Thursday, September 20, 2012

What is exactly achieved by Semantic web?

The web is currently filled with documents. There are reams of English text that can be consumed only by humans. Blogs like this add to the ever increasing pile of text content. Of course there are also other types of content like photos, images, videos and so on. Thus the web is increasingly becoming a way of  publishing content mainly for human consumption. The interesting aspect of these documents are they are linked to one another meaningfully enabling a user to traverse those hyper links and read all the linked content. For example, I point here the link to the W3C Semantic Web project W3C Semantic Web. Thus there is no need to repeat what one has already published and instead

All this is good and we could have lived like this happily.  Then came Tim Berners Lee, the original inventor of the web. He saw that the web of documents is having a large amount of data that includes not just fancy content, but dates and numbers and text and currencies and you name it. It appeared like if we could process this data, we can gain insight into a treasure trove of data that is on the public web. 

Now to achieve this, the web pages should be published with additional information or the semantics or the meaning of what is there in the content of a page. This meaning or semantics could be seen as tags that extend the existing information of the content of a web page. For example, there could be a string in the page which tells the name of the author of this web blog as 'Thalapathy'. There could be other things that can be tagged to denote the date on the page as the date on which the blog was written. There could be tags that denote the comments on the web blog, the dates and so on. And there can be tags that tell that the page is about 'Semantic Web'. Thus there can be innumerable pieces of data within a page that denotes a lot more additional semantics that a program can query on.

If we make parallels to the database world, this is about looking at the whole web as one large database.  
Query can be done the way a SQL is done on relational tables. This allows connecting disparate data across the web across several web pages to be able to answer a question. For example, the fact that a event reported in Bangalore on a Semantic Web conference can be related to a book released in California and its popularity from customer reviews in Amazon can be connected because the author who wrote that book attended the conference and the book is sold on Amazon which in turn gives the reviews on it. This is not something that can be achieved with a simple Google search. It requires data to be related across seemingly disparate pieces of knowledge. 

Thus Semantic Web opens up a whole lot of possibilities in humans and machines on behalf of them being able to see the web as a extended human consciousness offering answers to what otherwise would have looked an impossibility.



Friday, September 14, 2012

Degree of structure to consider for organizing data

Let us examine below the degree of structure that exists in data exchanges between humans with examples and what it has got to do with organizing data.

Unstructured Data

     Mostly English text in blogs, word documents, Emails, Web pages. Only humans can make sense of    
     this. NLP tools to some extent.

Unstructured Data with annotations

     English text in a word document or a web page with a given name, the paragraph headers and others
     could add more meaning to a human reader than just plain text that does not have annotations.

Semi-structured data

     Data used in business context. For example, in emails exchanged as part of a business transaction
     there can be something like this


  •      Order ID:  1234
  •      Order Date: 1/1/2012
  •      Quantity: 5000 cps
  •       Price per piece: 10 USD


     The above  data is more structured , however, the structure is more discernible to humans than machines. However, humans can interpret them differently as well making it lean towards unstructured. In a way, it depends on the context of interpretation.

Excel sheet data also falls in this category. Though I would say it is little more structured due to the visual grid that is used to organize the data. Hence it is more rigid than the above arrangement.

Your bank statement will fall in this category. It is a report, though generated from a highly structured database, is more meant for human consumption.

Structured data

I place XML in this category which is structured as XML conforms to a XSD (XML Schema Definition). XML is meant for data exchange between machines. Though XML expressed in ASCII text can still be read by humans. Hence I keep XML in a category of structured data but not as a highly structured data as defined below.



Highly structured data

In this case,  I mean a proper database which requires special skill of data modeling to define the data and relations.  This is more used by machines.

Another example is a LDAP.  All of these require a pre-arranged data model expressed in a schema language.

Semantic web

This adds a layer of meta data to existing web pages to enable a machine to make sense of the content automatically. However, the expression of this meta data is highly structured. Though the data itself can be unstructured. Thus, this has a unique property of being highly unstructured to highly structured all in one go. For example, the RDFS is highly structured which represents the ontology or the meaning while the RDF itself represents the information which represents facts of the world.

In summary,

It is clear from the above that, the more structured data is, it can be easily interpreted by machines while the less structured it is,  the data is meant for human consumption. The key point is, even for humans we end up having some syntactical and metadata level aspects to make things more clear without calling the metadata explicitly as metadata. If we explicitly call out or isolate the metadata from the data, then it becomes more usable by machines and in turn more useful for humans as well.

Coming to what all of this has got to do with organizing data, it is increasingly clear from the above that the better meta data (data that describes what the actual data has) is available separate from the data as in the case of a semantic web or linked data concept, the better it becomes for both machine and human consumption as better analysis of data can be done and more insights can be obtained using the metadata by the machines and ultimately data by the humans.  If data and metadata are placed together inter-woven, then it can only be interpreted by machines like a relational database.


Wednesday, September 12, 2012

Organizing data

There is data everywhere today...more obvious and more hitting than before with the world wide web and smart phones. I remember when I used to be a unix and C programmer in early nineties we used to use unix programs like chat and email. Our only interaction with computers were to write some C code. None of my relatives, parents or most of my friends were even remotely using a PC. Mobiles were non-existent.
Now, people download smartphone apps to organize their to-do-list, their contacts and even their jewellery collection. Organizing things is not something new to humans. The entropy increases with time. Organizing is a discipline. A free will hates discipline. I organize my things in my house only when it is absolutely needed. When I file my tax returns I run around for the proofs and letters and documents and so on. During the year when a tax related paper is received, I dump it in a bin. And when the bin overflows I put it in a file meant for income tax.  Often  I dont have a file for a specific category. When I have my stock report from my broker, I dont have a file for stocks, so I file it in something called personal finance. My home loan repayment certificate, I put it in home related file. Then often I correlate my tax return to the home loan across these files. Linking pieces of information in physical form is not that easy. I had done the right thing to keep these things separate. But I do need to have some kind of linking between them to relate them so that when I file tax returns, I get to know I also need to accomodate my home loan. But the home related file contains several other information like home maintenance expenses and so on.
Being self-employed is even more complex where you need to track all your expenses methodically to apportion them between business and personal for claiming tax exemption.
Running a small business may be even more complex when you have several interfaces with external vendors and partners and so on
Running a enterprise...?
And leaving all these serious data behind, what about my blogs? what about the terms I searched for , the documents I read and the books I have? What about the emails I sent and received? What about the facebook likes and linkedin updates I did? What about the spreadsheets I have in Google docs or the slides in Slideshare? What about all the photos I took that are lying on my laptop and phones?

Should we even bother about organizing all of these data? Yes for several reasons.
  • Imagine you are at a store and you need a copy of your passport to buy something..
  • You need to know the how much you spent this month on fuel
  • You need to find out if you already have the book titled 'Organizing data for dummies' before you make another purchase
  • You need to know the total money you made on consulting for some one
  • Or even the home address of your friend you plan to visit
Fundamentally, in today's world computers have taken over and increasingly becoming so. And they all keep and process information. Wherever you go, some form of data is needed to input into these machines and software to get more information or to do more stuff. That's broadly a case why it is important to think about organizing your data.

Also, if you need to share some information, you need to be able to find it.

Plus add to this, the amount of information assets that are increasing with you by day. The internet pages you read, the e-books, the photos, movies, audio and so on.

You cannot hold your brain in sanity going forward with the sort of information explosion that is round the corner.