Sunday, October 7, 2012

schema , data and semantic web

If you did not do a computer science course and specifically databases, it is unlikely you will know the term 'schema'. While many of us , even people of non-computer science background may be able to tell what a 'data' is.

What is the difference and why does it matter?

Data is all about values of some thing. For example, when some one asks your height, you may say 170 cms. The 170 cms is Data. While the tag or the name given to that value identifies what that value is in a myriad of other values such as the length of your sofa which is also 170 cms. If you need to differentiate the values or classify them as some thing meaningful, you need to have a additional tag that describes what those values stand for. So simple isn't it? 

Now I hear you telling that you knew this and you have been sending emails to your business partner who does Tshirts for you about the length and breadth of a Tshirt in cms.  Yes, you have been using it implicitly, but a computer program which is written to perform some checks, say a check that tells the width of the Tshirt cannot be more than 100 cms, will have to know to use the correct 'name' to make this comparison if it has to work across different values of the width of different Tshirts. Otherwise the program will be hardcoded to look for only 100 and it will not be a program that works for other dimensions.

Many a time in human communications, the schema is untold and left to the reader to decipher. For example, I may say to my friend, 'let us meet at the plaza to watch  'My cousin Vinny' at 7'O'clock evening'. In this, there is a lot of data such as 
  • plaza
  • My cousin Vinny
  • 7 pm
Now, as you can see all the above are data points in this statement. However, in order to allow a machine to process this, it has to go beyond the values and be able to add tags to this to describe what this is about or the semantics of the information. Thus additional tags on the above would be

plaza - theatre
My cousin vinny - movie
7 pm - time

Now we bring the schema to these statements by additional tags. plaza is about a theatre and My cousin vinny is a movie. This kind of interpretation of the key elements in this statement helps a computer software to answer queries like 'what is the name of the movie?' or 'which theatre is being talked about' or in general even span across all statements which has theatre in it to find out things like how many web pages have 'plaza' the theatre specified and how many of them have a statement that relates plaza to My cousin vinny.

But you may wonder how on earth is it going to be possible to tag every statement, every word of what we speak and especially the world wide web. Well, to answer this, most of the web pages today have 'data' that represents government information or companies or people or others. They are all currently published from databases or even excel sheets. All of them have very rigid schema. But in the process of getting them into HTML, the schema got missed out.  

Now, all this means is to have tools that allow these additional aspects to be still maintained in the process of a HTML publishing. 

This is fine, but how about Wikipedia like pages which has lot of textual content for human consumption? 

There are efforts like DBPedia which tries to derive automatically the semantic information represented by Wiki pages. Hence, it would not be difficult to bring back the schema of the wiki pages.

This is obviously an effort and a large one. But it is happening and soon you may find the web of text look like a web of data. 


1 comment:

  1. Extremely well put :-).
    Looking forward to your approach

    ReplyDelete