Sunday, October 7, 2012

schema , data and semantic web

If you did not do a computer science course and specifically databases, it is unlikely you will know the term 'schema'. While many of us , even people of non-computer science background may be able to tell what a 'data' is.

What is the difference and why does it matter?

Data is all about values of some thing. For example, when some one asks your height, you may say 170 cms. The 170 cms is Data. While the tag or the name given to that value identifies what that value is in a myriad of other values such as the length of your sofa which is also 170 cms. If you need to differentiate the values or classify them as some thing meaningful, you need to have a additional tag that describes what those values stand for. So simple isn't it? 

Now I hear you telling that you knew this and you have been sending emails to your business partner who does Tshirts for you about the length and breadth of a Tshirt in cms.  Yes, you have been using it implicitly, but a computer program which is written to perform some checks, say a check that tells the width of the Tshirt cannot be more than 100 cms, will have to know to use the correct 'name' to make this comparison if it has to work across different values of the width of different Tshirts. Otherwise the program will be hardcoded to look for only 100 and it will not be a program that works for other dimensions.

Many a time in human communications, the schema is untold and left to the reader to decipher. For example, I may say to my friend, 'let us meet at the plaza to watch  'My cousin Vinny' at 7'O'clock evening'. In this, there is a lot of data such as 
  • plaza
  • My cousin Vinny
  • 7 pm
Now, as you can see all the above are data points in this statement. However, in order to allow a machine to process this, it has to go beyond the values and be able to add tags to this to describe what this is about or the semantics of the information. Thus additional tags on the above would be

plaza - theatre
My cousin vinny - movie
7 pm - time

Now we bring the schema to these statements by additional tags. plaza is about a theatre and My cousin vinny is a movie. This kind of interpretation of the key elements in this statement helps a computer software to answer queries like 'what is the name of the movie?' or 'which theatre is being talked about' or in general even span across all statements which has theatre in it to find out things like how many web pages have 'plaza' the theatre specified and how many of them have a statement that relates plaza to My cousin vinny.

But you may wonder how on earth is it going to be possible to tag every statement, every word of what we speak and especially the world wide web. Well, to answer this, most of the web pages today have 'data' that represents government information or companies or people or others. They are all currently published from databases or even excel sheets. All of them have very rigid schema. But in the process of getting them into HTML, the schema got missed out.  

Now, all this means is to have tools that allow these additional aspects to be still maintained in the process of a HTML publishing. 

This is fine, but how about Wikipedia like pages which has lot of textual content for human consumption? 

There are efforts like DBPedia which tries to derive automatically the semantic information represented by Wiki pages. Hence, it would not be difficult to bring back the schema of the wiki pages.

This is obviously an effort and a large one. But it is happening and soon you may find the web of text look like a web of data. 


Tuesday, October 2, 2012

We are connected by data

Behind every social and business interaction there is data. To understand this statement, let me look at some examples.

You and your purchases


When you buy something, apart from the amount and shop's name, address and phone number etc., there is also the warranty information, maintenance contract, service centre phone numbers, if it is a EMI to be paid, then the reminders to ensure you pay properly, if warranty expires and you want to extend it, then the dates etc., if you get any free coupons, then the details of it, if you bought it as a gift for some one, then the details of that person, if you need to ship it somewhere, then the address and phone numbers, the tracking details until the goods arrived at a place...and so on.

You and your bank

In this case, sure, the bank maintains most of the details on your transactions and offers a monthly statement to you or a online statement. But when you just issue a cheque to some one, the reason why you issued the cheque or when you receive, the reason why you received the money is known only to you. The details of a credit or debit card transaction is clearly not something machine readable. For example, if I buy a laptop from Apple store, the Apple store detail is there but not the fact that it is a laptop.

Now with these simple examples, you can see the connection. What you bought and some additional details are available in the first one (with the shop) and the details of all transactions you did, not just with this shop, but with all others with other instruments (cheque) are available with the bank.

As a individual I would definitely benefit if both the above data are linked and thus makes sense to me. However, how do we make this happen? And how less painful this can be for the end user?

Semantic web is one answer to this problem. When I say semantic web, I mean standards like RDF Linked data allows to specify such linkages provided vocabulary for the above data representations are available. But beyond that every shop and every bank may have to specify using this. I feel at least the online e-commerce portals can start returning such information as a RDF/XML which can be reconciled with the banks, thus allowing a method of getting the details of all your spending automatically.

Imagine how useful this is for paying my taxes..

If the Tax department can simply accept such a format of expenses and the total of it can be shown against my income (assuming you are self-employed), then will it not save a lot of energy for everyone and of course a lot of paper and lot of tracking ?

The beauty of such linkages is that I can simply run through such a data like a breeze and look for any type of spending I did on any category, may be the shop owner can offer more discounts as the loyalty information is there up front and Income tax can reward people by offering discounts as the data is clean and available for scrutiny much more easily than ever saving lots of $$$ on being able to have more efficient tax collection mechanisms.

That's the power of linked data.

This just the surface...and if it happens it can change your and all our lives for ever.