For the sake of software: semantic web

Showing posts with label semantic web. Show all posts

Wednesday, July 16, 2014

The Broken WEB

It is just that I have been intrigued by the divisions I see around in life and how the mind thrives in these divisions. Divisions are dangerous and we are seeing it everyday in politics, office and even among us in friends, family. Humans have this penchant for dissecting and making sense. Dissecting is a double edged sword. It can be used as a tool to analyze and at the same time, it can arouse passions and lead you into unwanted confusions. The problem is a weak mind cannot know the difference between when it ended the analysis part and when it entered the realm of the unwanted. This sounds like philosophy right?

I see the same thing in software. Microsoft, Google, Apple all have been delivering this so called Apps which are divisive. They focus on a single function. They deliver that effectively with a great looking UI/UX (the Apple way). They go on developing the Apps like this, millions and zillions.

Now compare this to the WWW of the past when it originated. It was a beautiful, marvelous discovery in my opinion. Hyper links unified the way you read documents. Even today, unless you watch yourself, you can get thoroughly lost in your browsing. I have seen this happen with me many times. I do not know why or how I ended up reading a page which was never what I started with. Hyper links unified the sea of documents.

Unfortunately the documents are a weak form of representing things. It caters to the form and loses the underlying structure from where it originated (the databases). In the quest to present in a form that is readable, the documents lose vital information. Thus the very strength brought in by the hyper linking which links documents breaks mysteriously because there are these finer things in the documents which cannot be linked to precisely.

With time as the documents started representing not merely some text, but company financials, books, weather and so on, the hyper link broke and Google came in :-) Google emerged as the great leveler, the one that fixes the breakage. It brought in search which kind of fixes the hyper link problem indirectly.

I will give you an example, I look for 'lambda calculus', then I find the profile of the lecturer, OK I find a link and I go to his page, then I find some other subjects, say 'Haskell' which are mentioned there and get interested, but I cannot go to them. I go to Google and look for those. Google has figured out the linking between these disconnected pages and it has allowed me to browse again with a brief interruption. You may think it is brief. But I feel it is really annoying. I use a App (Google) to fix my linking when it should have been possible for the lecturer to have linked it or for that matter if it was publicly editable it should have been done by some one. But how does the lecturer know the web site that has the information about 'Haskell' without doing a search? Why was this not solved by the inventors of the WWW which seems vital for the hyper links to function? OK if I do not want to go there, I will have to still answer how will the lecturer or some one else know which web site to link?

I feel the only way is when you have the web itself converted from the web of documents to the web of things. Things are not just data or text. They embody the description about the data as well. They carry meaning. One aspect of describing a thing is also the 'categorization' of the thing as to where it belongs to in a hierarchy of things in the world. If only we had a web of things, then it was fairly easy for the lecturer or any one to look for things that belong to the category of functional programming and find pages of Haskell categorized there and be able to make that Hyperlink.

This way, the web would have remained powerful and would have been a truly knowledge base of the human beings or in short the collective consciousness of us. But as history would want it, we had a more ugly way of fixing this by giving the rights to fix this to the great company called Google. I admire them as they probably were truly annoyed by this breakage as well and looked to answer this sincerely.

However, I feel the root cause is in the way we create things. It was form based and not structure based. It was about putting up a document in the human cognizable way and leaving it for Google to discover it than being able to present the underlying structure. Form falls into the Apple space. That is what they mastered and took it very far. And many others followed. But Form is as I said very divisive. It caters to the human mind very well. It allows you to see things distinctly without a way to know that underneath every thing is related to one another. Form destroys structure in the process of delivering something comprehensible for the human mind. I am not against forms. I feel they are good, but not to the extent where they can destroy the underlying structure of things nor the connection between the things.

I see this world crafted by these two big companies. On one side is Apple which has focused on Form , brought App store, zillions of Apps to cater to every need of you not allowing to realize the connectivity across these sea of Apps because they focused on the Form and make you focus on the Form.

On the other hand, we have Google which is helping you relate the underlying structure, but in the process it did it in a way allowing the lazy human way of typing text for everything and destroying the underlying structure. It did the grunt work of analyzing things and allowing the humans to continue in their realm of text or Form which they are comfortable with.

Thus one has blatantly sidelined with the Form and another has not looked at alternate means and allowed things to continue as is. The net result is, we have a BROKEN WEB and that continues.

I believe the original inventor of the web had realized this in some sense and brought in the Semantic Web and all the W3C work that followed it. However, what they miss is that the damage has happened and I see light only in the micro formats. But that is not again going to be seen as important for a long time to come by the web authors as it looks too late into the game and there are already zillions of web pages without them.

Coming back to philosophy, the divisiveness has to be fixed within. Because it originated within you. That does not mean you stop seeing forms and only see the underlying structures and connections. You will still divide. You will still see forms. But now you are with the awareness that you are dividing and you are admiring the form. The awareness is not something that is turned on in a day for people who had been dividing and seeing things for long. The awareness was there and is there and always there. It has just got faded over time and has remained muted. The realization that divisions are a property of the mind and it should be done only when needed is the awareness aspect of it. With awareness, practice is needed to glide through life's challenges that forces you to be divisive. With awareness one has to watch the divisions happening around. Being in steadfast awareness is the key. It is harder than a small reed withstanding a whirlwind. It is the way and the only way for living. A unified YOU is the goal. Then it is easier to drop YOU.

We have applied our divisive mind to designing the WEB. The spread of apps or the searching of documents is a testimony to that. By raising the awareness level of the WEB which represents the collective consciousness of all of us, we bring a seamless connected web without the distractions of the Apps, but not ignoring them altogether for the function and focus they bring in to solve the problems. A cohesive WEB is the goal. Then it is easier to drop YOU.

PS: I have voluntarily not added any hyper links in the above text to maintain that the web is broken. I have added tags though to show that that is the way to go forward, though it is not going to solve the mess we are already in.

Sunday, October 7, 2012

schema , data and semantic web

If you did not do a computer science course and specifically databases, it is unlikely you will know the term 'schema'. While many of us , even people of non-computer science background may be able to tell what a 'data' is.

What is the difference and why does it matter?

Data is all about values of some thing. For example, when some one asks your height, you may say 170 cms. The 170 cms is Data. While the tag or the name given to that value identifies what that value is in a myriad of other values such as the length of your sofa which is also 170 cms. If you need to differentiate the values or classify them as some thing meaningful, you need to have a additional tag that describes what those values stand for. So simple isn't it?

Now I hear you telling that you knew this and you have been sending emails to your business partner who does Tshirts for you about the length and breadth of a Tshirt in cms. Yes, you have been using it implicitly, but a computer program which is written to perform some checks, say a check that tells the width of the Tshirt cannot be more than 100 cms, will have to know to use the correct 'name' to make this comparison if it has to work across different values of the width of different Tshirts. Otherwise the program will be hardcoded to look for only 100 and it will not be a program that works for other dimensions.

Many a time in human communications, the schema is untold and left to the reader to decipher. For example, I may say to my friend, 'let us meet at the plaza to watch 'My cousin Vinny' at 7'O'clock evening'. In this, there is a lot of data such as

plaza
My cousin Vinny
7 pm

Now, as you can see all the above are data points in this statement. However, in order to allow a machine to process this, it has to go beyond the values and be able to add tags to this to describe what this is about or the semantics of the information. Thus additional tags on the above would be

plaza - theatre

My cousin vinny - movie

7 pm - time

Now we bring the schema to these statements by additional tags. plaza is about a theatre and My cousin vinny is a movie. This kind of interpretation of the key elements in this statement helps a computer software to answer queries like 'what is the name of the movie?' or 'which theatre is being talked about' or in general even span across all statements which has theatre in it to find out things like how many web pages have 'plaza' the theatre specified and how many of them have a statement that relates plaza to My cousin vinny.

But you may wonder how on earth is it going to be possible to tag every statement, every word of what we speak and especially the world wide web. Well, to answer this, most of the web pages today have 'data' that represents government information or companies or people or others. They are all currently published from databases or even excel sheets. All of them have very rigid schema. But in the process of getting them into HTML, the schema got missed out.

Now, all this means is to have tools that allow these additional aspects to be still maintained in the process of a HTML publishing.

This is fine, but how about Wikipedia like pages which has lot of textual content for human consumption?

There are efforts like DBPedia which tries to derive automatically the semantic information represented by Wiki pages. Hence, it would not be difficult to bring back the schema of the wiki pages.

This is obviously an effort and a large one. But it is happening and soon you may find the web of text look like a web of data.

Tuesday, October 2, 2012

We are connected by data

Behind every social and business interaction there is data. To understand this statement, let me look at some examples.

You and your purchases

When you buy something, apart from the amount and shop's name, address and phone number etc., there is also the warranty information, maintenance contract, service centre phone numbers, if it is a EMI to be paid, then the reminders to ensure you pay properly, if warranty expires and you want to extend it, then the dates etc., if you get any free coupons, then the details of it, if you bought it as a gift for some one, then the details of that person, if you need to ship it somewhere, then the address and phone numbers, the tracking details until the goods arrived at a place...and so on.

You and your bank

In this case, sure, the bank maintains most of the details on your transactions and offers a monthly statement to you or a online statement. But when you just issue a cheque to some one, the reason why you issued the cheque or when you receive, the reason why you received the money is known only to you. The details of a credit or debit card transaction is clearly not something machine readable. For example, if I buy a laptop from Apple store, the Apple store detail is there but not the fact that it is a laptop.

Now with these simple examples, you can see the connection. What you bought and some additional details are available in the first one (with the shop) and the details of all transactions you did, not just with this shop, but with all others with other instruments (cheque) are available with the bank.

As a individual I would definitely benefit if both the above data are linked and thus makes sense to me. However, how do we make this happen? And how less painful this can be for the end user?

Semantic web is one answer to this problem. When I say semantic web, I mean standards like RDF Linked data allows to specify such linkages provided vocabulary for the above data representations are available. But beyond that every shop and every bank may have to specify using this. I feel at least the online e-commerce portals can start returning such information as a RDF/XML which can be reconciled with the banks, thus allowing a method of getting the details of all your spending automatically.

Imagine how useful this is for paying my taxes..

If the Tax department can simply accept such a format of expenses and the total of it can be shown against my income (assuming you are self-employed), then will it not save a lot of energy for everyone and of course a lot of paper and lot of tracking ?

The beauty of such linkages is that I can simply run through such a data like a breeze and look for any type of spending I did on any category, may be the shop owner can offer more discounts as the loyalty information is there up front and Income tax can reward people by offering discounts as the data is clean and available for scrutiny much more easily than ever saving lots of $$$ on being able to have more efficient tax collection mechanisms.

That's the power of linked data.

This just the surface...and if it happens it can change your and all our lives for ever.

Thursday, September 20, 2012

What is exactly achieved by Semantic web?

The web is currently filled with documents. There are reams of English text that can be consumed only by humans. Blogs like this add to the ever increasing pile of text content. Of course there are also other types of content like photos, images, videos and so on. Thus the web is increasingly becoming a way of publishing content mainly for human consumption. The interesting aspect of these documents are they are linked to one another meaningfully enabling a user to traverse those hyper links and read all the linked content. For example, I point here the link to the W3C Semantic Web project W3C Semantic Web. Thus there is no need to repeat what one has already published and instead

All this is good and we could have lived like this happily. Then came Tim Berners Lee, the original inventor of the web. He saw that the web of documents is having a large amount of data that includes not just fancy content, but dates and numbers and text and currencies and you name it. It appeared like if we could process this data, we can gain insight into a treasure trove of data that is on the public web.

Now to achieve this, the web pages should be published with additional information or the semantics or the meaning of what is there in the content of a page. This meaning or semantics could be seen as tags that extend the existing information of the content of a web page. For example, there could be a string in the page which tells the name of the author of this web blog as 'Thalapathy'. There could be other things that can be tagged to denote the date on the page as the date on which the blog was written. There could be tags that denote the comments on the web blog, the dates and so on. And there can be tags that tell that the page is about 'Semantic Web'. Thus there can be innumerable pieces of data within a page that denotes a lot more additional semantics that a program can query on.

If we make parallels to the database world, this is about looking at the whole web as one large database.

Query can be done the way a SQL is done on relational tables. This allows connecting disparate data across the web across several web pages to be able to answer a question. For example, the fact that a event reported in Bangalore on a Semantic Web conference can be related to a book released in California and its popularity from customer reviews in Amazon can be connected because the author who wrote that book attended the conference and the book is sold on Amazon which in turn gives the reviews on it. This is not something that can be achieved with a simple Google search. It requires data to be related across seemingly disparate pieces of knowledge.

Thus Semantic Web opens up a whole lot of possibilities in humans and machines on behalf of them being able to see the web as a extended human consciousness offering answers to what otherwise would have looked an impossibility.

Friday, September 14, 2012

Degree of structure to consider for organizing data

Let us examine below the degree of structure that exists in data exchanges between humans with examples and what it has got to do with organizing data.

Unstructured Data

Mostly English text in blogs, word documents, Emails, Web pages. Only humans can make sense of
this. NLP tools to some extent.

Unstructured Data with annotations

English text in a word document or a web page with a given name, the paragraph headers and others
could add more meaning to a human reader than just plain text that does not have annotations.

Semi-structured data

Data used in business context. For example, in emails exchanged as part of a business transaction
there can be something like this

Order ID: 1234
Order Date: 1/1/2012
Quantity: 5000 cps
Price per piece: 10 USD

The above data is more structured , however, the structure is more discernible to humans than machines. However, humans can interpret them differently as well making it lean towards unstructured. In a way, it depends on the context of interpretation.

Excel sheet data also falls in this category. Though I would say it is little more structured due to the visual grid that is used to organize the data. Hence it is more rigid than the above arrangement.

Your bank statement will fall in this category. It is a report, though generated from a highly structured database, is more meant for human consumption.

Structured data

I place XML in this category which is structured as XML conforms to a XSD (XML Schema Definition). XML is meant for data exchange between machines. Though XML expressed in ASCII text can still be read by humans. Hence I keep XML in a category of structured data but not as a highly structured data as defined below.

Highly structured data

In this case, I mean a proper database which requires special skill of data modeling to define the data and relations. This is more used by machines.

Another example is a LDAP. All of these require a pre-arranged data model expressed in a schema language.

Semantic web

This adds a layer of meta data to existing web pages to enable a machine to make sense of the content automatically. However, the expression of this meta data is highly structured. Though the data itself can be unstructured. Thus, this has a unique property of being highly unstructured to highly structured all in one go. For example, the RDFS is highly structured which represents the ontology or the meaning while the RDF itself represents the information which represents facts of the world.

In summary,

It is clear from the above that, the more structured data is, it can be easily interpreted by machines while the less structured it is, the data is meant for human consumption. The key point is, even for humans we end up having some syntactical and metadata level aspects to make things more clear without calling the metadata explicitly as metadata. If we explicitly call out or isolate the metadata from the data, then it becomes more usable by machines and in turn more useful for humans as well.

Coming to what all of this has got to do with organizing data, it is increasingly clear from the above that the better meta data (data that describes what the actual data has) is available separate from the data as in the case of a semantic web or linked data concept, the better it becomes for both machine and human consumption as better analysis of data can be done and more insights can be obtained using the metadata by the machines and ultimately data by the humans. If data and metadata are placed together inter-woven, then it can only be interpreted by machines like a relational database.

For the sake of software