Sunday, October 7, 2012

schema , data and semantic web

If you did not do a computer science course and specifically databases, it is unlikely you will know the term 'schema'. While many of us , even people of non-computer science background may be able to tell what a 'data' is.

What is the difference and why does it matter?

Data is all about values of some thing. For example, when some one asks your height, you may say 170 cms. The 170 cms is Data. While the tag or the name given to that value identifies what that value is in a myriad of other values such as the length of your sofa which is also 170 cms. If you need to differentiate the values or classify them as some thing meaningful, you need to have a additional tag that describes what those values stand for. So simple isn't it? 

Now I hear you telling that you knew this and you have been sending emails to your business partner who does Tshirts for you about the length and breadth of a Tshirt in cms.  Yes, you have been using it implicitly, but a computer program which is written to perform some checks, say a check that tells the width of the Tshirt cannot be more than 100 cms, will have to know to use the correct 'name' to make this comparison if it has to work across different values of the width of different Tshirts. Otherwise the program will be hardcoded to look for only 100 and it will not be a program that works for other dimensions.

Many a time in human communications, the schema is untold and left to the reader to decipher. For example, I may say to my friend, 'let us meet at the plaza to watch  'My cousin Vinny' at 7'O'clock evening'. In this, there is a lot of data such as 
  • plaza
  • My cousin Vinny
  • 7 pm
Now, as you can see all the above are data points in this statement. However, in order to allow a machine to process this, it has to go beyond the values and be able to add tags to this to describe what this is about or the semantics of the information. Thus additional tags on the above would be

plaza - theatre
My cousin vinny - movie
7 pm - time

Now we bring the schema to these statements by additional tags. plaza is about a theatre and My cousin vinny is a movie. This kind of interpretation of the key elements in this statement helps a computer software to answer queries like 'what is the name of the movie?' or 'which theatre is being talked about' or in general even span across all statements which has theatre in it to find out things like how many web pages have 'plaza' the theatre specified and how many of them have a statement that relates plaza to My cousin vinny.

But you may wonder how on earth is it going to be possible to tag every statement, every word of what we speak and especially the world wide web. Well, to answer this, most of the web pages today have 'data' that represents government information or companies or people or others. They are all currently published from databases or even excel sheets. All of them have very rigid schema. But in the process of getting them into HTML, the schema got missed out.  

Now, all this means is to have tools that allow these additional aspects to be still maintained in the process of a HTML publishing. 

This is fine, but how about Wikipedia like pages which has lot of textual content for human consumption? 

There are efforts like DBPedia which tries to derive automatically the semantic information represented by Wiki pages. Hence, it would not be difficult to bring back the schema of the wiki pages.

This is obviously an effort and a large one. But it is happening and soon you may find the web of text look like a web of data. 


Tuesday, October 2, 2012

We are connected by data

Behind every social and business interaction there is data. To understand this statement, let me look at some examples.

You and your purchases


When you buy something, apart from the amount and shop's name, address and phone number etc., there is also the warranty information, maintenance contract, service centre phone numbers, if it is a EMI to be paid, then the reminders to ensure you pay properly, if warranty expires and you want to extend it, then the dates etc., if you get any free coupons, then the details of it, if you bought it as a gift for some one, then the details of that person, if you need to ship it somewhere, then the address and phone numbers, the tracking details until the goods arrived at a place...and so on.

You and your bank

In this case, sure, the bank maintains most of the details on your transactions and offers a monthly statement to you or a online statement. But when you just issue a cheque to some one, the reason why you issued the cheque or when you receive, the reason why you received the money is known only to you. The details of a credit or debit card transaction is clearly not something machine readable. For example, if I buy a laptop from Apple store, the Apple store detail is there but not the fact that it is a laptop.

Now with these simple examples, you can see the connection. What you bought and some additional details are available in the first one (with the shop) and the details of all transactions you did, not just with this shop, but with all others with other instruments (cheque) are available with the bank.

As a individual I would definitely benefit if both the above data are linked and thus makes sense to me. However, how do we make this happen? And how less painful this can be for the end user?

Semantic web is one answer to this problem. When I say semantic web, I mean standards like RDF Linked data allows to specify such linkages provided vocabulary for the above data representations are available. But beyond that every shop and every bank may have to specify using this. I feel at least the online e-commerce portals can start returning such information as a RDF/XML which can be reconciled with the banks, thus allowing a method of getting the details of all your spending automatically.

Imagine how useful this is for paying my taxes..

If the Tax department can simply accept such a format of expenses and the total of it can be shown against my income (assuming you are self-employed), then will it not save a lot of energy for everyone and of course a lot of paper and lot of tracking ?

The beauty of such linkages is that I can simply run through such a data like a breeze and look for any type of spending I did on any category, may be the shop owner can offer more discounts as the loyalty information is there up front and Income tax can reward people by offering discounts as the data is clean and available for scrutiny much more easily than ever saving lots of $$$ on being able to have more efficient tax collection mechanisms.

That's the power of linked data.

This just the surface...and if it happens it can change your and all our lives for ever.


Thursday, September 20, 2012

What is exactly achieved by Semantic web?

The web is currently filled with documents. There are reams of English text that can be consumed only by humans. Blogs like this add to the ever increasing pile of text content. Of course there are also other types of content like photos, images, videos and so on. Thus the web is increasingly becoming a way of  publishing content mainly for human consumption. The interesting aspect of these documents are they are linked to one another meaningfully enabling a user to traverse those hyper links and read all the linked content. For example, I point here the link to the W3C Semantic Web project W3C Semantic Web. Thus there is no need to repeat what one has already published and instead

All this is good and we could have lived like this happily.  Then came Tim Berners Lee, the original inventor of the web. He saw that the web of documents is having a large amount of data that includes not just fancy content, but dates and numbers and text and currencies and you name it. It appeared like if we could process this data, we can gain insight into a treasure trove of data that is on the public web. 

Now to achieve this, the web pages should be published with additional information or the semantics or the meaning of what is there in the content of a page. This meaning or semantics could be seen as tags that extend the existing information of the content of a web page. For example, there could be a string in the page which tells the name of the author of this web blog as 'Thalapathy'. There could be other things that can be tagged to denote the date on the page as the date on which the blog was written. There could be tags that denote the comments on the web blog, the dates and so on. And there can be tags that tell that the page is about 'Semantic Web'. Thus there can be innumerable pieces of data within a page that denotes a lot more additional semantics that a program can query on.

If we make parallels to the database world, this is about looking at the whole web as one large database.  
Query can be done the way a SQL is done on relational tables. This allows connecting disparate data across the web across several web pages to be able to answer a question. For example, the fact that a event reported in Bangalore on a Semantic Web conference can be related to a book released in California and its popularity from customer reviews in Amazon can be connected because the author who wrote that book attended the conference and the book is sold on Amazon which in turn gives the reviews on it. This is not something that can be achieved with a simple Google search. It requires data to be related across seemingly disparate pieces of knowledge. 

Thus Semantic Web opens up a whole lot of possibilities in humans and machines on behalf of them being able to see the web as a extended human consciousness offering answers to what otherwise would have looked an impossibility.



Friday, September 14, 2012

Degree of structure to consider for organizing data

Let us examine below the degree of structure that exists in data exchanges between humans with examples and what it has got to do with organizing data.

Unstructured Data

     Mostly English text in blogs, word documents, Emails, Web pages. Only humans can make sense of    
     this. NLP tools to some extent.

Unstructured Data with annotations

     English text in a word document or a web page with a given name, the paragraph headers and others
     could add more meaning to a human reader than just plain text that does not have annotations.

Semi-structured data

     Data used in business context. For example, in emails exchanged as part of a business transaction
     there can be something like this


  •      Order ID:  1234
  •      Order Date: 1/1/2012
  •      Quantity: 5000 cps
  •       Price per piece: 10 USD


     The above  data is more structured , however, the structure is more discernible to humans than machines. However, humans can interpret them differently as well making it lean towards unstructured. In a way, it depends on the context of interpretation.

Excel sheet data also falls in this category. Though I would say it is little more structured due to the visual grid that is used to organize the data. Hence it is more rigid than the above arrangement.

Your bank statement will fall in this category. It is a report, though generated from a highly structured database, is more meant for human consumption.

Structured data

I place XML in this category which is structured as XML conforms to a XSD (XML Schema Definition). XML is meant for data exchange between machines. Though XML expressed in ASCII text can still be read by humans. Hence I keep XML in a category of structured data but not as a highly structured data as defined below.



Highly structured data

In this case,  I mean a proper database which requires special skill of data modeling to define the data and relations.  This is more used by machines.

Another example is a LDAP.  All of these require a pre-arranged data model expressed in a schema language.

Semantic web

This adds a layer of meta data to existing web pages to enable a machine to make sense of the content automatically. However, the expression of this meta data is highly structured. Though the data itself can be unstructured. Thus, this has a unique property of being highly unstructured to highly structured all in one go. For example, the RDFS is highly structured which represents the ontology or the meaning while the RDF itself represents the information which represents facts of the world.

In summary,

It is clear from the above that, the more structured data is, it can be easily interpreted by machines while the less structured it is,  the data is meant for human consumption. The key point is, even for humans we end up having some syntactical and metadata level aspects to make things more clear without calling the metadata explicitly as metadata. If we explicitly call out or isolate the metadata from the data, then it becomes more usable by machines and in turn more useful for humans as well.

Coming to what all of this has got to do with organizing data, it is increasingly clear from the above that the better meta data (data that describes what the actual data has) is available separate from the data as in the case of a semantic web or linked data concept, the better it becomes for both machine and human consumption as better analysis of data can be done and more insights can be obtained using the metadata by the machines and ultimately data by the humans.  If data and metadata are placed together inter-woven, then it can only be interpreted by machines like a relational database.


Wednesday, September 12, 2012

Organizing data

There is data everywhere today...more obvious and more hitting than before with the world wide web and smart phones. I remember when I used to be a unix and C programmer in early nineties we used to use unix programs like chat and email. Our only interaction with computers were to write some C code. None of my relatives, parents or most of my friends were even remotely using a PC. Mobiles were non-existent.
Now, people download smartphone apps to organize their to-do-list, their contacts and even their jewellery collection. Organizing things is not something new to humans. The entropy increases with time. Organizing is a discipline. A free will hates discipline. I organize my things in my house only when it is absolutely needed. When I file my tax returns I run around for the proofs and letters and documents and so on. During the year when a tax related paper is received, I dump it in a bin. And when the bin overflows I put it in a file meant for income tax.  Often  I dont have a file for a specific category. When I have my stock report from my broker, I dont have a file for stocks, so I file it in something called personal finance. My home loan repayment certificate, I put it in home related file. Then often I correlate my tax return to the home loan across these files. Linking pieces of information in physical form is not that easy. I had done the right thing to keep these things separate. But I do need to have some kind of linking between them to relate them so that when I file tax returns, I get to know I also need to accomodate my home loan. But the home related file contains several other information like home maintenance expenses and so on.
Being self-employed is even more complex where you need to track all your expenses methodically to apportion them between business and personal for claiming tax exemption.
Running a small business may be even more complex when you have several interfaces with external vendors and partners and so on
Running a enterprise...?
And leaving all these serious data behind, what about my blogs? what about the terms I searched for , the documents I read and the books I have? What about the emails I sent and received? What about the facebook likes and linkedin updates I did? What about the spreadsheets I have in Google docs or the slides in Slideshare? What about all the photos I took that are lying on my laptop and phones?

Should we even bother about organizing all of these data? Yes for several reasons.
  • Imagine you are at a store and you need a copy of your passport to buy something..
  • You need to know the how much you spent this month on fuel
  • You need to find out if you already have the book titled 'Organizing data for dummies' before you make another purchase
  • You need to know the total money you made on consulting for some one
  • Or even the home address of your friend you plan to visit
Fundamentally, in today's world computers have taken over and increasingly becoming so. And they all keep and process information. Wherever you go, some form of data is needed to input into these machines and software to get more information or to do more stuff. That's broadly a case why it is important to think about organizing your data.

Also, if you need to share some information, you need to be able to find it.

Plus add to this, the amount of information assets that are increasing with you by day. The internet pages you read, the e-books, the photos, movies, audio and so on.

You cannot hold your brain in sanity going forward with the sort of information explosion that is round the corner.



Thursday, February 23, 2012

Assume you have a complex 10000s of lines of Java code with all kinds of complexities involving multi-threading, JDBC connection pools, Stateless session beans, heavy duty XML processing, http request response processing, huge (in the order of 10s of Megabytes) Hashmaps being accessed, a fork-join model of threading and several 3rd party interfaces.

Now assume that your customer is asking for a response time of each transaction in the order of 10-100 msecs depending on the size of the response returned (XML) and a TPS of 250 on a 8 core x86 processor blade on Linux with say large memory of 64G.

How do we achieve this in Java? What are the things that needs to be kept in mind while doing the performance tuning of this application. Thats what Iam going to talk about below.
Remember that the application Iam talking about is not like a typical e-commerce application with human interfaces and response time in seconds. Iam talking about a real low-latency expectation in milliseconds.

There is no easy way. A typical common sense approach would be to do profiling of your code with any of those available in the market and identify bottlenecks where there are mega loops that are consuming time etc.

While this will help in the first pass, the biggest challenge you will face with Java would be there are functions like Garbage collection which will play a big role in deciding the performance of your app. Java unlike C or C++ allows programmers to allocate objects with the 'new' operator and then uses its GC mechanism to keep freeing up the allocated objects without the programmer needing to free them explicitly.

All allocated objects are maintained in a heap space and GC operates on that. There are broadly two categories of GC.



  • Single generation

  • multi-generation (two generation typically).

Single generation means one contiguous heap space and generational implies many spaces for different purposes. For example, a 2 generational heap could have one space to have temporary objects that are short-lived (called the nursery space) and then one for the long lifecycle objects.

GC in this case will only be operating on the nursery space until, it becomes difficult to free up anything there (means all short-lived young generation are living for a longer time) and it has to take an action in terms of moving those objects into the tenured space.

Over a period of time, if the tenured space becomes full, it may have to do a FULL GC cycle running across the entire (larger old generation heap) resulting in a long time of a pause when the application is running.

Apart from the way spaces are maintained, there is also the case for whether the GC will run in a mode pausing the application during the GC run or run along with the application in a sort of concurrent mode not completely pausing the application. This will have a effect when your application response time is expected to be really low. I mean choosing a GC option which stops your application is not a good thing especially if you are the low latency kind.

Another aspect of the current GC algorithms are the mark and sweep function. The way it works is for all objects in the heap which are referenced by live threads (from their stacks) or from statically allocated objects, the GC will leave them and Mark all unreferenced objects in one phase. In another phase, it will sweep them free. Basically it is a two step process, mainly to aid shorter pause times on the application.

While all this sounds cool, it comes loaded with troubles if your program is transaction oriented and memory intensive (like some heavy XML processing etc.) and also has very scattered pattern of memory allocation and freeing. If in such a situation, response time is a big thing to satisfy (low latencies, more real time responses), and needing to handle a larger TPS on a single box (if you had had to do the sizing of the HW during your bid and later worry about how to meet it , which is typically the case :-) then, I give below some broad outlines which might help you tune your app. Again I emphasize broad, because it cannot be taken as something that will straight fit in your situation. You may have to take the clues here and typically study how your system behaves adjusting the several parameters. Iam only listing down some of the key things you may want to look at while on the task.

1) Know how frequent your GC cycles and therefore the pause times on the application are
You can try with both a single generational and multi-generational heap option. Typically for low latency applications single generation with concurrent runs along with the application works better

2) Check how quickly your entire Heap space (which is determined by the -Xmx and -Xms options) is eaten up on heavy load and results in a GC run to reclaim heap. This can be done by turning on GC verbose mode and observing how frequently GC is running, how long it takes for each run, the heap before GC ran and after it ran. All of this information helps to know if your application is consuming heavy memory per transaction leading to more GC runs which in turn causes higher latencies.

3) Do longer runs with higher TPS (until CPU maxes out) to know if there are any issues of heap fragmenting over a long period and resulting in unusable heap and eventually the application becomes unresponsive. This is really bad and should be avoided at all cost. There are options of JVM that helps you to do compaction more frequently if this is the case to prevent this from happening

4) If your CPU (in top command in Linux) is mis-behaving - this means it may be erratic doing say 50% and then showing 90% and then coming back to 60% and then doing a 90% , sort of oscillating widely, then it could be caused by your GC runs. This will have to be controlled as this will not guarantee you a consistent result or a trend.

5) As you move on the TPS from low to high, you should see a proportional consumption of CPU and a stable heap and consistent periodic GC runs. This is the point when you have controlled the monster!

6) There could be several things you can do in your application like trying not to allocate huge objects holding on for the entire duration of your transaction lifetime. This will typically not help when you want to do high TPS , low latency as this will accumulate lots of objects in the heap for a long time causing frequent GC runs.

7) See if you can do a pooling concept where you do one time allocation and reusing the objects.

8) There are also several situations where you want to monitor your threads via a periodic thread dump. This will reveal quite a lot of things especially under load. There could be threads mainly blocked spinning on locks (fat locks) which can be examined to see if your code has any bottlenecks like there are two many threads contending on a synchronized block or a HashTable or a Vector. If the synchronization is not intended, then it is safe to remove them and allow the parallelism. There could also be contention on memory if your allocation is eating away all the space with long life time objects. You can use memory leak detector tools to check if there are contentions on locks during a GC cycle when a java code is trying to allocate. This could be indirectly induced by modules on the path of the transaction processing, but may surface elsewhere completely in a different module when it is trying to allocate. There could also be a 3rd party s/w misbehaving on contention resulting in poor performance.

9) Try to keep your code under threads to perform short quick actions in a multi-threading scenario. Remember that anything that is in a transaction path and runs for a long time will cause bottleneck under load causing you to not scale.

10) There are several Java language specific good practices that you can follow while coding. One good pointer is ftp://ftp.glenmccl.com/pub/free/jperf.pdf