Friday, September 14, 2012

Degree of structure to consider for organizing data

Let us examine below the degree of structure that exists in data exchanges between humans with examples and what it has got to do with organizing data.

Unstructured Data

     Mostly English text in blogs, word documents, Emails, Web pages. Only humans can make sense of    
     this. NLP tools to some extent.

Unstructured Data with annotations

     English text in a word document or a web page with a given name, the paragraph headers and others
     could add more meaning to a human reader than just plain text that does not have annotations.

Semi-structured data

     Data used in business context. For example, in emails exchanged as part of a business transaction
     there can be something like this


  •      Order ID:  1234
  •      Order Date: 1/1/2012
  •      Quantity: 5000 cps
  •       Price per piece: 10 USD


     The above  data is more structured , however, the structure is more discernible to humans than machines. However, humans can interpret them differently as well making it lean towards unstructured. In a way, it depends on the context of interpretation.

Excel sheet data also falls in this category. Though I would say it is little more structured due to the visual grid that is used to organize the data. Hence it is more rigid than the above arrangement.

Your bank statement will fall in this category. It is a report, though generated from a highly structured database, is more meant for human consumption.

Structured data

I place XML in this category which is structured as XML conforms to a XSD (XML Schema Definition). XML is meant for data exchange between machines. Though XML expressed in ASCII text can still be read by humans. Hence I keep XML in a category of structured data but not as a highly structured data as defined below.



Highly structured data

In this case,  I mean a proper database which requires special skill of data modeling to define the data and relations.  This is more used by machines.

Another example is a LDAP.  All of these require a pre-arranged data model expressed in a schema language.

Semantic web

This adds a layer of meta data to existing web pages to enable a machine to make sense of the content automatically. However, the expression of this meta data is highly structured. Though the data itself can be unstructured. Thus, this has a unique property of being highly unstructured to highly structured all in one go. For example, the RDFS is highly structured which represents the ontology or the meaning while the RDF itself represents the information which represents facts of the world.

In summary,

It is clear from the above that, the more structured data is, it can be easily interpreted by machines while the less structured it is,  the data is meant for human consumption. The key point is, even for humans we end up having some syntactical and metadata level aspects to make things more clear without calling the metadata explicitly as metadata. If we explicitly call out or isolate the metadata from the data, then it becomes more usable by machines and in turn more useful for humans as well.

Coming to what all of this has got to do with organizing data, it is increasingly clear from the above that the better meta data (data that describes what the actual data has) is available separate from the data as in the case of a semantic web or linked data concept, the better it becomes for both machine and human consumption as better analysis of data can be done and more insights can be obtained using the metadata by the machines and ultimately data by the humans.  If data and metadata are placed together inter-woven, then it can only be interpreted by machines like a relational database.


No comments:

Post a Comment