Analytics on Unstructured Data – Twitter, Facebook and Social Media
Quoting Wikipedia:-Unstructured Data (or unstructured information) refers to information that
either does not have a pre-defined data model and/or does not fit well into relational tables.
Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and
facts as well. This results in irregularities and ambiguities that make it difficult to understand using
traditional computer programs as compared to data stored in fielded form in databases or
annotated (semantically tagged) in documents.
Yes, most big data source, including facebook, twitter etc., have unstructured data. And nearly no
analytics can work directly on this unstructured data. Unstructured data is the starting point, but it
has to be metamorphosis into some structured format before we can start with any actual analytics
technique application. So what is the process?
Business requirement: - Tweets, facebook postings and other social comments have to be analysed
to determine the sentiment of the population.
Creating semi structured/ structured data (which can fit into relational tables) will involve dissecting
the text into words and phrases which can then be categorised from ‘good’ to ‘bad’ and everything
in between. Converted numerically, this would be in a range of +1 to -1. This set of fully numeric
data is then ready for use. And all the analysis techniques can then be used to conclude and arrive at
results. Thus, a new step of extracting structured data from unstructured data gets added into the
Thus, all the Analytics skills and techniques are going to remain very valid in this new paradigm.
Only the type of data, the source and its general understanding has to be re-vamped. And most of
us analysts can breathe a sigh of relief.
So does this process of converting unstructured to structured data have to be manual, thru
heuristics? Or machine driven thru algorithms? Algorithms reduce accuracy but increase scale. So a
judicious decision or a gradual shift from manual to algorithm can be used to standardise this
process within the organisation.
In fact, in this whole landscape, the decision on what data to delete becomes of paramount
importance. And there are a new set of data guardians who are experts on helping organisations
retain only relevant data.
This understanding and coming together of all the bits and pieces has given a lot of confidence to
decision making process on dynamic and big data .