Martin Willcox from Teradata wrote a couple of blog posts outlining the reasons why he feels the phrase “unstructured data” is marketing jargon and that “nontraditional data” is more appropriate.

Let me start by saying that the examples Martin uses in the first post are technically accurate if we were all disk manufacturers. Whether bitmap (audio, video) or text (email, html), it’s true all of these file types use a structured format when being processed by a computer. That being said, we are not all disk manufacturers.

As a data architect I’ve always felt the true spirit of the phrase “unstructured data” corresponds to the modeling and analysis of the data. If you have a collection of objects in an email, an image, or web page… then these things are unstructured. They tell you nothing without the context of the structured model.

If this were simply a preference in terminology then I wouldn’t think too much of it, but when a relational database vendor claims that “nontraditional” (unstructured) data is easily converted to “traditional” data by running fact/entity extraction routines and loading a table it makes me stop and question the true intent of the original message. It’s not as simple as pushing a button, and an RDBMS is most often not your best option. This isn’t something which should be glossed over.

The problem is that when using a relational database schema the relationships, attributes, and quantities must be defined before running any extraction routines. That’s ok when running against a fixed set of data looking for a known set of attributes/measures – but when you are mining millions of images or billions of web pages all of the edges don’t start to show up until you actually start to extract and analyze the data. In this situation a relational database actually makes it harder to consume unstructured data due to the high cost associated with schema changes

To me the term unstructured makes sense… it’s simply the inverse of structured. Data without a model if you will.  And remember, the larger and more diverse the data set, the less you will know about it’s characteristcs ahead of time.