Rapidly Changing Schemas

MongoDB CEO, Max Schireson recently said “Relational databases were designed in the 1970s during the time that computers were going from punch cards to terminals.” I admit there was a period of time where I bought into this thinking. Somehow, the thinking goes, if we just throw everything away and start from scratch we’ll do it 10x better. I suppose when you raise $231 million dollars you have to really sell it, then proclaim you are going to take down Oracle, then you have your team write posts on how to scale past 100GB.

These days I believe there is nothing a crotchety old relational database can’t handle for 99% of the use cases out there. I do admit though when designing a model dealing with a large number of columns with high cardinality being able to rely on a soft schema can really make things easy from a code perspective.

A design pattern I’ve been intrigued by lately is the hybrid approach. More specifically the introduction of the json datatype in Postgresql 9.2, and the improved set of functions in 9.3.

Previously a hybrid approach meant gluing your relational db with a NoSQL db or key value store at the application layer. You get flexibility, but the downside is you end up shifting all the logic into code and it becomes challenging to manage. With a PostgreSQL 9.2+ hybrid approach you can do your document storage in PostgreSQL while still taking advantage of things like constraints, foreign keys, & triggers. This allows you to tightly couple your structured and loosely structured data because your document store is effectively just another column. From both a data integrity and a developer productivity standpoint this is a huge win.

Here’s a good presentation by Andrew Godwin about rapid schema development with PostgreSQL with some high level tips & tricks.  Christophe Pettus also created a good presentation on this approach which you can find here.   Side note, I like the phrase “rapid schema” as it gets to the heart of what a schema free database promises which is the ability to change quickly. Unfortunately it’s sort of vague, like NoSQL.

MySQL Upgrade Issue

I just spent more time than I should have troubleshooting why the upgrade of MySQL from 5.0 to 5.1 on a Debian box resulted in a MySQL instance that wouldn’t start. Not a lot out there on this so hopefully this will save someone a bit of time in the future.

When upgrading from 5.0 to 5.1 using apt everything will install normally. Then when the MySQL service tries to restart you’ll see and init.d error and an error that looks something like this:

Errors were encountered while processing:mysql-server-5.1mysql-server

Not a lot to go on here but as it turns there is a deprecated entry in the my.cnf file called skip-bdb.  Comment this line out and you should be good to go.

Structuring the Unstructured

Martin Willcox from Teradata wrote a couple of blog posts outlining the reasons why he feels the phrase “unstructured data” is marketing jargon and that “nontraditional data” is more appropriate.

Let me start by saying that the examples Martin uses in the first post are technically accurate if we were all disk manufacturers. Whether bitmap (audio, video) or text (email, html), it’s true all of these file types use a structured format when being processed by a computer. That being said, we are not all disk manufacturers.

As a data architect I’ve always felt the true spirit of the phrase “unstructured data” corresponds to the modeling and analysis of the data. If you have a collection of objects in an email, an image, or web page… then these things are unstructured. They tell you nothing without the context of the structured model.

If this were simply a preference in terminology then I wouldn’t think too much of it, but when a relational database vendor claims that “nontraditional” (unstructured) data is easily converted to “traditional” data by running fact/entity extraction routines and loading a table it makes me stop and question the true intent of the original message. It’s not as simple as pushing a button, and an RDBMS is most often not your best option. This isn’t something which should be glossed over.

The problem is that when using a relational database schema the relationships, attributes, and quantities must be defined before running any extraction routines. That’s ok when running against a fixed set of data looking for a known set of attributes/measures – but when you are mining millions of images or billions of web pages all of the edges don’t start to show up until you actually start to extract and analyze the data. In this situation a relational database actually makes it harder to consume unstructured data due to the high cost associated with schema changes

To me the term unstructured makes sense… it’s simply the inverse of structured. Data without a model if you will.  And remember, the larger and more diverse the data set, the less you will know about it’s characteristcs ahead of time.

© 2009 - 2019 Ross Bates