Sunday, April 21, 2013

The silent revolution in data architecture

While the data vault (http://danlinstedt.com/) is undoubtedly the best modelling technique for keeping historic data in a classic Enterprise Data Warehouse (no wonder, since it was invented for it in first place), it is not obvious how or if it will survive in the brave new world of in-memory analytics.

Now that wide-spread adoption is well under way, the data warehouse itself is changing. Major vendors are offering everything in-memory with column-based storage and compression as keys for ultra efficient data retrieval. Ironically one of the main arguments for the data vault is faster data loading (i.e. writing), not retrieval (i.e. reading). The reason for that is that by breaking up the dependencies on keys by using surrogate keys instead, table contents can be written all in parallel regardless of referencial integrity issues.
But with the innovation seen in the data warehouse area in current times, data vault may well be a technique that is dead before it had the chance to prove its merits - bad luck, so is life in IT.

In summary, the cube/snowflake/star schema is (any time soon:) dead, killed by column-based, in-memory analytical data stores; the relational database itself is more and more shaky (now with SAP HANA as the basis of the ERP making a big step in that direction, at least if it succeeds), and even disk storage is seemingly on its way down. As someone said "disk is the new tape". Why do you need persistent storage if you have huge amounts of RAM in fault-tolerant data centres? For backups. Then again, soon enough all backups will end up encrypted in the cloud right?

Now, what about end user devices? Again, the hard disk is all but dead. Flash, SSD, clould are already the standard.

I find it really surprising that there's not more media hype around all these threats to the HD. After a couple of decades of quite suspicious super-fast increase of storage capacity in HD technology (is innovation really that fast that disks can double capacity every year, or was it just about market manipulation?) it seems that the tide is turning. 

That's IT for you: without the physical constraints of hard disk storage, the database (whether relational or analytical) is not so optimal anymore; without the constraints of network speed, local storage is not that relevant anymore; without the constraints of local capacity, distributed systems are hardly necessary (Google Public DNS anyone?); etc. But then again we still have mainframes around right?