Monday, September 2, 2013

Data Vault in LoB applications - why not?

The data vault modelling technique has been created specifically for the data warehouse. It is a rather simple design pattern that is intended to address the shortcomings of 3NF and dimensional modelling at the stable area of the DWH. Here's DV concept in a nutshell:

- There are 3 types of tables: hubs, satellites and links;
- Hubs contain the business keys (e.g. Customer ID) and, for each record, a surrogate key;
- Links are tables that represent relations (in practice turning every relation into the many-to-many variant);
- Satellites have the actual entity data. There can be as many or as few satellites as you wish. Satellites can be attached both to hubs and links (but not to each other). 

In addition to that each table (hub, satellite and link) has a "load date/time stamp" (Load_DTS) and a "record source" column. This way, tracking of data loading is ensured. This pattern easily allows for historic data to be persisted and back-tracked. If an entity's schema changes (e.g. from a certain point in time a company wants to have additional customer information than it had before thereby creating new data fields for "customer"), then a new satellite can be created. For convenience satellite tables have an "end date/time stamp" (End_DTS) column, so that a query can easily find the latest record for an entity. Furthermore, the primary key in a satellite is a combination of the foreign key from the hub (that's the surrogate key, not the business key) and the Load_DTS. Here's an example, directly from Dan Linstedt of how DV implements the Northwind database sample that comes with Microsoft products all the time (typical e-commerce solution).



It's clear that if a change to a customer occurs (e.g. new street address) they only thing that needs to be done is an insert operation to the satellite table - old data is always kept. If a new field is added to a customer then a new satellite table can be created. This is one of the main arguments for DV: "all the data, all the time". But looking at the example again there is something else that strikes me: what if the company now wants to have specialized data on their customers? For example, they want to distinguish between corporate and private customers. Obviously these have some fields in common and other that are only meaningful for the type of customer that they are. This is the "is a" relation that is always a problem to map to the relational world, especially if one wants to keep historic data, i.e. to see at the data as it was before the schema change. This is business-as-usual for object-oriented but it's always a pain when doing the persistence layer. First, one needs to agree on how the relational model looks like, as there is no specialization relation there. Of course one can add a table for each specialized class and keep the existing Customer table (adding a field for the customer type), flatten the whole thing by adding the extra columns to the existing customer table, or keep only the specialized tables (CorporateCustomer and PrivateCustomer) and dump the original customer table. In any case it's a pain to change the model, to migrate the old data to the new model and to rewrite the persistence layer (so-called O/RM, or object-to-relational mapping). 

As already said, if one wants to keep historic data then it's even more complicated and maintaining the consistency and mapping with the object-oriented world gets more and more complex as new changes come along (e.g. the model can later be further specialized for handling VIP customers).

Enter data vault: all of these shortcomings are suddenly addressed with very small effort and what's more, in a standardised way. In this case, just create 2 new satellite tables for private and corporate customers; that's it. On the application (OO model) extend the customer class to have the 2 subclasses and simply provide the serialisation/de-serialisation methods that write and read to the respective satellite. Job done. 

So why not use data vault for line-of-business (or so-called transactional) applications? A deeper look into how transactions and locking are supported is needed and I intend to try it out sometime. I certainly think it's worth the time.