Monday, September 2, 2013

Data Vault in LoB applications - why not?

The data vault modelling technique has been created specifically for the data warehouse. It is a rather simple design pattern that is intended to address the shortcomings of 3NF and dimensional modelling at the stable area of the DWH. Here's DV concept in a nutshell:

- There are 3 types of tables: hubs, satellites and links;
- Hubs contain the business keys (e.g. Customer ID) and, for each record, a surrogate key;
- Links are tables that represent relations (in practice turning every relation into the many-to-many variant);
- Satellites have the actual entity data. There can be as many or as few satellites as you wish. Satellites can be attached both to hubs and links (but not to each other). 

In addition to that each table (hub, satellite and link) has a "load date/time stamp" (Load_DTS) and a "record source" column. This way, tracking of data loading is ensured. This pattern easily allows for historic data to be persisted and back-tracked. If an entity's schema changes (e.g. from a certain point in time a company wants to have additional customer information than it had before thereby creating new data fields for "customer"), then a new satellite can be created. For convenience satellite tables have an "end date/time stamp" (End_DTS) column, so that a query can easily find the latest record for an entity. Furthermore, the primary key in a satellite is a combination of the foreign key from the hub (that's the surrogate key, not the business key) and the Load_DTS. Here's an example, directly from Dan Linstedt of how DV implements the Northwind database sample that comes with Microsoft products all the time (typical e-commerce solution).



It's clear that if a change to a customer occurs (e.g. new street address) they only thing that needs to be done is an insert operation to the satellite table - old data is always kept. If a new field is added to a customer then a new satellite table can be created. This is one of the main arguments for DV: "all the data, all the time". But looking at the example again there is something else that strikes me: what if the company now wants to have specialized data on their customers? For example, they want to distinguish between corporate and private customers. Obviously these have some fields in common and other that are only meaningful for the type of customer that they are. This is the "is a" relation that is always a problem to map to the relational world, especially if one wants to keep historic data, i.e. to see at the data as it was before the schema change. This is business-as-usual for object-oriented but it's always a pain when doing the persistence layer. First, one needs to agree on how the relational model looks like, as there is no specialization relation there. Of course one can add a table for each specialized class and keep the existing Customer table (adding a field for the customer type), flatten the whole thing by adding the extra columns to the existing customer table, or keep only the specialized tables (CorporateCustomer and PrivateCustomer) and dump the original customer table. In any case it's a pain to change the model, to migrate the old data to the new model and to rewrite the persistence layer (so-called O/RM, or object-to-relational mapping). 

As already said, if one wants to keep historic data then it's even more complicated and maintaining the consistency and mapping with the object-oriented world gets more and more complex as new changes come along (e.g. the model can later be further specialized for handling VIP customers).

Enter data vault: all of these shortcomings are suddenly addressed with very small effort and what's more, in a standardised way. In this case, just create 2 new satellite tables for private and corporate customers; that's it. On the application (OO model) extend the customer class to have the 2 subclasses and simply provide the serialisation/de-serialisation methods that write and read to the respective satellite. Job done. 

So why not use data vault for line-of-business (or so-called transactional) applications? A deeper look into how transactions and locking are supported is needed and I intend to try it out sometime. I certainly think it's worth the time.

3 comments:

  1. Data Vault data modelling training and certification is now available in Australia.

    http://au.analytics8.com/event/data-vault-in-australia-certified-3-day-data-vault-data-modeling-course

    Over the last 10 years, Data Vault has gained significant momentum as one of the best ways to design a Data Warehouse.

    The Data Vault approach results in systems that can:

    • be built quickly;
    • deal better with changes in business requirements; and
    • be more easily maintained.

    This certification course provides everything you need to know to design a Data Vault model.

    This certification course will be presented by Hans Hultgren from Genesee Academy.

    Course structure

    Day 1

    • Data Vault constructs of Hub, Link and Satellite
    • Impact of Data Vault – the benefits of applying this model
    • Sales Receipt example modeling with Data Vault
    • Deep dive into Hub design
    • Group Case: Secaucus Soccer
    • Deep dive into Link design
    • Group Case: Mt. Evans Harley
    • Advanced constructs
    • Assign integration case

    Day 2

    • Group Case: integration case
    • Working with Data Vault
    • Loading paradigms
    • Applying to the EDW
    • Data Vault in practice

    Day 3

    • Exam review and discussions
    • CDVDM (Certified Data Vault Data Modeller) examination

    Learn more and enrol at:

    http://au.analytics8.com/event/data-vault-in-australia-certified-3-day-data-vault-data-modeling-course

    Regards,
    Simon Berglund, National Sales & Marketing Manager
    Analytics 8
    Tel +61 2 9299 4430
    www.analytics8.com.au

    ReplyDelete
  2. Disclaimer: I decided to accept the comment above because it relates to the actual post. I am in no way affiliated to the company that is effectively advertising in this blog though.

    ReplyDelete
  3. - Satellites have the actual entity data. There can be as many or as few satellites as you wish. Satellites can be attached both to hubs and links (but not to each other). local paper shredding

    ReplyDelete

Comments are always welcome. They will be moderated for posts older than 14 days. In that case a delay of a few hours can be expected before publishing.