Monday, September 2, 2013

Data Vault in LoB applications - why not?

The data vault modelling technique has been created specifically for the data warehouse. It is a rather simple design pattern that is intended to address the shortcomings of 3NF and dimensional modelling at the stable area of the DWH. Here's DV concept in a nutshell:

- There are 3 types of tables: hubs, satellites and links;
- Hubs contain the business keys (e.g. Customer ID) and, for each record, a surrogate key;
- Links are tables that represent relations (in practice turning every relation into the many-to-many variant);
- Satellites have the actual entity data. There can be as many or as few satellites as you wish. Satellites can be attached both to hubs and links (but not to each other). 

In addition to that each table (hub, satellite and link) has a "load date/time stamp" (Load_DTS) and a "record source" column. This way, tracking of data loading is ensured. This pattern easily allows for historic data to be persisted and back-tracked. If an entity's schema changes (e.g. from a certain point in time a company wants to have additional customer information than it had before thereby creating new data fields for "customer"), then a new satellite can be created. For convenience satellite tables have an "end date/time stamp" (End_DTS) column, so that a query can easily find the latest record for an entity. Furthermore, the primary key in a satellite is a combination of the foreign key from the hub (that's the surrogate key, not the business key) and the Load_DTS. Here's an example, directly from Dan Linstedt of how DV implements the Northwind database sample that comes with Microsoft products all the time (typical e-commerce solution).



It's clear that if a change to a customer occurs (e.g. new street address) they only thing that needs to be done is an insert operation to the satellite table - old data is always kept. If a new field is added to a customer then a new satellite table can be created. This is one of the main arguments for DV: "all the data, all the time". But looking at the example again there is something else that strikes me: what if the company now wants to have specialized data on their customers? For example, they want to distinguish between corporate and private customers. Obviously these have some fields in common and other that are only meaningful for the type of customer that they are. This is the "is a" relation that is always a problem to map to the relational world, especially if one wants to keep historic data, i.e. to see at the data as it was before the schema change. This is business-as-usual for object-oriented but it's always a pain when doing the persistence layer. First, one needs to agree on how the relational model looks like, as there is no specialization relation there. Of course one can add a table for each specialized class and keep the existing Customer table (adding a field for the customer type), flatten the whole thing by adding the extra columns to the existing customer table, or keep only the specialized tables (CorporateCustomer and PrivateCustomer) and dump the original customer table. In any case it's a pain to change the model, to migrate the old data to the new model and to rewrite the persistence layer (so-called O/RM, or object-to-relational mapping). 

As already said, if one wants to keep historic data then it's even more complicated and maintaining the consistency and mapping with the object-oriented world gets more and more complex as new changes come along (e.g. the model can later be further specialized for handling VIP customers).

Enter data vault: all of these shortcomings are suddenly addressed with very small effort and what's more, in a standardised way. In this case, just create 2 new satellite tables for private and corporate customers; that's it. On the application (OO model) extend the customer class to have the 2 subclasses and simply provide the serialisation/de-serialisation methods that write and read to the respective satellite. Job done. 

So why not use data vault for line-of-business (or so-called transactional) applications? A deeper look into how transactions and locking are supported is needed and I intend to try it out sometime. I certainly think it's worth the time.

Sunday, April 21, 2013

The silent revolution in data architecture

While the data vault (http://danlinstedt.com/) is undoubtedly the best modelling technique for keeping historic data in a classic Enterprise Data Warehouse (no wonder, since it was invented for it in first place), it is not obvious how or if it will survive in the brave new world of in-memory analytics.

Now that wide-spread adoption is well under way, the data warehouse itself is changing. Major vendors are offering everything in-memory with column-based storage and compression as keys for ultra efficient data retrieval. Ironically one of the main arguments for the data vault is faster data loading (i.e. writing), not retrieval (i.e. reading). The reason for that is that by breaking up the dependencies on keys by using surrogate keys instead, table contents can be written all in parallel regardless of referencial integrity issues.
But with the innovation seen in the data warehouse area in current times, data vault may well be a technique that is dead before it had the chance to prove its merits - bad luck, so is life in IT.

In summary, the cube/snowflake/star schema is (any time soon:) dead, killed by column-based, in-memory analytical data stores; the relational database itself is more and more shaky (now with SAP HANA as the basis of the ERP making a big step in that direction, at least if it succeeds), and even disk storage is seemingly on its way down. As someone said "disk is the new tape". Why do you need persistent storage if you have huge amounts of RAM in fault-tolerant data centres? For backups. Then again, soon enough all backups will end up encrypted in the cloud right?

Now, what about end user devices? Again, the hard disk is all but dead. Flash, SSD, clould are already the standard.

I find it really surprising that there's not more media hype around all these threats to the HD. After a couple of decades of quite suspicious super-fast increase of storage capacity in HD technology (is innovation really that fast that disks can double capacity every year, or was it just about market manipulation?) it seems that the tide is turning. 

That's IT for you: without the physical constraints of hard disk storage, the database (whether relational or analytical) is not so optimal anymore; without the constraints of network speed, local storage is not that relevant anymore; without the constraints of local capacity, distributed systems are hardly necessary (Google Public DNS anyone?); etc. But then again we still have mainframes around right?

Tuesday, February 19, 2013

Waiting for TOGAF 10


TOGAF 9 is a great framework. It has a robust "chassis" on which one can really build a solid Enterprise Architecture function. The only problem with it is that some parts of it are outdated. One can really see that all the parts work together quite well and it's easy to see why it became so hot some time ago. It's like an old Rolls Royce; you admire its quality and enginnering but it is no longer the most practical means of transportation for everyday life. 

Take for example, the Architecture Building Blocks (ABBs): according to TOGAF's definition ABBs "capture architecture requirements; e.g. business, data, application and technology requirements" and they "direct and guide the development of SBBs" (note: SBBs are Solution Building Blocks). ABBs are therefore "logical" components. Moreover, also according to TOGAF's defintion, a building block (whether ABB or SBB) "has a defined boundary and is generally reconizable as "a thing" by domain experts".

The intention is understandable. By abstracting the "thing" into a set of functions one should in principle find ways to easily repliace it by another equivalent "thing". The thing can be a web server, a database, an application server, a CRM system and so on.
The problem is that this kind of bottom-up abstraction logic doesn't last very long, since it is driven by implementation. It's like saying that in order to move items from point A to point B one needs a truck. "Truck" would then be the ABB and the actual choice of a brand and model would be the SBB. But who says that one needs a truck? What if the business finds that hiring a transportation company is cheaper than buying a truck? 

Indeed it is much more important to document that you need a transportation service with certain service requirements; this service should show up in every other area of the enterprise where it is needed. Then the decision of, say, reusing a truck that is already used by some other department in the enterprise or of hiring a transportation company instead can be supported by real business arguments. Maybe the other department would benefit from a general contract with the transportation service and get rid of its truck. Or they would just agree to share the truck. Either way what is important is first looking at the application services that are needed to support a certain business process or function, and then at possible solutions to provide these services. ABBs are an artificial construction for which I don't see any value added at all.

Furthermore by extending the service concept downwards to the infrastructure level a quite useful system map can be easily obtained. In fact the application components will use infrastructure services that are in turn provided by infrastructure components. By documenting all that you end up with a map of the business processes that require each infrastructure component via application services. This is a powerful tool to optimise the whole IT landscape.

Coming back to TOGAF 9, there is somethting else that belongs to the 1990s museum: the reference models - the Technical Reference Model or TRM and the Information Information Infrastructure Refernce Model or III-RM. Here's a quick summary of these 2 animals:
  • the TRM says that applications are built on top of application platforms which use different services requiring different qualities. These are provided by an Operating System which uses network services which is built on a communciations infrastructure;
  • the III-RM says that you should use a broker between consuming and provider applications.


These are quite generalistic and over-simplistic views. They are true for most (but not all) IT applications out there but what value does this actually provide? The way I see it each Organisation must produce its own set of reference models, reference architectures and patterns based on its needs and constraints. What TOGAF should be focusing on is a method to help develop these artifacts. Building them is an incremental exercise that is essentially capability-driven. For example when a new capability is identified as beneficial for the enterprise it is usually best to make a shared service out of it. Then you need a reference model for it and from the model (by mapping it to the existing landscape) you can derive a reference architecture and finally a set of patterns to guide consumers using the service.

In summary, TOGAF is a great framework as it really helps to manage the many aspects to the enterprise architecture but I think that a new model of this Rolls Royce is long overdue. The fact that The Open Group took on board Archimate as a common language looks promising, since Archimate provides an excellent way of representing a top-down and service-oriented view of the enterprise;  I can't wait to see the next new major version of TOGAF. 

Friday, January 18, 2013

XML, the extra morose language


First there was HTML; it was a "quick and dirty way of producing information and making it available. The difficulties in making web pages look good and the overhead that it caused on the network were outweighed by the simplicity and time required to produce information in the form of standardised documents. With all its shortcomings, coupled with HTTP (also quite limited at the time) it was enough to change the world.

Those limitations were gradually removed with new versions of HTML together with different scripting technologies both for the client and server sides. Later, compression became possible in order to reduce network latency.

Meanwhile XML came along and it too took over the world. Again many optimisations emerged, and also new ways of encoding binary data.

These are typical examples of IT solving problems that were created by IT. Looking at it now, what sense does it make to encode data in a human readable format when it is meant to be transferred only between computer programs? Why do applications need to parse tags? Why do messaging components need to verify if end tags match start tags? Why do applications need to convert numbers to strings and then back on the other side? Let's face it: markup languages are nonsense. There's no real advantage in using XML over ASN.1. It's slower, less scalable, overall far less efficient. And yet it has become "the" standard. But has it?

With the introduction of Ajax, JSON (another nonsense standard) started to emerge because it's much more efficient for representing portable code, more specifically Javascript. But meanwhile Google decided that GMail should be fast so they just made up their own binary standard and made it open source (Protocol Buffers). BTW, Google also produced a programming language (Go) with... pointers!

When I think of the petabytes of data that are transferred every day unnecessarily (just think of Base64), and the processing power that is wasted just to parse XML I get dizzy.

This is however how the Internet works nowadays. And it's not only the Internet. SIP, the signalling protocol used in voice communications that should be taking over the world anytime soon (isn't it?) is often discarded because good old H.323 still proves much more efficient.

Many different companies start producing their own proprietary protocols, usually very optimised. Their motivation is not to share but to get market share. Finally when standards start to become important, i.e. when customers demand standards because they don't want to be locked in to specific vendors, these companies finally agree on a standard that is worse than their own solutions, probably hoping that those customers will eventually give in and use the proprietary versions. Finally the world surrenders to the least common denominator and then spends a lot of time and effort solving the problems that it creates.

In 2011 the W3C adopted EXI as a standard for "efficient XML interchange". It provides over 100x perfomance increases over XML (which is not hard to do). Let's hope it finally takes off and gets real adoption from vendors...

Saturday, January 5, 2013

The uncertainty of forecasting clouds


As I wrote in a previous post the adoption of large-scale cloud computing services suffers from many ills, for which I can only envisage one medicine: regulation.

It is tempting to think that cloud is just a matter of time; that companies giving up their data centres and most of their IT staff is something about to happen; that so called "mega-vendors" will dominate the market offering all the application, platform and infrastructure services that any non-IT company might ever desire. I, for one am a believer. I think that most companies will have extremely reduced information deparments compared to nowadays. They will not have IT departments anymore but rather a few roles such as Information Architects, Security Experts, Data Scientists (high in the hype cycle now), possibly a handful of developers, depending on the type of business. Big corporations and institutions may keep a  "private cloud" either for cost reasons, to keep highly confidential data or both. Cost is indeed an issue: nobody proved yet that it is cheaper to rent a large data centre than to own one. In fact it may never be. That is the case with office buildings. It is far cheaper for a corporation to own than to rent one. On the other end of the spectrum, small to mid-sized companies may just outsource everything to cloud providers and not even have an IT department anymore. I am sure that there are cases like that already.

But for most companies cloud is still a challenge and will remain one until there is enough regulation and enforcement thereof. Tecnhically moving to the cloud is not difficult. Services have been accessible for a few years already. But there are a few reasons why businesses may not be prepared yet:

1- The cloud provider may go down, possibly even because of some other cloud provider with whom the prime contractor had an underpinning contract. Fresh example: Netflix going down possibly because of an outage in Amazon.

2- Data leakage: data access is almost impossible to protect from at least a few system administrators. The only effective means of protecting data is through encryption. But encryption is a tricky control, extremely hard to manage. First there is ultimately a key that is a single point of failure - encrypting is pointless if the system administrators or anybody eles has access to the key; second, if the key is lost then the whole data is lost; third, keys should be changed regularly but then they need to be backed up somewhere safe - of course not with the data itself. And finally depending on what is encrypted different qualities may be affected. For example using database encryption usually has an impact on performance, may restrict available features and requires licensing of additional modules. Using file system encryption has more or less the same issues, and using volume encryption is useless while a system is working and the administrator has access to it. Obviously encrypting data that needs to be accessed by multiple persons is complex as many keys need to be managed, and if applications need to access the data then the keys need to be saved somewhere in the system which make them accessible to system aministrators (and hackers).

3- Undefined boundaries: Who else is using the infrastructure? What if your system is running on a virtual machine and one of your "neighours" who is hosted in another VM in the same physical server breaks out and gets control of that server? Then your machine is already compromised.

4- Legal/compliance: who knows where your systems are running? There my be constraints on location - e.g. your data may not legally leave the country of origin. Moreover, if there's a police investigation on one of your "neighbours" (i.e. someone who is using the same infrastructure as you) and the Police needs access to your files to follow a trace, you will need to deal with that too.

The main problem with solving these issues may be lack of motivation from those "mega-vendors". Lack of regulation benefits the established large vendors. First, they are happy to host systems from anyone on the planet within their own legal framework and on their own terms. Second, without regulation only the big vendors have enough credibility from the customers' point of view. And then there are some the technology issues: regulation leads to standardisation, accreditations and consequently to more competitors in the market and reduced profit margins. Besides, there is no doubt that in the cloud world open source is king. It is obvious that a provider will always prefer to use and develop open source to paying licenses to someone else - after all they are not in the business of selling code, but services; so the large vendors that live on selling licenses are not really interested in standardisation but rather on keeping a competitive advantage by offering more and better integrated services than the competitors.

Anyway it is certainly interesting to live in such a transition era.