Sunday, December 23, 2012

CAP and other tradeoffs


I really liked Daniel Abadi's post on the CAP theorem [http://dbmsmusings.blogspot.de/2010/04/problems-with-cap-and-yahoos-little.html]. It is certainly true that CAP is an oversimplified acronym. However, while Daniel's post is great food for thought it is also not thorough itself.

For the ones that are not aquainted with CAP here's a quick explanation: a distributed system cannot be simulaneously consistent (C), available (A) and partition-tolerant (P). The P means that if nodes are "partitioned" the system should behave normally from the consumer's perspective, i.e. it should be both consistent and available. By partitioning it is meant that the nodes cannot communicate with each others, either because of a network failure or because of a node being down (which is sometimes impossible to distinguish anyway). In fact the usual term is fault-tolerant; I guess that the author(s) just wanted to have a catchy acronym.

Anyway the theorem postulates that it is impossible for a system to have the 3 qualities. In practice it means that when there is no communication between the nodes you have to choose either consistency or availability. It is not hard to understand because consistency depends on that communication (changes must be propagated to all nodes). Hence, if there is a problem with it a decision needs to be made: either the system doesn't respond until everything restored and guaranteed to be consistent (sacrificing availability) or it does (sacrificing consistency).

It is obvious that a system cannot be CA and not P. It is rather "if P then choose between C and A". So Daniel notices correctly that it can only be either CP or AP.

He says that CA and CP are identical but the way I see it CA just doesn't make sense.

He then goes on to explain that there are more trade-offs to consider apart from consistency and availability and he introduces latency backed by an example from Yahoo (PNUTS). Finally he proposes a new acronym, PACELC meaning: if P then choose either A or C Else choose either L or C.

In fact usually there is a tradeoff between consistency and latency, since consistency in distributed systems is expensive. But that is not necessarily true. The solution may not be sensitive to latency. Depending on the solution and its environment there will be for sure many other trade-offs to consider such as between availability and maintainability (whether changes to the system configuration require downtime) or between latency and security (authentication and encryption both increase latency) just to name 2 typical ones.

Moreover, lantency and availability are intertwined themselves. If the system has too high latency then it will be considered unavailable - in fact it is often the case, for example in cluster controllers, that a node is considered down when it takes too long to respond. In practice if it is higher than defined in the OLA (Operational Level Agreement) then it will mean P; if it is higher than defined in the Service Level Agreement (SLA, which is higher than the OLA) than it will mean A.

If it is lower than both than it may not even be an issue, depending on the SLA.

So the way I see it CAP identifies a pre-defined trade-off for a pattern of distributed systems that exists regardless of the solution, i.e. somewhere where a decision needs to be made: either to ensure consistency or availability, or find some other mitigating measure: for example by using more replicas (redundant nodes) you could reach a satisfactory level of C, A and P and even L but with an impact on cost, maintainability and possibly  security.

CAP is indeed a predefined tradeoff in one architectural pattern, that's all. As always the actual implication depends on the system design and on the requirements. I would call the pattern a "cloud pattern": a client-server architecture where the server is distributed in many equal nodes and the client keeps no data itself but expects the server to be always available and always consistent. This is the pattern used in Big Data distributed file systems such as Hadoop (or more specifically HDFS), which is the reason why CAP became so popular.

For different patterns the theorem does not apply: for example (as a someone commented in Daniel's post) in a publisher/subscriber model, where the service immediately informs its clients about changes, consistency may not a problem anymore because the fact that the servers are not guaranteed to be consistent at a specific point in time may not even affect the clients.

Good articles on CAP:
http://ksat.me/a-plain-english-introduction-to-cap-theorem/
http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/
http://www.royans.net/arch/brewers-cap-theorem-on-distributed-systems/

Sunday, December 16, 2012

SOA, boundaries and segregation

If there is one principle that architects agree on, is that silos should be avoided. Instead of building systems from the top down, shared services should be reused. Each project should identify reusable capabilities and contribute to build a service-oriented architecture (SOA).

Some classical examples of SOA capabilities are identity and access management, master data management, business intelligence, messaging, data integration, orchestration, complex event management. Many applications can then be built or extended just by reusing these capabilities.

On the other hand, systems have a life of their own. A system owner typically wants to have maximum flexibility which means as least dependencies as possible. A silo is much more appealing both to a project manager and a system owner because they know that they will control the development process and the final product respectively.

Then there's the security viewpoint. Sharing can of course lead to the propagation of vulnerabilities. A vulnerability in one shared service may of course have an impact on each one of its consumers. But sometimes also a vulnerability in one of the consumers can lead to the shared service being compromised and consequently all the other consumers and possibly other services as well. There are many ways of mitigating risks but the only way to avoid them is by segregating systems, which leads to more fragmentation - sometimes even more than the system owner wants.

All of these views are perfectly valid. Highly critical systems may in fact need to be isolated because the cost of doing so is less than the risk of data leakage, manipulation or unavailability. But that is rare. Shared services need to support the highest requirements of each consumer so over time they tend to be much more available and secure than any individual system can afford to be. Moreover, it is usually not a whole system that has such high requirements for specific qualities that it cannot be provided by shared services. At most it's a small subset of its data that is highly sensitive - e.g. credit card information, plans for new products, medical records.

So how are boundaries defined in an SOA? Let's say an application uses these services: an enterprise service bus, a reporting engine, a document management system, a database cluster, an application server, the virtualised infrastructure and the disaster recovery standby site. For all of these services operational level agreements (OLA) are defined that guarantee the final service level agreement (SLA) with the customer. 

As for security, the risk profile of a shared service must be communicated transparently not only to the stakeholders of that and other impacted services, but also to its consumers before they agree to use it. For example if a known vulnerability exists in the document management system, then the medical department may not want to store personal records there until the risk is mitigated.

The conclusion is that silos can be broken if and only if the governance framework allows it. Otherwise they just can't. That is why an architect should not support, unless in exceptional circumstances, any kind of low-level "reuse" such as federated databases, database links or direct connections to the Internet. For example if one system's database is hard-linked to another (e.g. because a sales manager decided to add more information to its CRM system), it is impossible to guarantee performance, availability, security or in fact any kind of service level. Just one simple query issued at one system may break the other; one field allowing an SQL injection attack on one system may lead to data leakage from another one; maintenance windows of both system must be synchronised because unavailability of one breaks the other; and so on.

In summary, silos should not be built if reuse is possible, but reuse is not possible without service contracts. In other words SOA is the only way of reducing costs by consolidating systems.

Saturday, December 8, 2012

Why getting the architecture right makes everything run better

While jogging some days ago I was, as usual, listening to music on my relatively smart phone. I have a play list for jogging with which I always use the “shuffle” option. At a certain point in time and physical exhaustion I wanted to listen to another song so I clicked “Next”.

To my surprise the next song was not part of the play list. It was a nice song from Portishead but not really good for jogging... so I had to slow down, go through the phone menus and get shuffle mode to work again. Not good.

This made me start thinking about what could be wrong with this simple app’s architecture.  I also noticed that the app doesn’t allow to shuffle within one play list – one must “shuffle all” playlists.
This is a likely implementation:
  • Shuffling in the app is done by creating a song index out of all the songs in all play lists and then playing them in a loop;
  • When the user presses the button to move to the next song the loop is broken and the next song on the database is chosen instead.

That's indicative that the layered structure is not designed to support playlists and shuffle play correctly. This is what the architecture might look like:


In this design the user can choose either a playlist or the base song catalogue to play. The “shuffleAll()” function is completely controlled by the main player app (PlayerAppUI object) and “moveNext()” just calls “play()” for the next song in the BaseSongCatalog, therefore breaking the shuffle play loop.

Here is another architecture that would make shuffling work correctly:



Here there’s an intermediate layer (with the “SongPlayer” and “ShufflePlayer” super and sub classes respectively) that is responsible for providing the services for browsing songs. In this case if a user presses “Next”, the request is deferred to the selected player which will know which song is next (in this proposal that is achieved by means of an Enumerator in each subclass). While some methods of the app will use the specialised subclass (e.g. shuffle player), most (moveNext, movePrevious) just need to use the super class and defer the specific player features to the subclass that is in use. Other players can be easily added in this design – for example one that gets songs from an internet streaming service – without affecting the general part of the user interface except for any specific functions. In shuffle play that means reshuffling; in the internet streaming case it could mean selecting the web site for example.
Meanwhile there have been a few updates to this app but there are always problems and limitations with it. When the structure is wrong, fixing it is almost impossible. 

That’s why it’s important to get the architecture right before implementation, even in a small app - which has nonetheless a huge user base. 

Saturday, December 1, 2012

The Enterprise is a System!


... Or to be more exact, it’s a system of systems. But that doesn't matter; it's still an information system with moving parts and interdependencies.

That is why the notion that Enterprise Architecture (EA) shouldn't care about details doesn't make sense. It's also how so many "ivory towers" were built and later dismantled.

Failure to recongise that EA is the same discipline as system architcture has led to a more or less generalised disappointment with EA. It is not that there's anything wrong with EA frameworks such as TOGAF. The problem is that EA frameworks only specify how to organise the work, but not how to do it. To me a framework is to architecture what a methodology (such as Scrum) is to development, and not more than that.

In fact I am of the opinion that TOGAF is really good to support programme management.

“Doing” architecture means designing systems based on qualities that stakeholders specify, identifying trade-off points, recommending solutions and strategies.

One of the newsletters that I like to follow is the one from SQLServerCentral.com. The editorial that I copied below caught my attention. It’s about something most have seen before – a story about the ills of micro-management. But why does it happen so often? Probably the manager knew or at least suspected that a low-level technical decision such as the choice of backup technology might have a direct impact on the customer’s experience and hence on the whole business. While it is often the case that managers get too involved in technical details and end up wasting everybody’s time (including their own), it also happens quite a lot that because of what looks like a detail a whole business gets affected. Some examples: the antenna in the iPhone 4, the brakes of the Toyota Prius a few years ago, the MarsOrbiter crashing. There are many cases where a technical decision has caused serious problems not only to the system that contained the affected component but also to the whole environment.
Of course I don’t claim that architecture can solve all the problems. Sometimes the architecture is perfectly right but a design or even implementation decision leads to problems.

But using an architecture method to design and evaluate systems should handle most of the problems that can affect the system’s structures. The architect focuses on system qualities, not functions. Stakeholders define the qualities that are most important to a system and the architect is responsible for ensuring that they are met and identifying risks that they are not met. In addition to building a system from the ground up to meet those qualities, by evaluating an architecture design with a method such as the ATAM the architect looks at the system with a “magnifying glass” that allows him to detect sensitivity points (elements that are sensitive to a certain quality), trade-off points (elements that are sensitive to multiple qualities in contradicting ways), risks and non-risks. In summary in the scenario below, the manager should probably not be making decisions on the backup strategy but he should also not rely blindly on the developer’s preference because the developer probably doesn’t have the complete picture to decide what the best choice is. Instead an architect should be analysing possible solutions (competing architectures), identifying risks and recommending the way forward, based on the most important system qualities for the stakeholders.



I was reading a note recently from a DBA working at a software company. Their management wanted to ensure clients had a simple backup solution and were leaning towards Windows OS backup instead of SQL Server backups. They were planning on running databases in simple mode instead of taking transaction log backups, which were seen as too complex. While this can work, I'm not sure this is the type of discussion that should even come up.

Management should be concerned with the higher level goals. Clients need a simple scripted backup. Period. The implementation of that isn't something that management should be discussing with developers. This is the perfect example of where the software development goes off the track with micro management. Managers becoming deeply involved in technical decisions and implementations is a sure way to ensure that less than optimal decisions are being made. 

What should happen? Technical developers should get the goals of management (a simple backup process for clients, every day). They should then recommend a solution, but with a minimal of technical details. Managers should have no idea that transaction log backups are being made or a part of the process. Developers should write scripts, tools, or processes that allow an administrator to accomplish a goal in an easy to execute fashion, but shouldn't need to explain how every detail works to the end user.

Keep it simple and effective. That's a mantra that's worked well for me throughout my career.

Steve Jones from SQLServerCentral.com

Sunday, November 25, 2012

Is cloud computing the end of Internet utopia?


Cloud computing may be unavoidable but for all the IaaS, SaaS and (to a lesser extent) PaaS services out there, we're not quite there yet, at least at the corporate level. In fact, despite a growing adoption of out-hosted services there aren’t all that many corporations that have gotten rid of their data centres.

So what is then missing? A signature, really. The problem is that nobody wants to sign off what may be a death sentence for the business. So next question is, what is required to get that signature? I can only see one possibility: the one thing that nobody wants in the Internet right now... regulation. Once this is accepted, everything will come to place. The "cyber-world" will be ever more like our "physical" world. Even insurance companies may take their part and cover for data leakage, manipulation or unavailability. And the Internet will finally be a normal part of our lives.

But will people ever want regulation? It is certainly about time that our children get the protection that they deserve on the Internet, and that criminals start getting caught. No doubt that a hacker would think twice before defacing a web site or impersonating somebody else if he could get jailed for it. It's all about value at risk. Let’s face it, right now there is virtually no risk in attacking who or whatever a hacker wants, and that makes it really hard and often expensive for businesses and people to defend themselves.

In my opinion it is long overdue because only with security can people and institutions thrive.


The challenge is to create a secure environment while keeping privacy and freedom of expression. And as far as I can see that cannot be achieved only with so-called self-regulation.

Saturday, November 17, 2012

"If you don't develop an architecture, you will get one anyway – and you might not like what you get!" (Linda Northrop, SEI)


Many times have I been asked what „IT Architecture” is, or what is the difference to system design, and what does an architect do that is different from an engineer.
Sometimes the IT architect is simply an experienced system engineer. In fact I can’t really think of someone doing IT architecture without having been exposed to the many aspects of engineering, to many different technologies and patterns, and not knowing at least a few of them in depth. There is however a difference between architecture and engineering. In fact there are many sides to engineering; for example, someone with a main background in software development tends to look at things differently from someone that comes from network design, IT operations or IT security.

Once you become an architect you are expected to deal with all of these functions and to understand them well enough to propose decisions.

For instance, security often wants their systems (controls) to be based on appliances because they tend to be more secure. However Operations will argue that they can’t manage so many boxes and would prefer instead to have standard images and virtualisation. As for the development team, they always need the latest versions because the older ones have either run out of support or are not good enough.

The architect has to consider all of these points of view and many more when designing a system or working on a change to an existing system. And this is just inside the IT department. The customers have their own concerns, which are not necessarily related to security, technical component versions or release cycles. They need to have their tools and information available, quickly and in an uncomplicated way. In addition, many times different groups of customers have conflicting requirements.

This is where architecture comes into play. It’s all about managing trade-offs. The architect has to understand the concerns or viewpoints of each stakeholder and design a system that takes all of those aspects into consideration.

Architecture is mainly concerned with non-functional requirements, or system qualities. When you buy a car you want it to be fast, beautiful, big, comfortable, reliable, secure, eco-friendly, not depreciating too much, with low maintenance, lots of extras… and cheap. You have probably seen a panoramic roof which you liked in some model, tinted glasses, MP3 connectivity, built-in GPS, a fridge, Bluetooth, parking aids. You want it all but then you realise that you need to compromise somewhere. So in the end you (sometimes) get the car that matches your most important qualities within budget. If the balance is not right you know you will regret it later, and probably end up replacing it and spending a lot more than you wanted.

So that’s the role of the architect: to find the right balance given the priorities that are communicated by each stakeholder and on which there must be agreement at design time.

Can you build a system without taking care of architecture? You certainly can. But inevitably the system will be unbalanced, neglecting the interests of key stakeholders, and problems will soon emerge. Like Linda Northrop from the SEI wrote “If you don't develop an architecture, you will get one anyway – and you might not like what you get!” (http://csse.usc.edu/gsaw/gsaw2003/s13/northrop.pdf).

There are countless examples of badly or non-architected systems. The consequence is that they usually have to be replaced quickly. That is what architecture tries to solve.

Wednesday, November 14, 2012

Architecture and agility: the ultimate trade-off?


Architecture-centric design is meant to minimise the probability that structural changes in the system will occur.
Changes in a system, as pointed out in the great “Software Architecture in Practice” book from the Software Engineering Institute (SEI), can be either:

  • Local - affecting only one element;
  • Non-local - affecting multiple elements but not the architecture itself; or,
  • Architectural - affecting the pattern of the architecture, i.e. the “fundamental ways in which the elements interact with each other”.

The system structure should therefore be created in a way that it minimises the likelihood of architectural changes. If possible only local changes should be necessary. 

Of course this requires design to be done upfront which is something that some “agilists” love to hate.

In fact, by using an architecture-centric approach such as the Attribute Driven Design (ADD) the system is decomposed recursively into components in a way that the desired qualities are achieved. The system can then be developed and prototyped incrementally like a skeleton into which the components can be added, tested and refined until the final product is achieved. This also allows components to be developed externally with minimal risk.

This kind of approach ensures that the system qualities, or non-functional requirements, are met, also in expected growth scenarios.

At first glance it contrasts with agile methodologies such as Scrum. These are based on developing in small iterations that build functionality incrementally. Each of these “sprints” is a small project that takes a chuck of requirements, or user stories, from the product backlog, and provides an increment to the product. Requirements are not very clear until they get prioritised and included in a sprint.

This approach is usually focused on functional requirements but it doesn’t have to be. As long the architect is involved in the development process, the backlog prioritisation is aligned with architecture scenario-based priorities, and enough design decisions are done upfront, there’s no reason why Scrum and Architecture can’t provide synergies. Moreover, architecture intermediate reviews or even trade-off analyses can be performed during the development process, after certain iterations.

After all, agile methodologies are frequently applied in projects where the architecture is implicitly defined anyway, via frameworks such as Spring, Hibernate, Grails, etc. So the idea of "no design upfront" doesn't really apply.