Sunday, December 23, 2012

CAP and other tradeoffs


I really liked Daniel Abadi's post on the CAP theorem [http://dbmsmusings.blogspot.de/2010/04/problems-with-cap-and-yahoos-little.html]. It is certainly true that CAP is an oversimplified acronym. However, while Daniel's post is great food for thought it is also not thorough itself.

For the ones that are not aquainted with CAP here's a quick explanation: a distributed system cannot be simulaneously consistent (C), available (A) and partition-tolerant (P). The P means that if nodes are "partitioned" the system should behave normally from the consumer's perspective, i.e. it should be both consistent and available. By partitioning it is meant that the nodes cannot communicate with each others, either because of a network failure or because of a node being down (which is sometimes impossible to distinguish anyway). In fact the usual term is fault-tolerant; I guess that the author(s) just wanted to have a catchy acronym.

Anyway the theorem postulates that it is impossible for a system to have the 3 qualities. In practice it means that when there is no communication between the nodes you have to choose either consistency or availability. It is not hard to understand because consistency depends on that communication (changes must be propagated to all nodes). Hence, if there is a problem with it a decision needs to be made: either the system doesn't respond until everything restored and guaranteed to be consistent (sacrificing availability) or it does (sacrificing consistency).

It is obvious that a system cannot be CA and not P. It is rather "if P then choose between C and A". So Daniel notices correctly that it can only be either CP or AP.

He says that CA and CP are identical but the way I see it CA just doesn't make sense.

He then goes on to explain that there are more trade-offs to consider apart from consistency and availability and he introduces latency backed by an example from Yahoo (PNUTS). Finally he proposes a new acronym, PACELC meaning: if P then choose either A or C Else choose either L or C.

In fact usually there is a tradeoff between consistency and latency, since consistency in distributed systems is expensive. But that is not necessarily true. The solution may not be sensitive to latency. Depending on the solution and its environment there will be for sure many other trade-offs to consider such as between availability and maintainability (whether changes to the system configuration require downtime) or between latency and security (authentication and encryption both increase latency) just to name 2 typical ones.

Moreover, lantency and availability are intertwined themselves. If the system has too high latency then it will be considered unavailable - in fact it is often the case, for example in cluster controllers, that a node is considered down when it takes too long to respond. In practice if it is higher than defined in the OLA (Operational Level Agreement) then it will mean P; if it is higher than defined in the Service Level Agreement (SLA, which is higher than the OLA) than it will mean A.

If it is lower than both than it may not even be an issue, depending on the SLA.

So the way I see it CAP identifies a pre-defined trade-off for a pattern of distributed systems that exists regardless of the solution, i.e. somewhere where a decision needs to be made: either to ensure consistency or availability, or find some other mitigating measure: for example by using more replicas (redundant nodes) you could reach a satisfactory level of C, A and P and even L but with an impact on cost, maintainability and possibly  security.

CAP is indeed a predefined tradeoff in one architectural pattern, that's all. As always the actual implication depends on the system design and on the requirements. I would call the pattern a "cloud pattern": a client-server architecture where the server is distributed in many equal nodes and the client keeps no data itself but expects the server to be always available and always consistent. This is the pattern used in Big Data distributed file systems such as Hadoop (or more specifically HDFS), which is the reason why CAP became so popular.

For different patterns the theorem does not apply: for example (as a someone commented in Daniel's post) in a publisher/subscriber model, where the service immediately informs its clients about changes, consistency may not a problem anymore because the fact that the servers are not guaranteed to be consistent at a specific point in time may not even affect the clients.

Good articles on CAP:
http://ksat.me/a-plain-english-introduction-to-cap-theorem/
http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/
http://www.royans.net/arch/brewers-cap-theorem-on-distributed-systems/

No comments:

Post a Comment

Comments are always welcome. They will be moderated for posts older than 14 days. In that case a delay of a few hours can be expected before publishing.