Trade-offs in distributed system (part 2)

Abracadabra
5 min readNov 23, 2020

--

This is the second part of this topic. Make sure you read part 1 if some of the concepts(in bold font) in this post are unfamiliar or ambiguous to you. The intended readers are familiar with common distributed storage systems but want to systematically learn the vocabulary to describe and reasoning with the trade-offs. Otherwise, you could be overwhelmed by the involved systems and design patterns.

What can be compromised?

As we mentioned in part 1, we should assume the following are non-negotiable in 99% of distributed systems:

  • Network partition
  • Serializability
  • SLA-availability

That leaves us only linearizability and latency to not to have. Indeed, most system designs trade one for the other. Even though in their narratives, they often vaguely use consistency and availability to describe the trade-offs. Availability is not enough for most applications, Amazon can’t expect users to wait for one minute while processing their orders.

A note on throughput(QPS)

Before diving into the common design paradigms, I want to speak about throughput. Throughput(QPS) is measured as the number of requests processed per second. Latency is measured by seconds of request processing time. For sequential(non concurrent) systems,

throughput = 1 / latency

Thus high latency usually implies low throughput. For example since write operation on the same register must be serialized(processed one by one), high write latency implies low write throughput. This is not necessarily true for read because reads can be executed in parallel. I thought of adding throughput in the picture, but decided not to because it will complicate the description. Readers should be able to comfortably reason with designs where throughput is a concern with knowledge from this post.

Low latency without linearizability(eventual consistency)

This is the more common breed of the trade-off.

Primary-replica setup of MySQL

The most popular design in this style is the primary-replica setup of traditional DB. The data from the primary host is replicated to N replicas. All writes are processed by primary. Read requests can be processed by both primary and replica. The updates are broadcasted from primary to replicas asynchronously.

ACID provided by DB software ensures the serializability of read and write. Obviously linearizability can’t be guaranteed due to the asynchronous update feature.

Look-aside cache

Look-aside cache is how Facebook architected their memcached clusters. Read will populate cache. After writing to DB, a process will delete the existing cache. Since the invalidating all cache replicas can’t be instantaneous, it’s possible that an old value is returned from the memcached node where cache is not yet invalidated, after the new value is read from a node where invalidation is done. Thus linearizability is lost.

W + R <= N in NoSQL

Another typical design under this trade-off is supported in most NoSQL databases. Each register is replicated to N hosts. Each write operation must be successful in W hosts before returning to the client as successful. Each read operation reads from R hosts and returns the newest version found.

This paradigm is commonly named as eventual consistency. The name faithfully captured the nature of the trade-off. There are other realizations of this paradigm, but the above should be enough for illustration.

Linearizability with high latency

We have both linearizability and serializability if we are willing to accept high latency(and low throughput). Consensus algorithms like Raft and Paxos are specifically designed to solve this problem. (If you are hardcore enough to study any of them, I highly recommend Raft, though Paxos is way more famous.)

This paradigm is also termed strong consistency. Due to the high latency, such systems are usually not user facing. For example, Zookeeper provides strong consistency and are typically used to store small metadata or service discovery.

Notice that W + R > N in NoSQL can not guarantee strong consistency which implies serializability, contrary to common misunderstanding. Many nuances exist in corner cases. For example, with <N = 3, W = 2, R = 2>. If a write operation is only successful in one node, the return value of the next read depends on which two hosts it reads from. Thus it’s not serializable under failed writes. Cassandra uses a read repair mechanism so that the system will reach consensus eventually. I will skip it for simplicity.

Google Spanner

Google Spanner is a breakthrough in recent years because it appeared to have both low latency and strong consistency. Specifically, Spanner has the following features at the same time:

  • Very high SLA-availability(99.999%)
  • Lock-free read(high concurrency, low latency)
  • Serializability
  • Linearizability

Like most distributed systems run by large shops, in Spanner, SLA-availability is achieved by robust private networks and SRE best practice. Serializability is achieved by two-phase commit(across shard) and Paxos(within shard).

In a nutshell, Spanner is a sharded database. Each shard is replicated as a Paxos cluster. The magic comes from measuring real time using atomic clocks. This method is called TrueTime. TrueTime can tell each host the real time with a bounded margin of error. The margin is less than 7ms. This way each sequential operation will have a bounded begin and end time. Thus sequential operations can be ordered using real time. This is exactly what linearizability means.

With the traditional solution for SLA-availability and serializability, Spanner shines by enabling lock-free read that returns most up to date data(linearizability). The idea is very simple, each read has a required timestamp(t), replicas will directly serve the read if its latest update from primary is later than t. Otherwise, it will contact the primary for truthful data.

In precise CAP terminology, Spanner is consistent under partition. Because the minority partition in Paxos cluster can’t handle write and when partitioned from primary, any node may not be able to handle read. In practice, due to Google’s advanced data center maintenance, partition is very rare. So Spanner is always CAP-consistent with very high SLA-availability.

Reference

[1] Spanner, TrueTime and the CAP Theorem. Title says all.

[2] Scaling Memcache at Facebook. Read if you are interested in nuances of look aside cache in global scale. Recommended for serious learners on distributed cache.

[3] Internals of Google Cloud Spanner. A more gentle, though not complete precise introduction to Spanner.

[4 … n] I realized there are too many references, including the Dynamo paper, Cassandra read repair, Raft/Paxos, etc. I guess most readers won’t be interested. So I will stop here.

--

--