System design paradigm: Distributed source of truth storage

5 min readDec 23, 2020

Source of truth

In designing distributed applications, it’s a good practice to separate the concern of serving latency and scaling out data storage as a source of truth. Serving latency is mostly handled by in-memory index or cache. The source of truth storage problem instead focuses on scale, durability and consistency.

Below is a typical design pattern for distributed application. I will explain this pattern in detail in another post. The focus of this post is the source of truth storage.

I will use the Key-Value data model in this discussion for simplicity. There are two main branches of designing a horizontally scalable distributed storage.

Two layer structure: Sharding + replication

The first one is sharding + replication. Sharding solves the problem that data is too big to store in one host. Replication solves the durability and availability problem. Each shard is replicated as a Paxos or Raft cluster, so the shard will function as long as a majority of the nodes in the cluster is available.

Write requests can only be handled by the leader and the contents are replicated to all followers. Read could be handled by a follower if the client doesn’t care about consistency or there’s a special mechanism to ensure linearizability, like TrueTime from Google.

One challenge of this design is adding shards. Resharding is usually not an option because that requires moving all data. One common solution is having a large enough shard number and assigning multiple shards to physical servers in the beginning. Then the load is dynamically adjusted. Google Spanner constantly adjusts the load across shards.

The most mature system using this hierarchical structure is Google Spanner.

Flat structure: consistent hashing

The second branch of distributed source of truth storage uses a single layer structure based on consistent hashing. In this structure, the same key is also replicated to several hosts. Consistent hashing is used to specify the keys assignment to hosts. Each key is hashed to an integer. Each host is consistently hashed with the same function.

Let me use a concrete example to illustrate. Assume the hash range is [0, 65535]. Each key is assigned to the next three hosts in the ‘ring’. In the graph above, the key whose hash is 41780 is assigned to host X, Y and Z. Keys whose hash is in (50170, 63937] will be assigned to host Z, A and B.

This replication scheme was first published by Amazon Dynamo and made more popular by Cassendra.

The rest of this post talks about how the key operations could be implemented under this scheme, including write, read, host restart under an unreliable network. The design is one way to implement them. Many ideas are borrowed from Cassandra. I would encourage the readers to think about alternatives and tradeoffs between availability and consistency.

Read and Write

Clients can specify the number of hosts to read from(R) for a read request and the number of hosts a write must be replicated to(W) before a write request if returned. Each write has a timestamp.

If less than W is replicated during a write, failure is returned to the client.

If there is conflict between the R nodes in a read, return the value with the latest timestamp.

Partial write failure

If the total number of replications is N, a common belief is that R + W > N implies ‘consistency’. This is not precise because value from a failed write is also possible to be observed by following reads. For example, N = 3, R = W = 2. If a write request is only successful on one node, then the value returned from the next read will depend on which two nodes it reads from. Remember read will return the value with the latest timestamp when different values are returned from R nodes.

Such inconsistency is temporary and will eventually be fixed by read repair.

Read repair

When inconsistency is detected during a read request, a background process will begin to heal the inconsistency by fetching value from all N replicas. It will force a write of the value with the latest timestamp.

Client interaction

All the hosts are identical from the client’s perspective. The client can connect to any host who will serve as the coordinator for the request. This greatly simplified the design and made the system more fault tolerant.

The coordinator needs to know the configuration of the consistent hash ring. This information will be externally maintained by a distributed strong consistency system like Zookeeper. Each host will cache it locally and query when conflict is detected during RPC interaction with other hosts.

Host membership

The hosts on the hash ring are virtual nodes. There will be a mapping from virtual nodes to physical server. This information will also be stored in Zookeeper. This design will make load balancing easier among physical servers.

When a host is down

Read and write will continue as long as R and W hosts responded respectively.

When a new physical machine is added

The new machine will be responsible for several keys, but initially it doesn’t have any data. Before making the service available, it needs to copy data from other replications to populate itself. It’s OK that the data is stale or inconsistent because read repair will fix it.

Load balancing(LB)

Virtual nodes level LB is easier. We will have a process that monitors the load on each physical server and assign virtual nodes from overloaded servers to servers that are less loaded. This will also need background data transfer and read repair to reach consistency. Assume A is overloaded and B is under loaded, the process of transferring virtual node V1 from A to B is:

Copy V1’s data from A to B. Notice this will be inconsistent data if there’re updates during the copy.
Update the virtual nodes mapping in Zookeeper. B will begin serving requests on V1.
Delete V1’s data from A.
The inconsistency introduced in 1) will be eventually fixed during read repairs.

Key level LB is more complex and requires adding virtual nodes in the hash ring, after that the procedure will be similar.

Conclusion

The consistent hash based distributed storage introduced in this post is shared by most NoSQL systems. NoSQL became popular around 2010. But at that time, those DBs are not stable, hard to operate and usually very slow. Many companies, including Facebook chose to use a simplified version of the sharding+replication solution. They don’t have a Raft cluster and trade consistency for higher speed.

In recent years, NoSQL(for example, Cassandra) became mature and proved in production with superior horizontal scalability power and ease of maintenance. In my mind the benefit of choosing Cassendra over manually sharding DB is the reduced complexity in SRE operation and simpler application infra. With sharding, you are essentially pretending that the system is not distributed. You have to expose sharding to the application in the process. The advantage is the ability to execute relational query within each shard.