System design paradigm: Primary-replica pattern
High read load problem
This design pattern solves the problem of scaling out read throughput on DB. Assume we a single DB host has 1ms latency for our read request and can handle 10 reads in parallel. The throughput(QPS) will be 1 / 0.001 * 10 = 10K/s. If we are happy with the 1ms latency, but our popular service has a read QPS of 100K/s, how can we scale out to handle it?
DB read usually become the bottleneck first because
- Read QPS is often 10~100 times higher than write QPS
- DB query is slower than most app server computation since it usually needs to read from disk. Longer latency implies lower throughput(Throughput = 1 / Latency)
Primary-replica pattern
Primary replica setup is the classic solution for scaling out DB read.
Only the primary DB host handles DB updates. The update on primary is synced to replicas via bin log replay. Most mainstream databases like MySQL have built in support for this setup. Read request is load balanced(LB) to the replicas.
To handle the 100K/s read QPS in our example, we can simply have 10 replicas, after LB, each host needs to process 10K/s QPS. Of course, in practice we need to have a little more replicas(12 for example) so that one failed replica won’t result in cascading failure.
Besides scaling out read load, another benefit of this pattern is providing more resilient persistent storage for the data.
Tradeoff: consistency
The primary replica set up will result in update delay in replicas and is a classic eventual consistency model. Essentially we trade strong consistency for read scalability. Eventual consistency is enough for most applications, except for ones requiring ‘read your write’ consistency.
‘Read your write’ consistency can be improved by forcing the read request to primary if it’s following a write. Or naively force the read to wait for several seconds so that all replicas have caught up. When there are replicas not in the same datacenter(DC), the read will also need to be restricted to the same DC.
There are some other techniques mitigating the consistency issue when there’s application cache. I won’t go deeper in this post.
Replication V.S. Cache
Using cache can also mitigate the read scaling out problem. Assume cache hit rate is 80%, we effectively take 80% of read QPS away. In our example, the 100K QPS will become 20K QPS that really hit DB.
From the perspective of scalability in distributed system design, cache and replication are used for different goals. Cache is in memory and is used to improve the latency. Replication is still in disk and is used to scale out read throughput and enhance durability.
The weakness: primary failure
There’s no free lunch. While the primary-replica pattern is very scalable in theory and can solve many design problems, it’s weakness makes it a lesser solution in practice.
The main challenge comes when the primary host is down. If it’s a transient network partition, replicas will have higher legacy. But if the primary is hard down, we face a hard choice.
Though I do know some fairly big companies who depend on human intervention to recover from primary’s failure, it is not acceptable in many use cases because:
- It’s abusing SRE oncall
- The downtime is not acceptable for high availability systems like Amazon
Therefore we need a solution for primary server’s auto fault recovery. In short, I don’t recommend you to play this game.
As you may have sensed it, this problem is very hard. Github has shared their solution(here and here). The idea is to have a separate system that constantly monitors the status of master and the lag on each replica. The monitor will detect the primary’s failure and adjust the network topology to promote one replica as the new primary. This requires being exposed to many low level network details. I find it intimidating to depend on unfamiliar open source projects doing tricky stuff on the network.
Many NoSQL databases have symmetric hosts thus have good support for node failures. I believe the main benefit today from a NoSQL database like Cassandra is the ease of operation. I will cover the NoSQL database solution in later posts.
Notice that replica failure is less an issue as we can simply have more replications so that replica failures can wait for human invention.