The anatomy of a horizontally scalable distributed application

Abracadabra
6 min readDec 23, 2020

--

Motivation of this post

It’s a daunting task for many engineers to design a fully horizontally scalable application. In practice, very few of us have the opportunities to build such systems end to end. The lack of hands on experience can be compensated by years of experience working indirectly on each technology(load balancer, sharding, etc). But it takes time to master all tools and design patterns.

In hindsight, I find such design is quite straightforward if one has the holistic picture of

  • what problems need to be solved(consistency, latency, availability, durability)
  • how each technology solve one specific problem
  • how should the technologies be assembled and the tradeoffs

Sadly, I’ve yet to find any book or online material trying to teach this piece of knowledge from the application infrastructure angle. There are books that go quite deep on each sub topic, like Designing Data-Intensive Applications. However, such books still didn’t teach readers how to think about system design from a practitioner’s perspective.

This post aims to address this problem. After reading, experienced engineers should feel comfortable with distributed system design problems at work or at job interviews. Engineers with limited knowledge on the concepts in this post will feel challenged. But they should know what to learn on the path of a competent architect.

Anatomy of a distributed application

Surprisingly, most, if not all, distributed system design problems can be solved using one simple template. In practice, the success of any system involves far more details. But this template is enough for any high level design.

Separation of read and write

Almost all applications can be broken down into read and write requests. They should be designed and represented separately. The separation will make the design easy to reason with and communicate.

Read/Write app server: stateless and load-balanced

The ‘Read App Server’ and ‘Write App Server’ are the stateless services that handle read and write requests respectively. They can be easily scaled out with load balancing due to their stateless nature. For the same reason, they are fault tolerant.

Message queue: the system backbone

Most systems are far more simplified as an event-driven design than request-driven. Therefore message queue(MQ) is the core of the system. It serves two purposes:

  1. MQ is the message hub of the event driven system
  2. MQ make the consumer(services doing the hardlifting) fault tolerant

Without a MQ, a system consists of N services will have O(N2)types of RPC calls. With MQ, the event based system will have O(N)types of PRC calls, because all the inter-service RPCs are replaced with message publish or message subscription to MQ.

Without MQ, the app server will need to call ‘DB update’ module or other consumer modules directly via RPC. The system will be unavailable if such worker modules are down. With MQ, the messages will be persistent in the MQ so writes don’t need to be blocked even if all workers are unavailable. This is desirable for eventual consistent applications. I know some companies use MQ(Kafka) for long term data persistence(instead of using DB).

The typical choice of MQ is Kafka. Kafka split each topic into multiple partitions to provide horizontally scalable write capacities in the expense of losing cross partition message orders. All of Kafka’s operations are translated into sequential reads and writes to disk, so they are very fast. Kafka’s author has an amazing post about the philosophy of MQ centric design. Be sure to read it.

DB: the distributed source of truth

In this template, DB’s main purpose is serving as a data persistence layer rather than serving read queries. DB’s focus is thus strong consistency(linearizability + serializability) and durability.

To be horizontally scalable, the data must be partitioned(a.k.a sharded). There are two flavors. The first is leaderless NOSQL DB. All of them are based on ideas from Amazon Dynamo. The second is leader based DB. The idea is first splitting data into partitions. Then have a leader-replica structure for each partition. The most successful implementation is Google Spanner.

It’s worth noting that with data replication, the DB’s throughput is scaled out. Thus DB can be used to answer read queries directly if latency is not an issue. In practice, this is usually not the case. That’s why we need to build an in-memory index for each read request.

In-memory index: the fast and volatile state

This is the volatile, stateful and fast data that are used to serve online read requests. There are also two flavors:

  1. Full data snapshot
  2. Cache

Some applications choose (a) because they need to have the full data in memory. Usually they will partition(sharding) at application level so that the full data will fit in memory. Each request will hit every shards in parallel and the results are aggregated before being returned to the app server. Each shard will have replications by themselves for scalability.

In flavor (b), the in-memory index is used as a look aside cache. This idea is discussed in detail in my other post. One common structure is illustrated as the chart below.

The App server will first ask cache for any query and populate the cache if there’s cache miss. Such populations will need to hit DB. Thus there’s a tradeoff between cache size and DB workload. Multiple consistency issues are discussed in detail in this post.

Notice that the data(state) in those caches are volatile. If the service is crashed, they will load the data from DB. The exact cold start mechanism needs to be carefully designed so that

  • DB is not overwhelmed by cache warming up
  • Never serve traffic from app server until cache is fully warmed up
  • If cache is crashed, avoid cascading failure

Update fan out: keep index posted

This service is the twin of DB updater. Its role is to update new writes to the index in memory. This operation is usually the fan out. For example, when a user posts a tweet, this new tweet needs to be added to all followers’ feed.

Because the fan out module is a stateless consumer of MQ, the system will continue to serve both reads and writes when the fan out module is down. Consistency will of course be lost. It will catch up when the fan out module is restarted. Such eventual consistency is acceptable for most high scale applications we can think of.

Examples of template instantiation

Almost all user facing applications can leverage this template to become a fully horizontally scalable system. Below are some examples.

User feed

According to this tech talk, Twitter used literally this design to implement their news feed. The index will be user’s feed. Fan out is very complex as it need to

  • take care of ranking
  • avoid thundering herd effect when a celebrity post an update

Top K most popular listings on Amazon

The task is to show the most viewed product on Amazon for each category.

This problem itself is a template for many system design interviews. Almost identical problems include showing real time likes count for each FB post, showing the realtime CTR of any posts in the past hour. They can all be solved with the same solution.

The index will be the data structure that is used to answer read queries. For most popular listings, it will be the number of views for each listing and a min heap for the top K most viewed listings.

The DB will store the raw updates and use them to recover the index when the cache is crashed.

Online chatting

Online chatting like Slack can also be implemented with this design. But it needs some tweaks. Instead of having a in-memory index, it will have in this place a service that maintains long connections.

More functions can be added easily, for example, marking messages as delivered. Read all messages when the user first login, group chat, etc. If you find this template useful and are not comfortable about distributed system design, I would encourage you to add the above features using this template.

--

--

Abracadabra
Abracadabra

Written by Abracadabra

“Writing in essence is rewriting”

Responses (3)