Preparing for a system design interview

Hello readers,

It has been a while (2 years!!!) since my last entry. I’ve been busy, and I apologize for not getting back any sooner. I’ve got some news. There were layoffs at my previous company, and I’m now on the market looking for my next opportunity.

Being more of a senior candidate, I can look forward to system design interviews in technical assessment rounds. There will be future posts on how to conduct yourself in different phases of a software engineering interview, but for now, I am sharing my notes on a case study that I put together to ramp up on concepts that you should know.

Let’s dive into it. One thing before starting the content, it’s about 3000 words, and it takes about 12 minutes to go through the content. I’d advise having some time to read and digest the content below before starting to read it šŸ™‚

How to get started with a system design question in the context of an interview?

Usually, system design questions initially start as vague. It is primarily a conversation between two parties; the interviewer and you. Remember that you must be the party responsible for driving the conversation, but that doesn’t mean you shouldn’t ask any questions. That’s encouraged! You can only go on and implement a design with a use case, feature, or requirement.

You must start the interview by asking elucidating questions. Doing so is a solid signal to the interviewer that you understand that software engineering isn’t a discipline focused on implementation; it’s the ‘easy’ part. There will be a time and place to focus on the implications of implementing whatever transpired while devising the system.

We need to take a step back and talk through complex topics and even decide to drop specific items due to either complexity or needing more resources to achieve the goal(s), such as being constrained in time or needing more staffing to accomplish the task.

Here, your goals would be to identify a few things, such as:

  • What’s the expected traffic for the website
  • Being a big platform serving multiple communities, what should we embed in our 45-60 minutes discussions that we’ll be designing? 
  • Are we heavier on reading requests or writing requests?
  • What are the core features that you will prepare in the scope of the conversation?

Use cases

Okay, since we’re not talking to an interviewer, we’ll make many assumptions about what information we’d extract from the interviewer as we begin our discussion.

Questions to ask the interviewer

  • Are we designing the system to support all the communities of StackExchange? No.

Huge help here! We trim a lot by supporting only some of the communities in our design.

  • Being a huge platform, do we design to support everything? No, we’ll only focus on the core features of Stackoverflow.

Huge help again! These clarification questions are vital to your success. We only need to focus on the core features that you’d expect to see in Stackoverflow. If you encounter a system you’ve never used before, ask what those core features are and then ask relevant questions to understand better what needs to be embedded in the design.

  • Can a user flag or comment on a post (question/answer)? No, and no.

Should we design a search feature? What are the implications of implementing the system that is being planned? 

What are the constraints and assumptions for the system design? 

  • What core features will be included in the Stackoverflow platform design? It should allow users to look for posts and tags and provide relevant and popular results.
  • Is there a need to have suggestions as the user is typing? Yes, it should take 300 ms to showcase posts relevant to the search term.
  • Can a user ask questions? Yes.
  • Can a user answer questions? Yes.
  • Can we downvote or upvote a question? Yes.
  • Should we have registered accounts? No, for simplicity’s sake, let’s act as if all users were anonymous, and we won’t support any registered account or have any features related to accounts.
  • How important is it that we always keep a post from the platform? Very.
  • What are the availability guarantees that we’re providing to our users? 99.9% of the time

In-scope items

  • A user gets on Stackoverflow and uses the search bar to find a specific post to solve their coding problem
  • A user goes on Stackoverflow and uses the search bar to find a particular tag to solve their coding problem
  • A user can ask a question
  • A user can answer a question
  • A user can upvote or downvote a post.
  • A user should be able to see the list of questions asked on the platform
  • The user will be anonymous.

Extracted information out of scope

  • The user registers an account on Stackoverflow
  • The users flagging posts
  • The is no ranking for users.
  • The users commenting on either questions or answers
  • All other communities except for Stackoverflow
  • Logging and analytics

Constraints and assumptions

State assumptions

  • Searches should be fast
  • Posting a question and an answer should be fast
  • We can never lose a post
  • We have 300 million users
  • 3 billion searches per month
  • 10 billion posts read per month
  • 10 million posts written per month
  • 100:1 read-to-write ratio

Here, when you’re extracting the information from the reviewer as you’re discussing, you can start to form an idea of how the system will look, but this is only the beginning. Because you’re racing against the clock, the interviewer might drop a few things to focus on more critical components. 

This portion of the interview is meant to see whether you can develop a coherent strategy that answers the requirements provided throughout the discussion. Having a complete design is a minor aspect, depending on which level you’re interviewing at and which company you’re talking to. As long as you’re demonstrating your train of thought and capable of defending your choices for the distributed system, you’ll be doing just fine. 

With the information gathered here, we can observe that we have a large-scale application to start designing, but there are still many questions to ask as we’ll be focused on building each component, such as:

  • Do I need strong consistency or eventual consistency?
  • What kind of information will I be storing?
  • Do I need a cache? If so, what eviction policy should we pick to keep the latency low? How big should the cache be?
  • How much storage should I put aside for the data to support the post alive for at least ten years?

Let’s move towards calculating the usage.

Calculate system usage

Storage for system

Whether a question or an answer, a post can go as high as 10KB. Since we have ten million new posts per month, this means that we have a total of:

  • 10Kb/post * 10 million posts = 100Gb per month
  • 12 TB in 10 years
  • 240 million new posts in 10 years

Request throughput

  • ~ 400 posts write request / second 
  • ~4 000 posts read / second

Caching needs

Using the Pareto Principle, where we can estimate that 20% of the posts generate 80% of the traffic, we can calculate how much capacity we’d need to keep a healthy cache to sustain the traffic. We get about 400 new generated posts per second, about 34.56 million posts created daily. 

34.56 million posts generated per day * 20% * 10Kb payload for a post: 69.1 GB of memory.

If we get a server with 128 GB of memory, we should handle more than one day’s load if we only use the memory for caching. After evaluating the system’s needs, we must attribute more capacity to the server.

Trie storage needs

Again, let’s use the 80-20 principle to estimate that 20% of prefixes will generate 80% of the searches. Let’s assume that we get, on average, posts of 50 characters. It also takes 2 bytes to store a character.

100 million searches per day * 20% * 50 bytes = 1 Gb per day.

If we expect some growth in the search and estimate that we get 5% of new queries every day, we can then calculate the following:

1 Gb + (5% growth * 1Gb * 365 days) = 19.25 Gb total storage for the year to keep the entire data on disk for the trie search queries.

Availability guarantees

We assumpted that the system had to stay up 99.9% of the time. That’s a recognized term in the industry as the ‘three nines.’ When we know that information, we can calculate how long of downtime we can plan for a year.

We have 365.25 days in a year * 24 hours per day * 3600 seconds in an hour = 31557600 seconds in a year.

We need to be up 31526042.4 seconds per year; we can only have 9 hours of downtime per year or about 75 seconds per day.

System APIs

Expect to provide a high-level idea of how the backend APIs would look ready for these questions.

Asking a question on Stackoverflow

askQuestion(apiDevKey, title, body, tagList = null, withAnswer = false, answerBody = false)

Posting an answer on Stackoverflow

postAnswer(apiDevKey, questionPostId, body)

Searching for a post on Stackoverflow

searchForPost(prefixSearchTerm = null, tagList = null)

To limit abuse, we can limit the number of questions and answers someone can post by binding a specific limit of new posts to the API development key per allocated time.

High-level design

We have a client (C) making HTTP requests to a web server backend. This server can route requests to:

  • Post Write API
  • Post Read API
  • Post Search API

There are a few services that the top-level APIs can contact:

  • The Question Service
  • The Search Service

Due to reliability concerns, using a SQL data store makes sense to preserve the existing posts written over time. We can have a simple proof of concept with a single machine. Every API would interact with the data store for reading or writing operations.

Core component design

In this portion, discuss and clarify with your interviewer how much code you must write to support the system’s use cases.

Use case: The user posts a question

Because we’re concerned with a reliable product that can’t lose any posts from anonymous users, we could use a relational database to store all the questions asked over time.

  • The Client sends a question request to the Web Server as a reverse proxy.
  • The Web Server forwards the request to the Post Question API server.
  • The Post Read API server does the following:
  1. Before moving forward, confirm that the request’s body data is valid.
    1. The title isn’t missing or duplicated, contains illegal characters, or is longer than expected by the database.
    2. The body isn’t missing or more prolonged than permitted by the database.
  2. Generates a unique URL for the new page
    1. Check the URL is unique by looking at the SQL Database for duplicates. If the URL isn’t unique, develop a new one.
  • Saves the post to the SQL Database QuestionPost table.
  • Returns the URL

TheĀ QuestionPostĀ could have the following structure:

create table QuestionPost( ID int not null identity(1,1) primary key, Title varchar (512), Body text, Url varchar, CreatedOn date time, ModifiedOn date time, Upvotes int, Downvotes int, AnswerId int null,  foreign key (AnswerId) references AnswerPost(Id))

To generate the URL of the question, we could use a hashing algorithm that could hash the title and use the timestamp as input values. We can start with MD5 to hash the input values; it provides a 128-bit hash which we can then give to Base62, which covers [a-zA-Z0-9]. Base62 is deterministic; the original input can only be one hash result. By taking the first six characters of the output of MD5, we get 62^6 possible values which should be more than sufficient to handle the constraint of 240 million new posts in 10 years. 

When we create a post, the date and time allow us to extend the homepage’s design by indexing this table with the CreatedOn column. We can retrieve the recently asked posts. Moreover, when we start tracking analytics for the website, we could also provide a section to showcase the most frequently visited questions. It was not added as a column to this entity because it needs a better design. It brings unnecessary coupling that would force the database to be consistently updated whenever a web page is visited.

Use case: User searches for a specific post with a prefix search term with typeahead suggestions

  • The Client sends a search request to the Web Server, running as a reverse-proxy
  • The Web Server forwards the request to the Search API server
  • The Search API contacts the Typeahead Service.
  • TheĀ Typeahead Service does the following:
    • Uses the Trie data structures to fix matching words
    • Returns the output to theĀ Search Service
      • Inspects popular posts in its cache to see if a post matches the input
        • In the event of a cache miss, we can search against the SQL Database for titles matching the requests

The system has to provide suggestions to the users as they are typing. Storing all the possible strings somewhere and querying that source for lookups would be slow. We need an excellent scheme to store the data quickly and efficiently query it to reduce latency, which is a requirement for search.

An appropriate solution for this problem would be a trie. It’s a tree structure designed for quick lookups of a word or sentence. It requires less storage than a hash table but may take longer to search. We can efficiently do a prefix search with this structure. There’s a good potential for having many hits; we can combine the results list and only provide back the top most recent searches for the prefix backed with a cache. Here, a distributed cache such as Memcache or Redis would do great.

We calculated earlier that we’d need to store about 8 Gb for the trie for a given year. Since it can be slow to only interact with a single tree, it’d be better to break it down and put it on several physical nodes, which would speed up the search. Because we get 100 million searches per day, we can’t update the trie after each search; it would add too much latency. 

We’d better off editing the tries at a given period. In the meantime, we can store two configurations on a node; the primary online that can support traffic and the secondary that be updated at any time. Once updated, we can then propagate the updates to the initial configuration.

We can store statistics and log them for search terms. We could update the trie at a given frequency to better serve the users with the latest and most popular search results. We’d be able to rank the results provided back by the nodes with the following:

  • The frequency of searches
  • The technologies the users are interested in when we have registered accounts
  • The personal history of the user once we have registered accounts

Furthermore, we’ll have to be smart about tracking the searches’ frequencies or recounting everything every time we need to update the tries, which is too slow (bad latency). We can correct the differences in frequency for top searched tops. If we keep the top posts of the past two weeks, we’ll only need to remove the hits from 16 days ago. Moreover, we can also provide more weight for newer hits than older ones, meaning something popular yesterday would have more importance than 12 days ago.

Data partitioning

We need to partition to scale out our database infrastructure to hold millions of posts. Therefore, we need an appropriate strategy to divide and store the data on different database servers. Here, we’re looking to store efficiently both the posts and the data backing up the tries.

In the context of the posts, we could embrace a hashing-based solution. This approach would be fantastic for minimizing hotspots and removing single points of failure instead of keeping everything on a single physical node. Hashing makes our system more reliable to something going wrong in real-world conditions.

We can hash the post’s title and calculate which node should store it from that value. We could still overload the partitions if we don’t use consistent hashing here. Hence, we would have needed to solve the single point of failure entirely.

It’s naive to take the hash result and use the total number of nodes to determine where data should go. We know that any hardware or software component can degrade or fail in distributed systems. Using absolute values is detrimental because a server could go down or add more servers to the resource pool if we focus on horizontal scaling. The previous mapping would break and require remapping all the existing keys, which requires time and going offline. Moreover, that makes the system less available or reduces its operational time to perform its required function.

We’d map the data to the physical nodes with consistent hashing and ensure that only a few keys are moved when servers are added or removed from the resource pool. Each node is associated with a range of values or tokens instead. The key is generated by a hashing algorithm such as MD5, which determines the scope of the data. The algorithm’s output will determine which server will be responsible for storing it. We can introduce virtual nodes in this system to resolve load-balancing issues. We could still encounter hotspots or significant overhead when a node is added/removed, and the tokens must be recomputed.

A virtual node, or vnode, breaks down a sizeable static range into smaller subranges. For practicality, vnodes are randomly distributed across the cluster and are non-contiguous, meaning we can’t have two neighboring vnodes on an assigned physical node. Having vnodes decreases the hotspot’s odds and makes manipulating the tokens’ size easier based on the server’s physical characteristics.

In trie servers, we can partition the data based on the maximum allocated memory for a trie. Whenever a subtree wouldn’t fit in a server, we break it down in the partition and move to the next. We might have to query multiple servers based on the user’s prefix search term. We’ll have to decide whether it’s better to aggregate the data in an aggregate layer on the backend or have the client be responsible for it.

This was a big post

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s