This System Design Cheat Sheet gives you a quick idea to tackle with the building blocks of any system design interview questions. Keep following this post for regular updates.
Primary database is your single source of truth when you save data. You always rely on this data.
Which database to choose?
We always get confused with the database to use as our primary database. Here are some points to keep in mind to make a better decision.
- Relational, Table-based
- When you require ACID (Atomicity, Consistency, Isolation, and Durability) properties.
- You choose SQL Database for saving financial transactions which gives you ACID guarantee that multiple records belongs to the same transaction are saved either all or nothing.
- You choose SQL Database when rigid schema is required. For e.g. you want to maintain strict relationships between the tables. you want to maintain unique, not_null constraints.
- You choose SQL Database when records are updated very frequently.
- You choose SQL Database when Joins and complex queries are required.
- SQL Databases are: Oracle, MySQL, MariaDB, Microsoft SQL Server, and PostgreSQL
- Document-based, key-value pairs, Graph database
- You choose NoSQL Database when flexible schema is required.
For e.g. e-commerce like Amazon can have Products in varied schema. Electronic products can have capacity (lt), power-rating (3-star) whereas Clothing products can have size (S,M,L,XL), material (cotton).
- You choose NoSQL when horizontal scaling is required. They scale very well.
- NoSQL Databases are: MongoDB, BigTable, Redis, RavenDb, Neo4j, and CouchDb
If you need any of the ACID property then use SQL Database else use NoSQL Database.
As your company grow, data grow and scalability comes into picture:-
SQL Databases are generally scaled vertically by adding more CPU, RAM or SSD, so that database can handle more load.
NoSQL Databases are generally scaled horizontally to store huge amount of data using Database Sharding.
Database sharding means partitioning of data in multiple databases. They are multiple ways:-
- Incoming data can be saved in databases in round-robin fashion
- Incoming data can be saved in databases based on some key. for e.g. userId in case of saving user profile.
total database in horizontal partitioning = 5 [DB_1, DB_2, DB_3, DB_4, DB_5] consistent hashing = 5 databases makes a consistent hash ring DB_1 ➚ ↘ DB_5 DB_2 ↖ ↙ DB_4 ⟵ DB_3 hash(userId_1) = 1002 database to insert this record = 1002%5 = 2 (DB_2) hash(userId_2) = 5119 database to insert this record = 5119%5 = 4 (DB_4)
Availability is measured based on replication factor. Replication factor = 3 means a database is having 2 additional copies which keep themselves in-sync asynchronously.
- Active-Active Data read and write happen in any database replica. This gives us eventual consistency because there might be a case where:
record_A write to DB_1
record_A read from DB_2 immediately
record_A might not be available in DB_2 if sync is yet to happen
- Active-Passive (Master-Slave) Data read and write happen in master database. Data is synced with slave database asynchronously. When master goes down, Slave started serving as master. This pattern is used when consistency is important.
- Write-Read Data write happen in master database. Data read happen from any database. This also gives us eventual consistency with increasing read performance.
- Cluster where data is partitioned across multiple database
total database in cluster = 5 [DB_1, DB_2, DB_3, DB_4, DB_5] replication factor = 3 consistent hashing = 5 databases makes a consistent hash ring DB_1 ➚ ↘ DB_5 DB_2 ↖ ↙ DB_4 ⟵ DB_3 Data write for DB_1 is copied to DB_2 and DB_3 (next 2 databases in the ring) Data write for DB_2 is copied to DB_3 and DB_4 (next 2 databases in the ring) ... ... Data write for DB_5 is copied to DB_1 and DB_2 (next 2 databases in the ring)
In consistent hash ring, next two databases either can contain 100% data copy of previous database or share the 50%-50% data copy depending upon the requirement.
Geo sharding is nothing but database sharding based on geo location.
- It is used by ride hailing apps such as Uber, Grab to store live locations of drivers and passengers.
- It is also used by tinder and other dating sites to provide recommendations based on location.
How to do?
- Google S2 library divides the geo locations into cells (for e.g. 1 cell = 100x100 miles square).
- You get the live location (coordinates lat, long) of user and pass to S2 library, which returns the cell_id in which that location falls into.
- Use this cell_id as a key to partition the data.
Consistent hashing is the algorithm used by:
- Most of the load balancers to distribute the load across multiple services.
- Most of the database horizontal partitioning to distribute the data across multiple databases.
The idea of consistent hashing is to place all the services (or databases) across the consistent hash ring and distribute the load based on some hashing algorithm. This gives us the flexibility to remove or add the service (or database) to the ring without disturbing the whole cluster with minimum or no data loss.
total database in cluster = 5 [DB_1, DB_2, DB_3, DB_4, DB_5] consistent hashing = 5 databases makes a consistent hash ring DB_1 ➚ ↘ DB_5 DB_2 ↖ ↙ DB_4 ⟵ DB_3
- If DB_2 goes down, Next database in the ring i.e. DB_3 is going to take additional load (future writes). If replication is not implemented then existing data of DB_2 will be lost but other database will still work.
- If DB_6 is added newly between DB_5 and DB_1 then it share the 50% load of the next database in the ring i.e. DB_1.
File Storage Options (Image/Video)
E-Commerce like Amazon, Social Media like Instagram, Facebook, Twitter, Streaming Provider like Netflix requires to store a huge amount of Images and Videos.
Images and Videos are stored a BLOB (Binary Large Object) and Database Storage is not a good option for this kind of storage. Why?
- You are not going to query on BLOB Storage
- You do not need any ACID guarantees for BLOB storage as such provided by RDBMS
- Costly to store in Database
Cheaper and scalable options to store such files are Distributed File Systems (DFS) such as Sharepoint, Amazon S3.
Details about the image/video such as its DFS URL, metadata can be save in Database, referred by Image ID.
It is also a smart idea to save thumbnail of image, or cover photo of the video in database, size of which should not exceed few KBs. Initially the thumbnail or cover photo can be sent to the client along with the URL of image/video. It is a good experience for the user to see something while image/video is downloading/streaming from DFS.
You may want to use Content Delivery Network (CDN) which distribute the same images/videos to different locations geographically near to the client locations.
You need to use Caching at some point of time in your system design for faster read.
Few famous Caching solutions are:
- Redis (recommended)
Text Search Capabilities
Almost all the websites requires searching capabilities in one way or another. For e.g. Amazon for product search, Netflix for movies search, Youtube for video search, Social media websites for user profile search, and so and so forth.
Some of the popular text search engine:
- Elastic Search (recommended)
- Apache Solr
- Apache Lucene
- Apache Lucene is the core framework written based on Map-Reduce and Inverted-Index algorithm.
- Elastic Search and Apache Solr has been written on top of Apache Lucene provide additional search capabilities such as fuzzy search, type ahead, search suggestions.
- You don’t rely on these for storage. You generally save data in your primary database and write to Elastic Search or Apache Solr asynchronously.
Reporting and Analytics
Data warehouse where you generally dump a lot of data and they provide very complex querying capabilities.
- Apache Hadoop
- Amazon Redshift
- Elastic Search