Design a WhatsApp Chat messaging system design.
Functional Requirements:
- 1:1 chat messages
- Read Receipts
- Last Seen
- Starred Messages
- Messages may contain media
Non Functional Requirements:
- Scalabale
- Fault Tolerant
- Resilient - in the sense messages should reach receipient and not get lost!
The schema that I have created is below. I am choosing RDBMS here because relationships across different entities are clearly well defined. One can use a document store or NoSQL if they wish.
Schema & Logic:
- Users will connect from their mobile/WhatsAppDesktop and may connect to any of the several LBs. It is very important to keep track of which user connected to which node. So we save this in UserSessions Table. This is done as soon as any user connects to internet or comes online.
- Happy Case - User A wants to send a message to user B and both are connected/online (they may be connected to same or different nodes, does not matter!).
- Here user A will send a message to the server via a HTTP REST. The LB will do nothing, just forward the request to Sessions microservice.
- The sessions services formats the request into DBMS friendly data and saves into UserSessions only, for now.
- Paralelly, as soon as data is saved in UserSessions, an acknowledgement is sent to client A and client A will see a 'single tick' in their message - meaning message reached to server but yet to be sent to user B.
- Since B is online, session service will directly send message to B. It can get node details from UserSessions
- IMP to note here, comunication from server to client (i.e. when acknowledging/read receipts/sending message to client, etc cannot be done by REST, because here we are sending from server to client. Instead we will use Long Polling or webSockets where the client will keep listening to server and server when sends data to client, client will consume it. Also note there that long polling drains user battery and may not work on some devices if device is in Battery Saver mode!
- Not So Happy Case - A sends a message to B, but B is not online. So now again session services will have A's details saved in UserSessions, but there is no details of B. So server does not know where to send message. Here in this case we will save the message in a temporary table UserMessageTemp.
- Whenever B comes online, its entry will get added to UserSessions and we can have a job that polls/listens to this table and this way server comes to know B is online and gets message from UserMessageTemp and sends back to user and then deletes this record from UserMessageTemp.
- UserMessageTemp needs to be cleaned of its data once its job is done for performance.
- As soon as client B receives message and/or reads it, an ack is sent to client A. This is for sent/read receipts.
- Once this is done, a message from end to end, it is saved in Messages table for historical purposes. This table can be archived every 7 days for performace reasons. Users even if they save historical messages dont acces it often, so better we archive messages > 7days.
- User starred messages/last seen quite straightforward.
- If user sends media in chat, we directly save media in Object Store like S3. The path of S3 will be saved in UserMediaServer table. When receipient wants to download/see the image, we get them a copy from S3.

Architecture Diagram

- All servers are Load Balanced/made redundant - we can use something like a single master/multiple slaves model to replicate data.
- Communication b/w services takes places via MQ like Kafka/RabbitMQ.
- We also have a dead letter queue which has all failure events and we can send them to error analysis serivce (not pictured here). This can be used by L3 team/devs to troubleshoot.
- Clients send message to servers using REST and they consume message from server using WebSocket/Long Polling
- Every Night at 2:00 AM, Backup service runs so that users local db data/messages are backed on whatsapp servers. We should ensure last seen is not updated because backup is automatic and not user activity.
- All messages saved in DB/archival are encrypted, respecting user privacy. One can use 2-level encryptions - say encrypt messages in MD5 which generates a hash and then again encrypt this hash with a different algorithm and then save this final hash as message in tables.
- Message to and from clients is also encrypted so hackers cannot snoop TCP packets.
- In cases of festivals/high load days/natural disasters, one can scale down/ignore services like last seen, read receipts and instead pour more resources on core functionality - peer messaging.
- If the load on servers is too much and even horizontal scaling does not help, we take client/senders message and save them in a queue which will then be sent to recipient. This way there could be some delay to reach message to client B, but since we are storing message in an MQ and parking it here, message is not lost. Also in such high pressure cases, we can give more priority to text messages rather than media.
- All DBs are replicated.
- All DB nodes can talk to each other - they can use gossip protocol where they send heartbeat every 5-10 seconds to each other. If a node does not send 2 back to back heartbeats, that means it is dead and we spin up a new node. The node configurations can be stored in Apache ZooKeeper.
- Above logic, ditto will be applied to servers/micro services as well.
- We can have CDNs and caches for media server, specially image/videos.
If we get more time, once can say we can implement or think loudly in 2-3 mins about below features
- If user is chatting with a WhatsApp business account, one can save user trends in some table (e.g. what is user buying, what does seller sell, etc) and connect this table to an ML engine for intelligent suggestions.
- Group Chats
- Dissappering Messages
- Efficient WhatsApp message searches/historic message searches
- VOIP/Video calls, etc
- If very unfortunate disaster occurs, say forsestFires/earthquake, WhatsApp can use users precise location details at that moment and record this data in a DB which can be shared with authorities for helping users stranded in danger zone. This is little in gray area as we can violate users privacy and users will also need to allow location permissions on their android/iOS devices.
Guys - How would you rate my design? Any feedback, positives, negatives? How would you have designed WhatsApp?