System Design 100 topics to learn

System Requirements:

1. Functional requirements:

These are requirements that define specific behavior or functions, often represented by use cases.

  • How to Define: Start with user needs and business goals. Outline the primary tasks users should accomplish.
  • Working Backwards Approach: Begin with the end goal and trace steps backwards to identify requirements.

2. High Availability:

Availability is a measure of system uptime.

  • Time-based: E.g., 99.9% (or "three nines") uptime.
  • Count-based: E.g., 1 failure allowed in 1 million requests.
  • Design Principles: Redundancy, failover, replication, monitoring, and regular testing.
  • Processes: Regular maintenance, backup procedures, disaster recovery planning.
  • SLO vs SLA: Service Level Objective (SLO) is the target level of service. Service Level Agreement (SLA) is the contract with the customer promising certain SLOs.

3. Fault Tolerance, Resilience, Reliability:

  • Error: An incorrect internal state.
  • Fault: A component's incorrect behavior.
  • Failure: When a system as a whole stops providing the required service.
  • Fault Tolerance: Ability of a system to behave in a well-defined manner once a fault occurs.
  • Resilience: Ability to return to a normal state after a failure.
  • Game Day vs Chaos Engineering: Game Day is a planned simulation. Chaos Engineering is introducing random failures.
  • Reliability: Probability a system will fail in a given period.

4. Scalability:

  • Vertical Scaling: Increasing the resources of a single node.
  • Horizontal Scaling: Adding more nodes.
  • Elasticity vs Scalability: Elasticity is the ability to scale automatically based on demand.

5. Performance:

  • Latency: Time taken to process a single operation.
  • Throughput: Number of operations processed in a given time period.
  • Percentiles: E.g., 95th percentile latency = under which 95% of the request latencies fall.
  • How to Increase: Optimize algorithms, caching, distributed systems.

6. Durability:

Ensuring data is safely stored and retrievable.

  • Backup Types:
    • Full: Everything is backed up.
    • Differential: Only changes since the last full backup.
    • Incremental: Only changes since the last backup of any type.
  • RAID: Redundant Array of Independent Disks. A method for redundant storage.
  • Replication: Copying data to ensure durability and availability.
  • Checksum: A computed value to detect errors in data.

7. Consistency:

Ensuring all users see the same data.

  • Eventual Consistency: Data will become consistent over time.
  • Linearizability: Real-time consistency.
  • Monotonic Reads: If a process reads the value of a data item, any successive reads by that process will always return that value or a more recent one.
  • Read-Your-Writes: Guarantees that a write by a process is visible to a subsequent read by the same process.
  • Consistent Prefix Reads: Guarantees that every read receives a prefix (some initial segment) of the writes, in order.

8. Maintainability, Security, Cost:

  • Maintainability: Ease with which a system can be maintained. Includes monitoring, testing, and deployment strategies.
  • Security:
    • CIA Triad: Confidentiality, Integrity, Availability.
    • Identity and Permissions Management: Who can do what.
    • Infrastructure Protection: Protecting physical and virtual resources.
    • Data Protection: Ensuring data confidentiality, integrity, and availability.
  • Cost Aspects: Engineering, maintenance, hardware, and software costs.

9. Summary of System Requirements:

  • Most Popular Non-Functional Requirements: Availability, Scalability, Performance, Security, Maintainability.

10. Regions, Availability Zones, etc.:

  • Regions: Geographic areas that consist of multiple, isolated data centers.
  • Availability Zones: Isolated locations within a region.
  • Data Centers: Physical facilities housing servers.
  • Racks: Structures holding multiple servers.
  • Servers: Physical machines running software.

11. Physical Servers to Serverless:

  • Physical Servers: Tangible hardware running software.
  • Virtual Machines (VMs): Emulated computer systems.
  • Containers: Lightweight VMs that share the same OS kernel and isolate the application processes from each other.
    • Pros: Lightweight, fast, consistent environment, scalable.
    • Cons: Less isolation than VMs.
  • Serverless: Cloud-computing model which can run individual functions in response to events. Resources are fully managed by the cloud provider.
    • Pros: Scalability, no infrastructure management, cost-effective.
    • Cons: Cold starts, limited customization, statelessness.

12. Synchronous vs. Asynchronous Communication:

  • Synchronous: Sender waits for the receiver to respond.
  • Asynchronous: Sender sends the message and proceeds without waiting for a response.

13. Asynchronous Messaging Patterns:

  • Message Queuing: Temporarily storing messages to be processed later.
  • Publish/Subscribe: Senders (publishers) send messages to channels without specifying receivers. Receivers (subscribers) subscribe to channels they're interested in.
  • Competing Consumers: Multiple consumers process messages from a single channel, distributing load.
  • Request/Response Messaging: A two-way messaging pattern; a request is followed by a response.
  • Priority Queue: Messages are processed based on priority rather than order of arrival.
  • Claim Check: Large message is dropped off at a service and a claim check is provided to the requester, which can be used to get the full message later.

14. Network Protocols:

  • TCP (Transmission Control Protocol): Reliable, connection-oriented protocol.
  • UDP (User Datagram Protocol): Connectionless, no guarantee of message delivery.
  • HTTP (Hypertext Transfer Protocol): Application protocol for distributed, collaborative, hypermedia systems.
    • HTTP Request and Response: Basic units of HTTP communication, where a client sends a request and a server responds.

15. Blocking vs. Non-blocking I/O:

  • Blocking I/O: The process is blocked until data is available.
  • Non-blocking I/O: The process continues even if the data is not available.
  • Thread per Connection Model: Each connection gets a dedicated thread.
  • Thread per Request with Non-blocking I/O Model: Threads are used only when processing a request.
  • Event Loop Model: Single thread handles multiple connections using events.
  • Concurrency vs. Parallelism: Concurrency is multiple tasks starting, running, and completing in overlapping time periods, while parallelism is multiple tasks running at exactly the same time.

16. Data Encoding Formats:

  • Textual Formats (e.g., JSON, XML): Human-readable, more overhead.
  • Binary Formats (e.g., Protobuf, Avro): Efficient, not human-readable.
  • Schema Sharing Options: How schema (structure) of data is shared between sender and receiver.
  • Backward and Forward Compatibility: Ensuring newer systems can read older data and older systems can read newer data, respectively.

17. Message Acknowledgment:

  • Safe Acknowledgment Modes: Receiver confirms safe receipt of a message.
  • Unsafe Acknowledgment Modes: Sender doesn't wait for a confirmation.

18. Deduplication Cache:

  • Local vs. External Cache: Storage close to the CPU vs. external storage systems.
  • Adding Data to Cache: Can be added explicitly by the application or implicitly by the caching system.
  • Cache Data Eviction: Removing data from cache, can be based on size, time, or other criteria.
  • Expiration vs. Refresh: Time after which data is removed vs. updating data periodically.

19. Metadata Cache:

  • Cache-aside Pattern: Application is responsible for loading cache.
  • Read-through and Write-through Patterns: Cache controls the data load and update.
  • Write-behind (Write-back) Pattern: Data is written to cache and is asynchronously written back to the storage.

20. Queue:

  • Bounded and Unbounded Queues: Limited vs unlimited size.
  • Circular Buffer (Ring Buffer): Fixed-size data structure that uses a single, continuous region of memory.

21. Full and Empty Queue Problems:

  • Load Shedding: Discarding excess traffic.
  • Rate Limiting: Limiting the rate of requests by users.
  • Failed Requests Handling: Strategies like retries or moving to dead-letter queues.
  • Backpressure: Signal to the producer to slow down.
  • Elastic Scaling: Dynamically adjusting resources based on demand.

22. Start with Something Simple:

  • Single Machine vs. Distributed System Concepts: Many concepts in distributed systems (like caching, load balancing) can be understood with the analogy of single-machine concepts.
  • Interview Tip: For system design interviews, start with a basic solution and then incrementally add complexity based on the requirements and constraints presented.

23. Blocking Queue and Producer-Consumer Pattern:

  • Producer-Consumer Pattern: Separate components that produce and consume data, often operating at different rates.
  • Wait and Notify: Mechanisms to synchronize between producers and consumers.
  • Semaphores: A synchronization primitive used to control access to a shared resource.
  • Blocking Queue Applications: Used in thread pooling, task scheduling, etc.

24. Thread Pool:

  • Pros: Efficient use of resources, controlled resource usage, quick task start-up.
  • Cons: Complexity, potential resource contention.
  • CPU-bound vs. I/O-bound tasks: CPU-bound tasks spend most of their time using the CPU, while I/O-bound tasks spend time waiting for external operations (like disk or network operations).
  • Graceful Shutdown: Properly terminating threads and releasing resources when shutting down a pool.

25. Big Compute Architecture:

  • Batch Computing Model: Processing high volumes of data where a group of transactions is collected over a period.
  • Embarrassingly Parallel Problems: Problems where little or no effort is needed to separate the problem into tasks to run in parallel.

26. Log:

  • Memory vs. Disk: Storing logs in volatile memory (faster, but not persistent) vs. non-volatile disk storage (slower but persistent).
  • Log Segmentation: Splitting logs into segments to manage them more efficiently.
  • Message Position (Offset): The position of a message within a log.

27. Index:

  • How to Implement an Efficient Index for a Messaging System: Use data structures like B-trees or Hash Indexes to allow for quick lookups and writes.

28. Time Series Data:

Storing and retrieving data points indexed in time order.

  • How to Store and Retrieve: Use databases optimized for time series data, like InfluxDB, to ensure efficient storage and fast queries.

29. Simple Key-Value Database:

  • How to Build: Utilize hash tables for quick lookups. Address issues like collision resolution.
  • Log Compaction: Reducing the size of logs by removing redundant data.

30. B-tree Index:

  • Usage in Databases and Messaging Systems: B-trees allow for efficient data storage and retrieval in databases and are used to index data in messaging systems.

31. Embedded vs. Remote Database:

  • Embedded Database: Runs within the same address space as the application. E.g., SQLite.
  • Remote Database: Separate from the application and is accessed over a network. E.g., PostgreSQL, MySQL.

32. RocksDB:

A high-performance embedded database for key-value data.

  • Memtable: In-memory data structure where RocksDB writes are inserted.
  • Write-ahead Log (WAL): A log that stores changes to data, ensuring durability.
  • Sorted Strings Table (SSTable): Persistent, ordered immutable map of key-value pairs.

33. LSM-tree vs. B-tree:

  • Log-Structured Merge-Tree (LSM-tree): Optimized for systems that write data more often than reading.
  • Write Amplification vs. Read Amplification: Trade-offs between the cost of writing data vs. reading data.

34. Page Cache:

  • Increasing Disk Throughput: Techniques like batching and zero-copy read can help in efficiently reading/writing to disks.

35. Push vs. Pull:

  • Push: Server initiates sending data to the client.
  • Pull: Client requests data from the server.

36. Host Discovery:

  • DNS: The Domain Name System translates domain names to IP addresses.
  • Anycast: Routing strategy where a single destination address has multiple routing paths to two or more endpoint destinations.

37. Service Discovery:

  • Server‑side and Client-side Discovery Patterns: Mechanisms where services find each other's network locations.
  • Service Registry: A database containing the network locations of service instances.

38. Peer Discovery:

  • Peer Discovery Options: How nodes in a network find each other.
  • Gossip Protocol: Nodes randomly exchange information, which gradually propagates through the system.

39. Choosing a Network Protocol:

  • TCP, UDP, and HTTP: Decision depends on use case requirements like reliability, speed, and connection state.

40. Video over HTTP:

  • Adaptive Streaming: Adjusting video quality in real-time based on the viewer's network and playback conditions.

41. CDN (Content Delivery Network):

  • How to Use & Benefits: CDNs distribute content across multiple servers to optimize access for users globally. They reduce latency and protectagainst traffic surges.

  • Point of Presence (POP): Physical locations or access points that connect to the internet and host servers, acting as the local source for cached content.

42. Push and Pull Technologies:

  • Short Polling: Client frequently asks the server for new data.
  • Long Polling: Client requests information and waits for the server to respond.
  • Websocket: Protocol that allows two-way communication with a server over a single, long-lived connection.
  • Server-Sent Events: Server pushes updates to the client over an HTTP connection.

43. Push and Pull Technologies in Real-Life Systems:

  • Quiz: Real-life examples would be chat applications (like Slack) using Websockets for real-time messaging, or stock trading apps using long polling for updating stock prices.

44. Large-scale Push Architectures:

  • C10K and C10M Problems: Challenges of handling 10,000 and 10 million concurrent connections respectively.
  • Large-Scale Push Architectures: Techniques like event-driven programming to handle large numbers of simultaneous connections.
  • Problems of Long-Lived Connections: Resource management, detecting stale or "dead" connections, and handling dropped connections.

45. Building Reliable, Scalable, and Fast Systems:

  • Common Problems in Distributed Systems: Network failures, machine failures, performance bottlenecks.
  • System Design Concepts: Caching, load balancing, sharding, replication, and partitioning.
  • Three-Tier Architecture: Presentation, logic, and data tiers.

46. Timeouts:

  • Fast Failures vs. Slow Failures: Quickly failing a request vs. taking a long time before failing.
  • Connection and Request Timeouts: Time limits set for establishing a connection or waiting for a response.

47. Handling Failed Requests:

  • Strategies:
    • Cancel: Terminate the request.
    • Retry: Attempt the request again.
    • Failover: Redirect the request to another system or component.
    • Fallback: Use an alternate backup solution.

48. When to Retry:

  • Idempotency: Ensuring that operations can be repeated without side effects.
  • AWS API Failures: Not all API failures should be retried. For instance, validation errors shouldn't be retried.

49. How to Retry:

  • Exponential Backoff: Increase the wait time between retries exponentially.
  • Jitter: Adding randomness to retry intervals to spread out the load.

50. Message Delivery Guarantees:

  • At-most-once: Messages are delivered once or not at all.
  • At-least-once: Messages are delivered one or more times.
  • Exactly-once: Messages are delivered precisely once.

51. Consumer Offsets:

  • Log-Based Messaging Systems: Systems where messages are stored in order and consumers track their position using offsets.
  • Checkpointing: Periodically saving the state of a process.

52. Batching:

  • Pros and Cons: Batching can improve efficiency but may introduce latency.
  • Handling Batch Requests: Ensuring that all items in a batch are processed, even if some fail.

53. Compression:

  • Pros and Cons: Reduces data size but adds computational overhead.
  • Compression Algorithms: Techniques like GZIP, Brotli, and LZ4, each with their trade-offs.

54. Scaling Message Consumption:

  • Single Consumer vs. Multiple Consumers: One consumer can ensure order but is limited in throughput. Multiple consumers can increase throughput but introduce complexity in message ordering.
  • Problems with Multiple Consumers: Ensuring order, handling duplicate processing, etc.

55. Partitioning in Real-Life Systems:

  • Pros and Cons: Helps in horizontal scaling and distributing loads but can introduce complexities.
  • Applications: Databases, messaging systems, etc.

56. Partitioning Strategies:

  • Lookup Strategy: Direct lookup.
  • Range Strategy: Based on a range of data values.
  • Hash Strategy: Using a hash function for uniform distribution.

57. Request Routing:

  • Physical and Virtual Shards: Actual data partitions vs logical partitions.
  • Routing Options: Methods for directing a request to the appropriate server or data partition.

58. Rebalancing Partitions:

Ensuring data is evenly distributed across systems, especially when adding or removing nodes.

59. Consistent Hashing:

  • Implementation: Uses a circular keyspace.
  • Advantages and Disadvantages: Reduces the number of re-mapped keys when scaling but can lead to non-uniform data distribution.
  • Virtual Nodes: Multiple hash values for a single node to ensure a more uniform distribution.

60. System Overload:

  • Importance: Protecting against system overload ensures continued availability and prevents cascading failures across dependent systems.
  • Protection Mechanisms: Load shedding, rate limiting, backpressure, and queuing.

61. Autoscaling:

  • Scaling Policies:
    • Metric-based: Scale based on metrics like CPU utilization or memory usage.
    • Schedule-based: Scale based on known usage patterns or times.
    • Predictive: Scale based on predicted future traffic patterns.
  • Autoscaling System Design: Involves metrics collection, decision-making algorithms, and mechanisms to add or remove resources.

62. Load Shedding:

  • Implementation: Prioritize and process only the most critical tasks, while dropping or delaying non-essential tasks.
  • Considerations: Decide which requests to drop, how to notify clients, and when to trigger load shedding.

63. Rate Limiting:

  • Purpose: To prevent any single user or client from overloading the system.
  • Implementation: Token buckets, leaky buckets, and fixed window counters are popular methods.

64. Synchronous and Asynchronous Clients:

  • Admission Control Systems: Mechanisms to decide whether to process a request immediately or defer it.
  • Blocking I/O vs. Non-Blocking I/O Clients: Synchronous vs. Asynchronous client operations.

65. Circuit Breaker:

  • Finite-State Machine: Mechanism where the circuit breaker maintains states (Closed, Open, Half-Open) to decide whether to process or deny requests.
  • Considerations: Setting thresholds, reset intervals, and monitoring the health of downstream services.

66. Fail-Fast Design Principle:

  • Problems with Slow Services: Slow services can cause chain reactions leading to larger system outages.
  • Solutions: Implementing timeouts, circuit breakers, and other mechanisms to quickly fail and return control to the caller.

67. Bulkhead:

  • Implementation: Isolating parts of a system so that if one fails, it doesn't bring down the whole system.
  • Applications: Can be applied at various levels, from processes to servers to data centers.

68. Shuffle Sharding:

  • Implementation: Combining sharding and redundancy to reduce the blast radius of failures.
  • Applications: Reducing the impact of failures in multi-tenant systems.

69. Host Discovery:

  • DNS: Translates domain names to IP addresses, enabling clients to discover hosts.
  • Anycast: Uses DNS to route to the nearest or best-performing physical location.

70. Service Discovery:

  • Server-side and Client-side Discovery Patterns: Mechanisms where services find each other's network locations, either with a central server or clients querying a service registry.
  • Service Registry: Stores locations of service instances, allowing for service discovery.

71. Peer Discovery:

  • Peer Discovery Options: Mechanisms like bootstrap lists or centralized directories.
  • Gossip Protocol: Nodes share information about their peers, allowing for decentralized discovery.

72. Choosing a Network Protocol:

  • TCP vs. UDP vs. HTTP: Depending on the requirements (reliability, connection setup, data size), one might be chosen over the other.

73. Network Protocols in Real-Life Systems:

Analyzing different scenarios and challenges to choose the best protocol. For example, streaming services might prefer UDP, while banking transactions would use TCP.

74. Video Over HTTP:

  • Adaptive Streaming: The video quality is dynamically adjusted based on the viewer's network conditions.

75. CDN:

  • Benefits: Faster content delivery by caching content closer to the end-users and reducing origin server load.
  • How It Works: Content is distributed and stored across a network of servers. When a user requests content, it's served from the nearest cached server.

76. Push and Pull Technologies in Real-Life Systems:

Real-world applications like social media updates or live sports updates may use push technologies like WebSockets, while email clients might use pull technologies to check for new mails.

77. Large-Scale Push Architectures:

Design considerations for building systems that can handle millions of simultaneous connections.

78. Timeouts:

  • Importance: Ensure that a system doesn't hang indefinitely.
  • Types:
    • Connection Timeouts: Time a task will wait while attempting to establish a connection.
    • Read/Write Timeouts: Time a task will wait for a read or write operation to complete.
  • Handling: Opt for graceful degradation of service, provide feedback to users, and use retries judiciously.

79. Handling Failed Requests:

  • Logging: Essential for diagnostics. What failed, where, and why?
  • Retry Logic: Implementing algorithms like exponential backoff.
  • Notifications: Inform stakeholders or trigger systems about persistent failures.

80. When to Retry:

  • Transient vs Permanent Failures: Temporary glitches (network blip) vs. consistent failures (invalid credentials).
  • Safety: Ensuring retries don't exacerbate the problem.

81. How to Retry:

  • Immediate vs Delayed: Waiting between retry attempts can be crucial.
  • Exponential Backoff: Increasingly long waits between retries.
  • Jitter: Introducing randomness to retry intervals.

82. Message Delivery Guarantees:

  • Importance: Ensuring data isn't lost or seen multiple times.
  • Strategies: Acknowledgments, transaction logs, checkpoints.

83. Consumer Offsets:

  • Checkpointing: Storing position in a stream or log.
  • Benefits: Allows consumers to pick up where they left off after failures.

84. Batching:

  • Benefits: Can significantly reduce overheads and increase throughput.
  • Challenges: Introducing latency, complexity in error handling.

85. Compression:

  • Trade-offs: CPU time vs. bandwidth.
  • Use Cases: Important in bandwidth-constrained environments or for storage savings.

86. Scaling Message Consumption:

  • Parallelism: Distributing load across multiple consumers.
  • Ordering: Ensuring that the sequence is maintained, if necessary.

87. Partitioning in Real-Life Systems:

  • Consistency vs Availability: Deciding trade-offs based on system requirements.

88. Partitioning Strategies:

  • Consistency Hashing: Minimizing reorganization of data when nodes are added or removed.

89. Request Routing:

  • Load Balancers: Distributing incoming requests to prevent any single resource from being overwhelmed.
  • Content-based Routing: Directing traffic based on the content of the request.

90. Rebalancing Partitions:

  • Importance: Ensuring data and load are evenly distributed across nodes.
  • Strategies: Manual interventions, algorithmic rebalancing.

91. Consistent Hashing:

  • Virtual Nodes: A mechanism to ensure a more uniform distribution of data.

92. System Overload:

  • Monitoring and Alerts: Real-time insights into system health.
  • Throttling: Limiting user or service requests.

93. Autoscaling:

  • Cloud Services: Many cloud providers offer autoscaling services that can automatically add or remove resources based on rules or machine learning.

94. Load Shedding:

  • Dynamic Adjustments: Dropping non-critical tasks when the system is under heavy load.

95. Rate Limiting:

  • API Gateways: Many come with built-in rate limiting features.
  • Client Feedback: Informing clients when they're exceeding limits.

96. Synchronous and Asynchronous Clients:

  • Trade-offs: Responsiveness vs. resource usage.
  • Use Cases: Asynchronous operations for long-running tasks.

97. Circuit Breaker:

  • Resilience Patterns: Preventing system failures from cascading.

98. Fail-Fast Design Principle:

  • Health Checks: Regular checks to ensure services are operational.
  • Fallbacks: Having backup systems or data sources in place.

99. Bulkhead:

  • Isolation: Making sure failures in one part don't impact others.

100. Shuffle Sharding:

  • Resilience: Improving system reliability by reducing the impact radius of failures.
Comments (14)