# 01. Generic System Design NFR Concepts
# 02. NFR fulfillment specific to BT (Imp)
# 03. System Design Questions specific to BT (Imp)
# 04. TRADE-OFFs in Distributed System Design
# 05. PROS & CONS of Microservices
# 06: CHALLANGES in Distributed System Design
# 07. BEST PRACTICES for designing microservices
# 08. Scaling the Stateful Services
# 09. Scaling communication b/w microservices
# 10. SD Questions for Senior Roles
# 11. Implement Rate limiter, Consistent Hashing
# 12. Addendum ( Building Blocks Concepts )

I.) Generic System Design NFR Concepts:
+++++++++++++++++++++++++++++++++
Non-Functional Requirements (NFRs) is crucial to demonstrate your understanding of how a system performs under real-world conditions. Here are some key NFRs that you should consider discussing during a system design interview: 10/20/24
1. Scalability
Definition: The ability of the system to handle increasing loads or to be expanded to accommodate growth.
Types:
Discussion Points:
Question: How would you scale your system to handle millions of users?
Answer: I would use horizontal scaling to add more instances of services across multiple servers to distribute the load. For example, deploying the services on Kubernetes or AWS EC2 Auto Scaling groups ensures that as the traffic grows, more instances are automatically added. I would also implement load balancing using tools like NGINX, AWS ELB, or Google Cloud Load Balancer to distribute incoming requests efficiently across the available instances.
Question: How does your system handle sudden traffic spikes (e.g., during a flash sale)?
Answer: To handle sudden traffic spikes, I would implement auto-scaling based on CPU or request load metrics. Additionally, I would use a queueing mechanism (like RabbitMQ or AWS SQS) to buffer requests if the system gets overwhelmed. For read-heavy workloads, caching layers like Redis or Memcached can offload pressure from the database.
Discuss load balancers, sharding, and replication.
2. Performance
Definition: The responsiveness and speed of the system under load.
Key Metrics:
Discussion Points:
Question: How would you minimize latency (e.g., through caching, using Content Delivery Networks)?
Answer: I would minimize latency by implementing caching at different layers: * Database caching using in-memory databases like Redis to cache frequent queries. * CDN (Content Delivery Network) for static assets (e.g., images, JavaScript, CSS) to reduce latency by serving content from the nearest edge server. * Application-level caching to cache API responses for frequently requested data. Additionally, I would optimize database queries using indexes and minimize network latency by colocating services in the same region
Question: How does the system handle high throughput (e.g., batch processing, message queues)?
Answer: To handle high throughput, I would implement message queues like Kafka or RabbitMQ to distribute tasks across multiple consumers asynchronously. For example, user-facing requests can be processed in real-time, while background jobs (such as data processing or notifications) are offloaded to the queue. For batch operations, I would use a distributed framework like Apache Spark for efficient processing of large datasets.
3. Reliability
Definition: The system's ability to perform consistently and correctly over time without failure.
Key Concepts:
Discussion Points:
Question: How does your system handle failures in individual components (e.g., using failover mechanisms, backups)?
Answer: I would design the system using redundant components so that if one instance of a service fails, another can take over without disrupting the user experience. This includes database replication (e.g., using master-slave or multi-primary configurations) and load balancers for service redundancy. In case of catastrophic failures, I would rely on automated backups of critical data, with replication across multiple data centers.
Question: How would you ensure reliability through redundancy (e.g., data replication, multi-region deployment)?
Answer: I would ensure reliability by deploying services across multiple regions using geo-replication. This would allow failover to another region if the primary region goes down. Additionally, I would use database replication strategies like multi-master replication for critical databases, so data is replicated in real-time across regions.
4. Availability
Definition: The percentage of time the system is operational and accessible.
Key Concepts:
Discussion Points:
Question: How would you design for high availability (e.g., using load balancing, geo-redundancy)?
Answer: For high availability, I would use geo-redundant architectures where services are replicated across multiple data centers or regions. A load balancer (e.g., AWS Elastic Load Balancer or NGINX) would route traffic to healthy instances. I would implement auto-scaling to ensure the system can handle fluctuating loads and leverage failover strategies to route traffic to backup instances if the primary fails.
Question: What happens if a particular data center or region goes down?
Answer: If a data center or region goes down, my system would trigger a failover mechanism to route traffic to services running in other regions or data centers. This would be facilitated by global load balancers like AWS Route 53, which can detect outages and reroute traffic. Data would remain consistent through multi-region replication, and the downtime would be minimal.
5. Fault Tolerance
Definition: The system’s ability to continue functioning even when some components fail.
Discussion Points:
Question: How would you ensure that critical services can still operate if other services fail (e.g., using retries, circuit breakers like Resilience4j or Hystrix)?
Answer: I would implement circuit breakers using libraries like Resilience4j or Hystrix to prevent cascading failures when dependent services fail. For example, if a downstream service becomes unresponsive, the circuit breaker would trip and prevent further requests from overloading the failing service. I would also implement retries with exponential backoff for transient failures.
Question: How do you plan for data loss prevention and service degradation?
Answer: For data loss prevention, I would use regular backups and replication across multiple data centers. For service degradation, I would implement graceful degradation by reducing functionality (e.g., disabling non-essential features) when the system is under heavy load, while ensuring core services remain available.
6. Consistency
Definition: The accuracy of data across the system, ensuring all users see the same data at the same time.
Key Concepts:
Discussion Points:
Question: What consistency model would you use (e.g., eventual consistency in a NoSQL database like Cassandra)?
Answer: In cases where strong consistency is not required (e.g., social media feed updates), I would use eventual consistency with databases like Cassandra. For scenarios requiring strong consistency (e.g., financial transactions), I would rely on databases with ACID properties such as PostgreSQL or MySQL in a replicated setup.
Question: How would you handle consistency in distributed systems? (e.g., distributed databases like MongoDB or DynamoDB)?
Answer: To handle consistency in distributed systems, I would use techniques like two-phase commit for transactions that span multiple services or databases. For high-availability systems, I would implement conflict resolution strategies in eventual consistency models (e.g., last write wins or vector clocks).
7. Security
Definition: Measures to protect the system and its data from unauthorized access, attacks, or breaches.
Key Concepts:
Discussion Points:
Question: How do you secure sensitive data (e.g., using encryption, TLS/SSL, token-based authentication like OAuth)?
Answer: I would secure sensitive data using: 1 Encryption at Rest: All sensitive data (such as user credentials and PII) would be encrypted using AES-256. 2 Encryption in Transit: I would enforce TLS/SSL to secure communication between microservices and between clients and services. 3 Token-based Authentication: I would use OAuth 2.0 for secure user authentication, with JWT (JSON Web Tokens) for stateless, secure authorization between services. 4 Access Control: I would implement Role-Based Access Control (RBAC) to enforce strict permissions on who can access or modify certain resources.
Question: What measures would you take to prevent DDoS attacks, data breaches, and insider threats?
Answer: To mitigate DDoS attacks, I would implement rate limiting at the API gateway level and use CDNs with built-in DDoS protection. To prevent data breaches, I would enforce strict encryption policies and use firewalls to block malicious traffic. For insider threats, I would use auditing and monitoring tools to log and detect suspicious activities, combined with least privilege access policies.

8. Latency
Definition: The delay between a request and the corresponding response.
Discussion Points:
Question: How would you reduce latency in a high-traffic system?
Answer: To reduce latency in a high-traffic system, I would implement: 1 Caching: Use in-memory caches like Redis or Memcached to store frequently accessed data, reducing the need to repeatedly query databases. 2 CDN (Content Delivery Network): Serve static content (e.g., images, CSS, JavaScript) through a CDN like Cloudflare or AWS CloudFront to deliver assets from locations geographically closer to users. 3 Database Optimization: Ensure efficient query execution by using indexes on frequently accessed tables and optimizing SQL queries. For NoSQL databases, use partitioning/sharding to distribute load. 4 Asynchronous Processing: Offload non-critical tasks (e.g., sending emails, logging) to background jobs via message queues like RabbitMQ to free up resources for time-sensitive requests.
Question: Discuss caching strategies (e.g., Redis, Memcached) and database query optimization.
Answer: For caching strategies, I would: 1 Application-level Caching: Use Redis or Memcached to cache API responses or session data for quick access without hitting the database. 2 Database Query Caching: Use write-through caching, where data is written to both the cache and the database simultaneously, to ensure the cache stays up-to-date. Alternatively, implement read-through caching to pull data from the database only if the cache is missed. 3 Query Optimization: Optimize queries by using database indexes, reducing the number of JOINs, and avoiding SELECT * queries. Additionally, leverage read replicas to offload read-heavy workloads.
9. Disaster Recovery
Definition: Strategies to restore functionality after a catastrophic event (e.g., data center failure, major security breach).
Discussion Points:
Question: What disaster recovery plans would you put in place (e.g., backup strategies, multi-region deployments)?
Answer: I would implement the following disaster recovery plans: 1 Multi-region Deployment: Deploy critical services and data across multiple geographic regions to ensure that even if one region experiences a failure, another region can take over with minimal downtime. 2 Automated Backups: Schedule regular automated backups of databases and critical stateful services, with data replication across multiple regions. For example, use Amazon RDS with multi-AZ deployments or Google Cloud Spanner for high availability. 3 Recovery Time Objective (RTO) and Recovery Point Objective (RPO): Establish RTO and RPO goals, ensuring that systems can be restored within acceptable time frames and with minimal data loss. 4 Failover Mechanisms: Use global load balancers and DNS failover solutions like AWS Route 53 to automatically route traffic to backup regions or instances during an outage.
Question: How would you handle failover to ensure minimal downtime?
Answer: To handle failover, I would implement active-active or active-passive failover mechanisms across multiple regions. A global load balancer like AWS Route 53 or Google Cloud Load Balancing would detect if a region goes down and reroute traffic to the next available healthy region. The system would maintain replicated databases in each region, and synchronous data replication (for strong consistency) or asynchronous replication (for eventual consistency) would ensure minimal data loss during failover.
Thus, by addressing these NFRs comprehensively, you show that you are not only focused on building functional systems but also on their performance, resilience, security, and efficiency, which are key to a well-designed system in real-world scenarios.
II.) NFR fulfillment @ BT : Currently we achived few of the NFR as below (PERSONAL NOTES) : 10/20/24:
Security :
Performance Improvement ( microservice level)
Fault Tolerence :
Scaling :
Availibility :
FIAT ON-RAMP: We can highlight relevant Work, Scope, Impact, achivements and Challanges.
The system is designed with a clear API structure and well-defined data models, addressing both functional and non-functional requirements. Below are the key NF considerations:
Asynchronous Communication: The system supports asynchronous interactions using both gRPC and REST APIs. Responses are returned immediately after submission to the backend, with the backend processing independently, integrating with Binance.com and the Tron blockchain. Services are designed to be stateless, promoting scalability and ease of maintenance.
Security: Security is a critical focus. Messages are signed using ECDSA algorithm. It allows entities to prove their identity and ensure data integrity without the need for a central authority. Next, signed messages can use HTTPS/TLS ( Specially bridge-wallet REST server in Universal Wallet) or WSS (websocket server in Decentralized Bidding Project) for encrypted communication. HTTPS/TLS termination occurs at the Application Load Balancer (ALB), ensuring encrypted message exchanges. Mutual TLS (mTLS) with certificate-based authentication is implemented for secure inter-service communication (bridge-wallet to exchange/ledger). Additional layers of security include rate-limiting (leaky bucket algorithm), IP whitelisting, and a fraud detection model to identify suspicious or outlier transactions.
Scaling: Auto-scaling is achieved via Kubernetes, especially for stateless services like the ledger and exchange, which handle fluctuating workloads during peak hours. Traffic is routed through Load Balancer (Route 53). Master-slave architecture is used for databases, with write operations directed to the master database. Stateless serivces ( gRPC ensures that interactions between services are non-blocking, efficient, and scalable)
Monitoring and Alerting: The system is continuously monitored using Prometheus and Grafana, with integrated Slack alerts for real-time notifications. In case of service failure, Slack alerts are triggered immediately. Kubernetes handles automatic rescheduling of pods and launches new nodes if necessary, ensuring service uptime through readiness and health probes.
Performance: Recent transaction hashes are stored in Redis, using payment IDs as keys. Write-through caching policies are applied, with session IDs set to a 6-hour TTL, transaction data to 24 hours, and other long-term data to 1 year TTL.
Data Consistency: To maintain consistency, a payment ID (in Fiat-onRamp project), Withdrawal_hash/Deposit_hash (universal wallet project) is used as an idempotency key at the application level. A reconciler job ensures proper transaction processing. MySQL is used at the database level, providing ACID properties for consistent data handling.
Fault Tolerance: A backoff counter is implemented for transient failure handling, with reconciling threads in the wallet service retrying failed transactions. Readiness and health probes are used to monitor service health and uptime.
Maintainability: A CI/CD pipeline is configured between GitHub and the build system, allowing for automatic build triggers, improving the development workflow.
Integration: The wallet service (ds_wallet) is integrated with multiple backend services (gRPC server for the ledger, REST server, Horus Gateway, reconciler) using Protocol Buffers (protobuf) and REST APIs.
Testing: Comprehensive testing includes unit tests, API tests, and ensuring code coverage, helping to maintain a robust codebase.
Future Improvements: Consider transitioning to webhooks for transaction status updates to eliminate the need for polling, improving system efficiency.
Challenges: One significant challenge was delivering tokens to users' off-chain balances, as Binance lacks direct access to the internal ledger. This required developing an exchange-service to transfer tokens from the hardware wallet to users' accounts.
Impact: The system has streamlined in-app purchases for 20 million users, resulting in a trading volume increase to 100K transactions per month. Binance has also seen significant user-base growth as a result.
Same NF consideration for other projects as well :
FYI: MOST Important/frequent Interview Questions Asked :
+++++++++++++++++++++++++++++++++++++++++++
Refer 1st section here : https://leetcode.com/discuss/study-guide/5935001/Leadership-Principles-(SRE)
III.) QUESTIONS BASED ON MY PROJECTS : 10/20/24: [ MUST READ ]
+++++++++++++++++++++++++++++++++++++++++++++++++++++
Refer Section-3 at https://leetcode.com/discuss/study-guide/5935001/Leadership-Principles-(SRE)
IV.) TRADE-OFFS: 10/20/24
+++++++++++++++++++++
1. Security Trade-offs:
Choice: Use of ECDSA for signing messages and mTLS for service-to-service communication.
Choice: Use of OAuth with JWTs for authentication.
Choice: Rate-limiting (token bucket) and DDoS protection via CDN.
2. Performance Optimization Trade-offs:
Choice: Redis caching for frequently accessed data.
Choice: Asynchronous processing for resource-intensive tasks.
Choice: Database optimization with RDS Master-Slave architecture.
3. Fault Tolerance Trade-offs:
Choice: Kubernetes for auto-rescheduling of pods.
Choice: Retries with exponential backoff for transient failures.
4. Scaling Decisions and Trade-offs:
Choice: Auto-scaling using Kubernetes.
Choice: Load balancing with round-robin.
Choice: Database sharding versus RDS Master-Slave architecture.
V.) PROS & CONS OF MICROSERVICES ( vs Monolithic Architecture) : 10/22/24
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Pros of Microservices
Scalability
Pro: Each microservice can be scaled independently based on demand. This allows for more efficient resource utilization, as only the necessary components need to be scaled, reducing infrastructure costs and improving performance for critical services.
Example: A high-traffic login service can be scaled up independently without having to scale the entire application.
Independent Development and Deployment
Pro: Microservices enable teams to work on different services independently, allowing faster development cycles and more frequent deployments. This supports DevOps practices like Continuous Integration/Continuous Deployment (CI/CD).
Example: A team working on the payment service can deploy updates without affecting the user profile or order services.
Fault Isolation
Pro: Failure in one service is isolated from others, preventing the entire system from crashing. This isolation improves the resilience and fault tolerance of the overall application.
Example: If the recommendation service goes down, it won’t affect the core functionality of the product checkout service.
Technology Flexibility
Pro: Different microservices can use different programming languages, frameworks, and databases based on their specific needs, allowing teams to choose the best tool for each task (also known as polyglot persistence).
Example: A real-time analytics service can use Python and a NoSQL database, while the main application can use Java and a relational database.
Easier to Understand and Maintain
Pro: Microservices are smaller and focused on a single responsibility, making them easier to understand, maintain, and debug. This modular approach reduces complexity compared to a monolithic architecture.
Example: Understanding the behavior of a microservice that only handles user authentication is simpler than understanding a monolith handling user authentication, order management, and payments together.
Improved Security
Pro: Security can be applied more granularly to each microservice. Sensitive services (e.g., payment processing) can be secured independently with stricter policies.
Example: The authentication service can use stronger encryption and authorization mechanisms without affecting the entire system.
Faster Time to Market
Pro: Teams working on independent microservices can deliver features more rapidly since they can develop, test, and deploy each service separately without waiting for the entire system to be ready.
Example: Launching a new feature in the billing service doesn’t require the approval of changes in the product catalog or user profile services.
Cons of Microservices
Increased Complexity
Con: While each service may be simple, the overall system architecture becomes more complex as the number of microservices grows. Managing the interactions between many services (including dependencies, data consistency, and communication) can be challenging.
Example: Orchestrating communication between dozens or even hundreds of microservices, managing distributed transactions, and maintaining consistent data become complex tasks.
Operational Overhead
Con: Microservices require extensive infrastructure, monitoring, and management. Managing multiple services, including deployment, scaling, and troubleshooting, requires sophisticated tooling and operational expertise.
Example: Maintaining separate CI/CD pipelines, monitoring, and logging for each microservice increases the complexity of operations.
Distributed Systems Challenges
Con: Microservices rely on network communication, which introduces latency and the possibility of network failures. Handling issues like service discovery, network timeouts, retries, and load balancing requires careful design.
Example: A slow or unreliable network connection can degrade the performance of a microservices-based system, causing communication issues between services.
Data Consistency
Con: Achieving data consistency across distributed microservices is difficult. Since each microservice manages its own database, coordinating updates and ensuring consistency across services becomes complex, especially in transactional workflows.
Example: In an e-commerce system, updating the inventory in one service and processing a payment in another service may lead to inconsistencies if not properly managed.
Cross-Cutting Concerns
Con: Implementing cross-cutting concerns like logging, monitoring, security, and error handling across all services can be challenging. It often requires building or integrating shared libraries and tools.
Example: Applying a unified logging mechanism across 20+ microservices requires additional coordination and infrastructure.
Testing Complexity
Con: Testing a microservices architecture is more complicated than testing a monolithic application because you need to test not only individual services but also their interactions. Integration and end-to-end testing require coordination across multiple services.
Example: A change in one service might break functionality in another, making it harder to create comprehensive tests that cover all possible interactions between services.
Inter-Service Communication Overhead
Con: Microservices communicate over the network using protocols such as HTTP/REST, gRPC, or message queues. This introduces overhead due to serialization, deserialization, and network latency compared to direct in-process communication in monolithic systems.
Example: A call from the user service to the order service may take longer due to network overhead, even if both services are fast individually.
Deployment and Versioning Issues
Con: Managing deployments of multiple microservices becomes complex, especially when dealing with interdependent services. Keeping track of service versions and backward compatibility is essential but challenging.
Example: If the user service requires changes that break compatibility with the order service, both services need to be deployed in sync, complicating the deployment process.
Latency and Performance Overhead
Con: Since microservices often communicate over the network, the round-trip time (RTT) between services can introduce latency. Additionally, serialization/deserialization and API calls can reduce overall performance.
Example: In a monolith, a function call is quick, whereas in microservices, an API call between two services might incur additional latency.
Summary
Pros:
Cons:
VI.) CHALLANGES IN DISTRIBUTED SYSTEM DESIGN: 10/30/24
+++++++++++++++++++++++++++++++++++++++++++++++
In a distributed systems design interview, discussing the challenges shows a solid understanding of how these systems operate and the complexities involved. Here are some critical challenges and considerations to discuss:
Network Reliability and Latency:
Network issues (like packet loss, latency, or timeouts) can disrupt communication between nodes. Discuss the impact of unreliable networks and the importance of designing for eventual consistency, retries, and handling split-brain scenarios.
Data Consistency:
Ensuring data consistency across distributed nodes is challenging, especially with systems prioritizing availability and partition tolerance (as per the CAP theorem). You could discuss strategies like quorum-based replication, eventual consistency, and multi-version concurrency control (MVCC).
Scalability and Load Balancing:
Achieving horizontal scalability while ensuring the system can handle increased loads effectively. Load balancers, sharding, and partitioning strategies help, but each introduces complexity in handling failovers and rebalancing.
Fault Tolerance and Resilience:
Handling node failures gracefully without disrupting overall system functionality is critical. Techniques like replication, failover mechanisms, and using distributed consensus algorithms (e.g., Raft or Paxos) can improve fault tolerance.
Concurrency and Synchronization:
Concurrent access to shared resources across multiple nodes can lead to race conditions and data inconsistency. Techniques like distributed locking, vector clocks, or Lamport timestamps can help maintain order and consistency.
Security and Data Privacy:
Distributed systems often involve data exchange across networks, exposing them to risks like man-in-the-middle attacks or unauthorized access. Strategies here include encryption, secure authentication, token-based authorization, and compartmentalizing access based on need-to-know principles.
Monitoring and Observability:
Ensuring visibility into system health is complex due to the distributed nature of components. Discuss using tools for centralized logging, tracing (e.g., OpenTelemetry), and monitoring (e.g., Prometheus, Grafana) to detect and troubleshoot issues in real-time.
Data Sharding and Partitioning:
Sharding and partitioning data effectively across nodes requires careful planning. This includes deciding partition keys, handling rebalancing when scaling, and managing the increased complexity in querying and maintaining data consistency across partitions.
Distributed Consensus and Coordination:
Coordination between nodes is essential, especially for tasks like leader election or maintaining a consistent state. Discuss consensus algorithms like Paxos or Raft, used in systems like Zookeeper and etcd, and their challenges in handling network partitions and latency.
Testing and Deployment:
Testing distributed systems is challenging due to unpredictable failures and inter-node communication issues. It’s helpful to discuss strategies like chaos engineering, integration tests with simulated network failures, and canary deployments to validate changes gradually.
Cost Management:
Operating and scaling distributed systems can become expensive, especially with cloud infrastructure. Optimizing resource usage, data transfer costs, and storage while maintaining performance can help manage expenses.
Each of these challenges can open the door to discussing specific solutions and trade-offs, which would demonstrate a well-rounded understanding of distributed systems design.
VII.) Best Practices for designing Microservices:
+++++++++++++++++++++++++++++++++++
Designing microservices-based services requires careful planning to ensure scalability, reliability, and maintainability. Here are some best practices that can guide you in developing robust microservices architectures:
By following these best practices, you can build microservices that are scalable, resilient, secure, and easy to manage. This ensures that your system can handle growing demands and complex business needs while maintaining high availability and performance.
VIII.) SCALING THE STATEFUL SERVICES: 10/22/24
++++++++++++++++++++++++++++++++++++++
Scaling stateful services presents unique challenges compared to scaling stateless services because the service instances need to maintain and manage their state across multiple requests or sessions. However, there are several strategies and best practices for scaling stateful services effectively. Below are the common approaches:
Description: The most common way to scale stateful services is to externalize the state by moving it to an external system (like a database, object storage, or distributed caching system) while keeping the service itself stateless.
How it works: The service no longer holds the state internally, allowing multiple instances of the service to work independently and share access to the external state. Databases, key-value stores (e.g., Redis, Memcached), or persistent storage (e.g., S3) can be used to store session or application data.
Benefits: This approach simplifies scaling, as the service becomes effectively stateless and can scale horizontally.
Tools:
Example: Store user session data in Redis or a database.
The service fetches the session data from Redis when required, making the service stateless but still able to handle sessions.
State Replication
Description: Replicate the state across multiple service instances. Each instance maintains a copy of the state so that any instance can handle any request.
How it works: When the state changes, those changes are propagated to all the replicas. This approach allows each instance to handle requests without needing to go to an external state store.
Challenges: It’s crucial to ensure data consistency across replicas, especially under high loads. Replication can introduce latency if the system waits for consistency.
Tools:
Distributed systems with replication (e.g., Apache Zookeeper, Consul for service coordination)
Master-slave replication in databases
Example:A distributed system that replicates user data across multiple nodes so that any node can handle the user’s requests.
Stateful Microservices with Consistent Hashing
Key Considerations When Scaling Stateful Services:
By following these strategies, you can scale stateful services effectively while maintaining performance and consistency.
IX.) Scale communication between two microservices (grpc based) : 10/20/24:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Scaling communication between two gRPC-based microservices involves addressing several aspects, such as network efficiency, load balancing, failure handling, and resource optimization. Here’s a detailed guide to scaling this communication:
Client-Side Load Balancing:
gRPC natively supports client-side load balancing. You can implement a load balancing policy at the gRPC client level to distribute requests across multiple instances of a backend microservice. Common strategies include:
Example in gRPC (with DNS or service discovery):
// Set up client-side load balancing with round-robin policy
conn, err := grpc.Dial("my-grpc-service:50051", grpc.WithInsecure(), grpc.WithBalancerName("round_robin"))For larger-scale systems, consider integrating with service discovery systems (like Consul or Eureka) or Kubernetes' DNS-based load balancing, where the client will query the registry for available microservice instances.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-grpc-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-grpc-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
As pods scale up and down, the service discovery or DNS system dynamically updates the list of available instances.
circuit_breakers:
thresholds:
max_connections: 1000
max_pending_requests: 100
max_requests: 500
This ensures that a failure in one service doesn’t cause failures across the system and provides more predictable performance under load.
opts := []grpc.DialOption{
grpc.WithRetry(func() retry.Backoff { return retry.DefaultBackoff }),
}
conn, err := grpc.Dial("my-grpc-service:50051", opts...)For scenarios involving large data sets or real-time updates (e.g., streaming logs or bidirectional communication), gRPC streaming can be very efficient compared to traditional request/response models. With streaming, the connection is kept open, reducing the overhead of establishing new connections and handling batch requests.
Example of bidirectional streaming
stream, err := client.MyServiceMethod(ctx)
for {
// Sending and receiving messages asynchronously
err := stream.Send(&Request{})
resp, err := stream.Recv()
}
To scale streaming services, ensure the services are deployed across multiple instances and use load balancing to distribute streams evenly.
Example with OpenTelemetry:
tracer := otel.Tracer("grpc-client")
ctx, span := tracer.Start(context.Background(), "grpc-call")
defer span.End()
// Making the gRPC call
_, err := client.MyMethod(ctx, &Request{})
Metrics and Logging:
Use Prometheus to track gRPC-specific metrics such as:
Envoy, if used as a proxy, can also expose rich gRPC metrics that you can scrape via Prometheus. Set up alerts for unusual spikes in response times or connection errors to address scaling issues proactively.
Caching:
Use distributed caching (like Redis) between services for frequently accessed or computed data to reduce direct database hits. Redis can be used to cache responses from microservice A that are required by microservice B, thus reducing the load on both services.
Example of Redis for caching responses
cacheKey := "grpc-response-key"
cachedData, err := redisClient.Get(cacheKey).Result()
if err == redis.Nil {
response, err := grpcClient.MyMethod(ctx, &Request{})
redisClient.Set(cacheKey, response, time.Minute*10)
}
Message Queuing:
For scenarios where eventual consistency and asynchronous communication are acceptable, you can introduce message queues (e.g., Kafka) between services. This decouples services and allows for better fault tolerance and traffic management.
TLS:
Use TLS to encrypt traffic between gRPC services. This ensures that the communication is secure even when scaling across different environments, such as multiple data centers or cloud regions.
creds, err := credentials.NewClientTLSFromFile("path/to/cert.pem", "")
conn, err := grpc.Dial("my-grpc-service:50051", grpc.WithTransportCredentials(creds))
Mutual TLS (mTLS):
For additional security, especially in zero-trust architectures, implement mTLS. Both the client and server validate each other’s certificates, ensuring mutual trust.
OAuth 2.0 or JWT:
For user authentication and authorization, implement token-based mechanisms like OAuth 2.0 or JWT (JSON Web Tokens) in gRPC. The client sends tokens with each gRPC call, and the server validates them.
Additionally:
Summary of gRPC Security Approaches:
X.) SYSTEM DESIGN QUESTIONS FOR SENIOR ROLES : 10/20/24:
++++++++++++++++++++++++++++++++++++++++++++++++++
Question: Describe a time when you had to improve the scalability of a system. What challenges did you face, and how did you overcome them?
Answer: In my current role, I was tasked with improving the scalability of a microservices-based system. One of the main challenges was handling increasing traffic during peak hours without affecting the overall system performance. To address this, I introduced Kubernetes-based auto-scaling. The stateless services were set to scale based on CPU and memory usage. Traffic was distributed using a load balancer, which employed round-robin algorithms. Additionally, I optimized our database operations by introducing Redis for caching frequently accessed data, reducing the load on our primary database. We also considered sharding strategies but instead adopted a master-slave architecture that allowed us to split read and write operations, improving scalability without the immediate need for sharding. The main challenge here was ensuring minimal downtime during these changes, which we achieved through careful planning and gradual rollout.
Question: How do you ensure the reliability of a system that handles critical operations?
Answer: Ensuring reliability in a system handling critical operations requires a combination of proactive monitoring, fault tolerance mechanisms, and solid architectural design. In my current project, we implemented health and readiness probes within Kubernetes to monitor the status of pods and auto-reschedule them when needed. This way, even if individual components fail, the system continues functioning with minimal downtime. We also leveraged multi-AZ (Availability Zones) deployment to ensure high availability, so even if one zone goes down, traffic can be routed to another. Additionally, we built redundancy into our architecture by taking regular snapshots of our database and using Redis backups to ensure quick recovery in case of failure. The challenge was managing these snapshots and backups without affecting system performance, which we addressed by scheduling these operations during non-peak hours.
Question: Can you talk about a situation where you had to improve the performance of a system under heavy load?
Answer: One instance that comes to mind involved optimizing the performance of a high-throughput microservices-based system that was experiencing high latency during peak times. The first step I took was to remove deadlocks by sorting incoming requests and reusing connections, which significantly improved the speed of the API. 2. We also adopted asynchronous processing using a queuing system for batch operations and background jobs, ensuring that real-time requests weren’t blocked by long-running processes. Redis caching was introduced to offload frequent data reads from the database, reducing latency. 3. Another key optimization was implementing multithreading for certain operations, offloading tasks like public address generation to asynchronous threads. This improved throughput without increasing the load on individual services. The main challenge was optimizing these changes while maintaining system stability, which we solved through rigorous load testing before deployment.
Question: How have you handled fault tolerance in a system with high reliability demands?
Answer: Fault tolerance is a critical aspect of any high-reliability system. In my experience, we handled pod/process-level failures using Kubernetes' self-healing capabilities, where pods were automatically rescheduled based on health and readiness probes. However, if an entire service or AWS region failed, there was no fallback mechanism in place. To improve this, we considered geo-redundant architecture but had to settle for scaling across multiple Availability Zones within the same region due to budget constraints. 2. Also For transient failures, we implemented retries with exponential backoff to handle temporary issues gracefully. I also integrated graceful degradation into the system, where non-essential features were disabled during heavy loads, ensuring that critical operations remained unaffected. The challenge here was maintaining system functionality without overloading it, which was resolved by setting priority levels for different types of services.
Question: Can you describe a challenging situation where you had to ensure both security and performance in a system?
Answer: One challenging scenario I faced was ensuring security while optimizing system performance for a wallet service that handled financial transactions. Security was critical due to the nature of the transactions, so we implemented mutual TLS (mTLS) for service-to-service communication and OAuth tokens for authentication. Messages were signed using the ECDSA algorithm to ensure data integrity. 2. However, these security measures introduced additional overhead, leading to performance degradation. To balance security and performance, we implemented application-based rate limiting using the leaky bucket algorithm to safeguard against DDoS attacks without affecting legitimate users. We also optimized the system by caching JWT tokens and session data in Redis, which reduced the overhead of frequent token validation. This approach improved performance while maintaining high levels of security. The challenge was fine-tuning these configurations to avoid overburdening the system while ensuring top-notch security.
Question: Tell me about a time when you had to make a tradeoff between performance and reliability. How did you decide what to prioritize?
Answer: In one of our systems, we faced a situation where we had to choose between maximizing performance and maintaining system reliability, particularly during a period of high transaction volume. Initially, we had designed the system to prioritize real-time processing of transactions, but as the load increased, some services began to fail intermittently. 2. To address this, we implemented graceful degradation, where non-critical services (like notifications or analytics) were temporarily disabled during peak load, allowing us to focus resources on the core transactional services. We also introduced circuit breakers to limit cascading failures from affecting other parts of the system. 3. Although this meant sacrificing some performance on non-essential features, it ensured the reliability of our core services and prevented downtime. The challenge was to make sure the degradation was transparent to the users, which we achieved by pre-communicating service limitations during high-load periods.
Question: Describe a situation where you had to implement a security solution that impacted system performance. How did you mitigate the performance impact?
Answer: One notable case was when we introduced mutual TLS (mTLS) for service-to-service communication to enhance security. While mTLS provided a high level of security by ensuring that both the client and server authenticate each other, it introduced significant overhead due to the handshake process. 2. To mitigate the performance impact, I implemented connection pooling and session reuse mechanisms for frequently communicated services. This reduced the frequency of full TLS handshakes, improving latency without compromising security. We also introduced Redis-based caching for token management and session state, which allowed us to reduce repeated token validation processes for authenticated requests.3. By carefully balancing security and performance through these optimizations, we were able to improve response times while maintaining the strict security requirements of the system.
Question: Have you ever encountered a situation where scaling a system caused unexpected challenges? How did you handle it?
Answer: Yes, one such challenge arose when we implemented auto-scaling in our Kubernetes cluster for a microservices architecture. While auto-scaling helped handle increased load, we encountered an unexpected challenge with database bottlenecks. As the number of pods increased, the volume of database connections also rose, eventually causing connection pool exhaustion. 2. To address this, I introduced connection pooling strategies and optimized database queries to ensure the system could handle the increased demand. Additionally, we adopted database read replicas to offload read-heavy operations from the master database, which helped to balance the load effectively. 3. We also put rate-limiting mechanisms in place at the API gateway level to prevent a sudden surge in requests from overwhelming the database. The challenge here was managing the database scaling without affecting user experience, which we mitigated through these architectural changes and proper monitoring.
Question: Can you share an example of a time when you improved system availability? What was your approach, and what challenges did you face?
Answer: In one project, we needed to improve system availability as our services were deployed in a single region, which posed a risk of significant downtime in case of regional outages. To address this, I proposed and implemented a multi-AZ (Availability Zone) architecture, where services were replicated across multiple zones within the same region. 2. We used an Elastic Load Balancer (ELB) to distribute traffic across healthy instances, ensuring high availability even if one AZ went down. Additionally, I set up health checks and failover strategies to reroute traffic to backup instances if the primary ones failed. 3. The main challenge was ensuring consistent data replication across zones while keeping latency low. We overcame this by optimizing database writes and introducing read replicas in each AZ to handle local reads, reducing cross-AZ latency. While we couldn’t fully achieve geo-redundancy due to cost constraints, this approach significantly improved availability.
Question: How do you handle performance monitoring and optimization in large-scale systems? Can you give an example of how you used monitoring tools to identify and resolve a performance issue?
Answer: Performance monitoring and optimization are crucial for maintaining system health, especially at scale. In our system, we used Prometheus for monitoring and Grafana for real-time visualization of performance metrics. This setup allowed us to track key metrics like CPU usage, memory consumption, request latency, and database throughput. 2. One issue we encountered was a sudden spike in response time during peak traffic hours. Upon investigating through Grafana dashboards, we identified a bottleneck in one of the microservices responsible for database-heavy operations. The service was performing inefficient queries that caused high CPU and memory consumption. 3. To resolve this, I optimized the database queries by adding appropriate indexing and reducing the number of database calls. I also introduced Redis as a caching layer for frequently accessed data, significantly reducing the load on the database and improving overall response times. We used Prometheus alerts to ensure that similar performance issues would be flagged early in the future.
Question: Can you share a time when you had to address a system failure? How did you ensure quick recovery and prevent it from happening again?
Answer: In one instance, we experienced a system-wide outage due to a failure in our Kubernetes cluster caused by a misconfiguration in the auto-scaling policy. This resulted in excessive pod rescheduling and resource exhaustion, which brought down several key services. 2. To recover quickly, we immediately rolled back the configuration changes and rescheduled the pods manually to restore service availability. I also worked closely with the DevOps team to adjust the auto-scaling parameters to prevent over-scaling. 3. Post-recovery, we conducted a thorough root cause analysis and implemented a set of safeguards, including stricter health checks, more fine-tuned scaling policies, and better resource allocation strategies. We also set up real-time Slack alerts for critical service failures to ensure a faster response to similar incidents in the future.
Question: How do you handle NFR (Non-Functional Requirements) in your design process, especially when it comes to security, performance, and scalability?
Answer: Non-functional requirements such as security, performance, and scalability are always a key part of my design process. For security, I focus on incorporating strong authentication (OAuth tokens, JWTs), authorization, encryption (HTTPS, TLS, mTLS), and DDoS protection mechanisms like rate-limiting. 2. For performance, I ensure that we leverage caching (e.g., Redis), optimize database queries, and use asynchronous processing for heavy tasks. We also monitor latency and throughput to make sure that the system can handle the required load. 3. For scalability, I design systems with auto-scaling capabilities using Kubernetes, stateless services, and load balancing across services and zones. Database sharding and master-slave architectures are considered when scaling read and write operations. 4. A key challenge is balancing these NFRs without causing trade-offs that might affect other parts of the system. For example, improving security shouldn’t introduce significant latency, and enhancing scalability shouldn’t compromise system reliability. Monitoring tools like Prometheus and real-time alerts help in making data-driven decisions to balance these requirements effectively.
Question: Tell me about a time when you needed to optimize system latency. How did you identify the sources of latency, and what actions did you take to resolve them?
Answer: We noticed increased latency in our ledger services during peak transaction hours. After profiling the service, I identified that multiple calls to the database were the primary cause. To address this, I implemented in-memory caching using Redis to store frequently accessed transaction data, significantly reducing the number of database queries. I also optimized our database schema and introduced asynchronous processing for non-critical operations, which helped us further reduce latency and improve system responsiveness.
Question: Can you describe a time when you had to make a trade-off between system availability and consistency? How did you decide what to prioritize?
Based on the Speed architectural diagram: The architecture includes services like the “Ledger Service” and an "Archive Service." These often require a balance between availability and consistency, especially in distributed systems. Answer: When designing the ledger service for high availability, we faced a trade-off between ensuring data consistency and maintaining availability during peak traffic. During times of heavy load, we prioritized availability by implementing eventual consistency. I introduced asynchronous processing and transaction logs that allowed us to ensure transactions were eventually processed, even if some data was temporarily out of sync. This approach helped us maintain service availability, while batch processes ensured data consistency was restored without affecting the user experience.
XI.) IMPLEMENTATION :
++++++++++++++++++
ADDENDUM
++++++++++Summary of NFR @ BT



Load Balancing Strategies :
Static Algorithms
Round robin
Sticky round-robin
Weighted round-robin
Hash
Dynamic Algorithms
Least connections
Least response time








LINUX COMMANDS :
1.) Networking and System Monitoring
* netstat — Displays network connections, routing tables, and interface statistics.
* ss — Shows socket statistics (similar to netstat but faster).
* nmap — Network scanner for discovering hosts and services.
* iftop — Real-time bandwidth usage of network interfaces.
* htop — Interactive system monitor for CPU, memory usage, and process management.
* ping — Tests connectivity between hosts.
* traceroute — Shows the path packets take to reach a host.
* curl — Transfers data from or to a server, supporting many protocols.
* wget — Downloads files from the internet.
* dig — Queries DNS information.
* nslookup — Looks up DNS records.
* arp — Displays or manipulates the system's ARP cache.
* ip — Shows/manages IP addresses, routing, and devices.
* route — Shows/manages IP routing tables.
* nc (netcat) — Reads and writes data across networks.
* iptables — Manages firewall rules.
2.) Disk and Filesystem Management
* df — Displays disk space usage.
* du — Shows disk usage of files and directories.
* ncdu — Text-based disk usage analyzer.
* rsync — File and directory synchronization and backup.
* lsof — Lists open files and their associated processes.
* mount / umount — Mounts and unmounts filesystems.
* lsblk — Lists information about all available block devices.
* blkid — Shows or changes block device attributes.
* mkfs — Formats a disk with a specific filesystem.
* fsck — Checks and repairs filesystem consistency.
* e2fsck — Checks and repairs ext2/ext3/ext4 filesystems.
* resize2fs — Resizes ext2/ext3/ext4 filesystems.
3.) System Resource Management
* ps aux — Lists all running processes with details.
* pkill — Kills processes by name.
* nice / renice — Adjusts the priority of running processes.
* sar — Collects and reports system activity information.
4.) Process Management
* top — Displays live system processes.
* kill — Terminates a process by ID.
* killall — Terminates processes by name.
* jobs — Lists background jobs in the current session.
* bg / fg — Moves jobs to the background or foreground.
* ps — Displays currently running processes.
* pmap — Displays memory map of a process.
5.) File System and Disk Management
* fdisk — Disk partitioning tool.
* parted — Advanced disk partitioning.
* tune2fs — Adjusts filesystem parameters on ext2/ext3/ext4 filesystems.
6.) Archiving and Compression
* tar — Archives and compresses files.
* zip / unzip — Compresses and extracts files in ZIP format.
7.) Security and Permissions
* chmod — Changes file permissions.
* chown — Changes file ownership.
* iptables — Manages firewall rules.
* ufw — User-friendly firewall for Ubuntu-based systems.
8.) System Information and Diagnostics
* **dmesg** — Displays system messages (e.g., boot logs, hardware issues).
* **iostat** — Provides CPU and I/O statistics.
* strace — Traces system calls and signals for a command.
* ltrace — Tracks library calls made by a command.
9.) Device and Hardware Control
* ioctl — System call for device-specific input/output operations (used in programming).
* mknod — Creates device files manually.
* dmsetup — Device mapper tool for managing logical volumes.
* hdparm — Disk parameter control for tuning HDD/SSD performance.
* setpci — Configures PCI device settings.
* modprobe — Manages kernel modules dynamically.
* insmod / rmmod — Manually inserts and removes kernel modules.
10.) Kernel and System Configuration
* sysctl — Configures kernel parameters at runtime.
* ethtool — Configures network interface hardware settings.
* perf — Performance analysis tool to profile applications.
11.) File and Directory Management
* ls — Lists files and directories.
* cd — Changes the current directory.
* pwd — Displays the present working directory.
* mkdir — Creates new directories.
* rmdir — Removes empty directories.
* cp — Copies files or directories.
* mv — Moves or renames files or directories.
* rm — Removes files or directories.
* find — Searches for files and directories in a directory hierarchy.
* locate — Finds files quickly using an index.
* touch — Creates an empty file or updates an existing file's timestamp.
12.) Developer and Debugging Tools
* gcc — GNU Compiler Collection for compiling code.
* gdb — Debugger for programs written in C/C++.
* make — Utility for building and managing projects.
* valgrind — Tool for memory debugging, profiling, and leak detection.
* strace — Traces system calls.
* ltrace — Traces library calls.
* objdump — Displays information about object files.
* ldd — Lists shared libraries required by a program.
14.) System and Hardware Information
* uname — Shows system information.
* hostname — Displays or sets the system's hostname.
* lscpu — Displays CPU architecture information.
* lsusb — Lists USB devices.
* lspci — Lists PCI devices.
* dmidecode — Displays hardware information from the BIOS.
* uptime — Shows how long the system has been running.
* free — Shows memory usage.
* **dstat** — Displays system resource usage in real-time.
* **mpstat** — Shows CPU activity for each processor.
Consistent Hashing Concepts
Consistent hashing is a technique primarily used in distributed systems to distribute data evenly across a dynamic set of nodes (e.g., servers) while minimizing the amount of data that needs to be redistributed when nodes are added or removed. This approach is commonly used in caching, load balancing, and distributed databases, where node additions and deletions are frequent.
Basics of Consistent Hashing
Adding Nodes
Deleting Nodes
Load Distribution and Virtual Nodes
Summary of Load Distribution:

Database types :
https://medium.com/@i.vikash/looking-for-a-database-for-your-next-system-design-d65df8a778c5

DB CheatSheet : https://blog.bytebytego.com/p/understanding-database-types

Choose which database for your System Design.



LSM Engine :

