Design a system to capture unique addresses in the entire world.
This quesiton was taken from this https://leetcode.com/discuss/interview-experience/340499/amazon-sde2-aws-seattle-2019-offer/311001
Here is my attempt [ please share your thoughts ]
Follow up Questions:
We can find out the requirement accordingly, this may change the algorithm of uniquely identifying address drastically.
I'm assuming, least of them and considering we need to address, addresses on Sea point too (so that it can be use for some crucial purposes)
System Requirement:
Functional
Non-functional
The last point is debatable but since we need consistency for sure then we need to compromise the availability. That can be discussed with interviewer.
Capacity calculation
Earth is of nearly 500 million KM^2 in size which is 5 * 10^8 (KM)^2. Let's assume we divide the big square in small number of tiny square so that we can map each of them.
How short or long Is depend on how granular level of address we are supporting. If we support least one like Street address then we need to be very sure that square should be minimum like 5m x 5 m
Lets make it more granular with 3m X 3m
Now each 1km = 1000 m
so there will be 333m blocks of size 3m in a KM.
hence 1000m * 1000 m => 333m * 333m
If we need to store only address?
Let's assume that each address take up to 100 character (100 bytes) at max including pin codes etc.
Lets assume each 3mx3m block have a single address then
111 * ( 3mx3m) = 111 * 100 bytes = 11,100 bytes = 11 KB =>
so for storing 1 * (KM)^2 we need 11 KB
so for 500 million (KM)^2 we need = 500 million * 11 KB = 5500 million KB = 5500 GB ( 1million*1KB = 1 GB) = 5.5 TB
Data is reasonable in size 5.5 TB for whole world map.
Assuming a machine that can hold 15 TB than 1 node.
But storing whole data on 1 node would not be great idea. So assume on each node we store up to 100 GB makes this make our query bit faster
5500 / 100 = **55 Nodes **
With replica its 55*3 = 165
If we need to store images.[Point 3-Follow up]
Lets assume, that each of this block consume 1 MB (which include minimal contextual information too such as 1 building, 2 road name etc)
Size would be in MB : 1000m * 1000 m => 333m * 333m and each of 1MB => 333MB *333MB = 110889 MB lets make it 111000MB =111 GB
so for storing 1 * (KM)^2 we need 111 GB
so for 500 million (KM)^2 we need = 500 million * 111 GB =55500 million GB = 55500 PB { since 1 million * 1 GB = 1 PB)
55500 PB => 55.5 XB
Data is so huge 55.5 XB for whole world map.
Assuming a machine that can hold 15 TB than 3.7 million nodes. That so huge.
So we need to vertical scale it.
A 64 bit machine can hold up to 16XB of data "theoretically"
232 = 4 GB (4109 bytes), therefore
264 = (4109)2 = 161018 bytes = 16106 TBs = 16000 PBs = 16 EBs (Exabytes)
But that will be very costly, so lets settle down to somewhere 100 TB of size. Nodes require = 5,55,000 (~0.5 million)
[We need to have big size machines otherwise we need to have so many nodes, that makes our query bit slow but we need to find a way to solve it]
Since we need make data available all the time then assuming With 3 replace = 555000 * 3 = 1665000 = ~1.6 million nodes
Moving ahead, **managing 165 nodes is reasonably fine [ but not 1.6 million nodes ] **
Now let's dive into some system design.
Lets our system only store address and refer address using English language only.
The idea is we need to use engine words to identify address uniquely.
Ok, lets revise how many square we had
500 * 10^6 * km^2 = 500*10^6 * 110889 = 55444500 * 10^6 = 55 * 10^12 = 55 Trillion [ including Sea]
If we don't include sea and not useful area like forest, then we left with 10% of actual i.e. 50 million km^2
Hence 5.5 Trillion
We need to uniquely identify **55 trillion combination. **
English has ~60 K words, if we remove abusing, offensive and other bad words. Then we have up to 40K words. Since we have 3m x 3m square of blocks, we'll use 3 words to identify a address uniquely
so we have 40k^3 = 64 Trillion ( > 55 trillion > 5.5 Trillion )
Hence using 40K words, we can easily identify all the world address.
**We have (word1 x word2 x word3 ) 40k combination **
Now how we can map this uniquely ?
Three important points
So square 0 can be identify with (word0 x word0 x word0), square 1 with (word0 x word0 x word1).....
as
Offline algorithm
We can pre-compute all the address and give them a unique combination as state above. Whenever some one wants to map a address, we can locate the Lat Long, and fetch from our pre-compute unique address and map it to given address;
Hence
**User address = (Lat Long) => (word m * word n * word o) **
How can we identify a region is City or not?
Using deep-learning algorithm to extract the road pixels from satellite images. Then an algorithm to connected the pixels together into a road network. The system analyzed the density and shape of the roads to segment the network into different communities, and the densest cluster was labeled as the city center. The regions around the city center were divided into north, south, east, and west quadrants, and streets were numbered and lettered according to their orientation and distance from the center.
**System Desing Components **
Offline: This is pipeline based Server [ Note I made this very easy but in real, that going to be very complex and concurrent. As doing one by one by take ages ]