Meta System Design interview question (E4/E5)
Anonymous User
6317

There's a very large website that we want to crawl. This site is organized like Wikipedia, with every page linking to multiple other pages. Please design a system that:

  • Downloads all pages from this website
  • Downloads each URL once
  • Minimizes the amount of traffic coming from any given node.

The interviewer stressed on the fact that you have 100 machines available, you need to design this system using those.
Upon further clarification, I was told that we just want to visit the pages (and do nothing else?). Visiting a URL exactly once was very important. Assume the website has tens of millions of pages. We don't have to visit external links.
The third bullet point didn't make sense to me, so I asked to clarify. He said it's not important (??).

Comments (8)