Wikipedia | DB/system design - thoughts?
Anonymous User
2050

Your task is to improve how the world thinks about content on Wikipedia. Currently, Wikipedia
stores edits to its pages (AKA articles) in a ​revision​ table:

revision_id  int
page_id int
revision_timestamp  timestamp
revision_text_id  int

That ​revision_text_id​ points to a blob of text in an external object store (e.g. Openstack Swift). Revision text contains a myriad of interesting unstructured data. For example, editors use ​the ​{{Citation needed}}​ markup template​ to indicate that a factual statement needs a reference, and in general that a page needs more work.

Product managers want to track how the ​{{Citation needed}}​ template is used over time. They want to know things like:

  • How many ​{{Citation needed}}​s are added and removed per page over time?
  • How many ​{{Citation needed}}​s are there in a given revision?
  • Is the total number of ​{{Citation needed}}​s fluctuating “too quickly”?
    The ​revision​ table is large, so large that simply querying it, looking up the text, parsing it, and counting occurrences of ​{{Citation needed}}​ will not scale to meet the query requirements. Your task is to re-architect the way this unstructured data is stored and queried.
    Submit a system architecture proposal that will support use cases like these at scale. You may be as abstract or specific as you like. If you choose to use specific technologies, please indicate why you made those choices.

Bonus 1: The new architecture should consider other unstructured data use cases, for example answer similar questions about other markup templates in the revision text.


My solution I submitted was:
Using a combination of a OLTP transactional db + nosql database like elasticsearch or sphinx for fast text search + OLAP data warehouse for analytics like redshift/snowflake.
Different tools for different domain questions mainly: a time series(treating time as first class citizen) database like influxdb or elastic search for answering “over-time” questions and Relational datastore for answering aggregate metrics queries. ( “total number of” questions)

But I got a rejection email after this. :(

Comments (2)