Your task is to improve how the world thinks about content on Wikipedia. Currently, Wikipedia
stores edits to its pages (AKA articles) in a revision table:
revision_id int
page_id int
revision_timestamp timestamp
revision_text_id intThat revision_text_id points to a blob of text in an external object store (e.g. Openstack Swift). Revision text contains a myriad of interesting unstructured data. For example, editors use the {{Citation needed}} markup template to indicate that a factual statement needs a reference, and in general that a page needs more work.
Product managers want to track how the {{Citation needed}} template is used over time. They want to know things like:
Bonus 1: The new architecture should consider other unstructured data use cases, for example answer similar questions about other markup templates in the revision text.
My solution I submitted was:
Using a combination of a OLTP transactional db + nosql database like elasticsearch or sphinx for fast text search + OLAP data warehouse for analytics like redshift/snowflake.
Different tools for different domain questions mainly: a time series(treating time as first class citizen) database like influxdb or elastic search for answering “over-time” questions and Relational datastore for answering aggregate metrics queries. ( “total number of” questions)
But I got a rejection email after this. :(