I recently got this question in the data engineering on-site interview for position in menlo park. This was part of the data modeling/ System Design interview. \n\nThe interviewer shared the below sample data logs. \nGiven below sample data \n\n```\nm0\tp0\tstart\t0.712\nm0\tp1\tstart\t0.841\nm0\tp2\tstart\t1.523\nm0\tp2\tend\t1.966\nm0\tp1\tstart\t2.856\nm0\tp2\tstart\t3.347\nm0\tp2\tend\t3.567\nm0\tp1\tstart\t3.800\nm0\tp2\tstart\t4.618\nm0\tp2\tend\t5.497\nm0\tp1\tstart\t5.961\nm0\tp2\tstart\t6.324\nm0\tp2\tend\t6.673\nm0\tp1\tend\t7.233\nm0\tp1\tend\t7.533\nm0\tp1\tend\t7.933\nm0\tp1\tend\t8.333\nm0\tp0\tend\t9.933\n```\n\na row m1:p1:start:2.984 means, machine m1 starts process p1 at timestamp 2.984. \n\nGoal:\n1. Design a table schema for this data to be used by data scientist to query metrics such as process with max average elapsed time and they can plot each process.\n2. Design a ETL in python to load data to above data model /table.\n\nFollow-up \n1. How to optimize process to parse the file and load to table. Can it be done with constant memory.\n2. There can be multiple machine m0..mN, each machine can have millions of process entries. How you will scale.

I recently got this question in the data engineering on-site interview for position in menlo park. This was part of the data modeling/ System Design interview.