This question was not asked to me but I saw it in a different site. It has been asked sevral times from 2019-2020 in phone interview.
We want to measure a metric called User Active Minutes (UAM). User active minutes for a given user is defined as the count of the number of distinct minutes in which the user takes some action on Twitter. Multiple actions in the same minute are only counted as one minute. We would like a histogram of the number of users who spend X minutes on Twitter, for different values of X, given 30 days of raw logs and an interval size in minutes.
The raw logs are in the format: [ user_id, epoch timestamp]. Each row represents an action a user took on Twitter. The logs are ordered chronologically. Duplicates are possible.
Write code to compute the histogram of UAMs across our user base.
Example:
Raw logs
"""
[1, 1518290973]
[2, 1518291032]
[3, 1518291095]
[1, 1518291096]
[4, 1518291120]
[3, 1518291178]
[1, 1518291200]
[1, 1518291200]
"""
Interval size
2
Resulting histogram
[2, 2]
2 users spend 0 -1 minutes on Twitter
2 users spend 2-3 minutes on TwitterI am thinking we can create a hasmap for the user who logged off from twitter and create a separate bucket (hashmap) to save epoch login time of the user.
Unfortunately I don't have more details for this question.
Follow up questions : what if data is too large or what if data arrives in un-even manner?