Position: SDE2
You are given a logfile with multiple requests in the following form:-
Timestamp Customer_ID Page_ID
Assume that the logfile is sorted on timestamp. Find a way of generating the most common 3 page sequence. A 3 page sequence is a sequence of pages such that a user visits these 3 pages one after each other.
eg.
T1 C1 P1
T2 C2 P2
T3 C1 P2
T4 C2 P1
T5 C1 P1
Here, the most frequent 3 page sequence is P1->P2->P1. C2 here is a corner-case because C2 only visits 2 pages, asked about this and was told to discard such sequences.
I gave 2 solutions:
- Straightforward hashmap: Hashmap<sequence, integer> in java, where sequence is a key generated by page visits. Eg. key(1->2->3) = 123 where 1,2,3 are page ids.
- O(n) time: O(n) for separating out page visits according to customers, O(n) for going through each possible sequence and incrementing count in hashmap(O(n - k) actually), and finally O(n) in finding the maximum. Space would be O(n) in all scenarious. Here n is the total number of logfile entries, and k is the sequence length.
=========================================================
- Compressed Trie of page sequences: much more scalable, much lesser memory in a real-world scenario, and slightly more time to get the maximum. The Leaves store the count of each sequence. Can have variable page sequences(k), uses much lesser memory in real-world scenario because almost 90% of users visit home first, and then probably search/detail/cart/checkout. To get the most popular/max-occurring page sequence: can keep a running maximum count. Implemented HashSet every trie-node for faster child-node retrieval(it is not characters but pageids, so we cannot just keep a 26-sized array map).
- Time Complexity analysis: Time taken to create trie: For 'n' sequences with all unique nodes,O(n) to parse for all users, O(n), for every page id, we have to create a node, but for real-world scenarios, it will be much faster(because sequences will often be repeated), finally O(n) for calculating maximum because everytime we get a sequence, we compare against the maximum count encountered till that sequence. Here n is the number of entries in the logfile.
- Memory Complexity: O(n) to create trie(again, much better in real-world scenario).
- All of this is assuming that page sequence size is a constant.
I implemented the 2nd one. Was rejected and the reason given after the interview was that I had issues in understanding Data Structures. The interviewers agreed with my preference for trie during the interview process, but the final result was given to me by the HR.
EDIT: I have coded the second solution here : https://leetcode.com/playground/DdwGRi3p. This solution also sorts the given log entries on the basis of timestamp, but in the original question, it is assumed that the entries are already sorted by time.
There is a leetcode question similar to this where sequences might not necessarily be consecutive. https://leetcode.com/problems/analyze-user-website-visit-pattern/
eg for P1 -> P2-> P3->P4, seqences for this question would be P1, P2, P3 and P2, P3, P4, but for the leetcode question, P1, P3, P4 is also a valid sequence.