Had 5 rounds (4 algo + 1 SD) in 2 days related to data stream problems. Had similar 1 last year too.
- Kth largest element
- K most frequent elements (algo + SD)
- Find median
I would like to have thoughts on "K most frequent elements" algorithm. Searched on Leetcode and other platforms, but I'm curious to know if there is a better approach.
My approach:
- For each incoming element,
- CASE 1: Incoming element is met for the first time in data stream
- insert into hashMap with element as key and count as value (O(1))
- insert into min heap (max heap also possible, but thats a different story)
- if heap size less than k, simply insert incoming element (O(logK))
- else, if incoming element has count greater than that of heap top element, remove heap top element (O(1) for removal of top + O(logK) for heapify) and insert incoming element (O(logK))
- CASE 2: Incoming element is previously met and is present in heap
- update the element's count in hashMap (O(1))
- remove the element from heap (O(k) for arbitrary search + O(logK) for heapify)
- insert the element back into heap with the updated count (O(logK))
- CASE 3: Incoming element is previously met and was present in heap but later removed from heap
- update the element's count in hashMap (O(1))
- if incoming element has count greater than that of heap top element, remove heap top element (O(1) for removal of top + O(logK) for heapify) and insert incoming element (O(logK))
-
But how to know if the incoming element is present in heap or not? ...so as not to simply search for elements that are not even present in the heap in CASE 2 removal:
- Either store a boolean flag 'isPresentInHeap' along with the element and count during inserting in heap, or use another hashMap (say presentInHeap) for elements in heap so for an incoming element, search it in presentInHeap first (if present, then do CASE 2) and then in first hashMap (say notInHeapMap) (if present do CASE3 else CASE 1).
- Since the former consumes memory for all elements (whether elements are in or out of heap), I chose the later.
- But (to avoid redundant space consumption between maps) for each insertion into heap, remove the element from notInHeapMap and insert in presentInHeapMap. And for each removal from heap, remove it from presentInHeapMap and insert in notInHeapMap. Map inserts and removals are in O(1) average case.
-
Retrieve the k most elements for each incoming element: iterate the k elements in heap (O(k))
Summary of above:
- O(k) for arbitrary search in heap and O(k) for iterating k elements during retrieval are the limiting factors for better time complexity during an incoming element. Can we do better?
- The two maps would totaly consume O(n) space for stream of n elements. Can we reduce this space consumption here?