rolling hash is usually used for search patterns in long string input. For example:
input = "abcdefgabcdefg"
pattern = "cd"1. compare ab and cd -> not equal, move the window
--
abcdefgabcdefg
2. compare bc and cd -> not equal, move the window
--
abcdefgabcdefg
3. compare cd and cd -> equal, return true
--
abcdefgabcdefgtime complexity O(N * M) where N is length of input and M is length of pattern.
the main problem in this approach is that we compare pattern with all windows, to avoid this we can calculate window hash and compare pattern and window only if their hashes are equal.
if input and pattern have only english characters we can simply calculate hash like sum of character value, for example:
a = 1
b = 2
c = 3
d = 4
...
input = "abcdefgabcdefg"
pattern = "cd"
1. calculate pattern hash:
p_hash = 3 + 4 = 7 (cb)
2. calculate first window hash:
w_hash = 1 + 2 = 3 (ab)
3. check window and pattern hashes -> not equal, move the window and recalculate hash
--
abcdefgabcdefg
because we moving window one by one character we don't need to recalculate hash from zero, we can remove previous character hash and add a new one
w_hash = 3 - 1 + 3 = 5 (bc)
4. check window and pattern hashes -> not equal, move the window and recalculate hash
--
abcdefgabcdefg
w_hash = 5 - 2 + 4 = 7 (cd)
5. check window and pattern hashes -> equal, check pattern and window character by character -> equal, return true
--
abcdefgabcdefgtime complexity O(M + N) where N is length of input and M is length of pattern because we need to calculate pattern hash and then recalculate hash for all input length but recalculation hash is O(1), so we can skip it.
if all windows hashes are equal for example:
input = "aabaabaabaab"
pattern = "baa"
first window and pattern hashes are equal time complexity can be O(M * N), to avoid this we need a strong hash function.
For the strong hash function, we can use a numeral system with positional notation, for this first, we need to calculate a number of unique characters in the input. For example, if we have only English alphabet characters, it is 26, if we have English alphabet + digits, it is 36 if we have only 0 and 1, it is 2, and so on.
That number will be used for numeral system base:
input = "abcdefgabcdefg"
pattern = "cd"
1. calculate pattern hash:
p_hash = 26^1 * 3 + 26^0 * 4 = 82 % 997 = 82 (cb) | (numeral system base)^(position) * (character value)
to avoid int overflow we can use any prime number for devision our hash like 997
2. calculate first window hash:
w_hash = 26^1 * 1 + 26^0 * 2 = 28 % 997 = 28 (ab)
3. check window and pattern hashes -> not equal, move the window and recalculate hash
--
abcdefgabcdefg
w_hash = 28 - 26 = | then before adding a new value we need to shift value to left, for this we can multiply value by numeral system base | = 2 * 26 + 26^0 * 3 = 57 % 997 = 57 (bc)
4. check window and pattern hashes -> not equal, move the window and recalculate hash
--
abcdefgabcdefg
w_hash = (57 - 54) * 26 + 26^0 * 4 = 82 % 997 = 82 (cd)
5. check window and pattern hashes -> equal, check pattern and window character by character -> equal, return true
--
abcdefgabcdefg