String matching algorithms are one of the important topic with respect to MAANG and Other Good Software Companies , often neglected by students.
Today we will be discussing about all the string matching algorithms and their implementation in C++.
String matching algorithms are basically used to find a pattern in a string or a text. It is also known as String Searching Algorithm.
Example: Suppose you have a string S and you want to find a pattern P in it. So, you will be using string matching algorithm to find the pattern P in the string S.
There are various types of string matching algorithms. Some of them are as follows:
Naive String Matching Algorithm is the simplest string matching algorithm. It is also known as Brute Force Algorithm.
In this algorithm, we simply traverse the string and check whether the pattern is present in the string or not.
NaiveStringMatchingAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
for i = 0 to n-m
{
for j = 0 to m
{
if S[i+j] != P[j]
break
}
if j == m
print "Pattern found at index " + i
}
}The time complexity of Naive String Matching Algorithm is O(n*m).
n = length of string S
m = length of string PUsing the above pseudocode, we can calculate the time complexity of Naive String Matching Algorithm as follows:
for i = 0 to n-m
{
for j = 0 to m
{
if S[i+j] != P[j]
break
}
if j == m
print "Pattern found at index " + i
}The outer loop will run for n-m+1 times.
The inner loop will run for m+1 times.
So, the total number of times the inner loop will run is (n-m+1)*(m+1).
(n-m+1)*(m+1) = n*m + n - m*m - m + 1
n*m + n - m*m - m + 1 = n*m - m*m + n - m + 1
n*m - m*m + n - m + 1 = n*m - m(m - n + 1) + 1
n*m - m(m - n + 1) + 1 = n*m - m*n + m + 1
n*m - m*n + m + 1 = m + 1
m + 1 = m + 1So, the time complexity of Naive String Matching Algorithm is O(n*m).
Using Master Theorem, we can also calculate the time complexity of Naive String Matching Algorithm as follows:
T(n) = a*T(n/b) + f(n)
a = 1
b = 1
f(n) = n*m
logb(a) = log1(1) = 0
f(n) = n*m = O(n^c)
c = 2
logb(a) < c
So, the time complexity of Naive String Matching Algorithm is O(n^c) = O(n^2).#include <bits/stdc++.h>
using namespace std;
void NaiveStringMatchingAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
for (int i = 0; i <= n - m; i++)
{
int j;
for (j = 0; j < m; j++)
{
if (S[i + j] != P[j])
break;
}
if (j == m)
cout << "Pattern found at index " << i << endl;
}
}
int main()
{
string S, P;
cin >> S >> P;
NaiveStringMatchingAlgorithm(S, P);
return 0;
}Suppose we have a string S of length n and a pattern P of length m.
Now, we will be calculating the number of comparisons that will be made by the Naive String Matching Algorithm.
for i = 0 to n-m
{
for j = 0 to m
{
if S[i+j] != P[j]
break
}
if j == m
print "Pattern found at index " + i
}It will give TLE for n = 10^5 and m = 10^5.
because 10^5 * 10^5 = 10^10 which is greater than 10^8.
KMP Algorithm is one of the most important string matching algorithm. It is also known as Knuth-Morris-Pratt Algorithm.
In this algorithm, we will be using a prefix array to find the pattern in the string.
KMPAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
prefixArray = PrefixArray(P)
i = 0
j = 0
while i < n
{
if S[i] == P[j]
{
i++
j++
}
if j == m
{
print "Pattern found at index " + (i-j)
j = prefixArray[j-1]
}
else if i < n && S[i] != P[j]
{
if j != 0
j = prefixArray[j-1]
else
i++
}
}
}
PrefixArray(string P)
{
m = length of string P
prefixArray[0] = 0
len = 0
i = 1
while i < m
{
if P[i] == P[len]
{
len++
prefixArray[i] = len
i++
}
else
{
if len != 0
len = prefixArray[len-1]
else
{
prefixArray[i] = 0
i++
}
}
}
}The time complexity of KMP Algorithm is O(n+m).
n = length of string S
m = length of string PMaster Theorem cannot be used to calculate the time complexity of KMP Algorithm.
The time complexity of the Knuth-Morris-Pratt (KMP) string matching algorithm can be mathematically described as follows:
Let:
Preprocessing Phase:
Searching Phase:
Total Time Complexity:
Therefore, the total time complexity of the KMP algorithm can be mathematically expressed as:
T(n, m) = O(m) + O(n * m)
In big O notation, we can simplify this to:
T(n, m) = O(n + m)
So, mathematically, the time complexity of the KMP algorithm is O(n + m).
#include <bits/stdc++.h>
using namespace std;
void PrefixArray(string P, int prefixArray[])
{
int m = P.length();
prefixArray[0] = 0;
int len = 0;
int i = 1;
while (i < m)
{
if (P[i] == P[len])
{
len++;
prefixArray[i] = len;
i++;
}
else
{
if (len != 0)
len = prefixArray[len - 1];
else
{
prefixArray[i] = 0;
i++;
}
}
}
}
void KMPAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
int prefixArray[m];
PrefixArray(P, prefixArray);
int i = 0;
int j = 0;
while (i < n)
{
if (S[i] == P[j])
{
i++;
j++;
}
if (j == m)
{
cout << "Pattern found at index " << (i - j) << endl;
j = prefixArray[j - 1];
}
else if (i < n && S[i] != P[j])
{
if (j != 0)
j = prefixArray[j - 1];
else
i++;
}
}
}
int main()
{
string S, P;
cin >> S >> P;
KMPAlgorithm(S, P);
return 0;
}Suppose we have a string S of length n and a pattern P of length m.
Pattern Preprocessing (Building the LPS Array):
lps[i] stores the length of the longest proper prefix of the pattern that is also a proper suffix up to position i.[0, 0, 1, 2, 3, 0].Search Phase:
i in the pattern, instead of simply shifting the pattern to the right by 1 character (as in brute force), KMP uses the LPS array to determine how far it can safely shift the pattern.i in the pattern. This means that the first lps[i] characters in the pattern are already known to match with the text.i - lps[i] characters, KMP ensures that it continues the comparison at the next character in the text that hasn't been compared yet with the pattern.Efficiency and Avoiding Redundant Comparisons:
In essence, the KMP algorithm works by cleverly determining how far to shift the pattern based on previously matched characters, which allows it to skip over sections of the text where a match is not possible. This avoids redundant comparisons and makes it a highly efficient string matching algorithm.
Rabin-Karp Algorithm is one of the most important string matching algorithm. It is also known as Rabin-Karp Algorithm.
In this algorithm, we will be using a hash function to find the pattern in the string.
RabinKarpAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
hashValueOfPattern = HashValue(P)
hashValueOfSubstring = HashValue(S[0...m-1])
for i = 0 to n-m
{
if hashValueOfPattern == hashValueOfSubstring
{
for j = 0 to m
{
if S[i+j] != P[j]
break
}
if j == m
print "Pattern found at index " + i
}
if i < n-m
{
hashValueOfSubstring = RecalculateHashValue(hashValueOfSubstring, S[i], S[i+m], m)
}
}
}
HashValue(string S)
{
hashValue = 0
for i = 0 to length of string S
{
hashValue = hashValue + S[i]*p^i
}
return hashValue
}
RecalculateHashValue(hashValueOfSubstring, firstCharacter, lastCharacter, m)
{
hashValueOfSubstring = hashValueOfSubstring - firstCharacter
hashValueOfSubstring = hashValueOfSubstring/p
hashValueOfSubstring = hashValueOfSubstring + lastCharacter*p^(m-1)
return hashValueOfSubstring
}The time complexity of Rabin-Karp Algorithm is O(n+m).
n = length of string S
m = length of string PMaster Theorem cannot be used to calculate the time complexity of Rabin-Karp Algorithm.
#include <bits/stdc++.h>
using namespace std;
int HashValue(string S)
{
int hashValue = 0;
for (int i = 0; i < S.length(); i++)
{
hashValue = hashValue + S[i] * pow(3, i);
}
return hashValue;
}
int RecalculateHashValue(int hashValueOfSubstring, char firstCharacter, char lastCharacter, int m)
{
hashValueOfSubstring = hashValueOfSubstring - firstCharacter;
hashValueOfSubstring = hashValueOfSubstring / 3;
hashValueOfSubstring = hashValueOfSubstring + lastCharacter * pow(3, m - 1);
return hashValueOfSubstring;
}
void RabinKarpAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
int hashValueOfPattern = HashValue(P);
int hashValueOfSubstring = HashValue(S.substr(0, m));
for (int i = 0; i <= n - m; i++)
{
if (hashValueOfPattern == hashValueOfSubstring)
{
int j;
for (j = 0; j < m; j++)
{
if (S[i + j] != P[j])
break;
}
if (j == m)
cout << "Pattern found at index " << i << endl;
}
if (i < n - m)
{
hashValueOfSubstring = RecalculateHashValue(hashValueOfSubstring, S[i], S[i + m], m);
}
}
}
int main()
{
string S, P;
cin >> S >> P;
RabinKarpAlgorithm(S, P);
return 0;
}Suppose we have a string S of length n and a pattern P of length m.
Hash Function:
p raised to the power of the index of the character in the string.65 * p^0 + 66 * p^1 + 65 * p^2 + 66 * p^3 + 67 * p^4 + 65 * p^5.p, then adding the ASCII value of the last character "A" multiplied by p^5.Search Phase:
Choosing a Prime Number:
p is important to ensure that the hash values of the pattern and the substrings of the text are unique.p should be large enough to avoid collisions but small enough to avoid overflow.p is 3 * 10^8 + 7, which is the largest prime number that can be stored in a 32-bit integer.Avoiding False Positives:
p.p, it should be run again with a different prime number p to confirm the match.Avoiding False Negatives:
p.p, it should be run again with a different prime number p to confirm the match.Finite Automata Algorithm is one of the most important string matching algorithm. It is also known as Finite Automata Algorithm.
In this algorithm, we will be using a finite automata to find the pattern in the string.
Here is an image of a digram for a finite automaton that accepts strings that start with the letter a and end with the letter b:

In this digram, the two states, q0 and q1, are represented by circles. The initial state, q0, is indicated by an arrow pointing to it from outside the diagram. The final state, q2, is indicated by a double circle. The transition from q0 to q1 on the input symbol a is represented by an arrow labeled a.
Any string that starts with a and ends with b will follow the path from q0 to q2 in this diagram. For example, the string ab will follow the path q0 -> q1 -> q2. The string aab will follow the path q0 -> q1 -> q1 -> q2. And the string aaba will follow the path q0 -> q1 -> q1 -> q1 -> q2.
However, any string that does not start with a or end with b will not be accepted by this automaton. For example, the string ba will follow the path q0 -> q1. However, there is no transition from q1 to a final state, so the string ba will not be accepted. Similarly, the strings aa and bbb will not be accepted, because they do not start with a or end with b.
FiniteAutomataAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
transitionTable = TransitionTable(P)
currentState = 0
for i = 0 to n
{
currentState = transitionTable[currentState][S[i]]
if currentState == m
print "Pattern found at index " + (i-m+1)
}
}
TransitionTable(string P)
{
m = length of string P
transitionTable[m][256]
for state = 0 to m
{
for character = 0 to 256
{
if state < m && character == P[state]
transitionTable[state][character] = state + 1
else
{
transitionTable[state][character] = nextState(transitionTable, m, state, character)
}
}
}
return transitionTable
}
nextState(transitionTable, m, state, character)
{
if state == 0
return 0
i = 0
for x = state-1 to 0
{
if transitionTable[x][character] == state
{
for i = 0 to x
{
if P[i] != P[state-x+i]
break
}
if i == x+1
return x+1
}
}
return 0
}The time complexity of Finite Automata Algorithm is O(n+m).
n = length of string S
m = length of string P#include <bits/stdc++.h>
using namespace std;
int nextState(int transitionTable[][256], int m, int state, char character)
{
if (state == 0)
return 0;
int i = 0;
for (int x = state - 1; x >= 0; x--)
{
if (transitionTable[x][character] == state)
{
for (i = 0; i < x; i++)
{
if (P[i] != P[state - x + i])
break;
}
if (i == x)
return x + 1;
}
}
return 0;
}
int **TransitionTable(string P)
{
int m = P.length();
int **transitionTable = new int *[m + 1];
for (int i = 0; i <= m; i++)
{
transitionTable[i] = new int[256];
}
for (int state = 0; state <= m; state++)
{
for (int character = 0; character < 256; character++)
{
if (state < m && character == P[state])
transitionTable[state][character] = state + 1;
else
{
transitionTable[state][character] = nextState(transitionTable, m, state, character);
}
}
}
return transitionTable;
}
void FiniteAutomataAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
int **transitionTable = TransitionTable(P);
int currentState = 0;
for (int i = 0; i < n; i++)
{
currentState = transitionTable[currentState][S[i]];
if (currentState == m)
cout << "Pattern found at index " << (i - m + 1) << endl;
}
}
int main()
{
string S, P;
cin >> S >> P;
FiniteAutomataAlgorithm(S, P);
return 0;
}Aho-Corasick Algorithm is one of the most important string matching algorithm. It is also known as Aho-Corasick Algorithm.
In this algorithm, we will be using a Trie to find the pattern in the string.
AhoCorasickAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
trie = Trie(P)
currentState = 0
for i = 0 to n
{
currentState = trie[currentState][S[i]]
if trie[currentState].isLeaf == true
print "Pattern found at index " + (i-m+1)
}
}
Trie(string P)
{
m = length of string P
trie[m+1][256]
for state = 0 to m
{
for character = 0 to 256
{
if state < m && character == P[state]
trie[state][character] = state + 1
else
{
trie[state][character] = nextState(trie, m, state, character)
}
}
}
return trie
}
nextState(trie, m, state, character)
{
if state == 0
return 0
i = 0
for x = state-1 to 0
{
if trie[x][character] == state
{
for i = 0 to x
{
if P[i] != P[state-x+i]
break
}
if i == x+1
return x+1
}
}
return 0
}The time complexity of Aho-Corasick Algorithm is O(n+m).
#include <bits/stdc++.h>
using namespace std;
struct TrieNode
{
struct TrieNode *children[256];
bool isLeaf;
};
struct TrieNode *getNode(void)
{
struct TrieNode *pNode = new TrieNode;
pNode->isLeaf = false;
for (int i = 0; i < 256; i++)
pNode->children[i] = NULL;
return pNode;
}
void insert(struct TrieNode *root, string key)
{
struct TrieNode *pCrawl = root;
for (int i = 0; i < key.length(); i++)
{
int index = key[i];
if (!pCrawl->children[index])
pCrawl->children[index] = getNode();
pCrawl = pCrawl->children[index];
}
pCrawl->isLeaf = true;
}
void AhoCorasickAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
struct TrieNode *root = getNode();
insert(root, P);
int currentState = 0;
for (int i = 0; i < n; i++)
{
int index = S[i];
if (root->children[index])
{
currentState = root->children[index];
if (root->children[index]->isLeaf)
cout << "Pattern found at index " << (i - m + 1) << endl;
}
else
{
currentState = 0;
}
}
}
int main()
{
string S, P;
cin >> S >> P;
AhoCorasickAlgorithm(S, P);
return 0;
}Z Algorithm is one of the most important string matching algorithm. It is also known as Z Algorithm.
In this algorithm, we will be using a Z Array to find the pattern in the string.
ZAlgorithm(string S, string P)
{
n = length of string S
m = length of string P
ZArray = ZArray(P + "$" + S)
for i = 0 to n
{
if ZArray[i] == m
print "Pattern found at index " + (i-m-1)
}
}
ZArray(string S)
{
n = length of string S
ZArray[n]
left = 0
right = 0
for i = 1 to n-1
{
if i > right
{
left = i
right = i
while right < n && S[right-left] == S[right]
right++
ZArray[i] = right - left
right--
}
else
{
k = i - left
if ZArray[k] < right - i + 1
ZArray[i] = ZArray[k]
else
{
left = i
while right < n && S[right-left] == S[right]
right++
ZArray[i] = right - left
right--
}
}
}
return ZArray
}The time complexity of Z Algorithm is O(n+m).
S = "aabcaabxaaz"
P = "aab"
Concatenated String: "aab$aabcaabxaaz"
Step 1:
Z[0] = 0 (initialization)
Step 2:
Z[1] = 0 (no common prefix with "a")
Step 3:
Z[2] = 1 (common prefix "a")
Step 4:
Z[3] = 0 (no common prefix with "$")
Step 5:
Z[4] = 2 (common prefix "aa")
Step 6:
Z[5] = 0 (no common prefix with "aab")
Step 7:
Z[6] = 1 (common prefix "a")
Step 8:
Z[7] = 0 (no common prefix with "aabc")
Step 9:
Z[8] = 0 (no common prefix with "aabca")
Step 10:
Z[9] = 3 (common prefix "aab")
Step 11:
Z[10] = 0 (no common prefix with "aabcaa")
Step 12:
Z[11] = 0 (no common prefix with "aabcaab")
Step 13:
Z[12] = 1 (common prefix "a")
Step 14:
Z[13] = 0 (no common prefix with "aabcaabx")
Step 15:
Z[14] = 0 (no common prefix with "aabcaabxa")
Step 16:
Z[15] = 0 (no common prefix with "aabcaabxaa")
Step 17:
Z[16] = 0 (no common prefix with "aabcaabxaaz")
The final Z-array for the combined string "aab$aabcaabxaaz" is as follows:
ZArray = [0, 0, 1, 0, 2, 0, 1, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0]
This Z-array helps you identify the starting positions in the text (S) where the pattern (P) matches. In this case, it indicates that the pattern "aab" matches at positions 2, 9, and 12 in the text.
#include <bits/stdc++.h>
using namespace std;
void ZAlgorithm(string S, string P)
{
int n = S.length();
int m = P.length();
string concatenatedString = P + "$" + S;
int ZArray[n + m + 1];
int left = 0;
int right = 0;
for (int i = 1; i < n + m + 1; i++)
{
if (i > right)
{
left = i;
right = i;
while (right < n + m + 1 && concatenatedString[right - left] == concatenatedString[right])
right++;
ZArray[i] = right - left;
right--;
}
else
{
int k = i - left;
if (ZArray[k] < right - i + 1)
ZArray[i] = ZArray[k];
else
{
left = i;
while (right < n + m + 1 && concatenatedString[right - left] == concatenatedString[right])
right++;
ZArray[i] = right - left;
right--;
}
}
}
for (int i = 0; i < n + m + 1; i++)
{
if (ZArray[i] == m)
cout << "Pattern found at index " << (i - m - 1) << endl;
}
}
int main()
{
string S, P;
cin >> S >> P;
ZAlgorithm(S, P);
return 0;
}Try to understand the algorithm by reading the pseudocode and then try to implement it in your favorite programming language.
Try to solve some problems on string matching algorithms on various platforms like Codeforces, Codechef, Leetcode, Hackerrank, etc.
Learn a single algorithm at a time , Don't try to learn all the algorithms at once. It will result in confusion.
Try to learn the intuition behind the algorithm. It will help you to understand the algorithm better.
The Others Algorithms will be added soon.
Contributions are welcome! ❤️ Here's how you can contribute:
Add solutions to the existing problems in other languages (Preferably C++/Java/Python).
Add new problems with solutions in any language.