Regular Expression Matching
September 1, 2011 in backtracking, string
Implement regular expression matching with support for ‘.’ and ‘*’.
‘*’ Matches zero or more of the preceding element.
The matching should cover the entire input string (not partial).
The function prototype should be:
bool isMatch(const char *s, const char *p)
Some examples:
isMatch(“aa”,”a”) → false
isMatch(“aa”,”aa”) → true
isMatch(“aaa”,”aa”) → false
isMatch(“aa”, “a*”) → true
isMatch(“aa”, “.*”) → true
isMatch(“ab”, “.*”) → true
isMatch(“aab”, “c*a*b”) → true
Online Judge
This problem is available at Online Judge. Head over there and it will judge your solution. Currently only able to compile C++/Java code. If you are using other languages, you can still verify your solution by looking at the judge’s test cases and its expected output.
Background:
It might seem deceptively easy even you know the general idea, but programming it correctly with all the details require careful thought.
Edit:
It seems that some readers are confused about why the regex pattern “.*” matches the string “ab”. “.*” means repeat the preceding element 0 or more times. Here, the “preceding” element is the dot character in the pattern, which can match any characters. Therefore, the regex pattern “.*” allows the dot to be repeated any number of times, which matches any string (even an empty string).
Hints:
Think carefully how you would do matching of ‘*’. Please note that ‘*’ in regular expression is different from wildcard matching, as we match the previous character 0 or more times. But, how many times? If you are stuck, recursion is your friend.
This problem is a tricky one. Due to the huge number of edge cases, many people would write lengthy code and have numerous bugs on their first try. Try your best getting your code correct first, then refactor mercilessly to as clean and concise as possible!

A sample diagram of a deterministic finite state automata (DFA). DFAs are useful for doing lexical analysis and pattern matching. An example is UNIX’s grep tool. Please note that this post does not attempt to descibe a solution using DFA.
Solution:
This looks just like a straight forward string matching, isn’t it? Couldn’t we just match the pattern and the input string character by character? The question is, how to match a ‘*’?
A natural way is to use a greedy approach; that is, we attempt to match the previous character as many as we can. Does this work? Let us look at some examples.
s = “abbbc”, p = “ab*c”
Assume we have matched the first ‘a’ on both s and p. When we see “b*” in p, we skip all b’s in s. Since the last ‘c’ matches on both side, they both match.
s = “ac”, p = “ab*c”
After the first ‘a’, we see that there is no b’s to skip for “b*”. We match the last ‘c’ on both side and conclude that they both match.
It seems that being greedy is good. But how about this case?
s = “abbc”, p = “ab*bbc”
When we see “b*” in p, we would have skip all b’s in s. They both should match, but we have no more b’s to match. Therefore, the greedy approach fails in the above case.
One might be tempted to think of a quick workaround. How about counting the number of consecutive b’s in s? If it is smaller or equal to the number of consecutive b’s after “b*” in p, we conclude they both match and continue from there. For the opposite, we conclude there is not a match.
This seem to solve the above problem, but how about this case:
s = “abcbcd”, p = “a.*c.*d”
Here, “.*” in p means repeat ‘.’ 0 or more times. Since ‘.’ can match any character, it is not clear how many times ‘.’ should be repeated. Should the ‘c’ in p matches the first or second ‘c’ in s? Unfortunately, there is no way to tell without using some kind of exhaustive search.
We need some kind of backtracking mechanism such that when a matching fails, we return to the last successful matching state and attempt to match more characters in s with ‘*’. This approach leads naturally to recursion.
The recursion mainly breaks down elegantly to the following two cases:
- If the next character of p is NOT ‘*’, then it must match the current character of s. Continue pattern matching with the next character of both s and p.
- If the next character of p is ‘*’, then we do a brute force exhaustive matching of 0, 1, or more repeats of current character of p… Until we could not match any more characters.
You would need to consider the base case carefully too. That would be left as an exercise to the reader. ![]()
Below is the extremely concise code (Excluding comments and asserts, it’s about 10 lines of code).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | bool isMatch(const char *s, const char *p) { assert(s && p); if (*p == '\0') return *s == '\0'; // next char is not '*': must match current character if (*(p+1) != '*') { assert(*p != '*'); return ((*p == *s) || (*p == '.' && *s != '\0')) && isMatch(s+1, p+1); } // next char is '*' while ((*p == *s) || (*p == '.' && *s != '\0')) { if (isMatch(s, p+2)) return true; s++; } return isMatch(s, p+2); } |
Further Thoughts:
Some extra exercises to this problem:
- If you think carefully, you can exploit some cases that the above code runs in exponential complexity. Could you think of some examples? How would you make the above code more efficient?
- Try to implement partial matching instead of full matching. In addition, add ‘^’ and ‘$’ to the rule. ‘^’ matches the starting position within the string, while ‘$’ matches the ending position of the string.
- Try to implement wildcard matching where ‘*’ means any sequence of zero or more characters.
For the interested reader, real world regular expression matching (such as the grep tool) are usually implemented by applying formal language theory. To understand more about it, you may read this article.
Regular Expression Matching,
guoguo said on September 1, 2011
why isMatch(“aab”, “c*a*b”) -> true?
obviously, the second string contains one extra ‘c’ compared to the first string. Am i correct?
Thanks, great website:)
1337c0d3r said on September 1, 2011
isMatch(“aab”, “c*a*b”) returns true because c* means 0 or more ‘c’ to match. Since it doesn’t start with a ‘c’, so c* matches 0 ‘c’. Similarly, a* matches the two ‘a’s, and b matches the last ‘b’.
guoguo said on September 2, 2011
ic,lol
nathan said on October 14, 2011
That’s obviously wrong!
Andy said on January 16, 2012
I think “c*” should be clarified in the question. People may think ‘* ‘in “c*” means it can be 0 or more previous char, which is ‘c’. But ‘c’ in “c*” already exists. So “c*” means at least one ‘c’.
I guess here ‘*’ can cancel the previous char, and thus “c*” means 0 or more ‘c’. But, I think this is not common sense.
test said on September 3, 2011
Could you please tell me what’s the running time of this algorithm? O(n^2)?
Li Jin said on September 5, 2011
In case s=”a…..ab” p=”a*a*….a*”, it takes exponential time and O(n) space. Using memorizing, I think it takes O(n^3) time and O(n^2) space.
Martin said on December 31, 2011
To improve the exponential time, is it feasible to preprocess the pattern string? if p=”a*a*….a*”, p1=”a*”, are p and p1 equivalent?
Subramanian Ganapathy said on October 14, 2012
KMP gives a running time of O(n* number of * characters)
Kun Li said on September 20, 2011
The +1 button on my side has some problem
ds said on October 4, 2011
Could you tell why isMatch(“ab”, “.*”) → true? .can match “a”, so “.*” is a or aa or aa… How could it be “ab”?
rockstar said on October 10, 2011
how could isMatch(“ab”,”.*”) -> true ?? . matches a then preceding * can only match more a’s , how did this b came then ….!!
Semaphore said on October 17, 2011
It means 0 or multiple single character, which definitely match “ab”.
ds said on October 22, 2011
It should not be true. * means 0 or multiple “preceding” character. In this case it is ‘a’. So how did ‘b’ come from?
1337c0d3r said on October 27, 2011
@rockstar and @ds:
I mentioned this in my post, “.*” means repeat the preceding element of the pattern itself, which is the single dot character. Since dot can match virtually any character, the pattern “.*’ will match the string “ab”.
If you do not believe me, I’ll quote this from here, http://www.regular-expressions.info/dot.html :
“Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so “.*” seems to do the trick just fine. The dot matches any character, and the star allows the dot to be repeated any number of times, including zero.”
lawnblower said on November 7, 2011
dumb test case isMatch(“aa”, “*a”) will trigger assert, which should not.
Michael said on May 16, 2013
from my understanding, the ‘*’ must have a preceding element, so your use case maybe wrong regarding to this scenario.
lawnblower said on November 7, 2011
in other word, a wrapper function is needed.
Ashok Koyi said on November 10, 2011
Please find the java version of the solution [Naive method]
flo said on November 10, 2011
Thanks for the tips and article
Tanin said on December 9, 2011
I come up with a dynamic programming solution. It takes O(n^2) in time and O(n) in space.
Could you please check if it is correct.
Here is the solution:
d[i,j] is true iff a[0..i] matches r[0..j]
d[i,j] is true if one of these cases is true:
(1) a[i] == r[j] and d[i-1,j-1] is true (match a character with a character)
(2) r[j] == ‘.’ and d[i-1,j-1] is true
(3) r[j] is * and it can match a[i] and d[i-1,j-1] is true (* consumes character as its first one)
(4) r[j] is * and it can match a[i] and d[i-1,j] is true (* consumes 1 more character)
(5) r[j] is * and it can match a[i] and d[i,j-1] is true (* matches an empty string)
The solution is called “Elastic Matching” used in Handwriting recognition.
Please see the full solution here: http://tanin.nanakorn.com/blog/!/id25
Tanin said on December 9, 2011
Detail is a little off and wrong. Sorry about that. But I guess you can get the concept of how it works
Gaurav said on December 17, 2011
Could anyone give me a O(N) solution ?
Random Guy said on December 27, 2011
No.
SK said on June 29, 2012
NO
Zhen said on January 23, 2012
Nice code!
I have a minor question.
IsMatch(“a”, “ab*a”) should be “ture”, but this code returns “false”.
Please correct me if I miss some point
1337c0d3r said on January 23, 2012
IsMatch(“a”, “ab*a”) should return “false”.
This is because the pattern “ab*a” means the string begin with an a, following by 0 or more b’s, then an another a at the end.
This imply that it must minimally contain two characters which has two a’s, ie. “aa”.
Zhen said on January 23, 2012
Oh, I miss the whole match condition.
If the regular expression string should be equal to the entire input string, then IsMatch(“a”, “ab*a”)–>false
1337c0d3r said on January 23, 2012
Yup, that’s correct. It’s mentioned in the problem statement, Just bolded the word “entire” to make the statement clearer.
zhong zhang said on February 9, 2012
The code is amazing.
I just don’t understand why we need:
assert(*p != ‘*’), which is after if (*(p+1) != ‘*’)
Thanks,
1337c0d3r said on February 10, 2012
This is to assume that multiple *’s are not in the input pattern and before a * there must be a letter to match, or else the code above will not be correct. In the case where multiple *’s can be in the input pattern, you can just use a loop to skip those, as those are redundant and won’t affect the correctness.
Nova2358 said on February 11, 2012
There is a bug.
char* exp = “b*bb”, char* text = “bb”
This should be a “Not match”, but your code will answer “match”.
You should change your last code to
1337c0d3r said on February 11, 2012
Thanks for your comment.
exp=”b*bb” and text=”bb” should be a match. This is because “b*” could match 0 or more b’s. In this case it would be 0 b.
Jay said on February 23, 2012
I think my code is same with answer but it shows the backtracking aspect of problem more.
int match(char *s, char *p)
{
int char_matched = 0;
if (*s == ”)
return (*p == ”);
if (*s == *p || *p == ‘.’)
char_matched = 1;
if (*(p+1) == ‘*’)
return char_matched? (match(s+1, p) || match(s+1, p+2)) : match(s+1, p+2);
else
return char_matched? match(s+1, p+1): 0;
}
Jay said on February 23, 2012
I just saw “bb”, “b*bb” case above. Yeah.. this should match. Added the case.
int match(char *s, char *p)
{
int char_matched = 0;
if (*s == ”)
return (*p == ”);
if (*s == *p || *p == ‘.’)
char_matched = 1;
if (*(p+1) == ‘*’)
return char_matched? (match(s+1, p) || match(s+1, p+2) || match(s, p+2)) : match(s+1, p+2);
else
return char_matched? match(s+1, p+1): 0;
}
Alex said on June 26, 2012
This code is buggy. Here are some examples where it would fail:
– it would return false on empty string for regex “.*” while it should have returned true.
– it would fail in the case such as text=”ab”and regex=”ab.*”.
ted cheng said on April 5, 2012
the code doesn’t handle when the length of p is 1
Hello said on April 24, 2012
// next char is ‘*’
while ((*p == *s) || (*p == ‘.’ && *s != ”))
Hello, in this piece of code, *p == ‘.’ would never be true since there is no any code inside while loop operates p.
flynewdream said on May 15, 2012
Cannot understand the following code:
// next char is ‘*’
while ((*p == *s) || (*p == ‘.’ && *s != ”)) {
if (isMatch(s, p+2)) return true;
s++;
}
why use “if (isMatch(s, p+2)) return true;”? This will stop comparing s to the characters before ‘*’ in p. Is this correct?
Ashish said on June 10, 2012
This code is breaking for abdabcabc & .*abc
Zhenjie Chen said on July 6, 2012
Python? DP can be applied (not in the following code).
Run:
./reg.py “abcbcd” “a.*c.*d”
Guyue Liu said on August 30, 2012
I am not sure about the time complexity of solution.
Why do not solve this problem via DP? it would be O(n ^ 2) ~
Guyue Liu said on August 30, 2012
here’s my code:
class Solution {
public:
bool canMatch(char a, char b)
{
return (a == b || b == ‘.’);
}
bool isMatch(const char *s, const char *p) {
// Start typing your C/C++ solution below
// DO NOT write int main() function
int lS = strlen(s);
int lP = strlen(p);
vector<vector > F(lS + 1, vector(lP + 1));
F[0][0] = true;
for (int i = 0; i <= lS; i++)
{
for (int j = 1; j 0)
if (F[i - 1][j - 1] && canMatch(s[i - 1], p[j - 1]))
{
F[i][j] = true;
continue;
}
if (i > 0 && j > 1)
if (F[i - 1][j] && canMatch(s[i - 1], p[j - 2]) && p[j - 1] == ‘*’)
{
F[i][j] = true;
continue;
}
if (j > 1)
if (F[i][j - 2] && p[j - 1] == ‘*’)
{
F[i][j] = true;
continue;
}
}
}
return F[lS][lP];
}
};
it can pass all test cases.
zhong zhang said on September 15, 2012
mark
zhong zhang said on September 17, 2012
Further thoughts:
Problem 1:
I came up one case that takes a long time to get the result while the length of string s and p are not huge.
s[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
p[] = "a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*b"
zhong zhang said on September 17, 2012
I don’t know is this the case of exponentially complexity or not. Can anyone help me check it? Thanks.
jgan said on October 6, 2012
here, definition of ‘*’ is confusing, a*b could equal to following strings?
b–>cancel a
ab–>0 a
aab–>1 a
aaab–>2 a
aaaab–>3 a
…
Subramanian Ganapathy said on October 14, 2012
Knuth morris pratt modified, * is greedy. All we have to do is build patterns of all characters upto *
for .e.g ab*bc*cd, we build 3 prefix that are also suffixes pattern for ab , bc and cd respectively.
for the actual string, we first start with ab using its prefix array try to match the string, once a match occurs we recursively try to match from that matched position for bc and finally for cd. we backtrack when we complete the entire string for any recursion level, seems like O(n*number of stars) to me. I am in the process of writing code for it.
Subramanian Ganapathy said on October 14, 2012
My previous solution was for wildcard match but it lends intuitively into regex match also. Cancel kmp prefix generation for a character whose next char is a *, then the same algo holds but this time we need to move to the next matching block once the current * char is matched and backtrack here only when the next matching block fails.
shady said on December 27, 2012
nice post, but has problem as isMatch(“aa”, “*a”) triggers assert which leads to termination of program.
this line is incorrect in the code, as it can lead to segmentation fault… how can we directly access p+2 element, we know for sure that p is not ”, but p+1 element can be ” , therefore leading to p+2 to be undefined.
Nathan said on January 23, 2013
I think there is a bug for your code.
For the case where p = “a” and s = “b*a”, your code will return FALSE. However, I believe it should return TRUE.
What do you think? Did I miss anything?
Yi said on February 28, 2013
p is the regular expression.
coding said on January 24, 2013
The following is the DP solution which passes all the online cases:
class Solution {
public:
bool isMatch(const char *s, const char *p) {
vector row_char, column_char;
vector column_st;
while(*s != ”)
row_char.push_back(*(s++));
while(*p != ”)
{
if(*p != ‘*’)
{
column_char.push_back(*p);
if(*(p + 1) == ‘*’)
column_st.push_back(true);
else
column_st.push_back(false);
}
p++;
}
int row_sz = row_char.size(), column_sz = column_char.size();
if(row_sz == 0 && column_sz == 0)
return true;
if(row_sz == 0)
{
for(int i = 0; i < column_sz; i++)
if(column_st[i] == false)
return false;
return true;
}
if(column_sz == 0)
return false;
bool state[2][column_sz];
int flag = false;
int i = 0;
for(; i < column_sz; i++)
{
if(column_char[i] == '.' || column_char[i] == row_char[0])
flag = true;
if(column_st[i] == true)
state[0][i] = true;
else
break;
}
if(i == column_sz && flag == false)
return false;
if(i < column_sz)
{
if(column_char[i] == '.' || column_char[i] == row_char[0])
state[0][i] = true;
else
state[0][i] = false;
i++;
}
for(; i < column_sz; i++)
{
if(column_st[i] == true && state[0][i - 1] == true)
state[0][i] = true;
else
state[0][i] = false;
}
int cur = 1, pre = 0;
for(i = 1; i < row_sz; i++)
{
if(column_st[0] == true && column_char[0] == '.')
state[cur][0] = true;
else if(column_st[0] == true && state[pre][0] == true && row_char[i - 1] == column_char[0] && row_char[i] == column_char[0])
state[cur][0] = true;
else
state[cur][0] = false;
for(int j = 1; j < column_sz; j++)
{
if(column_st[j] == true && state[cur][j - 1] == true)
state[cur][j] = true;
else if(column_char[j] == '.' || row_char[i] == column_char[j])
{
if(state[pre][j - 1] == true)
state[cur][j] = true;
else if(column_st[j] == true && state[pre][j] == true)
state[cur][j] = true;
else
state[cur][j] = false;
}
else
state[cur][j] = false;
}
swap(cur, pre);
}
return state[pre][column_sz - 1];
}
};
Jens said on January 26, 2013
Here is the ultimate in conciseness (short of using a regex library): 6 lines of Prolog!
Note that 42 is the ASCII for ‘*’ and 46 the ASCII for ‘.’. A sample query is this:
theactual said on January 27, 2013
Annie said on April 30, 2013
isMatch(s, p+2) maybe out of boundary, in which case will store unexpected characters. Better check before go into.
*(p+1) != ” && isMatch(s, p+2)
Annie said on April 30, 2013
shubham gupta said on May 2, 2013
your code is giving not match for isMatch(“.”,”a”) but it should be a match.