Solution


Approach 1: Stack

Summarizing the given problem, we can say that we need to determine whether a tag is valid or not, by checking the following properties.

  1. The code should be wrapped in valid closed tag.

  2. The TAG_NAME should be valid.

  3. The TAG_CONTENT should be valid.

  4. The cdata should be valid.

  5. All the tags should be closed. i.e. each start-tag should have a corresponding end-tag and vice-versa and the order of the tags should be correct as well.

In order to check the validity of all these, firstly, we need to identify which parts of the given string act as which part from the above mentioned categories. To understand how it's done, we'll go through the implementation and the reasoning behind it step by step.

We iterate over the given string. Whenever a < is encountered(unless we are currently inside <![CDATA[...]]>), it indicates the beginning of either a TAG_NAME(start tag or end tag) or the beginning of cdata as per the conditions given in the problem statement.

If the character immediately following this < is an !, the characters following this < can't be a part of a valid TAG_NAME, since only upper-case letters(in case of a start tag) or / followed by upper-case letters(in the case of an end tag). Thus, the choice now narrows down to only cdata. Thus, we need to check if the current bunch of characters following <!(including it) constitute a valid cdata. For doing this, firstly we find out the first matching ]]> following the current <! to mark the ending of cdata. If no such matching ]]> exists, the string is considered as invalid. Apart from this, the <! should also be immediately followed by CDATA[ for the cdata to be valid. The characters lying inside the <![CDATA[ and ]]> do not have any constraints on them.

If the character immediately following the < encountered isn't an !, this < can only mark the beginnning of TAG_NAME. Now, since a valid start tag can't contain anything except upper-case letters, if a / is found after <, the </ pair indicates the beginning of an end tag. Now, when a < refers to the beginning of a TAG_NAME(either start-tag or end-tag), we find out the first closing > following the < to find out the substring(say ), that constitutes the TAG_NAME. This should satisfy all the criterion to constitute a valid TAG_NAME. Thus, for every such , we check if it contains all upper-case letters and also check its length(It should be between 1 to 9). If any of the criteria isn't fulfilled, doesn't constitue a valid TAG_NAME. Hence, the string turns out to be invalid as well.

Apart from checking the validity of the TAG_NAME, we also need to ensure that the tags always exist in pairs. i.e. for every start-tag, a corresponding end-tag should always exist. Further, we can note that in case of multiple TAG_NAME's, the TAG_NAME whose start-tag comes later than the other ones, should have its end-tag appearing before the end-tags of those other TAG_NAME's. i.e. the tag which starts later should end first.

From this, we get the intuition that we can make use of a to check the existence of matching start and end-tags. Thus, whenever we find out a valid start-tag, as mentioned above, we push its TAG_NAME string onto a . Now, whenever an end-tag is found, we compare its TAG_NAME with the TAG_NAME at the top the and remove this element from the . If the two don't match, this implies that either the current end-tag has no corresponding start-tag or there is a problem with the ordering of the tags. The two need to match for the tag-pair to be valid, since there can't exist an end-tag without a corresponding start-tag and vice-versa. Thus, if a match isn't found, we can conclude that the given string is invalid.

Now, after the complete string has been traversed, the should be empty if all the start-tags have their corresponding end-tags as well. If the isn't empty, this implies that some start-tag doesn't have the corresponding end-tag, violating the closed-tag's validity condition.

Further, we also need to ensure that the given is completely enclosed within closed tags. For this, we need to ensure that the first cdata found is also inside the closed tags. Thus, when we find a possibility of the presence of cdata, we proceed further only if we've already found a start tag, indicated by a non-empty stack. Further, to ensure that no data lies after the last end-tag, we need to ensure that the doesn't become empty before we reach the end of the given string, since an empty indicates that the last end-tag has been encountered.

The following animation depicts the process.

!?!../Documents/Tag_Validator_Stack.json:1000,563!?!

Complexity Analysis

  • Time complexity : . We traverse over the given string of length .

  • Space complexity : . The stack can grow upto a size of in the worst case. e.g. In case of <A><B><C><D>, =12 and number of tags = 12/3 = 4.


Approach 2: Regex

Instead of manually checking the given string for checking the validity of TAG_NAME, TAG_CONTENT and cdata, we can make use of an inbuilt java fuunctionality known as regular expressions.

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. They can be used to search, edit, or manipulate text and data. The most common quantifiers used in regular expressions are listed below. A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur.

? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

* The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

+ The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

{n} The preceding item is matched exactly n times.

{min,} The preceding item is matched min or more times.

{min,max} The preceding item is matched at least min times, but not more than max times.

| A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

() Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey".

[...] Matches any single character in brackets.

[^...] Matches any single character not in brackets.

Thus, by making use of regex, we can directly check the validity of the string directly(except the nesting of the inner tags) by using the regex expression below:

<([A-Z]{1,9})>([^<]*((<\/?[A-Z]{1,9}>)|(<!\[CDATA\[(.*?)]]>))?[^<]*)*<\/\1>

The image below shows the portion of the string that each part of the expression helps to match:

Regex

But, if we make use of back-referencing as mentioned above, the matching process takes a very large amount of CPU time. Thus, we use the regex only to check the validity of the TAG_CONTENT, TAG_NAME and the cdata. We check the presence of the outermost closed tags by making use of a as done in the last approach.

The rest of the process remains the same as in the last approach, except that we need not manually check the validity of TAG_CONTENT, TAG_NAME and the cdata, since it is already done by the regex expression. We only need to check the presence of inner closed tags.

Check this link for testing any regular expression on a sample text.

Complexity Analysis

  • Time complexity : Regular Expressions are/can be implemented in the form of Finite State Machines. Thus, the time complexity is dependent on the internal representation. In case of any suggestions, please comment below.

  • Space complexity : . The stack can grow upto a size of in the worst case. e.g. In case of <A><B><C><D>, =12 and number of tags = 12/3 = 4.


Analysis written by: @vinod23