Monday, October 20, 2008

RXRDG V - Ins and outs

If you haven't done already I suggest you at least skim over the previous RXRDG installments to get an idea of what we have done so far. Our next addition to the project will be character sets.

Character sets are defined by placing regular expression tokens within brackets (ie. [abc])- so we'll need bracket tokens to our lexer. The brackets act in a way similar to parenthesis (which we have already implemented) so we'll copy that approach. We'll add a set state and new cases for transitioning from literal to set state. There are a few things to keep in mind though. The rules for meta-characters within the character sets are quite complex. Eg.

  • The character "^" (caret) is a literal but encountered at the beginning of a set it is a negation of the set.
  • The character "-" denotes a range between the previous and the next literal but encountered at the beginning or the end of a set it is a literal.
  • The character "]" marks the end of a set a range between the previous and the next literal but encountered at the beginning of a set or following a "^" at the begining of a set it is a literal.
  • ...

The mind boggles! What we need to keep doing is making small steps. Our regexp parser might not be perfect yet but we're making baby steps towards a better one with each addition. We'll worry about all the special cases later. There's one quick shortcut we can make here. Looking at the list it is clear that tokens are treated differently at the beginning of a set state. When we don't match any special tokens at the beginning we transition into "normal" set state. We can address this issue by simply having two set states with appropriate transitions.

The character sets are defined by containing literals or ranges (ie. a-z means characters from a to z). Therefore we'll also need to define a range token. As mentioned the sets can be negated so we'll also define a not token. As character sets are quite useful it is no surprise that some common sets have shorthands. Most notably the any set (the dot) matches any character.

There are quite a few new tokens and states we need to implement so it's best to start right away. We'll need to add:

  • Additional lexer token types and tokens (BracketLeft, BracketRight, Range, Not and Any (not shown)
  • Additional methods for the token builder (not shown)
  • Additional cases for the literal state
  • Begin set and set state

case '[':
context.ToState(new BeginSetState());
return TokenBuilder.BuildBracketLeftToken();
case '.':
return TokenBuilder.BuildAnyToken();

This takes care of the first transition. Next we'll define the two set states. Begin set state is always the first one and it can transition into set state. Ending the set state means transitioning through the begin set state so a double call to end state will be needed.


internal class BeginSetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '^':
return TokenBuilder.BuildNotToken();
default:
context.ToState(new SetState());
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

internal class SetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case ']':
context.EndState();
context.EndState();
return TokenBuilder.BuildBracketRightToken();
case '-':
return TokenBuilder.BuildRangeToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Now that our lexer is working we need to focus on the parser. For the set token we simply need to follow the parenthesis pattern and define a bracket node as an unary operator setting its precedence quite high (10). Looking at the range token it becomes clear that it simply connects two literal tokens. Eg. a-z becomes:


<range>
<literal character='a'>
<literal character='z'>
</range>

So we'll define range as a binary operator (setting the precedence to 130). The not token is a simple unary operator with a high precedence (11). Defining the nodes is an exercise left to the reader. That leaves us with the any token. Any token matches any character - ranging from the " " to the "~". So the dot is simply a predefined character set containing a range between space and a tilde. This gives us a chance to follow the builder pattern and define a node builder to complement the token builder class.


public static class NodeBuilder
{
public static LiteralNode BuildLiteralNode(LiteralToken literalToken)
{
return new LiteralNode(literalToken);
}

public static RangeNode BuildRangeNode(LiteralToken from, LiteralToken to)
{
RangeNode rangeNode = new RangeNode(TokenBuilder.BuildRangeToken());
rangeNode.ChildNodes.Add(BuildLiteralNode(from));
rangeNode.ChildNodes.Add(BuildLiteralNode(to));
return rangeNode;
}

public static RangeNode BuildAnyNode()
{
return BuildRangeNode(TokenBuilder.BuildLiteralToken((char)32), TokenBuilder.BuildLiteralToken((char)126));
}
}

And we're ready to add additional cases to the parse state.


case TokenType.Range:
RangeToken range = (RangeToken)token;
INode rangeNode = new RangeNode(range);
AddOperator(rangeNode);
break;
case TokenType.BracketLeft:
context.ToState(new ParseState());
break;
case TokenType.BracketRight:
BracketRightToken set = (BracketRightToken)token;
INode setNode = new BracketNode(set);
AddOperator(setNode);
context.EndState();
break;
case TokenType.Any:
RangeNode anyNode = NodeBuilder.BuildAnyNode();
AddOperand(anyNode);
break;
case TokenType.Not:
NotToken not = (NotToken)token;
INode notNode = new NotNode(not);
AddOperator(notNode);
break;

We'll need to update our visitor interface and updating xml visitor should be a breeze. The data generator however poses some problems. Character sets need to generate random data based on containing literals and ranges. A naive implementation would follow the pattern of visiting a random character set member. There are however two problems with that approach:

  1. The random pattern would not be fair. A large range node would be equaly visited as a neighboring literal.
  2. It would be impossible to implement character set negation.

The easiest way to deal with the problem is to visit the set child nodes and gather enough information to generate a simple set containing only literals. We'll create a literal node collection class to keep literals unique. As a side note - KeyedCollection generic is one of my favorite classes. It provides a solution for types that containt the id within themselves. All we need to do is to implement the GetKeyForItem method to specify how the key is retrieved from each item.


private class LiteralNodeCollection : System.Collections.ObjectModel.KeyedCollection
{
protected override char GetKeyForItem(LiteralNode item)
{
return item.Token.Character;
}
}

While visiting the bracket node we might encounter various types of nodes and deal with them immediately.

  • If we encounter the not node it means we need to negate the set. We'll leave the implementation to the not visiting method.
  • We must skip any concatenation nodes.
  • We must add literal nodes to the literal nodes collection
  • If we encounter a range node we must add every literal nodes within the range to the literal nodes collection

public void Visit(BracketNode node)
{
if (node.ChildNodes[0].Token.TokenType == TokenType.Not)
{
node.ChildNodes[0].Accept(this);
return;
}

INode current = node;
while (current.ChildNodes[0].Token.TokenType == TokenType.Concatenation)
{
current = node.ChildNodes[0];
}

LiteralNodeCollection nodes = new LiteralNodeCollection();
foreach (INode childNode in current.ChildNodes)
{
switch (childNode.Token.TokenType)
{
case TokenType.Literal:
LiteralNode literalChildNode = (LiteralNode)childNode;
if (nodes.Contains(literalChildNode) == false)
{
nodes.Add(literalChildNode);
}
break;
case TokenType.Range:
int min = (int)((LiteralNode)childNode.ChildNodes[0]).Token.Character;
int max = (int)((LiteralNode)childNode.ChildNodes[1]).Token.Character;
for (int i = min; i < max; i++)
{
char c = (char)i;
if (nodes.Contains(c) == false)
{
nodes.Add(NodeBuilder.BuildLiteralNode(TokenBuilder.BuildLiteralToken(c)));
}
}
break;
}
}
int index = RandomNumberProvider.GetRandomNumber(0, nodes.Count);
nodes[index].Accept(this);
}

Visiting the not node is just the opposite. First we add all nodes to the collection and then remove the ones we find. We must add range node visiting method for completeness and we're done. We've got ourselves a quite functional regular expression parser and a random data generator to match.

As promised the source code for the regular expression random data generator can be found at Google code. The series on RXRDG is far from over and the code will be updated accordingly.


Related posts:

No comments: