Monday, October 20, 2008

RXRDG V - Ins and outs

If you haven't done already I suggest you at least skim over the previous RXRDG installments to get an idea of what we have done so far. Our next addition to the project will be character sets.

Character sets are defined by placing regular expression tokens within brackets (ie. [abc])- so we'll need bracket tokens to our lexer. The brackets act in a way similar to parenthesis (which we have already implemented) so we'll copy that approach. We'll add a set state and new cases for transitioning from literal to set state. There are a few things to keep in mind though. The rules for meta-characters within the character sets are quite complex. Eg.

  • The character "^" (caret) is a literal but encountered at the beginning of a set it is a negation of the set.
  • The character "-" denotes a range between the previous and the next literal but encountered at the beginning or the end of a set it is a literal.
  • The character "]" marks the end of a set a range between the previous and the next literal but encountered at the beginning of a set or following a "^" at the begining of a set it is a literal.
  • ...

The mind boggles! What we need to keep doing is making small steps. Our regexp parser might not be perfect yet but we're making baby steps towards a better one with each addition. We'll worry about all the special cases later. There's one quick shortcut we can make here. Looking at the list it is clear that tokens are treated differently at the beginning of a set state. When we don't match any special tokens at the beginning we transition into "normal" set state. We can address this issue by simply having two set states with appropriate transitions.

The character sets are defined by containing literals or ranges (ie. a-z means characters from a to z). Therefore we'll also need to define a range token. As mentioned the sets can be negated so we'll also define a not token. As character sets are quite useful it is no surprise that some common sets have shorthands. Most notably the any set (the dot) matches any character.

There are quite a few new tokens and states we need to implement so it's best to start right away. We'll need to add:

  • Additional lexer token types and tokens (BracketLeft, BracketRight, Range, Not and Any (not shown)
  • Additional methods for the token builder (not shown)
  • Additional cases for the literal state
  • Begin set and set state

case '[':
context.ToState(new BeginSetState());
return TokenBuilder.BuildBracketLeftToken();
case '.':
return TokenBuilder.BuildAnyToken();

This takes care of the first transition. Next we'll define the two set states. Begin set state is always the first one and it can transition into set state. Ending the set state means transitioning through the begin set state so a double call to end state will be needed.


internal class BeginSetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '^':
return TokenBuilder.BuildNotToken();
default:
context.ToState(new SetState());
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

internal class SetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case ']':
context.EndState();
context.EndState();
return TokenBuilder.BuildBracketRightToken();
case '-':
return TokenBuilder.BuildRangeToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Now that our lexer is working we need to focus on the parser. For the set token we simply need to follow the parenthesis pattern and define a bracket node as an unary operator setting its precedence quite high (10). Looking at the range token it becomes clear that it simply connects two literal tokens. Eg. a-z becomes:


<range>
<literal character='a'>
<literal character='z'>
</range>

So we'll define range as a binary operator (setting the precedence to 130). The not token is a simple unary operator with a high precedence (11). Defining the nodes is an exercise left to the reader. That leaves us with the any token. Any token matches any character - ranging from the " " to the "~". So the dot is simply a predefined character set containing a range between space and a tilde. This gives us a chance to follow the builder pattern and define a node builder to complement the token builder class.


public static class NodeBuilder
{
public static LiteralNode BuildLiteralNode(LiteralToken literalToken)
{
return new LiteralNode(literalToken);
}

public static RangeNode BuildRangeNode(LiteralToken from, LiteralToken to)
{
RangeNode rangeNode = new RangeNode(TokenBuilder.BuildRangeToken());
rangeNode.ChildNodes.Add(BuildLiteralNode(from));
rangeNode.ChildNodes.Add(BuildLiteralNode(to));
return rangeNode;
}

public static RangeNode BuildAnyNode()
{
return BuildRangeNode(TokenBuilder.BuildLiteralToken((char)32), TokenBuilder.BuildLiteralToken((char)126));
}
}

And we're ready to add additional cases to the parse state.


case TokenType.Range:
RangeToken range = (RangeToken)token;
INode rangeNode = new RangeNode(range);
AddOperator(rangeNode);
break;
case TokenType.BracketLeft:
context.ToState(new ParseState());
break;
case TokenType.BracketRight:
BracketRightToken set = (BracketRightToken)token;
INode setNode = new BracketNode(set);
AddOperator(setNode);
context.EndState();
break;
case TokenType.Any:
RangeNode anyNode = NodeBuilder.BuildAnyNode();
AddOperand(anyNode);
break;
case TokenType.Not:
NotToken not = (NotToken)token;
INode notNode = new NotNode(not);
AddOperator(notNode);
break;

We'll need to update our visitor interface and updating xml visitor should be a breeze. The data generator however poses some problems. Character sets need to generate random data based on containing literals and ranges. A naive implementation would follow the pattern of visiting a random character set member. There are however two problems with that approach:

  1. The random pattern would not be fair. A large range node would be equaly visited as a neighboring literal.
  2. It would be impossible to implement character set negation.

The easiest way to deal with the problem is to visit the set child nodes and gather enough information to generate a simple set containing only literals. We'll create a literal node collection class to keep literals unique. As a side note - KeyedCollection generic is one of my favorite classes. It provides a solution for types that containt the id within themselves. All we need to do is to implement the GetKeyForItem method to specify how the key is retrieved from each item.


private class LiteralNodeCollection : System.Collections.ObjectModel.KeyedCollection
{
protected override char GetKeyForItem(LiteralNode item)
{
return item.Token.Character;
}
}

While visiting the bracket node we might encounter various types of nodes and deal with them immediately.

  • If we encounter the not node it means we need to negate the set. We'll leave the implementation to the not visiting method.
  • We must skip any concatenation nodes.
  • We must add literal nodes to the literal nodes collection
  • If we encounter a range node we must add every literal nodes within the range to the literal nodes collection

public void Visit(BracketNode node)
{
if (node.ChildNodes[0].Token.TokenType == TokenType.Not)
{
node.ChildNodes[0].Accept(this);
return;
}

INode current = node;
while (current.ChildNodes[0].Token.TokenType == TokenType.Concatenation)
{
current = node.ChildNodes[0];
}

LiteralNodeCollection nodes = new LiteralNodeCollection();
foreach (INode childNode in current.ChildNodes)
{
switch (childNode.Token.TokenType)
{
case TokenType.Literal:
LiteralNode literalChildNode = (LiteralNode)childNode;
if (nodes.Contains(literalChildNode) == false)
{
nodes.Add(literalChildNode);
}
break;
case TokenType.Range:
int min = (int)((LiteralNode)childNode.ChildNodes[0]).Token.Character;
int max = (int)((LiteralNode)childNode.ChildNodes[1]).Token.Character;
for (int i = min; i < max; i++)
{
char c = (char)i;
if (nodes.Contains(c) == false)
{
nodes.Add(NodeBuilder.BuildLiteralNode(TokenBuilder.BuildLiteralToken(c)));
}
}
break;
}
}
int index = RandomNumberProvider.GetRandomNumber(0, nodes.Count);
nodes[index].Accept(this);
}

Visiting the not node is just the opposite. First we add all nodes to the collection and then remove the ones we find. We must add range node visiting method for completeness and we're done. We've got ourselves a quite functional regular expression parser and a random data generator to match.

As promised the source code for the regular expression random data generator can be found at Google code. The series on RXRDG is far from over and the code will be updated accordingly.


Related posts:

Sunday, October 12, 2008

RXRDG Part IV - Basic extensions

To recap what we have learned in parts I, II and III. We've built a regular expression lexer using iterator pattern to loop through a sequence of characters and turn them into a sequence of regular expression tokens. We've used a state pattern to simplify the lexer code that enables us to deal with each context individually. We've extracted token building methods to a TokenBuilder in order to make the code more expressive. We've built a regular expression token parser on top of the lexer output stream of tokens. The parser itself used a state pattern to simplify expression grouping. The parser state implemented a Shunting Yard Algorithm to create an Abstract Sytax Tree. We've implemented a visitor pattern in the tree nodes and built a visitor that outputted a nicely indented xml representation of the AST. Aside from learning a few tricks what is the payoff for making these design decisions?

Narrow scope of responsibility

Our parser lacks the ability to correctly parse escaped characters. Some characters are reserved for use in the regular expressions. Yet sometimes you wish to look for exactly those characters. Eg. we wish to find "{" within a text. To match these special characters we need to prefix them with a "\". So how hard is it to extend our parser to implement this?

It's quite simple actually. When we find "\" in the stream of character we know we will treat the next character differently. In other words - we will change state! So we add another case to the LiteralState (ironically '\' must be prefixed by itself in the matching condition).


case '\\':
context.ToState(new EscapeState());

We also need to define the escape state.


internal class EscapeState : IState
{
public IToken Handle(IContext context)
{
context.EndState();
switch (context.Current)
{
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Done! The narrow scope of current context proves very beneficial. There's no need to think about the fact "*" indicates that previous operand should be repeated zero or more times in the escape state. So there are no ifs and checks to cover all the angles.

Separation of concerns

It took us a great deal of effort to implement the visitor pattern to retrieve value from nodes. We could have just added an overload to the GetString method and be done. But the original idea for this series was to generate random data and not a pretty printed AST. So how about we do that?

To generate our random output we'll just define another visitor. Our first order of concern is to define a random source.


public static class RandomNumberProvider
{
static readonly object _padlock = new object();

private static RandomNumberGenerator _rnd;
private static RandomNumberGenerator Rnd
{
get
{
lock (_padlock)
{
if (_rnd == null)
{
_rnd = RandomNumberGenerator.Create();
}
return _rnd;
}
}
}

public static int GetRandomNumber(int min, int max)
{
byte[] randbyte = new byte[1];
Rnd.GetNonZeroBytes(randbyte);
Random rand = new Random(Convert.ToInt32(randbyte[0]));

return rand.Next(min, max);
}
}

As we're not building an online gambling system, this should be sufficiently random. We'll copy most of the functionality of the XmlVisitor


public class GeneratorVisitor : IVisitor
{
private StringBuilder _builder = new StringBuilder();
private const int DefaultMaxOccurs = 11;

private GeneratorVisitor()
{ }

public static string Visit(INode node)
{
GeneratorVisitor visitor = new GeneratorVisitor();
node.Accept(visitor);
return visitor._builder.ToString();
}
}

And finaly we'll add the required Visit methods. The beauty of AST is that it gives us another view of separation of concerns:

  • concatenation and paranthesis nodes will "visit" their child nodes
  • literal node has no child nodes and will simply output its character.
  • repetition node will "visit" it's child nodes - repetedly

public void Visit(LiteralNode node)
{
_builder.Append(node.Token.Character.ToString());
}

public void Visit(RepetitionNode node)
{
int index = RandomNumberProvider.GetRandomNumber(node.Token.MinOccurs, node.Token.MaxOccurs > -1 ? node.Token.MaxOccurs + 1 : DefaultMaxOccurs);
for (int i = 0; i < index; i++)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

public void Visit(ConcatenationNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}

public void Visit(ParenthesisNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

And we have our first data generator that uses regular expressions to generate random strings conforming to those patterns!

Still we're missing some important regular expression features such as alternation and character sets. We'll add alternation and leave the character sets for the next installment.

Adding new tokens is the most difficult task in our project as new tokens propagate throughout the system (tokens, tokenbuilder methods, nodes, visitors...) First we need to update the lexer part.

  1. Add alternation to token type enum.
  2. Add alternation token
  3. Add build alternation token method to token builder
  4. Add a case to literal state switch (case '|') - not shown

public enum TokenType
{
Alternation
...
}

public class AlternationToken : IToken
{
public TokenType TokenType { get { return TokenType.Alternation; } }
}

public static AlternationToken BuildAlternationToken()
{
return new AlternationToken();
}

Next we update the parser part.

  1. Add the alternation node (alternation is the lowest precedence operator barring paranthesis)
  2. Add a case to parser state switch (remember - alternation is a binary operator)

public class AlternationNode : NodeBase
{
public AlternationNode(AlternationToken token)
: base(token) { }

public override NodeType NodeType
{
get { return NodeType.BinaryOperator; }
}

public override int Precedence
{
get { return 30; }
}

public override void Accept(IVisitor visitor)
{
visitor.Visit(this);
}
}

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Alternation:
AlternationToken alternation = (AlternationToken)token;
INode alternationNode = new AlternationNode(alternation);
AddOperator(alternationNode);
break;
...
}
}

Which reminds us to update the visitors.


public interface IVisitor
{
void Visit(AlternationNode node);
...
}

public void Visit(AlternationNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1}>{2}", tab, "alternation", LR);
_level++;
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
_level--;
_builder.AppendFormat("{0}{2}", tab, "alternation", LR);
}

public void Visit(AlternationNode node)
{
int index = RandomNumberProvider.GetRandomNumber(0,2);
node.ChildNodes[index].Accept(this);
}

A nice expression to test our new feature is "(ab|cd)|ef" which should evaluate to:


<alternation>
<parenthesis>
<alternation>
<sequence>
<literal>a</literal>
<literal>>b</literal>
</sequence>
<sequence>
<literal>c</literal>
<literal>d</literal>
</sequence>
</alternation>
</parenthesis>
<sequence>
<literal>e</literal>
<literal>f</literal>
</sequence>
</alternation>

Thanks to our modular design adding new functionality to the parser is easy.


Related posts:

Saturday, October 4, 2008

RXRDG Part III - (Re)visiting the tree

This is the third part in the Regular Expression Random Data Generator series. This installment builds on and some knowledge of the previous two is required.

It's time to output our syntax tree. The main objective here will be too traverse the tree and output the type and other information on the nodes as we go.


static void Main(string[] args)
{
string expression = "ab(cd)*";

Parser parser = new Parser();
INode node = parser.Parse(expression);

Console.WriteLine(Traverse(node));
}

static string Traverse(INode node)
{
string result = string.Empty;
result += string.Format("<{0}>{1}", node.Token.TokenType, Environment.NewLine);
foreach (INode childNode in node.ChildNodes)
{
result += Traverse(childNode);
}
result += string.Format("{1}", node.Token.TokenType, Environment.NewLine);

return result;
}

This certainly does the job but there are a few problems with this approach:

  • It isn't very efficient - it uses recursion and string concatenation
  • It isn't very extensible - what happens when we change the type of output we want?
  • It isn't very helpful - to get better information out of a node we'll need to add a switch statement and test for the type of INode

A more interesting way of getting the output is using a visitor design pattern. The basic idea behind the visitor pattern is a mechanism called double dispatch. Lets define an interface for the visitor class.


public interface IVisitor
{
void Visit(LiteralNode node);
void Visit(RepetitionNode node);
void Visit(ConcatenationNode node);
void Visit(ParenthesisNode node);
}

The visitor will contain all the necessary logic to create the output we desire. Each Visit method takes a type of node so extracting information from a node within a method will be easy. The tricky part is to call the correct method with the correct parameter. Unfortunately C# does not support multimethods so there are only two ways of doing that. One is to write a switch statement to identify a type of node and call the appropriate method. The other is to let the nodes do it. First we define the caller interface.


public interface IVisitable
{
void Accept(IVisitor visitor);
}

Next we change the INode interface to inherit the IVisitable interface and extend the NodeBase class by adding an Accept method as required.


public abstract class NodeBase : INode
{
public abstract void Accept(IVisitor visitor);

//omited...
}

And finally we need to extend each node class with the concrete implementation of the Accept method.


public override void Accept(IVisitor visitor)
{
visitor.Visit(this);
}

So every node is capable of calling the correct Visitor method passing itself as a parameter. You may be wondering why we are violating the DRY principle here. Couldn't we simply implement the Accept method in the base class? We could, but the base class would simply call the Accept(NodeBase node) method and not the concrete method we need!

The disadvantages of this pattern are:

  • The code is less clear if the design pattern is not known.
  • The list of derived visitable classes must be known in advance
  • Boilerpate code is added to every derived visitable class

There are ways to mitigate these limitations (using generic visitors or dynamic casting) but that's out of scope for this series. In our case the complete list of RE tokens is known and not likely to extend. What visitor allows us to extend is the ways in which to utilize strong typed information we recieve from individual nodes. Lets write our XmlVisitor! First we define the basics.


public class XmlVisitor : IVisitor
{
readonly string LR = Environment.NewLine;
readonly static int TabSpaces = 3;

private int _level;
private StringBuilder _builder = new StringBuilder();

private static string GetTab(int tabLevel)
{
string tab = string.Empty;
return tab.PadLeft(tabLevel * TabSpaces, ' ');
}
}

The StringBuilder will help with the efficiency of building a string representation of the tree while GetTab method will take care of proper indentation. Next we fill in the interface members.


public void Visit(LiteralNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1}>{2}{3}", tab, "literal", node.Token.Character, LR);
}

public void Visit(RepetitionNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1} min='{2}' max='{3}'>{4}", tab, "repeatition", node.Token.MinOccurs, node.Token.MaxOccurs, LR);
_level++;
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
_level--;
_builder.AppendFormat("{0}{2}", tab, "repeatition", LR);
}

//rest omited...

Using our visitor is still quite cumbersome as we need to create an instance of the visitor, get the node to accept the visitor and then have a method return the StringBuilder value. We can wrap that functionality in a single static method. This means that initializing the visitor will not be needed so we can declare its default constructor private.


private XmlVisitor()
{ }

public static string Visit(INode node)
{
XmlVisitor visitor = new XmlVisitor();
node.Accept(visitor);
return visitor._builder.ToString();
}

We can remove the cumbersome Traverse method and test our shiny new visitor!


Console.WriteLine(XmlVisitor.Visit(node));
//Console.WriteLine(Traverse(node));


Related posts: