Thursday, September 25, 2008

RXRDG Part I - Regular Expression Random Data Generator

A while ago I was faced with a task of setting up a test database for the system I was building. I was preparing to use existing sql scripts to populate the database with random values but it seemed like an awful lot of work. Whenever something looks like an awful lot of work that's an opportunity to do something creative. But first you should always Google!

There are number of useful random data generators out there and the idea that I found interesting was using regular expressions (RE) to create random data. RE are usually used as a means of validating existing data but that means that data generated to fit those expressions should be valid test data as well.

So the basic idea is to take a regular RE, parse it and use the results to generate random data. Now that seems like an interesting and useful project. What I will try to do in the next few installments of this blog is to slowly build the RE random data generator and hopefully learn a few things along the way. The main objective is to learn and teach, to write readable and not the fastest code. I hope to provide you with a download of a working project to finish off the series.

Parsing expressions can become a complex operation. We need to recognize usable blocks of texts that have meaning on their own (tokens) and we need to decipher the meaning of the sequence of such blocks. The first phase is called tokenization and the second is called parsing. Now tokenization is usually done using regular expressions but it would be quite ironic to use that approach here :) Also most regular expression tokens are single characters (such as letters, dot operator, star operator etc…). Still - splitting the parsing process in two parts helps reduce the complexity of the process and will prove quite helpful in the long run.

The first step should be to build a lexer capable of recognizing at least some of the regular expression tokens. So how does one perform lexical analysis on an expression? Iterating through characters would be a good start. First we define some interfaces. We'll need a token type enumeration to distinguish between tokens and a lexer.


public enum TokenType
{
Literal,
Repeatition
}

public interface IToken
{
TokenType TokenType { get; }
}

public interface IContext
{
IEnumerable Tokenize(string expression);
}

I did say that the most of regular expression tokens are single characters, but not all. This means that when the lexer hits a specific character it will need to recognize it as a start of a usable block of characters and act accordingly. In other words - lexer will analyze the input and change its state. Such behavior calls for implementation of a state pattern. We will therefore split the lexer into lexer context which drives the analysis and lexer state which analyzes character stream and builds tokens. The lexer will output a stream of tokens from an expression and keep track of its state. So lets update the IContext interface and add another interface for the lexer states.


public interface IState
{
IToken Handle(IContext context);
}

public interface IContext
{
IEnumerable Tokenize(string expression);
void ToState(IState state);
void EndState();
char Current { get; }
}

It’s time to build the lexer class, the default state and a few tokens. But first we'll define the literal (characters) and repetition (quantifiers) token types.


public class LiteralToken : IToken
{
public char Character { get; set; }
public TokenType TokenType { get { return TokenType.Literal; } }
}

public class RepetitionToken : IToken
{
public int MinOccurs { get; set; }
public int MaxOccurs { get; set; }
public TokenType TokenType { get { return TokenType.Repetition; } }
}

Next we'll define the default lexer state. The job of a lexer state is to take a character and do either buid a token or transition to another state. This way tokenizing logic is always contained within a certain context. This will prove very helpful in the future.


public class LiteralState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '(':
context.ToState(new LiteralState());
break;
case ')':
context.EndState();
break;
case '*':
return new RepetitionToken()
{ MinOccurs = 0, MaxOccurs = - };
default:
return new LiteralToken()
{ Character = context.Current };
}
return null;
}
}

And finally we can define the lexer class. Since lexer states are responsible for processing the input the lexer needs only to iterate through the characters of the expression and keep track of its states.


public class Lexer : IContext
{
Stack _states;
IState _currentState;
IEnumerator _characters;

public IEnumerable Tokenize(string expression)
{
_states = new Stack();
_currentState = new LiteralState();
_characters = expression.GetEnumerator();
while (_characters.MoveNext())
{
IToken token = _currentState.Handle(this);
if (token != null)
{
yield return token;
}
}
}

public void ToState(IState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
_currentState = _states.Pop();
}

public char Current
{
get
{
return _characters.Current;
}
}
}

That was easy! We can also refactor the token building procedures to a token builder. Remember to update the lexer state.


public static class TokenBuilder
{
public static LiteralToken BuildLiteralToken(char character)
{
return new LiteralToken()
{ Character = character };
}

public static RepetitionToken BuildZeroOrMoreToken()
{
return TokenBuilder.BuildRepetitionToken(0, -1);
}

public static RepetitionToken BuildRepetitionToken(int minOccurs, int maxOccurs)
{
return new RepetitionToken()
{ MinOccurs = minOccurs, MaxOccurs = maxOccurs};
}
}

We can quickly test our lexer on a few expressions (and think about adding new states).


static void Main(string[] args)
{
string expression = Console.ReadLine();

Lexer lexer = new Lexer();
foreach (IToken token in lexer.Tokenize(expression))
{
Console.WriteLine(token.TokenType.ToString());
}

Console.ReadLine();
}

Related posts:

2 comments:

Unknown said...

Weakness of attitude becomes weakness of charater. See the link below for more info.


#character
www.ufgop.org

eddyshaw9272711 said...

very good put up, i actually love this website, carry on it free online casino slots