Tuesday, June 05, 2007

Regular expressions vs. string operations

I'm not an expert at using Regular expressions but I tend to use them where possible.
It is a very powerful and handy tool. It can be used for validation purposes, string manipulation or searching etc.
I admit that it is sometimes very difficult to understand a regular expression. In these cases using a regular expression tool, like Expresso, is very helpful to decode the expressions.

Using regular expressions is extremely helpful especially if you are dealing with html codes.

Let's define a string variable having a simple html output.

string html = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\"><html
xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\"><head><title>test
page</title></head><body><form action=\"\" method=\"post\"><p>Lorem
ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum dolor sit amet, consectetuer
adipiscing elit.Lorem ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum
dolor sit amet, consectetuer adipiscing elit.</p></form></body></html>";

Suppose that you want to retrieve the text between the form tags. Regular expressions is a very good candidate to do the job.

Regex regex = new Regex(@"<form\b[^>]*>(.*?)</form>");
Match match = regex.Match(html);
if (match.Success){
  result = match.Groups[1].Value;
}

The code does not look messy and it gives you what you want. It takes 0,070625904 ms. It is fast.

How about using string operations!

int index = html.IndexOf("<form");
if (index > 0){
  int formEndIndex = html.IndexOf('>', index);
  if (formEndIndex > 0){
    int endIndex = html.IndexOf("</form>");
    if (endIndex > 0){
      result = html.Substring(formEndIndex + 1, endIndex - formEndIndex - 1);
    }
  }
}

The string operation only takes 0,00312504 ms.

This is fast as well but it is 22.6 times faster than the regular expression. Although it may not reflect all the scenarios, using string operations seems faster than using regular expressions!!

Note: Each code snippet was executed 10000 times to calculate the speed. To calculate the average, the same process was repeated for 10 times.

Using RegexOptions.Compiled option as shown below might help but be careful!

Regex regex = new Regex(@"<form\b[^>]*>(.*?)</form>", RegexOptions.Compiled);

In this case the average becomes only 0,019062744 ms which is 3.7 times faster than the uncompiled regular expression execution.
(Regex expression was created once and executed 10000 times to get the results for each iteration)

If you are not caching the regex expression to use for other executions, or intend to use an expression only once, then do not use the compile option. If we compile the expression each time before we use the expression, it takes 4,330055424 ms !!! 61 times slower than using it without the compile option

we should use regular expressions but it should not become a behaviour. Using string operations is still a choice.
We should choice which method to use depending on the complexity and the conditions.

How do we decide then?
Here is the list to start with.
- The code should perform well.
- It should be readable
- It should be easy to maintain
- Documentation should be clear to understand
If you want to add anything to this list, send me an email.