What you really need to know about regular expressions before using them
Dusting off another old article…
I recently read tips and tricks about using regular expressions. That stuff isn’t what you need to know. Anybody can find those things in a manual or whatever and how hard can character classes or whatever really be. What you really need to know is how this stuff ticks. You can get yourself into deep trouble using the wrong kind of regexp engine in the wrong way.
[What follows is an article I wrote in 2015]
If you want to use regular expressions in production code the most important thing you must know about how these things are matched is that there are three general approaches to doing it. They have different performance characteristics and it is absolutely vital that you know which approach the library you are using implements.
Approach #1 — Non-deterministic Finite Automaton
- This approach first converts the regular expression into a non-deterministic state machine, which is just like a state machine except you have to keep track of all the states you might be in. Sort of like when you have a choice of going left or right you go both ways and remember it could be either. The state machine you get from this approach is proportional to the size of the expression in a fairly direct way and does not depend on the text you will match at all, just the regular expression. The maximum number of states you might have to track at any given moment is likewise determined by the expression in a fairly direct way and again does not depend on what you are matching, just the pattern.
- This is the “medium” approach to the problem, and it is popular in command line tools like grep/sed etc. Many text editors use this approach internally.
- Typically a library that does this offers a compile step to compile the RE first into a machine and then a match method to do the matching.
- The time to match an expression is linear in the size of the input.
Approach #2 — Deterministic Finite Automaton
- In this approach you basically post-process the NDFA from step number 1 into a regular state machine. You can always do this, you can imagine that if in the NDFA you were in the state [maybe #1, maybe #7, maybe #22] that in the DFA you’re in a single state called [maybe_1_maybe_7_maybe_22], so there is one distinct state for every combination you might have been in. The advantage of this approach is that once you have set up the DFA the amount of housekeeping you have to do at each step is much smaller.
- Like in #1 your final run time is not affected by the complexity of the expression, but only by the size of the text you are matching, linearly. Also like #1 there is a compile step which you have to do once per expression, so the total cost includes both compilation and matching.
- Command line tools like “egrep” and “lex” use this approach. Here’s the thing, if the tool is something like “lex” then you only pay the cost of the compilation once when you generate your machine. Your already cooked state machine is then compiled directly into your program. So that’s a cool trade off.
- The downside is that the number of states you could need can get very large indeed. In the worst case if the NDFA had N states then the DFA would have 2^N states. But usually that doesn’t happen.
- This is the “Heavy” approach in terms of initial cost
Approach #3 — Interpret the regular expression as you go and backtrack to match
- This doesn’t create a machine, and in fact you don’t need any extra state other than a stack. In this approach when faced with a pattern like A|B you first try A and if that doesn’t work you rewind and try B. This is all fine and well except if you have even a simple pattern like .*.*X, you find yourself trying to swallow all you can with the .* by which time you’re way past the X so you rewind but of course the next .* then grabs way too much, so you rewind that… Now you can imagine if you had a very long line that didn’t match X in it at all you could spend all kinds of time trying to eat different combinations of leading stuff… This pattern will be N² in the size of the input. Now this kind of stuff happens all the time… and you can make it much, much worse (.*.*)*X just compounds the problem. If that pattern looks too unrealistic to you how do you feel about “.*=.*+.*;” to match anything that looks like X=Y+Z; that pattern is a cosmic disaster… [imho, this type of engine should never be used in production code because it’s so unpredictable]
- Thing is for very simple patterns #3 will often do better because there’s no compilation. It could get bad but often it doesn’t. Some libraries do this [I think C++ STL’s default is this, and there’s also RE2 which is probably #1]
- This is the “Light” approach in terms of initial cost
Obviously these three all behave very differently.
Now, if you need to match many different regular expressions, to see if any match, then trying each one separately in a loop is the worst thing you can do. Remember if you want to match A or B you could always encod that in a single expression A|B and then you can see if you get a match in one pass over the input, not two! If there are many expressions it’s even more valuable to consolidate like this. Using a lexical analysis technique like that provided by “lex” or “flex” allows you to state your (many) expressions in advance and get one machine that matches them all simultaneously (telling you which matched of course) and you only pay the compilation cost once. This is a super economical way to match hundreds of expressions.
For more details consider these references: