Regular expression security
Pass the Test
One important paradigm in software development, especially web applications, is careful validation of user input before allowing further processing. Much can go wrong if the input is not carefully checked. SQL injection and cross-site scripting attacks are just two of the most common examples of exploitation. Regular expressions are useful for checking user input, but even they are vulnerable to attacks. In this article, I show you how to check your regular expressions for vulnerabilities.
Regular Expressions
Regular expressions (regex) have become an established method of validating user input before processing to describe and check for permitted entries and to prevent certain characters (e.g., those with a special function in an application) from being entered. Unfortunately, some regular expressions can still cause unwanted behavior with certain types of input and make the script unusable – much like a denial of service attack.
A regular expression describes a language, wherein you define the language that you want to accept as input for an application. Email addresses provide a simple but useful example. The World Wide Web Consortium recommends the following regular expression for the language of email addresses:
^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*$
When you see a string like this for the first time, the semantics are not immediately apparent, but regular expressions form a language that is actually quite easy to understand. The two characters ^
and $
describe the start and end of input, respectively.
The parentheses form groups and the square brackets ([ ]
) classes of admissible input characters. In this case, no value other than those that would appear in an email address may be present in the input. The first class in the example comprises a-z
(all lowercase letters), A-Z
(all uppercase letters), 0-9
(all digits), and all the special characters specified up to the right square bracket.
Certain symbols define the frequency of occurrence of any characters in a group or class: ?
(zero or one time), +
(one or more times), and *
(zero or more times). The plus sign that appears after the first closing square bracket and in front of the @
sign therefore says that the @
symbol of an email address must be preceded by at least one to any number of the specified characters.
The regular expression after the @
checks domain names, which begin with at least one character, number, or hyphen (+
), with an optional extension (*
) comprising a period (zero or more times) and at least one of the characters of the class that follows.
The dot also has a special meaning for regular expressions: It stands for any character. If you really expect a dot at a location, you need to insert a backslash in front (e.g., \.
, as in the example) to "escape" the character). If you adhere to all these requirements, you can generate words in the regular language of email addresses.
Regular expressions also allow the logical AND and OR operations. The AND operation is implicit outside of a class, whereas within a class, characters are implicitly linked with OR. The |
character is used to stipulate an explicit OR; for example, a|b
stands for a character that is either a
or b
. The expression is equivalent to the class [ab]
. If you want to check the string for email addresses, you can use the Python console (Figure 1).
State Machines
State machines are used to test whether a word is part of a language – for example, whether the user input is an email address. If this doesn't mean anything to you, just imagine a ticket vending machine for the subway or a train. It assumes different states (e.g., the welcome screen, ticket selection, payment, or ticket printout) and expects different types of input from the operator that cause a transition from one state to another. If all the user input matches the previously defined input language, the machine accepts the input and prints a ticket.
State machines that check whether an input word is part of a language work in the same way. Each letter of an input word changes the state of the machine. If the state machine is in an accepting state after the input, the word is part of the language. Figure 2 shows a simple state machine with four states and the transitions between these states. State 1 is accepting, all other states are not. The state machine has an input alphabet that comprises only the letters a and b . Words are accepted in which an a is followed by another a or a b . The state machine is equivalent to the regular expression:
^a[ab]+$
If the input word starts with a b , the machine changes to state 3. From this state, it cannot transition to any other state, especially not the accepting state 1, which means the word input is not part of the language and is not accepted. If it reads an a first, the machine switches to state 2. From there, either an a or b will cause it to transition to the accepting state 1. If this terminates the input, then the word is part of the language, and the state machine accepts it. If further characters are read from the input, the machine proceeds to state 3 and does not accept the input.
For each regular expression in your application, you can create state machines and use them to check the input. State machines can become quite large, with many states and transitions. In particular, relations with +
and *
quantifiers, large classes, or the dot (.
) for all possible input symbols lead to very large state machines.
Attack with ReDoS
A state machine's performance when checking regular expressions depends on many factors: the size of the state machine (i.e., the number of states and transitions) and, of course, the input. A state machine checks each possible sequence of states in turn until it accepts one of the sequences, which means, in the worst case, the run time for a state machine that checks a regular expression can grow exponentially in relation to the input. More bad news is that you often don't even notice that a regular expression results in a large state machine with a correspondingly long run time.
The Open Web Application Security Project (OWASP) lists the regular expression denial of service (ReDoS) [1] as an attack and shows some regular expressions that have unexpectedly bad worst-case run times, such as:
^(a+)+$
At first glance this does not seem to be a problematic expression; for example, the input aaaa has only 16 (2^4) possible sequences of state transitions. However, if you enter a 16 times, you already have 2^16 = 65,536 possible sequences, all of which need to be checked. The number doubles with each additional a in the input.
Occasionally, developers also use user input to create regular expressions. Imagine you want to prevent a username from being included in the password you are using. If an attacker chooses (a+)+ as the username and types a 40 times as the password, the state machine would have to check 2^40 possible sequences. An attacker could thus deliberately cause a denial of service if they had some idea of how the application checked user input.
Buy this article as PDF
(incl. VAT)