Unit 4
ePortfolio component
Read Weidman (no date) then answer the questions below, adding them as evidence to your e-portfolio. You may want to complete this activity in conjunction with or after completing Seminar 2 preparation.
- What is Evil Regex?
- What are the common problems associated with the use of regex? How can these be mitigated?
- How and why could regex be used as part of a security solution?
You can share your responses with tutor for formative feedback or discuss it in this week’s seminar.
An Evil Regex is a regular expression that is prone to catastrophic backtracking when given specially crafted input that exploits the regex’s structure. Usually these are expressions that describe multiple ways to match a string: they may contain nested quantifiers or overlapping matches, which will cause the regex engine to explore a large number of alternatives if the match fails. (Goyvaerts, 2023; Weidman, no date)
The complexity of processing such expression is exponential, which is the basis for Regular expression Denial of Service (ReDoS) attacks. These attacks work by either submitting crafted input to exploit an existing vulnerable regex or injecting input that builds a new regex, then triggering it with tailored data. In both cases, the result is service unavailability. (Weidman, no date)
To mitigate these issues, developers should learn how to write safer patterns, for example, using possessive quantifiers, atomic groups, avoiding unnecessary nesting, and employing tools like recheck to detect potentially “evil” expressions (Goyvaerts, 2023; Weidman, no date; TSUYUSATO ‘MakeNowJust’ Kitsune, no date). Some programming languages also offer ways to safely construct expressions from user input, like Pattern#quote()
in Java (Oracle, no date).
Regular expressions are also notoriously complex. Developers value them, but many find them hard to write and understand, and often lack confidence in their regex skills (Michael et al., 2019).
Still, regular expressions remain a powerful tool in modern software development. They are useful for processing string data: validating or sanitizing input, extracting information, or searching and filtering logs. But as demonstrated above, a poorly written regex can introduce unexpected vulnerabilities into otherwise secure code. Notably, only 38 % of developers surveyed by Michael et al. (2019) were aware of ReDoS attacks.
References
Goyvaerts, J. (2023) Runaway Regular Expressions: Catastrophic Backtracking. Available at: https://www.regular-expressions.info/catastrophic.html (Accessed: 17 July 2025).
Michael, L.G. et al. (2019) ‘Regexes are Hard: Decision-Making, Difficulties, and Risks in Programming Regular Expressions’, in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 415–426. Available at: https://doi.org/10.1109/ASE.2019.00047.
Oracle (no date) Pattern (Java Platform SE 8 ). Available at: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html (Accessed: 17 July 2025).
TSUYUSATO ‘MakeNowJust’ Kitsune (no date) recheck. Available at: https://makenowjust-labs.github.io/recheck/ (Accessed: 17 July 2025).
Weidman, A. (no date) Regular expression Denial of Service - ReDoS | OWASP Foundation. Available at: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS (Accessed: 28 April 2025).
Regex
The second language concept we will look at is regular expressions (regex). We have already presented some studies on their use, and potential problems, above. The lecturecast also contains a useful link to a tutorial on creating regex. Re-read the provided links and tutorial (Jaiswal, 2020) and then attempt the problem presented below:
- The UK postcode system consists of a string that contains a number of characters and numbers – a typical example is ST7 9HV (this is not valid – see below for why). The rules for the pattern are available from idealpostcodes (2020).
- Create a python program that implements a regex that complies with the rules provided above – test it against the examples provided.
Examples:
- M1 1AA
- M60 1NW
- CR2 6XH
- DN55 1PT
- W1A 1HQ
- EC1A 1BB
How do you ensure your solution is not subject to an evil regex attack?
The first step at solving the problem is to review the UK postcode format. IdealPostcodes (2023) provide the following description of the pattern:
- One or two letters
- One or two digits
- One letter (optional)
- A space character
- One digit
- Two letters
Based on that, implementing a regular expression becomes a trivial task. Besides, as the format is quite restrictive and straightforward, the expression does not require any nested quantifiers or ambiguous alternatives that could make it a target for a ReDoS attack (Jan Goyvaerts, 2023; Adar Weidman, no date).
The Python code attached below implements parsing the provided postcodes and extracting different structural units from them. To achieve that, two regular expressions are used. The base structure of the expressions is the same:
^[A-Z]{1,2}\d{1,2}[A-Z]? \d[A-Z]{2}$
However to simplify data extraction, it makes sense to break down the expression using named groups, providing options for the geographical sections of a code (area, district, etc.) and for the structural parts (outward and inward code):
^(?P<sector>(?P<subdistrict>(?P<district>(?P<area>[A-Z]{1,2})\d{1,2})[A-Z]?) \d)(?P<unit>[A-Z]{2})$
^(?P<outcode>[A-Z]{1,2}\d{1,2}[A-Z]?) (?P<incode>\d[A-Z]{2})$
The application of named groups allows to produce a dictionary with the postcode part names mapped to their values in a postcode using the groupdict()
method on a Match
object (Python Software Foundation, 2025).
import re
geo_parts_regex = "^(?P<sector>(?P<subdistrict>(?P<district>(?P<area>[A-Z]{1,2})\d{1,2})[A-Z]?) \d)(?P<unit>[A-Z]{2})$"
geo_parts_order = ["area", "district", "subdistrict", "sector", "unit"]
code_parts_regex = "^(?P<outcode>[A-Z]{1,2}\d{1,2}[A-Z]?) (?P<incode>\d[A-Z]{2})$"
base_regex = "^[A-Z]{1,2}\d{1,2}[A-Z]? \d[A-Z]{2}$"
postcodes = [
"M1 1AA",
"M60 1NW",
"CR2 6XH",
"DN55 1PT",
"W1A 1HQ",
"EC1A 1BB",
]
for postcode in postcodes:
print("Postcode: ", postcode)
geo_parts_match = re.match(geo_parts_regex, postcode)
out_in_match = re.match(code_parts_regex, postcode)
geo_parts = geo_parts_match.groupdict()
for key in geo_parts_order:
print(f" {key}: {geo_parts[key]}")
print()
for key, value in out_in_match.groupdict().items():
print(f" {key}: {value}")
print()
Postcode: M1 1AA
area: M
district: M1
subdistrict: M1
sector: M1 1
unit: AA
outcode: M1
incode: 1AA
Postcode: M60 1NW
area: M
district: M60
subdistrict: M60
sector: M60 1
unit: NW
outcode: M60
incode: 1NW
Postcode: CR2 6XH
area: CR
district: CR2
subdistrict: CR2
sector: CR2 6
unit: XH
outcode: CR2
incode: 6XH
Postcode: DN55 1PT
area: DN
district: DN55
subdistrict: DN55
sector: DN55 1
unit: PT
outcode: DN55
incode: 1PT
Postcode: W1A 1HQ
area: W
district: W1
subdistrict: W1A
sector: W1A 1
unit: HQ
outcode: W1A
incode: 1HQ
Postcode: EC1A 1BB
area: EC
district: EC1
subdistrict: EC1A
sector: EC1A 1
unit: BB
outcode: EC1A
incode: 1BB
References
Adar Weidman (no date) Regular expression Denial of Service - ReDoS | OWASP Foundation. Available at: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS (Accessed: 28 April 2025).
IdealPostcodes (2023) The UK Postcode Format, IdealPostcodes. Available at: https://ideal-postcodes.co.uk (Accessed: 28 April 2025).
Jan Goyvaerts (2023) Runaway Regular Expressions: Catastrophic Backtracking. Available at: https://www.regular-expressions.info/catastrophic.html (Accessed: 17 July 2025).
Python Software Foundation (2025) re — Regular expression operations, Python documentation. Available at: https://docs.python.org/3/library/re.html (Accessed: 17 July 2025).