Regular expressions: a beginner’s guide
A pattern language for finding and shaping text — explained from zero.
A regular expression (regex) is a small pattern that describes a set of strings. Instead of searching for one fixed word, you write a pattern and the engine finds every piece of text that matches it. That makes regex the workhorse behind find-and-replace, form validation, log scraping and search filters. The syntax looks cryptic at first, but it’s built from a handful of pieces. Once you know them, most patterns become readable. The best way to learn is to build a pattern and watch what it matches, so keep our regex tester open in another tab as you read.
Literals and character classes
Most characters in a pattern are literals: the pattern cat
matches the letters c, a, t in that order, anywhere they appear. The power comes from
metacharacters that mean “a kind of character” rather than one exact letter.
A character class in square brackets matches any single character from the
set: [abc] matches one a, b, or c. Use a hyphen for a range, so
[a-z] matches any lowercase letter and [0-9] any digit. Put a caret
first to negate the class: [^0-9] matches any character that is not a
digit.
Common classes have shorthands, each with an uppercase negation:
| Token | Matches | Negation | Negation matches |
|---|---|---|---|
\d | A digit, same as [0-9] | \D | Any non-digit |
\w | A word character: letters, digits, _ | \W | Any non-word character |
\s | Whitespace: space, tab, newline | \S | Any non-whitespace |
. | Any single character (except newline) | — | — |
The dot . is the wildcard — it matches almost anything. Because it is so greedy,
beginners often reach for . when a narrower class like \d or
[a-z] would be safer and clearer.
Anchors and boundaries
By default a pattern can match anywhere inside the text. Anchors pin it to a
position instead of consuming a character. ^ means “start of the string” (or
start of a line in multiline mode) and $ means “end.” So ^\d+$
matches a string that is only digits from beginning to end — useful for validating
that an input is a whole number with nothing else mixed in.
The word boundary \b matches the invisible edge between a word character and a
non-word character. The pattern \bcat\b matches “cat” as a whole word but not
the “cat” inside “category” or “concatenate.” Boundaries are how you avoid the classic
find-and-replace disaster of mangling words that merely contain your target.
Quantifiers, groups and alternation
Quantifiers say how many times the preceding item may repeat. There are four you’ll use constantly:
*— zero or more (the item is optional and repeatable)+— one or more (at least one required)?— zero or one (optional, appears at most once){n,m}— between n and m times;{3}is exactly three,{2,}is two or more
Groups in round brackets bundle part of a pattern so a quantifier applies to
the whole bundle, and they also capture what they matched for later reuse. Alternation
with the pipe | means “or.” Combine them and (cat|dog)s? matches
“cat,” “cats,” “dog,” or “dogs.” The group makes s? apply to either option.
Digits only
The pattern ^\d+$ accepts 42 or 1000 but rejects 12a or an empty box. Pair it with the m flag to check each line.
A simple email-like check
^[\w.+-]+@[\w-]+\.[a-z]{2,}$ matches “[email protected].” It is deliberately loose — full email validation is famously hard, so favour a forgiving pattern.
Escaping, flags and lazy matching
Because characters like ., *, ( and ?
have special meanings, you must escape them with a backslash to match them
literally. To match a real dot — say in a file extension — write \.. So
file\.txt matches “file.txt” and not “fileXtxt.” Inside a character class most
metacharacters lose their power, so [.?] simply means a dot or a question mark.
Flags change how the whole pattern behaves:
| Flag | Name | Effect |
|---|---|---|
g | global | Find every match, not just the first. |
i | case-insensitive | cat also matches “Cat” and “CAT.” |
m | multiline | ^ and $ match at the start and end of every line, not just the whole string. |
Finally, quantifiers are greedy by default — they grab as much text as
possible. Against <a><b> the pattern <.*> matches
the entire string in one go, because .* swallows everything up to the last
>. Add a ? after the quantifier to make it lazy,
taking as little as possible: <.*?> matches just <a>
first, then <b> on the next pass. Greedy versus lazy is one of the most
common sources of “why did my regex match too much?” confusion.
Common pitfalls
- Forgetting to escape. An unescaped
.matches any character, so3.14as a pattern also matches “3x14.” Escape literal dots, slashes and brackets. - Greedy overreach. If a match grabs far more than expected, switch the quantifier to its lazy form with
?. - Anchoring mistakes. Without
^and$, a “digits only” check passes any string that contains a digit. Anchor validation patterns. - Over-engineering email and URL patterns. Perfect validation is nearly impossible; a loose pattern plus a confirmation step beats a brittle monster regex.
Regex rewards practice more than memorisation. Start with literals, add one metacharacter at a time, and check each step against real text. Within an afternoon the patterns stop looking like noise and start reading like the precise little sentences they are.
Related tools: Regex tester · Text diff