Regular expressions: a beginner’s guide

A pattern language for finding and shaping text — explained from zero.

A regular expression (regex) is a small pattern that describes a set of strings. Instead of searching for one fixed word, you write a pattern and the engine finds every piece of text that matches it. That makes regex the workhorse behind find-and-replace, form validation, log scraping and search filters. The syntax looks cryptic at first, but it’s built from a handful of pieces. Once you know them, most patterns become readable. The best way to learn is to build a pattern and watch what it matches, so keep our regex tester open in another tab as you read.

Literals and character classes

Most characters in a pattern are literals: the pattern cat matches the letters c, a, t in that order, anywhere they appear. The power comes from metacharacters that mean “a kind of character” rather than one exact letter. A character class in square brackets matches any single character from the set: [abc] matches one a, b, or c. Use a hyphen for a range, so [a-z] matches any lowercase letter and [0-9] any digit. Put a caret first to negate the class: [^0-9] matches any character that is not a digit.

Common classes have shorthands, each with an uppercase negation:

TokenMatchesNegationNegation matches
\dA digit, same as [0-9]\DAny non-digit
\wA word character: letters, digits, _\WAny non-word character
\sWhitespace: space, tab, newline\SAny non-whitespace
.Any single character (except newline)

The dot . is the wildcard — it matches almost anything. Because it is so greedy, beginners often reach for . when a narrower class like \d or [a-z] would be safer and clearer.

Anchors and boundaries

By default a pattern can match anywhere inside the text. Anchors pin it to a position instead of consuming a character. ^ means “start of the string” (or start of a line in multiline mode) and $ means “end.” So ^\d+$ matches a string that is only digits from beginning to end — useful for validating that an input is a whole number with nothing else mixed in.

The word boundary \b matches the invisible edge between a word character and a non-word character. The pattern \bcat\b matches “cat” as a whole word but not the “cat” inside “category” or “concatenate.” Boundaries are how you avoid the classic find-and-replace disaster of mangling words that merely contain your target.

Quantifiers, groups and alternation

Quantifiers say how many times the preceding item may repeat. There are four you’ll use constantly:

  • * — zero or more (the item is optional and repeatable)
  • + — one or more (at least one required)
  • ? — zero or one (optional, appears at most once)
  • {n,m} — between n and m times; {3} is exactly three, {2,} is two or more

Groups in round brackets bundle part of a pattern so a quantifier applies to the whole bundle, and they also capture what they matched for later reuse. Alternation with the pipe | means “or.” Combine them and (cat|dog)s? matches “cat,” “cats,” “dog,” or “dogs.” The group makes s? apply to either option.

Digits only

The pattern ^\d+$ accepts 42 or 1000 but rejects 12a or an empty box. Pair it with the m flag to check each line.

A simple email-like check

^[\w.+-]+@[\w-]+\.[a-z]{2,}$ matches “[email protected].” It is deliberately loose — full email validation is famously hard, so favour a forgiving pattern.

Escaping, flags and lazy matching

Because characters like ., *, ( and ? have special meanings, you must escape them with a backslash to match them literally. To match a real dot — say in a file extension — write \.. So file\.txt matches “file.txt” and not “fileXtxt.” Inside a character class most metacharacters lose their power, so [.?] simply means a dot or a question mark.

Flags change how the whole pattern behaves:

FlagNameEffect
gglobalFind every match, not just the first.
icase-insensitivecat also matches “Cat” and “CAT.”
mmultiline^ and $ match at the start and end of every line, not just the whole string.

Finally, quantifiers are greedy by default — they grab as much text as possible. Against <a><b> the pattern <.*> matches the entire string in one go, because .* swallows everything up to the last >. Add a ? after the quantifier to make it lazy, taking as little as possible: <.*?> matches just <a> first, then <b> on the next pass. Greedy versus lazy is one of the most common sources of “why did my regex match too much?” confusion.

Don’t write regex blind. Paste your pattern and some sample text into our regex tester to see matches highlighted live as you type, with each capture group broken out. When you tweak a pattern and want to confirm exactly what changed in the output, drop the before and after into the text diff tool. Both run entirely in your browser.

Common pitfalls

  • Forgetting to escape. An unescaped . matches any character, so 3.14 as a pattern also matches “3x14.” Escape literal dots, slashes and brackets.
  • Greedy overreach. If a match grabs far more than expected, switch the quantifier to its lazy form with ?.
  • Anchoring mistakes. Without ^ and $, a “digits only” check passes any string that contains a digit. Anchor validation patterns.
  • Over-engineering email and URL patterns. Perfect validation is nearly impossible; a loose pattern plus a confirmation step beats a brittle monster regex.

Regex rewards practice more than memorisation. Start with literals, add one metacharacter at a time, and check each step against real text. Within an afternoon the patterns stop looking like noise and start reading like the precise little sentences they are.

Related tools: Regex tester · Text diff