Python Regex: Master Text Pattern Matching with `re` Module

Unlock the power of Python's `re` module for advanced text pattern matching, data extraction, and validation. A comprehensive guide for AI and ML professionals.

Python Regular Expressions: A Comprehensive Guide to the re Module

Regular expressions (regex) are a powerful tool for defining search patterns within text. Python's built-in re module provides comprehensive support for regex operations, including pattern matching, string searching, text extraction, and data validation.

What is a Regular Expression?

A regular expression is a sequence of characters that specifies a search pattern. It is primarily used for:

  • String Validation: Checking if a string conforms to a specific format (e.g., email addresses, phone numbers).
  • Pattern Extraction: Pulling out specific pieces of information from text based on defined patterns.
  • Text Substitution: Replacing occurrences of a pattern within a string with another string.
  • Splitting Strings: Breaking a string into a list of substrings based on occurrences of a pattern.

Importing Python's re Module

To begin using regular expressions in Python, you need to import the re module:

import re

Basic Regex Functions in Python

The re module offers several key functions for working with regular expressions:

FunctionDescription
re.match(pattern, string)Checks for a match only at the beginning of the string.
re.search(pattern, string)Scans the entire string for the first occurrence of the pattern.
re.findall(pattern, string)Returns a list of all non-overlapping matches in the string.
re.finditer(pattern, string)Returns an iterator yielding match objects for all non-overlapping matches.
re.sub(pattern, repl, string)Replaces occurrences of the pattern with the repl string.
re.split(pattern, string)Splits the string by occurrences of the pattern.
re.compile(pattern)Compiles a pattern into a regex object, which can improve performance if reused.

Regular Expression Syntax and Common Metacharacters

Regular expressions use special characters (metacharacters) to define complex patterns. Here's a look at some of the most common ones:

SymbolMeaningExample PatternMatches
.Any character except a newlinec.t"cat", "cut", "cot"
^Start of string^a"apple" (but not "banana")
$End of stringend$"weekend" (but not "ending")
*0 or more repetitionsgo*"g", "go", "goo"
+1 or more repetitionsgo+"go", "goo"
?0 or 1 repetitiongo?"g", "go"
{n}Exactly n repetitionsa{3}"aaa"
{n,m}Between n and m repetitionsa{2,4}"aa", "aaa", "aaaa"
[]Set of characters[abc]"a", "b", or "c"
[^]Negation – characters not in the set[^abc]Any character except "a", "b", or "c"
\dAny digit (0-9)\d{3}"123", "456"
\DAny non-digit
\wWord character (alphanumeric + underscore)\w+"word1", "data_123"
\WNon-word character
\sWhitespace character
\SNon-whitespace character
| or ``OR operator`cat
()Grouping(abc)+"abc", "abcabc" (groups subexpressions for repetition or capturing)

Practical Regex Examples in Python

Let's explore how to use these functions with practical examples.

1. re.match() – Match Only at the Start

This function checks if the pattern matches from the very beginning of the string.

import re

result = re.match(r"Hello", "Hello World")
if result:
    print(result.group())  # Output: Hello
else:
    print("No match at the start.")

result_no_match = re.match(r"World", "Hello World")
if result_no_match:
    print(result_no_match.group())
else:
    print("No match at the start.") # Output: No match at the start.

2. re.search() – Find First Match Anywhere

This function scans the entire string and returns the first match it finds.

import re

text = "The year is 2024, not 2023."
result = re.search(r"\d{4}", text) # Find the first sequence of 4 digits
if result:
    print(f"Found: {result.group()} at position {result.start()}") # Output: Found: 2024 at position 12
else:
    print("No 4-digit number found.")

3. re.findall() – Find All Matches

This function returns a list containing all non-overlapping matches of the pattern in the string.

import re

text = "Phone numbers: 123-456-7890, 987-654-3210"
matches = re.findall(r"\d{3}-\d{3}-\d{4}", text)
print(matches)
# Output: ['123-456-7890', '987-654-3210']

4. re.sub() – Replace Matches

This function replaces all occurrences of a pattern in a string with a specified replacement string.

import re

text = "Today is 2025-05-22, tomorrow is 2025-05-23."
new_text = re.sub(r"\d{4}-\d{2}-\d{2}", "DATE", text)
print(new_text)
# Output: Today is DATE, tomorrow is DATE.

5. re.split() – Split String by Pattern

This function splits a string into a list of substrings wherever the pattern matches.

import re

data = "apple,banana;grape|mango"
parts = re.split(r"[,;|]", data) # Split by comma, semicolon, or pipe
print(parts)
# Output: ['apple', 'banana', 'grape', 'mango']

6. re.finditer() – Iterator for Match Objects

This function returns an iterator that yields match objects for each match found. Match objects contain information like the matched string and its start/end positions.

import re

text = "abc123xyz456"
for match in re.finditer(r"\d+", text):
    print(f"Matched '{match.group()}' starting at index {match.start()}")
# Output:
# Matched '123' starting at index 3
# Matched '456' starting at index 9

Using re.compile() – Precompiled Patterns

If you intend to use the same regular expression pattern multiple times, compiling it first can significantly improve performance. The re.compile() function creates a regex object.

import re

# Compile a pattern to match words with exactly 4 letters
pattern = re.compile(r"\b\w{4}\b")

text = "This will find four word hits in this sentence."
matches = pattern.findall(text)
print(matches)
# Output: ['This', 'will', 'find', 'four', 'word', 'hits', 'this']

Real-World Examples

Validate an Email Address

A common use case for regex is input validation.

import re

def is_valid_email(email):
    # A common regex pattern for email validation
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    if re.match(pattern, email):
        return True
    else:
        return False

print(f"'test@example.com' is valid: {is_valid_email('test@example.com')}")
# Output: 'test@example.com' is valid: True
print(f"'invalid-email' is valid: {is_valid_email('invalid-email')}")
# Output: 'invalid-email' is valid: False

Extract Dates from Text

You can extract structured information like dates from unstructured text.

import re

text = "Important dates: 2023-01-01, 2025-12-31, and upcoming event on 2024-07-15."
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates)
# Output: ['2023-01-01', '2025-12-31', '2024-07-15']

Summary

Python's re module offers a robust set of tools for text pattern matching, data extraction, and string manipulation using regular expressions. Mastering regex is an invaluable skill for tasks ranging from validating user input and extracting structured data to performing complex text processing operations.

Interview Questions on Python Regex

  • What is the primary purpose of the re module in Python?
  • Explain the difference between re.match() and re.search(). When would you choose one over the other?
  • How does re.findall() differ from re.finditer()?
  • Describe the functionality of re.sub() and provide a simple example.
  • List and explain some of the most frequently used regex special characters and their meanings.
  • How can you validate an email address using a regular expression in Python?
  • What does re.compile() do, and in what scenarios is it beneficial to use?
  • How would you split a string using multiple different delimiters using a single regex pattern?
  • Write a regex pattern to extract all words consisting of exactly four letters from a given sentence.
  • What are some best practices for writing regular expressions that are both efficient and easy to read/maintain?