Python Regex: Master Text Pattern Matching with `re` Module
Unlock the power of Python's `re` module for advanced text pattern matching, data extraction, and validation. A comprehensive guide for AI and ML professionals.
Python Regular Expressions: A Comprehensive Guide to the re
Module
Regular expressions (regex) are a powerful tool for defining search patterns within text. Python's built-in re
module provides comprehensive support for regex operations, including pattern matching, string searching, text extraction, and data validation.
What is a Regular Expression?
A regular expression is a sequence of characters that specifies a search pattern. It is primarily used for:
- String Validation: Checking if a string conforms to a specific format (e.g., email addresses, phone numbers).
- Pattern Extraction: Pulling out specific pieces of information from text based on defined patterns.
- Text Substitution: Replacing occurrences of a pattern within a string with another string.
- Splitting Strings: Breaking a string into a list of substrings based on occurrences of a pattern.
Importing Python's re
Module
To begin using regular expressions in Python, you need to import the re
module:
import re
Basic Regex Functions in Python
The re
module offers several key functions for working with regular expressions:
Function | Description |
---|---|
re.match(pattern, string) | Checks for a match only at the beginning of the string. |
re.search(pattern, string) | Scans the entire string for the first occurrence of the pattern. |
re.findall(pattern, string) | Returns a list of all non-overlapping matches in the string. |
re.finditer(pattern, string) | Returns an iterator yielding match objects for all non-overlapping matches. |
re.sub(pattern, repl, string) | Replaces occurrences of the pattern with the repl string. |
re.split(pattern, string) | Splits the string by occurrences of the pattern . |
re.compile(pattern) | Compiles a pattern into a regex object, which can improve performance if reused. |
Regular Expression Syntax and Common Metacharacters
Regular expressions use special characters (metacharacters) to define complex patterns. Here's a look at some of the most common ones:
Symbol | Meaning | Example Pattern | Matches |
---|---|---|---|
. | Any character except a newline | c.t | "cat", "cut", "cot" |
^ | Start of string | ^a | "apple" (but not "banana") |
$ | End of string | end$ | "weekend" (but not "ending") |
* | 0 or more repetitions | go* | "g", "go", "goo" |
+ | 1 or more repetitions | go+ | "go", "goo" |
? | 0 or 1 repetition | go? | "g", "go" |
{n} | Exactly n repetitions | a{3} | "aaa" |
{n,m} | Between n and m repetitions | a{2,4} | "aa", "aaa", "aaaa" |
[] | Set of characters | [abc] | "a", "b", or "c" |
[^] | Negation – characters not in the set | [^abc] | Any character except "a", "b", or "c" |
\d | Any digit (0-9) | \d{3} | "123", "456" |
\D | Any non-digit | ||
\w | Word character (alphanumeric + underscore) | \w+ | "word1", "data_123" |
\W | Non-word character | ||
\s | Whitespace character | ||
\S | Non-whitespace character | ||
| or ` | ` | OR operator | `cat |
() | Grouping | (abc)+ | "abc", "abcabc" (groups subexpressions for repetition or capturing) |
Practical Regex Examples in Python
Let's explore how to use these functions with practical examples.
1. re.match()
– Match Only at the Start
This function checks if the pattern matches from the very beginning of the string.
import re
result = re.match(r"Hello", "Hello World")
if result:
print(result.group()) # Output: Hello
else:
print("No match at the start.")
result_no_match = re.match(r"World", "Hello World")
if result_no_match:
print(result_no_match.group())
else:
print("No match at the start.") # Output: No match at the start.
2. re.search()
– Find First Match Anywhere
This function scans the entire string and returns the first match it finds.
import re
text = "The year is 2024, not 2023."
result = re.search(r"\d{4}", text) # Find the first sequence of 4 digits
if result:
print(f"Found: {result.group()} at position {result.start()}") # Output: Found: 2024 at position 12
else:
print("No 4-digit number found.")
3. re.findall()
– Find All Matches
This function returns a list containing all non-overlapping matches of the pattern in the string.
import re
text = "Phone numbers: 123-456-7890, 987-654-3210"
matches = re.findall(r"\d{3}-\d{3}-\d{4}", text)
print(matches)
# Output: ['123-456-7890', '987-654-3210']
4. re.sub()
– Replace Matches
This function replaces all occurrences of a pattern in a string with a specified replacement string.
import re
text = "Today is 2025-05-22, tomorrow is 2025-05-23."
new_text = re.sub(r"\d{4}-\d{2}-\d{2}", "DATE", text)
print(new_text)
# Output: Today is DATE, tomorrow is DATE.
5. re.split()
– Split String by Pattern
This function splits a string into a list of substrings wherever the pattern matches.
import re
data = "apple,banana;grape|mango"
parts = re.split(r"[,;|]", data) # Split by comma, semicolon, or pipe
print(parts)
# Output: ['apple', 'banana', 'grape', 'mango']
6. re.finditer()
– Iterator for Match Objects
This function returns an iterator that yields match objects for each match found. Match objects contain information like the matched string and its start/end positions.
import re
text = "abc123xyz456"
for match in re.finditer(r"\d+", text):
print(f"Matched '{match.group()}' starting at index {match.start()}")
# Output:
# Matched '123' starting at index 3
# Matched '456' starting at index 9
Using re.compile()
– Precompiled Patterns
If you intend to use the same regular expression pattern multiple times, compiling it first can significantly improve performance. The re.compile()
function creates a regex object.
import re
# Compile a pattern to match words with exactly 4 letters
pattern = re.compile(r"\b\w{4}\b")
text = "This will find four word hits in this sentence."
matches = pattern.findall(text)
print(matches)
# Output: ['This', 'will', 'find', 'four', 'word', 'hits', 'this']
Real-World Examples
Validate an Email Address
A common use case for regex is input validation.
import re
def is_valid_email(email):
# A common regex pattern for email validation
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
if re.match(pattern, email):
return True
else:
return False
print(f"'test@example.com' is valid: {is_valid_email('test@example.com')}")
# Output: 'test@example.com' is valid: True
print(f"'invalid-email' is valid: {is_valid_email('invalid-email')}")
# Output: 'invalid-email' is valid: False
Extract Dates from Text
You can extract structured information like dates from unstructured text.
import re
text = "Important dates: 2023-01-01, 2025-12-31, and upcoming event on 2024-07-15."
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates)
# Output: ['2023-01-01', '2025-12-31', '2024-07-15']
Summary
Python's re
module offers a robust set of tools for text pattern matching, data extraction, and string manipulation using regular expressions. Mastering regex is an invaluable skill for tasks ranging from validating user input and extracting structured data to performing complex text processing operations.
Interview Questions on Python Regex
- What is the primary purpose of the
re
module in Python? - Explain the difference between
re.match()
andre.search()
. When would you choose one over the other? - How does
re.findall()
differ fromre.finditer()
? - Describe the functionality of
re.sub()
and provide a simple example. - List and explain some of the most frequently used regex special characters and their meanings.
- How can you validate an email address using a regular expression in Python?
- What does
re.compile()
do, and in what scenarios is it beneficial to use? - How would you split a string using multiple different delimiters using a single regex pattern?
- Write a regex pattern to extract all words consisting of exactly four letters from a given sentence.
- What are some best practices for writing regular expressions that are both efficient and easy to read/maintain?
Python Stacks & Queues for AI/ML: LIFO/FIFO Explained
Master Python stacks (LIFO) and queues (FIFO) for efficient AI/ML data processing and algorithm design. Learn concepts, implementation, and comparisons.
Send Emails with Python: AI & ML Automation
Learn to send emails programmatically using Python's smtplib and email libraries. Ideal for AI/ML automation, notifications, and data reporting.