Regular expression to match lines that don’t contain a specific word
I know it’s possible to match a word and then reverse the matches using other tools (e.g., grep -v). However, is it possible to match lines that do not contain a specific word, e.g., ‘hede’, using a regular expression?
Input:
hoho
hihi
haha
hede
Code:
grep “<Regex for ‘doesn’t contain hede’>” input
Desired output:
hoho
hihi
haha
To match lines that don’t contain a specific word like ‘hede’ using a regular expression, you can use a negative lookahead assertion. The regex ^(?!.*\bhede\b).*$ will match any line that does not contain the word “hede”, where the negative lookahead (?!.*\bhede\b) ensures the word is absent before matching the entire line with .*.
Contents
- Understanding Negative Lookahead
- Basic Solution for Line Matching
- Word Boundary Considerations
- Practical Implementation with Grep
- Alternative Approaches
- Performance Considerations
Understanding Negative Lookahead
A negative lookahead is a zero-width assertion in regular expressions that checks if a pattern does not appear at the current position without actually consuming any characters. The syntax for negative lookahead is (?!pattern), where pattern is the sequence you want to exclude.
According to the Regular Expressions Cookbook, “negative lookahead assertions can be useful for validating strings that do not start with specific words” and more broadly for ensuring patterns don’t appear anywhere in the text.
The key insight is that negative lookaheads allow you to specify what you don’t want to match, rather than trying to construct a complex pattern that excludes specific content.
Basic Solution for Line Matching
The most common pattern to match lines that don’t contain a specific word is:
^(?!.*\bhede\b).*$
Breaking this down:
^- Start of line anchor(?!.*\bhede\b)- Negative lookahead that asserts the word “hede” does not appear anywhere in the line.*- Matches any characters (except newline) greedily\b- Word boundary ensures we match the whole word, not part of another word
.*- Matches the entire line content$- End of line anchor
The Stack Overflow discussion explains that this approach “lets the lookahead part check out the whole text, ensure there is no ‘hede’, and then the normal part (.*) can eat the whole text all at one time.”
Word Boundary Considerations
Word boundaries (\b) are crucial when matching specific words to avoid partial matches. Without word boundaries, a pattern like hede would also match substrings within larger words like “behemoth” or “hedged”.
The pattern \bhede\b ensures you’re matching the complete word “hede” and not just a substring. As Saturn Cloud explains, “The \b matches a word boundary, which ensures that ‘word’ is not part of a larger word.”
If you specifically want to match lines that don’t contain the substring regardless of word boundaries, you can omit the \b markers:
^(?!.*hede).*$
This will exclude any line containing “hede” as a substring, which might be useful in some cases but could lead to unintended exclusions.
Practical Implementation with Grep
To use this with grep as requested in your question, you would use:
grep -P '^(?!.*\bhede\b).*$' input
The -P flag enables Perl-compatible regex features, which support negative lookaheads. According to Sentry’s regex guide, “(?!word): Negative lookahead assertion” is the key syntax for this functionality.
For your specific example:
grep -P '^(?!.*\bhede\b).*$' input
This would produce your desired output:
hoho
hihi
haha
Note that the -P flag may not be available in all grep implementations. For systems that don’t support Perl-compatible regex, you might need to use alternative approaches like grep -v 'hede' as you mentioned.
Alternative Approaches
Using Lazy Quantifier in Lookahead
Some regex engines benefit from using lazy quantifiers in the negative lookahead for better performance:
^(?=.*?)(?:(?!hede).)*$
As noted in the Stack Overflow answer, “Note the (?) lazy quantifier in the negative lookahead part is optional, you can use () greedy quantifier instead, depending on your data: if ‘hede’ does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster.”
Using Multiple Lookaheads
For more complex exclusion patterns, you can chain multiple negative lookaheads:
^(?!.*\bhede\b)(?!.*\berror\b)(?!.*\bwarning\b).*$
This would match lines containing none of the words “hede”, “error”, or “warning”.
Using Character Classes
For simple single-character exclusion, you can use character class subtraction in some regex engines:
^(?:(?!hede).)*$
As RexEgg explains, “After the negative lookahead asserts that what follows the current position is not a Q, the \w matches a word character.”
Performance Considerations
Negative lookaheads can be computationally expensive, especially when applied to every position in a long line. As the O’Reilly Regular Expressions Cookbook warns, “Testing a negative lookahead against every position in a line or string is rather inefficient.”
For better performance with large files:
- Consider using
grep -v 'hede'instead, which is more efficient - If you must use regex, keep the patterns as simple as possible
- Avoid complex nested lookaheads when simpler alternatives exist
- Consider pre-filtering with faster tools before applying regex
Sources
- Regular expression to match a line that doesn’t contain a word - Stack Overflow
- Write a regular expression to match lines not containing a word | Sentry
- 5.11. Match Complete Lines That Do Not Contain a Word - Regular Expressions Cookbook, 2nd Edition
- Regular Expression to Match a Line That Doesn’t Contain a Word | Saturn Cloud Blog
- Regular Expression To Match A Line That Doesn’t Contain a Word
- Regex Tutorial: Lookahead and Lookbehind Zero-Length Assertions
- Lookahead and Lookbehind Tutorial—Tips &Tricks
Conclusion
Matching lines that don’t contain a specific word using regular expressions is indeed possible through negative lookahead assertions. The key takeaways are:
- Use negative lookahead
(?!pattern)to specify what you want to exclude - The most reliable pattern is
^(?!.*\bword\b).*$to match complete lines without the specific word - Word boundaries (
\b) are important to avoid partial matches within larger words - For your grep example, use
grep -P '^(?!.*\bhede\b).*$' inputto achieve the desired result - Consider performance implications - for simple cases,
grep -v 'hede'may be more efficient
While regular expressions provide a powerful way to exclude specific content, they can be complex and less performant than simpler alternatives. Choose the approach that best fits your specific requirements and performance constraints.