NeuroAgent

Regex to Match Lines Without Specific Word

Learn how to use regular expressions with negative lookahead to match lines that don't contain specific words. Complete guide with grep examples and performance tips.

Question

Regular expression to match lines that don’t contain a specific word

I know it’s possible to match a word and then reverse the matches using other tools (e.g., grep -v). However, is it possible to match lines that do not contain a specific word, e.g., ‘hede’, using a regular expression?

Input:
hoho
hihi
haha
hede

Code:
grep “<Regex for ‘doesn’t contain hede’>” input

Desired output:
hoho
hihi
haha

NeuroAgent

To match lines that don’t contain a specific word like ‘hede’ using a regular expression, you can use a negative lookahead assertion. The regex ^(?!.*\bhede\b).*$ will match any line that does not contain the word “hede”, where the negative lookahead (?!.*\bhede\b) ensures the word is absent before matching the entire line with .*.

Contents


Understanding Negative Lookahead

A negative lookahead is a zero-width assertion in regular expressions that checks if a pattern does not appear at the current position without actually consuming any characters. The syntax for negative lookahead is (?!pattern), where pattern is the sequence you want to exclude.

According to the Regular Expressions Cookbook, “negative lookahead assertions can be useful for validating strings that do not start with specific words” and more broadly for ensuring patterns don’t appear anywhere in the text.

The key insight is that negative lookaheads allow you to specify what you don’t want to match, rather than trying to construct a complex pattern that excludes specific content.


Basic Solution for Line Matching

The most common pattern to match lines that don’t contain a specific word is:

regex
^(?!.*\bhede\b).*$

Breaking this down:

  • ^ - Start of line anchor
  • (?!.*\bhede\b) - Negative lookahead that asserts the word “hede” does not appear anywhere in the line
    • .* - Matches any characters (except newline) greedily
    • \b - Word boundary ensures we match the whole word, not part of another word
  • .* - Matches the entire line content
  • $ - End of line anchor

The Stack Overflow discussion explains that this approach “lets the lookahead part check out the whole text, ensure there is no ‘hede’, and then the normal part (.*) can eat the whole text all at one time.”


Word Boundary Considerations

Word boundaries (\b) are crucial when matching specific words to avoid partial matches. Without word boundaries, a pattern like hede would also match substrings within larger words like “behemoth” or “hedged”.

The pattern \bhede\b ensures you’re matching the complete word “hede” and not just a substring. As Saturn Cloud explains, “The \b matches a word boundary, which ensures that ‘word’ is not part of a larger word.”

If you specifically want to match lines that don’t contain the substring regardless of word boundaries, you can omit the \b markers:

regex
^(?!.*hede).*$

This will exclude any line containing “hede” as a substring, which might be useful in some cases but could lead to unintended exclusions.


Practical Implementation with Grep

To use this with grep as requested in your question, you would use:

bash
grep -P '^(?!.*\bhede\b).*$' input

The -P flag enables Perl-compatible regex features, which support negative lookaheads. According to Sentry’s regex guide, “(?!word): Negative lookahead assertion” is the key syntax for this functionality.

For your specific example:

bash
grep -P '^(?!.*\bhede\b).*$' input

This would produce your desired output:

hoho
hihi
haha

Note that the -P flag may not be available in all grep implementations. For systems that don’t support Perl-compatible regex, you might need to use alternative approaches like grep -v 'hede' as you mentioned.


Alternative Approaches

Using Lazy Quantifier in Lookahead

Some regex engines benefit from using lazy quantifiers in the negative lookahead for better performance:

regex
^(?=.*?)(?:(?!hede).)*$

As noted in the Stack Overflow answer, “Note the (?) lazy quantifier in the negative lookahead part is optional, you can use () greedy quantifier instead, depending on your data: if ‘hede’ does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster.”

Using Multiple Lookaheads

For more complex exclusion patterns, you can chain multiple negative lookaheads:

regex
^(?!.*\bhede\b)(?!.*\berror\b)(?!.*\bwarning\b).*$

This would match lines containing none of the words “hede”, “error”, or “warning”.

Using Character Classes

For simple single-character exclusion, you can use character class subtraction in some regex engines:

regex
^(?:(?!hede).)*$

As RexEgg explains, “After the negative lookahead asserts that what follows the current position is not a Q, the \w matches a word character.”


Performance Considerations

Negative lookaheads can be computationally expensive, especially when applied to every position in a long line. As the O’Reilly Regular Expressions Cookbook warns, “Testing a negative lookahead against every position in a line or string is rather inefficient.”

For better performance with large files:

  1. Consider using grep -v 'hede' instead, which is more efficient
  2. If you must use regex, keep the patterns as simple as possible
  3. Avoid complex nested lookaheads when simpler alternatives exist
  4. Consider pre-filtering with faster tools before applying regex

Sources

  1. Regular expression to match a line that doesn’t contain a word - Stack Overflow
  2. Write a regular expression to match lines not containing a word | Sentry
  3. 5.11. Match Complete Lines That Do Not Contain a Word - Regular Expressions Cookbook, 2nd Edition
  4. Regular Expression to Match a Line That Doesn’t Contain a Word | Saturn Cloud Blog
  5. Regular Expression To Match A Line That Doesn’t Contain a Word
  6. Regex Tutorial: Lookahead and Lookbehind Zero-Length Assertions
  7. Lookahead and Lookbehind Tutorial—Tips &Tricks

Conclusion

Matching lines that don’t contain a specific word using regular expressions is indeed possible through negative lookahead assertions. The key takeaways are:

  1. Use negative lookahead (?!pattern) to specify what you want to exclude
  2. The most reliable pattern is ^(?!.*\bword\b).*$ to match complete lines without the specific word
  3. Word boundaries (\b) are important to avoid partial matches within larger words
  4. For your grep example, use grep -P '^(?!.*\bhede\b).*$' input to achieve the desired result
  5. Consider performance implications - for simple cases, grep -v 'hede' may be more efficient

While regular expressions provide a powerful way to exclude specific content, they can be complex and less performant than simpler alternatives. Choose the approach that best fits your specific requirements and performance constraints.