I recently faced the problem of requiring negative lookbehinds in a regex engine that does not support them (Golang's Regexp package, RE2, and Hyperscan).
For example, say you want to match the string "def" except if it is preceeded by "abc". In PCRE using negative lookbehinds you could achieve this with the regex:
(?<!abc)def
But this will not work in all regex engines, so I needed an alternative.
I found this Stackoverflow answer by user Sebastian Proske. The approach is clever, but tricky to comprehend and write for long and complex negative lookbehinds. I will try to explain the general approach here, and then present a script to help use this approach for longer and more complex cases.
Continuing the example above, the approach works by having alternatives for each possible prefix that is not "abc". These alternatives are:
- The beginning of the string, or a character that is not "c", followed by "def"
- The beginning of the string, or a character that is not "b", followed by "cdef"
- The beginning of the string, or a character that is not "a", followed by "bcdef"
Written as a regex (line breaks and indentation for readability are not part of the regex):
( (^|[^c])def | (^|[^b])cdef | (^|[^a])bcdef )
We can extract the common suffix out:
( (^|[^c]) | (^|[^b])c | (^|[^a])bc ) def
To support multiple alternative negative lookbehinds, such as both "abc" and "1234", which in PCRE syntax this can be written as "(?<!abc|1234)def", we can write:
( (^|[^c4]) | (^|[^b])c | (^|[^a])bc | (^|[^3])4 | (^|[^2])34 | (^|[^1])234 ) def
To support character classes in the negative lookbehinds, such as "[bB]", which in PCRE syntax can be written as "(?<!a[bB]c)def", we can write:
( (^|[^c]) | (^|[^bB])c | (^|[^a])[bB]c ) def
For better equivalence to the negative lookbehinds, we can additionally use non-capturing groups:
(?: (?:^|[^c]) | (?:^|[^b])c | (?:^|[^a])bc ) def
Here's a quick and dirty Python script to help construct regexes following this approach. The example input here is equivalent to "(?<!a[bB]c|1234)".
The script is also available to conveniently run in your browser on repl.it: https://repl.it/@allanrbo/regexnegativelookbehindalternative1
- negativePrefixes = [
- "a[bB]c",
- "1234",
- ]
- def removeDuplicateChars(s):
- return "".join([c for i,c in enumerate(s) if c not in s[:i]])
- def removeChars(s, charsToRemove):
- return "".join([c for i,c in enumerate(s) if c not in charsToRemove])
- # Split into arrays of strings. Each string is either a single char, or a char class.
- negativePrefixesSplit = []
- for np in negativePrefixes:
- npSplit = []
- curCc = ""
- inCc = False
- for c in np:
- if c == "[":
- inCc = True
- elif c == "]":
- npSplit.append(removeDuplicateChars(curCc))
- curCc = ""
- inCc = False
- else:
- if inCc:
- if c in "-\\":
- raise "Only really simply char classes are currently supported. No ranges or escapes, sorry."
- curCc += c
- else:
- npSplit.append(c)
- negativePrefixesSplit.append(npSplit)
- allexprs = []
- class Expr():
- pass
- suffixLength = 0
- while True:
- suffixes = []
- for np in negativePrefixesSplit:
- if suffixLength < len(np):
- suffixes.append(np[len(np)-suffixLength-1:])
- if len(suffixes) == 0:
- break
- exprs = []
- for suffix in suffixes:
- curChar = suffix[0]
- remainder = suffix[1:]
- expr = Expr()
- expr.curChar = curChar
- expr.remainder = remainder
- exprs.append(expr)
- # Is the remainder a subset of any other suffixes remainders?
- for i in range(len(exprs)):
- e1 = exprs[i]
- for j in range(len(exprs)):
- e2 = exprs[j]
- isSubset = True
- for k in range(len(e1.remainder)):
- if not set(e1.remainder[k]).issubset(set(e2.remainder[k])):
- isSubset = False
- break
- if isSubset:
- if e1.curChar == e2.curChar:
- e1.remainder = e2.remainder
- continue
- e1.curChar += e2.curChar
- e1.curChar = removeDuplicateChars(e1.curChar)
- for k in range(len(e1.remainder)):
- if len(set(e2.remainder[k]) - set(e1.remainder[k])) > 0:
- charsInCommon = "".join(set(e2.remainder[k]) & set(e1.remainder[k]))
- e2.remainder[k] = removeChars(e2.remainder[k], charsInCommon)
- # Remove duplicate expressions
- exprsFiltered = []
- for i in range(len(exprs)):
- e1 = exprs[i]
- alreadyExists = False
- for j in range(len(exprs)):
- if i == j:
- break
- e2 = exprs[j]
- sameC = set(e1.curChar) == set(e2.curChar)
- sameR = True
- for k in range(len(e1.remainder)):
- if set(e1.remainder[k]) != set(e2.remainder[k]):
- sameR = False
- break
- if sameC and sameR:
- alreadyExists = True
- break
- if not alreadyExists:
- exprsFiltered.append(e1)
- allexprs.extend(exprsFiltered)
- suffixLength += 1
- continue
- out = "(?:\n"
- for i in range(len(allexprs)):
- e = allexprs[i]
- out += ("(?:^|[^" + e.curChar + "])")
- for c in e.remainder:
- if len(c) > 1:
- out += "[" + c + "]"
- else:
- out += c
- if i != len(allexprs)-1:
- out += "|"
- out += "\n"
- out += ")"
- print("Human readable:")
- print(out)
- print()
- print("Single line:")
- print(out.replace("\n",""))
Human readable: (?: (?:^|[^c4])| (?:^|[^bB])c| (?:^|[^3])4| (?:^|[^a])[bB]c| (?:^|[^2])34| (?:^|[^1])234 ) Single line: (?:(?:^|[^c4])|(?:^|[^bB])c|(?:^|[^3])4|(?:^|[^a])[bB]c|(?:^|[^2])34|(?:^|[^1])234)