I recently faced the problem of requiring negative lookbehinds in a regex engine that does not support them (Golang's Regexp package, RE2, and Hyperscan).
For example, say you want to match the string "def" except if it is preceeded by "abc". In PCRE using negative lookbehinds you could achieve this with the regex:
(?<!abc)def
But this will not work in all regex engines, so I needed an alternative.
I found this Stackoverflow answer by user Sebastian Proske. The approach is clever, but tricky to comprehend and write for long and complex negative lookbehinds. I will try to explain the general approach here, and then present a script to help use this approach for longer and more complex cases.
Continuing the example above, the approach works by having alternatives for each possible prefix that is not "abc". These alternatives are:
- The beginning of the string, or a character that is not "c", followed by "def"
- The beginning of the string, or a character that is not "b", followed by "cdef"
- The beginning of the string, or a character that is not "a", followed by "bcdef"
Written as a regex (line breaks and indentation for readability are not part of the regex):
( (^|[^c])def | (^|[^b])cdef | (^|[^a])bcdef )
We can extract the common suffix out:
( (^|[^c]) | (^|[^b])c | (^|[^a])bc ) def
To support multiple alternative negative lookbehinds, such as both "abc" and "1234", which in PCRE syntax this can be written as "(?<!abc|1234)def", we can write:
( (^|[^c4]) | (^|[^b])c | (^|[^a])bc | (^|[^3])4 | (^|[^2])34 | (^|[^1])234 ) def
To support character classes in the negative lookbehinds, such as "[bB]", which in PCRE syntax can be written as "(?<!a[bB]c)def", we can write:
( (^|[^c]) | (^|[^bB])c | (^|[^a])[bB]c ) def
For better equivalence to the negative lookbehinds, we can additionally use non-capturing groups:
(?: (?:^|[^c]) | (?:^|[^b])c | (?:^|[^a])bc ) def
Here's a quick and dirty Python script to help construct regexes following this approach. The example input here is equivalent to "(?<!a[bB]c|1234)".
The script is also available to conveniently run in your browser on repl.it: https://repl.it/@allanrbo/regexnegativelookbehindalternative1
negativePrefixes = [ "a[bB]c", "1234", ] def removeDuplicateChars(s): return "".join([c for i,c in enumerate(s) if c not in s[:i]]) def removeChars(s, charsToRemove): return "".join([c for i,c in enumerate(s) if c not in charsToRemove]) # Split into arrays of strings. Each string is either a single char, or a char class. negativePrefixesSplit = [] for np in negativePrefixes: npSplit = [] curCc = "" inCc = False for c in np: if c == "[": inCc = True elif c == "]": npSplit.append(removeDuplicateChars(curCc)) curCc = "" inCc = False else: if inCc: if c in "-\\": raise "Only really simply char classes are currently supported. No ranges or escapes, sorry." curCc += c else: npSplit.append(c) negativePrefixesSplit.append(npSplit) allexprs = [] class Expr(): pass suffixLength = 0 while True: suffixes = [] for np in negativePrefixesSplit: if suffixLength < len(np): suffixes.append(np[len(np)-suffixLength-1:]) if len(suffixes) == 0: break exprs = [] for suffix in suffixes: curChar = suffix[0] remainder = suffix[1:] expr = Expr() expr.curChar = curChar expr.remainder = remainder exprs.append(expr) # Is the remainder a subset of any other suffixes remainders? for i in range(len(exprs)): e1 = exprs[i] for j in range(len(exprs)): e2 = exprs[j] isSubset = True for k in range(len(e1.remainder)): if not set(e1.remainder[k]).issubset(set(e2.remainder[k])): isSubset = False break if isSubset: if e1.curChar == e2.curChar: e1.remainder = e2.remainder continue e1.curChar += e2.curChar e1.curChar = removeDuplicateChars(e1.curChar) for k in range(len(e1.remainder)): if len(set(e2.remainder[k]) - set(e1.remainder[k])) > 0: charsInCommon = "".join(set(e2.remainder[k]) & set(e1.remainder[k])) e2.remainder[k] = removeChars(e2.remainder[k], charsInCommon) # Remove duplicate expressions exprsFiltered = [] for i in range(len(exprs)): e1 = exprs[i] alreadyExists = False for j in range(len(exprs)): if i == j: break e2 = exprs[j] sameC = set(e1.curChar) == set(e2.curChar) sameR = True for k in range(len(e1.remainder)): if set(e1.remainder[k]) != set(e2.remainder[k]): sameR = False break if sameC and sameR: alreadyExists = True break if not alreadyExists: exprsFiltered.append(e1) allexprs.extend(exprsFiltered) suffixLength += 1 continue out = "(?:\n" for i in range(len(allexprs)): e = allexprs[i] out += ("(?:^|[^" + e.curChar + "])") for c in e.remainder: if len(c) > 1: out += "[" + c + "]" else: out += c if i != len(allexprs)-1: out += "|" out += "\n" out += ")" print("Human readable:") print(out) print() print("Single line:") print(out.replace("\n",""))Example output:
Human readable: (?: (?:^|[^c4])| (?:^|[^bB])c| (?:^|[^3])4| (?:^|[^a])[bB]c| (?:^|[^2])34| (?:^|[^1])234 ) Single line: (?:(?:^|[^c4])|(?:^|[^bB])c|(?:^|[^3])4|(?:^|[^a])[bB]c|(?:^|[^2])34|(?:^|[^1])234)
No comments:
Post a Comment