Friday, January 10, 2020

Alternative to negative lookbehinds in regular expressions

I recently faced the problem of requiring negative lookbehinds in a regex engine that does not support them (Golang's Regexp package, RE2, and Hyperscan).

For example, say you want to match the string "def" except if it is preceeded by "abc". In PCRE using negative lookbehinds you could achieve this with the regex:

(?<!abc)def

But this will not work in all regex engines, so I needed an alternative.

I found this Stackoverflow answer by user Sebastian Proske. The approach is clever, but tricky to comprehend and write for long and complex negative lookbehinds. I will try to explain the general approach here, and then present a script to help use this approach for longer and more complex cases.

Continuing the example above, the approach works by having alternatives for each possible prefix that is not "abc". These alternatives are:

  • The beginning of the string, or a character that is not "c", followed by "def"
  • The beginning of the string, or a character that is not "b", followed by "cdef"
  • The beginning of the string, or a character that is not "a", followed by "bcdef"

Written as a regex (line breaks and indentation for readability are not part of the regex):

(
    (^|[^c])def
    |
    (^|[^b])cdef
    |
    (^|[^a])bcdef
)

We can extract the common suffix out:

(
    (^|[^c])
    |
    (^|[^b])c
    |
    (^|[^a])bc
)
def

To support multiple alternative negative lookbehinds, such as both "abc" and "1234", which in PCRE syntax this can be written as "(?<!abc|1234)def", we can write:

(
    (^|[^c4])
    |
    (^|[^b])c
    |
    (^|[^a])bc
    |
    (^|[^3])4
    |
    (^|[^2])34
    |
    (^|[^1])234
)
def

To support character classes in the negative lookbehinds, such as "[bB]", which in PCRE syntax can be written as "(?<!a[bB]c)def", we can write:

(
    (^|[^c])
    |
    (^|[^bB])c
    |
    (^|[^a])[bB]c
)
def

For better equivalence to the negative lookbehinds, we can additionally use non-capturing groups:

(?:
    (?:^|[^c])
    |
    (?:^|[^b])c
    |
    (?:^|[^a])bc
)
def

Here's a quick and dirty Python script to help construct regexes following this approach. The example input here is equivalent to "(?<!a[bB]c|1234)".

The script is also available to conveniently run in your browser on repl.it: https://repl.it/@allanrbo/regexnegativelookbehindalternative1

  1. negativePrefixes = [  
  2. "a[bB]c",  
  3. "1234",  
  4. ]  
  5.   
  6. def removeDuplicateChars(s):  
  7.   return "".join([c for i,c in enumerate(s) if c not in s[:i]])  
  8.   
  9. def removeChars(s, charsToRemove):  
  10.   return "".join([c for i,c in enumerate(s) if c not in charsToRemove])  
  11.   
  12. # Split into arrays of strings. Each string is either a single char, or a char class.  
  13. negativePrefixesSplit = []  
  14. for np in negativePrefixes:  
  15.   npSplit = []  
  16.   curCc = ""  
  17.   inCc = False  
  18.   for c in np:  
  19.     if c == "[":  
  20.       inCc = True  
  21.     elif c == "]":  
  22.       npSplit.append(removeDuplicateChars(curCc))  
  23.       curCc = ""  
  24.       inCc = False  
  25.     else:  
  26.       if inCc:    
  27.         if c in "-\\":  
  28.           raise "Only really simply char classes are currently supported. No ranges or escapes, sorry."  
  29.         curCc += c  
  30.       else:  
  31.         npSplit.append(c)  
  32.   negativePrefixesSplit.append(npSplit)  
  33.   
  34. allexprs = []  
  35.   
  36. class Expr():  
  37.   pass  
  38.   
  39. suffixLength = 0  
  40. while True:  
  41.   suffixes = []  
  42.   for np in negativePrefixesSplit:  
  43.     if suffixLength < len(np):  
  44.       suffixes.append(np[len(np)-suffixLength-1:])  
  45.   
  46.   if len(suffixes) == 0:  
  47.     break  
  48.   
  49.   exprs = []  
  50.   for suffix in suffixes:  
  51.     curChar = suffix[0]  
  52.     remainder = suffix[1:]  
  53.     expr = Expr()  
  54.     expr.curChar = curChar  
  55.     expr.remainder = remainder  
  56.     exprs.append(expr)  
  57.   
  58.   # Is the remainder a subset of any other suffixes remainders?  
  59.   for i in range(len(exprs)):  
  60.     e1 = exprs[i]  
  61.     for j in range(len(exprs)):  
  62.       e2 = exprs[j]  
  63.       isSubset = True  
  64.       for k in range(len(e1.remainder)):  
  65.         if not set(e1.remainder[k]).issubset(set(e2.remainder[k])):  
  66.           isSubset = False  
  67.           break  
  68.       if isSubset:  
  69.         if e1.curChar == e2.curChar:  
  70.           e1.remainder = e2.remainder  
  71.           continue  
  72.   
  73.         e1.curChar += e2.curChar  
  74.         e1.curChar = removeDuplicateChars(e1.curChar)  
  75.         for k in range(len(e1.remainder)):  
  76.           if len(set(e2.remainder[k]) - set(e1.remainder[k])) > 0:  
  77.             charsInCommon = "".join(set(e2.remainder[k]) & set(e1.remainder[k]))  
  78.             e2.remainder[k] = removeChars(e2.remainder[k], charsInCommon)  
  79.   
  80.   # Remove duplicate expressions  
  81.   exprsFiltered = []  
  82.   for i in range(len(exprs)):  
  83.     e1 = exprs[i]  
  84.     alreadyExists = False  
  85.     for j in range(len(exprs)):  
  86.       if i == j:  
  87.         break  
  88.   
  89.       e2 = exprs[j]  
  90.   
  91.       sameC = set(e1.curChar) == set(e2.curChar)  
  92.       sameR = True  
  93.       for k in range(len(e1.remainder)):  
  94.         if set(e1.remainder[k]) != set(e2.remainder[k]):  
  95.           sameR = False  
  96.           break  
  97.       if sameC and sameR:  
  98.         alreadyExists = True  
  99.         break  
  100.   
  101.     if not alreadyExists:  
  102.       exprsFiltered.append(e1)  
  103.     
  104.   allexprs.extend(exprsFiltered)  
  105.   
  106.   suffixLength += 1  
  107.   continue  
  108.   
  109. out = "(?:\n"  
  110. for i in range(len(allexprs)):  
  111.   e = allexprs[i]  
  112.   out += ("(?:^|[^" + e.curChar + "])")  
  113.   for c in e.remainder:  
  114.     if len(c) > 1:  
  115.       out += "[" + c + "]"  
  116.     else:  
  117.       out += c  
  118.   if i != len(allexprs)-1:  
  119.     out += "|"  
  120.   out += "\n"  
  121. out += ")"  
  122.   
  123. print("Human readable:")  
  124. print(out)  
  125. print()  
  126. print("Single line:")  
  127. print(out.replace("\n",""))  
Example output:
Human readable:
(?:
(?:^|[^c4])|
(?:^|[^bB])c|
(?:^|[^3])4|
(?:^|[^a])[bB]c|
(?:^|[^2])34|
(?:^|[^1])234
)

Single line:
(?:(?:^|[^c4])|(?:^|[^bB])c|(?:^|[^3])4|(?:^|[^a])[bB]c|(?:^|[^2])34|(?:^|[^1])234)