Friday, January 10, 2020

Alternative to negative lookbehinds in regular expressions

I recently faced the problem of requiring negative lookbehinds in a regex engine that does not support them (Golang's Regexp package, RE2, and Hyperscan).

For example, say you want to match the string "def" except if it is preceeded by "abc". In PCRE using negative lookbehinds you could achieve this with the regex:

(?<!abc)def

But this will not work in all regex engines, so I needed an alternative.

I found this Stackoverflow answer by user Sebastian Proske. The approach is clever, but tricky to comprehend and write for long and complex negative lookbehinds. I will try to explain the general approach here, and then present a script to help use this approach for longer and more complex cases.

Continuing the example above, the approach works by having alternatives for each possible prefix that is not "abc". These alternatives are:

  • The beginning of the string, or a character that is not "c", followed by "def"
  • The beginning of the string, or a character that is not "b", followed by "cdef"
  • The beginning of the string, or a character that is not "a", followed by "bcdef"

Written as a regex (line breaks and indentation for readability are not part of the regex):

(
    (^|[^c])def
    |
    (^|[^b])cdef
    |
    (^|[^a])bcdef
)

We can extract the common suffix out:

(
    (^|[^c])
    |
    (^|[^b])c
    |
    (^|[^a])bc
)
def

To support multiple alternative negative lookbehinds, such as both "abc" and "1234", which in PCRE syntax this can be written as "(?<!abc|1234)def", we can write:

(
    (^|[^c4])
    |
    (^|[^b])c
    |
    (^|[^a])bc
    |
    (^|[^3])4
    |
    (^|[^2])34
    |
    (^|[^1])234
)
def

To support character classes in the negative lookbehinds, such as "[bB]", which in PCRE syntax can be written as "(?<!a[bB]c)def", we can write:

(
    (^|[^c])
    |
    (^|[^bB])c
    |
    (^|[^a])[bB]c
)
def

For better equivalence to the negative lookbehinds, we can additionally use non-capturing groups:

(?:
    (?:^|[^c])
    |
    (?:^|[^b])c
    |
    (?:^|[^a])bc
)
def

Here's a quick and dirty Python script to help construct regexes following this approach. The example input here is equivalent to "(?<!a[bB]c|1234)".

The script is also available to conveniently run in your browser on repl.it: https://repl.it/@allanrbo/regexnegativelookbehindalternative1

negativePrefixes = [
"a[bB]c",
"1234",
]

def removeDuplicateChars(s):
  return "".join([c for i,c in enumerate(s) if c not in s[:i]])

def removeChars(s, charsToRemove):
  return "".join([c for i,c in enumerate(s) if c not in charsToRemove])

# Split into arrays of strings. Each string is either a single char, or a char class.
negativePrefixesSplit = []
for np in negativePrefixes:
  npSplit = []
  curCc = ""
  inCc = False
  for c in np:
    if c == "[":
      inCc = True
    elif c == "]":
      npSplit.append(removeDuplicateChars(curCc))
      curCc = ""
      inCc = False
    else:
      if inCc:  
        if c in "-\\":
          raise "Only really simply char classes are currently supported. No ranges or escapes, sorry."
        curCc += c
      else:
        npSplit.append(c)
  negativePrefixesSplit.append(npSplit)

allexprs = []

class Expr():
  pass

suffixLength = 0
while True:
  suffixes = []
  for np in negativePrefixesSplit:
    if suffixLength < len(np):
      suffixes.append(np[len(np)-suffixLength-1:])

  if len(suffixes) == 0:
    break

  exprs = []
  for suffix in suffixes:
    curChar = suffix[0]
    remainder = suffix[1:]
    expr = Expr()
    expr.curChar = curChar
    expr.remainder = remainder
    exprs.append(expr)

  # Is the remainder a subset of any other suffixes remainders?
  for i in range(len(exprs)):
    e1 = exprs[i]
    for j in range(len(exprs)):
      e2 = exprs[j]
      isSubset = True
      for k in range(len(e1.remainder)):
        if not set(e1.remainder[k]).issubset(set(e2.remainder[k])):
          isSubset = False
          break
      if isSubset:
        if e1.curChar == e2.curChar:
          e1.remainder = e2.remainder
          continue

        e1.curChar += e2.curChar
        e1.curChar = removeDuplicateChars(e1.curChar)
        for k in range(len(e1.remainder)):
          if len(set(e2.remainder[k]) - set(e1.remainder[k])) > 0:
            charsInCommon = "".join(set(e2.remainder[k]) & set(e1.remainder[k]))
            e2.remainder[k] = removeChars(e2.remainder[k], charsInCommon)

  # Remove duplicate expressions
  exprsFiltered = []
  for i in range(len(exprs)):
    e1 = exprs[i]
    alreadyExists = False
    for j in range(len(exprs)):
      if i == j:
        break

      e2 = exprs[j]

      sameC = set(e1.curChar) == set(e2.curChar)
      sameR = True
      for k in range(len(e1.remainder)):
        if set(e1.remainder[k]) != set(e2.remainder[k]):
          sameR = False
          break
      if sameC and sameR:
        alreadyExists = True
        break

    if not alreadyExists:
      exprsFiltered.append(e1)
  
  allexprs.extend(exprsFiltered)

  suffixLength += 1
  continue

out = "(?:\n"
for i in range(len(allexprs)):
  e = allexprs[i]
  out += ("(?:^|[^" + e.curChar + "])")
  for c in e.remainder:
    if len(c) > 1:
      out += "[" + c + "]"
    else:
      out += c
  if i != len(allexprs)-1:
    out += "|"
  out += "\n"
out += ")"

print("Human readable:")
print(out)
print()
print("Single line:")
print(out.replace("\n",""))
Example output:
Human readable:
(?:
(?:^|[^c4])|
(?:^|[^bB])c|
(?:^|[^3])4|
(?:^|[^a])[bB]c|
(?:^|[^2])34|
(?:^|[^1])234
)

Single line:
(?:(?:^|[^c4])|(?:^|[^bB])c|(?:^|[^3])4|(?:^|[^a])[bB]c|(?:^|[^2])34|(?:^|[^1])234)