Bug 6802

Summary: /ss/i interpreted as /(?:ss|ß)/, Variable length lookbehind not implemented in regex
Product: Spamassassin Reporter: Mark Martinec <Mark.Martinec>
Component: LibrariesAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: apache, kmcgrail
Priority: P2    
Version: 3.4 SVN branch   
Target Milestone: 4.0.0   
Hardware: All   
OS: All   
Whiteboard:

Description Mark Martinec 2012-06-04 17:26:37 UTC
This is what I reported on a ML on 2012-05-28 - but see below,
the problem is likely to be more involved:


There are three cases of:

  "Variable length lookbehind not implemented in regex"

perl error reported by perl 5.16.0. in J_CHICKENPOX_* rules,
file rulesrc/sandbox/khopesh/20_chickenpox.cf.

Perldiag man page explains:

  Variable length lookbehind not implemented in m/%s/
    (F) Lookbehind is allowed only for subexpressions whose length
        is fixed and known at compile time.  See perlre.

perlre main page:

  "(?<=pattern)" "\K"
    A zero-width positive look-behind assertion.  For example,
    "/(?<=\t)\w+/" matches a word that follows a tab, without
    including the tab in $&.  Works only for fixed-width look-
    behind.

  "(?<!pattern)"
    A zero-width negative look-behind assertion.  For example
    "/(?<!bar)foo/" matches any occurrence of "foo" that does not
    follow "bar".  Works only for fixed-width look-behind.


lint: config: invalid regexp for rule J_CHICKENPOX_13: m/\s(?![acdgjlmnosx]
[`'"])[a-zA-Z]{1}[.,\;:?%!&+^~`'\$*=#|013467\(\)\[\]\{\}<>"][a-zA-Z]{3}(?<!\.
(?:(?-i:[A-Z][a-z]{2})|a(?:sc|sp|ux)|b(?:ak|at|iz|in|ks|mk|mp)|c(?:fg|gi|nf|
om|pp|ss)|d(?:at|ll|mg|oc)|e(?:du|nt|xe|xt)|g(?:if|ov)|htm|i(?:co|di|mg|nc|nf|
ni)|jpg|l(?:ib|og)|m(?:ap|bs|il|p[eg])|net|org|p(?:df|fb|hp|i[df]|ng|pt|sd)|
raw|s(?:cr|ql|ty|ys)|t(?:ar|ex|ld|mp|tf|xt)|usr|w(?:av|pi)|x(?:ls|ml|sl)|zip)|
[`'"]all)(?:[,'\?!]|\.?\s)/i: Variable length lookbehind not implemented in 
regex m/(?i)\s(?![acdgjlmnosx][`'"])[a-zA-Z]{1}[.,\;:?%!&+^~`'\$*=#|
013467\(\)\[\]\{\}<>"][a-zA-Z]{3}(?<!\.(?:(?-i:[A-Z][a-z].../

lint: config: invalid regexp for rule J_CHICKENPOX_23: m/\s(?!(?:fn|re):|
(?:cc|to)=|(?:qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a-zA-Z]{2}[.,
\;:?%!&+^~`'\$*=#|013467\(\)\[\]\{\}<>"][a-zA-Z]{3}(?<!\.(?:(?-i:[A-Z][a-z]
{2})|a(?:sc|sp|ux)|b(?:ak|at|iz|in|ks|mk|mp)|c(?:fg|gi|nf|om|pp|ss)|d(?:at|ll|
mg|oc)|e(?:du|nt|xe|xt)|g(?:if|ov)|htm|i(?:co|di|mg|nc|nf|ni)|jpg|l(?:ib|og)|
m(?:ap|bs|il|p[eg])|net|org|p(?:df|fb|hp|i[df]|ng|pt|sd)|raw|s(?:cr|ql|ty|ys)|
t(?:ar|ex|ld|mp|tf|xt)|usr|w(?:av|pi)|x(?:ls|ml|sl)|zip)|['`"]tje)(?:[,'\?!]|
\.?\s)/i: Variable length lookbehind not implemented in regex m/(?i)\s(?!
(?:fn|re):|(?:cc|to)=|(?:qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a-zA-Z]{2}[.,
\;:?%!&+^~`'\$*=#|013467\(\)\[\].../

lint: config: invalid regexp for rule J_CHICKENPOX_33: m/\s(?!(?:alt|biz|mrs|
rev|s(?:ci|en|oc))\.|(?:end|fwd|org|reg):|pop3|cos')[a-zA-Z]{3}[.,
\;:?%!&+^~`'\$*=#|013467\(\)\[\]\{\}<>"][a-zA-Z]{3}(?<!\.(?:(?-i:[A-Z][a-z]
{2})|a(?:sc|sp|ux)|b(?:ak|at|iz|in|ks|mk|mp)|c(?:fg|gi|nf|om|pp|ss)|d(?:at|ll|
mg|oc)|e(?:du|nt|xe|xt)|g(?:if|ov)|htm|i(?:co|di|mg|nc|nf|ni)|jpg|l(?:ib|og)|
m(?:ap|bs|il|p[eg])|net|org|p(?:df|fb|hp|i[df]|ng|pt|sd)|raw|s(?:cr|ql|ty|ys)|
t(?:ar|ex|ld|mp|tf|xt)|usr|w(?:av|pi)|x(?:ls|ml|sl)|zip)|['`"]tje)(?:[,'\?!]|
\.?\s)/i: Variable length lookbehind not implemented in regex m/(?i)\s(?!
(?:alt|biz|mrs|rev|s(?:ci|en|oc))\.|(?:end|fwd|org|reg):|pop3|cos')[a-zA-Z]{3}
[.,\;:?%!&+^~`'\$*=#|013467\(\.../

ERROR: LINT FAILED, suppressing output: rules/70_sandbox.cf




It seemed like a perl bug, but it turns out it may not be.
I reported the case as:
  https://rt.perl.org:443/rt3/Ticket/Display.html?id=113496

  Came across a small handful of complex regular expressions
  (in SpamAssassin rules), which seem to unwarrantedly hit an
  error: "Variable length lookbehind not implemented in regex"
  under perl 5.16.0 (on FreeBSD).

  Here is a distilled-down sample case:

  good:
    $ perl -e 'm/(?<!abc|cde)/i'

  good:
    $ perl -e 'm/(?<!abc|css)/'

  incorrect diagnostics:
    $ perl -e 'm/(?<!abc|css)/i'
    Variable length lookbehind not implemented in regex m/(?<!abc|css)/
      at -e line 1.


doy replied:
  This is likely because in 5.16, /ss/i compiles to something along the
  lines of /(?:ss|ß)/. You could possibly use \K instead of lookbehind
  (depending on the pattern), or use the /a regex modifier to enforce
  ASCII semantics.

Father Chrysostomos adds:
  The /aa modifier is what you need here.



So, are just the three J_CHICKENPOX_* rules in need of fixing,
or are we in deeper trouble?
Comment 1 Kevin A. McGrail 2012-06-04 17:31:29 UTC
I had been wondering when that email you wrote about chickenpox was going to rear it's head.  Unfortunately, this issue is WAY over my regexp head.  Does it only rear it's head with compiled rules?
Comment 2 Mark Martinec 2012-06-04 18:21:40 UTC
> Does it only rear it's head with compiled rules?

It showed in a --lint phase of install, non-compiled rules.

(compiled rules are possibly affected too, but these may
already be broken due to changes in a perl debug output
across versions, Bug 6649)

As a quick and dirty hack I just replaced a 'ss' with 's[s]'
in these three rules, so that installation/lint does not barf:

  Bug 6802: a hack on three J_CHICKENPOX_* rules,
  replacing ss with s[s] avoids interpreting "ss" as "sharp s"
    Sending rulesrc/sandbox/khopesh/20_chickenpox.cf
  Committed revision 1346064.


Don't know where else we may encounter effects of these changes
in perl. These three rules were just 'lucky' in using a construct
which involves an additional internal check on string lengths.

So far so good with 5.16.0, things appear to be working normally.
Comment 3 Mark Martinec 2012-06-05 18:49:08 UTC
> Don't know where else we may encounter effects of these changes

Should we be adding an:
  use re "/aa";
in code sections which interpret regexps in rules
to avoid surprises with Unicode semantics, or deal with
specific problems as we come across?

Adding an aa modifier directly in rules would break
these regexps for versions of perl older than 5.12 (I think).
Comment 4 Henrik Krohns 2019-07-26 09:22:25 UTC
Added /aa in compile_regex() for perl >=5.14.

Tested with mass-checks runs, it has no effect on rule hits or runtimes, so I deem it safe to use.

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/Util.pm
Sending        trunk/lib/Mail/SpamAssassin/Util.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1863788.
Comment 5 Henrik Krohns 2019-08-05 07:42:09 UTC
Rolled it back, it was buggy.

https://bugzilla.redhat.com/show_bug.cgi?id=731062

So it can be only used from perl 5.16 really.

Also I'm not sure if it should be enabled or disabled depending on normalize_charset. Needs a little bit more testing.

One can already use /aa in rules so it's not a very big problem.

body FOO m/(?<!abc|css)/aa
Comment 6 Henrik Krohns 2019-08-11 06:11:29 UTC
Btw using qr//aa for all regexes speed up total runtime by ~%6 !! I was wondering why things slowed down when I removed the patch. :-)

I'll play with it a bit in trunk, we can probably use it as long as the textual body is handled as bytes (could change later to utf8 for normalize_charset ?).
Comment 7 Henrik Krohns 2019-08-12 12:52:07 UTC
New try.

Sending        trunk/lib/Mail/SpamAssassin/Util.pm
Transmitting file data .done
Committing transaction...
Committed revision 1864964.