6649 – sa-compile fails on SOUGHT rule with re2c: error: line 207, column 2: unterminated string constant (missing ")

Bug 6649 - sa-compile fails on SOUGHT rule with re2c: error: line 207, column 2: unterminated string constant (missing ")

Summary: sa-compile fails on SOUGHT rule with re2c: error: line 207, column 2: untermi...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P2 major
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-15 15:53 UTC by Darxus
Modified:	2018-01-28 17:56 UTC (History)
CC List:	8 users (show)

Attachment	Type	Actions	Submitter/CLA Status
Problematic scanner2.re	text/plain	None	Darxus
Patch to escape linefeeds in re2c input, against trunk	patch	None	Darxus
Sample email that hits __SEEK_4CZTNZ.	text/plain	None	Darxus
another problematic re2c input file from the sought ruleset	application/octet-stream	None	bpoliakoff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Darxus 2011-08-15 15:53:16 UTC

Created attachment 4953 [details]
Problematic scanner2.re

This is being discussed on the users list today with the subject:  Subject: Latest sa-update crashing sa-compile?

# sa-compile
Aug 15 11:47:46.221 [21218] info: generic: base extraction starting. this can take a while...
Aug 15 11:47:46.221 [21218] info: generic: extracting from rules of type body_0
100% [================================================================================================] 568.89 rules/sec 00m02s DONE
100% [================================================================================================] 235.82 bases/sec 00m13s DONE
Aug 15 11:48:02.864 [21218] info: body_0: 1817 base strings extracted in 16 seconds
cd /tmp/.spamassassin21218ELu9TJtmp
cd Mail-SpamAssassin-CompiledRegexps-body_0
re2c -i -b -o scanner1.c scanner1.re
re2c -i -b -o scanner2.c scanner2.re
re2c: error: line 207, column 2: unterminated string constant (missing ")
command failed: exit 1


I'm guessing the problem is that the string in column 1 contains a newline (carriage return, linefeed, or both).  Quite possibly from SOUGHT.  So maybe a bug in SOUGHT's rule generation.  Or maybe it would be nice to handle this in re2c?

scanner2.re attached.

Comment 1 Darxus 2011-08-15 16:03:29 UTC

Yeah, the 21st character of line 207 in the attachment is hex 0A, a linefeed, the unix standard for a newline.  


head -n 208 /tmp/.spamassassin21134S0BApJtmp/Mail-SpamAssassin-CompiledRegexps-body_0/scanner2.re | tail -n 2 
        "��ٕh��uo~u뉒�z�
��$� ^\"�ka���w�b�>�;:����#��staf6���(���"            {RET("__SEEK_4CZTNZ,[l=1]");}


And it's from sought.  

/var/lib/spamassassin/3.004000/sought_rules_yerp_org# grep __SEEK_4CZTNZ *
20_sought.cf:body __SEEK_4CZTNZ  
...nevermind, that's too long.

# ls -l 20_sought.cf
-rw-r--r-- 1 root root 265630 2011-08-15 03:52 20_sought.cf
(Eastern Daylight Time)

On the users list it was mentioned that just running sa-update again gets this problem to go away, at Mon, 15 Aug 2011 14:13:31 +0000.


So, SOUGHT or re2c?  Or both?

Comment 2 Darxus 2011-08-15 16:09:13 UTC

I just tried sa-update and then re-running sa-compile, and this problem did *not* go away.

Comment 3 Darxus 2011-08-15 20:47:05 UTC

scanner2.re is written by sa-compile.  In trunk, scanner2.re is opened for writing on line 389, and it looks like the corrupted output is printed on line 431:

431   print $re "\t",
432           Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::fixup_re($regexp),
433           "            {RET(\"$reason\");}\n"
434           or die "error writing: $!";

re2c is then run on this output on line 455.

I'm thinking the best solution would be to modify Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::fixup_re() to escape the linefeed in its output, if re2c (which is not part of spamassassin) has a mechanism for escaping it.

Comment 4 Darxus 2011-08-15 21:10:50 UTC

Created attachment 4954 [details]
Patch to escape linefeeds in re2c input, against trunk

This might do it:

  $output =~ s/\n/\\n/g; # escape linefeeds

Near the end of lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm.  It makes the error go away.  Still need to test that it's actually creating useful output.

http://re2c.org/manual.html says ANSI-C escape sequences (which \n is) are allowed in strings.

Comment 5 Kevin A. McGrail 2011-08-15 21:15:16 UTC

(In reply to comment #4)
> Created attachment 4954 [details]
> Patch to escape linefeeds in re2c input, against trunk
> 
> This might do it:
> 
>   $output =~ s/\n/\\n/g; # escape linefeeds
> 
> Near the end of lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm.  It
> makes the error go away.  Still need to test that it's actually creating useful
> output.
> 
> http://re2c.org/manual.html says ANSI-C escape sequences (which \n is) are
> allowed in strings.

Excellent.  In theory sounds very good.  Any of the users complaining able to test?  I don't use re2c on any production systems :-(

Comment 6 Daniel J McDonald 2011-08-15 21:32:19 UTC

I applied the patch, ran sa-update and sa-compile and reloaded amavisd.  There were no errors during the compilation.  The JM_SOUGHT_3 rule has hit 4 messages in the past two minutes.  So, it appears to be working, but I don't know if it will detect a message with the particular sub-rule that was escaped.

(at JM_SOUGHT_3 just hit another 12 messages...)

Comment 7 Darxus 2011-08-15 21:42:11 UTC

So... I'm trying to figure out exactly what this __SEEK_4CZTNZ rule was supposed to hit, so I can try testing it.

\x{0a} would be the hex encoding of a linefeed character (0a).  But it doesn't exist anywhere in the pre-compiled rule.  Where that character comes from, there's a \x{01}.  Why is hex character 01 being converted to a linefeed, hex character 0A for input into re2c??  Can it... result in a compiled rule that works?

Comment 8 Justin Mason 2011-08-15 22:14:02 UTC

it's a phish containing the following MIME headers:

Content-Type: ;
        name="UPS_document.zip"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
        filename="UPS_document.zip"

and a phishing attachment which is then being interpreted as text.

I've updated the backend to strip these out (hopefully).

Comment 9 Darxus 2011-08-15 22:18:36 UTC

Created attachment 4955 [details]
Sample email that hits __SEEK_4CZTNZ.

I can delete my compiled rules and verify this attached email hits __SEEK_4CZTNZ (the rule with the linefeed problem) via JM_SOUGHT_3.  Then I run sa-compile, and it *still* hits.  Which shouldn't be possible, due to 0x01 in the pre-compiled rule being converted to 0x0A in the input to re2c.

How do I verify it's the compiled form of a rule that I'm hitting?

I'm pretty sure there's something fantastically broken in here.

$ spamassassin -D -t < SEEK_4CZTNZ.mbox 2>&1 | grep 4CZTNZ
Aug 15 18:11:40.458 [6210] dbg: rules: ran body rule __SEEK_4CZTNZ ======> got hit: "[binary gunk]"

BTW, I created that .mbox file by just copying and pasting the regex (stuff between the //s) into a perl print statement and outputting it to a file.  Then just adding a From header and blank line before it.

Comment 10 Darxus 2011-08-15 22:53:06 UTC

From /tmp/.spamassassin9294X8Ocvktmp/Mail-SpamAssassin-CompiledRegexps-body_0/scanner2.re, the file generated by sa-compile for input into re2c:

"<82><D7>ٕh<FC><F2><BC><8E>uo~u뉒<A4>z<BF>
<B2><99>^]$<D2> ^\"<FD>ka<BD><86><8E>w<FF>b<8D>><8A>;:<AC><E9><CB><F5><A3>#<B7><AF>staf6<93><F7><A3><ED>(<BC><B2><B6>"            {RET("__SEEK_4CZTNZ,[l=1]");}

The rule it was generated from:

body __SEEK_4CZTNZ  /\x{82}\x{d7}\x{d9}\x{95}H\x{fc}\x{f2}\x{bc}\x{8e}uo\~U\x{eb}\x{89}\x{92}\x{a4}z\x{bf}\x{01}d\x{e6}\~\x{04}\x{c9}L\x{81}\x{d2}\{\{l\x{a9}\x{f4}\x{a1}\x{87}m\x{e7}\x{bb}\/n \x{19}\(\x{fa}\x{a3}\x{dd}\x{ac}\x{1a}\[\.R\x{1a}z\x{e8}a\x{bb}\x{fb}\x{98}\[\$r_\x{02}y\x{ab}i\x{c1}\x{e1} ,\x{9f}\x{f9}\x{f9}\x{d3}2_\x{f9}\x{d0}\x{fc}2R\x{13}\x{ee}g\x{aa}\x{9b}TK\\ \x{1c}\x{f8}qu\x{0f}\x{e9}k\{\x{ab}\x{aa}q \+u\^f\x{a5}\x{8f}\x{db}_\}
(truncated by me)

See where the original rule says z\x{bf}\x{01} the scanner2.re file should say z<BF><01>, but it says z<BF><newline>.  There's actually about 223 characters in the much longer original rule that got replaced by that one linefeed.  Any ideas why?

Comment 11 bpoliakoff 2011-09-21 16:59:41 UTC

We're seeing further instances of this behavior with SOUGHT rules.  I'm attaching another file that causes re2c to segfault.

Comment 12 bpoliakoff 2011-09-21 17:00:37 UTC

Created attachment 4964 [details]
another problematic re2c input file from the sought ruleset

Comment 13 Darxus 2011-09-21 17:11:39 UTC

It would be nice if the build tests included a test to sa-compile a rule with the full set of (binary) characters, verifying the output matches the input (as I explain it doesn't in comment 10).

Comment 14 Mark Martinec 2011-09-22 15:22:35 UTC

> Created attachment 4964 [details]
> another problematic re2c input file from the sought ruleset

Right, the re2c crashes with a core dump.

The culprit line is:

  "\336\275\311x@fe;\272w\207t-\313\230"\342u\351<\317\355"
    {RET("__SEEK_UWCSGI,[l=1]");

Note that a " within a string is unescaped!

Comment 15 Mark Martinec 2011-09-22 15:31:43 UTC

rule:
body __SEEK_UWCSGI
  /\x{de}\x{bd}\x{c9}x\@FE;\x{ba}w\x{87}T-\x{cb}
   \x{98}\x{e4}2\x{e2}U\x{e9}<\x{cf}\x{ed}
   \x{f2}r\x{f9}\x{19}\x{87}\x{b5}\x{13}\x{ef}9\*/

result:
  "\x{de}\x{bd}\x{c9}x@fe;\x{ba}w\x{87}t-\x{cb}
   \x{98}"\x{e2}u\x{e9}<\x{cf}\x{ed}"

note that  \x{e4}2  was transformed into a  "

Comment 16 Mark Martinec 2011-09-22 20:19:37 UTC

I'm beginning to understand what may have happened.

> \x{bf}\x{01}d\x{e6}...
> See where the original rule says z\x{bf}\x{01} the scanner2.re file
> should say z<BF><01>, but it says z<BF><newline>

See \x{01}d - note that \xd is a code for a newline.

> note that  \x{e4}2  was transformed into a  "

Cannot yet fully explain this, but note that \x22 is a code for "

Seems like the trouble originates from how the output
of a "perl -Mre=debug" command is being parsed by
Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::extract_hints().

The extract_hints() has to deal with truncted regexp debug output
(lines with "..." at their end). The RE debug output of perl has
changed subtly between 5.8 and 5.10:

  perl 5.8:              <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...>(18)
  perl 5.10, 5.12, 5.14: <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>... (18)

I'm not sure this is still being properly handled.
A side effect may be that a \x{dd} in a regexp may be arbitrarily
chopped for long strings, and an  \x{  may join later with some
text that follows. I do not fully understand what happens next
and how to properly fix it.

As a small hardening measure, I added escaping of NL and NULL
characters, following a suggestion in Comment 4. It does not solve
the origin of the trouble, but at least it limits the damage.

I'm also not sure that re2c knows how to handle null characters
(escaped or raw) in regexp strings in a scanner*.re source.
There are several of these in current SOUGHT rules.

trunk:
  Bug 6649: sa-compile fails on SOUGHT rule with re2c: unterminated
  string constant - protect special characters, some debuggings aids,
  perl -Mre=debug changed its output format with perl 5.10
Sending lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending lib/Mail/SpamAssassin/Util.pm
Sending sa-compile.raw
Committed revision 1174349.

Comment 17 Darxus 2011-11-14 17:54:29 UTC

Increasing priority from "normal" to "major" - sa-compile is breaking on input data (other than low ASCII) in a way nobody has been able to explain.  And which there's no build test for.

Comment 18 Mark Martinec 2011-11-14 18:13:21 UTC

> sa-compile is breaking on input data (other than low ASCII) in a way
> nobody has been able to explain.

What happens is explained in Comment 16.
It's just that nobody is willing / able / knowledgeable enough
to fix it.

Comment 19 Mark Martinec 2011-11-14 18:28:24 UTC

Perhaps I should explain: the BodyRuleBaseExtractor spawns a perl interpreter
in a regexp debug mode and gives it a regexp to compile - letting perl's regexp
machinery do the hard work.

Then it tries to parse the perl debug output and make some sense/use of it.
The trouble is that perl regexp debug output truncates its lines in a report,
and does so at fixed line sizes, which do not take into account possible
multi-character byte sequences (like backslash-escaping). So doing it right
is tricky if not impossible. I'm not volunteering.

Comment 20 Darxus 2012-12-17 18:25:34 UTC

Just want to record that this came up again today:  http://www.gossamer-threads.com/lists/spamassassin/users/176122

Comment 21 Darxus 2012-12-17 19:32:49 UTC

There is a related bug 6336, which was fixed in 3.3.2, which Mark Martinec believes my previous comment was actually caused by.

Summary of the status of this bug is in comment 19 and comment 16.

Comment 22 Giovanni Bechis 2018-01-28 14:41:15 UTC

No more comments in 6 years after bug 6336 has been fixed.
Time to close this bz ?

Comment 23 Dave Jones 2018-01-28 17:56:11 UTC

SOUGHT rules have been deprecated for years and this issue is 6 years old.