SA Bugzilla – Bug 6649
sa-compile fails on SOUGHT rule with re2c: error: line 207, column 2: unterminated string constant (missing ")
Last modified: 2018-01-28 17:56:11 UTC
Created attachment 4953 [details] Problematic scanner2.re This is being discussed on the users list today with the subject: Subject: Latest sa-update crashing sa-compile? # sa-compile Aug 15 11:47:46.221 [21218] info: generic: base extraction starting. this can take a while... Aug 15 11:47:46.221 [21218] info: generic: extracting from rules of type body_0 100% [================================================================================================] 568.89 rules/sec 00m02s DONE 100% [================================================================================================] 235.82 bases/sec 00m13s DONE Aug 15 11:48:02.864 [21218] info: body_0: 1817 base strings extracted in 16 seconds cd /tmp/.spamassassin21218ELu9TJtmp cd Mail-SpamAssassin-CompiledRegexps-body_0 re2c -i -b -o scanner1.c scanner1.re re2c -i -b -o scanner2.c scanner2.re re2c: error: line 207, column 2: unterminated string constant (missing ") command failed: exit 1 I'm guessing the problem is that the string in column 1 contains a newline (carriage return, linefeed, or both). Quite possibly from SOUGHT. So maybe a bug in SOUGHT's rule generation. Or maybe it would be nice to handle this in re2c? scanner2.re attached.
Yeah, the 21st character of line 207 in the attachment is hex 0A, a linefeed, the unix standard for a newline. head -n 208 /tmp/.spamassassin21134S0BApJtmp/Mail-SpamAssassin-CompiledRegexps-body_0/scanner2.re | tail -n 2 "��ٕh��uo~u뉒�z� ��$� ^\"�ka���w�b�>�;:����#��staf6���(���" {RET("__SEEK_4CZTNZ,[l=1]");} And it's from sought. /var/lib/spamassassin/3.004000/sought_rules_yerp_org# grep __SEEK_4CZTNZ * 20_sought.cf:body __SEEK_4CZTNZ ...nevermind, that's too long. # ls -l 20_sought.cf -rw-r--r-- 1 root root 265630 2011-08-15 03:52 20_sought.cf (Eastern Daylight Time) On the users list it was mentioned that just running sa-update again gets this problem to go away, at Mon, 15 Aug 2011 14:13:31 +0000. So, SOUGHT or re2c? Or both?
I just tried sa-update and then re-running sa-compile, and this problem did *not* go away.
scanner2.re is written by sa-compile. In trunk, scanner2.re is opened for writing on line 389, and it looks like the corrupted output is printed on line 431: 431 print $re "\t", 432 Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::fixup_re($regexp), 433 " {RET(\"$reason\");}\n" 434 or die "error writing: $!"; re2c is then run on this output on line 455. I'm thinking the best solution would be to modify Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::fixup_re() to escape the linefeed in its output, if re2c (which is not part of spamassassin) has a mechanism for escaping it.
Created attachment 4954 [details] Patch to escape linefeeds in re2c input, against trunk This might do it: $output =~ s/\n/\\n/g; # escape linefeeds Near the end of lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm. It makes the error go away. Still need to test that it's actually creating useful output. http://re2c.org/manual.html says ANSI-C escape sequences (which \n is) are allowed in strings.
(In reply to comment #4) > Created attachment 4954 [details] > Patch to escape linefeeds in re2c input, against trunk > > This might do it: > > $output =~ s/\n/\\n/g; # escape linefeeds > > Near the end of lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm. It > makes the error go away. Still need to test that it's actually creating useful > output. > > http://re2c.org/manual.html says ANSI-C escape sequences (which \n is) are > allowed in strings. Excellent. In theory sounds very good. Any of the users complaining able to test? I don't use re2c on any production systems :-(
I applied the patch, ran sa-update and sa-compile and reloaded amavisd. There were no errors during the compilation. The JM_SOUGHT_3 rule has hit 4 messages in the past two minutes. So, it appears to be working, but I don't know if it will detect a message with the particular sub-rule that was escaped. (at JM_SOUGHT_3 just hit another 12 messages...)
So... I'm trying to figure out exactly what this __SEEK_4CZTNZ rule was supposed to hit, so I can try testing it. \x{0a} would be the hex encoding of a linefeed character (0a). But it doesn't exist anywhere in the pre-compiled rule. Where that character comes from, there's a \x{01}. Why is hex character 01 being converted to a linefeed, hex character 0A for input into re2c?? Can it... result in a compiled rule that works?
it's a phish containing the following MIME headers: Content-Type: ; name="UPS_document.zip" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="UPS_document.zip" and a phishing attachment which is then being interpreted as text. I've updated the backend to strip these out (hopefully).
Created attachment 4955 [details] Sample email that hits __SEEK_4CZTNZ. I can delete my compiled rules and verify this attached email hits __SEEK_4CZTNZ (the rule with the linefeed problem) via JM_SOUGHT_3. Then I run sa-compile, and it *still* hits. Which shouldn't be possible, due to 0x01 in the pre-compiled rule being converted to 0x0A in the input to re2c. How do I verify it's the compiled form of a rule that I'm hitting? I'm pretty sure there's something fantastically broken in here. $ spamassassin -D -t < SEEK_4CZTNZ.mbox 2>&1 | grep 4CZTNZ Aug 15 18:11:40.458 [6210] dbg: rules: ran body rule __SEEK_4CZTNZ ======> got hit: "[binary gunk]" BTW, I created that .mbox file by just copying and pasting the regex (stuff between the //s) into a perl print statement and outputting it to a file. Then just adding a From header and blank line before it.
From /tmp/.spamassassin9294X8Ocvktmp/Mail-SpamAssassin-CompiledRegexps-body_0/scanner2.re, the file generated by sa-compile for input into re2c: "<82><D7>ٕh<FC><F2><BC><8E>uo~u뉒<A4>z<BF> <B2><99>^]$<D2> ^\"<FD>ka<BD><86><8E>w<FF>b<8D>><8A>;:<AC><E9><CB><F5><A3>#<B7><AF>staf6<93><F7><A3><ED>(<BC><B2><B6>" {RET("__SEEK_4CZTNZ,[l=1]");} The rule it was generated from: body __SEEK_4CZTNZ /\x{82}\x{d7}\x{d9}\x{95}H\x{fc}\x{f2}\x{bc}\x{8e}uo\~U\x{eb}\x{89}\x{92}\x{a4}z\x{bf}\x{01}d\x{e6}\~\x{04}\x{c9}L\x{81}\x{d2}\{\{l\x{a9}\x{f4}\x{a1}\x{87}m\x{e7}\x{bb}\/n \x{19}\(\x{fa}\x{a3}\x{dd}\x{ac}\x{1a}\[\.R\x{1a}z\x{e8}a\x{bb}\x{fb}\x{98}\[\$r_\x{02}y\x{ab}i\x{c1}\x{e1} ,\x{9f}\x{f9}\x{f9}\x{d3}2_\x{f9}\x{d0}\x{fc}2R\x{13}\x{ee}g\x{aa}\x{9b}TK\\ \x{1c}\x{f8}qu\x{0f}\x{e9}k\{\x{ab}\x{aa}q \+u\^f\x{a5}\x{8f}\x{db}_\} (truncated by me) See where the original rule says z\x{bf}\x{01} the scanner2.re file should say z<BF><01>, but it says z<BF><newline>. There's actually about 223 characters in the much longer original rule that got replaced by that one linefeed. Any ideas why?
We're seeing further instances of this behavior with SOUGHT rules. I'm attaching another file that causes re2c to segfault.
Created attachment 4964 [details] another problematic re2c input file from the sought ruleset
It would be nice if the build tests included a test to sa-compile a rule with the full set of (binary) characters, verifying the output matches the input (as I explain it doesn't in comment 10).
> Created attachment 4964 [details] > another problematic re2c input file from the sought ruleset Right, the re2c crashes with a core dump. The culprit line is: "\336\275\311x@fe;\272w\207t-\313\230"\342u\351<\317\355" {RET("__SEEK_UWCSGI,[l=1]"); Note that a " within a string is unescaped!
rule: body __SEEK_UWCSGI /\x{de}\x{bd}\x{c9}x\@FE;\x{ba}w\x{87}T-\x{cb} \x{98}\x{e4}2\x{e2}U\x{e9}<\x{cf}\x{ed} \x{f2}r\x{f9}\x{19}\x{87}\x{b5}\x{13}\x{ef}9\*/ result: "\x{de}\x{bd}\x{c9}x@fe;\x{ba}w\x{87}t-\x{cb} \x{98}"\x{e2}u\x{e9}<\x{cf}\x{ed}" note that \x{e4}2 was transformed into a "
I'm beginning to understand what may have happened. > \x{bf}\x{01}d\x{e6}... > See where the original rule says z\x{bf}\x{01} the scanner2.re file > should say z<BF><01>, but it says z<BF><newline> See \x{01}d - note that \xd is a code for a newline. > note that \x{e4}2 was transformed into a " Cannot yet fully explain this, but note that \x22 is a code for " Seems like the trouble originates from how the output of a "perl -Mre=debug" command is being parsed by Mail::SpamAssassin::Plugin::BodyRuleBaseExtractor::extract_hints(). The extract_hints() has to deal with truncted regexp debug output (lines with "..." at their end). The RE debug output of perl has changed subtly between 5.8 and 5.10: perl 5.8: <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...>(18) perl 5.10, 5.12, 5.14: <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>... (18) I'm not sure this is still being properly handled. A side effect may be that a \x{dd} in a regexp may be arbitrarily chopped for long strings, and an \x{ may join later with some text that follows. I do not fully understand what happens next and how to properly fix it. As a small hardening measure, I added escaping of NL and NULL characters, following a suggestion in Comment 4. It does not solve the origin of the trouble, but at least it limits the damage. I'm also not sure that re2c knows how to handle null characters (escaped or raw) in regexp strings in a scanner*.re source. There are several of these in current SOUGHT rules. trunk: Bug 6649: sa-compile fails on SOUGHT rule with re2c: unterminated string constant - protect special characters, some debuggings aids, perl -Mre=debug changed its output format with perl 5.10 Sending lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm Sending lib/Mail/SpamAssassin/Util.pm Sending sa-compile.raw Committed revision 1174349.
Increasing priority from "normal" to "major" - sa-compile is breaking on input data (other than low ASCII) in a way nobody has been able to explain. And which there's no build test for.
> sa-compile is breaking on input data (other than low ASCII) in a way > nobody has been able to explain. What happens is explained in Comment 16. It's just that nobody is willing / able / knowledgeable enough to fix it.
Perhaps I should explain: the BodyRuleBaseExtractor spawns a perl interpreter in a regexp debug mode and gives it a regexp to compile - letting perl's regexp machinery do the hard work. Then it tries to parse the perl debug output and make some sense/use of it. The trouble is that perl regexp debug output truncates its lines in a report, and does so at fixed line sizes, which do not take into account possible multi-character byte sequences (like backslash-escaping). So doing it right is tricky if not impossible. I'm not volunteering.
Just want to record that this came up again today: http://www.gossamer-threads.com/lists/spamassassin/users/176122
There is a related bug 6336, which was fixed in 3.3.2, which Mark Martinec believes my previous comment was actually caused by. Summary of the status of this bug is in comment 19 and comment 16.
No more comments in 6 years after bug 6336 has been fixed. Time to close this bz ?
SOUGHT rules have been deprecated for years and this issue is 6 years old.