Bug 6226 - mass-check lossage on NUL bytes and double-byte characters in hit texts
Summary: mass-check lossage on NUL bytes and double-byte characters in hit texts
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Masses (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other All
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
Depends on:
Reported: 2009-10-24 10:19 UTC by John Hardin
Modified: 2019-07-30 18:42 UTC (History)
2 users (show)

Attachment Type Modified Status Actions Submitter/CLA Status
Let mass-check accept NUL bytes in rule hits patch None John Hardin [HasCLA]
Fix mass-check problems with NUL bytes and wide characters in match text when using --loghits. patch None John Hardin [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description John Hardin 2009-10-24 10:19:19 UTC
Created attachment 4556 [details]
Let mass-check accept NUL bytes in rule hits

mass-check uses a NUL byte to delimit the hit-texts from the scan results in
IPC when --loghits is specified. Unfortunately the RE for parsing this does not
properly react to literal NULs appearing in the hit texts, e.g. for

Proposed change attached. Votes? Do we want to get this in promptly, or wait
until 3.0.0 ships?
Comment 1 John Hardin 2009-10-24 17:22:08 UTC
Created attachment 4557 [details]
Fix mass-check problems with NUL bytes and wide characters in match text when using --loghits.
Comment 2 John Hardin 2009-10-24 17:23:27 UTC
Also: high-bit characters in the match texts cause "wide character" errors in single-process mode and complete hang in parallel mode (at least on x86_64). Second attached patch fixes both problems.
Comment 3 Warren Togami 2009-10-24 20:50:54 UTC
Is utf8 the only possible encoding where this issue happens?
Comment 4 John Hardin 2009-10-25 09:10:01 UTC
It's not a matter of the encoding of the message itself so much as it is the presence of 8-bit characters in the match strings (e.g. somebody directly typed an iso-8859-1 accented character into an unencoded 8-bit message body, and that gets hit by some rule).

UTF8 encoding is done to temporarily armor the match string results for IPC to avoid hanging the entire masscheck process when the IO library routines encounter a wide character. Perhaps the _proper_ solution involves encoding the match strings much earlier in the process, when the scan generates them and the original encoding of the message (if any) is known. I didn't dig that deeply into it.

I should have posted this earlier - here's the tail of the log output from unpatched masscheck using -j>1 when an unencoded wide character appears in the match text. After this point all of the masscheck processes are present, but idle until killed.

-------------------BEGIN LOG
.....status:  26% ham: 1361   spam: 953    date: 2005-12-02   now: 2009-10-24 01:56:28 PM
status:  27% ham: 1413   spam: 990    date: 2007-06-29   now: 2009-10-24 01:57:21 PM
Wide character in print at /usr/lib64/perl5/5.8.8/x86_64-linux/IO/Handle.pm line 401.
-------------------END LOG

398 sub print {
399     @_ or croak 'usage: $io->print(ARGS)';
400     my $this = shift;
401     print $this @_;
402 }

When running without -j there are "wide character in print" warnings but the masscheck process runs to completion.
Comment 5 Henrik Krohns 2019-07-30 18:42:58 UTC
I'd rather base64 encode the RESULT instead of perhaps error prone utf8:ing.

Sending        masses/mass-check
Transmitting file data .done
Committing transaction...
Committed revision 1864018.

Will leave open for the print STDOUT issue. Dunno if should just binmode(STDOUT, ":utf8") or something. If it's still issue.