SA Bugzilla – Bug 6583
log_message and get_content_preview should print safe
Last modified: 2024-03-04 19:38:17 UTC
Documentation: "This is returned as a multi-line string, with the lines separated by C<\n> characters, containing a fully-decoded, safe, plain-text sample of the first few lines of the message body." The preview isn't actually "binary safe". I would assume something like that. There are many cases where SA ends up scanning binary bodies.
Any objections? Or change it to just very bad control characters? --- lib/Mail/SpamAssassin/PerMsgStatus.pm (revision 1100366) +++ lib/Mail/SpamAssassin/PerMsgStatus.pm (working copy) @@ -619,6 +619,9 @@ $str =~ s/[-_\*\.]{10,}//gs; $str =~ s/\s+/ /gs; + # Bug 6583: it's supposed to be "safe", so replace unprintable with ? + $str =~ s/[^[:print:]]/?/gs; + # add "Content preview:" ourselves, so that the text aligns # correctly with the template -- then trim it off. We don't # have to get this *exactly* right, but it's nicer if we
Or better yet, show unprintable as hex "\xFF". I'd also like to do this with the whole debug output (atleast the matched rules part), since it's crapped my terminal many times..
I'm not sure if [:print:] is going to have the same locale problems as lc(), or if it'll be fine due to using whatever character set the body was encoded in. Looks like there's a simpler fix: $str =~ s/\P{IsPrint}/?/gs; From the perlre man page: But if the "locale" or "encoding" pragmas are not used and the string is not "utf8", then "[[:xxxxx:]]" (and "\w", etc.) will not match characters 0x80-0xff; whereas "\p{IsXxxxx}" will force the string to "utf8" and can match these characters (as Unicode). [2] "\p{IsPrint}" matches characters 0x09-0x0d but "[[:print:]]" does not.
(In reply to comment #3) > I'm not sure if [:print:] is going to have the same locale problems as lc(), or > if it'll be fine due to using whatever character set the body was encoded in. > Looks like there's a simpler fix: > > $str =~ s/\P{IsPrint}/?/gs; Again I'd like to be cautious with introducing untested features.. since especially old versions of Perl had even crashes with utf8 related things. And I don't really think we need to care utf8 with my proposal here, we just escape characters for the sake of readability. Here's what I propose, atleast for trunk.. if someone really wants to directly see some high characters like latin special characters or euro symbols let me know. Index: PerMsgStatus.pm =================================================================== --- PerMsgStatus.pm (revision 1100366) +++ PerMsgStatus.pm (working copy) @@ -626,6 +626,9 @@ $str = Mail::SpamAssassin::Util::wrap($str, " ", "Content preview: ", 75, 1); $str =~ s/^Content preview:\s+//gs; + # Bug 6583: replace unsafe characters with decimal octets "\000" - "\255" + $str =~ s/([\x00-\x09\x0b-\x1f\x7f-\xff])/sprintf("\\%03d",ord($1))/egs; + return $str; } Index: Logger.pm =================================================================== --- Logger.pm (revision 1098151) +++ Logger.pm (working copy) @@ -174,9 +174,10 @@ # the subclasses having to understand multi-line logs my $first = 1; foreach my $line (split(/\n/, $message)) { - # replace control characters with "_", tabs and spaces get - # replaced with a single space. - $line =~ tr/\x09\x20\x00-\x1f/ _/s; + # tabs and spaces get replaced with a single space. + $line =~ tr/\x09\x20/ /s; + # Bug 6583: replace unsafe characters with decimal octets "\000" - "\255" + $line =~ s/([\x00-\x09\x0b-\x1f\x7f-\xff])/sprintf("\\%03d",ord($1))/egs; if ($first) { $first = 0; } else {
Too untested and too many questions to consider for 3.3.X.
Some comments, before I forget: Considering that logging has three different back-ends internally (file, stderr and syslog) and that it has a documented open interface for third-party backends (which is used at least by amavisd-new to integrate SpamAssassin logging with its own), I'd prefer that each backend logger takes care of its own limitations, which may not be the same for each. For example a file-based logging may be happy with logging UTF-8 text, while the syslog may not be. A syslog backend may have line length limitations, while others may not. A concern with the amavisd backend logger is that it does its own sanitation of nonprintable characters, so if SpamAssassin does it at the common Logger.pm routine, we'd end up with doubled backslashes - not too bad, but ugly nevertheless. So I'd move the sanitation line from Logger.pm into Logger/*.pm modules. And perhaps the Logger/File.pm is a candidate for less strict sanitation, like allowing 8-bit characters through. Btw, the suggested s/// does not cope with character codes above 255, which may happen to find its way into a string being sanitized. Also the '\' is not being doubled, which means that the result is not uniquely reversible (if it happens that a '\nnn' is literally in a string). Perhaps something like the following would do: $str =~ s{([^\040-\176])} { $1 eq '\\' ? '\\\\' : sprintf(ord($1) > 255 ? '\\x{%04x}' : '\\%03o', ord($1)) }egs; An alternative for sanitizing strings with characters above 255 is to encode them first into utf-8, and only then \ooo -quote them. Not sure what subset of characters should be allow to pass through un-encoded in PerMsgStatus::get_content_preview().
Very nice insights Mark, didn't notice Logger/* .. I'll look into this later, if someone doesn't before me.
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
This is a good idea but pushing for 3.4.2. Perhaps right after 3.4.1 we can get some patches committed for testing by those of us using trunk for day-to-day?
Pushing to 3.4.3
Committed to trunk in backwards and 3rd party compatible way. New escape => 1 option is used everywhere in SA code, for example: Mail::SpamAssassin::Logger::add(method => 'stderr', timestamp_fmt => $log_timestamp_fmt, escape => 1); If escape is not defined at all (like current Amavis), normal pre-4.0 method is used. This is the used method, which is moved into all Logger/*.pm modules. if ($self->{escape}) { local $1; # Bug 7305: # Quote non-ascii characters as \x{XX} or \x{XXXX} (Unicode) # Also quote backslash, so the log can be unescaped properly $msg =~ s{([^\x20-\x5b\x5d-\x7e])}{ $1 eq '\\' ? '\\\\' : sprintf(ord($1) > 255 ? '\\x{%04X}' : '\\x{%02X}', ord($1)) }egs; } elsif (!exists $self->{escape}) { # Backwards compatible pre-4.0 escaping, if $escape not given. # replace control characters with "_", tabs and spaces get # replaced with a single space. $msg =~ tr/\x09\x20\x00-\x1f/ _/s; } I hate octal presensation, so it's \x{FF} and \x{FFFF} to be uniform. Should have done this long time ago, it's super helpful! Aug 12 12:20:25.854 [11986] dbg: rules: ran rawbody rule __L_BODY_8BITS ======> got hit: "\x{EF}" Aug 12 12:20:48.846 [12007] dbg: rules: ran rawbody rule __HAS_HREF ======> got hit: "\x{09}\x{09}<td><a href=" Aug 12 12:21:54.700 [12081] dbg: message: _decode_header subject: Fyll inn alle detaljene for \x{C3}\x{B8}nsket l\x{C3}\x{A5}n Aug 12 12:21:54.902 [12081] dbg: rules: ran body rule __HIGHBITS ======> got hit: "\x{C3}\x{83}\x{C2}\x{B8}n" Aug 12 12:22:56.663 [12272] dbg: uri: canonicalizing parsed uri: http://www\x{E3}\x{80}\x{82}example3\x{E3}\x{80}\x{82}com Sending trunk/UPGRADE Sending trunk/lib/Mail/SpamAssassin/Logger/File.pm Sending trunk/lib/Mail/SpamAssassin/Logger/Stderr.pm Sending trunk/lib/Mail/SpamAssassin/Logger/Syslog.pm Sending trunk/lib/Mail/SpamAssassin/Logger.pm Sending trunk/spamd/spamd.raw Transmitting file data ......done Committing transaction... Committed revision 1864949.
Content preview: now escaped too Sending trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm Transmitting file data .done Committing transaction... Committed revision 1864961.
Very interested to see this in production.
Revamped it after more testing. It was a bit too eager imo.. For logs, now escaped are only escape chars + DEL + UTF-8 sequences. I think the UTF-8 bit is important, so debugs can be viewed on any terminal. I removed content preview escaping, the whole preview function needs more overall thought on how it handles different encodingins, since it's supposed to prints out stuff on mail body. Even normalize_charset breaks it I think. Sending trunk/UPGRADE Sending trunk/lib/Mail/SpamAssassin/Logger/File.pm Sending trunk/lib/Mail/SpamAssassin/Logger/Stderr.pm Sending trunk/lib/Mail/SpamAssassin/Logger/Syslog.pm Sending trunk/lib/Mail/SpamAssassin/Logger.pm Sending trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm Transmitting file data ......done Committing transaction... Committed revision 1865007.
The previous UTF-8 escaping is more problematic than the problem it's trying to solve (showing extended latin1 chars for readibility). All extended ascii is now escaped. Sending lib/Mail/SpamAssassin/Logger.pm Transmitting file data .done Committing transaction... Committed revision 1866130.
*** Bug 8217 has been marked as a duplicate of this bug. ***