Bug 6583 - log_message and get_content_preview should print safe
Summary: log_message and get_content_preview should print safe
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.3.1
Hardware: All All
: P3 minor
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 8217 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-05-07 13:24 UTC by Henrik Krohns
Modified: 2024-03-04 19:38 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Krohns 2011-05-07 13:24:48 UTC
Documentation:

"This is returned as a multi-line string, with the lines separated by C<\n>
characters, containing a fully-decoded, safe, plain-text sample of the first
few lines of the message body."

The preview isn't actually "binary safe". I would assume something like that. There are many cases where SA ends up scanning binary bodies.
Comment 1 Henrik Krohns 2011-05-07 13:28:06 UTC
Any objections? Or change it to just very bad control characters?

--- lib/Mail/SpamAssassin/PerMsgStatus.pm (revision 1100366)
+++ lib/Mail/SpamAssassin/PerMsgStatus.pm (working copy)
@@ -619,6 +619,9 @@
   $str =~ s/[-_\*\.]{10,}//gs;
   $str =~ s/\s+/ /gs;

+  # Bug 6583: it's supposed to be "safe", so replace unprintable with ?
+  $str =~ s/[^[:print:]]/?/gs;
+
   # add "Content preview:" ourselves, so that the text aligns
   # correctly with the template -- then trim it off.  We don't
   # have to get this *exactly* right, but it's nicer if we
Comment 2 Henrik Krohns 2011-05-07 13:38:24 UTC
Or better yet, show unprintable as hex "\xFF".

I'd also like to do this with the whole debug output (atleast the matched rules part), since it's crapped my terminal many times..
Comment 3 Darxus 2011-05-07 14:23:11 UTC
I'm not sure if [:print:] is going to have the same locale problems as lc(), or if it'll be fine due to using whatever character set the body was encoded in.  Looks like there's a simpler fix:

$str =~ s/\P{IsPrint}/?/gs;

From the perlre man page:


But if the "locale" or "encoding" pragmas are not used and the string is not "utf8", then "[[:xxxxx:]]" (and "\w", etc.)  will not match characters 0x80-0xff; whereas "\p{IsXxxxx}" will force the string to "utf8" and can match these characters (as Unicode).

[2] "\p{IsPrint}" matches characters 0x09-0x0d but "[[:print:]]" does not.
Comment 4 Henrik Krohns 2011-05-08 07:00:36 UTC
(In reply to comment #3)
> I'm not sure if [:print:] is going to have the same locale problems as lc(), or
> if it'll be fine due to using whatever character set the body was encoded in. 
> Looks like there's a simpler fix:
> 
> $str =~ s/\P{IsPrint}/?/gs;

Again I'd like to be cautious with introducing untested features.. since especially old versions of Perl had even crashes with utf8 related things. And I don't really think we need to care utf8 with my proposal here, we just escape characters for the sake of readability.

Here's what I propose, atleast for trunk.. if someone really wants to directly see some high characters like latin special characters or euro symbols let me know.


Index: PerMsgStatus.pm
===================================================================
--- PerMsgStatus.pm     (revision 1100366)
+++ PerMsgStatus.pm     (working copy)
@@ -626,6 +626,9 @@
   $str = Mail::SpamAssassin::Util::wrap($str, "  ", "Content preview:  ", 75, 1);
   $str =~ s/^Content preview:\s+//gs;

+  # Bug 6583: replace unsafe characters with decimal octets "\000" - "\255"
+  $str =~ s/([\x00-\x09\x0b-\x1f\x7f-\xff])/sprintf("\\%03d",ord($1))/egs;
+
   return $str;
 }

Index: Logger.pm
===================================================================
--- Logger.pm   (revision 1098151)
+++ Logger.pm   (working copy)
@@ -174,9 +174,10 @@
   # the subclasses having to understand multi-line logs
   my $first = 1;
   foreach my $line (split(/\n/, $message)) {
-    # replace control characters with "_", tabs and spaces get
-    # replaced with a single space.
-    $line =~ tr/\x09\x20\x00-\x1f/  _/s;
+    # tabs and spaces get replaced with a single space.
+    $line =~ tr/\x09\x20/  /s;
+    # Bug 6583: replace unsafe characters with decimal octets "\000" - "\255"
+    $line =~ s/([\x00-\x09\x0b-\x1f\x7f-\xff])/sprintf("\\%03d",ord($1))/egs;
     if ($first) {
       $first = 0;
     } else {
Comment 5 Kevin A. McGrail 2011-05-09 16:42:22 UTC
Too untested and too many questions to consider for 3.3.X.
Comment 6 Mark Martinec 2011-05-09 17:06:14 UTC
Some comments, before I forget:

Considering that logging has three different back-ends internally
(file, stderr and syslog) and that it has a documented open interface
for third-party backends (which is used at least by amavisd-new to
integrate SpamAssassin logging with its own), I'd prefer that each
backend logger takes care of its own limitations, which may not be
the same for each.

For example a file-based logging may be happy with logging UTF-8
text, while the syslog may not be. A syslog backend may have line
length limitations, while others may not.

A concern with the amavisd backend logger is that it does its own
sanitation of nonprintable characters, so if SpamAssassin does it
at the common Logger.pm routine, we'd end up with doubled backslashes
- not too bad, but ugly nevertheless.

So I'd move the sanitation line from Logger.pm into Logger/*.pm modules.
And perhaps the Logger/File.pm is a candidate for less strict sanitation,
like allowing 8-bit characters through.

Btw, the suggested s/// does not cope with character codes above 255,
which may happen to find its way into a string being sanitized.

Also the '\' is not being doubled, which means that the result is not
uniquely reversible (if it happens that a '\nnn' is literally in a string).

Perhaps something like the following would do:

  $str =~ s{([^\040-\176])}
           { $1 eq '\\' ? '\\\\' :
             sprintf(ord($1) > 255 ? '\\x{%04x}' : '\\%03o', ord($1)) }egs;

An alternative for sanitizing strings with characters above 255 is
to encode them first into utf-8, and only then \ooo -quote them.


Not sure what subset of characters should be allow to pass through
un-encoded in PerMsgStatus::get_content_preview().
Comment 7 Henrik Krohns 2011-05-09 17:55:56 UTC
Very nice insights Mark, didn't notice Logger/* .. I'll look into this later, if someone doesn't before me.
Comment 8 Kevin A. McGrail 2013-06-21 16:16:38 UTC
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
Comment 9 Kevin A. McGrail 2015-04-07 14:15:30 UTC
This is a good idea but pushing for 3.4.2.  Perhaps right after 3.4.1 we can get some patches committed for testing by those of us using trunk for day-to-day?
Comment 10 Bill Cole 2018-09-01 00:31:26 UTC
Pushing to 3.4.3
Comment 11 Henrik Krohns 2019-08-12 09:23:31 UTC
Committed to trunk in backwards and 3rd party compatible way.

New escape => 1 option is used everywhere in SA code, for example:

Mail::SpamAssassin::Logger::add(method => 'stderr',
                                  timestamp_fmt => $log_timestamp_fmt,
                                  escape => 1);

If escape is not defined at all (like current Amavis), normal pre-4.0 method is used.

This is the used method, which is moved into all Logger/*.pm modules.

  if ($self->{escape}) {
    local $1;
    # Bug 7305:
    # Quote non-ascii characters as \x{XX} or \x{XXXX} (Unicode)
    # Also quote backslash, so the log can be unescaped properly
    $msg =~ s{([^\x20-\x5b\x5d-\x7e])}{ $1 eq '\\' ? '\\\\' :
      sprintf(ord($1) > 255 ? '\\x{%04X}' : '\\x{%02X}', ord($1)) }egs;
  } elsif (!exists $self->{escape}) {
    # Backwards compatible pre-4.0 escaping, if $escape not given.
    # replace control characters with "_", tabs and spaces get
    # replaced with a single space.
    $msg =~ tr/\x09\x20\x00-\x1f/  _/s;
  }

I hate octal presensation, so it's \x{FF} and \x{FFFF} to be uniform.

Should have done this long time ago, it's super helpful!

Aug 12 12:20:25.854 [11986] dbg: rules: ran rawbody rule __L_BODY_8BITS ======> got hit: "\x{EF}"

Aug 12 12:20:48.846 [12007] dbg: rules: ran rawbody rule __HAS_HREF ======> got hit: "\x{09}\x{09}<td><a href="

Aug 12 12:21:54.700 [12081] dbg: message: _decode_header subject: Fyll inn alle detaljene for \x{C3}\x{B8}nsket l\x{C3}\x{A5}n

Aug 12 12:21:54.902 [12081] dbg: rules: ran body rule __HIGHBITS ======> got hit: "\x{C3}\x{83}\x{C2}\x{B8}n"

Aug 12 12:22:56.663 [12272] dbg: uri: canonicalizing parsed uri: http://www\x{E3}\x{80}\x{82}example3\x{E3}\x{80}\x{82}com



Sending        trunk/UPGRADE
Sending        trunk/lib/Mail/SpamAssassin/Logger/File.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger/Stderr.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger/Syslog.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger.pm
Sending        trunk/spamd/spamd.raw
Transmitting file data ......done
Committing transaction...
Committed revision 1864949.
Comment 12 Henrik Krohns 2019-08-12 11:48:55 UTC
Content preview: now escaped too

Sending        trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm
Transmitting file data .done
Committing transaction...
Committed revision 1864961.
Comment 13 Kevin A. McGrail 2019-08-12 12:06:32 UTC
Very interested to see this in production.
Comment 14 Henrik Krohns 2019-08-13 05:04:33 UTC
Revamped it after more testing. It was a bit too eager imo..

For logs, now escaped are only escape chars + DEL + UTF-8 sequences. I think the UTF-8 bit is important, so debugs can be viewed on any terminal.

I removed content preview escaping, the whole preview function needs more overall thought on how it handles different encodingins, since it's supposed to prints out stuff on mail body. Even normalize_charset breaks it I think.

Sending        trunk/UPGRADE
Sending        trunk/lib/Mail/SpamAssassin/Logger/File.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger/Stderr.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger/Syslog.pm
Sending        trunk/lib/Mail/SpamAssassin/Logger.pm
Sending        trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm
Transmitting file data ......done
Committing transaction...
Committed revision 1865007.
Comment 15 Henrik Krohns 2019-08-30 08:04:51 UTC
The previous UTF-8 escaping is more problematic than the problem it's trying to solve (showing extended latin1 chars for readibility). All extended ascii is now escaped.

Sending        lib/Mail/SpamAssassin/Logger.pm
Transmitting file data .done
Committing transaction...
Committed revision 1866130.
Comment 16 Sidney Markowitz 2024-03-04 19:38:17 UTC
*** Bug 8217 has been marked as a duplicate of this bug. ***