Bug 7253 - X-Spam-Report incorrectly mime-encodes multiline report in header, violating RFC 2047
Summary: X-Spam-Report incorrectly mime-encodes multiline report in header, violating ...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.4.1
Hardware: All All
: P2 normal
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-14 13:59 UTC by Mark Martinec
Modified: 2016-06-23 16:39 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
proposed patch patch None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2015-10-14 13:59:07 UTC
With
  report_safe 0

and a rule with non-ASCII description, e.g.:

  header L_TEST_REPORT_ENCODING From =~ /./
  score  L_TEST_REPORT_ENCODING 0.01
  describe L_TEST_REPORT_ENCODING  En-tête contient caractères

the resulting X-Spam-Report multiline header field as inserted
into spam messages is incorrectly encoded into encoded-words:
the whole multiline header field is encoded into a single
encoded-words, whitespace is not encoded, the result contains
whitespace within encoded-word, and the encoded-word spans across
lines:

X-Spam-Report: =?UTF-8?Q?
  *  100 USER_IN_BLACKLIST From: address is in the user's black-list
  *  0.0 L_TEST_REPORT_ENCODING En-t=c3=aate contient caract=c3=a8res
  * -0.3 BAYES_05 BODY: Bayes spam probability is 1 to 5%
  *      [score: 0.0137]
  *  0.3 TXREP TXREP: Score normalizing based on sender's reputation?=

This is wrong on multiple accounts. The RFC 2047 is explicit:


   An 'encoded-word' may not be more than 75 characters long, including
   'charset', 'encoding', 'encoded-text', and delimiters.

[...]

   IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
   by an RFC 822 parser.  As a consequence, unencoded white space
   characters (such as SPACE and HTAB) are FORBIDDEN within an
   'encoded-word'.  For example, the character sequence
      =?iso-8859-1?q?this is some text?=
   would be parsed as four 'atom's, rather than as a single 'atom' (by
   an RFC 822 parser) or 'encoded-word' (by a parser which understands
   'encoded-words').  The correct way to encode the string "this is some
   text" is to encode the SPACE characters as well, e.g.
      =?iso-8859-1?q?this=20is=20some=20text?=

[...]

   Only a subset of the printable ASCII characters may be used in
   'encoded-text'.  Space and tab characters are not allowed, so that
   the beginning and end of an 'encoded-word' are obvious.


The culprit is  MS::PerMsgStatus::qp_encode_header().
It should encode (when necessary) each line individually,
and should encode whitespace within encoded-word(s).
Comment 1 Mark Martinec 2015-10-14 14:38:45 UTC
> The culprit is  MS::PerMsgStatus::qp_encode_header().
> It should encode (when necessary) each line individually,
> and should encode whitespace within encoded-word(s).

Tried to use Encode (latest version: 2.72) in qp_encode_header()
to do the encoding:
  $text = Encode::encode('MIME-Header', $text);

Although the Encode::MIME::Header is much more careful than
our code, it is still wrong, resulting in an empty line
inserted in the header!!!

X-Spam-Report:
  *  100 USER_IN_BLACKLIST From: address is in the user's black-list

 =?UTF-8?B?CSogIDAuMCBMX1RFU1RfUkVQT1JUX0VOQ09ESU5HIEVuLXTDg8KqdGUgY29udGllbg==?=
 =?UTF-8?B?dCBjYXJhY3TDg8KocmVz?=
  * -0.3 BAYES_05 BODY: Bayes spam probability is 1 to 5%
  *      [score: 0.0137]
  *  0.2 TXREP TXREP: Score normalizing based on sender's reputation


Bad, bad, bad, (and unsightly).
Comment 2 Mark Martinec 2015-10-14 15:12:07 UTC
> Tried to use Encode (latest version: 2.72)
[...]
> Although the Encode::MIME::Header is much more careful than
> our code, it is still wrong, resulting in an empty line
> inserted in the header!!!

Tried Encode 2.78 (the latest on CPAN), it is still broken
on multiple accounts:


- still inserts an empty line in a header on encoding:

$ perl -e 'use Encode; use utf8; printf("Subject: %s\n", encode("MIME-B",
  "report\n\tL_TEST_REPORT_ENCODING En-tête contient caractères\n\tend-report"))'

<quote>
Subject: report

 =?UTF-8?B?CUxfVEVTVF9SRVBPUlRfRU5DT0RJTkcgRW4tdMOqdGUgY29udGllbnQgY2FyYWM=?=
 =?UTF-8?B?dMOocmVz?=
        end-report
</quote>


- produces a continuation line with no leading whitespace on encoding:

$ perl -e 'use Encode; use utf8; printf("Subject: %s\n", encode("MIME-Q",
  "report\n\tL_TEST_REPORT_ENCODING En-tête contient caractères\n\tend-report"))'

<quote>
Subject: report
=?UTF-8?Q?=09L=5FTEST=5FREPORT=5FENCODI?=
 =?UTF-8?Q?NG=20En=2Dt=C3=AAte=20contient=20?= =?UTF-8?Q?caract=C3=A8res?=
        end-report
</quote>


- loses whitespace on folding when decodin
    https://rt.cpan.org/Public/Bug/Display.html?id=40027

$ perl -le 'use Encode; print decode("MIME-Header", "a: b\r\n c")'
a: bc
Comment 3 Mark Martinec 2015-10-14 15:33:17 UTC
For reference:

> - Encode::MIME::Header inserts an empty line in a header on encoding
reported: https://rt.cpan.org/Ticket/Display.html?id=107775

> - produces a continuation line with no leading whitespace on encoding
reported: https://rt.cpan.org/Ticket/Display.html?id=107776

> - loses whitespace on folding when decoding
> https://rt.cpan.org/Public/Bug/Display.html?id=40027
Comment 4 Mark Martinec 2015-10-15 17:58:23 UTC
It's funny (if not disheartening) that we have two tests which are there
specifically to test encoding of an X-Spam-Report in a header, and they
both expect an incorrectly encoded result. When the bug is fixed these
tests fail :)
  t/reportheader_8bit.t
  t/prefs_include.t
Comment 5 Mark Martinec 2015-10-15 18:49:38 UTC
Created attachment 5332 [details]
proposed patch

- encoded-words must not cross line boundaries
- encoded-words must not contain non-encoded whitespace
- choose between Q and B encoding based on shorter encoded length
  (and rename sub qp_encode_header to mime_encode_header)
- change a call to MIME::Base64::encode_base64 in oder to
  prevent it from inserting a newline in the middle of a
  longish encoded-word, add a test for this
- adjust t/prefs_include.t and t/reportheader_8bit.t to
  cope with this fix
- enhanced the test t/header_utf8.t

trunk:
  Sending lib/Mail/SpamAssassin/PerMsgStatus.pm
  Sending lib/Mail/SpamAssassin/Util.pm
  Sending t/header_utf8.t
  Sending t/prefs_include.t
  Sending t/reportheader_8bit.t
Committed revision 1708863.


Btw, the mime_encode_header() is still not perfect, as it is incapable
of breaking long encoded-words if line length exceeds 75 characters.
Still, it is a major improvement over what we had and over what
Encode::MIME::Header would offer.
Comment 6 Mark Martinec 2016-06-23 16:39:55 UTC
Should do for now, it's a significant improvement anyway.
Closing.