Bug 7307 - Base64 decoded headers have invisible characters sometimes injected at line breaks
Summary: Base64 decoded headers have invisible characters sometimes injected at line b...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P2 major
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-25 21:49 UTC by jidanni
Modified: 2016-06-23 16:32 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Proposed patch to work around violations of the RFC 2047 section 5 requirement in two stages patch None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description jidanni 2016-03-25 21:49:45 UTC
Can you believe

Subject: =?UTF-8?B?44CQ6YeN6KaB6KiK5oGv44CR5Y+w6Zu7MTA15bm0?=
 =?UTF-8?B?M+aciOmbu+iyu++8jOWnlOiol+mHkeiejeapn+ani+aJow==?=
 =?UTF-8?B?57mz5oiQ5Yqf6Zu75a2Q57mz6LK75oaR6K2JKOmbu+iZnw==?=
 =?UTF-8?B?MDc0ODc2MTY3MzAp?=

decodes to

Mar 26 05:25:46.233 [9866] dbg: message: _decode_header subject: 【重要訊息】台電105年3月電費,委託金融機構扣_繳成功電子繳費憑證(電號_07487616730)

that "_" spuriously injected, causes

header J_TAIPOWER_RCT Subject=~/委託金融機構扣繳成功電子繳費憑證/
not to match, instead needing ".." to match it too
header J_TAIPOWER_RCT Subject=~/委託金融機構扣..繳成功電子繳費憑證/
Comment 1 Mark Martinec 2016-06-20 09:58:14 UTC
Reverted part of the fix for Bug 7249, which attempted to deal
with invalid splicing of multibyte characters in encoded-words, but
doesn't work well in all cases, causing a problem like described here.
A better (two-stage) approach is needed, but the change is nontrivial,
to be done later...

+ [...]
+# Bug 7307: the above code is commented-out as it mistreats adjecent
+# encoded chunks which end on ==?= (are not a multiple of 4 characters).
+# A better solution is needed: base64-decoding of all chunks first, then
+# multi-byte character set decoding over adjecent same-charset chunks.


Revert code attempting to deal with invalid splicing
of multibyte characters in encoded-words

trunk:
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1749286.

Note: this only affects trunk, the problematic code section to
deal with invalid splicing of multibyte characters in encoded-words
is not in the 3.4 branch. Safe to close the ticket.
Comment 2 Mark Martinec 2016-06-20 11:33:59 UTC
Broken the t/header_utf8.t test, investigating...
Comment 3 Mark Martinec 2016-06-20 12:18:39 UTC
> Broken the t/header_utf8.t test, investigating...

The test is also checking for sanitation of spliced multibyte characters,
which is now disabled. Need to find a better way, or disable these two
subtests...
Comment 4 Mark Martinec 2016-06-20 13:59:38 UTC
> The test is also checking for sanitation of spliced multibyte characters,
> which is now disabled. Need to find a better way, or disable these two
> subtests...

- Temporarily disabled test patterns LT_SUBJ and LT_CT in t/header_utf8.t
to make jenkins happy.

- Added test file t/data/nice/unicode2 with a test case from this bug
report, and added a subtest to t/header_utf8.t, checking for breakage
described by this bug report.

trunk:

svn ci -m 'Bug 7307: ... see comment #4'
  Sending        MANIFEST
  Adding  (bin)  t/data/nice/unicode2
  Sending        t/header_utf8.t
Committed revision 1749338.
Comment 5 Mark Martinec 2016-06-23 00:14:14 UTC
Created attachment 5395 [details]
Proposed patch to work around violations of the RFC 2047 section 5 requirement in two stages

Bug 7249: work around violations of the RFC 2047 section 5 requirement:
  Each 'encoded-word' MUST represent an integral number of characters.
  A multi-octet character may not be split across adjacent 'encoded-word's
Unfortunately such violations are not uncommon.

Bug 7307: to deal with the above, base64/QP decoding must be decoupled
from decoding a specified multi-byte character set into UTF-8.
previous simpler code could not handle base64 fill bits correctly
(merging of adjecent encoded sections before base64/QP decoding them).
Comment 6 Mark Martinec 2016-06-23 00:20:22 UTC
- Improved decoding of MIME encoded words violating RFC 2047 section 5.
- Restore all test patterns of the t/header_utf8.t test, it should now
  check for Bug 7249 and Bug 7307 test cases.

trunk:
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending t/data/nice/unicode2
  Sending t/header_utf8.t
Committed revision 1749798.


(this only affects trunk, the problematic code section to deal with
invalid splicing of multibyte characters in encoded-words is not
in the 3.4 branch)
Comment 7 Mark Martinec 2016-06-23 16:32:55 UTC
The bug reported in this PR has been fixed.

If some other problem regarding decoding of multibyte characters
in encoded-words is found, please re-open Bug 7249 or create a new one.

Closing.