SA Bugzilla – Bug 7307
Base64 decoded headers have invisible characters sometimes injected at line breaks
Last modified: 2016-06-23 16:32:55 UTC
Can you believe Subject: =?UTF-8?B?44CQ6YeN6KaB6KiK5oGv44CR5Y+w6Zu7MTA15bm0?= =?UTF-8?B?M+aciOmbu+iyu++8jOWnlOiol+mHkeiejeapn+ani+aJow==?= =?UTF-8?B?57mz5oiQ5Yqf6Zu75a2Q57mz6LK75oaR6K2JKOmbu+iZnw==?= =?UTF-8?B?MDc0ODc2MTY3MzAp?= decodes to Mar 26 05:25:46.233 [9866] dbg: message: _decode_header subject: 【重要訊息】台電105年3月電費,委託金融機構扣_繳成功電子繳費憑證(電號_07487616730) that "_" spuriously injected, causes header J_TAIPOWER_RCT Subject=~/委託金融機構扣繳成功電子繳費憑證/ not to match, instead needing ".." to match it too header J_TAIPOWER_RCT Subject=~/委託金融機構扣..繳成功電子繳費憑證/
Reverted part of the fix for Bug 7249, which attempted to deal with invalid splicing of multibyte characters in encoded-words, but doesn't work well in all cases, causing a problem like described here. A better (two-stage) approach is needed, but the change is nontrivial, to be done later... + [...] +# Bug 7307: the above code is commented-out as it mistreats adjecent +# encoded chunks which end on ==?= (are not a multiple of 4 characters). +# A better solution is needed: base64-decoding of all chunks first, then +# multi-byte character set decoding over adjecent same-charset chunks. Revert code attempting to deal with invalid splicing of multibyte characters in encoded-words trunk: Sending lib/Mail/SpamAssassin/Message/Node.pm Committed revision 1749286. Note: this only affects trunk, the problematic code section to deal with invalid splicing of multibyte characters in encoded-words is not in the 3.4 branch. Safe to close the ticket.
Broken the t/header_utf8.t test, investigating...
> Broken the t/header_utf8.t test, investigating... The test is also checking for sanitation of spliced multibyte characters, which is now disabled. Need to find a better way, or disable these two subtests...
> The test is also checking for sanitation of spliced multibyte characters, > which is now disabled. Need to find a better way, or disable these two > subtests... - Temporarily disabled test patterns LT_SUBJ and LT_CT in t/header_utf8.t to make jenkins happy. - Added test file t/data/nice/unicode2 with a test case from this bug report, and added a subtest to t/header_utf8.t, checking for breakage described by this bug report. trunk: svn ci -m 'Bug 7307: ... see comment #4' Sending MANIFEST Adding (bin) t/data/nice/unicode2 Sending t/header_utf8.t Committed revision 1749338.
Created attachment 5395 [details] Proposed patch to work around violations of the RFC 2047 section 5 requirement in two stages Bug 7249: work around violations of the RFC 2047 section 5 requirement: Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's Unfortunately such violations are not uncommon. Bug 7307: to deal with the above, base64/QP decoding must be decoupled from decoding a specified multi-byte character set into UTF-8. previous simpler code could not handle base64 fill bits correctly (merging of adjecent encoded sections before base64/QP decoding them).
- Improved decoding of MIME encoded words violating RFC 2047 section 5. - Restore all test patterns of the t/header_utf8.t test, it should now check for Bug 7249 and Bug 7307 test cases. trunk: Sending lib/Mail/SpamAssassin/Message/Node.pm Sending t/data/nice/unicode2 Sending t/header_utf8.t Committed revision 1749798. (this only affects trunk, the problematic code section to deal with invalid splicing of multibyte characters in encoded-words is not in the 3.4 branch)
The bug reported in this PR has been fixed. If some other problem regarding decoding of multibyte characters in encoded-words is found, please re-open Bug 7249 or create a new one. Closing.