SA Bugzilla – Bug 2106
binary data ends up in body
Last modified: 2004-01-25 11:53:36 UTC
I have at least 3 messages in my ham corpus where binary attachment data is ending up in the body. They appear to be valid MIME messages. I will attach (the two that I can post) in a moment. I noticed this when working on bug 1551 since I started getting HTML_COMMENT_8BITS and HTML_COMMENT_EMAIL matches on ham that didn't have those properties. The <!foo> regular expression is loose enough that it will match on a lot of binary data. However, the problem is not with the regular expression. I tried fixing it with various heuristics, but nothing really worked that well. In addition to that issue, this must be slowing us down and causing other FPs too.
Created attachment 1065 [details] first example
Created attachment 1066 [details] example 2
Justin figured this out. It's due to nested MIME where more than one boundary is present. Binary attachments that are in any level deeper than the first are not ignored properly.
Perhaps we need to shift to using MIME::Parser? It goes through this stuff easily, and we could offload the parsing to the MIME-tools modules which is the idea with these modules in the first place. ;) ie: Content-type: multipart/mixed Effective-type: multipart/mixed Body-file: NONE Subject: removed Num-parts: 2 -- Content-type: multipart/related Effective-type: multipart/related Body-file: NONE Num-parts: 2 -- Content-type: multipart/alternative Effective-type: multipart/alternative Body-file: NONE Num-parts: 2 -- Content-type: text/plain Effective-type: text/plain Body-file: NONE -- Content-type: text/html Effective-type: text/html Body-file: NONE -- Content-type: image/jpeg Effective-type: image/jpeg Body-file: NONE Recommended-filename: 1dee719.jpg -- Content-type: text/plain Effective-type: text/plain Body-file: NONE -- and Content-type: multipart/mixed Effective-type: multipart/mixed Body-file: NONE Subject: [Fwd: Call for Papers Emb. Linux Conf - Dallas, TX Event: June 11] Num-parts: 2 -- Content-type: text/plain Effective-type: text/plain Body-file: NONE -- Content-type: message/rfc822 Effective-type: message/rfc822 Body-file: NONE Num-parts: 1 -- Content-type: multipart/mixed Effective-type: multipart/mixed Body-file: NONE Subject: Call for Papers Emb. Linux Conf - Dallas, TX Event: June 11 Num-parts: 2 -- Content-type: multipart/alternative Effective-type: multipart/alternative Body-file: NONE Num-parts: 2 -- Content-type: text/plain Effective-type: text/plain Body-file: NONE -- Content-type: text/html Effective-type: text/html Body-file: NONE -- Content-type: application/pdf Effective-type: application/pdf Body-file: NONE Recommended-filename: Emb Linux Call for Pprs-Dallas 02.pdf --
Subject: Re: [SAdev] binary data ends up in body bugzilla-daemon@bugzilla.spamassassin.org writes: >Perhaps we need to shift to using MIME::Parser? It goes through this stuff ea >sily, and we could >offload the parsing to the MIME-tools modules which is the idea with these mod >ules in the first >place. ;) Two things: - it has a load of dependencies :( - it can't cope with some of the stuff we need to cope with -- e.g. undeclared base64 data in the message body, which Outlook and SpamAssassin will parse, but MIME::Parser will not, if I recall correctly. --j.
Subject: Re: [SAdev] binary data ends up in body On Tue, Jun 24, 2003 at 09:57:49AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Two things: Fair enough. I remember us talking about it at one point a while ago, then the idea went away but I wasn't sure why.
this isn't a big enough issue to hold up the 2.60 release, so we'll take care of this for 2.70.
I just ran into this bug with a PDF attachment that was in a multipart inside a multipart (a forwarded message with a PDF attached). The binary data happened to contain something that matched /<q\s/i, so the PDF was treated as HTML and the message ended up triggering BODY_8BITS, HTML_COMMENT_8BITS, UNWANTED_LANGUAGE_BODY, and WEIRD_QUOTING. Just assuming that anything matching the HTML regexes must be HTML seems very risky. There are lots of binary files that match /<[abiqsu]\s/i, for example.
I just got two more messages that triggered this bug. Both were just MS Word documents attached to messages that were then forwarded. It seems like a pretty common situation. For now I've zeroed out BODY_8BITS and HTML_COMMENT_8BITS. I don't want to disable UNWANTED_LANGUAGE_BODY, though. I'm happy to help with fixing, but it seems like there's some doubt about where the fix should be, and it may affect too many other things. Maybe for a quick fix the HTML test could be changed to require a little more than just a single one-character tag, which can easily occur randomly in binary data? Require multiple tags? Look at the beginning of the data to see if it seems to be an HTML file? Alternatively, look for signature of common binary file types (Word, PDF, GIF, JPEG) and skip them?
This seems to be a duplicate of 2402.
*** Bug 2644 has been marked as a duplicate of this bug. ***
*** Bug 2829 has been marked as a duplicate of this bug. ***
*** Bug 2767 has been marked as a duplicate of this bug. ***
fixed in 2.70
*** Bug 2367 has been marked as a duplicate of this bug. ***