Bug 2106 - binary data ends up in body
Summary: binary data ends up in body
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 major
Target Milestone: 2.70
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2367 2644 2767 2829 (view as bug list)
Depends on: 1527
Blocks:
  Show dependency tree
 
Reported: 2003-06-20 13:14 UTC by Daniel Quinlan
Modified: 2004-01-25 11:53 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
first example text/plain None Daniel Quinlan [HasCLA]
example 2 text/plain None Daniel Quinlan [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2003-06-20 13:14:10 UTC
I have at least 3 messages in my ham corpus where binary attachment data is
ending up in the body.  They appear to be valid MIME messages.  I will attach
(the two that I can post) in a moment.

I noticed this when working on bug 1551 since I started getting
HTML_COMMENT_8BITS and HTML_COMMENT_EMAIL matches on ham that didn't have
those properties.  The <!foo> regular expression is loose enough that it will
match on a lot of binary data.  However, the problem is not with the regular
expression.  I tried fixing it with various heuristics, but nothing really
worked that well.

In addition to that issue, this must be slowing us down and causing other
FPs too.
Comment 1 Daniel Quinlan 2003-06-20 13:15:03 UTC
Created attachment 1065 [details]
first example
Comment 2 Daniel Quinlan 2003-06-20 13:16:10 UTC
Created attachment 1066 [details]
example 2
Comment 3 Daniel Quinlan 2003-06-21 00:18:46 UTC
Justin figured this out.  It's due to nested MIME where more than one
boundary is present.  Binary attachments that are in any level deeper
than the first are not ignored properly.
Comment 4 Theo Van Dinter 2003-06-24 07:35:47 UTC
Perhaps we need to shift to using MIME::Parser?  It goes through this stuff easily, and we could 
offload the parsing to the MIME-tools modules which is the idea with these modules in the first 
place.  ;)

ie:

Content-type: multipart/mixed
Effective-type: multipart/mixed
Body-file: NONE
Subject: removed
Num-parts: 2
--
    Content-type: multipart/related
    Effective-type: multipart/related
    Body-file: NONE
    Num-parts: 2
    --
        Content-type: multipart/alternative
        Effective-type: multipart/alternative
        Body-file: NONE
        Num-parts: 2
        --
            Content-type: text/plain
            Effective-type: text/plain
            Body-file: NONE
            --
            Content-type: text/html
            Effective-type: text/html
            Body-file: NONE
            --
        Content-type: image/jpeg
        Effective-type: image/jpeg
        Body-file: NONE
        Recommended-filename: 1dee719.jpg
        --
    Content-type: text/plain
    Effective-type: text/plain
    Body-file: NONE
    --

and

Content-type: multipart/mixed
Effective-type: multipart/mixed
Body-file: NONE
Subject: [Fwd: Call for Papers Emb. Linux Conf - Dallas, TX     Event: June 11]
Num-parts: 2
--
    Content-type: text/plain
    Effective-type: text/plain
    Body-file: NONE
    --
    Content-type: message/rfc822
    Effective-type: message/rfc822
    Body-file: NONE
    Num-parts: 1
    --
        Content-type: multipart/mixed
        Effective-type: multipart/mixed
        Body-file: NONE
        Subject: Call for Papers Emb. Linux Conf - Dallas, TX     Event: June 11 
        Num-parts: 2
        --
            Content-type: multipart/alternative
            Effective-type: multipart/alternative
            Body-file: NONE
            Num-parts: 2
            --
                Content-type: text/plain
                Effective-type: text/plain
                Body-file: NONE
                --
                Content-type: text/html
                Effective-type: text/html
                Body-file: NONE
                --
            Content-type: application/pdf
            Effective-type: application/pdf
            Body-file: NONE
            Recommended-filename: Emb Linux Call for Pprs-Dallas 02.pdf
            --
Comment 5 Justin Mason 2003-06-24 09:56:08 UTC
Subject: Re: [SAdev]  binary data ends up in body 


bugzilla-daemon@bugzilla.spamassassin.org writes:
>Perhaps we need to shift to using MIME::Parser?  It goes through this stuff ea
>sily, and we could 
>offload the parsing to the MIME-tools modules which is the idea with these mod
>ules in the first 
>place.  ;)

Two things:

- it has a load of dependencies :(

- it can't cope with some of the stuff we need to cope with -- e.g.
  undeclared base64 data  in the message body, which Outlook and
  SpamAssassin will parse, but MIME::Parser will not, if I recall
  correctly.

--j.

Comment 6 Theo Van Dinter 2003-06-24 10:47:14 UTC
Subject: Re: [SAdev]  binary data ends up in body

On Tue, Jun 24, 2003 at 09:57:49AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Two things:

Fair enough.  I remember us talking about it at one point a while ago,
then the idea went away but I wasn't sure why.

Comment 7 Theo Van Dinter 2003-06-24 13:32:50 UTC
this isn't a big enough issue to hold up the 2.60 release, so we'll take care of this for 2.70.
Comment 8 Keith Ivey 2003-10-07 14:35:27 UTC
I just ran into this bug with a PDF attachment that was in a multipart inside a
multipart (a forwarded message with a PDF attached).  The binary data happened
to contain something that matched /<q\s/i, so the PDF was treated as HTML and
the message ended up triggering BODY_8BITS, HTML_COMMENT_8BITS,
UNWANTED_LANGUAGE_BODY, and WEIRD_QUOTING.  Just assuming that anything matching
the HTML regexes must be HTML seems very risky.  There are lots of binary files
that match /<[abiqsu]\s/i, for example.
Comment 9 Keith Ivey 2003-10-10 06:37:36 UTC
I just got two more messages that triggered this bug.  Both were just MS Word
documents attached to messages that were then forwarded.  It seems like a pretty
common situation.  For now I've zeroed out BODY_8BITS and HTML_COMMENT_8BITS.  I
don't want to disable UNWANTED_LANGUAGE_BODY, though.

I'm happy to help with fixing, but it seems like there's some doubt about where
the fix should be, and it may affect too many other things.  Maybe for a quick
fix the HTML test could be changed to require a little more than just a single
one-character tag, which can easily occur randomly in binary data?  Require
multiple tags?  Look at the beginning of the data to see if it seems to be an
HTML file?  Alternatively, look for signature of common binary file types (Word,
PDF, GIF, JPEG) and skip them?
Comment 10 Keith Ivey 2003-10-11 17:53:11 UTC
This seems to be a duplicate of 2402.
Comment 11 Nicolas 2003-10-29 06:26:15 UTC
*** Bug 2644 has been marked as a duplicate of this bug. ***
Comment 12 Nels Lindquist 2003-12-12 07:49:20 UTC
*** Bug 2829 has been marked as a duplicate of this bug. ***
Comment 13 Theo Van Dinter 2004-01-18 17:16:45 UTC
*** Bug 2767 has been marked as a duplicate of this bug. ***
Comment 14 Theo Van Dinter 2004-01-24 20:55:53 UTC
fixed in 2.70
Comment 15 Theo Van Dinter 2004-01-25 20:53:36 UTC
*** Bug 2367 has been marked as a duplicate of this bug. ***