Bug 4606 - Misparsing of message causes binary attachment to be treated as text
Summary: Misparsing of message causes binary attachment to be treated as text
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 normal
Target Milestone: 3.1.1
Assignee: John Gardiner Myers
URL:
Whiteboard:
Keywords:
Depends on: 4609
Blocks:
  Show dependency tree
 
Reported: 2005-09-28 15:56 UTC by Fred T
Modified: 2006-02-24 10:54 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Zip of sample messages which are misparsed application/octet-stream None Fred T [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Fred T 2005-09-28 15:56:47 UTC
I tried describing this the best I can, I'll attach a collection of messages 
which are being incorrectly parsed causing SA 3.1 to treat the binary 
attachment as plain text.  These are all multi-layered mime messages, it looks 
like the typical forward of a forward of a forward thing.  Are all these 
messages broken or is SA not parsing them correct is my question.
Comment 1 Fred T 2005-09-28 15:57:28 UTC
Created attachment 3150 [details]
Zip of sample messages which are misparsed
Comment 2 Theo Van Dinter 2005-09-29 03:08:13 UTC
Subject: Re:   New: Misparsing of message causes binary attachment to be treated as text

On Wed, Sep 28, 2005 at 03:56:48PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I tried describing this the best I can, I'll attach a collection of messages 
> which are being incorrectly parsed causing SA 3.1 to treat the binary 
> attachment as plain text.  These are all multi-layered mime messages, it looks 
> like the typical forward of a forward of a forward thing.  Are all these 
> messages broken or is SA not parsing them correct is my question.

If I understand your concern, neither the messages are broken nor is SA
parsing them incorrectly.  SA subparses the first message/rfc822 part,
but the ones inside there aren't subparsed, so they'd be considered
text in the message.

Comment 3 Fred T 2005-09-29 08:38:26 UTC
Well this issue is causing me 100s of false positives each day.  I could write
custom nice rules to counter the misfiring rules but I was hoping for a better
solution, one that would work for everyone!  Is there anything that can be done
in SA to work around this issue?
Comment 4 Theo Van Dinter 2005-09-29 09:15:09 UTC
Subject: Re:  Misparsing of message causes binary attachment to be treated as text

On Thu, Sep 29, 2005 at 08:38:27AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Well this issue is causing me 100s of false positives each day.  I could write
> custom nice rules to counter the misfiring rules but I was hoping for a better
> solution, one that would work for everyone!  Is there anything that can be done
> in SA to work around this issue?

Not that I can think of.  We originally would have parsed the whole
thing, but certain messages would then cause perl to die with a deep
recursion error (bug 4103).  So now the code only subparses the first
level message/rfc822 to avoid the issue.

The messages don't necessary look like spam, and SA doesn't flag any of them
as spam with the default 3.1 set, so I'm not sure where you're getting FPs.

Comment 5 Justin Mason 2005-09-29 11:11:34 UTC
Fred -- you could use the MIMEHeader plugin in 3.1.0, and write a custom "nice"
rule to match message/rfc822 parts...
Comment 6 Fred T 2005-09-29 15:57:02 UTC
I was seeing some of the fuzzy rules hitting the binary text, also a few other
obfu type rules.  I have to admit there was SARE rules involved in the FP, also
Tripwire, but the issue I see here is that mis-parsing HAM messages like this
could create a false image about a paticular rule when it's not really the rules
fault it hit.  I hope I'm making sense out of this, what I mean is if you have
messages like this in your corpus, and the binary is being treated as text, that
could be messing up your mass-check results.
I created a work-around so if you don't feel any need to do anything about this,
I am happy to close this ticket in any fashion you see fit, the main point was I
thought this was incorrect behaviour.  Thanks for your time and help!
Comment 7 John Gardiner Myers 2005-09-29 16:14:59 UTC
A MIME recursion limit of 1 seems inordinately low.  I'd think it should be at
least 5.
Comment 8 Justin Mason 2005-09-29 16:32:45 UTC
we've been here before, and boy, was there fireworks. ;)

My position was that if a message contains a message/rfc822 MIME part, that part
should be considered a (non-scanned) binary part, not a (scanned) text part. 
That would avoid this bug.

However IIRC some MUAs *did* treat it as a text part, which is why we have the
current behaviour...
Comment 9 John Gardiner Myers 2005-09-29 16:44:29 UTC
The test cases show a succession of forwardings by MUAs that forward messages as
message/rfc822 entities.  This type of forwarding is a common MUA behavior.  It
is also the standard MTA behavior for DSNs.

If SpamAssassin didn't scan message/rfc822, that would give spammers an easy way
to bypass scanning.
Comment 10 Justin Mason 2005-09-29 17:05:53 UTC
'If SpamAssassin didn't scan message/rfc822, that would give spammers an easy way
to bypass scanning.'

we really have been here before.  seriously.  see:

http://bugzilla.spamassassin.org/show_bug.cgi?id=3367
http://bugzilla.spamassassin.org/show_bug.cgi?id=3069
Comment 11 Fred T 2005-09-29 17:31:03 UTC
Ok after further research, this is a duplicate of Bug 3069.
I'll mark this a duplicate of that.
I was thinking it'd be simple to not parse the base64 text regardless if it's a
message/rfc822 or not, I wasn't familiar of the internals and how this is worked
out.

*** This bug has been marked as a duplicate of 3069 ***
Comment 12 John Gardiner Myers 2005-09-29 18:07:01 UTC
This is not a duplicate of bug 3609.  Bug 3609 was fixed by recursing down a
single level of message/rfc822.  This bug complains that message/rfc822 messages
at more than one level of recursion are parsed as text, generating FPs.

A single level of recursion is not sufficient.  Even if it were, the current
handling of messages beyond the recursion limit is undesirable.

Comment 13 Fred T 2005-09-30 09:13:55 UTC
So the parsing of the mime structure is a problem here?  I was hoping or
thinking we could just drop parts like this:

Content-Type: image/jpeg
Content-Transfer-Encoding: base64

I understand that it's not always obvious what the Content-Type is, but when it
does show up, can't we just drop it from body?

The decision to do this was decided in Bug 3069 that's why I marked it a
duplicate, if a dev disagrees with me they are free to undo what I've done :)
Comment 14 Theo Van Dinter 2005-09-30 10:07:31 UTC
Subject: Re:  Misparsing of message causes binary attachment to be treated as text

On Fri, Sep 30, 2005 at 09:13:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> So the parsing of the mime structure is a problem here?  I was hoping or
> thinking we could just drop parts like this:
> 
> Content-Type: image/jpeg
> Content-Transfer-Encoding: base64
> 
> I understand that it's not always obvious what the Content-Type is, but when it
> does show up, can't we just drop it from body?

The message parser works fine, it just intentionally doesn't subparse
message/rfc822 parts past the first one.  Non text/* or message/* parts are
already ignored.  The specific issue here is 1) only subparsing one level and
2) treating non-parsed message/* parts as text.

Both of those behaviors come from other tickets and were decided upon as the
correct course of action to fix the issues from those tickets.  We can't do it
both ways, either we solve one set of problems or the reverse set of problems.
If you think you do know how to solve both sets, please feel free to explain.

Comment 15 John Gardiner Myers 2005-09-30 10:15:35 UTC
As those other tickets are either unreferenced or unavailable, it is not
reasonable to expect people to address the unknown problems therein.
Comment 16 Tom Schulz 2005-09-30 11:05:33 UTC
Well, bug 4103 is certainly unavailable.  It seems that the other bugs can be
found in the above comments and by following one reference.  The ones I find
are: bug 3069, bug 3271 and bug 3367.  Could bug 4103 be unblocked now that
fixes are out to everybody?
Comment 17 Justin Mason 2005-09-30 11:25:54 UTC
bug 4103 is now readable.
Comment 18 Loren Wilton 2005-09-30 15:04:15 UTC
Subject: Re:  Misparsing of message causes binary attachment to be treated as text

I wonder if there is a way to do "cheap subparsing" that might solve both
issues.

As I think I understand the problem with subparsing, I'm assuming that you
were basically calling some largish mime parser module recursively, and this
was probably upsetting perl at the stack size, leading to the deep recursion
error.

What about the possibility of simply hacking the mime parser by adding a
string array used as a stack of mime boundaries?  Every time when linearly
parsing down you recognize the introduction of a new boundary you push it
ont the stack and make it the active boundary, and run it until you see the
end boundary flag, when you pop it off, reverting to the previous boundary?

That has some potential holes with malformed mime introductions or missing
end boundaries, but I don't see that recursively calling the parser would in
any way be immune to those faults either.  You could always keep the outer
boundary handy and compare text to it to make sure you don't overrun the
overall message.  (Assuming that the outer boundary isn't reused as an
internal boundary.)

The main point of this hack would be to simply detect and delete internal
non-text mime parts, not to actually create separate recursive mime parts
for all the found text parts.  Thus, in the case of a message that had a
bunch of embedded messages as the reply history, all of that would still
show up as the outer-level text part by the time SA was scanning things.
However, any gif that was embedded in someone's signature three levels in
would have been correctly stripped from that one agglomerated text part.

Comment 19 Justin Mason 2005-10-14 13:37:10 UTC
reopening, as per discussion in bug 4609.  once that bug is resolved, we can fix
this one pretty easily as per Theo's idea in that bug: 'still only subparse the
one time, and then not treat message/* parts as part of the body after that.'

In other words, treat subparts that are not displayed by default by common MUAs,
in the same way we treat binary attachments currently -- ie. hidden.
Comment 20 John Gardiner Myers 2005-10-19 20:33:34 UTC
Taking bug.
Comment 21 John Gardiner Myers 2006-02-24 19:54:34 UTC
3.1 branch: Committed revision 380772.

Trunk: Committed revision 380783.