SA Bugzilla – Bug 4606
Misparsing of message causes binary attachment to be treated as text
Last modified: 2006-02-24 10:54:34 UTC
I tried describing this the best I can, I'll attach a collection of messages which are being incorrectly parsed causing SA 3.1 to treat the binary attachment as plain text. These are all multi-layered mime messages, it looks like the typical forward of a forward of a forward thing. Are all these messages broken or is SA not parsing them correct is my question.
Created attachment 3150 [details] Zip of sample messages which are misparsed
Subject: Re: New: Misparsing of message causes binary attachment to be treated as text On Wed, Sep 28, 2005 at 03:56:48PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I tried describing this the best I can, I'll attach a collection of messages > which are being incorrectly parsed causing SA 3.1 to treat the binary > attachment as plain text. These are all multi-layered mime messages, it looks > like the typical forward of a forward of a forward thing. Are all these > messages broken or is SA not parsing them correct is my question. If I understand your concern, neither the messages are broken nor is SA parsing them incorrectly. SA subparses the first message/rfc822 part, but the ones inside there aren't subparsed, so they'd be considered text in the message.
Well this issue is causing me 100s of false positives each day. I could write custom nice rules to counter the misfiring rules but I was hoping for a better solution, one that would work for everyone! Is there anything that can be done in SA to work around this issue?
Subject: Re: Misparsing of message causes binary attachment to be treated as text On Thu, Sep 29, 2005 at 08:38:27AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Well this issue is causing me 100s of false positives each day. I could write > custom nice rules to counter the misfiring rules but I was hoping for a better > solution, one that would work for everyone! Is there anything that can be done > in SA to work around this issue? Not that I can think of. We originally would have parsed the whole thing, but certain messages would then cause perl to die with a deep recursion error (bug 4103). So now the code only subparses the first level message/rfc822 to avoid the issue. The messages don't necessary look like spam, and SA doesn't flag any of them as spam with the default 3.1 set, so I'm not sure where you're getting FPs.
Fred -- you could use the MIMEHeader plugin in 3.1.0, and write a custom "nice" rule to match message/rfc822 parts...
I was seeing some of the fuzzy rules hitting the binary text, also a few other obfu type rules. I have to admit there was SARE rules involved in the FP, also Tripwire, but the issue I see here is that mis-parsing HAM messages like this could create a false image about a paticular rule when it's not really the rules fault it hit. I hope I'm making sense out of this, what I mean is if you have messages like this in your corpus, and the binary is being treated as text, that could be messing up your mass-check results. I created a work-around so if you don't feel any need to do anything about this, I am happy to close this ticket in any fashion you see fit, the main point was I thought this was incorrect behaviour. Thanks for your time and help!
A MIME recursion limit of 1 seems inordinately low. I'd think it should be at least 5.
we've been here before, and boy, was there fireworks. ;) My position was that if a message contains a message/rfc822 MIME part, that part should be considered a (non-scanned) binary part, not a (scanned) text part. That would avoid this bug. However IIRC some MUAs *did* treat it as a text part, which is why we have the current behaviour...
The test cases show a succession of forwardings by MUAs that forward messages as message/rfc822 entities. This type of forwarding is a common MUA behavior. It is also the standard MTA behavior for DSNs. If SpamAssassin didn't scan message/rfc822, that would give spammers an easy way to bypass scanning.
'If SpamAssassin didn't scan message/rfc822, that would give spammers an easy way to bypass scanning.' we really have been here before. seriously. see: http://bugzilla.spamassassin.org/show_bug.cgi?id=3367 http://bugzilla.spamassassin.org/show_bug.cgi?id=3069
Ok after further research, this is a duplicate of Bug 3069. I'll mark this a duplicate of that. I was thinking it'd be simple to not parse the base64 text regardless if it's a message/rfc822 or not, I wasn't familiar of the internals and how this is worked out. *** This bug has been marked as a duplicate of 3069 ***
This is not a duplicate of bug 3609. Bug 3609 was fixed by recursing down a single level of message/rfc822. This bug complains that message/rfc822 messages at more than one level of recursion are parsed as text, generating FPs. A single level of recursion is not sufficient. Even if it were, the current handling of messages beyond the recursion limit is undesirable.
So the parsing of the mime structure is a problem here? I was hoping or thinking we could just drop parts like this: Content-Type: image/jpeg Content-Transfer-Encoding: base64 I understand that it's not always obvious what the Content-Type is, but when it does show up, can't we just drop it from body? The decision to do this was decided in Bug 3069 that's why I marked it a duplicate, if a dev disagrees with me they are free to undo what I've done :)
Subject: Re: Misparsing of message causes binary attachment to be treated as text On Fri, Sep 30, 2005 at 09:13:55AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > So the parsing of the mime structure is a problem here? I was hoping or > thinking we could just drop parts like this: > > Content-Type: image/jpeg > Content-Transfer-Encoding: base64 > > I understand that it's not always obvious what the Content-Type is, but when it > does show up, can't we just drop it from body? The message parser works fine, it just intentionally doesn't subparse message/rfc822 parts past the first one. Non text/* or message/* parts are already ignored. The specific issue here is 1) only subparsing one level and 2) treating non-parsed message/* parts as text. Both of those behaviors come from other tickets and were decided upon as the correct course of action to fix the issues from those tickets. We can't do it both ways, either we solve one set of problems or the reverse set of problems. If you think you do know how to solve both sets, please feel free to explain.
As those other tickets are either unreferenced or unavailable, it is not reasonable to expect people to address the unknown problems therein.
Well, bug 4103 is certainly unavailable. It seems that the other bugs can be found in the above comments and by following one reference. The ones I find are: bug 3069, bug 3271 and bug 3367. Could bug 4103 be unblocked now that fixes are out to everybody?
bug 4103 is now readable.
Subject: Re: Misparsing of message causes binary attachment to be treated as text I wonder if there is a way to do "cheap subparsing" that might solve both issues. As I think I understand the problem with subparsing, I'm assuming that you were basically calling some largish mime parser module recursively, and this was probably upsetting perl at the stack size, leading to the deep recursion error. What about the possibility of simply hacking the mime parser by adding a string array used as a stack of mime boundaries? Every time when linearly parsing down you recognize the introduction of a new boundary you push it ont the stack and make it the active boundary, and run it until you see the end boundary flag, when you pop it off, reverting to the previous boundary? That has some potential holes with malformed mime introductions or missing end boundaries, but I don't see that recursively calling the parser would in any way be immune to those faults either. You could always keep the outer boundary handy and compare text to it to make sure you don't overrun the overall message. (Assuming that the outer boundary isn't reused as an internal boundary.) The main point of this hack would be to simply detect and delete internal non-text mime parts, not to actually create separate recursive mime parts for all the found text parts. Thus, in the case of a message that had a bunch of embedded messages as the reply history, all of that would still show up as the outer-level text part by the time SA was scanning things. However, any gif that was embedded in someone's signature three levels in would have been correctly stripped from that one agglomerated text part.
reopening, as per discussion in bug 4609. once that bug is resolved, we can fix this one pretty easily as per Theo's idea in that bug: 'still only subparse the one time, and then not treat message/* parts as part of the body after that.' In other words, treat subparts that are not displayed by default by common MUAs, in the same way we treat binary attachments currently -- ie. hidden.
Taking bug.
3.1 branch: Committed revision 380772. Trunk: Committed revision 380783.