Bug 5393

Summary: Text after closing MIME boundary is not examined.
Product: Spamassassin Reporter: Bill Cole <billcole>
Component: spamassassinAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: normal CC: mattheww
Priority: P5    
Version: 3.1.8   
Target Milestone: Undefined   
Hardware: All   
OS: All   
Whiteboard:
Attachments: zipfile containing 3 GTUBE test messages

Description Bill Cole 2007-03-28 14:33:54 UTC
Pathological multipart/alternative mail with a terminating MIME boundary string
and text after it evades SpamAssassin. I am attaching a zip file with a test
case in 3 files:

gtubetest: The GTUBE test provide on the SA site

gtubetest.multipart: That message minimally restructured into multipart/alternative

gtubetest.multipart.broken: The multipart version with the addition of a
terminating boundary line above the payload. 

SA 3.1.8 detects GTUBE in the first 2 correctly, but misses it entirely in the
third.
Comment 1 Bill Cole 2007-03-28 14:35:11 UTC
Created attachment 3892 [details]
zipfile containing 3 GTUBE test messages
Comment 2 Theo Van Dinter 2007-03-28 15:40:31 UTC
Ok...

What's the issue?  Text outside the start/end MIME boundaries are supposed to be
ignored by MUAs.  If you have a MIME-compliant MUA that shows that text, then
it's not actually MIME-compliant. :)
Comment 3 Bill Cole 2007-03-28 15:51:19 UTC
Indeed, a MUA that displays such a thing is arguably broken. Yet people actually
pay for such abuse. 
Unfortunately, you can find ham structured thus if you look for sites that do
careless footer application, and many people still follow Postel's Law of
Robustness. 

I'm not even certain that display of the trailing stuff is technically wrong.
The termination of the MIME object might not need to be the termination of the
RFC(2)822 message. 
Comment 4 Theo Van Dinter 2007-03-28 16:04:28 UTC
(In reply to comment #3)
> I'm not even certain that display of the trailing stuff is technically wrong.
> The termination of the MIME object might not need to be the termination of the
> RFC(2)822 message. 

It's not wrong, just ignorable. :)  It's actually very clear in RFC1521 which
defines MIME:

      NOTE: These "preamble" and "epilogue" areas are generally not used
      because of the lack of proper typing of these parts and the lack
      of clear semantics for handling these areas at gateways,
      particularly X.400 gateways.  However, rather than leaving the
      preamble area blank, many MIME implementations have found this to
      be a convenient place to insert an explanatory note for recipients
      who read the message with pre-MIME software, since such notes will
      be ignored by MIME-compliant software.

So it pretty specifically states that anything before (preamble) or after
(epilogue) the bounded MIME parts will be ignored.  The BNF is pretty clear too:

   multipart-body :=preamble 1*encapsulation close-delimiter epilogue
   epilogue := discard-text        ;  to  be  ignored upon receipt.
   discard-text := *(*text CRLF)

Also, one of the example mails includes:

      --simple boundary--
      This is the epilogue.  It is also to be ignored.


:)
Comment 5 Theo Van Dinter 2007-03-28 16:07:06 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > I'm not even certain that display of the trailing stuff is technically wrong.
> > The termination of the MIME object might not need to be the termination of the
> > RFC(2)822 message. 
> 
> It's not wrong, just ignorable. :)  It's actually very clear in RFC1521 which
> defines MIME:

That didn't make a lot of sense, I was trying to say that it's not wrong to have
trailing text, but it should be ignored.  It actually is wrong for a
MIME-compliant MUA to display it since it's supposed to have been ignored.
Comment 6 Loren Wilton 2007-03-28 17:10:02 UTC
Are there any examples of common MUAs that will display this stuff?  Even with 
non-default but easily available settings a real user might select? 

If so then putting spam there would be an evasion of SA checking.
Comment 7 Bill Cole 2007-03-29 08:18:13 UTC
(In reply to comment #6)
> Are there any examples of common MUAs that will display this stuff?  Even with 
> non-default but easily available settings a real user might select? 
> 
> If so then putting spam there would be an evasion of SA checking.

(In reply to comment #6)

I discovered this because a sample was reported by an Outlook (2000?) user as a
filter miss. Due to policy issues I am not able to share the full original
(hence the constructed samples) and while I am unable to stand in front of his
machine to verify it, I feel pretty sure that he was not reporting the emptiness
of the text and HTML parts as spam. The epilogue-carried payload was a consumer
survey come-on, all bad HTML with lots of non-included images. 

In addition, any MUA that does not support MIME will display whatever epilogue
happens to be present. I know it sounds crazy, but people do still use mailx.
Really. There is even a politically significant (in geek terms) population of
people who use pure text mailers like mutt and mh and intentionally break
whatever MIME support is there. I have also confirmed that Palm's VersaMail MUA
will display the epilogue of MIME messages. 

I'd rate my confidence that this was intentional filter evasion at about 80% 
with a real chance that it was an "OOPS!" (the HTML  was clearly built by hand
by an amateur) but even so, it seems prudent to look in the epilogue. What could
it hurt? 

Comment 8 Theo Van Dinter 2007-03-29 08:34:52 UTC
(In reply to comment #7)
> I discovered this because a sample was reported by an Outlook (2000?) user as a
> filter miss. Due to policy issues I am not able to share the full original
> (hence the constructed samples) and while I am unable to stand in front of his
> machine to verify it, I feel pretty sure that he was not reporting the emptiness
> of the text and HTML parts as spam. The epilogue-carried payload was a consumer
> survey come-on, all bad HTML with lots of non-included images. 

Was it that he saw that information or did it just exist in the message? 
Spammers send out a lot of crappy email (construction, not just content,) and I
wouldn't be surprised if they had stuff all over the place.  But that's not
really important if the MUA does what it's supposed to and ignores it.

> In addition, any MUA that does not support MIME will display whatever epilogue
> happens to be present. I know it sounds crazy, but people do still use mailx.
> Really. There is even a politically significant (in geek terms) population of
> people who use pure text mailers like mutt and mh and intentionally break
> whatever MIME support is there. I have also confirmed that Palm's VersaMail MUA
> will display the epilogue of MIME messages. 

Well, readers like Mutt (which I use) don't break MIME.  You just look at the
text parts, not HTML or anything else.  mailx and anything else that's not a
MIME-compliant reader will obviously show you everything.

In as far as spammers are concerned, they're not targeting mailx users.

> I'd rate my confidence that this was intentional filter evasion at about 80% 
> with a real chance that it was an "OOPS!" (the HTML  was clearly built by hand
> by an amateur) but even so, it seems prudent to look in the epilogue. What could
> it hurt? 

Well, asking for us to stop being MIME compliant because we can is a little
ridiculous imo. :)   If common MIME-compliant MUAs display the preamble or the
epilogue (very obviously incorrect), then we may have to do the same, and also
should get people yell at the vendor about it.  If they don't, then we arguably
shouldn't either.

An obvious issue is that it potentially gives us a lot more data to process, and
gives spammers a way of specifically trying to clog anti-spam tools.  There's
already the issue of text/plain parts with garbage and a text/html part with the
actual payload.  It's hard enough trying to determine that the text/plain part
is garbage, let alone try to figure out whether or not all the data in the
epilogue should be ignored.
Comment 9 Bill Cole 2007-03-29 10:03:28 UTC
(In reply to comment #8)


> 
> Well, readers like Mutt (which I use) don't break MIME.  You just look at the
> text parts, not HTML or anything else.  mailx and anything else that's not a
> MIME-compliant reader will obviously show you everything.
> 
> In as far as spammers are concerned, they're not targeting mailx users.

Please read all of my message. 
 
 
> > I'd rate my confidence that this was intentional filter evasion at about 80% 
> > with a real chance that it was an "OOPS!" (the HTML  was clearly built by hand
> > by an amateur) but even so, it seems prudent to look in the epilogue. What could
> > it hurt? 
> 
> Well, asking for us to stop being MIME compliant because we can is a little
> ridiculous imo. :)   

I think your reasoning here is silly, since SA is not a MUA. There is spam being
sent which is putting payload in the MIME epilogue. That alone ought to be a
useful thing to detect, and HTML in the MIME epilogue even more useful, as well
as detailed features of that payload that would  already be detectable with SA
except for the fact that SA totally ignores the epilogue.  By your reasoning,
one could support having SA ignore anything inside invalid HTML or completely
passing on the analysis of mail with improper RFC822 format. 

SA already quite usefully identifies a variety of technical flaws as such and
does not generally exempt the content of broken structures from filtering. 


>If common MIME-compliant MUAs display the preamble or the
> epilogue (very obviously incorrect), then we may have to do the same, and also
> should get people yell at the vendor about it.  If they don't, then we arguably
> shouldn't either.

I have confirmation from the original user who reported the spam to me that
Outlook 97 displays the epilogue payload. Outlook 2003SP2 does not. Versamail
does. Any MIME-ignorant MUA does.  

Which does not mean I think your rationale makes sense. 

> An obvious issue is that it potentially gives us a lot more data to process, and
> gives spammers a way of specifically trying to clog anti-spam tools.  There's
> already the issue of text/plain parts with garbage and a text/html part with the
> actual payload.  It's hard enough trying to determine that the text/plain part
> is garbage, let alone try to figure out whether or not all the data in the
> epilogue should be ignored.

Yeah, spam sucks. I think that's something we can agree on completely. 

Having a filter act like a piece of some spam simply doesn't exist because it is
supposed to be ignored by MUA's that adhere to a standard seems like a novel
variation on the idea that some spam is not spam because of its content. 
Comment 10 Sidney Markowitz 2007-03-29 12:12:02 UTC
I don't think it really is a matter of SA only working with mail that follows
the RFCs, as we don't in every case. But in this case there is a specific answer
to your question about what is the harm in scanning the epilogue text. Given
that some MUAs ignore it and some MUAs show it, we have to make a choice based
on what spammers will do if we do or do not scan it.

If we do not scan the epilogue they can put spam there and not in the body,
targeting the people who use MUAs that see it and missing people who do not.

If we do scan the epilogue, they can target people who use MUAs that do not see
it by putting high volumes of non-spammy garbage in the epilogue designed to
overload spam filters and poison Bayes databases.

The question is not a matter of what does the RFC say, but which MUAs do what
with it, how popular are they, what are spammers doing now, and what potential
advantages do we give spammers with our choice of how we handle this.

I would like to see a list of MUAs that display the epilogue in these test
messages. If it is only Outlook 97, Versamail, and mail readers that don't
understand MIME, then I would go with having SA ignore the epilogue following
the reasoning that spammers will be more likely to try to take advantage of a
loophole to DoS spam filters than they are to target a very small subset of the
MUAs.
Comment 11 Loren Wilton 2007-03-30 01:57:44 UTC
In reply to comment #10, is not the epilog supposed to be small?  After all, if 
it is to be discarded (according to the RFCs), what would be the purpose in 
making it large, possibly larger than the legitimate body of the mail itself?

Maybe there is no point in scanning an epilog that is 200KB in size.  Or 1KB in 
size.  Maybe just add one point to the message for every KB of size of the 
epilog and be done with it.  For less than 27KB (about the standard size for an 
image spam these days) scan the epilog, as it is no bigger than a typical spam 
mail that you do feel is worth scanning.

The argument that putting large quantities of garbage in the epilog will 
prevent spam scanning or use up system resources doesn't hold.  Since this is 
supposed to be ignored by MUAs, then by definition it is NOT supposed to have 
valid content.  It is sufficient to detect that it DOES contain valid content 
and score that fact appropriately.  Detailed scanning on "typically sized" 
content would merely be a bonus.
Comment 12 Theo Van Dinter 2007-03-30 08:32:46 UTC
FWIW,  until someone comes up with tests for this stuff: both correctly and
incorrectly formatted MIME messages, and runs them all against numerous mail
clients, and shows clearly that the most commonly used MUAs are doing things one
way or the other ... Arguing about what should happen is pretty moot IMO.
Comment 13 Bill Cole 2007-03-30 09:43:20 UTC
(In reply to comment #11)
> In reply to comment #10, is not the epilog supposed to be small?  After all, if 
> it is to be discarded (according to the RFCs), what would be the purpose in 
> making it large, possibly larger than the legitimate body of the mail itself?
> 
> Maybe there is no point in scanning an epilog that is 200KB in size.  Or 1KB in 
> size.  Maybe just add one point to the message for every KB of size of the 
> epilog and be done with it.  For less than 27KB (about the standard size for an 
> image spam these days) scan the epilog, as it is no bigger than a typical spam 
> mail that you do feel is worth scanning.
> 
> The argument that putting large quantities of garbage in the epilog will 
> prevent spam scanning or use up system resources doesn't hold.  Since this is 
> supposed to be ignored by MUAs, then by definition it is NOT supposed to have 
> valid content.  It is sufficient to detect that it DOES contain valid content 
> and score that fact appropriately.  Detailed scanning on "typically sized" 
> content would merely be a bonus.


I think that's a very important point: the choice is not a binary choice between
scanning whatever epilogue there is as if it were a normal part of normal mail
or not scanning it at all. There are potentially interesting features that could
be detected without doing a full scan of the epilogue, including simple
existence, absolute size, and size relative to valid MIME parts. I've now seen 3
such messages in the wild, all of which had effectively empty MIME parts
consisting of a small number of blank lines.  

FWIW, I think the risk of overload attacks by the use of large epilogues is also
relatively low. It is already common (e.g. implemented in the MIMEDefang sample
code for using SA) to exempt large messages from SA scanning completely. That
path has itself been attacked by image spammers, but it remains useful to cap
the size of messages subjected to SA scanning. That practice also would limit
overload attacks via large epilogues. 
Comment 14 Matthew Wilson 2007-05-24 06:03:48 UTC
*** Bug 5474 has been marked as a duplicate of this bug. ***
Comment 15 Matthew Wilson 2007-05-24 06:07:40 UTC
(In reply to comment #12)
> FWIW,  until someone comes up with tests for this stuff: both correctly and
> incorrectly formatted MIME messages, and runs them all against numerous mail
> clients, and shows clearly that the most commonly used MUAs are doing things one
> way or the other ... Arguing about what should happen is pretty moot IMO.

Two more user-reported examples of Outlook 2003 and 2007 also being susceptible
to this SpamAssassin exploit are at
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5474

Comment 16 Theo Van Dinter 2007-07-05 12:43:55 UTC
So I coded up a test for this...   It's not good (results from the latest
nightly run):

  0.782   0.3261   2.7777    0.105   0.34    0.00  T_TVD_MIME_EPI

basically, it's more of a ham test than anything else.

Looking at my corpus, all the ham hits (except one) were for mailing lists which
blindly tack on a footer to each message.  (the exception was a newsletter which
repeated the text/html part after the closing boundary)  For my spams, the only
epilogue I found was a single null character after the boundary, no actual content.

Testing for an epilogue is trivial, so I'm going to leave it in for now.