Bug 3348 - Dealing with invalid base64 encoded emails
Summary: Dealing with invalid base64 encoded emails
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 3208
  Show dependency tree
 
Reported: 2004-05-03 21:37 UTC by Yusuf Goolamabbas
Modified: 2004-05-10 12:59 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
testcase text/plain None Yusuf Goolamabbas [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuf Goolamabbas 2004-05-03 21:37:22 UTC
The attached message when run through a URI dumper returns nothing, however the
URI http://www.savemorenow.biz/aff249.asp is there in the message

I suspect it could be due to base64 encoding of the text/html message
Comment 1 Yusuf Goolamabbas 2004-05-03 21:38:12 UTC
Created attachment 1938 [details]
testcase
Comment 2 Theo Van Dinter 2004-05-04 07:10:44 UTC
Subject: Re:  New: base64 encoded html messages seem to confuse get_uri_list

On Mon, May 03, 2004 at 09:37:23PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I suspect it could be due to base64 encoding of the text/html message

Interesting...  This has nothing to do with get_uri_list, and more to
do with the base64 decoding...

According to SpamAssassin, the text part of the message, after decoding is:

$VAR1 = [
          'Want to get out of debt?
',
          '<WHOLE BUNCH OF BINARY DATA>'
        ];

Which is the subject as the first line, then the "decoded" body.

But according to mutt, for instance, the output is:

<SOME BINARY DATA>
<SOME BINARY DATA>
 <html>
<head>
<title>Form Application</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css">
[...]


/me keeps digging

Comment 3 Theo Van Dinter 2004-05-04 07:11:11 UTC
moving to 3.0 queue
Comment 4 Theo Van Dinter 2004-05-04 07:53:07 UTC
Subject: Re:  New: base64 encoded html messages seem to confuse get_uri_list

On Mon, May 03, 2004 at 09:37:23PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I suspect it could be due to base64 encoding of the text/html message

Hrm.

Well, the problem is that it's an invalid message.  The message headers
say text/html that's base64 encoded, but the message body has a MIME
boundary with headers saying text/html and base64.  So our system goes
"oh, the whole thing is base64 encoded" (which is what it's supposed to
do) and tries to decode it, which will including trying to decode the
invalid MIME boundary, then everything goes to hell.

Here we go again with MUAs not simply doing what they're supposed to do
(treat this as invalid), but trying to look through the part to display
anything possible to the user.  <grrr>

MUA			Behavior
----------------------- -----------------------------
Mutt			Binary boundary, then HTML, then binary boundary
Pine			Binary boundary, then HTML, then binary boundary
Apple Mail		Text boundary, then HTML, then text boundary
Outlook Express		Binary gibberish
Exchange WebMail	Binary gibberish
Opera			Binary gibberish, includes "errors while decoding" msg

As far as I can tell, this tells you how different MUAs do decoding.
Mutt and Pine seem to go line by line.  Apple Mail seems to also
go line by line, but checks that the line is validly encoded first.
The Windows-based MUAs (OE, WebMail, and Opera) all seem to just take
the message body as a chunk and decode it, like we do.

So...  Anyone have thoughts about this?  Part of me is inclined to leave
things as they are, and the other part of me says we should emulate
Apple Mail here and make as much valid/visible text as possible.

Comment 5 Theo Van Dinter 2004-05-04 08:02:32 UTC
changing the summary since the issue isn't get_uri_list related.
Comment 6 Justin Mason 2004-05-04 09:42:26 UTC
hmm, tricky one.

I think the best option is to stick with what we do already; if none of the
Windows MUAs can display it, it's not going to be much use for the spammers to
exploit this invalid formatting, because their victims won't be able to read it.
Comment 7 Theo Van Dinter 2004-05-04 10:24:23 UTC
Subject: Re:  base64 encoded html messages seem to confuse get_uri_list

On Tue, May 04, 2004 at 07:53:08AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> MUA			Behavior
> ----------------------- -----------------------------
> Mutt			Binary boundary, then HTML, then binary boundary
> Pine			Binary boundary, then HTML, then binary boundary
> Apple Mail		Text boundary, then HTML, then text boundary
> Outlook Express		Binary gibberish
> Exchange WebMail	Binary gibberish
> Opera			Binary gibberish, includes "errors while decoding" msg

Thunderbird		Binary gibberish
Sylpheed		Nothing is displayed at all

Comment 8 Daniel Quinlan 2004-05-04 11:11:45 UTC
Subject: Re:  base64 encoded html messages seem to confuse get_uri_list

> So...  Anyone have thoughts about this?  Part of me is inclined to leave
> things as they are, and the other part of me says we should emulate
> Apple Mail here and make as much valid/visible text as possible.

Maybe we should start decoding where base64 appears to begin.  I think
we should generally follow the common behavior, but this is one case
where we probably avoid doing the thing that doesn't let us catch the
spam.  Perhaps something like:

  first non-blank line:
    if line is a legal MIME boundary
      treat as a MIME boundary
    anything else
      treat as base64
  all remaining lines:
    treat as base64

Thankfully, "-" is not in base64.  However, that could open us up to
some stupid spammer using a fake MIME boundary that is indeed decoded by
some mailers that skip non-base64 characters (like "-") and produces a
line of spam text.

So, a more robust technique would be:

  first non-blank line:
    if line is a legal MIME boundary
      decode line (raw, our MIME decoding routine simulates the more
        common behavior of skipping non-base64 characters)
      if decoded line is binary garbage
        treat the line as a MIME boundary
      else
        treat it like base64
    anything else
      treat as base64
  all remaining lines:
    treat as base64

Comment 9 Sidney Markowitz 2004-05-04 11:43:45 UTC
I agree with Justin, why decode it if the most common MUA's don't? It's not like
this isn't identifying itself as spam anyway. And if I read it correctly 6.5 of
the 19 points are directly associated with the bogus MIME/BASE64 in it:

Content analysis details:   (19.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 1.0 RCVD_BY_IP             Received by mail server with no name
 0.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 0.7 MIME_HTML_NO_CHARSET   RAW: Message text in HTML without charset
 0.0 MIME_BASE64_NO_NAME    RAW: base64 attachment does not have a file name
 1.7 MIME_BASE64_ILLEGAL    RAW: base64 attachment uses illegal characters
 0.0 MIME_BASE64_BLANKS     RAW: Extra blank lines in base64 encoding
 1.1 MIME_BASE64_TEXT       RAW: Message text disguised using base64 encoding
 2.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
             [Blocked - see <http://www.spamcop.net/bl.shtml?200.208.226.249>]
 1.0 RCVD_IN_XBL            RBL: Received via a relay in Spamhaus XBL
                            [200.208.226.249 listed in sbl-xbl.spamhaus.org]
 0.0 FORGED_AOL_TAGS        AOL mailers can't send HTML in this format
 1.0 RCVD_DOUBLE_IP_SPAM    Bulk email fingerprint (double IP) found
 1.7 HTML_MIME_NO_HTML_TAG  HTML-only message, but there is no HTML tag
 1.1 UPPERCASE_50_75        message body is 50-75% uppercase
 1.2 MISSING_MIMEOLE        Message has X-MSMail-Priority, but no X-MimeOLE
 4.3 FORGED_MUA_AOL_FROM    Forged mail pretending to be from AOL (by From)
 1.8 FORGED_AOL_HTML        AOL can't send HTML message only
 0.1 MISSING_OUTLOOK_NAME   Message looks like Outlook, but isn't
Comment 10 Justin Mason 2004-05-04 13:26:04 UTC
Subject: Re:  base64 encoded html messages seem to confuse get_uri_list 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
> > So...  Anyone have thoughts about this?  Part of me is inclined to leave
> > things as they are, and the other part of me says we should emulate
> > Apple Mail here and make as much valid/visible text as possible.
> 
> Maybe we should start decoding where base64 appears to begin.  I think
> we should generally follow the common behavior, but this is one case
> where we probably avoid doing the thing that doesn't let us catch the
> spam.

TBH, I don't know. IMO, it would be better to emulate what the common UAs
do.  Otherwise, we could run into a situation where a spammer can
craft a message that looks one way in common MUAs, but another way
to *us* (possibly just by having more "innocent" text after the
payload.)

- --j.

> Perhaps something like:
> 
>   first non-blank line:
>     if line is a legal MIME boundary
>       treat as a MIME boundary
>     anything else
>       treat as base64
>   all remaining lines:
>     treat as base64
> 
> Thankfully, "-" is not in base64.  However, that could open us up to
> some stupid spammer using a fake MIME boundary that is indeed decoded by
> some mailers that skip non-base64 characters (like "-") and produces a
> line of spam text.
> 
> So, a more robust technique would be:
> 
>   first non-blank line:
>     if line is a legal MIME boundary
>       decode line (raw, our MIME decoding routine simulates the more
>         common behavior of skipping non-base64 characters)
>       if decoded line is binary garbage
>         treat the line as a MIME boundary
>       else
>         treat it like base64
>     anything else
>       treat as base64
>   all remaining lines:
>     treat as base64
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAl/xDQTcbUG5Y7woRAg4KAJ4ohhmK8rpdsNn7VdSKf9wnkqPcBgCeINN4
6bkd37V4uT13b/YLxZjORQQ=
=b9fj
-----END PGP SIGNATURE-----

Comment 11 Daniel Quinlan 2004-05-04 13:35:11 UTC
Subject: Re:  base64 encoded html messages seem to confuse get_uri_list 

Justin Mason <jm@jmason.org> writes:

> TBH, I don't know. IMO, it would be better to emulate what the common
> UAs do.  Otherwise, we could run into a situation where a spammer can
> craft a message that looks one way in common MUAs, but another way to
> *us* (possibly just by having more "innocent" text after the payload.)

I think my second proposal is pretty robust in that respect.

Daniel

Comment 12 Sidney Markowitz 2004-05-04 14:32:00 UTC
Daniel wrote:
> I think we should generally follow the common behavior, but this
> is one case where we probably avoid doing the thing that doesn't
> let us catch the spam

But following the common behavior doesn't miss the spam. In the 6.5 points I
counted I missed another 1.1 points for UPPERCASE_50_75. So that's 7.6 points
that this coding produces in the current SVN rules. The existing rules catch the
it handily without us giving spammers a chance at figuring out a way of fooling
SpamAssassin when it decodes things that the most common MUAs don't. And why add
code for a rare case that is not leading to FNs?
Comment 13 Daniel Quinlan 2004-05-04 16:07:05 UTC
Subject: Re:  Dealing with invalid base64 encoded emails

> And why add code for a rare case that is not leading to FNs?

I suppose that works for me.

Comment 14 Theo Van Dinter 2004-05-10 20:59:47 UTC
ok, it looks like the consensus here is that the message is invalid, with the 
exception of a few, MUAs see that, and treat the message accordingly.