Bug 1434 - Look for multipart/alternative that isn't?
Summary: Look for multipart/alternative that isn't?
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 enhancement
Target Milestone: 2.60
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-02-02 14:54 UTC by Theo Van Dinter
Modified: 2003-05-27 02:59 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Van Dinter 2003-02-02 14:54:17 UTC
I've been noticing spams that say they're multipart/alternative, but end up only having an HTML attachment.  I haven't gotten around to writing a rule and testing, but thought we may at least want to try it for 2.60. :)
Comment 1 Daniel Quinlan 2003-02-02 16:56:14 UTC
Subject: Re: [SAdev]  New: Look for multipart/alternative that isn't?

felicity@kluge.net writes:

> I've been noticing spams that say they're multipart/alternative, but
> end up only having an HTML attachment.  I haven't gotten around to
> writing a rule and testing, but thought we may at least want to try it
> for 2.60. :)

Cool idea.

Doing a hand-test, I get 2/11938 hits for ham and 109/8142 hits for
spam.  That's an S/O ratio of 0.988 if my math is right.

My test was looking for multipart/alternative messages with only one
Content-Type in the body.  (None of my spam had zero Content-Type
headers in the body.)

If I try testing for any multipart/ message, then the counts are
12/11938 for ham and 691/8142 for spam.  That's an S/O ratio is also
0.988 (a small fraction better, actually).

Now, if I make that last test for only HTML (as you originally said),
then I get a pretty good spam hit rate of 491/11938 messages with 0 ham
hits.  That test is:

  - header contains Content-Type: multipart
  - body contains only one HTML Content-Type

That's the same as your test except also allowing other multipart
subtypes like "mixed".

I also tried some simple meta rules that are similar to the above:

header __CTYPE_MULTIPART	Content-Type =~ /multipart/i
meta T_MULTIPART_HTML_ONLY1	(__CTYPE_MULTIPART && __MIME_HTML_ONLY)
meta T_MULTIPART_HTML_ONLY2	(__CTYPE_MULTIPART && MIME_HTML_ONLY)

It looks like the new rule is better.  I get about 25 extra spam hits
for either, but also 3 ham hits instead of 0.  The new rule is only a
few lines long if we work it into _check_attachments.  (Add a count for
all body MIME parts, if that is equal to the mime_body_html_count and
the header contains Content-Type =~ /multipart/i, then you got it.)

Comment 2 Eugene Miretsky 2003-02-03 07:18:22 UTC
Subject: Re: [SAdev]  Look for multipart/alternative that isn't?

I think that another useful (but more difficult rule) would be a rule that
checks to see if text/plain part is inadequately small compared to text/html
I get a lot of mails with norman text/html part, but text/plain pretty much
has not information.

On Sun, Feb 02, 2003 at 04:56:15PM -0800, bugzilla-daemon@hughes-family.org wrote:
> http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1434
> 
> 
> 
> 
> 
> ------- Additional Comments From quinlan@pathname.com  2003-02-02 16:56 -------
> Subject: Re: [SAdev]  New: Look for multipart/alternative that isn't?
> 
> felicity@kluge.net writes:
> 
> > I've been noticing spams that say they're multipart/alternative, but
> > end up only having an HTML attachment.  I haven't gotten around to
> > writing a rule and testing, but thought we may at least want to try it
> > for 2.60. :)
> 
> Cool idea.
> 
> Doing a hand-test, I get 2/11938 hits for ham and 109/8142 hits for
> spam.  That's an S/O ratio of 0.988 if my math is right.
> 
> My test was looking for multipart/alternative messages with only one
> Content-Type in the body.  (None of my spam had zero Content-Type
> headers in the body.)
> 
> If I try testing for any multipart/ message, then the counts are
> 12/11938 for ham and 691/8142 for spam.  That's an S/O ratio is also
> 0.988 (a small fraction better, actually).
> 
> Now, if I make that last test for only HTML (as you originally said),
> then I get a pretty good spam hit rate of 491/11938 messages with 0 ham
> hits.  That test is:
> 
>   - header contains Content-Type: multipart
>   - body contains only one HTML Content-Type
> 
> That's the same as your test except also allowing other multipart
> subtypes like "mixed".
> 
> I also tried some simple meta rules that are similar to the above:
> 
> header __CTYPE_MULTIPART	Content-Type =~ /multipart/i
> meta T_MULTIPART_HTML_ONLY1	(__CTYPE_MULTIPART && __MIME_HTML_ONLY)
> meta T_MULTIPART_HTML_ONLY2	(__CTYPE_MULTIPART && MIME_HTML_ONLY)
> 
> It looks like the new rule is better.  I get about 25 extra spam hits
> for either, but also 3 ham hits instead of 0.  The new rule is only a
> few lines long if we work it into _check_attachments.  (Add a count for
> all body MIME parts, if that is equal to the mime_body_html_count and
> the header contains Content-Type =~ /multipart/i, then you got it.)
> 
> 
> 
> 
> 
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> 
> 
> -------------------------------------------------------
> This SF.NET email is sponsored by:
> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
> http://www.vasoftware.com
> _______________________________________________
> Spamassassin-devel mailing list
> Spamassassin-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/spamassassin-devel

Comment 3 Justin Mason 2003-05-07 11:11:16 UTC
btw, quick data point:  apple's newsletters come with an empty text/plain part
and a full text/html part.  so that would be one FP
Comment 4 Daniel Quinlan 2003-05-25 17:21:55 UTC
I might as well do this bug since I played with some tests before.
Comment 5 Daniel Quinlan 2003-05-26 13:08:45 UTC
I expect to finish this within the next day or two.
Changing to 2.60 milestone.
Comment 6 Daniel Quinlan 2003-05-27 10:59:54 UTC
two new rules in CVS now

  MIME_HTML_ONLY_MULTI
  MIME_MULTIPART_SHORT