SA Bugzilla – Bug 1434
Look for multipart/alternative that isn't?
Last modified: 2003-05-27 02:59:54 UTC
I've been noticing spams that say they're multipart/alternative, but end up only having an HTML attachment. I haven't gotten around to writing a rule and testing, but thought we may at least want to try it for 2.60. :)
Subject: Re: [SAdev] New: Look for multipart/alternative that isn't? felicity@kluge.net writes: > I've been noticing spams that say they're multipart/alternative, but > end up only having an HTML attachment. I haven't gotten around to > writing a rule and testing, but thought we may at least want to try it > for 2.60. :) Cool idea. Doing a hand-test, I get 2/11938 hits for ham and 109/8142 hits for spam. That's an S/O ratio of 0.988 if my math is right. My test was looking for multipart/alternative messages with only one Content-Type in the body. (None of my spam had zero Content-Type headers in the body.) If I try testing for any multipart/ message, then the counts are 12/11938 for ham and 691/8142 for spam. That's an S/O ratio is also 0.988 (a small fraction better, actually). Now, if I make that last test for only HTML (as you originally said), then I get a pretty good spam hit rate of 491/11938 messages with 0 ham hits. That test is: - header contains Content-Type: multipart - body contains only one HTML Content-Type That's the same as your test except also allowing other multipart subtypes like "mixed". I also tried some simple meta rules that are similar to the above: header __CTYPE_MULTIPART Content-Type =~ /multipart/i meta T_MULTIPART_HTML_ONLY1 (__CTYPE_MULTIPART && __MIME_HTML_ONLY) meta T_MULTIPART_HTML_ONLY2 (__CTYPE_MULTIPART && MIME_HTML_ONLY) It looks like the new rule is better. I get about 25 extra spam hits for either, but also 3 ham hits instead of 0. The new rule is only a few lines long if we work it into _check_attachments. (Add a count for all body MIME parts, if that is equal to the mime_body_html_count and the header contains Content-Type =~ /multipart/i, then you got it.)
Subject: Re: [SAdev] Look for multipart/alternative that isn't? I think that another useful (but more difficult rule) would be a rule that checks to see if text/plain part is inadequately small compared to text/html I get a lot of mails with norman text/html part, but text/plain pretty much has not information. On Sun, Feb 02, 2003 at 04:56:15PM -0800, bugzilla-daemon@hughes-family.org wrote: > http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1434 > > > > > > ------- Additional Comments From quinlan@pathname.com 2003-02-02 16:56 ------- > Subject: Re: [SAdev] New: Look for multipart/alternative that isn't? > > felicity@kluge.net writes: > > > I've been noticing spams that say they're multipart/alternative, but > > end up only having an HTML attachment. I haven't gotten around to > > writing a rule and testing, but thought we may at least want to try it > > for 2.60. :) > > Cool idea. > > Doing a hand-test, I get 2/11938 hits for ham and 109/8142 hits for > spam. That's an S/O ratio of 0.988 if my math is right. > > My test was looking for multipart/alternative messages with only one > Content-Type in the body. (None of my spam had zero Content-Type > headers in the body.) > > If I try testing for any multipart/ message, then the counts are > 12/11938 for ham and 691/8142 for spam. That's an S/O ratio is also > 0.988 (a small fraction better, actually). > > Now, if I make that last test for only HTML (as you originally said), > then I get a pretty good spam hit rate of 491/11938 messages with 0 ham > hits. That test is: > > - header contains Content-Type: multipart > - body contains only one HTML Content-Type > > That's the same as your test except also allowing other multipart > subtypes like "mixed". > > I also tried some simple meta rules that are similar to the above: > > header __CTYPE_MULTIPART Content-Type =~ /multipart/i > meta T_MULTIPART_HTML_ONLY1 (__CTYPE_MULTIPART && __MIME_HTML_ONLY) > meta T_MULTIPART_HTML_ONLY2 (__CTYPE_MULTIPART && MIME_HTML_ONLY) > > It looks like the new rule is better. I get about 25 extra spam hits > for either, but also 3 ham hits instead of 0. The new rule is only a > few lines long if we work it into _check_attachments. (Add a count for > all body MIME parts, if that is equal to the mime_body_html_count and > the header contains Content-Type =~ /multipart/i, then you got it.) > > > > > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Spamassassin-devel mailing list > Spamassassin-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/spamassassin-devel
btw, quick data point: apple's newsletters come with an empty text/plain part and a full text/html part. so that would be one FP
I might as well do this bug since I played with some tests before.
I expect to finish this within the next day or two. Changing to 2.60 milestone.
two new rules in CVS now MIME_HTML_ONLY_MULTI MIME_MULTIPART_SHORT