SA Bugzilla – Bug 1253
Would like a "BODY_ONLY_HTML" test
Last modified: 2002-12-11 13:46:50 UTC
CTYPE_JUST_HTML catches mail that specifies an HTML content type in the header, and I find the spam correlation of it for my mail stream to be high enough that I've raised its score to 2.0. But, it does not catch mail that has Content-Type: multipart/mixed while having only one part and that part having Content-Type: text/html.
Subject: Re: [SAdev] New: Would like a "BODY_ONLY_HTML" test On Thu, Dec 05, 2002 at 07:00:30PM -0800, bugzilla-daemon@hughes-family.org wrote: > CTYPE_JUST_HTML catches mail that specifies an HTML content type in the header, > and I find the spam correlation of it for my mail stream to be high enough that > I've raised its score to 2.0. But, it does not catch mail that has > Content-Type: multipart/mixed while having only one part and that part having > Content-Type: text/html. Oooh, very nice! I did up a CTYPE_JUST_HTML replacement which does the original plus the body, here are the results: OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 34220 12158 22062 0.355 0.00 0.00 (all messages) 100.000 35.5289 64.4711 0.355 0.00 0.00 (all messages as %) 19.418 54.0056 0.3581 0.993 1.00 0.01 T_CTYPE_JUST_HTML 16.774 46.5866 0.3445 0.993 0.00 1.00 CTYPE_JUST_HTML I'll be committing it to CVS shortly. :) BTW: I figured a replacement test would be more efficient (1 rule instead of 2 -- the evaltest code only needed 1 extra line) ... If in testing there are a lot more FPs, it's easy enough to remove that line of code and make it a body-version only.
I don't think I squeezed it into CVS in time for tonight's corpus runs, but I wrote up Theo's test using _check_attachments so it should have a little less overhead and comes out a few lines shorter. This version also seems to get a lot more hits. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12603 4911 7692 0.390 0.00 0.00 (all messages) 100.000 38.9669 61.0331 0.390 0.00 0.00 (all messages as %) 21.725 55.6506 0.0650 0.999 1.00 0.01 T_CTYPE_JUST_HTML2 20.559 52.6573 0.0650 0.999 1.00 0.01 T_CTYPE_JUST_HTML 19.956 51.1098 0.0650 0.999 0.99 0.45 CTYPE_JUST_HTML
Subject: Re: [SAdev] Would like a "BODY_ONLY_HTML" test On Sun, Dec 08, 2002 at 01:34:03AM -0800, bugzilla-daemon@hughes-family.org wrote: > I don't think I squeezed it into CVS in time for tonight's corpus runs, but > I wrote up Theo's test using _check_attachments so it should have a little > less overhead and comes out a few lines shorter. This version also seems to > get a lot more hits. > > OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME > 12603 4911 7692 0.390 0.00 0.00 (all messages) > 100.000 38.9669 61.0331 0.390 0.00 0.00 (all messages as %) > 21.725 55.6506 0.0650 0.999 1.00 0.01 T_CTYPE_JUST_HTML2 > 20.559 52.6573 0.0650 0.999 1.00 0.01 T_CTYPE_JUST_HTML > 19.956 51.1098 0.0650 0.999 0.99 0.45 CTYPE_JUST_HTML That's interesting actually ... I forgot the case-insensitive 'i' for the header ctype, so that didn't help. The two rules act differently: 'HTML2' only checks for single mime part that is text/html, whereas 'HTML' checks that there are only text/html mime parts. I guess there aren't a lot of spams with multiple text/html parts? Do your results for T_CTYPE_JUST_HTML take the case-insensitivity flag that you put on (thanks) into account, or was this pre-that fix?
Subject: Re: [SAdev] Would like a "BODY_ONLY_HTML" test felicity@kluge.net writes: > That's interesting actually ... I forgot the case-insensitive 'i' for > the header ctype, so that didn't help. The two rules act differently: > 'HTML2' only checks for single mime part that is text/html, whereas > 'HTML' checks that there are only text/html mime parts. > > I guess there aren't a lot of spams with multiple text/html parts? I could see multiple text/html MIME parts happening and it happens once or twice in my corpus, so I'll modify the rule to do that. (Done.) > Do your results for T_CTYPE_JUST_HTML take the case-insensitivity flag > that you put on (thanks) into account, or was this pre-that fix? Yes, I ran my mass-check after making that change.
Subject: Re: [SAdev] Would like a "BODY_ONLY_HTML" test On Sun, Dec 08, 2002 at 11:51:29AM -0800, bugzilla-daemon@hughes-family.org wrote: > Yes, I ran my mass-check after making that change. Interesting. Ok, then I'll remove my rule and rename yours to be the T_CTYPE_JUST_HTML then. :) I like the cleaner version you have.
changing owner to me
This has been checked into CVS. The revised CTYPE_JUST_HTML test is named "MIME_HTML_ONLY".
Subject: Re: [SAdev] Would like a "BODY_ONLY_HTML" test On Wed, Dec 11, 2002 at 10:46:51PM -0800, bugzilla-daemon@hughes-family.org wrote: > This has been checked into CVS. The revised CTYPE_JUST_HTML test is > named "MIME_HTML_ONLY". Since you've removed CTYPE_JUST_HTML, you probably want to remove the other associated entries as well: 30_text_de.cf:lang de describe CTYPE_JUST_HTML Reine HTML-Mail, ohne Textversion 30_text_fr.cf:lang fr describe CTYPE_JUST_HTML Le corps du mhl est uniquement en format HTML 30_text_it.cf:lang it describe CTYPE_JUST_HTML Email unicamente in formato HTML, senza versione testuale 50_scores.cf:score CTYPE_JUST_HTML 0.453