Bug 1253 - Would like a "BODY_ONLY_HTML" test
Summary: Would like a "BODY_ONLY_HTML" test
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: 2.43
Hardware: Other other
: P2 enhancement
Target Milestone: ---
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-12-05 19:00 UTC by John DuBois
Modified: 2002-12-11 13:46 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description John DuBois 2002-12-05 19:00:29 UTC
CTYPE_JUST_HTML catches mail that specifies an HTML content type in the header,
and I find the spam correlation of it for my mail stream to be high enough that
I've raised its score to 2.0.  But, it does not catch mail that has 
Content-Type: multipart/mixed while having only one part and that part having
Content-Type: text/html.
Comment 1 Theo Van Dinter 2002-12-07 15:33:07 UTC
Subject: Re: [SAdev]  New: Would like a "BODY_ONLY_HTML" test

On Thu, Dec 05, 2002 at 07:00:30PM -0800, bugzilla-daemon@hughes-family.org wrote:
> CTYPE_JUST_HTML catches mail that specifies an HTML content type in the header,
> and I find the spam correlation of it for my mail stream to be high enough that
> I've raised its score to 2.0.  But, it does not catch mail that has 
> Content-Type: multipart/mixed while having only one part and that part having
> Content-Type: text/html.

Oooh, very nice!   I did up a CTYPE_JUST_HTML replacement which does
the original plus the body, here are the results:

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  34220    12158    22062    0.355   0.00    0.00  (all messages)
100.000  35.5289  64.4711    0.355   0.00    0.00  (all messages as %)
 19.418  54.0056   0.3581    0.993   1.00    0.01  T_CTYPE_JUST_HTML
 16.774  46.5866   0.3445    0.993   0.00    1.00  CTYPE_JUST_HTML

I'll be committing it to CVS shortly. :)

BTW: I figured a replacement test would be more efficient (1 rule instead
of 2 -- the evaltest code only needed 1 extra line) ...  If in testing
there are a lot more FPs, it's easy enough to remove that line of code
and make it a body-version only.
Comment 2 Daniel Quinlan 2002-12-08 01:34:03 UTC
I don't think I squeezed it into CVS in time for tonight's corpus runs, but
I wrote up Theo's test using _check_attachments so it should have a little
less overhead and comes out a few lines shorter.  This version also seems to
get a lot more hits.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12603     4911     7692    0.390   0.00    0.00  (all messages)
100.000  38.9669  61.0331    0.390   0.00    0.00  (all messages as %)
 21.725  55.6506   0.0650    0.999   1.00    0.01  T_CTYPE_JUST_HTML2
 20.559  52.6573   0.0650    0.999   1.00    0.01  T_CTYPE_JUST_HTML
 19.956  51.1098   0.0650    0.999   0.99    0.45  CTYPE_JUST_HTML
Comment 3 Theo Van Dinter 2002-12-08 08:46:37 UTC
Subject: Re: [SAdev]  Would like a "BODY_ONLY_HTML" test

On Sun, Dec 08, 2002 at 01:34:03AM -0800, bugzilla-daemon@hughes-family.org wrote:
> I don't think I squeezed it into CVS in time for tonight's corpus runs, but
> I wrote up Theo's test using _check_attachments so it should have a little
> less overhead and comes out a few lines shorter.  This version also seems to
> get a lot more hits.
> 
> OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
>   12603     4911     7692    0.390   0.00    0.00  (all messages)
> 100.000  38.9669  61.0331    0.390   0.00    0.00  (all messages as %)
>  21.725  55.6506   0.0650    0.999   1.00    0.01  T_CTYPE_JUST_HTML2
>  20.559  52.6573   0.0650    0.999   1.00    0.01  T_CTYPE_JUST_HTML
>  19.956  51.1098   0.0650    0.999   0.99    0.45  CTYPE_JUST_HTML

That's interesting actually ...  I forgot the case-insensitive 'i' for
the header ctype, so that didn't help.  The two rules act differently:
'HTML2' only checks for single mime part that is text/html, whereas
'HTML' checks that there are only text/html mime parts.

I guess there aren't a lot of spams with multiple text/html parts?

Do your results for T_CTYPE_JUST_HTML take the case-insensitivity flag
that you put on (thanks) into account, or was this pre-that fix?
Comment 4 Daniel Quinlan 2002-12-08 11:51:29 UTC
Subject: Re: [SAdev]  Would like a "BODY_ONLY_HTML" test

felicity@kluge.net writes:

> That's interesting actually ...  I forgot the case-insensitive 'i' for
> the header ctype, so that didn't help.  The two rules act differently:
> 'HTML2' only checks for single mime part that is text/html, whereas
> 'HTML' checks that there are only text/html mime parts.
> 
> I guess there aren't a lot of spams with multiple text/html parts?

I could see multiple text/html MIME parts happening and it happens once
or twice in my corpus, so I'll modify the rule to do that.  (Done.)
 
> Do your results for T_CTYPE_JUST_HTML take the case-insensitivity flag
> that you put on (thanks) into account, or was this pre-that fix?

Yes, I ran my mass-check after making that change.

Comment 5 Theo Van Dinter 2002-12-08 13:10:05 UTC
Subject: Re: [SAdev]  Would like a "BODY_ONLY_HTML" test

On Sun, Dec 08, 2002 at 11:51:29AM -0800, bugzilla-daemon@hughes-family.org wrote:
> Yes, I ran my mass-check after making that change.

Interesting.  Ok, then I'll remove my rule and rename yours to be the
T_CTYPE_JUST_HTML then. :)  I like the cleaner version you have.
Comment 6 Daniel Quinlan 2002-12-09 17:36:37 UTC
changing owner to me
Comment 7 Daniel Quinlan 2002-12-11 22:46:50 UTC
This has been checked into CVS.  The revised CTYPE_JUST_HTML test is
named "MIME_HTML_ONLY".
Comment 8 Theo Van Dinter 2002-12-12 07:30:53 UTC
Subject: Re: [SAdev]  Would like a "BODY_ONLY_HTML" test

On Wed, Dec 11, 2002 at 10:46:51PM -0800, bugzilla-daemon@hughes-family.org wrote:
> This has been checked into CVS.  The revised CTYPE_JUST_HTML test is
> named "MIME_HTML_ONLY".

Since you've removed CTYPE_JUST_HTML, you probably want to remove the
other associated entries as well:

30_text_de.cf:lang de describe CTYPE_JUST_HTML          Reine HTML-Mail, ohne Textversion
30_text_fr.cf:lang fr describe CTYPE_JUST_HTML        Le corps du mhl est uniquement en format
HTML
30_text_it.cf:lang it describe CTYPE_JUST_HTML        Email unicamente in formato HTML, senza
versione testuale
50_scores.cf:score CTYPE_JUST_HTML                0.453