Bug 1053 - IMG tag based rules
Summary: IMG tag based rules
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: unspecified
Hardware: Other other
: P2 enhancement
Target Milestone: ---
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-10-03 21:37 UTC by Matthew Cline
Modified: 2002-12-17 09:14 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Cline 2002-10-03 21:37:42 UTC
Inspired by complaints about all-image or mostly-image spam that's
getting by SA, I've cooked up three sets of rules that analyze the use
of IMG tags in HTML: one that looks at the total area of all of the
images in the message (T_HTML_IMAGE_AREA*), one that looks at the
total number of images in the message (T_HTML_NUM_IMGS*), and one that
looks at the longest total run of consecutive images
(T_HTML_CONSEC_IMG*).

===============

The total area of all images is rather easy to compute: inside of
HTML::html_tests(), if an IMG tag has both the width and height
properties, then multiply them together and add the result to the
running total.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  15113     4797    10316    0.32    0.00    0.00  (all messages)
100.000   31.741   68.259    0.32    0.00    0.00  (all messages as %)
  0.635    2.001    0.000    1.00    0.81    0.01  T_HTML_IMAGE_AREA14
  0.417    1.313    0.000    1.00    0.78    0.01  T_HTML_IMAGE_AREA15
  0.331    1.042    0.000    1.00    0.76    0.01  T_HTML_IMAGE_AREA07
  0.245    0.771    0.000    1.00    0.74    0.01  T_HTML_IMAGE_AREA10
  0.238    0.750    0.000    1.00    0.74    0.01  T_HTML_IMAGE_AREA02
  0.225    0.709    0.000    1.00    0.74    0.01  T_HTML_IMAGE_AREA16
  0.126    0.396    0.000    1.00    0.70    0.01  T_HTML_IMAGE_AREA18
  0.119    0.375    0.000    1.00    0.70    0.01  T_HTML_IMAGE_AREA19
  0.119    0.375    0.000    1.00    0.70    0.01  T_HTML_IMAGE_AREA17
  1.125    3.523    0.010    1.00    0.68    0.01  T_HTML_IMAGE_AREA12
  0.741    2.314    0.010    1.00    0.65    0.01  T_HTML_IMAGE_AREA13
  1.542    4.732    0.058    0.99    0.58    0.01  T_HTML_IMAGE_AREA11
  0.139    0.417    0.010    0.98    0.54    0.01  T_HTML_IMAGE_AREA08
  0.483    1.397    0.058    0.96    0.50    0.01  T_HTML_IMAGE_AREA03
  0.192    0.500    0.048    0.91    0.44    0.01  T_HTML_IMAGE_AREA06
  0.820    1.834    0.349    0.84    0.39    0.01  T_HTML_IMAGE_AREA04
  0.946    2.022    0.446    0.82    0.38    0.01  T_HTML_IMAGE_AREA01
  0.569    0.896    0.417    0.68    0.32    0.01  T_HTML_IMAGE_AREA05
  6.498    0.500    9.287    0.05    0.02    0.01  T_HTML_IMAGE_AREA09

Spam % of all rules with S/0 > 0.90: 20.615%

=============================

The total number of IMG tags is really easy to do.

  0.648    2.043    0.000    1.00    0.81    0.01  T_HTML_NUM_IMGS08
  0.609    1.918    0.000    1.00    0.80    0.01  T_HTML_NUM_IMGS09
  0.490    1.543    0.000    1.00    0.79    0.01  T_HTML_NUM_IMGS10
  0.119    0.375    0.000    1.00    0.70    0.01  T_HTML_NUM_IMGS14
  0.986    3.064    0.019    0.99    0.63    0.01  T_HTML_NUM_IMGS06
  2.303    7.150    0.048    0.99    0.62    0.01  T_HTML_NUM_IMGS11
  0.033    0.104    0.000    1.00    0.61    0.01  T_HTML_NUM_IMGS17
  0.787    2.439    0.019    0.99    0.61    0.01  T_HTML_NUM_IMGS12
  0.344    1.063    0.010    0.99    0.60    0.01  T_HTML_NUM_IMGS13
  0.020    0.063    0.000    1.00    0.58    0.01  T_HTML_NUM_IMGS20
  0.020    0.063    0.000    1.00    0.58    0.01  T_HTML_NUM_IMGS16
  0.860    2.627    0.039    0.99    0.57    0.01  T_HTML_NUM_IMGS05
  0.754    2.293    0.039    0.98    0.56    0.01  T_HTML_NUM_IMGS07
  0.013    0.042    0.000    1.00    0.55    0.01  T_HTML_NUM_IMGS18
  0.887    2.627    0.078    0.97    0.52    0.01  T_HTML_NUM_IMGS04
  1.356    3.711    0.262    0.93    0.47    0.01  T_HTML_NUM_IMGS03
  0.046    0.125    0.010    0.93    0.46    0.01  T_HTML_NUM_IMGS15
  6.061   10.256    4.110    0.71    0.34    0.01  T_HTML_NUM_IMGS01
  0.040    0.063    0.029    0.68    0.32    0.01  T_HTML_NUM_IMGS19
  6.233    4.753    6.921    0.41    0.22    0.01  T_HTML_NUM_IMGS02

Spam % of all rules with S/O > 0.90: 31.25%

=========================

I figured that spam that is made up of only images is going to only
have IMG tags interspersed with table, paragraph and linebreak tags,
and some whitespace, so there would be a lot of IMG tags with no plain
text (non-whitespace) between them.  So I defined consecutive IMG tags
to be ones with no text between them, and looked at the longest run of
consecutive IMGs within a message.

This one seems to do pretty good, because in my non-spam corpus
there's only a handful of messages with IMG runs larger than two.

  0.450    1.418    0.000    1.00    0.78    0.01  T_HTML_CONSEC_IMGS06
  0.232    0.730    0.000    1.00    0.74    0.01  T_HTML_CONSEC_IMGS08
  0.205    0.646    0.000    1.00    0.73    0.01  T_HTML_CONSEC_IMGS11
  1.813    5.691    0.010    1.00    0.71    0.01  T_HTML_CONSEC_IMGS02
  1.019    3.189    0.010    1.00    0.67    0.01  T_HTML_CONSEC_IMGS03
  0.768    2.397    0.010    1.00    0.66    0.01  T_HTML_CONSEC_IMGS05
  0.053    0.167    0.000    1.00    0.64    0.01  T_HTML_CONSEC_IMGS12
  1.006    3.127    0.019    0.99    0.63    0.01  T_HTML_CONSEC_IMGS04
  0.483    1.501    0.010    0.99    0.62    0.01  T_HTML_CONSEC_IMGS07
  0.020    0.063    0.000    1.00    0.58    0.01  T_HTML_CONSEC_IMGS13
  0.020    0.063    0.000    1.00    0.58    0.01  T_HTML_CONSEC_IMGS15
  1.032    3.148    0.048    0.98    0.57    0.01  T_HTML_CONSEC_IMGS10
  0.199    0.605    0.010    0.98    0.57    0.01  T_HTML_CONSEC_IMGS09
  0.013    0.042    0.000    1.00    0.55    0.01  T_HTML_CONSEC_IMGS17
  0.013    0.042    0.000    1.00    0.55    0.01  T_HTML_CONSEC_IMGS19
  0.007    0.021    0.000    1.00    0.51    0.01  T_HTML_CONSEC_IMGS14
  7.080    7.484    6.892    0.52    0.26    0.01  T_HTML_CONSEC_IMGS01
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS16
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS18

Spam % of all rules with S/O > 0.90: 22.85%

==========================

Next I'm going to see if there's any meta rules I can make that will
reduce the FP rate for low S/O rules.
Comment 1 Michael Moncur 2002-10-03 21:46:38 UTC
Looks great! One note: most "image-only" spam actually has some text (a few 
words at the top, a disclaimer at the end) so keep that in mind. The typical 
image spam that slips through SA seems to have a bit of text, one huge image, 
then a bit more.
Comment 2 Daniel Quinlan 2002-10-04 15:54:57 UTC
<daf> Argument "100%" isn't numeric in multiplication (*) at
/home/daf/cvs/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm line 230.

The code doesn't seem to handle width and height attributes expressed as
percentages.  It should convert those to the equivalent pixel size for an 800x600
monitor (that should do the job).

Something like: (800 * (percent / 100)) or (600 * (percent / 100))
Comment 3 Craig Hughes 2002-10-05 00:54:33 UTC
Call me an optimization freak, but isn't that just

(8 * percent)   or   (6 * percent)?
Comment 4 Matthew Cline 2002-10-05 01:29:59 UTC
Here's my results for the width/height ratios, one set of rules for the
minimum ratio found and a second for the maximum ratio found:

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  15169     4849    10320    0.32    0.00    0.00  (all messages)
100.000   31.967   68.033    0.32    0.00    0.00  (all messages as %)
  0.349    1.093    0.000    1.00    0.77    0.01  T_HTML_MIN_IMG_RATIO4
  0.165    0.516    0.000    1.00    0.72    0.01  T_HTML_MIN_IMG_RATIO1
  0.877    2.598    0.068    0.97    0.53    0.01  T_HTML_MIN_IMG_RATIO5
  0.310    0.887    0.039    0.96    0.50    0.01  T_HTML_MIN_IMG_RATIO2
  0.270    0.763    0.039    0.95    0.49    0.01  T_HTML_MIN_IMG_RATIO3
  1.536    3.753    0.494    0.88    0.42    0.01  T_HTML_MIN_IMG_RATIO6


All of the min ratios from 0.0 to 0.75 have a S/O of 0.95 or greater.

  15169     4849    10320    0.32    0.00    0.00  (all messages)
100.000   31.967   68.033    0.32    0.00    0.00  (all messages as %)
  0.626    1.959    0.000    1.00    0.80    0.01  T_HTML_MAX_IMG_RATIO02
  0.521    1.629    0.000    1.00    0.79    0.01  T_HTML_MAX_IMG_RATIO05
  0.435    1.361    0.000    1.00    0.78    0.01  T_HTML_MAX_IMG_RATIO06
  0.171    0.536    0.000    1.00    0.72    0.01  T_HTML_MAX_IMG_RATIO09
  0.145    0.454    0.000    1.00    0.71    0.01  T_HTML_MAX_IMG_RATIO07
  1.147    3.568    0.010    1.00    0.68    0.01  T_HTML_MAX_IMG_RATIO04
  1.642    5.094    0.019    1.00    0.66    0.01  T_HTML_MAX_IMG_RATIO03B
  0.554    1.670    0.029    0.98    0.56    0.01  T_HTML_MAX_IMG_RATIO10
  0.125    0.371    0.010    0.97    0.53    0.01  T_HTML_MAX_IMG_RATIO08
  0.092    0.268    0.010    0.97    0.51    0.01  T_HTML_MAX_IMG_RATIO01
  1.872    5.094    0.359    0.93    0.47    0.01  T_HTML_MAX_IMG_RATIO03

It looks like all of these rules have goo S/O's because the width/height
ratios of 1 to 5 aren't covered by any rule.

T_HTML_MAX_IMG_RATIO03B does a lot better than T_HTML_MAX_IMG_RATIO03 because
it excludes all of the FROM_EGROUP messages, which often have image ratios
in that range.
Comment 5 Matthew Cline 2002-10-05 01:41:45 UTC
Oh, and I made a meta rule to combine the # of images test with the
HTML percentage test, since I figured that spam made mostly of images will
also be mostly HTML.  So here's what I got:

  3.896   12.167    0.010    1.00    0.76    0.01  T_HTML_50_70_IMGS3
  9.566   27.655    1.066    0.96    0.51    0.31  HTML_50_70

(T_HTML_50_70_IMGS3 == HTML_50_70 + 3 or more images)

It greatly reduces the FP rate, while reducing the spam rate by 56%.
Comment 6 Justin Mason 2002-10-05 06:07:01 UTC
Subject: Re: [SAdev]  IMG tag based rules 


Here's my results from last night for the T_HTML_ rules. note that
there were some checkins after my last cvs update though, so these
aren't the latest versions.  However I think the modifications
were (a) handling percentages (b) some new rules and (c) avoiding
that warning message, so I think these hits are still valid.

I'll cvs update now and rerun.

  0.431    0.555    0.000    1.00    0.83    0.01  T_HTML_IMAGE_AREA16
  0.136    0.175    0.000    1.00    0.75    0.01  T_HTML_CONSEC_IMGS13
  0.023    0.029    0.000    1.00    0.61    0.01  T_HTML_IMAGE_AREA19
  0.023    0.029    0.000    1.00    0.61    0.01  T_HTML_IMAGE_AREA18
  4.195    5.348    0.202    0.96    0.59    0.01  T_HTML_NUM_IMGS02B
  4.195    5.348    0.202    0.96    0.59    0.01  T_HTML_NUM_IMGS02
  3.469    4.267    0.709    0.86    0.47    0.01  T_HTML_NUM_IMGS03
  2.404    2.864    0.810    0.78    0.42    0.01  T_HTML_NUM_IMGS04
  5.193    6.108    2.024    0.75    0.41    0.01  T_HTML_CONSEC_IMGS07
  5.170    6.078    2.024    0.75    0.41    0.01  T_HTML_MIN_IMG_RATIO4
  4.263    4.939    1.923    0.72    0.39    0.01  T_HTML_IMAGE_AREA12
  3.855    4.442    1.822    0.71    0.39    0.01  T_HTML_CONSEC_IMGS04
  0.635    0.731    0.304    0.71    0.38    0.01  T_HTML_MAX_IMG_RATIO02
  0.635    0.731    0.304    0.71    0.38    0.01  T_HTML_MAX_IMG_RATIO02B
  4.036    4.617    2.024    0.70    0.38    0.01  T_HTML_NUM_IMGS13
  1.020    1.140    0.607    0.65    0.36    0.01  T_HTML_NUM_IMGS09
  4.127    4.559    2.632    0.63    0.35    0.01  T_HTML_NUM_IMGS01B
  4.127    4.559    2.632    0.63    0.35    0.01  T_HTML_NUM_IMGS01
  2.268    2.425    1.721    0.58    0.33    0.01  T_HTML_NUM_IMGS08
  1.565    1.666    1.215    0.58    0.33    0.01  T_HTML_MIN_IMG_RATIO1
  1.837    1.929    1.518    0.56    0.32    0.01  T_HTML_NUM_IMGS10
  1.111    1.140    1.012    0.53    0.31    0.01  T_HTML_CONSEC_IMGS06
  5.397    5.377    5.466    0.50    0.29    0.01  T_HTML_MAX_IMG_RATIO04
  4.354    4.267    4.656    0.48    0.28    0.01  T_HTML_CONSEC_IMGS03
 12.517   12.215   13.563    0.47    0.28    0.01  T_HTML_IMAGE_AREA11
 10.703   10.374   11.842    0.47    0.28    0.01  T_HTML_NUM_IMGS11
  1.088    1.052    1.215    0.46    0.28    0.01  T_HTML_IMAGE_AREA13
  9.705    9.264   11.235    0.45    0.27    0.01  T_HTML_MAX_IMG_RATIO10
  0.680    0.643    0.810    0.44    0.27    0.01  T_HTML_CONSEC_IMGS08
  2.562    2.367    3.239    0.42    0.26    0.01  T_HTML_IMAGE_AREA10
  2.834    2.601    3.644    0.42    0.26    0.01  T_HTML_MIN_IMG_RATIO5
 11.156   10.111   14.777    0.41    0.25    0.01  T_HTML_IMAGE_AREA01
  0.998    0.877    1.417    0.38    0.24    0.01  T_HTML_MAX_IMG_RATIO06
  4.943    4.237    7.389    0.36    0.23    0.01  T_HTML_CONSEC_IMGS01
  4.943    4.237    7.389    0.36    0.23    0.01  T_HTML_CONSEC_IMGS01B
  0.839    0.701    1.316    0.35    0.22    0.01  T_HTML_CONSEC_IMGS09
  1.927    1.607    3.036    0.35    0.22    0.01  T_HTML_NUM_IMGS05
  2.766    2.279    4.453    0.34    0.22    0.01  T_HTML_NUM_IMGS12
  1.406    1.110    2.429    0.31    0.20    0.01  T_HTML_NUM_IMGS06
  2.449    1.929    4.251    0.31    0.20    0.01  T_HTML_MIN_IMG_RATIO6
  0.113    0.088    0.202    0.30    0.20    0.01  T_HTML_CONSEC_IMGS12
  7.778    5.932   14.170    0.30    0.19    0.01  T_HTML_CONSEC_IMGS02
  2.132    1.461    4.453    0.25    0.17    0.01  T_HTML_CONSEC_IMGS05
  7.029    4.793   14.777    0.24    0.16    0.01  T_HTML_MAX_IMG_RATIO03
  7.029    4.793   14.777    0.24    0.16    0.01  T_HTML_MAX_IMG_RATIO03B
  1.519    1.023    3.239    0.24    0.16    0.01  T_HTML_CONSEC_IMGS11
  5.714    3.624   12.955    0.22    0.15    0.01  T_HTML_MIN_IMG_RATIO3
  3.560    2.133    8.502    0.20    0.13    0.01  T_HTML_IMAGE_AREA04B
  3.560    2.133    8.502    0.20    0.13    0.01  T_HTML_IMAGE_AREA04
  0.544    0.321    1.316    0.20    0.13    0.01  T_HTML_IMAGE_AREA09
  1.633    0.964    3.947    0.20    0.13    0.01  T_HTML_IMAGE_AREA03B
  1.633    0.964    3.947    0.20    0.13    0.01  T_HTML_IMAGE_AREA03
  0.340    0.175    0.911    0.16    0.11    0.01  T_HTML_IMAGE_AREA14
  0.408    0.205    1.113    0.16    0.10    0.01  T_HTML_IMAGE_AREA07
  3.560    1.753    9.818    0.15    0.10    0.01  T_HTML_NUM_IMGS07
 12.404    6.078   34.312    0.15    0.10    0.01  T_HTML_50_70_IMGS3
  1.315    0.643    3.644    0.15    0.10    0.01  T_HTML_IMAGE_AREA02
  6.054    2.805   17.308    0.14    0.09    0.01  T_HTML_MAX_IMG_RATIO05
  2.381    1.052    6.984    0.13    0.08    0.01  T_HTML_IMAGE_AREA05B
  2.381    1.052    6.984    0.13    0.08    0.01  T_HTML_IMAGE_AREA05
  1.270    0.555    3.745    0.13    0.08    0.01  T_HTML_NUM_IMGS15
  0.068    0.029    0.202    0.13    0.08    0.01  T_HTML_IMAGE_AREA15
  1.270    0.526    3.846    0.12    0.07    0.01  T_HTML_IMAGE_AREA06B
  1.270    0.526    3.846    0.12    0.07    0.01  T_HTML_IMAGE_AREA06
 11.633    4.793   35.324    0.12    0.07    0.01  T_HTML_CONSEC_IMGS10
 10.091    3.916   31.478    0.11    0.06    0.01  T_HTML_MIN_IMG_RATIO2
  1.406    0.409    4.858    0.08    0.04    0.01  T_HTML_NUM_IMGS16
  1.338    0.380    4.656    0.08    0.04    0.01  T_HTML_NUM_IMGS14
  2.494    0.672    8.806    0.07    0.03    0.01  T_HTML_IMAGE_AREA08
  0.748    0.175    2.733    0.06    0.02    0.01  T_HTML_MAX_IMG_RATIO07
  1.905    0.351    7.287    0.05    0.01    0.01  T_HTML_MAX_IMG_RATIO09
  1.043    0.175    4.049    0.04    0.01    0.01  T_HTML_MAX_IMG_RATIO08
  0.703    0.088    2.834    0.03    0.01    0.01  T_HTML_MAX_IMG_RATIO01
  1.633    0.175    6.680    0.03    0.00    0.01  T_HTML_NUM_IMGS17
  0.998    0.088    4.150    0.02    0.00    0.01  T_HTML_NUM_IMGS18
  3.469    0.029   15.385    0.00    0.00    0.01  T_HTML_NUM_IMGS20
  0.408    0.000    1.822    0.00    0.00    0.01  T_HTML_NUM_IMGS19


Comment 7 Daniel Quinlan 2002-10-06 19:27:10 UTC
One request: could you name all of these rules T_HTML_IMG_* ?  I find that
hierarchical naming helps make it easier to compare similar rules.

Here are my current results for your IMG rules.  I added T_HTML_MESSAGE so
we can get baseline "how spammy is HTML in general" control numbers for
our corpuses (since they differ).  We really want all of the HTML rules to
have a significantly better S/O than T_HTML_MESSAGE.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12157     4462     7695    0.37    0.00    0.00  (all messages)
100.000   36.703   63.297    0.37    0.00    0.00  (all messages as %)
  2.056    5.603    0.000    1.00    0.90    0.01  T_HTML_CONSEC_IMGS01B
  2.056    5.603    0.000    1.00    0.90    0.01  T_HTML_CONSEC_IMGS01
  1.925    5.244    0.000    1.00    0.89    0.01  T_HTML_IMAGE_AREA01
  1.012    2.757    0.000    1.00    0.85    0.01  T_HTML_NUM_IMGS03
  0.888    2.420    0.000    1.00    0.84    0.01  T_HTML_CONSEC_IMGS03
  0.806    2.196    0.000    1.00    0.84    0.01  T_HTML_IMAGE_AREA12
  0.790    2.152    0.000    1.00    0.84    0.01  T_HTML_IMAGE_AREA06
  0.790    2.152    0.000    1.00    0.84    0.01  T_HTML_IMAGE_AREA06B
  0.740    2.017    0.000    1.00    0.83    0.01  T_HTML_MIN_IMG_RATIO5
  0.675    1.838    0.000    1.00    0.82    0.01  T_HTML_NUM_IMGS07
  0.633    1.726    0.000    1.00    0.82    0.01  T_HTML_NUM_IMGS12
  0.535    1.457    0.000    1.00    0.81    0.01  T_HTML_CONSEC_IMGS09
  0.461    1.255    0.000    1.00    0.80    0.01  T_HTML_MAX_IMG_RATIO02B
  0.461    1.255    0.000    1.00    0.80    0.01  T_HTML_MAX_IMG_RATIO02
  0.461    1.255    0.000    1.00    0.80    0.01  T_HTML_IMAGE_AREA07
  0.420    1.143    0.000    1.00    0.79    0.01  T_HTML_NUM_IMGS08
  0.395    1.076    0.000    1.00    0.79    0.01  T_HTML_CONSEC_IMGS05
  0.387    1.053    0.000    1.00    0.79    0.01  T_HTML_MAX_IMG_RATIO10
  0.387    1.053    0.000    1.00    0.79    0.01  T_HTML_IMAGE_AREA10
  0.321    0.874    0.000    1.00    0.77    0.01  T_HTML_MIN_IMG_RATIO4
  0.222    0.605    0.000    1.00    0.75    0.01  T_HTML_MAX_IMG_RATIO06
  0.214    0.583    0.000    1.00    0.75    0.01  T_HTML_IMAGE_AREA08
  0.148    0.403    0.000    1.00    0.72    0.01  T_HTML_NUM_IMGS13
  0.148    0.403    0.000    1.00    0.72    0.01  T_HTML_MIN_IMG_RATIO1
  0.123    0.336    0.000    1.00    0.71    0.01  T_HTML_CONSEC_IMGS08
  0.123    0.336    0.000    1.00    0.71    0.01  T_HTML_IMAGE_AREA13
  0.107    0.291    0.000    1.00    0.70    0.01  T_HTML_IMAGE_AREA14
  0.099    0.269    0.000    1.00    0.70    0.01  T_HTML_CONSEC_IMGS11
  3.150    8.539    0.026    1.00    0.69    0.01  T_HTML_NUM_IMGS01B
  1.563    4.236    0.013    1.00    0.69    0.01  T_HTML_MAX_IMG_RATIO03B
  0.090    0.247    0.000    1.00    0.69    0.01  T_HTML_MAX_IMG_RATIO07
  2.163    5.849    0.026    1.00    0.67    0.01  T_HTML_NUM_IMGS11
  0.058    0.157    0.000    1.00    0.66    0.01  T_HTML_IMAGE_AREA15
  0.847    2.286    0.013    0.99    0.65    0.01  T_HTML_MIN_IMG_RATIO6
  0.049    0.134    0.000    1.00    0.65    0.01  T_HTML_IMAGE_AREA16
  0.049    0.134    0.000    1.00    0.65    0.01  T_HTML_MAX_IMG_RATIO01
  2.139    5.760    0.039    0.99    0.64    0.01  T_HTML_CONSEC_IMGS02
  0.699    1.883    0.013    0.99    0.64    0.01  T_HTML_NUM_IMGS05
  0.041    0.112    0.000    1.00    0.64    0.01  T_HTML_CONSEC_IMGS12
  0.642    1.726    0.013    0.99    0.63    0.01  T_HTML_IMAGE_AREA04B
  0.033    0.090    0.000    1.00    0.62    0.01  T_HTML_NUM_IMGS15
  0.033    0.090    0.000    1.00    0.62    0.01  T_HTML_MAX_IMG_RATIO08
  0.033    0.090    0.000    1.00    0.62    0.01  T_HTML_CONSEC_IMGS13
  1.949    5.222    0.052    0.99    0.61    0.01  T_HTML_IMAGE_AREA11
  3.192    8.539    0.091    0.99    0.61    0.01  T_HTML_NUM_IMGS01
  0.025    0.067    0.000    1.00    0.60    0.01  T_HTML_NUM_IMGS16
  1.579    4.213    0.052    0.99    0.60    0.01  T_HTML_NUM_IMGS02B
  0.979    2.600    0.039    0.99    0.59    0.01  T_HTML_NUM_IMGS04
  0.650    1.726    0.026    0.99    0.59    0.01  T_HTML_IMAGE_AREA04
  0.642    1.703    0.026    0.98    0.59    0.01  T_HTML_IMAGE_AREA05
  0.642    1.703    0.026    0.98    0.59    0.01  T_HTML_IMAGE_AREA05B
  1.588    4.213    0.065    0.98    0.59    0.01  T_HTML_NUM_IMGS02
  0.313    0.829    0.013    0.98    0.58    0.01  T_HTML_MAX_IMG_RATIO05
  0.864    2.286    0.039    0.98    0.58    0.01  T_HTML_MAX_IMG_RATIO04
  0.568    1.502    0.026    0.98    0.58    0.01  T_HTML_CONSEC_IMGS10
  0.016    0.045    0.000    1.00    0.58    0.01  T_HTML_NUM_IMGS20
  0.535    1.412    0.026    0.98    0.57    0.01  T_HTML_CONSEC_IMGS04
  1.612    4.236    0.091    0.98    0.56    0.01  T_HTML_MAX_IMG_RATIO03
  1.004    2.622    0.065    0.98    0.55    0.01  T_HTML_NUM_IMGS06
  0.313    0.807    0.026    0.97    0.53    0.01  T_HTML_NUM_IMGS10
  0.008    0.022    0.000    1.00    0.53    0.01  T_HTML_IMAGE_AREA17
  0.008    0.022    0.000    1.00    0.53    0.01  T_HTML_CONSEC_IMGS14
  0.008    0.022    0.000    1.00    0.53    0.01  T_HTML_NUM_IMGS19
  0.230    0.583    0.026    0.96    0.51    0.01  T_HTML_CONSEC_IMGS07
 28.000   70.731    3.223    0.96    0.51    0.00  T_HTML_MESSAGE
  0.107    0.269    0.013    0.95    0.51    0.01  T_HTML_IMAGE_AREA02
  0.280    0.695    0.039    0.95    0.50    0.01  T_HTML_MIN_IMG_RATIO2
  0.082    0.202    0.013    0.94    0.49    0.01  T_HTML_NUM_IMGS14
  0.469    1.121    0.091    0.92    0.47    0.01  T_HTML_MIN_IMG_RATIO3
  0.403    0.941    0.091    0.91    0.46    0.01  T_HTML_IMAGE_AREA03
  0.403    0.941    0.091    0.91    0.46    0.01  T_HTML_IMAGE_AREA03B
  0.395    0.919    0.091    0.91    0.46    0.01  T_HTML_NUM_IMGS09
  0.304    0.672    0.091    0.88    0.43    0.01  T_HTML_CONSEC_IMGS06
  0.271    0.538    0.117    0.82    0.40    0.01  T_HTML_IMAGE_AREA09
  0.033    0.045    0.026    0.63    0.31    0.01  T_HTML_MAX_IMG_RATIO09
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS19
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS18
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS17
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS16
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_CONSEC_IMGS15
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_IMAGE_AREA18
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_IMAGE_AREA19
  0.008    0.000    0.013    0.00    0.00    0.01  T_HTML_NUM_IMGS17
  0.000    0.000    0.000    0.00    0.00    0.01  T_HTML_NUM_IMGS18
Comment 8 Daniel Quinlan 2002-10-07 20:31:43 UTC
assigning bug
Comment 9 Matthew Cline 2002-11-20 23:57:28 UTC
This doesn't seem to be working too well for other people; shall I remove this
from CVS and close the bug WONTIFX?
Comment 10 Daniel Quinlan 2002-12-08 23:56:26 UTC
> This doesn't seem to be working too well for other people; shall I remove this
> from CVS and close the bug WONTIFX?

I went through the nightly runs and added comments for all of these tests.
To summarize, I don't mind if you remove them, but please take a look at
my comments in 70_cvs_rules_under_test.cf first.  One or two sets of tests
look like they are worth some further work, especially T_HTML_IMAGE_AREA14
and higher.  The rest can probably go.

I also looked at all of the width and height attributes in my spam.  It looks
like 20% of them are specified using a percentage instead of a fixed value.
It might be worth guestimating those.  I'll try it out.

If anyone other than Matt removes any of these, please make sure you also get
the code from HTML.pm.
Comment 11 Daniel Quinlan 2002-12-09 00:42:13 UTC
Okay, I made the percent change to T_HTML_IMAGE_AREA_* and it seems to improve
the results a tiny bit (without any upward movement for nonspam), so I checked
it in.  Here's the relative change, before to after (so positive is an
increase), out of 3504 HTML spam with originally had 241 hits for
T_HTML_IMAGE_AREA01.

18      T_HTML_IMAGE_AREA08
18      T_HTML_MIN_IMG_RATIO4
1       T_HTML_IMAGE_AREA01
1       T_HTML_MAX_IMG_RATIO04
1       T_HTML_MAX_IMG_RATIO05
1       T_HTML_MAX_IMG_RATIO06
-2      T_HTML_MAX_IMG_RATIO03
-2      T_HTML_MAX_IMG_RATIO03B
-18     T_HTML_IMAGE_AREA05
-18     T_HTML_IMAGE_AREA05B

I suggest removing all of the other IMAGE stuff in that block except for
T_HTML_IMAGE_AREA and T_IMAGE_ONLY_* (which is in a separate block of the
file) ... along with the related code in HTML.pm.
Comment 12 Rod Begbie 2002-12-09 06:30:31 UTC
My nightly run gave a load of these errors tonight:

Argument "100%" isn't numeric in multiplication (*) at
/home/rod/build/sanightly/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm
line 290.

Might be related to changes to this bug.
Comment 13 Daniel Quinlan 2002-12-09 16:19:40 UTC
Subject: Re: [SAdev]  IMG tag based rules

rOD-spamassassin@arsecandle.org writes:

> My nightly run gave a load of these errors tonight:
> 
> Argument "100%" isn't numeric in multiplication (*) at
> /home/rod/build/sanightly/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm
> line 290.
> 
> Might be related to changes to this bug.

Thanks, it was a silly mistake on my part (now fixed).  The code still
worked (my results don't change with the fix).  Perl happens to do what
I wanted:

------- start of cut text --------------
$ perl -e 'use warnings; use strict; my $x = "100%"; my $y = 8; print $x * $y . "\n"'
Argument "100%" isn't numeric in multiplication (*) at -e line 1.
800
------- end ----------------------------

Comment 14 Daniel Quinlan 2002-12-13 18:03:53 UTC
Finishing these up, it looks like T_HTML_IMAGE_AREA* will be kept since it
works pretty well, trying to find where the S/O ratio starts being really good.

The rest of the tests are going away.
Comment 15 Daniel Quinlan 2002-12-17 18:14:11 UTC
Done, promoted area of 400000 pixels square and upwards