SA Bugzilla – Bug 1053
IMG tag based rules
Last modified: 2002-12-17 09:14:11 UTC
Inspired by complaints about all-image or mostly-image spam that's getting by SA, I've cooked up three sets of rules that analyze the use of IMG tags in HTML: one that looks at the total area of all of the images in the message (T_HTML_IMAGE_AREA*), one that looks at the total number of images in the message (T_HTML_NUM_IMGS*), and one that looks at the longest total run of consecutive images (T_HTML_CONSEC_IMG*). =============== The total area of all images is rather easy to compute: inside of HTML::html_tests(), if an IMG tag has both the width and height properties, then multiply them together and add the result to the running total. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 15113 4797 10316 0.32 0.00 0.00 (all messages) 100.000 31.741 68.259 0.32 0.00 0.00 (all messages as %) 0.635 2.001 0.000 1.00 0.81 0.01 T_HTML_IMAGE_AREA14 0.417 1.313 0.000 1.00 0.78 0.01 T_HTML_IMAGE_AREA15 0.331 1.042 0.000 1.00 0.76 0.01 T_HTML_IMAGE_AREA07 0.245 0.771 0.000 1.00 0.74 0.01 T_HTML_IMAGE_AREA10 0.238 0.750 0.000 1.00 0.74 0.01 T_HTML_IMAGE_AREA02 0.225 0.709 0.000 1.00 0.74 0.01 T_HTML_IMAGE_AREA16 0.126 0.396 0.000 1.00 0.70 0.01 T_HTML_IMAGE_AREA18 0.119 0.375 0.000 1.00 0.70 0.01 T_HTML_IMAGE_AREA19 0.119 0.375 0.000 1.00 0.70 0.01 T_HTML_IMAGE_AREA17 1.125 3.523 0.010 1.00 0.68 0.01 T_HTML_IMAGE_AREA12 0.741 2.314 0.010 1.00 0.65 0.01 T_HTML_IMAGE_AREA13 1.542 4.732 0.058 0.99 0.58 0.01 T_HTML_IMAGE_AREA11 0.139 0.417 0.010 0.98 0.54 0.01 T_HTML_IMAGE_AREA08 0.483 1.397 0.058 0.96 0.50 0.01 T_HTML_IMAGE_AREA03 0.192 0.500 0.048 0.91 0.44 0.01 T_HTML_IMAGE_AREA06 0.820 1.834 0.349 0.84 0.39 0.01 T_HTML_IMAGE_AREA04 0.946 2.022 0.446 0.82 0.38 0.01 T_HTML_IMAGE_AREA01 0.569 0.896 0.417 0.68 0.32 0.01 T_HTML_IMAGE_AREA05 6.498 0.500 9.287 0.05 0.02 0.01 T_HTML_IMAGE_AREA09 Spam % of all rules with S/0 > 0.90: 20.615% ============================= The total number of IMG tags is really easy to do. 0.648 2.043 0.000 1.00 0.81 0.01 T_HTML_NUM_IMGS08 0.609 1.918 0.000 1.00 0.80 0.01 T_HTML_NUM_IMGS09 0.490 1.543 0.000 1.00 0.79 0.01 T_HTML_NUM_IMGS10 0.119 0.375 0.000 1.00 0.70 0.01 T_HTML_NUM_IMGS14 0.986 3.064 0.019 0.99 0.63 0.01 T_HTML_NUM_IMGS06 2.303 7.150 0.048 0.99 0.62 0.01 T_HTML_NUM_IMGS11 0.033 0.104 0.000 1.00 0.61 0.01 T_HTML_NUM_IMGS17 0.787 2.439 0.019 0.99 0.61 0.01 T_HTML_NUM_IMGS12 0.344 1.063 0.010 0.99 0.60 0.01 T_HTML_NUM_IMGS13 0.020 0.063 0.000 1.00 0.58 0.01 T_HTML_NUM_IMGS20 0.020 0.063 0.000 1.00 0.58 0.01 T_HTML_NUM_IMGS16 0.860 2.627 0.039 0.99 0.57 0.01 T_HTML_NUM_IMGS05 0.754 2.293 0.039 0.98 0.56 0.01 T_HTML_NUM_IMGS07 0.013 0.042 0.000 1.00 0.55 0.01 T_HTML_NUM_IMGS18 0.887 2.627 0.078 0.97 0.52 0.01 T_HTML_NUM_IMGS04 1.356 3.711 0.262 0.93 0.47 0.01 T_HTML_NUM_IMGS03 0.046 0.125 0.010 0.93 0.46 0.01 T_HTML_NUM_IMGS15 6.061 10.256 4.110 0.71 0.34 0.01 T_HTML_NUM_IMGS01 0.040 0.063 0.029 0.68 0.32 0.01 T_HTML_NUM_IMGS19 6.233 4.753 6.921 0.41 0.22 0.01 T_HTML_NUM_IMGS02 Spam % of all rules with S/O > 0.90: 31.25% ========================= I figured that spam that is made up of only images is going to only have IMG tags interspersed with table, paragraph and linebreak tags, and some whitespace, so there would be a lot of IMG tags with no plain text (non-whitespace) between them. So I defined consecutive IMG tags to be ones with no text between them, and looked at the longest run of consecutive IMGs within a message. This one seems to do pretty good, because in my non-spam corpus there's only a handful of messages with IMG runs larger than two. 0.450 1.418 0.000 1.00 0.78 0.01 T_HTML_CONSEC_IMGS06 0.232 0.730 0.000 1.00 0.74 0.01 T_HTML_CONSEC_IMGS08 0.205 0.646 0.000 1.00 0.73 0.01 T_HTML_CONSEC_IMGS11 1.813 5.691 0.010 1.00 0.71 0.01 T_HTML_CONSEC_IMGS02 1.019 3.189 0.010 1.00 0.67 0.01 T_HTML_CONSEC_IMGS03 0.768 2.397 0.010 1.00 0.66 0.01 T_HTML_CONSEC_IMGS05 0.053 0.167 0.000 1.00 0.64 0.01 T_HTML_CONSEC_IMGS12 1.006 3.127 0.019 0.99 0.63 0.01 T_HTML_CONSEC_IMGS04 0.483 1.501 0.010 0.99 0.62 0.01 T_HTML_CONSEC_IMGS07 0.020 0.063 0.000 1.00 0.58 0.01 T_HTML_CONSEC_IMGS13 0.020 0.063 0.000 1.00 0.58 0.01 T_HTML_CONSEC_IMGS15 1.032 3.148 0.048 0.98 0.57 0.01 T_HTML_CONSEC_IMGS10 0.199 0.605 0.010 0.98 0.57 0.01 T_HTML_CONSEC_IMGS09 0.013 0.042 0.000 1.00 0.55 0.01 T_HTML_CONSEC_IMGS17 0.013 0.042 0.000 1.00 0.55 0.01 T_HTML_CONSEC_IMGS19 0.007 0.021 0.000 1.00 0.51 0.01 T_HTML_CONSEC_IMGS14 7.080 7.484 6.892 0.52 0.26 0.01 T_HTML_CONSEC_IMGS01 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS16 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS18 Spam % of all rules with S/O > 0.90: 22.85% ========================== Next I'm going to see if there's any meta rules I can make that will reduce the FP rate for low S/O rules.
Looks great! One note: most "image-only" spam actually has some text (a few words at the top, a disclaimer at the end) so keep that in mind. The typical image spam that slips through SA seems to have a bit of text, one huge image, then a bit more.
<daf> Argument "100%" isn't numeric in multiplication (*) at /home/daf/cvs/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm line 230. The code doesn't seem to handle width and height attributes expressed as percentages. It should convert those to the equivalent pixel size for an 800x600 monitor (that should do the job). Something like: (800 * (percent / 100)) or (600 * (percent / 100))
Call me an optimization freak, but isn't that just (8 * percent) or (6 * percent)?
Here's my results for the width/height ratios, one set of rules for the minimum ratio found and a second for the maximum ratio found: OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 15169 4849 10320 0.32 0.00 0.00 (all messages) 100.000 31.967 68.033 0.32 0.00 0.00 (all messages as %) 0.349 1.093 0.000 1.00 0.77 0.01 T_HTML_MIN_IMG_RATIO4 0.165 0.516 0.000 1.00 0.72 0.01 T_HTML_MIN_IMG_RATIO1 0.877 2.598 0.068 0.97 0.53 0.01 T_HTML_MIN_IMG_RATIO5 0.310 0.887 0.039 0.96 0.50 0.01 T_HTML_MIN_IMG_RATIO2 0.270 0.763 0.039 0.95 0.49 0.01 T_HTML_MIN_IMG_RATIO3 1.536 3.753 0.494 0.88 0.42 0.01 T_HTML_MIN_IMG_RATIO6 All of the min ratios from 0.0 to 0.75 have a S/O of 0.95 or greater. 15169 4849 10320 0.32 0.00 0.00 (all messages) 100.000 31.967 68.033 0.32 0.00 0.00 (all messages as %) 0.626 1.959 0.000 1.00 0.80 0.01 T_HTML_MAX_IMG_RATIO02 0.521 1.629 0.000 1.00 0.79 0.01 T_HTML_MAX_IMG_RATIO05 0.435 1.361 0.000 1.00 0.78 0.01 T_HTML_MAX_IMG_RATIO06 0.171 0.536 0.000 1.00 0.72 0.01 T_HTML_MAX_IMG_RATIO09 0.145 0.454 0.000 1.00 0.71 0.01 T_HTML_MAX_IMG_RATIO07 1.147 3.568 0.010 1.00 0.68 0.01 T_HTML_MAX_IMG_RATIO04 1.642 5.094 0.019 1.00 0.66 0.01 T_HTML_MAX_IMG_RATIO03B 0.554 1.670 0.029 0.98 0.56 0.01 T_HTML_MAX_IMG_RATIO10 0.125 0.371 0.010 0.97 0.53 0.01 T_HTML_MAX_IMG_RATIO08 0.092 0.268 0.010 0.97 0.51 0.01 T_HTML_MAX_IMG_RATIO01 1.872 5.094 0.359 0.93 0.47 0.01 T_HTML_MAX_IMG_RATIO03 It looks like all of these rules have goo S/O's because the width/height ratios of 1 to 5 aren't covered by any rule. T_HTML_MAX_IMG_RATIO03B does a lot better than T_HTML_MAX_IMG_RATIO03 because it excludes all of the FROM_EGROUP messages, which often have image ratios in that range.
Oh, and I made a meta rule to combine the # of images test with the HTML percentage test, since I figured that spam made mostly of images will also be mostly HTML. So here's what I got: 3.896 12.167 0.010 1.00 0.76 0.01 T_HTML_50_70_IMGS3 9.566 27.655 1.066 0.96 0.51 0.31 HTML_50_70 (T_HTML_50_70_IMGS3 == HTML_50_70 + 3 or more images) It greatly reduces the FP rate, while reducing the spam rate by 56%.
Subject: Re: [SAdev] IMG tag based rules Here's my results from last night for the T_HTML_ rules. note that there were some checkins after my last cvs update though, so these aren't the latest versions. However I think the modifications were (a) handling percentages (b) some new rules and (c) avoiding that warning message, so I think these hits are still valid. I'll cvs update now and rerun. 0.431 0.555 0.000 1.00 0.83 0.01 T_HTML_IMAGE_AREA16 0.136 0.175 0.000 1.00 0.75 0.01 T_HTML_CONSEC_IMGS13 0.023 0.029 0.000 1.00 0.61 0.01 T_HTML_IMAGE_AREA19 0.023 0.029 0.000 1.00 0.61 0.01 T_HTML_IMAGE_AREA18 4.195 5.348 0.202 0.96 0.59 0.01 T_HTML_NUM_IMGS02B 4.195 5.348 0.202 0.96 0.59 0.01 T_HTML_NUM_IMGS02 3.469 4.267 0.709 0.86 0.47 0.01 T_HTML_NUM_IMGS03 2.404 2.864 0.810 0.78 0.42 0.01 T_HTML_NUM_IMGS04 5.193 6.108 2.024 0.75 0.41 0.01 T_HTML_CONSEC_IMGS07 5.170 6.078 2.024 0.75 0.41 0.01 T_HTML_MIN_IMG_RATIO4 4.263 4.939 1.923 0.72 0.39 0.01 T_HTML_IMAGE_AREA12 3.855 4.442 1.822 0.71 0.39 0.01 T_HTML_CONSEC_IMGS04 0.635 0.731 0.304 0.71 0.38 0.01 T_HTML_MAX_IMG_RATIO02 0.635 0.731 0.304 0.71 0.38 0.01 T_HTML_MAX_IMG_RATIO02B 4.036 4.617 2.024 0.70 0.38 0.01 T_HTML_NUM_IMGS13 1.020 1.140 0.607 0.65 0.36 0.01 T_HTML_NUM_IMGS09 4.127 4.559 2.632 0.63 0.35 0.01 T_HTML_NUM_IMGS01B 4.127 4.559 2.632 0.63 0.35 0.01 T_HTML_NUM_IMGS01 2.268 2.425 1.721 0.58 0.33 0.01 T_HTML_NUM_IMGS08 1.565 1.666 1.215 0.58 0.33 0.01 T_HTML_MIN_IMG_RATIO1 1.837 1.929 1.518 0.56 0.32 0.01 T_HTML_NUM_IMGS10 1.111 1.140 1.012 0.53 0.31 0.01 T_HTML_CONSEC_IMGS06 5.397 5.377 5.466 0.50 0.29 0.01 T_HTML_MAX_IMG_RATIO04 4.354 4.267 4.656 0.48 0.28 0.01 T_HTML_CONSEC_IMGS03 12.517 12.215 13.563 0.47 0.28 0.01 T_HTML_IMAGE_AREA11 10.703 10.374 11.842 0.47 0.28 0.01 T_HTML_NUM_IMGS11 1.088 1.052 1.215 0.46 0.28 0.01 T_HTML_IMAGE_AREA13 9.705 9.264 11.235 0.45 0.27 0.01 T_HTML_MAX_IMG_RATIO10 0.680 0.643 0.810 0.44 0.27 0.01 T_HTML_CONSEC_IMGS08 2.562 2.367 3.239 0.42 0.26 0.01 T_HTML_IMAGE_AREA10 2.834 2.601 3.644 0.42 0.26 0.01 T_HTML_MIN_IMG_RATIO5 11.156 10.111 14.777 0.41 0.25 0.01 T_HTML_IMAGE_AREA01 0.998 0.877 1.417 0.38 0.24 0.01 T_HTML_MAX_IMG_RATIO06 4.943 4.237 7.389 0.36 0.23 0.01 T_HTML_CONSEC_IMGS01 4.943 4.237 7.389 0.36 0.23 0.01 T_HTML_CONSEC_IMGS01B 0.839 0.701 1.316 0.35 0.22 0.01 T_HTML_CONSEC_IMGS09 1.927 1.607 3.036 0.35 0.22 0.01 T_HTML_NUM_IMGS05 2.766 2.279 4.453 0.34 0.22 0.01 T_HTML_NUM_IMGS12 1.406 1.110 2.429 0.31 0.20 0.01 T_HTML_NUM_IMGS06 2.449 1.929 4.251 0.31 0.20 0.01 T_HTML_MIN_IMG_RATIO6 0.113 0.088 0.202 0.30 0.20 0.01 T_HTML_CONSEC_IMGS12 7.778 5.932 14.170 0.30 0.19 0.01 T_HTML_CONSEC_IMGS02 2.132 1.461 4.453 0.25 0.17 0.01 T_HTML_CONSEC_IMGS05 7.029 4.793 14.777 0.24 0.16 0.01 T_HTML_MAX_IMG_RATIO03 7.029 4.793 14.777 0.24 0.16 0.01 T_HTML_MAX_IMG_RATIO03B 1.519 1.023 3.239 0.24 0.16 0.01 T_HTML_CONSEC_IMGS11 5.714 3.624 12.955 0.22 0.15 0.01 T_HTML_MIN_IMG_RATIO3 3.560 2.133 8.502 0.20 0.13 0.01 T_HTML_IMAGE_AREA04B 3.560 2.133 8.502 0.20 0.13 0.01 T_HTML_IMAGE_AREA04 0.544 0.321 1.316 0.20 0.13 0.01 T_HTML_IMAGE_AREA09 1.633 0.964 3.947 0.20 0.13 0.01 T_HTML_IMAGE_AREA03B 1.633 0.964 3.947 0.20 0.13 0.01 T_HTML_IMAGE_AREA03 0.340 0.175 0.911 0.16 0.11 0.01 T_HTML_IMAGE_AREA14 0.408 0.205 1.113 0.16 0.10 0.01 T_HTML_IMAGE_AREA07 3.560 1.753 9.818 0.15 0.10 0.01 T_HTML_NUM_IMGS07 12.404 6.078 34.312 0.15 0.10 0.01 T_HTML_50_70_IMGS3 1.315 0.643 3.644 0.15 0.10 0.01 T_HTML_IMAGE_AREA02 6.054 2.805 17.308 0.14 0.09 0.01 T_HTML_MAX_IMG_RATIO05 2.381 1.052 6.984 0.13 0.08 0.01 T_HTML_IMAGE_AREA05B 2.381 1.052 6.984 0.13 0.08 0.01 T_HTML_IMAGE_AREA05 1.270 0.555 3.745 0.13 0.08 0.01 T_HTML_NUM_IMGS15 0.068 0.029 0.202 0.13 0.08 0.01 T_HTML_IMAGE_AREA15 1.270 0.526 3.846 0.12 0.07 0.01 T_HTML_IMAGE_AREA06B 1.270 0.526 3.846 0.12 0.07 0.01 T_HTML_IMAGE_AREA06 11.633 4.793 35.324 0.12 0.07 0.01 T_HTML_CONSEC_IMGS10 10.091 3.916 31.478 0.11 0.06 0.01 T_HTML_MIN_IMG_RATIO2 1.406 0.409 4.858 0.08 0.04 0.01 T_HTML_NUM_IMGS16 1.338 0.380 4.656 0.08 0.04 0.01 T_HTML_NUM_IMGS14 2.494 0.672 8.806 0.07 0.03 0.01 T_HTML_IMAGE_AREA08 0.748 0.175 2.733 0.06 0.02 0.01 T_HTML_MAX_IMG_RATIO07 1.905 0.351 7.287 0.05 0.01 0.01 T_HTML_MAX_IMG_RATIO09 1.043 0.175 4.049 0.04 0.01 0.01 T_HTML_MAX_IMG_RATIO08 0.703 0.088 2.834 0.03 0.01 0.01 T_HTML_MAX_IMG_RATIO01 1.633 0.175 6.680 0.03 0.00 0.01 T_HTML_NUM_IMGS17 0.998 0.088 4.150 0.02 0.00 0.01 T_HTML_NUM_IMGS18 3.469 0.029 15.385 0.00 0.00 0.01 T_HTML_NUM_IMGS20 0.408 0.000 1.822 0.00 0.00 0.01 T_HTML_NUM_IMGS19
One request: could you name all of these rules T_HTML_IMG_* ? I find that hierarchical naming helps make it easier to compare similar rules. Here are my current results for your IMG rules. I added T_HTML_MESSAGE so we can get baseline "how spammy is HTML in general" control numbers for our corpuses (since they differ). We really want all of the HTML rules to have a significantly better S/O than T_HTML_MESSAGE. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12157 4462 7695 0.37 0.00 0.00 (all messages) 100.000 36.703 63.297 0.37 0.00 0.00 (all messages as %) 2.056 5.603 0.000 1.00 0.90 0.01 T_HTML_CONSEC_IMGS01B 2.056 5.603 0.000 1.00 0.90 0.01 T_HTML_CONSEC_IMGS01 1.925 5.244 0.000 1.00 0.89 0.01 T_HTML_IMAGE_AREA01 1.012 2.757 0.000 1.00 0.85 0.01 T_HTML_NUM_IMGS03 0.888 2.420 0.000 1.00 0.84 0.01 T_HTML_CONSEC_IMGS03 0.806 2.196 0.000 1.00 0.84 0.01 T_HTML_IMAGE_AREA12 0.790 2.152 0.000 1.00 0.84 0.01 T_HTML_IMAGE_AREA06 0.790 2.152 0.000 1.00 0.84 0.01 T_HTML_IMAGE_AREA06B 0.740 2.017 0.000 1.00 0.83 0.01 T_HTML_MIN_IMG_RATIO5 0.675 1.838 0.000 1.00 0.82 0.01 T_HTML_NUM_IMGS07 0.633 1.726 0.000 1.00 0.82 0.01 T_HTML_NUM_IMGS12 0.535 1.457 0.000 1.00 0.81 0.01 T_HTML_CONSEC_IMGS09 0.461 1.255 0.000 1.00 0.80 0.01 T_HTML_MAX_IMG_RATIO02B 0.461 1.255 0.000 1.00 0.80 0.01 T_HTML_MAX_IMG_RATIO02 0.461 1.255 0.000 1.00 0.80 0.01 T_HTML_IMAGE_AREA07 0.420 1.143 0.000 1.00 0.79 0.01 T_HTML_NUM_IMGS08 0.395 1.076 0.000 1.00 0.79 0.01 T_HTML_CONSEC_IMGS05 0.387 1.053 0.000 1.00 0.79 0.01 T_HTML_MAX_IMG_RATIO10 0.387 1.053 0.000 1.00 0.79 0.01 T_HTML_IMAGE_AREA10 0.321 0.874 0.000 1.00 0.77 0.01 T_HTML_MIN_IMG_RATIO4 0.222 0.605 0.000 1.00 0.75 0.01 T_HTML_MAX_IMG_RATIO06 0.214 0.583 0.000 1.00 0.75 0.01 T_HTML_IMAGE_AREA08 0.148 0.403 0.000 1.00 0.72 0.01 T_HTML_NUM_IMGS13 0.148 0.403 0.000 1.00 0.72 0.01 T_HTML_MIN_IMG_RATIO1 0.123 0.336 0.000 1.00 0.71 0.01 T_HTML_CONSEC_IMGS08 0.123 0.336 0.000 1.00 0.71 0.01 T_HTML_IMAGE_AREA13 0.107 0.291 0.000 1.00 0.70 0.01 T_HTML_IMAGE_AREA14 0.099 0.269 0.000 1.00 0.70 0.01 T_HTML_CONSEC_IMGS11 3.150 8.539 0.026 1.00 0.69 0.01 T_HTML_NUM_IMGS01B 1.563 4.236 0.013 1.00 0.69 0.01 T_HTML_MAX_IMG_RATIO03B 0.090 0.247 0.000 1.00 0.69 0.01 T_HTML_MAX_IMG_RATIO07 2.163 5.849 0.026 1.00 0.67 0.01 T_HTML_NUM_IMGS11 0.058 0.157 0.000 1.00 0.66 0.01 T_HTML_IMAGE_AREA15 0.847 2.286 0.013 0.99 0.65 0.01 T_HTML_MIN_IMG_RATIO6 0.049 0.134 0.000 1.00 0.65 0.01 T_HTML_IMAGE_AREA16 0.049 0.134 0.000 1.00 0.65 0.01 T_HTML_MAX_IMG_RATIO01 2.139 5.760 0.039 0.99 0.64 0.01 T_HTML_CONSEC_IMGS02 0.699 1.883 0.013 0.99 0.64 0.01 T_HTML_NUM_IMGS05 0.041 0.112 0.000 1.00 0.64 0.01 T_HTML_CONSEC_IMGS12 0.642 1.726 0.013 0.99 0.63 0.01 T_HTML_IMAGE_AREA04B 0.033 0.090 0.000 1.00 0.62 0.01 T_HTML_NUM_IMGS15 0.033 0.090 0.000 1.00 0.62 0.01 T_HTML_MAX_IMG_RATIO08 0.033 0.090 0.000 1.00 0.62 0.01 T_HTML_CONSEC_IMGS13 1.949 5.222 0.052 0.99 0.61 0.01 T_HTML_IMAGE_AREA11 3.192 8.539 0.091 0.99 0.61 0.01 T_HTML_NUM_IMGS01 0.025 0.067 0.000 1.00 0.60 0.01 T_HTML_NUM_IMGS16 1.579 4.213 0.052 0.99 0.60 0.01 T_HTML_NUM_IMGS02B 0.979 2.600 0.039 0.99 0.59 0.01 T_HTML_NUM_IMGS04 0.650 1.726 0.026 0.99 0.59 0.01 T_HTML_IMAGE_AREA04 0.642 1.703 0.026 0.98 0.59 0.01 T_HTML_IMAGE_AREA05 0.642 1.703 0.026 0.98 0.59 0.01 T_HTML_IMAGE_AREA05B 1.588 4.213 0.065 0.98 0.59 0.01 T_HTML_NUM_IMGS02 0.313 0.829 0.013 0.98 0.58 0.01 T_HTML_MAX_IMG_RATIO05 0.864 2.286 0.039 0.98 0.58 0.01 T_HTML_MAX_IMG_RATIO04 0.568 1.502 0.026 0.98 0.58 0.01 T_HTML_CONSEC_IMGS10 0.016 0.045 0.000 1.00 0.58 0.01 T_HTML_NUM_IMGS20 0.535 1.412 0.026 0.98 0.57 0.01 T_HTML_CONSEC_IMGS04 1.612 4.236 0.091 0.98 0.56 0.01 T_HTML_MAX_IMG_RATIO03 1.004 2.622 0.065 0.98 0.55 0.01 T_HTML_NUM_IMGS06 0.313 0.807 0.026 0.97 0.53 0.01 T_HTML_NUM_IMGS10 0.008 0.022 0.000 1.00 0.53 0.01 T_HTML_IMAGE_AREA17 0.008 0.022 0.000 1.00 0.53 0.01 T_HTML_CONSEC_IMGS14 0.008 0.022 0.000 1.00 0.53 0.01 T_HTML_NUM_IMGS19 0.230 0.583 0.026 0.96 0.51 0.01 T_HTML_CONSEC_IMGS07 28.000 70.731 3.223 0.96 0.51 0.00 T_HTML_MESSAGE 0.107 0.269 0.013 0.95 0.51 0.01 T_HTML_IMAGE_AREA02 0.280 0.695 0.039 0.95 0.50 0.01 T_HTML_MIN_IMG_RATIO2 0.082 0.202 0.013 0.94 0.49 0.01 T_HTML_NUM_IMGS14 0.469 1.121 0.091 0.92 0.47 0.01 T_HTML_MIN_IMG_RATIO3 0.403 0.941 0.091 0.91 0.46 0.01 T_HTML_IMAGE_AREA03 0.403 0.941 0.091 0.91 0.46 0.01 T_HTML_IMAGE_AREA03B 0.395 0.919 0.091 0.91 0.46 0.01 T_HTML_NUM_IMGS09 0.304 0.672 0.091 0.88 0.43 0.01 T_HTML_CONSEC_IMGS06 0.271 0.538 0.117 0.82 0.40 0.01 T_HTML_IMAGE_AREA09 0.033 0.045 0.026 0.63 0.31 0.01 T_HTML_MAX_IMG_RATIO09 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS19 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS18 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS17 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS16 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_CONSEC_IMGS15 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_IMAGE_AREA18 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_IMAGE_AREA19 0.008 0.000 0.013 0.00 0.00 0.01 T_HTML_NUM_IMGS17 0.000 0.000 0.000 0.00 0.00 0.01 T_HTML_NUM_IMGS18
assigning bug
This doesn't seem to be working too well for other people; shall I remove this from CVS and close the bug WONTIFX?
> This doesn't seem to be working too well for other people; shall I remove this > from CVS and close the bug WONTIFX? I went through the nightly runs and added comments for all of these tests. To summarize, I don't mind if you remove them, but please take a look at my comments in 70_cvs_rules_under_test.cf first. One or two sets of tests look like they are worth some further work, especially T_HTML_IMAGE_AREA14 and higher. The rest can probably go. I also looked at all of the width and height attributes in my spam. It looks like 20% of them are specified using a percentage instead of a fixed value. It might be worth guestimating those. I'll try it out. If anyone other than Matt removes any of these, please make sure you also get the code from HTML.pm.
Okay, I made the percent change to T_HTML_IMAGE_AREA_* and it seems to improve the results a tiny bit (without any upward movement for nonspam), so I checked it in. Here's the relative change, before to after (so positive is an increase), out of 3504 HTML spam with originally had 241 hits for T_HTML_IMAGE_AREA01. 18 T_HTML_IMAGE_AREA08 18 T_HTML_MIN_IMG_RATIO4 1 T_HTML_IMAGE_AREA01 1 T_HTML_MAX_IMG_RATIO04 1 T_HTML_MAX_IMG_RATIO05 1 T_HTML_MAX_IMG_RATIO06 -2 T_HTML_MAX_IMG_RATIO03 -2 T_HTML_MAX_IMG_RATIO03B -18 T_HTML_IMAGE_AREA05 -18 T_HTML_IMAGE_AREA05B I suggest removing all of the other IMAGE stuff in that block except for T_HTML_IMAGE_AREA and T_IMAGE_ONLY_* (which is in a separate block of the file) ... along with the related code in HTML.pm.
My nightly run gave a load of these errors tonight: Argument "100%" isn't numeric in multiplication (*) at /home/rod/build/sanightly/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm line 290. Might be related to changes to this bug.
Subject: Re: [SAdev] IMG tag based rules rOD-spamassassin@arsecandle.org writes: > My nightly run gave a load of these errors tonight: > > Argument "100%" isn't numeric in multiplication (*) at > /home/rod/build/sanightly/spamassassin/masses/../lib/Mail/SpamAssassin/HTML.pm > line 290. > > Might be related to changes to this bug. Thanks, it was a silly mistake on my part (now fixed). The code still worked (my results don't change with the fix). Perl happens to do what I wanted: ------- start of cut text -------------- $ perl -e 'use warnings; use strict; my $x = "100%"; my $y = 8; print $x * $y . "\n"' Argument "100%" isn't numeric in multiplication (*) at -e line 1. 800 ------- end ----------------------------
Finishing these up, it looks like T_HTML_IMAGE_AREA* will be kept since it works pretty well, trying to find where the S/O ratio starts being really good. The rest of the tests are going away.
Done, promoted area of 400000 pixels square and upwards