SA Bugzilla – Bug 3065
RFE: count the number of matches
Last modified: 2004-02-18 17:30:15 UTC
Sorry, this should have been in the initial description... (RETURN at the wrong time...) There are some kinds of rules where counting the number of matches is important. For example, I would like to write a rule that counts the number of invalid html tags. full __INVALID_HTML_TAGS count:m/<(?!a|img|h\d|font|etc...)\w+[^>]*>/g full __VALID_HTML_TAGS count:m/<(?:a|img|h\d|font|etc...)[^>]*>/g meta MADE_UP_HTML ( __VALID_HTML_TAGS > 15 && __INVALID_HTML_TAGS > __VALID_HTML_TAGS * 0.5 ) Similarly, I would like to count the number of mid-word html tags vs regular html tags: full __IN_WORD_TAGS count:m/\w<[^>]+>\w/g full __TOTAL_TAGS count:m/<[^>]+>/g meta HTML_BREAKING_WORDS ( __TOTAL_TAGS > 50 && __IN_WORD_TAGS > __TOTAL_TAGS * 0.3 )
This feature already exists in HEAD: OVERALL% SPAM% HAM% S/O RANK SCORE NAME 260784 188819 71965 0.724 0.00 0.00 (all messages) 100.000 72.4044 27.5956 0.724 0.00 0.00 (all messages as %) 0.933 1.2891 0.0000 1.000 0.94 1.00 HTML_NONELEMENT_90_100 1.101 1.5205 0.0000 1.000 0.94 1.00 HTML_NONELEMENT_80_90 1.086 1.4993 0.0000 1.000 0.94 1.00 HTML_NONELEMENT_70_80 1.185 1.6370 0.0000 1.000 0.95 1.00 HTML_NONELEMENT_60_70 14.805 20.4471 0.0014 1.000 0.97 1.00 HTML_NONELEMENT_50_60 0.682 0.9374 0.0125 0.987 0.91 1.00 HTML_NONELEMENT_40_50 2.822 3.8894 0.0222 0.994 0.93 1.00 HTML_NONELEMENT_30_40 0.350 0.4714 0.0306 0.939 0.78 1.00 HTML_NONELEMENT_20_30 0.511 0.6572 0.1265 0.839 0.56 1.00 HTML_NONELEMENT_10_20 1.411 1.8123 0.3571 0.835 0.55 1.00 HTML_NONELEMENT_00_10 0.016 0.0222 0.0000 1.000 0.94 1.00 HTML_BADTAG_90_100 0.112 0.1546 0.0000 1.000 0.94 1.00 HTML_BADTAG_80_90 0.275 0.3797 0.0000 1.000 0.94 1.00 HTML_BADTAG_70_80 0.805 1.1116 0.0000 1.000 0.94 1.00 HTML_BADTAG_60_70 0.354 0.4888 0.0000 1.000 0.94 1.00 HTML_BADTAG_50_60 15.185 20.9730 0.0000 1.000 0.97 1.00 HTML_BADTAG_40_50 1.321 1.8240 0.0014 0.999 0.94 1.00 HTML_BADTAG_30_40 0.851 1.1736 0.0028 0.998 0.94 1.00 HTML_BADTAG_20_30 3.637 5.0006 0.0584 0.988 0.92 1.00 HTML_BADTAG_10_20 2.330 3.0325 0.4877 0.861 0.61 1.00 HTML_BADTAG_00_10 40 or 50% and above for both of these new rules matches tons of spam. If you just look at HTML messages in our nightly corpus, the peak of both ranges are in the top 6 rules of all rules. OVERALL% SPAM% HAM% S/O RANK SCORE NAME 166122 161128 4994 0.970 0.00 0.00 (all messages) 100.000 96.9938 3.0062 0.970 0.00 0.00 (all messages as %) 46.790 48.2238 0.5206 0.989 1.00 0.75 BIZ_TLD 25.606 26.4001 0.0000 1.000 1.00 0.01 T_DEEP_DISC_MEDS 23.839 24.5774 0.0000 1.000 1.00 1.00 HTML_BADTAG_40_50 <--- 23.609 24.3403 0.0000 1.000 1.00 0.01 T_SUBJ_VALIUM 22.192 22.8799 0.0000 1.000 0.99 0.01 T_MSGID_EVIL_20 23.241 23.9611 0.0200 0.999 0.99 1.00 HTML_NONELEMENT_50_60 <--- (removed other test variations of T_MSGID_EVIL) In addition to the above tests for invalid tags, we also look for mid-word tags. 0.287 0.2960 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_90_100 0.182 0.1874 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_80_90 0.680 0.7013 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_70_80 0.640 0.6597 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_60_70 1.365 1.4076 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_50_60 1.798 1.8538 0.0000 1.000 0.96 1.00 HTML_OBFUSCATION_40_50 1.678 1.7297 0.0200 0.989 0.93 1.00 HTML_OBFUSCATION_30_40 3.724 3.8342 0.1802 0.955 0.84 1.00 HTML_OBFUSCATION_20_30 2.376 2.4285 0.6808 0.781 0.46 1.00 HTML_OBFUSCATION_10_20 21.327 20.0276 63.2559 0.240 0.04 1.00 HTML_OBFUSCATION_00_10 So, thanks for the suggestions. :-)
Counting html tags isn't the only use for a count rule. I'm happy to see that my examples are already taken care of.