Bug 3065 - RFE: count the number of matches
Summary: RFE: count the number of matches
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: Other other
: P5 enhancement
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-02-18 21:41 UTC by David Muir Sharnoff
Modified: 2004-02-18 17:30 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description David Muir Sharnoff 2004-02-18 21:41:01 UTC
 
Comment 1 David Muir Sharnoff 2004-02-18 21:51:07 UTC
Sorry, this should have been in the initial description...  (RETURN at the  
wrong time...)  
  
There are some kinds of rules where counting the number of matches is  
important.   For example, I would like to write a rule that counts the number  
of invalid html tags.  
  
full    __INVALID_HTML_TAGS    count:m/<(?!a|img|h\d|font|etc...)\w+[^>]*>/g 
full    __VALID_HTML_TAGS	count:m/<(?:a|img|h\d|font|etc...)[^>]*>/g 
meta	MADE_UP_HTML	( __VALID_HTML_TAGS > 15 && __INVALID_HTML_TAGS > 
__VALID_HTML_TAGS * 0.5 ) 
 
Similarly, I would like to count the number of mid-word html tags vs regular 
html tags: 
 
full	__IN_WORD_TAGS	count:m/\w<[^>]+>\w/g 
full	__TOTAL_TAGS	count:m/<[^>]+>/g 
meta	HTML_BREAKING_WORDS        ( __TOTAL_TAGS > 50 && __IN_WORD_TAGS > 
__TOTAL_TAGS * 0.3 ) 
 
 
Comment 2 Daniel Quinlan 2004-02-19 01:50:28 UTC
This feature already exists in HEAD:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 260784   188819    71965    0.724   0.00    0.00  (all messages)
100.000  72.4044  27.5956    0.724   0.00    0.00  (all messages as %)
  0.933   1.2891   0.0000    1.000   0.94    1.00  HTML_NONELEMENT_90_100
  1.101   1.5205   0.0000    1.000   0.94    1.00  HTML_NONELEMENT_80_90
  1.086   1.4993   0.0000    1.000   0.94    1.00  HTML_NONELEMENT_70_80
  1.185   1.6370   0.0000    1.000   0.95    1.00  HTML_NONELEMENT_60_70
 14.805  20.4471   0.0014    1.000   0.97    1.00  HTML_NONELEMENT_50_60
  0.682   0.9374   0.0125    0.987   0.91    1.00  HTML_NONELEMENT_40_50
  2.822   3.8894   0.0222    0.994   0.93    1.00  HTML_NONELEMENT_30_40
  0.350   0.4714   0.0306    0.939   0.78    1.00  HTML_NONELEMENT_20_30
  0.511   0.6572   0.1265    0.839   0.56    1.00  HTML_NONELEMENT_10_20
  1.411   1.8123   0.3571    0.835   0.55    1.00  HTML_NONELEMENT_00_10
  0.016   0.0222   0.0000    1.000   0.94    1.00  HTML_BADTAG_90_100
  0.112   0.1546   0.0000    1.000   0.94    1.00  HTML_BADTAG_80_90
  0.275   0.3797   0.0000    1.000   0.94    1.00  HTML_BADTAG_70_80
  0.805   1.1116   0.0000    1.000   0.94    1.00  HTML_BADTAG_60_70
  0.354   0.4888   0.0000    1.000   0.94    1.00  HTML_BADTAG_50_60
 15.185  20.9730   0.0000    1.000   0.97    1.00  HTML_BADTAG_40_50
  1.321   1.8240   0.0014    0.999   0.94    1.00  HTML_BADTAG_30_40
  0.851   1.1736   0.0028    0.998   0.94    1.00  HTML_BADTAG_20_30
  3.637   5.0006   0.0584    0.988   0.92    1.00  HTML_BADTAG_10_20
  2.330   3.0325   0.4877    0.861   0.61    1.00  HTML_BADTAG_00_10

40 or 50% and above for both of these new rules matches tons of spam.

If you just look at HTML messages in our nightly corpus, the peak of both
ranges are in the top 6 rules of all rules.

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 166122   161128     4994    0.970   0.00    0.00  (all messages)
100.000  96.9938   3.0062    0.970   0.00    0.00  (all messages as %)
 46.790  48.2238   0.5206    0.989   1.00    0.75  BIZ_TLD
 25.606  26.4001   0.0000    1.000   1.00    0.01  T_DEEP_DISC_MEDS
 23.839  24.5774   0.0000    1.000   1.00    1.00  HTML_BADTAG_40_50     <---
 23.609  24.3403   0.0000    1.000   1.00    0.01  T_SUBJ_VALIUM
 22.192  22.8799   0.0000    1.000   0.99    0.01  T_MSGID_EVIL_20
 23.241  23.9611   0.0200    0.999   0.99    1.00  HTML_NONELEMENT_50_60 <---

(removed other test variations of T_MSGID_EVIL)

In addition to the above tests for invalid tags, we also look for mid-word
tags.

  0.287   0.2960   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_90_100
  0.182   0.1874   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_80_90
  0.680   0.7013   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_70_80
  0.640   0.6597   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_60_70
  1.365   1.4076   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_50_60
  1.798   1.8538   0.0000    1.000   0.96    1.00  HTML_OBFUSCATION_40_50
  1.678   1.7297   0.0200    0.989   0.93    1.00  HTML_OBFUSCATION_30_40
  3.724   3.8342   0.1802    0.955   0.84    1.00  HTML_OBFUSCATION_20_30
  2.376   2.4285   0.6808    0.781   0.46    1.00  HTML_OBFUSCATION_10_20
 21.327  20.0276  63.2559    0.240   0.04    1.00  HTML_OBFUSCATION_00_10

So, thanks for the suggestions.  :-)
Comment 3 David Muir Sharnoff 2004-02-19 02:30:15 UTC
Counting html tags isn't the only use for a count rule.  I'm happy 
to see that my examples are already taken care of.