Bug 5123 - should 'rules/72_active.cf' be distributed in the distro tarball?
Summary: should 'rules/72_active.cf' be distributed in the distro tarball?
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 minor
Target Milestone: 3.2.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 4681
  Show dependency tree
 
Reported: 2006-10-10 11:25 UTC by Justin Mason
Modified: 2006-12-28 09:46 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
totrim.xml text/plain None Justin Mason [HasCLA]
rules turned into meta subrules (some deleted instead if no metas still use them) text/plain None Justin Mason [HasCLA]
script to trim rules based on lines of XML text/plain None Justin Mason [HasCLA]
similar script, to turn rules into subrules text/plain None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2006-10-10 11:25:16 UTC
As discussed on the dev list last month, Theo believes that we need
to trim back some of the more baroque aspects of the rules project work.
I'm happy enough with that.  Here's what's suggested (pasted from
earlier mails on dev):

  Me: 'Hmm -- I think I'd need more details of how that'd work [..]'

  Theo: 'Keep 3.1 the way it currently is, semi-revert 3.2 to have a rules
  directory and we'll put a minimal set of 3.2 rules in there.
  External to the normal distro, we have code that generates updates based on
  the mass-check results (or whatever else we want to base them on).

  [on what files are distributed:] I think our current methodology of:

  - user downloads the engine, installs it, it comes with a core set of rules
  - user is encouraged to run sa-update, and then run it periodically as well
    to get the complete set of rules

  works well.  It doesn't require us to have everything in the engine
  distribution though.'



So in other words, the distributed tarball consists of:

  - engine
  - core rules in "rules/" directory

the SVN build would consist of:

  - engine
  - core rules in "rules/" directory
  - an SVN external for the "rulesrc/" directory
  - and the build/mkrules script to compile the latter

the "sa-update" distributed rules tarballs would have:

  - core rules in "rules/" directory
  - plus the 'active' rules in "rules/" directory (in other words, the compiled
    output from build/mkrules)


Another issue is version-dependent rulesets; these would move back
out of rulesrc, and back into the "rules/" directory, alongside
the engine whose code they depend on.

In terms of  implementation, it's pretty simple:

1. Move the core rules files from "rulesrc/core" back into "rules".

1a. (Maybe trim out the rules that are currently showing up as
unusable in the rule-QA app, since there's doubtless some rule rot
by now...)

2. change Makefile.PL to not run "build/mkrules" unless the "rulesrc"
directory is present; that will only be the case with an SVN
checkout rather than a "make dist" dir or tarball.
Comment 1 Justin Mason 2006-10-11 02:53:18 UTC
aiming at 3.2.0 as a blocker; this is something we need to decide before that
release
Comment 2 Justin Mason 2006-10-12 10:04:51 UTC
is the silence indicative of general happiness?  Warnock's dilemma strikes ;)
Comment 3 Daryl C. W. O'Shea 2006-10-13 11:23:39 UTC
Sounds like a reasonable plan of action to me.
Comment 4 Justin Mason 2006-10-20 08:22:28 UTC
(In reply to comment #3)
> Sounds like a reasonable plan of action to me.

OK.  the low-hanging fruit is now complete -- that's part (1) and (3).  next
thing to do is (2) spend some time going through
http://ruleqa.spamassassin.org/20061018-r465178-n , and removing the rules that
are greyed-out, in the core ruleset, and not required for other purposes.

Actually, there may also be a step 4.  currently, rules/72_active.cf -- ie. the
compiled "good enough" rules from the rulesrc/sandbox/* tree -- is being
distributed in distro tarballs; however, I think Theo was considering that it
shouldn't be, and instead should only be shipped in sa-update tarballs.

Theo, are you really keen on that idea?  I don't think it's too important, since
it will probably work fine the way it is now anyway...
Comment 5 Justin Mason 2006-10-20 11:22:58 UTC
Created attachment 3722 [details]
totrim.xml

here's the list of rules I plan to remove from the core ruleset, in XML
format (with ham percent hitrate, spam%, and S/O).   Shout if
you see any in there that you like and want to keep. 

Rules that are dependencies of "good" metas will be preserved, or possibly
turned into subrules.
Comment 6 Justin Mason 2006-10-23 10:39:15 UTC
Created attachment 3723 [details]
rules turned into meta subrules (some deleted instead if no metas still use them)

committed as r467038 and r467043, removing some 220-or-so rules!  phew.
Comment 7 Justin Mason 2006-10-23 10:41:49 UTC
Created attachment 3724 [details]
script to trim rules based on lines of XML

for reference, (or for next time), here's a script I used. I cut and pasted the
XML output in the comments of the 20061018-r465178-n ruleqa freqs output,
trimmed it down to the specific rules I wanted to get rid of, rescued a few
manually, then saved the file as "totrim.xml" and used this script like so:

./trim-rules-from-xml rules/*.cf
Comment 8 Justin Mason 2006-10-23 10:46:02 UTC
Created attachment 3725 [details]
similar script, to turn rules into subrules

and this script does similar, but turning rules into subrules (the first
script will warn about the cases where this is required).

./make-rules-into-subrules-from-xml rules/*.cf

then some manual cleanup, based off "svn diff" and "make test",
was required.

Finally, I have not removed these rules, even though the freqs would suggest
that it's appropriate, because I'm not sure I see why they're FPing -- if
anyone has time, it might be worth checking:

<rule><test>RCVD_ILLEGAL_IP</test><promo>0</promo>
<spc>1.2311</spc><hpc>0.1258</hpc><so>0.907</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FRCVD_ILLEGAL_IP%2Fdetail%3Fxml%3D1</detailhref></rule>

<rule><test>URI_HEX</test><promo>0</promo>
<spc>1.4336</spc><hpc>0.1319</hpc><so>0.916</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FURI_HEX%2Fdetail%3Fxml%3D1</detailhref></rule>
<rule><test>DATE_IN_PAST_96_XX</test><promo>0</promo>
<spc>1.2259</spc><hpc>0.0921</hpc><so>0.930</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FDATE_IN_PAST_96_XX%2Fdetail%3Fxml%3D1</detailhref></rule>

<rule><test>FROM_EXCESS_BASE64</test><promo>0</promo>
<spc>1.0240</spc><hpc>0.0644</hpc><so>0.941</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FFROM_EXCESS_BASE64%2Fdetail%3Fxml%3D1</detailhref></rule>

<rule><test>DRUGS_ERECTILE_OBFU</test><promo>0</promo>
<spc>1.0855</spc><hpc>0.0675</hpc><so>0.941</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FDRUGS_ERECTILE_OBFU%2Fdetail%3Fxml%3D1</detailhref></rule>

<rule><test>IP_LINK_PLUS</test><promo>0</promo>
<spc>0.1760</spc><hpc>0.0399</hpc><so>0.815</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FIP_LINK_PLUS%2Fdetail%3Fxml%3D1</detailhref></rule>

<rule><test>WEIRD_PORT</test><promo>0</promo>
<spc>0.3923</spc><hpc>0.1933</hpc><so>0.670</so> <detailhref
esc='1'>%2F20061018-r465178-n%2FWEIRD_PORT%2Fdetail%3Fxml%3D1</detailhref></rule>
Comment 9 Justin Mason 2006-10-23 11:21:24 UTC
yeesh -- chasing up the eval code into Html.pm has beaten me.  The following
rules use the following flags in that module:

HTML_00_10               HTML_ATTR_BAD           HTML_ATTR_UNIQUE
HTML_BACKHAIR_2          HTML_BACKHAIR_4
HTML_BACKHAIR_8          HTML_EVENT_UNSAFE       HTML_FONT_BIG
HTML_FONT_FACE_CAPS      HTML_FONT_INVISIBLE     HTML_FONT_TINY
HTML_SHOUTING3           HTML_SHOUTING4          HTML_SHOUTING5
HTML_SHOUTING6           HTML_SHOUTING7          HTML_TEXT_AFTER_BODY
HTML_TEXT_AFTER_HTML

flags:

        attr_unique_bad attr_bad ratio html_event_unsafe big_font
        font_face_caps font_invisible tiny_font max_shouting text_after_body
        text_after_html

I'm going to leave the code in there; I don't think it slows us down
much, and it's a really mucky task :(
Comment 10 Justin Mason 2006-10-23 14:06:01 UTC
lowering pri, now the main stuff is done; just the "72_active.cf in distro
tarball or not" question remaining.
Comment 11 Justin Mason 2006-11-14 10:59:03 UTC
fwiw, I did the mucky task:

: jm 1009...; svn commit -m "bug 5123: remove some more vestiges of now-obsolete
eval rules, namely the following HTML flags and range variables: attr_seen_*
attr_unique_bad attr_bad html_event_unsafe big_font font_face_caps
font_invisible tiny_font max_shouting text_after_body text_after_html"
lib/Mail/SpamAssassin/HTML.pm
Sending        lib/Mail/SpamAssassin/HTML.pm
Transmitting file data .
Committed revision 474920.

it had a nice side-effect of reducing scanner memory usage by 44KB.
Comment 12 Alex Broens 2006-11-28 11:12:19 UTC
There's abunch of rules which are nciely used in Metas. I don't think its a good
idea to remove them totally. Maybe keep in separate file and score them 0.01

Stuff like HTML_IMAGE_RATIO_06 come in very handy agains stock spams which
combined with *for example*  __ANY_OUTLOOK_MUA and some Msg-Id rule.

here's some examples I have in production:

META_OENOBDARY      (__ANY_OUTLOOK_MUA && MIME_MISSING_BOUNDARY) ยจ#would stop
working if you remove MIME_MISSING_BOUNDARY

META_ULESSBIZ        (BIZ_TLD && !__ANY_OUTLOOK_MUA && __HAS_URI &&
__HAS_X_MAILER!) # would be rendered useless without BIZ_TLD

I doubt I'm the only one who uses stuff like this to add the extra point to a
final score.

If there's any chance of keepigng these rules, I'd work my way thru the list and
a large collection of Metas and check if & what is worth keeping.

Alex
Comment 13 Justin Mason 2006-12-13 04:43:20 UTC
I may see if we can resurrect those HTML_IMAGE_RATIO_* rules as meta subrules,
for Alex. Let's keep that a separate issue, in bug 5242.

in the meantime, back to the other issue... I'm going to vote that we should
distribute rules/72_active.cf in the main distro tarball.  Basically, otherwise,
when we run the Perceptron at release time, it will only be able to make scores
for the rules in the main tarball, not taken from rulesrc via 72_active.cf.

This would mean that some of the best rules in the ruleset would go unscored.
Until we have a way to Perceptron-optimise rules for published rule updates, we
can't do this.
Comment 14 Justin Mason 2006-12-28 09:46:17 UTC
ok, my last suggestion garnered no complaints, so let's go that way ;)