SA Bugzilla – Bug 5123
should 'rules/72_active.cf' be distributed in the distro tarball?
Last modified: 2006-12-28 09:46:17 UTC
As discussed on the dev list last month, Theo believes that we need to trim back some of the more baroque aspects of the rules project work. I'm happy enough with that. Here's what's suggested (pasted from earlier mails on dev): Me: 'Hmm -- I think I'd need more details of how that'd work [..]' Theo: 'Keep 3.1 the way it currently is, semi-revert 3.2 to have a rules directory and we'll put a minimal set of 3.2 rules in there. External to the normal distro, we have code that generates updates based on the mass-check results (or whatever else we want to base them on). [on what files are distributed:] I think our current methodology of: - user downloads the engine, installs it, it comes with a core set of rules - user is encouraged to run sa-update, and then run it periodically as well to get the complete set of rules works well. It doesn't require us to have everything in the engine distribution though.' So in other words, the distributed tarball consists of: - engine - core rules in "rules/" directory the SVN build would consist of: - engine - core rules in "rules/" directory - an SVN external for the "rulesrc/" directory - and the build/mkrules script to compile the latter the "sa-update" distributed rules tarballs would have: - core rules in "rules/" directory - plus the 'active' rules in "rules/" directory (in other words, the compiled output from build/mkrules) Another issue is version-dependent rulesets; these would move back out of rulesrc, and back into the "rules/" directory, alongside the engine whose code they depend on. In terms of implementation, it's pretty simple: 1. Move the core rules files from "rulesrc/core" back into "rules". 1a. (Maybe trim out the rules that are currently showing up as unusable in the rule-QA app, since there's doubtless some rule rot by now...) 2. change Makefile.PL to not run "build/mkrules" unless the "rulesrc" directory is present; that will only be the case with an SVN checkout rather than a "make dist" dir or tarball.
aiming at 3.2.0 as a blocker; this is something we need to decide before that release
is the silence indicative of general happiness? Warnock's dilemma strikes ;)
Sounds like a reasonable plan of action to me.
(In reply to comment #3) > Sounds like a reasonable plan of action to me. OK. the low-hanging fruit is now complete -- that's part (1) and (3). next thing to do is (2) spend some time going through http://ruleqa.spamassassin.org/20061018-r465178-n , and removing the rules that are greyed-out, in the core ruleset, and not required for other purposes. Actually, there may also be a step 4. currently, rules/72_active.cf -- ie. the compiled "good enough" rules from the rulesrc/sandbox/* tree -- is being distributed in distro tarballs; however, I think Theo was considering that it shouldn't be, and instead should only be shipped in sa-update tarballs. Theo, are you really keen on that idea? I don't think it's too important, since it will probably work fine the way it is now anyway...
Created attachment 3722 [details] totrim.xml here's the list of rules I plan to remove from the core ruleset, in XML format (with ham percent hitrate, spam%, and S/O). Shout if you see any in there that you like and want to keep. Rules that are dependencies of "good" metas will be preserved, or possibly turned into subrules.
Created attachment 3723 [details] rules turned into meta subrules (some deleted instead if no metas still use them) committed as r467038 and r467043, removing some 220-or-so rules! phew.
Created attachment 3724 [details] script to trim rules based on lines of XML for reference, (or for next time), here's a script I used. I cut and pasted the XML output in the comments of the 20061018-r465178-n ruleqa freqs output, trimmed it down to the specific rules I wanted to get rid of, rescued a few manually, then saved the file as "totrim.xml" and used this script like so: ./trim-rules-from-xml rules/*.cf
Created attachment 3725 [details] similar script, to turn rules into subrules and this script does similar, but turning rules into subrules (the first script will warn about the cases where this is required). ./make-rules-into-subrules-from-xml rules/*.cf then some manual cleanup, based off "svn diff" and "make test", was required. Finally, I have not removed these rules, even though the freqs would suggest that it's appropriate, because I'm not sure I see why they're FPing -- if anyone has time, it might be worth checking: <rule><test>RCVD_ILLEGAL_IP</test><promo>0</promo> <spc>1.2311</spc><hpc>0.1258</hpc><so>0.907</so> <detailhref esc='1'>%2F20061018-r465178-n%2FRCVD_ILLEGAL_IP%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>URI_HEX</test><promo>0</promo> <spc>1.4336</spc><hpc>0.1319</hpc><so>0.916</so> <detailhref esc='1'>%2F20061018-r465178-n%2FURI_HEX%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>DATE_IN_PAST_96_XX</test><promo>0</promo> <spc>1.2259</spc><hpc>0.0921</hpc><so>0.930</so> <detailhref esc='1'>%2F20061018-r465178-n%2FDATE_IN_PAST_96_XX%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>FROM_EXCESS_BASE64</test><promo>0</promo> <spc>1.0240</spc><hpc>0.0644</hpc><so>0.941</so> <detailhref esc='1'>%2F20061018-r465178-n%2FFROM_EXCESS_BASE64%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>DRUGS_ERECTILE_OBFU</test><promo>0</promo> <spc>1.0855</spc><hpc>0.0675</hpc><so>0.941</so> <detailhref esc='1'>%2F20061018-r465178-n%2FDRUGS_ERECTILE_OBFU%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>IP_LINK_PLUS</test><promo>0</promo> <spc>0.1760</spc><hpc>0.0399</hpc><so>0.815</so> <detailhref esc='1'>%2F20061018-r465178-n%2FIP_LINK_PLUS%2Fdetail%3Fxml%3D1</detailhref></rule> <rule><test>WEIRD_PORT</test><promo>0</promo> <spc>0.3923</spc><hpc>0.1933</hpc><so>0.670</so> <detailhref esc='1'>%2F20061018-r465178-n%2FWEIRD_PORT%2Fdetail%3Fxml%3D1</detailhref></rule>
yeesh -- chasing up the eval code into Html.pm has beaten me. The following rules use the following flags in that module: HTML_00_10 HTML_ATTR_BAD HTML_ATTR_UNIQUE HTML_BACKHAIR_2 HTML_BACKHAIR_4 HTML_BACKHAIR_8 HTML_EVENT_UNSAFE HTML_FONT_BIG HTML_FONT_FACE_CAPS HTML_FONT_INVISIBLE HTML_FONT_TINY HTML_SHOUTING3 HTML_SHOUTING4 HTML_SHOUTING5 HTML_SHOUTING6 HTML_SHOUTING7 HTML_TEXT_AFTER_BODY HTML_TEXT_AFTER_HTML flags: attr_unique_bad attr_bad ratio html_event_unsafe big_font font_face_caps font_invisible tiny_font max_shouting text_after_body text_after_html I'm going to leave the code in there; I don't think it slows us down much, and it's a really mucky task :(
lowering pri, now the main stuff is done; just the "72_active.cf in distro tarball or not" question remaining.
fwiw, I did the mucky task: : jm 1009...; svn commit -m "bug 5123: remove some more vestiges of now-obsolete eval rules, namely the following HTML flags and range variables: attr_seen_* attr_unique_bad attr_bad html_event_unsafe big_font font_face_caps font_invisible tiny_font max_shouting text_after_body text_after_html" lib/Mail/SpamAssassin/HTML.pm Sending lib/Mail/SpamAssassin/HTML.pm Transmitting file data . Committed revision 474920. it had a nice side-effect of reducing scanner memory usage by 44KB.
There's abunch of rules which are nciely used in Metas. I don't think its a good idea to remove them totally. Maybe keep in separate file and score them 0.01 Stuff like HTML_IMAGE_RATIO_06 come in very handy agains stock spams which combined with *for example* __ANY_OUTLOOK_MUA and some Msg-Id rule. here's some examples I have in production: META_OENOBDARY (__ANY_OUTLOOK_MUA && MIME_MISSING_BOUNDARY) ยจ#would stop working if you remove MIME_MISSING_BOUNDARY META_ULESSBIZ (BIZ_TLD && !__ANY_OUTLOOK_MUA && __HAS_URI && __HAS_X_MAILER!) # would be rendered useless without BIZ_TLD I doubt I'm the only one who uses stuff like this to add the extra point to a final score. If there's any chance of keepigng these rules, I'd work my way thru the list and a large collection of Metas and check if & what is worth keeping. Alex
I may see if we can resurrect those HTML_IMAGE_RATIO_* rules as meta subrules, for Alex. Let's keep that a separate issue, in bug 5242. in the meantime, back to the other issue... I'm going to vote that we should distribute rules/72_active.cf in the main distro tarball. Basically, otherwise, when we run the Perceptron at release time, it will only be able to make scores for the rules in the main tarball, not taken from rulesrc via 72_active.cf. This would mean that some of the best rules in the ruleset would go unscored. Until we have a way to Perceptron-optimise rules for published rule updates, we can't do this.
ok, my last suggestion garnered no complaints, so let's go that way ;)