SA Bugzilla – Bug 2853
Rewrite masses/ (in perl)
Last modified: 2019-08-01 06:53:14 UTC
Ideally it should be a lot easier to package and distribute the mass-check + evolve stuff. Would it be possible to have the evolve program read tests + scores after compilation? This would make it a lot easier for users to make their own scores.
I intend to re-write most of the masses/ stuff so that it's easier to use for end users. Namely, I intend to rewrite perceptron in perl, and fix up the other scripts so they don't use temp files all the time (there's a lot more reading and writing than i think is necessary). One major adavantage this will have is that a compiler won't be necessary to rescore everything. If anyone can think of some pitfalls I need to think about, I'd appreciate it if you could let me know. Or, if you think what I'm doing is useless... tell me. My goal is to create a spamassassin-tools/-utils package with this sort of thing so people can rescore themselves. One of the biggest weaknesses of SpamAssassin is that scores are hard to determine and get stale quickly after a release. By allowing users to make their own scores, we can prolong the "shelf life" of older version of SpamAssassin. This, combined with plugins should be quite useful to users. Ideally, SpamAssassin 3.0.x will be in Debian sarge. Since Debian releases so infrequently, it is important that this last quite a while.
Sounds generally like a good idea -- in particular, I'd suggest making mass-check easier to use. I'd like to see mass-check generating one output file, e.g. by adding a "ham"/"spam" indicator to the start of the line instead of keeping each in separate output files. Also the ancillary scripts -- fp-fn-statistics, hit-frequencies, etc. are a little complicated, and all of them make too many assumptions about their location, e.g. assumign that ../rules is the rules dir. However, rewriting the perceptron in perl gets -1 from me. IMO the C nature of the perceptron is not a big problem. Pretty much every Debian machine will have a C compiler available. Also, the amount of data (logs and scores) in RAM needs a good, compact and fast representation, and C works very very well for this; probably a lot better than perl can do without quite a bit of work. What is the problem, however, is that it currently requires a rebuild to include the hits and scores from the C files generated from logs-to-c. That should probably be fixed, so that the perceptron can be distributed as a binary and read that data at runtime. (as you said) IMO, the biggest problem for users of a system like this, will be in corpus management and mass-checking. the perceptron et al aren't too hard compared to that.
Subject: Re: Rewrite masses/ (in perl) On Tue, Mar 16, 2004 at 11:13:40AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > However, rewriting the perceptron in perl gets -1 from me. Um... Yeah, ditto to everything JM said, including the perceptron rewrite -1... (stupid stealing my thunder ... <G>)
ha! need to comment faster ;)
Subject: Re: Rewrite masses/ (in perl) As far as the perceptron in perl thing goes, I'm probably still going to try it to see how slow it is. My first goal will be to amalgamate many of those helper scripts -- hit-frequencies score-ranges-from-freqs logs-to-c, etc.
Subject: RE: Rewrite masses/ (in perl) My RFE, bug 3096, probably relates to this one, as well.
I suppose I should make sure this comment gets in bugzilla: if this is ready in time, I'd like to see it in 3.0.0. I'm going under the theory that this sort of change is exempt from the code freezes since it doesn't hold up the mass-checks (which should definitely be done using the old code, unless I do some significant work real soon) of course, this kind of change is R-T-C, so it's not entirely up to me... :-)
I'm almost done on this. I'd like to see it in 3.0.0. I need to do a little more testing and clean up the documenation, but my (quick) results suggest an average TCR of 64.622 (scoreset 2) which is better than the old stuff which had a TCR of 56.288. (I did make a bit of a change to the algorithm to score ranges to use soratio instead of rank, so that probably accounts for much of the difference.) I'm going to fix 3096 while I'm at it, so I'll mark that as a dupe.
*** Bug 3096 has been marked as a duplicate of this bug. ***
Subject: Re: Rewrite masses/ (in perl) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > I'm almost done on this. I'd like to see it in 3.0.0. I need to do a > little more testing and clean up the documenation, but my (quick) results > suggest an average TCR of 64.622 (scoreset 2) which is better than the old > stuff which had a TCR of 56.288. (I did make a bit of a change to the > algorithm to score ranges to use soratio instead of rank, so that probably > accounts for much of the difference.) Cool! one thing though -- I found that relying entirely on soratio could be bad news when the hit-rates are too small; e.g. a test that hits 0.05% of spam and 0% ham may get a score of 4.0. This could be tricky if the training corpus isn't representative of other corpora, which might have some ham hits. Have you tested using 10FCV? that takes this into account. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFApPoEQTcbUG5Y7woRAgpRAKCIbYO0+X61RuQY4bf1Ff5OFn6xRwCfTgdo 3OIwbtxaofwfJhRnC/9Ot2o= =vQ0i -----END PGP SIGNATURE-----
Subject: Re: Rewrite masses/ (in perl) > > I'm almost done on this. I'd like to see it in 3.0.0. I need to do a > > little more testing and clean up the documenation, but my (quick) results > > suggest an average TCR of 64.622 (scoreset 2) which is better than the old > > stuff which had a TCR of 56.288. (I did make a bit of a change to the > > algorithm to score ranges to use soratio instead of rank, so that probably > > accounts for much of the difference.) > > Cool! one thing though -- I found that relying entirely on soratio > could be bad news when the hit-rates are too small; e.g. > a test that hits 0.05% of spam and 0% ham may get a score of 4.0. > This could be tricky if the training corpus isn't representative > of other corpora, which might have some ham hits. Those numbers are based on 10FCV.
Subject: Re: Rewrite masses/ (in perl) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >Those numbers are based on 10FCV. K, that's good then. ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFApTEUQTcbUG5Y7woRAtr+AKCs5NmbZ+XifvWcwg5NzPkHPO39ZgCfU7cB BmL175yjtwtW7JdlKf6/+nY= =VVMA -----END PGP SIGNATURE-----
Created attachment 1960 [details] New masses/ stuff There's some stuff that needs to be fixed. I'm a little low on time right now, so if people could look at it and help find the bugs, that'd be great ;-)
Created attachment 1966 [details] Original perl implementation of perceptron.c This is the original implementation of the perceptron in perl. It is dog slow. It uses the same algorithm as the C version but takes approx. 60 times longer to run. Good luck with your own implementation.
60x slower? I vote for sticking with the C. *Any* slow-down is unacceptable since the perceptron is the major bottleneck when running a 10fcv which is slow enough even in the current development tree (and vastly faster than the GA, but that went without saying).
I agree that C would be best; although it would be better if someone fixed bug 3388 :-) (However I believe that your assertion that "*Any* slow-down is unacceptable" is false) Henry: thanks anyways. I remember that there was some talk of implementing perceptron in perl, but I'd forgotten that it had actually been done and that there really was such a slow down. I've changed the majority of the scripts in masses/ (I think i only missed the stuff in rule-QA) and I've written a README for end users. While re-writing, I added pod documentation for the "major" scripts -- the ones end users will actually need. They are: mass-check extract-message-from-mbox hit-frequencies lint-rules-from-freqs logs-to-c rewrite-cf-with-new-scores fp-fn-statistics I propose that these scripts as well as perceptron be distributed to the end users (in .deb's, .rpm's etc). I plan on putting this in a spamassassin-tools package for Debian. (I hope to upload a 3.0.0 cvs package with that to experimental by the end of this long weekend.) The major change is that most of the reading/computing stats stuff, etc is in lib/Mail/SpamAssassin/Masses.pm rather than duplicating the code in the scripts or using parse-rules-for-masses. Many of the scripts' options have been changed slightly for consistancy, and they no longer rely on @ARGV or shell piping/redirects. I'll post my changes (in giant patch format) later tonight after I finish one last big test.
Created attachment 1967 [details] Final (?) patch This patch is my latest copy. I think it's pretty much ready to go.
As I mentioned before, this would be R-T-C, so I'm seeking +1's. Thanks
BTW, I *do* like the idea of distributing and improving usability of these scripts by default, to lower the barrier-to-entry for mass-checking, running perceptron etc. I'm +1 on the patch contents. (I also am -1 on the idea of rewriting the perceptron. can't see the point in doing that, but I don't think that's what Duncan was really suggesting, fwiw ;) I'd suggest a further tweak, BTW: for the "masses" scripts that DAF is talking about putting into a -tools pkg, I suggest we rename them to use an "sa-" prefix. e.g. sa-mass-check, sa-hit-frequencies, et al. This helps to keep them logically together when installed, and is consistent with sa-learn. (we should probably do similar for the "tools" scripts too eventually). I'd also suggest some renaming to be clearer about what they do in reality, e.g.: mass-check sa-mass-scan extract-message-from-mbox sa-mbox-get hit-frequencies sa-msr-to-freqs lint-rules-from-freqs sa-rules-lint logs-to-c sa-msr-compile fp-fn-statistics sa-accuracy-report perceptron sa-perceptron rewrite-cf-with-new-scores sa-scores-rewrite also these scripts are very useful in rule dev, QA, and scoring: mass-check-results-to-mbox sa-msr-to-mbox overlap sa-msr-rule-overlap corpora/remove-tests-from-logs sa-msr-strip-rules (Note: "sa-rules-lint" should be modified to do a basic lint (without reading logs from STDIN) unless a (new) cmdline flag is supplied. Most of its linting behaviour does not require freqs data at all.) Also, would it be a good idea to come up with a file format name instead of "mass-check log file"? e.g. "MSR" for "mass-scan results"? (that's what sa-msr-to-mbox et al refers to.)
Subject: Re: [review] Rewrite masses/ (in perl) > I'd suggest a further tweak, BTW: for the "masses" scripts that DAF is > talking about putting into a -tools pkg, I suggest we rename them to > use an "sa-" prefix. e.g. sa-mass-check, sa-hit-frequencies, et al. > This helps to keep them logically together when installed, and is > consistent with sa-learn. (we should probably do similar for the > "tools" scripts too eventually). Only rename them when installed, please! > I'd also suggest some renaming to be clearer about what they do in > reality, e.g.: > > mass-check sa-mass-scan current name seems fine. > extract-message-from-mbox sa-mbox-get We have another script to do the same thing, "mboxget" which is generally easier to use, the only thing missing from it is the ability to read directly from a .log (which is just egrep -v '#'|awk '{print $3}'). > hit-frequencies sa-msr-to-freqs "msr" is less clear to me. > lint-rules-from-freqs sa-rules-lint > logs-to-c sa-msr-compile > fp-fn-statistics sa-accuracy-report Can we also fix the FP and FN numbers? Only print the correct one! :-) You know, I'm not really in favor of mass-renaming stuff. Rename the things that are completely misnamed, but leave it at that.
Subject: Re: [review] Rewrite masses/ (in perl) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Daniel Quinlan writes: > > I'd suggest a further tweak, BTW: for the "masses" scripts that DAF is > > talking about putting into a -tools pkg, I suggest we rename them to > > use an "sa-" prefix. e.g. sa-mass-check, sa-hit-frequencies, et al. > > This helps to keep them logically together when installed, and is > > consistent with sa-learn. (we should probably do similar for the > > "tools" scripts too eventually). > > Only rename them when installed, please! No -- that's kind of pointless, because then any scripts or guidelines for running mass-checks/perceptron etc. would work only for 1 style or the other of installed or from masses dir. > > I'd also suggest some renaming to be clearer about what they do in > > reality, e.g.: > > > > mass-check sa-mass-scan > > current name seems fine. IMO, "check" is too generic. it's scanning. > > extract-message-from-mbox sa-mbox-get > > We have another script to do the same thing, "mboxget" which is > generally easier to use, the only thing missing from it is the ability > to read directly from a .log (which is just egrep -v '#'|awk '{print $3}'). good point. > > hit-frequencies sa-msr-to-freqs > > "msr" is less clear to me. > > > lint-rules-from-freqs sa-rules-lint > > logs-to-c sa-msr-compile > > fp-fn-statistics sa-accuracy-report > > Can we also fix the FP and FN numbers? Only print the correct one! :-) yes! good point. (drives me nuts, that one ;) > You know, I'm not really in favor of mass-renaming stuff. Rename the > things that are completely misnamed, but leave it at that. it seems like a good opportunity to do this to me, IMO... - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAsnsoQTcbUG5Y7woRAk6oAKCoClyilqwklC5f8+VyK1kQgR4siwCeLWGY jU8M7lPyXnHTgDmB32053zQ= =gNX8 -----END PGP SIGNATURE-----
Subject: Re: [review] Rewrite masses/ (in perl) On Mon, May 24, 2004 at 03:46:24PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > IMO, "check" is too generic. it's scanning. Well, actually it is running check(). That's where I figured the name came from actually. :) So it becomes a semantics issue really imo.
Subject: Re: [review] Rewrite masses/ (in perl) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> IMO, "check" is too generic. it's scanning. > >Well, actually it is running check(). That's where I figured the name >came from actually. :) So it becomes a semantics issue really imo. true -- although we were talking a while back about renaming that too (although that's kind of fallen off the list as a priority). Also, a command-line script has a higher "API visibility level" than an internal method name, if you get me... - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAsodlQTcbUG5Y7woRAqEZAJ9ekYOW+JuvWWcOmy3oQpVWEDneZQCdFNZn 9EDuQfF16mvMB+aIPPZ+Tz4= =w/df -----END PGP SIGNATURE-----
Re: mboxget vs extract-message-from-mbox: I kinda merged the two scripts, so the extract-message-from-mbox as shown in the patch should be pretty much like the current mboxget + it has the ability to read mass-check logs Re: adding sa- prefix: I'm relatively indifferent on this. It's probably a good idea, although I was originally planning on having them put into /usr/share/spamassassin-tools rather than /usr/bin to avoid cluttering the namespace. Renaming them makes them more accessible. So I'll vote for renaming (and only renaming the ones that are going to be distributed). I'm not currently distributing the scripts in tools, although I may wish to now (we should probably rename them then) Re: mass-check vs mass-scan: I like the ring to mass-check better, but mass-scan is probably more logical. Re: FP/FN numbers: I'm not sure what you're talking about. btw fp-fn-statistics is essentially what logs-to-c --count used to be. Re: MSR: I agree there should be a better name, but I'm not sure this is it. But in the absence of a better suggestion.... Someone want to give me another +1 so I can commit what I've got and we can worry about the rest later, or do you feel you want to see another patch?
comment 24 was me of course
Subject: Re: [review] Rewrite masses/ (in perl) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Re: adding sa- prefix: I'm relatively indifferent on this. It's probably > a good idea, although I was originally planning on having them put into > /usr/share/spamassassin-tools rather than /usr/bin to avoid cluttering > the namespace. Renaming them makes them more accessible. So I'll vote > for renaming (and only renaming the ones that are going to be > distributed). I'm not currently distributing the scripts in tools, > although I may wish to now (we should probably rename them then) yeah. I was thinking that'd be a separate ticket anyway ;) > Re: FP/FN numbers: I'm not sure what you're talking about. btw > fp-fn-statistics is essentially what logs-to-c --count used to be. FP/FN report outputs *two* numbers on the "false positives"/"false negatives" lines; one is "false negs in the spam collection", one is "false negs in the combined spam+ham collection". The latter number is absolutely useless ;) > Re: MSR: I agree there should be a better name, but I'm not sure this is > it. But in the absence of a better suggestion.... yeah, I'd be happy to get other ideas for that... > Someone want to give me another +1 so I can commit what I've got and we > can worry about the rest later, or do you feel you want to see another > patch? I'm ok with working from this. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAsqFmQTcbUG5Y7woRAjLeAJ0TNu3rrd2Omcb+ykPGZo6pHCTImQCfZOoH H/aVc/+844K47kleQ9E2gbk= =erJ7 -----END PGP SIGNATURE-----
Subject: Re: [review] Rewrite masses/ (in perl) > No -- that's kind of pointless, because then any scripts or guidelines > for running mass-checks/perceptron etc. would work only for 1 style > or the other of installed or from masses dir. Then let's not rename them at all. I am certain that these are *not* ready for prime time anyway, so let's stick with the current names and install them into /usr/share/spamassassin or the like for now. > IMO, "check" is too generic. it's scanning. I think check and scan are the same thing. Changing the name is just going to annoy people who use it. > it seems like a good opportunity to do this to me, IMO... We can do it later. -1 on renaming en masse.
Subject: Re: [review] Rewrite masses/ (in perl) > Re: mboxget vs extract-message-from-mbox: I kinda merged the two > scripts, so the extract-message-from-mbox as shown in the patch should > be pretty much like the current mboxget + it has the ability to read > mass-check logs Okay. Let's name the final script "mboxget". The other name is out of control length-wise (for something I run so often). > Re: adding sa- prefix: I'm relatively indifferent on this. It's > probably a good idea, although I was originally planning on having > them put into /usr/share/spamassassin-tools rather than /usr/bin to > avoid cluttering the namespace. I like your original plan MUCH better. I don't think adding sa-* to anything we don't actively intend users to run is a good idea and these scripts are far from there, that's a post-3.0 thing. > Renaming them makes them more accessible. So I'll vote for renaming > (and only renaming the ones that are going to be distributed). I'm not > currently distributing the scripts in tools, although I may wish to > now (we should probably rename them then) No. -1 on renaming. > Re: mass-check vs mass-scan: I like the ring to mass-check better, but > mass-scan is probably more logical. mass-check :-) > Re: FP/FN numbers: I'm not sure what you're talking about. btw > fp-fn-statistics is essentially what logs-to-c --count used to be. There is more than one number reported. The one we want is FP / (all ham) and FN / (all spam). (IIRC, it's the second one on the line.) > Re: MSR: I agree there should be a better name, but I'm not sure this > is it. But in the absence of a better suggestion.... Something like "results". No incomprehensible acronyms. > Someone want to give me another +1 so I can commit what I've got and we can > worry about the rest later, or do you feel you want to see another patch? -0.1 only because I want to see another patch.
ok, I'll give in on my suggestions. But I'm still +1 on Duncan's patch, since it came before I started throwing in crazy ideas. Dan, what exactly do you want to see in a new patch? Is it just the change of extract-message-from-mbox to be more like mboxget? Also, one other thing: if we install the masses scripts into $PREFIX/share/spamassassin-tools, note that we will still need to make one further change: they will have to be built from .raw versions by the top-level Makefile, like sa-learn, spamassassin and spamd, in order to take care of the usual stuff like: - sedding the #!/usr/bin/perl line, for when the user used a perl that isn't /usr/bin/perl in Makefile.PL - looking for the rules dir in a directory != /usr/share/spamassassin and != ../rules, for when PREFIX ne '/usr' - looking for the libs in a dir that isn't in @INC for when PREFIX ne '/usr'
Subject: Re: [review] Rewrite masses/ (in perl) > Also, one other thing: if we install the masses scripts into > $PREFIX/share/spamassassin-tools, note that we will still need to make one > further change: they will have to be built from .raw versions by the top-level > Makefile, like sa-learn, spamassassin and spamd, in order to take care of the > usual stuff like: > > - sedding the #!/usr/bin/perl line, for when the user used a perl that isn't > /usr/bin/perl in Makefile.PL > - looking for the rules dir in a directory != /usr/share/spamassassin and != > ../rules, for when PREFIX ne '/usr' > - looking for the libs in a dir that isn't in @INC for when PREFIX ne '/usr' I don't think this is necssary. I wasn't even going to suggest that this go in the Makefile. (although it probably should, but not by default). None of the scripts guess where the rules dir is, other than mass-check which relys on Mail::SpamAssassin to do the dirty work. I hadn't thought about point 3, but it makes it a lot more complex than necessary. I figure these scripts are optional, and if you want them, you get to figure out how to install them. ;-)
I'd suggest just changing Makefile.PL to build them, to avoid bug reports. #3 will be a killer otherwise. Here's a demo: 1. user runs "perl Makefile.PL PREFIX=~/sausr SYSCONFDIR=~/saetc" for example, to run sa from their home dir. (could be any other dir) 2. "make; make install" 3. tries to use mass-check. mass-check fails with "can't find module Mail::SpamAssassin" or similar there's no way for the user to fix that, apart from "vi /usr/share/spamassassin-tools/mass-check" and adding a "use lib" line. :( #2 is also bad, since Mail:SpamAssassin won't guess the location of the rules dir correctly in this case. (and this case is supported in the SA INSTALL file.)
Created attachment 1977 [details] Next Patch This is similar to the last one, but contains a couple more scripts and combines extract-message-from-mbox, mboxget and mass-check-results-to-mbox into one script. Still to do: figure out the Makefile.PL stuff (Malte? you interested?) - although we may need to discuss this a bit, it shouldnt happen by default. I'd like to commit this before we figure out the Makefile stuff, so it can get some wider testing.
+1 on applying that patch.
my only problem with this patch is that it'd be going in now, so close to release time. the original version has been beaten for a long time, and I don't want to put in a new version, do relatively little testing, then ask for mass-checks and such.
+1 on adding this to the trunk for 3.1.0. once we fork 3.0.0. It's a great patch! -1 on adding this change to 3.0.0, it's just too close to release. I'm afraid this could end up delaying the release for weeks. Even if we delay for a week, it could still mean a delay of weeks. I changed the milestone to 3.1.0.
moving to 3.1.0 milestone -- I think it'd be good to check this in reasonably soon to trunk.
Subject: Re: [review] Rewrite masses/ (in perl) On Wed, Oct 13, 2004 at 04:02:20PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > moving to 3.1.0 milestone -- I think it'd be good to check this in reasonably > soon to trunk. FWIW, the current patch might be a bit buggy... I think there's some stuff I might have tweaked, but unfortunately deleted :-( Anyways, I have little time now, if someone wants to pick up the torch, that'd be good. The more I think about it though, I feel my approach is perhaps a bit cumbersome and reworking it might be a good idea. (Perhaps people can review it and see what they think.)
bumping this to future milestone, too late in 3.1 cycle to do this
move bug to Future milestone (previously set to Future -- I hope)
Duncan, are you redoing this patch? ISTR seeing some mailing list traffic that suggested it. it might be worth just opening a branch and working there, to avoid the bit-rot that happened here. I really would like to see these changes getting into trunk at some stage ;)
Oh wow... I haven't looked at this bug in a while! I was pretty sure it was a lot further from complete than my comments seem to indicate. My current project is more along the lines of replacing the perceptron, but obviously cleanups in masses/ may be useful in doing this. :-) I'll see what happens; I'm not really planning on devoting extraordinary amounts of time to this, but I have a tendency of being bored during the exam period. :-)
removing "[review]" tag -- I doubt this is just in review anymore
Closing old stale bug. Any remaining uncommitted code is probably deprecated long time ago.