Bug 2853 - Rewrite masses/ (in perl)
Summary: Rewrite masses/ (in perl)
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Building & Packaging (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: Duncan Findlay
URL:
Whiteboard:
Keywords:
: 3096 (view as bug list)
Depends on: 3388
Blocks: 3208
  Show dependency tree
 
Reported: 2003-12-17 22:49 UTC by Duncan Findlay
Modified: 2019-08-01 06:53 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
New masses/ stuff patch None Duncan Findlay [HasCLA]
Original perl implementation of perceptron.c text/plain None Henry Stern [HasCLA]
Final (?) patch patch None Duncan Findlay [HasCLA]
Next Patch patch None Duncan Findlay [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan Findlay 2003-12-17 22:49:01 UTC
Ideally it should be a lot easier to package and distribute the mass-check +
evolve stuff. Would it be possible to have the evolve program read tests +
scores after compilation? This would make it a lot easier for users to make
their own scores.
Comment 1 Duncan Findlay 2004-03-16 08:18:41 UTC
I intend to re-write most of the masses/ stuff so that it's easier to use for
end users. Namely, I intend to rewrite perceptron in perl, and fix up the other
scripts so they don't use temp files all the time (there's a lot more reading
and writing than i think is necessary).

One major adavantage this will have is that a compiler won't be necessary to
rescore everything.

If anyone can think of some pitfalls I need to think about, I'd appreciate it if
you could let me know. Or, if you think what I'm doing is useless... tell me. My
goal is to create a spamassassin-tools/-utils package with this sort of thing so
people can rescore themselves.

One of the biggest weaknesses of SpamAssassin is that scores are hard to
determine and get stale quickly after a release. By allowing users to make their
own scores, we can prolong the "shelf life" of older version of SpamAssassin.
This, combined with plugins should be quite useful to users.

Ideally, SpamAssassin 3.0.x will be in Debian sarge. Since Debian releases so
infrequently, it is important that this last quite a while.
Comment 2 Justin Mason 2004-03-16 11:13:38 UTC
Sounds generally like a good idea -- in particular, I'd suggest making
mass-check easier to use.  I'd like to see mass-check generating one
output file, e.g. by adding a "ham"/"spam" indicator to the start of the line
instead of keeping each in separate output files.

Also the ancillary scripts -- fp-fn-statistics, hit-frequencies, etc. are a
little complicated, and all of them make too many assumptions about their
location, e.g. assumign that ../rules is the rules dir.

However, rewriting the perceptron in perl gets -1 from me.

IMO the C nature of the perceptron is not a big problem.  Pretty much every
Debian machine will have a C compiler available.  Also, the amount of data (logs
and scores) in RAM needs a good, compact and fast representation, and C works
very very well for this; probably a lot better than perl can do without quite a
bit of work.

What is the problem, however, is that it currently requires a rebuild to include
the hits and scores from the C files generated from logs-to-c.  That should
probably be fixed, so that the perceptron can be distributed as a binary and
read that data at runtime. (as you said)

IMO, the biggest problem for users of a system like this, will be in corpus
management and mass-checking.  the perceptron et al aren't too hard compared to
that.
Comment 3 Theo Van Dinter 2004-03-16 11:47:24 UTC
Subject: Re:  Rewrite masses/ (in perl)

On Tue, Mar 16, 2004 at 11:13:40AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> However, rewriting the perceptron in perl gets -1 from me.

Um...  Yeah, ditto to everything JM said, including the perceptron
rewrite -1...  (stupid stealing my thunder ... <G>)

Comment 4 Justin Mason 2004-03-16 12:26:32 UTC
ha! need to comment faster ;)
Comment 5 Duncan Findlay 2004-03-16 13:23:13 UTC
Subject: Re:  Rewrite masses/ (in perl)

As far as the perceptron in perl thing goes, I'm probably still going
to try it to see how slow it is.

My first goal will be to amalgamate many of those helper scripts --
hit-frequencies score-ranges-from-freqs logs-to-c, etc.
Comment 6 Gary Funck 2004-03-17 08:31:19 UTC
Subject: RE:  Rewrite masses/ (in perl)


My RFE, bug 3096, probably relates to this one, as well.


Comment 7 Duncan Findlay 2004-04-21 16:01:48 UTC
I suppose I should make sure this comment gets in bugzilla:

if this is ready in time, I'd like to see it in 3.0.0. I'm going under the
theory that this sort of change is exempt from the code freezes since it doesn't
hold up the mass-checks (which should definitely be done using the old code,
unless I do some significant work real soon)

of course, this kind of change is R-T-C, so it's not entirely up to me... :-)
Comment 8 Duncan Findlay 2004-05-13 20:11:55 UTC
I'm almost done on this. I'd like to see it in 3.0.0. I need to do a little more
testing and clean up the documenation, but my (quick) results suggest an average
TCR of 64.622 (scoreset 2) which is better than the old stuff which had a TCR of
56.288.

(I did make a bit of a change to the algorithm to score ranges to use soratio
instead of rank, so that probably accounts for much of the difference.)

I'm going to fix 3096 while I'm at it, so I'll mark that as a dupe.
Comment 9 Duncan Findlay 2004-05-13 20:14:17 UTC
*** Bug 3096 has been marked as a duplicate of this bug. ***
Comment 10 Justin Mason 2004-05-14 09:56:09 UTC
Subject: Re:  Rewrite masses/ (in perl) 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> I'm almost done on this. I'd like to see it in 3.0.0. I need to do a
> little more testing and clean up the documenation, but my (quick) results
> suggest an average TCR of 64.622 (scoreset 2) which is better than the old
> stuff which had a TCR of 56.288. (I did make a bit of a change to the
> algorithm to score ranges to use soratio instead of rank, so that probably
> accounts for much of the difference.)

Cool!  one thing though -- I found that relying entirely on soratio
could be bad news when the hit-rates are too small; e.g.
a test that hits 0.05% of spam and 0% ham may get a score of 4.0.
This could be tricky if the training corpus isn't representative
of other corpora, which might have some ham hits.

Have you tested using 10FCV?  that takes this into account.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFApPoEQTcbUG5Y7woRAgpRAKCIbYO0+X61RuQY4bf1Ff5OFn6xRwCfTgdo
3OIwbtxaofwfJhRnC/9Ot2o=
=vQ0i
-----END PGP SIGNATURE-----

Comment 11 Duncan Findlay 2004-05-14 13:01:28 UTC
Subject: Re:  Rewrite masses/ (in perl)

> > I'm almost done on this. I'd like to see it in 3.0.0. I need to do a
> > little more testing and clean up the documenation, but my (quick) results
> > suggest an average TCR of 64.622 (scoreset 2) which is better than the old
> > stuff which had a TCR of 56.288. (I did make a bit of a change to the
> > algorithm to score ranges to use soratio instead of rank, so that probably
> > accounts for much of the difference.)
> 
> Cool!  one thing though -- I found that relying entirely on soratio
> could be bad news when the hit-rates are too small; e.g.
> a test that hits 0.05% of spam and 0% ham may get a score of 4.0.
> This could be tricky if the training corpus isn't representative
> of other corpora, which might have some ham hits.

Those numbers are based on 10FCV.
Comment 12 Justin Mason 2004-05-14 13:51:07 UTC
Subject: Re:  Rewrite masses/ (in perl) 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Those numbers are based on 10FCV.

K, that's good then. ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFApTEUQTcbUG5Y7woRAtr+AKCs5NmbZ+XifvWcwg5NzPkHPO39ZgCfU7cB
BmL175yjtwtW7JdlKf6/+nY=
=VVMA
-----END PGP SIGNATURE-----

Comment 13 Duncan Findlay 2004-05-16 21:48:10 UTC
Created attachment 1960 [details]
New masses/ stuff

There's some stuff that needs to be fixed. I'm a little low on time right now,
so if people could look at it and help find the bugs, that'd be great ;-)
Comment 14 Henry Stern 2004-05-22 13:17:21 UTC
Created attachment 1966 [details]
Original perl implementation of perceptron.c

This is the original implementation of the perceptron in perl.	It is dog slow.
 It uses the same algorithm as the C version but takes approx. 60 times longer
to run.

Good luck with your own implementation.
Comment 15 Daniel Quinlan 2004-05-22 14:22:21 UTC
60x slower?  I vote for sticking with the C.

*Any* slow-down is unacceptable since the perceptron is the major bottleneck
when running a 10fcv which is slow enough even in the current development tree
(and vastly faster than the GA, but that went without saying).
Comment 16 Duncan Findlay 2004-05-22 21:48:30 UTC
I agree that C would be best; although it would be better if someone fixed bug
3388 :-) (However I believe that your assertion that "*Any* slow-down is
unacceptable" is false)

Henry: thanks anyways. I remember that there was some talk of implementing
perceptron in perl, but I'd forgotten that it had actually been done and that
there really was such a slow down.

I've changed the majority of the scripts in masses/ (I think i only missed the
stuff in rule-QA) and I've written a README for end users. While re-writing, I
added pod documentation for the "major" scripts -- the ones end users will
actually need. They are:
mass-check
extract-message-from-mbox
hit-frequencies
lint-rules-from-freqs
logs-to-c
rewrite-cf-with-new-scores
fp-fn-statistics

I propose that these scripts as well as perceptron be distributed to the end
users (in .deb's, .rpm's etc). I plan on putting this in a spamassassin-tools
package for Debian. (I hope to upload a 3.0.0 cvs package with that to
experimental by the end of this long weekend.)

The major change is that most of the reading/computing stats stuff, etc is in
lib/Mail/SpamAssassin/Masses.pm rather than duplicating the code in the scripts
or using parse-rules-for-masses. Many of the scripts' options have been changed
slightly for consistancy, and they no longer rely on @ARGV or shell
piping/redirects.

I'll post my changes (in giant patch format) later tonight after I finish one
last big test.
Comment 17 Duncan Findlay 2004-05-22 23:02:29 UTC
Created attachment 1967 [details]
Final (?) patch

This patch is my latest copy. I think it's pretty much ready to go.
Comment 18 Duncan Findlay 2004-05-22 23:03:15 UTC
As I mentioned before, this would be R-T-C, so I'm seeking +1's.

Thanks
Comment 19 Justin Mason 2004-05-24 13:34:54 UTC
BTW, I *do* like the idea of distributing and improving usability of these
scripts by default, to lower the barrier-to-entry for mass-checking, running
perceptron etc.

I'm +1 on the patch contents.

(I also am -1 on the idea of rewriting the perceptron.  can't see the point in
doing that, but I don't think that's what Duncan was really suggesting, fwiw ;)

I'd suggest a further tweak, BTW:  for the "masses" scripts that DAF is talking
about putting into a -tools pkg, I suggest we rename them to use an "sa-"
prefix.  e.g. sa-mass-check, sa-hit-frequencies, et al.  This helps to keep them
logically together when installed, and is consistent with sa-learn.  (we should
probably do similar for the "tools" scripts too eventually).

I'd also suggest some renaming to be clearer about what they do in reality, e.g.:

mass-check                         sa-mass-scan
extract-message-from-mbox          sa-mbox-get
hit-frequencies                    sa-msr-to-freqs
lint-rules-from-freqs              sa-rules-lint
logs-to-c                          sa-msr-compile
fp-fn-statistics                   sa-accuracy-report
perceptron                         sa-perceptron
rewrite-cf-with-new-scores         sa-scores-rewrite

also these scripts are very useful in rule dev, QA, and scoring:

mass-check-results-to-mbox         sa-msr-to-mbox
overlap                            sa-msr-rule-overlap
corpora/remove-tests-from-logs     sa-msr-strip-rules

(Note: "sa-rules-lint" should be modified to do a basic lint (without reading
logs from STDIN) unless a (new) cmdline flag is supplied.  Most of its linting
behaviour does not require freqs data at all.)

Also, would it be a good idea to come up with a file format name instead of
"mass-check log file"?  e.g. "MSR" for "mass-scan results"?  (that's what
sa-msr-to-mbox et al refers to.)
Comment 20 Daniel Quinlan 2004-05-24 15:25:04 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)

> I'd suggest a further tweak, BTW: for the "masses" scripts that DAF is
> talking about putting into a -tools pkg, I suggest we rename them to
> use an "sa-" prefix.  e.g. sa-mass-check, sa-hit-frequencies, et al.
> This helps to keep them logically together when installed, and is
> consistent with sa-learn.  (we should probably do similar for the
> "tools" scripts too eventually).

Only rename them when installed, please!
 
> I'd also suggest some renaming to be clearer about what they do in
> reality, e.g.:
> 
> mass-check                         sa-mass-scan

current name seems fine.

> extract-message-from-mbox          sa-mbox-get

We have another script to do the same thing, "mboxget" which is
generally easier to use, the only thing missing from it is the ability
to read directly from a .log (which is just egrep -v '#'|awk '{print $3}').

> hit-frequencies                    sa-msr-to-freqs

"msr" is less clear to me.

> lint-rules-from-freqs              sa-rules-lint
> logs-to-c                          sa-msr-compile
> fp-fn-statistics                   sa-accuracy-report

Can we also fix the FP and FN numbers?  Only print the correct one!  :-)

You know, I'm not really in favor of mass-renaming stuff.  Rename the
things that are completely misnamed, but leave it at that.

Comment 21 Justin Mason 2004-05-24 15:46:22 UTC
Subject: Re:  [review] Rewrite masses/ (in perl) 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
> > I'd suggest a further tweak, BTW: for the "masses" scripts that DAF is
> > talking about putting into a -tools pkg, I suggest we rename them to
> > use an "sa-" prefix.  e.g. sa-mass-check, sa-hit-frequencies, et al.
> > This helps to keep them logically together when installed, and is
> > consistent with sa-learn.  (we should probably do similar for the
> > "tools" scripts too eventually).
> 
> Only rename them when installed, please!

No -- that's kind of pointless, because then any scripts or guidelines
for running mass-checks/perceptron etc. would work only for 1 style
or the other of installed or from masses dir.

> > I'd also suggest some renaming to be clearer about what they do in
> > reality, e.g.:
> > 
> > mass-check                         sa-mass-scan
> 
> current name seems fine.

IMO, "check" is too generic.  it's scanning.

> > extract-message-from-mbox          sa-mbox-get
> 
> We have another script to do the same thing, "mboxget" which is
> generally easier to use, the only thing missing from it is the ability
> to read directly from a .log (which is just egrep -v '#'|awk '{print $3}').

good point.

> > hit-frequencies                    sa-msr-to-freqs
> 
> "msr" is less clear to me.
> 
> > lint-rules-from-freqs              sa-rules-lint
> > logs-to-c                          sa-msr-compile
> > fp-fn-statistics                   sa-accuracy-report
> 
> Can we also fix the FP and FN numbers?  Only print the correct one!  :-)

yes!  good point.   (drives me nuts, that one ;)

> You know, I'm not really in favor of mass-renaming stuff.  Rename the
> things that are completely misnamed, but leave it at that.

it seems like a good opportunity to do this to me, IMO...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAsnsoQTcbUG5Y7woRAk6oAKCoClyilqwklC5f8+VyK1kQgR4siwCeLWGY
jU8M7lPyXnHTgDmB32053zQ=
=gNX8
-----END PGP SIGNATURE-----

Comment 22 Theo Van Dinter 2004-05-24 16:15:33 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)

On Mon, May 24, 2004 at 03:46:24PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> IMO, "check" is too generic.  it's scanning.

Well, actually it is running check().  That's where I figured the name
came from actually.  :)   So it becomes a semantics issue really imo.

Comment 23 Justin Mason 2004-05-24 16:38:36 UTC
Subject: Re:  [review] Rewrite masses/ (in perl) 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>> IMO, "check" is too generic.  it's scanning.
>
>Well, actually it is running check().  That's where I figured the name
>came from actually.  :)   So it becomes a semantics issue really imo.

true -- although we were talking a while back about renaming that
too (although that's kind of fallen off the list as a priority).
Also, a command-line script has a higher "API visibility level"
than an internal method name, if you get me...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAsodlQTcbUG5Y7woRAqEZAJ9ekYOW+JuvWWcOmy3oQpVWEDneZQCdFNZn
9EDuQfF16mvMB+aIPPZ+Tz4=
=w/df
-----END PGP SIGNATURE-----

Comment 24 spamassassin 2004-05-24 17:47:23 UTC
Re: mboxget vs extract-message-from-mbox: I kinda merged the two scripts, so the
extract-message-from-mbox as shown in the patch should be pretty much like the
current mboxget + it has the ability to read mass-check logs

Re: adding sa- prefix: I'm relatively indifferent on this. It's probably a good
idea, although I was originally planning on having them put into
/usr/share/spamassassin-tools rather than /usr/bin to avoid cluttering the
namespace. Renaming them makes them more accessible. So I'll vote for renaming
(and only renaming the ones that are going to be distributed). I'm not currently
distributing the scripts in tools, although I may wish to now (we should
probably rename them then)

Re: mass-check vs mass-scan: I like the ring to mass-check better, but mass-scan
is probably more logical.

Re: FP/FN numbers: I'm not sure what you're talking about. btw fp-fn-statistics
is essentially what logs-to-c --count used to be.

Re: MSR: I agree there should be a better name, but I'm not sure this is it. But
in the absence of a better suggestion....

Someone want to give me another +1 so I can commit what I've got and we can
worry about the rest later, or do you feel you want to see another patch?
Comment 25 Duncan Findlay 2004-05-24 17:48:25 UTC
comment 24 was me of course
Comment 26 Justin Mason 2004-05-24 18:56:34 UTC
Subject: Re:  [review] Rewrite masses/ (in perl) 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> Re: adding sa- prefix: I'm relatively indifferent on this. It's probably
> a good idea, although I was originally planning on having them put into
> /usr/share/spamassassin-tools rather than /usr/bin to avoid cluttering
> the namespace. Renaming them makes them more accessible. So I'll vote
> for renaming (and only renaming the ones that are going to be
> distributed). I'm not currently distributing the scripts in tools,
> although I may wish to now (we should probably rename them then)

yeah.  I was thinking that'd be a separate ticket anyway ;)

> Re: FP/FN numbers: I'm not sure what you're talking about. btw
> fp-fn-statistics is essentially what logs-to-c --count used to be.

FP/FN report outputs *two* numbers on the "false positives"/"false
negatives" lines; one is "false negs in the spam collection", one is
"false negs in the combined spam+ham collection".  The latter number
is absolutely useless ;)

> Re: MSR: I agree there should be a better name, but I'm not sure this is
> it. But in the absence of a better suggestion....

yeah, I'd be happy to get other ideas for that...

> Someone want to give me another +1 so I can commit what I've got and we
> can worry about the rest later, or do you feel you want to see another
> patch?

I'm ok with working from this.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAsqFmQTcbUG5Y7woRAjLeAJ0TNu3rrd2Omcb+ykPGZo6pHCTImQCfZOoH
H/aVc/+844K47kleQ9E2gbk=
=erJ7
-----END PGP SIGNATURE-----

Comment 27 Daniel Quinlan 2004-05-24 22:43:11 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)

> No -- that's kind of pointless, because then any scripts or guidelines
> for running mass-checks/perceptron etc. would work only for 1 style
> or the other of installed or from masses dir.

Then let's not rename them at all.  I am certain that these are *not*
ready for prime time anyway, so let's stick with the current names and
install them into /usr/share/spamassassin or the like for now.
 
> IMO, "check" is too generic.  it's scanning.

I think check and scan are the same thing.  Changing the name is just
going to annoy people who use it.
 
> it seems like a good opportunity to do this to me, IMO...

We can do it later.  -1 on renaming en masse.

Comment 28 Daniel Quinlan 2004-05-24 22:48:48 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)


> Re: mboxget vs extract-message-from-mbox: I kinda merged the two
> scripts, so the extract-message-from-mbox as shown in the patch should
> be pretty much like the current mboxget + it has the ability to read
> mass-check logs

Okay.  Let's name the final script "mboxget".  The other name is out of
control length-wise (for something I run so often).
 
> Re: adding sa- prefix: I'm relatively indifferent on this. It's
> probably a good idea, although I was originally planning on having
> them put into /usr/share/spamassassin-tools rather than /usr/bin to
> avoid cluttering the namespace.

I like your original plan MUCH better.  I don't think adding sa-* to
anything we don't actively intend users to run is a good idea and these
scripts are far from there, that's a post-3.0 thing.

> Renaming them makes them more accessible. So I'll vote for renaming
> (and only renaming the ones that are going to be distributed). I'm not
> currently distributing the scripts in tools, although I may wish to
> now (we should probably rename them then)

No.  -1 on renaming.
 
> Re: mass-check vs mass-scan: I like the ring to mass-check better, but
> mass-scan is probably more logical.

mass-check :-)
 
> Re: FP/FN numbers: I'm not sure what you're talking about. btw
> fp-fn-statistics is essentially what logs-to-c --count used to be.

There is more than one number reported.  The one we want is FP / (all
ham) and FN / (all spam).  (IIRC, it's the second one on the line.)
 
> Re: MSR: I agree there should be a better name, but I'm not sure this
> is it. But in the absence of a better suggestion....

Something like "results".  No incomprehensible acronyms.
 
> Someone want to give me another +1 so I can commit what I've got and we can
> worry about the rest later, or do you feel you want to see another patch?

-0.1 only because I want to see another patch.

Comment 29 Justin Mason 2004-05-25 17:57:39 UTC
ok, I'll give in on my suggestions.  But I'm still +1 on Duncan's patch, since
it came before I started throwing in crazy ideas.

Dan, what exactly do you want to see in a new patch?  Is it just the change of
extract-message-from-mbox to be more like mboxget?

Also, one other thing: if we install the masses scripts into
$PREFIX/share/spamassassin-tools, note that we will still need to make one
further change: they will have to be built from .raw versions by the top-level
Makefile, like sa-learn, spamassassin and spamd, in order to take care of the
usual stuff like:

- sedding the #!/usr/bin/perl line, for when the user used a perl that isn't
/usr/bin/perl in Makefile.PL
- looking for the rules dir in a directory != /usr/share/spamassassin and !=
../rules, for when PREFIX ne '/usr'
- looking for the libs in a dir that isn't in @INC for when PREFIX ne '/usr'
Comment 30 Duncan Findlay 2004-05-25 18:52:18 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)

> Also, one other thing: if we install the masses scripts into
> $PREFIX/share/spamassassin-tools, note that we will still need to make one
> further change: they will have to be built from .raw versions by the top-level
> Makefile, like sa-learn, spamassassin and spamd, in order to take care of the
> usual stuff like:
> 
> - sedding the #!/usr/bin/perl line, for when the user used a perl that isn't
> /usr/bin/perl in Makefile.PL
> - looking for the rules dir in a directory != /usr/share/spamassassin and !=
> ../rules, for when PREFIX ne '/usr'
> - looking for the libs in a dir that isn't in @INC for when PREFIX ne '/usr'

I don't think this is necssary. I wasn't even going to suggest that
this go in the Makefile. (although it probably should, but not by default).

None of the scripts guess where the rules dir is, other than
mass-check which relys on Mail::SpamAssassin to do the dirty work. I
hadn't thought about point 3, but it makes it a lot more complex than
necessary. I figure these scripts are optional, and if you want them,
you get to figure out how to install them. ;-)
Comment 31 Justin Mason 2004-05-25 19:29:45 UTC
I'd suggest just changing Makefile.PL to build them, to avoid bug reports.
#3 will be a killer otherwise.

Here's a demo:

1. user runs "perl Makefile.PL PREFIX=~/sausr SYSCONFDIR=~/saetc" for example,
to run sa from their home dir. (could be any other dir)
2. "make; make install"
3. tries to use mass-check.  mass-check fails with "can't find module
Mail::SpamAssassin" or similar

there's no way for the user to fix that, apart from "vi
/usr/share/spamassassin-tools/mass-check" and adding a "use lib" line. :(  

#2 is also bad, since Mail:SpamAssassin won't guess the location of the rules
dir correctly in this case.  (and this case is supported in the SA INSTALL file.)
Comment 32 Duncan Findlay 2004-05-27 21:34:59 UTC
Created attachment 1977 [details]
Next Patch

This is similar to the last one, but contains a couple more scripts and
combines extract-message-from-mbox, mboxget and mass-check-results-to-mbox into
one script.

Still to do: figure out the Makefile.PL stuff (Malte? you interested?) -
although we may need to discuss this a bit, it shouldnt happen by default.

I'd like to commit this before we figure out the Makefile stuff, so it can get
some wider testing.
Comment 33 Justin Mason 2004-06-01 10:38:16 UTC
+1 on applying that patch.
Comment 34 Theo Van Dinter 2004-06-01 10:48:02 UTC
my only problem with this patch is that it'd be going in now, so close to release time.  the original 
version has been beaten for a long time, and I don't want to put in a new version, do relatively little 
testing, then ask for mass-checks and such.
Comment 35 Daniel Quinlan 2004-06-01 11:23:00 UTC
+1 on adding this to the trunk for 3.1.0. once we fork 3.0.0.  It's a great patch!

-1 on adding this change to 3.0.0, it's just too close to release.  I'm
afraid this could end up delaying the release for weeks.  Even if we delay
for a week, it could still mean a delay of weeks.

I changed the milestone to 3.1.0.
Comment 36 Justin Mason 2004-10-13 16:02:18 UTC
moving to 3.1.0 milestone -- I think it'd be good to check this in reasonably
soon to trunk.
Comment 37 Duncan Findlay 2004-10-13 16:14:54 UTC
Subject: Re:  [review] Rewrite masses/ (in perl)

On Wed, Oct 13, 2004 at 04:02:20PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> moving to 3.1.0 milestone -- I think it'd be good to check this in reasonably
> soon to trunk.

FWIW, the current patch might be a bit buggy... I think there's some
stuff I might have tweaked, but unfortunately deleted :-(

Anyways, I have little time now, if someone wants to pick up the
torch, that'd be good. The more I think about it though, I feel my
approach is perhaps a bit cumbersome and reworking it might be a good
idea. (Perhaps people can review it and see what they think.)
Comment 38 Daniel Quinlan 2005-03-29 14:21:29 UTC
bumping this to future milestone, too late in 3.1 cycle to do this
Comment 39 Daniel Quinlan 2005-03-30 01:08:44 UTC
move bug to Future milestone (previously set to Future -- I hope)
Comment 40 Justin Mason 2006-12-06 08:40:39 UTC
Duncan, are you redoing this patch?  ISTR seeing some mailing list traffic that
suggested it.

it might be worth just opening a branch and working there, to avoid the bit-rot
that happened here.   I really would like to see these changes getting into
trunk at some stage ;)
Comment 41 Duncan Findlay 2006-12-06 09:43:41 UTC
Oh wow... I haven't looked at this bug in a while!

I was pretty sure it was a lot further from complete than my comments seem to
indicate.

My current project is more along the lines of replacing the perceptron, but
obviously cleanups in masses/ may be useful in doing this. :-)

I'll see what happens; I'm not really planning on devoting extraordinary amounts
of time to this, but I have a tendency of being bored during the exam period. :-)
Comment 42 Justin Mason 2007-04-16 05:16:29 UTC
removing "[review]" tag -- I doubt this is just in review anymore
Comment 43 Henrik Krohns 2019-08-01 06:53:14 UTC
Closing old stale bug. Any remaining uncommitted code is probably deprecated long time ago.