Bug 1472 - some tests seem to be missing scores
Summary: some tests seem to be missing scores
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 major
Target Milestone: 2.50
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1471
  Show dependency tree
 
Reported: 2003-02-11 04:29 UTC by Justin Mason
Modified: 2003-02-12 01:46 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2003-02-11 04:29:01 UTC
viz. 

: jm t 1196...; grep PENIS_ENLARGE ../rules/*.cf 
../rules/20_phrases.cf:# Jul  3 2002 jm: modified PENIS_ENLARGE patterns:
removed "add", replaced with "inches",
../rules/20_phrases.cf:body PENIS_ENLARGE              
/\b(?:enlarge|increase|grow|lengthen|larger\b|bigger\b|longer\b|thicker\b|\binches\b).{0,50}\b(?:penis|male
organ|P[ -]?P\b|pee[ -]?pee|dick|sc?hlong|wh?anger|breast)/i
../rules/20_phrases.cf:describe PENIS_ENLARGE           Information on getting a
larger penis or breasts
../rules/20_phrases.cf:body PENIS_ENLARGE2              /\b(?:penis|male
organ|P[ -]?P\b|pee[
-]?pee|dick|sc?hlong|wh?anger|breast).{0,50}\b(?:enlarge|increase|grow|lengthen|larger\b|bigger\b|longer\b|thicker\b|\binches\b)/i
../rules/20_phrases.cf:describe PENIS_ENLARGE2          Information on getting a
larger penis or breasts (2)
../rules/30_text_fr.cf:lang fr describe PENIS_ENLARGE  Contient des informations
pour augmenter la taile du p�nis
../rules/30_text_fr.cf:lang fr describe PENIS_ENLARGE2          Contient des
information sur l'accroissementd e la taille du p�nis
../rules/30_text_it.cf:lang it describe PENIS_ENLARGE   Spiega come aumentare le
dimensioni del proprio pene 
../rules/30_text_it.cf:lang it describe PENIS_ENLARGE2          Spiega come
aumentare le dimensioni del proprio pene


no hits in 50_scores.cf, therefore default score of 1.0.
I suspect the "remove rules based on GA feedback" thing didn't
work.
Comment 1 Justin Mason 2003-02-11 05:09:59 UTC
BTW this seems to be the reason why t/utf8.t and t/rule_tests.t are failing,
AFAICS.  Those tests simply are not being run (or at least output) without
a score.
Comment 2 Malte S. Stretz 2003-02-11 05:26:50 UTC
Why aren't they run? A missing score defaults to 1.0 and enables the test, or do I 
miss something here? 
Comment 3 Theo Van Dinter 2003-02-11 07:44:37 UTC
Subject: Re: [SAdev]  New: some tests seem to be missing scores

On Tue, Feb 11, 2003 at 04:29:01AM -0800, bugzilla-daemon@hughes-family.org wrote:
> no hits in 50_scores.cf, therefore default score of 1.0.
> I suspect the "remove rules based on GA feedback" thing didn't
> work.

Well, no, the problem seems to be related to the "GA doesn't output
all the rules" problem/rewrite_scores issues.  PENIS_ENLARGE* both
had scores.  I just re-ran through my kluge script to make sure all
rules have a score (even of 0), then recreated the 50_scores.cf file.
It seems to have worked fine, at least for the PENIS_ENLARGE* rules.
So I don't know what happened there.

Related to that issue, any rules in rules.pl that didn't get a score
(say PL_*) now have a score in 50_scores.cf.  I'm trimming them out now.

More to come.

Comment 4 Theo Van Dinter 2003-02-11 08:18:48 UTC
Subject: Re: [SAdev]  New: some tests seem to be missing scores

Ok, so I've now gone through -- there are 844 rules that need scores.
The currently in-place 50_scores.cf has 607 scores, which is less than
the GA outputted from either run.  So I really have no idea what happened
to the extra scores.  Worst case, we should have gotten # of scores from
set1 output.

I've generated a new 50_scores.cf with all 844 scores.  PENIS_ENLARGE*
is in there.  It's now committed.

Two things from this:
1) Do we want to scrap the current bayes mass-check run and restart again?
   With the scores off, autolearning wouldn't have worked correctly.
   Part of me says we should since the results aren't accurate,
   another part says that the results are pretty good right now and
   stopping/restarting would just piss people off.  I think I'm leaning
   towards: leave it alone right now, it's actually not too big of a
   problem, we'll get things straightened out for 2.6.  (it sounds like
   a big problem, but if you look at the results for the bayes runs
   so far, the stats look like what I would expect: the far edges are
   mostly 100% correct, the middle is questionable, and there's decent
   results everywhere else.  I hope that made sense.)


2) We need to come up with a solution to the multi-scoreset generation!
   I think the easiest thing is to have the GA output all the scores
   for all rules, even if the score was 0.  We can then use the same
   logic as always during the rewrite.


Thoughts?

Comment 5 Antony Mawer 2003-02-11 09:04:44 UTC
Subject: Re:  some tests seem to be missing scores 


> 1) Do we want to scrap the current bayes mass-check run and restart again?
>    With the scores off, autolearning wouldn't have worked correctly.
>    Part of me says we should since the results aren't accurate,
>    another part says that the results are pretty good right now and
>    stopping/restarting would just piss people off.  I think I'm leaning
>    towards: leave it alone right now, it's actually not too big of a
>    problem, we'll get things straightened out for 2.6.  (it sounds like
>    a big problem, but if you look at the results for the bayes runs
>    so far, the stats look like what I would expect: the far edges are
>    mostly 100% correct, the middle is questionable, and there's decent
>    results everywhere else.  I hope that made sense.)

argh.  I think, pragmatically, we should use those results if possible,
even if they are not perfect.  As you say, the bayes graph is the expected
shape, and that's the main aim; getting an idea of the accuracy of that
so the GA can work its scores appropriately.

But won't the fact that the mass-check logs will not contain any of the
hits for those missed rules, screw up their scores for set2/set3?

> 2) We need to come up with a solution to the multi-scoreset generation!
>    I think the easiest thing is to have the GA output all the scores
>    for all rules, even if the score was 0.  We can then use the same
>    logic as always during the rewrite.

yes, sounds perfectly sensible.

--j.

Comment 6 Theo Van Dinter 2003-02-11 09:08:25 UTC
Subject: Re: [SAdev]  some tests seem to be missing scores

On Tue, Feb 11, 2003 at 09:04:44AM -0800, bugzilla-daemon@hughes-family.org wrote:
> But won't the fact that the mass-check logs will not contain any of the
> hits for those missed rules, screw up their scores for set2/set3?

The rules still hit (the scores aren't 0), they're just not getting
a proper score to help out Bayes autolearning.  So the resulting logs
will be fine for 2 and 3, with the question of how accurate the BAYES_*
test results are.

Comment 7 Antony Mawer 2003-02-11 09:19:22 UTC
Subject: Re:  some tests seem to be missing scores 


> The rules still hit (the scores aren't 0), they're just not getting
> a proper score to help out Bayes autolearning.  So the resulting logs
> will be fine for 2 and 3, with the question of how accurate the BAYES_*
> test results are.

OK -- in that case I think we're fine to use the results coming in now.

Comment 8 Theo Van Dinter 2003-02-12 10:46:57 UTC
Ok, since the scores are now in place, I'm closing this bug.  I should make a tracker to fix the GA score output issue/come up with something better.BTW: Justin, your comments come out as being from Antony Mawer.  Bugzilla seems to be acting up.