Bug 6753 - ruleqa is handling late/early masscheck data badly
Summary: ruleqa is handling late/early masscheck data badly
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: RuleQA (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
: 6821 (view as bug list)
Depends on:
Reported: 2012-01-31 17:59 UTC by Darxus
Modified: 2019-06-16 06:39 UTC (History)
3 users (show)

Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Darxus 2012-01-31 17:59:19 UTC
If you look at the most recent 1000 ruleqa reports, you'll see weird things related to Axb's data:

For the Saturday / net runs, his data doesn't show up until it's a week late, and then it shows up with the *date* of a week later.  And overwrites everybody else's data.

Similarly, past Friday's all have their data overwritten with only Axb's data, and with Thursday's date.

Looking at the latest masscheck net log he uploaded, I notced his "SVN revision" line had a value that was a week old.

So I asked him if he was running masscheck too early, and it turned out he was.  So while it's (possibly?) correctable by him editing his cron job, ruleqa is dealing with it really badly.  

There was a previous bug where somebody asked how, for example, the 2012-01-21 ruleqa report could have data from a week in the future, 2012-01-28 - this is all it takes.  Grabbing the wrong SVN revision number by running masscheck too early.

Might be useful if avoiding this problem was handled in masscheck.  There's even already dates provided with the SVN revision numbers where they come from, weekly-versions.txt and nightly-versions.txt.  They could be checked against whatever date ends up in the right hand column of ruleqa output ("20120128-r1237024-n").

But I think the biggest problem is the overwriting.
Comment 1 Kevin A. McGrail 2012-08-09 04:17:28 UTC
Should we instead allow an SVN range +/- 1000 or something similar?
Comment 2 AXB 2012-08-09 05:12:55 UTC
If I understand the outcome of making it a range, not sure this would be wise as it could possibly not reflect changes on rule changes/revisions, etc. ,delay score changes and would be even harder to find out if someone's new rules are worth any effort while working on a rule subset
Comment 3 Kevin A. McGrail 2012-08-09 07:15:07 UTC
*** Bug 6821 has been marked as a duplicate of this bug. ***
Comment 4 Kevin A. McGrail 2012-08-09 07:16:30 UTC
Perhaps setting that the svn checkout or rsync for the code will utilize a the FIRST check-in for the day.  That will leave us ALWAYS one day behind on code but give a 24-hour window where all mass-check clients are using the same codebase.
Comment 5 AXB 2012-08-09 07:29:01 UTC
(In reply to comment #4)
> Perhaps setting that the svn checkout or rsync for the code will utilize a
> the FIRST check-in for the day.  That will leave us ALWAYS one day behind on
> code but give a 24-hour window where all mass-check clients are using the
> same codebase.

not sure it the FIRST is a good plan.
Assuming the first commit of the day is a code change which is borked and woudl get a fix afterwards....

I'd go for last of the day before where chances of meeting good fixes are higher.
Comment 6 Kevin A. McGrail 2012-08-09 15:27:49 UTC
Kevin Golding is getting SVN revision: unknown

I've asked him to add a few debug statements to the mass-check script to see which scenario in get_current_svn_revsion is being used.

My theory is he is running the svn command line and you might have a broken version of svn.  You might need to compile your own.

Anyone know if it would it break things to add information to the get_current_svn_revision on stdout?

I think we should add code that if revision is unknown, we abort a masscheck.  It's not going to get used...
Comment 7 Kevin A. McGrail 2012-08-10 14:30:51 UTC
Kevin Golding : I added in extra statements outside the conditionals and it proved I'm not successfully entering any of them.  It looks like my problem is line 998:

  if (-d "$dir/.svn" || -f "$dir/svninfo.tmp") {

 At that point $dir = /usr/home/masscheck/trunk/masses

 I have a /usr/home/masscheck/trunk/.svn but no /usr/home/masscheck/trunk/masses/.svn

There is a patch in note 6821 about this but the root cause is pointed out by Kris Deugau - SVN 1.7+:

> Have you upgraded to SVN 1.7?  The working copy structure uses only one
> .svn directory at the root of the working copy for 1.7, so if you've
> upgraded from < 1.7, your working copy at /usr/home/masscheck/trunk will
> not have a /usr/home/masscheck/trunk/masses/.svn directory any more.

We'll need to accommodate a case for that!
Comment 8 Kevin A. McGrail 2012-08-10 14:42:34 UTC
We need to look more into these files:

http://rsync.spamassassin.org/nightly-versions.txt (and http://rsync.spamassassin.org/weekly-versions.txt of course)

Is the masscheck script using this to determine what version to download?

Can the version that some people are using with RSYNC to "checkout" grab a specific SVN version?
Comment 9 Paul Stead 2019-06-16 06:26:18 UTC
Fixed with


Early/late submissions are categorised with corpus files from the same date. Further problems with active.list exist - investigation continues.

Comment 10 Paul Stead 2019-06-16 06:39:32 UTC
* link