SA Bugzilla – Bug 6671
Updates not happening due to lack of bb corpora since August 27th
Last modified: 2011-10-30 23:36:05 UTC
Yesterday's weekly net mass-check again didn't have enough non-spam for score regeneration to happen: http://www.chaosreigns.com/dnswl/tot.svg (We had 102,436 out of the needed 150,000 non-spams, 68.3%.) And again, none of the bb corpora showed up in ruleqa: http://ruleqa.spamassassin.org/?daterev=20111008 I'm guessing there is a direct relationship. bb corpora are the ones where people upload their emails and mass-check is run on a spamassassin server. Can somebody look into why these aren't making it to ruleqa? Is this sufficiently documented somewhere? Missing ham-net-bb-guenther_fraud - last seen 20110820. Missing ham-net-bb-jhardin - last seen 20110820. Missing ham-net-bb-jhardin_fraud - last seen 20110820. Missing ham-net-bb-jm - last seen 20110820. It looks like ruleqa didn't run on 2011-08-27, and the bb corpora haven't been included since. That does correspond exactly to when net runs dropped below the 150,000 non-spam threshold. Last time they were included: http://ruleqa.spamassassin.org/?daterev=20110820
Data point - a snapshot of the current ruleqa output: 1180567: 2011-10-09 05:57:14 khopesh: auto-generated rules 20111009-r1180567-n bb-jhardin_fraud bb-jm danmcdonald darxus-trap darxus grenier jarif kgolding llanga wt-ackbar wt-en1 wt-en2-flh wt-en3 wt-hamtrap wt-homeone wt-jp1 [+] (...Network masscheck omitted here...) 1179962: 2011-10-07 05:57:18 khopesh: auto-generated rules 20111009-r1179962-n bb-guenther_fraud bb-jhardin [+] How is the masscheck from 2011-10-07 using logs dated 20111009? I suspect two processes are running, the one to post last is using a smaller corpus. I wager the large-corpus "1180567: 2011-10-09 05:57:14" masscheck results will soon be overwritten by another masscheck using only the fraud corpora.
From the automc/freqsd log it looks like it's deciding to go back and overwrite older results outputs. I don't know if this is normal behavior; I'm grabbing the log so that I can perform more analysis locally.
(In reply to comment #1) > Data point - a snapshot of the current ruleqa output: > > 1180567: 2011-10-09 05:57:14 > khopesh: auto-generated rules > > 20111009-r1180567-n > bb-jhardin_fraud bb-jm danmcdonald darxus-trap darxus grenier jarif > kgolding llanga wt-ackbar wt-en1 wt-en2-flh wt-en3 wt-hamtrap wt-homeone > wt-jp1 [+] > I wager the large-corpus "1180567: 2011-10-09 05:57:14" masscheck > results will soon be overwritten by another masscheck Yup: 1180567: 2011-10-09 05:57:14 khopesh: auto-generated rules 20111010-r1180567-n bb-guenther_fraud bb-jhardin [+] [-] ham-bb-guenther_fraud.20111010-r1180567-n.log: started: 20111010T090246Z; submitted: 20111010T080113Z; size: 4402 bytes ham-bb-jhardin.20111010-r1180567-n.log: started: 20111010T090522Z; submitted: 20111010T110436Z; size: 9011673 bytes spam-bb-guenther_fraud.20111010-r1180567-n.log: started: 20111010T090246Z; submitted: 20111010T080113Z; size: 1702133 bytes spam-bb-jhardin.20111010-r1180567-n.log: started: 20111010T090522Z; submitted: 20111010T110436Z; size: 1604099 bytes (end of corpus list. bb-jm et. al. are _gone_ now) I still don't know enough about this to figure out why it's overwriting or deleting old results.
Yesterday was the eighth week without a -net run including the bb corpora. (In the final output.) Who has sufficient access to look at this?
Is the date / time on the machine running this set reasonably accurately? I'd be happy to look at the problem if you want to give me access. I believe I am very qualified.
(In reply to comment #5) > Is the date / time on the machine running this set reasonably accurately? > > I'd be happy to look at the problem if you want to give me access. I believe I > am very qualified. Email me off-list and perhaps we can share a session and discuss what might be the issue over the phone at the same time?
Time looks interesting, BTW: zones: Wed Oct 19 19:48:38 GMT 2011 zones2: Wed Oct 19 20:51:45 UTC 2011 Far as I know GMT and UTC are identical zones, yes? I'm getting in touch with Infra now.
(In reply to comment #7) > Time looks interesting, BTW: > > zones: Wed Oct 19 19:48:38 GMT 2011 > zones2: Wed Oct 19 20:51:45 UTC 2011 > > Far as I know GMT and UTC are identical zones, yes? > > I'm getting in touch with Infra now. Jira ticket open: https://issues.apache.org/jira/browse/INFRA-4054
(In reply to comment #7) > zones: Wed Oct 19 19:48:38 GMT 2011 > zones2: Wed Oct 19 20:51:45 UTC 2011 I was hoping it would be farther off. > Far as I know GMT and UTC are identical zones, yes? Yup. Difference is basically leap seconds. http://geography.about.com/od/timeandtimezones/a/gmtutc.htm
(In reply to comment #9) > (In reply to comment #7) > > zones: Wed Oct 19 19:48:38 GMT 2011 > > zones2: Wed Oct 19 20:51:45 UTC 2011 > > I was hoping it would be farther off. > > > Far as I know GMT and UTC are identical zones, yes? > > Yup. Difference is basically leap seconds. > http://geography.about.com/od/timeandtimezones/a/gmtutc.htm Time was a very good thought. My memory is there are some safety valves in the cron jobs that if time is off it could definitely mess with masscheck. I use ntpdate on all my boxes so I never even thought about time being wrong.
(In reply to comment #8) > Jira ticket open: https://issues.apache.org/jira/browse/INFRA-4054 Did ntpdate not work because the time was too far off? I know it'll do that. I think -b is the flag you want to force it.
(In reply to comment #11) > (In reply to comment #8) > > Jira ticket open: https://issues.apache.org/jira/browse/INFRA-4054 > > Did ntpdate not work because the time was too far off? I know it'll do that. > I think -b is the flag you want to force it. This is the error I got: 19 Oct 21:13:05 ntpdate[17267]: Can't set time of day: Not owner Since I think this is a virtualized box, I think it has to go upstream.
Sounds like this is running on Solaris? And a "zone" is a Solaris virtual machine. So it makes sense that the "global zone administrator" (someone with access to the host OS) would need to fix this. http://hub.opensolaris.org/bin/view/Community+Group+zones/faq#HQ:CanazonebeanNTPclientorserver3F
(In reply to comment #13) > Sounds like this is running on Solaris? And a "zone" is a Solaris virtual > machine. So it makes sense that the "global zone administrator" (someone with > access to the host OS) would need to fix this. > http://hub.opensolaris.org/bin/view/Community+Group+zones/faq#HQ:CanazonebeanNTPclientorserver3F That was my take as well. Already kicked up to ASF Infra via a Jira ticket.
We have bigger problems: Bringing back the bb corpora will not be sufficient to enable score generation / rule updates. As of the last net run, we have 38,740 fewer hams than required. Only hams no more than 2 months old are used. Of the missing bb corpora, number of hams in the last 3 months when last seen: bb-guenther_fraud 0 (all spam) bb-jhardin 471 bb-jhardin_fraud 0 (all spam) bb-jm 0 - most recent is 2010-11 So once we get bb corpora included again, we'll still be short at least 38,269 hams, having 74.5% of the required 150,000 no more than 2 months old. The reason I gave counts over the last three months is, for some reason, when the bb corpora were last included, ham counts were including ham back to around 2006: http://ruleqa.spamassassin.org/20110820-r1159860-n/RCVD_IN_XBL/detail?s_corpus=1#corpus For example, it says there were 70,329 hams in bb-jm, and if you look at the yearly / monthly counts, that would need to include hams back to 2006. Counts in the latest net run make sense for only including the last 2 months of ham. This is particularly weird because even before the age threshold was changed for bug #6557 to match score generation, the threshold was still only 6 months.
Please disregard my last comment. I got the ham and spam age limits backwards. But I am a little concerned that near a quarter of the ham we've been using for ruleqa / score generation is about 3-4 years old, from jm.
(In reply to comment #16) > Please disregard my last comment. I got the ham and spam age limits backwards. > > But I am a little concerned that near a quarter of the ham we've been using for > ruleqa / score generation is about 3-4 years old, from jm. We will get more! Though I will state that we are mostly pretty good at guessing what the limits should be for a lot of rules. AutoMC isn't the end of the world.
(In reply to comment #17) > (In reply to comment #16) > > Please disregard my last comment. I got the ham and spam age limits backwards. > > > > But I am a little concerned that near a quarter of the ham we've been using for > > ruleqa / score generation is about 3-4 years old, from jm. > > We will get more! Though I will state that we are mostly pretty good at > guessing what the limits should be for a lot of rules. AutoMC isn't the end of > the world. the CID imhg thing and its issues goes way back to the SARE days when stock spam was full of them I'll re-enable the old SARE auto masschecker instance (for a chosen few) so rules can be tested before they're put on sandbox.
> Missing ham-net-bb-guenther_fraud Since this has been mentioned quite a few times: Please do note that this is a hand classified corpus of *fraud* spam, intended for the SOUGHT_FRAUD rule-set. Naturally, the mentioned ham counterpart to the fraud corpus does not exist, and is not really expected to. With the only exception of occasionally holding very few ham samples, purely to prevent FPs -- e.g. forged facebook notifications. Please stop mentioning that corpus is "lacking" recent ham. Same goes for the jhardin fraud corpus, I believe.
(In reply to comment #19) > > Missing ham-net-bb-guenther_fraud > > Since this has been mentioned quite a few times: Please do note that this is a > hand classified corpus of *fraud* spam, intended for the SOUGHT_FRAUD rule-set. > > Naturally, the mentioned ham counterpart to the fraud corpus does not exist, I realize that. The ruleqa page does actually say the ham corpus existed: ham-net-bb-guenther_fraud.20110820-r1159860-n.log: started: 20110820T090310Z; submitted: 20110820T090048Z; size: 4547 bytes It was there, although empty. It is now missing. What that indicates is that either the ham or spam part is missing, or both. I generally only include the ham part of the list of missing corpora from the output of my script to avoid redundancy. So when I say "ham-net-bb-guenther_fraud" is missing, what I mean is that some part of "bb-guenther_fraud" is missing. Sorry that bothers you. I could start stripping the ham-/spam- part off, but I'm afraid that could result in missing interesting output. (In reply to comment #17) > (In reply to comment #16) > > But I am a little concerned that near a quarter of the ham we've been using for > > ruleqa / score generation is about 3-4 years old, from jm. > > We will get more! Though I will state that we are mostly pretty good at How?
> > We will get more! Though I will state that we are mostly pretty good at > > How? We have lots of people interested in contributing. My corpora aren't even included at the moment because something with rsync broke.
Why aren't those interested people contributing yet?
(In reply to comment #22) > Why aren't those interested people contributing yet? Because I haven't re-opened giving out rsync accounts because of a security hole we found that I'd rather not discuss in bugzilla.
(In reply to comment #19) > Same goes for the jhardin fraud corpus, I believe. Correct.
Would it be worth mentioning in the jira ticket that this is holding up a release? Any guesses on how long it takes The Apache Software Foundation to set the time on two computers?
Yesterday was the 9th week that rule updates didn't happen due to this problem. It's been 3.8 days since a ticket was opened for Apache Infrastructure to correctly set the time on the two relevant machines, which might fix it, with no response at all. And 23 days since a SA v3.4.0 Release Candidate was supposed to be released, that's being held up, at least in part, by this problem.
(In reply to comment #26) > Yesterday was the 9th week that rule updates didn't happen due to this problem. That concurs closely with my timing, yes. > It's been 3.8 days since a ticket was opened for Apache Infrastructure to > correctly set the time on the two relevant machines, which might fix it, with > no response at all. The time issue was fixed. > And 23 days since a SA v3.4.0 Release Candidate was supposed to be released, > that's being held up, at least in part, by this problem. The date for the Release Candidate is an estimate. I'm far more worried about the rules operation.
(In reply to comment #27) > The time issue was fixed. Nice. I don't suppose they set up ntpd / ntpdate, by any chance? Today (2011-10-24) and yesterday (2011-10-23) are the first days in a while (2011-08-24?) that the nightly (non-net) ruleqa output includes non-bb corpora. That's encouraging. > The date for the Release Candidate is an estimate. I'm far more worried about > the rules operation. Sure. And if everybody were working on rule generation I wouldn't have brought it up. So I'm wondering what else we need to do for a release while we wait to see if rule generation works this coming Saturday.
(In reply to comment #28) > > The date for the Release Candidate is an estimate. I'm far more worried about > > the rules operation. > > Sure. And if everybody were working on rule generation I wouldn't have brought > it up. So I'm wondering what else we need to do for a release while we wait to > see if rule generation works this coming Saturday. see that trunk gets the scores file for the auto promoted rules so these are not scored 1.0 due to missing file.
(In reply to comment #28) > (In reply to comment #27) > > The time issue was fixed. > > Nice. I don't suppose they set up ntpd / ntpdate, by any chance? NTPD is supposed to be on the master zone but they gave me no feedback as to the technical nature of the fix beyond: Zones and Zones2 are on two different hardware virtual machines. Ntpd/ntpupdate ONLY run on the master zone not on virtual zones. So my assumption is they fixed ntpd on the master zone. > > Today (2011-10-24) and yesterday (2011-10-23) are the first days in a while > (2011-08-24?) that the nightly (non-net) ruleqa output includes non-bb corpora. > That's encouraging. Excellent. That was my prediction/hope. I will sacrifice an intern to appease the computer gods if needed ;-) > > > The date for the Release Candidate is an estimate. I'm far more worried about > > the rules operation. > > Sure. And if everybody were working on rule generation I wouldn't have brought > it up. So I'm wondering what else we need to do for a release while we wait to > see if rule generation works this coming Saturday. Good question. I'll look. Off-hand, Mark has done a great job of moving the project towards a release with IPv6 support. I'll respond to dev about that.
(In reply to comment #30) > (In reply to comment #28) > > (In reply to comment #27) > > > > Today (2011-10-24) and yesterday (2011-10-23) are the first days in a while > > (2011-08-24?) that the nightly (non-net) ruleqa output includes non-bb corpora. > > That's encouraging. > > Excellent. Okay, it indeed looks like the clock variance is what was causing the masscheck results to be discarded. It's now got a couple of days of full results that it hasn't destroyed. I'll see if I can set up some monitoring tasks in automc's cron. If I can get that working should notifications be sent to the ruleqa list or the dev list?
> Okay, it indeed looks like the clock variance is what was causing the masscheck > results to be discarded. It's now got a couple of days of full results that it > hasn't destroyed. Good call to Darxus. I never would have checked without his impetus. > I'll see if I can set up some monitoring tasks in automc's cron. If I can get > that working should notifications be sent to the ruleqa list or the dev list? My $0.02 is Dev, please. RuleQA should be low volume. And people on Dev will likely know how to use a rule to filter things...
Okay, fixing the time discrepancy between the zones seems to have revived masscheck producing rule updates. Now we just need to figure out why 72_scores.cf is being omitted from the update tarball... (bug #6644)