Bug 6386

Summary: Limit corpora network test age in score generation
Product: Spamassassin Reporter: Jeff Chan <jeffc>
Component: Score GenerationAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: major CC: apache, Darxus, davej, jeffc, jm, kmcgrail
Priority: P5    
Version: SVN Trunk (Latest Devel Version)   
Target Milestone: ---   
Hardware: Other   
OS: All   
Whiteboard:

Description Jeff Chan 2010-03-24 07:16:15 UTC
[I'm marking this as major severity since it could have a major effect on the scores of all network tests.  Feel free to adjust as appropriate.]

Justin mentioned that old ham hits (resulting in false positives) from network tests of the original score generation run from when a given ham sample is first introduced are carried forward through time when new scores are generated.  This seems inappropriate, especially in the case of network tests, since the data behind network tests tend to change over time.  In particular a FP on an old network test may not continue to be a FP when using current network test data, i.e., the network test data may have had the FP removed after the original scoring run and no longer cause an FP.  As a result, such retrospective FPs under the existing score generation system may not reflect actual FPs from current network test data, leading to a lower than appropriate score for a particular test.

One solution would be to have some kind of time limit on network test results.  Some blacklist/blocklist data are highly dynamic and tend to change from day to day so an expiration time on the order of a few days may be appropriate.
Comment 1 Jeff Chan 2010-03-24 07:46:43 UTC
[changed summary slightly; it's not so much the corpora that are incorrectly aged, but the network test results on those corpora]
 
Another solution is to run the network test again for ham, but not for spam.  While ham FPs should tend to decrease over time, old spam replay may FN due to natural delisting/expiration on blacklists.  Spam data tend to expire off lists due to time locality, i.e., old blacklist data become unproductive and removed as a result.
Comment 2 Justin Mason 2010-03-24 11:13:45 UTC
hey -- thanks for opening the bug.

I don't think we can safely run against old ham, either; there are innocuous URLs in 5-year-old ham messages which have expired and been stolen by a spammer.

http:// sitescooper dot org/ is an example of this.  It used to host a piece of software I wrote, but we let it expire, and a Russian link-farm picked it up; their NSes are on the SBL, so it now hits URIBL_SBL when re-scanned.
Comment 3 Darxus 2011-10-28 17:02:59 UTC
Current corpora limits for score generation are:
Ham: 6 years.
Spam: 2 months.

So, we should reduce the limit for ham?  To what?  

Score generation has a threshold of a minimum of 150,000 hams.  The 150,000th newest ham submitted on 2011-10-22 (which includes the bb corpora) was dated:  
Tue Apr 17 09:33:16 UTC 2007.  About 4.6 years.

29.8% of the ham currently used in score generation is from 2008 or older, from jm's corpus.

So I think it's important to fix the problem with adding new masscheck accounts, and get more data from more people.


It looks like the place to change this limit is rulesrc/sandbox/dos/new-rule-score-gen/generate-new-scores, arguments to log-grep-recent:
172:masses/log-grep-recent -m 72 ../corpus/usable-corpus-set$SCORESET/ham-*.log > masses/ham-full.log
173:masses/log-grep-recent -m 2 ../corpus/usable-corpus-set$SCORESET/spam-*.log > masses/spam-full.log

And ruleqa should be changed to match:
masses/rule-qa/reports-from-logs
36:my $OLDEST_HAM_WEEKS    = 72 * 4;       # 72 months = 6 years
37:my $OLDEST_SPAM_WEEKS    = 2 * 4;       # 2 months
Comment 4 Darxus 2011-11-08 17:53:46 UTC
Can I get some other opinions on what the ham age limit should be?

There's a nice graphical representation of the problem in this graph:  http://www.chaosreigns.com/dnswl/ham.svg

See that big hump on the right at the top, the light blue "At least None" line?  Where it goes from ~50, up to 60-62 for a while, then back down to ~47?  That 29% drop at the end was due to JM's corpora being added back, with his mostly 3 to 4 year old ham corpus which is comprising 30% of our ham used for re-scoring.  

That "At least None" line represents the percent of ham that hits any rank of DNSWL.org.  And it shows that using so much data that's so old is really screwing up how accurately we measure the performance of things like white lists.  

20110806 50.6 
20110813 50.3545  bb present
20110820 50.5765 

20110910 62.304 
20110917 62.406 
20110924 61.4487 
20111001 60.9607  bb missing
20111008 60.9483 
20111015 60.5923 
20111022 61.6126 

20111029 47.4826  bb present
20111105 47.6509 

I realize this problem is critically linked to fixing our ability to add new masscheck accounts, but I'd like to try to get consensus on what the ham age limit should be changed to.
Comment 5 Kevin A. McGrail 2011-11-08 17:59:13 UTC
> I realize this problem is critically linked to fixing our ability to add new
> masscheck accounts, but I'd like to try to get consensus on what the ham age
> limit should be changed to.

Recommend we visit this again in 4 months to give time to get more mass checkers. I am working through the backlog and got one person at least their password yesterday because they are a committer.

But having a specific age implies that spammers will simply be able to use their old tricks again after X number of months or years.

So once promoted, always promoted because a bit of an interesting discussion.  

Perhaps make a "hyper-efficient" ruleset for those that are interested in saving cycles?
Comment 6 Kevin A. McGrail 2015-04-13 21:49:57 UTC
Pushing to 3.4.2
Comment 7 Kevin A. McGrail 2018-09-04 15:33:06 UTC
Moving off a specific release.  Dave, I think we have sane limits on spam/ham age now, yes?  Can you document what they are here and let's close this ticket as resolved?
Comment 8 Henrik Krohns 2018-09-17 12:19:40 UTC
I'm surprised no one has mentioned "reuse" in this thread. That should be the fix for network tests, not some arbitrary age limits. Network test results should be from the time MX received the mail, period.

It might need some extra work from masscheckers, but if they are willing to do it (with help if needed), there's not much downside.