Bug 2818 - Bayes tokens learnt with atime zero
Summary: Bayes tokens learnt with atime zero
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 2.60
Hardware: PC Linux
: P5 minor
Target Milestone: 2.70
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-12-08 09:16 UTC by Michel Bouissou
Modified: 2003-12-13 01:07 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Michel Bouissou 2003-12-08 09:16:29 UTC
Hi, 
 
I've been checking from time to time the contents of my SA 2.60 bayes DB, and 
I've noticed that it recently learnt 8 tokens from a spam message with a 
"zero" atime, all 8 tokens coming from the same message given their contents. 
 
It seems that something didn't work as expected there...
Comment 1 Theo Van Dinter 2003-12-08 11:24:44 UTC
can you attach that message to this bug ticket?  bayes uses a function which determines a received 
timestamp for the message based on the headers, or 0 if it can't figure it out.  apparently that 
message is appropriately strange enough to not have the date parsed out.
Comment 2 Michel Bouissou 2003-12-08 11:40:16 UTC
I'd rather not forward this message with complete headers and analysis from my 
system to a public list, but if you wish I can gzip it and forward it to you 
by private mail. 
 
Please note one thing about this message: It wasn't detected as spam by SA and 
scored quite low, but was bayes-learnt when "spamassassin -r" 'd. 
 
Maybe this learning method has something to do with the resulting zero atime ? 
Comment 3 Theo Van Dinter 2003-12-08 11:50:00 UTC
Subject: Re: [SAdev]  Bayes tokens learnt with atime zero

On Mon, Dec 08, 2003 at 11:40:17AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I'd rather not forward this message with complete headers and analysis from my 
> system to a public list, but if you wish I can gzip it and forward it to you 
> by private mail. 

Actually all I'd need is the headers.  Private mail would be fine too,
whatever's more comfortable. :)

Comment 4 Michel Bouissou 2003-12-08 15:29:01 UTC
I'll send you the complete message by private mail. Nothing confidential in a 
spam, but I don't want to publically expose some mail addresses from my 
domain, that already receive enough spam ;-) 
Comment 5 Theo Van Dinter 2003-12-08 15:56:21 UTC
hrm.  according to my testing, it gets an atime of 1070670464.

if I remove the Received headers, though, it'll use the mbox From seperator:

From blah@blah.blah Thu Jan  1 01:00:00 1970

The Bayes code takes the date from the mbox seperator, figuring whatever wrote it has already 
figured out the received date so why do work twice...  In my case, that translates to 18000, but 
could be 0 for you due to timezone.

So what do you think the possibility of having no Received headers when scanning would be?
Comment 6 Michel Bouissou 2003-12-08 16:11:49 UTC
Ooops. I'm sorry. I hadn't realized that the mbox From separator had a "Thu 
Jan  1 01:00:00 1970" date, that translates to the Unix "0" epoch, thus the 
zero atime. 
 
I believe this happens because this mail has manually been piped from a KMail 
MUA to SA -r, using a script. 
I believe the mbox "From" separator shows this strange 0 time because the 
KMail MUA tried to extract it from the "Date:" header, which looks incorrect 
with a localized "Date: sam., 06 déc 2003 01:27:38". 
 
I think that KMail probably couln't parse this header, generating the "0" mbox 
From separator, that was then reused by SA. 
 
Question: Why does the learner bother extracting atime from the headers or 
mbox separators ? Wouldn't it be perfect to just use current atime at time of 
processing ? 
 
Isn't it anyway better for bayes to record the time when a given token wa 
submitted / learnt, rather than trying to figure out the messages's time, 
which is quite irrelevant for such an application ? 
 
Comment 7 Justin Mason 2003-12-08 16:33:15 UTC
Subject: Re: [SAdev]  Bayes tokens learnt with atime zero 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Isn't it anyway better for bayes to record the time when a given token wa 
>submitted / learnt, rather than trying to figure out the messages's time, 
>which is quite irrelevant for such an application ? 

if I recall correctly, the problem with that is that using learn-time for
  initial atime would result in wierd expiration behaviour.
  
e.g. learning a batch of spam messages, then learning a batch of ham
messages, would result in (let's say) 75,000 spam tokens of the same
atime, and 75,000 ham tokens of a younger atime.  So on the first expire,
more of the spam tokens would be expired because they're "older".

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/1RhFQTcbUG5Y7woRAm3UAKDm6tbILtzafX9ads0R//YudcaXswCfX6Mb
MKGRPSTeY43YrrcREZW17vU=
=Farl
-----END PGP SIGNATURE-----

Comment 8 Theo Van Dinter 2003-12-08 16:47:44 UTC
Subject: Re: [SAdev]  Bayes tokens learnt with atime zero

On Mon, Dec 08, 2003 at 04:33:16PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> if I recall correctly, the problem with that is that using learn-time for
>   initial atime would result in wierd expiration behaviour.

Yeah.  In short, you care when you _SAW_ the token, not when you _LEARNED_
the token.

Comment 9 Michel Bouissou 2003-12-09 15:35:13 UTC
Theo Van Dinter wrote: 
> 
> Yeah.  In short, you care when you _SAW_ the token, not when you _LEARNED_ 
> the token. 
 
Hmmmmm... With this, if somebody purposely wants to feed _now_ a set of old 
ham or old spam into his Bayes DB, then Bayes DB will learn them with old 
atimes corresponding to messages dates (not learning date). 
In such a situation, next expiry run will expire all the "old messages, newly 
learnt", which I'm not sure would be the intended result. 
 
Personally, I feel it's more important caring when you _learn_ the token, 
rather than the message date, because learnt token are either learnt more or 
less "as mail comes" (autolearn, daily manual feed, etc) and then their 
learning time is close to the message timestamp, or manually learnt by 
batches, but this is a manual operation by which one states "I want to learn 
this now", and then the most relevant date to record is, IMHO, the date od 
submission to the DB, rather than the messages dates, whatever could they 
be... 
 
Comment 10 Theo Van Dinter 2003-12-13 10:06:57 UTC
Subject: Re: [SAdev]  Bayes tokens learnt with atime zero

On Tue, Dec 09, 2003 at 03:35:14PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Hmmmmm... With this, if somebody purposely wants to feed _now_ a set of old 
> ham or old spam into his Bayes DB, then Bayes DB will learn them with old 
> atimes corresponding to messages dates (not learning date). 

Only if those tokens don't already exist.  If the token already exists
with a newer atime, it's left alone.

> In such a situation, next expiry run will expire all the "old messages, newly 
> learnt", which I'm not sure would be the intended result. 

Well, the goal of expiry is to get rid of tokens that haven't been seen
in a while.  Learn time is irrelevent there since you could have seen
the token 10 years ago.  Just because you want to learn it now doesn't
mean anything.  You want Bayes to know that the last time the token was
seen was 10 years ago.  If an expire happens right after you learn it,
yeah it'll be removed, but that's the behavior you want.

Comment 11 Theo Van Dinter 2003-12-13 10:07:29 UTC
message set atime to 0 ...