SA Bugzilla – Bug 2818
Bayes tokens learnt with atime zero
Last modified: 2003-12-13 01:07:29 UTC
Hi, I've been checking from time to time the contents of my SA 2.60 bayes DB, and I've noticed that it recently learnt 8 tokens from a spam message with a "zero" atime, all 8 tokens coming from the same message given their contents. It seems that something didn't work as expected there...
can you attach that message to this bug ticket? bayes uses a function which determines a received timestamp for the message based on the headers, or 0 if it can't figure it out. apparently that message is appropriately strange enough to not have the date parsed out.
I'd rather not forward this message with complete headers and analysis from my system to a public list, but if you wish I can gzip it and forward it to you by private mail. Please note one thing about this message: It wasn't detected as spam by SA and scored quite low, but was bayes-learnt when "spamassassin -r" 'd. Maybe this learning method has something to do with the resulting zero atime ?
Subject: Re: [SAdev] Bayes tokens learnt with atime zero On Mon, Dec 08, 2003 at 11:40:17AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I'd rather not forward this message with complete headers and analysis from my > system to a public list, but if you wish I can gzip it and forward it to you > by private mail. Actually all I'd need is the headers. Private mail would be fine too, whatever's more comfortable. :)
I'll send you the complete message by private mail. Nothing confidential in a spam, but I don't want to publically expose some mail addresses from my domain, that already receive enough spam ;-)
hrm. according to my testing, it gets an atime of 1070670464. if I remove the Received headers, though, it'll use the mbox From seperator: From blah@blah.blah Thu Jan 1 01:00:00 1970 The Bayes code takes the date from the mbox seperator, figuring whatever wrote it has already figured out the received date so why do work twice... In my case, that translates to 18000, but could be 0 for you due to timezone. So what do you think the possibility of having no Received headers when scanning would be?
Ooops. I'm sorry. I hadn't realized that the mbox From separator had a "Thu Jan 1 01:00:00 1970" date, that translates to the Unix "0" epoch, thus the zero atime. I believe this happens because this mail has manually been piped from a KMail MUA to SA -r, using a script. I believe the mbox "From" separator shows this strange 0 time because the KMail MUA tried to extract it from the "Date:" header, which looks incorrect with a localized "Date: sam., 06 déc 2003 01:27:38". I think that KMail probably couln't parse this header, generating the "0" mbox From separator, that was then reused by SA. Question: Why does the learner bother extracting atime from the headers or mbox separators ? Wouldn't it be perfect to just use current atime at time of processing ? Isn't it anyway better for bayes to record the time when a given token wa submitted / learnt, rather than trying to figure out the messages's time, which is quite irrelevant for such an application ?
Subject: Re: [SAdev] Bayes tokens learnt with atime zero -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >Isn't it anyway better for bayes to record the time when a given token wa >submitted / learnt, rather than trying to figure out the messages's time, >which is quite irrelevant for such an application ? if I recall correctly, the problem with that is that using learn-time for initial atime would result in wierd expiration behaviour. e.g. learning a batch of spam messages, then learning a batch of ham messages, would result in (let's say) 75,000 spam tokens of the same atime, and 75,000 ham tokens of a younger atime. So on the first expire, more of the spam tokens would be expired because they're "older". - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/1RhFQTcbUG5Y7woRAm3UAKDm6tbILtzafX9ads0R//YudcaXswCfX6Mb MKGRPSTeY43YrrcREZW17vU= =Farl -----END PGP SIGNATURE-----
Subject: Re: [SAdev] Bayes tokens learnt with atime zero On Mon, Dec 08, 2003 at 04:33:16PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > if I recall correctly, the problem with that is that using learn-time for > initial atime would result in wierd expiration behaviour. Yeah. In short, you care when you _SAW_ the token, not when you _LEARNED_ the token.
Theo Van Dinter wrote: > > Yeah. In short, you care when you _SAW_ the token, not when you _LEARNED_ > the token. Hmmmmm... With this, if somebody purposely wants to feed _now_ a set of old ham or old spam into his Bayes DB, then Bayes DB will learn them with old atimes corresponding to messages dates (not learning date). In such a situation, next expiry run will expire all the "old messages, newly learnt", which I'm not sure would be the intended result. Personally, I feel it's more important caring when you _learn_ the token, rather than the message date, because learnt token are either learnt more or less "as mail comes" (autolearn, daily manual feed, etc) and then their learning time is close to the message timestamp, or manually learnt by batches, but this is a manual operation by which one states "I want to learn this now", and then the most relevant date to record is, IMHO, the date od submission to the DB, rather than the messages dates, whatever could they be...
Subject: Re: [SAdev] Bayes tokens learnt with atime zero On Tue, Dec 09, 2003 at 03:35:14PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Hmmmmm... With this, if somebody purposely wants to feed _now_ a set of old > ham or old spam into his Bayes DB, then Bayes DB will learn them with old > atimes corresponding to messages dates (not learning date). Only if those tokens don't already exist. If the token already exists with a newer atime, it's left alone. > In such a situation, next expiry run will expire all the "old messages, newly > learnt", which I'm not sure would be the intended result. Well, the goal of expiry is to get rid of tokens that haven't been seen in a while. Learn time is irrelevent there since you could have seen the token 10 years ago. Just because you want to learn it now doesn't mean anything. You want Bayes to know that the last time the token was seen was 10 years ago. If an expire happens right after you learn it, yeah it'll be removed, but that's the behavior you want.
message set atime to 0 ...