2167 – bayes: save unmunged tokens during scan, for later learning

Bug 2167 - bayes: save unmunged tokens during scan, for later learning

Summary: bayes: save unmunged tokens during scan, for later learning

Status:	REOPENED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	2.60
Hardware:	Other other

Importance:	P5 enhancement
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2003-06-30 17:05 UTC by Kelsey Cummings
Modified:	2005-03-29 16:08 UTC (History)
CC List:	0 users

Attachment	Type	Actions	Submitter/CLA Status
patches for new sonic.net features	patch	None	Kelsey Cummings
revised patch	patch	None	Kelsey Cummings
example backend server	application/octet-stream	None	Kelsey Cummings
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kelsey Cummings 2003-06-30 17:05:45 UTC

Here is the first run at a group of hacks to increase sa and bayes usefulness in
a large scale environment where users either don't have access to shell accounts
or aren't smart enough to use them.

These patches enable a few things:

hashing (leafing really, but is a module) of bayes statedirs seperately from
userconfiguration dirs (or SQL)

logging of bayes tokens per messages into unique files in a folder off of the
statedir for later use to 'train' bayes via an external method.  (We've written
a server that enables users to vew their last N hours of email and to select
displayed messages as spam or ham.)  sa-btoc-learn takes care of the backend
work of importing the tokens as spam or ham into the bayes databases.

configuration of these new features is accomplished through local.cf and there
are a couple of new modules as well.

This code has not be thouroughly tested yet, but we wanted to get some other
people looking at it.  We'll probably have some updates in a few days when the
rest of our new hardware arrives and we are able to accomplish some large scale
tests.

External tools need to handle removal of old 'bayes token logs' like a cron job,
etc.

Comment 1 Kelsey Cummings 2003-06-30 17:07:01 UTC

Created attachment 1116 [details]
patches for new sonic.net features

Comment 2 Kelsey Cummings 2003-07-03 12:23:05 UTC

Created attachment 1129 [details]
revised patch

fixed patches (I think.)

Comment 3 Kelsey Cummings 2003-07-03 12:28:53 UTC

Created attachment 1130 [details]
example backend server

Example server code to hook into via web app or other front end for user to
access the learn_from_token_log pre-tokenized messages.

Comment 4 Justin Mason 2003-08-28 16:35:48 UTC

kgc -- looks interesting as I mentioned a while back. a few comments:

1. it's a bit untidy and needs cleaning up; there's config settings in the
local.cf example, a shortage of documentation ;), the patch includes the
"Makefile" (patching using "cvs diff" is better as it ignores generated files).

2. I think it'd be better if instead of dumping some metadata and the tokens as
newline-separated data in the storage files, it could use a cleaner parseable
format -- such as YAML or a custom one -- so that the format is extensible.  I
can imagine there may be situations further down the line where we want to add
other kinds of data from the message -- full message text, more metadata, etc.

If YAML sounds like overkill, a simple "Name: value" header-style thing would
work fine here (just list all the tokens on one long line ;)

A version line at the top of the file would help keep it forward-compatible,
too, in case we need to make serious changes in future.

3. Disk space usage may be an issue -- as I mentioned, the tokens from a mail
are often about the same size as the mail itself.  Perhaps a good approach would
be to store the tokens gzipped - but then we have to consider how to safely
store binary data.  (If the store can use 1 index file containing the metadata
and then subfiles for the gzipped data, that'd work fine.)

I still haven't got Dan to comment on it, but I think it makes sense ;)

Comment 5 Duncan Findlay 2004-12-01 17:44:25 UTC

Is this really an issue now that we have all this SQL stuff? I'm going to assume
it's not given that there's been nothing for 17 months...

Closing WONTFIX, reopen if necessary.

Comment 6 Justin Mason 2004-12-01 23:13:46 UTC

yeah - Kelsey, feel free to pipe up with your opinions -- I'm wondering if it
might be more appropriate to have some kind of support for this, stored in an
SQL db for example.

probably best to take that to a thread in dev@ rather than on this bug.

Comment 7 Kelsey Cummings 2004-12-01 23:18:01 UTC

Well, even with the SQL stuff as is it doesn't really address the issues that we
were trying to deal with with these patches -- the ability for the spamd servers
to track message tokens and learn on them through a web interface at a later
date.  But - the projects pretty much dead on this end since we still haven't
been able to address the SQL performance to do site wide per-user bayes.

We'll bring it up again if and when we start to work on it.

Comment 8 Michael Parker 2004-12-01 23:30:42 UTC

Subject: Re:  enhancements to bayes, statedir, new sa-btoc-learn script

Learning via spamd would probably be a huge win here, even without the
BayesSQL stuff in place.  Actually I can think of several solutions
using the existing API, or soon to come API, that would work for what
you are trying to do.

Michael

Comment 9 Justin Mason 2005-02-15 11:32:53 UTC

I've just heard that apparently DSpam does something similar to this -- it dumps
the list of tokens to the SQL database for every message, adds a signature
header to the filtered mails, and relearning is then just a matter of extracting
the signature from the (possibly mangled) message and extracting the token list
from the db that matches that sig.

This may be useful functionality, since it cleans up one aspect that's quite
tricky in many environments -- it's no longer necessary for the user to know how
to safely transmit the message they want to learn, in an unmunged format. the
message can be thoroughly munged, as long as the Signature header is intact (or
just relatively intact).   That's possibly the nastiest UI issue with the whole
Bayes training thing.

anyone think this sounds useful?  (reopening just so the idea is tracked.)

Comment 10 Daniel Quinlan 2005-03-30 01:08:32 UTC

move bug to Future milestone (previously set to Future -- I hope)