SA Bugzilla – Bug 2167
bayes: save unmunged tokens during scan, for later learning
Last modified: 2005-03-29 16:08:32 UTC
Here is the first run at a group of hacks to increase sa and bayes usefulness in a large scale environment where users either don't have access to shell accounts or aren't smart enough to use them. These patches enable a few things: hashing (leafing really, but is a module) of bayes statedirs seperately from userconfiguration dirs (or SQL) logging of bayes tokens per messages into unique files in a folder off of the statedir for later use to 'train' bayes via an external method. (We've written a server that enables users to vew their last N hours of email and to select displayed messages as spam or ham.) sa-btoc-learn takes care of the backend work of importing the tokens as spam or ham into the bayes databases. configuration of these new features is accomplished through local.cf and there are a couple of new modules as well. This code has not be thouroughly tested yet, but we wanted to get some other people looking at it. We'll probably have some updates in a few days when the rest of our new hardware arrives and we are able to accomplish some large scale tests. External tools need to handle removal of old 'bayes token logs' like a cron job, etc.
Created attachment 1116 [details] patches for new sonic.net features
Created attachment 1129 [details] revised patch fixed patches (I think.)
Created attachment 1130 [details] example backend server Example server code to hook into via web app or other front end for user to access the learn_from_token_log pre-tokenized messages.
kgc -- looks interesting as I mentioned a while back. a few comments: 1. it's a bit untidy and needs cleaning up; there's config settings in the local.cf example, a shortage of documentation ;), the patch includes the "Makefile" (patching using "cvs diff" is better as it ignores generated files). 2. I think it'd be better if instead of dumping some metadata and the tokens as newline-separated data in the storage files, it could use a cleaner parseable format -- such as YAML or a custom one -- so that the format is extensible. I can imagine there may be situations further down the line where we want to add other kinds of data from the message -- full message text, more metadata, etc. If YAML sounds like overkill, a simple "Name: value" header-style thing would work fine here (just list all the tokens on one long line ;) A version line at the top of the file would help keep it forward-compatible, too, in case we need to make serious changes in future. 3. Disk space usage may be an issue -- as I mentioned, the tokens from a mail are often about the same size as the mail itself. Perhaps a good approach would be to store the tokens gzipped - but then we have to consider how to safely store binary data. (If the store can use 1 index file containing the metadata and then subfiles for the gzipped data, that'd work fine.) I still haven't got Dan to comment on it, but I think it makes sense ;)
Is this really an issue now that we have all this SQL stuff? I'm going to assume it's not given that there's been nothing for 17 months... Closing WONTFIX, reopen if necessary.
yeah - Kelsey, feel free to pipe up with your opinions -- I'm wondering if it might be more appropriate to have some kind of support for this, stored in an SQL db for example. probably best to take that to a thread in dev@ rather than on this bug.
Well, even with the SQL stuff as is it doesn't really address the issues that we were trying to deal with with these patches -- the ability for the spamd servers to track message tokens and learn on them through a web interface at a later date. But - the projects pretty much dead on this end since we still haven't been able to address the SQL performance to do site wide per-user bayes. We'll bring it up again if and when we start to work on it.
Subject: Re: enhancements to bayes, statedir, new sa-btoc-learn script Learning via spamd would probably be a huge win here, even without the BayesSQL stuff in place. Actually I can think of several solutions using the existing API, or soon to come API, that would work for what you are trying to do. Michael
I've just heard that apparently DSpam does something similar to this -- it dumps the list of tokens to the SQL database for every message, adds a signature header to the filtered mails, and relearning is then just a matter of extracting the signature from the (possibly mangled) message and extracting the token list from the db that matches that sig. This may be useful functionality, since it cleans up one aspect that's quite tricky in many environments -- it's no longer necessary for the user to know how to safely transmit the message they want to learn, in an unmunged format. the message can be thoroughly munged, as long as the Signature header is intact (or just relatively intact). That's possibly the nastiest UI issue with the whole Bayes training thing. anyone think this sounds useful? (reopening just so the idea is tracked.)
move bug to Future milestone (previously set to Future -- I hope)