Bug 7890 - Integration with IMAP servers
Summary: Integration with IMAP servers
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: unspecified
Hardware: All All
: P2 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-14 20:40 UTC by mi+apache@aldan.algebra.com
Modified: 2021-04-08 06:45 UTC (History)
6 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description mi+apache@aldan.algebra.com 2021-03-14 20:40:54 UTC
On my little server, we have a convention -- messages, that users consider spam, are moved into each account's "spam" folder.

This is easy to setup -- Thunderbird's own filtering will do just that with the spam-flagged messages, if you let it.

To keep such messages from even entering the server, I run `sa-learn --spam` against all of these mailboxes from cron. Cyrus IMAPD keeps each message in a separate file, which makes this easy.

Easy but wasteful -- because the old spams are being relearned over and over. It is also less effective, because the server's configuration is only updated once in a few hours -- so multiple copies of the same or similar spam can enter again and again undetected before spamd learns about it.

The enhancement being requested would add code to sa-spamd to monitor the specified list of directories, using inotify on Linux, kqueue on BSD, and whatever else is suitable on other operating systems to:

1. Pass any new file appearing /in/ such a directory to (some equivalent of) sa-learn as spam.
2. Pass any file being moved /out/ of such a directory to (some equivalent of) sa-learn as ham.

This feature may not be useful to /all/, but it will be useful to /many/. Please, consider implementing.
Comment 1 Benny Pedersen 2021-03-14 21:55:21 UTC
disable bayes pr user is not an option ?

i think if anything should be changed it could be more like dspam where its learned once, but tracked pr user

or should bayes be both for global and pr user

but all that mess take it to user@ maillist since its not a bug yet
Comment 2 mi+apache@aldan.algebra.com 2021-03-14 21:58:33 UTC
(In reply to Benny Pedersen from comment #1)
> disable bayes pr user is not an option?

I don't think, you understood the request. It is not about "per server" vs. "per user". It is about spamd monitoring a directory (or list thereof) and "learning" things based on files appearing in such a directory -- or being moved out of one.

> but all that mess take it to user@ maillist since its not a bug yet

It is not a bug, nor claimed to be a bug. On the contrary, it is explicitly marked as "enhancement".
Comment 3 Loren Wilton 2021-03-14 22:36:44 UTC
> 2. Pass any file being moved /out/ of such a directory to (some equivalent of) 
> sa-learn as ham.

Would you really want to do this? You said the problem (if that is the correct word) was that spam messages were being repeatedly re-learned. Personally, if they still represent spam, I don't see the problem. But assuming it is, I'd expect that removing it from a spam folder should only mean that it has been sufficiently learned, not that it has become ham. Messages could be removed from the folder by a cron job, or after they have been learned.

Personally I'd suggest having both spam and ham folders, and learning from both of them in the appropriate directions. I'd probably also only be concerned with a relearn when something was added to one or the other, not when it was removed.

All that said, you are hopefully either running per-user bayes, or you are monitoring what gets put into these spam folders, since it is notorious that users often stick stuff into spam folders that isn't spam. As long as it only affects their mail it isn't necessarily a problem.
Comment 4 mi+apache@aldan.algebra.com 2021-03-14 23:01:14 UTC
(In reply to Loren Wilton from comment #3)
> > 2. Pass any file being moved /out/ of such a directory to (some
> > equivalent of) sa-learn as ham.

> Would you really want to do this? You said the problem (if that is the
> correct word) was that spam messages were being repeatedly re-learned.

That and the other things: the reaction is /delayed/ -- until the next time sa-learn cron-job runs.

And yes, I could implement such a directory-watching daemon myself. But I don't want /another/ daemon -- it should be part of spamd, in my opinion. Both logically, and from the resource-consuming point of view: spamd is already running, and it has all of the Bayes code in it.

> But assuming it is, I'd expect that removing it from a spam folder should only
> mean that it has been sufficiently learned, not that it has become ham.

> Messages could be removed from the folder by a cron job, or after they have
> been learned.

They could be. That delay will still be there, though.

> Personally I'd suggest having both spam and ham folders, and learning from

Maintaining an explicit "ham" folder is too much burden on the users. The best I could think of is treating INBOX itself as ham, but only for messages older than N days -- on the assumption, that a user NOT labeling a message as spam within N days is a good evidence of it being ham.

But that's a different topic altogether.

> All that said, you are hopefully either running per-user bayes [...]

Per-user Bayes here is simply, what Thunderbird comes with. But this is yet another topic...
Comment 5 RW 2021-03-14 23:53:56 UTC
(In reply to mi+apache@aldan.algebra.com from comment #4)
> (In reply to Loren Wilton from comment #3)

> That and the other things: the reaction is /delayed/ -- until the next time
> sa-learn cron-job runs.

There's nothing to stop you running it more often - particularly if you make it more efficent.

> And yes, I could implement such a directory-watching daemon myself. But I
> don't want /another/ daemon -- it should be part of spamd, in my opinion.
> Both logically, and from the resource-consuming point of view: spamd is
> already running, and it has all of the Bayes code in it.

I don't think that's ideal as it means giving spamd read access to the mail store - something it wouldn't otherwise need.

Dovecot has a plugin that largely does what you want. It can also be done using "IMAP Sieve" which allows sieve-like scripts to be handle IMAP events. I don't know whether Cyrus has equivalent functionality, I suspect it does.  


> > But assuming it is, I'd expect that removing it from a spam folder should only
> > mean that it has been sufficiently learned, not that it has become ham.
> 
> > Messages could be removed from the folder by a cron job, or after they have
> > been learned.
> 
> They could be. That delay will still be there, though.

It would become so cheap you could run it once a second if you like.
Comment 6 mi+apache@aldan.algebra.com 2021-03-15 02:48:15 UTC
(In reply to RW from comment #5)
> (In reply to mi+apache@aldan.algebra.com from comment #4)
> > (In reply to Loren Wilton from comment #3)
> 
> > the reaction is /delayed/ -- until the next time sa-learn cron-job runs.
> 
> There's nothing to stop you running it more often - particularly if you make
> it more efficent.

It is still polling -- an inferior method: increasing frequency increases load, but the reaction is still delayed.

> I don't think that's ideal as it means giving spamd read access to
> the mail store - something it wouldn't otherwise need.

It processes all incoming mail already. I don't think, giving it access to the spam-folders is particularly promiscuous :)

> Dovecot has a plugin that largely does what you want. It can also be done
> using "IMAP Sieve" which allows sieve-like scripts to be handle IMAP events.
> I don't know whether Cyrus has equivalent functionality, I suspect it does.  

At best, that functionality will be invoking sa-learn each time -- a separate perl-program. Not as efficient as the already-running perl-program with all necessary code already in it.

> It would become so cheap you could run it once a second if you like.

Which would be quite wasteful, when there is no mail... And not frequent enough, when there is a lot of it -- the problem of polling in general.
Comment 7 RW 2021-03-15 22:25:58 UTC
(In reply to mi+apache@aldan.algebra.com from comment #6)
> (In reply to RW from comment #5)
> > (In reply to mi+apache@aldan.algebra.com from comment #4)

> It is still polling -- an inferior method: increasing frequency increases
> load, but the reaction is still delayed.

It makes little difference as long as the interval is small compared the typical time the users take to react to misclassifications.

I use ls to determine whether there is anything in a training folder before running sa-learn on it. Typically it isn't even accessing the drive as it's working on cached metadata. 

I'm not sure that your idea can be made reliable without keeping an extra database or doing a periodic full retrain. 

> > Dovecot has a plugin that largely does what you want. It can also be done
> > using "IMAP Sieve" which allows sieve-like scripts to be handle IMAP events.
> > I don't know whether Cyrus has equivalent functionality, I suspect it does.  
> 
> At best, that functionality will be invoking sa-learn each time -- a
> separate perl-program. Not as efficient as the already-running perl-program
> with all necessary code already in it.

spamc can be used to train to spamd if you prefer, but then you get the problem of what happens when spamd is not running. That's fairly easy to fix, but IMHO the saving in cpu cycles is unlikely to be worth the effort.

Doing it from the IMAP server has the advantage that you can train as ham when mail is moved from the spam folder, and it can distinguish the special case of spam being sent to a trash folder.

I'm not sure you should even be training directly on a Cyrus mailbox, I think they contain additional metadata files. Training from IMAP would avoid any problems around that.
Comment 8 mi+apache@aldan.algebra.com 2021-03-15 22:38:33 UTC
(In reply to RW from comment #7)
> (In reply to mi+apache@aldan.algebra.com from comment #6)
> > (In reply to RW from comment #5)
> > > (In reply to mi+apache@aldan.algebra.com from comment #4)
> 
> > It is still polling -- an inferior method: increasing frequency increases
> > load, but the reaction is still delayed.
> 
> It makes little difference as long as the interval is small compared
> the typical time the users take to react to misclassifications.

It should still be done without /further/ delay. Also, Thunderbird's own Bayes is invoked automatically, without user's own actions.

"Small" interval, means it is done too often -- and still, there is a delay of, on average, half the polling interval. This is an inevitable flaw of polling.
 
> I use ls to determine whether there is anything in a training folder before
> running sa-learn on it. Typically it isn't even accessing the drive as it's
> working on cached metadata.

Human beings cannot distinguish a millisecond from a microsecond. That's not a good reason to not care about things taking 1000 times longer, than they need to take...

> I'm not sure that your idea can be made reliable without keeping an extra
> database or doing a periodic full retrain.

Such a retrain can still be done -- via cron -- but a lot less often. Say, once a day, or even at reboot.

> spamc can be used to train to spamd if you prefer

Really? Can you elaborate? If spamc can -- without itself loading the Bayesian functionality -- tell spamd to process yet another file (as either spam or ham), that will solve a big part of the problem.

One'd still need a daemon, but it can be as simple as inotifyd...

> Doing it from the IMAP server has the advantage that you can train as ham
> when mail is moved from the spam folder, and it can distinguish the special
> case of spam being sent to a trash folder.

Yes, that is the situation I'm describing here:
1. sa-learn runs on the same machine as the imap-server.
2. sa-learn trains the same database used by spamd guarding the incoming mail to the same server.

> I'm not sure you should even be training directly on a Cyrus mailbox, I
> think they contain additional metadata files.

Yes, there are metadata files there, but they are not appearing /anew/. Unlike e-mail messages, which appear as new files, one message per file. Very convenient.

> Training from IMAP would avoid any problems around that.

Teaching imap to talk to spamd's database is (much) harder, than teaching spamd to monitor a few directories.
Comment 9 Bill Cole 2021-03-15 23:24:57 UTC
(In reply to mi+apache@aldan.algebra.com from comment #8)
> (In reply to RW from comment #7)
[...]
> > spamc can be used to train to spamd if you prefer
> 
> Really? 

Of course. Feeding spamd is what spamc is for. See spamd/PROTOCOL for how it does that. RTFM for all the details. 

> Can you elaborate? 

Use "-L ham" and "-L spam" options. It's all there on the fine man page. 

> If spamc can -- without itself loading the
> Bayesian functionality --

Look at the code for yourself: spamc doesn't know anything about any Perl modules.

> tell spamd to process yet another file (as either
> spam or ham), that will solve a big part of the problem.
> 
> One'd still need a daemon, but it can be as simple as inotifyd...

Not even really that. I do this using spamc from a shell script that runs from cron periodically and figures out what to have spamc pass by maintaining a 'last run' flag file and using find's '-newer' directive. I can't share that code because it was written for hire, but the basic concept is Not Hard. Yes, you'll get better performance (probably) with inotify/kqueue in a daemon, but it's not really a heavy task at all to identify new files and feed them to spamc. 

Also: the real reason to avoid re-submitting messages to spamd for training is not that you'll skew the data but only that you're wasting the Bayes subsystem's effort in noticing that it has seen the message before. 
 

FWIW, I am mildly negative on adding this functionality into SpamAssassin itself. It's feature bloat and scope creep. It would invite and take ownership of a whole new class of integration problems that we don't have the aggregate attention to provide support for.
Comment 10 Henrik Krohns 2021-04-08 06:45:24 UTC
Agree with Bill, there's nothing to add here. It's not spamc or spamd's job to monitor anything. You can feed learnable stuff to spamc per your own mechanisms.