4915 – RFE: Distributed mass-check

Bug 4915 - RFE: Distributed mass-check

Summary: RFE: Distributed mass-check

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	spamc/spamd (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P5 enhancement
Target Milestone:	3.2.0
Assignee:	Theo Van Dinter

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	4560
	Show dependency tree

Reported:	2006-05-26 10:18 UTC by Justin Mason
Modified:	2006-09-08 10:49 UTC (History)
CC List:	1 user (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Justin Mason 2006-05-26 10:18:18 UTC

This was a suggested idea for the Google Summer of Code 2006;
I'm adding it to the bugzilla for future use, and in case anyone feels
like implementing it.

Subject ID: spamassassin-distributed-mass-check
Keywords: corpora, perl
Description: mass-check currently makes use of a single system to process a
number of messages. However, in larger organizations, or for people with
multiple machines, it would be nice if multiple machines could all process a
single mass-check run, preferably without needing to share the same filesystem,
paths, etc.  It would also be useful if we ended up with a single large corpus
(see the spamassassin-corpus project above), so that multiple people could run
the messages through over the Internet.
Possible Mentors: Theo Van Dinter (felicity -at- apache.org)

Comment 1 Theo Van Dinter 2006-08-24 03:26:42 UTC

I'm working on implementing this in a branch in my spare time.  It's really
still floating around in my head, though I think I know what I'm looking for in
an implementation.  If/when I get the chance to write it out, I'll put it here
in the ticket. :)

Comment 2 Justin Mason 2006-08-24 10:42:30 UTC

interesting!

how are you planning to distribute the workload and scanned messages?  ssh? a
grid-based system?

If you want to get complex, distributed work-queues are nice, and I'd be very
happy to add that support to IPC::DirQueue ;)

Comment 3 Theo Van Dinter 2006-08-24 16:05:20 UTC

(In reply to comment #2)
> how are you planning to distribute the workload and scanned messages?  ssh? a
> grid-based system?

At the moment, I was planning on HTTP, in much the same way that our current "-j
#" method works, mass-check client connect to mass-check server, makes a request
(give me at most X messages), the server reads in from disk and sends out a
tar/gz file of messages in file format.  The client then runs over them in a
normal mass-check mode and gathers all the results, then connects back to the
server to give the results and request more work.  Somewhere along the line, I
was going to have the client dynamically adjust the "max" number based on the
amount of time needed to process a message (ie: the client wants to connect to
the server roughly once a minute).  I also need to have support in the server to
track which messages were handed out, and then rehand them out if some time
limit passes without seeing the result.

There are several issues I haven't figured out how to deal with, so I'm leaving
them for now -- make sure that all of the clients run the same version w/ the
same (as appropriate) modules, conf files, plugins, etc.  How to let this work
with bayes.  A way to abort if not all the messages were processed?  etc.

This feels like recreating the wheel, btw, but I don't know of a module/etc that
does what I want here.

> If you want to get complex, distributed work-queues are nice, and I'd be very
> happy to add that support to IPC::DirQueue ;)

I haven't looked too much at IPC::DQ, but based on what I've read I don't think
it fits in with this completely, though feel free to correct me. :)  I was
planning on non-long-lived connections, ability to communicate through a proxy, etc.

Comment 4 Theo Van Dinter 2006-08-24 18:29:07 UTC

(In reply to comment #3)
> There are several issues I haven't figured out how to deal with, so I'm leaving
[...]
> with bayes.  A way to abort if not all the messages were processed?  etc.

There's also the issue of ordering run (ie: not using -n), which isn't
guaranteed in the current code and is going to be much more difficult in a
client/server model.

"happily", ordered runs typically only happen when using bayes, which as
mentioned before doesn't work in this model, so ...

Comment 5 Theo Van Dinter 2006-09-08 17:49:06 UTC

Ok, this has generally been implemented in 3.2/trunk.  I think it still needs
some work around the edges, but it's good enough for me to close the ticket. :)