SA Bugzilla – Bug 4915
RFE: Distributed mass-check
Last modified: 2006-09-08 10:49:06 UTC
This was a suggested idea for the Google Summer of Code 2006; I'm adding it to the bugzilla for future use, and in case anyone feels like implementing it. Subject ID: spamassassin-distributed-mass-check Keywords: corpora, perl Description: mass-check currently makes use of a single system to process a number of messages. However, in larger organizations, or for people with multiple machines, it would be nice if multiple machines could all process a single mass-check run, preferably without needing to share the same filesystem, paths, etc. It would also be useful if we ended up with a single large corpus (see the spamassassin-corpus project above), so that multiple people could run the messages through over the Internet. Possible Mentors: Theo Van Dinter (felicity -at- apache.org)
I'm working on implementing this in a branch in my spare time. It's really still floating around in my head, though I think I know what I'm looking for in an implementation. If/when I get the chance to write it out, I'll put it here in the ticket. :)
interesting! how are you planning to distribute the workload and scanned messages? ssh? a grid-based system? If you want to get complex, distributed work-queues are nice, and I'd be very happy to add that support to IPC::DirQueue ;)
(In reply to comment #2) > how are you planning to distribute the workload and scanned messages? ssh? a > grid-based system? At the moment, I was planning on HTTP, in much the same way that our current "-j #" method works, mass-check client connect to mass-check server, makes a request (give me at most X messages), the server reads in from disk and sends out a tar/gz file of messages in file format. The client then runs over them in a normal mass-check mode and gathers all the results, then connects back to the server to give the results and request more work. Somewhere along the line, I was going to have the client dynamically adjust the "max" number based on the amount of time needed to process a message (ie: the client wants to connect to the server roughly once a minute). I also need to have support in the server to track which messages were handed out, and then rehand them out if some time limit passes without seeing the result. There are several issues I haven't figured out how to deal with, so I'm leaving them for now -- make sure that all of the clients run the same version w/ the same (as appropriate) modules, conf files, plugins, etc. How to let this work with bayes. A way to abort if not all the messages were processed? etc. This feels like recreating the wheel, btw, but I don't know of a module/etc that does what I want here. > If you want to get complex, distributed work-queues are nice, and I'd be very > happy to add that support to IPC::DirQueue ;) I haven't looked too much at IPC::DQ, but based on what I've read I don't think it fits in with this completely, though feel free to correct me. :) I was planning on non-long-lived connections, ability to communicate through a proxy, etc.
(In reply to comment #3) > There are several issues I haven't figured out how to deal with, so I'm leaving [...] > with bayes. A way to abort if not all the messages were processed? etc. There's also the issue of ordering run (ie: not using -n), which isn't guaranteed in the current code and is going to be much more difficult in a client/server model. "happily", ordered runs typically only happen when using bayes, which as mentioned before doesn't work in this model, so ...
Ok, this has generally been implemented in 3.2/trunk. I think it still needs some work around the edges, but it's good enough for me to close the ticket. :)