|
SA Bugzilla – Full Text Bug Listing |
Summary: | [patch] make SA work with multiple pyzor servers | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | John Hein <nm4zejxa5j> |
Component: | Plugins | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | ||
Priority: | P5 | ||
Version: | 3.1.6 | ||
Target Milestone: | 3.2.0 | ||
Hardware: | Other | ||
OS: | other | ||
Whiteboard: | |||
Attachments: |
patch for Pyzor.pm to pick max score from multiple pyzor servers
Multi-server support (cumulative) patch for Pyzor.pm sum results from multiple pyzor servers |
Description
John Hein
2006-10-24 12:05:43 UTC
Created attachment 3726 [details]
patch for Pyzor.pm to pick max score from multiple pyzor servers
Created attachment 3732 [details]
Multi-server support (cumulative) patch for Pyzor.pm
This is my version of patch, which uses multiple servers in cumulative score
mode.
It is IMHO preffered, as pyzor servers are _not_ synced, and if one is more
used then others then others are effectively ignored with original patch.
Any user can modify pyzor_max depending on the percieved usage of secondary
server, thus accounting for incresed pyzor scores.
It also resets $pyzor_count to 0 if message is whitelisted as original pyzor
plugin code does (to stop false/invalid reports)
Yep. That is one of the "different algorithms" I thought about. It's hard to say which is better. In the current state of affairs (two known public servers which don't sync with each other, one which seldom responds), the "sum" approach is probably better. I'll leave it at that. +1 btw, I'm not a fan of the "sum" approach. Pyzor is already too time-consuming as a rule; doubling the potential lookup time is not a good thing. I'd prefer to apply John's patch -- but is it worth doing this, given this issue: 'It is IMHO preffered, as pyzor servers are _not_ synced, and if one is more used then others then others are effectively ignored with original patch.' Neither patch should consume more time. They are both designed to accomodate the situation where you have two servers listed in ~/.pyzor/servers. The pyzor software will try to contact both servers regardless. No, the pyzor servers are not sync'd at this point, although there is some working going on to do that (see pyzor-users list discussions in the past few months). Once that is in place, it reduces the need for having more than one server in your pyzor config. In the mean time, this patch is useful, particularly since the "official" server retrieved via 'pyzor discover' is not reliable. The timeouts when talking to that server are probably the main reason pyzor is, "time-consuming". You can lower that timeout in the pyzor config (I have done so since it almost always seems to respond within a second or two or not at all - the five second default is probably too high). So to address the "sum vs. max (or even avg)" behavior of the patch... I don't think it matters in terms of time. Summing makes the most sense if most people are reporting to only one server. Max or averaging would make the most sense if most people are reporting to all servers. At this point, I would think that we should start out summing. Then as more people report to N servers, switch to max or avg. Or make it configurable (probably not worth the effort). Of course, it's hard to tell if most people are reporting to just one server or not. However, with the current implementation in SA, if you report to more than one, SA gets confused. So, I would guess that for most SA users, they either use one server or never get scores from pyzor. So assuming that most SA users use one pyzor server would be the safe bet for now. And that's why, if I had to choose, I would pick the summing method at this time (and maybe put some explanatory comments in the code). It shouldn't affect whether pyzor takes more time or not. But you may want to lower the default timeout for pyzor. As I said above, 2 seconds is probably fine, but that's subjective - I haven't run a scientific test. If you want a better patch (comments, lower default timeout) and don't want to do much work beyond review/commit, let me know. ok. if it doesn't effect runtime, that's good. 'If you want a better patch (comments, lower default timeout) and don't want to do much work beyond review/commit, let me know.' that would be great. ;) p.s. I know I recommended that we raise the default timeout in my original description. So part of my last message might seem contradictory. I'd say it's best to have a default pyzor timeout (in the pyzor config, not SA) of say, 2 or 3. This timeout is per server. So in order not to short circuit that timeout, it's best for the pyzor_timeout in SA to be something like N * pyzor's timeout + 1 (N because the pyzor lookups are not threaded, but end to end). Unfortunately, pyzor code doesn't take this config setting on the command line. So SA has no way of knowing pyzor's timeout without reading the pyzor config. Maybe it's just best to document a recommended setup. My recommendation would be 2 or 3 seconds for the pyzor (not SA) timeout and for SA's pyzor_timeout, use 3 or 4 seconds (and document how that interacts with pyzor's config). This would accomodate one pyzor server timing out. Document that bumping up the overall SA pyzor_timeout might be useful the more pyzor servers you use. What we really want is to check 1 or 2 very reliable pyzor servers and report to as many as possible. You could do this with command line args to pyzor pointing it at different config/server files. But now we're beyond the intent of my original patch (to just get SA working) and may be overcome by events when people get together and sync pyzor servers (ala DCC perhaps). jm wrote: > that would be great. ;) Okay. I'll come up with a new patch that uses the summing method and adds comments/docs. Gimme a day. Then you can review / clean up style / commit. Thanks for shepherding this. Created attachment 3766 [details]
sum results from multiple pyzor servers
Okay... maybe a couple days.
This patch does the following:
- sum results from responses from all pyzor servers
- document timeout better (including refs to Pyzor's own timeout)
- lower overall Pyzor timeout a bit - my testing shows that a pyzor
server responds in less than a second (on a T1) or not at all.
looks good: will apply this later, when I get more tuits. so, just to check -- the pyzor script will check multiple pyzor servers *in parallel*, right? not in serial? just want to make sure I have the facts straight. ;) The patch will not enforce whether Pyzor looks up the server in series or parallel. Pyzor itself is where that is done. And as of the version I have (using FreeBSD pkg rev pyzor-0.4.0_4), it is doing the lookups in series. It's up to the user to put more than one server in his .pyzor/servers file. If he does so and multiple servers are regularly not very responsive, he should reconsider the pyzor server config. This patch just keeps SA from not working at all if there are more than one response from 'pyzor check'. Before this patch, you could get two or more good responses in less than a second, but the SA plugin would not understand the response(s), throw up its hand and use _none_ of the legit responses. IMO, since most people are probably using the official 'pyzor discover' published server (66.250.40.33) it will typically be just as slow whether you add another server or not. Especially with the default timeout settings (5 per server for pyzor, and 5 total for SA). I have these two servers in my config: 82.94.255.100:24441 66.250.40.33:24441 The former is almost always responsive (that could change of course). If both respond when doing 'pyzor check', they do so typically in a total of half a second or less. If any one is not responding, you will hit timeouts. As I said, I think it's better to set the per-server timeout lower in pyzor and the overall SA timeout just a bit larger. This will all probably be much better if and when server syncing is added and the world gets a few more servers. We could add to the docs that I updated something like: "As of this writing, Pyzor communicates with servers in series, not in parallel." If I find a round tuit, I'll send it along. ok, applied to 3.2.0 : jm 252...; svnc "bug 5148: fix Pyzor plugin to support lookups against multiple servers, summing their results. improve Pyzor docs regarding timeouts, and lower the default timeout to 3.5 seconds. thanks to John Hein <jhein at timing.com>" lib/Mail/SpamAssassin/Plugin/Pyzor.pm Sending lib/Mail/SpamAssassin/Plugin/Pyzor.pm Transmitting file data . Committed revision 485624. |