SA Bugzilla – Bug 4900
Persistent blacklist not really persistent
Last modified: 2015-07-29 17:14:40 UTC
When using the command "spamassassin --add-addr-to-blacklist=addr", which is listed in the -h output as "Add addr to persistent address blacklist", one would expect the address to get a permanent extra score value. However, in reality the score (rapidly) decreases with each scanned message. Initially it is 50, but after a 5 messages from the same source it has decreased to 8.7 already. It looks like the blacklist entry is not really "persistent", but is averaged with the score like an automatic whitelist entry.
Same in 3.1.2 Nobody else who can confirm this?
(In reply to comment #1) > Same in 3.1.2 > Nobody else who can confirm this? It's a documentation issue. The entry is actually persistent, but the BL value isn't since the entry goes into the AWL. A "persistent blacklist" in your context requires a "blacklist_*" config entry.
Thanks. Unfortunately, keeping the blacklist in a configuration file is not very practical as the spamd needs to be reloaded after each modification. (there are also other reasons why I don't really like this solution) Maybe the AWL could be extended so "persistent" entry could be recognized, either by an extra flag or a magic value of the value? It looks like they are inserted at value 50. How about not averaging the current score into this value when the entry is at value 50?
Bug 6032, Ian Turner 2008-12-18 : These commands (--add-addr-to-blacklist and --add-addr-to-whitelist) just set the AWL database with count=1 and totscore equal to to plus or minus 100. But since the count is set to 1, the average score quickly returns to the mean typical for messages from this source. For example, consider a message sender whitehat@example.com, whose messages are flagged as spam with score 30. Assume the system is configured with a spam threshold of 10. Finally, assume an administrator runs spamassassin --add-addr-to-whitelist=whitehat@example.com and that several messages are then recieved from this source. We will see the following behaviour: Message pre-AWL post-AWL count totscore Message accepted? score score 1 -100 1 30 -35 2 -70 TRUE 2 30 -20 3 -40 TRUE 3 30 -5 4 -10 TRUE 4 30 10 5 20 FALSE 5 30 25 6 50 FALSE As it turns out, the --add-addr-to-whitelist command was only good for three messages. In my opinion, the easy way to fix this bug is to set totscore=100000 and score=1000, but the best way is to probably have a scheme where if count is null then totscore is never changed, and to set count=null when adding known addresses.
*** Bug 6032 has been marked as a duplicate of this bug. ***
Just for reference, the *only* way to get a truly persistent blacklist in Spamassassin is to use the blacklist_* configuration options. Period. You can tweak the AWL score to any arbitrarily large value you want, and it is going to get diluted rapidly because the AWL is a score averaging system, it is *NOT* a whitelist or a blacklist (although it can have these effects, it's really just borrowing score from past reputation). Perhaps we need to change these command-line options to: --increase-awl-for-addr --decrease-awl-for-addr To avoid confusion. The existing options just reinforce the myth that the AWL is a suitable for use as a hard whitelist/blacklist system, which it really isn't.
Matt, there's no reason you couldn't tweak the auto white list to get persistent blacklist or wihtelist behavior using either of the changes which I suggested above. It's true that right now the AWL is not suitable as a persistent white/black list, but it's false to say that they couldn't be.
(In reply to comment #6) > Just for reference, the *only* way to get a truly persistent blacklist in > Spamassassin is to use the blacklist_* configuration options. Period. Yes but that is precisely what should be changed. When a certain address has been determined as a spam source, it should be possible to add it to the AWL and permanently blacklist it. Or when it is determined to be an address from which one wants to receive everything, it should stay whitelisted when added to the AWL. This is because scripts can be written to add an entry to the AWL, but one does not want to modify a configuration file from such a script. Besides, we are running spamd with a single shared configuration for all users and one would have to restart spamd after every configuration change, which is not required when solving it via the AWL.
I have wanted different behavior in AWL as well. I think the basic issue is that in addition to AWL, which I think works well for what it is, that there should be database entries that can assign a score bias to some addresses, and to have those not get adjusted. These entries should perhaps be independent of AWL. So I'd be able to do spamassassin --assign-score user@example.com -5 and then there'd be a db entry matching that sender that would add -5 points. One could use -100 or +100. (But, with forgeable senders, -5 feels right to me to rescue legit mails that look spammy but not give a total pass.) To implement this the AWL db could store the score in the scorefield and -1 in the count field, and then AWL would ignore the entry. But, AWL and persistent scores probably should both operate.
(In reply to comment #9) > > spamassassin --assign-score user@example.com -5 and/or spamassassin --assign-score-auth user@example.com -5 This could also be reasonable shortcut fodder. If the fixed score has been set to (for example) -500, it's unlikely any combination of spam rules will push it back above zero.
GTUBE is scored at +1000 points, so AWL whitelist offset should probably be more. However, we are now drifting outside the scope of this bug.
(In reply to comment #4 / Bug 6032, Ian Turner 2008-12-18) > For example, consider a message sender whitehat@example.com, whose messages > are flagged as spam with score 30. Assume the system is configured with a > spam threshold of 10. Finally, assume an administrator runs spamassassin > --add-addr-to-whitelist=whitehat@example.com and that several messages are > then recieved from this source. We will see the following behaviour: > > Message pre-AWL post-AWL count totscore Message accepted? > score score > 1 -100 > 1 30 -35 2 -70 TRUE > 2 30 -20 3 -40 TRUE > 3 30 -5 4 -10 TRUE > 4 30 10 5 20 FALSE > 5 30 25 6 50 FALSE > > As it turns out, the --add-addr-to-whitelist command was only good for > three messages. I ran this test on vanilla 3.3.0 and 3.3.1 installs to verify, my numbers differed a bit (for the worse!). I tried it at three different auto_whitelist_factor values; the default of 0.5 (implict and explicit, both were the same), 0.75, and 1.0. Tests were performed using a vanilla email with a custom rule assigning 30 points to that email's Message-ID string. I trained SA via `spamassassin -W <test.eml`), then successively scanned the email with `spamassassin -D auto-whitelist <test.eml |grep score:` The only value that changes with the auto_whitelist_factor is the post-AWL score. This is also the only difference my results have with Ian's in comment 4, so I'm only presenting the post-AWL scores at the three factors I tested. My results: ------- AWL factor ------- Message 0.5 0.75 1.0 1 -35 -67.5 -100 2 -2.5 -18.75 -35 3 8.333 -2.5 -13.333 4 13.75 5.625 -2.5 5 17 10.5 4 6 19.167 13.75 8.333 7 20.714 16.071 11.429 8 21.875 17.812 13.75 9 22.778 19.167 15.556 At a factor of 1.0, AWL brings the score to the previous average as specified in the documentation, which is handy for checking the math. Like Ian's results, my test at factor=0.5 results in the sender getting flagged as spam on the third email following a whitelist training. Even at factor=1.0, there are only five emails in the clear. Here's another view of the issue, fixed at AWL factor 0.5 but with varying initial scores and learning as ham or spam: ---------------- initial score ----------------- Message ham@30 ham@20 ham@10 spam@0 spam@-5 spam@-10 1 -35 -40 -45 50 47.5 45 2 -2.5 -10 -17.5 25 21.3 17.5 3 **8.3** 0 -8.3 16.7 12.5 8.3 4 13.8 **5** -3.8 12.5 8.1 **3.8** 5 17 8 -1 10 5.5 1 6 19.2 10 0.8 8.3 **3.8** -0.8 7 20.7 11.4 2.1 7.1 2.5 -2.1 8 21.9 12.5 3.1 6.3 1.6 -3.1 9 22.8 13.3 3.9 5.6 0.8 -3.9 10 23.5 14 4.5 5 0.3 -4.5 11 24.1 14.5 **5** **4.5** -0.2 -5 The turnover counts are the notable thresholds here. A ham scoring 30 bounces back to getting marked as spam on the third message. Ham at 20 takes just one more. Ham at 10 turns over on the 11th message. I didn't put a 5 point ham on the chart, but it's fine for quite a while (it hits 4.0 on the 53rd message and 4.5 on the 105th). On the spam side, a spam that somehow gets to -10 evades detection on its fourth message. A spam at -5 returns to the inbox on the sixth. A zero-scoring spam is snuffed for ten iterations, returning on the 11th. Not on the chart, a spam scoring 2 comes back on the 17th message and a spam scoring 4.5 dips under 6 on its 32nd, under 5.5 after 48, and gets out of jail on its 94th. Method: change the value of my local rule and then run: spamassassin --add-to-blacklist <~/Mail/test.eml >/dev/null; for a in `seq 1 105`; do spamassassin -D auto-whitelist <~/Mail/test.eml 2>&1 |sed -re '/.*post.*score: /!d' -e "s// $a\t/"; done (or swap --add-to-blacklist with -W)
Greg proposed in comment 9: > spamassassin --assign-score user@example.com -5 > > and then there'd be a db entry matching that sender that would add -5 > points. One could use -100 or +100. (But, with forgeable senders, -5 > feels right to me to rescue legit mails that look spammy but not give > a total pass.) > > To implement this the AWL db could store the score in the scorefield > and -1 in the count field, and then AWL would ignore the entry. But, > AWL and persistent scores probably should both operate. I like the idea, but I'm not sure if that's exactly the right approach; that's too much choice in the hand of the user. I'd like either for the AWL plugin to use the old score to calculate what base score is needed to require 100 subsequent emails before reverting to flag/no-flag, or a larger base count (which I think is MUCH simpler and also far more statistically sound). Also, AWL and persistent scores are synonymous :-) As to John's addition in comment 10: > and/or > > spamassassin --assign-score-auth user@example.com -5 Doesn't AWL already keep track of authentication? I guess that's useful for manual training of addresses, but it would be automatic when using --add-to-[white|black]list A lesson I learned when playing with greylisting: The email address is more trouble than it's worth unless you're dealing with a heavyweight like hotmail (we can already recognize the freemailers, which is probably good enough). Otherwise, all that matters is the IP address, which can't be forged. To simplify matters, I'd assume any IP that satisfies SPF should be treated the same. Rethinking the way AWL (and traditional white/blacklisting) deals with authentication is a good exercise, but it is not related to THIS bug. > This could also be reasonable shortcut fodder. If the fixed score has > been set to (for example) -500, it's unlikely any combination of spam > rules will push it back above zero. No dice. Here's what that does: 1 -235.000 2 -102.500 3 -58.333 4 -36.250 5 -23.000 6 -14.167 7 -7.857 8 -3.125 9 0.556 10 3.500 11 5.909 I artificially set the AWL score to -500 (wipe with -R, alter the custom rule to get it to -500, run once, then move the custom rule back to 30 and loop through successive scans as noted in my last post, comment 12). As you can see, this only saves ten messages. How about we set the AWL totscore to -100 and the count to 10 rather than 1? 1 -35.000 2 -29.091 3 -24.167 ... 6 -13.333 9 -6.111 12 -0.952 13 0.455 14 1.739 15 2.917 16 4.000 17 5.000 18 5.926 Not enough in my book. I want to at least affect the first 50, ideally the first hundred or even the first 365. 30 is a high score, but I think it's an admirable start. Here are the results for a base 100x -100 in the AWL before scanning a 30-point email: 1 -35.000 2 -34.356 3 -33.725 ... 10 -29.633 20 -24.622 30 -20.388 40 -16.763 50 -13.624 60 -10.881 70 -8.462 80 -6.313 90 -4.392 100 -2.663 My proposal is a much more "persistent" setting than just cranking the magnitude on the swing. Plus, it should amount to about a single line of code. (In reply to Ian's comment #11) > GTUBE is scored at +1000 points, so AWL whitelist offset should probably be > more. However, we are now drifting outside the scope of this bug. We're not supposed to "protect" against GTUBE. It's scored at 1000 points /specifically/ to bypass things like this.
(In reply to comment #13) > As to John's addition in comment 10: > > This could also be reasonable shortcut fodder. If the fixed score has > > been set to (for example) -500, it's unlikely any combination of spam > > rules will push it back above zero. > > No dice. Here's what that does: > > 1 -235.000 > 2 -102.500 > 3 -58.333 > 4 -36.250 > 5 -23.000 > 6 -14.167 > 7 -7.857 > 8 -3.125 > 9 0.556 > 10 3.500 > 11 5.909 That isn't shortcutting (bypassing scan) when the AWL average is very negative/positive, which is what I was suggesting. Shortcutting on AWL totscore is admittedly a more intrusive change than simply tweaking the numbers the AWL data gets set to, but shortcutting would indeed make it a permanent white/blacklist, not just a "white/blacklist for the next month or so". It would also break GTUBE, because a large negative AWL score that shortcut scanning would bypass the GTUBE rule, unless we provide something like a tflag for "this rule can't be bypassed by AWL shortcut" and apply it to GTUBE. It would also break the "averager" part of AWL, absent such a tflag, but use of that flag reduces the permanence of the white/blacklisting. GTUBE also should not count towards AWL score or you'd risk blacklisting someone who sends a GTUBE... Yep, rather intrusive. > How about we set the AWL totscore to -100 and the count to 10 rather > than 1? > > 1 -35.000 > 2 -29.091 > 3 -24.167 > ... > 6 -13.333 > 9 -6.111 > 12 -0.952 > 13 0.455 > 14 1.739 > 15 2.917 > 16 4.000 > 17 5.000 > 18 5.926 > > Not enough in my book. Push the envelope harder. Try setting totscore to -950 (to pass GTUBE) and count to 500 or 1000.
(In reply to comment #14) > That isn't shortcutting (bypassing scan) when the AWL average is very > negative/positive, which is what I was suggesting. > > Shortcutting on AWL totscore is admittedly a more intrusive change than simply > tweaking the numbers the AWL data gets set to, but shortcutting would indeed > make it a permanent white/blacklist, not just a "white/blacklist for the next > month or so". Oh, you're talking about an actually persistent setting. Sounds like dangerous waters. I'd rather that stay in the config file as whitelist_from et al. > It would also break GTUBE, because a large negative AWL score that shortcut > scanning would bypass the GTUBE rule, unless we provide something like a tflag > for "this rule can't be bypassed by AWL shortcut" and apply it to GTUBE. > > GTUBE also should not count towards AWL score or you'd risk blacklisting > someone who sends a GTUBE... Good point. The GTUBE rule already has tflags userconf and noautolearn. There is no tflag to bypass AWL, but if we move its priority so that it fires after AWL, that would do the trick, no? > > How about we set the AWL totscore to -100 and the count to 10 rather > > than 1? > > Push the envelope harder. Try setting totscore to -950 (to pass GTUBE) and > count to 500 or 1000. Maybe you missed the bottom of my last comment? I tested count=100 for a 30-point ham. After 100 subsequent emails, it still scored only -2.663. I didn't report it, but a trial run of count=100 against a 10-point ham took 1001 runs to hit 5 points. I think count=100 and totscore=-100 is fine (or perhaps even too strong); we don't want to go overboard, otherwise there's no difference between AWL and actual white/black lists. We also don't want to be ruled by corner-cases like a 30-point ham. Also noted at the bottom of my last comment, my take on GTUBE's intent is that it is scored 1000 specifically to override all negatives, especially including something like AWL whose data isn't locatable via grep. As long as we immunize AWL from it, I think it's irrelevant. GTUBE was designed to test spam flagging, not how to save a 1000-point ham.
(In reply to comment #15) > > Push the envelope harder. Try setting totscore to -950 (to pass GTUBE) > > and count to 500 or 1000. > > Maybe you missed the bottom of my last comment? I tested count=100 > for a 30-point ham. After 100 subsequent emails, it still scored > only -2.663. ...I did fail to notice that detail. Sorry. > we don't want to go overboard, otherwise there's no > difference between AWL and actual white/black lists. That's fair. Shortcutting on extreme AWL score was only a suggestion, it sounds like an unwise one. > Also noted at the bottom of my last comment, my take on GTUBE's intent is that > it is scored 1000 specifically to override all negatives, especially including > something like AWL whose data isn't locatable via grep. Right, that's why I included it in my comments.
(In reply to comment #15) > Oh, you're talking about an actually persistent setting. Sounds like dangerous > waters. I'd rather that stay in the config file as whitelist_from et al. I don't understand how it can be a good idea to keep variable data in a config file. A config file should specify config settings, policies etc, but not a list of mail addresses. That it currently offers options like whitelist_from with a hardcoded list of mail addresses is bad. Such a config option should point to an external db that stores the list of addresses and that can be updated without reloading the program. The AWL is such a db. Unfortunately it has hardwired averaging in it. What is needed is either a mod to the AWL so that it can store fixed score offsets (as suggested in other comments) or a new AWL-like facility that can store fixed blacklist and whitelist values. I think adding this to the AWL is better.
> I don't understand how it can be a good idea to keep variable data in a config > file. Wait... This whole bug is about making AWL over-rides persistent. So is this data variable, or persistent? Pick one. If you want the AWL mods to be short-lived, the existing mechanism works. If they need to be persistent, then a config-file mod (user_prefs, etc) shouldn't be a problem. To be honest, this particular concept actually makes me oppose making any AWL extension that has a persistent effect. If it's really persistent, it belongs in the configuration file, where they can be easily seen, no? Long-lasting score hacks muddy the waters of what the AWL is doing, and make it difficult to see that "oh yeah, I did an over-ride for that" unless you go dumping though the AWL database. We may end up getting lots of "why is my AWL being crazy" complaints if it gets rolled into the AWL. I'd consider the mod if it got reported as a separate rule (AWL_OVERRIDE or some such), but I'm really leaning towards opposing it because the nature of the AWL is intended to be transitory, and the config files are intended to be persistent. Use each for their intended purposes.
(In reply to comment #18) > > I don't understand how it can be a good idea to keep variable data in a config > > file. > > Wait... This whole bug is about making AWL over-rides persistent. So is this > data variable, or persistent? Pick one. Are you trying to be clever? The list of mail addresses to be blacklisted/whitelisted is of course variable in that more addresses could be added to it at any time. Every address put in that list should stay there until the user removes it. > If you want the AWL mods to be short-lived, the existing mechanism works. > > If they need to be persistent, then a config-file mod (user_prefs, etc) > shouldn't be a problem. Do you really think that everyday users of a spamfilter want to edit their configuration file and restart the service for every address they encounter that they want to put in a blacklist or whitelist? I think this is completely silly. A config file holds data that makes this installation different from the one nextdoor. Things like a local domain name, filesystem paths, policies, etc. But not a list of addresses that is maintained daily.
(In reply to comment #19) > (In reply to comment #18) > > > I don't understand how it can be a good idea to keep variable data in a config > > > file. > > > > Wait... This whole bug is about making AWL over-rides persistent. So is this > > data variable, or persistent? Pick one. > > Are you trying to be clever? If by "clever" you mean rude, no. I'm merely pointing out we've got a contradiction here that needs resolution. Is this data in question persistent, or dynamic in nature? How persistent or how dynamic is it? I'm being serious, because we can't argue both sides. > > The list of mail addresses to be blacklisted/whitelisted is of course variable > in that more addresses could be added to it at any time. > Every address put in that list should stay there until the user removes it. > > > If you want the AWL mods to be short-lived, the existing mechanism works. > > > > If they need to be persistent, then a config-file mod (user_prefs, etc) > > shouldn't be a problem. > > Do you really think that everyday users of a spamfilter want to edit their > configuration file and restart the service for every address they encounter > that they want to put in a blacklist or whitelist? Do you really think these same users are prepared to go dumping though the cryptic AWL database to find the addresses they previously whitelisted or blacklisted to remove them? > > I think this is completely silly. > A config file holds data that makes this installation different from the one > nextdoor. Things like a local domain name, filesystem paths, policies, etc. > But not a list of addresses that is maintained daily. If you are modifying your black or whitelist daily, something is *severely* wrong with your SpamAssassin install. I've made maybe 20 such entries in 5 years of running a server with site-wide configuration for 100 users. This may be a bit on the abnormally low side, but long-term black and white lists should be a measure of absolute last resort, not daily maintenance.
"Do you really think that everyday users of a spamfilter want to edit their configuration file and restart the service for every address they encounter that they want to put in a blacklist or whitelist?" One more side note: you do NOT need to restart the service when adding entries to user_prefs. That file is reparsed per-message. Only the site-wide configs (ie: base rules, local.cf, etc) require a service restart. So, if you place the black/white entries in a user_prefs file, the restart part goes away.
(In reply to comment #21) > One more side note: you do NOT need to restart the service when adding entries > to user_prefs. That file is reparsed per-message. Only the site-wide configs > (ie: base rules, local.cf, etc) require a service restart. Our filter runs with only site-wide config. Users's wouldn't have the skills to edit a user_prefs anyway. As an admin I sometimes make changes to local.cf and have to restart the service. I also handle requests to blacklist or whitelist users, on user request or when detecting false positives in the logs. This is done using "spamassassin --add-addr-to-blacklist=" etc. from within a custom shell script. This does not require a restart. That is much better. Only, unfortunately, those actions do not stick forever. That is the topic of this bug. Like others in the comments, I think it should be fixed. A blacklist/whitelist is not configuration data so solving it by putting it in whatever configuration file is just a dirty hack.
First: Even if you are using a site-wide configuration there is still a user_prefs. You can use that user_prefs without restarting spamd. This is because a site-wide configuration is nothing more than a setup where SA always runs as the same user. That user still has a home directory in /etc/passwd, and SA will still try to go there and load user_prefs. Claiming that local.cf and spamd restart is the only option for a site-wide configuration is bogus. I will grant that editing a file is less-than optimal is valid, but there are ways where you don't have to restart spamd. Your custom shell script could be modified to a small perl or python script to manage a user_prefs file. This would take a bit of work, but it is hack-free. Second: Putting black/whitelist data into a config file may be a "hack" to you, but putting it into the AWL database is an extraordinarily egregious hack to me. The AWL is a dynamic score averager, it is not a blacklist or whitelist, and it is not static in nature. Persistent score settings are fundamentally counter to the nature of what the AWL is, because it is fundamentally dynamic not static. Although adding this static data would seem to fit with the AWL's ill-chosen name (auto white list), it does not fit with the reality of what the AWL actually is. Adding static data to it is going to involve a really ugly duct-tape and kludging wire hack. This is my primary reason for resisting this change. It may be convenient from the perspective of the user interface, but the insides of making it happen are going to be really horrible and ugly.
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
As noted, the issue is a documentation issue. To blacklist persistently, use the blacklist_auth/blacklist_from features. This is really just adding a weight to the AWL database that will get progressively re-weighted with new mails. Side-note: AWL is not recommend and recommend you look at TxRep (in trunk and coming with 3.4.1)
(In reply to Kevin A. McGrail from comment #25) Don't worry, the system where that functionality was required is no longer running SpamAssassin so I don't bother anymore. (I still think it is a stupid decision to put blacklist information in a configuration file instead of a whitelist database, but to each his own!)
(In reply to Rob Janssen from comment #26) > (In reply to Kevin A. McGrail from comment #25) > > Don't worry, the system where that functionality was required is no longer > running SpamAssassin so I don't bother anymore. > > (I still think it is a stupid decision to put blacklist information in a > configuration file instead of a whitelist database, but to each his own!) I use a userpref sql database for this information and store tons of whitelist and blacklist entries there. regards, KAM