SA Bugzilla – Bug 7335
URIBL randomly not triggered (and SPF too)
Last modified: 2017-10-16 18:51:26 UTC
there seems to be a major bug with asynchron rules randomly not fireing which makes whitelist_auth in case of ham as well as URIBL in case of spam completly unrlieable just pass a message repeatly through spamd, one time rules hit, other times not Am 05.07.2016 um 14:01 schrieb Reindl Harald: > i have here a message with URIBL_ABUSE_SURBL Contains an URL listed in > the ABUSE SURBL blocklist > > 50% of all tries against spamd it does NOT hit while the scantime for > the whole message is arounnd 3 seconds - since there is a local > unbound-cache with > > cache-min-ttl: 300 > cache-max-ttl: 10800 > > it's impossible that there are happening dns timeouts and i can observe > the same behavior randomly with URIBL_LOCAL where the unbound dns cache > on 127.0.0.1:53 talks to rblsdnsd on 127.0.0.1:1053 > > that smells why ever very unrelieable and frankly i observed similar > with SPF_PASS / SHORTCIRCUIT where people within 5 seconds get the same > message and one get USER_IN_SPF_WHITELIST while the other goes through > all tests that below too MUST NOT happen because one triggers USER_IN_SPF_WHITELIST and the other don't have any SPF test and given that there is a python-policyd-spf waiting 20 seconds for the response in 'smtpd_recipient_restrictions' long before the contentfilters the dns-results are cached Jul 4 11:34:51 mail-gw postfix/smtpd[13648]: 3rjhgb71LVzB47: client=o3.email.wetransfer.com[192.254.123.42] Jul 4 11:34:52 mail-gw spamd[12535]: spamd: processing message <577a2da06a20d_63ca5ed300132182e5@delayedjobs-17aj6hbldm9spghikobe88v7k.wetransfer.com.mail> for sa-milt:189 Jul 4 11:34:56 mail-gw spamd[12535]: spamd: result: . -4 - BAYES_00,CUST_DNSWL_2_SENDERSC_L,CUST_DNSWL_3_JEF_L,CUST_DNSWL_5_ORG_N,CUST_DNSWL_8_TL_N,DKIM_SIGNED,DKIM_VALID,HTML_MESSAGE,RCVD_IN_MSPIKE_H2,RP_MATCHES_RCVD scantime=4.2,size=18438,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<577a2da06a20d_63ca5ed300132182e5@delayedjobs-17aj6hbldm9spghikobe88v7k.wetransfer.com.mail>,bayes=0.000000,autolearn=disabled,shortcircuit=no Jul 4 11:57:01 mail-gw postfix/smtpd[14837]: 3rjj993Bk8zB7P: client=o3.email.wetransfer.com[192.254.123.42] Jul 4 11:57:02 mail-gw spamd[14302]: spamd: processing message <577a32e8f35bb_671c116b308134854a0@delayedjobs-16gux7nsdp9xgp69boio5hcsg.wetransfer.com.mail> for sa-milt:189 Jul 4 11:57:02 mail-gw spamd[14302]: spamd: result: . -100 - CUST_DNSWL_2_SENDERSC_L,CUST_DNSWL_3_JEF_L,CUST_DNSWL_5_ORG_N,CUST_DNSWL_8_TL_N,CUST_SHORTCIRCUIT,RCVD_IN_MSPIKE_H2,SHORTCIRCUIT,USER_IN_SPF_WHITELIST scantime=0.1,size=15685,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<577a32e8f35bb_671c116b308134854a0@delayedjobs-16gux7nsdp9xgp69boio5hcsg.wetransfer.com.mail>,autolearn=disabled,shortcircuit=spam
another time in production first one slipped and is missing BOTH - SPF_PASS and URIBL_BLACK, i would believe that URIBL_BLACK listed 30 seconds later but NO in case of SPF_PASS also missing which is clearly cached at that moment and since the spf-policyd adds the SPF-Received header there is not much to do for SA when the test runs Jul 6 11:46:56 mail-gw spamd[22406]: spamd: result: . 2 - BAYES_50,CUST_DNSBL_20_SORBS_SPAM,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD scantime=1.2,size=96001,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<sr.EWhhvDbth0ODXb_7mD4aGw.j_CD3WdKG0Gtk5nJYH1oRQ@mta0.sbr14.net>,bayes=0.500000,autolearn=disabled,shortcircuit=no Jul 6 11:47:22 mail-gw spamd[22406]: spamd: result: Y 8 - BAYES_50,CUST_DNSBL_20_SORBS_SPAM,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD,SPF_PASS,URIBL_BLACK scantime=2.8,size=95579,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<sr.EWhhvDbth0ODXb_7mD4aGw.GUALfXqPakuhMMwoy27UqA@mta0.sbr14.net>,bayes=0.500000,autolearn=disabled,shortcircuit=no
Created attachment 5398 [details] script and sample message
Comment on attachment 5398 [details] script and sample message [root@mail-gw:/scripts/spamfilter/development]$ ./test.php URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND
in fact after it was found all looks repeatable fine, until you restart spamd sometimes and then randomly it starts again with [root@mail-gw:/scripts/spamfilter/development]$ ./test.php URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND
pls run spamd with -D switch and post the relevant log lines showing SA is not sending the query. same with passing msg multiple times to "spamassassin -D dns" .
yes, i will try to find debug-infos but need to do that on a different machine on the production server i can trigger the test result also when i restart unbound (meaning some of the first tests don't return URIBL_BLACK) when i restart unbound before and so the cache is empty the main difference here is the 872 ms response looks like it's not a matter of "showing SA is not sending the query" but the asynchron responses are somewhere handeled racy and i guess the same happens when the TTL expired which may explain the differences on the other hand it does not really explain the changing results within a short timeframe while that's absoluetly not predictable to trigger at tleat the dig results are showing that the unbound cache *is* answering after a restart / expire _________________________________________________________ [root@mail-gw:/scripts/spamfilter/development]$ dig finewatch2016.ru.black.uribl.com. ;; ANSWER SECTION: finewatch2016.ru.black.uribl.com. 294 IN A 127.0.0.2 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Mi Jul 06 12:58:51 CEST 2016 ;; MSG SIZE rcvd: 77 [root@mail-gw:/scripts/spamfilter/development]$ systemctl restart unbound _________________________________________________________ [root@mail-gw:/scripts/spamfilter/development]$ dig finewatch2016.ru.black.uribl.com. ;; ANSWER SECTION: finewatch2016.ru.black.uribl.com. 300 IN A 127.0.0.2 ;; Query time: 872 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Mi Jul 06 12:58:58 CEST 2016 ;; MSG SIZE rcvd: 77
bingo - after a cigarette break TLL expired and without restart unbound the first pass did not return the listing - i will try to write a script restarting unbound and pass "spmassassin -D" outputs each time to single files - stay tuned [root@mail-gw:/scripts/spamfilter/development]$ ./test.php URIBL_BLACK NOT FOUND URIBL_BLACK FOUND
Created attachment 5399 [details] debug-outputs, testscript and sample-message at the bottom you see that the first "spamassassin -D dns" pass is missing URIBL_BLACK while the second one has it as expected 127.0.0.1:53 is a unbound doing direct recursion and the only resolver, it answers proveable after a restart, see comment above with 0.8 response time after restart i think it's pretty clear now that with a cold cache it's very likely to have at least the first time a new domain appears in a mail URIBL does not hit proper as said i observe similar problems with SPF_PASS and that is a problem with whitelisting / shortcircuit based on SPF meta CUST_SHORTCIRCUIT (USER_IN_WHITELIST || USER_IN_BLACKLIST || USER_IN_BLACKLIST_TO || USER_IN_ALL_SPAM_TO || USER_IN_DKIM_WHITELIST || USER_IN_SPF_WHITELIST || ALL_TRUSTED) priority CUST_SHORTCIRCUIT -500 shortcircuit CUST_SHORTCIRCUIT on score CUST_SHORTCIRCUIT 0.0001 describe CUST_SHORTCIRCUIT Skip tests based on whitelists/blacklists and local relay _______________________________________________________________________________ [root@mail-gw:/scripts/spamfilter/development]$ cat test.sh #!/bin/bash systemctl restart unbound rm -f /scripts/spamfilter/development/debug/* for NUMBER in "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" do su -c "/usr/bin/spamassassin -D < /scripts/spamfilter/development/uribl-sample.eml" - sa-milt &> /scripts/spamfilter/development/debug/$NUMBER.txt done sleep 60 systemctl restart unbound rm -f /scripts/spamfilter/development/debug-dns/* for NUMBER in "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" do su -c "/usr/bin/spamassassin -D dns < /scripts/spamfilter/development/uribl-sample.eml" - sa-milt &> /scripts/spamfilter/development/debug-dns/$NUMBER.txt done _______________________________________________________________________________ [root@mail-gw:/scripts/spamfilter/development/debug-dns]$ cat 01.txt | grep -i black Jul 6 13:40:58.054 [5522] dbg: dns: URIBL_BLACK lookup start Jul 6 13:40:58.058 [5522] dbg: dns: URIBL_BLACK lookup start * 3.0 URIBL_LOCAL Contains an URL listed in the URIBL blacklist * 0.0 CUST_SHORTCIRCUIT Skip tests based on whitelists/blacklists and _______________________________________________________________________________ [root@mail-gw:/scripts/spamfilter/development/debug-dns]$ cat 02.txt | grep -i black Jul 6 13:40:58.964 [5548] dbg: dns: URIBL_BLACK lookup start Jul 6 13:40:58.967 [5548] dbg: dns: URIBL_BLACK lookup start Jul 6 13:40:58.982 [5548] dbg: dns: URIBL_BLACK lookup finished * 6.5 URIBL_BLACK Contains an URL listed in the URIBL blacklist * 3.0 URIBL_LOCAL Contains an URL listed in the URIBL blacklist * 0.0 CUST_SHORTCIRCUIT Skip tests based on whitelists/blacklists and
Pls post OS/version, Perl & Net::DNS versions.
Fedora 23 x86_64 [root@mail-gw:~]$ uname -a Linux mail-gw.thelounge.net 4.5.7-202.fc23.x86_64 #1 SMP Tue Jun 28 18:22:51 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux [root@mail-gw:~]$ rpm -q perl perl-5.22.2-352.fc23.x86_64 [root@mail-gw:~]$ rpm -qa | grep -i perl | grep -i dns perl-Net-DNS-0.83-3.fc23.x86_64 [root@mail-gw:~]$ rpm -qa | grep -i spamassassin spamassassin-3.4.1-6.fc23.x86_64 spamassassin-clamav-2.0-4.fc23.20160525.rh.noarch spamassassin-iXhash2-2.05-8.fc23.noarch
can't reproduce the problem using Centos7's Perl, SA install from trunk and pdns-recursor - Maybe someone else has more luck.
sad to hear - started it again now after skip our own domains with "uridnsbl_skip_domain", same result with "-D" and "-D dns" the first one is missing URIBL_BLACK, the others have it the difference that only one compared with https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c3 is likely spamd versus spamassassin and caused by the startup time of the non-daemonized process while in background the cache is filled before the second attempt
My first though was that you hit the retry/timeout incompatibility with Net::DNS 1.03, but you say your version is 0.83, so this is not it. What is your setting for rbl_timeout ? See man page for Mail::SpamAssassin::Conf : rbl_timeout t [t_min] [zone] (default: 15 3) All DNS queries are made at the beginning of a check and we try to read the results at the end. This value specifies the maximum period of time (in seconds) to wait for a DNS query. If most of the DNS queries have succeeded for a particular message, then SpamAssassin will not wait for the full period to avoid wasting time on unresponsive server(s), but will shrink the timeout according to a percentage of queries already completed. As the number of queries remaining approaches 0, the timeout value will gradually approach a t_min value, which is an optional second parameter and defaults to 0.2 * t. [...] It would be useful to see the full debug output with timestamps of the failed DNS lookup event, as produced by a command-line spaassassin -D. It would show how long did it take for a DNS reply to come back, and if spamassassin terminated the wait prematurely.
> It would be useful to see the full debug output > with timestamps of the failed DNS lookup event i have attached all infos requested, see: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c8
> i have attached all infos requested, see: > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c8 Sorry, I missed that. Don't use shortcircuit, it just discards and wastes valuable information like DNS replies which have not yet been processed. A small saving in time is not worth getting incomplete info.
Shortcircuiting is quite handy with dnsless rules like ALL_TRUSTED and WHITELIST_FROM_RCVD since they are tricky to implement at MTA level. But it needs some patching to work properly: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5930
well - very interesting - without the SHORTCIRCUIT caused by using the live configuration / environment it's not that easy to reproduce with the test samples but i have seen that unpredictable behavior also in non-shortcircuit mailflow, especially in case of SPF_PASS don't fire and HENCE NO shortcircuit triggered maybe that helps to reproduce and look deeper by craft a message which would hit URIBL and SHORTCIRCUIT - thanks for that hint helping provide this information and explain likely "can't reproduce the problem using Centos7's Perl, SA install from trunk and pdns-recursor - Maybe someone else has more luck"
two days ago i update smy testserver vm to Fedora 24 which ships perl-Net-DNS-1.04-3.fc24.noarch and seems not to be affected from that problem compared with Fedora 23 with perl-Net-DNS-0.83-3.fc23.x86_64 - strange
i revert my previous "DNS-1.04-3.fc24.noarch seems not to be affected" - looks like that was only on my vm at home - Fedora 24 in production after upgrade 20 minutes ago has the same issue seems to depend really on timing - if the rest of the scan is very fast *and* the dns response slow something get out of sync [root@mail-gw:/scripts/spamfilter/development]$ systemctl restart unbound; ./test.php URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK NOT FOUND URIBL_BLACK FOUND URIBL_BLACK FOUND
The inability of others to reproduce the problem, in combination with the multiple changes in Net::DNS and SA (and their interaction) in the past year+ seems to me to be a sound basis for deeming this (whatever "this" was) to be fixed or to be a quirk of local configuration.
about which changes in SA in the last year are you talking about? there was no release in the past two years, sa-update is dead for nearly a whole year except a short timeframe rule-updates worked but according to the list the scroring was questionable and hence stopped again
and since SPF also randomly don't score even when there is a policyd header you can be sure that the dns stuff in spamassassin is slightly broken
(In reply to Reindl Harald from comment #21) > about which changes in SA in the last year are you talking about? The relevant bugs that come to mind are bug #7223, bug #7231, and bug #7265. > there was no release in the past two years Which does not mean that there have been no changes implemented in the current codebase. If this can be reproduced with the current code, rather than a vendor package of 3.4.1 with an unknown set of back-ported fixes and/or customizations. As has been noted, using SHORTCIRCUIT is problematic in connection with DNS-based rules, because unfinished queries are ignored. In the 2 collections of debug outputs provided, only the first of 20 runs fails to hit URIBL_BLACK and those 2 runs are cut short by SHORTCIRCUIT to less than 20ms from the start of the relevant DNS query to the harvesting of completed results. It is extremely unlikely to complete a recursive DNS query whose answer is not in a local cache in 20ms. In casual testing on a well-connected low-load host with the specific name being queried cleared from the cache between queries, I am unable to get an answer from any of the multi.uribl.com name servers in <35ms consistently and never <25ms. Given the available evidence, there is probably no bug here at all: only a known effect of SHORTCIRCUIT working as designed. If there IS anything more involved, it is not reasonable to diagnose a DNS issue using 3.4.1 rather than the current codebase due to the substantial code changes made in relation to DNS since the 3.4.1 release.
how does it help when something exists in commits - there are a ton of fixed bugs making it not to distributions because SA don't make releases as any other project over years
It doesn't help as much as I would like but we aren't responsible for distros. I have worked on getting a release done and the ruleqa fixes with Dave Jones as well.