Bug 7335 - URIBL randomly not triggered (and SPF too)
Summary: URIBL randomly not triggered (and SPF too)
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.4.1
Hardware: PC Linux
: P2 major
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-06 09:45 UTC by Reindl Harald
Modified: 2017-10-16 18:51 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
script and sample message application/octet-stream None Reindl Harald [NoCLA]
debug-outputs, testscript and sample-message application/octet-stream None Reindl Harald [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Reindl Harald 2016-07-06 09:45:59 UTC
there seems to be a major bug with asynchron rules randomly not fireing which makes whitelist_auth in case of ham as well as URIBL in case of spam completly unrlieable

just pass a message repeatly through spamd, one time rules hit, other times not

Am 05.07.2016 um 14:01 schrieb Reindl Harald:
> i have here a message with URIBL_ABUSE_SURBL Contains an URL listed in
> the ABUSE SURBL blocklist
>
> 50% of all tries against spamd it does NOT hit while the scantime for
> the whole message is arounnd 3 seconds - since there is a local
> unbound-cache with
>
>  cache-min-ttl: 300
>  cache-max-ttl: 10800
>
> it's impossible that there are happening dns timeouts and i can observe
> the same behavior randomly with URIBL_LOCAL where the unbound dns cache
> on 127.0.0.1:53 talks to rblsdnsd on 127.0.0.1:1053
>
> that smells why ever very unrelieable and frankly i observed similar
> with SPF_PASS / SHORTCIRCUIT where people within 5 seconds get the same
> message and one get USER_IN_SPF_WHITELIST while the other goes through
> all tests

that below too MUST NOT happen because one triggers USER_IN_SPF_WHITELIST and the other don't have any SPF test and given that there is a python-policyd-spf waiting 20 seconds for the response in 'smtpd_recipient_restrictions' long before the contentfilters the dns-results are cached

Jul  4 11:34:51 mail-gw postfix/smtpd[13648]: 3rjhgb71LVzB47: client=o3.email.wetransfer.com[192.254.123.42]
Jul  4 11:34:52 mail-gw spamd[12535]: spamd: processing message <577a2da06a20d_63ca5ed300132182e5@delayedjobs-17aj6hbldm9spghikobe88v7k.wetransfer.com.mail> for sa-milt:189
Jul  4 11:34:56 mail-gw spamd[12535]: spamd: result: . -4 - BAYES_00,CUST_DNSWL_2_SENDERSC_L,CUST_DNSWL_3_JEF_L,CUST_DNSWL_5_ORG_N,CUST_DNSWL_8_TL_N,DKIM_SIGNED,DKIM_VALID,HTML_MESSAGE,RCVD_IN_MSPIKE_H2,RP_MATCHES_RCVD scantime=4.2,size=18438,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<577a2da06a20d_63ca5ed300132182e5@delayedjobs-17aj6hbldm9spghikobe88v7k.wetransfer.com.mail>,bayes=0.000000,autolearn=disabled,shortcircuit=no

Jul  4 11:57:01 mail-gw postfix/smtpd[14837]: 3rjj993Bk8zB7P: client=o3.email.wetransfer.com[192.254.123.42]
Jul  4 11:57:02 mail-gw spamd[14302]: spamd: processing message <577a32e8f35bb_671c116b308134854a0@delayedjobs-16gux7nsdp9xgp69boio5hcsg.wetransfer.com.mail> for sa-milt:189
Jul  4 11:57:02 mail-gw spamd[14302]: spamd: result: . -100 - CUST_DNSWL_2_SENDERSC_L,CUST_DNSWL_3_JEF_L,CUST_DNSWL_5_ORG_N,CUST_DNSWL_8_TL_N,CUST_SHORTCIRCUIT,RCVD_IN_MSPIKE_H2,SHORTCIRCUIT,USER_IN_SPF_WHITELIST scantime=0.1,size=15685,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<577a32e8f35bb_671c116b308134854a0@delayedjobs-16gux7nsdp9xgp69boio5hcsg.wetransfer.com.mail>,autolearn=disabled,shortcircuit=spam
Comment 1 Reindl Harald 2016-07-06 10:11:47 UTC
another time in production

first one slipped and is missing BOTH - SPF_PASS and URIBL_BLACK, i would believe that URIBL_BLACK listed 30 seconds later but NO in case of SPF_PASS also missing which is clearly cached at that moment and since the spf-policyd adds the SPF-Received header there is not much to do for SA when the test runs

Jul  6 11:46:56 mail-gw spamd[22406]: spamd: result: . 2 - BAYES_50,CUST_DNSBL_20_SORBS_SPAM,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD scantime=1.2,size=96001,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<sr.EWhhvDbth0ODXb_7mD4aGw.j_CD3WdKG0Gtk5nJYH1oRQ@mta0.sbr14.net>,bayes=0.500000,autolearn=disabled,shortcircuit=no

Jul  6 11:47:22 mail-gw spamd[22406]: spamd: result: Y 8 - BAYES_50,CUST_DNSBL_20_SORBS_SPAM,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD,SPF_PASS,URIBL_BLACK scantime=2.8,size=95579,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<sr.EWhhvDbth0ODXb_7mD4aGw.GUALfXqPakuhMMwoy27UqA@mta0.sbr14.net>,bayes=0.500000,autolearn=disabled,shortcircuit=no
Comment 2 Reindl Harald 2016-07-06 10:43:36 UTC
Created attachment 5398 [details]
script and sample message
Comment 3 Reindl Harald 2016-07-06 10:44:09 UTC
Comment on attachment 5398 [details]
script and sample message

[root@mail-gw:/scripts/spamfilter/development]$ ./test.php 
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
Comment 4 Reindl Harald 2016-07-06 10:47:50 UTC
in fact after it was found all looks repeatable fine, until you restart spamd sometimes and then randomly it starts again with 

[root@mail-gw:/scripts/spamfilter/development]$ ./test.php 
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
Comment 5 AXB 2016-07-06 10:52:08 UTC
pls run spamd with -D switch and post the relevant log lines showing SA is not sending the query.

same with passing msg multiple times to "spamassassin -D dns" .
Comment 6 Reindl Harald 2016-07-06 11:06:50 UTC
yes, i will try to find debug-infos but need to do that on a different machine

on the production server i can trigger the test result also when i restart unbound (meaning some of the first tests don't return URIBL_BLACK) when i restart unbound before and so the cache is empty

the main difference here is the 872 ms response

looks like it's not a matter of "showing SA is not sending the query" but the asynchron responses are somewhere handeled racy and i guess the same happens when the TTL expired which may explain the differences 

on the other hand it does not really explain the changing results within a short timeframe while that's absoluetly not predictable to trigger

at tleat the dig results are showing that the unbound cache *is* answering after a restart / expire
_________________________________________________________

[root@mail-gw:/scripts/spamfilter/development]$ dig finewatch2016.ru.black.uribl.com.

;; ANSWER SECTION:
finewatch2016.ru.black.uribl.com. 294 IN A      127.0.0.2

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mi Jul 06 12:58:51 CEST 2016
;; MSG SIZE  rcvd: 77

[root@mail-gw:/scripts/spamfilter/development]$ systemctl restart unbound
_________________________________________________________

[root@mail-gw:/scripts/spamfilter/development]$ dig finewatch2016.ru.black.uribl.com.

;; ANSWER SECTION:
finewatch2016.ru.black.uribl.com. 300 IN A      127.0.0.2

;; Query time: 872 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mi Jul 06 12:58:58 CEST 2016
;; MSG SIZE  rcvd: 77
Comment 7 Reindl Harald 2016-07-06 11:18:21 UTC
bingo - after a cigarette break TLL expired and without restart unbound the first pass did not return the listing - i will try to write a script restarting unbound and pass "spmassassin -D" outputs each time to single files - stay tuned

[root@mail-gw:/scripts/spamfilter/development]$ ./test.php 
URIBL_BLACK NOT FOUND
URIBL_BLACK FOUND
Comment 8 Reindl Harald 2016-07-06 11:53:49 UTC
Created attachment 5399 [details]
debug-outputs, testscript and sample-message

at the bottom you see that the first "spamassassin -D dns" pass is missing URIBL_BLACK while the second one has it as expected

127.0.0.1:53 is a unbound doing direct recursion and the only resolver, it answers proveable after a restart, see comment above with 0.8 response time after restart

i think it's pretty clear now that with a cold cache it's very likely to have at least the first time a new domain appears in a mail URIBL does not hit proper 

as said i observe similar problems with SPF_PASS and that is a problem with whitelisting / shortcircuit based on SPF

meta          CUST_SHORTCIRCUIT  (USER_IN_WHITELIST || USER_IN_BLACKLIST || USER_IN_BLACKLIST_TO || USER_IN_ALL_SPAM_TO || USER_IN_DKIM_WHITELIST || USER_IN_SPF_WHITELIST || ALL_TRUSTED)
priority      CUST_SHORTCIRCUIT  -500
shortcircuit  CUST_SHORTCIRCUIT  on
score         CUST_SHORTCIRCUIT  0.0001
describe      CUST_SHORTCIRCUIT  Skip tests based on whitelists/blacklists and local relay
_______________________________________________________________________________

[root@mail-gw:/scripts/spamfilter/development]$ cat test.sh
#!/bin/bash
systemctl restart unbound
rm -f /scripts/spamfilter/development/debug/*
for NUMBER in "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
do
 su -c "/usr/bin/spamassassin -D < /scripts/spamfilter/development/uribl-sample.eml" - sa-milt &> /scripts/spamfilter/development/debug/$NUMBER.txt
done
sleep 60
systemctl restart unbound
rm -f /scripts/spamfilter/development/debug-dns/*
for NUMBER in "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
do
 su -c "/usr/bin/spamassassin -D dns < /scripts/spamfilter/development/uribl-sample.eml" - sa-milt &> /scripts/spamfilter/development/debug-dns/$NUMBER.txt
done
_______________________________________________________________________________

[root@mail-gw:/scripts/spamfilter/development/debug-dns]$ cat 01.txt | grep -i black
Jul  6 13:40:58.054 [5522] dbg: dns: URIBL_BLACK lookup start
Jul  6 13:40:58.058 [5522] dbg: dns: URIBL_BLACK lookup start
        *  3.0 URIBL_LOCAL Contains an URL listed in the URIBL blacklist
        *  0.0 CUST_SHORTCIRCUIT Skip tests based on whitelists/blacklists and
_______________________________________________________________________________

[root@mail-gw:/scripts/spamfilter/development/debug-dns]$ cat 02.txt | grep -i black
Jul  6 13:40:58.964 [5548] dbg: dns: URIBL_BLACK lookup start
Jul  6 13:40:58.967 [5548] dbg: dns: URIBL_BLACK lookup start
Jul  6 13:40:58.982 [5548] dbg: dns: URIBL_BLACK lookup finished
        *  6.5 URIBL_BLACK Contains an URL listed in the URIBL blacklist
        *  3.0 URIBL_LOCAL Contains an URL listed in the URIBL blacklist
        *  0.0 CUST_SHORTCIRCUIT Skip tests based on whitelists/blacklists and
Comment 9 AXB 2016-07-06 12:01:16 UTC
Pls post OS/version, Perl & Net::DNS versions.
Comment 10 Reindl Harald 2016-07-06 12:03:42 UTC
Fedora 23 x86_64

[root@mail-gw:~]$ uname -a
Linux mail-gw.thelounge.net 4.5.7-202.fc23.x86_64 #1 SMP Tue Jun 28 18:22:51 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@mail-gw:~]$ rpm -q perl
perl-5.22.2-352.fc23.x86_64

[root@mail-gw:~]$ rpm -qa | grep -i perl | grep -i dns
perl-Net-DNS-0.83-3.fc23.x86_64

[root@mail-gw:~]$ rpm -qa | grep -i spamassassin
spamassassin-3.4.1-6.fc23.x86_64
spamassassin-clamav-2.0-4.fc23.20160525.rh.noarch
spamassassin-iXhash2-2.05-8.fc23.noarch
Comment 11 AXB 2016-07-06 15:38:44 UTC
can't reproduce the problem using Centos7's Perl, SA install from trunk and pdns-recursor - Maybe someone else has more luck.
Comment 12 Reindl Harald 2016-07-06 23:15:58 UTC
sad to hear - started it again now after skip our own domains with "uridnsbl_skip_domain", same result

with "-D" and "-D dns" the first one is missing URIBL_BLACK, the others have it

the difference that only one compared with https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c3 is likely spamd versus spamassassin and caused by the startup time of the non-daemonized process while in background the cache is filled before the second attempt
Comment 13 Mark Martinec 2016-07-07 10:04:44 UTC
My first though was that you hit the retry/timeout incompatibility with
Net::DNS 1.03, but you say your version is 0.83, so this is not it.

What is your setting for rbl_timeout ?
See man page for Mail::SpamAssassin::Conf :

  rbl_timeout t [t_min] [zone]       (default: 15 3)
    All DNS queries are made at the beginning of a check and we try to
    read the results at the end.  This value specifies the maximum
    period of time (in seconds) to wait for a DNS query.  If most of
    the DNS queries have succeeded for a particular message, then
    SpamAssassin will not wait for the full period to avoid wasting
    time on unresponsive server(s), but will shrink the timeout
    according to a percentage of queries already completed.  As the
    number of queries remaining approaches 0, the timeout value will
    gradually approach a t_min value, which is an optional second
    parameter and defaults to 0.2 * t.  [...]

It would be useful to see the full debug output with timestamps
of the failed DNS lookup event, as produced by a command-line
spaassassin -D.  It would show how long did it take for a DNS
reply to come back, and if spamassassin terminated the wait
prematurely.
Comment 14 Reindl Harald 2016-07-07 10:14:45 UTC
> It would be useful to see the full debug output 
> with timestamps of the failed DNS lookup event

i have attached all infos requested, see:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c8
Comment 15 Mark Martinec 2016-07-07 11:09:13 UTC
> i have attached all infos requested, see:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7335#c8

Sorry, I missed that.

Don't use shortcircuit, it just discards and wastes valuable
information like DNS replies which have not yet been processed.
A small saving in time is not worth getting incomplete info.
Comment 16 Henrik Krohns 2016-07-07 12:19:46 UTC
Shortcircuiting is quite handy with dnsless rules like ALL_TRUSTED and WHITELIST_FROM_RCVD since they are tricky to implement at MTA level. But it needs some patching to work properly:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5930
Comment 17 Reindl Harald 2016-07-07 12:28:18 UTC
well - very interesting - without the SHORTCIRCUIT caused by using the live configuration / environment it's not that easy to reproduce with the test samples 

but i have seen that unpredictable behavior also in non-shortcircuit mailflow, especially in case of SPF_PASS don't fire and HENCE NO shortcircuit triggered

maybe that helps to reproduce and look deeper by craft a message which would hit URIBL and SHORTCIRCUIT - thanks for that hint helping provide this information and explain likely "can't reproduce the problem using Centos7's Perl, SA install from trunk and pdns-recursor - Maybe someone else has more luck"
Comment 18 Reindl Harald 2016-07-21 09:34:29 UTC
two days ago i update smy testserver vm to Fedora 24 which ships perl-Net-DNS-1.04-3.fc24.noarch and seems not to be affected from that problem compared with Fedora 23 with perl-Net-DNS-0.83-3.fc23.x86_64 - strange
Comment 19 Reindl Harald 2016-07-25 17:09:21 UTC
i revert my previous "DNS-1.04-3.fc24.noarch seems not to be affected" - looks like that was only on my vm at home - Fedora 24 in production after upgrade 20 minutes ago has the same issue

seems to depend really on timing - if the rest of the scan is very fast *and* the dns response slow something get out of sync

[root@mail-gw:/scripts/spamfilter/development]$ systemctl restart unbound; ./test.php 
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK NOT FOUND
URIBL_BLACK FOUND
URIBL_BLACK FOUND
Comment 20 Bill Cole 2017-10-15 07:32:51 UTC
The inability of others to reproduce the problem, in combination with the multiple changes in Net::DNS and SA (and their interaction) in the past year+ seems to me to be a sound basis for deeming this (whatever "this" was) to be fixed or to be a quirk of local configuration.
Comment 21 Reindl Harald 2017-10-15 11:31:23 UTC
about which changes in SA in the last year are you talking about?

there was no release in the past two years, sa-update is dead for nearly a whole year except a short timeframe rule-updates worked but according to the list the scroring was questionable and hence stopped again
Comment 22 Reindl Harald 2017-10-15 11:32:40 UTC
and since SPF also randomly don't score even when there is a policyd header you can be sure that the dns stuff in spamassassin is slightly broken
Comment 23 Bill Cole 2017-10-16 16:02:59 UTC
(In reply to Reindl Harald from comment #21)
> about which changes in SA in the last year are you talking about?

The relevant bugs that come to mind are bug #7223, bug #7231, and bug #7265. 

> there was no release in the past two years

Which does not mean that there have been no changes implemented in the current codebase. If this can be reproduced with the current code, rather than a vendor package of 3.4.1 with an unknown set of back-ported fixes and/or customizations. 

As has been noted, using SHORTCIRCUIT is problematic in connection with DNS-based rules, because unfinished queries are ignored. In the 2 collections of debug outputs provided, only the first of 20 runs fails to hit URIBL_BLACK and those 2 runs are cut short by SHORTCIRCUIT to less than 20ms from the start of the relevant DNS query to the harvesting of completed results. It is extremely unlikely to complete a recursive DNS query whose answer is not in a local cache in 20ms. In casual testing on a well-connected low-load host with the specific  name being queried cleared from the cache between queries, I am unable to get an answer from any of the multi.uribl.com name servers in <35ms consistently and never <25ms.

Given the available evidence, there is probably no bug here at all: only a known effect of SHORTCIRCUIT working as designed. If there IS anything more involved, it is not reasonable to diagnose a DNS issue using 3.4.1 rather than the current codebase due to the substantial code changes made in relation to DNS since the 3.4.1 release.
Comment 24 Reindl Harald 2017-10-16 16:37:45 UTC
how does it help when something exists in commits - there are a ton of fixed bugs making it not to distributions because SA don't make releases as any other project over years
Comment 25 Kevin A. McGrail 2017-10-16 18:51:26 UTC
It doesn't help as much as I would like but we aren't responsible for distros.  I have worked on getting a release done and the ruleqa fixes with Dave Jones as well.