SA Bugzilla – Bug 6884
uribl_all_types.t test is intermittently failing
Last modified: 2013-01-17 01:52:05 UTC
- We have failed the last 5 buidls 8280 through 8285 such as: - Test failure output example: https://builds.apache.org/job/SpamAssassin-trunk/8284/testReport/make_test/t_uribl_all_types_t/test______2/ - make test on my systems works without issue Thoughts: - Might be wrong output in the rules report vis a vis a bug similar to freemail where we haveincorrect output going into the report - Might be DNS related but Mark doesn't think so - Try and get the log output from Jenkins which KAM doesn't know how to do Per Michael: One thing you could do is modify the tests/jenkins to archive the logs so that you can go to the web interface and review them. I'd actually be surprised if that wasn't already done. It's been awhile since I messed with the source, but I think just adding a tar czvf log_output.tgz WORKDIR/t/log/* or whatever the path actually is, would get you a .tgz of the logs and then modify the Jenkins config to archive that file along with the results.
The underlying cause is identified by Mark in Bug 6880. *** This bug has been marked as a duplicate of bug 6880 ***
> The underlying cause is identified by Mark in Bug 6880. Not quite: the Bug 6880 is certainly the final drop that broke the uribl_all_types test unconditionally, but the intermittent variability in the sub-description still remains. On my host it fails in about 50% of runs.
Not entirely a duplicate, the original failure reason is still there.
(In reply to comment #3) > Not entirely a duplicate, the original failure reason is still there. Once I added a description, my ability to replicate the failure is gone. Descriptions are stored in a custom Tie Hash of a single string. I wonder if there is a bug in that storage involved rules with no descriptions? Anyway, I'll leave this open and see what fails on the next jenkins build.
(In reply to comment #3) > Not entirely a duplicate, the original failure reason is still there. As Jenkin's is now stable, I will close this ticket but if you can find any rhyme or reason or way to duplicate the issue that might lead to the REPORT leakage issue, please feel free to re-open.
I'm beginning to understand the issue. These are relevant DNS queries from the failing test: $ host -t A uribl-example-a.com.dnsbltest.spamassassin.org uribl-example-a.com.dnsbltest.spamassassin.org has address 127.0.0.2 $ host -t A uribl-example-b.com.dnsbltest.spamassassin.org uribl-example-b.com.dnsbltest.spamassassin.org has address 127.0.0.4 $ host -t A uribl-example-c.com.dnsbltest.spamassassin.org uribl-example-c.com.dnsbltest.spamassassin.org has address 127.0.0.6 $ host -t A 98.3.137.144.dnsbltest.spamassassin.org 98.3.137.144.dnsbltest.spamassassin.org has address 127.0.0.2 and here are the corresponding rules from the test: urirhssub X_URIBL_IPSONLY dnsbltest.spamassassin.org. A 2 body X_URIBL_IPSONLY eval:check_uridnsbl('X_URIBL_IPSONLY') describe X_URIBL_IPSONLY X_URIBL_IPSONLY tflags X_URIBL_IPSONLY net ips_only urirhssub X_URIBL_DOMSONLY dnsbltest.spamassassin.org. A 4 body X_URIBL_DOMSONLY eval:check_uridnsbl('X_URIBL_DOMSONLY') describe X_URIBL_DOMSONLY X_URIBL_DOMSONLY tflags X_URIBL_DOMSONLY net domains_only Note that both the uribl-example-a.com.dnsbltest.spamassassin.org as well as the 98.3.137.144.dnsbltest.spamassassin.org produce a reply 127.0.0.2, which matches the '2' subrule filter of X_URIBL_IPSONLY, so this rule collects both the 'uribl-example-a' and the '98.3.137.144' as a rule sub-description - but there is room for only one answer, so whichever comes last(?) stays there. It is a pure luck or a timing thing (or maybe a Perl version) if Jenkins happens to always return the same result. In my case I see one or the other result with about the same probability (perl 5.17.7). I see two open issues here: - allow multiple hits to show in the sub-description (this is already mentioned in the code as a TODO); - does the uribl-example-a.com.dnsbltest.spamassassin.org with its 127.0.0.2 really merit the hit on X_URIBL_IPSONLY (despite a match on a filter), considering its tflag ips_only ? Some determinism needs to be introduced here before this ticket can be closed, although it probably isn't a blocker, as the above test case is somewhat synthetic and should rarely occur in practice.
The reason why it happens to give consistent results on Jenkins, and why it fails in 2/3 of runs on my host is in a version of perl: perl5176delta - what is new for perl v5.17.6 Per process hash randomization The seed used by Perl's hash function is now random. This means that the order which keys/values will be returned from functions like "keys()", "values()", and "each()" will differ from run to run. This change was introduced to make Perl's hashes more robust to algorithmic complexity attacks, and also because we discovered that it exposes hash ordering dependency bugs and makes them easier to track down. Toolchain maintainers might want to invest in additional infrastructure to test for things like this. Running tests several times in a row and then comparing results will make it easier to spot hash order dependencies in code. Authors are strongly encouraged not to expose the key order of Perl's hashes to insecure audiences. Lucily/unluckily this exposes a bug in SA::Plugin::URIDNSBL::complete_dnsbl_lookup where a hit is inappropriately scored for all rules with the same subrule filter.
(In reply to comment #7) > Lucily/unluckily this exposes a bug in > > SA::Plugin::URIDNSBL::complete_dnsbl_lookup > > where a hit is inappropriately scored for all rules > with the same subrule filter. You get bonus points for digging this one up!
I'm still spinning in circles, thinking how to fix the URIDNSBL.pm. Meanwhile, here is some cosmetics and a debugging aid check-in, along with fixes for what appears to be an innocent bug in PerMsgStatus.pm, namely the: $self->{test_log_msgs} = (); seems a silly way of saying $self->{test_log_msgs} = undef. What was apparently intended is a %{$self->{test_log_msgs}} = (), which deletes both keys but keeps a hash cleaned for reuse by the next rule. trunk: Bug 6884: $pms->{test_log_msgs} is a hashref, not a hash; some cosmetics and debugging in Plugin/URIDNSBL.pm' Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Sending lib/Mail/SpamAssassin/Plugin/Check.pm Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Sending t/uribl_all_types.t Committed revision 1432200.
> Meanwhile, ... Another (loosely related) DNS hardening detail: trunk: DnsResolver: prevent a failed or suppressed DNS query packet generation from propagating unnecessarily Sending lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/DnsResolver.pm Sending lib/Mail/SpamAssassin/Plugin/ASN.pm Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Committed revision 1432359.
... getting there, looks promising ...
Found another test case which the current code gets wrong, and the t/uribl.t (test 5) covers the failure by claiming that this is an expected result. The test 5 in t/uribl.t is: urifullnsrhssub X_URIBL_FULL_NS dnsbltest.spamassassin.org. A 8 body X_URIBL_FULL_NS eval:check_uridnsbl('X_URIBL_FULL_NS') tflags X_URIBL_FULL_NS net The sample message is t/data/spam/dnsbl.eml, which contains an URL http://spamassassin.org/. The urifullnsrhssub is supposed to collect NS records of spamassassin.org, which are: a.auth-ns.sonic.net, b.auth-ns.sonic.net, c.auth-ns.sonic.net and ns.hyperreal.org, then submit these host full host names (not just their domains) to a DNS BL test zone dnsbltest.spamassassin.org: $ host -t a hyperreal.org.dnsbltest.spamassassin.org hyperreal.org.dnsbltest.spamassassin.org has address 127.0.0.8 $ host -t a ns.hyperreal.org.dnsbltest.spamassassin.org Host ns.hyperreal.org.dnsbltest.spamassassin.org not found: 3(NXDOMAIN) Note that ns.hyperreal.org.dnsbltest.spamassassin.org is *not* in the zone, so rule X_URIBL_FULL_NS should *not* have fired. The hyperreal.org.dnsbltest.spamassassin.org however *is* in the BL zone, so the other test rule X_URIBL_NS does fire correctly: urinsrhssub X_URIBL_NS dnsbltest.spamassassin.org. A 8 ... yet because X_URIBL_NS happens to have the same subtest (8) as X_URIBL_FULL_NS, it incorrectly drags the X_URIBL_FULL_NS with itself. Btw: seems like subroutines lookup_a, lookup_ptr, lookup_mx, and lookup_mx_exists in Dns.pm are not used anywhere??? Drop? (And the lookup_ns is only used during startup by is_dns_available() in the same module.)
Good catch on uribl.t as well. I'm moving this to a blocker status. BTW, re: "subroutines lookup_a, lookup_ptr, lookup_mx, and lookup_mx_exists in Dns.pm are not used anywhere??? Drop? (And the lookup_ns is only used during startup by is_dns_available() in the same module.)" I would say not to drop them. Too likely we will need those functions at some point in my opinion. Regards, KAM
(In reply to comment #13) > Good catch on uribl.t as well. I'm moving this to a blocker status. > > BTW, re: "subroutines lookup_a, lookup_ptr, lookup_mx, and > lookup_mx_exists in Dns.pm are not used anywhere??? Drop? > (And the lookup_ns is only used during startup by is_dns_available() > in the same module.)" > > I would say not to drop them. Too likely we will need those functions at > some point in my opinion. > > Regards, > KAM Brace yourselves, here comes a biggie! Better not look at the diff :) there are just too many - following a couple of days of fixing & rewriting lead one thing to another, finding new code simplification opportunities and cleanups. Looks like all the tests are passing now, and our production mail server seems happy too. The original approach in URIDNSBL.pm just didn't hold water. Grouping rules under a zone and subrule hashes and launching a single DNS query for these rules lost information on which queried domains belong to each rule and what are their flags, leading to unrelated rules being hit when another rule triggered with the same zone and subrule but a different query domain. The changed approach now pretends to launch a DNS query for each uri rule, but avoids actual duplicate DNS queries by letting a new subroutine AsyncLoop::bgsend_and_start_lookup() take care or grouping equal DNS queries into a single launch, while dealing with individual callbacks behind the scenes. This has the side effect of simplifying several code sections in URIDNSBL.pm, ASN.pm, AskDNS.pm and Dns.pm, without increasing the number of DNS queries (in some cases even reducing it). It also fixes the problem in AskDNS.pm, issuing warnings about 'such lookup has already been issued'. It also revealed several unused code sections, leftovers from the distant past before URIDNSBL.pm and AsyncLoop came to be. While at it, it came naturally to also add a long-desired feature: +=item dns_query_restriction (allow|deny) domain1 domain2 ... + +Option allows disabling of rules which would result in a DNS query to one of +the listed domains. The first argument must be a literal C<allow> or C<deny>, +remaining arguments are domains names. + +Most DNS queries (with some exceptions) are subject to dns_query_restriction. +A domain to be queried is successively stripped-off of its leading labels +(thus yielding a series of its parent domains), and on each iterations a +check is made against an associative array generated by dns_query_restriction +options. Search stops at the first match (i.e. a tightest match), and the +matching entry with its C<allow> or C<deny> value then controls if a DNS query +is allowed to be launched. + +If no match is found an implicit default is to allow all queries. The purpose +of an explicit C<allow> entry is to be able to override a previously configured +C<deny> on the same domain or to override an entry (possibly still to be +configured) on one of its parent domains. Thus an 'allow zen.spamhaus.org' +with a 'deny spamhaus.org' would permit DNS queries on a specific DNS BL zone +but deny queries to other zones under the same parent domain 'spamhaus.org'. + +Domains are matched case-insensitively, no wildcards are recognized, +there should be no leading or trailing dot. + +Specifying blocked domains has a similar effect as setting a score of such +rules to zero, and can be a handy alternative to hunting for such rules +when a site policy does not allow certain DNS block lists to be queried. + +Example: + dns_query_restriction deny dnswl.org surbl.org + dns_query_restriction allow zen.spamhaus.org + dns_query_restriction deny spamhaus.org mailspike.net spamcop.net +=item clear_dns_query_restriction + +The option removes any entries entered by previous 'dns_query_restriction' +options, leaving the list empty, i.e. allowing DNS queries for any domain +(including any DNS BL zone). So here it comes: - a new subroutine AsyncLoop::bgsend_and_start_lookup() implements a commonly used idiom of calling bgsend(), followed by start_lookup(), while implementing a mechanism for grouping equal DNS queries, simplifying several other code sections; - rewrite URIDNSBL.pm to put the new mechanism to good use, fixing the particular Bug 6884 and test 5 in t/uribl.t; - fix the t/uribl.t, moving X_URIBL_FULL_NS from %patterns to %anti_patterns; - add a dns_query_restriction configuration option; - some cosmetic edits, like renaming some variables for consistency. trunk: Sending lib/Mail/SpamAssassin/AsyncLoop.pm Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/DnsResolver.pm Sending lib/Mail/SpamAssassin/Plugin/ASN.pm Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm Sending lib/Mail/SpamAssassin/Plugin/DNSEval.pm Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Committed revision 1434360. Please pay attention to any anomalies I may have missed ;)
(In reply to comment #13) > BTW, re: "subroutines lookup_a, lookup_ptr, lookup_mx, and > lookup_mx_exists in Dns.pm are not used anywhere??? Drop? > I would say not to drop them. Too likely we will need those > functions at some point in my opinion. Kevin, I ditched them anyway, sorry :) Seems these subroutines pre-date the bgsend / start_lookup and callbacks through AsyncLoop. If a need for these would re-appear, these would need to be rewritten anyway with the new DNS model. Just left the lookup_ns, which is only needed during startup when an automatic determination of DNS availability is asked for.
forgot to check-in t/uribl.t Sending t/uribl.t Committed revision 1434367.
(In reply to comment #15) > (In reply to comment #13) > > BTW, re: "subroutines lookup_a, lookup_ptr, lookup_mx, and > > lookup_mx_exists in Dns.pm are not used anywhere??? Drop? > > I would say not to drop them. Too likely we will need those > > functions at some point in my opinion. > > Kevin, I ditched them anyway, sorry :) No worries. In working to mark this resolved and I pass disttest with 5.8.6 but I dit NOT pass with 5.17.8 for t/whitelist_addrs.t t/whitelist_addrs.t ............... 11/35 Not found: spamyes = X-Spam-Status: Yes at t/whitelist_addrs.t line 65. # Failed test 11 in t/SATest.pm at line 751 Output can be examined in: log/d.whitelist_addrs/11 t/whitelist_addrs.t ............... 23/35 Not found: spamyes = X-Spam-Status: Yes at t/whitelist_addrs.t line 154. # Failed test 23 in t/SATest.pm at line 751 fail #2 Output can be examined in: log/d.whitelist_addrs/23 t/whitelist_addrs.t ............... 35/35 Not found: spamyes = X-Spam-Status: Yes at t/whitelist_addrs.t line 245. # Failed test 35 in t/SATest.pm at line 751 fail #3 Output can be examined in: log/d.whitelist_addrs/35 t/whitelist_addrs.t ............... Failed 3/35 subtests Thoughts?
(In reply to comment #17) > (In reply to comment #15) > > (In reply to comment #13) > > > BTW, re: "subroutines lookup_a, lookup_ptr, lookup_mx, and > > > lookup_mx_exists in Dns.pm are not used anywhere??? Drop? > > > I would say not to drop them. Too likely we will need those > > > functions at some point in my opinion. > > > > Kevin, I ditched them anyway, sorry :) > > No worries. > > In working to mark this resolved and I pass disttest with 5.8.6 but I dit > NOT pass with 5.17.8 for t/whitelist_addrs.t > > t/whitelist_addrs.t ............... 11/35 Not found: spamyes = > X-Spam-Status: Yes at t/whitelist_addrs.t line 65. > # Failed test 11 in t/SATest.pm at line 751 > Output can be examined in: log/d.whitelist_addrs/11 > t/whitelist_addrs.t ............... 23/35 Not found: spamyes = > X-Spam-Status: Yes at t/whitelist_addrs.t line 154. > # Failed test 23 in t/SATest.pm at line 751 fail #2 > Output can be examined in: log/d.whitelist_addrs/23 > t/whitelist_addrs.t ............... 35/35 Not found: spamyes = > X-Spam-Status: Yes at t/whitelist_addrs.t line 245. > # Failed test 35 in t/SATest.pm at line 751 fail #3 > Output can be examined in: log/d.whitelist_addrs/35 > t/whitelist_addrs.t ............... Failed 3/35 subtests > > Thoughts? Found the issue. Ignore this for a moment...
(In reply to comment #18) > > In working to mark this resolved and I pass disttest with 5.8.6 but I dit > > NOT pass with 5.17.8 for t/whitelist_addrs.t First, marking this ticket resolved. Second, the issue with whitelist_addrs.t is that it requires DB_File module for several of the tests. I added logic to test for DB_File otherwise skip ALL the tests. svn commit -m 'Changes for t/whitelist_addrs.t to check for DB_File - bug 6884' t/ Sending t/whitelist_addrs.t Transmitting file data . Committed revision 1434534.