SA Bugzilla – Bug 1249
spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets
Last modified: 2003-04-05 06:04:30 UTC
spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets and takes the last dotted IP address between angle brackets as the source, for example Received: from fiann.pair.com by office.atmedia.net (fetchmail-4.3.7 POP3) for <krusch@localhost> (single-drop); Wed, 04 Dec 2002 20:33:06 CET Received: (qmail 66096 invoked from network); 4 Dec 2002 18:21:59 -0000 Received: from pro1.gotospeedoffrslist873009118273.com (64.70.20.77) by fiann.pair.com with SMTP; 4 Dec 2002 18:21:59 -0000 Received: from [10.0.1.21] by pro1.gotospeedoffrslist873009118273.com (10.0.1.31) with QMQP; 04 Dec 2002 10:22:19 +0000 The IP address logged is 10.0.1.31: debug: AWL active, pre-score: 5.1, mean: 4.67142857142857, originating-ip: 10.0.1.31 This is in multiple places throughout the program, a generic routine to parse Received: lines and allow for different formats would help improve the accuracy. (Currently remote checks are not executed because the IP addresses are not recognized.)
Subject: Re: [SAdev] New: spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets In message <20021204210509.A77BE276C@belphegore.hughes-family.org> (on 4 December 2002 13:05:09 -0800), bugzilla-daemon@hughes-family.org (bugzilla-daemon@hughes-family.org) wrote: >http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1249 > > Summary: spamassassin does not correctly parse Received: lines > with dotted IP addresses between round brackets > Product: Spamassassin > Version: 2.43 > Platform: All > OS/Version: All > Status: NEW > Severity: normal > Priority: P2 > Component: Libraries > AssignedTo: spamassassin-devel@lists.sourceforge.net > ReportedBy: KlausRusch@atmedia.net > > >spamassassin does not correctly parse Received: lines with dotted IP >addresses between round brackets and takes the last dotted IP address >between angle brackets as the source, for example > >Received: from fiann.pair.com > by office.atmedia.net (fetchmail-4.3.7 POP3) > for <krusch@localhost> (single-drop); Wed, 04 Dec 2002 20:33:06 CET >Received: (qmail 66096 invoked from network); 4 Dec 2002 18:21:59 -0000 >Received: from pro1.gotospeedoffrslist873009118273.com (64.70.20.77) > by fiann.pair.com with SMTP; 4 Dec 2002 18:21:59 -0000 >Received: from [10.0.1.21] > by pro1.gotospeedoffrslist873009118273.com (10.0.1.31) with QMQP; 04 >Dec 2002 10:22:19 +0000 > >The IP address logged is 10.0.1.31: > >debug: AWL active, pre-score: 5.1, mean: 4.67142857142857, originating-ip: >10.0.1.31 > > >This is in multiple places throughout the program, a generic routine to parse >Received: lines and allow for different formats would help improve the >accuracy. (Currently remote checks are not executed because the IP addresses >are not recognized.) I suggest that the problem is not so much angle vs [] vs (), but simply that the Received header needs to be seperated into the parts before a 'from', between a 'from' and a 'by', and after a 'by'. IP addresses in the last category should be associated with the Received header one line up, _if_ it lacks a Received header. Unfortunately, in the above case, _neither_ your suggestion nor mine would appear to help, if 10.0.1.21 would still be fed into the AWL. A few things to help fix this: A. At the minimum, reserved IPs should be rejected unless there are no other IP addresses in the headers. B. Some mechanism, probably a mixture of configuration options with automated means (hostname of machine being run on + hostname after 'by' in top Received header + their MXes for names; for IP addresses, any in the same /24 as these), needs to be put in place to recognize "trusted" or "normal" MXes and the Received lines they add. C. If check_for_forged_received_trail returns 1, don't believe IP addresses in any but the top (or top "trusted") Received header for any form of whitelisting (including usage of bondedsender.com). I've been working on some of these in the process of improving the RBL code. I'll try to post my initial set of patches sometime this week or weekend. -Allen
Subject: Re: [SAdev] spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets > A. At the minimum, reserved IPs should be rejected unless there are no > other IP addresses in the headers. yep! > B. Some mechanism, probably a mixture of configuration options with > automated means (hostname of machine being run on + hostname after > 'by' in top Received header + their MXes for names; for IP > addresses, any in the same /24 as these), needs to be put in place > to recognize "trusted" or "normal" MXes and the Received lines they > add. Now this could be very hard. It would be nice to try to auto-determine it, if possible, or at least not require it. > C. If check_for_forged_received_trail returns 1, don't believe IP > addresses in any but the top (or top "trusted") Received header for > any form of whitelisting (including usage of bondedsender.com). I'd also be wary of this. check_for_forged_received_trail is very brittle. It hits on all Craig's mail for example, because it's sent from a DSL addr (according to rDNS) but HELOs as hughes-family.org, and this is quite a common setup. > I've been working on some of these in the process of improving the RBL > code. I'll try to post my initial set of patches sometime this week or > weekend. BTW could you abstract this, so that other tests can use the results? A good Received-line parser is very valuable, and something we don't have. In fact, a new class, with an instance hanging off PerMsgStatus, might be a good idea. --j.
Subject: Re: [SAdev] spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets In message <20021205114935.C048916F16@jmason.org> (on 5 December 2002 11:49:30 +0000), jm@jmason.org (Justin Mason) wrote: > >> A. At the minimum, reserved IPs should be rejected unless there are >> no other IP addresses in the headers. > >yep! > >> B. Some mechanism, probably a mixture of configuration options with >> automated means (hostname of machine being run on + hostname after >> 'by' in top Received header + their MXes for names; for IP >> addresses, any in the same /24 as these), needs to be put in place >> to recognize "trusted" or "normal" MXes and the Received lines they >> add. > >Now this could be very hard. It would be nice to try to auto-determine >it, if possible, That's what I was thinking with the hostname of the machine being run on + any hostname after 'by' in the top Received header (or the top one that _has_ a full hostname (not 'localhost'), if there's more than one valid Received header) + any MXes of theirs for hostnames (I've put together a routine to do 'is this in the same approximate domain' (as in same last 2 elements for 3+-letter TLDs, same last 3 for 2-letter, unless there aren't that many elements in it) determination); the IP addresses of said hosts plus the surrounding /24 for IP addresses. > or at least not require it. Right. Configuration would be a _supplement_ to the above, for cases where, say, mail is being (properly) forwarded by a host that's outside the above (e.g., I have a spamcop.net account). >> C. If check_for_forged_received_trail returns 1, don't believe IP >> addresses in any but the top (or top "trusted") Received header for >> any form of whitelisting (including usage of bondedsender.com). > >I'd also be wary of this. check_for_forged_received_trail is very >brittle. It hits on all Craig's mail for example, because it's sent >from a DSL addr (according to rDNS) but HELOs as hughes-family.org, >and this is quite a common setup. Sigh... yes. I will try to see if there's anything that can be done to improve this when I get a chance; if anyone else wants to investigate before then, be my guest (as well as time deficiencies, I am by no means that familiar with all the different (usually broken) Received header formats)! One place I'd look at is the (open-source for an old version) SpamCop code for checking on this. I've also been working on more general DNS background lookup code so more tests can be done on DNS data without massive slowdowns - an initial usage with check_for_from_mx looks like it's indeed helping - so using info from that should be more possible. (One problem, incidentally, is that Net::DNS::Resolver creates even bgsend sockets, and then sends using them, in _blocking_ mode, which I've seen create problems with hangs in the past with other code (my RBL evaluation scripts) - fixing this requires essentially replacing bgsend, unfortunately. Another difficulty is with handling CNAMEs... gah!) Of course, even if Received headers look OK, a spammer _still_ could have added them - especially if the message came in via an open proxy. >> I've been working on some of these in the process of improving the RBL >> code. I'll try to post my initial set of patches sometime this week or >> weekend. > >BTW could you abstract this, so that other tests can use the results? I will do my best; I had already moved a lot of the IP-address-extracting code out of check_rbl and the new check_rbl_group, since it was in common between them. >A good Received-line parser is very valuable, and something we don't have. I wouldn't call what I've done a good one by any means - I more was trying to fix problems like reserved IP addresses "crowding out" other addresses (as someone else pointed out) and that the existing code assumed 1 IP address per line. >In fact, a new class, with an instance hanging off PerMsgStatus, might be >a good idea. Yes. Actually, there _is_ a Perl module - not updated for quite a while, though - specifically for that purpose, namely Mail::Field::Received. I don't think making it a requirement would be a good idea, but borrowing some code from it - with proper attribution, of course - looks like a definite possibility. Your thoughts? -Allen
Subject: Re: [SAdev] spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets BTW, another tip on checking for forgery -- many of the current spams forge Received lines using either the recipient's domain, or one of the big popular webmail sites. So that means you can cut down the "possible forged Received line" detection to only detect those cases, and it may have better results. > Actually, there _is_ a Perl module - not updated for quite a while, > though - specifically for that purpose, namely Mail::Field::Received. I > don't think making it a requirement would be a good idea, but borrowing some > code from it - with proper attribution, of course - looks like a definite > possibility. Your thoughts? sounds good -- but make sure it's Artistic-licensed first ;) --j.
Subject: Re: [SAdev] spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets Incidentally, I'm not currently seeing the FPs from messages from Craig that you've mentioned; I am guessing that this is because the minimum threshold for number of mismatches to trigger the test has been increased since you noticed that, although it might be due to differing mails setups, alterations in how the helo-name is added by Craig's ISP, or whatever. In message <20021205175457.B824A16F16@jmason.org> (on 5 December 2002 17:54:52 +0000), jm@jmason.org (Justin Mason) wrote: > >BTW, another tip on checking for forgery -- many of the current spams >forge Received lines using either the recipient's domain, or one of the >big popular webmail sites. So that means you can cut down the "possible >forged Received line" detection to only detect those cases, and it may >have better results. Point (although in the case of the big webmail sites, there are more-specific tests for whether they've been forged in many cases). Currently, my preliminary results are looking like DNS checks will be necessary for increasing forgery detection without getting lots of FPs, but if DNS checks are timing out or giving other errors, then a good fallback may be assuming it's spam if it matches a webmail site or the recipient's domain (and in the latter case that the IP address isn't in a CIDR block recognized as good/normal, as per earlier discussion). >> Actually, there _is_ a Perl module - not updated for quite a while, >> though - specifically for that purpose, namely Mail::Field::Received. I >> don't think making it a requirement would be a good idea, but borrowing some >> code from it - with proper attribution, of course - looks like a definite >> possibility. Your thoughts? > >sounds good -- but make sure it's Artistic-licensed first ;) No problem; from the manpage: LICENSE All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. It's available from CPAN. -Allen
Created attachment 467 [details] A couple of useful subroutines for telling 'is this a full name' and 'are these in the same overall domain' - guesses, but better than nothing
Assigning to Allen... cuz DNS is his bag. Allen.... move target milestone if necessary.
Hey! We have a dns keyword. Marking DNS bugs with the DNS keyword, setting milestone for 2.50, since we're hoping that Allen's DNS stuff will make it in time.
What I'm currently looking at for Received line parsing is first a set of regexes for the most common cases (have started on these; thanks go to JM for some initial ones, BTW!), ideally including those generated by most common MTAs followed by what might be described as "tearing the line apart". For the last, there are already some partial versions in the forged-rcvd-trail tests and other places, and I am uncertain whether to take those as the main starting point, use Mail::Header::Received with updates, or use Parse::RecDescent to generate an initial version with manual trimming to remove the Parse::RecDescent dependency. (Most likely at this point is the first, followed by the third. I really don't want to use _just_ the last, for speed and accuracy reasons.) -Allen
OK, now fixed in 2.60 CVS; we have a new class with knowledge of lots of Received hdr formats.