Bug 1249 - spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets
Summary: spamassassin does not correctly parse Received: lines with dotted IP addresse...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.43
Hardware: All All
: P2 normal
Target Milestone: 2.60
Assignee: Allen Smith
URL:
Whiteboard:
Keywords: dns
Depends on:
Blocks:
 
Reported: 2002-12-04 13:05 UTC by Klaus Johannes Rusch
Modified: 2003-04-05 06:04 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
A couple of useful subroutines for telling 'is this a full name' and 'are these in the same overall domain' - guesses, but better than nothing text/plain None Allen Smith [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Klaus Johannes Rusch 2002-12-04 13:05:09 UTC
spamassassin does not correctly parse Received: lines with dotted IP addresses 
between round brackets and takes the last dotted IP address between angle 
brackets as the source, for example

Received: from fiann.pair.com
        by office.atmedia.net (fetchmail-4.3.7 POP3)
        for <krusch@localhost> (single-drop); Wed, 04 Dec 2002 20:33:06 CET
Received: (qmail 66096 invoked from network); 4 Dec 2002 18:21:59 -0000
Received: from pro1.gotospeedoffrslist873009118273.com (64.70.20.77)
  by fiann.pair.com with SMTP; 4 Dec 2002 18:21:59 -0000
Received: from [10.0.1.21]
        by pro1.gotospeedoffrslist873009118273.com (10.0.1.31) with QMQP; 04 
Dec 2002 10:22:19 +0000

The IP address logged is 10.0.1.31:

debug: AWL active, pre-score: 5.1, mean: 4.67142857142857, originating-ip: 
10.0.1.31


This is in multiple places throughout the program, a generic routine to parse 
Received: lines and allow for different formats would help improve the 
accuracy. (Currently remote checks are not executed because the IP addresses 
are not recognized.)
Comment 1 Allen Smith 2002-12-04 17:00:07 UTC
Subject: Re: [SAdev]  New: spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets

In message <20021204210509.A77BE276C@belphegore.hughes-family.org> (on 4
December 2002 13:05:09 -0800), bugzilla-daemon@hughes-family.org
(bugzilla-daemon@hughes-family.org) wrote:
>http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1249
>
>           Summary: spamassassin does not correctly parse Received: lines
>                    with dotted IP addresses between round brackets
>           Product: Spamassassin
>           Version: 2.43
>          Platform: All
>        OS/Version: All
>            Status: NEW
>          Severity: normal
>          Priority: P2
>         Component: Libraries
>        AssignedTo: spamassassin-devel@lists.sourceforge.net
>        ReportedBy: KlausRusch@atmedia.net
>
>
>spamassassin does not correctly parse Received: lines with dotted IP
>addresses between round brackets and takes the last dotted IP address
>between angle brackets as the source, for example
>
>Received: from fiann.pair.com
>        by office.atmedia.net (fetchmail-4.3.7 POP3)
>        for <krusch@localhost> (single-drop); Wed, 04 Dec 2002 20:33:06 CET
>Received: (qmail 66096 invoked from network); 4 Dec 2002 18:21:59 -0000
>Received: from pro1.gotospeedoffrslist873009118273.com (64.70.20.77)
>  by fiann.pair.com with SMTP; 4 Dec 2002 18:21:59 -0000
>Received: from [10.0.1.21]
>        by pro1.gotospeedoffrslist873009118273.com (10.0.1.31) with QMQP; 04 
>Dec 2002 10:22:19 +0000
>
>The IP address logged is 10.0.1.31:
>
>debug: AWL active, pre-score: 5.1, mean: 4.67142857142857, originating-ip: 
>10.0.1.31
>
>
>This is in multiple places throughout the program, a generic routine to parse 
>Received: lines and allow for different formats would help improve the 
>accuracy. (Currently remote checks are not executed because the IP addresses 
>are not recognized.)

I suggest that the problem is not so much angle vs [] vs (), but simply that
the Received header needs to be seperated into the parts before a 'from',
between a 'from' and a 'by', and after a 'by'. IP addresses in the last
category should be associated with the Received header one line up, _if_ it
lacks a Received header. Unfortunately, in the above case, _neither_ your
suggestion nor mine would appear to help, if 10.0.1.21 would still be fed
into the AWL. A few things to help fix this:
     A. At the minimum, reserved IPs should be rejected unless there are no
        other IP addresses in the headers.
     B. Some mechanism, probably a mixture of configuration options with
        automated means (hostname of machine being run on + hostname after
        'by' in top Received header + their MXes for names; for IP
        addresses, any in the same /24 as these), needs to be put in place
        to recognize "trusted" or "normal" MXes and the Received lines they
        add.
     C. If check_for_forged_received_trail returns 1, don't believe IP
        addresses in any but the top (or top "trusted") Received header for
	any form of whitelisting (including usage of bondedsender.com).
I've been working on some of these in the process of improving the RBL
code. I'll try to post my initial set of patches sometime this week or
weekend.

	-Allen

Comment 2 Justin Mason 2002-12-05 03:54:56 UTC
Subject: Re: [SAdev]  spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets 


>      A. At the minimum, reserved IPs should be rejected unless there are no
>         other IP addresses in the headers.

yep!

>      B. Some mechanism, probably a mixture of configuration options with
>         automated means (hostname of machine being run on + hostname after
>         'by' in top Received header + their MXes for names; for IP
>         addresses, any in the same /24 as these), needs to be put in place
>         to recognize "trusted" or "normal" MXes and the Received lines they
>         add.

Now this could be very hard.  It would be nice to try to auto-determine
it, if possible, or at least not require it.

>      C. If check_for_forged_received_trail returns 1, don't believe IP
>         addresses in any but the top (or top "trusted") Received header for
> 	any form of whitelisting (including usage of bondedsender.com).

I'd also be wary of this.  check_for_forged_received_trail is very
brittle.  It hits on all Craig's mail for example, because it's sent
from a DSL addr (according to rDNS) but HELOs as hughes-family.org,
and this is quite a common setup.

> I've been working on some of these in the process of improving the RBL
> code. I'll try to post my initial set of patches sometime this week or
> weekend.

BTW could you abstract this, so that other tests can use the results?  A
good Received-line parser is very valuable, and something we don't have.
In fact, a new class, with an instance hanging off PerMsgStatus, might be
a good idea.

--j.

Comment 3 Allen Smith 2002-12-05 09:49:28 UTC
Subject: Re: [SAdev]  spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets 

In message <20021205114935.C048916F16@jmason.org> (on 5 December 2002
11:49:30 +0000), jm@jmason.org (Justin Mason) wrote:
>
>>      A. At the minimum, reserved IPs should be rejected unless there are
>>         no other IP addresses in the headers.
>
>yep!
>
>>      B. Some mechanism, probably a mixture of configuration options with
>>         automated means (hostname of machine being run on + hostname after
>>         'by' in top Received header + their MXes for names; for IP
>>         addresses, any in the same /24 as these), needs to be put in place
>>         to recognize "trusted" or "normal" MXes and the Received lines they
>>         add.
>
>Now this could be very hard.  It would be nice to try to auto-determine
>it, if possible,

That's what I was thinking with the hostname of the machine being run on +
any hostname after 'by' in the top Received header (or the top one that
_has_ a full hostname (not 'localhost'), if there's more than one valid
Received header) + any MXes of theirs for hostnames (I've put together a
routine to do 'is this in the same approximate domain' (as in same last 2
elements for 3+-letter TLDs, same last 3 for 2-letter, unless there aren't
that many elements in it) determination); the IP addresses of said hosts
plus the surrounding /24 for IP addresses.

> or at least not require it.

Right. Configuration would be a _supplement_ to the above, for cases where,
say, mail is being (properly) forwarded by a host that's outside the above
(e.g., I have a spamcop.net account).

>>      C. If check_for_forged_received_trail returns 1, don't believe IP
>>         addresses in any but the top (or top "trusted") Received header for
>> 	any form of whitelisting (including usage of bondedsender.com).
>
>I'd also be wary of this.  check_for_forged_received_trail is very
>brittle.  It hits on all Craig's mail for example, because it's sent
>from a DSL addr (according to rDNS) but HELOs as hughes-family.org,
>and this is quite a common setup.

Sigh... yes. I will try to see if there's anything that can be done to
improve this when I get a chance; if anyone else wants to investigate before
then, be my guest (as well as time deficiencies, I am by no means that
familiar with all the different (usually broken) Received header formats)!
One place I'd look at is the (open-source for an old version) SpamCop code
for checking on this. I've also been working on more general DNS background
lookup code so more tests can be done on DNS data without massive slowdowns
- an initial usage with check_for_from_mx looks like it's indeed helping -
so using info from that should be more possible. (One problem, incidentally,
is that Net::DNS::Resolver creates even bgsend sockets, and then sends using
them, in _blocking_ mode, which I've seen create problems with hangs in the
past with other code (my RBL evaluation scripts) - fixing this requires
essentially replacing bgsend, unfortunately. Another difficulty is with
handling CNAMEs... gah!) Of course, even if Received headers look OK, a
spammer _still_ could have added them - especially if the message came in
via an open proxy.

>> I've been working on some of these in the process of improving the RBL
>> code. I'll try to post my initial set of patches sometime this week or
>> weekend.
>
>BTW could you abstract this, so that other tests can use the results?

I will do my best; I had already moved a lot of the IP-address-extracting
code out of check_rbl and the new check_rbl_group, since it was in common
between them.

>A good Received-line parser is very valuable, and something we don't have.

I wouldn't call what I've done a good one by any means - I more was
trying to fix problems like reserved IP addresses "crowding out" other
addresses (as someone else pointed out) and that the existing code assumed 1
IP address per line.

>In fact, a new class, with an instance hanging off PerMsgStatus, might be
>a good idea.

Yes. Actually, there _is_ a Perl module - not updated for quite a while,
though - specifically for that purpose, namely Mail::Field::Received. I
don't think making it a requirement would be a good idea, but borrowing some
code from it - with proper attribution, of course - looks like a definite
possibility. Your thoughts?

	-Allen

Comment 4 Justin Mason 2002-12-05 09:55:03 UTC
Subject: Re: [SAdev]  spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets 


BTW, another tip on checking for forgery -- many of the current spams
forge Received lines using either the recipient's domain, or one of the
big popular webmail sites.  So that means you can cut down the "possible
forged Received line" detection to only detect those cases, and it may
have better results.

> Actually, there _is_ a Perl module - not updated for quite a while,
> though - specifically for that purpose, namely Mail::Field::Received. I
> don't think making it a requirement would be a good idea, but borrowing some
> code from it - with proper attribution, of course - looks like a definite
> possibility. Your thoughts?

sounds good -- but make sure it's Artistic-licensed first ;)

--j.

Comment 5 Allen Smith 2002-12-06 08:52:09 UTC
Subject: Re: [SAdev]  spamassassin does not correctly parse Received: lines with dotted IP addresses between round brackets 


Incidentally, I'm not currently seeing the FPs from messages from Craig that
you've mentioned; I am guessing that this is because the minimum threshold
for number of mismatches to trigger the test has been increased since you
noticed that, although it might be due to differing mails setups,
alterations in how the helo-name is added by Craig's ISP, or whatever.

In message <20021205175457.B824A16F16@jmason.org> (on 5 December 2002 17:54:52 +0000), jm@jmason.org (Justin Mason) wrote:
>
>BTW, another tip on checking for forgery -- many of the current spams
>forge Received lines using either the recipient's domain, or one of the
>big popular webmail sites.  So that means you can cut down the "possible
>forged Received line" detection to only detect those cases, and it may
>have better results.

Point (although in the case of the big webmail sites, there are
more-specific tests for whether they've been forged in many
cases). Currently, my preliminary results are looking like DNS checks
will be necessary for increasing forgery detection without getting lots of
FPs, but if DNS checks are timing out or giving other errors, then a good
fallback may be assuming it's spam if it matches a webmail site or the
recipient's domain (and in the latter case that the IP address isn't in a
CIDR block recognized as good/normal, as per earlier discussion).

>> Actually, there _is_ a Perl module - not updated for quite a while,
>> though - specifically for that purpose, namely Mail::Field::Received. I
>> don't think making it a requirement would be a good idea, but borrowing some
>> code from it - with proper attribution, of course - looks like a definite
>> possibility. Your thoughts?
>
>sounds good -- but make sure it's Artistic-licensed first ;)

No problem; from the manpage:

     LICENSE
          All rights reserved.  This program is free software; you can
          redistribute it and/or modify it under the same terms as
          Perl itself.

It's available from CPAN.

  -Allen

Comment 6 Allen Smith 2002-12-07 16:05:56 UTC
Created attachment 467 [details]
A couple of useful subroutines for telling 'is this a full name' and 'are these in the same overall domain' - guesses, but better than nothing
Comment 7 Duncan Findlay 2002-12-24 12:38:44 UTC
Assigning to Allen... cuz DNS is his bag.

Allen.... move target milestone if necessary.
Comment 8 Duncan Findlay 2002-12-24 12:53:58 UTC
Hey! We have a dns keyword. Marking DNS bugs with the DNS keyword, setting
milestone for 2.50, since we're hoping that Allen's DNS stuff will make it in time.
Comment 9 Allen Smith 2002-12-24 14:07:02 UTC
What I'm currently looking at for Received line parsing is first a set of
regexes for the most common cases (have started on these; thanks go to JM for
some initial ones, BTW!), ideally including those generated by most common
MTAs followed by what might be described as "tearing the line apart". For the
last, there are already some partial versions in the forged-rcvd-trail tests and
other places, and I am uncertain whether to take those as the main starting
point, use Mail::Header::Received with updates, or use Parse::RecDescent to
generate an initial version with manual trimming to remove the Parse::RecDescent
dependency. (Most likely at this point is the first, followed by the third. I
really don't want to use _just_ the last, for speed and accuracy reasons.)

	-Allen
Comment 10 Justin Mason 2003-04-05 15:04:30 UTC
OK, now fixed in 2.60 CVS; we have a new class with knowledge of lots of
Received hdr formats.