Bug 5780 - [review] URI processing turns uuencoded strings into http URI's which then causes FPs
Summary: [review] URI processing turns uuencoded strings into http URI's which then ca...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: 3.2.6
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 5836 (view as bug list)
Depends on:
Blocks: 5813
  Show dependency tree
 
Reported: 2008-01-11 20:22 UTC by Sidney Markowitz
Modified: 2011-05-23 20:17 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
example email that FPs on WEIRD_PORT text/plain None Sidney Markowitz [HasCLA]
Patch with complete rewrite of URI parsing from text and new test cases patch None Sidney Markowitz [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Sidney Markowitz 2008-01-11 20:22:56 UTC
This is pretty much the same as bug 504, but a specific case has shown up that
was not fixed by bug 504's fix. Since it has been so long, I'm opening this as a
new bug instead of reopening bug 504.

I have a ham email that includes a uuencoded file. Aside from the issue
(discussed in bug 504 comments) that SpamAssassin does not parse uuencodings,
the following line of uuecoding:

  MOD_M))%X"M0"-AE/9Y.KN:_0D2F:95^H*:I,8<W6PO;#18AZ-QV9/:K>J..Q

caused the URI code in PerMsgStatus.pm to find the following:

dbg: rules: ran uri rule WEIRD_PORT ======> got hit: "http://9Y.KN:_0D2F:95"

There was some discussion about this in bug 504, but I think that this result
shows that what we have settled on is still too liberal. No reasonable MUA would
turn plain text like "9Y.KN:_0D2F:95" into a hot link, so neither should
SpamAssassin.
Comment 1 Sidney Markowitz 2008-01-11 20:24:42 UTC
Created attachment 4235 [details]
example email that FPs on WEIRD_PORT
Comment 2 Sidney Markowitz 2008-01-17 13:14:49 UTC
Do we have a good reason for parsing as URIs text that Thunderbird and Outlook
Express do not hot link? Mozilla's code for that is in
http://lxr.mozilla.org/mozilla/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp
A good place to start understanding it might be the functions ShouldLinkify and
 mozTXTToHTMLConv::CheckURLAndCreateHTML

Looking at the Mozilla code and trying things out in Outlook Express 6, I find
the following strings being linkified by both:

anything@anything.anything
www.anything
ftp.anything
http://anything
https://anything
ftp://anything
gopher://anything
news:anything
mailto:anything
nntp://anything

The following are linkified by OE but not by Thunderbird:

anything@anything
gopher.anything

The following are linkified by Thunderbird but not by OE

http:anything
ftp:anything
https:anything
gopher:anything
nntp:anything

I think we can ignore gopher, news, and nntp links. And I think that what we are
doing now is fine when there is an explicit protocol, e.g. http:// or http: as a
prefix to the string. I propose that when there is no protocol we only treat a
string as a URI if it begins with 'www.' or 'ftp.'.
Comment 3 Theo Van Dinter 2008-01-17 13:18:42 UTC
(In reply to comment #2)
> I think we can ignore gopher, news, and nntp links. And I think that what we are
> doing now is fine when there is an explicit protocol, e.g. http:// or http: as a
> prefix to the string. I propose that when there is no protocol we only treat a
> string as a URI if it begins with 'www.' or 'ftp.'.

It all comes back to: do we want to try catching things like "goto example.com in your browser" ?

I agree that our "find things that look like a domain" handling is a bit lax at the moment, and wouldn't 
mind shoring it up, but I'm not sure if we want to strip the "raw domain" search back out versus making  
it suck less.

Comment 4 Sidney Markowitz 2008-01-17 13:37:25 UTC
(in reply to comment#3)

I guess the answer to that question depends on the relative cost of a false
positive versus a false negative in the URI detection. When the URI is used to
get a host name for an RBL lookup, we would probably prefer to get false
positive junk that doesn't appear in the RBL instead of missing any evil host
names. But for a rule such as WEIRD_PORT, the tradeoff is the opposite.

Perhaps the answer is to find a way to keep the extraction of host names for RBL
lookups liberal, but make rules that use the full URL be stricter.
Comment 5 Theo Van Dinter 2008-01-17 13:45:37 UTC
(In reply to comment #4)
> Perhaps the answer is to find a way to keep the extraction of host names for RBL
> lookups liberal, but make rules that use the full URL be stricter.

FWIW, we have some of this by the uri_detail data.  At least, we can tell if the URL came from a HTML 
tag (and which one) versus being parsed out of the text/plain, or both.  It doesn't track being parsed w/ 
protocol versus "likely domain", etc.

uri_detail is already used in the URIDNSBL plugin, for example, to give a different lookup priority to 
parsed domains versus HTML referenced domains.

Comment 6 Justin Mason 2008-01-17 14:11:10 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Perhaps the answer is to find a way to keep the extraction of host names for RBL
> > lookups liberal, but make rules that use the full URL be stricter.
> 
> FWIW, we have some of this by the uri_detail data.  At least, we can tell if
the URL came from a HTML 
> tag (and which one) versus being parsed out of the text/plain, or both.  It
doesn't track being parsed w/ 
> protocol versus "likely domain", etc.
> 
> uri_detail is already used in the URIDNSBL plugin, for example, to give a
different lookup priority to 
> parsed domains versus HTML referenced domains.

+1 to making WIERD_PORT stricter so that it doesn't FP in this case.

if there's any changes made to the URI extraction algorithms, let's
ensure we come up with a set of test cases that it both should, and shouldn't
extract; test-driven development is the way to do this IMO.
Comment 7 Sidney Markowitz 2008-01-17 15:16:32 UTC
Looking at code in PerMsgStstus.pm I've noticed some bugs. Before making other
changes I'll see what the behaviour is like after the bugs are fixed. For
example, right now user@ftp.example.com@foo will get parsed as three URIs,
mailto:ftp://ftp.example.com@foo
ftp://ftp.example.com@foo
ftp.example.com@foo

Some ifs need to be changed to elsif among some other small things.

I'm working on that now. I agree with Justin's comment #6 that we should make
the desired behaviour explicit in the form of test cases.
Comment 8 Sidney Markowitz 2008-01-17 15:32:03 UTC
> right now user@ftp.example.com@foo

That was a typo. I meant to type 'ftp.example.com@foo'. And I know that is
nonsensical, but it illustrates the bug and the resulting parse is clearly not
correct.
Comment 9 Sidney Markowitz 2008-01-17 17:24:31 UTC
The sub that gets URIs from plain text generates just a list of the URIs, not a
srtucture with details data like is done by the HTML parsing. How about this as
a compromise to avoid additional complexity? If we decide that all that is
really needed in plain text URLs that have no scheme are the host names, for RBL
lookup, then we can parse such URIs truncated at the end of the host name. Then
there will never be any problem with using them in rules such as WEIRD_PORT.
Comment 10 Loren Wilton 2008-01-17 20:00:43 UTC
Since the original problem was parsing putative URIs from plain garbage, how 
about some sanity checks around the extraction to see if the stuff nearby even 
seems like words?  If there aren't any spaces in the line you pulled a "uri" 
out of the middle of, then your guess at a URI is probably wrong, for 
instance.  English tends to average around 5 letters per whitespace, German 
rather longer.

Other checks might be a fuzzy match for "copy" or "paste" or "browser" 
somewhere near the putative URI.
Comment 11 Sidney Markowitz 2008-01-17 21:07:52 UTC
(reply to comment #10)
> see if the stuff nearby even seems like words?

That would be difficult to get working right and could make it easy to hide a
URI from SpamAssassin by surrounding it with artistic garbage that leaves it
visually standing out. Also, it's overkill. We want to find hostnames so we can
look them up in RBLs. For that false positives are better than false negatives,
because an occassional extra RBL lookup is not a big deal. But we want to avoid
false positives for things like the WEIRD_PORT rule that could cause overall FPs.

Right now I'm leaning toward just a tiny change just to the code that handles a
supposed URI that has no scheme and does not begin with www\d*\. or ftp\. to
only parse up to the first [:/] character. That will keep most uses of URIs
unchanged while avoiding the problems stemming from parsing garbage as links.
Comment 12 Justin Mason 2008-01-18 02:45:55 UTC
(In reply to comment #9)
> How about this as
> a compromise to avoid additional complexity? If we decide that all that is
> really needed in plain text URLs that have no scheme are the host names, for RBL
> lookup, then we can parse such URIs truncated at the end of the host name.

I don't know about that.   If OE detects

type this in your browser: foo.com/bar/baz 

as "http://foo.com/bar/baz", then we should too, since we want to be able
to apply our existing rules to that format.  (this includes WIERD_PORT.)
Otherwise we create a hole.
Comment 13 Loren Wilton 2008-01-18 03:36:19 UTC
It doesn't.  It linkifies the http line, but not the first line.
If the first line had been www.foo.bar.baz it *would* have linkified it.
But not without the www (or Sidney suggests ftp, I haven't tested that.)
Comment 14 Sidney Markowitz 2008-01-18 06:41:32 UTC
(in reply to comment #12)

In comment #2 I list exactly what OE 6 amd Thunderbird linkify. I agree that we
should do the full processing including WEIRD_PORT on any URL that OE linkifies.
What I'm proposing would affect only non-linkified strings. It does create a
hole if we want to do more with

 type this in your browser: foo.com/bar/baz

than just an RBL check on foo.com. If there is some rule that we really want to
have see foo.com/bar/baz in that specific case, then what I'm proposing will not
handle it. Is there such a rule?

I've got some more questions about the existing code for someone who knowns more
about the original intentions. Is that you or Theo? The code that adds a scheme
to the URI pushes both versions of the string, with and without the scheme, on
to the list. In your example, both "foo.com/bar/baz" and
"http://foo.com/bar/bax" go on the list of URIs. Is that really what is
intended, or should it be only "http://foo.com/bar/baz"?

Also, there are separate tests for URIs and email links, but the URI testing
section of the code also looks for an '@' and generates a mailto URI using a
somewhat different regexp than the branch of the code that is looking for mail
links. That results in two mailto URI strings being pushed on the list, possibly
slightly different. Is that a bug?

I've got some proposed code changes but I want to resolve those issues first
before I finalize it.

Another question is how to test changes. This would affect overall performance.
I could use help in figuring out from nightly run results what has happened once
we get to the point of being ready to test a change.
Comment 15 Justin Mason 2008-01-18 08:32:38 UTC
(In reply to comment #14)
> (in reply to comment #12)
> 
> In comment #2 I list exactly what OE 6 amd Thunderbird linkify. I agree that we
> should do the full processing including WEIRD_PORT on any URL that OE linkifies.
> What I'm proposing would affect only non-linkified strings. It does create a
> hole if we want to do more with
> 
>  type this in your browser: foo.com/bar/baz
> 
> than just an RBL check on foo.com. If there is some rule that we really want to
> have see foo.com/bar/baz in that specific case, then what I'm proposing will not
> handle it. Is there such a rule?

ah, I see what you mean; I got that wrong, assuming that OE was linkifying
the

   type this in your browser: foo.com

case (which it doesn't).

In that case, +1 to your suggestion at the end of comment 11.

> I've got some more questions about the existing code for someone who knowns more
> about the original intentions. Is that you or Theo? The code that adds a scheme
> to the URI pushes both versions of the string, with and without the scheme, on
> to the list. In your example, both "foo.com/bar/baz" and
> "http://foo.com/bar/bax" go on the list of URIs. Is that really what is
> intended, or should it be only "http://foo.com/bar/baz"?

Pushing both is what was intended.  It's easier for rule writers that way.

> Also, there are separate tests for URIs and email links, but the URI testing
> section of the code also looks for an '@' and generates a mailto URI using a
> somewhat different regexp than the branch of the code that is looking for mail
> links. That results in two mailto URI strings being pushed on the list, possibly
> slightly different. Is that a bug?

Yes, that may be.  Go by the test cases ;)

> I've got some proposed code changes but I want to resolve those issues first
> before I finalize it.
> 
> Another question is how to test changes. This would affect overall performance.
> I could use help in figuring out from nightly run results what has happened once
> we get to the point of being ready to test a change.

ah, that side of things can be tricky.  If you click on the "[+]" icon in the
list of DateRevs on ruleqa.spamassassin.org, it'll give you start and end dates
for mass-checks. that may help; but I've found in the past that just
benchmarking a mass-check of 1000 messages is good enough.
Comment 16 Sidney Markowitz 2008-01-19 15:58:43 UTC
I'm using this comment as a convenient place to document something that I've
found: I'm concerned that the code in PerMsgStatus.pm that parses out email
addresses is some fancy stuff from Email::Find that while elegant, is written to
cover the specs in the RFCs rather than what actually can be handled by
real-world MTAs and MUAs. For example it recognizes -AE/9Y.KN@[192.168.93.30] as
a mailto URI (although it leaves out the final ']' which is probably was not
intended). The criteria used by ThunderBird and Outlook Express for what email
addresses to linkify are already quite liberal and I think provide a better
model for what SpamAssassin should recognize. Thunderbird lexical scan
determines ths start of a URI using the following characters:

 ><"'`,{[(|\ space, and for an email URI (any with a @ in it) any non-ASCII
character

The end of a URI

 ><"`}]{[(|

and (' or non-ASCII in email URIs

and ) only if a ( was found after the start (not in email URIs)

Also, the following characters are allowed inside a URI but not as the last
character, because in that position they are considered as part of the plan text
of a message:

 .,;!?-'

This will be clearer when I code up test cases. I still need to determine what
the edge cases for Outlook Express are, but without the source code I have to do
that experimentally.

Comment 17 Sidney Markowitz 2008-01-19 16:20:25 UTC
Typo in the previous comment should have said

and ) only if a ( was _not_ found after the start (not in email URIs)

E.g., www.(bogus)example.com is all linkified but ignore(www.example.com)ignore
or ignore"www.example.com)ignore linkify just www.example.com
Comment 18 Loren Wilton 2008-01-19 22:04:17 UTC
> E.g., www.(bogus)example.com is all linkified but ignore(www.example.com)
> ignore or ignore"www.example.com)ignore linkify just www.example.com

Nope, not on my version of OE.  What is specifically linkified:

www.(bogus)example.com linkifies www.(bogus)example.com 
ignore(www.example.com)ignore linkifies www.example.com)ignore
ignore"www.example.com)ignore linkifies www.example.com)ignore

Note in the second one the leading parend is NOT linkified, but the trailing 
one is; as is the trailing 'ignore" in both cases.

Comment 19 Sidney Markowitz 2008-01-19 22:16:39 UTC
(reply to comment #18)

> Nope, not on my version of OE

Comment #17 is a continuation of comment #16 which is documenting what I found
about Thunderbird.

Once I have things well organized from the tbird perspective it will be easier
to figure out what differences in OE have to be taken in to account.
Comment 20 Loren Wilton 2008-01-19 22:54:12 UTC
> Comment #17 is a continuation of comment #16 which is documenting 
> what I found about Thunderbird.

Ah, sorry.  The last line of comment 16 confused me.

> I still need to determine what the edge cases for Outlook Express are, 
> but without the source code I have to do that experimentally.
Comment 21 Sidney Markowitz 2008-01-19 23:01:31 UTC
Another typo to correct in comment #16, the end delimiters of course should
include a space.
Comment 22 Sidney Markowitz 2008-01-19 23:20:42 UTC
Sigh, so many typos, so little time...

In comment #16 the list of end delimiters has yet another mistake, it should not
include ( because as it says in the next line ( is only an end delimiter in
email address URIs.
Comment 23 Justin Mason 2008-01-20 03:42:57 UTC
Sidney, if you fancy redoing the email-address-recognition algorithm, scrapping
the Email::Find stuff, and using a large set of test cases in a .t script to
decide (in test-driven development style) -- +1 to that!
Comment 24 Sidney Markowitz 2008-01-20 03:57:36 UTC
(reply to comment #23)

Yes, that's exactly what I'm working on right now.
Comment 25 Sidney Markowitz 2008-01-26 18:46:06 UTC
I would like to solicit an opinion. Looking at the existing tests in
t/uri_text.t there is one with the format

  foo @ example.com

with spaces around the '@', which the test expects to parsed as
mailto:foo@example.com

I know that the relevant RFCs allow the spaces in an email address, but neither
Thunderbird nor Outlook Express linkify this example.

Does anyone know of if that test is there because of its use in real spam in a
"copy and paste this address" type of message, or is it just there as a test
case to confirm an edge case in the RFC's spec? If the latter, I think we would
be better off not processing such text as a URI.
Comment 26 Justin Mason 2008-01-27 14:23:08 UTC
> Does anyone know of if that test is there because of its use in real spam in a
> "copy and paste this address" type of message, or is it just there as a test
> case to confirm an edge case in the RFC's spec? If the latter, 

I think someone just thought it'd be a good idea ;)

> I think we would
> be better off not processing such text as a URI.

+1
Comment 27 Sidney Markowitz 2008-01-27 14:38:15 UTC
While debugging this I am seeing that get_uri_detail_list() gets run through
twice even though it caches its results in a hash in $self. I see
PerMsgStatus->new being called three times, with get_uri_detail_list() being
called on the first and third instance.

This is happening when I call spamassassin -D uri -L < testcase.eml

Is this a bug or is there a reason for the multiple instances of PerMsgStatus
and the lack of effective caching of get_uri_detail_list?
Comment 28 Sidney Markowitz 2008-01-27 15:01:54 UTC
In SpamAssassin.pm, check() creates a new PerMsgStatus object and
get_uri_detail_list() gets being called on it. Then in Bayes.pm learn() creates
a new PerMsgStatus object. Then in plugin/Bayes.pm get_body_from_msg() creates a
third copy and get_uri_detail_list() gets called on it.

Does the plugin API not provide access to the parsed message body?
Comment 29 Sidney Markowitz 2008-01-27 15:42:48 UTC
(in reply to self, comment #28)

Ok, I see my question is bug 5291, which was hoped to be addressed by bug 5293
but wasn't.

It would be really nice to have the Bayes plugin not have to create two extra
PerMsgStatus objects and rescan the message.
Comment 30 Justin Mason 2008-01-28 01:28:31 UTC
(In reply to comment #29)
> It would be really nice to have the Bayes plugin not have to create two extra
> PerMsgStatus objects and rescan the message.

*definitely*.  This is some seriously wasteful code -- if it could be fixed
finally, that would be great ;)
Comment 31 Sidney Markowitz 2008-01-28 14:54:41 UTC
Created attachment 4248 [details]
Patch with complete rewrite of URI parsing from text and new test cases

Committed to trunk as revision 616097.

This passed my tests and runs a make test of uri_text.t at the same speed as
3.2, but I would like it if someone can compare scoring and performance with a
significantly sized corpus.

I've attached the patch to make it easier to review.

The patch works against branch 3.2, too. Opinions about whether this should go
into the 3.2 branch?
Comment 32 Justin Mason 2008-01-28 15:39:18 UTC
I like it!  also, I'm benchmarking the new code as being slightly faster ;) +1
Comment 33 Sidney Markowitz 2008-02-07 04:22:29 UTC
I've retargeted this to 3.2.5 but I would like comments about that. Do you think
this is too extensive a change to backport? The patch does apply cleanly to 3.2
branch.

Another possibility is to only use the one change that truncates a schemeless
URI from text after the domain, so there is no :portnum to trigger WEIRD_PORT
unless it begins with https?: or www\.

I'm happy with using the full patch, but I would go along with just the
truncation if other people want that.

Comments?
Comment 34 Sidney Markowitz 2008-02-07 19:44:22 UTC
Now that I've found and fixed bug 5813, I am more in favor of applying this fix
to branch 3.2 and then applying the bug 5813 fix on top of it. So I'm calling
for votes here and for bug 5813.
Comment 35 Sidney Markowitz 2008-02-08 14:14:33 UTC
Please see bug 5813 attachment 4258 [details] for the review of this bug and bug 5813.
Comment 36 Sidney Markowitz 2008-02-10 16:23:41 UTC
Changing status to "1 vote needed". See vote for combined patch at comment #9,
bug 5813
Comment 37 Sidney Markowitz 2008-02-25 14:56:14 UTC
*** Bug 5836 has been marked as a duplicate of this bug. ***
Comment 38 Michael Parker 2009-04-07 19:54:30 UTC
It seems like this change no longer find uri like:

http://URI

since it doesn't have a valid domain, is that intentional?
Comment 39 Sidney Markowitz 2009-04-07 21:11:38 UTC
(reply to comment #38)
It's been a while since I coded it, but I don't think it is what I intended. If a string begins with a known scheme, then Thunderbird will make it a hot link. I thought that was the point of $uriknownscheme in the patch. If it doesn't have a valid TLD then Firefox at least will tack on a ".com". So I would think that we should treat http://URI the same as http://URI.com.
Comment 40 Michael Parker 2009-04-07 21:29:10 UTC
It might be working ok and I'm just seeing some odd issue.  It does seem like uris without domains don't make it through this chunk of code in _get_parsed_uri_list:

        my @tmp = Mail::SpamAssassin::Util::uri_list_canonify($redirector_patterns, $uri);
        my $goodurifound = 0;
        foreach my $cleanuri (@tmp) {
          my $domain = Mail::SpamAssassin::Util::uri_to_domain($cleanuri);
          if ($domain) {
            # bug 5780: Stop after domain to avoid FP, but do that after all deobfuscation of u
rlencoding and redirection
            $cleanuri =~ s/^(https?:\/\/[^:\/]+).*$/$1/ if $rblonly;
            push (@uris, $cleanuri);
            $goodurifound = 1;
          }
        }
        next unless $goodurifound;
        push @uris, $rawuri unless $rblonly;

Maybe it gets added later from the HTML uri metadata?
Comment 41 Sidney Markowitz 2009-04-07 21:37:22 UTC
Oh, ok, it's coming back to me. http://URI has to be parsed as a link because the MTA will, and it has to go through deobfuscation after that. But then it needs a TLD if it is going to be checked with an RBL, which is why that gets filtered at that point. Except we probably should (and probably don't now) tack a .com at the end if a browser would, and do that as part of the deobfuscation/redirection list of URIs that are generated to check with the RBLs.
Comment 42 Sidney Markowitz 2009-04-07 22:22:22 UTC
I think the right thing to do is to extend uri_list_canonify to add a URI of the original with .com tacked on the end if the URI is something like
  https?//(www\.}?[a-z0-9]+
That's just to a first approximation off the top of my head as about what I see Firefox doing.

This could go into a new bug report as it is independent of this patch.


Comment 43 Darxus 2011-05-23 17:20:05 UTC
Close.

Committed to trunk years before 3.3 was branched, and won't get backported to 3.2.

Might be worth opening a new bug for comments 38 - 42:  parsing "http://uri" as "http://uri.com" (appending ".com" to URLs lacking a Top Level Domain).

(I did just read all comments.)
Comment 44 Mark Martinec 2011-05-23 19:02:14 UTC
I lost track of what is going on here. Fine by me for closing.
Comment 45 Darxus 2011-05-23 19:06:31 UTC
Everything related to this bug was resolved and committed to trunk before it was branched to 3.3.  This includes everything up to comment 37.  So it should be closed.

The rest of the comments, 38 to 42, are another issue, which may or may not be worth opening an additional bug.
Comment 46 Sidney Markowitz 2011-05-23 20:17:36 UTC
I have opened Bug 6596 to continue consideration of comments 38 through 43 as a separate issue. Since this issue was being held open only for votes on committing it to 3.2, we can give up on backporting to 3.2 and close this.