SA Bugzilla – Bug 5780
[review] URI processing turns uuencoded strings into http URI's which then causes FPs
Last modified: 2011-05-23 20:17:36 UTC
This is pretty much the same as bug 504, but a specific case has shown up that was not fixed by bug 504's fix. Since it has been so long, I'm opening this as a new bug instead of reopening bug 504. I have a ham email that includes a uuencoded file. Aside from the issue (discussed in bug 504 comments) that SpamAssassin does not parse uuencodings, the following line of uuecoding: MOD_M))%X"M0"-AE/9Y.KN:_0D2F:95^H*:I,8<W6PO;#18AZ-QV9/:K>J..Q caused the URI code in PerMsgStatus.pm to find the following: dbg: rules: ran uri rule WEIRD_PORT ======> got hit: "http://9Y.KN:_0D2F:95" There was some discussion about this in bug 504, but I think that this result shows that what we have settled on is still too liberal. No reasonable MUA would turn plain text like "9Y.KN:_0D2F:95" into a hot link, so neither should SpamAssassin.
Created attachment 4235 [details] example email that FPs on WEIRD_PORT
Do we have a good reason for parsing as URIs text that Thunderbird and Outlook Express do not hot link? Mozilla's code for that is in http://lxr.mozilla.org/mozilla/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp A good place to start understanding it might be the functions ShouldLinkify and mozTXTToHTMLConv::CheckURLAndCreateHTML Looking at the Mozilla code and trying things out in Outlook Express 6, I find the following strings being linkified by both: anything@anything.anything www.anything ftp.anything http://anything https://anything ftp://anything gopher://anything news:anything mailto:anything nntp://anything The following are linkified by OE but not by Thunderbird: anything@anything gopher.anything The following are linkified by Thunderbird but not by OE http:anything ftp:anything https:anything gopher:anything nntp:anything I think we can ignore gopher, news, and nntp links. And I think that what we are doing now is fine when there is an explicit protocol, e.g. http:// or http: as a prefix to the string. I propose that when there is no protocol we only treat a string as a URI if it begins with 'www.' or 'ftp.'.
(In reply to comment #2) > I think we can ignore gopher, news, and nntp links. And I think that what we are > doing now is fine when there is an explicit protocol, e.g. http:// or http: as a > prefix to the string. I propose that when there is no protocol we only treat a > string as a URI if it begins with 'www.' or 'ftp.'. It all comes back to: do we want to try catching things like "goto example.com in your browser" ? I agree that our "find things that look like a domain" handling is a bit lax at the moment, and wouldn't mind shoring it up, but I'm not sure if we want to strip the "raw domain" search back out versus making it suck less.
(in reply to comment#3) I guess the answer to that question depends on the relative cost of a false positive versus a false negative in the URI detection. When the URI is used to get a host name for an RBL lookup, we would probably prefer to get false positive junk that doesn't appear in the RBL instead of missing any evil host names. But for a rule such as WEIRD_PORT, the tradeoff is the opposite. Perhaps the answer is to find a way to keep the extraction of host names for RBL lookups liberal, but make rules that use the full URL be stricter.
(In reply to comment #4) > Perhaps the answer is to find a way to keep the extraction of host names for RBL > lookups liberal, but make rules that use the full URL be stricter. FWIW, we have some of this by the uri_detail data. At least, we can tell if the URL came from a HTML tag (and which one) versus being parsed out of the text/plain, or both. It doesn't track being parsed w/ protocol versus "likely domain", etc. uri_detail is already used in the URIDNSBL plugin, for example, to give a different lookup priority to parsed domains versus HTML referenced domains.
(In reply to comment #5) > (In reply to comment #4) > > Perhaps the answer is to find a way to keep the extraction of host names for RBL > > lookups liberal, but make rules that use the full URL be stricter. > > FWIW, we have some of this by the uri_detail data. At least, we can tell if the URL came from a HTML > tag (and which one) versus being parsed out of the text/plain, or both. It doesn't track being parsed w/ > protocol versus "likely domain", etc. > > uri_detail is already used in the URIDNSBL plugin, for example, to give a different lookup priority to > parsed domains versus HTML referenced domains. +1 to making WIERD_PORT stricter so that it doesn't FP in this case. if there's any changes made to the URI extraction algorithms, let's ensure we come up with a set of test cases that it both should, and shouldn't extract; test-driven development is the way to do this IMO.
Looking at code in PerMsgStstus.pm I've noticed some bugs. Before making other changes I'll see what the behaviour is like after the bugs are fixed. For example, right now user@ftp.example.com@foo will get parsed as three URIs, mailto:ftp://ftp.example.com@foo ftp://ftp.example.com@foo ftp.example.com@foo Some ifs need to be changed to elsif among some other small things. I'm working on that now. I agree with Justin's comment #6 that we should make the desired behaviour explicit in the form of test cases.
> right now user@ftp.example.com@foo That was a typo. I meant to type 'ftp.example.com@foo'. And I know that is nonsensical, but it illustrates the bug and the resulting parse is clearly not correct.
The sub that gets URIs from plain text generates just a list of the URIs, not a srtucture with details data like is done by the HTML parsing. How about this as a compromise to avoid additional complexity? If we decide that all that is really needed in plain text URLs that have no scheme are the host names, for RBL lookup, then we can parse such URIs truncated at the end of the host name. Then there will never be any problem with using them in rules such as WEIRD_PORT.
Since the original problem was parsing putative URIs from plain garbage, how about some sanity checks around the extraction to see if the stuff nearby even seems like words? If there aren't any spaces in the line you pulled a "uri" out of the middle of, then your guess at a URI is probably wrong, for instance. English tends to average around 5 letters per whitespace, German rather longer. Other checks might be a fuzzy match for "copy" or "paste" or "browser" somewhere near the putative URI.
(reply to comment #10) > see if the stuff nearby even seems like words? That would be difficult to get working right and could make it easy to hide a URI from SpamAssassin by surrounding it with artistic garbage that leaves it visually standing out. Also, it's overkill. We want to find hostnames so we can look them up in RBLs. For that false positives are better than false negatives, because an occassional extra RBL lookup is not a big deal. But we want to avoid false positives for things like the WEIRD_PORT rule that could cause overall FPs. Right now I'm leaning toward just a tiny change just to the code that handles a supposed URI that has no scheme and does not begin with www\d*\. or ftp\. to only parse up to the first [:/] character. That will keep most uses of URIs unchanged while avoiding the problems stemming from parsing garbage as links.
(In reply to comment #9) > How about this as > a compromise to avoid additional complexity? If we decide that all that is > really needed in plain text URLs that have no scheme are the host names, for RBL > lookup, then we can parse such URIs truncated at the end of the host name. I don't know about that. If OE detects type this in your browser: foo.com/bar/baz as "http://foo.com/bar/baz", then we should too, since we want to be able to apply our existing rules to that format. (this includes WIERD_PORT.) Otherwise we create a hole.
It doesn't. It linkifies the http line, but not the first line. If the first line had been www.foo.bar.baz it *would* have linkified it. But not without the www (or Sidney suggests ftp, I haven't tested that.)
(in reply to comment #12) In comment #2 I list exactly what OE 6 amd Thunderbird linkify. I agree that we should do the full processing including WEIRD_PORT on any URL that OE linkifies. What I'm proposing would affect only non-linkified strings. It does create a hole if we want to do more with type this in your browser: foo.com/bar/baz than just an RBL check on foo.com. If there is some rule that we really want to have see foo.com/bar/baz in that specific case, then what I'm proposing will not handle it. Is there such a rule? I've got some more questions about the existing code for someone who knowns more about the original intentions. Is that you or Theo? The code that adds a scheme to the URI pushes both versions of the string, with and without the scheme, on to the list. In your example, both "foo.com/bar/baz" and "http://foo.com/bar/bax" go on the list of URIs. Is that really what is intended, or should it be only "http://foo.com/bar/baz"? Also, there are separate tests for URIs and email links, but the URI testing section of the code also looks for an '@' and generates a mailto URI using a somewhat different regexp than the branch of the code that is looking for mail links. That results in two mailto URI strings being pushed on the list, possibly slightly different. Is that a bug? I've got some proposed code changes but I want to resolve those issues first before I finalize it. Another question is how to test changes. This would affect overall performance. I could use help in figuring out from nightly run results what has happened once we get to the point of being ready to test a change.
(In reply to comment #14) > (in reply to comment #12) > > In comment #2 I list exactly what OE 6 amd Thunderbird linkify. I agree that we > should do the full processing including WEIRD_PORT on any URL that OE linkifies. > What I'm proposing would affect only non-linkified strings. It does create a > hole if we want to do more with > > type this in your browser: foo.com/bar/baz > > than just an RBL check on foo.com. If there is some rule that we really want to > have see foo.com/bar/baz in that specific case, then what I'm proposing will not > handle it. Is there such a rule? ah, I see what you mean; I got that wrong, assuming that OE was linkifying the type this in your browser: foo.com case (which it doesn't). In that case, +1 to your suggestion at the end of comment 11. > I've got some more questions about the existing code for someone who knowns more > about the original intentions. Is that you or Theo? The code that adds a scheme > to the URI pushes both versions of the string, with and without the scheme, on > to the list. In your example, both "foo.com/bar/baz" and > "http://foo.com/bar/bax" go on the list of URIs. Is that really what is > intended, or should it be only "http://foo.com/bar/baz"? Pushing both is what was intended. It's easier for rule writers that way. > Also, there are separate tests for URIs and email links, but the URI testing > section of the code also looks for an '@' and generates a mailto URI using a > somewhat different regexp than the branch of the code that is looking for mail > links. That results in two mailto URI strings being pushed on the list, possibly > slightly different. Is that a bug? Yes, that may be. Go by the test cases ;) > I've got some proposed code changes but I want to resolve those issues first > before I finalize it. > > Another question is how to test changes. This would affect overall performance. > I could use help in figuring out from nightly run results what has happened once > we get to the point of being ready to test a change. ah, that side of things can be tricky. If you click on the "[+]" icon in the list of DateRevs on ruleqa.spamassassin.org, it'll give you start and end dates for mass-checks. that may help; but I've found in the past that just benchmarking a mass-check of 1000 messages is good enough.
I'm using this comment as a convenient place to document something that I've found: I'm concerned that the code in PerMsgStatus.pm that parses out email addresses is some fancy stuff from Email::Find that while elegant, is written to cover the specs in the RFCs rather than what actually can be handled by real-world MTAs and MUAs. For example it recognizes -AE/9Y.KN@[192.168.93.30] as a mailto URI (although it leaves out the final ']' which is probably was not intended). The criteria used by ThunderBird and Outlook Express for what email addresses to linkify are already quite liberal and I think provide a better model for what SpamAssassin should recognize. Thunderbird lexical scan determines ths start of a URI using the following characters: ><"'`,{[(|\ space, and for an email URI (any with a @ in it) any non-ASCII character The end of a URI ><"`}]{[(| and (' or non-ASCII in email URIs and ) only if a ( was found after the start (not in email URIs) Also, the following characters are allowed inside a URI but not as the last character, because in that position they are considered as part of the plan text of a message: .,;!?-' This will be clearer when I code up test cases. I still need to determine what the edge cases for Outlook Express are, but without the source code I have to do that experimentally.
Typo in the previous comment should have said and ) only if a ( was _not_ found after the start (not in email URIs) E.g., www.(bogus)example.com is all linkified but ignore(www.example.com)ignore or ignore"www.example.com)ignore linkify just www.example.com
> E.g., www.(bogus)example.com is all linkified but ignore(www.example.com) > ignore or ignore"www.example.com)ignore linkify just www.example.com Nope, not on my version of OE. What is specifically linkified: www.(bogus)example.com linkifies www.(bogus)example.com ignore(www.example.com)ignore linkifies www.example.com)ignore ignore"www.example.com)ignore linkifies www.example.com)ignore Note in the second one the leading parend is NOT linkified, but the trailing one is; as is the trailing 'ignore" in both cases.
(reply to comment #18) > Nope, not on my version of OE Comment #17 is a continuation of comment #16 which is documenting what I found about Thunderbird. Once I have things well organized from the tbird perspective it will be easier to figure out what differences in OE have to be taken in to account.
> Comment #17 is a continuation of comment #16 which is documenting > what I found about Thunderbird. Ah, sorry. The last line of comment 16 confused me. > I still need to determine what the edge cases for Outlook Express are, > but without the source code I have to do that experimentally.
Another typo to correct in comment #16, the end delimiters of course should include a space.
Sigh, so many typos, so little time... In comment #16 the list of end delimiters has yet another mistake, it should not include ( because as it says in the next line ( is only an end delimiter in email address URIs.
Sidney, if you fancy redoing the email-address-recognition algorithm, scrapping the Email::Find stuff, and using a large set of test cases in a .t script to decide (in test-driven development style) -- +1 to that!
(reply to comment #23) Yes, that's exactly what I'm working on right now.
I would like to solicit an opinion. Looking at the existing tests in t/uri_text.t there is one with the format foo @ example.com with spaces around the '@', which the test expects to parsed as mailto:foo@example.com I know that the relevant RFCs allow the spaces in an email address, but neither Thunderbird nor Outlook Express linkify this example. Does anyone know of if that test is there because of its use in real spam in a "copy and paste this address" type of message, or is it just there as a test case to confirm an edge case in the RFC's spec? If the latter, I think we would be better off not processing such text as a URI.
> Does anyone know of if that test is there because of its use in real spam in a > "copy and paste this address" type of message, or is it just there as a test > case to confirm an edge case in the RFC's spec? If the latter, I think someone just thought it'd be a good idea ;) > I think we would > be better off not processing such text as a URI. +1
While debugging this I am seeing that get_uri_detail_list() gets run through twice even though it caches its results in a hash in $self. I see PerMsgStatus->new being called three times, with get_uri_detail_list() being called on the first and third instance. This is happening when I call spamassassin -D uri -L < testcase.eml Is this a bug or is there a reason for the multiple instances of PerMsgStatus and the lack of effective caching of get_uri_detail_list?
In SpamAssassin.pm, check() creates a new PerMsgStatus object and get_uri_detail_list() gets being called on it. Then in Bayes.pm learn() creates a new PerMsgStatus object. Then in plugin/Bayes.pm get_body_from_msg() creates a third copy and get_uri_detail_list() gets called on it. Does the plugin API not provide access to the parsed message body?
(in reply to self, comment #28) Ok, I see my question is bug 5291, which was hoped to be addressed by bug 5293 but wasn't. It would be really nice to have the Bayes plugin not have to create two extra PerMsgStatus objects and rescan the message.
(In reply to comment #29) > It would be really nice to have the Bayes plugin not have to create two extra > PerMsgStatus objects and rescan the message. *definitely*. This is some seriously wasteful code -- if it could be fixed finally, that would be great ;)
Created attachment 4248 [details] Patch with complete rewrite of URI parsing from text and new test cases Committed to trunk as revision 616097. This passed my tests and runs a make test of uri_text.t at the same speed as 3.2, but I would like it if someone can compare scoring and performance with a significantly sized corpus. I've attached the patch to make it easier to review. The patch works against branch 3.2, too. Opinions about whether this should go into the 3.2 branch?
I like it! also, I'm benchmarking the new code as being slightly faster ;) +1
I've retargeted this to 3.2.5 but I would like comments about that. Do you think this is too extensive a change to backport? The patch does apply cleanly to 3.2 branch. Another possibility is to only use the one change that truncates a schemeless URI from text after the domain, so there is no :portnum to trigger WEIRD_PORT unless it begins with https?: or www\. I'm happy with using the full patch, but I would go along with just the truncation if other people want that. Comments?
Now that I've found and fixed bug 5813, I am more in favor of applying this fix to branch 3.2 and then applying the bug 5813 fix on top of it. So I'm calling for votes here and for bug 5813.
Please see bug 5813 attachment 4258 [details] for the review of this bug and bug 5813.
Changing status to "1 vote needed". See vote for combined patch at comment #9, bug 5813
*** Bug 5836 has been marked as a duplicate of this bug. ***
It seems like this change no longer find uri like: http://URI since it doesn't have a valid domain, is that intentional?
(reply to comment #38) It's been a while since I coded it, but I don't think it is what I intended. If a string begins with a known scheme, then Thunderbird will make it a hot link. I thought that was the point of $uriknownscheme in the patch. If it doesn't have a valid TLD then Firefox at least will tack on a ".com". So I would think that we should treat http://URI the same as http://URI.com.
It might be working ok and I'm just seeing some odd issue. It does seem like uris without domains don't make it through this chunk of code in _get_parsed_uri_list: my @tmp = Mail::SpamAssassin::Util::uri_list_canonify($redirector_patterns, $uri); my $goodurifound = 0; foreach my $cleanuri (@tmp) { my $domain = Mail::SpamAssassin::Util::uri_to_domain($cleanuri); if ($domain) { # bug 5780: Stop after domain to avoid FP, but do that after all deobfuscation of u rlencoding and redirection $cleanuri =~ s/^(https?:\/\/[^:\/]+).*$/$1/ if $rblonly; push (@uris, $cleanuri); $goodurifound = 1; } } next unless $goodurifound; push @uris, $rawuri unless $rblonly; Maybe it gets added later from the HTML uri metadata?
Oh, ok, it's coming back to me. http://URI has to be parsed as a link because the MTA will, and it has to go through deobfuscation after that. But then it needs a TLD if it is going to be checked with an RBL, which is why that gets filtered at that point. Except we probably should (and probably don't now) tack a .com at the end if a browser would, and do that as part of the deobfuscation/redirection list of URIs that are generated to check with the RBLs.
I think the right thing to do is to extend uri_list_canonify to add a URI of the original with .com tacked on the end if the URI is something like https?//(www\.}?[a-z0-9]+ That's just to a first approximation off the top of my head as about what I see Firefox doing. This could go into a new bug report as it is independent of this patch.
Close. Committed to trunk years before 3.3 was branched, and won't get backported to 3.2. Might be worth opening a new bug for comments 38 - 42: parsing "http://uri" as "http://uri.com" (appending ".com" to URLs lacking a Top Level Domain). (I did just read all comments.)
I lost track of what is going on here. Fine by me for closing.
Everything related to this bug was resolved and committed to trunk before it was branched to 3.3. This includes everything up to comment 37. So it should be closed. The rest of the comments, 38 to 42, are another issue, which may or may not be worth opening an additional bug.
I have opened Bug 6596 to continue consideration of comments 38 through 43 as a separate issue. Since this issue was being held open only for votes on committing it to 3.2, we can give up on backporting to 3.2 and close this.