SA Bugzilla – Bug 504
URI processing turns base64 strings into http URI's which then match rules
Last modified: 2002-07-29 00:26:32 UTC
I can't attach the specific email that triggered this. If this report isn't enough to reproduce the bug I can try to come up with a sanitized example. An email had a BASE64 MIME encoded attachment (of a Word doc file, but that probably is not relevant) which contained embedded within it the following lines: M"N\$=P%D``0``````/8)Y Q3:61N97D@=V%S(&$@=')U<W1E9"!M96UB97(@ M;V8@=&AE(&=R;W5P+B @2&4@8V]U;&0@8F4@8V]U;G1E9"!O;B!F;W(@8F]T M:"!H:7,@=&5C:&YI8V%L(&%B:6QI='D@86YD&P`)`!8`%@`6`!0`#0`<`!8` M$P`-`!8`#0`+``T`%@`3``L`%@`6``P`(P`5`",`%@`6``P`#0`5``P`# `+ M`!8`%@`.`!8`#0`6`!8`%@`*``X`#@`=`!8`#0`4`!8`%@`)`!8`#@`6`!4` M#@`4`!8`%@`6``L`%@`6``X`%@`6``T`"P`6``T`#@`6`!4`"P`6``X`%0`) M`!,`#@`+`!8`% `6`!8`" `4`!8`" `.`!8`%@`(``@`"0`+`!,`#@`6`!8` M%@`$````+@$```4````4`B!VP:($`````@$!``0````"`0$`!0```!0"'P5W M`00````N`0$`EP```#(*'P5W`5T`! ``````]@GD#&1I;&EG96YC92!T;R!C M;VUP;&5T92!T87-K<R!O;B!T:6UE+B @2&4@9G)E<75E;G1L>2!W87,@87-S I added a dbg statement in PerMsgStatus.pm to show the uri's that were parsed out for uritest and that indicated that the above produced: debug: Got URI: http://Q3:61N97D@=V%S(&$@=')U2!W87,@87-S Note that a big chunk between a < and a > got stripped out to make that URI. I don't think that makes sense as it does not look like a comment. There were also the lines M`0$`F@```#(*3P5W`5\`! ``````]@GD#&-H86QL96YG:6YG('1A<VMS(&EN M('1H92!C;VUP;&5T:6]N(&]F(&-L:65N="!W;W)K+B @2&4@;F5V97(@:&5S M:71A=&5D('1O('1A8VML92!N97<@<')O8FQE;7,L`!0`%@`6``@`"0`6`!8` M%@`)`!8`%@`4``L`%@`3`!0`$P`6``D`%@`6``L`%@`6`!4`% `5`",`%@`) M`!4`"@`)`!8`%@`6`!4`"P`6`!,`"0`)`!8`%0`+`!8`' `6``T`% `+`!4` M%@`=`!8`%@`6`!4`%0`6``T`%@`6`!8`$P`(``L`%@`+`!8`%@`5``L`%@`6 M``L`%0`4`!,`"0`6`!8`%@`6`!P`%@`6``T`%@`6``D`%0`C`!,`"P`$```` M+@$```4````4`B!VP:($`````@$!``0````"`0$`!0```!0"?P5W`00````N M`0$`H0```#(*?P5W`60`! ``````]@GD#&5S<&5C:6%L;'D@=&AO<V4@<F5Q M=6ER:6YG(&YE=R!R97-E87)C:"X@($AE('=A<R!A;'=A>7,@82!T96%M('!L M87EE<BP@=VEL;&EN9R!T;R!H96QP(&]T:&5R(&UE;6)E<G,6`!,`%@`6`!0` M"0`6``D`"0`3``P`"P`6`!8`$P`6``T`#0`6`!8`%@`)``P`"0`6`!8`#0`6 M`!8`' `-``T`%@`3`!8`%@`-`!0`%@`+``T`#0`=`!4`#0`<`!8`$P`-`!8` M"0`<`!8`%0`3``T`%@`-``L`%@`6`"(`#0`6``D`%0`4`!8`#0`*``T`' `) M``@`" `)`!8`%@`-``L`%0`-`!8`%@`)`!8`#0`5``L`%@`6``T`# `C`!4` M(@`6`!8`#0`3``0````N`0``!0```!0"(';!H@0````"`0$`! ````(!`0`% M````% *O!7<!! ```"X!`0!5````,@JO!7<!,0`$``````#V">0,;V8@=&AE which parsed into debug: Got URI: http://H86QL96YG:6YG('1A0,;V8@=&AE Both of these ended up with http:// tacked on to them which then caused matches of the rules HTTP_ESCAPED_HOST, HTTP_USERNAME_USED, and WEIRD_PORT because of there were % @ and : characters in the resulting "URI"s. This caused perfectly fine mail with a Word attachment to get a high spam score. I think there is more than one bug here. 1) MIME encoded attachmemts should not be parsed for URIs in the rawbody. 2) Something about the embedded <> string caused the URI parsing to act badly.
SA is kind-of doing the right thing -- Outlook Express highlighted several lines of text in your message as URLs. Since they are theoretically clickable-upon, it makes sense to check them as URLs. I'm attaching a screenshot so you can see what I mean.
I don't see the screenshot attachment, but I don't think that Outlook Express finding URLs in my bug report means anything in this case. My bug report does have URL looking things that begin with http://. But part of the bug is that SpamAssassin added those http:// strings itself... They were not in the message body. And first of all, SA should not be looking for URIs in the rawbody text of MIME attachments. That should be done in the decoded body. Secondly, looking at PerMsgStatus.pm I see that SA is trying to guess at relative URLs. It is true that if a host name like www.foo.com or ftp.bar.edu appears in the body then some mail software will make it a hot link. SA deals with that by prefixing http:// to it in order to make it a URI. That seems to be what is going on here. But H86QL96YG:6YG('1A<lotsofstuff>0,;V8@=&AE does not look like a www.* or ftp.* host name and should not be treated the same way. It appears that foo:bar is treated as a URI, www.example.com is turned into http://www.example.com and is treated as a URI. foo:bar<baz>glug is turned into http://foo:barglug and is treated as a URI and that is wrong. If the 'foo:' is treated as a URI scheme, then it definitely should not have http:// tacked on to the beginning of it. And what sense does it make to remove the embedded <...> string? Plus all this is being done to raw BASE64 characters that certainly are not URIs. I'm afraid that I am not familiar enough with Perl to know how to fix the relatively complicated code in PerMsgStatus.pm that handles the URIs. My guess is that the problem is somewhere in do_body_uri_tests, but perhaps it is also in how it is called.
Subject: Re: [SAdev] URI processing turns base64 strings into http URI's which then match rules For some reason, I can't access Bugzilla at the moment to post the image, so it can be seen at http://blazing.arsecandle.org/~rod/urls.png
Ok, I see the screenshot. The strings that Outlook Express is underlining have nothing to do with the strings that SA thinks are URIs. All this shows is that Outlook Express can get very confused when handed BASE64 encoded files as plain text. If anything it just emphasizes that SA's URI parsing should not be done on the rawbody in the first place. Note that Outlook Express would never see these strings in this way, it would decode the BASE64 MIME attachment first.
Is that even Base64 encoded? It looks like UUEncoding to me. Dan.
Duh! Yeah. Is it too late to change the title of the bug repoort? :-) Replace every mention of 'BASE64' above with 'uuencoded'. Sorry about that. But the bug still stands. The Word attachment still was flagged with three different kinds of spammy URIs that are not there. Does anyone want to try it out with BASE64 attachments to see if it still happens?
Subject: Re: [SAdev] URI processing turns base64 strings into http URI's which then match rules Right, and therein lies the problem - SA doesn't decode UUencoded stuff (yet?). I've got some code that plugs into my earlier posted mail parser to do UUE extraction, but I think both Craig and I discovered that SA needs a bit of a re-write in order to make use of a proper mail parser (either MIME::Tools based or my parser based). Matt.
There's still another bug besides not decoding uuencode. I don't know enough Perl to propose a patch yet, but I think I found the offending line of code. In PerlMsgStatus.pm lines 1318 to 1321 in my current cvs version, it says # Does the uri start with "http://", "mailto:", "javascript:" or # such? If not, we probaly need to put the base URI in front # of it. if ($uri !~ /^[a-z]+:/i) { But $uri was set using an expression that includes $uriRE which is defined in terms of $schemeRE and $schemelessRE in line 1246. There you see that the scheme can have non-alphabetic characters before the :. The result of this bug is that a URI with nonalphabetic characters gets a http:// tacked on in front of it. An approximate fix would be to change the if expression above to if ($uri !~ /^[a-zA-Z][a-zA-Z0-9.+\-]*:/) which better matches the definition of $schemeRE, or even better if this works (remember I don't know Perl) if ($uri !~ /^$schemeRE:/)
ok, I've locked that down by explicitly listing protocols. I don't think that's going to cause any more trouble than it already is ;)