Bug 504 - URI processing turns base64 strings into http URI's which then match rules
Summary: URI processing turns base64 strings into http URI's which then match rules
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other Linux
: P2 major
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-06-25 11:14 UTC by Sidney Markowitz
Modified: 2002-07-29 00:26 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Sidney Markowitz 2002-06-25 11:14:12 UTC
I can't attach the specific email that triggered this. If this report isn't
enough to reproduce the bug I can try to come up with a sanitized example.

An email had a BASE64 MIME encoded attachment (of a Word doc file, but that
probably is not relevant) which contained embedded within it the following lines:

M"N\$=P%D``0``````/8)Y Q3:61N97D@=V%S(&$@=')U<W1E9"!M96UB97(@
M;V8@=&AE(&=R;W5P+B @2&4@8V]U;&0@8F4@8V]U;G1E9"!O;B!F;W(@8F]T
M:"!H:7,@=&5C:&YI8V%L(&%B:6QI='D@86YD&P`)`!8`%@`6`!0`#0`<`!8`
M$P`-`!8`#0`+``T`%@`3``L`%@`6``P`(P`5`",`%@`6``P`#0`5``P`# `+
M`!8`%@`.`!8`#0`6`!8`%@`*``X`#@`=`!8`#0`4`!8`%@`)`!8`#@`6`!4`
M#@`4`!8`%@`6``L`%@`6``X`%@`6``T`"P`6``T`#@`6`!4`"P`6``X`%0`)
M`!,`#@`+`!8`% `6`!8`" `4`!8`" `.`!8`%@`(``@`"0`+`!,`#@`6`!8`
M%@`$````+@$```4````4`B!VP:($`````@$!``0````"`0$`!0```!0"'P5W
M`00````N`0$`EP```#(*'P5W`5T`! ``````]@GD#&1I;&EG96YC92!T;R!C
M;VUP;&5T92!T87-K<R!O;B!T:6UE+B @2&4@9G)E<75E;G1L>2!W87,@87-S

I added a dbg statement in PerMsgStatus.pm to show the uri's that were parsed
out for uritest and that indicated that the above produced:

debug: Got URI: http://Q3:61N97D@=V%S(&$@=')U2!W87,@87-S

Note that a big chunk between a < and a > got stripped out to make that URI. I
don't think that makes sense as it does not look like a comment.

There were also the lines

M`0$`F@```#(*3P5W`5\`! ``````]@GD#&-H86QL96YG:6YG('1A<VMS(&EN
M('1H92!C;VUP;&5T:6]N(&]F(&-L:65N="!W;W)K+B @2&4@;F5V97(@:&5S
M:71A=&5D('1O('1A8VML92!N97<@<')O8FQE;7,L`!0`%@`6``@`"0`6`!8`
M%@`)`!8`%@`4``L`%@`3`!0`$P`6``D`%@`6``L`%@`6`!4`% `5`",`%@`)
M`!4`"@`)`!8`%@`6`!4`"P`6`!,`"0`)`!8`%0`+`!8`' `6``T`% `+`!4`
M%@`=`!8`%@`6`!4`%0`6``T`%@`6`!8`$P`(``L`%@`+`!8`%@`5``L`%@`6
M``L`%0`4`!,`"0`6`!8`%@`6`!P`%@`6``T`%@`6``D`%0`C`!,`"P`$````
M+@$```4````4`B!VP:($`````@$!``0````"`0$`!0```!0"?P5W`00````N
M`0$`H0```#(*?P5W`60`! ``````]@GD#&5S<&5C:6%L;'D@=&AO<V4@<F5Q
M=6ER:6YG(&YE=R!R97-E87)C:"X@($AE('=A<R!A;'=A>7,@82!T96%M('!L
M87EE<BP@=VEL;&EN9R!T;R!H96QP(&]T:&5R(&UE;6)E<G,6`!,`%@`6`!0`
M"0`6``D`"0`3``P`"P`6`!8`$P`6``T`#0`6`!8`%@`)``P`"0`6`!8`#0`6
M`!8`' `-``T`%@`3`!8`%@`-`!0`%@`+``T`#0`=`!4`#0`<`!8`$P`-`!8`
M"0`<`!8`%0`3``T`%@`-``L`%@`6`"(`#0`6``D`%0`4`!8`#0`*``T`' `)
M``@`" `)`!8`%@`-``L`%0`-`!8`%@`)`!8`#0`5``L`%@`6``T`# `C`!4`
M(@`6`!8`#0`3``0````N`0``!0```!0"(';!H@0````"`0$`! ````(!`0`%
M````% *O!7<!! ```"X!`0!5````,@JO!7<!,0`$``````#V">0,;V8@=&AE

which parsed into

debug: Got URI: http://H86QL96YG:6YG('1A0,;V8@=&AE

Both of these ended up with http:// tacked on to them which then caused matches
of the rules HTTP_ESCAPED_HOST, HTTP_USERNAME_USED, and WEIRD_PORT because of
there were % @ and : characters in the resulting "URI"s.

This caused perfectly fine mail with a Word attachment to get a high spam score.

I think there is more than one bug here. 1) MIME encoded attachmemts should not
be parsed for URIs in the rawbody. 2) Something about the embedded <> string
caused the URI parsing to act badly.
Comment 1 Rod Begbie 2002-06-25 12:29:36 UTC
SA is kind-of doing the right thing -- Outlook Express highlighted several 
lines of text in your message as URLs.

Since they are theoretically clickable-upon, it makes sense to check them as 
URLs.

I'm attaching a screenshot so you can see what I mean.
Comment 2 Sidney Markowitz 2002-06-25 13:19:24 UTC
I don't see the screenshot attachment, but I don't think that Outlook Express
finding URLs in my bug report means anything in this case.

My bug report does have URL looking things that begin with http://. But part of
the bug is that SpamAssassin added those http:// strings itself... They were not
in the message body.

And first of all, SA should not be looking for URIs in the rawbody text of MIME
attachments. That should be done in the decoded body.

Secondly, looking at PerMsgStatus.pm I see that SA is trying to guess at
relative URLs. It is true that if a host name like www.foo.com or ftp.bar.edu
appears in the body then some mail software will make it a hot link. SA deals
with that by prefixing http:// to it in order to make it a URI. That seems to be
what is going on here. But H86QL96YG:6YG('1A<lotsofstuff>0,;V8@=&AE does not
look like a www.* or ftp.* host name and should not be treated the same way.

It appears that foo:bar is treated as a URI, www.example.com is turned into
http://www.example.com and is treated as a URI. foo:bar<baz>glug is turned into
http://foo:barglug and is treated as a URI and that is wrong. If the 'foo:' is
treated as a URI scheme, then it definitely should not have http:// tacked on to
the beginning of it. And what sense does it make to remove the embedded <...>
string?

Plus all this is being done to raw BASE64 characters that certainly are not URIs.

I'm afraid that I am not familiar enough with Perl to know how to fix the
relatively complicated code in PerMsgStatus.pm that handles the URIs. My guess
is that the problem is somewhere in do_body_uri_tests, but perhaps it is also in
how it is called.

Comment 3 Evan Prodromou 2002-06-25 13:30:30 UTC
Subject: Re: [SAdev]  URI processing turns base64 strings into http URI's which then match rules

For some reason, I can't access Bugzilla at the moment to post the image, so
it can be seen at http://blazing.arsecandle.org/~rod/urls.png

Comment 4 Sidney Markowitz 2002-06-25 13:49:36 UTC
Ok, I see the screenshot. The strings that Outlook Express is underlining have
nothing to do with the strings that SA thinks are URIs. All this shows is that
Outlook Express can get very confused when handed BASE64 encoded files as plain
text. If anything it just emphasizes that SA's URI parsing should not be done on
the rawbody in the first place. Note that Outlook Express would never see these
strings in this way, it would decode the BASE64 MIME attachment first.
Comment 5 Daniel Rogers 2002-06-25 13:56:22 UTC
Is that even Base64 encoded?  It looks like UUEncoding to me.

Dan.
Comment 6 Sidney Markowitz 2002-06-25 14:19:18 UTC
Duh! Yeah. Is it too late to change the title of the bug repoort? :-)

Replace every mention of 'BASE64' above with 'uuencoded'. Sorry about that. But
the bug still stands. The Word attachment still was flagged with three different
kinds of spammy URIs that are not there.

Does anyone want to try it out with BASE64 attachments to see if it still happens?
Comment 7 Matt Sergeant 2002-06-26 02:07:45 UTC
Subject: Re: [SAdev]  URI processing turns base64 strings into http
 URI's which then match rules

Right, and therein lies the problem - SA doesn't decode UUencoded stuff 
(yet?).

I've got some code that plugs into my earlier posted mail parser to do 
UUE extraction, but I think both Craig and I discovered that SA needs a 
bit of a re-write in order to make use of a proper mail parser (either 
MIME::Tools based or my parser based).

Matt.

Comment 8 Sidney Markowitz 2002-06-26 02:46:03 UTC
There's still another bug besides not decoding uuencode.

I don't know enough Perl to propose a patch yet, but I think I found the 
offending line of code. In PerlMsgStatus.pm lines 1318 to 1321 in my current 
cvs version, it says

 # Does the uri start with "http://", "mailto:", "javascript:" or
 # such?  If not, we probaly need to put the base URI in front
 # of it.
 if ($uri !~ /^[a-z]+:/i) {

But $uri was set using an expression that includes $uriRE which is defined in 
terms of $schemeRE and $schemelessRE in line 1246. There you see that the 
scheme can have non-alphabetic characters before the :. The result of this bug 
is that a URI with nonalphabetic characters gets a http:// tacked on in front 
of it.

An approximate fix would be to change the if expression above to

 if ($uri !~ /^[a-zA-Z][a-zA-Z0-9.+\-]*:/)

which better matches the definition of $schemeRE, or even better if this works 
(remember I don't know Perl)

 if ($uri !~ /^$schemeRE:/)

Comment 9 Justin Mason 2002-07-29 08:26:32 UTC
ok, I've locked that down by explicitly listing protocols.
I don't think that's going to cause any more trouble than it
already is ;)