Bug 2878 - Identify when plain text and HTML are different in multipart/alternative
Summary: Identify when plain text and HTML are different in multipart/alternative
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.61
Hardware: All All
: P5 enhancement
Target Milestone: 2.70
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-12-30 10:20 UTC by Kjetil Kjernsmo
Modified: 2004-01-24 11:41 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Kjetil Kjernsmo 2003-12-30 10:20:45 UTC
Recently, I have received a lot of spam with the multipart/alternative MIME
type. There are some random words in the plain/text version and some other
random words in the HTML version, the information is mainly contained in an
image which is linked. 

RFC 1521 (IIRC) says that the contents of parts in multipart/alternative should
be essentially the same, so it should be a pretty good rule if it was possible
to compare the contents of the plain text and HTML versions to see if the same
words can be found in each. Comments can be ignored, and the words can be
compared. I don't know what kind of algorithms will be used, but surely
something exists for the purpose of comparing texts...?

I'm getting the same spam as in bug #2875, but I'll include a bit more of the
most relevant stuff:

----ALT--TCEF13321957421304
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit

swab companionway bagpipe elephant cucumber regal 
birmingham shuck soothe plethora arrogate phenolic lieu zombie 
cherub denote leland urania basket blight fairfield eat conqueror imposture 

----ALT--TCEF13321957421304
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 8bit

<HTML><HEAD>
<BODY>
<p>Fr</battlefront>ee Ca</courtyard>bleTV!N</histamine>o mo</bovine>re
p</consumptive>ay!&</p>
<a href="http://www.2004hosting.net/cable/">
<img border="0" src="http://www.2004hosting.net/fiter3.jpg"></a>
nature borealis chastity cow debra checkpoint ascribe deferring tabulate
marketeer lob eaton sophistry blockade eyepiece benthic exhibit oatmeal bacon
keen buckwheat champagne turtleback intoxicant defunct crewcut <BR>


Also quite common, and even easier to catch are cases where text/plain is empty.
Comment 1 Niels Teglsbo 2004-01-04 10:23:32 UTC
Is SA able to make text version of the HTML part? It only needs to perform good 
on non-spam.

Then you have the problem of determining if the two texts are different.

That problem can be solved with the Levenshtein Distance, it's described on 
http://www.merriampark.com/ld.htm

It tells you how much you have to change to make the texts equal.

The algorithm is O(n^2) so it should probably only be run on the first few kb of 
the mail or something like that.
Comment 2 Kjetil Kjernsmo 2004-01-04 13:40:33 UTC
Thanks for the interesting follow-up! 

I suppose HTML::Strip should be perfect at stripping the HTML from the HTML
part... Probably, collapsing all whitespace to a single space is also a good idea.

As for the comparing of strings, I wasn't aware of that algorithm, but the link
gave me some keywords, so I googled for "normalized string edit distance", and
came up with several interesting results. Among them, Arslan and Egecioglu:
"Efficient Algorithms For Normalized Edit Distance" (2000),
http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/JDA00.ps

It's not clear to me what the n in your post is, but if I understand the
abstract of the paper correctly (it was all that I read), their algorithm should
be better... :-)

The past week, 90% of the spam that has passed my SMTP rejection score of 13 has
been of this type, so it sure would have been a great addition to SA if we could
get this working. 

BTW, I've noticed something interesting about this spam: Their random-word
database is evidently quite small, and contains a bunch of rarely used words,
which is the reason why Bayesian filtering works so well, such words are rarely
used in any ham, so when they occur frequently in spam, they are making it easy
to catch. 

However, it is just a matter of time before spammers make a larger database of
words, and while it can never fool a well-trained Bayesian filter, it may make
its signal weaker, so to speak. 
Comment 3 Theo Van Dinter 2004-01-04 17:32:33 UTC
Subject: Re:  Identify when plain text and HTML are different in multipart/alternative

On Sun, Jan 04, 2004 at 11:18:36AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Is SA able to make text version of the HTML part? It only needs to perform good 
> on non-spam.

Sort of.  2.70 has a new MIME parser which will render the HTML into a
text version which will be used for Bayes and the like.

Comment 4 Niels Teglsbo 2004-01-04 17:35:19 UTC
I was a little sloppy when I wrote O(n^2), what I meant was, that if you have 
two texts M and N of lengths m and n, then the running time is O(nm), if both 
texts are of almost equal size it's something like O(nn) = O(n^2).

The Normalized Edit Distance (NED) (as opposed to Edit Distance (ED) aka 
Levenshtein Distance) has a best known implementation running time of O(m*n^2) 
which is worse than O(m*n) for ED.

I'm not sure if the NED is any advantage if you just assign the edits a weight 
of 1 each.

You could make a Edit Distance relative to length by just dividing the ED by the 
length of one of the strings or of the average length of both strings.
Comment 5 Sidney Markowitz 2004-01-04 18:14:25 UTC
If someone tries this, see if you can identify mail that has something like "If
you are seeing this then your mail program does not display HTML. Enable HTML or
use an HTML capable mail reader to see the actual message" and also see if there
is any non-spam mail that has something like that in the text portion.

I recall seeing some mail like that since I disabled automatically viewing HTML
in my mail, but I don't remember if it was only in spam.
Comment 6 Niels Teglsbo 2004-01-04 18:19:57 UTC
Also things like "This is a multi-part message in MIME format." are quite 
common.

But they are not in the text part, they are before the text part, and they will 
not be shown if the mail program understands MIME.
Comment 7 Justin Mason 2004-01-04 22:53:10 UTC
Subject: Re:  Identify when plain text and HTML are different in multipart/alternative 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The main issue I could see with such a test is with mails from some legit
mailers like Apple.com; they use a multipart/alternative message with 
a text/plain part that says "read this issue online at URL" and a
text/html that contains the full HTML text.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/+QnPQTcbUG5Y7woRAlyhAJ98LWoSK4MFUh/SrCrdOKtoyHXVCACbBlLp
XYlmPr0A+BQUqPKTyj9Qjqs=
=ug7L
-----END PGP SIGNATURE-----

Comment 8 Justin Mason 2004-01-04 23:06:12 UTC
Subject: Re:  Identify when plain text and HTML are different in multipart/alternative 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>However, it is just a matter of time before spammers make a larger database of
>words, and while it can never fool a well-trained Bayesian filter, it may make
>its signal weaker, so to speak. 

BTW, it's important to note that this is *not* the case.

When a spammer adds random dictionary words to a spam as a bayes-buster,
those words will be quite rare (since people don't generally use *all* the
words in their language very frequently).   So they'll most likely have
never been seen before in the user's training.  Words that are not in the
training database are ignored.  So the bayes poison in that case will have
no effect.

What the spammers *should* be doing is figuring out what each recipient
email address has in its training db, and use that text instead. ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/+QzfQTcbUG5Y7woRAluTAJ9L67r1f84oht5gWDPZd1mFJ/wbVwCg6qiZ
N6gKt6q2e5eEllMfRDSIbfw=
=/Grs
-----END PGP SIGNATURE-----

Comment 9 Sidney Markowitz 2004-01-05 00:03:01 UTC
> What the spammers *should* be doing is figuring out what
> each recipient email address has in its training db,
> and use that text instead. ;)

Oh, no, now the spammaers are going to know to befriend each of us individually
and send us spam subliminally encoded inside what we think is ordinary
conversation with friends and colleagues. You have just handed them the ultimate
weapon to defeat spam filters! :-)
Comment 10 Kjetil Kjernsmo 2004-01-05 01:51:32 UTC
Subject: Re:  Identify when plain text and HTML are different in multipart/alternative

> >However, it is just a matter of time before spammers make a larger
> > database of words, and while it can never fool a well-trained
> > Bayesian filter, it may make its signal weaker, so to speak.
>
> BTW, it's important to note that this is *not* the case.
>
> When a spammer adds random dictionary words to a spam as a
> bayes-buster, those words will be quite rare (since people don't
> generally use *all* the words in their language very frequently).  
> So they'll most likely have never been seen before in the user's
> training.  Words that are not in the training database are ignored.
>  So the bayes poison in that case will have no effect.

Yeah, that's why I wrote "well-trained", since unfortunately, there are 
many sites that do not allow individual users to train their own 
filters, among them my old university. I've seen Bayes filters being 
successfully attacked several times, and these are probably what 
spammers are targeting, since it may be a large audience there that 
never train their own filters. 

There, it is not too hard to guess what words people will use, and most 
words will be in the dictionary. It can probably be overcome by certain 
tricks, because indeed, it doesn't affect those most extreme cases that 
clearly says "spam" or clearly says "ham", but it can flatten the 
distribution function somewhat, which would affect reliability. 

It may be why spammers have rather rare words in their dictionary, the 
idea is that if a word hits, it will hit well, and one of their words 
did, but obviously, it didn't help them too much...

> What the spammers *should* be doing is figuring out what each
> recipient email address has in its training db, and use that text
> instead. ;)

Uhm, there's a fine line between openly discuss and giving them ideas 
here, I suppose. I can imagine ways to do that... :-/ So I think there 
are reasons to work on many fronts... 

Niels: Thanks for the clarification on the efficiency of the algorithm! 
When I added "normalized" to my google search, it was because I figured 
it would be convenient to have a measurement between 0 and 1, and I 
didn't realize what I found was an algorithm that had a slightly 
different purpose. Just doing ED/n would probably satisfy what I was 
looking for... :-) 

Kjetil

Comment 11 Theo Van Dinter 2004-01-24 20:41:05 UTC
I have a rule in 2.70 for this. :)  works quite well.