SA Bugzilla – Bug 2878
Identify when plain text and HTML are different in multipart/alternative
Last modified: 2004-01-24 11:41:05 UTC
Recently, I have received a lot of spam with the multipart/alternative MIME type. There are some random words in the plain/text version and some other random words in the HTML version, the information is mainly contained in an image which is linked. RFC 1521 (IIRC) says that the contents of parts in multipart/alternative should be essentially the same, so it should be a pretty good rule if it was possible to compare the contents of the plain text and HTML versions to see if the same words can be found in each. Comments can be ignored, and the words can be compared. I don't know what kind of algorithms will be used, but surely something exists for the purpose of comparing texts...? I'm getting the same spam as in bug #2875, but I'll include a bit more of the most relevant stuff: ----ALT--TCEF13321957421304 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit swab companionway bagpipe elephant cucumber regal birmingham shuck soothe plethora arrogate phenolic lieu zombie cherub denote leland urania basket blight fairfield eat conqueror imposture ----ALT--TCEF13321957421304 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 8bit <HTML><HEAD> <BODY> <p>Fr</battlefront>ee Ca</courtyard>bleTV!N</histamine>o mo</bovine>re p</consumptive>ay!&</p> <a href="http://www.2004hosting.net/cable/"> <img border="0" src="http://www.2004hosting.net/fiter3.jpg"></a> nature borealis chastity cow debra checkpoint ascribe deferring tabulate marketeer lob eaton sophistry blockade eyepiece benthic exhibit oatmeal bacon keen buckwheat champagne turtleback intoxicant defunct crewcut <BR> Also quite common, and even easier to catch are cases where text/plain is empty.
Is SA able to make text version of the HTML part? It only needs to perform good on non-spam. Then you have the problem of determining if the two texts are different. That problem can be solved with the Levenshtein Distance, it's described on http://www.merriampark.com/ld.htm It tells you how much you have to change to make the texts equal. The algorithm is O(n^2) so it should probably only be run on the first few kb of the mail or something like that.
Thanks for the interesting follow-up! I suppose HTML::Strip should be perfect at stripping the HTML from the HTML part... Probably, collapsing all whitespace to a single space is also a good idea. As for the comparing of strings, I wasn't aware of that algorithm, but the link gave me some keywords, so I googled for "normalized string edit distance", and came up with several interesting results. Among them, Arslan and Egecioglu: "Efficient Algorithms For Normalized Edit Distance" (2000), http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/JDA00.ps It's not clear to me what the n in your post is, but if I understand the abstract of the paper correctly (it was all that I read), their algorithm should be better... :-) The past week, 90% of the spam that has passed my SMTP rejection score of 13 has been of this type, so it sure would have been a great addition to SA if we could get this working. BTW, I've noticed something interesting about this spam: Their random-word database is evidently quite small, and contains a bunch of rarely used words, which is the reason why Bayesian filtering works so well, such words are rarely used in any ham, so when they occur frequently in spam, they are making it easy to catch. However, it is just a matter of time before spammers make a larger database of words, and while it can never fool a well-trained Bayesian filter, it may make its signal weaker, so to speak.
Subject: Re: Identify when plain text and HTML are different in multipart/alternative On Sun, Jan 04, 2004 at 11:18:36AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Is SA able to make text version of the HTML part? It only needs to perform good > on non-spam. Sort of. 2.70 has a new MIME parser which will render the HTML into a text version which will be used for Bayes and the like.
I was a little sloppy when I wrote O(n^2), what I meant was, that if you have two texts M and N of lengths m and n, then the running time is O(nm), if both texts are of almost equal size it's something like O(nn) = O(n^2). The Normalized Edit Distance (NED) (as opposed to Edit Distance (ED) aka Levenshtein Distance) has a best known implementation running time of O(m*n^2) which is worse than O(m*n) for ED. I'm not sure if the NED is any advantage if you just assign the edits a weight of 1 each. You could make a Edit Distance relative to length by just dividing the ED by the length of one of the strings or of the average length of both strings.
If someone tries this, see if you can identify mail that has something like "If you are seeing this then your mail program does not display HTML. Enable HTML or use an HTML capable mail reader to see the actual message" and also see if there is any non-spam mail that has something like that in the text portion. I recall seeing some mail like that since I disabled automatically viewing HTML in my mail, but I don't remember if it was only in spam.
Also things like "This is a multi-part message in MIME format." are quite common. But they are not in the text part, they are before the text part, and they will not be shown if the mail program understands MIME.
Subject: Re: Identify when plain text and HTML are different in multipart/alternative -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The main issue I could see with such a test is with mails from some legit mailers like Apple.com; they use a multipart/alternative message with a text/plain part that says "read this issue online at URL" and a text/html that contains the full HTML text. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/+QnPQTcbUG5Y7woRAlyhAJ98LWoSK4MFUh/SrCrdOKtoyHXVCACbBlLp XYlmPr0A+BQUqPKTyj9Qjqs= =ug7L -----END PGP SIGNATURE-----
Subject: Re: Identify when plain text and HTML are different in multipart/alternative -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >However, it is just a matter of time before spammers make a larger database of >words, and while it can never fool a well-trained Bayesian filter, it may make >its signal weaker, so to speak. BTW, it's important to note that this is *not* the case. When a spammer adds random dictionary words to a spam as a bayes-buster, those words will be quite rare (since people don't generally use *all* the words in their language very frequently). So they'll most likely have never been seen before in the user's training. Words that are not in the training database are ignored. So the bayes poison in that case will have no effect. What the spammers *should* be doing is figuring out what each recipient email address has in its training db, and use that text instead. ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/+QzfQTcbUG5Y7woRAluTAJ9L67r1f84oht5gWDPZd1mFJ/wbVwCg6qiZ N6gKt6q2e5eEllMfRDSIbfw= =/Grs -----END PGP SIGNATURE-----
> What the spammers *should* be doing is figuring out what > each recipient email address has in its training db, > and use that text instead. ;) Oh, no, now the spammaers are going to know to befriend each of us individually and send us spam subliminally encoded inside what we think is ordinary conversation with friends and colleagues. You have just handed them the ultimate weapon to defeat spam filters! :-)
Subject: Re: Identify when plain text and HTML are different in multipart/alternative > >However, it is just a matter of time before spammers make a larger > > database of words, and while it can never fool a well-trained > > Bayesian filter, it may make its signal weaker, so to speak. > > BTW, it's important to note that this is *not* the case. > > When a spammer adds random dictionary words to a spam as a > bayes-buster, those words will be quite rare (since people don't > generally use *all* the words in their language very frequently). > So they'll most likely have never been seen before in the user's > training. Words that are not in the training database are ignored. > So the bayes poison in that case will have no effect. Yeah, that's why I wrote "well-trained", since unfortunately, there are many sites that do not allow individual users to train their own filters, among them my old university. I've seen Bayes filters being successfully attacked several times, and these are probably what spammers are targeting, since it may be a large audience there that never train their own filters. There, it is not too hard to guess what words people will use, and most words will be in the dictionary. It can probably be overcome by certain tricks, because indeed, it doesn't affect those most extreme cases that clearly says "spam" or clearly says "ham", but it can flatten the distribution function somewhat, which would affect reliability. It may be why spammers have rather rare words in their dictionary, the idea is that if a word hits, it will hit well, and one of their words did, but obviously, it didn't help them too much... > What the spammers *should* be doing is figuring out what each > recipient email address has in its training db, and use that text > instead. ;) Uhm, there's a fine line between openly discuss and giving them ideas here, I suppose. I can imagine ways to do that... :-/ So I think there are reasons to work on many fronts... Niels: Thanks for the clarification on the efficiency of the algorithm! When I added "normalized" to my google search, it was because I figured it would be convenient to have a measurement between 0 and 1, and I didn't realize what I found was an algorithm that had a slightly different purpose. Just doing ED/n would probably satisfy what I was looking for... :-) Kjetil
I have a rule in 2.70 for this. :) works quite well.