SA Bugzilla – Bug 1373
Excessive Commenting in code
Last modified: 2003-05-19 04:06:53 UTC
SpamAssassin should check to see if emails contain excessive amounts of comments in the html <!-- SOME TEXT, OR NO TEXT--> This is often used by spammers to disrupt the use of spam filters such as spamassassin from correctly detecting email. There is no reason to use a comment in HTML mail anyway, since it's never hand edited, and isn't processed by a server (SSI tags and such) so it's a pretty good indication of spam.
Comments in general or comments in the middle of words? The latter is OBFUSCATING_COMMENT? If the former, please attach a sample.
I thought he meant comments in general, but now I'm not quite so sure. If he did... This rule might be usable if we create a meta and "and" it with CTYPE_JUST_HTML. Actually, now that I think about it, a lot of our failed HTML rules might work good enough to use if we "and" them with CTYPE_JUST_HTML.
I meant in general, since that catches both. There is no reason for comments in HTML mail, since it isn't hand edited, nor does it go through a parser. It's limited to being a spammer technique.
Subject: Re: [SAdev] Excessive Commenting in code robert@accettura.com writes: > I meant in general, since that catches both. There is no reason for > comments in HTML mail, since it isn't hand edited, nor does it go > through a parser. It's limited to being a spammer technique. You have to be careful. It's difficult to tell the difference between a legitimate attached HTML file and an HTML part that's the HTML version of text (I suppose you could compare them, but we're not set up for that.) Also, some legitimate newsletters use HTML editors as well as hand-edited HTML and contain mark-up not found in your average HTML email, which does, in fact, sometimes include comments. That's why I said it would be necessary to require CTYPE_JUST_HTML (and even then, it might not work well-enough for general use due to newsletters and such). Actually, just doing a quick hand-test. 63 of the 335 messages containing text/html in my spam corpus contain HTML comments.
But newsletters tend to be sent out by majordomo, auto whitelisted, etc. The typical legit-newsletter rules (I'm researching if there are some more to be added) If those rules are accurate enough (and should get better), the comments could get a small point value without really harming legitimate mail no? I think it's worth less than 1.0 (at least at this time). Perhaps a 0.2. >You have to be careful. It's difficult to tell the difference between a >legitimate attached HTML file and an HTML part that's the HTML version >of text (I suppose you could compare them, but we're not set up for >that.) Didn't realize that. Coming to 3.0?
Subject: Re: [SAdev] Excessive Commenting in code > I meant in general, since that catches both. There is no reason for > comments in HTML mail, since it isn't hand edited, nor does it go > through a parser. It's limited to being a spammer technique. data point: actually, MS Word documents saved as HTML do contain comments (it stores conditional statements in there).
I grepped my nonspam corpus for HTML comments and found: - Several messages from friends that appear to be composed in MS Word (WordMail for Outlook probably) and have a bunch of comments (metadata, not human-specified comments) - Daily Dilbert newsletters from unitedmedia.com, which use comments around a script - A GoDaddy.com domain renewal notice that has the entire text repeated in an HTML comment for some strange reason - A Register.com renewal notice with a "saved from url" comment - An HTML newsletter from a software comany that has the text version in an HTML comment rather than a MIME part - An HTML newsletter from an online store with some kind of tracking information in a comment - A message sent to a YahooGroups list via their online interface, containing a blank comment for some reason ...so I'm not sure how useful this test will be.
I've got various obviously script-generated newsletters containing many comments. I'd say WONTFIX.
Subject: Re: [SAdev] Excessive Commenting in code > I've got various obviously script-generated newsletters containing many > comments. I'd say WONTFIX. It's possible some spam has insanely high percentages of comments, like 90% commenting. Maybe worth a try after 2.50 is out. Do something like: in html_comment, keep a counter like: $self->{html}{comment_length} += length($text) + 7; # 7 = "<!--" + "-->" then do: if ($self->{html}{non_uri_len}) { $self->{html}{comment_ratio} = $self->{html}{comment_length} / $self->{html}{non_uri_len}; } then a range test in 10% increments, etc.
I've been hand anysising spam in three categories: Spam Newletters/Mass legit mailings Normal Mail What I have found is that spammers tend to use random data in comments between words, as well as what to me looks like totally random placement. I guess it's to obscure code from easy view. What I have found is that ones that do so, tend to use massive quantities so that it appears to be greater than 40% of the code (just an estimate). It tends to be all out, or none at all.
ok, I've put this in testing for 2.60. It looks pretty good for me, but I don't get a lot of HTML mail so ... The results I got were: Just the ratios, not looking for MIME_HTML_ONLY: 4.059 6.0792 0.2261 0.964 0.81 1.00 __HTML_COMMENT_RATIO_00_10 1.558 2.3703 0.0174 0.993 0.89 1.00 __HTML_COMMENT_RATIO_10_20 0.402 0.6052 0.0174 0.972 0.82 1.00 __HTML_COMMENT_RATIO_20_30 0.600 0.9123 0.0087 0.991 0.88 1.00 __HTML_COMMENT_RATIO_30_40 0.189 0.2888 0.0000 1.000 0.90 1.00 __HTML_COMMENT_RATIO_40_50 0.183 0.2797 0.0000 1.000 0.90 1.00 __HTML_COMMENT_RATIO_50_60 0.165 0.2476 0.0087 0.966 0.80 1.00 __HTML_COMMENT_RATIO_60_70 0.315 0.4814 0.0000 1.000 0.91 1.00 __HTML_COMMENT_RATIO_70_80 0.201 0.3072 0.0000 1.000 0.90 1.00 __HTML_COMMENT_RATIO_80_90 0.000 0.0000 0.0000 0.500 0.00 1.00 __HTML_COMMENT_RATIO_90_100 Making it a meta with MIME_HTML_ONLY ... 2.426 3.6860 0.0348 0.991 0.88 0.01 T_HTML_COMMENT_RATIO_00_10 1.225 1.8705 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_10_20 0.357 0.5456 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_20_30 0.564 0.8619 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_30_40 0.177 0.2705 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_40_50 0.159 0.2430 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_50_60 0.156 0.2384 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_60_70 0.306 0.4676 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_70_80 0.195 0.2980 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_80_90 0.000 0.0000 0.0000 0.500 0.00 0.01 T_HTML_COMMENT_RATIO_90_100
Subject: Re: [SAdev] Excessive Commenting in code felicity@kluge.net wrote: > Just the ratios, not looking for MIME_HTML_ONLY: How about without MIME_HTML_ONLY? That rule is so effective (and has a high spam hit rate) that I've started worrying way too easy to rely on it as a FP reduction tool. I have no data to back this up, though. :-) Looking at HTML rules with __MIME_HTML or HTML_MESSAGE seems pretty safe to me, though.
Subject: Re: [SAdev] Excessive Commenting in code On Sun, Mar 02, 2003 at 08:36:24PM -0800, bugzilla-daemon@hughes-family.org wrote: > How about without MIME_HTML_ONLY? That rule is so effective (and has a > high spam hit rate) that I've started worrying way too easy to rely on > it as a FP reduction tool. I have no data to back this up, though. :-) I'm not sure what you're asking. Are you asking for the comment ratio results with the meta, or for "HTML_COMMENT_RATIO... && !MIME_HTML_ONLY"? If the former, that was posted. If the latter, I don't know, but we could try it if you think it would be a useful set of tests.
Subject: Re: [SAdev] Excessive Commenting in code bugzilla-daemon@hughes-family.org writes: > I'm not sure what you're asking. Are you asking for the comment ratio > results with the meta, or for "HTML_COMMENT_RATIO... && !MIME_HTML_ONLY"? > > If the former, that was posted. If the latter, I don't know, but we > could try it if you think it would be a useful set of tests. I was mostly saying I'd rather not have this meta test require MIME_HTML_ONLY if it works well enough with HTML_MESSAGE or __MIME_HTML. It might be interesting to see results for (HTML_COMMENT_RATIO... && __MIME_HTML) and (HTML_COMMENT_RATIO... && HTML_MESSAGE) compared with the (HTML_COMMENT_RATIO... && MIME_HTML_ONLY) ones.
Subject: Re: [SAdev] Excessive Commenting in code On Sun, Mar 02, 2003 at 09:38:08PM -0800, bugzilla-daemon@hughes-family.org wrote: > It might be interesting to see results for (HTML_COMMENT_RATIO... && > __MIME_HTML) and (HTML_COMMENT_RATIO... && HTML_MESSAGE) compared with > the (HTML_COMMENT_RATIO... && MIME_HTML_ONLY) ones. Below are my results, sorted by rule name. MIME_HTML_ONLY produces the best S/O ratios at 0.991 (0-10) or 1.0 (10-100) while catching 5.57% of all messages. HTML_MESSAGE has S/O ratios ranging from 0.963 - 1 and catches 7.65% of all messages. __MIME_HTML has S/O ratios of 0.962 - 1 and catches 7.42% of all messages. So I still like MIME_HTML_ONLY: It catches less messages overall, but is more accurate. I've committed the new rules for testing in a larger arena. :) 4.069 6.0889 0.2343 0.963 0.80 0.01 T_HTML_COMMENT_RATIO_00_10_HTML_MESSAGE 2.431 3.6935 0.0347 0.991 0.88 0.01 T_HTML_COMMENT_RATIO_00_10_MIME_HTML_ONLY 4.009 5.9974 0.2343 0.962 0.80 0.01 T_HTML_COMMENT_RATIO_00_10___MIME_HTML 1.551 2.3587 0.0174 0.993 0.89 0.01 T_HTML_COMMENT_RATIO_10_20_HTML_MESSAGE 1.222 1.8651 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_10_20_MIME_HTML_ONLY 1.362 2.0708 0.0174 0.992 0.88 0.01 T_HTML_COMMENT_RATIO_10_20___MIME_HTML 0.401 0.6034 0.0174 0.972 0.82 0.01 T_HTML_COMMENT_RATIO_20_30_HTML_MESSAGE 0.356 0.5440 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_20_30_MIME_HTML_ONLY 0.398 0.5988 0.0174 0.972 0.82 0.01 T_HTML_COMMENT_RATIO_20_30___MIME_HTML 0.596 0.9051 0.0087 0.991 0.88 0.01 T_HTML_COMMENT_RATIO_30_40_HTML_MESSAGE 0.563 0.8594 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_30_40_MIME_HTML_ONLY 0.599 0.9097 0.0087 0.991 0.88 0.01 T_HTML_COMMENT_RATIO_30_40___MIME_HTML 0.186 0.2834 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_40_50_HTML_MESSAGE 0.177 0.2697 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_40_50_MIME_HTML_ONLY 0.189 0.2880 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_40_50___MIME_HTML 0.177 0.2697 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_50_60_HTML_MESSAGE 0.159 0.2423 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_50_60_MIME_HTML_ONLY 0.183 0.2788 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_50_60___MIME_HTML 0.168 0.2514 0.0087 0.967 0.80 0.01 T_HTML_COMMENT_RATIO_60_70_HTML_MESSAGE 0.162 0.2468 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_60_70_MIME_HTML_ONLY 0.171 0.2560 0.0087 0.967 0.81 0.01 T_HTML_COMMENT_RATIO_60_70___MIME_HTML 0.305 0.4663 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_70_80_HTML_MESSAGE 0.305 0.4663 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_70_80_MIME_HTML_ONLY 0.314 0.4800 0.0000 1.000 0.91 0.01 T_HTML_COMMENT_RATIO_70_80___MIME_HTML 0.195 0.2971 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_80_90_HTML_MESSAGE 0.195 0.2971 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_80_90_MIME_HTML_ONLY 0.195 0.2971 0.0000 1.000 0.90 0.01 T_HTML_COMMENT_RATIO_80_90___MIME_HTML 0.000 0.0000 0.0000 0.500 0.00 0.01 T_HTML_COMMENT_RATIO_90_100_HTML_MESSAGE 0.000 0.0000 0.0000 0.500 0.00 0.01 T_HTML_COMMENT_RATIO_90_100_MIME_HTML_ONLY 0.000 0.0000 0.0000 0.500 0.00 0.01 T_HTML_COMMENT_RATIO_90_100___MIME_HTML
Reopening for further comment (not a simple rule in terms of what we should do). Here are the HTML_MESSAGE scores for this rule (rod/theo/quinlan) for last night's corpus run: 0.460 0.5072 0.0000 1.000 0.96 0.01 T_HTML_COMMENT_RATIO_70_80_MIME_HTML_ONLY 0.330 0.3644 0.0000 1.000 0.96 0.01 T_HTML_COMMENT_RATIO_80_90_MIME_HTML_ONLY 0.477 0.5184 0.0725 0.877 0.65 0.01 T_HTML_COMMENT_RATIO_70_80___MIME_HTML 0.477 0.5184 0.0725 0.877 0.65 0.01 T_HTML_COMMENT_RATIO_70_80_HTML_MESSAGE 0.351 0.3794 0.0725 0.840 0.57 0.01 T_HTML_COMMENT_RATIO_80_90_HTML_MESSAGE 0.344 0.3719 0.0725 0.837 0.56 0.01 T_HTML_COMMENT_RATIO_80_90___MIME_HTML 1.991 2.1000 0.9424 0.690 0.32 0.01 T_HTML_COMMENT_BLANK 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100_MIME_HTML_ONLY 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100___MIME_HTML 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100_HTML_MESSAGE 1.014 1.0106 1.0511 0.490 0.11 0.01 T_HTML_COMMENT_RATIO_30_40_MIME_HTML_ONLY 5.484 5.2707 7.5390 0.411 0.07 0.01 T_HTML_COMMENT_RATIO_00_10_MIME_HTML_ONLY 10.181 9.3317 18.3762 0.337 0.05 0.01 T_HTML_COMMENT_RATIO_00_10___MIME_HTML 10.290 9.4181 18.7024 0.335 0.05 0.01 T_HTML_COMMENT_RATIO_00_10_HTML_MESSAGE 0.310 0.2893 0.5074 0.363 0.05 0.01 T_HTML_COMMENT_RATIO_60_70_MIME_HTML_ONLY 2.481 2.2841 4.3856 0.342 0.04 0.01 T_HTML_COMMENT_NO_ALPHANUM 0.327 0.3005 0.5799 0.341 0.04 0.01 T_HTML_COMMENT_RATIO_60_70_HTML_MESSAGE 0.327 0.3005 0.5799 0.341 0.04 0.01 T_HTML_COMMENT_RATIO_60_70___MIME_HTML 2.607 2.3705 4.8931 0.326 0.04 0.01 T_HTML_COMMENT_RATIO_10_20_MIME_HTML_ONLY 0.807 0.7363 1.4860 0.331 0.04 0.01 T_HTML_COMMENT_RATIO_20_30_MIME_HTML_ONLY 3.469 3.0392 7.6115 0.285 0.03 0.01 T_HTML_COMMENT_RATIO_10_20_HTML_MESSAGE 3.213 2.7725 7.4665 0.271 0.02 0.01 T_HTML_COMMENT_RATIO_10_20___MIME_HTML 1.256 1.0857 2.8996 0.272 0.02 0.01 T_HTML_COMMENT_RATIO_30_40___MIME_HTML 1.256 1.0857 2.8996 0.272 0.02 0.01 T_HTML_COMMENT_RATIO_30_40_HTML_MESSAGE 1.008 0.8077 2.9358 0.216 0.01 0.01 T_HTML_COMMENT_RATIO_20_30___MIME_HTML 1.014 0.8115 2.9721 0.214 0.01 0.01 T_HTML_COMMENT_RATIO_20_30_HTML_MESSAGE 0.425 0.3343 1.3048 0.204 0.01 0.01 T_HTML_COMMENT_RATIO_40_50_MIME_HTML_ONLY 0.432 0.3268 1.4498 0.184 0.01 0.01 T_HTML_COMMENT_RATIO_50_60___MIME_HTML 0.432 0.3268 1.4498 0.184 0.01 0.01 T_HTML_COMMENT_RATIO_50_60_HTML_MESSAGE 0.470 0.3531 1.5948 0.181 0.01 0.01 T_HTML_COMMENT_RATIO_40_50_HTML_MESSAGE 0.466 0.3494 1.5948 0.180 0.01 0.01 T_HTML_COMMENT_RATIO_40_50___MIME_HTML 0.391 0.2930 1.3411 0.179 0.01 0.01 T_HTML_COMMENT_RATIO_50_60_MIME_HTML_ONLY It looks like 70 and above are usable. The average rank for 70 and above is higher for the MIME_HTML_ONLY versions, so I would be okay using it. It looks safe to use __MIME_HTML_ONLY, though, so I'd suggest that just in case someone manages to successfully forge a hotmail message. The S/O ratio is also so low for the lower end of the range that we might as well leave all of these rules in and see if any are usable by the GA as compensation rules (we don't have to explicitly tag stuff as nice, do we?) Dan
Let me try that table again: 0.460 0.5072 0.0000 1.000 0.96 0.01 T_HTML_COMMENT_RATIO_70_80_MIME_HTML_ONLY 0.330 0.3644 0.0000 1.000 0.96 0.01 T_HTML_COMMENT_RATIO_80_90_MIME_HTML_ONLY 0.477 0.5184 0.0725 0.877 0.65 0.01 T_HTML_COMMENT_RATIO_70_80___MIME_HTML 0.477 0.5184 0.0725 0.877 0.65 0.01 T_HTML_COMMENT_RATIO_70_80_HTML_MESSAGE 0.351 0.3794 0.0725 0.840 0.57 0.01 T_HTML_COMMENT_RATIO_80_90_HTML_MESSAGE 0.344 0.3719 0.0725 0.837 0.56 0.01 T_HTML_COMMENT_RATIO_80_90___MIME_HTML 1.991 2.1000 0.9424 0.690 0.32 0.01 T_HTML_COMMENT_BLANK 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100_MIME_HTML_ONLY 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100___MIME_HTML 0.000 0.0000 0.0000 0.500 0.12 0.01 T_HTML_COMMENT_RATIO_90_100_HTML_MESSAGE 1.014 1.0106 1.0511 0.490 0.11 0.01 T_HTML_COMMENT_RATIO_30_40_MIME_HTML_ONLY 5.484 5.2707 7.5390 0.411 0.07 0.01 T_HTML_COMMENT_RATIO_00_10_MIME_HTML_ONLY 10.181 9.3317 18.3762 0.337 0.05 0.01 T_HTML_COMMENT_RATIO_00_10___MIME_HTML 10.290 9.4181 18.7024 0.335 0.05 0.01 T_HTML_COMMENT_RATIO_00_10_HTML_MESSAGE 0.310 0.2893 0.5074 0.363 0.05 0.01 T_HTML_COMMENT_RATIO_60_70_MIME_HTML_ONLY 2.481 2.2841 4.3856 0.342 0.04 0.01 T_HTML_COMMENT_NO_ALPHANUM 0.327 0.3005 0.5799 0.341 0.04 0.01 T_HTML_COMMENT_RATIO_60_70_HTML_MESSAGE 0.327 0.3005 0.5799 0.341 0.04 0.01 T_HTML_COMMENT_RATIO_60_70___MIME_HTML 2.607 2.3705 4.8931 0.326 0.04 0.01 T_HTML_COMMENT_RATIO_10_20_MIME_HTML_ONLY 0.807 0.7363 1.4860 0.331 0.04 0.01 T_HTML_COMMENT_RATIO_20_30_MIME_HTML_ONLY 3.469 3.0392 7.6115 0.285 0.03 0.01 T_HTML_COMMENT_RATIO_10_20_HTML_MESSAGE 3.213 2.7725 7.4665 0.271 0.02 0.01 T_HTML_COMMENT_RATIO_10_20___MIME_HTML 1.256 1.0857 2.8996 0.272 0.02 0.01 T_HTML_COMMENT_RATIO_30_40___MIME_HTML 1.256 1.0857 2.8996 0.272 0.02 0.01 T_HTML_COMMENT_RATIO_30_40_HTML_MESSAGE 1.008 0.8077 2.9358 0.216 0.01 0.01 T_HTML_COMMENT_RATIO_20_30___MIME_HTML 1.014 0.8115 2.9721 0.214 0.01 0.01 T_HTML_COMMENT_RATIO_20_30_HTML_MESSAGE 0.425 0.3343 1.3048 0.204 0.01 0.01 T_HTML_COMMENT_RATIO_40_50_MIME_HTML_ONLY 0.432 0.3268 1.4498 0.184 0.01 0.01 T_HTML_COMMENT_RATIO_50_60___MIME_HTML 0.432 0.3268 1.4498 0.184 0.01 0.01 T_HTML_COMMENT_RATIO_50_60_HTML_MESSAGE 0.470 0.3531 1.5948 0.181 0.01 0.01 T_HTML_COMMENT_RATIO_40_50_HTML_MESSAGE 0.466 0.3494 1.5948 0.180 0.01 0.01 T_HTML_COMMENT_RATIO_40_50___MIME_HTML 0.391 0.2930 1.3411 0.179 0.01 0.01 T_HTML_COMMENT_RATIO_50_60_MIME_HTML_ONLY
Subject: Re: [SAdev] Excessive Commenting in code On Wed, Mar 05, 2003 at 09:14:47PM -0800, bugzilla-daemon@hughes-family.org wrote: > The S/O ratio is also so low for the lower end of the range that we might > as well leave all of these rules in and see if any are usable by the GA > as compensation rules (we don't have to explicitly tag stuff as nice, do we?) In the current GA code, yes, we would have to. Any rule not marked as "nice" is forced to have a >= 0 score. I'd first like to pick what set of those rules we want to use, then leave in the whole set for further testing (right now we only have 3 people total doing nightly runs, so the results are telling but not conclusive.)
there is now a version checked into 2.60: 0.266 0.8842 0.0061 0.993 0.94 1.00 HTML_COMMENT_RATIO which works out pretty well I think. :)