|
SA Bugzilla – Full Text Bug Listing |
Summary: | all caps headers | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Doug McCasland <dougm> |
Component: | Regression Tests | Assignee: | Daniel Quinlan <quinlan> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | dev |
Priority: | P2 | ||
Version: | 2.42 | ||
Target Milestone: | --- | ||
Hardware: | Other | ||
OS: | Linux | ||
Whiteboard: |
Description
Doug McCasland
2002-10-12 14:16:51 UTC
Subject: Re: [SAdev] New: all caps headers On Sat, Oct 12, 2002 at 02:16:51PM -0700, bugzilla-daemon@hughes-family.org wrote: > A + score for any header that is all-caps. I have found 99% of msgs that have > the To: address in all caps to be spam. In a quick test, I have 27 spam hits and 74 nonspam hits. *** Bug 1111 has been marked as a duplicate of this bug. *** *** Bug 1110 has been marked as a duplicate of this bug. *** My test, BTW, was run using: $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' non-spam/* | wc -l 75 $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' spam/* | wc -l 35 So this test doesn't look very good. Somewhere around a .70 with a small match percentage (110 out of 20000). Oh, BTW... Please don't add a new bug saying "revise this old bug", it's extremely annoying. You can just add a comment to the actual bug... I get an S/O of 0.112; not so hot. Hi, sorry about the dupes. I think I get it now ;-o Anyway, this is the main point, that "To:" (or other upper+lower tag) is not getting caught by the ALL_CAPS_HEADER test. TO: BOBO@FOO.BIZ [the entire header is all caps] gets a ALL_CAPS_HEADER score, fine. But To: BOBO@FOO.BIZ ["To:" is not all caps] does not. Aside: my email server (postfix) converts incoming TO: to To:. I think I found a variation of this test that works well, but I had to tweak it a lot, so I'm afraid it may be fragile outside of my corpus. In fact, I will be surprised if this test survives. Notes: - the test semantics are a bit different than the original suggestion too. - if there are specific headers that cause more FPs than not, we can exclude them (I already excluded one.) It's in CVS now as T_HEADER_ALL_CAPS. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12402 4708 7694 0.38 0.00 0.00 (all messages) 100.000 37.962 62.038 0.38 0.00 0.00 (all messages as %) 2.104 5.523 0.013 1.00 0.71 1.00 T_HEADER_ALL_CAPS (assigning to me) Subject: Re: [SAdev] all caps headers On Sun, Oct 13, 2002 at 01:27:52AM -0700, bugzilla-daemon@hughes-family.org wrote: > OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME > 12402 4708 7694 0.38 0.00 0.00 (all messages) > 100.000 37.962 62.038 0.38 0.00 0.00 (all messages as %) > 2.104 5.523 0.013 1.00 0.71 1.00 T_HEADER_ALL_CAPS Not even close on mine: 73.235 41.268 95.311 0.30 0.20 1.00 T_HEADER_ALL_CAPS I'll poke around a little bit to see if there's a common cause for this, but ... Subject: Re: [SAdev] all caps headers On Sun, Oct 13, 2002 at 09:28:04AM -0700, bugzilla-daemon@hughes-family.org wrote: > I'll poke around a little bit to see if there's a common cause for this, > but ... I found the main cause was the X-UID header. After removing that, the scores at least look a little better: 0.856 1.120 0.675 0.62 0.35 1.00 T_HEADER_ALL_CAPS I found a number of mails which had semi-random "X-<something>: <no letters in text>", so I modified the code to only check headers that have [a-zA-Z] in the value portion: 0.530 0.882 0.288 0.75 0.41 1.00 T_HEADER_ALL_CAPS I then found a few more headers to remove: /^X-UIDL?:/ /^X-IMDB-MAILFILTER:/ /^X-SMTP-\w+:/ /^X-AUTH:/ /^X-EPRI-ID:/ /^X-FID:/ /^X-BG:/ /^X-AS[HN]:/ /^X-AUTO:/ /^X-A[DNP]:/ /^X-\d+:/ 0.370 0.870 0.025 0.97 0.62 1.00 T_HEADER_ALL_CAPS The left over FPs are valid mails with either all-caps TO or CONTENT-TYPE headers. To make the list easier, I'm tempted to say ignore /^X-/, and just pay attention to "real" headers. Overall, I don't think the benefits justify the cost of processing, but YMMV. Subject: Re: all caps headers felicity@kluge.net writes: > To make the list easier, I'm tempted to say ignore /^X-/, and just pay > attention to "real" headers. Overall, I don't think the benefits justify > the cost of processing, but YMMV. I think I tried ignoring /^X-/ and lot almost all of the SPAM%, but I can't remember which version of the test that was, so I'll try it again. Dan Version that just tests headers not matching with /^X-/ 0.065 0.170 0.000 1.00 0.67 1.00 T_HEADER_ALL_CAPS I looked at the spam hit header names of my previous version and here they are: 195 X-X: 11 X-POSTMASTER: 8 X-SLUIDL: 8 SUBJECT: 4 X-PLATTER: 4 X-IONK: 4 X-CRUNCHERS: 4 X-CORONNA: 2 X-REFERER: 2 X-FROM: X-X: has another rule, SUBJECT: has (or had, we might have removed it due to FPs) another rule, and the others aren't particularly high frequency. I'm closing this bug as WONTFIX and removing the stuff from CVS. Well, I didn't understand all the hoopla. ;-) But here's what I put in the the system-wide prefs: header ALLCAPS_TOCC ToCc !~ /[a-z]+\@/ score ALLCAPS_TOCC 2.0 This adds 2.0 if the ToCC headers didn't contain at least one lower-case lhs -- regardless of whether it's To: or TO: (or Cc: or CC:). I'm not yet sure about any side-effects. +2.0 might seem like a stiff penalty, but I have never seen an upper-case ToCc that wasn't spam. (Why spammers end up with UC addrs, and then use them without conversion, is beyond me.) Thanks for all your work. Subject: Re: all caps headers dougm@bravoecho.net: > header ALLCAPS_TOCC ToCc !~ /[a-z]+\@/ > score ALLCAPS_TOCC 2.0 That's not a very good rule. This will match any sort of address ending in a number. I get the impression you aren't really testing these rules. > This adds 2.0 if the ToCC headers didn't contain at least one > lower-case lhs -- regardless of whether it's To: or TO: (or Cc: or > CC:). I'm not yet sure about any side-effects. +2.0 might seem > like a stiff penalty, but I have never seen an upper-case ToCc that > wasn't spam. (Why spammers end up with UC addrs, and then use them > without conversion, is beyond me.) Thanks for all your work. 2.0 would be WAY too high of a score for that rule. I only get an S/O of 0.95. Most rules with a score of 2.0 have an S/O of 1.00 or very close to that. (A few have lower S/O ratios, but they hit relatively few messages.) OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12402 4708 7694 0.38 0.00 0.00 (all messages) 100.000 37.962 62.038 0.38 0.00 0.00 (all messages as %) 10.224 24.894 1.248 0.95 0.50 1.00 T_ALLCAPS_TOCC Some rules with scores of about 2.0: 1.121 2.952 0.000 1.00 0.86 2.00 FORGED_MX_HOTMAIL 0.734 1.933 0.000 1.00 0.83 2.03 FORGED_RCVD_TRAIL 0.419 1.105 0.000 1.00 0.79 2.01 COMPARE_RATES 0.298 0.786 0.000 1.00 0.77 1.99 COPY_ACCURATELY 0.266 0.701 0.000 1.00 0.76 1.93 HR_4176 0.250 0.658 0.000 1.00 0.76 1.98 LIVE_PORN 0.161 0.425 0.000 1.00 0.73 1.99 CBYI 0.814 2.124 0.013 0.99 0.65 2.01 BULK_EMAIL 0.016 0.042 0.000 1.00 0.57 2.00 LYING_EYES 0.177 0.446 0.013 0.97 0.54 1.94 DRASTIC_REDUCED 0.403 0.913 0.091 0.91 0.46 2.06 CHARSET_FARAWAY_HEADERS 0.185 0.382 0.065 0.85 0.42 1.96 US_DOLLARS Subject: Re: [SAdev] all caps headers bugzilla-daemon@hughes-family.org said: > $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' non-spam/* | wc -l > 75 > $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' spam/* | wc -l > 35 > > So this test doesn't look very good. Somewhere around a .70 with a small > match percentage (110 out of 20000). Yep, about the same here. It's worth noting that the default setup for LISTSERV mailing lists uses upper-case for the From and To addrs, so there's a big pool of false positives right there. --j. (reopening bug) I wrote some rules with good stats (for me) that are worth testing. They're a bit more complicated and we'll need to figure out which ones work the best for everyone (some mailing list software might be a problem). Here are my results for the new rule. The best RANK for previous rules was about 0.50. These are all much better (some lower than others, but this is the set that looks worthy of testing on other corpuses). OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12402 4708 7694 0.38 0.00 0.00 (all messages) 100.000 37.962 62.038 0.38 0.00 0.00 (all messages as %) 0.508 1.338 0.000 1.00 0.80 1.00 T_NO_LOWER_TO_CC_ALL_E 0.508 1.338 0.000 1.00 0.80 1.00 T_NO_LOWER_TO_ALL_E 0.484 1.274 0.000 1.00 0.80 1.00 T_NO_LOWER_TOCC_ALL_E 0.169 0.446 0.000 1.00 0.73 1.00 T_NO_LOWER_FROM_1 0.129 0.340 0.000 1.00 0.71 1.00 T_NO_LOWER_FROM_2 4.636 12.107 0.065 0.99 0.66 1.00 T_NO_LOWER_TOCC_USER_E 4.112 10.599 0.143 0.99 0.59 1.00 T_NO_LOWER_TOCC_HOST_E 4.733 12.192 0.169 0.99 0.59 1.00 T_NO_LOWER_TOCC_EITHER_E 7.523 19.329 0.299 0.98 0.59 1.00 T_NO_LOWER_TOCC_USER 4.185 10.705 0.195 0.98 0.57 1.00 T_NO_LOWER_TO_HOST_E 7.620 19.414 0.403 0.98 0.57 1.00 T_NO_LOWER_TOCC_EITHER 6.999 17.821 0.377 0.98 0.56 1.00 T_NO_LOWER_TOCC_HOST 4.241 10.790 0.234 0.98 0.56 1.00 T_NO_LOWER_TO_CC_HOST_E 4.814 12.213 0.286 0.98 0.56 1.00 T_NO_LOWER_TO_USER_E 4.870 12.319 0.312 0.98 0.55 1.00 T_NO_LOWER_TO_CC_USER_E 3.370 8.496 0.234 0.97 0.55 1.00 T_NO_LOWER_TOCC_ALL 7.394 18.585 0.546 0.97 0.54 1.00 T_NO_LOWER_TO_HOST 8.023 20.093 0.637 0.97 0.54 1.00 T_NO_LOWER_TO_USER 4.943 12.298 0.442 0.97 0.53 1.00 T_NO_LOWER_TO_EITHER_E 3.717 9.218 0.351 0.96 0.52 1.00 T_NO_LOWER_TO_ALL 8.152 20.178 0.793 0.96 0.52 1.00 T_NO_LOWER_TO_EITHER 5.023 12.404 0.507 0.96 0.52 1.00 T_NO_LOWER_TO_CC_EITHER_E Only one rule worth keeping, one of the T_FROM_NO_LOWER variants is now T_FROM_NO_LOWER. The rest are DELETED. |