|
SA Bugzilla – Full Text Bug Listing |
Summary: | SUBJ_ALL_CAPS got fired up on on non-latin subjects with some latin characters which are all capital | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | lee_yiu_chung |
Component: | Rules | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | normal | CC: | davej, dramitsharma0909, frank.urban, jidanni, kmcgrail, lee_yiu_chung, luther.blissett, ramtin.beheshti, rwmaillists, sidney, temnota.am |
Priority: | P2 | ||
Version: | 3.3.1 | ||
Target Milestone: | Undefined | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: |
Description
lee_yiu_chung
2008-03-20 09:24:04 UTC
Me too. For details see http://news.gmane.org/find-root.php?message_id=87ws9pzguf.fsf@jidanni.org . It was a mere RE: instead of Re:, and they got slammed for it. Copied from Bug 6398 comment 2: Luther: When sending a mail with arabic subject (e.g. بسم الله الرحمن الرحيم), a reply or forward causes the SUBJ_ALL_CAPS pattern to match (e.g. "AW: بسم الله الرحمن الرحيم" or "FWD: بسم الله الرحمن الرحيم"). This might also be the case for other languages (e.g. Hindi, Thai, etc.) Mark: Here is the attached header field sample: Subject: =?utf-8?Q?AW:_=D8=A7=D9=84=D9=84=D9=87_=D9=83=D8=A8=D8=B1?= Should the CHARSETS_LIKELY_TO_FP_AS_CAPS in Constants.pm include 'utf-8'? Shouldn't the "=hh" entities be exempt from QP encoded strings entirely? Also, shouldn't the B-encoded (base64) MIME strings be exempt entirely from this test? *** Bug 6398 has been marked as a duplicate of this bug. *** IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin characters that match IsUpper (regardless of charset). I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a horrible workaround to me! Also, subject with charset windows-1251 or windows-1252 show same issue. I forgot that there was this bug on file and instead posted... http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/129541 What would be a good workaround for ones user_prefs? All I found for a model was ./blib/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm:895:sub subject_is_all_caps { (In reply to comment #4) > IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin > characters that match IsUpper (regardless of charset). > > I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a > horrible workaround to me! I agree to this CHARSETS_LIKELY_TO_FP_AS_CAPS should be renamed to CHARSETS_FOR_CHECK_AS_CAPS and only latin characters should be included in this rule Good feature to fix but pushing to 3.4.1 I think it's inevitable that some rules will hit on ham. So perhaps a ceiling on this rule and moving it to the sandbox to see if it is autopromoted would be good instead of trying to fix it? It seems an 75% indicator of Spam from ruleqa but perhaps some real-world additions are saying it needs to be artificially capped. Pushing to 3.4.2 since this might require code changes to support the fix. This is a rules issue not release specific. Anyone want to look at the ruleqa s/o for this rule? Dave, perhaps we need a score ceiling for it? good one here http://bit.ly/38OVSdD http://bit.ly/39Dgpn7 IMO the following two lines are in the wrong order return 0 if (length $subject < 10); # don't match short subjects $subject =~ s/[^a-zA-Z]//g; # only look at letters The changes made from this thread and others have never actually addressed the original problem, which is that in: Subject: KS - SWC =?UTF-8?B?57WQ5p6c?= the Chinese characters count towards the minimum length, but are then stripped. This allows the rule to fire on a single remaining [A-Z] character. I suspect that this causes most, if not all, of the problems. It's not clear to me whether CHARSETS_LIKELY_TO_FP_AS_CAPS is anything more that a list of character sets where the problem happens to have been observed. It can be triggered with pure ASCII, e.g. Subject: X [ 243, 346 ] (In reply to RW from comment #13) > IMO the following two lines are in the wrong order > > return 0 if (length $subject < 10); # don't match short subjects > $subject =~ s/[^a-zA-Z]//g; # only look at letters That makes sense, but as an alternate fix, consider that the return line does check the length after stripping like this: return length($subject) && ($subject eq uc($subject)); We could make it length($subject) < n for some number that makes sense. I don't know what n should be, though, without experimenting with ruleqa with different amounts. It does have to be something less than 10, or else the current code makes no sense at all. |