SA Bugzilla – Bug 5859
SUBJ_ALL_CAPS got fired up on on non-latin subjects with some latin characters which are all capital
Last modified: 2022-07-14 22:31:26 UTC
I just found that SUBJ_ALL_CAPS got fired up in this header: Subject: KS - SWC =?UTF-8?B?57WQ5p6c?= which is Chinese subjects which contained capital letters (which is site code for my company use) (KS - SWC 結果). I think this rule should be tighten to avoid such things. I think it is fairly common to contain some capital letters (which is normally some sort of codes) on non-latin subjects.
Me too. For details see http://news.gmane.org/find-root.php?message_id=87ws9pzguf.fsf@jidanni.org . It was a mere RE: instead of Re:, and they got slammed for it.
Copied from Bug 6398 comment 2: Luther: When sending a mail with arabic subject (e.g. بسم الله الرحمن الرحيم), a reply or forward causes the SUBJ_ALL_CAPS pattern to match (e.g. "AW: بسم الله الرحمن الرحيم" or "FWD: بسم الله الرحمن الرحيم"). This might also be the case for other languages (e.g. Hindi, Thai, etc.) Mark: Here is the attached header field sample: Subject: =?utf-8?Q?AW:_=D8=A7=D9=84=D9=84=D9=87_=D9=83=D8=A8=D8=B1?= Should the CHARSETS_LIKELY_TO_FP_AS_CAPS in Constants.pm include 'utf-8'? Shouldn't the "=hh" entities be exempt from QP encoded strings entirely? Also, shouldn't the B-encoded (base64) MIME strings be exempt entirely from this test?
*** Bug 6398 has been marked as a duplicate of this bug. ***
IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin characters that match IsUpper (regardless of charset). I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a horrible workaround to me!
Also, subject with charset windows-1251 or windows-1252 show same issue.
I forgot that there was this bug on file and instead posted... http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/129541
What would be a good workaround for ones user_prefs? All I found for a model was ./blib/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm:895:sub subject_is_all_caps {
(In reply to comment #4) > IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin > characters that match IsUpper (regardless of charset). > > I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a > horrible workaround to me! I agree to this CHARSETS_LIKELY_TO_FP_AS_CAPS should be renamed to CHARSETS_FOR_CHECK_AS_CAPS and only latin characters should be included in this rule
Good feature to fix but pushing to 3.4.1
I think it's inevitable that some rules will hit on ham. So perhaps a ceiling on this rule and moving it to the sandbox to see if it is autopromoted would be good instead of trying to fix it? It seems an 75% indicator of Spam from ruleqa but perhaps some real-world additions are saying it needs to be artificially capped. Pushing to 3.4.2 since this might require code changes to support the fix.
This is a rules issue not release specific. Anyone want to look at the ruleqa s/o for this rule? Dave, perhaps we need a score ceiling for it?
good one here http://bit.ly/38OVSdD http://bit.ly/39Dgpn7
IMO the following two lines are in the wrong order return 0 if (length $subject < 10); # don't match short subjects $subject =~ s/[^a-zA-Z]//g; # only look at letters The changes made from this thread and others have never actually addressed the original problem, which is that in: Subject: KS - SWC =?UTF-8?B?57WQ5p6c?= the Chinese characters count towards the minimum length, but are then stripped. This allows the rule to fire on a single remaining [A-Z] character. I suspect that this causes most, if not all, of the problems. It's not clear to me whether CHARSETS_LIKELY_TO_FP_AS_CAPS is anything more that a list of character sets where the problem happens to have been observed. It can be triggered with pure ASCII, e.g. Subject: X [ 243, 346 ]
(In reply to RW from comment #13) > IMO the following two lines are in the wrong order > > return 0 if (length $subject < 10); # don't match short subjects > $subject =~ s/[^a-zA-Z]//g; # only look at letters That makes sense, but as an alternate fix, consider that the return line does check the length after stripping like this: return length($subject) && ($subject eq uc($subject)); We could make it length($subject) < n for some number that makes sense. I don't know what n should be, though, without experimenting with ruleqa with different amounts. It does have to be something less than 10, or else the current code makes no sense at all.