Bug 5859 - SUBJ_ALL_CAPS got fired up on on non-latin subjects with some latin characters which are all capital
Summary: SUBJ_ALL_CAPS got fired up on on non-latin subjects with some latin character...
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.1
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 6398 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-03-20 09:24 UTC by lee_yiu_chung
Modified: 2022-07-14 22:31 UTC (History)
11 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description lee_yiu_chung 2008-03-20 09:24:04 UTC
I just found that SUBJ_ALL_CAPS got fired up in this header:

Subject: KS - SWC =?UTF-8?B?57WQ5p6c?=

which is Chinese subjects which contained capital letters (which is site code for my company use) (KS - SWC 結果). I think this rule should be tighten to avoid such things. I think it is fairly common to contain some capital letters (which is normally some sort of codes) on non-latin subjects.
Comment 1 jidanni 2009-04-22 00:04:41 UTC
Me too. For details see http://news.gmane.org/find-root.php?message_id=87ws9pzguf.fsf@jidanni.org .
It was a mere RE: instead of Re:, and they got slammed for it.
Comment 2 Mark Martinec 2010-03-31 14:52:42 UTC
Copied from Bug 6398 comment 2:

Luther:
When sending a mail with arabic subject (e.g.  بسم الله الرحمن الرحيم), a reply
or forward causes the SUBJ_ALL_CAPS pattern to match (e.g. "AW: بسم الله الرحمن
الرحيم" or "FWD: بسم الله الرحمن الرحيم").

This might also be the case for other languages (e.g. Hindi, Thai, etc.)


Mark:
Here is the attached header field sample:
  Subject: =?utf-8?Q?AW:_=D8=A7=D9=84=D9=84=D9=87_=D9=83=D8=A8=D8=B1?=

Should the CHARSETS_LIKELY_TO_FP_AS_CAPS in Constants.pm include 'utf-8'?

Shouldn't the "=hh" entities be exempt from QP encoded strings entirely?

Also, shouldn't the B-encoded (base64) MIME strings be exempt entirely
from this test?
Comment 3 Mark Martinec 2010-03-31 14:53:28 UTC
*** Bug 6398 has been marked as a duplicate of this bug. ***
Comment 4 John Wilcock 2010-03-31 15:32:26 UTC
IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin characters that match IsUpper (regardless of charset).

I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a horrible workaround to me!
Comment 5 Andrey Melnikov 2010-04-02 18:20:57 UTC
Also, subject with charset windows-1251 or windows-1252 show same issue.
Comment 6 jidanni 2010-07-06 04:59:51 UTC
I forgot that there was this bug on file and instead posted...
http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/129541
Comment 7 jidanni 2010-07-10 22:25:53 UTC
What would be a good workaround for ones user_prefs?
All I found for a model was
./blib/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm:895:sub subject_is_all_caps {
Comment 8 Frank Urban 2011-01-24 08:33:08 UTC
(In reply to comment #4)
> IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin
> characters that match IsUpper (regardless of charset).
> 
> I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a
> horrible workaround to me!

I agree to this
CHARSETS_LIKELY_TO_FP_AS_CAPS should be renamed to
CHARSETS_FOR_CHECK_AS_CAPS
and only latin characters should be included in this rule
Comment 9 Kevin A. McGrail 2011-10-28 20:58:40 UTC
Good feature to fix but pushing to 3.4.1
Comment 10 Kevin A. McGrail 2015-04-13 22:33:54 UTC
I think it's inevitable that some rules will hit on ham.  So perhaps a ceiling on this rule and moving it to the sandbox to see if it is autopromoted would be good instead of trying to fix it?  It seems an 75% indicator of Spam from ruleqa but perhaps some real-world additions are saying it needs to be artificially capped.

Pushing to 3.4.2 since this might require code changes to support the fix.
Comment 11 Kevin A. McGrail 2018-09-04 15:37:29 UTC
This is a rules issue not release specific.  Anyone want to look at the ruleqa s/o for this rule?  Dave, perhaps we need a score ceiling for it?
Comment 12 amit 2020-03-23 09:00:20 UTC
good one here 
http://bit.ly/38OVSdD
http://bit.ly/39Dgpn7
Comment 13 RW 2020-03-24 21:52:23 UTC
IMO the following two lines are in the wrong order

   return 0 if (length $subject < 10);  # don't match short subjects
   $subject =~ s/[^a-zA-Z]//g;          # only look at letters

The changes made from this thread and others have never actually addressed the original problem, which is that in:

Subject: KS - SWC =?UTF-8?B?57WQ5p6c?=

the Chinese characters count towards the minimum length, but are then stripped. This allows the rule to fire on a single remaining [A-Z] character.   

I suspect that this causes most, if not all, of the problems. It's not clear to me whether  CHARSETS_LIKELY_TO_FP_AS_CAPS is anything more that a list of character sets where the problem happens to have been observed. It can be triggered with pure ASCII, e.g.

Subject: X [ 243, 346 ]
Comment 14 Sidney Markowitz 2022-04-19 06:01:08 UTC
(In reply to RW from comment #13)
> IMO the following two lines are in the wrong order
> 
>    return 0 if (length $subject < 10);  # don't match short subjects
>    $subject =~ s/[^a-zA-Z]//g;          # only look at letters

That makes sense, but as an alternate fix, consider that the return line does check the length after stripping like this:

  return length($subject) && ($subject eq uc($subject));

We could make it length($subject) < n for some number that makes sense. I don't know what n should be, though, without experimenting with ruleqa with different amounts. It does have to be something less than 10, or else the current code makes no sense at all.