Bug 5859

Summary:	SUBJ_ALL_CAPS got fired up on on non-latin subjects with some latin characters which are all capital
Product:	Spamassassin	Reporter:	lee_yiu_chung
Component:	Rules	Assignee:	SpamAssassin Developer Mailing List <dev>
Status:	NEW ---
Severity:	normal	CC:	davej, dramitsharma0909, frank.urban, jidanni, kmcgrail, lee_yiu_chung, luther.blissett, ramtin.beheshti, rwmaillists, sidney, temnota.am
Priority:	P2
Version:	3.3.1
Target Milestone:	Undefined
Hardware:	All
OS:	All
Whiteboard:

Description lee_yiu_chung 2008-03-20 09:24:04 UTC

I just found that SUBJ_ALL_CAPS got fired up in this header:

Subject: KS - SWC =?UTF-8?B?57WQ5p6c?=

which is Chinese subjects which contained capital letters (which is site code for my company use) (KS - SWC 結果). I think this rule should be tighten to avoid such things. I think it is fairly common to contain some capital letters (which is normally some sort of codes) on non-latin subjects.

Comment 1 jidanni 2009-04-22 00:04:41 UTC

Me too. For details see http://news.gmane.org/find-root.php?message_id=87ws9pzguf.fsf@jidanni.org .
It was a mere RE: instead of Re:, and they got slammed for it.

Comment 2 Mark Martinec 2010-03-31 14:52:42 UTC

Copied from Bug 6398 comment 2:

Luther:
When sending a mail with arabic subject (e.g.  بسم الله الرحمن الرحيم), a reply
or forward causes the SUBJ_ALL_CAPS pattern to match (e.g. "AW: بسم الله الرحمن
الرحيم" or "FWD: بسم الله الرحمن الرحيم").

This might also be the case for other languages (e.g. Hindi, Thai, etc.)


Mark:
Here is the attached header field sample:
  Subject: =?utf-8?Q?AW:_=D8=A7=D9=84=D9=84=D9=87_=D9=83=D8=A8=D8=B1?=

Should the CHARSETS_LIKELY_TO_FP_AS_CAPS in Constants.pm include 'utf-8'?

Shouldn't the "=hh" entities be exempt from QP encoded strings entirely?

Also, shouldn't the B-encoded (base64) MIME strings be exempt entirely
from this test?

Comment 3 Mark Martinec 2010-03-31 14:53:28 UTC

*** Bug 6398 has been marked as a duplicate of this bug. ***

Comment 4 John Wilcock 2010-03-31 15:32:26 UTC

IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin characters that match IsUpper (regardless of charset).

I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a horrible workaround to me!

Comment 5 Andrey Melnikov 2010-04-02 18:20:57 UTC

Also, subject with charset windows-1251 or windows-1252 show same issue.

Comment 6 jidanni 2010-07-06 04:59:51 UTC

I forgot that there was this bug on file and instead posted...
http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/129541

Comment 7 jidanni 2010-07-10 22:25:53 UTC

What would be a good workaround for ones user_prefs?
All I found for a model was
./blib/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm:895:sub subject_is_all_caps {

Comment 8 Frank Urban 2011-01-24 08:33:08 UTC

(In reply to comment #4)
> IMO SUBJ_ALL_CAPS should *only* fire if the subject is entirely in latin
> characters that match IsUpper (regardless of charset).
> 
> I haven't looked at the code, but CHARSETS_LIKELY_TO_FP_AS_CAPS sounds like a
> horrible workaround to me!

I agree to this
CHARSETS_LIKELY_TO_FP_AS_CAPS should be renamed to
CHARSETS_FOR_CHECK_AS_CAPS
and only latin characters should be included in this rule

Comment 9 Kevin A. McGrail 2011-10-28 20:58:40 UTC

Good feature to fix but pushing to 3.4.1

Comment 10 Kevin A. McGrail 2015-04-13 22:33:54 UTC

I think it's inevitable that some rules will hit on ham.  So perhaps a ceiling on this rule and moving it to the sandbox to see if it is autopromoted would be good instead of trying to fix it?  It seems an 75% indicator of Spam from ruleqa but perhaps some real-world additions are saying it needs to be artificially capped.

Pushing to 3.4.2 since this might require code changes to support the fix.

Comment 11 Kevin A. McGrail 2018-09-04 15:37:29 UTC

This is a rules issue not release specific.  Anyone want to look at the ruleqa s/o for this rule?  Dave, perhaps we need a score ceiling for it?

Comment 12 amit 2020-03-23 09:00:20 UTC

good one here 
http://bit.ly/38OVSdD
http://bit.ly/39Dgpn7

Comment 13 RW 2020-03-24 21:52:23 UTC

IMO the following two lines are in the wrong order

   return 0 if (length $subject < 10);  # don't match short subjects
   $subject =~ s/[^a-zA-Z]//g;          # only look at letters

The changes made from this thread and others have never actually addressed the original problem, which is that in:

Subject: KS - SWC =?UTF-8?B?57WQ5p6c?=

the Chinese characters count towards the minimum length, but are then stripped. This allows the rule to fire on a single remaining [A-Z] character.   

I suspect that this causes most, if not all, of the problems. It's not clear to me whether  CHARSETS_LIKELY_TO_FP_AS_CAPS is anything more that a list of character sets where the problem happens to have been observed. It can be triggered with pure ASCII, e.g.

Subject: X [ 243, 346 ]

Comment 14 Sidney Markowitz 2022-04-19 06:01:08 UTC

(In reply to RW from comment #13)
> IMO the following two lines are in the wrong order
> 
>    return 0 if (length $subject < 10);  # don't match short subjects
>    $subject =~ s/[^a-zA-Z]//g;          # only look at letters

That makes sense, but as an alternate fix, consider that the return line does check the length after stripping like this:

  return length($subject) && ($subject eq uc($subject));

We could make it length($subject) < n for some number that makes sense. I don't know what n should be, though, without experimenting with ruleqa with different amounts. It does have to be something less than 10, or else the current code makes no sense at all.