Bug 6146 - FPs with Oriental text: TVD_SPACE_RATIO etc.
Summary: FPs with Oriental text: TVD_SPACE_RATIO etc.
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: 3.2.3
Hardware: All All
: P5 minor
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-03 12:35 UTC by Cedric Knight
Modified: 2009-07-03 12:35 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Cedric Knight 2009-07-03 12:35:31 UTC
The following rules triggered on ham in gb2312 character set:
HTML_FONT_FACE_BAD, MIME_BASE64_TEXT, TVD_SPACE_RATIO

I don't read Chinese myself, but some text/plain parts in such a character set have reason to be in base64.  Also it seems that in Chinese you rarely use the space bar, which is sufficient to trigger TVD_SPACE_RATIO.

I also find such email hits Bayes, because all the Chinese email used to train it has been spam.  Maybe it should be checked whether there is enough Chinese ham represented in the corpus, and also big5, gb2312 etc parts be excluded from TVD_SPACE_RATIO and MIME_BASE64_TEXT.