SA Bugzilla – Bug 6146
FPs with Oriental text: TVD_SPACE_RATIO etc.
Last modified: 2009-07-03 12:35:31 UTC
The following rules triggered on ham in gb2312 character set: HTML_FONT_FACE_BAD, MIME_BASE64_TEXT, TVD_SPACE_RATIO I don't read Chinese myself, but some text/plain parts in such a character set have reason to be in base64. Also it seems that in Chinese you rarely use the space bar, which is sufficient to trigger TVD_SPACE_RATIO. I also find such email hits Bayes, because all the Chinese email used to train it has been spam. Maybe it should be checked whether there is enough Chinese ham represented in the corpus, and also big5, gb2312 etc parts be excluded from TVD_SPACE_RATIO and MIME_BASE64_TEXT.