Bug 6364 - Russian UTF-8 in TextCat
Summary: Russian UTF-8 in TextCat
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: 3.3.0
Hardware: All All
: P5 major
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 7816
Blocks:
  Show dependency tree
 
Reported: 2010-03-04 10:17 UTC by Mike Stupalov
Modified: 2020-05-14 14:09 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Another Russian UTF-8 spam application/mbox None jidanni@jidanni.org [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Stupalov 2010-03-04 10:17:19 UTC
Please add support UTF-8 encoding for russian language in Textcat. Without it do not work valuably definition of Russian language and "normalize_charset".

For problem solution it is necessary to add a file ru.utf-8.lm in source.
This file called ru-utf8.lm, is accessible by link:
http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/textcat

(In the same place there are some refreshed files for other languages.)

P.S. This problem with russian language very old and unpleasant, I think all Russian-speaking community will tell for this small file of thanks :)
Comment 1 Mike Stupalov 2010-03-04 11:05:32 UTC
In the repository specified above added utf-8 textcat models for russian, french, spanish, italian and chinese.
Comment 2 Henrik Krohns 2010-03-04 11:13:54 UTC
Also notice my bug.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229

TextCat is currently broken case and encoding wise. It should be completely revamped.
Comment 3 jidanni 2020-05-14 14:08:34 UTC
Created attachment 5702 [details]
Another Russian UTF-8 spam

Take the attached obviously Russian message "a".

Well no matter normalize_charset 0 or 1 (it's UTF-8 anyway) we get the same

X-Spam-Textcatresults: zh.gb2312:149533(1.00) zh.big5:151313(1.01)
	ko:152101(1.02) ja.shift-jis:152161(1.02) th:152504(1.02)
	ja.euc-jp:152931(1.02) hy:152988(1.02) ar.iso-8859-6:153918(1.03)
	am.utf-8:154133(1.03) ta:154349(1.03) mr:155147(1.04) hi:155343(1.04)
	ru.iso-8859-5:156383(1.05) uk.koi8-r:156736(1.05) vi:156983(1.05)
	ka:157040(1.05) bg.iso-8859-5:157425(1.05) ru.koi8-r:157592(1.05)
	ar.windows-1256:157731(1.05) pl:158285(1.06)

Why is ar.windows-1256 even tied with Russian?