7314 – Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode

Bug 7314 - Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode

Summary: Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode

Status:	NEW

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Plugins (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-04-27 15:10 UTC by azotov
Modified:	2022-04-16 06:26 UTC (History)
CC List:	1 user (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
suggested patch	text/plain			None	azotov@geolink-group.com
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description azotov 2016-04-27 15:10:58 UTC

Created attachment 5387 [details]
suggested patch

Spamassassin fails to generate additional Bayes tokens "Foo", "foo!" and "foo" from original token "Foo!" when the original token contains Unicode characters from non-Latin languages. It happens because \w in regex in Bayes.pm matches only Latin characters and numbers. As a consequence, for example, almost all Cyrillic Unicode characters are deleted by s/[^\w:\*]//gs leading to empty tokens or such weird things as "sk:" tokens.

This problem can be corrected by the attached patch. I have little experience with Unicode in perl, so there can be better solution. The main idea is to make \w match any Unicode word character, not just Latin, and to replace [A-Z] with more generic [[:upper:]].

Maybe it is better to work with Unicode characters not just in DECOMPOSE_BODY_TOKENS section, but everywhere in _tokenize_line sub. This idea was also mentioned in the discussion of Bug 7130. Too many regex in _tokenize_line sub are not working properly for non-Latin Unicode characters now. For example splitting on "..." works only for Latin words, regex in IGNORE_TITLE_CASE sections low-cases only A-Z capital letters and so on.

Comment 1 Mark Martinec 2016-06-15 23:35:31 UTC

> Maybe it is better to work with Unicode characters not just in
> DECOMPOSE_BODY_TOKENS section, but everywhere in _tokenize_line sub ...

Thanks. Yes, there are several problems still associated with
historical assumption of single-byte characters. Some have been
addressed in current trunk code, but there are more, like the one
reported here. To be addressed in the next major version (4.0) ...

Comment 2 Henrik Krohns 2022-04-16 06:26:10 UTC

Likely much too work to test things properly for 4.0, not going to delay the release, postponing into future.