Bug 7720 - Bayes plugin uses English specific stop words
Summary: Bayes plugin uses English specific stop words
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: 3.4 SVN branch
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-12 02:37 UTC by Shreyansh Shrivastava
Modified: 2020-06-17 14:39 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Shreyansh Shrivastava 2019-06-12 02:37:54 UTC
When the Bayes plugin tokenizes the message, it ignores words with length<3 along with a list of commonly occurring words which lie in the gray area ( do not affect the spam detection process ) called stop words. In my understanding, this is mainly done for computation speedup and storage issues. But if a user's primary language is not English for eg Spanish/French, the presence of a mail with English stop words is a big indication for spam, hence for these users, if the removal of stop words is made configurable, it would be helpful.
Comment 1 Giovanni Bechis 2020-06-17 14:39:49 UTC
Fixed in trunk with revision  1878925.
Atm en, es, fr, de and it languages regexps are available.