SA Bugzilla – Bug 7720
Bayes plugin uses English specific stop words
Last modified: 2020-06-17 14:39:49 UTC
When the Bayes plugin tokenizes the message, it ignores words with length<3 along with a list of commonly occurring words which lie in the gray area ( do not affect the spam detection process ) called stop words. In my understanding, this is mainly done for computation speedup and storage issues. But if a user's primary language is not English for eg Spanish/French, the presence of a mail with English stop words is a big indication for spam, hence for these users, if the removal of stop words is made configurable, it would be helpful.
Fixed in trunk with revision 1878925. Atm en, es, fr, de and it languages regexps are available.