Bug 7819 - bayes is using usernames case-sensitive
Summary: bayes is using usernames case-sensitive
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.4.4
Hardware: All All
: P4 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-23 22:49 UTC by Benny Pedersen
Modified: 2024-01-22 01:15 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Untested patch to have case insensitive usernames patch None Giovanni Bechis [HasCLA]
Insert lowered usernames patch None Giovanni Bechis [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Benny Pedersen 2020-05-23 22:49:45 UTC
so foo is not same as fOo, since email is case insensitive it can create 2 bayes learning on same user
Comment 1 RW 2020-05-23 23:15:59 UTC
Domain names are not case-sensitive, it's optional for the local part of an address.
Comment 2 Benny Pedersen 2020-05-23 23:18:46 UTC
is the problem only postgresql then ?
Comment 3 Benny Pedersen 2020-07-28 09:17:40 UTC
could this gets highter priortet ?

while its fixed in fuglu, its not yet in spamd/spamc
Comment 4 Bill Cole 2020-07-28 20:11:16 UTC
I'm a bit confused by the "Component" change. If this is a Bayes issue as stated in the bug title, the  "Component" should stay "Plugins" because that is how Bayes is implemented. The only way I can see that "spamc/spamd" would be correct is if you're not referring to Bayes at all, but to the per-user configuration (including Bayes DB) support in spamc/spamd. In either case, this is NOT A BUG but rather an enhancement request and I think it should be optional and non-default, because usernames (and virtual usernames) can be case-sensitive. 

(In reply to Benny Pedersen from comment #2)
>is the problem only postgresql then ?
I would think that if you are using a RDBMS you could fix this on the DB side by making the relevant field case-insensitive. In MySQL that's the default, in PostgreSQL it requires that the column type be CITEXT rather than TEXT. See https://www.postgresql.org/docs/current/citext.html for details. 


(In reply to Benny Pedersen from comment #3)
> could this gets highter priortet ?
I do not expect so. It would require substantial effort to "fix" and the behavior is a non-bug. 

As RW says, local parts can be case-sensitive (with the exception of "postmaster") so it isn't formally wrong to treat 'fOo' and 'Foo' as different tokens. In fact it would be formally *wrong* to arbitrarily case-squash tokens just because they happen to be usernames. Not having examined the code for Bayes tokenization I cannot be certain, but I would expect that detection of usernames per se is not done. It would break an assumption of the "Naive Bayes" model of using simple classifiers that are stripped of context. 

Patches welcome, of course. In my opinion, adding the behavior change for Bayes local-part tokens or per-user config usernames should be optional and non-default.
Comment 5 Giovanni Bechis 2020-07-30 09:19:10 UTC
Created attachment 5711 [details]
Untested patch to have case insensitive usernames

Untested patch to have case insensitive usernames on Postgresql as well.
Comment 6 Benny Pedersen 2020-11-23 23:52:43 UTC
thanks for the patch for postgresql, same should be maked to mysql, ldap, berkdb, sqlite ?

if it was done in spamassasin core it would not be need to be database specifik
Comment 7 Giovanni Bechis 2020-11-24 08:47:01 UTC
Created attachment 5732 [details]
Insert lowered usernames

Different patch that inserts usernames into bayes as lowercase.
SELECT statements kept untouched because I do not know compatibility across different databases of LOWER() SQL statement.
Comment 8 Benny Pedersen 2024-01-22 01:15:24 UTC
is solved in trunk ?, gentoo still not have 4.x.x