Bug 6037

Summary: Bayes-SQL improvements
Product: Spamassassin Reporter: Thorsten Meinl <Thorsten>
Component: LearnerAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement    
Priority: P5    
Version: 3.2.4   
Target Milestone: Undefined   
Hardware: Other   
OS: All   
Whiteboard:
Attachments: Patch for splitting the bayes_token table

Description Thorsten Meinl 2008-12-26 08:21:52 UTC
Created attachment 4410 [details]
Patch for splitting the bayes_token table

All bayes tokens for all user are currently stored inside one huge table (if Bayes is stored inside an SQL database). For several thousand users this becomes a bottleneck, especially for bayes_expire. The patch below adds the possibility to split the token table into several tables. Which user is contained in which table is looked up from bayes_vars which has an additional column "token_table". New user are automatically assigned to one table by using their name's CRC32 checksum (could have been any other but this one was easiest as it gives an int which can be used to derive a simple number for the token table). This patch leads
to considerably lower loads on our machine and bayes_expire now only takes 
about 5 hours instead of 20 before when using 10 instead of 1 table.
Comment 1 Michael Parker 2009-01-05 08:30:54 UTC
This is something best done in a new BayesStore module instead of patching the existing modules.
Comment 2 Justin Mason 2009-01-05 09:57:02 UTC
> This is something best done in a new BayesStore module instead of patching the
> existing modules.

agreed; ideally it should be possible to subclass the existing Bayes plugin, or hook into it in some similar way, to reuse as much of that code as possible.