Bug 7118

Summary: BayesStore/Redis.pm: please add support for per user database
Product: Spamassassin Reporter: Marcin M <issues.apache.org>
Component: LearnerAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: apache, billcole, issues.apache.org, kmcgrail
Priority: P2    
Version: unspecified   
Target Milestone: Future   
Hardware: PC   
OS: Linux   
Whiteboard:

Description Marcin M 2015-01-13 16:45:49 UTC
It would be nice to have fast bayes engine with support of storing bayes data per user/recipient similarly to other storage backends.
Comment 1 Kevin A. McGrail 2015-01-13 16:52:14 UTC
Because of the way redis works as a hash lookup, would that be one database with an additional key value for username or are you envisioning tons of separate Redis DBs?
Comment 2 AXB 2015-01-13 16:57:26 UTC
You may want to look into MySQL with the memcached engine.
Comment 3 Marcin M 2015-01-13 18:36:12 UTC
I prefer to not use mysql:) May I expect that Redis storage will support per user bayes in ... eg. this year? Or I should rather choose other backend?
Kevin, if you are asking me I can say that I don't know the Redis enough to answer. Also I'm not sure if Redis with can work with e.g. 5k databases. Personally I prefer to have all in one database byt maybe I'm still thinking in SQL way?:)
Comment 4 Mark Martinec 2015-01-13 18:49:59 UTC
Can you afford to keep in memory all bayes token sets and seen sets
for each of your users? Is it worth the cost of memory? How long will
it take for a redis server to reload or make a periodic dump?
Comment 5 Marcin M 2015-01-14 09:11:29 UTC
Good suggestion Mark.
$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0        639          0  non-token data: nspam
0.000          0       7577          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

and redis has 169688kB on RSS. So it gives:
169688 kB/(639+7577) tokens
20.65 kB per token. It's rather high value.
Comment 6 Henrik Krohns 2015-01-14 09:46:41 UTC
(In reply to Marcin M from comment #5)
> and redis has 169688kB on RSS. So it gives:
> 169688 kB/(639+7577) tokens
> 20.65 kB per token. It's rather high value.

nspam/nham is learned message counts, not tokens.


$ redis-cli info | egrep '(_rss|db0:keys)'
used_memory_rss:21688320
db0:keys=198733,expires=198729,avg_ttl=2079826604

... about 109 bytes per token (key)

This is just few hundred messages a day with ~1 month token ttl. So perhaps 10-20MB memory per user could be some low ballpark.