Bug 7674 - sa-learn learns all messages as ham even if --spam is specified
Summary: sa-learn learns all messages as ham even if --spam is specified
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.4.2
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-01 21:56 UTC by Ralf Glauberman
Modified: 2019-01-13 14:29 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Ralf Glauberman 2019-01-01 21:56:00 UTC
While learning messages with "sa-learn --spam" from a folder the messages are in fact learned as ham instead of spam. 

Debug log:
Jan  1 22:42:12.185 [19522] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x56193c244a38), bayes_store_module=Mail::SpamAssassin::BayesStore::SQL
Jan  1 22:42:12.204 [19522] dbg: bayes: using username: XXX
Jan  1 22:42:12.204 [19522] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::SQL=HASH(0x56193cd9a998)
Jan  1 22:42:12.217 [19522] dbg: bayes: database connection established
Jan  1 22:42:12.218 [19522] dbg: bayes: found bayes db version 3
Jan  1 22:42:12.218 [19522] dbg: bayes: Using userid: 4
Jan  1 22:42:12.219 [19522] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200
Jan  1 22:42:12.221 [19522] dbg: sa-learn: spamtest initialized
Jan  1 22:42:12.221 [19522] dbg: learn: initializing learner
Jan  1 22:42:12.221 [19522] dbg: bayes: bayes journal sync starting
Jan  1 22:42:12.221 [19522] dbg: bayes: bayes journal sync completed
Jan  1 22:42:12.221 [19522] dbg: bayes: expiry starting
Jan  1 22:42:12.222 [19522] dbg: bayes: database connection established
Jan  1 22:42:12.222 [19522] dbg: bayes: found bayes db version 3
Jan  1 22:42:12.223 [19522] dbg: bayes: Using userid: 4
Jan  1 22:42:12.234 [19522] dbg: bayes: DB expiry: tokens in DB: 430, Expiry max size: 150000, Oldest atime: 1546240630, Newest atime: 1546309979, Last expire: 0, Current time: 1546378932
Jan  1 22:42:12.236 [19522] dbg: bayes: expiry completed
Jan  1 22:42:12.238 [19522] dbg: learn: learning ham
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized body: 3 tokens
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized uri: 0 tokens
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized invisible: 0 tokens
Jan  1 22:42:12.261 [19522] dbg: bayes: tokenized header: 159 tokens
Jan  1 22:42:12.355 [19522] dbg: bayes: seen (6fbb589c1d2d27cf8a150d8345ff08c53ec827fa@sa_generated) put
Jan  1 22:42:12.356 [19522] dbg: bayes: learned '6fbb589c1d2d27cf8a150d8345ff08c53ec827fa@sa_generated', atime: 1546309979

Note the line "dbg: learn: learning ham"

Numbers for nham/nspam from from "sa-learn --dump magic" confirm the message is in fact learned as ham and not as spam as intended.

Messages seem to be learned correctly if learned via autolearn instead of sa-learn script.

Bayes data is stored in a MySQL database backend if this should be relevant.

System is Gentoo Linux x86_64 with the latest distribution SpamAssassin package (spamassassin-3.4.2-r2).
Comment 1 Bill Cole 2019-01-01 22:49:36 UTC
I cannot reproduce this with flat-file (BDB 5.3) Bayes. 

What is the exact command line you are using to run sa-learn and generate that log?
Comment 2 Ralf Glauberman 2019-01-02 17:10:49 UTC
Thanks for your response. I have looked into it a bit more and as it seems the learner is fine but the command line parsing is not as I would have expected:

sa-learn test.eml --debug
--> Error reported (since no command is specified)

sa-learn --spam test.eml --debug
--> Mail is correctly learned as spam

sa-learn test.eml --spam --debug
--> Mail is learned as ham and no error is reported

I assumed that the order of parameters would be unimportant and I think it is a bit confusing that the parameter is on one hand detected (no error is reported) but on the other hand ignored. Maybe a simple sanity check could be added?
Comment 3 Bill Cole 2019-01-02 20:33:51 UTC
I still can't reproduce the exact problem with the given command line, so that is apparently an artifact of the storage backend, the configuration, or the input. It may be helpful to take this problem to the SpamAssassin Users mailing list, where others with a diverse range of configurations can assist. 

HOWEVER: I will not (yet) unilaterally close this bug as "WORKSFORME" because even though I have been unable to reproduce the behavior, I strongly suspect that it is related to bug 7675, an issue that people have been working around practically forever rather than properly documenting and/or fixing. 

*** WORKAROUNDS *** 

Always give the -D/--debug option an explicit set of debug channels: either 'all' or a comma-delimited list.
Comment 4 Ralf Glauberman 2019-01-13 14:29:13 UTC
Sorry for the delay, needed to debug it in more detail...

I still don't know why you are unable to reproduce the bug but i am sure it is not related to the debug switch. I was able to reproduce the problem with a clean install and default konfiguration (i.e. no MySQL or anything). To debug the problem I added the following line to sa-learn in the wanted function (at about line 576):

+  warn "learning $id as $class\n";
   my $status = $spamtest->learn( $ma, undef, $spam, $forget );

Executing the following command then returns:

./bin/sa-learn --spam test.eml --ham test2.eml --spam test3.eml
learning test.eml as s
learning test3.eml as s
learning test2.eml as h
Learned tokens from 2 message(s) (3 message(s) examined)

=> Works as intended

In order to be able to learn spam and ham messages during one execution of the command, the command line is parsed and each message to learn is added to the targets array by the target function. The function uses the global isspam variable to determine if the message should be learned as spam or as ham. This variable is set whenever a --spam/--ham command line parameter is read by GetOptions. This means however that if a message file name is found on the command line before any --spam/--ham flag, the isspam variable has never been initialized by the time target is called and the behavior is therefore undefined. The statement "my $class = ( $isspam ? "spam" : "ham" );" results in the message being learned as ham.

./bin/sa-learn test.eml --ham test2.eml --spam test3.eml
learning test3.eml as s
learning test.eml as h
learning test2.eml as h
Learned tokens from 2 message(s) (3 message(s) examined)

Or with just one message:
./bin/sa-learn test.eml --spam
learning test.eml as h
Learned tokens from 0 message(s) (1 message(s) examined)

I think the program should check that --spam/--ham has been seen on the command line before any message file name and the documentation should be updated so it is clear that both spam and ham can be learned during a single execution but the relevant flag has to be used before the file name.

Sorry for not providing a patch but I don't know perl (read only).