SA Bugzilla – Bug 293
ok_languages contribution
Last modified: 2002-06-15 03:58:08 UTC
The new version is ready to be applied to your tree. http://www.pathname.com/~quinlan/software/spamassassin/textcat.patch http://www.pathname.com/~quinlan/software/spamassassin/lm.tar.gz (Apply the patch first, then unpack the tarball; the tarball will overwrite a few files that are in the patch, but the contents are the same.) Changes: - all of the languages use ISO 639 names now (like locales) instead of English names - language model files are named uniformly - moved a few extremely unlikely languages (some are dead languages) into an "inactive" directory - added models for ja.iso-2022 and tr.iso-8859-9 (someone could probably generate better ones, but these seem to work pretty well) One quick thing that I just noticed is the $opt_d in TextCat.pm. It's hard-coded to be a file in /usr/local. Can you get that? Generating good language models is harder than I expected. I tried replacing a few of the existing models that seem to be less accurate, but never managed to come up with a model that was always better than the old one. (I had one better in most cases, but worse in some, but my sample is too small so I chickened out.) Electronic books (text format) seem to work best since they are long and more devoid of English and technical words and pretty-printing. I avoided using spam as inputs since it's too short and often bizarre while the intent is accuracy (matching both good email and bad). Let me know if you have any problems.
Craig asked me to assign this ticket to him.
Craig asked me to assign this ticket to him. (second try)
Re-assigning to me -- doesn't work for Daniel?
*** Bug 229 has been marked as a duplicate of this bug. ***
I might as well handle checking this in from my tree rather than submit a new version of the patch. Changes since last version: - set default to "all" to avoid overhead of language checking unless the administrator wants it - put score in non-GA section of 50_scores.cf for now (2.0 still seems about right, I compared it with other rules with about the same level of accuracy) - languages file location is not hard-coded anymore So, do you think it's ready to check in? For my spam corpus of 6149 messages (3122 spam and 4824 nonspam), UNDESIRED_LANGUAGE_BODY matches 87 spam and only 1 nonspam. It causes no additional false negatives and false positives (spam messages that "get through") are reduced from 149 to 124 messages.
okay, the code is now checked into the tree final changes: - remove "Subject:" from beginning of body text (causes message to look more like English than it should) - require 256 byte minimum before testing (removes several false positives exposed by first change plus the 1 false positive we've had for a while, and we only lose one match)
*** Bug 250 has been marked as a duplicate of this bug. ***