Bug 293 - ok_languages contribution
Summary: ok_languages contribution
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.30CVS
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
: 229 250 (view as bug list)
Depends on:
Blocks:
 
Reported: 2002-05-09 00:51 UTC by Daniel Quinlan
Modified: 2002-06-15 03:58 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2002-05-09 00:51:36 UTC
The new version is ready to be applied to your tree.

http://www.pathname.com/~quinlan/software/spamassassin/textcat.patch
http://www.pathname.com/~quinlan/software/spamassassin/lm.tar.gz

(Apply the patch first, then unpack the tarball; the tarball will overwrite
a few files that are in the patch, but the contents are the same.)

Changes:

  - all of the languages use ISO 639 names now (like locales) instead
    of English names
  - language model files are named uniformly
  - moved a few extremely unlikely languages (some are dead languages)
    into an "inactive" directory
  - added models for ja.iso-2022 and tr.iso-8859-9 (someone could
    probably generate better ones, but these seem to work pretty well)

One quick thing that I just noticed is the $opt_d in TextCat.pm.  It's
hard-coded to be a file in /usr/local.  Can you get that?

Generating good language models is harder than I expected.  I tried
replacing a few of the existing models that seem to be less accurate,
but never managed to come up with a model that was always better than
the old one.  (I had one better in most cases, but worse in some, but
my sample is too small so I chickened out.)

Electronic books (text format) seem to work best since they are long
and more devoid of English and technical words and pretty-printing.  I
avoided using spam as inputs since it's too short and often bizarre
while the intent is accuracy (matching both good email and bad).

Let me know if you have any problems.
Comment 1 Daniel Quinlan 2002-05-09 00:59:49 UTC
Craig asked me to assign this ticket to him.
Comment 2 Daniel Quinlan 2002-05-09 01:00:39 UTC
Craig asked me to assign this ticket to him.  (second try)
Comment 3 Craig Hughes 2002-05-09 13:01:34 UTC
Re-assigning to me -- doesn't work for Daniel?
Comment 4 Daniel Quinlan 2002-05-10 13:04:44 UTC
*** Bug 229 has been marked as a duplicate of this bug. ***
Comment 5 Daniel Quinlan 2002-05-29 13:19:25 UTC
I might as well handle checking this in from my tree rather than submit a new
version of the patch.  Changes since last version:

- set default to "all" to avoid overhead of language checking unless the
  administrator wants it
- put score in non-GA section of 50_scores.cf for now (2.0 still seems about
  right, I compared it with other rules with about the same level of accuracy)
- languages file location is not hard-coded anymore

So, do you think it's ready to check in?  For my spam corpus of 6149 messages
(3122 spam and 4824 nonspam), UNDESIRED_LANGUAGE_BODY matches 87 spam and only
1 nonspam.  It causes no additional false negatives and false positives (spam
messages that "get through") are reduced from 149 to 124 messages.
Comment 6 Daniel Quinlan 2002-05-30 19:26:49 UTC
okay, the code is now checked into the tree

final changes:

- remove "Subject:" from beginning of body text (causes message to look more
  like English than it should)
- require 256 byte minimum before testing (removes several false positives
  exposed by first change plus the 1 false positive we've had for a while,
  and we only lose one match)
Comment 7 Craig Hughes 2002-06-10 00:49:02 UTC
*** Bug 250 has been marked as a duplicate of this bug. ***