Bug 4152 - RFE: train language identifiation with proper input data
Summary: RFE: train language identifiation with proper input data
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: HP Linux
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-23 22:33 UTC by era eriksson
Modified: 2007-02-02 05:37 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description era eriksson 2005-02-23 22:33:43 UTC
The language sample files which were used to create the input data for the
TextCat module under lm/ were probably never really intended to be used in a
production environment. The amount of data is probably roughly the correct order
of magnitude -- there's about 5,000 words per language -- but there are weird
biases and blind spots. I imagine these were just quickly scraped together to
produce Gertjan's original TextCat demo.

I did a quick spot check of lm/sv.lm and noticed that substrings of the name
Diktonius get incredibly high scores. True, the poet Elmer Diktonius wrote in
Swedish, but he was from Finland, and it's not at all impossible that somebody
would write about him in just about any language; right now, any message with
the word "Diktonius" in it is virtually guaranteed to be classified as Swedish.

There are 6 occurrences of this name in ShortTexts/swedish.txt, yielding a tie
together with two genuine stop words for the 10th most common word in the
Swedish language.

     22 en        "a" (real gender)
     18 i         "in"
     14 och       "and"
     13 som       "which"
     13 av        "of"
     11 den       "the" (real gender)
      9 på        "on"
      9 att       "to" (as in "to count words for fun and profit*")
      8 till      "to" (as in "from here to there")
      6 har       "has"
      6 för       "for"
      6 Diktonius "Hemingway"
      5 är        "is"
      5 sin       "his/her/its"
      5 med       "with"
      4 vid       "by"
      4 vi        "we"
      4 En        "A" (real gender)

* and academic credentials, if you're good at it.

This is just a particularly juicy data point; there are certainly other less
grave/entertaining but basically similar cases in other languages.
(Keskuskauppakamari "Central Chamber of Commerce" in Finnish, but that is only 3
hits; still gets it into the top 10. Probably you should have more input data
for Finnish. Some stray German words in the Danish file. And anyhow, none of the
samples are particularly representative of the kind of text people write in e-mail.)

I would suggest to locate around 10 reasonably different mailing lists in each
language which is deemed important, and semi-randomly picking 4-5 lines from a
few messages from each list. This is just off the top of my head, but I believe
that using genuine texts from the domain you are attempting to model would be a
good starting point; dictionaries, web pages, or regular corpora may be easier
to locate, at least for the major languages, but are probably less suitable for
this particular task.
Comment 1 eriker-sa 2007-02-02 05:25:36 UTC
Just for my own sake, here's a link to an old mailing list posting of mine
detailing how to do this.

http://article.gmane.org/gmane.mail.spam.spamassassin.general/33090
Comment 2 eriker-sa 2007-02-02 05:37:25 UTC
Also, while I have your attention: Gertjan confirmed to me in private email that
the language models he shipped with TextCat were not really intended for real
production use, and that he is not able to release the raw data he used, because
the licenses on those materials were somewhat murky. (This was long ago; I was
in touch with him regarding this last year or the year before.)

I have looked into some other language models in the distribution, and found
similar problems, but rather than report them in detail here, I suppose it would
be more useful to try to collect better raw data.

For the record, too: the shorttexts are apparently subsets of the full raw data
used for training. You can confirm this by compiling a language model out of
them; the counts in the data files are smaller than, but roughly congruent with
the shipped models.

Note also that the mnogosearch guesser module uses a file format and an internal
implementation which is very close to TextCat's, but with much bigger language
models. Perhaps somebody should look into whether using those instead would
improve the detection. <http://www.mnogosearch.org/guesser/>

PS. I tried to change the Hardware: field of this bug from "HP" [sic] to "All"
and the Component to Plugins, but I'm not allowed to.