SA Bugzilla – Bug 4152
RFE: train language identifiation with proper input data
Last modified: 2007-02-02 05:37:25 UTC
The language sample files which were used to create the input data for the TextCat module under lm/ were probably never really intended to be used in a production environment. The amount of data is probably roughly the correct order of magnitude -- there's about 5,000 words per language -- but there are weird biases and blind spots. I imagine these were just quickly scraped together to produce Gertjan's original TextCat demo. I did a quick spot check of lm/sv.lm and noticed that substrings of the name Diktonius get incredibly high scores. True, the poet Elmer Diktonius wrote in Swedish, but he was from Finland, and it's not at all impossible that somebody would write about him in just about any language; right now, any message with the word "Diktonius" in it is virtually guaranteed to be classified as Swedish. There are 6 occurrences of this name in ShortTexts/swedish.txt, yielding a tie together with two genuine stop words for the 10th most common word in the Swedish language. 22 en "a" (real gender) 18 i "in" 14 och "and" 13 som "which" 13 av "of" 11 den "the" (real gender) 9 på "on" 9 att "to" (as in "to count words for fun and profit*") 8 till "to" (as in "from here to there") 6 har "has" 6 för "for" 6 Diktonius "Hemingway" 5 är "is" 5 sin "his/her/its" 5 med "with" 4 vid "by" 4 vi "we" 4 En "A" (real gender) * and academic credentials, if you're good at it. This is just a particularly juicy data point; there are certainly other less grave/entertaining but basically similar cases in other languages. (Keskuskauppakamari "Central Chamber of Commerce" in Finnish, but that is only 3 hits; still gets it into the top 10. Probably you should have more input data for Finnish. Some stray German words in the Danish file. And anyhow, none of the samples are particularly representative of the kind of text people write in e-mail.) I would suggest to locate around 10 reasonably different mailing lists in each language which is deemed important, and semi-randomly picking 4-5 lines from a few messages from each list. This is just off the top of my head, but I believe that using genuine texts from the domain you are attempting to model would be a good starting point; dictionaries, web pages, or regular corpora may be easier to locate, at least for the major languages, but are probably less suitable for this particular task.
Just for my own sake, here's a link to an old mailing list posting of mine detailing how to do this. http://article.gmane.org/gmane.mail.spam.spamassassin.general/33090
Also, while I have your attention: Gertjan confirmed to me in private email that the language models he shipped with TextCat were not really intended for real production use, and that he is not able to release the raw data he used, because the licenses on those materials were somewhat murky. (This was long ago; I was in touch with him regarding this last year or the year before.) I have looked into some other language models in the distribution, and found similar problems, but rather than report them in detail here, I suppose it would be more useful to try to collect better raw data. For the record, too: the shorttexts are apparently subsets of the full raw data used for training. You can confirm this by compiling a language model out of them; the counts in the data files are smaller than, but roughly congruent with the shipped models. Note also that the mnogosearch guesser module uses a file format and an internal implementation which is very close to TextCat's, but with much bigger language models. Perhaps somebody should look into whether using those instead would improve the detection. <http://www.mnogosearch.org/guesser/> PS. I tried to change the Hardware: field of this bug from "HP" [sic] to "All" and the Component to Plugins, but I'm not allowed to.