Bug 6229 - [review] TextCat is too case sensitive
Summary: [review] TextCat is too case sensitive
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P5 normal
Target Milestone: 3.3.2
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2730 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-10-29 00:41 UTC by Henrik Krohns
Modified: 2011-05-09 18:05 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
TextCat problem sample text/plain None Henrik Krohns [HasCLA]
Diff of results, "english" corpus of 1000 text/plain None Henrik Krohns [HasCLA]
Diff of results, "finnish" corpus of 3000 text/plain None Henrik Krohns [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Krohns 2009-10-29 00:41:20 UTC
Created attachment 4562 [details]
TextCat problem sample

It seems the languages database is case sensitive. For example, all uppercase
english spams get very wonky results.

I have no idea what the best way to fix this would be, I'm using a quick fix
like this to get better results..

--- TextCat.pm.orig   2009-10-29 09:23:46.985152046 +0200
+++ TextCat.pm  2009-10-29 09:24:38.339651987 +0200
@@ -440,6 +440,7 @@
   # my $non_word_characters = qr/[0-9\s]/;
   for my $word (split(/[0-9\s]+/, ${$_[0]}))
   {
+    $word =~ tr/A-ZÖÄÅ/a-zöäå/ if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/;
     $word = "\000" . $word . "\000";
     my $len = length($word);
     my $flen = $len;

Attached is a sample message. Running it with textcat_max_languages 20 gives
us:

ja.iso-2022-jp de zh.big5 sk.windows-1250 id sk.us-ascii cs.iso-8859-2 ca da vi
sw ms tl ne pl

Running it with my fix gives the expected single "en".
Comment 1 Henrik Krohns 2011-05-01 17:05:47 UTC
Please vote on my patch above and whether to implement it in 3.3 also (+1 for me).

It's obvious that no one is going to rewrite stuff for unicode support soon (probably needs extensive changes, I'm no guru in that area). Also unless no one is willing to spend time remaking the database, this is easier workaround.

I did lots of corpus testing and the patch doesn't seem to break anything, on the contrary it fixes many cases. The positives most likely far outweight any possible negatives. I think [a-z]{4} is pretty safe matching latin1 languages.

Feel free to test it if you have non-common language corpuses (Warren etc?). I can supply some perl scripts for testing if needed.
Comment 2 Henrik Krohns 2011-05-02 11:40:35 UTC
*** Bug 2730 has been marked as a duplicate of this bug. ***
Comment 3 Mark Martinec 2011-05-04 17:36:59 UTC
Why not just do a lc($word) instead of tr ?
This way other common European characters would be handled too,
such as É Í Ç Á Ã Ñ Ö Ï Ü.
Comment 4 Henrik Krohns 2011-05-04 18:45:47 UTC
(In reply to comment #3)
> Why not just do a lc($word) instead of tr ?
> This way other common European characters would be handled too,
> such as É Í Ç Á Ã Ñ Ö Ï Ü.

If you mean..

$word = lc($word) if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/;

I'm fine with that if it handles the special chars? But isn't that locale dependent? Doesn't seem to work for me.

I guess more of the special chars would need to be handled in any case, I just went with the finnish ones and it worked for me well..
Comment 5 Mark Martinec 2011-05-04 19:56:33 UTC
> If you mean..
> $word = lc($word) if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/;

Yes, something like that.

> I'm fine with that if it handles the special chars? But isn't that locale
> dependent? Doesn't seem to work for me.

It is locale dependent I believe. Another case for a Bug 3062.
It should be documented somewhere that SpamAssassin should be
run under a C locale.

> I guess more of the special chars would need to be handled in any case,
> I just went with the finnish ones and it worked for me well..

I only tried it with my installed version of perl under a C locale,
seems the lc() handles such characters well, but the string must be
decoded first into a correct character set - does not work on raw octets,
as it has not idea that these can be interpreted as ISO Latin1.

Ok, so this is probably too much of a change for a minor release,
backing off my suggestion.

What will happen with 8-bit characters in the source code?
So far there is no such case as far as I can tell.
Maybe these should be encoded in the source as \ooo or \x{hh}
to stay on the safe side (not depending on a locale).

Other common uppercase letters with diacritics from Latin1
should be included in the set I suppose.
Comment 6 Darxus 2011-05-04 20:08:18 UTC
lc:
"Otherwise, If EXPR has the UTF8 flag set
If the current package has a subroutine named ToLower , it will be used to change the case (See User-Defined Case Mappings in perlunicode.) Otherwise Unicode semantics are used for the case change." - http://perldoc.perl.org/functions/lc.html

Not sure if that flag is set, but it you can make sure it is with:

$word = Encode::decode_utf8($word); # set the flag

According to http://www.cosmocode.de/en/blog/gohr/2009-12/10-surviving-the-perl-utf-8-madness

So...

$word = lc(Encode::decode_utf8($word));

?

Ick.  Might be better to make sure that everything is getting imported as utf8.
Comment 7 Kevin A. McGrail 2011-05-05 20:09:56 UTC
Too much technical debate for 3.3.2 consideration.  Retargeting to 3.4.0.
Comment 8 Mark Martinec 2011-05-06 00:47:27 UTC
> Too much technical debate for 3.3.2 consideration.  Retargeting to 3.4.0.

How about just doing a plain lc for now, which will at least
handle all-ascii text such as English:

- $word = "\000" . $word . "\000";
+ $word = "\000" . lc($word) . "\000";

and leave the bug open for a better solution in 3.4 ?
Comment 9 Henrik Krohns 2011-05-06 07:00:21 UTC
(In reply to comment #8)
> > Too much technical debate for 3.3.2 consideration.  Retargeting to 3.4.0.
> 
> How about just doing a plain lc for now, which will at least
> handle all-ascii text such as English:
> 
> - $word = "\000" . $word . "\000";
> + $word = "\000" . lc($word) . "\000";
> 
> and leave the bug open for a better solution in 3.4 ?

I'm currently trying a "proper" set of characters.. imo lc is too vague and locale dependent.

$word =~ tr/A-Z\xc0-\xd6\xd8-\xde/a-z\xe0-\xf6\xf8-\xfe/
if $word =~ /[A-Z]/ && $word =~ /[a-zA-Z\xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe]{4}/;

This table includes all latin accents.

foreach (192..214, 216..222) {
    printf "%s %x %s %s %x %s\n", $_, $_, chr($_), $_ + 32, $_ + 32, chr($_ + 32);
}

Also I'm quite certain that lowering textcat_acceptable_score to 1.02 is also the right thing to do. I'm currently making a small corpus of different languages, including a separate fp corpus. I'll have some results soon..
Comment 10 Mark Martinec 2011-05-06 09:58:33 UTC
(> I'm currently trying a "proper" set of characters.. imo lc is too vague
> and locale dependent.
> 
> $word =~ tr/A-Z\xc0-\xd6\xd8-\xde/a-z\xe0-\xf6\xf8-\xfe/
> if $word =~ /[A-Z]/ && $word =~
> /[a-zA-Z\xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe]{4}/;

That would be fine with me for 3.3.2.

> Also I'm quite certain that lowering textcat_acceptable_score to 1.02 is also
> the right thing to do.

Fine with me.
Please attach a patch for both (and/or apply to trunk) so that we can vote.

Retargeting tentatively to 3.3.2.
Comment 11 Darxus 2011-05-06 16:26:13 UTC
Henrik, did you try lc() after setting the utf8 flag?

$word = Encode::decode_utf8($word); # set the flag
Comment 12 Henrik Krohns 2011-05-06 20:29:46 UTC
Created attachment 4878 [details]
Diff of results, "english" corpus of 1000
Comment 13 Henrik Krohns 2011-05-06 20:30:14 UTC
Created attachment 4879 [details]
Diff of results, "finnish" corpus of 3000
Comment 14 Henrik Krohns 2011-05-06 20:36:50 UTC
(In reply to comment #11)
> Henrik, did you try lc() after setting the utf8 flag?
> 
> $word = Encode::decode_utf8($word); # set the flag

I think that's trying to be too clever.. I believe the textcat database has some utf-8 signatures also.

About the attached files:

- I used acceptable score 1.02 for both, since it provides more accurate results
- Full tr/A-Z\xc0-\xd6\xd8-\xde/a-z\xe0-\xf6\xf8-\xfe/ is used in the new

I think this is safe tuning, reducing many of the "bunch of languages" and "all caps look like japanese".
Comment 15 Darxus 2011-05-06 20:52:03 UTC
(In reply to comment #14)
> > $word = Encode::decode_utf8($word); # set the flag
> 
> I think that's trying to be too clever.. I believe the textcat database has
> some utf-8 signatures also.

I don't, far from it.  That should give you proper case conversion for the entire set of utf8 characters.  

It would be better to figure out how to set the locale to utf8 for all of SA early on, but I think setting the flag on this variable here is cleaner than trying to figure out the right set of characters to feed to tr().  Although I completely agree with Mark that your tr() solution is fine for 3.3.2.
Comment 16 Henrik Krohns 2011-05-06 21:07:59 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > > $word = Encode::decode_utf8($word); # set the flag
> > 
> > I think that's trying to be too clever.. I believe the textcat database has
> > some utf-8 signatures also.
> 
> I don't, far from it.  That should give you proper case conversion for the
> entire set of utf8 characters.  

To clarify..

- Currently SA doesn't use Encode anywhere, I don't think this is the right way to introduce it..
- It is possible that it decodes something wrong or just borks? Like on some funny Asian non-utf8 stuff.

Hopefully sometime in the future we don't have to guess whether rendered body contains utf8 or not etc.
Comment 17 Henrik Krohns 2011-05-06 21:15:26 UTC
- Changed textcat_max_languages to 3 (How many can a message have? More likely TextCat just borks and guesses too many)
- Changed textcat_acceptable_score to 1.02 (Much safer than 1.05)
- Tiny documentation clarification about max_languages (I'd like to explain score better too, but I can't understand it)

Trunk:
Sending        lib/Mail/SpamAssassin/Plugin/TextCat.pm
Transmitting file data .
Committed revision 1100378.
Comment 18 Mark Martinec 2011-05-07 00:06:12 UTC
> Committed revision 1100378.

+1  for 3.3.2

(should do for 3.3.2, we'll worry about possible more complex
improvements later)



> > > $word = Encode::decode_utf8($word); # set the flag
> > I think that's trying to be too clever.. I believe the textcat database has
> > some utf-8 signatures also.
Darxus writes:
> I don't, far from it.  That should give you proper case conversion for the
> entire set of utf8 characters.  
> It would be better to figure out how to set the locale to utf8 for all of SA
> early on, but I think setting the flag on this variable here is cleaner than
> trying to figure out the right set of characters to feed to tr().  Although I
> completely agree with Mark that your tr() solution is fine for 3.3.2.

You are assuming that text from mail parts is decoded from octets into
perl characters (utf8) according to a MIME header field of each mail part.
This is how it should ideally be, but is not currently the case. The proper
solution will require work at several levels of SpamAssassin modules.
There were some attempts in the past, but we were often burned by perl
bugs dealing with utf8 in older versions of perl, or just horrible slowdowns.
Eventually we'll need to go this way, but this probably won't happen
even with the 3.4 release. For 3.3.x we can only assume there are plain
octets reaching TextCat, with no associated character set or any decoding
attempted. The assumed Latin1 case conversion by this patch is just an
attempt to deal with 80% of cases where TextCat got it wrong so far.
Comment 19 Darxus 2011-05-07 03:04:53 UTC
(In reply to comment #18)
> You are assuming that text from mail parts is decoded from octets into
> perl characters (utf8) according to a MIME header field of each mail part.

No.  Check out comment #6.  I believe that however the text gets there, setting the utf8 flag like this will cause lc() to work on all characters in the unicode set, which is probably effectively everything.

I agree that long term, it would be better if everything were read in in utf8 in the first place.
Comment 20 Henrik Krohns 2011-05-08 07:39:38 UTC
Btw we have a (somewhat forgotten) normalize_charset feature. :-) It converts rendered() body to latin1, using Encode::Detect and utf8::downgrade.

I think we could discuss about it in some related or new bug. Maybe even 3.4 could have it on by default.

In any case we probably need to keep the "lc-code" forever, since it could be hard to create textcat database with all case variations.. but we need to make sure we know the locale for body and handle accordingly.
Comment 21 Mark Martinec 2011-05-09 15:41:58 UTC
> Btw we have a (somewhat forgotten) normalize_charset feature. :-) It converts
> rendered() body to latin1, using Encode::Detect and utf8::downgrade.

The normalize_charset suffers from two problems:
- it tries to *guess* a character set from a text sample,
  instead of taking the encoding information for a MIME subheader;
- the Bug 5691 (Slow rules due to charset normalization) is still
  applicable. The attached test case there still takes 19 times as much
  time as a non-UTF8 case using perl 5.12 (it used to be 30 time slower
  with older perl).

> I think we could discuss about it in some related or new bug. Maybe even 3.4
> could have it on by default.

I used to have normalize_charset enabled, but after being bitten by
extreme slowdowns on some mail messages, we can no longer afford
to use this feature on a production mailer. Too bad. Not something
that could be enabled by default in 3.4 if you ask me.

> In any case we probably need to keep the "lc-code" forever, since it could be
> hard to create textcat database with all case variations.. but we need to make
> sure we know the locale for body and handle accordingly.

True.

Let's just keep things simple for 3.3.2 and apply this simple patch,
then we can open a new problem report to discuss introduction of more
fancy (but also more risky) stuff like proper handling of encodings
of each message mime part.
Comment 22 Kevin A. McGrail 2011-05-09 17:04:38 UTC
(In reply to comment #18)
> > Committed revision 1100378.
> 
> +1  for 3.3.2
> 
> (should do for 3.3.2, we'll worry about possible more complex
> improvements later)
> 

+1 on this as well. KAM
Comment 23 Henrik Krohns 2011-05-09 18:05:56 UTC
3.3 done:
Sending        lib/Mail/SpamAssassin/Plugin/TextCat.pm
Transmitting file data .
Committed revision 1101127.