Bug 7022 - normalize_charset
normalize_charset
Status: NEW
Product: Spamassassin
Classification: Unclassified
Component: spamassassin
unspecified
All All
: P2 enhancement
: Undefined
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2014-03-12 21:38 UTC by Ivo Truxa
Modified: 2014-03-13 22:43 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
SA/Conf.pm - changes at normalize_charset text/plain None Ivo Truxa [HasCLA]
SA/Utils/DependencyInfo.pm - dependency on Text::Unidecode text/plain None Ivo Truxa [HasCLA]
SA/Message/Node.pm text/plain None Ivo Truxa [HasCLA]
diffs for all three modules application/x-gzip None Ivo Truxa [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Ivo Truxa 2014-03-12 21:38:49 UTC
Created attachment 5189 [details]
SA/Conf.pm - changes at normalize_charset

English is, I believe, the only language using the Latin alphabet without any diacritics (except for foreign words). Well, Dutch can go without diacritics relatively fine, but that's all, I think. For all other languages SpamAssassin does not work as well as it could because of it. Yes, when the option normalize_charset is enabled, practically all of the many available foreign charsets will be converted into Unicode, so it solves at least one part of the problem - the multitude of standards. It's not the full fix, though. Unicode brings some problems itself (more complex handling, slower regexes, taking more space in memory, faster growth of the Bayes database, the necessity to write and maintain the rules in Unicode,...), but the main problem is elsewhere.

Big part of users in most nations write often their email without any diacritics, with incorrect diacritics, or with just a part of it. The reasons differ, it can be the ignorance of the user, technical limitations of the device, OS, or software they use, compatibility issues, conversions, simple laziness, or many others. In conclusion, even if you set the normalize_charset option, and if you carefully write and maintain your rules in Unicode, they will still very often miss the target, as long as you do not add all possible permutations with and without diacritics (and partial diacritics as well). The same goes for Bayes. You may train it on plenty of spam, but it is sufficient that the spammer uses different diacritics (incl. a completely wrong one), and the tokens need to be learned again, and separately from their equivalents.

So despite the existence of Unicode, I believe that normalizing email for spam detection with the old good 7bit ASCII is still the best way. For this reason I patched SA with some rather minor changes to allow, besides the current UTF8 normalizing, also US-ASCII normalizing. I used the Text:Unidecode Perl module that decomposes not only accented letters into their ASCII transcriptions (é => e, ô => o, ü => ue, ...), but it transliterates also Greek, Cyrillic, and practically any other characters including Asian sign languages. It uses systematic non-contextual transliteration, so at the more exotic sign alphabets it is not always perfect, but should be sufficient for the needs of Spamassassin (especially for those who primarily need it for European languages).

The type of the setting normalize_charset was changed from Boolean to string, and can take the form of 0 (no normalizing), 1 or UTF or UTF8 (normalizing to Unicode), or ASCII (aliases like US-ASCII can be used too). The setting is case-insensitive. When set to ASCII, the Node.pm module will convert the text_visible_rendered and text_invisible_rendered into plain 7-bit ASCII with only unaccented characters. Because in the original modules, the normalizing happens before decoding HTML entities, which would be then let in UTF8, I had to add the ASCII normalizing also there.

Bayes works with the rendered arrays, hence the change will impact as well Bayes, as also regexes in rules. When writing rules for your language with the ASCII normalizing enabled, you just write them unaccented. You only need to remember that some special characters are transliterated into multiple characters (for example characters with umlauts), so in such cases there is still some ambiguity because there are people who will for example write Müller as Mueller, and others as Muller, so you still may need to write more complex regexes for such cases.

I am attaching the modified files - SA/Conf.pm, SA/Message/Node.pm, and SA/Util/DependencyInfo.pm (added dependency on the Text::Unidecode module). The originals were from the v3.004.000. Although it is a long time I wanted to write this, I just stitched it together today, so it is not much tested, and there still may be some bugs and issues.
Comment 1 Ivo Truxa 2014-03-12 21:40:27 UTC
Created attachment 5190 [details]
SA/Utils/DependencyInfo.pm - dependency on Text::Unidecode
Comment 2 Ivo Truxa 2014-03-12 21:41:31 UTC
Created attachment 5191 [details]
SA/Message/Node.pm
Comment 3 Ivo Truxa 2014-03-12 21:42:53 UTC
I realized, that I was probably supposed to post the diffs, not the full modules. If still necessary, I can do it too, of course.
Comment 4 Kevin A. McGrail 2014-03-12 21:45:34 UTC
Diffs would be better unless you are adding something.
Comment 5 John Hardin 2014-03-12 21:51:09 UTC
If this is done globally we'll lose the ability to detect some forms of obfuscation. On the flip side, discarding the accents may have the effect of making that obfuscation pointless.

How does that balance out? Do we gain more from discarding all accents than we lose from being able to tell whether or not accents are being used to obfuscate a common word, which is a fairly strong spam sign?
Comment 6 Ivo Truxa 2014-03-12 22:19:04 UTC
(In reply to John Hardin from comment #5)
> If this is done globally we'll lose the ability to detect some forms of
> obfuscation. On the flip side, discarding the accents may have the effect of
> making that obfuscation pointless.
>
> How does that balance out? Do we gain more from discarding all accents than
> we lose from being able to tell whether or not accents are being used to
> obfuscate a common word, which is a fairly strong spam sign?

Yes, I think it will in fact unmask some of the obfuscation automatically. On the other hand, you are right that some obfuscated words would have higher spam scores than when not obfuscated, so you would miss that. 

There is also the possibility to append the ASCII normalization after the Unicode version (or the original). That would satisfy both needs, but would increase the memory needs and the database growth.

However, the normalizing is optional, and the administrator can choose what is better for his case. In my case (the vast majority of email on the server is Czech, German or French with a big multitude of diverse charsets), I know I want the plain ASCII normalizing, already because writing the rules is a nightmare otherwise. But I am sure that many other administrators will opt for Unicode, or no normalizing at all.
Comment 7 John Hardin 2014-03-12 22:47:26 UTC
(In reply to Ivo Truxa from comment #6)
> There is also the possibility to append the ASCII normalization after the
> Unicode version (or the original). That would satisfy both needs, but would
> increase the memory needs and the database growth.

I'd recommend against doing that. That could have serious negative effects on "tflags multiple" rules.

I think a better approach would be to keep both unnormalized and normalized versions separate in memory, and rules would be run against the normalized version by default unless they had a tflag specifying they should run against the unnormalized version.

That way the relatively few rules that look for accent obfuscation can detect it, while the majority of rules (and bayes) get the normalized version and yield better overall results. There would be memory impact, but little additional impact on the scan time, and bayes wouldn't double-token.

As an efficiency hack, the unnormalized version could be discarded after normalization if there were no active rules that had the tflag specifying to run against the unnormalized message text. That would minimize the memory impact.

> However, the normalizing is optional, and the administrator can choose what
> is better for his case. In my case (the vast majority of email on the server
> is Czech, German or French with a big multitude of diverse charsets), I know
> I want the plain ASCII normalizing, already because writing the rules is a
> nightmare otherwise. But I am sure that many other administrators will opt
> for Unicode, or no normalizing at all.

Right. So the tflag for "run against unnormalized message body" would not have any effect on the rule if normalizing was disabled.

An admin might also disable it to avoid the memory pressure and/or performance hit from normalization.

This sounds like it might be a big improvement.
Comment 8 Ivo Truxa 2014-03-12 22:52:18 UTC
Created attachment 5192 [details]
diffs for all three modules

OK, I am attaching the diffs. Hope I did it correctly.

BTW, the possibilities of the obfuscation by Unicode are practically endless - you will find easily 20 or often even more accented or visually similar variants for practically every letter. It means that already at 5-letter words, the number of available permutations can easily go into millions. Although each of them may be a strong spam marker, you need to learn them all first, and need a sufficiently big Bayes database to keep them all. In contrary, if you de-obfuscate them, the original word may help you to catch the spam better than each of the rarely used variants.

However, all these are just speculations. We need to perform some comparative tests to see what is better.
Comment 9 Kevin A. McGrail 2014-03-12 22:56:46 UTC
(In reply to Ivo Truxa from comment #8)
> Created attachment 5192 [details]
> diffs for all three modules
> 
> OK, I am attaching the diffs. Hope I did it correctly.
> 
> BTW, the possibilities of the obfuscation by Unicode are practically endless
> - you will find easily 20 or often even more accented or visually similar
> variants for practically every letter. It means that already at 5-letter
> words, the number of available permutations can easily go into millions.
> Although each of them may be a strong spam marker, you need to learn them
> all first, and need a sufficiently big Bayes database to keep them all. In
> contrary, if you de-obfuscate them, the original word may help you to catch
> the spam better than each of the rarely used variants.
> 
> However, all these are just speculations. We need to perform some
> comparative tests to see what is better.

The idea John had ties in very neatly to the idea I had of needing a separate body message without the subject via a tflag.  A tflag for a non-obfuscated version for specific rules might help a lot.
Comment 10 AXB 2014-03-12 23:04:25 UTC
(In reply to Kevin A. McGrail from comment #9)
> (In reply to Ivo Truxa from comment #8)
> > Created attachment 5192 [details]
> > diffs for all three modules
> > 
> > OK, I am attaching the diffs. Hope I did it correctly.
> > 
> > BTW, the possibilities of the obfuscation by Unicode are practically endless
> > - you will find easily 20 or often even more accented or visually similar
> > variants for practically every letter. It means that already at 5-letter
> > words, the number of available permutations can easily go into millions.
> > Although each of them may be a strong spam marker, you need to learn them
> > all first, and need a sufficiently big Bayes database to keep them all. In
> > contrary, if you de-obfuscate them, the original word may help you to catch
> > the spam better than each of the rarely used variants.
> > 
> > However, all these are just speculations. We need to perform some
> > comparative tests to see what is better.
> 
> The idea John had ties in very neatly to the idea I had of needing a
> separate body message without the subject via a tflag.  A tflag for a
> non-obfuscated version for specific rules might help a lot.

Am I getting this right?
1.- normalization would have to be switched on ?
2.- normalized rules would require a tflag ?

If yes, sounds good tho I wonder how this could affect the ok_locale stuff, etc, etc (fearing a can of worms)
Comment 11 John Hardin 2014-03-12 23:11:08 UTC
(In reply to AXB from comment #10)
> 
> Am I getting this right?
> 1.- normalization would have to be switched on ?

Agree.

> 2.- normalized rules would require a tflag ?

That's backwards. Running against the normalized text would be the default, you'd need a tflag to run against the non-normalized (raw) text.  (I hesitate to use "raw" in this discussion, to avoid confusion with "rawbody".)
Comment 12 AXB 2014-03-12 23:16:15 UTC
(In reply to John Hardin from comment #11)
> (In reply to AXB from comment #10)
> > 
> > Am I getting this right?
> > 1.- normalization would have to be switched on ?
> 
> Agree.
> 
> > 2.- normalized rules would require a tflag ?
> 
> That's backwards. Running against the normalized text would be the default,
> you'd need a tflag to run against the non-normalized (raw) text.  (I
> hesitate to use "raw" in this discussion, to avoid confusion with "rawbody".)

hmm.. that means that 90% of our rules would have the opt-out tflag when 90% of the spamflow is detected without normalizing anything?
I must be missing something
Comment 13 Kevin A. McGrail 2014-03-12 23:17:16 UTC
(In reply to AXB from comment #12)
> (In reply to John Hardin from comment #11)
> > (In reply to AXB from comment #10)
> > > 
> > > Am I getting this right?
> > > 1.- normalization would have to be switched on ?
> > 
> > Agree.
> > 
> > > 2.- normalized rules would require a tflag ?
> > 
> > That's backwards. Running against the normalized text would be the default,
> > you'd need a tflag to run against the non-normalized (raw) text.  (I
> > hesitate to use "raw" in this discussion, to avoid confusion with "rawbody".)
> 
> hmm.. that means that 90% of our rules would have the opt-out tflag when 90%
> of the spamflow is detected without normalizing anything?
> I must be missing something

I agree with AXB.  If you want to use the normalized text, the tflag would be required so new rules can use the concept and not modify all the pre-existing rules.
Comment 14 Ivo Truxa 2014-03-12 23:22:32 UTC
(In reply to Kevin A. McGrail from comment #13)
> I agree with AXB.  If you want to use the normalized text, the tflag would
> be required so new rules can use the concept and not modify all the
> pre-existing rules.

Personally I think that it should be made dependable on the setting normalize_charset. Because when and admin decides to turn it on, it means he expects that he can write rules against the normalized version. And vice versa. Hence in the dependence on the setting, the tflag might be used to inverse the choice.
Comment 15 AXB 2014-03-12 23:28:35 UTC
(In reply to Kevin A. McGrail from comment #13)
> (In reply to AXB from comment #12)
> > (In reply to John Hardin from comment #11)
> > > (In reply to AXB from comment #10)
> > > > 
> > > > Am I getting this right?
> > > > 1.- normalization would have to be switched on ?
> > > 
> > > Agree.
> > > 
> > > > 2.- normalized rules would require a tflag ?
> > > 
> > > That's backwards. Running against the normalized text would be the default,
> > > you'd need a tflag to run against the non-normalized (raw) text.  (I
> > > hesitate to use "raw" in this discussion, to avoid confusion with "rawbody".)
> > 
> > hmm.. that means that 90% of our rules would have the opt-out tflag when 90%
> > of the spamflow is detected without normalizing anything?
> > I must be missing something
> 
> I agree with AXB.  If you want to use the normalized text, the tflag would
> be required so new rules can use the concept and not modify all the
> pre-existing rules.

Imo, this sort  of change is far from trivial (even scary) and if anything, I think it should be released as a fork for the ones who may want this, 
That way it can be tested thoroughly before it even comes close to the existing, released code avoiding hundreds of fires buring at the same time.
Comment 16 John Hardin 2014-03-12 23:58:14 UTC
(In reply to Kevin A. McGrail from comment #13)
> (In reply to AXB from comment #12)
> > (In reply to John Hardin from comment #11)
> > > (In reply to AXB from comment #10)
> > > > 2.- normalized rules would require a tflag ?
> > > 
> > > That's backwards. Running against the normalized text would be the default,
> > > you'd need a tflag to run against the non-normalized (raw) text.
> > 
> > hmm.. that means that 90% of our rules would have the opt-out tflag when 90%
> > of the spamflow is detected without normalizing anything?
> > I must be missing something
> 
> I agree with AXB.  If you want to use the normalized text, the tflag would
> be required so new rules can use the concept and not modify all the
> pre-existing rules.

I think I disagree. The point to normalization is to make rules work *better* in the face of varying accents and attempts to obfuscate text using accents. The existing rules shouldn't _need_ modification to take advantage of this, and they should work better against accent-obfuscated (or incorrectly-accented) text that currently defeats them. If this is in place they can be simplified to remove explicit alternations to allow for accents, though that shouldn't be necessary in lots of cases as the un-accented character is likely already part of the alternation. (The exception would be for things like umlauts that become two characters.)

If an existing rule contains accents to match accented text, either to detect obfuscation or from pulling such text verbatim from samples, it would need to be updated or have the tflag applied. That's not the majority of rules, though, is it?
Comment 17 Ivo Truxa 2014-03-13 00:36:29 UTC
In fact, I think it can be done transparent for the user or the rule developer, so that he does not need to bother. Just as it is now, I'd let the admin the choice to disable the normalizing altogether, enable the Unicode normalizing, or the ASCII normalizing.

Then, SA, when processing rules would look whether the rule contains non-ASCII characters. If it does, it would let it match against the UTF8 or against the non-normalized version (depending on normalize_charset), otherwise with the ASCII normalized one.

This would cover the vast majority of cases. Only in rather rare cases someone might like to run an ASCII regex on the non-ASCII version, and in such case a special tflag could be used.

However, as I told already previously, I think the default setting should stay as it is - no normalizing, but both the UTF8 and the ASCII normalizing should be available to administrators who want to use them, regardless if there is any tflag for normalized/non-normalized versions available or not.

Finally, if I am not mistaken, currently there is also no tflag for the Unicode normalizing, so if there are any rules written for UTF8, or for some specific code-pages, then they also do not always work correctly.
Comment 18 Ivo Truxa 2014-03-13 02:37:58 UTC
(In reply to John Hardin from comment #5)
> If this is done globally we'll lose the ability to detect some forms of
> obfuscation. On the flip side, discarding the accents may have the effect of
> making that obfuscation pointless.
> 
> How does that balance out? Do we gain more from discarding all accents than
> we lose from being able to tell whether or not accents are being used to
> obfuscate a common word, which is a fairly strong spam sign?

I come back again to this comment. As I wrote, it needs to be tested to see the reality, but in fact I am persuaded it can be only better. 

You need to ask yourself why do spammers obfuscate some words? Certainly not because they are hammy, but just because every anti-spam filter would immediately catch them. They obfuscate the most spammy words. So the fear that by removing the obfuscation you lose the advantage of a strong spam marker, is false. Quite in contrary - the original unobfuscated spam-word will become even much more spammy than before (thanks to many more hits), and will help to catch the spam easier.

Only in the case that the obfuscated word transliterates in something else than the original spam word, the score of the original word could not be used (and increased), but in very most cases you would get a new nonsense-word, that would become as strong spam marker as its obfuscated version.

Ivo
Comment 19 Ivo Truxa 2014-03-13 22:43:27 UTC
I wrote also a simple standalone tool for normalizing text in the same way as presented here. It may be useful for those who want to write rules for certain words in foreign alphabets, and need to know how it would be normalized by SpamAssassin.

You can find it on GitHub here: https://github.com/truxoft/sa-normalize