Bug 6541 - ReplaceTags: Experience matches french word "expérience"
Summary: ReplaceTags: Experience matches french word "expérience"
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-01 13:50 UTC by jp
Modified: 2011-05-16 05:44 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
mail sent to french debian list - FP on french "experience" message/rfc822 None mouss [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description jp 2011-02-01 13:50:16 UTC
I am sending a newsletter in French that contains the word "expérience".
When the mail passes through SpamAssassin, it triggers the "ReplaceTags: Experience" rule which adds 3.0 points.
I don't think this rule should match valid words with accented characters. It seems to be far too easy for a legitimate mail written in French to be marked as spam this way.

I'm guessing it could also happen with other French words like médication, crédit, ...).
Comment 1 John Wilcock 2011-02-01 14:39:59 UTC
I could have sworn this had already been reported, but can't find the bug.

Regardless, I've long since disabled FRT_EXPERIENCE and FRT_APPROV locally due to FPs in French. 

Quickly scanning logs, FRT_DIPLOMA also occasionally hits on "diplomé", "diplôme", though this scores less and rarely causes FPs.
Comment 2 Adam Katz 2011-03-04 13:44:57 UTC
Checked in r1075489 and r1077335 to introduce variants the following words:

credit penis medication million approve experience diploma

I did not add an exclusion for "médication" because it's way too obscure, though to counter that point, we now exclude the Polish "dyplom" for diploma.

Anybody looking to help on this front should look at rulesrc/sandbox/emailed/00_FVGT_File001.cf and rules/25_replace.cf or perhaps the entire collection with commands like these:

egrep -ri '^(raw|body|header.*subject).*\(\?![a-z?]{2,}\)' rules*

grep -ri '(?![^)]*[\[(?\\].*).*><' rules*


I'm resolving this bug.  Feel free to re-open with new FP examples.
Comment 3 Adam Katz 2011-03-04 13:59:31 UTC
(In reply to comment #2)
> grep -ri '(?![^)]*[\[(?\\].*).*><' rules*

Okay, that's hard to do without grep -P ... here's a more complete query:

grep --color -riP '\(\?\!\K[^)]*[\[(?\\\w].*(?=\)[<>\w]{1,30}><)'

or else using UNIX grep plus perl:

grep -r . rules* |perl -ne 'print if /\(\?\![^)]*[\[(?\\\w].*\)[<>\w]{1,30}></'

... with colors:

grep -r . trunk/rules* |perl -ne '
  if (/^([^:]*)(.*\(\?\!)([^)]*[\[(?\\\w].*)(\)[<>\w]{1,30}><.*)/) 
    { print "\e[0;35m$1\e[0;0m$2\e[1;32m$3\e[0;0m$4\n"; }'
Comment 4 mouss 2011-05-16 05:44:10 UTC
Created attachment 4886 [details]
mail sent to french debian list - FP on french "experience"