Bug 4892 - The abbreviation for Oxfordshire causes high Spam Score
Summary: The abbreviation for Oxfordshire causes high Spam Score
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Score Generation (show other bugs)
Version: 3.1.0
Hardware: PC Windows XP
: P5 major
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 5075 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-05-02 11:27 UTC by Vicky Clarke
Modified: 2009-08-07 01:14 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Htm file which causes the error when file is sent via chilcat from our server. text/plain None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Jpeg required for htm mail previously attached image/jpeg None Vicky Clarke [NoCLA]
Example causing FP (text obfuscated for customer privacy). text/plain None Nick Leverton [HasCLA]
Suggested fix (apply \b to the pattern). patch None Nick Leverton [HasCLA]
Another FUZZY_XPILL false positive, bulk non-spam, somewhat munged message/rfc822 None Cedric Knight [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Vicky Clarke 2006-05-02 11:27:58 UTC
Hello! The word o x o n (no spaces!) causes FUZZY_XPILL BODY value in spam 
score.
SpamAssassin is giving any emails with the word O x o n (no spaces - I just 
don't want this to get spammed too!) a high score of 4.1 when if I remove that 
1 word it becomes 1.4! It's a very common word being short for the county of 
Oxfordshire as well as the animal so I wondered if you could stop this 
happening please? 
SpamAssassin Report for mail with o x o n in:
Wed 2006-04-26 17:15:57: Spam Filter processing
c:\mdaemon\localq\md50001193349.msg...
Wed 2006-04-26 17:15:57: > Message return-path:
info@sportsworld.digi-email.com
Wed 2006-04-26 17:15:57: > Message from: info@sportsworld.digi-email.com
Wed 2006-04-26 17:15:57: > Message to: vicky.clarke@digi-products.com
Wed 2006-04-26 17:15:57: > Message subject: fmanager
Wed 2006-04-26 17:15:57: > Message ID:
<CHILKAT-MID-5894957f-c134-413c-b473-12956ac5d324@web1>
Wed 2006-04-26 17:15:57: Start SpamAssassin results
Wed 2006-04-26 17:15:57: 4.10 points, 3.00 required
Wed 2006-04-26 17:15:57: *  0.8 EXTRA_MPART_TYPE Header has extraneous
Content-type:...type= entry
Wed 2006-04-26 17:15:57: *  2.6 FUZZY_XPILL BODY: Attempt to obfuscate words
in spam
Wed 2006-04-26 17:15:57: *  0.1 HTML_TAG_EXIST_TBODY BODY: HTML has "tbody"
tag
Wed 2006-04-26 17:15:57: *  0.0 HTML_MESSAGE BODY: HTML included in message
Wed 2006-04-26 17:15:57: *  0.3 HTML_FONT_BIG BODY: HTML tag for a big font
size
Wed 2006-04-26 17:15:57: *  0.2 MIME_BOUND_NEXTPART Spam tool pattern in
MIME boundary
Wed 2006-04-26 17:15:57: End SpamAssassin results
Wed 2006-04-26 17:15:57: * c:\mdaemon\localq\md50001193349.msg deleted

Without the word o x o n the FUZZY_XPILL BODY doesn't appear in results.
Thanks very much - please email or call on +44 (0)1189 841567 if you need more 
info!
Comment 1 Theo Van Dinter 2006-05-02 13:45:50 UTC
I can't reproduce this, please attach (via the web form) a sample mail which has
this problem.  
Comment 2 Vicky Clarke 2006-05-03 08:26:23 UTC
Created attachment 3502 [details]
Htm file which causes the error when file is sent via chilcat from our server.

I'm not sure if I explained fully that these htm files are generated on our
server via asp and then mailed via chilkat. The SpamAssassin scores are
generated as the mail is processed by MDaemon on our mail server.
The attached mail had a score of 4.1 when containing the word O x o n but
without it I got 1.4. 
Thanks for looking into this - please let me know if you need any further info.
Comment 3 Vicky Clarke 2006-05-03 08:29:13 UTC
Created attachment 3503 [details]
Jpeg required for htm mail previously attached
Comment 4 Vicky Clarke 2006-05-03 08:30:20 UTC
Created attachment 3504 [details]
Jpeg required for htm mail previously attached

Would be much easier if you allowed upload of zip/rar files!
Comment 5 Vicky Clarke 2006-05-03 08:31:07 UTC
Created attachment 3505 [details]
Jpeg required for htm mail previously attached
Comment 6 Vicky Clarke 2006-05-03 08:31:58 UTC
Created attachment 3506 [details]
Jpeg required for htm mail previously attached
Comment 7 Vicky Clarke 2006-05-03 08:32:30 UTC
Created attachment 3507 [details]
Jpeg required for htm mail previously attached
Comment 8 Vicky Clarke 2006-05-03 08:32:55 UTC
Created attachment 3508 [details]
Jpeg required for htm mail previously attached
Comment 9 Vicky Clarke 2006-05-03 08:33:27 UTC
Created attachment 3509 [details]
Jpeg required for htm mail previously attached
Comment 10 Vicky Clarke 2006-05-03 08:33:51 UTC
Created attachment 3510 [details]
Jpeg required for htm mail previously attached
Comment 11 Sidney Markowitz 2006-05-03 11:20:57 UTC
Comment on attachment 3502 [details]
Htm file which causes the error when file is sent via chilcat from our server.

changing mime type of attachment to text/plain to make it easier to view in
bugzilla
Comment 12 Sidney Markowitz 2006-05-03 11:45:55 UTC
The problem is not the word Oxon by itself. The example contains an address that
has the lines

Abingdon,
Oxon.
OX14 3JF.

which in HTML look like

Abingdon,<BR>Oxon.<BR>OX14 3JF.<BR>

When the HTML tags are removed to process the text in the body, the string

Oxon.
OX14

is a fuzzy match for 'xanax' in the FUZZY_XPILL_BODY rule. The initial 'O' and
the final '14' are ignored, as are newlines and spaces, leaving xon.OX as what
is fuzzily matching with 'xanax'.

Whether that should be a match I leave for someone with more familiarity with
the fuzzy match rules to decide now that the problem has been narrowed down.

I do wonder if <br> should be replaced with a newline and the fuzzy match should
not go across lines, if that is possible with the way we parse out text from HTML.
Comment 13 Vicky Clarke 2006-06-02 11:04:32 UTC
(In reply to comment #12)
> The problem is not the word Oxon by itself. The example contains an address 
that
> has the lines
> Abingdon,
> Oxon.
> OX14 3JF.
> which in HTML look like
> Abingdon,<BR>Oxon.<BR>OX14 3JF.<BR>
> When the HTML tags are removed to process the text in the body, the string
> Oxon.
> OX14
> is a fuzzy match for 'xanax' in the FUZZY_XPILL_BODY rule. The initial 'O' and
> the final '14' are ignored, as are newlines and spaces, leaving xon.OX as what
> is fuzzily matching with 'xanax'.
> Whether that should be a match I leave for someone with more familiarity with
> the fuzzy match rules to decide now that the problem has been narrowed down.
> I do wonder if <br> should be replaced with a newline and the fuzzy match 
should
> not go across lines, if that is possible with the way we parse out text from 
HTML.

Hi there - thank you for looking into this. I'm sorry I didn't get the email 
regarding your finding because any mail with <BR>Oxon.<BR>OX14 would have got 
spammed! Our spam is deleted you see. Does that mean any business in that 
Oxfordshire postcode sending html mails with their address and abbreviation 
Oxon will get their mailed marked as spam?
Would you be able to let me know when you will have a fix for this please? Or 
is there a work around I can use for now? 
Thank you very much!
Vicky
Comment 14 Sidney Markowitz 2006-08-31 14:25:05 UTC
*** Bug 5075 has been marked as a duplicate of this bug. ***
Comment 15 Sidney Markowitz 2006-08-31 14:31:06 UTC
I'm pasting in the following suggestion that Nick Leverton made in bug 5075 so
it doesn't get lost in this discussion. Can someone test the effect of this
suggestion on some corpora?

 --------------------

The simplest fix seems to be to add \b to the rule as follows:     
     
body FUZZY_XPILL        /<inter W3><post P2>(?!xanax)\b<X><A><N><A><X>/i
Comment 16 Nick Leverton 2006-08-31 14:39:44 UTC
Created attachment 3677 [details]
Example causing FP (text obfuscated for customer privacy).
Comment 17 Nick Leverton 2006-08-31 14:42:08 UTC
Created attachment 3678 [details]
Suggested fix (apply \b to the pattern).
Comment 18 Nick Leverton 2006-08-31 14:46:01 UTC
(In reply to comment #12) 
> I do wonder if <br> should be replaced with a newline and the fuzzy match 
should 
> not go across lines, if that is possible with the way we parse out text from 
HTML. 
 
The Post Office recommends that postcodes follow the county on the same line, 
although many people do split it onto two lines as in the OP's example.  My 
example though shows them both on the same line. 
 
Comment 19 Justin Mason 2007-03-15 05:50:03 UTC
testing replacement rule now
Comment 20 Justin Mason 2007-10-08 10:43:02 UTC
(In reply to comment #19)
> testing replacement rule now

the original isn't stellar these days, but the replacement certainly isn't
working too well:

0.00000 	 0.1018  796 of 781638 messages  	 0.0129  21 of 162569 messages  	
0.887 	 0.65 	 3.40 	FUZZY_XPILL 	 	

0.00000 	0.0000 0 of 781638 messages 	0.0000 0 of 162569 messages 	0.500 	0.48 
0.01 	T_FUZZY_XPILL_BUG4892
Comment 21 Cedric Knight 2009-08-07 01:14:56 UTC
Created attachment 4505 [details]
Another FUZZY_XPILL false positive, bulk non-spam, somewhat munged

But how many of the spam hits for FUZZY_XPILL are actually for the right reason?  One third of the hits I see are on ham (I'd reduced score to 1.2 for this reason), and the rest are on spam in Cyrillic that is about Moscow removals firms, tyre rebalancing etc. with no pharmaceutical connotations.  I don't have samples but it looks like there would be a high rate of FPs on Cyrillic ham, as well as the ones already discussed.  Here's another false positive (English, bulk, ham, anonymised).

The modified rule still hits all fuzzy references to the drug name I can devise, although could use (?:_|\b) instead of \b.

I guess this rule will get zeroed by the score generation, but it would be good to see the fix in update channels etc.  Could it please be reprioritised?