Bug 1002 - obfuscated whitespace via quotes and other special characters
Summary: obfuscated whitespace via quotes and other special characters
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.41
Hardware: Other other
: P5 minor
Target Milestone: 2.60
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 1059 1445 (view as bug list)
Depends on:
Blocks: 1050
  Show dependency tree
 
Reported: 2002-09-20 08:40 UTC by Eugene Miretsky
Modified: 2003-06-01 04:29 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene Miretsky 2002-09-20 08:40:27 UTC
Hello,

I started getting lots of emails with special charactes (s.a. ', ", *) replacing
spaces.  Oftern, single and double quotes will be place randomly.
Therefore, I suggest that instead of using " " or "\s" in regexes,
something like  [\s'"\*]+ be used.

I received the spam that had the following 2 lines:
   'Lose" up to 10 Lbs. the *first week* up to 30 Lbs. the *first month*       
                                     
   'Get your *free 14-day supply*                                              
                                 

Unfortunatelly, DIET rule did not match, nor "spamphrase 420 month supply" matched.

I recommend that DIET rule be changed to:
body DIET /\b(?:(?:without|no) (?:exercis|diet)ing|weight
loss|(?:extra|lose|lost|losing)(?:[\s'"\*]+up[\s'"\*]+to[\s'"\*]+\d+|.{0,9})[\s'"\*](?:pounds|weight|inches|lbs?)|burn.{1,10}fat)\b/i

I added lbs as possible unit of weight loss.  I also replaced spaces with above
mentioned regex, and added "up to \d+" as an alternative to .{0,9} part.

Secondly, I recommend adding FREE_SUPPLY rule:
body FREE_SUPPLY /\bfree[\s'"\*]+(?:\d+[\s'"\*\-]+)?(?:month|year|day)s?/i
describe FREE_SUPPLY Offers some free supplies
Comment 1 Daniel Quinlan 2002-09-20 13:12:16 UTC
Subject: Re: [SAdev]  New: Word separators & Free Supply rule & Modification to Diet Rule

As a first step, it might be worth adding a test to detect odd
punctuation and spacing habits.

> Therefore, I suggest that instead of using " " or "\s" in regexes,
> something like [\s'"\*]+ be used.

I don't want the regexes to become completely unreadable.  If we do
this, we have to do it automatically somehow, but not by changing
existing regular expressions.  I think the only reasonable way is to
create up our own backslash code (something not used in perl 5 or 6)
and munge it before running the test.

Also, this would let us use " " or \s when we really meant it (as we
often do).

Dan

Comment 2 Eugene Miretsky 2002-09-20 13:20:47 UTC
Subject: Re:  Word separators & Free Supply rule & Modification to Diet Rule

On Fri, Sep 20, 2002 at 01:12:16PM -0700, bugzilla-daemon@hughes-family.org wrote:
> http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1002
> 
> 
> 
> 
> 
> ------- Additional Comments From quinlan@pathname.com  2002-09-20 13:12 -------
> Subject: Re: [SAdev]  New: Word separators & Free Supply rule & Modification to Diet Rule
> 
> As a first step, it might be worth adding a test to detect odd
> punctuation and spacing habits.
> 
> > Therefore, I suggest that instead of using " " or "\s" in regexes,
> > something like [\s'"\*]+ be used.
> 
> I don't want the regexes to become completely unreadable.  If we do
> this, we have to do it automatically somehow, but not by changing
> existing regular expressions.  I think the only reasonable way is to
> create up our own backslash code (something not used in perl 5 or 6)
> and munge it before running the test.
> 
> Also, this would let us use " " or \s when we really meant it (as we
> often do).

Totally agree with this. This would be a pretty good idea actually.

> 
> Dan
> 
> 
> 
> 
> 
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.

Comment 3 Michael Moncur 2002-10-04 01:37:47 UTC
I tested these two rules and the old DIET rule:

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  15144     9848     5296    0.65    0.00    0.00  (all messages)
100.000   65.029   34.971    0.65    0.00    0.00  (all messages as %)
  2.905    4.387    0.151    0.97    1.00    0.40  DIET
  2.628    3.960    0.151    0.96    0.96    1.00  DIET_OLD
  1.479    1.868    0.755    0.71    0.00    1.00  FREE_SUPPLY

The new DIET is a clear improvement, it matches more spam and the same nonspam. 
FREE_SUPPLY looks bad here, but my results are likely biased - I have about 30 
messages from a legitimate mailing list that include the same ad that matches 
this text. Someone else should test that one.

The incriminating sentence is "Free 7-day membership", which isn't quite the 
same as a free supply of something.
Comment 4 Daniel Quinlan 2002-10-04 02:24:43 UTC
Can you look at bug 1050 before making any changes to this rule.  This problem
is way more general than just this rule.  I don't think we should make this
change to any one rule.

I suppose I should add my original comment here as an alternative to bug 1050.
I'm starting to think it might be easier to implement and use than the scheme
I proposed in 1050.

In addition, I think spam-phrases could be fixed quite easily, but I'm not
sure why it didn't match already.  It reduces all non-letters to whitespace.

Reporter: can you attach an example email?

FYI - I am also testing some improvements to the DIET rule in
70_cvs_rules_under_test.cf right now.
Comment 5 Daniel Quinlan 2002-10-06 00:57:35 UTC
*** Bug 1059 has been marked as a duplicate of this bug. ***
Comment 6 Justin Mason 2002-12-13 10:45:59 UTC
Not quite high-priority until after 2.50 release, sicne it'll
affect hit rates quite a bit on a few rules.

BTW I think I prefer the idea from bug 1050: defining a new
test type like "phrase" and then transforming the text in advance
before running those tests.  It would be much faster than modifying
each regexp to include [ '"-_*] or other alternative word-separator
chars.  Doing the latter would have a much greater speed hit.
Comment 7 Theo Van Dinter 2003-02-04 07:46:28 UTC
*** Bug 1445 has been marked as a duplicate of this bug. ***
Comment 8 Justin Mason 2003-02-13 09:36:11 UTC
not a serious problem so far.  bayes eats these for breakfast ;)
Comment 9 Theo Van Dinter 2003-06-01 12:29:17 UTC
seems to be working ok