Bug 6188 - ISO-2022-JP false positives on OBFUSCATING_COMMENT
Summary: ISO-2022-JP false positives on OBFUSCATING_COMMENT
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.0
Hardware: Other All
: P2 major
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 6155
  Show dependency tree
 
Reported: 2009-08-31 19:48 UTC by Warren Togami
Modified: 2009-09-03 14:25 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Example misfiring __OBFUSCATING_COMMENT_B text/plain None Warren Togami [HasCLA]
PATCH: Prevent misfiring on __OBFUSCATING_COMMENT_B with ISO-2022-JP text/plain None Warren Togami [HasCLA]
Example misfiring __OBFUSCATING_COMMENT_A text/plain None Warren Togami [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Warren Togami 2009-08-31 19:48:20 UTC
http://ruleqa.spamassassin.org/20090831-r809502-n/OBFUSCATING_COMMENT/detail
OBFUSCATING_COMMENT is triggering abnormally large amounts with ISO-2022-JP encoded mail.

rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/
meta OBFUSCATING_COMMENT        ((__OBFUSCATING_COMMENT_A && HTML_MESSAGE) || (__OBFUSCATING_COMMENT_B && MIME_HTML_ONLY))
describe OBFUSCATING_COMMENT    HTML comments which obfuscate text

If I'm reading these regex correctly, it is looking for <! comment tags in the middle of words or non-whitespace?

But if spamassassin (by default) does not decode the body of mail, then <![^>]*> can be valid text in the body that is not parsed by the HTML parser, since it is recognized as ISO-2022-JP formatted text.
Comment 1 Warren Togami 2009-08-31 21:27:43 UTC
Created attachment 4526 [details]
Example misfiring __OBFUSCATING_COMMENT_B
Comment 2 Warren Togami 2009-08-31 21:32:23 UTC
Created attachment 4527 [details]
PATCH: Prevent misfiring on __OBFUSCATING_COMMENT_B with ISO-2022-JP

It seems all cases triggering __OBFUSCATING_COMMENT_B in ISO-2022-JP mail had either a ! or # character prior to <!.  This patch completely eliminates all 213 hits of __OBFUSCATING_COMMENT_B in the wt-jp2 ham corpus.
Comment 3 Warren Togami 2009-08-31 21:53:27 UTC
Created attachment 4528 [details]
Example misfiring __OBFUSCATING_COMMENT_A

http://ruleqa.spamassassin.org/20090831-r809502-n/OBFUSCATING_COMMENT?mclog=ham-wt-jp2
Prior to the above patch, all hits of OBFUSCATING_COMMENT also hit __OBFUSCATING_COMMENT_B.  It was easy to eliminate __OBFUSCATING_COMMENT_B because seemingly all cases of <! was preceded by either ! or # characters (for some reason I don't understand).  Of the original 213 hits, only 71 matched __OBFUSCATING_COMMENT_A.

I tried mass-check after applying that patch.  Unfortunately, OBFUSCATING_COMMENT is now firing due to __OBFUSCATING_COMMENT_A 133 times in wt-jp2 ham corpus.

Was it short-circuiting and not bothering to test __OBFUSCATING_COMMENT_A if __OBFUSCATING_COMMENT_B had fired before?  Is there any short-circuiting or order of precedence in booleans of spamassassin rules?

This attachment is one example of __OBFUSCATING_COMMENT_A misfiring.
                                                
<TD class=3D"font_m">=1B$B!X%=3D%&!Y$N<!$O$3$l!*=1B(B =1B$B$"$N?M=
7A$b=3DP$F$-$^$9!#=1B(B<br>

It seems for (__OBFUSCATING_COMMENT_A && HTML_MESSAGE), <! can be preceded by other characters including ordinary ASCII letters.  I can't think of any way to fix __OBFUSCATING_COMMENT_A to eliminate ISO-2022-JP FP's.

It is at least improved a bit with the above patch.
Comment 4 Warren Togami 2009-08-31 22:01:14 UTC
Upon request I can give you an entire mbox full of samples triggering __OBFUSCATING_COMMENT_A or __OBFUSCATING_COMMENT_B.  It is a bit too large to attach here.
Comment 5 Warren Togami 2009-08-31 22:14:52 UTC
rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

\w is problematic in languages like Japanese where there is not necessarily any whitespace between words.  Also multi-byte space could be encoded but we'll never know it is whitespace since we don't decode by default in spamassassin.

So it might be impossible to fix this rule for encoded languages.  Why?

* spamassassin does not decode by default.  Most rules work just fine without decoding.  Optional decoding makes it WAY slower which is why we don't do it.

* Even if decoding were not a problem, linguistically the whitespace assumptions used in the test cases for this rule are invalid for languages like Japanese without spaces between words in a sentence.

What should we do if __OBFUSCATING_COMMENT_A cannot be fixed?  This could make spamassassin dangerous for sizable populations of Asian users we don't test at all in the corpus like Chinese.

Should we apply the above patch fixing __OBFUSCATING_COMMENT_B at least?
Comment 6 Justin Mason 2009-09-01 04:03:38 UTC
let's attempt to fix before bug 6155 kicks off.
Comment 7 Warren Togami 2009-09-01 05:19:29 UTC
rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

How would you modify this rule to exclude cases where the ! or # character precedes the <!, except you don't want ! or # to be captured?

(?![#!]) doesn't seem to be valid regex.
Comment 8 Mark Martinec 2009-09-01 05:28:19 UTC
> rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

Btw, it is prudent to avoid \w in regexp, as its exact set of characters
depends on the current locale. I'm not sure if current SA code effectively
disables effects of currently selected locale.
Comment 9 Mark Martinec 2009-09-01 05:33:17 UTC
> How would you modify this rule to exclude cases where the ! or # character
> precedes the <!, except you don't want ! or # to be captured?

I don't think that is possible without capturing it.
 
> (?![#!]) doesn't seem to be valid regex.

It is, but is a look-ahead. A look-behind is not possible.
Comment 10 Warren Togami 2009-09-01 05:36:31 UTC
>> rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
> Btw, it is prudent to avoid \w in regexp, as its exact set of characters
> depends on the current locale. I'm not sure if current SA code effectively
> disables effects of currently selected locale.

I didn't write the above rule.  This is what is currently in trunk.
Comment 11 Warren Togami 2009-09-01 05:39:06 UTC
Damn.

t/basic_lint.t .................. [1381] warn: config: invalid regexp for rule __OBFUSCATING_COMMENT_B: /[^\s>!: missing or invalid delimiters

My patch broke the __OBFUSCATING_COMMENT_B rule.

rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/

Any suggestions how to exclude ! and # from the leading character class?  [^\s>!#] doesn't work.
Comment 12 Warren Togami 2009-09-01 17:57:48 UTC
- rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/
+ rawbody __OBFUSCATING_COMMENT_B /[^\s>!\x23](?:<![^>]*>)+[^\s<]/

I was able to make it skip #<! and !<! with the above change.  I now discover that there are clearly too many possible characters that can be immediately before a <! in legitimate ISO-2022-JP encoded text.

^[$B:#=5$N%a%k%^%,$O!"7HBS$d%]!<%?%V%k%*!<%G%#%*$G;H$($k^[(BBluetooth(R)^[$BBP1~^[(B
^[$B%]!<%?%V%k%9%T!<%+!<$H!"%*%H%/$J%-%c%s%Z!<%s$N$4>R2p$+$i$G$9!#^[(B

This sample alone has ] and a few ASCII letters.

I believe this is unfixable.  Even with decoding you run into linguistic reasons why the English-based assumptions of this rule fail.  We are probably best off disabling this rule for 3.3.0?
Comment 13 Justin Mason 2009-09-02 00:51:21 UTC
(In reply to comment #12)
> I believe this is unfixable.  Even with decoding you run into linguistic
> reasons why the English-based assumptions of this rule fail.  We are probably
> best off disabling this rule for 3.3.0?

try making it a meta that doesn't fire if ISO-2022-JP shift markers are found in the body, which is what many of the other similar rules use (PLING_QUERY etc.)  There's already a meta subrule which fires if ISO-2022-JP is in use.
Comment 14 Justin Mason 2009-09-02 00:59:03 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > I believe this is unfixable.  Even with decoding you run into linguistic
> > reasons why the English-based assumptions of this rule fail.  We are probably
> > best off disabling this rule for 3.3.0?
> 
> try making it a meta that doesn't fire if ISO-2022-JP shift markers are found
> in the body, which is what many of the other similar rules use (PLING_QUERY
> etc.)  There's already a meta subrule which fires if ISO-2022-JP is in use.

got time to dig one up.  take a look:

rules/20_head_tests.cf:header __PLING_QUERY             Subject =~ /\?.*!|!.*\?/
rules/20_head_tests.cf:meta PLING_QUERY                (__PLING_QUERY && !__ISO_2022_JP_DELIM)

for reference, __ISO_2022_JP_DELIM looks like this:

rules/20_meta_tests.cf:body __ISO_2022_JP_DELIM /\e\$B/
Comment 15 John Hardin 2009-09-02 08:04:52 UTC
Can you meta the original test with an encoding test to say it only hits if the encoding is _not_ ISO-2022-JP ?
Comment 16 Warren Togami 2009-09-02 08:13:28 UTC
Wouldn't that make it trivially easy for spammers to bypass this rule?

Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that encoding and abuse this rule.  The MUA will render it just fine.
Comment 17 John Hardin 2009-09-02 08:25:59 UTC
(In reply to comment #16)
> Wouldn't that make it trivially easy for spammers to bypass this rule?
> 
> Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that
> encoding and abuse this rule.  The MUA will render it just fine.

...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads} Ah, okay.

But then, an ISO-2022-JP encoding with _no_ "switch to japanese/chinese/korean" escape in the message might itself be a useful test....
Comment 18 Warren Togami 2009-09-02 08:50:39 UTC
http://en.wikipedia.org/wiki/ISO-2022-JP#ISO_2022_character_sets
> # ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and
> includes the following escape sequences
>    * ESC ( B to switch to ASCII (1 byte per character)
>    * ESC ( J to switch to JIS X 0201-1976 (ISO 646:JP) Roman set (1 byte 
>      per character)
>    * ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
>    * ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)
Comment 19 Warren Togami 2009-09-02 09:00:33 UTC
> ...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads}
> Ah, okay.

The problem isn't subset of the encoding, but rather your message can be in this encoding but plain ASCII continues to work.
Comment 20 Warren Togami 2009-09-02 09:11:23 UTC
test __OBFUSCATING_COMMENT ok     This is a te<!--foo-->st
test __OBFUSCATING_COMMENT fail   Not a <!-- problem --> here
test __OBFUSCATING_COMMENT fail   or<!--problem--> here
test __OBFUSCATING_COMMENT fail   This <tag><!-- neither --></tag> I hope

Here is example where decoding ISO-2022-JP wouldn't help us with this rule.  English assumes that you have spaces between words.  This is not true of Japanese.

The dog <!-- foo -->went to the park.
犬<!-- foo -->が公園に行きました。

Neither of these lines should be firing on this rule.
Comment 21 John Hardin 2009-09-02 09:49:28 UTC
(In reply to comment #20)

> The dog <!-- foo -->went to the park.
> 犬<!-- foo -->が公園に行きました。
> 
> Neither of these lines should be firing on this rule.

So how about making the \w explicitly ASCII, so it's not broken by encoding or locale adaptation?

rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i
Comment 22 Warren Togami 2009-09-02 10:02:24 UTC
> So how about making the \w explicitly ASCII, so it's not broken by encoding or
> locale adaptation?

> rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i

Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.
Comment 23 John Hardin 2009-09-02 10:09:46 UTC
(In reply to comment #22)
> > So how about making the \w explicitly ASCII, so it's not broken by encoding or
> > locale adaptation?
> 
> > rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i
> 
> Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.

...two-byte chars. Right. Duh.
Comment 24 Warren Togami 2009-09-02 10:15:49 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > > So how about making the \w explicitly ASCII, so it's not broken by encoding or
> > > locale adaptation?
> > 
> > > rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i
> > 
> > Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.
> 
> ...two-byte chars. Right. Duh.

Variable byte actually.
Comment 25 Justin Mason 2009-09-02 13:32:18 UTC
(In reply to comment #16)
> Wouldn't that make it trivially easy for spammers to bypass this rule?
> 
> Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that
> encoding and abuse this rule.  The MUA will render it just fine.

that issue applies for almost all of our body ruleset.  The trick is that (as John notes) when a spammer attempts to evade it, we can look for the evasion as a much more reliable spam signature -- since nonspam never needs to attempt to evade rules like that.

Also, a low-accuracy body rule being evadable, is better than it FP'ing.  so I think that is the better approach.
Comment 26 Justin Mason 2009-09-02 15:43:35 UTC
wtf.

svn commit -m "bug 6188: add exclusion for __ISO_2022_JP_DELIM to OBFUSCATING_COMMENT as well" t.rules/OBFUSCATING_COMMENT rules/20_html_tests.cf
svn: Commit failed (details follow):
svn: Could not open the requested SVN filesystem

oh well.  here's the patch:
http://taint.org/x/2009/p
Comment 27 Justin Mason 2009-09-03 14:25:42 UTC
made it in eventually