|
SA Bugzilla – Full Text Bug Listing |
Summary: | ISO-2022-JP false positives on OBFUSCATING_COMMENT | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Warren Togami <wtogami> |
Component: | Rules | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | major | CC: | jm |
Priority: | P2 | ||
Version: | 3.3.0 | ||
Target Milestone: | 3.3.0 | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
Bug Depends on: | |||
Bug Blocks: | 6155 | ||
Attachments: |
Example misfiring __OBFUSCATING_COMMENT_B
PATCH: Prevent misfiring on __OBFUSCATING_COMMENT_B with ISO-2022-JP Example misfiring __OBFUSCATING_COMMENT_A |
Description
Warren Togami
2009-08-31 19:48:20 UTC
Created attachment 4526 [details]
Example misfiring __OBFUSCATING_COMMENT_B
Created attachment 4527 [details]
PATCH: Prevent misfiring on __OBFUSCATING_COMMENT_B with ISO-2022-JP
It seems all cases triggering __OBFUSCATING_COMMENT_B in ISO-2022-JP mail had either a ! or # character prior to <!. This patch completely eliminates all 213 hits of __OBFUSCATING_COMMENT_B in the wt-jp2 ham corpus.
Created attachment 4528 [details] Example misfiring __OBFUSCATING_COMMENT_A http://ruleqa.spamassassin.org/20090831-r809502-n/OBFUSCATING_COMMENT?mclog=ham-wt-jp2 Prior to the above patch, all hits of OBFUSCATING_COMMENT also hit __OBFUSCATING_COMMENT_B. It was easy to eliminate __OBFUSCATING_COMMENT_B because seemingly all cases of <! was preceded by either ! or # characters (for some reason I don't understand). Of the original 213 hits, only 71 matched __OBFUSCATING_COMMENT_A. I tried mass-check after applying that patch. Unfortunately, OBFUSCATING_COMMENT is now firing due to __OBFUSCATING_COMMENT_A 133 times in wt-jp2 ham corpus. Was it short-circuiting and not bothering to test __OBFUSCATING_COMMENT_A if __OBFUSCATING_COMMENT_B had fired before? Is there any short-circuiting or order of precedence in booleans of spamassassin rules? This attachment is one example of __OBFUSCATING_COMMENT_A misfiring. <TD class=3D"font_m">=1B$B!X%=3D%&!Y$N<!$O$3$l!*=1B(B =1B$B$"$N?M= 7A$b=3DP$F$-$^$9!#=1B(B<br> It seems for (__OBFUSCATING_COMMENT_A && HTML_MESSAGE), <! can be preceded by other characters including ordinary ASCII letters. I can't think of any way to fix __OBFUSCATING_COMMENT_A to eliminate ISO-2022-JP FP's. It is at least improved a bit with the above patch. Upon request I can give you an entire mbox full of samples triggering __OBFUSCATING_COMMENT_A or __OBFUSCATING_COMMENT_B. It is a bit too large to attach here. rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/ \w is problematic in languages like Japanese where there is not necessarily any whitespace between words. Also multi-byte space could be encoded but we'll never know it is whitespace since we don't decode by default in spamassassin. So it might be impossible to fix this rule for encoded languages. Why? * spamassassin does not decode by default. Most rules work just fine without decoding. Optional decoding makes it WAY slower which is why we don't do it. * Even if decoding were not a problem, linguistically the whitespace assumptions used in the test cases for this rule are invalid for languages like Japanese without spaces between words in a sentence. What should we do if __OBFUSCATING_COMMENT_A cannot be fixed? This could make spamassassin dangerous for sizable populations of Asian users we don't test at all in the corpus like Chinese. Should we apply the above patch fixing __OBFUSCATING_COMMENT_B at least? rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/ How would you modify this rule to exclude cases where the ! or # character precedes the <!, except you don't want ! or # to be captured? (?![#!]) doesn't seem to be valid regex. > rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
Btw, it is prudent to avoid \w in regexp, as its exact set of characters
depends on the current locale. I'm not sure if current SA code effectively
disables effects of currently selected locale.
> How would you modify this rule to exclude cases where the ! or # character > precedes the <!, except you don't want ! or # to be captured? I don't think that is possible without capturing it. > (?![#!]) doesn't seem to be valid regex. It is, but is a look-ahead. A look-behind is not possible. >> rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
> Btw, it is prudent to avoid \w in regexp, as its exact set of characters
> depends on the current locale. I'm not sure if current SA code effectively
> disables effects of currently selected locale.
I didn't write the above rule. This is what is currently in trunk.
Damn. t/basic_lint.t .................. [1381] warn: config: invalid regexp for rule __OBFUSCATING_COMMENT_B: /[^\s>!: missing or invalid delimiters My patch broke the __OBFUSCATING_COMMENT_B rule. rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/ Any suggestions how to exclude ! and # from the leading character class? [^\s>!#] doesn't work. - rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/ + rawbody __OBFUSCATING_COMMENT_B /[^\s>!\x23](?:<![^>]*>)+[^\s<]/ I was able to make it skip #<! and !<! with the above change. I now discover that there are clearly too many possible characters that can be immediately before a <! in legitimate ISO-2022-JP encoded text. ^[$B:#=5$N%a%k%^%,$O!"7HBS$d%]!<%?%V%k%*!<%G%#%*$G;H$($k^[(BBluetooth(R)^[$BBP1~^[(B ^[$B%]!<%?%V%k%9%T!<%+!<$H!"%*%H%/$J%-%c%s%Z!<%s$N$4>R2p$+$i$G$9!#^[(B This sample alone has ] and a few ASCII letters. I believe this is unfixable. Even with decoding you run into linguistic reasons why the English-based assumptions of this rule fail. We are probably best off disabling this rule for 3.3.0? (In reply to comment #12) > I believe this is unfixable. Even with decoding you run into linguistic > reasons why the English-based assumptions of this rule fail. We are probably > best off disabling this rule for 3.3.0? try making it a meta that doesn't fire if ISO-2022-JP shift markers are found in the body, which is what many of the other similar rules use (PLING_QUERY etc.) There's already a meta subrule which fires if ISO-2022-JP is in use. (In reply to comment #13) > (In reply to comment #12) > > I believe this is unfixable. Even with decoding you run into linguistic > > reasons why the English-based assumptions of this rule fail. We are probably > > best off disabling this rule for 3.3.0? > > try making it a meta that doesn't fire if ISO-2022-JP shift markers are found > in the body, which is what many of the other similar rules use (PLING_QUERY > etc.) There's already a meta subrule which fires if ISO-2022-JP is in use. got time to dig one up. take a look: rules/20_head_tests.cf:header __PLING_QUERY Subject =~ /\?.*!|!.*\?/ rules/20_head_tests.cf:meta PLING_QUERY (__PLING_QUERY && !__ISO_2022_JP_DELIM) for reference, __ISO_2022_JP_DELIM looks like this: rules/20_meta_tests.cf:body __ISO_2022_JP_DELIM /\e\$B/ Can you meta the original test with an encoding test to say it only hits if the encoding is _not_ ISO-2022-JP ? Wouldn't that make it trivially easy for spammers to bypass this rule? Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that encoding and abuse this rule. The MUA will render it just fine. (In reply to comment #16) > Wouldn't that make it trivially easy for spammers to bypass this rule? > > Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that > encoding and abuse this rule. The MUA will render it just fine. ...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads} Ah, okay. But then, an ISO-2022-JP encoding with _no_ "switch to japanese/chinese/korean" escape in the message might itself be a useful test.... http://en.wikipedia.org/wiki/ISO-2022-JP#ISO_2022_character_sets > # ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and > includes the following escape sequences > * ESC ( B to switch to ASCII (1 byte per character) > * ESC ( J to switch to JIS X 0201-1976 (ISO 646:JP) Roman set (1 byte > per character) > * ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character) > * ESC $ B to switch to JIS X 0208-1983 (2 bytes per character) > ...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads}
> Ah, okay.
The problem isn't subset of the encoding, but rather your message can be in this encoding but plain ASCII continues to work.
test __OBFUSCATING_COMMENT ok This is a te<!--foo-->st test __OBFUSCATING_COMMENT fail Not a <!-- problem --> here test __OBFUSCATING_COMMENT fail or<!--problem--> here test __OBFUSCATING_COMMENT fail This <tag><!-- neither --></tag> I hope Here is example where decoding ISO-2022-JP wouldn't help us with this rule. English assumes that you have spaces between words. This is not true of Japanese. The dog <!-- foo -->went to the park. 犬<!-- foo -->が公園に行きました。 Neither of these lines should be firing on this rule. (In reply to comment #20) > The dog <!-- foo -->went to the park. > 犬<!-- foo -->が公園に行きました。 > > Neither of these lines should be firing on this rule. So how about making the \w explicitly ASCII, so it's not broken by encoding or locale adaptation? rawbody __OBFUSCATING_COMMENT_A /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i > So how about making the \w explicitly ASCII, so it's not broken by encoding or > locale adaptation? > rawbody __OBFUSCATING_COMMENT_A /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i Wont help. SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding. (In reply to comment #22) > > So how about making the \w explicitly ASCII, so it's not broken by encoding or > > locale adaptation? > > > rawbody __OBFUSCATING_COMMENT_A /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i > > Wont help. SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding. ...two-byte chars. Right. Duh. (In reply to comment #23) > (In reply to comment #22) > > > So how about making the \w explicitly ASCII, so it's not broken by encoding or > > > locale adaptation? > > > > > rawbody __OBFUSCATING_COMMENT_A /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i > > > > Wont help. SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding. > > ...two-byte chars. Right. Duh. Variable byte actually. (In reply to comment #16) > Wouldn't that make it trivially easy for spammers to bypass this rule? > > Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that > encoding and abuse this rule. The MUA will render it just fine. that issue applies for almost all of our body ruleset. The trick is that (as John notes) when a spammer attempts to evade it, we can look for the evasion as a much more reliable spam signature -- since nonspam never needs to attempt to evade rules like that. Also, a low-accuracy body rule being evadable, is better than it FP'ing. so I think that is the better approach. wtf. svn commit -m "bug 6188: add exclusion for __ISO_2022_JP_DELIM to OBFUSCATING_COMMENT as well" t.rules/OBFUSCATING_COMMENT rules/20_html_tests.cf svn: Commit failed (details follow): svn: Could not open the requested SVN filesystem oh well. here's the patch: http://taint.org/x/2009/p made it in eventually |