SA Bugzilla – Bug 3449
Testing markup tags, HTML_FONT_LOW_CONTRAST not triggered due to bad HTML parsing
Last modified: 2004-05-30 19:11:28 UTC
This bug is a follow=up to the discussion started way back on february 18 on spamassassin-users mailin list ("Testing markup tags", "Semi-invisible font missed by SA"). There was a consensus that there's something definitely wrong with SpamAssassin HTML parsing when a spammer uses excessive line breaks inside HTML FONT tags between attribute name ("color") and value ("#FFFFsomething"). Back then, I've published sample messages here: http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast/ The problem was, that the spammers use the following construct aimed directly at SpamAssassin HTML analysis method to bypass the test html_test('font_near_invisible') and not trigger the rule HTML_FONT_LOW_CONTRAST in effect: <font color= "#FFFFFB">some random text to fool Bayes</font> The excessive line breaks between "color=" and "#FFFFFB" fool the parser to not detect the presence of that attribute. I've analysed SpamAssasin 2.63 code back then in 23 Feb, and discovered that SA code indeed does receive a string "color" instead of hash code for the value of "color" attribute. Those messages keep coming and sometimes pass through SA not triggering HTML_FONT_LOW_CONTRAST, and I'm currently using a custom rule to give them additional score: rawbody LOC_HTMLSPLITFONT /^\"?\#[a-z0-9]{6}\"?\>/i describe LOC_HTMLSPLITFONT font color on separate line from font tag score LOC_HTMLSPLITFONT 2.1 1.6 2.1 1.6 But this rule has a potential for FP-ing, so the ideal solution would be to make SpamAssassin parse those tags using HTML::Parser correctly. I've made a test Perl script that parses HTML and outputs the attribute names and values, and running it indicates that HTML::Parser works fine. You can see the script and test data here: http://olo.ab.altkom.pl/domowa/admin/spamassassin/ There are 4 files there: My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml font_attribute_line_break_corrected.html font_attribute_line_break_orig.html parse_test.pl The .eml file contains the message that has passed through not triggering HTML_FONT_LOW_CONTRAST. The file parse_test.pl is the Perl script. The 2 .html files contain the HTML code from the .eml message, the "_orig" one contains the code unchanged, the "_corrected" has excessive line breaks removed. running parse_test.pl on both HTML files shows that HTML::Parser does its job fine in both cases, so the problem must lie somewhere in SpamAssassin code that does the parsing using HTML::Parser. However, the SA code is too bit to convoluted for me - so I'm asking its original author to have a look at it. SA needs to be fixed to trigger HTML_FONT_LOW_CONTRAST rule when processing the message My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml.
Created attachment 1982 [details] Sample message Message subject for attachment identification: "My prgivate s_ge_x life is now available to you!unarmed grotesques"
Created attachment 1983 [details] Sample message Message subject for attachment identification: "Beebe hey buddzzy bowels"
Created attachment 1984 [details] Test HTML document, with excessive breaks extracted from the message "My prgivate s_ge_x life is now available to you!unarmed grotesques". For use with test HTML parser.
Created attachment 1985 [details] Corrected test HTML document, with excessive breaks removed extracted from the message "My prgivate s_ge_x life is now available to you!unarmed grotesques". For use with test HTML parser.
Created attachment 1986 [details] Test HTML parser Usage: parse_test.pl font_attribute_line_break_orig.html or: parse_test.pl font_attribute_line_break_corrected.html You'll see that in both cases HTML::Parser correctly extracts attributes. So there's something wrong in SpamAssassin itself in that it doesn't see the "color" attribute when excessive breaks are present.
This appears to be already fixed in the current SVN which has new HTML code and which triggers HTML_FONT_LOW_CONTRAST on both of these examples.