Bug 3449 - Testing markup tags, HTML_FONT_LOW_CONTRAST not triggered due to bad HTML parsing
Summary: Testing markup tags, HTML_FONT_LOW_CONTRAST not triggered due to bad HTML par...
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.63
Hardware: PC Linux
: P5 normal
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL: http://olo.ab.altkom.pl/domowa/spam/s...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-05-31 02:09 UTC by Aleksander Adamowski
Modified: 2004-05-30 19:11 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Sample message text/plain None Aleksander Adamowski [NoCLA]
Sample message text/plain None Aleksander Adamowski [NoCLA]
Test HTML document, with excessive breaks text/html None Aleksander Adamowski [NoCLA]
Corrected test HTML document, with excessive breaks removed text/html None Aleksander Adamowski [NoCLA]
Test HTML parser text/plain None Aleksander Adamowski [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Aleksander Adamowski 2004-05-31 02:09:01 UTC
This bug is a follow=up to the discussion started way back on february 18 on
spamassassin-users mailin list ("Testing markup tags", "Semi-invisible font
missed by SA").

There was a consensus that there's something definitely wrong with SpamAssassin
HTML parsing when a spammer uses excessive line breaks inside HTML FONT tags
between attribute  name ("color") and value ("#FFFFsomething").

Back then, I've published sample messages here:
http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast/

The problem was, that the spammers use the following construct aimed directly at
SpamAssassin HTML analysis method to bypass the test
html_test('font_near_invisible') and not trigger the rule HTML_FONT_LOW_CONTRAST
in effect:
<font color=

"#FFFFFB">some random text to fool Bayes</font>

The excessive line breaks between "color=" and "#FFFFFB" fool the parser to not
detect the presence of that attribute.

I've analysed SpamAssasin 2.63 code back then in 23 Feb, and discovered that SA
code indeed does receive a string "color" instead of hash code for the value of
"color" attribute.

Those messages keep coming and sometimes pass through SA not triggering
HTML_FONT_LOW_CONTRAST, and I'm currently using a custom rule to give them
additional score:

rawbody LOC_HTMLSPLITFONT  /^\"?\#[a-z0-9]{6}\"?\>/i
describe LOC_HTMLSPLITFONT font color on separate line from font tag
score LOC_HTMLSPLITFONT    2.1 1.6 2.1 1.6

But this rule has a potential for FP-ing, so the ideal solution would be to make
SpamAssassin parse those tags using HTML::Parser correctly.

I've made a test Perl script that parses HTML and outputs the attribute names
and values, and running it indicates that HTML::Parser works fine. You can see
the script and test data here:
http://olo.ab.altkom.pl/domowa/admin/spamassassin/

There are 4 files there:

My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml
font_attribute_line_break_corrected.html
font_attribute_line_break_orig.html
parse_test.pl


The .eml file contains the message that has passed through not triggering
HTML_FONT_LOW_CONTRAST.
The file parse_test.pl is the Perl script.
The 2 .html files contain the HTML code from the .eml message, the "_orig" one
contains the code unchanged, the "_corrected" has excessive line breaks removed.

running parse_test.pl on both HTML files shows that HTML::Parser does its job
fine in both cases, so the problem must lie somewhere in SpamAssassin code that
does the parsing using HTML::Parser. However, the SA code is too bit to
convoluted for me - so I'm asking its original author to have a look at it.

SA needs to be fixed to trigger HTML_FONT_LOW_CONTRAST rule when processing the
message My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml.
Comment 1 Aleksander Adamowski 2004-05-31 02:15:17 UTC
Created attachment 1982 [details]
Sample message

Message subject for attachment identification:
"My prgivate s_ge_x life is now available to you!unarmed grotesques"
Comment 2 Aleksander Adamowski 2004-05-31 02:16:13 UTC
Created attachment 1983 [details]
Sample message

Message subject for attachment identification:
"Beebe hey buddzzy bowels"
Comment 3 Aleksander Adamowski 2004-05-31 02:17:19 UTC
Created attachment 1984 [details]
Test HTML document, with excessive breaks

extracted from the message "My prgivate s_ge_x life is now available to
you!unarmed grotesques".

For use with test HTML parser.
Comment 4 Aleksander Adamowski 2004-05-31 02:18:31 UTC
Created attachment 1985 [details]
Corrected test HTML document, with excessive breaks removed

extracted from the message "My prgivate s_ge_x life is now available to
you!unarmed grotesques".

For use with test HTML parser.
Comment 5 Aleksander Adamowski 2004-05-31 02:20:39 UTC
Created attachment 1986 [details]
Test HTML parser

Usage:

parse_test.pl font_attribute_line_break_orig.html

or:

parse_test.pl font_attribute_line_break_corrected.html

You'll see that in both cases HTML::Parser correctly extracts attributes. So
there's something wrong in SpamAssassin itself in that it doesn't see the
"color" attribute when excessive breaks are present.
Comment 6 Sidney Markowitz 2004-05-31 03:11:28 UTC
This appears to be already fixed in the current SVN which has new HTML code and
which triggers HTML_FONT_LOW_CONTRAST on both of these examples.