Bug 2211 - New HTML Tag Tests (patch)
Summary: New HTML Tag Tests (patch)
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 2.60
Hardware: All All
: P5 enhancement
Target Milestone: 2.70
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
: 2947 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-07-11 10:00 UTC by Brian White
Modified: 2004-10-18 03:24 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Combined patch for bug #2211 and #2231 patch None Brian White [HasCLA]
Tests to call new code. text/plain None Brian White [HasCLA]
Patch (against v2.60-rc2) to test for bad HTML tags patch None Brian White [HasCLA]
Secondary patch for 20_html_tests.cf patch None Brendan Byrd/SineSwiper [HasCLA]
reimplementation of idea patch None Daniel Quinlan [HasCLA]
Additional tags and rework of RE text/plain None Mike Kuentz [HasCLA]
Patch for custom tags patch None Brendan Byrd/SineSwiper [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Brian White 2003-07-11 10:00:22 UTC
I've written some new code to deal with messages that use HTML tags to
obfuscate messages.  The full diff is attached, but here is the breakdown.


-------------------------------------------------------------------------------

This patch to "html_tag" adds two things:

1) Support for being called by "html_text" with tags not found by the
   HTML parser module (see that for more information).  This is done by
   passing a $num of 0 (zero).

2) When a tag is closed ($num goes to zero), a check is made to see if any
   content was placed between those tags.  Content is noted by the patches
   to "html_text" and "html_tests".  If no content existed between an
   open/close formatting pair, then a counter is incremented which can be
   tested via "eval:html_range()".

--- orig/HTML.pm        Thu Jul 10 10:20:52 2003
+++ HTML.pm     Fri Jul 11 10:26:35 2003
@@ -26,10 +26,21 @@
 sub html_tag {
   my ($self, $tag, $attr, $num) = @_;
 
-  $self->{html_inside}{$tag} += $num;
+  if ($num != 0) {
+    $self->{html_inside}{$tag} += $num;
+    if ($self->{html_inside}{$tag} == 0) {
+      # look for obsucating tags (format changes that have no affect if nothing 
between)
+      if ($self->{html_used}{$tag} == 0 && $tag !~ m/^(p)$/) {
+        $self->{html}{empty_format}++;
+#       print STDERR "html_tag: no content within tag sequence '$tag' 
(empty_format=$self->{html}{empty_format})\n";
+      }
+      delete $self->{html_used}{$tag};
+    }
+  }
 
   $self->{html}{elements}++ if $tag =~ /^(?:$re_strict|$re_loose)$/io;
   $self->{html}{tags}++;
+# print STDERR "html_tag: found bad tag '$tag'\n" if $tag !~ 
/^(?:$re_strict|$re_loose)$/io;
 
   if ($num == 1) {
     $self->html_format($tag, $attr, $num);

-------------------------------------------------------------------------------

This patch to "html_tests" records found IMG tags as being content between
formatting tags.  This patch could be omitted and maybe give better results
since normal HTML email shouldn't be sending highly-formatted messages
anyway, but it's included for correctness.

@@ -259,9 +270,16 @@
   }
   if (($tag eq "img" &&
        exists $attr->{src} && ($_ = $attr->{src})) ||
       ($tag =~ /^(?:body|table|tr|td|th)$/ && 
        exists $attr->{background} && ($_ = $attr->{background})))
   {
+    my $inside = $self->{html_inside};
+#   print STDERR "html_tests: found image inside tags:";
+    foreach (keys %$inside) {
+#     print STDERR " ",$_;
+      $self->{html_used}{$_}++;
+    }
+#   print STDERR "\n";
     if (/\?/ || (/[a-f\d]{12,}/i && ! /\.(?:jpe?g|gif|png)$/i && !/^cid:/))
     {
       $self->{html}{web_bugs} = 1;

-------------------------------------------------------------------------------

This patch to "html_text" does two things:

1) If a tag is found that was skipped by the real HTML parser (which it does
   if the tag was not recognized), then it forces the parsing of it here and
   removes it from the text stream.  At tag in this case is any "<...>"
   construct where the first character following the "<" is not a space.  I
   believe this fits how most browsers work and has caused no false hits in
   the tests I've run.  (A real "<" should be coded as "&lt;", anyway.)

2) When the text (after tags are removed) is non-null, then all active tags
   ("html_inside") are marked as "used" and thus will not be counted as some
   type of obfuscating tag.

@@ -314,6 +332,23 @@
 sub html_text {
   my ($self, $text) = @_;
 
+  # the HTML parses skips tags that it does not recognize; fine for normal, bad 
for spam
+  while ($text =~ s/<(\S[^>]*)>//) {
+#   print STDERR "html_text: found unparsed <$1> inside text\n";
+    html_tag($self,$1,undef,0);
+  }
+
+  # record when something non-tag exists between other tags (search of 
obfuscating tags)
+  if ($text ne "") {
+    my $inside = $self->{html_inside};
+#   print STDERR "html_text: found text inside tags:";
+    foreach (keys %$inside) {
+#     print STDERR " ",$_;
+      $self->{html_used}{$_} = 1;
+    }
+#   print STDERR "\n";
+  }
+
   if (exists $self->{html_inside}{a} && $self->{html_inside}{a} > 0) {
     $self->{html}{anchor_text} .= " $text";
   }

-------------------------------------------------------------------------------

This new function "html_bad_tags" returns the actual count or the percent of
all tags that were not valid.

@@ -392,6 +427,32 @@
 ###########################################################################
 # HTML parser tests
 ###########################################################################
+
+sub html_bad_tags {
+  my ($self, undef, $test, $min, $max) = @_;
+
+# print STDERR "html_bad_tags: test=$test; min=$min; max=$max 
(tags=$self->{html}{tags}; elements=$self->{html}{elements})\n";
+  return 0 if !$self->{html}{tags};
+
+  if ($test eq "ratio") {
+    # ratio of tags that are valid
+    $test = ($self->{html}{tags} - $self->{html}{elements}) / 
$self->{html}{tags};
+  } elsif ($test eq "count") {
+    # number of invalid tags
+    $test = $self->{html}{tags} - $self->{html}{elements};
+  } else {
+    # invalid test
+    return 0;
+  }
+
+  # not all perls understand what "inf" means, so we need to do
+  # non-numeric tests!  urg!
+  if ( !defined $max || $max eq "inf" ) {
+    return ($test > $min);
+  } else {
+    return ($test > $min && $test <= $max);
+  }
+}
 
 sub html_tag_balance {
   my ($self, undef, $rawtag, $rawexpr) = @_;

-------------------------------------------------------------------------------

This patch to "get_decoded_stripped_body_text_array" just clears the "html_used"
structure between messages.

--- orig/PerMsgStatus.pm        Thu Jul 10 10:20:57 2003
+++ PerMsgStatus.pm     Thu Jul 10 13:07:17 2003
@@ -1128,6 +1128,7 @@
   # reset variables used in HTML tests
   $self->{html} = {};
   $self->{html_inside} = {};
+  $self->{html_used} = {};
   $self->{html}{ratio} = 0;
   $self->{html}{image_area} = 0;
   $self->{html}{shouting} = 0;

-------------------------------------------------------------------------------

Finally, here is the rules that make use of the new features.  Obviously, the
scores need a little more rigorous determination than the intuitive ones I
just chose once everything appeared to be working.

body            HTML_BAD_TAGS_00_10     
eval:html_bad_tags('ratio','0.00','0.10')
body            HTML_BAD_TAGS_10_20     
eval:html_bad_tags('ratio','0.10','0.20')
body            HTML_BAD_TAGS_20_30     
eval:html_bad_tags('ratio','0.20','0.30')
body            HTML_BAD_TAGS_30_40     
eval:html_bad_tags('ratio','0.30','0.40')
body            HTML_BAD_TAGS_40_50     
eval:html_bad_tags('ratio','0.40','0.50')
body            HTML_BAD_TAGS_50_60     
eval:html_bad_tags('ratio','0.50','0.60')
body            HTML_BAD_TAGS_60_70     
eval:html_bad_tags('ratio','0.60','0.70')
body            HTML_BAD_TAGS_70_80     
eval:html_bad_tags('ratio','0.70','0.80')
body            HTML_BAD_TAGS_80_90     
eval:html_bad_tags('ratio','0.80','0.90')
body            HTML_BAD_TAGS_90_100    
eval:html_bad_tags('ratio','0.90','1.00')
body            HTML_BAD_TAGS_0         eval:html_bad_tags('count','-1','0')
body            HTML_BAD_TAGS_1         eval:html_bad_tags('count','0','4')
body            HTML_BAD_TAGS_5         eval:html_bad_tags('count','4','9')
body            HTML_BAD_TAGS_10        eval:html_bad_tags('count','9','24')
body            HTML_BAD_TAGS_25        eval:html_bad_tags('count','24','49')
body            HTML_BAD_TAGS_50        eval:html_bad_tags('count','49')
body            HTML_EMPTY_FORMAT_1     eval:html_range('empty_format','0','4')
body            HTML_EMPTY_FORMAT_5     eval:html_range('empty_format','4','9')
body            HTML_EMPTY_FORMAT_10    eval:html_range('empty_format','9','24')
body            HTML_EMPTY_FORMAT_25    
eval:html_range('empty_format','24','49')
body            HTML_EMPTY_FORMAT_50    eval:html_range('empty_format','49')

describe        HTML_BAD_TAGS_00_10     0-10% of all HTML tags are invalid
describe        HTML_BAD_TAGS_10_20     10-20% of all HTML tags are invalid
describe        HTML_BAD_TAGS_20_30     20-30% of all HTML tags are invalid
describe        HTML_BAD_TAGS_30_40     30-40% of all HTML tags are invalid
describe        HTML_BAD_TAGS_40_50     40-50% of all HTML tags are invalid
describe        HTML_BAD_TAGS_50_60     50-60% of all HTML tags are invalid
describe        HTML_BAD_TAGS_60_70     60-70% of all HTML tags are invalid
describe        HTML_BAD_TAGS_70_80     70-80% of all HTML tags are invalid
describe        HTML_BAD_TAGS_80_90     80-90% of all HTML tags are invalid
describe        HTML_BAD_TAGS_90_100    90-100% of all HTML tags are invalid
describe        HTML_BAD_TAGS_0         HTML has no invalid tags
describe        HTML_BAD_TAGS_1         HTML has at least 1 invalid tag
describe        HTML_BAD_TAGS_5         HTML has at least 5 invalid tags
describe        HTML_BAD_TAGS_10        HTML has at least 10 invalid tags
describe        HTML_BAD_TAGS_25        HTML has at least 25 invalid tags
describe        HTML_BAD_TAGS_50        HTML has at least 50 invalid tags
describe        HTML_EMPTY_FORMAT_1     HTML has at least 1 formatting pair with 
nothing between
describe        HTML_EMPTY_FORMAT_5     HTML has at least 5 formatting pairs 
with nothing between
describe        HTML_EMPTY_FORMAT_10    HTML has at least 10 formatting pairs 
with nothing between
describe        HTML_EMPTY_FORMAT_25    HTML has at least 25 formatting pairs 
with nothing between
describe        HTML_EMPTY_FORMAT_50    HTML has at least 50 formatting pairs 
with nothing between

score           HTML_BAD_TAGS_00_10     0.1
score           HTML_BAD_TAGS_10_20     0.5
score           HTML_BAD_TAGS_20_30     1.0
score           HTML_BAD_TAGS_30_40     1.5
score           HTML_BAD_TAGS_40_50     2.0
score           HTML_BAD_TAGS_50_60     2.5
score           HTML_BAD_TAGS_60_70     3.0
score           HTML_BAD_TAGS_70_80     3.5
score           HTML_BAD_TAGS_80_90     4.0
score           HTML_BAD_TAGS_90_100    4.5
score           HTML_BAD_TAGS_0         -0.1
score           HTML_BAD_TAGS_1         0.1
score           HTML_BAD_TAGS_5         0.2
score           HTML_BAD_TAGS_10        0.3
score           HTML_BAD_TAGS_25        0.4
score           HTML_BAD_TAGS_50        0.5
score           HTML_EMPTY_FORMAT_1     0.1
score           HTML_EMPTY_FORMAT_5     0.5
score           HTML_EMPTY_FORMAT_10    1.0
score           HTML_EMPTY_FORMAT_25    1.5
score           HTML_EMPTY_FORMAT_50    2.0

-------------------------------------------------------------------------------


After going through testing, I find that these HTML rules don't fire all that
often.  Most HTML messages (spam and valid) use tags properly.  The "empty
format" tests would fire fairly frequently on email generated by MSWord; Word
also inserts a certain number of bad "<o:...>" tags.  The bad-tags test,
however, fired above 20% only on spam messages.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    That which we persist in doing becomes easier.  It's not that the nature
      of the thing has changed but rather our ability at it has increased.




--- orig/HTML.pm        Thu Jul 10 10:20:52 2003
+++ HTML.pm     Fri Jul 11 10:26:35 2003
@@ -26,10 +26,21 @@
 sub html_tag {
   my ($self, $tag, $attr, $num) = @_;
 
-  $self->{html_inside}{$tag} += $num;
+  if ($num != 0) {
+    $self->{html_inside}{$tag} += $num;
+    if ($self->{html_inside}{$tag} == 0) {
+      # look for obsucating tags (format changes that have no affect if nothing 
between)
+      if ($self->{html_used}{$tag} == 0 && $tag !~ m/^(p)$/) {
+        $self->{html}{empty_format}++;
+#       print STDERR "html_tag: no content within tag sequence '$tag' 
(empty_format=$self->{html}{empty_format})\n";
+      }
+      delete $self->{html_used}{$tag};
+    }
+  }
 
   $self->{html}{elements}++ if $tag =~ /^(?:$re_strict|$re_loose)$/io;
   $self->{html}{tags}++;
+# print STDERR "html_tag: found bad tag '$tag'\n" if $tag !~ 
/^(?:$re_strict|$re_loose)$/io;
 
   if ($num == 1) {
     $self->html_format($tag, $attr, $num);
@@ -259,9 +270,16 @@
   }
   if (($tag eq "img" &&
        exists $attr->{src} && ($_ = $attr->{src})) ||
       ($tag =~ /^(?:body|table|tr|td|th)$/ && 
        exists $attr->{background} && ($_ = $attr->{background})))
   {
+    my $inside = $self->{html_inside};
+#   print STDERR "html_tests: found image inside tags:";
+    foreach (keys %$inside) {
+#     print STDERR " ",$_;
+      $self->{html_used}{$_}++;
+    }
+#   print STDERR "\n";
     if (/\?/ || (/[a-f\d]{12,}/i && ! /\.(?:jpe?g|gif|png)$/i && !/^cid:/))
     {
       $self->{html}{web_bugs} = 1;
@@ -314,6 +332,23 @@
 sub html_text {
   my ($self, $text) = @_;
 
+  # the HTML parses skips tags that it does not recognize; fine for normal, bad 
for spam
+  while ($text =~ s/<(\S[^>]*)>//) {
+#   print STDERR "html_text: found unparsed <$1> inside text\n";
+    html_tag($self,$1,undef,0);
+  }
+
+  # record when something non-tag exists between other tags (search of 
obfuscating tags)
+  if ($text ne "") {
+    my $inside = $self->{html_inside};
+#   print STDERR "html_text: found text inside tags:";
+    foreach (keys %$inside) {
+#     print STDERR " ",$_;
+      $self->{html_used}{$_} = 1;
+    }
+#   print STDERR "\n";
+  }
+
   if (exists $self->{html_inside}{a} && $self->{html_inside}{a} > 0) {
     $self->{html}{anchor_text} .= " $text";
   }
@@ -392,6 +427,32 @@
 ###########################################################################
 # HTML parser tests
 ###########################################################################
+
+sub html_bad_tags {
+  my ($self, undef, $test, $min, $max) = @_;
+
+# print STDERR "html_bad_tags: test=$test; min=$min; max=$max 
(tags=$self->{html}{tags}; elements=$self->{html}{elements})\n";
+  return 0 if !$self->{html}{tags};
+
+  if ($test eq "ratio") {
+    # ratio of tags that are valid
+    $test = ($self->{html}{tags} - $self->{html}{elements}) / 
$self->{html}{tags};
+  } elsif ($test eq "count") {
+    # number of invalid tags
+    $test = $self->{html}{tags} - $self->{html}{elements};
+  } else {
+    # invalid test
+    return 0;
+  }
+
+  # not all perls understand what "inf" means, so we need to do
+  # non-numeric tests!  urg!
+  if ( !defined $max || $max eq "inf" ) {
+    return ($test > $min);
+  } else {
+    return ($test > $min && $test <= $max);
+  }
+}
 
 sub html_tag_balance {
   my ($self, undef, $rawtag, $rawexpr) = @_;
--- orig/PerMsgStatus.pm        Thu Jul 10 10:20:57 2003
+++ PerMsgStatus.pm     Thu Jul 10 13:07:17 2003
@@ -1128,6 +1128,7 @@
   # reset variables used in HTML tests
   $self->{html} = {};
   $self->{html_inside} = {};
+  $self->{html_used} = {};
   $self->{html}{ratio} = 0;
   $self->{html}{image_area} = 0;
   $self->{html}{shouting} = 0;
Comment 1 Daniel Quinlan 2003-07-11 14:26:39 UTC
Thanks, I like this idea and it might work pretty well, I'll try it out.
Comment 2 Brian White 2003-07-14 09:10:47 UTC
Small problem with the patch I provided.  In the "html_tests" function, I use $_
while looping over the open tags.  However, $_ is used implicitly directly
below in the test for "web_bugs".  I think it is sufficient to move the code
I added to the bottom of the enclosing "if" block.
Comment 3 Brian White 2003-07-15 13:25:43 UTC
Minor change...  This optimizes the process by removing "html_inside" entries 
when they become zero so that the "html_used" test doesn't have to loop over 
them.  It also changes an increment (++) to an assignment (=1).  The increment 
was my original plan but decided I didn't need it and only change one of the two 
places it was done.  Sorry.

--- orig/HTML.pm        Tue Jul 15 15:47:00 2003
+++ HTML.pm     Tue Jul 15 16:19:14 2003
@@ -35,6 +35,7 @@
 #       print STDERR "html_tag: no content within tag sequence '$tag' 
(empty_format=$self->{html}{empty_format})\n";
       }
       delete $self->{html_used}{$tag};
+      delete $self->{html_inside}{$tag};
     }
   }
 
@@ -306,7 +307,7 @@
 
     my $inside = $self->{html_inside};
     foreach (keys %$inside) {
-      $self->{html_used}{$_}++;
+      $self->{html_used}{$_} = 1;
     }
   }
   if ($tag eq "img" && exists $attr->{width} && exists $attr->{height}) {


Note that this does not include the fix to move the "html_used" code in function 
"html_tests" to the bottom of the "img" tag processing block (mentioned in 
a previous addendum) since I included that patch in my other bug report.  Sorry 
for the multiple patches.  Perhaps I should just stop for a while before 
everything gets even more mixed up.

-- Brian
Comment 4 Theo Van Dinter 2003-07-15 13:32:47 UTC
Subject: Re: [SAdev]  New HTML Tag Tests (patch)

On Tue, Jul 15, 2003 at 01:28:53PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Minor change...  This optimizes the process by removing "html_inside" entries 

we tend to ignore patches included in the comments for bugs.  please use
the "create a new attachment" link.

I added a line to "file new bug" which explains this, fyi to folks.

Comment 5 Brian White 2003-07-15 14:00:54 UTC
Created attachment 1159 [details]
Combined patch for bug #2211 and #2231

Okay...  No more pieces.  As requested, here's the final patch, listed against
v2.55.	It includes the change for bug #2231 as well.  I wouldn't normally
include both, but both are made to the same file and difficult to separate now.


-- Brian
Comment 6 Brian White 2003-07-15 14:04:06 UTC
Created attachment 1160 [details]
Tests to call new code.
Comment 7 Matthew Cline 2003-08-05 15:24:57 UTC
> If a tag is found that was skipped by the real HTML parser (which it
> does if the tag was not recognized), then it forces the parsing of
> it here and removes it from the text stream.  At tag in this case is
> any "<...>" construct where the first character following the "<" is
> not a space.  I believe this fits how most browsers work and has
> caused no false hits in the tests I've run.  (A real "<" should be
> coded as "&lt;", anyway.)

But if you do this after HTML::Parser is run, &lt; will already have
been translated into "<".  So a spammer could, say, do:

   &lt;VIAGRA&gt;

Which would get translated into:

   <VIAGRA>

and removed by your code.
Comment 8 Brian White 2003-08-06 05:18:12 UTC
Subject: Re:  New HTML Tag Tests (patch)

> But if you do this after HTML::Parser is run, &lt; will already have
> been translated into "<".  So a spammer could, say, do:
> 
>    &lt;VIAGRA&gt;
> 
> Which would get translated into:
> 
>    <VIAGRA>
> 
> and removed by your code.

Good point.  I guess I'll have to look at patching the HTML::Parser code.

This code still provides a marked improvement over not having it since it
catches obfuscating tags and it would be undesireable for spammers to
place angle brackets around every incriminating word.

Better to fix it properly, though.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
            If you're passed on the right, you're in the wrong lane.

Comment 9 Kristian Koehntopp 2003-08-28 22:32:35 UTC
I will implement the changes Justin requests in 
http://bugzilla.spamassassin.org/show_bug.cgi?id=2205#c4 over the weekend and 
submit a revised patch against the current CVS. I am looking forward to ConfSourceLDAP 
becoming standard part of SpamAssassin. 
 
As a technical side note: I will not implement Bayes storage or AWL storage in LDAP, 
despite the fact that this would be easy to implement. I cannot see how such an 
extension is going to be useful. Bayes and AWL is highly volatile data, which LDAP is not 
designed to handle. Storing such data in LDAP will only create replication storms if you 
have more that one LDAP server. 
Comment 10 Kristian Koehntopp 2003-08-28 22:38:41 UTC
Sorry, bugzilla fooled me. Please ignore the previous comment. 
Comment 11 Brian White 2003-08-29 07:08:55 UTC
Subject: Re:  New HTML Tag Tests (patch)

> I will implement the changes Justin requests in
> http://bugzilla.spamassassin.org/show_bug.cgi?id=2205#c4 over the weekend and
> submit a revised patch against the current CVS. I am looking forward to ConfSourceLDAP
> becoming standard part of SpamAssassin.

Ummm...  You say bug #2205 but this was posted against bug #2211.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
A bend in the road is not the end of the road unless you fail to make the turn.

Comment 12 Brian White 2003-09-03 06:07:30 UTC
Created attachment 1309 [details]
Patch (against v2.60-rc2) to test for bad HTML tags

I've re-generated this patch for the new version of SA.  In doing so, however,
I removed the part of the test that looked for empty HTML formatting.  Empty
formatting appears in email generated by MSWord, so such a rule is never going
to get weighted very high anyway.  I don't believe the complexity of the code
and the O(n*m) running time (though "n" and "m" are generally small) make it a
worthwhile test.

This patch only adds the ability to look for bad tags, which should never
appear.  It should hopefully weight at about the same level as
OBFUSCATING_COMMENT which is currently around the 4.0 range.

Note:  I do have other changes in the HTML.pm file so if you apply this patch
to original source there will be an offset on one hunk:

tolkien:~/tmp> patch -p1 <HTML.pm.diff 
missing header for unified diff at line 3 of patch
patching file HTML.pm
Hunk #4 succeeded at 695 (offset -11 lines).

-- Brian
Comment 13 Brendan Byrd/SineSwiper 2004-01-23 04:23:17 UTC
Yes, this patch is badly needed.  I have a ton of e-mail that seems to come 
from the same source.  While it gets caught by a few RBLs, they aren't weighted 
enough to matter.  It seems like the guy uses every trick to try to get past 
spam filters, possibly even engineered just for SA.  (Also note the X-Mailer, 
which may be ratware, but that's up for another discussion.)

Here's the message:

Received: from CCI1 (itandi{@[216.195.204.220])
	by ResonatorSoft.org (8.11.6/8.11.6) with SMTP id i0NAQpX05128
	for <sineswiper@resonatorsoft.org>; Fri, 23 Jan 2004 05:26:51 -0500
Received: from [216.195.204.220] by e-hostzz.comIP with HTTP;
	Fri, 23 Jan 2004 11:25:56 +0100
From: "Corey" <iekwbfat@web.de>
To: sineswiper@resonatorsoft.org
Subject: Re: IX, what had happened
Mime-Version: 1.0
X-Mailer: mPOP Web-Mail 2.19
X-Originating-IP: [e-hostzz.comIP]
Date: Fri, 23 Jan 2004 11:28:56 +0100
Reply-To: "Corey Watts" <iekwbfat@web.de>
Content-Type: multipart/alternative;
	boundary="--ALT--EGHH75168692978204"
Message-Id: <MWEDWUU-0004508982410@periphery>
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on ResonatorSoft.org
X-Spam-Level: ***
X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_44=-0.001,
	HTML_IMAGE_ONLY_06=1.439,HTML_MESSAGE=0.1,RCVD_IN_BL_SPAMCOP_NET=1.5,
	RCVD_IN_DSBL=0.706 autolearn=no version=2.63

----ALT--EGHH75168692978204
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit

ambassador mueller dwindle salami lura hoff coaxial linotype fast phyllis 
powder nothing thomas butte aunt metallurgic davy thereunder emile 
intrigue animate equip rebel 

----ALT--EGHH75168692978204
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 8bit

<HTML><HEAD>
<BODY>
<p>Fr</ringlet>ee Ca</ababa>ble* TV</p>
<a href="http://www.e-hostzz.com/cable/">
<img border="0" src="http://www.e-hostzz.com/fiter3.jpg"></a>
emancipate primrose lundquist suck breeze crystallographer photolysis filled 
anthropomorphic baronial jack shako downtown insulate wart sapling protoplasm 
cost sepuchral attrition fizeau anamorphic flagellate wakerobin rage <BR>
emblazon lure milky sal abate boswell torsion cover cowbird decedent <BR>

</BODY>
</HTML>


----ALT--EGHH75168692978204--
Comment 14 Brendan Byrd/SineSwiper 2004-01-23 05:06:16 UTC
Created attachment 1725 [details]
Secondary patch for 20_html_tests.cf

This should finish the patch, adding the tests on the .cf file.  Seems to work
with the test example, and counted the right ratio.
Comment 15 Daniel Quinlan 2004-01-27 14:48:15 UTC
Created attachment 1729 [details]
reimplementation of idea
Comment 16 Daniel Quinlan 2004-01-27 14:49:21 UTC
I have added my own code which covers this bug and also tests for unique tags
(in addition to just testing counts and ratios) which seems to have a better
result in general.

My version just reuses html_range() for the test eval function so the
addition is actually smaller.

 lib/Mail/SpamAssassin/HTML.pm         |   16 ++++++++--
 lib/Mail/SpamAssassin/MsgContainer.pm |    6 ++++
 rules/70_cvs_rules_under_test.cf      |   50 ++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+), 4 deletions(-)

Brian and Brendan, thanks for figuring out the technique and the ideas.
Comment 17 Brendan Byrd/SineSwiper 2004-01-28 09:19:14 UTC
Ahhh...cool.  You also added unique tags, which should catch a lot of the newer
spam.  Many are opting to spam "unique" tags, like they currently spam unique
words, to throw off the Bayes filters.  This should catch that technique easily.

Lemme know when the tests have been corpus'd enough to get some scores and I'll
test them out right away.

BTW, will the "inf" be translated right?  Brian was noting in his code that not
all Perls know about it.
Comment 18 Brian White 2004-01-28 09:33:12 UTC
Subject: Re:  New HTML Tag Tests (patch)

> BTW, will the "inf" be translated right?  Brian was noting in his code that not
> all Perls know about it.

As I recall, I copied that comment from elsewhere in the source code.  It
may not be relevant any longer.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    In theory, theory and practice are the same.  In practice, they're not.

Comment 19 Mike Kuentz 2004-01-29 18:22:03 UTC
Created attachment 1735 [details]
Additional tags and rework of RE

I was working on some thing similar, but saw this and threw in the towel :)  I
had come up with a bigger list of tags seen and made a separate entry for
closing tags.  The separate one for closing tags is because they generally
don't have attributes like a lot of the beginning tags do.
Comment 20 Daniel Quinlan 2004-01-29 20:23:22 UTC
Subject: Re:  New HTML Tag Tests (patch)

> I was working on some thing similar, but saw this and threw in the
> towel :) I had come up with a bigger list of tags seen and made a
> separate entry for closing tags.  The separate one for closing tags is
> because they generally don't have attributes like a lot of the
> beginning tags do.

two questions:

- You want to try adapting the code to see if being stricter
  (considering whether a tag is open-only, I think attributes will
  probably FP a lot, but try that too)

- Any other non-standard tags you think we should permit?

Daniel

Comment 21 Brendan Byrd/SineSwiper 2004-01-29 22:37:50 UTC
> Any other non-standard tags you think we should permit?

I scanned the list with the tags at http://www.htmlhelp.com/reference/html40/ 
and it looks clean.  But, I did find some tags in the Mozilla source code, 
which might be added to the code:

http://lxr.mozilla.org/seamonkey/source/htmlparser/tests/htmlgen/htmlgen.cpp#28

 30   "BGSOUND", "BLINK"
 33   "EMBED", 
 36   "ILAYER"
 37   "KEYGEN", 
 38   "LAYER", "LISTING", 
 39   "MULTICOL", 
 40   "NOBR", "NOEMBED", "NOLAYER", 
 42   "PLAINTEXT",
 44   "SERVER","SMALL","SOUND","SPACER",
 48   "WBR", 
 49   "XMP",

I'll do a quick patch in a minute...
Comment 22 Brendan Byrd/SineSwiper 2004-01-29 22:50:25 UTC
Created attachment 1737 [details]
Patch for custom tags

Added those custom tags to a seperate variable, and also added the MARQUEE tag
in that list.
Comment 23 Mike Kuentz 2004-01-30 09:24:15 UTC
> - You want to try adapting the code to see if being stricter
>   (considering whether a tag is open-only, I think attributes will
>   probably FP a lot, but try that too)

The variable $beginning_tags has 2 sections.  the first (left) side of it are 
tags that typically have attributes.  For example <a>, <font>, etc.   The right 
side (after \b ?.*?)| ) has tags that generally don't have attributes.  For 
example <u>, <kbd>, etc.  I think it makes sense to be stricter on the tags 
because, for example the <u> tag, generally never has a use for an attribute.  
You can certainly do <u title=foo>, but why would any one ever do that in an 
email?  I pulled the list of tags and attributes from 
<http://www.w3schools.com/tags/tag_u.asp>  Anything that didn't have optional 
attributes and only standard attributes was put on the left side of that regex.

> - Any other non-standard tags you think we should permit?

Ya, the tags defined in the variables $ms_xml_tags and $extra_tags
in attachment <http://bugzilla.spamassassin.org/attachment.cgi?
id=1735&action=view>
Comment 24 Daniel Quinlan 2004-10-18 11:24:00 UTC
*** Bug 2947 has been marked as a duplicate of this bug. ***