SA Bugzilla – Bug 2211
New HTML Tag Tests (patch)
Last modified: 2004-10-18 03:24:00 UTC
I've written some new code to deal with messages that use HTML tags to obfuscate messages. The full diff is attached, but here is the breakdown. ------------------------------------------------------------------------------- This patch to "html_tag" adds two things: 1) Support for being called by "html_text" with tags not found by the HTML parser module (see that for more information). This is done by passing a $num of 0 (zero). 2) When a tag is closed ($num goes to zero), a check is made to see if any content was placed between those tags. Content is noted by the patches to "html_text" and "html_tests". If no content existed between an open/close formatting pair, then a counter is incremented which can be tested via "eval:html_range()". --- orig/HTML.pm Thu Jul 10 10:20:52 2003 +++ HTML.pm Fri Jul 11 10:26:35 2003 @@ -26,10 +26,21 @@ sub html_tag { my ($self, $tag, $attr, $num) = @_; - $self->{html_inside}{$tag} += $num; + if ($num != 0) { + $self->{html_inside}{$tag} += $num; + if ($self->{html_inside}{$tag} == 0) { + # look for obsucating tags (format changes that have no affect if nothing between) + if ($self->{html_used}{$tag} == 0 && $tag !~ m/^(p)$/) { + $self->{html}{empty_format}++; +# print STDERR "html_tag: no content within tag sequence '$tag' (empty_format=$self->{html}{empty_format})\n"; + } + delete $self->{html_used}{$tag}; + } + } $self->{html}{elements}++ if $tag =~ /^(?:$re_strict|$re_loose)$/io; $self->{html}{tags}++; +# print STDERR "html_tag: found bad tag '$tag'\n" if $tag !~ /^(?:$re_strict|$re_loose)$/io; if ($num == 1) { $self->html_format($tag, $attr, $num); ------------------------------------------------------------------------------- This patch to "html_tests" records found IMG tags as being content between formatting tags. This patch could be omitted and maybe give better results since normal HTML email shouldn't be sending highly-formatted messages anyway, but it's included for correctness. @@ -259,9 +270,16 @@ } if (($tag eq "img" && exists $attr->{src} && ($_ = $attr->{src})) || ($tag =~ /^(?:body|table|tr|td|th)$/ && exists $attr->{background} && ($_ = $attr->{background}))) { + my $inside = $self->{html_inside}; +# print STDERR "html_tests: found image inside tags:"; + foreach (keys %$inside) { +# print STDERR " ",$_; + $self->{html_used}{$_}++; + } +# print STDERR "\n"; if (/\?/ || (/[a-f\d]{12,}/i && ! /\.(?:jpe?g|gif|png)$/i && !/^cid:/)) { $self->{html}{web_bugs} = 1; ------------------------------------------------------------------------------- This patch to "html_text" does two things: 1) If a tag is found that was skipped by the real HTML parser (which it does if the tag was not recognized), then it forces the parsing of it here and removes it from the text stream. At tag in this case is any "<...>" construct where the first character following the "<" is not a space. I believe this fits how most browsers work and has caused no false hits in the tests I've run. (A real "<" should be coded as "<", anyway.) 2) When the text (after tags are removed) is non-null, then all active tags ("html_inside") are marked as "used" and thus will not be counted as some type of obfuscating tag. @@ -314,6 +332,23 @@ sub html_text { my ($self, $text) = @_; + # the HTML parses skips tags that it does not recognize; fine for normal, bad for spam + while ($text =~ s/<(\S[^>]*)>//) { +# print STDERR "html_text: found unparsed <$1> inside text\n"; + html_tag($self,$1,undef,0); + } + + # record when something non-tag exists between other tags (search of obfuscating tags) + if ($text ne "") { + my $inside = $self->{html_inside}; +# print STDERR "html_text: found text inside tags:"; + foreach (keys %$inside) { +# print STDERR " ",$_; + $self->{html_used}{$_} = 1; + } +# print STDERR "\n"; + } + if (exists $self->{html_inside}{a} && $self->{html_inside}{a} > 0) { $self->{html}{anchor_text} .= " $text"; } ------------------------------------------------------------------------------- This new function "html_bad_tags" returns the actual count or the percent of all tags that were not valid. @@ -392,6 +427,32 @@ ########################################################################### # HTML parser tests ########################################################################### + +sub html_bad_tags { + my ($self, undef, $test, $min, $max) = @_; + +# print STDERR "html_bad_tags: test=$test; min=$min; max=$max (tags=$self->{html}{tags}; elements=$self->{html}{elements})\n"; + return 0 if !$self->{html}{tags}; + + if ($test eq "ratio") { + # ratio of tags that are valid + $test = ($self->{html}{tags} - $self->{html}{elements}) / $self->{html}{tags}; + } elsif ($test eq "count") { + # number of invalid tags + $test = $self->{html}{tags} - $self->{html}{elements}; + } else { + # invalid test + return 0; + } + + # not all perls understand what "inf" means, so we need to do + # non-numeric tests! urg! + if ( !defined $max || $max eq "inf" ) { + return ($test > $min); + } else { + return ($test > $min && $test <= $max); + } +} sub html_tag_balance { my ($self, undef, $rawtag, $rawexpr) = @_; ------------------------------------------------------------------------------- This patch to "get_decoded_stripped_body_text_array" just clears the "html_used" structure between messages. --- orig/PerMsgStatus.pm Thu Jul 10 10:20:57 2003 +++ PerMsgStatus.pm Thu Jul 10 13:07:17 2003 @@ -1128,6 +1128,7 @@ # reset variables used in HTML tests $self->{html} = {}; $self->{html_inside} = {}; + $self->{html_used} = {}; $self->{html}{ratio} = 0; $self->{html}{image_area} = 0; $self->{html}{shouting} = 0; ------------------------------------------------------------------------------- Finally, here is the rules that make use of the new features. Obviously, the scores need a little more rigorous determination than the intuitive ones I just chose once everything appeared to be working. body HTML_BAD_TAGS_00_10 eval:html_bad_tags('ratio','0.00','0.10') body HTML_BAD_TAGS_10_20 eval:html_bad_tags('ratio','0.10','0.20') body HTML_BAD_TAGS_20_30 eval:html_bad_tags('ratio','0.20','0.30') body HTML_BAD_TAGS_30_40 eval:html_bad_tags('ratio','0.30','0.40') body HTML_BAD_TAGS_40_50 eval:html_bad_tags('ratio','0.40','0.50') body HTML_BAD_TAGS_50_60 eval:html_bad_tags('ratio','0.50','0.60') body HTML_BAD_TAGS_60_70 eval:html_bad_tags('ratio','0.60','0.70') body HTML_BAD_TAGS_70_80 eval:html_bad_tags('ratio','0.70','0.80') body HTML_BAD_TAGS_80_90 eval:html_bad_tags('ratio','0.80','0.90') body HTML_BAD_TAGS_90_100 eval:html_bad_tags('ratio','0.90','1.00') body HTML_BAD_TAGS_0 eval:html_bad_tags('count','-1','0') body HTML_BAD_TAGS_1 eval:html_bad_tags('count','0','4') body HTML_BAD_TAGS_5 eval:html_bad_tags('count','4','9') body HTML_BAD_TAGS_10 eval:html_bad_tags('count','9','24') body HTML_BAD_TAGS_25 eval:html_bad_tags('count','24','49') body HTML_BAD_TAGS_50 eval:html_bad_tags('count','49') body HTML_EMPTY_FORMAT_1 eval:html_range('empty_format','0','4') body HTML_EMPTY_FORMAT_5 eval:html_range('empty_format','4','9') body HTML_EMPTY_FORMAT_10 eval:html_range('empty_format','9','24') body HTML_EMPTY_FORMAT_25 eval:html_range('empty_format','24','49') body HTML_EMPTY_FORMAT_50 eval:html_range('empty_format','49') describe HTML_BAD_TAGS_00_10 0-10% of all HTML tags are invalid describe HTML_BAD_TAGS_10_20 10-20% of all HTML tags are invalid describe HTML_BAD_TAGS_20_30 20-30% of all HTML tags are invalid describe HTML_BAD_TAGS_30_40 30-40% of all HTML tags are invalid describe HTML_BAD_TAGS_40_50 40-50% of all HTML tags are invalid describe HTML_BAD_TAGS_50_60 50-60% of all HTML tags are invalid describe HTML_BAD_TAGS_60_70 60-70% of all HTML tags are invalid describe HTML_BAD_TAGS_70_80 70-80% of all HTML tags are invalid describe HTML_BAD_TAGS_80_90 80-90% of all HTML tags are invalid describe HTML_BAD_TAGS_90_100 90-100% of all HTML tags are invalid describe HTML_BAD_TAGS_0 HTML has no invalid tags describe HTML_BAD_TAGS_1 HTML has at least 1 invalid tag describe HTML_BAD_TAGS_5 HTML has at least 5 invalid tags describe HTML_BAD_TAGS_10 HTML has at least 10 invalid tags describe HTML_BAD_TAGS_25 HTML has at least 25 invalid tags describe HTML_BAD_TAGS_50 HTML has at least 50 invalid tags describe HTML_EMPTY_FORMAT_1 HTML has at least 1 formatting pair with nothing between describe HTML_EMPTY_FORMAT_5 HTML has at least 5 formatting pairs with nothing between describe HTML_EMPTY_FORMAT_10 HTML has at least 10 formatting pairs with nothing between describe HTML_EMPTY_FORMAT_25 HTML has at least 25 formatting pairs with nothing between describe HTML_EMPTY_FORMAT_50 HTML has at least 50 formatting pairs with nothing between score HTML_BAD_TAGS_00_10 0.1 score HTML_BAD_TAGS_10_20 0.5 score HTML_BAD_TAGS_20_30 1.0 score HTML_BAD_TAGS_30_40 1.5 score HTML_BAD_TAGS_40_50 2.0 score HTML_BAD_TAGS_50_60 2.5 score HTML_BAD_TAGS_60_70 3.0 score HTML_BAD_TAGS_70_80 3.5 score HTML_BAD_TAGS_80_90 4.0 score HTML_BAD_TAGS_90_100 4.5 score HTML_BAD_TAGS_0 -0.1 score HTML_BAD_TAGS_1 0.1 score HTML_BAD_TAGS_5 0.2 score HTML_BAD_TAGS_10 0.3 score HTML_BAD_TAGS_25 0.4 score HTML_BAD_TAGS_50 0.5 score HTML_EMPTY_FORMAT_1 0.1 score HTML_EMPTY_FORMAT_5 0.5 score HTML_EMPTY_FORMAT_10 1.0 score HTML_EMPTY_FORMAT_25 1.5 score HTML_EMPTY_FORMAT_50 2.0 ------------------------------------------------------------------------------- After going through testing, I find that these HTML rules don't fire all that often. Most HTML messages (spam and valid) use tags properly. The "empty format" tests would fire fairly frequently on email generated by MSWord; Word also inserts a certain number of bad "<o:...>" tags. The bad-tags test, however, fired above 20% only on spam messages. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- That which we persist in doing becomes easier. It's not that the nature of the thing has changed but rather our ability at it has increased. --- orig/HTML.pm Thu Jul 10 10:20:52 2003 +++ HTML.pm Fri Jul 11 10:26:35 2003 @@ -26,10 +26,21 @@ sub html_tag { my ($self, $tag, $attr, $num) = @_; - $self->{html_inside}{$tag} += $num; + if ($num != 0) { + $self->{html_inside}{$tag} += $num; + if ($self->{html_inside}{$tag} == 0) { + # look for obsucating tags (format changes that have no affect if nothing between) + if ($self->{html_used}{$tag} == 0 && $tag !~ m/^(p)$/) { + $self->{html}{empty_format}++; +# print STDERR "html_tag: no content within tag sequence '$tag' (empty_format=$self->{html}{empty_format})\n"; + } + delete $self->{html_used}{$tag}; + } + } $self->{html}{elements}++ if $tag =~ /^(?:$re_strict|$re_loose)$/io; $self->{html}{tags}++; +# print STDERR "html_tag: found bad tag '$tag'\n" if $tag !~ /^(?:$re_strict|$re_loose)$/io; if ($num == 1) { $self->html_format($tag, $attr, $num); @@ -259,9 +270,16 @@ } if (($tag eq "img" && exists $attr->{src} && ($_ = $attr->{src})) || ($tag =~ /^(?:body|table|tr|td|th)$/ && exists $attr->{background} && ($_ = $attr->{background}))) { + my $inside = $self->{html_inside}; +# print STDERR "html_tests: found image inside tags:"; + foreach (keys %$inside) { +# print STDERR " ",$_; + $self->{html_used}{$_}++; + } +# print STDERR "\n"; if (/\?/ || (/[a-f\d]{12,}/i && ! /\.(?:jpe?g|gif|png)$/i && !/^cid:/)) { $self->{html}{web_bugs} = 1; @@ -314,6 +332,23 @@ sub html_text { my ($self, $text) = @_; + # the HTML parses skips tags that it does not recognize; fine for normal, bad for spam + while ($text =~ s/<(\S[^>]*)>//) { +# print STDERR "html_text: found unparsed <$1> inside text\n"; + html_tag($self,$1,undef,0); + } + + # record when something non-tag exists between other tags (search of obfuscating tags) + if ($text ne "") { + my $inside = $self->{html_inside}; +# print STDERR "html_text: found text inside tags:"; + foreach (keys %$inside) { +# print STDERR " ",$_; + $self->{html_used}{$_} = 1; + } +# print STDERR "\n"; + } + if (exists $self->{html_inside}{a} && $self->{html_inside}{a} > 0) { $self->{html}{anchor_text} .= " $text"; } @@ -392,6 +427,32 @@ ########################################################################### # HTML parser tests ########################################################################### + +sub html_bad_tags { + my ($self, undef, $test, $min, $max) = @_; + +# print STDERR "html_bad_tags: test=$test; min=$min; max=$max (tags=$self->{html}{tags}; elements=$self->{html}{elements})\n"; + return 0 if !$self->{html}{tags}; + + if ($test eq "ratio") { + # ratio of tags that are valid + $test = ($self->{html}{tags} - $self->{html}{elements}) / $self->{html}{tags}; + } elsif ($test eq "count") { + # number of invalid tags + $test = $self->{html}{tags} - $self->{html}{elements}; + } else { + # invalid test + return 0; + } + + # not all perls understand what "inf" means, so we need to do + # non-numeric tests! urg! + if ( !defined $max || $max eq "inf" ) { + return ($test > $min); + } else { + return ($test > $min && $test <= $max); + } +} sub html_tag_balance { my ($self, undef, $rawtag, $rawexpr) = @_; --- orig/PerMsgStatus.pm Thu Jul 10 10:20:57 2003 +++ PerMsgStatus.pm Thu Jul 10 13:07:17 2003 @@ -1128,6 +1128,7 @@ # reset variables used in HTML tests $self->{html} = {}; $self->{html_inside} = {}; + $self->{html_used} = {}; $self->{html}{ratio} = 0; $self->{html}{image_area} = 0; $self->{html}{shouting} = 0;
Thanks, I like this idea and it might work pretty well, I'll try it out.
Small problem with the patch I provided. In the "html_tests" function, I use $_ while looping over the open tags. However, $_ is used implicitly directly below in the test for "web_bugs". I think it is sufficient to move the code I added to the bottom of the enclosing "if" block.
Minor change... This optimizes the process by removing "html_inside" entries when they become zero so that the "html_used" test doesn't have to loop over them. It also changes an increment (++) to an assignment (=1). The increment was my original plan but decided I didn't need it and only change one of the two places it was done. Sorry. --- orig/HTML.pm Tue Jul 15 15:47:00 2003 +++ HTML.pm Tue Jul 15 16:19:14 2003 @@ -35,6 +35,7 @@ # print STDERR "html_tag: no content within tag sequence '$tag' (empty_format=$self->{html}{empty_format})\n"; } delete $self->{html_used}{$tag}; + delete $self->{html_inside}{$tag}; } } @@ -306,7 +307,7 @@ my $inside = $self->{html_inside}; foreach (keys %$inside) { - $self->{html_used}{$_}++; + $self->{html_used}{$_} = 1; } } if ($tag eq "img" && exists $attr->{width} && exists $attr->{height}) { Note that this does not include the fix to move the "html_used" code in function "html_tests" to the bottom of the "img" tag processing block (mentioned in a previous addendum) since I included that patch in my other bug report. Sorry for the multiple patches. Perhaps I should just stop for a while before everything gets even more mixed up. -- Brian
Subject: Re: [SAdev] New HTML Tag Tests (patch) On Tue, Jul 15, 2003 at 01:28:53PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Minor change... This optimizes the process by removing "html_inside" entries we tend to ignore patches included in the comments for bugs. please use the "create a new attachment" link. I added a line to "file new bug" which explains this, fyi to folks.
Created attachment 1159 [details] Combined patch for bug #2211 and #2231 Okay... No more pieces. As requested, here's the final patch, listed against v2.55. It includes the change for bug #2231 as well. I wouldn't normally include both, but both are made to the same file and difficult to separate now. -- Brian
Created attachment 1160 [details] Tests to call new code.
> If a tag is found that was skipped by the real HTML parser (which it > does if the tag was not recognized), then it forces the parsing of > it here and removes it from the text stream. At tag in this case is > any "<...>" construct where the first character following the "<" is > not a space. I believe this fits how most browsers work and has > caused no false hits in the tests I've run. (A real "<" should be > coded as "<", anyway.) But if you do this after HTML::Parser is run, < will already have been translated into "<". So a spammer could, say, do: <VIAGRA> Which would get translated into: <VIAGRA> and removed by your code.
Subject: Re: New HTML Tag Tests (patch) > But if you do this after HTML::Parser is run, < will already have > been translated into "<". So a spammer could, say, do: > > <VIAGRA> > > Which would get translated into: > > <VIAGRA> > > and removed by your code. Good point. I guess I'll have to look at patching the HTML::Parser code. This code still provides a marked improvement over not having it since it catches obfuscating tags and it would be undesireable for spammers to place angle brackets around every incriminating word. Better to fix it properly, though. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- If you're passed on the right, you're in the wrong lane.
I will implement the changes Justin requests in http://bugzilla.spamassassin.org/show_bug.cgi?id=2205#c4 over the weekend and submit a revised patch against the current CVS. I am looking forward to ConfSourceLDAP becoming standard part of SpamAssassin. As a technical side note: I will not implement Bayes storage or AWL storage in LDAP, despite the fact that this would be easy to implement. I cannot see how such an extension is going to be useful. Bayes and AWL is highly volatile data, which LDAP is not designed to handle. Storing such data in LDAP will only create replication storms if you have more that one LDAP server.
Sorry, bugzilla fooled me. Please ignore the previous comment.
Subject: Re: New HTML Tag Tests (patch) > I will implement the changes Justin requests in > http://bugzilla.spamassassin.org/show_bug.cgi?id=2205#c4 over the weekend and > submit a revised patch against the current CVS. I am looking forward to ConfSourceLDAP > becoming standard part of SpamAssassin. Ummm... You say bug #2205 but this was posted against bug #2211. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- A bend in the road is not the end of the road unless you fail to make the turn.
Created attachment 1309 [details] Patch (against v2.60-rc2) to test for bad HTML tags I've re-generated this patch for the new version of SA. In doing so, however, I removed the part of the test that looked for empty HTML formatting. Empty formatting appears in email generated by MSWord, so such a rule is never going to get weighted very high anyway. I don't believe the complexity of the code and the O(n*m) running time (though "n" and "m" are generally small) make it a worthwhile test. This patch only adds the ability to look for bad tags, which should never appear. It should hopefully weight at about the same level as OBFUSCATING_COMMENT which is currently around the 4.0 range. Note: I do have other changes in the HTML.pm file so if you apply this patch to original source there will be an offset on one hunk: tolkien:~/tmp> patch -p1 <HTML.pm.diff missing header for unified diff at line 3 of patch patching file HTML.pm Hunk #4 succeeded at 695 (offset -11 lines). -- Brian
Yes, this patch is badly needed. I have a ton of e-mail that seems to come from the same source. While it gets caught by a few RBLs, they aren't weighted enough to matter. It seems like the guy uses every trick to try to get past spam filters, possibly even engineered just for SA. (Also note the X-Mailer, which may be ratware, but that's up for another discussion.) Here's the message: Received: from CCI1 (itandi{@[216.195.204.220]) by ResonatorSoft.org (8.11.6/8.11.6) with SMTP id i0NAQpX05128 for <sineswiper@resonatorsoft.org>; Fri, 23 Jan 2004 05:26:51 -0500 Received: from [216.195.204.220] by e-hostzz.comIP with HTTP; Fri, 23 Jan 2004 11:25:56 +0100 From: "Corey" <iekwbfat@web.de> To: sineswiper@resonatorsoft.org Subject: Re: IX, what had happened Mime-Version: 1.0 X-Mailer: mPOP Web-Mail 2.19 X-Originating-IP: [e-hostzz.comIP] Date: Fri, 23 Jan 2004 11:28:56 +0100 Reply-To: "Corey Watts" <iekwbfat@web.de> Content-Type: multipart/alternative; boundary="--ALT--EGHH75168692978204" Message-Id: <MWEDWUU-0004508982410@periphery> X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on ResonatorSoft.org X-Spam-Level: *** X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_44=-0.001, HTML_IMAGE_ONLY_06=1.439,HTML_MESSAGE=0.1,RCVD_IN_BL_SPAMCOP_NET=1.5, RCVD_IN_DSBL=0.706 autolearn=no version=2.63 ----ALT--EGHH75168692978204 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit ambassador mueller dwindle salami lura hoff coaxial linotype fast phyllis powder nothing thomas butte aunt metallurgic davy thereunder emile intrigue animate equip rebel ----ALT--EGHH75168692978204 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 8bit <HTML><HEAD> <BODY> <p>Fr</ringlet>ee Ca</ababa>ble* TV</p> <a href="http://www.e-hostzz.com/cable/"> <img border="0" src="http://www.e-hostzz.com/fiter3.jpg"></a> emancipate primrose lundquist suck breeze crystallographer photolysis filled anthropomorphic baronial jack shako downtown insulate wart sapling protoplasm cost sepuchral attrition fizeau anamorphic flagellate wakerobin rage <BR> emblazon lure milky sal abate boswell torsion cover cowbird decedent <BR> </BODY> </HTML> ----ALT--EGHH75168692978204--
Created attachment 1725 [details] Secondary patch for 20_html_tests.cf This should finish the patch, adding the tests on the .cf file. Seems to work with the test example, and counted the right ratio.
Created attachment 1729 [details] reimplementation of idea
I have added my own code which covers this bug and also tests for unique tags (in addition to just testing counts and ratios) which seems to have a better result in general. My version just reuses html_range() for the test eval function so the addition is actually smaller. lib/Mail/SpamAssassin/HTML.pm | 16 ++++++++-- lib/Mail/SpamAssassin/MsgContainer.pm | 6 ++++ rules/70_cvs_rules_under_test.cf | 50 ++++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 4 deletions(-) Brian and Brendan, thanks for figuring out the technique and the ideas.
Ahhh...cool. You also added unique tags, which should catch a lot of the newer spam. Many are opting to spam "unique" tags, like they currently spam unique words, to throw off the Bayes filters. This should catch that technique easily. Lemme know when the tests have been corpus'd enough to get some scores and I'll test them out right away. BTW, will the "inf" be translated right? Brian was noting in his code that not all Perls know about it.
Subject: Re: New HTML Tag Tests (patch) > BTW, will the "inf" be translated right? Brian was noting in his code that not > all Perls know about it. As I recall, I copied that comment from elsewhere in the source code. It may not be relevant any longer. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- In theory, theory and practice are the same. In practice, they're not.
Created attachment 1735 [details] Additional tags and rework of RE I was working on some thing similar, but saw this and threw in the towel :) I had come up with a bigger list of tags seen and made a separate entry for closing tags. The separate one for closing tags is because they generally don't have attributes like a lot of the beginning tags do.
Subject: Re: New HTML Tag Tests (patch) > I was working on some thing similar, but saw this and threw in the > towel :) I had come up with a bigger list of tags seen and made a > separate entry for closing tags. The separate one for closing tags is > because they generally don't have attributes like a lot of the > beginning tags do. two questions: - You want to try adapting the code to see if being stricter (considering whether a tag is open-only, I think attributes will probably FP a lot, but try that too) - Any other non-standard tags you think we should permit? Daniel
> Any other non-standard tags you think we should permit? I scanned the list with the tags at http://www.htmlhelp.com/reference/html40/ and it looks clean. But, I did find some tags in the Mozilla source code, which might be added to the code: http://lxr.mozilla.org/seamonkey/source/htmlparser/tests/htmlgen/htmlgen.cpp#28 30 "BGSOUND", "BLINK" 33 "EMBED", 36 "ILAYER" 37 "KEYGEN", 38 "LAYER", "LISTING", 39 "MULTICOL", 40 "NOBR", "NOEMBED", "NOLAYER", 42 "PLAINTEXT", 44 "SERVER","SMALL","SOUND","SPACER", 48 "WBR", 49 "XMP", I'll do a quick patch in a minute...
Created attachment 1737 [details] Patch for custom tags Added those custom tags to a seperate variable, and also added the MARQUEE tag in that list.
> - You want to try adapting the code to see if being stricter > (considering whether a tag is open-only, I think attributes will > probably FP a lot, but try that too) The variable $beginning_tags has 2 sections. the first (left) side of it are tags that typically have attributes. For example <a>, <font>, etc. The right side (after \b ?.*?)| ) has tags that generally don't have attributes. For example <u>, <kbd>, etc. I think it makes sense to be stricter on the tags because, for example the <u> tag, generally never has a use for an attribute. You can certainly do <u title=foo>, but why would any one ever do that in an email? I pulled the list of tags and attributes from <http://www.w3schools.com/tags/tag_u.asp> Anything that didn't have optional attributes and only standard attributes was put on the left side of that regex. > - Any other non-standard tags you think we should permit? Ya, the tags defined in the variables $ms_xml_tags and $extra_tags in attachment <http://bugzilla.spamassassin.org/attachment.cgi? id=1735&action=view>
*** Bug 2947 has been marked as a duplicate of this bug. ***