SA Bugzilla – Bug 4046
Parsing of undecoded UTF-8 will give garbage when decoding entities at [...]/Mail/SpamAssassin/HTML.pm line 182.
Last modified: 2015-02-06 03:25:28 UTC
The warning above is sometimes issued when sa-learn uses the HTML::Parser module to parse HTML messages. Happened to me on all 3.0.x versions.
Created attachment 2582 [details] proposed patch
Yes, I get lots of these warnings, too, in my nightly mass checks. It looks like this can be solved by enabling utf8_mode for HTML::Parser in parse() in Mail/SpamAssassin/HTML.pm. Unfortunately, this is a perl 5.8 option only. I'll attach a patch.
Oops, I was going to attach the same patch Sebastian already had attached ;)
we already have a workaround for this in trunk (by just trapping the warning). I'd be curious if this actually makes a difference either way...
(In reply to comment #4) > we already have a workaround for this in trunk (by just trapping the warning). > I'd be curious if this actually makes a difference either way... Just trapping a warning doesn't look like a good solution to me; on the other hand, the minimum perl requirement was 5.6.1 until now, so any real fix for this without creating a dependency on perl 5.8 would be nice.
We could wrap the call either in a version-check or eval. But from <http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm#Unicode>: "If the file contains text encoded in a charset besides ASCII, Latin-1 or UTF-8 then decoding will always be needed." and especially <http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm#METHODS>: "If utf8_mode is enabled then it is an error to pass strings containing characters with code above 255 to the parse() method, and the parse() method will croak if you try." I'm not sure if this can't make it worse (can we have UTF-16 encoded text?).
After applying the patch, I still have problems with UTF-32. Running sa-learn evokes the message: "Parsing of undecoded UTF-32 at /usr/.../SpamAssassin/HTML.pm at line 184".
A test case message would be helpful.
Since this bug report has been looked at, confirmed, and not yet resolved though progress has been made, and since it doesn't seem to cause any serious problems for anyone, it looks like its resolution should be scheduled for 3.1.1.
(In reply to comment #8) > A test case message would be helpful. See attachment (mbox format).
Created attachment 2867 [details] testcase testcase uploaded as text/plain, hope this doesn't change the file in any way.
Created attachment 2902 [details] tested fix OK, I realised that we'd never tested utf8_mode() as a fix for this. So I tested it. The results are: big fat zero I used my most recent 2000 ham and 2000 spam, ran mass-check on both, stripped out the "scantime" stuff that varies as a result of mass-check CPU load etc. and diffed the results -- there were no differences in rule hits whatsoever. in other words, HTML::Parser's noisiness is entirely irrelevant to us in terms of spam filtering results. On top of that, Art's UTF-32 message produces no warnings on perl 5.6.1 or perl 5.8.4 for me with svn trunk. Having said that, I'll change the svn trunk code to ignore UTF-32 noise as well, just in case. r178692. the attached patch is the utf8_mode()-enabling code. I won't be applying it, since as I noted above it has no effect and I'd prefer not to add version-specific stuff unless there's a point. ;)
meh, the test case message was from Sebastian, not Art. that may explain why it didn't produce warnings. moot point though as SVN trunk will now inhibit UTF-32 warnings too. closing as FIXED; please reopen if the problem is observed with SVN trunk.
*** Bug 4373 has been marked as a duplicate of this bug. ***
This still exists in the 3.0 branch, but is fixed in trunk.
*** Bug 3787 has been marked as a duplicate of this bug. ***
i have a sample message and spamd debug to follow that shows the malformed utf- 8 warns... all 22k of em. Other info.... # echo $LANG en_US # perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.45 # spamassassin -V SpamAssassin version 3.2.0-r322462 running on Perl version 5.8.5
Created attachment 3278 [details] sample message sample message that generates utf-8 warns on SVN
I'm not sure whether or not the text/plain destroys the UTF-8 encoding. Can you check the md5sum of the original? Downloaded, I get: 1a0e58a1cfc5fb2a9f891c1c128ecd4c With that version, I see no issues on either my Linux (perl 5.8.5) or Mac OS X (perl 5.8.6) machines.
Created attachment 3279 [details] spamd debug for sample message here is the spamd debug output.. i had to trim a bunch of the utf-8 warns out of this message to get it posted to bz under the 1k limit.
Created attachment 3280 [details] compressed sample message -rw------- 1 root root 20210 Nov 28 11:18 msg.txt # md5sum msg.txt 4bad4aedc872633b97400b8ff417fb02 msg.txt -rw------- 1 root root 3488 Nov 28 11:18 msg.txt.gz
per request from dallas, > Please test this message against your svn copy... > http://www.engelken.net/download/msg.txt > > It produces 22k utf-u warns for me... > > # grep -c UTF-8 spamd.debug.txt > 22636 > > And I only have 70_sare_obfu.cf in /usr/share/spamassassin for testing > purposes.. with my RDJ rules defined: TRUSTED_RULESETS="TRIPWIRE SARE_REDIRECT_POST300 SARE_EVILNUMBERS0 SARE_EVILNUMBERS1 SARE_BAYES_POISON_NXM SARE_HEADER SARE_HEADER_ENG SARE_SPECIFIC SARE_ADULT SARE_BML SARE_FRAUD SARE_SPOOF SARE_RANDOM SARE_SPAMCOP_TOP200 SARE_OEM SARE_GENLSUBJ SARE_GENLSUBJ_ENG SARE_UNSUB SARE_URI_ENG BOGUSVIRUS SARE_OBFU SARE_HTML" and, for comparison, per your earlier bug attach, % echo $LANG en_US % perl -e 'use HTML::Parser; print HTML::Parser->VERSION'; 3.46 % spamassassin -V SpamAssassin version 3.2.0-r322462 running on Perl version 5.8.6 on debug, output shows the 'usual' scads of errors: % grep -c "Malformed UTF-8 character" log.txt 80099
This is an entirely new bug, not a valid reopen.
Reopening bug 3787, as it is not a duplicate. Re-resolving this as FIXED.
Ten years later ... case re-opened as Bug 7133.