Bug 4046 - Parsing of undecoded UTF-8 will give garbage when decoding entities at [...]/Mail/SpamAssassin/HTML.pm line 182.
Summary: Parsing of undecoded UTF-8 will give garbage when decoding entities at [...]/...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: unspecified
Hardware: All other
: P5 normal
Target Milestone: 3.1.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4373 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-12-21 09:03 UTC by Sebastian Jaenicke
Modified: 2015-02-06 03:25 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
proposed patch patch None Sebastian Jaenicke [NoCLA]
testcase text/plain None Sebastian Jaenicke [NoCLA]
tested fix patch None Justin Mason [HasCLA]
sample message text/plain None Dallas Engelken [HasCLA]
spamd debug for sample message text/plain None Dallas Engelken [HasCLA]
compressed sample message application/octet-stream None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Jaenicke 2004-12-21 09:03:05 UTC
The warning above is sometimes issued when sa-learn uses the HTML::Parser
module to parse HTML messages. Happened to me on all 3.0.x versions.
Comment 1 Sebastian Jaenicke 2004-12-21 09:04:17 UTC
Created attachment 2582 [details]
proposed patch
Comment 2 Bas Zoetekouw 2004-12-27 06:14:41 UTC
Yes, I get lots of these warnings, too, in my nightly mass checks.

It looks like this can be solved by enabling utf8_mode for HTML::Parser in
parse() in Mail/SpamAssassin/HTML.pm. Unfortunately, this is a perl 5.8 option only.

I'll attach a patch.
Comment 3 Bas Zoetekouw 2004-12-27 06:18:05 UTC
Oops, I was going to attach the same patch Sebastian already had attached ;)
Comment 4 Justin Mason 2005-01-21 12:24:25 UTC
we already have a workaround for this in trunk (by just trapping the warning). 
 I'd be curious if this actually makes a difference either way...
Comment 5 Sebastian Jaenicke 2005-01-22 07:39:24 UTC
(In reply to comment #4)
> we already have a workaround for this in trunk (by just trapping the warning). 
>  I'd be curious if this actually makes a difference either way...

Just trapping a warning doesn't look like a good solution to me; on the
other hand, the minimum perl requirement was 5.6.1 until now, so any real
fix for this without creating a dependency on perl 5.8 would be nice.
Comment 6 Malte S. Stretz 2005-03-03 12:36:47 UTC
We could wrap the call either in a version-check or eval. 
 
But from <http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm#Unicode>: 
"If the file contains text encoded in a charset besides ASCII, Latin-1 or 
UTF-8 then decoding will always be needed." 
 
and especially 
<http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm#METHODS>: 
"If utf8_mode is enabled then it is an error to pass strings containing 
characters with code above 255 to the parse() method, and the parse() method 
will croak if you try." 
 
I'm not sure if this can't make it worse (can we have UTF-16 encoded text?). 
Comment 7 Art Mandler 2005-04-14 05:12:07 UTC
After applying the patch, I still have problems with UTF-32.  Running sa-learn
evokes the message:  "Parsing of undecoded UTF-32 at
/usr/.../SpamAssassin/HTML.pm at line 184".
Comment 8 John Gardiner Myers 2005-04-14 08:31:30 UTC
A test case message would be helpful.
Comment 9 Bob Menschel 2005-04-23 16:27:09 UTC
Since this bug report has been looked at, confirmed, and not yet resolved though
progress has been made, and since it doesn't seem to cause any serious problems
for anyone, it looks like its resolution should be scheduled for 3.1.1. 
Comment 10 Sebastian Jaenicke 2005-05-11 13:40:05 UTC
(In reply to comment #8)
> A test case message would be helpful.

See attachment (mbox format).
Comment 11 Sebastian Jaenicke 2005-05-11 13:43:04 UTC
Created attachment 2867 [details]
testcase

testcase uploaded as text/plain, hope this doesn't change the file
in any way.
Comment 12 Justin Mason 2005-05-26 14:24:20 UTC
Created attachment 2902 [details]
tested fix

OK, I realised that we'd never tested utf8_mode() as a fix for this.
So I tested it.  The results are:

    big fat zero

I used my most recent 2000 ham and 2000 spam, ran mass-check on both,
stripped out the "scantime" stuff that varies as a result of mass-check
CPU load etc. and diffed the results -- there were no differences
in rule hits whatsoever.

in other words, HTML::Parser's noisiness is entirely irrelevant to us
in terms of spam filtering results.

On top of that, Art's UTF-32 message produces no warnings on perl 5.6.1 or perl

5.8.4 for me with svn trunk.  Having said that, I'll change the svn trunk code
to ignore UTF-32 noise as well, just in case. r178692.

the attached patch is the utf8_mode()-enabling code.  I won't be
applying it, since as I noted above it has no effect and I'd prefer
not to add version-specific stuff unless there's a point. ;)
Comment 13 Justin Mason 2005-05-26 14:25:49 UTC
meh, the test case message was from Sebastian, not Art.  that may explain why it
didn't produce warnings.  moot point though as SVN trunk will now inhibit UTF-32
warnings too.

closing as FIXED; please reopen if the problem is observed with SVN trunk.
Comment 14 John Madden 2005-06-01 10:10:33 UTC
*** Bug 4373 has been marked as a duplicate of this bug. ***
Comment 15 John Madden 2005-06-01 10:11:41 UTC
This still exists in the 3.0 branch, but is fixed in trunk.
Comment 16 Bob Menschel 2005-07-02 21:10:54 UTC
*** Bug 3787 has been marked as a duplicate of this bug. ***
Comment 17 Dallas Engelken 2005-11-28 17:59:12 UTC
i have a sample message and spamd debug to follow that shows the malformed utf-
8 warns... all 22k of em. 

Other info....

# echo $LANG
en_US

# perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
3.45

# spamassassin -V
SpamAssassin version 3.2.0-r322462
  running on Perl version 5.8.5

Comment 18 Dallas Engelken 2005-11-28 17:59:57 UTC
Created attachment 3278 [details]
sample message

sample message that generates utf-8 warns on SVN
Comment 19 Theo Van Dinter 2005-11-28 18:06:11 UTC
I'm not sure whether or not the text/plain destroys the UTF-8 encoding.  Can you check the md5sum of 
the original?  Downloaded, I get:

1a0e58a1cfc5fb2a9f891c1c128ecd4c

With that version, I see no issues on either my Linux (perl 5.8.5) or Mac OS X (perl 5.8.6) machines.
Comment 20 Dallas Engelken 2005-11-28 18:06:14 UTC
Created attachment 3279 [details]
spamd debug for sample message

here is the spamd debug output.. i had to trim a bunch of the utf-8 warns out
of this message to get it posted to bz under the 1k limit.
Comment 21 Dallas Engelken 2005-11-28 18:27:05 UTC
Created attachment 3280 [details]
compressed sample message

-rw-------  1 root root    20210 Nov 28 11:18 msg.txt

# md5sum msg.txt
4bad4aedc872633b97400b8ff417fb02  msg.txt

-rw-------  1 root root     3488 Nov 28 11:18 msg.txt.gz
Comment 22 openmacnews 2005-11-28 18:38:10 UTC
per request from dallas,

> Please test this message against your svn copy...
> http://www.engelken.net/download/msg.txt
> 
> It produces 22k utf-u warns for me... 
> 
> # grep -c UTF-8 spamd.debug.txt
> 22636
> 
> And I only have 70_sare_obfu.cf in /usr/share/spamassassin for testing
> purposes..

with my RDJ rules defined:

TRUSTED_RULESETS="TRIPWIRE SARE_REDIRECT_POST300 SARE_EVILNUMBERS0
SARE_EVILNUMBERS1 SARE_BAYES_POISON_NXM SARE_HEADER SARE_HEADER_ENG
SARE_SPECIFIC SARE_ADULT SARE_BML SARE_FRAUD SARE_SPOOF SARE_RANDOM
SARE_SPAMCOP_TOP200 SARE_OEM SARE_GENLSUBJ SARE_GENLSUBJ_ENG SARE_UNSUB
SARE_URI_ENG BOGUSVIRUS SARE_OBFU SARE_HTML"

and, for comparison, per your earlier bug attach,

% echo $LANG
	en_US
% perl -e 'use HTML::Parser; print HTML::Parser->VERSION';
	3.46
% spamassassin -V
	SpamAssassin version 3.2.0-r322462
	  running on Perl version 5.8.6


on debug, output shows the 'usual' scads of errors:

% grep -c "Malformed UTF-8 character" log.txt
	80099
Comment 23 John Gardiner Myers 2005-11-28 18:42:44 UTC
This is an entirely new bug, not a valid reopen.
Comment 24 John Gardiner Myers 2005-11-28 18:50:05 UTC
Reopening bug 3787, as it is not a duplicate.  Re-resolving this as FIXED.
Comment 25 Mark Martinec 2015-02-06 03:25:28 UTC
Ten years later ... case re-opened as Bug 7133.