Bug 7780 - Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262.
Summary: Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm lin...
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.4.3
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-24 15:06 UTC by Andrew Aitchison
Modified: 2022-04-26 02:32 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Email which triggers the error message/rfc822 None Andrew Aitchison [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Aitchison 2019-12-24 15:06:41 UTC
Created attachment 5680 [details]
Email which triggers the error

When I run
   sa-learn --spam --mbox billing-utf16.eml
the attached email file reports:
Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262.
Learned tokens from 0 message(s) (1 message(s) examined)

Ubuntu 19.10 eoan
SpamAssassin version 3.4.2
  running on Perl version 5.28.1



May be a duplicate of https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7453
but the "attachment.cgi?..." examples there produce the UTF-16 error with
   spamassassin filename
and 
    sa-learn --spam filename
but not with
   sa-learn --spam --mbox filename
or
   spamassassin --mbox filename
Comment 1 Andrew Aitchison 2019-12-24 16:34:52 UTC
Copying SpamAssassin/HTML.pm from 3.4.3 to the 3.4.2 installation tree fixes the bug, so it appears to be fixed in 3.4.3
Comment 2 Henrik Krohns 2019-12-24 17:05:52 UTC
3.4.3 is simply turning all HTML::Parser warnings to info() messages, so it's only hidden in this case.

The HTML part starts with valid UTF-16 BOM and it's something HTML::Parser can't parse.

Parsing of undecoded UTF-16
    (W) The parser found the Unicode UTF-16 BOM signature at the start of the document. The result of parsing will likely be garbage.

Not sure if SA is supposed to handle this better, leaving open to check.
Comment 3 Sidney Markowitz 2022-04-26 02:32:44 UTC
The test case with -D info reproduces the problem in 3.4.4 but not in trunk (4.0.0). Closing this as worksforme.