Bug 7780 - Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262.
Summary: Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm lin...
Status: REOPENED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.4.3
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-24 15:06 UTC by Andrew Aitchison
Modified: 2019-12-24 17:05 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Email which triggers the error message/rfc822 None Andrew Aitchison [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Aitchison 2019-12-24 15:06:41 UTC
Created attachment 5680 [details]
Email which triggers the error

When I run
   sa-learn --spam --mbox billing-utf16.eml
the attached email file reports:
Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262.
Learned tokens from 0 message(s) (1 message(s) examined)

Ubuntu 19.10 eoan
SpamAssassin version 3.4.2
  running on Perl version 5.28.1



May be a duplicate of https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7453
but the "attachment.cgi?..." examples there produce the UTF-16 error with
   spamassassin filename
and 
    sa-learn --spam filename
but not with
   sa-learn --spam --mbox filename
or
   spamassassin --mbox filename
Comment 1 Andrew Aitchison 2019-12-24 16:34:52 UTC
Copying SpamAssassin/HTML.pm from 3.4.3 to the 3.4.2 installation tree fixes the bug, so it appears to be fixed in 3.4.3
Comment 2 Henrik Krohns 2019-12-24 17:05:52 UTC
3.4.3 is simply turning all HTML::Parser warnings to info() messages, so it's only hidden in this case.

The HTML part starts with valid UTF-16 BOM and it's something HTML::Parser can't parse.

Parsing of undecoded UTF-16
    (W) The parser found the Unicode UTF-16 BOM signature at the start of the document. The result of parsing will likely be garbage.

Not sure if SA is supposed to handle this better, leaving open to check.