SA Bugzilla – Bug 7780
Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262.
Last modified: 2022-04-26 02:32:44 UTC
Created attachment 5680 [details] Email which triggers the error When I run sa-learn --spam --mbox billing-utf16.eml the attached email file reports: Parsing of undecoded UTF-16 at /usr/share/perl5/Mail/SpamAssassin/HTML.pm line 262. Learned tokens from 0 message(s) (1 message(s) examined) Ubuntu 19.10 eoan SpamAssassin version 3.4.2 running on Perl version 5.28.1 May be a duplicate of https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7453 but the "attachment.cgi?..." examples there produce the UTF-16 error with spamassassin filename and sa-learn --spam filename but not with sa-learn --spam --mbox filename or spamassassin --mbox filename
Copying SpamAssassin/HTML.pm from 3.4.3 to the 3.4.2 installation tree fixes the bug, so it appears to be fixed in 3.4.3
3.4.3 is simply turning all HTML::Parser warnings to info() messages, so it's only hidden in this case. The HTML part starts with valid UTF-16 BOM and it's something HTML::Parser can't parse. Parsing of undecoded UTF-16 (W) The parser found the Unicode UTF-16 BOM signature at the start of the document. The result of parsing will likely be garbage. Not sure if SA is supposed to handle this better, leaving open to check.
The test case with -D info reproduces the problem in 3.4.4 but not in trunk (4.0.0). Closing this as worksforme.