SA Bugzilla – Bug 8267
ExtractText.pm
Last modified: 2024-07-12 05:12:57 UTC
Hiya! It looks like there's a bug here: --- a/lib/Mail/SpamAssassin/Plugin/ExtractText.pm 2024-03-29 02:00:00.000000000 +0000 +++ b/lib/Mail/SpamAssassin/Plugin/ExtractText.pm 2024-07-06 21:56:00.788596023 +0100 @@ -601,7 +601,7 @@ sub _extract { push @{$coll->{flags}}, 'ActionURI'; dbg("extracttext: ActionURI: $1"); push @{$coll->{text}}, $text; - push @{$coll->{uris}}, $2; + push @{$coll->{uris}}, $1; } elsif($text =~ /QR-Code\:([^\s]*)/) { # zbarimg(1) prefixes the url with "QR-Code:" string my $qrurl = $1; Note that the regex has a "?:" in the first capturing group: if ($text =~ /<a(?:\s+[^>]+)?\s+href="([^">]*)"/) { So, you just have $1. $2 is undef. A side note: You say "This module (ExtractText.pm) uses external tools to extract text from message parts, and then sets the text as the rendered part. **External tool must output plain text**, not HTML or other non-textual result." Though, this code is parsing an html tag for a href attribute... Cheers, jps
Yes, that looks like a bug. Thanks for reporting it. I've committed your patch as r1919159. It should be part of the next release. Regarding the documentation, I think that means that the extracted text will be directly available to body rules without any further processing. So it doesn't parse HTML and remove tags the way it would on a text/html part, for example. Honestly, I don't know why it's only parsing & extracting the first <a> tag and not all of them. It simply adds the URI to metadata that would be available to Bayes and other plugins. I don't know what the original author's intentions were.