Bug 8268 - trim whitespace from anchor text in uri_detail_list
Summary: trim whitespace from anchor text in uri_detail_list
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P2 minor
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-11 21:03 UTC by Kent Oyer
Modified: 2024-07-11 21:09 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
patch patch None Kent Oyer [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Kent Oyer 2024-07-11 21:03:05 UTC
It would be convenient if leading & trailing whitespace was removed from anchor_text in uri_detail_list. For example, HTML such as:

<a href="#">
   Download File
</a>

will end up with anchor_text containing "\n   Download File\n". This leads to unexpected results if you have a rule such as:

uri-detail RULENAME text =~ /^download file$/i

The workaround is to not use regex anchors, or explicitly allow whitespace in the regex:

uri-detail RULENAME text =~ /^\s*download file/i

However, I think this is non-intuitive and has tripped me up several times. I don't think there is any harm in removing the whitespace since the rules of HTML whitespace dictate that the HTML above should parse identically to this HTML:

<a href="#">Download File</a>

Please see the attached patch and provide feedback.
Comment 1 Kent Oyer 2024-07-11 21:06:12 UTC
Created attachment 5959 [details]
patch
Comment 2 Kent Oyer 2024-07-11 21:09:52 UTC
I think it might be good to collapse whitespace between words as well. Again, following the rules of HTML whitespace in case someone does this:

<a href="#">
   Download 
   File
</a>