Bug 8206 - uri_list_canonicalize adds more domains then it should
Summary: uri_list_canonicalize adds more domains then it should
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-03 11:00 UTC by Giovanni Bechis
Modified: 2024-01-03 16:21 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Sample email message/rfc822 None Giovanni Bechis [HasCLA]
fix for the issue patch None Giovanni Bechis [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Giovanni Bechis 2024-01-03 11:00:48 UTC
Created attachment 5930 [details]
Sample email

In the attached sample the tag <img src="undefined/favicon.ico"> is wrongly translated in an http://undefined.com uri.
Comment 1 Giovanni Bechis 2024-01-03 11:02:23 UTC
Created attachment 5931 [details]
fix for the issue
Comment 2 Kris Deugau 2024-01-03 16:21:52 UTC
I don't have any specific examples right at hand, but I've posted on the users list about essentially this issue in April last year with another specific case.  See https://lists.apache.org/thread/gf3kyq2y3j1v1lj37g5tpngmk82wgmcz.  I don't recall if any patches were committed as a result of that thread.

Looking at your patch, I think this is too narrow (even if only because it omits .png, .webp, and who knows what other image types some sender might use) and far too late in the process to fix the root cause. There are a long list of other HTML elements that get filed in the "URI" bin, that can trigger this problem.  I think to properly solve it, potential URIs from HTML elements need to be more tightly preprocessed (and discarded) ahead of the rest of the canonicalization process.

I have docucomments in my local configuration with this:

dbg: uri: canonicalizing html uri: none
dbg: uri: cleaned uri: http://www.none.com
dbg: uri: added host: www.none.com domain: none.com
dbg: uri: cleaned uri: none
dbg: uri: cleaned uri: http://none

(likely from that particular case I posted about)

and:

dbg: uri: canonicalizing html uri: assets/css/styles.css
dbg: uri: cleaned uri: http://www.assets.com/css/styles.css

along with matching uridnsbl_skip_domain entries for

none.com
assets.com
(I also have "none" listed, but that doesn't seem to work to suppress the entry)

and

background.com
www.com

which latter two I don't have debug detail recorded but which both originated in essentially the same source - HTML/CSS elements (not content/text!) that specify a relative URI in some context or form.  None of these were in text that a mail program *would* often turn into a clickable link.