Bug 8218 - HTML URIs with linebreaks not parsed with Content-Transfer-Encoding: quoted-printable
Summary: HTML URIs with linebreaks not parsed with Content-Transfer-Encoding: quoted-p...
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-27 11:34 UTC by brt
Modified: 2024-03-15 07:30 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Test email message/rfc822 None Sidney Markowitz [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description brt 2024-02-27 11:34:24 UTC
It appears PerMsgStatus.pm will not parse URI's that contain a linebreak when Content-Transfer-Encoding: quoted-printable  is enabled.

For example this blob:
<A style=3D"FONT-SIZE: 16px; TEXT-DECORATION: none; FONT-FAMILY: Helvetica,=
sans-serif; BACKGROUND: rgb(16,163,127) 0% 50%; FONT-WEIGHT: 400; COLOR: wh=
ite; PADDING-BOTTOM: 11px; PADDING-TOP: 12px; PADDING-LEFT: 20px; MARGIN: 0=
px; LINE-HEIGHT: 24px; PADDING-RIGHT: 20px" href="http://hashbltest.s=
urbl.org/example_uri">Verify email address</A>


Will return the following entry from get_uri_detail_list:
 $VAR1 = {
           'types' => {
                        'schemeless' => 1,
                        'parsed' => 1,
                        'unlinked' => 1
                      },
           'cleaned' => [
                          'http://urbl.org/example_uri'
                        ],
           'hosts' => {
                        'urbl.org' => 'urbl.org'
                      },
           'domains' => {
                          'urbl.org' => 1
                        }
         };


Rather than the entire URI (http://hashbltest.surbl.org/example_uri) as we would expect
Comment 1 Sidney Markowitz 2024-03-15 07:30:18 UTC
Created attachment 5942 [details]
Test email

Can you attach a full test email that demonstrates the problem? I made your snippet into an email with quoted-printable and ran

./spamassassin -t -D uri,message < uriwraptest.eml

Did I miss something? I am testing using current trunk.

Here is the debug output showing the correct URL being parsed:

Mar 15 20:23:08.229 [99779] dbg: message: _decode_header date: Thu, 2 May 2002 00:02:49 +1200
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header subject: foo
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header to: <bar@example.org>
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header from: <baz@example.com>
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header message-id: <INTM-6516584-3669405-2002.08.01-16.21.51--f@example.com>
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header mime-version: 1.0
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header content-type: text/html; charset=US-ASCII
Mar 15 20:23:08.229 [99779] dbg: message: _decode_header content-transfer-encoding: quoted-printable
Mar 15 20:23:08.230 [99779] dbg: message: main message type: text/html
Mar 15 20:23:08.235 [99779] dbg: message: ---- MIME PARSER START ----
Mar 15 20:23:08.235 [99779] dbg: message: parsing normal part
Mar 15 20:23:08.235 [99779] dbg: message: storing a body to memory
Mar 15 20:23:08.235 [99779] dbg: message: ---- MIME PARSER END ----
Mar 15 20:23:08.235 [99779] dbg: message: decoding quoted-printable
Mar 15 20:23:08.235 [99779] dbg: message: contains only US-ASCII characters, declared US-ASCII, not decoding
Mar 15 20:23:08.235 [99779] dbg: message: HTML::Parser utf8_mode off (default, assumed Unicode characters)
Mar 15 20:23:08.236 [99779] dbg: message: spaces (octets) in HTML: 3 out of 21, chars!?
Mar 15 20:23:08.242 [99779] dbg: uri: canonicalizing html uri: http://hashbltest.surbl.org/example_uri
Mar 15 20:23:08.242 [99779] dbg: uri: cleaned uri: http://hashbltest.surbl.org/example_uri
Mar 15 20:23:08.242 [99779] dbg: uri: added host: hashbltest.surbl.org domain: surbl.org