Bug 7866 - TextCat: Improper language classification on URIs in plain/text
Summary: TextCat: Improper language classification on URIs in plain/text
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: 3.4.4
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-26 01:45 UTC by JDunphy
Modified: 2020-11-02 18:15 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description JDunphy 2020-10-26 01:45:41 UTC
Textcat can improperly classify text including URI in the plain text portion of a message. Here is a sample that was tagged as UNWANTED_BODY_TEXT (classified as sk & cs for this example):

-------

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

A Weekly Review from AWS

Featured Announcements

Amazon Aurora enables dynamic resizing for database storage space          =
                                                                           =
                      =20
<https://email.awscloud.com/dc/KwqiTCOQ16Q1JCi3MdelD5Wf1a0xWVUzLJ8TEakWNHdl=
A7N3nIa2aQWviXRrQW0g0Nzk3qf9Jbd_7Br-VcC96_vVOrK4bJqlew1KGbdQmMLIlhsNLtVFFTo=
o0oG_f9iDbFtXhfHuZSrhIpoERCR4a4jOBbGd629KotGGay-7-sKFDTCWVGisnhbxOeaG-rBvct=
WHIpIaAIuHyhj21BdtQbvqu9vEkOLb4i9f5WJzjdvSttMrYY5mQQiiAxDzWx90K16R7A5hk3kuc=
4mmg5ogqliI9wKd7lBG1qX0Uis2H9tTvIbdKJhEU2XcxTXIVK0l2bb1qlYvipE7NL9dS516_m6n=
76Y0b_DoQp07kQfyQE3Cm-s5tpwt4oOzzjMzZvKLprHcLw3Lb7I6Pp_5WIyD-ze1ZmT5cFKEF7D=
C_c-BH24c5m2mByYrBLRlvBsCPRNQhAPdJZi-geOf6Jf9J_oeDFtxQwo1yU94FVAb3AB9MBU=3D=
/hOoW4pZ0lT000kthk000MCE>

--------

A fix would be to strip URIs from the body text before classification. 

  my $body = $msg->get_rendered_body_text_array();
  $body = join("\n", @{$body});
  $body =~ s/^Subject://i;

  # %%% Make sure that there are no URIs to be evaluated here.
  $body =~ s/https?:\/\/\S+//g;    # BUG fix
Comment 1 Giovanni Bechis 2020-11-02 18:15:52 UTC
A similar fix was present in trunk, now backported to 3.4 tree in r1883069.