Bug 7579 - PDFInfo: pdfinfo:pdf_has_uri
Summary: PDFInfo: pdfinfo:pdf_has_uri
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-02 13:09 UTC by Giovanni Bechis
Modified: 2019-08-03 14:21 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
pdfinfo::pdf_has_uri patch None Giovanni Bechis [HasCLA]
Extract URIs from pdf file and check them in URIBLs patch None Giovanni Bechis [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Giovanni Bechis 2018-05-02 13:09:12 UTC
Created attachment 5567 [details]
pdfinfo::pdf_has_uri

New function to check if a pdf has a "clickable" uri, it does not detect all uris because some software stores links in binary data.
Is it worth adding it to PDFInfo.pm or is it better to create a new plugin that depends on some pdf parser like PDF::Parse or similar ?
Comment 1 John Hardin 2018-05-02 21:12:56 UTC
That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum in isolation. I'd suggest it would be _much_ more useful to extract the URIs and add them to the pool that feeds uri rules and URIBL checks.

Even better if heuristics similar to what's used for body text would pull non-clickable URIs out of the PDF text, but doing that might best be controlled by a config option.
Comment 2 Kevin A. McGrail 2018-05-03 14:50:07 UTC
+1 to John's comment.
Comment 3 Giovanni Bechis 2018-05-04 06:39:59 UTC
(In reply to John Hardin from comment #1)
> That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum
> in isolation. I'd suggest it would be _much_ more useful to extract the URIs
> and add them to the pool that feeds uri rules and URIBL checks.
> 
any hints on how to add uris to the pool ?
I had a look at DecodeShortURLSs.pm but it's ugly and I am not sure it works correctly

> Even better if heuristics similar to what's used for body text would pull
> non-clickable URIs out of the PDF text, but doing that might best be
> controlled by a config option.
IMHO this should be a second step
Comment 4 Kevin A. McGrail 2018-05-04 13:19:06 UTC
Sorry, no hints here.  Probably best to ask on list.
Comment 5 Giovanni Bechis 2018-05-11 19:52:03 UTC
(In reply to John Hardin from comment #1)
> That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum
> in isolation. I'd suggest it would be _much_ more useful to extract the URIs
> and add them to the pool that feeds uri rules and URIBL checks.
> 
> Even better if heuristics similar to what's used for body text would pull
> non-clickable URIs out of the PDF text, but doing that might best be
> controlled by a config option.

Looking at my spam collection, a pdf named Invoice.pdf with a clickable uri is very probably spam.
Anyway I am looking at extracting URIs from attachments and adding them to the pool of uris to be checked.
Comment 6 Benny Pedersen 2018-05-11 22:39:51 UTC
Mail::SpamAssassin::Plugin::ExtractText 
Uses plugin extractors and/or external tools to extract text from message parts. Extractor plugins can extract parts that will be fed into the plugin for checking, so for example a an image OCR extractor could get to check images extracted from a PDF by another extractor. How to extract what from what is very configurable. Included are configs for MS Word, RTF, OpenDocument and PDF files, and a very simplistic OpenXML plugin. 
Created by: Jonas Eckerman 

fond here https://wiki.apache.org/spamassassin/UnmaintainedCustomPlugins

good start, but needs more maintaince
Comment 7 Giovanni Bechis 2018-06-04 06:57:47 UTC
Created attachment 5571 [details]
Extract URIs from pdf file and check them in URIBLs

Extract URIs from pdf files (at least some of them) and add them to the pool of URIs to be checked (URIBL, etc...).
Pms method added because it could be useful to other plugins (decodeshorturls, for example).
Comment 8 Henrik Krohns 2019-06-24 11:24:10 UTC
I don't think add_uri_detail_list should overwrite existing $pms->{uri_detail_list}->{$uri} blindly. If there is same uri from HTML, it will lose it's types etc.
Comment 9 Henrik Krohns 2019-06-24 11:25:20 UTC
And function accepting just "uri" as argument, should probably be named add_uri_list, since it doesn't accept any details.
Comment 10 Henrik Krohns 2019-08-02 07:10:37 UTC
FYI I'm rewriting the horrible mess of uri parsing for 4.0.0, so hold on..
Comment 11 Henrik Krohns 2019-08-03 14:21:40 UTC
We have now $pms->add_uri_detail_list() available in trunk

Committed revision 1864336.