Bug 7579 - PDFInfo: pdfinfo:pdf_has_uri
Summary: PDFInfo: pdfinfo:pdf_has_uri
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-02 13:09 UTC by Giovanni Bechis
Modified: 2021-04-13 20:52 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
pdfinfo::pdf_has_uri patch None Giovanni Bechis [HasCLA]
Extract URIs from pdf file and check them in URIBLs patch None Giovanni Bechis [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Giovanni Bechis 2018-05-02 13:09:12 UTC
Created attachment 5567 [details]
pdfinfo::pdf_has_uri

New function to check if a pdf has a "clickable" uri, it does not detect all uris because some software stores links in binary data.
Is it worth adding it to PDFInfo.pm or is it better to create a new plugin that depends on some pdf parser like PDF::Parse or similar ?
Comment 1 John Hardin 2018-05-02 21:12:56 UTC
That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum in isolation. I'd suggest it would be _much_ more useful to extract the URIs and add them to the pool that feeds uri rules and URIBL checks.

Even better if heuristics similar to what's used for body text would pull non-clickable URIs out of the PDF text, but doing that might best be controlled by a config option.
Comment 2 Kevin A. McGrail 2018-05-03 14:50:07 UTC
+1 to John's comment.
Comment 3 Giovanni Bechis 2018-05-04 06:39:59 UTC
(In reply to John Hardin from comment #1)
> That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum
> in isolation. I'd suggest it would be _much_ more useful to extract the URIs
> and add them to the pool that feeds uri rules and URIBL checks.
> 
any hints on how to add uris to the pool ?
I had a look at DecodeShortURLSs.pm but it's ugly and I am not sure it works correctly

> Even better if heuristics similar to what's used for body text would pull
> non-clickable URIs out of the PDF text, but doing that might best be
> controlled by a config option.
IMHO this should be a second step
Comment 4 Kevin A. McGrail 2018-05-04 13:19:06 UTC
Sorry, no hints here.  Probably best to ask on list.
Comment 5 Giovanni Bechis 2018-05-11 19:52:03 UTC
(In reply to John Hardin from comment #1)
> That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum
> in isolation. I'd suggest it would be _much_ more useful to extract the URIs
> and add them to the pool that feeds uri rules and URIBL checks.
> 
> Even better if heuristics similar to what's used for body text would pull
> non-clickable URIs out of the PDF text, but doing that might best be
> controlled by a config option.

Looking at my spam collection, a pdf named Invoice.pdf with a clickable uri is very probably spam.
Anyway I am looking at extracting URIs from attachments and adding them to the pool of uris to be checked.
Comment 6 Benny Pedersen 2018-05-11 22:39:51 UTC
Mail::SpamAssassin::Plugin::ExtractText 
Uses plugin extractors and/or external tools to extract text from message parts. Extractor plugins can extract parts that will be fed into the plugin for checking, so for example a an image OCR extractor could get to check images extracted from a PDF by another extractor. How to extract what from what is very configurable. Included are configs for MS Word, RTF, OpenDocument and PDF files, and a very simplistic OpenXML plugin. 
Created by: Jonas Eckerman 

fond here https://wiki.apache.org/spamassassin/UnmaintainedCustomPlugins

good start, but needs more maintaince
Comment 7 Giovanni Bechis 2018-06-04 06:57:47 UTC
Created attachment 5571 [details]
Extract URIs from pdf file and check them in URIBLs

Extract URIs from pdf files (at least some of them) and add them to the pool of URIs to be checked (URIBL, etc...).
Pms method added because it could be useful to other plugins (decodeshorturls, for example).
Comment 8 Henrik Krohns 2019-06-24 11:24:10 UTC
I don't think add_uri_detail_list should overwrite existing $pms->{uri_detail_list}->{$uri} blindly. If there is same uri from HTML, it will lose it's types etc.
Comment 9 Henrik Krohns 2019-06-24 11:25:20 UTC
And function accepting just "uri" as argument, should probably be named add_uri_list, since it doesn't accept any details.
Comment 10 Henrik Krohns 2019-08-02 07:10:37 UTC
FYI I'm rewriting the horrible mess of uri parsing for 4.0.0, so hold on..
Comment 11 Henrik Krohns 2019-08-03 14:21:40 UTC
We have now $pms->add_uri_detail_list() available in trunk

Committed revision 1864336.
Comment 12 Henrik Krohns 2021-04-12 14:16:26 UTC
(In reply to Giovanni Bechis from comment #7)
> 
> Extract URIs from pdf files (at least some of them) and add them to the pool
> of URIs to be checked (URIBL, etc...).

We have ExtractText.pm too, so which is better tool for the job? How will we manage things in future when we have 10 plugins all adding some metadata? Do we actually want "uri" or URIBL to match _anything_ and how do we manage on per-rule basis which sources should be used?
Comment 13 Henrik Krohns 2021-04-12 14:29:24 UTC
Let's say some large PDF has a hundred unique "uris" for one reason or another. How would we manage this? Should we prefer to URIBL query them instead of body uris? Or shuffle and take n-amount of uris from here and there? How will different __URI* rules react, which depend on count / number of hits?

I'm quite sceptical that even ExtractText makes any sense. It has the same problems, along with possibly filling Bayes with semi-random stuff from badly OCR'd images or wonky rendered PDF's etc.

I think would just vote to have a pdf_has_uri() which can match uris from PDFs and that's it. No complex metadata hassles.
Comment 14 Giovanni Bechis 2021-04-13 20:49:53 UTC
(In reply to Henrik Krohns from comment #12)
> (In reply to Giovanni Bechis from comment #7)
> > 
> > Extract URIs from pdf files (at least some of them) and add them to the pool
> > of URIs to be checked (URIBL, etc...).
> 
> We have ExtractText.pm too, so which is better tool for the job? How will we
> manage things in future when we have 10 plugins all adding some metadata? Do
> we actually want "uri" or URIBL to match _anything_ and how do we manage on
> per-rule basis which sources should be used?

IMHO ExtractText.pm is more ocr oriented and it covers more then just pdf files, PDFInfo.pm is more about attached pdf file names and other info strictly related to pdf, maybe they could be merged but I do not think it's worth the effort.
Comment 15 Giovanni Bechis 2021-04-13 20:52:47 UTC
(In reply to Henrik Krohns from comment #13)
> Let's say some large PDF has a hundred unique "uris" for one reason or
> another. How would we manage this? Should we prefer to URIBL query them
> instead of body uris? Or shuffle and take n-amount of uris from here and
> there? How will different __URI* rules react, which depend on count / number
> of hits?
> 
> I'm quite sceptical that even ExtractText makes any sense. It has the same
> problems, along with possibly filling Bayes with semi-random stuff from
> badly OCR'd images or wonky rendered PDF's etc.
> 
> I think would just vote to have a pdf_has_uri() which can match uris from
> PDFs and that's it. No complex metadata hassles.

ExtractText could poison Bayes databases but a lot of other sources can do the same, on the other hand it can parse .docx files and images as well and not just pdf files.
A warning about using ExtractText together with Bayes is a good idea anyway.