![]() |
SA Bugzilla – Full Text Bug Listing |
Summary: | PDFInfo: pdfinfo:pdf_has_uri | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Giovanni Bechis <giovanni> |
Component: | Plugins | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | normal | CC: | apache, giovanni, jhardin, kmcgrail, me |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | 4.0.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Attachments: |
pdfinfo::pdf_has_uri
Extract URIs from pdf file and check them in URIBLs |
That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum in isolation. I'd suggest it would be _much_ more useful to extract the URIs and add them to the pool that feeds uri rules and URIBL checks. Even better if heuristics similar to what's used for body text would pull non-clickable URIs out of the PDF text, but doing that might best be controlled by a config option. +1 to John's comment. (In reply to John Hardin from comment #1) > That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum > in isolation. I'd suggest it would be _much_ more useful to extract the URIs > and add them to the pool that feeds uri rules and URIBL checks. > any hints on how to add uris to the pool ? I had a look at DecodeShortURLSs.pm but it's ugly and I am not sure it works correctly > Even better if heuristics similar to what's used for body text would pull > non-clickable URIs out of the PDF text, but doing that might best be > controlled by a config option. IMHO this should be a second step Sorry, no hints here. Probably best to ask on list. (In reply to John Hardin from comment #1) > That a PDF has a URI (clickable or not) doesn't seem a terribly useful datum > in isolation. I'd suggest it would be _much_ more useful to extract the URIs > and add them to the pool that feeds uri rules and URIBL checks. > > Even better if heuristics similar to what's used for body text would pull > non-clickable URIs out of the PDF text, but doing that might best be > controlled by a config option. Looking at my spam collection, a pdf named Invoice.pdf with a clickable uri is very probably spam. Anyway I am looking at extracting URIs from attachments and adding them to the pool of uris to be checked. Mail::SpamAssassin::Plugin::ExtractText Uses plugin extractors and/or external tools to extract text from message parts. Extractor plugins can extract parts that will be fed into the plugin for checking, so for example a an image OCR extractor could get to check images extracted from a PDF by another extractor. How to extract what from what is very configurable. Included are configs for MS Word, RTF, OpenDocument and PDF files, and a very simplistic OpenXML plugin. Created by: Jonas Eckerman fond here https://wiki.apache.org/spamassassin/UnmaintainedCustomPlugins good start, but needs more maintaince Created attachment 5571 [details]
Extract URIs from pdf file and check them in URIBLs
Extract URIs from pdf files (at least some of them) and add them to the pool of URIs to be checked (URIBL, etc...).
Pms method added because it could be useful to other plugins (decodeshorturls, for example).
I don't think add_uri_detail_list should overwrite existing $pms->{uri_detail_list}->{$uri} blindly. If there is same uri from HTML, it will lose it's types etc. And function accepting just "uri" as argument, should probably be named add_uri_list, since it doesn't accept any details. FYI I'm rewriting the horrible mess of uri parsing for 4.0.0, so hold on.. We have now $pms->add_uri_detail_list() available in trunk Committed revision 1864336. |
Created attachment 5567 [details] pdfinfo::pdf_has_uri New function to check if a pdf has a "clickable" uri, it does not detect all uris because some software stores links in binary data. Is it worth adding it to PDFInfo.pm or is it better to create a new plugin that depends on some pdf parser like PDF::Parse or similar ?