SA Bugzilla – Bug 8107
Change how PDF's are parsed with the PDFInfo plugin
Last modified: 2023-01-26 06:42:53 UTC
I would like to discuss a possible rewrite of the PDFInfo plugin. The main issue I'm running into is that it does not detect images 100% of the time. For instance, if '/Height' and '/Width' are on different lines, the image is not detected. Also, if either '/Height' or '/Width' comes before '/Image' the image is not detected. I've looked for simple ways to fix it but I believe the best fix is to parse the PDF correctly using the PDF object structure instead of the current line-oriented method. Parsing the PDF object tree would allow the following additional features: 1. Differentiating between images displayed on the page vs images used as a mask for other images (the latter can probably be ignored) 2. Take scaling into account. The pixel dimensions of the image are not related to the amount of area the image consumes on the page. For example, you can have a 400x600 image that takes up the whole page or you can have a 1200x900 image that only takes up 25% of the page. 3. Images can be defined once and used multiple times on the page or on multiple pages. 4. We could prioritize content on page 1 (or simply ignore content on all other pages). Spammers usually put the payload on page 1 and if there are other pages, it's only there to confuse the filters. 5. Access images and URI's located in binary data. I've already started working on this and I think it's doable but I don't want to duplicate work if someone else is already working on it. I would also like feedback on whether this should be a drop-in replacement or a totally new plugin. I would like to maintain backward compatibility but there would be differences in how image-to-text ratios are calculated and the fuzzy MD5 checksums would be different unless I keep the existing code (and parse each file twice) just to avoid changing the checksums. Any thoughts?
Certainly the image detection failure is a good thing to work on. Is there a good module for PDF parsing as you describe? Re: The additional features, here's my thoughts: 1. mask images KAM: not sure this will be an indicator of spam/ham 2. scaling KAM: not sure this will be an indicator of spam/ham 3. Images used multiple times KAM: not sure this will be an indicator of spam/ham 4. We could prioritize content on page 1 (or simply ignore content on all other pages). Spammers usually put the payload on page 1 and if there are other pages, it's only there to confuse the filters. KAM: This sounds like an interesting balance on efficiency that could be very useful 5. Access images and URI's located in binary data. KAM: Are their PDFs avoiding scanning using this technique? Re: I've already started working on this and I think it's doable but I don't want to duplicate work if someone else is already working on it. I'm not aware of anything in progress and we love new blood. Re: I would also like feedback on whether this should be a drop-in replacement or a totally new plugin. How would it affect the stock ruleset would be my main question to help answer that? What changes would people need to make? For example, are their any affected rules in the KAM Ruleset?
There are a few PDF parsing modules already but they are overkill for what we need. I have written a more streamlined parser that just inspects images & URI's and is configured to stop after page 1. I still rely on ExtractText to pull out the text. Regarding points 1, 2, & 3, they would not be an indicator of spam/ham by themselves but would be used in conjunction with other rules. The current plugin has rules such as "pdf_image_to_text_ratio" and "pdf_image_size_range" which are also weak indicators used by themselves. IMHO, it would be better to know the percentage of page area taken up by images rather than the raw number of pixels. There are only a handful of rules in the stock ruleset that use this plugin. Most of the rules in 20_pdfinfo.cf are commented out and the remaining ones are from circa 2007. Is there a way to see the effectiveness of the rules in the stock ruleset? IIRC there used to be a way to see the hit frequencies from the nightly mass check somewhere. In my setup, these rules are not very effective: GMD_PDF_HORIZ Contains pdf 100-240 (high) x 450-800 (wide) GMD_PDF_SQUARE Contains pdf 180-360 (high) x 180-360 (wide) GMD_PDF_VERT Contains pdf 450-800 (high) x 100-240 (wide) GMD_PRODUCER_GPL PDF producer was GPL Ghostscript GMD_PRODUCER_POWERPDF PDF producer was PowerPDF GMD_PRODUCER_EASYPDF PDF producer was BCL easyPDF They have a low positive score but hit more ham than spam. Is anyone having success with these rules? The KAM ruleset only includes this one: describe KAM_BADPDF1 Prevalent Junk PDF SPAMs - EMPTY BODY & ENCRYPTED score KAM_BADPDF1 2.5 meta KAM_BADPDF1 (GMD_PDF_EMPTY_BODY + GMD_PDF_ENCRYPTED >= 2) I would be curious to know how this rule is working in your environment. Regarding point 5, yes I have examples of PDF's that are encrypted with a blank password. Most PDF readers will seamlessly open the PDF without prompting for a password so to the user it seems like a normal PDF. But the data is not visible to SA without decrypting it first. Thanks Kent
I decided to make a new plugin because it would be too difficult to maintain backward compatibility with the old plugin. The existing PDFInfo plugin was a spinoff of the ImageInfo plugin which probably explains why it focuses so much on image dimensions and pixel area. I've made the plugin available on GitHub in case anyone wants to use it. https://github.com/mxguardian/Mail-SpamAssassin-Plugin-PDFInfo2 Feedback and suggestions are appreciated. I'm using this in production without any problems but all the standard warnings and disclaimers apply. You can run this plugin in parallel with the old plugin in case you are using any rules that depend on the old plugin. Notable improvements: * It can parse PDF's that are encrypted with a blank password * Several of the tests focus exclusively on page 1 of each document. This not only helps with performance but is a countermeasure against content stuffing * pdf2_click_ratio - Fires based on how much of page 1 is clickable. Based on preliminary testing, anything over 20% is likely spam, especially if there's only one link and the word count is low. * I took the liberty of creating a new "pdf" URI type that can be used in writing uri-detail rules. Let me know if you have any questions. -Kent
Thanks for the effort. I've added mention of the plugin: https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CustomPlugins Development and any discussion can and should be continued in Github as needed since it's a third party plugin, closing this bug.