SA Bugzilla – Bug 4725
Add support for extracting terms from gif images for bayes subsystem
Last modified: 2006-12-31 13:01:35 UTC
Probably SpamAssassin bayes subsystem should have support for extracting terms from gif images in emails. These terms can provide much needed data for emails that contain nothing but headers and an image attachment. This idea was first offered by nico [tbb@hideout.ath.cx] to the author of spamprobe bayesian filter, and when implemented showed improvement in filtering spam.
Here are some details I found after digging into it: This is a feature in SpamProbe being tried out in the experimental 1.3x2 release of SpamProbe as announced at http://sourceforge.net/mailarchive/message.php?msg_id=14058893 Google didn't find any detailed discussion of the feature. The source code (file src/parser/GifParser.cc from the 1.3x2 source tarball) appears to use libungif to extract the following information from embedded GIF files in an email and generate tokens from them for the Bayesian filter. I don't see anything to handle jpeg images right now: MD5 digest of the image (I think digest of the image bytes from the message, not parsed or uncompressed to pixels, but I'm not sure) height, width, left , top, interlaced or not, color map or no color map, bits per pixel, the red green and blue values inthe color map if there is one, and the extension code and characters of any GIF extension records. The "image number" is made part of the tokens in the above paragraph (not the MD5) where I think that may have to do with multiple images in a single GIF object.
there are spammers who randomise the colour lookup table, inserting random values in the unused spaces, and reordering the CLUT in random order, to defeat MD5 sums -- so that's probably not useful. I wonder if stuff like specific colour choices (e.g. "this image contains #040400 and #ffff82") would make a good signature?
Even if this technique has some success in SpamProbe, which only uses Bayes, (and we don't yet know that, as it is a new exprimental feature), that would still not indicate whether it would do more in SpamAssassin than the HTML_IMAGE* rules. Of course I would not mind seeing the results if somebody wants to extract information from spam and ham embedded images and run some tests.
>I wonder if stuff like specific colour choices (e.g. "this image contain > #040400 and #ffff82") would make a good signature? I'm very skeptical. It's one thing to match on something that is characteristic of the content of spam (e.g., V!agra or $$$!!!), but there is no reason for spam to have characteristic image sizes or color maps or use of colors. Those are easily changed by spammers to arbitrary values if we do start looking for charasteristic values.
'there is no reason for spam to have characteristic image sizes or color maps or use of colors.' actually, there was; certain spammers would use certain sizes, colors, etc. in their campaigns. 'Those are easily changed by spammers to arbitrary values if we do start looking for charasteristic values' and there's the rub. In my recent testing (at $DAYJOB), they've been doing a lot of this. I doubt these features are reliable indicators anymore against current spam. :(
Standard text formatting (or more commonly standard HTML markup patterns) are still moderately good spam indicators in many cases. Since the text formatting patterns can be useful, I see no reason why the generic concept of image formatting patterns shouldn't be about as useful. However, I'm talking about "image formatting patterns" as a generality, and not necessarily the exact colors or images sizes. This is probably a really good place for a fuzzy matching algorithm that could do things like determine x% of the image is background color, or there is a pattern change 37% of the way from top to bottom, or 22% of the space appears to be text, etc. I have no idea at all how to produce those statistics in a useful way for filtering.
In 3.2, plugins can "render" any part, including image/* parts, to text, which will be used in body rules, the bayes tokenizer, etc. So I think this is done. :)