Bug 4725 - Add support for extracting terms from gif images for bayes subsystem
Summary: Add support for extracting terms from gif images for bayes subsystem
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: unspecified
Hardware: Other other
: P5 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-12-07 16:09 UTC by Eugene Morozov
Modified: 2006-12-31 13:01 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene Morozov 2005-12-07 16:09:29 UTC
Probably SpamAssassin bayes subsystem should have support for extracting terms
from gif images in emails. These terms can provide much needed data for emails
that contain nothing but headers and an image attachment.

This idea was first offered by nico [tbb@hideout.ath.cx] to the author of
spamprobe bayesian filter, and when implemented showed improvement in filtering
spam.
Comment 1 Sidney Markowitz 2005-12-07 20:15:52 UTC
Here are some details I found after digging into it: This is a feature in
SpamProbe being tried out in the experimental 1.3x2 release of SpamProbe as
announced at http://sourceforge.net/mailarchive/message.php?msg_id=14058893

Google didn't find any detailed discussion of the feature.

The source code (file src/parser/GifParser.cc from the 1.3x2 source tarball)
appears to use libungif to extract the following information from embedded GIF
files in an email and generate tokens from them for the Bayesian filter. I don't
see anything to handle jpeg images right now:

MD5 digest of the image (I think digest of the image bytes from the message, not
parsed or uncompressed to pixels, but I'm not sure)

height, width, left , top, interlaced or not, color map or no color map, bits
per pixel, the red green and blue values inthe color map if there is one, and
the extension code and characters of any GIF extension records.

The "image number" is made part of the tokens in the above paragraph (not the
MD5) where I think that may have to do with multiple images in a single GIF object.
Comment 2 Justin Mason 2005-12-07 21:00:36 UTC
there are spammers who randomise the colour lookup table, inserting random
values in the unused spaces, and reordering the CLUT in random order, to defeat
MD5 sums -- so that's probably not useful.

I wonder if stuff like specific colour choices (e.g. "this image contains
#040400 and #ffff82") would make a good signature?
Comment 3 Sidney Markowitz 2005-12-07 22:17:32 UTC
Even if this technique has some success in SpamProbe, which only uses Bayes,
(and we don't yet know that, as it is a new exprimental feature), that would
still not indicate whether it would do more in SpamAssassin than the HTML_IMAGE*
rules. Of course I would not mind seeing the results if somebody wants to
extract information from spam and ham embedded images and run some tests.
Comment 4 Sidney Markowitz 2005-12-08 20:51:26 UTC
>I wonder if stuff like specific colour choices (e.g. "this image contain
> #040400 and #ffff82") would make a good signature?

I'm very skeptical. It's one thing to match on something that is characteristic
of the content of spam (e.g., V!agra or $$$!!!), but there is no reason for spam
to have characteristic image sizes or color maps or use of colors. Those are
easily changed by spammers to arbitrary values if we do start looking for
charasteristic values.
Comment 5 Justin Mason 2005-12-08 21:14:51 UTC
'there is no reason for spam
to have characteristic image sizes or color maps or use of colors.'

actually, there was; certain spammers would use certain sizes, colors, etc. in
their campaigns.

'Those are
easily changed by spammers to arbitrary values if we do start looking for
charasteristic values'

and there's the rub.  In my recent testing (at $DAYJOB), they've been doing a
lot of this.  I doubt these features are reliable indicators anymore against
current spam. :(
Comment 6 Loren Wilton 2005-12-09 08:48:49 UTC
Standard text formatting (or more commonly standard HTML markup patterns) are 
still moderately good spam indicators in many cases.  Since the text formatting 
patterns can be useful, I see no reason why the generic concept of image 
formatting patterns shouldn't be about as useful.

However, I'm talking about "image formatting patterns" as a generality, and not 
necessarily the exact colors or images sizes.  This is probably a really good 
place for a fuzzy matching algorithm that could do things like determine x% of 
the image is background color, or there is a pattern change 37% of the way from 
top to bottom, or 22% of the space appears to be text, etc.

I have no idea at all how to produce those statistics in a useful way for 
filtering.
Comment 7 Theo Van Dinter 2006-12-31 13:01:35 UTC
In 3.2, plugins can "render" any part, including image/* parts, to text, which
will be used in body rules, the bayes tokenizer, etc.  So I think this is done. :)