Bug 3781 - There should be a rule type for mime part headers
Summary: There should be a rule type for mime part headers
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P3 enhancement
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 2417
  Show dependency tree
 
Reported: 2004-09-15 05:13 UTC by Loren Wilton
Modified: 2005-02-07 12:03 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Loren Wilton 2004-09-15 05:13:53 UTC
This is slightly related to bug 3780, at least in concept.

With some of the newer spam formats there is gold to be found in the mime 
headers for attachments, typically either the file types, names, or sometimes 
bogus encoding formats.  However, there is currently (to the best of my 
knowledge) no way to grab the mime headers to look at this stuff.

As Theo pointed out in anohter discussion, mime headers are not body rules.  
They also probably really header rules either.  So I think the best thing to do 
would be to make a new rule type, perhaps 'mimeheader' or 'mimehdr' or some 
such.

Within the new rule type I would expect it to work much like the current header 
rules, where you can grep a particular type of header, or use ALL.
Comment 1 Justin Mason 2004-09-15 05:55:52 UTC
true.   would it be worthwhile being able to match against MIME hdrs from
specific sub-parts of a message, or is matching against MIME hdrs from all
sub-parts at once OK?

AFAICS, the latter should be fine, but I'm not certain.

btw I'm also +1 on defining a new rule type for this.
Comment 2 Loren Wilton 2004-09-15 08:05:50 UTC
Subject: Re:  There should be a rule type for mime part headers

> true.   would it be worthwhile being able to match against MIME hdrs from
> specific sub-parts of a message, or is matching against MIME hdrs from all
> sub-parts at once OK?
>
> AFAICS, the latter should be fine, but I'm not certain.

I debated this when I wrote the idea up.  The best I can say at the moment
is "I don't know".

I *do* know that when manually examining a spam I'm usually interested in
one particular mime header and not others that may be there.  But I can't
think how I would be able to describe any particular header beyond "look at
them until I find the one I want".  Which isn't very useful.

I don't think that (for example) concatenating all of the Content-Encoding
parts from all headers into a single combined header (like is done for
Received in the main header) would be particularly useful or a particularly
good idea.  I think they should probably be served to the RE as individual
'lines', and the rule simply called a sufficient number of times.  As for
ALL, I think it should serve each mime header as an entity to the rule, but
not concatenate all headers into a single entity.

Doing things this way would make it difficult to do tests across multiple
headers at once.  But so far I haven't found a need to do that, and I'm
concerned about getting false results if the headers got combined.  I think
to handle that we could probably do something redundant like ALL:full or
ALL:ALL to tell the rule-driver to combine all the appropriate things into a
single string.  Right now I can't imagine actually using that construct.
But that doesn't mean that I might not find a rare use for it tomorrow.

        Loren

Comment 3 Daniel Quinlan 2004-09-15 12:39:00 UTC
Subject: Re:  There should be a rule type for mime part headers

I think this can be handled in one of two ways:

1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do
   this in addition to what I'm working on now (accessing raw MIME part
   data)
2. we could clean up the header test types a bit and allow something
   like "full" on just the body.

I think I would probably object to a new core test type for just MIME
part headers.

Comment 4 Theo Van Dinter 2004-09-15 13:25:06 UTC
Subject: Re:  There should be a rule type for mime part headers

On Wed, Sep 15, 2004 at 12:39:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do
>    this in addition to what I'm working on now (accessing raw MIME part
>    data)
> 
> I think I would probably object to a new core test type for just MIME
> part headers.

The was a short discussion on the dev list.  My suggestion was that there be a
plugin to implement the rule type, since the information in body mime headers
has not been found to be very useful in anti-spam work.  Depending on whether
or not that has changed and is now useful, the rule type could be made
"standard".

I responded to Chris Santerre's message requesting the plugin (ie:
what kind of information/queries are they looking to do), but have not
received a response.

Since the body mime bits and the binary bits are pretty much the same, I think
getting them together in the same plugin would be fine.  The code to look in
the body mime headers is pretty trivial though, so a separate plugin may be
good depending on what is actually desired in such a plugin.

Comment 5 Loren Wilton 2004-09-15 20:42:39 UTC
Subject: Re:  There should be a rule type for mime part headers

> I responded to Chris Santerre's message requesting the plugin (ie:
> what kind of information/queries are they looking to do), but have not
> received a response.

I'm not Chris, but a mime boundary is really a pretty simple thing, for
instance:

Content-Type: image/gif;
 name="Vnguku.GIF"
Content-Transfer-Encoding: base64
Content-ID: <part1.08020704.07020304@jtmju@eurekabroadband.com>
Content-Disposition: inline;
 filename="Vnguku.GIF"

There just isn't a lot here, and I'd want to treat it as just more header
items.  Possibly even treating them as though they were part of the main
header would be sufficient.  But I'd prefer to be able to treat them as
individual headers and not agglomerate the contents of identically-named
header items into a single string.

I can't find a good example in the afternoon's spam collection, and being on
a Windoze box it is too much work to hunt around just for an example.  But
generally I'm interested in the value of Content-Transfer-Encoding, looking
for what appear to be bogus values that are being treated as text/plain.

The file name can also potentially be useful.  Obviously those that want to
use SA to strip bogus virus warnings would be interested in this field,
since it could be a virus, or it could be a stupid name like
"Norton.Deleted" that can be used very easily to tell the message is junk.
I'd personally be more interested in looking for things like "CitiLogo.gif",
as is typical in a CitiBunk phish.

As in all of these things, sometimes knowing the case of the words can be
interesting, just as with header items or body items.  Thus, semantically,
I'd like to treat them as normal header items can be treated.


> Since the body mime bits and the binary bits are pretty much the same, I
think
> getting them together in the same plugin would be fine.  The code to look
in
> the body mime headers is pretty trivial though, so a separate plugin may
be
> good depending on what is actually desired in such a plugin.

This seems not unreasonable.  Perhaps a fake 'header' type of Mime-Body or
the like could be used to access the first 100 bytes or so of the body by
default, or the whole body if Mime-Body:full or some such were specified.  I
can't immediately see using this myself at the moment, but I wouldn't want
to rule it out.  It could be useful for looking for something like a
signature attachment with a particular value

> 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do
>    this in addition to what I'm working on now (accessing raw MIME part
>    data)

This seems feasible, if I'm understanding you correctly.

> 2. we could clean up the header test types a bit and allow something
>    like "full" on just the body.

I'm not sure I follow this here.  While an earlier bug pointed out that the
current implementation of rawbody is rather useless, forcing one to use
'full' far too often, I see this as unrelated.  I'd much rather not have to
scan across the entire message looking for mime header encodings and hope
that what I found was really a mime header.  Too much chance for an FP, and
too hard to do the correct mime part separation in normal rules.  SA already
knows how to split things into parts, and that should be leveraged.


> I think I would probably object to a new core test type for just MIME
> part headers.

I would not object to them being lumped into the current header tests.
Short of that, I'd like the appearence of a new test type, whether that is a
new object or a way of using a new eval/plugin.


Comment 6 Justin Mason 2004-09-16 03:26:33 UTC
Subject: Re:  There should be a rule type for mime part headers 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> I think this can be handled in one of two ways:
> 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do
>    this in addition to what I'm working on now (accessing raw MIME part
>    data)

- -0: that doesn't make much sense; why would a plugin intended to
detect MS executables, also allow third-parties to match against
arbitrary data in MIME part headers?   that usage isn't exactly
suggested by the name/purpose of "MSExec".

However making the MSExec plugin depend *on* another plugin that
allows this, now *that* makes sense.

> 2. we could clean up the header test types a bit and allow something
>    like "full" on just the body.

Not sure how that effects MIME part headers?

> I think I would probably object to a new core test type for just MIME
> part headers.

I think we have a significant "blind spot", in that we currently have no
way for a rule to match against those part headers -- apart from writing
an eval rule.

I think we may even have a situation where we *could* match some
spam through a rule that did that, and we're just not looking yet.

I don't think being able to access them in an eval rule is good enough,
btw.  That's too high a jump to make for most rule authors; it requires
familiarity with the code a lot more than writing a rule does.  We
shouldn't require that rule authors know how to write eval tests.   In
addition, eval tests are slower than a native rule type.

(Eval tests shift quite a lot of parsing overhead from startup-time to
runtime -- due to the overhead of argument marshalling, an individual eval
' ' scope for the function call, etc.  Quite a lot slower than the other
rule types.  take a look at DProf output...)

So I think we do need some way for user rules to access this code.

I'd be in favour of a plugin that implements this new rule type.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBSWpVQTcbUG5Y7woRAliiAKCHtsJ0MjkA4Tk/c5bwzlERCFtpsACeIZYw
fMDh44CdVBK3WK3QSyNGM68=
=53Pn
-----END PGP SIGNATURE-----

Comment 7 Daniel Quinlan 2004-09-16 06:08:23 UTC
Subject: Re:  There should be a rule type for mime part headers

> - -0: that doesn't make much sense; why would a plugin intended to
> detect MS executables, also allow third-parties to match against
> arbitrary data in MIME part headers?   that usage isn't exactly
> suggested by the name/purpose of "MSExec".
> 
> However making the MSExec plugin depend *on* another plugin that
> allows this, now *that* makes sense.

It's not even named MSExec in my tree anymore.  I'm still tweaking it,
though.  It's becoming a plugin for doing MIME tests, I'm not sure what
the exact scope is going to be, but right now, it tests the decoded MIME
part data for file(1)-style functionality.  Writing a plugin for just
MSExec satisfies a long-standing bug, but it has easily become a much
more generally useful plugin without the hard-coded values in the .pm.

I'm still playing with the format, but it might be something like this:

  loadplugin     Mail::SpamAssassin::Plugin::Binary

  magic          MICROSOFT_EXECUTABLE (0, 'MZ')
  magic          MICROSOFT_EXECUTABLE (128, 'PE\x00\x00')
  body           MICROSOFT_EXECUTABLE eval:check_binary()

  magic          PNG_IMAGE (0, '\x89PNG')
  body           PNG_IMAGE eval:check_binary()

Don't worry too much about the format.
 
>> 2. we could clean up the header test types a bit and allow something
>>    like "full" on just the body.

> Not sure how that effects MIME part headers?

A stupid line-by-line pristine raw untouched undecoded unrendered body
test would be sufficient for 99% of MIME header tests, especially of the
type desired by most rule writers.

I wasn't even close to suggesting that we make people write an eval
test.

Daniel

Comment 8 Justin Mason 2004-09-16 06:34:49 UTC
Subject: Re:  There should be a rule type for mime part headers 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> It's not even named MSExec in my tree anymore.  I'm still tweaking it,
> though.  It's becoming a plugin for doing MIME tests, I'm not sure what
> the exact scope is going to be, but right now, it tests the decoded MIME
> part data for file(1)-style functionality.  Writing a plugin for just
> MSExec satisfies a long-standing bug, but it has easily become a much
> more generally useful plugin without the hard-coded values in the .pm.
> 
> I'm still playing with the format, but it might be something like this:
> 
>   loadplugin     Mail::SpamAssassin::Plugin::Binary
> 
>   magic          MICROSOFT_EXECUTABLE (0, 'MZ')
>   magic          MICROSOFT_EXECUTABLE (128, 'PE\x00\x00')
>   body           MICROSOFT_EXECUTABLE eval:check_binary()
> 
>   magic          PNG_IMAGE (0, '\x89PNG')
>   body           PNG_IMAGE eval:check_binary()
> 
> Don't worry too much about the format.

sounds like a good plan.  don't fear adding a new test type -- that should
(a) be easy enough, (b) be modular enough now that it's in a plugin anyway
and (c) be faster than eval code.

> >> 2. we could clean up the header test types a bit and allow something
> >>    like "full" on just the body.
> 
> > Not sure how that effects MIME part headers?
> 
> A stupid line-by-line pristine raw untouched undecoded unrendered body
> test would be sufficient for 99% of MIME header tests, especially of the
> type desired by most rule writers.

however, there are problems there.  for example, (a) there would be no
way to tell a Content-Foo: line *inside* a MIME part, from a Content-Foo:
line in a MIME part's header.  (b), efficiency may not be so hot
when you consider large messages.   (c), MIME headers can span multiple
lines, so passing the "full-full" text line-by-line wouldn't work there.

> I wasn't even close to suggesting that we make people write an eval
> test.

ok.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBSZZ0QTcbUG5Y7woRAoz8AKCGGAPgNOYJ5KUOVeF53LPZ8AvY6gCeKPJX
tBm/KJT+m+X8N/LNUxwCAJQ=
=Cr6B
-----END PGP SIGNATURE-----

Comment 9 Loren Wilton 2004-09-16 06:54:34 UTC
Subject: Re:  There should be a rule type for mime part headers

>A stupid line-by-line pristine raw untouched undecoded unrendered body
>test would be sufficient for 99% of MIME header tests, especially of the
>type desired by most rule writers.

But might be pig slow compared to something that only scanned the mime headers, especially if there are largish binaries in the file.  (As tends to happen these days with the new one-line spams that include the whole message in a GIF, or the phish spams that use an image map as a link to the real site, and also include two or three inline bank logo images.)

         Loren

Comment 10 Daniel Quinlan 2004-09-16 13:14:58 UTC
Subject: Re:  There should be a rule type for mime part headers

For the obvious tests, like filename tests, Content-Type, etc., I'll
probably (> 90%) add that to the new module.  No hurry to define which
ones yet.  :-)

Comment 11 Justin Mason 2005-01-07 12:49:53 UTC
this is now more important, as anti-spam rules are being found that *do* work
based on the MIME part headers, cf:

http://mail-archives.apache.org/eyebrowse/ReadMsg?listName=dev@spamassassin.apache.org&msgNo=14755

a quick list of 'pros' for this approach:

- if these are not eval tests, it will mean we avoid having links between rules
  and code in EvalTests.  That results in "dead code" if those rules get
  removed

- less code in EvalTests, which is overall a good thing

- inability for non-developers to write efficient rules for those bits;
  they have to use "full" which is very, very inefficient in this case

- should be quite efficient: the actual parts of text matched (the MIME
  headers) form less than 5% of the bytes in a typical message body, and
  there's generally < 15 lines of MIME-in-body headers in a typical
  message  (at a rough guess)

cons:

- it adds another rule type.  this is a very minor con, and one which could
  actually be a "pro" depending on your outlook ;)  (for example: allows
  third-party developers to use that code, and allows the matching algorithm
  and behaviour to be documented in Conf.pm.)

Comment 12 Loren Wilton 2005-01-07 16:08:21 UTC
Subject: Re:  There should be a rule type for mime part headers

> this is now more important, as anti-spam rules are being found that *do* work
> based on the MIME part headers, cf:

Well, not to be too picky, but when I opened this enhancement X many months back, it was because I had rules that would have worked if I could have written them.  So the concept of working rules isn't exactly new.

Comment 13 Justin Mason 2005-02-07 21:03:57 UTC
done!  r152620.