SA Bugzilla – Bug 3781
There should be a rule type for mime part headers
Last modified: 2005-02-07 12:03:57 UTC
This is slightly related to bug 3780, at least in concept. With some of the newer spam formats there is gold to be found in the mime headers for attachments, typically either the file types, names, or sometimes bogus encoding formats. However, there is currently (to the best of my knowledge) no way to grab the mime headers to look at this stuff. As Theo pointed out in anohter discussion, mime headers are not body rules. They also probably really header rules either. So I think the best thing to do would be to make a new rule type, perhaps 'mimeheader' or 'mimehdr' or some such. Within the new rule type I would expect it to work much like the current header rules, where you can grep a particular type of header, or use ALL.
true. would it be worthwhile being able to match against MIME hdrs from specific sub-parts of a message, or is matching against MIME hdrs from all sub-parts at once OK? AFAICS, the latter should be fine, but I'm not certain. btw I'm also +1 on defining a new rule type for this.
Subject: Re: There should be a rule type for mime part headers > true. would it be worthwhile being able to match against MIME hdrs from > specific sub-parts of a message, or is matching against MIME hdrs from all > sub-parts at once OK? > > AFAICS, the latter should be fine, but I'm not certain. I debated this when I wrote the idea up. The best I can say at the moment is "I don't know". I *do* know that when manually examining a spam I'm usually interested in one particular mime header and not others that may be there. But I can't think how I would be able to describe any particular header beyond "look at them until I find the one I want". Which isn't very useful. I don't think that (for example) concatenating all of the Content-Encoding parts from all headers into a single combined header (like is done for Received in the main header) would be particularly useful or a particularly good idea. I think they should probably be served to the RE as individual 'lines', and the rule simply called a sufficient number of times. As for ALL, I think it should serve each mime header as an entity to the rule, but not concatenate all headers into a single entity. Doing things this way would make it difficult to do tests across multiple headers at once. But so far I haven't found a need to do that, and I'm concerned about getting false results if the headers got combined. I think to handle that we could probably do something redundant like ALL:full or ALL:ALL to tell the rule-driver to combine all the appropriate things into a single string. Right now I can't imagine actually using that construct. But that doesn't mean that I might not find a rare use for it tomorrow. Loren
Subject: Re: There should be a rule type for mime part headers I think this can be handled in one of two ways: 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do this in addition to what I'm working on now (accessing raw MIME part data) 2. we could clean up the header test types a bit and allow something like "full" on just the body. I think I would probably object to a new core test type for just MIME part headers.
Subject: Re: There should be a rule type for mime part headers On Wed, Sep 15, 2004 at 12:39:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do > this in addition to what I'm working on now (accessing raw MIME part > data) > > I think I would probably object to a new core test type for just MIME > part headers. The was a short discussion on the dev list. My suggestion was that there be a plugin to implement the rule type, since the information in body mime headers has not been found to be very useful in anti-spam work. Depending on whether or not that has changed and is now useful, the rule type could be made "standard". I responded to Chris Santerre's message requesting the plugin (ie: what kind of information/queries are they looking to do), but have not received a response. Since the body mime bits and the binary bits are pretty much the same, I think getting them together in the same plugin would be fine. The code to look in the body mime headers is pretty trivial though, so a separate plugin may be good depending on what is actually desired in such a plugin.
Subject: Re: There should be a rule type for mime part headers > I responded to Chris Santerre's message requesting the plugin (ie: > what kind of information/queries are they looking to do), but have not > received a response. I'm not Chris, but a mime boundary is really a pretty simple thing, for instance: Content-Type: image/gif; name="Vnguku.GIF" Content-Transfer-Encoding: base64 Content-ID: <part1.08020704.07020304@jtmju@eurekabroadband.com> Content-Disposition: inline; filename="Vnguku.GIF" There just isn't a lot here, and I'd want to treat it as just more header items. Possibly even treating them as though they were part of the main header would be sufficient. But I'd prefer to be able to treat them as individual headers and not agglomerate the contents of identically-named header items into a single string. I can't find a good example in the afternoon's spam collection, and being on a Windoze box it is too much work to hunt around just for an example. But generally I'm interested in the value of Content-Transfer-Encoding, looking for what appear to be bogus values that are being treated as text/plain. The file name can also potentially be useful. Obviously those that want to use SA to strip bogus virus warnings would be interested in this field, since it could be a virus, or it could be a stupid name like "Norton.Deleted" that can be used very easily to tell the message is junk. I'd personally be more interested in looking for things like "CitiLogo.gif", as is typical in a CitiBunk phish. As in all of these things, sometimes knowing the case of the words can be interesting, just as with header items or body items. Thus, semantically, I'd like to treat them as normal header items can be treated. > Since the body mime bits and the binary bits are pretty much the same, I think > getting them together in the same plugin would be fine. The code to look in > the body mime headers is pretty trivial though, so a separate plugin may be > good depending on what is actually desired in such a plugin. This seems not unreasonable. Perhaps a fake 'header' type of Mime-Body or the like could be used to access the first 100 bytes or so of the body by default, or the whole body if Mime-Body:full or some such were specified. I can't immediately see using this myself at the moment, but I wouldn't want to rule it out. It could be useful for looking for something like a signature attachment with a particular value > 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do > this in addition to what I'm working on now (accessing raw MIME part > data) This seems feasible, if I'm understanding you correctly. > 2. we could clean up the header test types a bit and allow something > like "full" on just the body. I'm not sure I follow this here. While an earlier bug pointed out that the current implementation of rawbody is rather useless, forcing one to use 'full' far too often, I see this as unrelated. I'd much rather not have to scan across the entire message looking for mime header encodings and hope that what I found was really a mime header. Too much chance for an FP, and too hard to do the correct mime part separation in normal rules. SA already knows how to split things into parts, and that should be leveraged. > I think I would probably object to a new core test type for just MIME > part headers. I would not object to them being lumped into the current header tests. Short of that, I'd like the appearence of a new test type, whether that is a new object or a way of using a new eval/plugin.
Subject: Re: There should be a rule type for mime part headers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > I think this can be handled in one of two ways: > 1. the MIME-ish "Binary" plugin I'm working on based on MSExec could do > this in addition to what I'm working on now (accessing raw MIME part > data) - -0: that doesn't make much sense; why would a plugin intended to detect MS executables, also allow third-parties to match against arbitrary data in MIME part headers? that usage isn't exactly suggested by the name/purpose of "MSExec". However making the MSExec plugin depend *on* another plugin that allows this, now *that* makes sense. > 2. we could clean up the header test types a bit and allow something > like "full" on just the body. Not sure how that effects MIME part headers? > I think I would probably object to a new core test type for just MIME > part headers. I think we have a significant "blind spot", in that we currently have no way for a rule to match against those part headers -- apart from writing an eval rule. I think we may even have a situation where we *could* match some spam through a rule that did that, and we're just not looking yet. I don't think being able to access them in an eval rule is good enough, btw. That's too high a jump to make for most rule authors; it requires familiarity with the code a lot more than writing a rule does. We shouldn't require that rule authors know how to write eval tests. In addition, eval tests are slower than a native rule type. (Eval tests shift quite a lot of parsing overhead from startup-time to runtime -- due to the overhead of argument marshalling, an individual eval ' ' scope for the function call, etc. Quite a lot slower than the other rule types. take a look at DProf output...) So I think we do need some way for user rules to access this code. I'd be in favour of a plugin that implements this new rule type. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBSWpVQTcbUG5Y7woRAliiAKCHtsJ0MjkA4Tk/c5bwzlERCFtpsACeIZYw fMDh44CdVBK3WK3QSyNGM68= =53Pn -----END PGP SIGNATURE-----
Subject: Re: There should be a rule type for mime part headers > - -0: that doesn't make much sense; why would a plugin intended to > detect MS executables, also allow third-parties to match against > arbitrary data in MIME part headers? that usage isn't exactly > suggested by the name/purpose of "MSExec". > > However making the MSExec plugin depend *on* another plugin that > allows this, now *that* makes sense. It's not even named MSExec in my tree anymore. I'm still tweaking it, though. It's becoming a plugin for doing MIME tests, I'm not sure what the exact scope is going to be, but right now, it tests the decoded MIME part data for file(1)-style functionality. Writing a plugin for just MSExec satisfies a long-standing bug, but it has easily become a much more generally useful plugin without the hard-coded values in the .pm. I'm still playing with the format, but it might be something like this: loadplugin Mail::SpamAssassin::Plugin::Binary magic MICROSOFT_EXECUTABLE (0, 'MZ') magic MICROSOFT_EXECUTABLE (128, 'PE\x00\x00') body MICROSOFT_EXECUTABLE eval:check_binary() magic PNG_IMAGE (0, '\x89PNG') body PNG_IMAGE eval:check_binary() Don't worry too much about the format. >> 2. we could clean up the header test types a bit and allow something >> like "full" on just the body. > Not sure how that effects MIME part headers? A stupid line-by-line pristine raw untouched undecoded unrendered body test would be sufficient for 99% of MIME header tests, especially of the type desired by most rule writers. I wasn't even close to suggesting that we make people write an eval test. Daniel
Subject: Re: There should be a rule type for mime part headers -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > It's not even named MSExec in my tree anymore. I'm still tweaking it, > though. It's becoming a plugin for doing MIME tests, I'm not sure what > the exact scope is going to be, but right now, it tests the decoded MIME > part data for file(1)-style functionality. Writing a plugin for just > MSExec satisfies a long-standing bug, but it has easily become a much > more generally useful plugin without the hard-coded values in the .pm. > > I'm still playing with the format, but it might be something like this: > > loadplugin Mail::SpamAssassin::Plugin::Binary > > magic MICROSOFT_EXECUTABLE (0, 'MZ') > magic MICROSOFT_EXECUTABLE (128, 'PE\x00\x00') > body MICROSOFT_EXECUTABLE eval:check_binary() > > magic PNG_IMAGE (0, '\x89PNG') > body PNG_IMAGE eval:check_binary() > > Don't worry too much about the format. sounds like a good plan. don't fear adding a new test type -- that should (a) be easy enough, (b) be modular enough now that it's in a plugin anyway and (c) be faster than eval code. > >> 2. we could clean up the header test types a bit and allow something > >> like "full" on just the body. > > > Not sure how that effects MIME part headers? > > A stupid line-by-line pristine raw untouched undecoded unrendered body > test would be sufficient for 99% of MIME header tests, especially of the > type desired by most rule writers. however, there are problems there. for example, (a) there would be no way to tell a Content-Foo: line *inside* a MIME part, from a Content-Foo: line in a MIME part's header. (b), efficiency may not be so hot when you consider large messages. (c), MIME headers can span multiple lines, so passing the "full-full" text line-by-line wouldn't work there. > I wasn't even close to suggesting that we make people write an eval > test. ok. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBSZZ0QTcbUG5Y7woRAoz8AKCGGAPgNOYJ5KUOVeF53LPZ8AvY6gCeKPJX tBm/KJT+m+X8N/LNUxwCAJQ= =Cr6B -----END PGP SIGNATURE-----
Subject: Re: There should be a rule type for mime part headers >A stupid line-by-line pristine raw untouched undecoded unrendered body >test would be sufficient for 99% of MIME header tests, especially of the >type desired by most rule writers. But might be pig slow compared to something that only scanned the mime headers, especially if there are largish binaries in the file. (As tends to happen these days with the new one-line spams that include the whole message in a GIF, or the phish spams that use an image map as a link to the real site, and also include two or three inline bank logo images.) Loren
Subject: Re: There should be a rule type for mime part headers For the obvious tests, like filename tests, Content-Type, etc., I'll probably (> 90%) add that to the new module. No hurry to define which ones yet. :-)
this is now more important, as anti-spam rules are being found that *do* work based on the MIME part headers, cf: http://mail-archives.apache.org/eyebrowse/ReadMsg?listName=dev@spamassassin.apache.org&msgNo=14755 a quick list of 'pros' for this approach: - if these are not eval tests, it will mean we avoid having links between rules and code in EvalTests. That results in "dead code" if those rules get removed - less code in EvalTests, which is overall a good thing - inability for non-developers to write efficient rules for those bits; they have to use "full" which is very, very inefficient in this case - should be quite efficient: the actual parts of text matched (the MIME headers) form less than 5% of the bytes in a typical message body, and there's generally < 15 lines of MIME-in-body headers in a typical message (at a rough guess) cons: - it adds another rule type. this is a very minor con, and one which could actually be a "pro" depending on your outlook ;) (for example: allows third-party developers to use that code, and allows the matching algorithm and behaviour to be documented in Conf.pm.)
Subject: Re: There should be a rule type for mime part headers > this is now more important, as anti-spam rules are being found that *do* work > based on the MIME part headers, cf: Well, not to be too picky, but when I opened this enhancement X many months back, it was because I had rules that would have worked if I could have written them. So the concept of working rules isn't exactly new.
done! r152620.