Bug 4469

Summary: Add a process/option to efficiently deal with very long mail messages
Product: Spamassassin Reporter: Loren Wilton <lwilton>
Component: spamc/spamdAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED DUPLICATE    
Severity: enhancement CC: apache, jm, parkerm, robert
Priority: P4    
Version: SVN Trunk (Latest Devel Version)   
Target Milestone: Future   
Hardware: Other   
OS: other   
Whiteboard:

Description Loren Wilton 2005-07-08 08:52:23 UTC
There are starting to be occasional reports of very large spams that make it 
past SA by virtue of the length cutoff limit.

Passing the entire message to SA would of course not be a Good Thing to do.  
However, armchair reasoning suggests that the spaminess of the message can 
probably be determined reasonably accurately from the headers and the first 
2..10K or so of the message body in virtually all cases.  In fact, this is 
probably virtually always true, even with messages in the 20K..250K range.

Suggest two things here: an option to SA (perhaps a special line on the front 
of the message stream itself) that tells it that this will be a partial 
message, and secondly a change to spamd to pass partial messages, along with 
this flag, when some size limit is exceeded.  

Since only a partial message is being passed, obviously spamd can't just pipe 
the entire message thru SA and out the other end.  Instead, it will have to get 
a declaration from SA of spaminess, and then do something itself with the 
original message.

The purpose of the flag to SA for a partial message would be twofold: it would 
disable some of the rules that expect correct mime-part terminations, and it 
might change the output from SA to perhaps only be headers for the message, 
plus a return value that somehow indicates spam.  This return value might be in 
the form of a real return value, or a first header line with special 
formatting, or perhaps something else.

If SA operating in this mode returned modified headers only, it would be 
trivial for the spamd child to remove the original message headers and replace 
them with the SA-supplied headers, and pipe the rest of the message straight 
through, thus avoiding the SA large-message overhead.

However this sort of option is implemented (if it is), it should be done in a 
way that tools calling SA or the SA API directly can fairly easily implement 
spam detection using this option.
Comment 1 Justin Mason 2005-07-08 10:10:14 UTC
yes, I agree something like this would be a worthwhile approach.  fwiw, I'd
prefer to do this entirely inside the Mail::SA modules, however.

IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
pristine message -- if the size goes over the scanning-size threshold, the
remainder of the message data is written to a temp file instead of stored on
disk.  (we already use temp files anyway in parts of the code.)

This would allow us to scan even 100MB mails without breaking a sweat and
causing all those FAQs on the users list. ;)
Comment 2 Theo Van Dinter 2005-07-08 10:26:09 UTC
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

On Fri, Jul 08, 2005 at 10:10:15AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
> pristine message -- if the size goes over the scanning-size threshold, the
> remainder of the message data is written to a temp file instead of stored on
> disk.  (we already use temp files anyway in parts of the code.)

Yeah, I was thinking of something simliar where text/* parts (at least)
are kept in memory, but other parts are stored in temp files since they'll
only be rarely used if at all.  Heck, even keep the filename in the part
information so that if a plugin wants to call an AV scanner, or something,
on that part it'd be easy to just point at the file instead of creating
a whole new temp file from the other temp file. ;)

In the original SA3 code, BTW, everything was a temp file.  Since that
seemed overly complicated since each part can have multiple versions,
etc, it was converted to the "all in memory" version.

> This would allow us to scan even 100MB mails without breaking a sweat and
> causing all those FAQs on the users list. ;)

Well, yes and no.  There's still the hit of storing the message in memory,
at least once, when it's initially read in.  We could store the pristine
body in a temp file, but then any full rules or the rewrite at the end
will cause that to come back in.

SA is really tuned for "everything in memory".

Comment 3 John Gardiner Myers 2005-07-08 10:36:02 UTC
I have a plugin that processes non-text parts in perl.  I would appreciate
continuing to be able to do so.
Comment 4 Auto-Mass-Checker 2005-07-08 10:40:42 UTC
Subject: Re:  Add a process/option to efficiently deal with very long mail messages 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> In the original SA3 code, BTW, everything was a temp file.  Since that
> seemed overly complicated since each part can have multiple versions,
> etc, it was converted to the "all in memory" version.

it's also slower. the qpsmtpd algorithm is nice, both for speed and RAM:
it goes like this:

  my $buffer;
  my $tmpfile_handle;       # closed and unset
  my $tmpfile_open = 0;
  while (reading) {
    if (size > some_limit) {
      if (!$tmpfile_open) {
        $tmpfile_open = 1;
        # generate tmpfile name
        # open tmpfile, if not already open
      }
      # write to $tmpfile_handle
    }
    else {
      # add to buffer
    }
  }

so the benefit is that the buffer contains the text part we're prepared to
scan, and the tmpfile is only ever opened (and disk I/O incurred) for
massive mails.

> > This would allow us to scan even 100MB mails without breaking a sweat and
> > causing all those FAQs on the users list. ;)
> 
> Well, yes and no.  There's still the hit of storing the message in memory,
> at least once, when it's initially read in.  We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.

full rules: change the semantics to only match the first 250k of the
message data

rewrite: add a new iterator interface as well as the old all-in-RAM
interface

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCzro2MJF5cimLx9ARAhv+AJ9KvZcVbkPlBKOGmo7wIRrFIzgWsACgmCXT
mEDzMudMpTcoZwDKkkrzjJc=
=Mf8Z
-----END PGP SIGNATURE-----

Comment 5 Loren Wilton 2005-07-08 10:48:08 UTC
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

I have some (perhaps incorrect) memory that Bayes learning is limited to
some KB of the message since there was no real use to going further.
Perhaps the same limit would be reasonable for normal scanning?

Comment 6 Loren Wilton 2005-07-08 10:51:12 UTC
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

> > This would allow us to scan even 100MB mails without breaking a sweat
and
> > causing all those FAQs on the users list. ;)
>
> Well, yes and no.  There's still the hit of storing the message in memory,
> at least once, when it's initially read in.  We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.
>
> SA is really tuned for "everything in memory".

Which is why I suggested doing this in spamd and just passing the
'reasonable size' to SA itself.  It eliminates all those niggling worries
about some line of code somewhere suddenly sucking in 100mb of text to a
hash or the like.

Comment 7 Bob Menschel 2005-07-08 22:36:11 UTC
Ref bug 2977
Comment 8 Mark Martinec 2009-08-21 07:55:24 UTC
*** Bug 2977 has been marked as a duplicate of this bug. ***
Comment 9 Mark Martinec 2009-08-21 07:57:30 UTC
> Ref bug 2977

also:
  Bug 4469 - Add a process/option to efficiently deal with very long mail messages
Comment 10 Mark Martinec 2009-08-21 08:00:59 UTC
> > Ref bug 2977
> 
> also: Bug 4469
ops, a self-reference.

I meant:
  Bug 5939 - allow scanning of multi-MB spam
Comment 11 Mark Martinec 2009-08-21 08:02:54 UTC
*** Bug 5939 has been marked as a duplicate of this bug. ***
Comment 12 Mark Martinec 2009-08-21 08:10:33 UTC
See also Bug 6088, which solves the problem of long messages
in the Amavisd/SpamAssassin interation.

Solving it for spamc/spamd would still require enhancements
to the spamc/spamd protocol.
Comment 13 Mark Martinec 2009-08-21 08:36:36 UTC
> See also Bug 6088, which solves the problem of long messages
> in the Amavisd/SpamAssassin interation.

FYI, here is an excerpt from amavisd-new-2.6.3 release notes (2009-04-22):

- large messages beyond $sa_mail_body_size_limit are now partially passed
  to SpamAssassin and other spam scanners for checking: a copy passed to
  a spam scanner is truncated near or slightly past the indicated limit.
  Large messages are no longer given an almost free passage through spam
  checks.

  Note that message truncation can invalidate a DKIM or DK signature.
  If using (non-default) SpamAssassin rules to assign score points to mail
  with no valid signatures from authors which are expected to always provide
  a valid signature, the message truncation can cause false positives on
  these rules. As a workaround, to a truncated message passed to spam
  scanners, amavisd inserts a header field:
    X-Amavis-MessageSize: mmmmm, TRUNCATED to nnnnn
  which can be captured by SpamAssassin rules, e.g.:
    header __TRUNCATED X-Amavis-MessageSize =~ m{\A[^\n]*TRUNCATED}m
  and used in rules like NOTVALID_EBAY to prevent them from triggering.

  Starting with version 3.3.0 of SpamAssassin, its DKIM plugin understands
  the issue and receives undamaged DKIM signature objects directly from
  amavisd, so the above workaround is not needed. Also, a hit on a __TRUNCATED
  rule is automatically generated (explicit header rule is not necessary),
  just in case it might be useful for some purpose.
Comment 14 Justin Mason 2009-08-21 08:55:08 UTC
Mark: how do you deal with the danger of phishers inserting fake
'X-Amavis-MessageSize: mmmmm, TRUNCATED to nnnnn' headers in their templates
to avoid DKIM checks?  (you could avoid it by ensuring the header appears at
the start of the message, before any trusted+internal Received hdrs, if you're
not already doing that.)

Perhaps we should "standardize" an official TRUNCATED header name.

There is also the issue that HTML spam can be easily concocted that contains
an innocent-looking body for the first 512KB, then includes 3KB of spam
payload which uses CSS to hide the innocent text and display only the payload.
But I guess that may not be a showstopper.  Certainly not as bad as spam
getting past, unscanned. ;)
Comment 15 John Hardin 2009-08-21 09:10:13 UTC
(In reply to comment #14)
> Mark: how do you deal with the danger of phishers inserting fake
> 'X-Amavis-MessageSize: mmmmm, TRUNCATED to nnnnn' headers in their templates
> to avoid DKIM checks?  (you could avoid it by ensuring the header appears at
> the start of the message, before any trusted+internal Received hdrs, if you're
> not already doing that.)

A better way to avoid that problem is to have the header include the local hostname and IP address. Depending on position to determine trust is fragile. Depending on data a phisher is unlikely to know, and is thus unlikely to be able to successfully forge, is much more robust.

e.g.:

  As a workaround, to a truncated message passed to spam
    scanners, amavisd inserts a header field:
      X-Amavis-MessageSize: mmmmm, TRUNCATED to nnnnn on mta1.example.com [nn.nn.nn.nn]

Then the existing trust list can be used to vet the header.
Comment 16 Mark Martinec 2009-08-21 09:27:43 UTC
> Mark: how do you deal with the danger of phishers inserting fake
> 'X-Amavis-MessageSize: mmmmm, TRUNCATED to nnnnn' headers in their templates
> to avoid DKIM checks?  (you could avoid it by ensuring the header appears at
> the start of the message, before any trusted+internal Received hdrs, if you're
> not already doing that.)

I already do that. The header field is always prepended to a message when
passing it to SA, and the rule (as suggested above) only checks for the
*first* occurrence of such header field:

header __TRUNCATED X-Amavis-MessageSize =~ m{\A[^\n]*TRUNCATED}m

> Perhaps we should "standardize" an official TRUNCATED header name.

Wouldn't hurt.

> There is also the issue that HTML spam can be easily concocted that contains
> an innocent-looking body for the first 512KB, then includes 3KB of spam
> payload which uses CSS to hide the innocent text and display only the payload.
> But I guess that may not be a showstopper.  Certainly not as bad as spam
> getting past, unscanned. ;)

I'm aware of this, but for the time being this isn't being
exploited. It's certainly no worse than not checking at all.
Comment 17 Mark Martinec 2009-08-21 09:36:59 UTC
> I already do that. The header field is always prepended to a message when
> passing it to SA, and the rule (as suggested above) only checks for the
> *first* occurrence of such header field: 
> header __TRUNCATED X-Amavis-MessageSize =~ m{\A[^\n]*TRUNCATED}m

I probably wasn't clear. The X-Amavis-MessageSize header field is
always inserted, regardless of truncation. If a message is truncated,
it just carries an additional text 'TRUNCATED at ...'.
Comment 18 Mark Martinec 2009-08-21 11:27:37 UTC
Some statistics. I checked the last three weeks of our logs, 9462 messages
were larger than our configured limit of 420 kB, and as such only
their first 420 kB were passed to SpamAssassin, i.e. were truncated.

Of these 9462 messages,
9184 were ham, 111 were spam,
and 167 unclassified (delivered anyway, half of them probably spam).

So, 111/9462 = 1.2% of big messages were spam.
Not many, but still worth blocking.
Comment 19 Mark Martinec 2009-08-24 08:34:25 UTC
> So, 111/9462 = 1.2% of big messages were spam.
> Not many, but still worth blocking.

Redid the counting, this time I excluded all internally-originating mail
(= outbound or internal-to-internal), taking into account only inbound mail:

124 (spam) / 4900 (all) = 2.5 % of inbound big messages were spam.
Comment 20 Justin Mason 2009-08-31 08:39:46 UTC
(In reply to comment #19)
> 124 (spam) / 4900 (all) = 2.5 % of inbound big messages were spam.

that's quite significant -- I think this is something we should deal with "officially"...
Comment 21 Henrik Krohns 2019-07-07 15:45:16 UTC
This was resolved in 3.4.3 with body_part_scan_size.

*** This bug has been marked as a duplicate of bug 6582 ***