Bug 2977

Summary: RFE: large messages should have first half scanned
Product: Spamassassin Reporter: Robert Kiessling <robert>
Component: spamc/spamdAssignee: SpamAssassin Developer Mailing List <dev>
Severity: enhancement    
Priority: P5    
Version: 2.63   
Target Milestone: Future   
Hardware: All   
OS: All   

Description Robert Kiessling 2004-01-28 07:10:40 UTC
Currently spamc does not do any spam checking on large messages.

Consequently we have to configure the message size limit as high as reasonably 
possible to avoid too many false negatives caused by large spam messages.

My suggestion is to introduce a message size limit in the following way.

Assume size of the message to be spam checked is SIZE with 
SIZE=HEADERSIZE+BODYSIZE, the (old, absolute) size limit is LIMIT and the newly 
introduced body size limit is BODYLIMIT.

If SIZE is smaller than LIMIT, then pass the message to spamd.

If HEADERSIZE + BODYLIMIT is larger than LIMIT, then skip the message.

- split the message into two parts: the initial segment of size 
- pass the initial segment to spamd
- take the result of spamd, append then remainder to it. This reconstructed 
message is the final spam checked email

Using the above algorithm we can set the size limit rather low, say 5kB. Spam 
checking will be pretty much as effective as if the full message was checked, 
while consuming considerably less resources since messages in the range of, say, 
 10kB-250kB are checked faster.

I am currently using a similar strategy on a little Perl program calling SA 
directly and it works well.

The above works with report_safe=0. If report_safe is set, the spamd/spamc 
interaction would need to be modified so that the final MIME boundary marker  
can by added in the right place. It should still be worth while, though.
Comment 1 Justin Mason 2004-03-16 21:41:33 UTC
I can't see a good way to do this with report_safe 1, sorry...
have you seen many very large spam messages?

I don't think this is important.
Comment 2 Daniel Quinlan 2004-03-16 23:35:27 UTC
I'm not sure this belongs in spamc, maybe in SA itself.
Comment 3 Daniel Quinlan 2004-08-27 16:51:01 UTC
moving performance and accuracy bugs to 3.1.0 milestone
Comment 4 Daniel Quinlan 2004-08-27 17:25:50 UTC
moving performance and accuracy bugs to 3.1.0 milestone
Comment 5 Justin Mason 2006-07-10 12:14:22 UTC
btw, thinking about this; it's actually pretty trivial for a MIME-structured
HTML message to contain 300KB of innocuous-looking body text, then 5KB of
"payload" HTML.  on scan, SA would scan the first 250KB and it'd all look
innocuous; however, on display, the 5KB would use DOM or CSS tricks to "overlay"
the payload in place of the decoy text.

I'm not sure this suggestion is viable as a result, since it's vulnerable to
this evasion.
Comment 6 Mark Martinec 2009-08-21 07:55:24 UTC
Similar to:
  Bug 4469 - Add a process/option to efficiently deal with very long mail messages
  Bug 5939 - allow scanning of multi-MB spam

*** This bug has been marked as a duplicate of bug 4469 ***