Bug 6855

Summary: Rules with extremely poor CPU_time / mail_size ratio
Product: Spamassassin Reporter: Mark Martinec <Mark.Martinec>
Component: RulesAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: normal CC: apache, jhardin, kmcgrail, wolfsplat
Priority: P2    
Version: 3.4 SVN branch   
Target Milestone: Undefined   
Hardware: All   
OS: All   
Attachments: The diagram: CPU usage vs. mail size
Mail sample 1
Mail sample 2
Mail sample 3
Profiling report - CPU time per rule, as reported by NYTProf for four sample messages

Description Mark Martinec 2012-10-24 14:56:03 UTC
Here is another problem report referring to long SpamAssassin scan times
with larger mail. Unlike Bug 6854, which concentrates on code hotspots,
this one refers to problematic rules (and/or their implementation).
There may be other similar PRs elsewhere.

Attached is a png image of a plot. The scattergram shows dots for
each mail message being processed on our system for the last two days.
The x-axis is a mail size in bytes, the y-axis is a CPU usage time
in milliseconds used only by SpamAssassin, as reported by a getrusage(2)
system call (via Unix-Getrusage-0.03 perl module).

Initialization and startup times are excluded from the measurements,
and so are mail transfer times. Measuring the CPU time instead of
elapsed time also largely removes the effect of network rules.

What caught my eye is that the ratio of CPU time vs. mail size
does not fall on a narrow line, but forms a rather wide wedge.
While most samples fall on the lower side of the wedge (showing
a slope of about 355 kB/s with a fixed offset of about 0.4 s),
some samples fall near the upper side of the wedge, showing
an extremely wasteful linear slope of about 11.5 kB/s.
These samples are shown in blue on the diagram.

I managed to get hold of a couple of such extreme messages
which fell near the upper edge of the wedge. Luckily some
were addressed to me - like a SA cron job as sent to
"nitemc@spamassassin2.zones.apache.org" and a
"bad sandbox rules report". Similar were a couple of in-house
cron jobs or rsync reports. But not all extreme samples were
the result of automatic runs, there were also legitimate
googlegroups chats with long thread citations and some reports
to authors from a publishing company. Will attach two or three
of these.

Note these were not extremely large messages, here are sample
sizes of four of such cases: 187kB, 154kB, 73kB, 93kB.
Profiling these four with a NYTProf profiler in a command-line
spamassassin run shows similar results as can be obtained
through a HitFreqsRuleTiming.pm plugin.

Will attach the results. Here is a summary - a top few of
the most "CPU resource usage per mail size" wasteful rules
for each of these four (quite different) messages:

785ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
777ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
299ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
216ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
128ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
126ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
59.5ms 72_active.cf, rule __LOTSA_MONEY_03,

645ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
641ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
179ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
145ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
141ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
110ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
48.6ms 72_active.cf, rule __LOTSA_MONEY_03, 

219ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
212ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
84.2ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.6ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
28.6ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
27.7ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
15.7ms 72_active.cf, rule __LOTSA_MONEY_03, 

236ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
225ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
89.9ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.9ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
34.9ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
34.5ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
18.8ms 72_active.cf, rule T_LOTTO_URI, 
18.4ms 72_active.cf, rule __GAPPY_HTML_02, 
16.8ms 72_active.cf, rule __LOTSA_MONEY_03, 
15.9ms 72_active.cf, rule __LOTSA_MONEY_01, 

I'm not sure what can/should be done about it,
but people tell me it is not acceptable that it takes
30 seconds of intensive CPU spinning to process a 350 kB message
(on our AMD Opteron, not the very latest and greatest,
but still a decent hardware).
Comment 1 Mark Martinec 2012-10-24 14:56:55 UTC
Created attachment 5103 [details]
The diagram: CPU usage vs. mail size
Comment 2 Mark Martinec 2012-10-24 14:59:07 UTC
Created attachment 5104 [details]
Mail sample 1
Comment 3 Mark Martinec 2012-10-24 14:59:35 UTC
Created attachment 5105 [details]
Mail sample 2
Comment 4 Mark Martinec 2012-10-24 15:02:26 UTC
Created attachment 5106 [details]
Mail sample 3
Comment 5 AXB 2012-10-24 15:03:49 UTC
Assuming you tested this with Amavisd, could you run the same test using spamc/spamd and the same ruleset?
Comment 6 Mark Martinec 2012-10-24 15:07:04 UTC
Created attachment 5107 [details]
Profiling report - CPU time per rule, as reported by NYTProf for four sample messages
Comment 7 Mark Martinec 2012-10-24 15:12:46 UTC
> These samples are shown in blue on the diagram.

In red, sorry. The rest of the samples is in green.
Comment 8 Mark Martinec 2012-10-24 15:17:10 UTC
> Assuming you tested this with Amavisd,

Indeed. The dot-samples in the diagram were taken from its log.

> could you run the same test using spamc/spamd and the same ruleset?

The samples in the red category which I could get hold of
were re-measured in a command-line spamassassin (ignoring the
startup times), with similar results. I believe this should suffice.
Comment 9 Mark Martinec 2012-10-24 15:19:22 UTC
> > could you run the same test using spamc/spamd and the same ruleset?

Forgot to add: the ruleset in use what what came out freshly
from sa-update for 3.4.
Comment 10 Kevin A. McGrail 2012-10-24 15:59:48 UTC
This is just great data.  And I say that because none of my rules were flagged ;-)

Can we add this type of profiling to a test procedure for rules that is added to rulesqa website?  Something to consider with S/O as a ratio of S/O to resources required?
Comment 11 John Hardin 2012-10-24 18:50:43 UTC
(In reply to comment #4)
> Created attachment 5106 [details]
> Mail sample 3

Thanks, I will take a look at those as soon as I can and see if I can improve the rules.
Comment 12 Henrik Krohns 2019-06-19 10:46:50 UTC
__FILL* monsters still dominating runtimes..

T        __FILL_THIS_FORM_SHORT2  118.8653    0.2009 24884
T         __FILL_THIS_FORM_LONG2  111.5167    0.1723 24884
T                __E_LIKE_LETTER   48.7511    0.0261 24884
T  __FILL_THIS_FORM_FRAUD_PHISH1   48.7025    0.1394 24884
T                      __LOWER_E   44.6741    0.0277 24884
T         __FILL_THIS_FORM_LOAN1   35.8043    0.0469 24884
T            HTTPS_HTTP_MISMATCH   33.9572    1.1763 24884
T                    __MY_VICTIM   22.8257    0.0293 24884
T        __FILL_THIS_FORM_SHORT1   18.6945    0.0259 24884
T             __UPPERCASE_75_100   18.6015    0.1381 24884
T                    FUZZY_ERECT   18.1948    0.0405 24884
T         __FILL_THIS_FORM_LONG1   17.5074    0.0392 24884

Problem is they seem to work very well, but look impossible to rewrite better, there's gazillion optional matches and alternations going on.

Is there _any_ simple common identifier in these mails, word, phrase lists?

Perhaps for 4.0.0 we could introduce some meta/tflags option to run specified rules only if some other rule hit first, this would work by adjusting priorities.

In theory we wouldn't even need any new options, but I don't know how much code it would require to handle metas intelligently and skip running any subrules that pending metas do not require any more.
Comment 13 Kevin A. McGrail 2019-06-19 17:57:51 UTC
I like that idea for a tflag to chain rules only if another one hits