SA Bugzilla – Bug 6855
Rules with extremely poor CPU_time / mail_size ratio
Last modified: 2020-11-24 09:21:49 UTC
Here is another problem report referring to long SpamAssassin scan times with larger mail. Unlike Bug 6854, which concentrates on code hotspots, this one refers to problematic rules (and/or their implementation). There may be other similar PRs elsewhere. Attached is a png image of a plot. The scattergram shows dots for each mail message being processed on our system for the last two days. The x-axis is a mail size in bytes, the y-axis is a CPU usage time in milliseconds used only by SpamAssassin, as reported by a getrusage(2) system call (via Unix-Getrusage-0.03 perl module). Initialization and startup times are excluded from the measurements, and so are mail transfer times. Measuring the CPU time instead of elapsed time also largely removes the effect of network rules. What caught my eye is that the ratio of CPU time vs. mail size does not fall on a narrow line, but forms a rather wide wedge. While most samples fall on the lower side of the wedge (showing a slope of about 355 kB/s with a fixed offset of about 0.4 s), some samples fall near the upper side of the wedge, showing an extremely wasteful linear slope of about 11.5 kB/s. These samples are shown in blue on the diagram. I managed to get hold of a couple of such extreme messages which fell near the upper edge of the wedge. Luckily some were addressed to me - like a SA cron job as sent to "nitemc@spamassassin2.zones.apache.org" and a "bad sandbox rules report". Similar were a couple of in-house cron jobs or rsync reports. But not all extreme samples were the result of automatic runs, there were also legitimate googlegroups chats with long thread citations and some reports to authors from a publishing company. Will attach two or three of these. Note these were not extremely large messages, here are sample sizes of four of such cases: 187kB, 154kB, 73kB, 93kB. Profiling these four with a NYTProf profiler in a command-line spamassassin run shows similar results as can be obtained through a HitFreqsRuleTiming.pm plugin. Will attach the results. Here is a summary - a top few of the most "CPU resource usage per mail size" wasteful rules for each of these four (quite different) messages: 785ms 72_active.cf, rule __FILL_THIS_FORM_SHORT2, 777ms 72_active.cf, rule __FILL_THIS_FORM_LONG2, 299ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 216ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 128ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 126ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 59.5ms 72_active.cf, rule __LOTSA_MONEY_03, 645ms 72_active.cf, rule __FILL_THIS_FORM_SHORT2, 641ms 72_active.cf, rule __FILL_THIS_FORM_LONG2, 179ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 145ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 141ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 110ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 48.6ms 72_active.cf, rule __LOTSA_MONEY_03, 219ms 72_active.cf, rule __FILL_THIS_FORM_SHORT2, 212ms 72_active.cf, rule __FILL_THIS_FORM_LONG2, 84.2ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 62.6ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 28.6ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 27.7ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 15.7ms 72_active.cf, rule __LOTSA_MONEY_03, 236ms 72_active.cf, rule __FILL_THIS_FORM_SHORT2, 225ms 72_active.cf, rule __FILL_THIS_FORM_LONG2, 89.9ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 62.9ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 34.9ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 34.5ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 18.8ms 72_active.cf, rule T_LOTTO_URI, 18.4ms 72_active.cf, rule __GAPPY_HTML_02, 16.8ms 72_active.cf, rule __LOTSA_MONEY_03, 15.9ms 72_active.cf, rule __LOTSA_MONEY_01, I'm not sure what can/should be done about it, but people tell me it is not acceptable that it takes 30 seconds of intensive CPU spinning to process a 350 kB message (on our AMD Opteron, not the very latest and greatest, but still a decent hardware).
Created attachment 5103 [details] The diagram: CPU usage vs. mail size
Created attachment 5104 [details] Mail sample 1
Created attachment 5105 [details] Mail sample 2
Created attachment 5106 [details] Mail sample 3
Assuming you tested this with Amavisd, could you run the same test using spamc/spamd and the same ruleset?
Created attachment 5107 [details] Profiling report - CPU time per rule, as reported by NYTProf for four sample messages
> These samples are shown in blue on the diagram. In red, sorry. The rest of the samples is in green.
> Assuming you tested this with Amavisd, Indeed. The dot-samples in the diagram were taken from its log. > could you run the same test using spamc/spamd and the same ruleset? The samples in the red category which I could get hold of were re-measured in a command-line spamassassin (ignoring the startup times), with similar results. I believe this should suffice.
> > could you run the same test using spamc/spamd and the same ruleset? Forgot to add: the ruleset in use what what came out freshly from sa-update for 3.4.
This is just great data. And I say that because none of my rules were flagged ;-) Can we add this type of profiling to a test procedure for rules that is added to rulesqa website? Something to consider with S/O as a ratio of S/O to resources required?
(In reply to comment #4) > Created attachment 5106 [details] > Mail sample 3 Thanks, I will take a look at those as soon as I can and see if I can improve the rules.
__FILL* monsters still dominating runtimes.. T __FILL_THIS_FORM_SHORT2 118.8653 0.2009 24884 T __FILL_THIS_FORM_LONG2 111.5167 0.1723 24884 T __E_LIKE_LETTER 48.7511 0.0261 24884 T __FILL_THIS_FORM_FRAUD_PHISH1 48.7025 0.1394 24884 T __LOWER_E 44.6741 0.0277 24884 T __FILL_THIS_FORM_LOAN1 35.8043 0.0469 24884 T HTTPS_HTTP_MISMATCH 33.9572 1.1763 24884 T __MY_VICTIM 22.8257 0.0293 24884 T __FILL_THIS_FORM_SHORT1 18.6945 0.0259 24884 T __UPPERCASE_75_100 18.6015 0.1381 24884 T FUZZY_ERECT 18.1948 0.0405 24884 T __FILL_THIS_FORM_LONG1 17.5074 0.0392 24884 Problem is they seem to work very well, but look impossible to rewrite better, there's gazillion optional matches and alternations going on. Is there _any_ simple common identifier in these mails, word, phrase lists? Perhaps for 4.0.0 we could introduce some meta/tflags option to run specified rules only if some other rule hit first, this would work by adjusting priorities. In theory we wouldn't even need any new options, but I don't know how much code it would require to handle metas intelligently and skip running any subrules that pending metas do not require any more.
I like that idea for a tflag to chain rules only if another one hits