SA Bugzilla – Bug 6123
Add "tflags exponential" to allow increasing score for multiple hits
Last modified: 2019-07-30 09:22:21 UTC
"tflags multiple" is very useful, but it would be nice to be able to add a greater penalty the more times a given match is repeated. I suggest "tflags exponential" as a variant of "tflags multiple". Setting this flag would cause the overall score to be changed by (rule_score * rule_hits_so_far), as opposed to (rule_score * 1) as for the basic "multiple" scoring. (I know it's not truly exponential scoring...)
In that case I would suggest calling it tflag multiply, not tflag exponential. Personally, I don't think it's a outright bad idea, but I don't see a whole lot of value in it. Perhaps I just need some examples of how it's really useful. In my thinking, it might be useful for implementing rules with small scores that don't hurt nonspam, but add up enough to mean something with repeated hits. However I just can't think of an example where the spread of hit count would be sufficiently large. ie: I can think of cases where spam might have 3 matches, and nonspam only 1, but that's not really a big enough spread. IMO, you'd need something with close to a factor of 10 difference between the typical nonspam hits and typical spam hits. Do you have some example ideas for rules, and spam messages that this would affect? I ask largely because it's my impression that while implementing this would not horribly difficult, it also would not be trivial. However implementing support for it in the perceptron so it can be used in the base ruleset would be down right complicated. It should also be noted that has been suggested numerous times in the past, and nobody's presented a decent case to convince someone to implement it yet. That's not to say it's a bad idea, but it's one that clearly needs a reason, not just a simple "this would be useful".
While this probably can be useful in some cases, it makes it really easy for the user to shoot his own foot. The problem is with carefully evaluating how many hits can be considered ok-ish, and where to raise the score beyond all thresholds -- and to craft scores matching that. If this would be implemented, I guess I'd prefer something like the procmail weighted scoring technique, which would cover this -- as well as the currently existing ones as special cases x=0 (plain rule) and x=1 (tflags multiple). More of a gut feeling, though, didn't think it through properly yet. ;) As for the name, tflags multiply is a no-go IMHO. This needs some better distinction from multiple than a single char change. Matt, can you point us at a previous discussion? On list, or bugzilla? Anyway, regarding the challenge to come up with an example, the following flexible and easy to grok rules' stub is about what you need to beat. Doesn't it pretty much do what you intend? tflags __FOO multiple meta FOO ( __FOO ) score FOO 0.2 meta FOO_4 ( __FOO >= 4 ) score FOO_4 1.0 meta FOO_8 ( __FOO >= 8 ) score FOO_8 2.5
I think we've implemented similar features in the past using eval rules, btw. It'd be easy to do it with an eval-rule plugin, too. But Karsten's 'tflags multiple' rules are even easier to understand...
The example I have in mind is "fill in the form" stuff in frauds and phishes. The more name/address/phone/gender/whatever blanks the spam has for the victim to fill in, the more points I'd like to assign. 6-8 blanks should score much higher than 1 or 2. Karsten's metas are certainly a workaround, but to my mind it'd be a lot easier to say: SCORE 0.15 TFLAGS exponential than figure out a set of metarules. This would also keep the total number of rules down. I hadn't intended this to be used very often, or to be incorporated into the perceptron. I also thought implementation would be pretty easy; I'll poke around and see if that is indeed the case.
Granted, this could result in shorter and more concise rules. A problem I see, however, is the potentially indefinite growth. Next thing that will be requested is a per rule tflags exponential_max_score with a value, to prevent the rule from single-handedly pushing the score above some high cut-off threshold. Like it has been often requested for FuzzyOCR. Yes, this also applies to tflags multiple, to a lesser extent. ;) Which seems to exclusively be used in a counting fashion like my example above, rather than in a real multiple scoring fashion. At least I don't recall ever seeing it, since the backhair rule-set. (Which, frankly, was almost as unsightly to watch in the status and report header as what it targeted. ;)
This can easily be done in a plugin, in fact I had the code half written when Firefox decided it didn't want to play well with others and now I'm too busy to re-write. I think it would be better to just promote that method for folks and put it up on the wiki. All the tools are there, its just a matter of someone putting it together. FYI, my solution did not involve tflags.
Closing old stale bug. I see no reason to bloat code for rare cases which can simply be handled with metas.