SA Bugzilla – Bug 5725
Track and count subjects
Last modified: 2019-10-02 10:45:48 UTC
I delete all spam with a score of 4 or aove. I review anything below 3. In those mailboxes, I sort by subject. I've noticed that there is a lot of duplication in the subjects of spoam messages. I would like to see SA track subjects and watch for matches. As a given subject is repeated, the score should go up.
I think at a casual glance it might look like a good idea, but I'm not so sure after spending a bit of time thinking about it. Consider the following points. 1) Implementing this would be "expensive" in that you'd need a database to track the subjects, probably with some form of atime-based expiry like the bayes system. This isn't really a case against doing it, but does raise the bar for how effective it should be. We don't want to be spending a lot of time coding a feature or occupying disk space with databases unless it's going to be really effective. 2) Subjects repeat a lot, not just in spam. Consider mailing lists like the spamassassin-users. Just this month there were 25 "Re: It's a fine line..." subjects (26 if you count the first one without the Re:). Also Consider subscriber newsletters and notifications. Every month I get a lot of emails such as "Your bill is now available online" (verizon) "Your M&T E-statement is now available REF#:xxxxxxx" (my bank, and the reference is always the same, I have 8 of them on hand to check against..) I get these *every* month, and over time the count piles up. This gets even worse if you consider sysadmin reporting tools like nagios, which can bombard you with dozens of the same subject a day if part of your network keeps going up and down. 3) Spammers could easily evade such a system by randomizing subjects if it was exact match based. They already randomize body text, so this would be trivial. If it's not exact-match, see 4. 4) SA's existing bayes system already tokenizes subject lines, which has this same effect, but on a trained basis, not on a counted basis. Overall, I'm not sure this is really worth it. It would be difficult to find a variation of this idea that isn't a duplication of bayes per #4, that's effective against randomization per #2, doesn't cause FPs per #3, and is effective enough to be worth it as per #1. I like that you're submitting ideas, and encourage you to keep doing so, I just don't think this one would work out in a broader reality.
Matt pretty much summed it up. I don't think any SA developer is going to look at it, nothing prevents someone else doing a plugin. Closing.