Bug 5725 - Track and count subjects
Summary: Track and count subjects
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.2.3
Hardware: Other other
: P5 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-18 09:05 UTC by Vicki Brown
Modified: 2019-10-02 10:45 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Vicki Brown 2007-11-18 09:05:36 UTC
I delete all spam with a score of 4 or aove. I review anything below 3.  In
those mailboxes, I sort by subject.

I've noticed that there is a lot of duplication in the subjects of spoam messages.

I would like to see SA track subjects and watch for matches. As a given subject
is repeated, the score should go up.
Comment 1 Matt Kettler 2007-11-18 10:22:29 UTC
I think at a casual glance it might look like a good idea, but I'm not so sure
after spending a bit of time thinking about it. Consider the following points.

1) Implementing this would be "expensive" in that you'd need a database to track
the subjects, probably with some form of atime-based expiry like the bayes
system. This isn't really a case against doing it, but does raise the bar for
how effective it should be. We don't want to be spending a lot of time coding a
feature or occupying disk space with databases unless it's going to be really
effective.

2) Subjects repeat a lot, not just in spam. Consider mailing lists like the
spamassassin-users. Just this month there were 25 "Re: It's a fine line..."
subjects (26 if you count the first one without the Re:). Also Consider
subscriber newsletters and notifications. Every month I get a lot of emails such
as "Your bill is now available online" (verizon) "Your M&T E-statement is now
available REF#:xxxxxxx" (my bank, and the reference is always the same, I have 8
of them on hand to check against..) I get these *every* month, and over time the
count piles up. This gets even worse if you consider sysadmin reporting tools
like nagios, which can bombard you with dozens of the same subject a day if part
of your network keeps going up and down.

3) Spammers could easily evade such a system by randomizing subjects if it was
exact match based. They already randomize body text, so this would be trivial.
If it's not exact-match, see 4.

4) SA's existing bayes system already tokenizes subject lines, which has this
same effect, but on a trained basis, not on a counted basis.

Overall, I'm not sure this is really worth it. It would be difficult to find a
variation of this idea that isn't a duplication of bayes per #4, that's
effective against randomization per #2, doesn't cause FPs per #3, and is
effective enough to be worth it as per #1.

I like that you're submitting ideas, and encourage you to keep doing so, I just
don't think this one would work out in a broader reality.
Comment 2 Henrik Krohns 2019-10-02 10:45:48 UTC
Matt pretty much summed it up. I don't think any SA developer is going to look at it, nothing prevents someone else doing a plugin. Closing.