5725 – Track and count subjects

Bug 5725 - Track and count subjects

Summary: Track and count subjects

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Learner (show other bugs)
Version:	3.2.3
Hardware:	Other other

Importance:	P5 enhancement
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-11-18 09:05 UTC by Vicki Brown
Modified:	2019-10-02 10:45 UTC (History)
CC List:	1 user (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Vicki Brown 2007-11-18 09:05:36 UTC

I delete all spam with a score of 4 or aove. I review anything below 3.  In
those mailboxes, I sort by subject.

I've noticed that there is a lot of duplication in the subjects of spoam messages.

I would like to see SA track subjects and watch for matches. As a given subject
is repeated, the score should go up.

Comment 1 Matt Kettler 2007-11-18 10:22:29 UTC

I think at a casual glance it might look like a good idea, but I'm not so sure
after spending a bit of time thinking about it. Consider the following points.

1) Implementing this would be "expensive" in that you'd need a database to track
the subjects, probably with some form of atime-based expiry like the bayes
system. This isn't really a case against doing it, but does raise the bar for
how effective it should be. We don't want to be spending a lot of time coding a
feature or occupying disk space with databases unless it's going to be really
effective.

2) Subjects repeat a lot, not just in spam. Consider mailing lists like the
spamassassin-users. Just this month there were 25 "Re: It's a fine line..."
subjects (26 if you count the first one without the Re:). Also Consider
subscriber newsletters and notifications. Every month I get a lot of emails such
as "Your bill is now available online" (verizon) "Your M&T E-statement is now
available REF#:xxxxxxx" (my bank, and the reference is always the same, I have 8
of them on hand to check against..) I get these *every* month, and over time the
count piles up. This gets even worse if you consider sysadmin reporting tools
like nagios, which can bombard you with dozens of the same subject a day if part
of your network keeps going up and down.

3) Spammers could easily evade such a system by randomizing subjects if it was
exact match based. They already randomize body text, so this would be trivial.
If it's not exact-match, see 4.

4) SA's existing bayes system already tokenizes subject lines, which has this
same effect, but on a trained basis, not on a counted basis.

Overall, I'm not sure this is really worth it. It would be difficult to find a
variation of this idea that isn't a duplication of bayes per #4, that's
effective against randomization per #2, doesn't cause FPs per #3, and is
effective enough to be worth it as per #1.

I like that you're submitting ideas, and encourage you to keep doing so, I just
don't think this one would work out in a broader reality.

Comment 2 Henrik Krohns 2019-10-02 10:45:48 UTC

Matt pretty much summed it up. I don't think any SA developer is going to look at it, nothing prevents someone else doing a plugin. Closing.