7915 – TLD Discrimination

Bug 7915 - TLD Discrimination

Summary: TLD Discrimination

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Undefined
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2021-07-06 23:52 UTC by Joe Workman
Modified:	2021-07-14 16:10 UTC (History)
CC List:	5 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joe Workman 2021-07-06 23:52:55 UTC

Let me start by saying that I am not a user of Spam Assassin. However, I am a developer. I run a small software company at www.weavers.space. Sending email to my customers has been a major pain for years. And it's all because of your software discriminating on non-traditional TLDs.

I sent an email to my customers today and received a X-Spam-Score of 4.9. Anything above a 5 is considered spam. I know that many of my customers miss emails that they signed up to receive because it gets added into SPAM.

I have worked very hard to improve my emails so that I get the lowest score possible. I think that I have reached a place where getting my score any lower is virtually impossible. Here is the results from testing my latest email that I sent to my customers.

X-Spam-Hits: BAYES_50 0.8, FROM_SUSPICIOUS_NTLD 0.499, FROM_SUSPICIOUS_NTLD_FP 1.6, HTML_FONT_LOW_CONTRAST 0.001, HTML_IMAGE_RATIO_04 0.001, HTML_MESSAGE 0.001, ME_HAS_VSSU 0.001, ME_SENDERREP_NEUTRAL 0.001, PDS_OTHER_BAD_TLD 1.999, RCVD_IN_DNSWL_NONE -0.0001, RCVD_IN_MSPIKE_H3 0.001, RCVD_IN_MSPIKE_WL 0.001, SPF_HELO_NONE 0.001, SPF_PASS -0.001, T_REMOTE_IMAGE 0.01, LANGUAGES en, BAYES_USED user, SA_VERSION 3.4.2

If we look at this, there are 3 tests that are 100% biased based solely on my domain's .space TLD:

* FROM_SUSPICIOUS_NTLD 0.499
* FROM_SUSPICIOUS_NTLD_FP 1.6
* PDS_OTHER_BAD_TLD 1.999

Based on just these 3 tests alone, I have a spam score of 4.1! This means if SpamAssassin did not discriminate based on TLD, I would have a really amazing score of 0.8.

I can full understand that many spammers work from fringe TLDs. However, there must to be a better way to target them than simply blindly blocking a TLD. Why not take into account SPF, DKIM and DMARC?

Punishing valid businesses by giving them a starting score of 4.1 just because they chose an irregular TLD is immoral and dare I say lazy. A better solution needs to be found.

I have scoured the internet for a better solution for years. There is very little out there about this. If I am making any wrong assumptions, please let me know. I look forward to hearing back from you.

Comment 1 Kevin A. McGrail 2021-07-07 01:40:51 UTC

Unfortunately, the science backs up that the TLDs are problematic. If a false positive is generated on a legitimate email with stock rules on a current version of Apache SpamAssassin, that's a different issue.  

Not much we can do with this trouble report other than agree with you that the .space TLD is problematic.

Those rules all passed automated QA and had their rule scoring set by a genetic algorithm designed to correctly classify emails.

Bayesian naive theory is also used along with DKIM, DMARC and SPF though DMARC is newer for most people.

If you have a false positive, put an email up on pastebin and post on the users mailing list.

Comment 2 RW 2021-07-07 15:02:15 UTC

I don't like the possibility that a scan can hit all three rule.  PDS_OTHER_BAD_TLD is looking at URI domains, the other two work on the various from addresses. It's pretty common in both spam and ham for the author domain to appear in links, it also gets added to uri list automatically if it's in a DKIM signature. 

If you have a situation where you are getting random single and double scoring on the same feature, the optimizer can't produce sensible scores.

Comment 3 John Hardin 2021-07-07 15:15:38 UTC

(In reply to RW from comment #2)
> I don't like the possibility that a scan can hit all three rule. 

Agreed. In this case we should probably look at consolidating the multiple suspicious-TLD checks into a single scoreable rule.

Comment 4 Paul Stead 2021-07-07 15:19:56 UTC

(In reply to John Hardin from comment #3)

> Agreed. In this case we should probably look at consolidating the multiple
> suspicious-TLD checks into a single scoreable rule.

Happy to take suggestions and/or adjustments to these rules within my sandbox to help achieve this goal of a single scorable rule for these TLDs

Comment 5 Paul Stead 2021-07-07 15:23:21 UTC

As a note, FROM_SUSPICIOUS_NTLD has a maximum potential score of 0.5, this combined with FROM_SUSPICIOUS_NTLD_FP at max 2.0 was seen at the time a compromise of having FROM_SUSPICIOUS_NTLD_FP at max 2.5 - hence the reason for these two.

Comment 6 Joe Workman 2021-07-14 04:59:33 UTC

Can this issue be re-opened? There are users whom agree that the current ruleset is unfair. Can we please have a discussion about this?

Comment 7 AXB 2021-07-14 16:10:33 UTC

Please move any discussion to the users mailing list