Bug 6317

Summary: Enable URI testing in From: headers
Product: Spamassassin Reporter: Martin Gregorie <martin>
Component: LibrariesAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: antispam
Priority: P5    
Version: unspecified   
Target Milestone: Undefined   
Hardware: All   
OS: All   
Whiteboard:
Attachments: Parse URIs in From:name, the real-name part of the From header.

Description Martin Gregorie 2010-02-01 06:50:10 UTC
Some spam carries its payload as the sender's personal name. The rest of the user-writable message, i.e. the subject line and the message body, are both filled with random gibberish. There is often a URL in this text that can't be recognised as a URL or processed as one without using an expensive raw scan. 

If body and uri tests can be applied to this text the same way as they are to the Subject header text we can easily write rules that fire on phrases and URLs in the From: header without adding much overhead to the scanning process.
Comment 1 Adam Katz 2010-02-01 10:46:00 UTC
This stems from a list conversation archived at
http://old.nabble.com/forum/ViewPost.jtp?post=27384882&framed=y and my tests were also mentioned in another thread from last week at http://old.nabble.com/forum/ViewPost.jtp?post=27328212&framed=y

I'm not sure I agree with the full concept though, and I think my participatory remarks may have been misread.

Bayesian rules already examine From and Subject fields in addition to the body, and they rightly mark the collected words with the field name (e.g. "from:adam" is a word plucked by Bayes when it sees "Adam Katz" in the From header, with the colon being a forbidden character in standard word parsing.  This is not necessarily the exact mechanism SA uses to delimit, but it is close.)

The topic that spurred this request was related to spamvertised websites that appear in the From header rather than the body and thus are immune to SA's uri detection.  Martin has abstracted this idea to all body tests, which may not be as wise.

Furthermore, URI detection for the From header may be a frivolous exercise, as my tests at http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop seem to indicate that *any* URI in this location is itself a strong an indicator of spam.  Further parsing is therefore unnecessary.

Publishing this rule with SA before legit mail starts clutching this concept might deter its adoption.
Comment 2 Adam Katz 2010-02-01 10:51:16 UTC
(In reply to comment #1)
> Furthermore, URI detection for the From header may be a frivolous exercise, as
> my tests at http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop seem to
> indicate that *any* URI in this location is itself a strong an indicator of
> spam.  Further parsing is therefore unnecessary.

Sorry, that should begin with "Furthermore, URI *decoding* for the From header may be a frivolous exercise," as my rules detect URIs and call it a done deal without further investigation, and the numbers back them up.
Comment 3 Karsten Bräckelmann 2010-02-01 11:07:47 UTC
While not the same request, bug 6315 is about the very same recent pattern as discussed here. Candidate for DUPE.

Personally, while URIs in the From:name header are quite suspect on its own (though I do have seen it being used in legit mail), I like the idea of harvesting the URIs for URI DNSBL checks.
Comment 4 Adam Katz 2010-02-01 15:08:59 UTC
Okay, let's separate the two bugs.

Bug 6315 primarily focuses on spammy text in the From field.
Old name: "New spam type with drugs promo in envelope From: string"
New name: "Detect spammy words like drug promos in From: headers"

Bug 6317 primarily focuses on uri patterns in the From field.
Old name: "Enhancement: include sender text in the message body so body and uri tests can scan it"
New name: "Enable URI testing in From: headers"

That puts half of this bug's scope into bug 6315 instead.
Comment 5 Karsten Bräckelmann 2010-03-07 16:12:01 UTC
Created attachment 4698 [details]
Parse URIs in From:name, the real-name part of the From header.

Quick hack that enables parsing of URIs out of From:name. Ripped directly from a live running 3.2 system.

Does not directly apply to 3.3 or trunk. The for loop in the function _get_parsed_uri_list changed slightly in 3.3. Porting to trunk should be straight forward, though.