Bug 6156 - Add PSBL blacklist
Summary: Add PSBL blacklist
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.0
Hardware: Other All
: P2 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-15 13:38 UTC by Warren Togami
Modified: 2009-11-19 11:02 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Example ISO-2022-JP "From" that does not trigger FM_FRM_RN_L_BRACK message/rfc822 None Warren Togami [HasCLA]
RCVD_IN_PSBL_2WEEKS patch None Warren Togami [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Warren Togami 2009-07-15 13:38:35 UTC
http://psbl.surriel.com/
Please add PSBL to spamassassin.

I have been using it for a while now and it seems to be very good.  It is free.

It seems to be a very simple but effective DNSBL.  Anything that harvested addresses and had sent mail to spam traps gets added.  Removal from the list is quick and easy with a self-serve form.

http://stats.dnsbl.com/
These stats seem to indicate it is of good quality.

Could we add it as an experimental rule for the automated tests at first?
Comment 1 Justin Mason 2009-07-15 14:17:10 UTC
would Rik be happy to see the additional query load?
Comment 2 Warren Togami 2009-07-15 14:40:11 UTC
<riel> oh, sure
<riel> I believe it could be useful

(He earlier said he might want additional DNS mirrors, but he said go ahead.)
Comment 3 Karsten Bräckelmann 2009-07-15 16:45:45 UTC
I believe we require more than that. At least some public statement by the BL owner on the dev list or on this bug -- that he is fine with an addition to SA by default, and aware of potential implications. Most importantly the added load, which likely will be massive. Can the BL infrastructure cope with that?

More mirrors doesn't sound like a bad idea...

Someone around who can phrase and express this better? :)

Also, of course, we'd need some rule-qa results first. Generally I'd prefer to see some longer-term stats first...

Warren, can you summon Rik to comment on here and preferably also join the dev list?
Comment 4 Warren Togami 2009-07-15 16:54:11 UTC
<warren> riel: Perhaps add a statement to the PSBL front page 'Yes, just add it to spamassassin already.  I can handle the load.  I will not answer any questions."
<riel> warren: I don't care _quite_ that much :)
* riel isn't going to join a mailing list or create a new bugzilla account to give permission
<warren> can I quote you on this?
<elb> but but but
<elb> they need you to exert effort so that they can use the service you provide for free
<warren> they're being a bit anal with disbelief that you can handle the load
<riel> warren: ok
<riel> personally I really don't care much about spamassassin
<elb> warren: can you find out how much load they're *actually* talking about?
<elb> do they know?
<warren> I don't know.
Comment 5 Karsten Bräckelmann 2009-07-15 17:17:41 UTC
(In reply to comment #4)
> * riel isn't going to join a mailing list or create a new bugzilla account to
> give permission

Well, no need to join the list to have me moderate it through. Sometimes Cc's and list moderation do serve a purpose...

> <elb> they need you to exert effort so that they can use the service you
> provide for free

That "they" would be our users (including paying customers to $distro, not us), using the software we provide for free.  Yes, that elb entity really groks things.

> <warren> they're being a bit anal with disbelief that you can handle the load

That would be me tonight. I can stand more than that. ;)  And it's not about disbelief, but some more official statement.
Comment 6 Warren Togami 2009-07-15 17:25:38 UTC
This is getting silly.  Isn't it clear that he gives permission, but doesn't care too much if it happens or not?

I'm not putting any more effort into convincing folks here that he gives permission.  If it happens, it happens.

Let me know privately if you want to become another PSBL mirror though.
Comment 7 Karsten Bräckelmann 2009-07-15 18:12:33 UTC
I agree, this is getting silly. I was not asking you to put more effort into this. I even was offering ways that require *my* time, just to get any direct statement, rather than what appears to be pastes from IRC.

Anyway, I was merely pointing out *I* would prefer to have any official statement. This doesn't block, and there are other devs to get the required +1's.

FWIW, committed RCVD_IN_PSBL testing rule to my sandbox, revision 794481.
Comment 8 Karsten Bräckelmann 2009-07-15 19:10:28 UTC
*sigh*  OK, make that revision 794493.
Comment 9 Justin Mason 2009-07-16 02:11:53 UTC
the "Day Old Bread" DNSBL reported a '600% increase in traffic', when SA rules querying it were put into rule updates.  I don't know what level they were at beforehand.
Comment 10 AXB 2009-07-16 02:18:00 UTC
(In reply to comment #9)
> the "Day Old Bread" DNSBL reported a '600% increase in traffic', when SA rules
> querying it were put into rule updates.  I don't know what level they were at
> beforehand.


considering that there's most probably way more IPs than URIs in mail flow, the traffic increase may even be higher.

can anybody ask Michelle/SORBS or some Spamcop admin for their numbers?
Comment 11 Matthias Leisi 2009-07-18 01:47:07 UTC
(In reply to comment #10)

> considering that there's most probably way more IPs than URIs in mail flow, the
> traffic increase may even be higher.

At dnswl.org, the data transfer volume roughly doubled. The PSBL has been around for quite some time already and is being used a lot as a DNSBL in MTAs, so I would expect a lower increase in this case. 

The impact of a small number of large mailsites querying the dnswl.org public nameservers is worse than a rather large number of small sites (we constantly have to contact ISPs, ESPs etc and ask them to switch to rsync when they are doing way above our rule of 100k queries/24 hours). 

It helps to avoid "accidential" traffic bursts by clearly stating the new DNSBL in the Changes/Readme file
Comment 12 Warren Togami 2009-07-18 18:41:10 UTC
Interesting, I ran my first Saturday --net maskcheck.  My results are not showing up on the ruleqa site though.

PSBL FP's were 47 of my 7283 non-spam.  One message was to myself, but it turned out to be an invalid case.  The 46 other FP's were all legitimate Japanese mail in my friend's hand classified mail.

This is interesting.  PSBL seems to be excellent for English, but shows some real trouble with Japanese mail.  This is perhaps indicative that:
- Japanese ISP's are not using PSBL.
- Japanese sysadmins less understand how to get themselves removed from PSBL due to the language barrier.
- Japanese sysadmins aren't listing themselves in DNSWL.

This underscores the need for people who speak other languages, especially Asian languages, to participate in the mass checks.
Comment 13 Matthias Leisi 2009-07-19 00:56:40 UTC
(In reply to comment #12)

> - Japanese sysadmins aren't listing themselves in DNSWL.

Japanese (and asian in general) entries are clearly underrepresented in dnswl.org data. If you know people who can change that, I'd be glad to include their data.
Comment 14 Justin Mason 2009-07-19 13:18:19 UTC
(In reply to comment #12)
> Interesting, I ran my first Saturday --net maskcheck.  My results are not
> showing up on the ruleqa site though.

that's showing up now, fwiw....

> PSBL FP's were 47 of my 7283 non-spam.  One message was to myself, but it
> turned out to be an invalid case.  The 46 other FP's were all legitimate
> Japanese mail in my friend's hand classified mail.
> 
> This is interesting.  PSBL seems to be excellent for English, but shows some
> real trouble with Japanese mail.  This is perhaps indicative that:
> - Japanese ISP's are not using PSBL.
> - Japanese sysadmins less understand how to get themselves removed from PSBL
> due to the language barrier.
> - Japanese sysadmins aren't listing themselves in DNSWL.
> 
> This underscores the need for people who speak other languages, especially
> Asian languages, to participate in the mass checks.

yep!
Comment 15 Justin Mason 2009-07-23 05:01:03 UTC
my FPs are:

- about 10 Yahoo! groups mails
- about 25 gmail messages
- 3 apache.org list messages
- 1 google-apps hosted domain

these are some _very_ common ham mail sources, and one would have thought, easily whitelisted.  This is a pretty common issue with trap-driven blocklists, you can see those FPs occasionally with BRBL too.  I wonder why they're not whitelisted by PSBL?
Comment 16 Justin Mason 2009-07-23 05:02:10 UTC
for reference, here's last saturday's ruleqa results:

http://ruleqa.spamassassin.org/20090718-r795325-n/RCVD_IN_PSBL/detail#all

     SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE
0  17.8027   0.1807   0.990    0.86    0.00  RCVD_IN_PSBL  
0   6.0275   0.8495   0.876    0.85    0.00  RCVD_IN_PSBL bb-jm 
0  19.4270   0.1223   0.994    0.91    0.00  RCVD_IN_PSBL dos 
0   3.3784   0.1001   0.971    0.76    0.00  RCVD_IN_PSBL jm 
0   6.5474   0.6459   0.910    0.65    0.00  RCVD_IN_PSBL wtogami 
0   3.1553   0.1580   0.952    0.57    0.00  RCVD_IN_PSBL zmi 


overall a 99% hit-rate.
Comment 17 Warren Togami 2009-08-20 17:08:17 UTC
What are the feelings about enabling this rule by default for 3.3.0?

http://ruleqa.spamassassin.org/20090815-r804443-n/RCVD_IN_PSBL/detail
25% of the FP's are to one of my Japanese users all from a single legitimate auction company in Japan.  I'm getting them listed in DNSWL, which will prevent it from being listed in PSBL.

Who runs DNSWL?  It would be helpful if DNSWL had their instructions translated into other languages like Japanese so sysadmins in those countries can more easily understand why they want to list and how.  I could find volunteers to translate it if DNSWL is willing to use those translations.
Comment 18 AXB 2009-08-20 23:09:01 UTC
(In reply to comment #17)
> What are the feelings about enabling this rule by default for 3.3.0?
> 
> http://ruleqa.spamassassin.org/20090815-r804443-n/RCVD_IN_PSBL/detail
> 25% of the FP's are to one of my Japanese users all from a single legitimate
> auction company in Japan.  I'm getting them listed in DNSWL, which will prevent
> it from being listed in PSBL.

if that site gets listed , why not try to fix the problem at the source instead of working around it?

> Who runs DNSWL?  It would be helpful if DNSWL had their instructions translated
> into other languages like Japanese so sysadmins in those countries can more
> easily understand why they want to list and how.  I could find volunteers to
> translate it if DNSWL is willing to use those translations.

On the DNSWL web site: "Contact" admins@dnswl.org
Comment 19 Warren Togami 2009-08-20 23:34:07 UTC
> if that site gets listed , why not try to fix the problem at the source instead
> of working around it?

The mails going into PSBL's spam trap look legititmate becuase they are legitimate.  Somebody figured out one of the spam trap adedresses and subscribed a legit site to deliver to the spam trap.  This is own as trap poisoning.

DNSWL will solve this problem.
Comment 20 AXB 2009-08-20 23:57:26 UTC
(In reply to comment #19)
> > if that site gets listed , why not try to fix the problem at the source instead
> > of working around it?
> 
> The mails going into PSBL's spam trap look legititmate becuase they are
> legitimate.  Somebody figured out one of the spam trap adedresses and
> subscribed a legit site to deliver to the spam trap.  This is own as trap
> poisoning.
> 
> DNSWL will solve this problem.

I dare assume you have confirmed this with Rik, the PSBL op.
Comment 21 Warren Togami 2009-08-22 14:21:42 UTC
Rik van Riel of PSBL recommends that spamassassin should not bother querying PSBL for any mail older than 2 weeks.  PSBL purges all hosts from the blacklist if they have hit the spam trap in the previous 2 weeks.  Testing ham or spam older than 2 weeks is not helpful.

http://ruleqa.spamassassin.org/20090822-r806811-n/RCVD_IN_PSBL/detail
This is confirmed by the nightly --net masscheck.  "set 0, broken down by message age in weeks".  The SPAM hit rate is 21% only for two weeks, then it drops off.
Comment 22 Warren Togami 2009-08-31 17:08:53 UTC
Created attachment 4524 [details]
Example ISO-2022-JP "From" that does not trigger FM_FRM_RN_L_BRACK
Comment 23 Warren Togami 2009-08-31 17:09:35 UTC
damn, wrong bug
Comment 24 Warren Togami 2009-08-31 17:10:38 UTC
Is anything blocking making this default enabled in 3.3.0?
Comment 25 Justin Mason 2009-09-01 04:08:11 UTC
(In reply to comment #24)
> Is anything blocking making this default enabled in 3.3.0?

the FPs noted in comment #16 are killing its score -- from bug 6155:

+score RCVD_IN_PSBL 0 0.416 0 0.001 # n=2

That's not really strong enough to add a new DNSBL lookup, IMO.

if the issue you refer to in comment #21 is what's causing the high FP rate on common ham sources, then we may need to leave inclusion until we can measure DNSBL accuracy using only newer-than-2-weeks-old mail.
Comment 26 Warren Togami 2009-09-01 05:34:59 UTC
> the FPs noted in comment #16 are killing its score -- from bug 6155:
> +score RCVD_IN_PSBL 0 0.416 0 0.001 # n=2

Could we possibly decide on this DNSBL after the next saturday run?  Since the last Saturday I've whitelisted another 15% of the confirmed FP's in my own corpus.

> if the issue you refer to in comment #21 is what's causing the high FP rate on
> common ham sources, then we may need to leave inclusion until we can measure
> DNSBL accuracy using only newer-than-2-weeks-old mail.

How difficult would it be to measure this?

> my FPs are:
> - about 10 Yahoo! groups mails
> - about 25 gmail messages

Any ideas how we can contact yahoo and gmail to get them to maintain their own dnswl entries for their outgoing MTA's?

> - 3 apache.org list messages

Why isn't apache.org listed in dnswl?  This at least you can easily find who is in charge of it.
Comment 27 AXB 2009-09-01 05:49:41 UTC
(In reply to comment #26)
> > the FPs noted in comment #16 are killing its score -- from bug 6155:
> > +score RCVD_IN_PSBL 0 0.416 0 0.001 # n=2
> 
> Could we possibly decide on this DNSBL after the next saturday run?  Since the
> last Saturday I've whitelisted another 15% of the confirmed FP's in my own
> corpus.
> 
> > if the issue you refer to in comment #21 is what's causing the high FP rate on
> > common ham sources, then we may need to leave inclusion until we can measure
> > DNSBL accuracy using only newer-than-2-weeks-old mail.
> 
> How difficult would it be to measure this?
> 
> > my FPs are:
> > - about 10 Yahoo! groups mails
> > - about 25 gmail messages
> 
> Any ideas how we can contact yahoo and gmail to get them to maintain their own
> dnswl entries for their outgoing MTA's?
> 
> > - 3 apache.org list messages
> 
> Why isn't apache.org listed in dnswl?  This at least you can easily find who is
> in charge of it.

Wondering...
what's stopping you from including the PSBL in an extra .cf file your distro's SA rpm?
Comment 28 Warren Togami 2009-09-01 07:01:11 UTC
> Wondering...
> what's stopping you from including the PSBL in an extra .cf file your distro's
> SA rpm?

Nothing.  I could do that easily.

http://stats.dnsbl.com/
Except PSBL seems to be one of the higher quality free DNSBL's.  I would hope it could benefit more people by default.

Nearly all of the FP's seem to be easily fixed with DNSWL of large, legitimate providers.
Comment 29 Warren Togami 2009-09-01 07:15:38 UTC
(In reply to comment #15)
> my FPs are:
> 
> - about 10 Yahoo! groups mails
> - about 25 gmail messages
> - 3 apache.org list messages

Justin,

Rik is asking what IP's did the apache.org messages come from?
Comment 30 Warren Togami 2009-09-01 07:42:34 UTC
Rik said he is working on automatic conversion of google and rr.com's SPF records into automatic whitelists, so none of those IP's will be listed in PSBL even if they send mail to his trap.  ~10-25% of all RCVD_IN_PSBL FP's were in my Japanese corpus, all from a single legitimate Ebay-like company in Japan.  rakuten.co.jp is now being whitelisted using their SPF records.  (Normally SPF is pretty useless, except it logically works for spam trap whitelisting.)

This leaves only yahoo and apache as jm's FP's in Comment #15.

apache.org can certainly be listed in dnswl.

We only need to figure out how to contact yahoo.
Comment 31 Mark Martinec 2009-09-01 08:14:44 UTC
> > my FPs are:
> > - about 10 Yahoo! groups mails
> > - about 25 gmail messages
> Any ideas how we can contact yahoo and gmail to get them to maintain
> their own dnswl entries for their outgoing MTA's?

> We only need to figure out how to contact yahoo.

I wasn't following this track closely, but seems to me that
recognizing genuine yahoo and gmail.com mail is reliably and
quickly done by verifying their DKIM signatures. A meta rule
could turn a hit on DKIM_VALID_AU combined with a From in
gmail.com or yahoo.com to implicitly turn on RCVD_IN_DNSWL_MED.
Comment 32 J.D. Falk 2009-09-01 12:20:23 UTC
(In reply to comment #31)
> > > my FPs are:
> > > - about 10 Yahoo! groups mails
> > > - about 25 gmail messages
> > Any ideas how we can contact yahoo and gmail to get them to maintain
> > their own dnswl entries for their outgoing MTA's?
> 
> > We only need to figure out how to contact yahoo.
> 
> I wasn't following this track closely, but seems to me that
> recognizing genuine yahoo and gmail.com mail is reliably and
> quickly done by verifying their DKIM signatures. A meta rule
> could turn a hit on DKIM_VALID_AU combined with a From in
> gmail.com or yahoo.com to implicitly turn on RCVD_IN_DNSWL_MED.

I used to work for Yahoo!, and I'm fairly certain that their answer would (still) be to check the DKIM signature.
Comment 33 Mark Martinec 2009-09-02 05:05:24 UTC
> recognizing genuine yahoo and gmail.com mail is reliably and
> quickly done by verifying their DKIM signatures. A meta rule
> could turn a hit on DKIM_VALID_AU combined with a From in
> gmail.com or yahoo.com to implicitly turn on RCVD_IN_DNSWL_MED.

With the most recent change to a DKIM plugin in SVN this is even simpler:

full   DKIM_VALID_YG eval:check_dkim_valid
 ('gmail.com','googlemail.com','googlegroups.com','yahoogroups.com','.yahoo.com','yahoo.ca','yahoo.de','yahoo.fr','yahoo.in','yahoo.co.in','yahoo.co.jp','yahoo.co.nz','yahoo.co.uk','yahoo.com.hk','yahoo.com.ph','yahoo.com.vn')
score  DKIM_VALID_YG -0.5
meta   RCVD_IN_PSBL_NOYG  RCVD_IN_PSBL && !DKIM_VALID_YG
score  RCVD_IN_PSBL_NOYG  1
score  RCVD_IN_PSBL 0.1
Comment 34 Mark Martinec 2009-09-02 05:06:41 UTC
(mind the line wrap, the full DKIM_VALID_YG is one long line)
Comment 35 Mark Martinec 2009-09-02 09:37:49 UTC
Some of my statistics from our August logs:

- almost 400.000 hits on RCVD_IN_PSBL overall,
- of which 677 messages were below spam threshold, most of these
  were probably ham (but see below), i.e. false positives on RCVD_IN_PSBL

- 157 messages of these 677 likely-ham messages carried a valid
  DKIM or DK signature from the yahoo and google domains I listed in #33,
  and could be saved from false positives on RCVD_IN_PSBL by the
  above rule

I re-examined the subjects of the remaining 677-157 messages.
The recurring ones were some automatic backup reports, some
newsletters, and four mailing lists, the rest were one-of-a-kind
private messages. Few seemingly spam, but not many, so yes,
most of the remaining 520 messages were indeed false positives
(but some justifiably so, like postings from home networks etc)

Seems pretty good.
Comment 36 Warren Togami 2009-09-02 10:09:00 UTC
Do we have an existing subrule that could be meta booleaned to make it fire only if the message is 2 weeks old or less?
Comment 37 Warren Togami 2009-09-02 10:42:45 UTC
> score  DKIM_VALID_YG -0.5
> meta   RCVD_IN_PSBL_NOYG  RCVD_IN_PSBL && !DKIM_VALID_YG
> score  RCVD_IN_PSBL_NOYG  1
> score  RCVD_IN_PSBL 0.1

It looks like this could work, but there are some concerns here.

* DKIM is an optional dependency of spamassassin that most people are not using.

Aside from that, it might not be necessary for spamassassin to check DKIM for PSBL in the near future.

* Relatively soon Rik will implement SPF whitelisting for Google, to automatically prevent blacklisting.
* It might soon not be necessary.  Rik wants to later add DKIM to the spam trap so he could automatically detect Yahoo mail and refrain from blacklisting those IP's.  This might take a few weeks to setup though.
* Independently of that, I'm in communication with someone from Yahoo who is trying to convince folks internally to DNSWL themself.

These automatic whitelisting measures will not take effect for a few weeks at PSBL.  Even after they are implemented, it will take 2 weeks for already listed IP's to timeout from the blacklist.
Comment 38 Warren Togami 2009-09-02 12:24:10 UTC
Created attachment 4530 [details]
RCVD_IN_PSBL_2WEEKS

Could we please test this in the next Saturday --net masscheck?  I used the received_within_months function which already existed in HeaderEval.pm.
Comment 39 Warren Togami 2009-09-02 14:09:50 UTC
Comments on the above patch please.

If there are reservations about the proposed rule, I hope at least the lib/Mail/SpamAssassin/Plugin/HeaderEval.pm one-liner can be applied.  That should be safe right?
Comment 40 Justin Mason 2009-09-02 15:05:48 UTC
(In reply to comment #38)
> Created an attachment (id=4530) [details]
> RCVD_IN_PSBL_2WEEKS
> 
> Could we please test this in the next Saturday --net masscheck?  I used the
> received_within_months function which already existed in HeaderEval.pm.

I'll add it to the mass-check tarball when we build it, and we can see how it does in the mass-checks and GA run.
Comment 41 Warren Togami 2009-09-02 19:02:52 UTC
I guess RCVD_IN_PSBL_2WEEKS is only really useful to more accurately measure the FP safety of PSBL in masscheck, since measuring RCVD_IN_PSBL directly is inaccurate by design.  It shouldn't be enabled as a regular rule.
Comment 42 Justin Mason 2009-09-03 01:03:34 UTC
(In reply to comment #41)
> I guess RCVD_IN_PSBL_2WEEKS is only really useful to more accurately measure
> the FP safety of PSBL in masscheck, since measuring RCVD_IN_PSBL directly is
> inaccurate by design.  It shouldn't be enabled as a regular rule.

well, it should be fine to use in place of RCVD_IN_PSBL, right?  "live" mail will always be < 2 weeks old.
Comment 43 Mark Martinec 2009-09-03 05:01:41 UTC
(In reply to comment #39)
> If there are reservations about the proposed rule, I hope at least the
> lib/Mail/SpamAssassin/Plugin/HeaderEval.pm one-liner can be applied.
> That should be safe right?

Can't hurt.


  Plugin/HeaderEval.pm: add one line: expose existing
  function 'received_within_months' as an eval function
Sending        lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
Committed revision 810905.
Comment 44 Warren Togami 2009-09-03 06:22:59 UTC
(In reply to comment #42)
> (In reply to comment #41)
> > I guess RCVD_IN_PSBL_2WEEKS is only really useful to more accurately measure
> > the FP safety of PSBL in masscheck, since measuring RCVD_IN_PSBL directly is
> > inaccurate by design.  It shouldn't be enabled as a regular rule.
> 
> well, it should be fine to use in place of RCVD_IN_PSBL, right?  "live" mail
> will always be < 2 weeks old.

Yes fine, but if you think about it is rather useless to test "live" since it will always be true right?

I am suggesting it might be useful for masscheck only because we can see both RCVD_IN_PSBL and RCVD_IN_PSBL_2WEEKS to compare.  This might be useful to test other network-based rules like pyzor or other DNSBL's as well since they really aren't designed to be accurate for old mail either.
Comment 45 Warren Togami 2009-09-03 11:52:41 UTC
(In reply to comment #40)
> (In reply to comment #38)
> > Created an attachment (id=4530) [details] [details]
> > RCVD_IN_PSBL_2WEEKS
> > 
> > Could we please test this in the next Saturday --net masscheck?  I used the
> > received_within_months function which already existed in HeaderEval.pm.
> 
> I'll add it to the mass-check tarball when we build it, and we can see how it
> does in the mass-checks and GA run.

Huh?  Add to mass-check tarball but not spamassassin-trunk?
Comment 46 Justin Mason 2009-09-03 13:31:34 UTC
(In reply to comment #45)
> > I'll add it to the mass-check tarball when we build it, and we can see how it
> > does in the mass-checks and GA run.
> 
> Huh?  Add to mass-check tarball but not spamassassin-trunk?

it'll go into trunk, too. ;)
Comment 47 Justin Mason 2009-09-03 14:30:48 UTC
: 73...; svn commit -m "bug 6156: add RCVD_IN_PSBL_2WEEKS to measure accuracy of PSBL using recent mail only"
Sending        rulesrc/sandbox/kb/20_bug_6156.cf
Transmitting file data .
Committed revision 811133.
Comment 48 Warren Togami 2009-09-05 13:22:42 UTC
http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL_2WEEKS/detail
Down to 22 FP's in RCVD_IN_PSBL_2WEEKS, all within hege's corpus.  Who is hege?  Could we possibly ask if those 22 FP's have anything in common?
Comment 49 Henrik Krohns 2009-09-06 00:10:41 UTC
It's all ham. Looks to be mostly from two separate SOHO instances (natted/exchange and through ISP smarthost) with bad looking security history.. so nothing out of the ordinary really. I'm surprised I'm the only one with such hits. :)
Comment 50 Henrik Krohns 2009-09-06 00:13:11 UTC
To be exact, it's not the smarthosts that was hit, but the real SOHO addresses. So DNSWL wouldn't have any effect here. This is the type of FPs to be expected from PSBL and deep parsing.
Comment 51 AXB 2009-09-06 00:20:38 UTC
(In reply to comment #50)
> To be exact, it's not the smarthosts that was hit, but the real SOHO addresses.
> So DNSWL wouldn't have any effect here. This is the type of FPs to be expected
> from PSBL and deep parsing.

the PSBL should NEVER be used for deep header parsing. Its screaming for FPs...
Comment 52 Warren Togami 2009-09-06 08:36:18 UTC
Do we have a way to make it parse only the last Received header?
Comment 53 AXB 2009-09-06 08:46:53 UTC
(In reply to comment #52)
> Do we have a way to make it parse only the last Received header?

afaik, changing

header   RCVD_IN_PSBL  eval:check_rbl('psbl', 'psbl.surriel.com.')

to

header   RCVD_IN_PSBL eval:check_rbl('psbl-lastexternal', 'psbl.surriel.com.')

should do the trick.
Comment 54 Warren Togami 2009-09-07 19:00:22 UTC
(In reply to comment #53)
> (In reply to comment #52)
> > Do we have a way to make it parse only the last Received header?
> 
> afaik, changing
> 
> header   RCVD_IN_PSBL  eval:check_rbl('psbl', 'psbl.surriel.com.')
> 
> to
> 
> header   RCVD_IN_PSBL eval:check_rbl('psbl-lastexternal', 'psbl.surriel.com.')
> 
> should do the trick.

I can confirm that this seems to do the right thing.  Henrik, does this rule change eliminate all of your FP's?

NOTE: All cases of PSBL from yahoo or gmail will be whitelisted soon after riel implements auto-whitelisting with SPF/DKIM testing.  Other common FP senders with defined SPF or DKIM records like rr.com and rakuten.co.jp will be added soon.
Comment 55 Justin Mason 2009-09-08 03:01:28 UTC
(In reply to comment #51)
> I can confirm that this seems to do the right thing.  Henrik, does this rule
> change eliminate all of your FP's?

I'm not sure what we can do about this w.r.t. the mass-checks -- we've missed the window to check in that change.

We could estimate the effect on the scores, based on a limited weekly mass-check result, and fix the rule and modify the "real" score accordingly.... bit of a hack :(
Comment 56 Henrik Krohns 2009-09-08 03:09:35 UTC
Well I just checked and for me deep parsing does work MUCH better. It catches lots of relayed/forwarded/list etc mail, which lastexternal obviously doesn't. The very small amount of FPs doesn't bother me and I can trusted_networks such anyway.
Comment 57 Warren Togami 2009-09-08 06:38:34 UTC
(In reply to comment #56)
> Well I just checked and for me deep parsing does work MUCH better. It catches
> lots of relayed/forwarded/list etc mail, which lastexternal obviously doesn't.
> The very small amount of FPs doesn't bother me and I can trusted_networks such
> anyway.

Yeah, I suspected this.  I have ZERO FP's before the rule change, and I lose 20%+ of the spam hits afterward.
Comment 58 Warren Togami 2009-09-13 19:09:48 UTC
It seems the deep parsing PSBL is safe enough and we should stick with it.

http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL_2WEEKS/detail
20 of 37006 ham false positives in the last 2 weeks of ham.

riel tells me that PSBL now filters out all IP addresses (from SPF) of past FP's like gmail, rr.com and yahoo.co.jp.  yahoo.com is not yet filtered because they do not have SPF.  He is soon implementing DKIM in order to filter those out automatically.
Comment 59 Warren Togami 2009-09-22 10:27:28 UTC
http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL_2WEEKS/detail
RCVD_IN_PSBL continues to perform well in weekly_mass_check.

But it seems mcsnapshot is broken.  RCVD_IN_PSBL seems to never trigger in the rescore masscheck.  What should we do?
Comment 60 Mark Martinec 2009-09-22 10:43:13 UTC
> But it seems mcsnapshot is broken.  RCVD_IN_PSBL seems to never trigger in the
> rescore masscheck.  What should we do?

Don't know, I was running the mcsnapshot.tar as posted by Justin
for my rescoring runs, and I have plenty of hits on RCVD_IN_PSBL in
my spam.log, and some in ham.log.

Did you forget the --net option on mass-check ?
Comment 61 Warren Togami 2009-09-22 12:13:39 UTC
./mass-check --progress --hamlog=ham-bayes-net-wt-en4.log --spamlog=spam-bayes-net-wt-en4.log --net --bayes spam:mbox:/path/to/spambox

If I run this command in mcsnapshot/masses/ it has other DNSBL hits like RCVD_IN_PBL but not RCVD_IN_PSBL.

If I run this command in nightly_mass_check/masses/, then RCVD_IN_PSBL has hits.

This is why I have been asking Justin to put up a ruleqa page for the rescoring masscheck.  I was looking only at the nightly and weekly masscheck results on ruleqa, so I didn't realize anything was broken in mcsnapshot until now.
Comment 62 Warren Togami 2009-09-22 13:46:21 UTC
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c39
Figured out a workaround that makes it work.  Redoing my mcsnapshot masschecks now...

Makes me wonder if other people's masschecks are also partially failing with invalid logs.
Comment 63 Warren Togami 2009-09-28 10:17:47 UTC
I believe we made a mistake here with deep parsing instead of lastexternal.

While masschecks have shown us that it catches maybe an additional 20% with deep parsing, it does introduce some very rare FP's.   Deep parsing catches IP addresses that posted to a Yahoo webmail interface or mail sent via a legitimate MTA.  They are are FP's for reasons like sending mail from a mobile phone or mobile broadband from an IP address that was previously used by a spammer.  There is nothing these users can do about it.

While RCVD_IN_PSBL with deep parsing alone is not likely to flag the mail as spam, that IP address could easily FP on a different DNSBL and push it over.

I believe we should do the following.

1) RCVD_IN_PSBL becomes lastexternal.  It deserves a higher score because it eliminates the above type of FP.

2) Add a separate subrule that hits with PSBL deep parsing && !RCVD_IN_PSBL.  This can add a smaller score, safer to the very rare FP's.  Can deep parsing and lastexternal be done simultaneously without two queries?

3) Release 3.3.0 with #1 by default.  We could add #2 too, or wait until sa-update later.  It will be difficult to score part #2 with the GA given how rare the FP's are.  I am not aware of any in my own 8 users' corpus at the moment.
Comment 64 Michael P 2009-09-28 10:35:57 UTC
(In reply to comment #63)
> I believe we made a mistake here with deep parsing instead of lastexternal.
> 
> While masschecks have shown us that it catches maybe an additional 20% with
> deep parsing, it does introduce some very rare FP's.   Deep parsing catches IP
> addresses that posted to a Yahoo webmail interface or mail sent via a
> legitimate MTA.  They are are FP's for reasons like sending mail from a mobile
> phone or mobile broadband from an IP address that was previously used by a
> spammer.  There is nothing these users can do about it.
> 
> While RCVD_IN_PSBL with deep parsing alone is not likely to flag the mail as
> spam, that IP address could easily FP on a different DNSBL and push it over.
> 
> I believe we should do the following.
> 
> 1) RCVD_IN_PSBL becomes lastexternal.  It deserves a higher score because it
> eliminates the above type of FP.
> 
> 2) Add a separate subrule that hits with PSBL deep parsing && !RCVD_IN_PSBL. 
> This can add a smaller score, safer to the very rare FP's.  Can deep parsing
> and lastexternal be done simultaneously without two queries?
> 
> 3) Release 3.3.0 with #1 by default.  We could add #2 too, or wait until
> sa-update later.  It will be difficult to score part #2 with the GA given how
> rare the FP's are.  I am not aware of any in my own 8 users' corpus at the
> moment.

Agreed, deep parsing should NOT affect the Spam Scores, as deep parsing will reach the end IP Address, which normally should/could be on PSBL or other lists.  It should ONLY be last external, who in reality is the responsible party for controlling outbound leakage.
Comment 65 Warren Togami 2009-10-02 12:24:13 UTC
jm also agreed on list that this rule should be lastexternal.  I'm waiting for my ASF account to be created so I cannot commit this myself yet.

http://ruleqa.spamassassin.org/20090930-r808953-n/T_RCVD_IN_PSBL_2WEEKS/detail
rescore masscheck results for RCVD_IN_PSBL_2WEEKS are looking very favorable if you look only at the nearest weeks.  PSBL times out entries older than 2 weeks so older lookups are not very useful for measuring its safety.  The rescore masscheck results reflect deep parsing.  Even with unsafe deep parsing the numbers are looking very good.  Switching to lastexternal should make it even safer.
Comment 66 Warren Togami 2009-11-19 11:02:59 UTC
This is fixed.