Bug 1051 - Detect gibberish email adresses
Summary: Detect gibberish email adresses
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: unspecified
Hardware: Other other
: P2 enhancement
Target Milestone: ---
Assignee: Matthew Cline
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-10-03 00:14 UTC by Matthew Cline
Modified: 2002-11-21 14:23 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Tar file containing extract-names.pl and name-triplets.pl application/x-tar None Matthew Cline [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Cline 2002-10-03 00:14:36 UTC
A lot of spammers seem to just make up random user names for the
"From" field.  The first stab at it tries to detect a certain number
consonants in a row (4 and 5 in a row, for the following rules):

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  14980     4752    10228    0.32    0.00    0.00  (all messages)
100.000   31.722   68.278    0.32    0.00    0.00  (all messages as %)
  1.689    4.019    0.606    0.87    0.41    1.00  T_NONSENSE_FROM_1
  0.874    2.125    0.293    0.88    0.42    1.00  T_NONSENSE_FROM_2

The S/O isn't that great.

I also tried something similar to detecting unique subject IDs.  It looks
through a list of triplets of letters that are expected to be in a user
name, and calculates the percentage that aren't expected.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  14980     4752    10228    0.32    0.00    0.00  (all messages)
100.000   31.722   68.278    0.32    0.00    0.00  (all messages as %)
  1.335    4.209    0.000    1.00    0.85    0.01  T_NONSENSE_FROM_60_70
  0.961    3.009    0.010    1.00    0.67    0.01  T_NONSENSE_FROM_70_80
  0.314    0.968    0.010    0.99    0.60    0.01  T_NONSENSE_FROM_80_90
  0.013    0.042    0.000    1.00    0.56    0.01  T_NONSENSE_FROM_92_93
  1.095    3.220    0.108    0.97    0.52    0.01  T_NONSENSE_FROM_50_60
  0.007    0.021    0.000    1.00    0.51    0.01  T_NONSENSE_FROM_91_92
  3.565    9.470    0.821    0.92    0.45    0.01  T_NONSENSE_FROM_40_50
  5.113   13.363    1.281    0.91    0.44    0.01  T_NONSENSE_FROM_99_100
  4.459    8.880    2.405    0.79    0.37    0.01  T_NONSENSE_FROM_20_30
  5.113    9.954    2.865    0.78    0.36    0.01  T_NONSENSE_FROM_30_40
  9.826    9.280   10.080    0.48    0.25    0.01  T_NONSENSE_FROM_10_20
 68.198   37.584   82.421    0.31    0.18    0.01  T_NONSENSE_FROM_00_10
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_93_94
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_90_91
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_95_96
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_94_95
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_96_97
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_98_99
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_97_98

So it looks like we could make three useful rules from that:
NONSENSE_FROM_40_50, NONSENSE_FROM_50_99, and NONSENSE_FROM_99_100.

NONSENSE_FROM_50_99 would catch 11.5% of spam, and NONSENSE_FROM_40_50
9.5% of spam, with pretty good S/O.

T_NONSENSE_FROM_99_100 has a high FP rate because it probably triggers
on any email address name who's only triplet is a person's initials;
I excluded email adresses user name which only had one triplet of letters in 
them, since those are probable just someone's initials, and thus aren't a good 
candidate for the type of triplet to expect in an email user name.

Of course, the list of triplets that I use are very over-fitted to my
list of non-spammish email addresses, and will need to be expanded
with those from the corpii (corpuses?) of others.  I'll attach two
tools in a tar file to do this: extract-names.pl, which will extract
email addresses from files containing email messages, and
name-triplets.pl, which will take the output of extract-names.pl and
split out a triplets file; before being fed to name-triplets.txt, the
output of extract-names.pl should be gone through by hand, to take out
email addresses which really are gibberish (since some non-spammers
pick names like this).  The various triplets files can then be
combined by doing "sort -u file1 file2 ... > name-triplets.txt"
Comment 1 Matthew Cline 2002-10-03 00:15:56 UTC
Created attachment 371 [details]
Tar file containing extract-names.pl and name-triplets.pl
Comment 2 Daniel Quinlan 2002-10-03 00:42:43 UTC
I like the idea, although based on your results (I am running a mass-check now),
I'd prefer to have more ranges: 50-60 is not as good as 60-90.  I'd suggest just
breaking them up by deciles.

Also, this rule is highly English specific.  It will barf on non-English
mail, so it needs to only be run for specific ok_locale or ok_language settings,
I think.
Comment 3 Daniel Quinlan 2002-10-07 20:33:19 UTC
assigning bug (note that I am adding a Cc: to SAdev for all of these)
Comment 4 Daniel Quinlan 2002-10-15 21:49:42 UTC
Here are my latest results.

I think the rule looks good from 80-95, but some people (I think it was
Theo) have reported 100% FP rates for all ranges.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12402     4708     7694    0.38    0.00    0.00  (all messages)
100.000   37.962   62.038    0.38    0.00    0.00  (all messages as %)
  0.524    1.381    0.000    1.00    0.81    0.01  T_NONSENSE_FROM_80_90
  0.024    0.064    0.000    1.00    0.60    0.01  T_NONSENSE_FROM_92_93
  1.169    2.995    0.052    0.98    0.58    1.00  T_NONSENSE_FROM_2
  0.016    0.042    0.000    1.00    0.57    0.01  T_NONSENSE_FROM_91_92
  1.919    3.462    0.975    0.78    0.38    0.01  T_NONSENSE_FROM_50_60
  2.766    4.397    1.768    0.71    0.35    0.01  T_NONSENSE_FROM_60_70
  2.000    3.122    1.313    0.70    0.34    0.01  T_NONSENSE_FROM_70_80
  4.217    6.393    2.885    0.69    0.34    1.00  T_NONSENSE_FROM_1
 27.447   32.009   24.656    0.56    0.29    0.01  T_NONSENSE_FROM_00_10
  8.095    9.303    7.356    0.56    0.29    0.01  T_NONSENSE_FROM_40_50
  8.861    9.941    8.201    0.55    0.28    0.01  T_NONSENSE_FROM_20_30
 11.530   11.257   11.697    0.49    0.26    0.01  T_NONSENSE_FROM_10_20
 12.603   11.682   13.166    0.47    0.25    0.01  T_NONSENSE_FROM_99_100
 24.214   13.339   30.868    0.30    0.18    0.01  T_NONSENSE_FROM_30_40
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_97_98
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_94_95
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_90_91
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_98_99
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_95_96
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_96_97
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_93_94
Comment 5 Theo Van Dinter 2002-10-15 22:07:56 UTC
Subject: Re: [SAdev]  Detect gibberish email adresses

On Tue, Oct 15, 2002 at 09:49:43PM -0700, bugzilla-daemon@hughes-family.org wrote:
> I think the rule looks good from 80-95, but some people (I think it was
> Theo) have reported 100% FP rates for all ranges.

I haven't had the time to track down why yet, but the problem comes
from me running a mass-check with only the 70_cvs_*.cf file which I do
for time.  If I run a mass-check with all the cf files, the rules seem
to work as expected.

For instance, here are the results from my latest CORPUS_SUBMIT run:

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  22109     8555    13554    0.39    0.00    0.00  (all messages)
100.000   38.695   61.305    0.39    0.00    0.00  (all messages as %)
  0.045    0.117    0.000    1.00    0.65    0.01  T_NONSENSE_FROM_92_93
  0.005    0.012    0.000    1.00    0.49    0.01  T_NONSENSE_FROM_90_91
  0.005    0.012    0.000    1.00    0.49    0.01  T_NONSENSE_FROM_96_97
  0.005    0.012    0.000    1.00    0.49    0.01  T_NONSENSE_FROM_93_94
  0.502    1.134    0.103    0.92    0.47    0.01  T_NONSENSE_FROM_80_90
  2.370    4.816    0.826    0.85    0.42    1.00  T_NONSENSE_FROM_1
  1.217    2.244    0.568    0.80    0.39    1.00  T_NONSENSE_FROM_2
  2.090    3.156    1.417    0.69    0.34    0.01  T_NONSENSE_FROM_50_60
  7.585    9.679    6.264    0.61    0.31    0.01  T_NONSENSE_FROM_40_50
  2.904    3.612    2.457    0.60    0.30    0.01  T_NONSENSE_FROM_60_70
 11.344   13.022   10.285    0.56    0.29    0.01  T_NONSENSE_FROM_30_40
  8.060    9.001    7.466    0.55    0.28    0.01  T_NONSENSE_FROM_20_30
  2.415    2.420    2.413    0.50    0.26    0.01  T_NONSENSE_FROM_70_80
 37.157   34.740   38.682    0.47    0.25    0.01  T_NONSENSE_FROM_00_10
 13.836   11.899   15.058    0.44    0.24    0.01  T_NONSENSE_FROM_99_100
 14.049   11.186   15.855    0.41    0.23    0.01  T_NONSENSE_FROM_10_20
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_98_99
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_97_98
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_91_92
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_95_96
  0.000    0.000    0.000    0.00    0.00    0.01  T_NONSENSE_FROM_94_95
Comment 6 Theo Van Dinter 2002-10-17 06:53:48 UTC
Subject: Re: [SAdev]  Detect gibberish email adresses

On Wed, Oct 16, 2002 at 01:07:51AM -0400, Theo Van Dinter wrote:
> I haven't had the time to track down why yet, but the problem comes
> from me running a mass-check with only the 70_cvs_*.cf file which I do
> for time.  If I run a mass-check with all the cf files, the rules seem
> to work as expected.

Ok, I used the time on my train ride in this morning to look at this
and I figured out the issue...

First: As I said, I _only_ have the 70_cvs*.cf file in there, which
means the name-triplets.txt file isn't.  So the test couldn't find the
input file.

Second: nonsense_from_percent() returns a 1 if (from triplets not in
above file) / (total triplets in from) > min && <= max percentage via the
rule file, and 0 otherwise.  Unfortunately, the function also returns
1 on error (like "No such file") which for me means that every message
triggers every nonsense rule, see "First".

Third: nonsense_from_percent() uses "rules_filename" to find the triplets
file, which works well in mass-check since we set rules_filename, but
outside of that it will likely fail (rules_filename only gets set if
someone runs spamassassin or spamd with the -C option).  The rest of the
code has been looking at DEF_RULES_DIR and LOCAL_RULES_DIR which do get
set via spamassassin and spamd normally.

Fourth: mass-check wasn't setting DEF_RULES_DIR and LOCAL_RULES_DIR when
running, so some tests weren't able to find the files they wanted to use,
so some rules don't match properly.  See "Third", but specificically:
nonsense_from_percent (T_NONSENSE_FROM_*), and word_is_in_dictionary
(SUBJ_HAS_UNIQ_ID).


Anyway, I'll be committing patches to HEAD shortly.

Comment 7 Matthew Cline 2002-10-21 01:44:11 UTC
> Also, this rule is highly English specific.

OK, I have some code that will not run the tests if ok_languages and ok_locales
are both "all".  Otherwise, it looks for a name-triplets.txt file for each
langauge/locale specified, and only runs if it finds them all.

However, if done this way, how will the tests run during a mass-check, so that
the results can be GA'd?
Comment 8 Matthew Cline 2002-11-20 23:57:10 UTC
This doesn't seem to be working too well for other people; shall I remove this
from CVS and close the bug WONTIFX?
Comment 9 Justin Mason 2002-11-21 09:09:19 UTC
Subject: Re: [SAdev]  Detect gibberish email adresses 


> This doesn't seem to be working too well for other people; shall I
> remove th is from CVS and close the bug WONTIFX?

in this case, I would.  sorry about that! ;)

Comment 10 Matthew Cline 2002-11-21 23:23:30 UTC
Removed code from CVS; marking WONTFIX.