SA Bugzilla – Bug 1051
Detect gibberish email adresses
Last modified: 2002-11-21 14:23:30 UTC
A lot of spammers seem to just make up random user names for the "From" field. The first stab at it tries to detect a certain number consonants in a row (4 and 5 in a row, for the following rules): OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 14980 4752 10228 0.32 0.00 0.00 (all messages) 100.000 31.722 68.278 0.32 0.00 0.00 (all messages as %) 1.689 4.019 0.606 0.87 0.41 1.00 T_NONSENSE_FROM_1 0.874 2.125 0.293 0.88 0.42 1.00 T_NONSENSE_FROM_2 The S/O isn't that great. I also tried something similar to detecting unique subject IDs. It looks through a list of triplets of letters that are expected to be in a user name, and calculates the percentage that aren't expected. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 14980 4752 10228 0.32 0.00 0.00 (all messages) 100.000 31.722 68.278 0.32 0.00 0.00 (all messages as %) 1.335 4.209 0.000 1.00 0.85 0.01 T_NONSENSE_FROM_60_70 0.961 3.009 0.010 1.00 0.67 0.01 T_NONSENSE_FROM_70_80 0.314 0.968 0.010 0.99 0.60 0.01 T_NONSENSE_FROM_80_90 0.013 0.042 0.000 1.00 0.56 0.01 T_NONSENSE_FROM_92_93 1.095 3.220 0.108 0.97 0.52 0.01 T_NONSENSE_FROM_50_60 0.007 0.021 0.000 1.00 0.51 0.01 T_NONSENSE_FROM_91_92 3.565 9.470 0.821 0.92 0.45 0.01 T_NONSENSE_FROM_40_50 5.113 13.363 1.281 0.91 0.44 0.01 T_NONSENSE_FROM_99_100 4.459 8.880 2.405 0.79 0.37 0.01 T_NONSENSE_FROM_20_30 5.113 9.954 2.865 0.78 0.36 0.01 T_NONSENSE_FROM_30_40 9.826 9.280 10.080 0.48 0.25 0.01 T_NONSENSE_FROM_10_20 68.198 37.584 82.421 0.31 0.18 0.01 T_NONSENSE_FROM_00_10 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_93_94 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_90_91 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_95_96 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_94_95 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_96_97 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_98_99 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_97_98 So it looks like we could make three useful rules from that: NONSENSE_FROM_40_50, NONSENSE_FROM_50_99, and NONSENSE_FROM_99_100. NONSENSE_FROM_50_99 would catch 11.5% of spam, and NONSENSE_FROM_40_50 9.5% of spam, with pretty good S/O. T_NONSENSE_FROM_99_100 has a high FP rate because it probably triggers on any email address name who's only triplet is a person's initials; I excluded email adresses user name which only had one triplet of letters in them, since those are probable just someone's initials, and thus aren't a good candidate for the type of triplet to expect in an email user name. Of course, the list of triplets that I use are very over-fitted to my list of non-spammish email addresses, and will need to be expanded with those from the corpii (corpuses?) of others. I'll attach two tools in a tar file to do this: extract-names.pl, which will extract email addresses from files containing email messages, and name-triplets.pl, which will take the output of extract-names.pl and split out a triplets file; before being fed to name-triplets.txt, the output of extract-names.pl should be gone through by hand, to take out email addresses which really are gibberish (since some non-spammers pick names like this). The various triplets files can then be combined by doing "sort -u file1 file2 ... > name-triplets.txt"
Created attachment 371 [details] Tar file containing extract-names.pl and name-triplets.pl
I like the idea, although based on your results (I am running a mass-check now), I'd prefer to have more ranges: 50-60 is not as good as 60-90. I'd suggest just breaking them up by deciles. Also, this rule is highly English specific. It will barf on non-English mail, so it needs to only be run for specific ok_locale or ok_language settings, I think.
assigning bug (note that I am adding a Cc: to SAdev for all of these)
Here are my latest results. I think the rule looks good from 80-95, but some people (I think it was Theo) have reported 100% FP rates for all ranges. OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 12402 4708 7694 0.38 0.00 0.00 (all messages) 100.000 37.962 62.038 0.38 0.00 0.00 (all messages as %) 0.524 1.381 0.000 1.00 0.81 0.01 T_NONSENSE_FROM_80_90 0.024 0.064 0.000 1.00 0.60 0.01 T_NONSENSE_FROM_92_93 1.169 2.995 0.052 0.98 0.58 1.00 T_NONSENSE_FROM_2 0.016 0.042 0.000 1.00 0.57 0.01 T_NONSENSE_FROM_91_92 1.919 3.462 0.975 0.78 0.38 0.01 T_NONSENSE_FROM_50_60 2.766 4.397 1.768 0.71 0.35 0.01 T_NONSENSE_FROM_60_70 2.000 3.122 1.313 0.70 0.34 0.01 T_NONSENSE_FROM_70_80 4.217 6.393 2.885 0.69 0.34 1.00 T_NONSENSE_FROM_1 27.447 32.009 24.656 0.56 0.29 0.01 T_NONSENSE_FROM_00_10 8.095 9.303 7.356 0.56 0.29 0.01 T_NONSENSE_FROM_40_50 8.861 9.941 8.201 0.55 0.28 0.01 T_NONSENSE_FROM_20_30 11.530 11.257 11.697 0.49 0.26 0.01 T_NONSENSE_FROM_10_20 12.603 11.682 13.166 0.47 0.25 0.01 T_NONSENSE_FROM_99_100 24.214 13.339 30.868 0.30 0.18 0.01 T_NONSENSE_FROM_30_40 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_97_98 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_94_95 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_90_91 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_98_99 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_95_96 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_96_97 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_93_94
Subject: Re: [SAdev] Detect gibberish email adresses On Tue, Oct 15, 2002 at 09:49:43PM -0700, bugzilla-daemon@hughes-family.org wrote: > I think the rule looks good from 80-95, but some people (I think it was > Theo) have reported 100% FP rates for all ranges. I haven't had the time to track down why yet, but the problem comes from me running a mass-check with only the 70_cvs_*.cf file which I do for time. If I run a mass-check with all the cf files, the rules seem to work as expected. For instance, here are the results from my latest CORPUS_SUBMIT run: OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 22109 8555 13554 0.39 0.00 0.00 (all messages) 100.000 38.695 61.305 0.39 0.00 0.00 (all messages as %) 0.045 0.117 0.000 1.00 0.65 0.01 T_NONSENSE_FROM_92_93 0.005 0.012 0.000 1.00 0.49 0.01 T_NONSENSE_FROM_90_91 0.005 0.012 0.000 1.00 0.49 0.01 T_NONSENSE_FROM_96_97 0.005 0.012 0.000 1.00 0.49 0.01 T_NONSENSE_FROM_93_94 0.502 1.134 0.103 0.92 0.47 0.01 T_NONSENSE_FROM_80_90 2.370 4.816 0.826 0.85 0.42 1.00 T_NONSENSE_FROM_1 1.217 2.244 0.568 0.80 0.39 1.00 T_NONSENSE_FROM_2 2.090 3.156 1.417 0.69 0.34 0.01 T_NONSENSE_FROM_50_60 7.585 9.679 6.264 0.61 0.31 0.01 T_NONSENSE_FROM_40_50 2.904 3.612 2.457 0.60 0.30 0.01 T_NONSENSE_FROM_60_70 11.344 13.022 10.285 0.56 0.29 0.01 T_NONSENSE_FROM_30_40 8.060 9.001 7.466 0.55 0.28 0.01 T_NONSENSE_FROM_20_30 2.415 2.420 2.413 0.50 0.26 0.01 T_NONSENSE_FROM_70_80 37.157 34.740 38.682 0.47 0.25 0.01 T_NONSENSE_FROM_00_10 13.836 11.899 15.058 0.44 0.24 0.01 T_NONSENSE_FROM_99_100 14.049 11.186 15.855 0.41 0.23 0.01 T_NONSENSE_FROM_10_20 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_98_99 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_97_98 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_91_92 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_95_96 0.000 0.000 0.000 0.00 0.00 0.01 T_NONSENSE_FROM_94_95
Subject: Re: [SAdev] Detect gibberish email adresses On Wed, Oct 16, 2002 at 01:07:51AM -0400, Theo Van Dinter wrote: > I haven't had the time to track down why yet, but the problem comes > from me running a mass-check with only the 70_cvs_*.cf file which I do > for time. If I run a mass-check with all the cf files, the rules seem > to work as expected. Ok, I used the time on my train ride in this morning to look at this and I figured out the issue... First: As I said, I _only_ have the 70_cvs*.cf file in there, which means the name-triplets.txt file isn't. So the test couldn't find the input file. Second: nonsense_from_percent() returns a 1 if (from triplets not in above file) / (total triplets in from) > min && <= max percentage via the rule file, and 0 otherwise. Unfortunately, the function also returns 1 on error (like "No such file") which for me means that every message triggers every nonsense rule, see "First". Third: nonsense_from_percent() uses "rules_filename" to find the triplets file, which works well in mass-check since we set rules_filename, but outside of that it will likely fail (rules_filename only gets set if someone runs spamassassin or spamd with the -C option). The rest of the code has been looking at DEF_RULES_DIR and LOCAL_RULES_DIR which do get set via spamassassin and spamd normally. Fourth: mass-check wasn't setting DEF_RULES_DIR and LOCAL_RULES_DIR when running, so some tests weren't able to find the files they wanted to use, so some rules don't match properly. See "Third", but specificically: nonsense_from_percent (T_NONSENSE_FROM_*), and word_is_in_dictionary (SUBJ_HAS_UNIQ_ID). Anyway, I'll be committing patches to HEAD shortly.
> Also, this rule is highly English specific. OK, I have some code that will not run the tests if ok_languages and ok_locales are both "all". Otherwise, it looks for a name-triplets.txt file for each langauge/locale specified, and only runs if it finds them all. However, if done this way, how will the tests run during a mass-check, so that the results can be GA'd?
This doesn't seem to be working too well for other people; shall I remove this from CVS and close the bug WONTIFX?
Subject: Re: [SAdev] Detect gibberish email adresses > This doesn't seem to be working too well for other people; shall I > remove th is from CVS and close the bug WONTIFX? in this case, I would. sorry about that! ;)
Removed code from CVS; marking WONTFIX.