SA Bugzilla – Bug 2954
check_for_to_in_subject() EVAL modifications
Last modified: 2005-01-20 06:25:43 UTC
check_for_to_in_subject() will only match subjects like dallase,how are you? dallase,blah blah blah this modified return regex will detect variations.. return (($subject =~ /^\s*\Q$to\E[\.\,\-]+\s*\S/i) || ($subject =~ /\S\s*[\.\,\-]+\Q$to\E(?:[\!\?\.]+)$/i)); dallase,how are you? dallase, how are you? DALLASE, how are you? dallase... how are you? dallase - how are you how are you, DALLASE how are you, dallase? how are you, dallase!!! how are you, dallase. plus be case-insensitive (of which caused the biggest jump in detection). my results show 4-fold improvement in userpart detection in subject. here are the results on my corpus... first test being the original rule, the second test being my modified regex case-sensitive, third test being modified regex case-insensitive. # Tue Jan 20 13:44:12 CST 2004 # beginning test of testrule.USERPART.txt: # original eval header USERNAME_IN_SUBJECT eval:check_for_to_in_subject() describe USERNAME_IN_SUBJECT To: username at front of subject score USERNAME_IN_SUBJECT 2.900 2.800 2.800 2.700 # new eval 1 header USERPART_IN_SUBJECT_1 eval:check_user_part_in_subject() describe USERPART_IN_SUBJECT_1 subject contains case-sensitive username at beginning or end score USERPART_IN_SUBJECT_1 2.900 2.800 2.800 2.700 # new eval 2 header USERPART_IN_SUBJECT_2 eval:check_user_part_in_subject_nocase() describe USERPART_IN_SUBJECT_2 subject contains username at beginning or end score USERPART_IN_SUBJECT_2 2.900 2.800 2.800 2.700 ############################################################ # USERNAME_IN_SUBJECT -- 22s/0h of 10963 corpus (6083s/4880h), 2004-01-20 ############################################################ ############################################################ # USERPART_IN_SUBJECT_1 -- 38s/0h of 10963 corpus (6083s/4880h), 2004-01-20 ############################################################ ############################################################ # USERPART_IN_SUBJECT_2 -- 90s/0h of 10963 corpus (6083s/4880h), 2004-01-20 ############################################################ OVERALL SPAM HAM S/O SCORE NAME 10963 6083 4880 0.555 0.00 0.00 (all messages) 90 90 0 1.000 1.00 2.90 USERPART_IN_SUBJECT_2 38 38 0 1.000 0.24 2.90 USERPART_IN_SUBJECT_1 22 22 0 1.000 0.00 2.90 USERNAME_IN_SUBJECT OVERALL% SPAM% HAM% S/O RANK SCORE NAME 10963 6083 4880 0.555 0.00 0.00 (all messages) 100.000 55.4866 44.5134 0.555 0.00 0.00 (all messages as %) 0.821 1.4795 0.0000 1.000 1.00 2.90 USERPART_IN_SUBJECT_2 0.347 0.6247 0.0000 1.000 0.24 2.90 USERPART_IN_SUBJECT_1 0.201 0.3617 0.0000 1.000 0.00 2.90 USERNAME_IN_SUBJECT dallas
Created attachment 1720 [details] Diff of EvalTest.pm for sub check_for_to_in_subject improved userpart detection in subject...
The restriction for this rule was made intentionally. See bug 613 for the reason why; other variants had too many FPs. But I think some of your variants (eg. name at the end of the subject) weren't checked back then and might be worth a try. Theo will know more :)
Granting the FP's, I'd still like to see the eval available, perhaps with a default score of zero. Those of us who never see our name legitimately appear in a subject line can override the score. It would be interesting to look at the FP's in the corpus and see if there's some other pattern that can be used to compensate them out. For instance, in a recent SA-Talk thread on this subject, examples were all generated by local automated tools and locally-generated mail wouldn't normally go through SA.
moving accuracy and some bugs to 3.1.0 milestone
working on this
testing did not fare so well for these, so closing