Bug 2954 - check_for_to_in_subject() EVAL modifications
Summary: check_for_to_in_subject() EVAL modifications
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: 2.63
Hardware: All All
: P4 enhancement
Target Milestone: 3.1.0
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-22 07:59 UTC by Dallas Engelken
Modified: 2005-01-20 06:25 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Diff of EvalTest.pm for sub check_for_to_in_subject patch None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dallas Engelken 2004-01-22 07:59:53 UTC
check_for_to_in_subject() will only match subjects like

dallase,how are you?
dallase,blah blah blah

this modified return regex will detect variations.. 

return (($subject =~ /^\s*\Q$to\E[\.\,\-]+\s*\S/i) ||
       ($subject =~ /\S\s*[\.\,\-]+\Q$to\E(?:[\!\?\.]+)$/i));

dallase,how are you?
dallase, how are you?
DALLASE, how are you?
dallase... how are you?
dallase - how are you
how are you, DALLASE
how are you, dallase?
how are you, dallase!!!
how are you, dallase.

plus be case-insensitive (of which caused the biggest jump in detection).  my 
results show 4-fold improvement in userpart detection in subject.

here are the results on my corpus... first test being the original rule, the 
second test being my modified regex case-sensitive, third test being modified 
regex case-insensitive.

# Tue Jan 20 13:44:12 CST 2004
# beginning test of testrule.USERPART.txt:
# original eval
header   USERNAME_IN_SUBJECT    eval:check_for_to_in_subject()
describe USERNAME_IN_SUBJECT    To: username at front of subject
score    USERNAME_IN_SUBJECT    2.900 2.800 2.800 2.700

# new eval 1
header   USERPART_IN_SUBJECT_1  eval:check_user_part_in_subject()
describe USERPART_IN_SUBJECT_1  subject contains case-sensitive username at 
beginning or end
score    USERPART_IN_SUBJECT_1  2.900 2.800 2.800 2.700

# new eval 2
header   USERPART_IN_SUBJECT_2  eval:check_user_part_in_subject_nocase()
describe USERPART_IN_SUBJECT_2  subject contains username at beginning or end
score    USERPART_IN_SUBJECT_2  2.900 2.800 2.800 2.700

############################################################
# USERNAME_IN_SUBJECT -- 22s/0h of 10963 corpus (6083s/4880h), 2004-01-20 
############################################################

############################################################
# USERPART_IN_SUBJECT_1 -- 38s/0h of 10963 corpus (6083s/4880h), 2004-01-20 
############################################################

############################################################
# USERPART_IN_SUBJECT_2 -- 90s/0h of 10963 corpus (6083s/4880h), 2004-01-20 
############################################################

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  10963     6083     4880    0.555   0.00    0.00  (all messages)
     90       90        0    1.000   1.00   2.90  USERPART_IN_SUBJECT_2
     38       38        0    1.000   0.24   2.90  USERPART_IN_SUBJECT_1
     22       22        0    1.000   0.00   2.90  USERNAME_IN_SUBJECT

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  10963     6083     4880    0.555   0.00    0.00  (all messages)
100.000  55.4866  44.5134    0.555   0.00    0.00  (all messages as %)
  0.821   1.4795   0.0000    1.000   1.00    2.90  USERPART_IN_SUBJECT_2
  0.347   0.6247   0.0000    1.000   0.24    2.90  USERPART_IN_SUBJECT_1
  0.201   0.3617   0.0000    1.000   0.00    2.90  USERNAME_IN_SUBJECT

dallas
Comment 1 Dallas Engelken 2004-01-22 08:06:31 UTC
Created attachment 1720 [details]
Diff of EvalTest.pm for sub check_for_to_in_subject

improved userpart detection in subject...
Comment 2 Malte S. Stretz 2004-01-22 09:27:37 UTC
The restriction for this rule was made intentionally. See bug 613 for the 
reason why; other variants had too many FPs. But I think some of your variants 
(eg. name at the end of the subject) weren't checked back then and might be 
worth a try. Theo will know more :) 
Comment 3 Kenneth Porter 2004-01-22 12:09:33 UTC
Granting the FP's, I'd still like to see the eval available, perhaps with a
default score of zero. Those of us who never see our name legitimately appear in
a subject line can override the score.

It would be interesting to look at the FP's in the corpus and see if there's
some other pattern that can be used to compensate them out. For instance, in a
recent SA-Talk thread on this subject, examples were all generated by local
automated tools and locally-generated mail wouldn't normally go through SA.
Comment 4 Daniel Quinlan 2004-08-27 16:59:11 UTC
moving accuracy and some bugs to 3.1.0 milestone
Comment 5 Daniel Quinlan 2005-01-19 15:46:44 UTC
working on this
Comment 6 Daniel Quinlan 2005-01-20 15:25:43 UTC
testing did not fare so well for these, so closing