Bug 6119 - TVD_SPACE_RATIO false positives -- the FP Collection
Summary: TVD_SPACE_RATIO false positives -- the FP Collection
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.2.5
Hardware: Other All
: P5 normal
Target Milestone: 3.3.2
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-25 15:07 UTC by Michael Monnerie
Modified: 2011-05-09 12:25 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Lots of FPs, but all of the same type of e-mail application/octet-stream None Michael Monnerie [NoCLA]
legit mail triggering TVD_SPACE_RATIO application/x-mimearchive None mouss [NoCLA]
another "sample" (it differs slightly from the one I've attached before) application/x-mimearchive None mouss [NoCLA]
a sample from the multimedia@FreeBSD.org list application/x-mimearchive None mouss [NoCLA]
and one from freebsd-current@freebsd.org application/x-mimearchive None mouss [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Monnerie 2009-05-25 15:07:07 UTC
Justin said: please attach FPs you can share to tickets on bugzilla.  they do help.

So I do, I found an old mail but that one is in my masscheck_HAM list already and should be learned. Still, I got a report last week where a legit mail had been FP'd and this rule was in, amongst others. Sorry can't share that mail. But this one:

Received: by mailsrv1.zmi.at (Postfix, from userid 65534)	id 9D69F166ED;
	Wed, 25 Jul 2007 21:12:46 +0200 (CEST)
Received: from protegate5.zmi.at (protegate5.zmi.at [212.69.162.205])	(using TLSv1
	with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))	(Client CN "protegate1.zmi.at", Issuer "power4u.zmi.at" (not verified))
	by mailsrv1.zmi.at (Postfix) with ESMTP id 2FB5527EC
	for <ORBCOM10@DATAMATIX.AT>; Wed, 25 Jul 2007 21:12:46 +0200 (CEST)
X-Envelope-From: m12720600880x1@orbcomm2.net
Received: from localhost (localhost [127.0.0.1])
	by protegate5.zmi.at (Postfix) with ESMTP id 7BE451E0EF
	for <ORBCOM10@DATAMATIX.AT>; Wed, 25 Jul 2007 21:10:51 +0200 (CEST)
X-Virus-Scanned: amavisd-new at zmi.at
X-Spam-Score: 4.636
X-Spam-Level: ****
X-Spam-Status: No, score=4.636 tagged_above=-999 required=5
	tests=[AWL=0.186,
	BAYES_20=-0.74, DKIM_POLICY_SIGNSOME=0, DK_POLICY_SIGNSOME=0,
	FROM_LOCAL_DIGITS=0.001, FROM_LOCAL_HEX=1.2, INVALID_MSGID=1.9,
	L_P0F_UNKN=0.1, SPF_PASS=-0.001, TVD_SPACE_RATIO=1.99]
Received: from protegate5.zmi.at ([127.0.0.1])
	by localhost (protegate5.zmi.at [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id AVfBWtBEdgkn for <ORBCOM10@DATAMATIX.AT>;
	Wed, 25 Jul 2007 21:10:47 +0200 (CEST)
X-Envelope-From: m12720600880x1@orbcomm2.net
Received: from fw1.orbcomm2.net (fw1.orbcomm2.net [208.44.94.213])
	by protegate5.zmi.at (Postfix) with ESMTP id 6C6721E0E7
	for <ORBCOM10@DATAMATIX.AT>; Wed, 25 Jul 2007 21:10:46 +0200 (CEST)
Received: from omsseh (omss.orbcomm2.net [10.201.26.11])
	by fw1.orbcomm2.net (8.9.3/8.9.3p2.fw2525 (GMSS3.3E Internal)) with SMTP
	id TAA02634	for <ORBCOM10@DATAMATIX.AT>; Wed, 25 Jul 2007 19:10:44 GMT
From: M12720600880X1@ORBCOMM2.NET
Date: 25 Jul 2007 19:10:44 +0000
Message-ID: <"00020202 0001 46a7a034"* @MHS>
Subject: [GLOBALGRAM:SAT=35]
To: orbcom10@datamatix.at
Priority: non-urgent
X-sat_id: 35
X-ncc_id: 120
X-ncc_mha_ref: 1
X-ack_level: 0
X-DBMail-PhysMessage-ID: 610116
Return-Path: M12720600880X1@ORBCOMM2.NET
MIME-Version: 1.0
Content-Type: text/plain;
  charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
X-DBMail-PhysMessage-ID: 610117
X-Length: 2235
X-Original-X-UID: 1202527
X-UID: 8115

07-07-25,14:12:40,48.192509,16.349321,278.338989,132.710999,0.000000,12,73909,"STD"
Comment 1 Karsten Bräckelmann 2009-05-25 18:15:32 UTC
(In reply to comment #0)
> please attach FPs you can share to tickets on bugzilla.  they do help.
         ^^^^^^
Yes, indeed. Pretty please with sugar on top, *attach* the raw messages, do *not* paste them as a comment. See "Add an attachment" in the Attachment list below the details table, above the comments.  Thanks. :)
Comment 2 Michael Monnerie 2009-05-25 22:59:15 UTC
Created attachment 4451 [details]
Lots of FPs, but all of the same type of e-mail

Sorry but I have no other FPs to provide ATM, will report ASA I have the next FP. Maybe I'll reset the score to defaults, then I should quickly get a report. But that's nasty...
Comment 3 mouss 2009-05-26 12:54:48 UTC
Created attachment 4452 [details]
legit mail triggering TVD_SPACE_RATIO

a lot of messages posted to freebsd-ports-bugs@FreeBSD.org hit this rule.
Comment 4 mouss 2009-05-26 12:55:28 UTC
Created attachment 4453 [details]
another "sample" (it differs slightly from the one I've attached before)
Comment 5 mouss 2009-05-26 12:57:43 UTC
Created attachment 4454 [details]
a sample from the multimedia@FreeBSD.org list
Comment 6 mouss 2009-05-26 12:58:42 UTC
Created attachment 4455 [details]
and one from freebsd-current@freebsd.org
Comment 7 Karsten Bräckelmann 2009-05-26 14:58:30 UTC
Just had a quick look at attachment 4452 [details] and the BodyEval tvd_vertical_words() function, adding some noisy debugging love. The reason is quite simple -- the space to non-space ratio doesn't exceed 9%, which is less than the default 10% max.

This didn't become apparent from looking at the code only without the debugging, though. I expected it to check the body line by line. However, it actually checks the space ratio for *paragraphs* in a traditional UN*X style. That paragraph ends with *two* newlines.

This line for example would have a ratio of 18% on its own, still 13% with the longish header-style prefix and no (munged?) linebreak.

  Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)

The text being looked at is the entire paragraph, though, including all lines immediately preceding or following without an empty line. Resulting in 20/201, or about 9%. One reason, and an explanation why it loves to hit on such messages, are the very long words prefixing each line. Or, in other word: There's not much real, human generated text there. Compare it to this very paragraph...

A quick and easy fix is, to lower the max threshold (second argument) in 20_body_tests.cf, which currently reads:
  body TVD_SPACE_RATIO  eval:tvd_vertical_words('0','10')

However, given the idea is to identify lots of *vertical* words, I seriously wonder if this used to work on actual *lines*, rather than whole paragraphs. Theo?
Comment 8 Karsten Bräckelmann 2009-05-26 15:10:59 UTC
(In reply to comment #7)
> This line for example would have a ratio of 18% on its own, still 13% with the
> longish header-style prefix and no (munged?) linebreak.
> 
>   Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)

Oops, underscores added by me. They are actually spaces.
Comment 9 mouss 2009-05-26 23:18:09 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > This line for example would have a ratio of 18% on its own, still 13% with the
> > longish header-style prefix and no (munged?) linebreak.
> > 
> >   Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)
> 
> Oops, underscores added by me. They are actually spaces.
> 

remove the signature from 4454 and the rule doesn't hit anymore. 

Maybe there's a way to ignore "simple" signatures? 
Comment 10 Justin Mason 2009-05-27 01:52:05 UTC
for reference, the ruleqa page: http://ruleqa.spamassassin.org/20090526-r778623-n/TVD_SPACE_RATIO/detail

and the FP rates on our dev corpora:

MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE
0.00000   1.2180   0.3403   0.782    0.70    2.90  TVD_SPACE_RATIO  
0.00000   1.7544   0.0646   0.964    0.82    2.90  TVD_SPACE_RATIO bb-jm 
0.00000   1.1575   0.1178   0.908    0.82    2.90  TVD_SPACE_RATIO dos 
0.00000   1.5153   0.4726   0.762    0.53    2.90  TVD_SPACE_RATIO jm 
0.00000   0.1232   0.8763   0.123    0.35    2.90  TVD_SPACE_RATIO zmi 
Comment 11 Karsten Bräckelmann 2009-05-27 03:17:39 UTC
Attachment 4454 [details]. Two paragraphs, the Subject and the entire body including that mailing-list generated sig, err footer. Ratio of 32 / 321 ~= 0.0997, which results in 9. Separating the footer from the body (the human generated text) results in 18 and 5 respectively. Well above the max.

The *cough* human generated body in attachment 4455 [details] doesn't get above a ratio of 8. Plain paste of a stacktrace and some similar stuff.

My question in comment 7 remains.
Comment 12 Karsten Bräckelmann 2009-05-27 03:55:22 UTC
The second mail in attachment 4451 [details], DBMail ID 610117 (same as the paste in comment 0) got 0/20 spaces for the Subject, and 1/83 spaces in the body. Ratio of 1, though I kind of would have expected a 0, since the body doesn't actually have any space.

The first message, DBMail ID 610202, a human generated short note, is NOT a FP. TVD_SPACE_RATIO just doesn't fire on that one. The note itself has a ratio of 19 (11/57). Just as expected. Michael, why did you put that one in?


It does show some odd results, though. A paragraph seems to always be ended with a space internally, even though there is no space in the mail and the single-word paragraph is the very last content. Merely following a newline to end the message.

That appended space means that tvd_vertical_words() actually does not identify vertical words, as long as the average length of words in a "paragraph" is not greater than 10. Like...

Viagra

That one will exonerate the entire message from triggering TVD_SPACE_RATIO. As would this.

VIAGRA
and
CIALIAS
only
$1.2
per_pill

Seems hardly intended. So, again, was this actually meant to be evaluated per paragraph or line by line? And an additional question: Was the trailing space added by SA internally intended to be counted?
Comment 13 Michael Monnerie 2009-05-28 01:17:32 UTC
> The first message, DBMail ID 610202, a human generated short note, is NOT a FP.
> TVD_SPACE_RATIO just doesn't fire on that one. The note itself has a ratio of
> 19 (11/57). Just as expected. Michael, why did you put that one in?

I must express my deepest apology for that mistake, sire! The error must have come from me being human, having too much work, and being full of mistakes. I will cut my left small finger off to remember the shame I brought to all of my family, and remind me to be more careful for all times.

mfg zmi
Comment 14 Karsten Bräckelmann 2009-05-28 13:03:26 UTC
> I must express my deepest apology for that mistake, sire!

No need to. :)  It perfectly serves as an example, that TVD_SPACE_RATIO indeed does tend not to pick on real, human written text.

The FP samples so far all are trivial to rescue, and IMHO probably shouldn't have been fed to SA in the first place. Given the current S/O in mass-check, however, I do agree the score is seriously too high.

Candidate for sa-update.

Since the Target Milestone is set to 3.3.0, the score issue most likely will be resolved magically by a full GA run. What I'd really like to sort about this for 3.3.0 is, if the eval really still works as originally intended, or if it maybe got broke by evaluating paragraphs instead of lines and the injected trailing space.
Comment 15 Justin Mason 2009-05-28 13:37:17 UTC
(In reply to comment #14)
> Since the Target Milestone is set to 3.3.0, the score issue most likely will be
> resolved magically by a full GA run. What I'd really like to sort about this
> for 3.3.0 is, if the eval really still works as originally intended, or if it
> maybe got broke by evaluating paragraphs instead of lines and the injected
> trailing space.

feel free to move it to 3.2.6 if you want to work on it for sa-update, though!
3.3.0 is just the easiest.
Comment 16 Pim van den Berg 2009-07-17 04:12:23 UTC
I have a little addition to this bugreport. I think it is better to have an initial value like:

$pms->{tvd_vertical_words} = -1;

on line 198 of BodyEval.pm instead of

$pms->{tvd_vertical_words} = 0;

So that the subroutine doesn't return true (0 >= $min) if there are no matches (ie. if there are no @lines over 5 chars in length).

For example when I send an e-mail with just a single word in the body like 'test', it probably shouldn't match TVD_SPACE_RATIO.
Comment 17 Warren Togami 2009-07-17 05:55:30 UTC
http://ruleqa.spamassassin.org/20090716-r794596-n/TVD_SPACE_RATIO/detail?s_corpus=1#corpus

Yikes, my statistics are very different from the other contributors.

I had noticed that simple mail containing only "test" in the body triggers this rule.

I'll find a few of the others and attach them here.
Comment 18 Warren Togami 2009-07-17 07:51:37 UTC
It appears that my TVD_SPACE_RATIO FP's are either:

* mail with only a single word or URL in the body.
* various legitimate personal mail written in Japanese.  Aside from being written in either ISO-2022-JP or UTF-8, written Japanese typically lacks spaces between words.  The same is true of Chinese.

Unfortunately I am unable to share any of the Japanese mail.  Most are in my user's folders who agreed to hand classify using our standards.  I will ask for her to choose a few samples to share, but it will be early next week.
Comment 19 Warren Togami 2009-07-21 09:58:14 UTC
http://ruleqa.spamassassin.org/20090721-r796186-n/TVD_SPACE_RATIO/detail

Why does TVD_SPACE_RATIO have such a high score of 2.90?  It appears to have a very low spam hit % and high false positives.
Comment 20 Justin Mason 2009-07-21 15:51:27 UTC
(In reply to comment #19)
> http://ruleqa.spamassassin.org/20090721-r796186-n/TVD_SPACE_RATIO/detail
> 
> Why does TVD_SPACE_RATIO have such a high score of 2.90?  It appears to have a
> very low spam hit % and high false positives.

Warren, generally those rules did better when they were first added... often there was a spam run using those characteristics.
Comment 21 Justin Mason 2009-07-23 07:11:19 UTC
going to take a look at these
Comment 22 Justin Mason 2009-08-31 16:06:01 UTC
if we want to change this for 3.3.0, it needs to be in SVN by this Thursday; see bug 6155.
Comment 23 Justin Mason 2009-09-03 14:02:41 UTC
I've copied those mails into my corpus, and added the fix to avoid hitting on single-word mails.  I can't see an easy rule fix to avoid the other FP samples here.  Hopefully the GA will reduce its score.  

fwiw, bug 6155's test scoregen resulted in:

+score TVD_SPACE_RATIO 1.291 0.598 1.799 0.744 # n=2
Comment 24 Matus UHLAR - fantomas 2009-11-13 01:37:20 UTC
I see match when mail of type application/pdf is sent - the pdf isn't an attachment, it's the body. The mail contains one required blank line between headers and body (required) and two blank lines at the end - might this be the problem?

It seems to happen at some fax2mail gateways.
Do you need sample? Unluckily I can't just attach customer's e-mail.
Comment 25 Matus UHLAR - fantomas 2009-11-13 01:46:32 UTC
note that modifying TVD_SPACE_RATIO could positively affect its score too ;)
Comment 26 Justin Mason 2009-11-13 06:27:46 UTC
(In reply to comment #24)
> Do you need sample? Unluckily I can't just attach customer's e-mail.

can you come up with something similar (but shareable) which triggers the issue?
Comment 27 Justin Mason 2010-01-27 02:20:30 UTC
moving most remaining 3.3.0 bugs to 3.3.1 milestone
Comment 28 Justin Mason 2010-01-27 03:16:18 UTC
reassigning, too
Comment 29 Volodin Arkady 2010-01-28 00:27:55 UTC
Generally, by my organization experiens, TVD_SPACE_RATIO is always FP if it was checked by spamassassin already (f.e. checked at sending MTA or at relay MTA).
I think it is because of long spaces at X-Spam-Report tag. I think developers must remove system tags before calculate parity between spaces/symbols.

For example message like this:

Return-path: <r00t@domainX.com>
Envelope-to: r00t@domainY.com
Delivery-date: Thu, 28 Jan 2010 11:05:10 +0300
Received: from [A1.B1.C1.D1] (helo=mail.domainX.com)
	by domainY.com with esmtps (TLSv1:AES256-SHA:256)
	(Exim 4.69)
	(envelope-from <r00t@domainX.com>)
	id 1NaPNE-0002HI-4W
	for r00t@domainY.com; Thu, 28 Jan 2010 11:05:10 +0300
Received: from host-1.domainZ.com ([A2.B2.C2.D2] helo=[192.168.60.39])
	by mail.domainX.com with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.71)
	(envelope-from <r00t@domainX.com>)
	id 1NaPMr-000563-Ah
	for r00t@domainY.com; Thu, 28 Jan 2010 08:04:45 +0000
Message-ID: <4B614516.8090700@domainX.com>
Date: Thu, 28 Jan 2010 11:04:38 +0300
From: R00T <r00t@domainX.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20090812)
MIME-Version: 1.0
To: r00t@domainY.com
Subject: TEST
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -1.4 (-)
X-Spam-Report: Spam detection software, running on the system "s1010.hostingdns.tv", has
 identified this incoming email as possible spam.  The original message
 has been attached to this so you can view it (if it isn't spam) or label
 similar future email.  If you have any questions, see
 spam@transclaim.ru for details.
 
 Content preview:  TEST [...] 
 
 Content analysis details:   (-1.4 points, 5.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -1.4 ALL_TRUSTED            Passed through trusted hosts only via SMTP
X-Spam-Level: -
X-Spam-Score: 3.0 (+++)
X-Spam-Report: Spam detection software, running on the system "mail.domainY.com", has
	identified this incoming email as possible spam.  The original message
	has been attached to this so you can view it (if it isn't spam) or label
	similar future email.  If you have any questions, see
	postmaster@domainY.com for details.
	Content preview:  TEST [...] 
	Content analysis details:   (2.9 points, 3.0 required)
	pts rule name              description
	---- ---------------------- --------------------------------------------------
	2.9 TVD_SPACE_RATIO        BODY: TVD_SPACE_RATIO
	0.1 RDNS_NONE              Delivered to trusted network by a host with no rDNS
Subject: ***SPAM*** TEST
X-Spam-Level: +++
X-Spam-Status: score=3.0

TEST
Comment 30 Justin Mason 2010-03-23 16:33:24 UTC
moving all open 3.3.1 bugs to 3.3.2
Comment 31 Karsten Bräckelmann 2010-03-23 17:42:37 UTC
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
Comment 32 Henrik Krohns 2011-05-09 12:23:24 UTC
Mass-checks look abysmal. This rule seems to be one that can never be fixed for all cases.

Anyone in favor of score 0.001? +1 from me
Comment 33 Henrik Krohns 2011-05-09 12:25:14 UTC
(In reply to comment #32)
> Mass-checks look abysmal. This rule seems to be one that can never be fixed for
> all cases.
> 
> Anyone in favor of score 0.001? +1 from me

Never mind it was already set at that. :-D Closing.