Bug 5673 - header-rule whitespace munging semantics should be consistent and documented
Summary: header-rule whitespace munging semantics should be consistent and documented
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 major
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-09 03:22 UTC by Justin Mason
Modified: 2009-06-30 15:29 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
fix patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-10-09 03:22:26 UTC
 
Comment 1 Justin Mason 2007-10-09 03:27:04 UTC
just noticed this; it appears that the "ALL" header looks like this:

  Header:  value
  Header2:  value
  ...

in other words there's an extra space added before the value of each header. 
Also, there's no test to catch this. :(

now fixed in trunk:

: jm 76...; svn commit -m "bug 5673: 'ALL' header was including extra spurious
spaces between header names and values.  fix"
Sending        MANIFEST
Sending        lib/Mail/SpamAssassin/Message/Node.pm
Sending        t/SATest.pm
Adding         t/data/Dumpheaders.pm
Adding         t/get_all_headers.t
Transmitting file data .....
Committed revision 583094.

Comment 2 Justin Mason 2007-10-09 03:28:34 UTC
Created attachment 4145 [details]
fix

this is the patch for 3.2.x.
Comment 3 Justin Mason 2007-10-09 03:29:31 UTC
please vote.  this is a major bug, btw, since the whole point of "ALL" is so
that patterns can be matched across header name/value/name/value boundaries...
Comment 4 Mark Martinec 2007-10-09 06:05:58 UTC
> this is a major bug, btw, since the whole point of "ALL" is so
> that patterns can be matched across header name/value/name/value
> boundaries...

+1 on a fix to Message/Node.pm

+0.9 on added test, which concentrates on one minor detail,
   where it could test 'a real thing', like given a message
   with repeated and mixed header fields, folded or not, ...
   and see what comes out.

> this is a major bug

Since the get("ALL") already clobbers whitespace on FWS,
turning ranges of TABs, spaces and form-feeds(!?) into a
single space, I don't see it as a 'major bug', but can't
hurt to get it right. 
Comment 5 Justin Mason 2007-10-09 07:47:49 UTC
(In reply to comment #4)
> Since the get("ALL") already clobbers whitespace on FWS,
> turning ranges of TABs, spaces and form-feeds(!?) into a
> single space, I don't see it as a 'major bug', but can't
> hurt to get it right. 

ugh.  that's a pretty serious bug too.  "ALL" should be the headers, verbatim,
as they were passed in (apart from the "From_" line).    perhaps we should just
use $msg->get_pristine_header() ?
Comment 6 Theo Van Dinter 2007-10-09 07:52:18 UTC
What's the problem here?  The amount of whitespace is irrelevant.

> ugh.  that's a pretty serious bug too.  "ALL" should be the headers, verbatim,
> as they were passed in (apart from the "From_" line).

Why?  There's nothing that says "ALL" means "pristine headers".  Not that I'm
against that idea, but I don't see a bug here.
Comment 7 Justin Mason 2007-10-09 08:09:25 UTC
(In reply to comment #6)
> What's the problem here?  The amount of whitespace is irrelevant.
> 
> > ugh.  that's a pretty serious bug too.  "ALL" should be the headers, verbatim,
> > as they were passed in (apart from the "From_" line).
> Why?  There's nothing that says "ALL" means "pristine headers".  Not that I'm
> against that idea, but I don't see a bug here.

true -- there's no doc indicating that it has to be pristine.  However it seems
more logical.   So are you saying this is really a feature request rather than a
bug report?
Comment 8 Justin Mason 2007-10-10 02:59:24 UTC
(In reply to comment #6)
> What's the problem here?  The amount of whitespace is irrelevant.

btw, I disagree on this point.  If the real headers look like

  Header1: value1
  Header2: value2

then get("ALL") should not return

  Header1:  value1
  Header2:  value2

with *additional* spaces added.  I can see marginal utility in "simplifying"
long strings of whitespace down to a single space -- but adding spaces where
they aren't in the original string, IMO, is a bug.
Comment 9 Mark Martinec 2007-10-10 03:33:58 UTC
> I can see marginal utility in "simplifying" long strings of whitespace
> down to a single space.

Pretty marginal I'd say, because it only turns TABs, FFs and SPs into a space
at line folds, but not within each line, so regexps in rules can not reliably 
count on such compressing of whitespace.

I would suggest to also change the get_header() to only unfold header fields
(i.e. only remove a line feed, and keep whitespace at line folds intact).
Not necessary for 3.2.4, but 3.3.0 would be a good target for this change.

Comment 10 Loren Wilton 2007-10-10 05:05:59 UTC
The more I think about it, the more I believe that header ALL shoudl literally 
return the header block of the message exactly as it is, with no more than 
perhaps removing trailing spaces on the line.  I don't believe I would remove 
newlines and concatenate multiple-line header items.  Sometimes the fact that 
the header is broken in a particular place is a spam sign.

Of course, meybe there is a header ALL:raw that does this.  In that case ALL 
without the :raw might do something else.  But it isn't clear to me what that 
difference might be.
Comment 11 Justin Mason 2007-10-10 05:09:17 UTC
yeah, Loren, I'm inclined to agree...
Comment 12 Justin Mason 2007-10-10 05:17:16 UTC
I've just checked in a change to use get_pristine_header(), and it now checks an
entire message's header in the test case.

: jm 152...; svn commit -m "bug 5673: just use get_pristine_header() for the
'ALL' pseudo-header, to preserve all internal whitespace in the headers; update
test"
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Sending        t/get_all_headers.t
Transmitting file data ..
Committed revision 583445.



BTW, I would say the current behaviour -- munging internal whitespace -- arguably
*is* a bug, since the documentation states:

> There are several special pseudo-headers that can be specified:
> ...
> =item C<ALL> can be used to mean the text of all the message's headers.

there's no mention that any munging at all takes place.  IMO, that's buggy.
Comment 13 Justin Mason 2007-10-10 05:18:27 UTC
(In reply to comment #9)
> I would suggest to also change the get_header() to only unfold header fields
> (i.e. only remove a line feed, and keep whitespace at line folds intact).

no, this is by design -- for individual headers, it's important to unfold the
values for rule matching, and the ":raw" variant is offered to match against the
unmunged form.
Comment 14 Mark Martinec 2007-10-10 05:29:43 UTC
> > I would suggest to also change the get_header() to only unfold ...

> no, this is by design -- for individual headers, it's important to unfold
> the values for rule matching, and the ":raw" variant is offered to match
> against the unmunged form.

Did I say otherwise? I don't think so. Unfold yes, munge whitespace no.
Unfolding means to remove newlines before whitespace. The whitespace
at the beginning of continuation lines is and integral part of the
mail header field body, not part of a folding.
Comment 15 Justin Mason 2007-10-10 05:41:50 UTC
(In reply to comment #14)
> > > I would suggest to also change the get_header() to only unfold ...
> 
> > no, this is by design -- for individual headers, it's important to unfold
> > the values for rule matching, and the ":raw" variant is offered to match
> > against the unmunged form.
> 
> Did I say otherwise? I don't think so. Unfold yes, munge whitespace no.
> Unfolding means to remove newlines before whitespace. The whitespace
> at the beginning of continuation lines is and integral part of the
> mail header field body, not part of a folding.

In terms of "unfolding", I refer to removing both the newline and the whitespace
at the beginning of continuation lines.  In terms of how it interferes with
writing rules, that whitespace is a nuisance.  Take this header:

  Received: from radish.jmason.org (localhost [127.0.0.1])
	by radish.jmason.org (Postfix) with ESMTP id 0409E33287
	for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST)

it could also be something like this:

  Received: from some-long-hostname-breaking-folding.jmason.org (localhost
        [127.0.0.1]) by radish.jmason.org (Postfix) with ESMTP id
        0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST)

This regexp

  /\(localhost \[127.0.0.1\]\) by radish.jmason.org/

is easier for a rule developer to deal with than the alternative

  /\(localhost\s+\[127.0.0.1\]\)\s+by\s+radish.jmason.org/

Again, this is by design, and IMO is still valuable behaviour.
Comment 16 Mark Martinec 2007-10-10 05:50:19 UTC
> In terms of "unfolding", I refer to removing both the newline and the
> whitespace at the beginning of continuation lines.

This is re-inventing RFC 2822, which says:
  ...wherever this standard allows for folding white space (not
  simply WSP characters), a CRLF may be inserted before any WSP.

> it could also be something like this: 
>   Received: from some-long-hostname-breaking-folding.jmason.org (localhost
>         [127.0.0.1]) by radish.jmason.org (Postfix) with ESMTP id
>         0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST)

It could also be:
Received:\tfrom some-long-hostname-breaking-folding.jmason.org (localhost
\t[127.0.0.1]) by\tradish.jmason.org (Postfix) with ESMTP\tid
 0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST)

> This regexp ... is easier for a rule developer to deal with
> than the alternative ...
> Again, this is by design, and IMO is still valuable behaviour.

In that case, *all* whitespace substrings should be replaced with
a single space, not just those after a CRLF.
Comment 17 Justin Mason 2007-10-10 06:54:27 UTC
(In reply to comment #16)
> In that case, *all* whitespace substrings should be replaced with
> a single space, not just those after a CRLF.

sure, I can go for that (since the ":raw" variant takes care of people
who want to match the whitespace).

btw, this stuff is undocumented in Mail::SA::Conf, so we need to fix that...
all it says is:

       header SYMBOLIC_TEST_NAME header op /pattern/modifiers [if-unset: STRING]

           Define a test.  "SYMBOLIC_TEST_NAME" is a symbolic test name, such as
           'FROM_ENDS_IN_NUMS'.  "header" is the name of a mail header, such as
           'Subject', 'To', etc.

           Appending ":raw" to the header name will inhibit decoding of quoted-
           printable or base-64 encoded strings.
Comment 18 Mark Martinec 2007-10-10 07:58:34 UTC
> > In that case, *all* whitespace substrings should be replaced with
> > a single space, not just those after a CRLF.
> sure, I can go for that (since the ":raw" variant takes care of people
> who want to match the whitespace).
> btw, this stuff is undocumented in Mail::SA::Conf, so we need to fix that...

Yes, that would be to my liking.
Comment 19 Loren Wilton 2007-10-10 12:21:43 UTC
> In that case, *all* whitespace substrings should be replaced with
> a single space, not just those after a CRLF.

I had always assumed that it worked that way, and would be inclined to consider 
it a bug if it doesn't.  (With the specific exception of the ALL pseudo-header 
which should implicitly be raw headers.)
Comment 20 Daryl C. W. O'Shea 2007-11-06 13:44:55 UTC
+1 on 4145 for 3.2.4, fix the rest whenever.
Comment 21 Sidney Markowitz 2007-12-29 14:10:25 UTC
+1 on committing patch 4145, then retarget this bug past 3.2.4 or open up a new
bug. I count 4 votes for the patch here, so that part should be ready to go, I'm
putting this in review and setting whiteboard status to show that.
Comment 22 Justin Mason 2007-12-30 13:10:36 UTC
ok, patch 4145 applied to 3.2.x:

: jm 142...; svn commit -m "bug 5673: 'ALL' header was including spurious extra
spaces between header names and values.  fix"
Sending        MANIFEST
Sending        lib/Mail/SpamAssassin/Message/Node.pm
Sending        t/SATest.pm
Adding         t/data/Dumpheaders.pm
Adding         t/get_all_headers.t
Transmitting file data .....
Committed revision 607590.

Bug retitled and aimed at 3.3.0 for the remaining work...
Comment 23 Justin Mason 2009-06-30 15:29:05 UTC
: 301...; svn commit -m "bug 5673: document 'ALL' pseudo-header's odd behaviour where whitespace is concerned" lib
Sending        lib/Mail/SpamAssassin/Conf.pm
Transmitting file data .
Committed revision 789988.

I've opened bug 6141 for the future TODO task.