SA Bugzilla – Bug 5673
header-rule whitespace munging semantics should be consistent and documented
Last modified: 2009-06-30 15:29:05 UTC
just noticed this; it appears that the "ALL" header looks like this: Header: value Header2: value ... in other words there's an extra space added before the value of each header. Also, there's no test to catch this. :( now fixed in trunk: : jm 76...; svn commit -m "bug 5673: 'ALL' header was including extra spurious spaces between header names and values. fix" Sending MANIFEST Sending lib/Mail/SpamAssassin/Message/Node.pm Sending t/SATest.pm Adding t/data/Dumpheaders.pm Adding t/get_all_headers.t Transmitting file data ..... Committed revision 583094.
Created attachment 4145 [details] fix this is the patch for 3.2.x.
please vote. this is a major bug, btw, since the whole point of "ALL" is so that patterns can be matched across header name/value/name/value boundaries...
> this is a major bug, btw, since the whole point of "ALL" is so > that patterns can be matched across header name/value/name/value > boundaries... +1 on a fix to Message/Node.pm +0.9 on added test, which concentrates on one minor detail, where it could test 'a real thing', like given a message with repeated and mixed header fields, folded or not, ... and see what comes out. > this is a major bug Since the get("ALL") already clobbers whitespace on FWS, turning ranges of TABs, spaces and form-feeds(!?) into a single space, I don't see it as a 'major bug', but can't hurt to get it right.
(In reply to comment #4) > Since the get("ALL") already clobbers whitespace on FWS, > turning ranges of TABs, spaces and form-feeds(!?) into a > single space, I don't see it as a 'major bug', but can't > hurt to get it right. ugh. that's a pretty serious bug too. "ALL" should be the headers, verbatim, as they were passed in (apart from the "From_" line). perhaps we should just use $msg->get_pristine_header() ?
What's the problem here? The amount of whitespace is irrelevant. > ugh. that's a pretty serious bug too. "ALL" should be the headers, verbatim, > as they were passed in (apart from the "From_" line). Why? There's nothing that says "ALL" means "pristine headers". Not that I'm against that idea, but I don't see a bug here.
(In reply to comment #6) > What's the problem here? The amount of whitespace is irrelevant. > > > ugh. that's a pretty serious bug too. "ALL" should be the headers, verbatim, > > as they were passed in (apart from the "From_" line). > Why? There's nothing that says "ALL" means "pristine headers". Not that I'm > against that idea, but I don't see a bug here. true -- there's no doc indicating that it has to be pristine. However it seems more logical. So are you saying this is really a feature request rather than a bug report?
(In reply to comment #6) > What's the problem here? The amount of whitespace is irrelevant. btw, I disagree on this point. If the real headers look like Header1: value1 Header2: value2 then get("ALL") should not return Header1: value1 Header2: value2 with *additional* spaces added. I can see marginal utility in "simplifying" long strings of whitespace down to a single space -- but adding spaces where they aren't in the original string, IMO, is a bug.
> I can see marginal utility in "simplifying" long strings of whitespace > down to a single space. Pretty marginal I'd say, because it only turns TABs, FFs and SPs into a space at line folds, but not within each line, so regexps in rules can not reliably count on such compressing of whitespace. I would suggest to also change the get_header() to only unfold header fields (i.e. only remove a line feed, and keep whitespace at line folds intact). Not necessary for 3.2.4, but 3.3.0 would be a good target for this change.
The more I think about it, the more I believe that header ALL shoudl literally return the header block of the message exactly as it is, with no more than perhaps removing trailing spaces on the line. I don't believe I would remove newlines and concatenate multiple-line header items. Sometimes the fact that the header is broken in a particular place is a spam sign. Of course, meybe there is a header ALL:raw that does this. In that case ALL without the :raw might do something else. But it isn't clear to me what that difference might be.
yeah, Loren, I'm inclined to agree...
I've just checked in a change to use get_pristine_header(), and it now checks an entire message's header in the test case. : jm 152...; svn commit -m "bug 5673: just use get_pristine_header() for the 'ALL' pseudo-header, to preserve all internal whitespace in the headers; update test" Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Sending t/get_all_headers.t Transmitting file data .. Committed revision 583445. BTW, I would say the current behaviour -- munging internal whitespace -- arguably *is* a bug, since the documentation states: > There are several special pseudo-headers that can be specified: > ... > =item C<ALL> can be used to mean the text of all the message's headers. there's no mention that any munging at all takes place. IMO, that's buggy.
(In reply to comment #9) > I would suggest to also change the get_header() to only unfold header fields > (i.e. only remove a line feed, and keep whitespace at line folds intact). no, this is by design -- for individual headers, it's important to unfold the values for rule matching, and the ":raw" variant is offered to match against the unmunged form.
> > I would suggest to also change the get_header() to only unfold ... > no, this is by design -- for individual headers, it's important to unfold > the values for rule matching, and the ":raw" variant is offered to match > against the unmunged form. Did I say otherwise? I don't think so. Unfold yes, munge whitespace no. Unfolding means to remove newlines before whitespace. The whitespace at the beginning of continuation lines is and integral part of the mail header field body, not part of a folding.
(In reply to comment #14) > > > I would suggest to also change the get_header() to only unfold ... > > > no, this is by design -- for individual headers, it's important to unfold > > the values for rule matching, and the ":raw" variant is offered to match > > against the unmunged form. > > Did I say otherwise? I don't think so. Unfold yes, munge whitespace no. > Unfolding means to remove newlines before whitespace. The whitespace > at the beginning of continuation lines is and integral part of the > mail header field body, not part of a folding. In terms of "unfolding", I refer to removing both the newline and the whitespace at the beginning of continuation lines. In terms of how it interferes with writing rules, that whitespace is a nuisance. Take this header: Received: from radish.jmason.org (localhost [127.0.0.1]) by radish.jmason.org (Postfix) with ESMTP id 0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST) it could also be something like this: Received: from some-long-hostname-breaking-folding.jmason.org (localhost [127.0.0.1]) by radish.jmason.org (Postfix) with ESMTP id 0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST) This regexp /\(localhost \[127.0.0.1\]\) by radish.jmason.org/ is easier for a rule developer to deal with than the alternative /\(localhost\s+\[127.0.0.1\]\)\s+by\s+radish.jmason.org/ Again, this is by design, and IMO is still valuable behaviour.
> In terms of "unfolding", I refer to removing both the newline and the > whitespace at the beginning of continuation lines. This is re-inventing RFC 2822, which says: ...wherever this standard allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP. > it could also be something like this: > Received: from some-long-hostname-breaking-folding.jmason.org (localhost > [127.0.0.1]) by radish.jmason.org (Postfix) with ESMTP id > 0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST) It could also be: Received:\tfrom some-long-hostname-breaking-folding.jmason.org (localhost \t[127.0.0.1]) by\tradish.jmason.org (Postfix) with ESMTP\tid 0409E33287 for <jm@localhost>; Wed, 10 Oct 2007 13:30:15 +0100 (IST) > This regexp ... is easier for a rule developer to deal with > than the alternative ... > Again, this is by design, and IMO is still valuable behaviour. In that case, *all* whitespace substrings should be replaced with a single space, not just those after a CRLF.
(In reply to comment #16) > In that case, *all* whitespace substrings should be replaced with > a single space, not just those after a CRLF. sure, I can go for that (since the ":raw" variant takes care of people who want to match the whitespace). btw, this stuff is undocumented in Mail::SA::Conf, so we need to fix that... all it says is: header SYMBOLIC_TEST_NAME header op /pattern/modifiers [if-unset: STRING] Define a test. "SYMBOLIC_TEST_NAME" is a symbolic test name, such as 'FROM_ENDS_IN_NUMS'. "header" is the name of a mail header, such as 'Subject', 'To', etc. Appending ":raw" to the header name will inhibit decoding of quoted- printable or base-64 encoded strings.
> > In that case, *all* whitespace substrings should be replaced with > > a single space, not just those after a CRLF. > sure, I can go for that (since the ":raw" variant takes care of people > who want to match the whitespace). > btw, this stuff is undocumented in Mail::SA::Conf, so we need to fix that... Yes, that would be to my liking.
> In that case, *all* whitespace substrings should be replaced with > a single space, not just those after a CRLF. I had always assumed that it worked that way, and would be inclined to consider it a bug if it doesn't. (With the specific exception of the ALL pseudo-header which should implicitly be raw headers.)
+1 on 4145 for 3.2.4, fix the rest whenever.
+1 on committing patch 4145, then retarget this bug past 3.2.4 or open up a new bug. I count 4 votes for the patch here, so that part should be ready to go, I'm putting this in review and setting whiteboard status to show that.
ok, patch 4145 applied to 3.2.x: : jm 142...; svn commit -m "bug 5673: 'ALL' header was including spurious extra spaces between header names and values. fix" Sending MANIFEST Sending lib/Mail/SpamAssassin/Message/Node.pm Sending t/SATest.pm Adding t/data/Dumpheaders.pm Adding t/get_all_headers.t Transmitting file data ..... Committed revision 607590. Bug retitled and aimed at 3.3.0 for the remaining work...
: 301...; svn commit -m "bug 5673: document 'ALL' pseudo-header's odd behaviour where whitespace is concerned" lib Sending lib/Mail/SpamAssassin/Conf.pm Transmitting file data . Committed revision 789988. I've opened bug 6141 for the future TODO task.