Bug 4793 - [review] rewrite_header is not properly parsed
Summary: [review] rewrite_header is not properly parsed
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.1.0
Hardware: PC Linux
: P5 minor
Target Milestone: 3.1.2
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: ready for commit
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-14 11:28 UTC by Dmitry Sedov
Modified: 2006-05-24 10:19 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
ignores unknown tag references patch None Justin Mason [HasCLA]
corrected patch against 3.1 branch with _get_tag that doesn't return undef if tag exists patch None Sidney Markowitz [HasCLA]
updated patch patch None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Sedov 2006-02-14 11:28:17 UTC
Then option rewrite_header from local.cf contains symbols of underlining
("_"), path of his tag that added to spam messages is dissapear.

   Example:

y:/etc/mail/spamassassin # grep rewrite_header local.cf
rewrite_header Subject MESSAGE_MARKED_AS_SPAM
y:/etc/mail/spamassassin # spamassassin <
/home/dmitry/dl/build/Mail-SpamAssassin-3.1.0/sample-spam.txt|grep Subject
Subject: MESSAGEAS_SPAM Test spam mail (GTUBE)
Subject: Test spam mail (GTUBE)
y:/etc/mail/spamassassin #


   This effect take place not for all spam emails, but very frequently. 

   I use spamassassin 3.1.0, from www.spamassassin.org. I reproduce this bug on
RHEL AS3 and SLES9. I think this is a bug, because spamassassin 2.55 and 2.64
work correctly with identical tag of subject_tag option.
Comment 1 Dmitry Sedov 2006-02-14 16:02:22 UTC
  Correction. This issue take place for ALL spam emails marked on spamassassin
3.1.0. Without exlusion.
Comment 2 Theo Van Dinter 2006-02-14 16:24:26 UTC
As for what's going on:
SpamAssassin is seeing "_MARKED_" as a template tag (see the docs, but it's the same as _SCORE_, 
_VERSION_, etc,) and tries to replace it, which ends up just removing it from the header since it's not a 
valid tag.

As for what to do about it:
We could have PMS->_get_tag() return "_TAG_" if it's unknown, though I can also see the argument for the 
current behavior (returning the empty string "").  It may be worth noting in the documentation that 
"_SOMETHING_" will be considered a tag always.
Comment 3 Daryl C. W. O'Shea 2006-02-14 20:25:03 UTC
"It may be worth noting in the documentation that 
"_SOMETHING_" will be considered a tag always."

I'm for that.  I don't like the possibility of conflicts with the addition of
tags in the future.
Comment 4 Sidney Markowitz 2006-02-14 23:18:00 UTC
How about adding an escape character, for example use \_ to insert a literal _
and \\ to insert a literal \ character?

Alternatively, what about only recognizing a tag on a word boundary?
Comment 5 Loren Wilton 2006-02-15 00:30:08 UTC
I would argue that "\W_SOMETHING_\W" is a tag, and "\w_SOMETHING_" 
or "_SOMETHING_\w" is not a tag.
Comment 6 Theo Van Dinter 2006-02-15 00:44:59 UTC
(In reply to comment #5)
> I would argue that "\W_SOMETHING_\W" is a tag, and "\w_SOMETHING_" 
> or "_SOMETHING_\w" is not a tag.

Perhaps /\b_[^_]+_\b/, but \W is a character, so a header of "_SOMETHING_" wouldn't be considered a 
valid tag.  But then we're also trading one limitation for another -- if someone wants to do 
"CUSTOMER_CONTACTADDRESS_" for instance (so that in the report people would be directed to send 
mail to a customer-specific address), why should we disallow that?  I'd rather do:

- add documentation that "_SOMETHING_" in places like the report, add_header, replace_header, etc, 
will be considered a tag and replaced, even in the middle of a string such as "FOO_SOMETHING_BAR".

- possibly, somehow rewrite _replace_tags() with a version that if the tag named "SOMETHING" isn't 
valid and an error is returned we'll leave the text alone.  right now invalid tags get replaced with the 
blank string so that all tags are replaced with something.  this doesn't seem to documented either btw.
Comment 7 Loren Wilton 2006-02-15 02:50:21 UTC
> but \W is a character

Am I misunderstanding?  I thought \w was a word-character, so \W was not (a 
word character).  Therefore \W\w would be somewhat equivalent to \b\w.

I'm not happy with trading limitations.  How about going to something similar 
to the way a shell might handle something like this, and include some parends 
when necessary?  So if someone REALLY wants the cojoined "CONTACT_" with the 
mythical _ADDRESS_ you might do CONTACT_(_ADDRESS_).  (It seems much more 
likely in that case that what they probably want is "CONTACT: _ADDRESS_" which 
doesn't really have the limitation in the first place; nor 
would "CONTACT:_ADDRESS_" if they left the space out.)

Or on rereading your comment, what the user might want was 
CUSTOMER@_CONTACTADDRESS_, which would also work.

Comment 8 Theo Van Dinter 2006-02-15 03:22:20 UTC
(In reply to comment #7)
> > but \W is a character
> 
> Am I misunderstanding?  I thought \w was a word-character, so \W was not (a 
> word character).  Therefore \W\w would be somewhat equivalent to \b\w.

You are correct in the meaning of \w and \W, however \W requires there to be a character.  \b is the 
space between a \W and a \w, ie: zero-width, and includes the start and end of the line.

> I'm not happy with trading limitations.  How about going to something similar 
> to the way a shell might handle something like this, and include some parends 
> when necessary?  So if someone REALLY wants the cojoined "CONTACT_" with the 
> mythical _ADDRESS_ you might do CONTACT_(_ADDRESS_).  (It seems much more 

I really don't want to change what we currently accept and kluge in some way to solve the immediate 
request that will cause us problems down the road.  For example, I can definitely see people wanting to 
use "(_TAG_)" and leave the parens in place.  For instance, from the standard config:

add_header all Checker-Version SpamAssassin _VERSION_ (_SUBVERSION_) on _HOSTNAME_

> Or on rereading your comment, what the user might want was 
> CUSTOMER@_CONTACTADDRESS_, which would also work.

well, my example was a bit contrived.  the point is that I don't want to trade one limitation for another 
just because we can.  if we change to requiring /\b_SOMETHING_\b/, someone will come along and 
request to be able to do "FOO_SOMETHING_BAR" and we'll be back here again.
Comment 9 Theo Van Dinter 2006-02-22 05:05:31 UTC
ok, I committed a small documentation addition that should explain the situation.  I think this should 
be sufficient, but will leave the ticket open for a little while longer in case there are any other 
comments.

v3.1:
Sending        Conf.pm
Transmitting file data .
Committed revision 379672.

v3.2:
Sending        Conf.pm
Transmitting file data .
Committed revision 379673.
Comment 10 Justin Mason 2006-02-22 11:04:34 UTC
I'd prefer to require \b on either side; IMO not allowing underscores in the
rewrite_header, or in templates, is too much!
Comment 11 Theo Van Dinter 2006-02-24 23:48:49 UTC
(In reply to comment #10)
> I'd prefer to require \b on either side; IMO not allowing underscores in the
> rewrite_header, or in templates, is too much!

It's not disallowing underscores completely ("rewrite_header all Subject [SPAM_SCORE: _SCORE_]" would 
work as expected).  It's disallowing text that looks like a tag.  If people really want to shift to requiring \b 
around the tag, I won't oppose it, but it simply trades one limitation for another one which doesn't really 
seem like a worthwhile effort to me.
Comment 12 Theo Van Dinter 2006-03-08 19:12:12 UTC
I'm going to move this to 3.1.2 for more discussion.  If we come up with a plan in the near term, we can 
move back to 3.1.1, but I don't think this needs to hold up the release.
Comment 13 Daryl C. W. O'Shea 2006-04-03 04:51:15 UTC
(In reply to comment #11)
> It's not disallowing underscores completely ("rewrite_header all Subject
[SPAM_SCORE: _SCORE_]" would 
> work as expected).  It's disallowing text that looks like a tag. 

FWIW, I still agree with this.  Anything that looks like a tag should be treated
as a tag... basically the same as it is now.  Requiring \b around tags is just
one trade off for another.


What I think should be done, though, is linting of tags.  If "_NOTATAG_" appears
in a config, a lint check should warn about it.
Comment 14 Justin Mason 2006-04-03 13:50:50 UTC
'We could have PMS->_get_tag() return "_TAG_" if it's unknown'

actually I prefer this, on reflection.

(a) we have to allow underscores to appear in rewrite_header as this bug started
with; that's a given. 'rewrite_header Subject MESSAGE_MARKED_AS_SPAM' must work
sensibly.

(b) if a user typos a tag name in a custom rewrite_header line, this gives them
an easier way to tell that this has happened.  in some circumstances a
tag value might be "" in the normal case, in which case the error would be hard
to spot.

(c) ok, I can see theo's point about trading for a new limitation on the
semantics of existing (and working) tag references.
Comment 15 Justin Mason 2006-04-03 13:53:42 UTC
Created attachment 3445 [details]
ignores unknown tag references 

this patch leaves "_UNKNOWN_TAG_REFERENCE_" intact, and documents the
behaviour.   it also makes the plugin API get_tag() return undef if an unknown
tag is referred to, instead of silently returning "".
Comment 16 Theo Van Dinter 2006-04-03 19:29:57 UTC
sure.  +1
Comment 17 Daryl C. W. O'Shea 2006-04-03 21:48:36 UTC
(In reply to comment #14)
> 'We could have PMS->_get_tag() return "_TAG_" if it's unknown'
> 
> actually I prefer this, on reflection.
> 
> (a) we have to allow underscores to appear in rewrite_header as this bug started
> with; that's a given. 'rewrite_header Subject MESSAGE_MARKED_AS_SPAM' must work
> sensibly.

I don't know if that *must* work.  MESSAGE-MARKED-AS-SPAM seems to be just as good.


> (b) if a user typos a tag name in a custom rewrite_header line, this gives them
> an easier way to tell that this has happened.  in some circumstances a
> tag value might be "" in the normal case, in which case the error would be hard
> to spot.

Linting invalid tags would also be pretty obvious that there's a problem. :)


> (c) ok, I can see theo's point about trading for a new limitation on the
> semantics of existing (and working) tag references.

The only thing I don't like about leaving "_UNKNOWN_TAG_REFERENCE_" intact is
that it opens up the possibility of collision with new tags in the future.


So +0 of 3445.
Comment 18 Sidney Markowitz 2006-04-09 02:31:13 UTC
+1   Ready for commit
Comment 19 Sidney Markowitz 2006-04-09 06:23:24 UTC
I messed up my test and missed errors. After applying this patch, I get
t/rcvd_parser.t failing most or all of its tests.

-1 until I or someone else resolves this
Comment 20 Sidney Markowitz 2006-04-09 06:28:26 UTC
More details (one example -- all have the same error):

make test TEST_FILES=t/rcvd_parser.t TEST_VERBOSE=yes

# Failed test 2 in t/rcvd_parser.t at line 647
not ok 2
expected: [ ip=218.11.152.141 rdns= helo=cs.helsinki.fi by=mail.cs.helsinki.fi
ident= envfrom= id= auth= ] [ ip=64.235.238.165 rdns= helo=m165.4superdeals.biz
by=mail.cs.helsinki.fi ident= envfrom= id= auth= ]
got     : [ ip=218.11.152.141 rdns= helo=cs.helsinki.fi by=mail.cs.helsinki.fi
ident= envfrom= id= auth= ] [ ip=64.235.238.165 rdns= helo=m165.4superdeals.biz
by=mail.cs.helsinki.fi ident= envfrom= id= auth= ] _RELAYSTRUSTED_
Comment 21 Sidney Markowitz 2006-04-09 06:53:55 UTC
I think I see the problem.

$self->{tag_data}->{RELAYSTRUSTED} = $self->{relays_trusted_str};

When there are no trusted relays it is undef, not "", which looks to this patch
just like _RELAYSTRUSTED_ is not a defined tag.
Comment 22 Sidney Markowitz 2006-04-09 07:21:57 UTC
The obvious fix for this is to require that anything that defines a tag MUST set
it to "" if it is empty rather than leave it undef.

But that means that we have to do that for every tag that is defined in
PerMsgStatus and in any plugin. If anybody can think of a clean alternative that
does not depend on every newly added tag being done correctly like that, I would
like to hear it.
Comment 23 Sidney Markowitz 2006-04-09 08:12:42 UTC
Ok, the fix can all go in _get_tags if it tests for exists instead of boolean
false of the hash value and return "" instead of undef if the tag is there as a
hash key even with an undef value.

Updated patch forthcoming
Comment 24 Sidney Markowitz 2006-04-09 08:36:55 UTC
Created attachment 3465 [details]
corrected patch against 3.1 branch with _get_tag that doesn't return undef if tag exists

Here's a pretty simple change that seems to correct the problem

Back to review status, needs 2 votes
Comment 25 Justin Mason 2006-04-09 13:09:51 UTC
+1  -- thanks, my mistake!
Comment 26 Sidney Markowitz 2006-04-09 13:54:43 UTC
Committed to trunk revision 392709.

Still needs 1 vote for 3.1 branch.
Comment 27 Theo Van Dinter 2006-05-23 22:56:32 UTC
The patch doesn't deal properly with "_TAG(DATA)_" when TAG doesn't exist.  It'll return "_TAG_", so we 
still end up losing bits.  I'll put up a new one shortly.
Comment 28 Theo Van Dinter 2006-05-23 23:45:23 UTC
Created attachment 3520 [details]
updated patch

basically the same as 3465, except deal properly with "_TAG(DATA)_".  I also
trimmed a few things down ("return undef" -> "return", etc,) for conciseness.
Comment 29 Justin Mason 2006-05-24 08:45:10 UTC
+1
Comment 30 Michael Parker 2006-05-24 17:15:48 UTC
+1
Comment 31 Theo Van Dinter 2006-05-24 17:19:56 UTC
Sending        lib/Mail/SpamAssassin/Conf.pm
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Transmitting file data ..
Committed revision 409210.