SA Bugzilla – Bug 4793
[review] rewrite_header is not properly parsed
Last modified: 2006-05-24 10:19:56 UTC
Then option rewrite_header from local.cf contains symbols of underlining ("_"), path of his tag that added to spam messages is dissapear. Example: y:/etc/mail/spamassassin # grep rewrite_header local.cf rewrite_header Subject MESSAGE_MARKED_AS_SPAM y:/etc/mail/spamassassin # spamassassin < /home/dmitry/dl/build/Mail-SpamAssassin-3.1.0/sample-spam.txt|grep Subject Subject: MESSAGEAS_SPAM Test spam mail (GTUBE) Subject: Test spam mail (GTUBE) y:/etc/mail/spamassassin # This effect take place not for all spam emails, but very frequently. I use spamassassin 3.1.0, from www.spamassassin.org. I reproduce this bug on RHEL AS3 and SLES9. I think this is a bug, because spamassassin 2.55 and 2.64 work correctly with identical tag of subject_tag option.
Correction. This issue take place for ALL spam emails marked on spamassassin 3.1.0. Without exlusion.
As for what's going on: SpamAssassin is seeing "_MARKED_" as a template tag (see the docs, but it's the same as _SCORE_, _VERSION_, etc,) and tries to replace it, which ends up just removing it from the header since it's not a valid tag. As for what to do about it: We could have PMS->_get_tag() return "_TAG_" if it's unknown, though I can also see the argument for the current behavior (returning the empty string ""). It may be worth noting in the documentation that "_SOMETHING_" will be considered a tag always.
"It may be worth noting in the documentation that "_SOMETHING_" will be considered a tag always." I'm for that. I don't like the possibility of conflicts with the addition of tags in the future.
How about adding an escape character, for example use \_ to insert a literal _ and \\ to insert a literal \ character? Alternatively, what about only recognizing a tag on a word boundary?
I would argue that "\W_SOMETHING_\W" is a tag, and "\w_SOMETHING_" or "_SOMETHING_\w" is not a tag.
(In reply to comment #5) > I would argue that "\W_SOMETHING_\W" is a tag, and "\w_SOMETHING_" > or "_SOMETHING_\w" is not a tag. Perhaps /\b_[^_]+_\b/, but \W is a character, so a header of "_SOMETHING_" wouldn't be considered a valid tag. But then we're also trading one limitation for another -- if someone wants to do "CUSTOMER_CONTACTADDRESS_" for instance (so that in the report people would be directed to send mail to a customer-specific address), why should we disallow that? I'd rather do: - add documentation that "_SOMETHING_" in places like the report, add_header, replace_header, etc, will be considered a tag and replaced, even in the middle of a string such as "FOO_SOMETHING_BAR". - possibly, somehow rewrite _replace_tags() with a version that if the tag named "SOMETHING" isn't valid and an error is returned we'll leave the text alone. right now invalid tags get replaced with the blank string so that all tags are replaced with something. this doesn't seem to documented either btw.
> but \W is a character Am I misunderstanding? I thought \w was a word-character, so \W was not (a word character). Therefore \W\w would be somewhat equivalent to \b\w. I'm not happy with trading limitations. How about going to something similar to the way a shell might handle something like this, and include some parends when necessary? So if someone REALLY wants the cojoined "CONTACT_" with the mythical _ADDRESS_ you might do CONTACT_(_ADDRESS_). (It seems much more likely in that case that what they probably want is "CONTACT: _ADDRESS_" which doesn't really have the limitation in the first place; nor would "CONTACT:_ADDRESS_" if they left the space out.) Or on rereading your comment, what the user might want was CUSTOMER@_CONTACTADDRESS_, which would also work.
(In reply to comment #7) > > but \W is a character > > Am I misunderstanding? I thought \w was a word-character, so \W was not (a > word character). Therefore \W\w would be somewhat equivalent to \b\w. You are correct in the meaning of \w and \W, however \W requires there to be a character. \b is the space between a \W and a \w, ie: zero-width, and includes the start and end of the line. > I'm not happy with trading limitations. How about going to something similar > to the way a shell might handle something like this, and include some parends > when necessary? So if someone REALLY wants the cojoined "CONTACT_" with the > mythical _ADDRESS_ you might do CONTACT_(_ADDRESS_). (It seems much more I really don't want to change what we currently accept and kluge in some way to solve the immediate request that will cause us problems down the road. For example, I can definitely see people wanting to use "(_TAG_)" and leave the parens in place. For instance, from the standard config: add_header all Checker-Version SpamAssassin _VERSION_ (_SUBVERSION_) on _HOSTNAME_ > Or on rereading your comment, what the user might want was > CUSTOMER@_CONTACTADDRESS_, which would also work. well, my example was a bit contrived. the point is that I don't want to trade one limitation for another just because we can. if we change to requiring /\b_SOMETHING_\b/, someone will come along and request to be able to do "FOO_SOMETHING_BAR" and we'll be back here again.
ok, I committed a small documentation addition that should explain the situation. I think this should be sufficient, but will leave the ticket open for a little while longer in case there are any other comments. v3.1: Sending Conf.pm Transmitting file data . Committed revision 379672. v3.2: Sending Conf.pm Transmitting file data . Committed revision 379673.
I'd prefer to require \b on either side; IMO not allowing underscores in the rewrite_header, or in templates, is too much!
(In reply to comment #10) > I'd prefer to require \b on either side; IMO not allowing underscores in the > rewrite_header, or in templates, is too much! It's not disallowing underscores completely ("rewrite_header all Subject [SPAM_SCORE: _SCORE_]" would work as expected). It's disallowing text that looks like a tag. If people really want to shift to requiring \b around the tag, I won't oppose it, but it simply trades one limitation for another one which doesn't really seem like a worthwhile effort to me.
I'm going to move this to 3.1.2 for more discussion. If we come up with a plan in the near term, we can move back to 3.1.1, but I don't think this needs to hold up the release.
(In reply to comment #11) > It's not disallowing underscores completely ("rewrite_header all Subject [SPAM_SCORE: _SCORE_]" would > work as expected). It's disallowing text that looks like a tag. FWIW, I still agree with this. Anything that looks like a tag should be treated as a tag... basically the same as it is now. Requiring \b around tags is just one trade off for another. What I think should be done, though, is linting of tags. If "_NOTATAG_" appears in a config, a lint check should warn about it.
'We could have PMS->_get_tag() return "_TAG_" if it's unknown' actually I prefer this, on reflection. (a) we have to allow underscores to appear in rewrite_header as this bug started with; that's a given. 'rewrite_header Subject MESSAGE_MARKED_AS_SPAM' must work sensibly. (b) if a user typos a tag name in a custom rewrite_header line, this gives them an easier way to tell that this has happened. in some circumstances a tag value might be "" in the normal case, in which case the error would be hard to spot. (c) ok, I can see theo's point about trading for a new limitation on the semantics of existing (and working) tag references.
Created attachment 3445 [details] ignores unknown tag references this patch leaves "_UNKNOWN_TAG_REFERENCE_" intact, and documents the behaviour. it also makes the plugin API get_tag() return undef if an unknown tag is referred to, instead of silently returning "".
sure. +1
(In reply to comment #14) > 'We could have PMS->_get_tag() return "_TAG_" if it's unknown' > > actually I prefer this, on reflection. > > (a) we have to allow underscores to appear in rewrite_header as this bug started > with; that's a given. 'rewrite_header Subject MESSAGE_MARKED_AS_SPAM' must work > sensibly. I don't know if that *must* work. MESSAGE-MARKED-AS-SPAM seems to be just as good. > (b) if a user typos a tag name in a custom rewrite_header line, this gives them > an easier way to tell that this has happened. in some circumstances a > tag value might be "" in the normal case, in which case the error would be hard > to spot. Linting invalid tags would also be pretty obvious that there's a problem. :) > (c) ok, I can see theo's point about trading for a new limitation on the > semantics of existing (and working) tag references. The only thing I don't like about leaving "_UNKNOWN_TAG_REFERENCE_" intact is that it opens up the possibility of collision with new tags in the future. So +0 of 3445.
+1 Ready for commit
I messed up my test and missed errors. After applying this patch, I get t/rcvd_parser.t failing most or all of its tests. -1 until I or someone else resolves this
More details (one example -- all have the same error): make test TEST_FILES=t/rcvd_parser.t TEST_VERBOSE=yes # Failed test 2 in t/rcvd_parser.t at line 647 not ok 2 expected: [ ip=218.11.152.141 rdns= helo=cs.helsinki.fi by=mail.cs.helsinki.fi ident= envfrom= id= auth= ] [ ip=64.235.238.165 rdns= helo=m165.4superdeals.biz by=mail.cs.helsinki.fi ident= envfrom= id= auth= ] got : [ ip=218.11.152.141 rdns= helo=cs.helsinki.fi by=mail.cs.helsinki.fi ident= envfrom= id= auth= ] [ ip=64.235.238.165 rdns= helo=m165.4superdeals.biz by=mail.cs.helsinki.fi ident= envfrom= id= auth= ] _RELAYSTRUSTED_
I think I see the problem. $self->{tag_data}->{RELAYSTRUSTED} = $self->{relays_trusted_str}; When there are no trusted relays it is undef, not "", which looks to this patch just like _RELAYSTRUSTED_ is not a defined tag.
The obvious fix for this is to require that anything that defines a tag MUST set it to "" if it is empty rather than leave it undef. But that means that we have to do that for every tag that is defined in PerMsgStatus and in any plugin. If anybody can think of a clean alternative that does not depend on every newly added tag being done correctly like that, I would like to hear it.
Ok, the fix can all go in _get_tags if it tests for exists instead of boolean false of the hash value and return "" instead of undef if the tag is there as a hash key even with an undef value. Updated patch forthcoming
Created attachment 3465 [details] corrected patch against 3.1 branch with _get_tag that doesn't return undef if tag exists Here's a pretty simple change that seems to correct the problem Back to review status, needs 2 votes
+1 -- thanks, my mistake!
Committed to trunk revision 392709. Still needs 1 vote for 3.1 branch.
The patch doesn't deal properly with "_TAG(DATA)_" when TAG doesn't exist. It'll return "_TAG_", so we still end up losing bits. I'll put up a new one shortly.
Created attachment 3520 [details] updated patch basically the same as 3465, except deal properly with "_TAG(DATA)_". I also trimmed a few things down ("return undef" -> "return", etc,) for conciseness.
+1
Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Transmitting file data .. Committed revision 409210.