SA Bugzilla – Bug 3979
[review] addresses not parsed properly for AWL if contain 2-byte language characters
Last modified: 2006-07-25 12:28:32 UTC
From: headers that contain 2-byte (Japanese) characters in the description (i.e. "2-byte chars here" <email@domain.tld>) were causing email addresses to be improperly parsed before being added to the AWL database. This caused entries in the database such as: "_z _ <kensho-a@ataru-kuji.com" (no quotes) The From: header looks like this: From: =?ISO-2022-JP?B?IhskQiF6OD02YiQ9JE4+bCRHRXYkPyRqISohekV2GyhC?= =?ISO-2022-JP?B?GyRCJD8kayEqISkkLyQ4IXo3fD5eJVwlcyVQITwhehsoQg==?= =?ISO-2022-JP?B?Ig==?= <kensho-a@ataru-kuji.com> the problem is reproduceable. I've collected debug information if necessary and am including a patch that will extract the "user@domain.tld" from what's passed to AutoWhitelist's _pack function. I haven't tested the patch extensively so it may be buggy or missing something in the regex.
Created attachment 2524 [details] Patch to extract email address when packing for AWL
I should have obfuscated the actual email address to protect the innocent, but forgot. if there's anyway that could be done for me, i'd appreciate it. also, i meant sub pack_addr, not _pack (just to prevent any further confusion)
Created attachment 2528 [details] slightly modified patch I've done some more testing and slight modification to the patch.
Triage: Problem seems defined well enough, and patch simple enough, to validate for 3.1.0 ...
Hmmm... I'd rather understand why the address is getting passed in as it is to begin with. That seems to be the problem. Moving to 3.1.1 milestone.
Created attachment 3588 [details] patch The problem is caused by a regex stopping at a '<' inside a pair of double quotes.
+1
hrm. why not just do something like: $result =~ s/"[^"]*"//g; The goal is to strip off the quoted bits, and the <> parts are handled in the next command. No sense trying to do everything at once if we don't need to.
Doing it all at once I should have used something like: $result =~ s/^\s*\".*?\"\s*<(.*?)>.*$/$1/; (In reply to comment #8) > hrm. why not just do something like: > > $result =~ s/"[^"]*"//g; > > The goal is to strip off the quoted bits, and the <> parts are handled in the > next command. No sense trying to do everything at once if we don't need to. That's fine by be too.
Created attachment 3592 [details] patch WTH was I doing backslashing double quotes? Anyway... patch for doing it in two steps.
+1 Seems simple enough.
uh, it's legal to have quoted strings in the address; Foo Bar <"Mr. Foo Bar"@barcom.biz> does that now support this? I suspect not. I think we're going about this the wrong way -- this is the perfect place for test-driven development, start with the test cases and write code that works against that.
(In reply to comment #12) > uh, it's legal to have quoted strings in the address; > > Foo Bar <"Mr. Foo Bar"@barcom.biz> > > does that now support this? I suspect not. > > I think we're going about this the wrong way -- this is the perfect > place for test-driven development, start with the test cases and > write code that works against that. Right, that's what we're doing (that's why I put the list of header contents in comments around that function so we know what we're dealing with). You're the first person to mention this other form. So it seems like we may want to do something like: $result =~ s/(?<!<)"[^"]*"//g; ?
> $result =~ s/(?<!<)"[^"]*"//g; try: 'To11:addr' failed! expect: '"Some User"@foo' got: '@foo' If <"Some User"@foo> is valid, I'd imagine we also want to handle "Some User"@foo without the angle brackets, no? If so, either the above with a negative look-ahead: $result =~ s/(?<!<)"[^"]*"(?!@)//g; or the original all-at-once way will do it: $result =~ s/^\s*".*?"\s*<(.*?)>.*$/$1/; BTW... [dos@FC5-VPC trunk]$ svn ci -m "bug 3979: add tests for correct :addr stripping" Sending MANIFEST Adding t/get_headers.t Transmitting file data .. Committed revision 423667.
since it seems there are still issues, I'm moving this out of review state. I'm tempted to say we ought to push this to 3.1.5, and and try to come up with something more comprehensive.
if no one minds, I'm moving this to 3.1.5. I don't think this is a blocker for 3.1.4.
ping jm on comment #14...
I agree, we should handle both <"Some User"@foo> and "Some User"@foo. whatever gets all the tests passing, with all the valid address test cases we've identified in this thread, I'm happy with ;)
Created attachment 3608 [details] suggested patch
Created attachment 3609 [details] suggested patch with tests
(In reply to comment #18) > I agree, we should handle both <"Some User"@foo> and "Some User"@foo. > whatever gets all the tests passing, with all the valid address test > cases we've identified in this thread, I'm happy with ;) I think attachment 3608 [details] will satisfy everyone. It strips stuff incrementally and passes all the tests identified. I don't think we need to include the actual tests in 3.1 though. However, attachment 3609 [details] is the same thing along with the tests. So take your pick. :) Back to 3.1.4 under the assumption that we're happy with this. [dos@FC5-VPC trunk]$ svn ci -m "bug 3979: better handling of quoted text when stripping email addresses" Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Sending t/get_headers.t Transmitting file data .. Committed revision 425255.
+1 on both; I'd prefer 3609 was applied, fwiw ;)
sure, one more vote for 3609 then...
+1 for 3609
thanks :) [dos@FC5-VPC 3.1]$ svn ci -m "bug 3979: better handling of quoted text when stripping email addresses" Sending MANIFEST Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Adding t/get_headers.t Transmitting file data ... Committed revision 425491.