Bug 3979 - [review] addresses not parsed properly for AWL if contain 2-byte language characters
Summary: [review] addresses not parsed properly for AWL if contain 2-byte language cha...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.0.1
Hardware: PC Linux
: P5 normal
Target Milestone: 3.1.4
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: can be commited
Keywords:
Depends on:
Blocks:
 
Reported: 2004-11-18 00:17 UTC by Alan Premselaar
Modified: 2006-07-25 12:28 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Patch to extract email address when packing for AWL patch None Alan Premselaar [NoCLA]
slightly modified patch patch None Alan Premselaar [NoCLA]
patch patch None Daryl C. W. O'Shea [HasCLA]
patch patch None Daryl C. W. O'Shea [HasCLA]
suggested patch patch None Daryl C. W. O'Shea [HasCLA]
suggested patch with tests patch None Daryl C. W. O'Shea [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Premselaar 2004-11-18 00:17:24 UTC
From: headers that contain 2-byte (Japanese) characters in the description (i.e.
"2-byte chars here" <email@domain.tld>) were causing email addresses to be
improperly parsed before being added to the AWL database.

This caused entries in the database such as: "_z _ <kensho-a@ataru-kuji.com" (no
quotes) 

The From: header looks like this:

From: =?ISO-2022-JP?B?IhskQiF6OD02YiQ9JE4+bCRHRXYkPyRqISohekV2GyhC?=
        =?ISO-2022-JP?B?GyRCJD8kayEqISkkLyQ4IXo3fD5eJVwlcyVQITwhehsoQg==?=
        =?ISO-2022-JP?B?Ig==?=
        <kensho-a@ataru-kuji.com>

the problem is reproduceable.

I've collected debug information if necessary and am including a patch that will
extract the "user@domain.tld" from what's passed to AutoWhitelist's _pack function.

I haven't tested the patch extensively so it may be buggy or missing something
in the regex.
Comment 1 Alan Premselaar 2004-11-18 00:18:33 UTC
Created attachment 2524 [details]
Patch to extract email address when packing for AWL
Comment 2 Alan Premselaar 2004-11-18 00:25:58 UTC
I should have obfuscated the actual email address to protect the innocent, but
forgot.  if there's anyway that could be done for me, i'd appreciate it.

also, i meant sub pack_addr, not _pack

(just to prevent any further confusion)

Comment 3 Alan Premselaar 2004-11-18 22:37:20 UTC
Created attachment 2528 [details]
slightly modified patch

I've done some more testing and slight modification to the patch.
Comment 4 Bob Menschel 2005-04-28 21:48:05 UTC
Triage: Problem seems defined well enough, and patch simple enough, to validate
for 3.1.0 ...
Comment 5 Daniel Quinlan 2005-04-30 20:54:21 UTC
Hmmm... I'd rather understand why the address is getting passed in
as it is to begin with.  That seems to be the problem.

Moving to 3.1.1 milestone.
Comment 6 Daryl C. W. O'Shea 2006-07-18 03:29:29 UTC
Created attachment 3588 [details]
patch

The problem is caused by a regex stopping at a '<' inside a pair of double
quotes.
Comment 7 Justin Mason 2006-07-18 20:31:14 UTC
+1
Comment 8 Theo Van Dinter 2006-07-19 01:12:08 UTC
hrm.  why not just do something like:

$result =~ s/"[^"]*"//g;

The goal is to strip off the quoted bits, and the <> parts are handled in the
next command.  No sense trying to do everything at once if we don't need to.
Comment 9 Daryl C. W. O'Shea 2006-07-19 02:16:14 UTC
Doing it all at once I should have used something like:

$result =~ s/^\s*\".*?\"\s*<(.*?)>.*$/$1/;


(In reply to comment #8)
> hrm.  why not just do something like:
> 
> $result =~ s/"[^"]*"//g;
> 
> The goal is to strip off the quoted bits, and the <> parts are handled in the
> next command.  No sense trying to do everything at once if we don't need to.

That's fine by be too.
Comment 10 Daryl C. W. O'Shea 2006-07-19 05:47:43 UTC
Created attachment 3592 [details]
patch

WTH was I doing backslashing double quotes?

Anyway... patch for doing it in two steps.
Comment 11 Doc Schneider 2006-07-19 06:17:58 UTC
+1

Seems simple enough.
Comment 12 Justin Mason 2006-07-19 09:40:57 UTC
uh, it's legal to have quoted strings in the address;

   Foo Bar <"Mr. Foo Bar"@barcom.biz>

does that now support this?  I suspect not.

I think we're going about this the wrong way -- this is the perfect
place for test-driven development, start with the test cases and
write code that works against that.
Comment 13 Theo Van Dinter 2006-07-19 15:52:21 UTC
(In reply to comment #12)
> uh, it's legal to have quoted strings in the address;
> 
>    Foo Bar <"Mr. Foo Bar"@barcom.biz>
> 
> does that now support this?  I suspect not.
> 
> I think we're going about this the wrong way -- this is the perfect
> place for test-driven development, start with the test cases and
> write code that works against that.

Right, that's what we're doing (that's why I put the list of header contents in
comments around that function so we know what we're dealing with).  You're the
first person to mention this other form.  So it seems like we may want to do
something like:

$result =~ s/(?<!<)"[^"]*"//g;

?
Comment 14 Daryl C. W. O'Shea 2006-07-20 00:21:38 UTC
> $result =~ s/(?<!<)"[^"]*"//g;

try: 'To11:addr' failed! expect: '"Some User"@foo' got: '@foo'


If <"Some User"@foo> is valid, I'd imagine we also want to handle "Some
User"@foo without the angle brackets, no?

If so, either the above with a negative look-ahead:

$result =~ s/(?<!<)"[^"]*"(?!@)//g;

or the original all-at-once way will do it:

$result =~ s/^\s*".*?"\s*<(.*?)>.*$/$1/;



BTW...

[dos@FC5-VPC trunk]$ svn ci -m "bug 3979: add tests for correct :addr stripping"
Sending        MANIFEST
Adding         t/get_headers.t
Transmitting file data ..
Committed revision 423667.
Comment 15 Theo Van Dinter 2006-07-21 15:12:27 UTC
since it seems there are still issues, I'm moving this out of review state.  I'm
tempted to say we ought to push this to 3.1.5, and and try to come up with
something more comprehensive.
Comment 16 Theo Van Dinter 2006-07-24 15:12:18 UTC
if no one minds, I'm moving this to 3.1.5.  I don't think this is a blocker for
3.1.4.
Comment 17 Daryl C. W. O'Shea 2006-07-24 18:05:33 UTC
ping jm on comment #14...
Comment 18 Justin Mason 2006-07-24 20:02:35 UTC
I agree, we should handle both <"Some User"@foo> and "Some User"@foo.
whatever gets all the tests passing, with all the valid address test
cases we've identified in this thread, I'm happy with ;)
Comment 19 Daryl C. W. O'Shea 2006-07-25 01:49:10 UTC
Created attachment 3608 [details]
suggested patch
Comment 20 Daryl C. W. O'Shea 2006-07-25 01:49:55 UTC
Created attachment 3609 [details]
suggested patch with tests
Comment 21 Daryl C. W. O'Shea 2006-07-25 02:03:37 UTC
(In reply to comment #18)
> I agree, we should handle both <"Some User"@foo> and "Some User"@foo.
> whatever gets all the tests passing, with all the valid address test
> cases we've identified in this thread, I'm happy with ;)

I think attachment 3608 [details] will satisfy everyone.  It strips stuff incrementally
and passes all the tests identified.  I don't think we need to include the
actual tests in 3.1 though.

However, attachment 3609 [details] is the same thing along with the tests.  So take your
pick. :)


Back to 3.1.4 under the assumption that we're happy with this.


[dos@FC5-VPC trunk]$ svn ci -m "bug 3979: better handling of quoted text when
stripping email addresses"
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Sending        t/get_headers.t
Transmitting file data ..
Committed revision 425255.
Comment 22 Justin Mason 2006-07-25 09:10:09 UTC
+1 on both; I'd prefer 3609 was applied, fwiw ;)
Comment 23 Daryl C. W. O'Shea 2006-07-25 15:56:16 UTC
sure, one more vote for 3609 then...
Comment 24 Doc Schneider 2006-07-25 19:05:52 UTC
+1 for 3609
Comment 25 Daryl C. W. O'Shea 2006-07-25 19:28:32 UTC
thanks :)

[dos@FC5-VPC 3.1]$ svn ci -m "bug 3979: better handling of quoted text when
stripping email addresses"
Sending        MANIFEST
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Adding         t/get_headers.t
Transmitting file data ...
Committed revision 425491.