Bug 5763 - [review] Problem with invisible context extraction - whitespace chars dropped
Summary: [review] Problem with invisible context extraction - whitespace chars dropped
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: unspecified
Hardware: Other other
: P2 normal
Target Milestone: 3.2.6
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: needs 1 vote for 3.2
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-02 12:09 UTC by Yanyan Yang
Modified: 2011-05-23 17:51 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Test msg with invisible text text/plain None Yanyan Yang [NoCLA]
the invisible text extracted text/plain None Yanyan Yang [NoCLA]
Proposed change to SpamAssassin HTML.pm patch None Yanyan Yang [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Yanyan Yang 2008-01-02 12:09:47 UTC
There seems to be a problem with text extracted into the "invisible" context --
whitespace characters are dropped. 

During the HTML parsing time (HTML.pm), "whitespace" is always treated as visible. 
On the other hand, when parsing the HTML text, in display_text() API, trailing 
whitespace is trimmed when current element is whitespace; leading whitespace is 
trimmed when previous element is whitespace.

So when invisible text is extracted, no whitespace (because either trailing
whitespace or leading whitespace is trimmed).
Comment 1 Yanyan Yang 2008-01-02 12:12:46 UTC
Created attachment 4226 [details]
Test msg with invisible text
Comment 2 Yanyan Yang 2008-01-02 12:14:41 UTC
Created attachment 4227 [details]
the invisible text extracted

Note that how in the invisible context some lines are juxtaposed directly
together
without any whitespace between them, even though there is whitespace (e.g.
newline chars and in some cases spaces) in the original message.
Comment 3 Yanyan Yang 2008-01-02 12:16:20 UTC
Created attachment 4228 [details]
Proposed change to SpamAssassin HTML.pm

In display_text() API, do not trim trailing whitespace of last element if last
element is invisible text; do not trim leading whitespace of current element if
current element is invisible.
Comment 4 Justin Mason 2008-01-02 15:07:31 UTC
thanks for the report and fix!  at first glance, it looks good;
aiming at 3.2.5.
Comment 5 Justin Mason 2008-01-08 14:34:44 UTC
checked into trunk:

: jm 419...; svn commit -m "bug 5763: whitespace characters are dropped from
'invisible' text sections.  fix, thanks to Yanyan Yang" 
lib/Mail/SpamAssassin/HTML.pm
Sending        lib/Mail/SpamAssassin/HTML.pm
Transmitting file data .
Committed revision 610204.

+1 for application to 3.2.5.
Comment 6 Sidney Markowitz 2008-02-28 11:30:29 UTC
+1
Comment 7 Justin Mason 2008-06-01 03:37:16 UTC
moving to 3.2.6 so that we can release a 3.2.5
Comment 8 Darxus 2011-05-23 16:42:41 UTC
This got committed to trunk when 3.2.5 was current, so it should be in 3.3, and should be closed, right?

3 years since last update.
Comment 9 Darxus 2011-05-23 16:48:32 UTC
Close.

Yup, 3.3 was branched from trunk January 21 2010, two years after this was committed to trunk, so this should be closed.

http://svn.apache.org/viewvc/spamassassin/branches/3.3/?view=log
Comment 10 Mark Martinec 2011-05-23 17:51:15 UTC
closing, fixed in 3.3.0