Bug 24611

Summary: Scrape doesn't support pages encoded in other charsets (like ISO-8859-1)
Product: Taglibs Reporter: Ricardo Caetano <ricardo.caetano>
Component: Scrape TaglibAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P3    
Version: 1.1   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: Zip file with a test case

Description Ricardo Caetano 2003-11-11 16:55:50 UTC
If a remote page is encoded in other charsets than the default, some 
characters are changes.

The line following line in streamtochararray:
InputStreamReader input = new InputStreamReader(in); 
Could be changed to:
InputStreamReader input = new InputStreamReader(in, <charset>); 

The parameter <charset> could be passed in the <scrape> tag.
Comment 1 Felipe Leme 2004-02-29 15:05:24 UTC
Hi Ricardo,

Could you please provide a test case for this bug (like 2 JSP pages, the one
with the different charset and the other that "scrape" that one)? 

Regards,

Felipe
Comment 2 Ricardo Caetano 2004-03-01 15:36:43 UTC
Just try a page with some characters like: 

"informações" which become: "informações"
"reunião" which become: "reunião"
"reuniões" which become: "reuniões"
"próximas" which become: "próximas".

This bug is most visible when the code page of the "scrapper" machine is 
different from the code page of the "scraped" machine.
Comment 3 Felipe Leme 2004-03-24 04:16:37 UTC
Created attachment 10939 [details]
Zip file with a test case
Comment 4 Felipe Leme 2004-03-24 04:19:14 UTC
I committed your suggestion - it should be available in the next nightly build.
Comment 5 Felipe Leme 2004-03-25 00:25:39 UTC
Ricardo,

Could you please try the new tag on the nightly build below:

http://cvs.apache.org/builds/jakarta-taglibs/nightly/projects/scrape/jakarta-taglibs-scrape-20040324.zip

Thanks,

Felipe
Comment 6 Felipe Leme 2004-03-29 15:38:17 UTC
Marking as fixed...