Bug 60769

Summary:	Problem with Jsp character encoding configuration
Product:	Tomcat 8	Reporter:	Lazar Kirchev <lazar.kirchev>
Component:	Jasper	Assignee:	Tomcat Developers Mailing List <dev>
Status:	RESOLVED FIXED
Severity:	normal
Priority:	P2
Version:	8.5.11
Target Milestone:	----
Hardware:	PC
OS:	All
Attachments:	The two sample applications reproducing the problems A sample reproducing the problem with exotic encoding Sample war with jspx in exotic encoding Correct war for reproducing the exotic encoding problem

Description Lazar Kirchev 2017-02-23 08:16:44 UTC

Created attachment 34775 [details]
The two sample applications reproducing the problems

On Tomcat 8.5.11 the attached sample.war, which declares UTF-8 encoding through a configuration element and Windows-1252 with the XML prolog of the test.jspx file, does not throw an exception and displays the jspx content. But it works as expected and throws an exception for the index.jsp, which declares again a different encoding from the one declared through JSP configuration in web.xml. I would expect that both the index.jsp and test.jspx from sample.war should throw an exception, as they do on Tomcat 8.5.5 for example.

Also, if in a jspx file I declare one and the same encoding (in my case Windows-1252) in both the XML prolog and the page directive pageEncoding property, I get the error message:
"Page-encoding specified in XML prolog (UTF-8) is different from that specified in page directive (WINDOWS-1252)", while on Tomcat 8.5.5 I get no error and the page is displayed. This is in sample1.war, enctext.jspx file.

These behaviors are probably due to change

https://github.com/apache/tomcat85/commit/a03c5755a6fa2d9daa43abe357628f475230fdb2 ?

If the two issues are unrelated I will open another bug report for the second one.

Reference to relating sections of JSP 2.3 spec:

section 3.3.4 (Declaring page encodings):
"It is also a translation-time error to name different encodings in the prolog / text declaration of the document in XML syntax and in a JSP configuration element matching the document. It is legal to name the same encoding through multiple mechanisms."


section 4.1.2 (XML Syntax):
"It is a translation-time error to name different encodings in two or more of the following: the XML prolog / text declaration of a JSP document, the pageEncoding attribute of the page directive of the JSP document, and in a JSP configuration element whose URL pattern matches the document."

Regards,
Lazar

Comment 1 Mark Thomas 2017-03-01 21:00:57 UTC

Yes, there was a regression in the refactoring. The detected BOM encoding was incorrectly taking precedence over the prolog specified encoding (if any).

Thanks for the report and the test case.

Comment 2 Lazar Kirchev 2017-04-12 15:51:08 UTC

Created attachment 34908 [details]
A sample reproducing the problem with exotic encoding

Comment 3 Lazar Kirchev 2017-04-12 15:56:05 UTC

Hello Marc,

I noticed that the second scenario is still failing if the encoding is more exotic - I tried with IBM871 - IBM EBCDIC (Icelandic). 

I debugged a little and noticed that EncodingDetector.getPrologEncoding() returns null although there is an encoding attribute specified in the prolog. Then the if on lines 67 - 73 in EncodingDetector goes in the second branch as if there is no encoding specified in the prolog.

I attach sample2.war, with which I reproduced it. It is essentially the same as sample1.war, only the encoding in enctest.jspx is IBM871.

Probably this is an issue with the XMLStreamReader?

Comment 4 Mark Thomas 2017-04-13 19:39:55 UTC

I've done some further testing and fixed an unrelated bug but as for as unusual encodings go, they have to be specified in the prolog else the JRE's XML parser doesn't have enough information to be able to reliably determine the encoding.

Comment 5 Lazar Kirchev 2017-04-14 08:10:05 UTC

The content of the enctest.jspx is:

<?xml version="1.0" encoding="IBM871"?>

<html xmlns:jsp="http://java.sun.com/JSP/Page">
 <jsp:directive.page pageEncoding="IBM871" />
 <jsp:output omit-xml-declaration="no"/>
 <body>
 You should see this text.
 </body>
</html>

So actually there is an encoding attribute in the prolog. For some reason JRE XML parser does not detect it correctly. On the other hand, the deprecated XMLEncodingDetector from before the refactoring, which parsed the files itself, correctly detects the encoding from the prolog - for example, with Tomcat 8.5.4 the sample works correctly.

I apologise for that my second attachment is an incorrect one - I noticed that by mistake I have attached the second war from the first attachment instead of the problematic war with IBM871 encoding. I attach now the correct one with name encsample.war

Comment 6 Lazar Kirchev 2017-04-14 08:10:56 UTC

Created attachment 34913 [details]
Sample war with jspx in exotic encoding

Comment 7 Lazar Kirchev 2017-04-14 09:35:18 UTC

Comment on attachment 34913 [details]
Sample war with jspx in exotic encoding

Invalid jspx file within.

Comment 8 Lazar Kirchev 2017-04-14 09:36:09 UTC

Created attachment 34914 [details]
Correct war for reproducing the exotic encoding problem

Comment 9 Mark Thomas 2017-04-19 08:56:28 UTC

Thew "unrelated bug" I fixed appears to have fixed the issue you were seeing. The fix is r1791298. If you can test with 9.0.x trunk or 8.5.x trunk to confirm that would be great.

Comment 10 Lazar Kirchev 2017-04-19 09:30:45 UTC

Thanks Mark! I tried the fix from 8.5 trunk and it works.

Something I noticed while debugging, probably it is not a problem, but I prefer to mention it:

In EncodingDetector's constructor, on line 61 (https://github.com/apache/tomcat85/blob/c29a2b45f57e481380d88a8fa0c6f4f0f242aca1/java/org/apache/jasper/compiler/EncodingDetector.java#L61)

The buffered input stream is being reset, but on the next lines the number of bytes which should be skipped are read from the initial input stream and not from the buffered input stream. Is this intended? Because when the buffered input stream is reset, the underlying input stream is not reset and its position stays where it was - e.g., at 4. And then when the bytes which should be skipped are read from it its position goes to e.g. 8. Is this intended?

Comment 11 Mark Thomas 2017-04-19 20:21:20 UTC

Good catch. That would be a bug. I'll get it fixed.