Bug 63985 - Tomcat 9 does not read UTF-8 files with no bom correctly
Summary: Tomcat 9 does not read UTF-8 files with no bom correctly
Status: RESOLVED WONTFIX
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.x
Hardware: PC All
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-03 21:30 UTC by Hubert Gailly
Modified: 2022-12-14 11:41 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hubert Gailly 2019-12-03 21:30:01 UTC
Very simple to reproduce on windows 10 or windows server 2012 (Not tested on Linux).

2 identical html files saved with as UTF-8 files one with BOM, one with no BOM.

tomcat 9 started with -Dfile.encoding=UTF-8 (also tested with -Dfile.encoding=UTF8)

If the files are served by Apache, they are both correct. If the file with no BOM is read by Tomcat 9, it is obviously converted but served as UTF-8, resulting with accents wrongly displayed on all browsers. No problem with the file saved with BOM.
Comment 1 Mark Thomas 2019-12-04 17:57:12 UTC
Thanks for the report.

The issue is not how Tomcat reads the files. The bytes presented to the user agent are exactly the bytes that are on disk.

The issue is that the Default Servlet does not add a content-type and charset.

In the BOM case the user agent reads the BOM and renders the bytes as UTF-8.
In the non-BOM case the user agent renders the bytes as ISO-8859-1.

The immediate solution is to add:
<head>
<meta charset="utf-8"/>
</head>
to the HTML page with no BOM. This allows the user agent to do the right thing.

Whether Tomcat should do anything about this is debatable. This is probably a discussion for the dev list. I'm going to resolve this as WONTFIX but I'll try and set out the key points here as a starter for a dev list discussion, should one be required.

This only applies to text files served by the default servlet.

There are multiple encodings in play here:
1. The encoding the text file has been saved with.
2. The encoding declared within the file (if any).
3. The fileEncoding init param for the Default servlet.
4. The default encoding configured for the web application (if any).
5. The encoding of the resource the static resource is being included in (if this is an include).
6. The default character encoding (ISO-8859-1) as defined by the Servlet spec.
7. Any explicit encoding declared for the request (e.g. by a filter)

The various encodings above are not always consistent. In this instance user agents will generally prioritise explicit encodings in the HTTP headers, then encodings in the file.

Because 3 is per web application (it is typically per server but it can be per web application) and multiple values for 1 within a single web application is fairly common, Tomcat tries to do as little as possible on the assumption the user agent will be able to figure out the right thing to do from the file in most cases. This is why it is a good idea to declare encodings in files where the file format supports this.

Experience to date is that it breaks more things than it fixes to have Tomcat set an explicit encoding. That may change at some point as everyone shifts to UTF-8 everywhere. I'm not sure we are there yet.

If all of the following are true, Tomcat will attempt to convert the bytes from the input file:
- The requested resource is a text file
- An explicit character encoding has been set for the response
- The explicit character encoding set is not the same as fileEncoding
In this case only, Tomcat reads the bytes from the file, converts them to characters using fileEncoding, converts those characters back to bytes using the explicitly declared encoding and then writes those bytes to the response.

All of this is sufficiently complex that we have over 3,000 unit tests checking various combinations.

Given the above, another solution would be to use the AddDefaultCharsetFilter to set all *.html files to have UTF-8 explicitly set.

There might be a case to add an option to the default servlet to add an explicit encoding to text responses that don't have one. It would probably need to allow for:
- same as fileEncoding
- same as web application default response encoding
- explicit charset
But I do wonder how much stuff that would actually break rather than fix.
Comment 2 Hubert Gailly 2019-12-14 08:05:31 UTC
Thanks for the answer, I knew all this, but to my opinion there is a consequent bug in Tomcat 9.

Same text is saved in UTF8 in 2 separate files one with BOM, one without BOM.
All the declared as UTF-8.
In Apache 'httpd.conf' : AddDefaultCharset UTF-8 In tomcat 'server.xml' : <Connector port="8009" enableLookups="false" redirectPort="8443" protocol="AJP/1.3" URIEncoding="UTF-8"/> In the file itself : <META content="text/html; charset=utf-8" http-equiv=Content-Type>

I tried all different configurations :
In both application and tomcat 'web.xml'
<init-param>
            <param-name>fileEcoding</param-name>
            <param-value>UTF8</param-value> 
</init-param> 
Or 
<init-param>
            <param-name>fileEcoding</param-name>
            <param-value>UTF-8</param-value> 
</init-param> 
And/Or 
Starting Tomcat 9 with
-Dfile.encoding=UT8
Or
-Dfile.encoding=UTF-8

Now, the response is correctly received as UTF-8 in both cases by the browser.

If served by Tomcatthe file with no BOM is corrupted. Accents are rubbish characters.
I save the file as ISO-8859-1, it is correct.

That means that whatever I say to Tomcat9, if there is a UTF8 static file, Tomcat9 always reads it as a ISO-8859-1, thus breaking the characters.
There is no problem with the file with BOM.
Comment 3 Christopher Schultz 2019-12-18 16:42:41 UTC
(In reply to Hubert Gailly from comment #2)
> Same text is saved in UTF8 in 2 separate files one with BOM, one without BOM.
> All the declared as UTF-8.
> In Apache 'httpd.conf' : AddDefaultCharset UTF-8

So Apache httpd is also in the mix? Great. More opportunities for things to go wrong with the character set.

> In tomcat 'server.xml' :
> <Connector port="8009" enableLookups="false" redirectPort="8443"
> protocol="AJP/1.3" URIEncoding="UTF-8"/>

This setting (URIEncoding) has nothing to do with the character set used to encode a response.

> In the file itself : <META
> content="text/html; charset=utf-8" http-equiv=Content-Type>

I'm not sure if that needs to be quoted, but I would definitely quote it. It doesn't matter, as the response header Content-Type will override whatever the <meta/> tag says.

> I tried all different configurations :
> In both application and tomcat 'web.xml'
> <init-param>
>             <param-name>fileEcoding</param-name>
>             <param-value>UTF8</param-value> 
> </init-param> 
> Or 
> <init-param>
>             <param-name>fileEcoding</param-name>
>             <param-value>UTF-8</param-value> 
> </init-param> 

Which filter is this? CharacterSetEncodingFilter? If so, you have not configured it correctly, which is probably why it's not working.
http://tomcat.apache.org/tomcat-9.0-doc/config/filter.html#Add_Default_Character_Set_Filter

The init-param is spelled "encoding", not "fileEncoding" or "fileEcoding".

> And/Or 
> Starting Tomcat 9 with
> -Dfile.encoding=UT8
> Or
> -Dfile.encoding=UTF-8

These don't matter, either.

> If served by Tomcat the file with no BOM is corrupted. Accents are rubbish
> characters.
> I save the file as ISO-8859-1, it is correct.
> 
> That means that whatever I say to Tomcat9, if there is a UTF8 static file,
> Tomcat9 always reads it as a ISO-8859-1, thus breaking the characters.

Tomcat is not "reading" anything at all. It's taking bytes from the disk and placing them on the wire. It's the client which is interpreting the bytes as ISO-8859-1.

> There is no problem with the file with BOM.

Again, this is down to client behavior. Please move this discussion to the users' list.
Comment 4 Ivan 2022-12-14 11:35:47 UTC
I've faced with BOM problem using pdf.js library.
I've described my problem on their issue tracker:
https://github.com/mozilla/pdf.js/issues/15790
but they didn't help me.

Could you explain why other servers don't have problem with returning of JS files without BOM and tomcat (in my case 9.0.62) has?
Comment 5 Mark Thomas 2022-12-14 11:41:27 UTC
Please take your question to the users mailing list.