Bug 66141 - useBomIfPresent removes UTF-BOM without modifying HTTP Content-Length
Summary: useBomIfPresent removes UTF-BOM without modifying HTTP Content-Length
Status: RESOLVED INVALID
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.26
Hardware: Other Linux
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords: RFC
Depends on:
Blocks:
 
Reported: 2022-06-23 09:00 UTC by Jano John Akim Franke
Modified: 2022-06-23 15:29 UTC (History)
0 users



Attachments
script to compare filesystem to HTTP-transfer (configure $SRC and wget-URL) (1.13 KB, application/x-shellscript)
2022-06-23 09:00 UTC, Jano John Akim Franke
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jano John Akim Franke 2022-06-23 09:00:53 UTC
Created attachment 38325 [details]
script to compare filesystem to HTTP-transfer (configure $SRC and wget-URL)

The DefaultServlet does modify files on transit by removing BOM and mishandling the resulting size for UTF-8 and UTF-16 BOM resulting in a transfer-timeout. UTF-32 is left intact.

When downloaded with wget the result file will have the last bytes appended depending on the BOM-size due to retrying. E.g. UTF-8 3-byte-BOM makes content "TEST" -> "TESTEST".

Looks to me that the tomcat code at https://github.com/apache/tomcat/blob/6a667943c5da6b5d61ac6bec1d7c9de061e3217c/java/org/apache/catalina/servlets/DefaultServlet.java#L1051 does not detect conversionRequired for the removal of BOM, so at https://github.com/apache/tomcat/blob/6a667943c5da6b5d61ac6bec1d7c9de061e3217c/java/org/apache/catalina/servlets/DefaultServlet.java#L1079 the 'Content-Length' is written before the BOM is stripped, resulting in the clients waiting for more bytes to come that never arrive.

Additionally why does UTF-32 work? The code lacks the 'skip' like all the other encodings:

UTF-8 skips and returns:
https://github.com/apache/tomcat/blob/6a667943c5da6b5d61ac6bec1d7c9de061e3217c/java/org/apache/catalina/servlets/DefaultServlet.java#L1275

UTF-32 does not skip, just resturns encoding name:
https://github.com/apache/tomcat/blob/6a667943c5da6b5d61ac6bec1d7c9de061e3217c/java/org/apache/catalina/servlets/DefaultServlet.java#L1287

See attached test-script for Micro Focus ZENworks which uses Tomcat and got this bug report as #02286060 "ZCM Webserver 2020.01 is not transparent to BOM and mishandling modified filesize" on -05-05 but refused to report upstream on -06-21 due to:

> Our engineering team come back with the analyses. Looking from ZENworks
> perspective there is no functionality impact. It seems Tomcat is used for your
> own for purpose, where the issue is happening. For that reason the suggestion
> is that you should report this case/scenario to the tomcat team. In case it
> will fixed from the Tomcat side, with every major ZENworks update a new version
> of Tomcat will be consumed.
Comment 1 Jano John Akim Franke 2022-06-23 10:28:58 UTC
Example output of test-encoding.sh showing file transfer with retry resulting in modified file contents:

Contents UTF-8: ...TEST
1. try        : TEST    (timeout waiting for 3 bytes)
2. try        : EST     (Content-Range: bytes 4-6/7)
Result        : TESTEST

[...]
1c1
< 0000000: efbb bf54 4553 54                        ...TEST
---
> 0000000: 5445 5354 4553 54                        TESTEST
d42db618f4b78cea995329eb8d60b491  /opt/novell/zenworks/install/downloads/TEST/UTF-8.txt
2961d3c31fbd6d0abc36fa53d3565915  /tmp/UTF-8.txt
1c1
< 0000000: feff 0054 0045 0053 0054                 ...T.E.S.T
---
> 0000000: 0054 0045 0053 0054 0054                 .T.E.S.T.T
dcb86ac7739a5776eadcc5e5dedf94fa  /opt/novell/zenworks/install/downloads/TEST/UTF-16BE.txt
09a69b9d518abf314fb830236d27bdce  /tmp/UTF-16BE.txt
1c1
< 0000000: fffe 5400 4500 5300 5400                 ..T.E.S.T.
---
> 0000000: 5400 4500 5300 5400 5400                 T.E.S.T.T.
64343f295737c917fc57e52431c6f6de  /opt/novell/zenworks/install/downloads/TEST/UTF-16LE.txt
55a65357c74490ce68b42bcca6962951  /tmp/UTF-16LE.txt
Files /opt/novell/zenworks/install/downloads/TEST/UTF-32BE.txt and /tmp/UTF-32BE.txt are identical
Files /opt/novell/zenworks/install/downloads/TEST/UTF-32LE.txt and /tmp/UTF-32LE.txt are identical
Comment 2 Mark Thomas 2022-06-23 15:29:32 UTC
The provided test case passes.

The analysis has a couple of flaws.

1. UTF-32 does skip the BOM
   Process BOM reads up to 4 bytes from the InputStream
   For BOM less than 4 bytes long, the method has to handle skipping the
     correct number of bytes for the given BOM. This is what the skip method
     does.
   UTF-32 has a 4 byte BOM. Therefore if a UTF-32 BOM is detected, the BOM
     has already been fully read (i.e. skipped) and no correction for a
     shorter BOM is required.

2. The DefaultServlet never sets the Content-Length and removes the BOM
   The BOM is only removed if:
   - the content is included; or
   - conversion is required
   If conversion is required, the Content-Length is not explicitly set.
   The Content-Length may be explicitly set for an included resource but
     setContentLengthLong is a NO-OP for included resoucres.

If you can recreate this issue on a clean install of the latest release of a currently supported Tomcat version (10.1.0-M16, 10.0.22, 9.0.64 or 8.5.81 at the time of writing) then feel free to re-open this issue and provide the steps to recreate.