Summary: | useBomIfPresent removes UTF-BOM without modifying HTTP Content-Length | ||
---|---|---|---|
Product: | Tomcat 9 | Reporter: | Jano John Akim Franke <jano.franke> |
Component: | Catalina | Assignee: | Tomcat Developers Mailing List <dev> |
Status: | RESOLVED INVALID | ||
Severity: | normal | Keywords: | RFC |
Priority: | P2 | ||
Version: | 9.0.26 | ||
Target Milestone: | ----- | ||
Hardware: | Other | ||
OS: | Linux | ||
Attachments: | script to compare filesystem to HTTP-transfer (configure $SRC and wget-URL) |
Description
Jano John Akim Franke
2022-06-23 09:00:53 UTC
Example output of test-encoding.sh showing file transfer with retry resulting in modified file contents: Contents UTF-8: ...TEST 1. try : TEST (timeout waiting for 3 bytes) 2. try : EST (Content-Range: bytes 4-6/7) Result : TESTEST [...] 1c1 < 0000000: efbb bf54 4553 54 ...TEST --- > 0000000: 5445 5354 4553 54 TESTEST d42db618f4b78cea995329eb8d60b491 /opt/novell/zenworks/install/downloads/TEST/UTF-8.txt 2961d3c31fbd6d0abc36fa53d3565915 /tmp/UTF-8.txt 1c1 < 0000000: feff 0054 0045 0053 0054 ...T.E.S.T --- > 0000000: 0054 0045 0053 0054 0054 .T.E.S.T.T dcb86ac7739a5776eadcc5e5dedf94fa /opt/novell/zenworks/install/downloads/TEST/UTF-16BE.txt 09a69b9d518abf314fb830236d27bdce /tmp/UTF-16BE.txt 1c1 < 0000000: fffe 5400 4500 5300 5400 ..T.E.S.T. --- > 0000000: 5400 4500 5300 5400 5400 T.E.S.T.T. 64343f295737c917fc57e52431c6f6de /opt/novell/zenworks/install/downloads/TEST/UTF-16LE.txt 55a65357c74490ce68b42bcca6962951 /tmp/UTF-16LE.txt Files /opt/novell/zenworks/install/downloads/TEST/UTF-32BE.txt and /tmp/UTF-32BE.txt are identical Files /opt/novell/zenworks/install/downloads/TEST/UTF-32LE.txt and /tmp/UTF-32LE.txt are identical The provided test case passes. The analysis has a couple of flaws. 1. UTF-32 does skip the BOM Process BOM reads up to 4 bytes from the InputStream For BOM less than 4 bytes long, the method has to handle skipping the correct number of bytes for the given BOM. This is what the skip method does. UTF-32 has a 4 byte BOM. Therefore if a UTF-32 BOM is detected, the BOM has already been fully read (i.e. skipped) and no correction for a shorter BOM is required. 2. The DefaultServlet never sets the Content-Length and removes the BOM The BOM is only removed if: - the content is included; or - conversion is required If conversion is required, the Content-Length is not explicitly set. The Content-Length may be explicitly set for an included resource but setContentLengthLong is a NO-OP for included resoucres. If you can recreate this issue on a clean install of the latest release of a currently supported Tomcat version (10.1.0-M16, 10.0.22, 9.0.64 or 8.5.81 at the time of writing) then feel free to re-open this issue and provide the steps to recreate. |