54602 – B2CConverter character decode underflow leaves bytes in buffer

Bug 54602 - B2CConverter character decode underflow leaves bytes in buffer

Summary: B2CConverter character decode underflow leaves bytes in buffer

Status:	RESOLVED FIXED

Alias:	None

Product:	Tomcat 7
Classification:	Unclassified
Component:	Catalina (show other bugs)
Version:	7.0.33
Hardware:	PC Linux

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	Tomcat Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-22 20:06 UTC by Nate Clark
Modified:	2013-03-05 14:05 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nate Clark 2013-02-22 20:06:36 UTC

If a request contains a uri that ends in a multi byte character which is missing a byte the extra bytes are left in a buffer which is reused by a future request.

The problem comes from two different things:
1) If B2CConverter tries to convert a ByteChunk which ends in a character underflow it does not convert that last character and it is left in a buffer in B2CConver.

2) The B2CConverter in org.apache.catalina.connector.Request is not recycled with the rest of the objects. It looks like this is done intentionally based on the comment above it's declaration.

The issue with B2CConverter can be demonstrated with the code below, which is a simplification of what is done inside of CoyoteAdapter.convertURI():

<code>
B2CConverter conv = new B2CConverter("UTF-8");
ByteChunk bc = new ByteChunk();
CharChunk cc = new CharChunk();
byte[] bytes = { 0x61, 0x62, 0x63, 0x64, -8, -69, -73, -77 };

bc.append(bytes, 0, bytes.length);
cc.allocate(bc.getLength(), -1);
conv.convert(bc, cc, cc.getBuffer().length - cc.getEnd());
System.out.println(cc);

cc.recycle();
bc.recycle();

bc.append(bytes, 0, bytes.length);
cc.allocate(bc.getLength(), -1);
conv.convert(bc, cc, cc.getBuffer().length - cc.getEnd());
System.out.println(cc);
</code>

If the B2CConverter was recycled with everything else in Request this would prevent a previous request from corrupting the next request but it doesn't fix the issue that a character is dropped from the initial decode.

I tried playing with B2CConverter and the only way I could get the convert to get everything was by ignoring the limit argument and reading to the end of the stream. It looks like tomcat8 is moving in this direction with the move to NIO, however I tried playing with the latest B2CConverter and was still seeing issues with the given byte sequence.

Comment 1 Mark Thomas 2013-02-28 01:17:25 UTC

I do see code that is meant to recycle the converter. Do you have a test case / can you write a Tomcat unit test that demonstrates that the converter isn't being recycled?

Incomplete byte sequences should result in a 400 response. I'll take a look at why this isn't happening.

Comment 2 Mark Thomas 2013-02-28 17:17:10 UTC

Part of the problem here is that the UTF-8 decoder should reject bytes 5-8 as an invalid sequence but doesn't. That is a JVM bug that needs to be reported to Oracle.

Given the widespread use of UTF-8 I suspect we'll need to look at using the stricter UTF-8 decoder we use in WebSocket more widely.

A second issue (that doesn't affect this report) is that the leftover byte array is not big enough for the decoders available in the JRE.

Comment 3 Nate Clark 2013-02-28 17:40:06 UTC

InputStreamReader defaults to replacement characters so it won't reject those characters just replace them with the replacement.

The underlying InputStreamReader holds on to those remaining bytes because it is waiting for the last byte or end of stream but because B2CCoverter doesn't try to read any more it never sees the the end of stream.

If B2CConver was changed to just read until ReadConvertor returned -1 then the bytes wouldn't be left behind and the those last 4 bytes would be replaced with the replacement character.

If you want invalid UTF-8 to be rejected you would have to update the constructor of ReadConvertor to call super which takes a CharsetDecoder which uses CodingErrorAction.REPORT, but that is a much bigger change.

I'll work on the test at some point, but I don't have the time currently to get that working.

Comment 4 Mark Thomas 2013-02-28 23:36:04 UTC

The original report was about URI processing. Now you are talking about request bodies.

There are multiple issues here. So far I have found / suspect:
a) Invalid sequences are not rejected quickly enough
b) Partial sequences are not rejected
c) There is no mechanism for a client of the API to indicate input is complete
d) The storage for bytes left over between calls is not bug enough for all charsets
e) Leftover bytes may remain between requests

a) is a JVM bug that we can work-around
b) is a consequence of c)
c) is now fixed in trunk but the callers of the API need to be reviewed
d) is fixed in trunk
e) needs further investigation

Also, the handling of incomplete/partial data may need to be different for URIs and request bodies. There is an argument for request bodies to be more tolerant.

Comment 5 Nate Clark 2013-03-01 00:37:47 UTC

Where do you think I started talking about request bodies?

org.apache.catalina.connector.Request uses a B2CConverter for URIDecoding the variable is called URIConverter. B2CConverter uses ReadConverter, which is a very thin wrapper around InputStreamReader, to do its conversions.InputStreamReader when constructed with just a Charset, like ReadConverter does, calls StreamDecoder.forInputStreamReader() to construct a StreamDecoder. StreamDecoder when not constructed with a CharsetDecoder uses CodingErrorAction.REPLACE for both malformed input and unmappable characters.

Comment 6 Remy Maucherat 2013-03-01 09:54:54 UTC

Yes, this is confusing to you probably. Your problem with URI processing and trunk was that I forgot to port a call to recycle, which likely caused a problem there [and you can ignore the rest, it doesn't concern you]. As for non trunk, the code is really a bit hopeless IMO.

Mark doesn't know the new bit of code in trunk, so that's what he's looking into, and he's apparently busy trying to convince himself it needs tons of improvements. Actually, only d) looks real (and nobody complained so far about it, not enough creative encoding use I guess). I found trying to deal with end of input, for example, does not add any value. That's why there's a isUnderflow flag, it will do the same thing in a more generic way.

Comment 7 Mark Thomas 2013-03-01 16:37:08 UTC

(In reply to comment #5)
> Where do you think I started talking about request bodies?

Sorry about the confusion. You were looking at 7.0.x and I was looking at trunk. When you mentioned an InputStream I started thinking request body as trunk doesn't use a Writer to do the conversion.

To Remy's point, I think there are a handful of bugs, inconsistences and potential improvements in both trunk and 7.0.x and there are likely to be differences between the two branches. Exactly what is a bug and what is an improvement is somewhat academic as I intend to address all of them.

My plan is to write some more test cases for trunk, get them to pass in trunk, back-port the tests to 7.0.x and then get the tests passing in 7.0.x.

Comment 8 Mark Thomas 2013-03-05 10:23:59 UTC

Thanks again for this bug report. It promoted me to take a much closer look at UTF-8 decoding and I found a number of edge cases in both URI processing and request body processing.

trunk is now using the Apache Harmony based UTF-8 decoder for all UTF-8 bytes to chars conversion including URIs and request bodies. The test cases have also been expanded considerably.

The next step is to copy the test cases to 7.0.x and then review. I suspect the fix will involve porting Remy's new B2CConverter to 7.0.x.

Comment 9 Mark Thomas 2013-03-05 14:05:47 UTC

I have ported Remy's improved converter to 7.0.x and have also switched to the new UTF-8 decoder. The test cases have also been ported and all pass.