Bug 64202

Summary: Upload is broken in version 9.0.31
Product: Tomcat 9 Reporter: ishaigorodsky
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: regression CC: mseele, ron
Priority: P2    
Version: 9.0.31   
Target Milestone: -----   
Hardware: Other   
OS: Linux   
Attachments: The sample example that demonstrates the breakage
The file known to trigger an issue

Description ishaigorodsky 2020-03-07 02:59:36 UTC
Created attachment 37067 [details]
The sample example that demonstrates the breakage

Jersey Multi-part upload is broken when using Mac OS 10.14.6 browsers and CenOS 6/Ubuntu 16 server with Tomcat version 9.0.31 over HTTPS for binary content. 
The upload is working with Tomcat version 9.0.30 for the same client and server.
To reproduce, unzip the sample attached and follow the steps.
The upload is known to work:
* locally
* over HTTP
* for text content such as CSV and JSON
Comment 1 ishaigorodsky 2020-03-07 03:08:18 UTC
The upload fails with 400 status code error in org.jvnet.mimepull.MIMEParser.
Depending on the browser used it can be:
* Reached EOF, but there is no closing MIME boundary.
* Missing start boundary
Comment 2 Michael Osipov 2020-03-07 09:00:56 UTC
Do you think you could provide a wire dump of the multipart message?
Comment 3 Mark Thomas 2020-03-07 10:53:51 UTC
I can't repeat this with the provided sample application. I suspect you are hitting an edge case somewhere in the I/O code.

Please can you provide:
- exact version of browser used
- exact version of Java used
- a file known to trigger the issue.
Comment 4 Mark Thomas 2020-03-07 11:10:21 UTC
If you could also test the current 9.0.32 release candidate that would be useful as well.
Comment 5 Remy Maucherat 2020-03-07 14:40:11 UTC
I suppose we can leave it for now but it looks like an obvious duplicate of 64195.
Comment 6 ishaigorodsky 2020-03-07 23:00:58 UTC
Created attachment 37068 [details]
The file known to trigger an issue
Comment 7 ishaigorodsky 2020-03-07 23:04:44 UTC
The issue happens with the following browsers on Mac OS Mojave 10.14.6 (18G3020):
* Firefox 73.0.1 (64-bit)
* Chrome 80.0.3987.122 (Official Build) (64-bit)
and the following JDK versions
* CentOS 7 (Core) openjdk 11.0.6 2020-01-14
* Ubuntu 16.04.6 LTS (Xenial Xerus) openjdk 11.0.1 2018-10-16
Let me get wire dump or curl to reproduce
Comment 8 Michael Osipov 2020-03-07 23:20:30 UTC
(In reply to ishaigorodsky from comment #6)
> Created attachment 37068 [details]
> The file known to trigger an issue

That a PNG and not a multipart request. Are you certain?
Comment 9 Mark Thomas 2020-03-09 15:00:49 UTC
I can now recreate the issue using 9.0.31 and the sample application. Using separate machines for client and server appears to be key. Exact Java / OS / browser versions seem less important at this stage. Possibly packet size or latency related.

I'm working on isolating the root cause.
Comment 10 ishaigorodsky 2020-03-09 15:30:32 UTC
Switching to Http11Nio2Protocol as suggested in the other ticket fixes the issue.
Comment 11 Mark Thomas 2020-03-09 15:56:32 UTC
*** Bug 64195 has been marked as a duplicate of this bug. ***
Comment 12 Mark Thomas 2020-03-09 16:03:29 UTC
I can confirm that this issue is fixed in 9.0.32.

Further the commit that fixed this was, as we suspected:
https://github.com/apache/tomcat/commit/6e60713c75141bc00f03f08f759df993a6416c71

I'm leaving this open for now as I want to dig into this a little further to see what the root cause was.
Comment 13 Mark Thomas 2020-03-09 16:11:56 UTC
Great. It is timing related. If I enable TLS debug logging the problem is no longer reproducible.
Comment 14 Remy Maucherat 2020-03-09 16:16:20 UTC
(In reply to Mark Thomas from comment #12)
> I can confirm that this issue is fixed in 9.0.32.
> 
> Further the commit that fixed this was, as we suspected:
> https://github.com/apache/tomcat/commit/
> 6e60713c75141bc00f03f08f759df993a6416c71
> 
> I'm leaving this open for now as I want to dig into this a little further to
> see what the root cause was.

Ok, can you see if the latch version of the same thing works better ? [the now removed block poller used that]

It would be good to have a "reliable" POST test case since this is an often reported issue and it's not practical to test it manually.
Comment 15 Mark Thomas 2020-03-09 19:11:17 UTC
Found the issue.

The read needs to be in a loop.

A TLS Network read can result in zero application bytes. Therefore you have to keep reading until you get a timeout, EOF or data. The original patch read a second time but did not handle the case of that read also resulting in zero byte. The current code also keeps track of the remaining read timeout int he case of multiple reads which adds a little complexity.

I haven't looked a write. It is possible a similar issue exists there.

Marking this as FIXED since the commit that triggered this has been reverted.
Comment 16 Remy Maucherat 2020-03-09 20:05:14 UTC
Thanks ! So annoying ... I was 100% certain the problem was some thread safety problem, so since I couldn't reproduce I got nowhere. I'll revert the change to use a latch since it is likely useless.

No problem for write, it already loops since it must write the whole buffer.

I probably won't make a change to keep track of the overall timeout at this time, the timeout is the IO timeout and something did happen. So this would be a deviation from the 9 behavior but it's not necessarily a bad thing. Up for discussion I guess ... If the timeout becomes the overall timeout then the write must be modified as well.
Comment 17 Remy Maucherat 2020-03-11 17:12:20 UTC
*** Bug 64218 has been marked as a duplicate of this bug. ***