Bug 56119 - File uploads fail every other attempt using timers
Summary: File uploads fail every other attempt using timers
Status: RESOLVED FIXED
Alias: None
Product: JMeter
Classification: Unclassified
Component: Main (show other bugs)
Version: 2.11
Hardware: All All
: P2 major with 1 vote (vote)
Target Milestone: ---
Assignee: JMeter issues mailing list
URL:
Keywords:
: 56118 56417 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-02-07 17:09 UTC by oliver lloyd
Modified: 2015-06-02 09:48 UTC (History)
3 users (show)



Attachments
Screenshot of issue (376.61 KB, image/png)
2014-02-07 17:12 UTC, oliver lloyd
Details
jmx (36.35 KB, text/plain)
2014-02-07 17:15 UTC, oliver lloyd
Details

Note You need to log in before you can comment on or make changes to this bug.
Description oliver lloyd 2014-02-07 17:09:52 UTC
This is a fairly impossible problem but I thought I'd raise it in case it supports something you've already seem. 

In the attached plan, if I run it without a timer it is successful for each request. However, if I enable the timer the the first upload request works but the second fails, then the third works, and the forth fails, etc...

The only difference is enabling and disabling the timer.

Unfortunately, I can't actually give you a working jmx - the attached will always fail but hopefully it is still useful.

Failure is:
org.apache.http.NoHttpResponseException: The target server failed to respond

Success:
204 - blank response


The type of timer makes no difference.

The value in the timer DOES make a difference. 100ms and + = problem, 30ms = no problem.

The request I am making to S3 is valid, were it not I would get an error response. In this case I get NoHTTPResponse.
Comment 1 oliver lloyd 2014-02-07 17:12:03 UTC
Created attachment 31292 [details]
Screenshot of issue
Comment 2 oliver lloyd 2014-02-07 17:13:35 UTC
I should also have stated that running this test with multiple threads does not change the per thread behaviour. 

Thread 1 > works, then fails, then works, then fails...
Thread 2 > works, then fails, then works, then fails...
Thread 3 > works, then fails, then works, then fails...
etc...
Comment 3 oliver lloyd 2014-02-07 17:15:10 UTC
Created attachment 31293 [details]
jmx
Comment 4 Philippe Mouawad 2014-02-08 17:56:39 UTC
*** Bug 56118 has been marked as a duplicate of this bug. ***
Comment 5 Philippe Mouawad 2014-02-08 22:22:55 UTC
Could you clarify which timer makes test fail ?
Is it Constant Timer or Gaussian Random Timer ?

Why does last request use a different server than others ?

Would it be possible that activating an additional Timer (if it's the case (see my first question)) leads randomly to having a timeout on created access key ?

Reading the S3 docs I don't see anything like this although.
Comment 6 oliver lloyd 2014-02-09 10:48:28 UTC
It fails with both timers.

The last POST to actually upload the file is different because the server (our server) only gives the client/browser some security tokens, we do not receive the uploaded file. The actual file transfer is done from the browser directly to Amazon S3 and this is the last request.

See: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingHTTPPOST.html

There is an expiration for the credentials we pass back to the browser but this is set to 5 minutes.

See: http://docs.aws.amazon.com/AmazonS3/latest/dev/HTTPPOSTForms.html#HTTPPOSTExpiration

At first glance, I can see how this looks like a config issue with the request and, to be fair, it probably is but it's weird that it relates to timing like this and that this can cause a null response - I was curious if there was anything you could think of.

But it's quite a minor problem: this will not occur in real life and we can easily work around it for testing.
Comment 7 Philippe Mouawad 2014-02-09 11:28:43 UTC
Thanks for feedback.
Issue does not seem to be related to JMeter.
Closing issue but would be very interested by the problem explanation if you find it.
Comment 8 oliver lloyd 2014-02-14 11:08:31 UTC
We're seeing this problem in other testplans now. Where an external request is failing every second attempt. This is not related to uploads but does appear to relate to calls to external servers.

We ran using 2.9 & 2.10 and did not see the problem - it is 2.11 specific.

If you contact me at oliver.lloyd@news.co.uk I can provide you with a testplan to replicate.
Comment 9 Philippe Mouawad 2014-02-14 14:17:10 UTC
Hello,
After exchange with Oliver it seems these 2 settings improve:
1) uncomment hc.parameters.file=hc.parameters in user.properties
2) Set this in hc.parameters: http.connection.stalecheck$Boolean=true


Answer from Oliver:
"That improves things a lot. But not 100% - Now we see 1 in 10 requests fail (instead of 1 in 2)."
Comment 10 Philippe Mouawad 2014-02-15 13:54:10 UTC
@Oliver,
Regarding the remaining 10% failure if it's confirmed, are you sure you don't get it with 2.9 and 2.10.

I think it might come from this behaviour of S3.
- http://aws.amazon.com/articles/1904


>"Also, don't overuse a connection. Amazon S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset'). Rather than having this happen, use a connection for 80-90 requests before closing and re-opening a new connection."

In this particular case, the settings change impacts highly the test as you will get a lot of Stale Connections.


We would need maybe to add ability to discard a connection after a certain number of Requests.

Finally, I am not sure we should rollback this change, but maybe document it better ?
Comment 11 Sebb 2014-02-15 15:27:24 UTC
Also, note that the "Use keepalive" option can be deselected to force the next sampler to use a new connection.
Comment 12 oliver lloyd 2014-02-15 21:48:42 UTC
No, we only see failures with 2.11. Also, we see the same issue with other testplans that go out to other external endpoints, not just s3. Plus, we start to get failures early on, after just 10 or 20 requests.

The common factors are external calls, 2.11 and the use of some sort of timer - if we remove all delays between requests then we do not get failures.

Sorry. it is hard to give a more reliable description for reproducing the problem. I will see if I can put something together next week to help.
Comment 13 Sebb 2014-02-15 22:08:07 UTC
Do you see the same behaviour if you use HttpClient 3.1 as the HTTP implementation?
Comment 14 Philippe Mouawad 2014-02-16 12:51:01 UTC
@Oliver, after changing the 2 following settings:
1) uncomment hc.parameters.file=hc.parameters in user.properties
2) Set this in hc.parameters: http.connection.stalecheck$Boolean=true

You should not see any difference between 2.10 and 2.11 as AFAIK there is no other change in HTTP stack that could explain it.

Do you confirm you see differences between 2.10 and 2.11 ?
or is it between 2.9 and 2.11 ?

In 2.9, the retry count for the HttpClient 3.1 and HttpClient 4.x samplers has been changed to 0 . Previously the default was 1, this one could also explain differences in your use case.
Comment 15 oliver lloyd 2014-02-18 15:27:24 UTC
Okay, so coming back to this today I was not able to reproduce the problem (failed external calls) when using the settings advised by Phillipe. We had previously seen lower failure rates of around 5% - 10% but it looks like that was an unfortunate co-incidence. Apologies for that.

So, yes, the following changes:

1) uncomment hc.parameters.file=hc.parameters in user.properties
2) Set this in hc.parameters: http.connection.stalecheck$Boolean=true

Work around the problem of every other external request failing in 2.11.
Comment 16 Philippe Mouawad 2014-02-18 20:07:48 UTC
Many thanks Oliver for your feedback.

In the end, do you consider this a bug ?
I don't as current settings work for websites which do not have S3 way of invalidating connection after N requests.
But would be interested by your opinion ?

At JMeter Team, what's your opinion ?
Should we rollback this setting change or not ?
At least we should document it but where ? Wiki or component reference ?
Comment 17 oliver lloyd 2014-02-19 10:50:11 UTC
My feeling is that this should be considered a bug. 

The two cases where we recently saw this problem were when going out to S3 and also when sending requests to Zuora, a payment provider. But I can think of several other scenarios where a web app might need to make a call to a third party and where this call might need to be included in a testplan; in fact, we do this all the time and I don't think we're unusual in this.
Comment 18 Sebb 2014-02-19 11:11:46 UTC
On the face of it, I don't think it is a bug - perhaps an enhancement request.

I'm not sure I fully understand how the timer interacts with the staleCheck setting, unless there is some S3 setting that disallows long-lasting sessions?

I think further tests are needed to establish exactly how enabling the staleCheck  prevents the problem.

It would be interesting to know if disabling keep-alive solved the issue in the absence of staleCheck.

Ideally we need a test server (either public or easy to set up locally) that shows the same behaviour as being reported here.

JMeter already supports closing a connection, i.e. forcing the next sample to use a new connection. It looks like the problem may be that the connection is expiring somehow. If this is indeed the case, then there might be a need to provide automated connection closing based on re-use count and/or idle time.

But first we need to know if that is really what the problem is.
Comment 19 oliver lloyd 2014-02-19 11:32:28 UTC
2.11 Use KeepAlive, OFF - 100% Success
2.11 HTTPClient 3.1 - 100% Success
2.11 DEFAULT - 50% Failures
2.9 DEFAULT - 100$ Success
2.10 DEFAULT - 100% Success

Note. This is a test NOT using S3, but another, different, external call.
Comment 20 Sebb 2014-02-19 11:48:28 UTC
Thanks. That shows the issue is something to do with connection re-use.

However I don't understand why dropping the staleCheck should affect 50% of requests.

Or indeed why HC 3.1 should be 100% successful, but perhaps that does retries - I don't know without looking.

I've just had another look at the error on the screenshot, which shows "Target Server failed to respond", yet the elapsed time seems to be close to sero.

This suggests that the request may be failing before it gets to the server.

Is there a server against which we can test?

Or can you run a very short test showing the errors with full wire logging (or using WireShark)?
Comment 21 Sebb 2014-02-19 17:51:03 UTC
Testing with the privately provided test plan shows that the problem occurs on one of the hosts only.

This host only occurs once in the plan, which means that its connection is relatively idle compared with the others.
When you add in all the other samples and delays, the elapsed time between samples of this particular host is about 60 seconds.

I just tried a simple GET on the same host, and that also fails when the elapsed time between samples is around 15 secs or more (10 secs seems to be OK)

I think this is the immediate cause of the issue.

The staleCheck option adds an extra sample just before real sample. This causes the old connection to be dropped by the server, but the stale check code does not report this: it just creates a new connection, which then works because it has not been idle.

That is why disabling Keep-Alive works - there is no attempt to reuse the connection which has been idle for a while.

The work round - if one does not want to use staleCheck for everything - is to disable keep-alive for samples that are likely to trigger the idle timeout.

We need to have a dev@ discussion on what we might want to do to fix this in JMeter. I'll start that off shortly.
Comment 22 Sebb 2014-02-19 19:55:16 UTC
Just discovered something else.
I tried testing timeouts against the ASF server.

Even if the idle timeout is exceeded, the samples don't fail.
Debug shows that this is because the ASF server returns the following header:

Keep-Alive: timeout=5, max=100

This is detected by the HC 4.x code:

2014/02/19 19:37:47:619 GMT [DEBUG] headers - << Keep-Alive: timeout=5, max=100
2014/02/19 19:37:47:619 GMT [DEBUG] headers - << Connection: Keep-Alive
2014/02/19 19:37:47:619 GMT [DEBUG] headers - << Content-Type: text/html; charset=utf-8
2014/02/19 19:37:47:697 GMT [DEBUG] BasicClientConnectionManager - Connection can be kept alive for 5000 MILLISECONDS
2014/02/19 19:39:17:712 GMT [DEBUG] BasicClientConnectionManager - Get connection for route {s}->https://www.apache.org
2014/02/19 19:39:17:712 GMT [DEBUG] BasicClientConnectionManager - Connection [id:0][route:{s}->https://www.apache.org][state:null] expired @ Wed Feb 19 19:37:52 GMT 2014
2014/02/19 19:39:17:712 GMT [DEBUG] DefaultClientConnection - Connection 0.0.0.0:4368<->192.87.106.229:443 closed
2014/02/19 19:39:17:712 GMT [DEBUG] DefaultClientConnectionOperator - Connecting to www.apache.org:443

Note that the connection is automatically dropped by the connection manager.

I have just reviewed the logs from the private test plan, and these don't return a Keep-Alive header, so the HC code assumes the connection is not timed out. The log shows the following:

BasicClientConnectionManager - Connection can be kept alive indefinitely

This means that the HC code does not know when to drop the connection.

I think it is probably a configuration error on the server; it really ought to tell the client what the keep-alive period is. It probably explains why this issue has not been reported by others as well.

It gives a possible solution for the test errors - fix the server config!
Comment 23 oliver lloyd 2014-02-19 20:28:03 UTC
We also see the same issue in another testplan, this one using S3, it's feasible that both servers are setup wrongly but this suggests this server config error is common. That or we're just unlucky!
Comment 24 Sebb 2014-02-19 22:11:57 UTC
Check the response headers from the servers.

Do they include the Keep-Alive header?
Comment 25 oliver lloyd 2014-02-19 22:53:34 UTC
No, I'm not seeing the keep-alive header in the response.

I've emailed you the plan.

I see what you're saying, that the server is sending the wrong response so how can the client know how to act, and sure, I can work around this easily, my concern though is that not sending this header seems to be pretty common so this same issue could impact others. It's not immediately obvious though what the cause of the problem is and I can see how others might first blame the external server, I know I did - it might be that the problem is happening for other people but they are not raising it here... Just a thought.
Comment 26 oliver lloyd 2014-02-19 23:10:16 UTC
Looking here: http://en.wikipedia.org/wiki/HTTP_persistent_connection

It seems as though the Keep-Alive header is effectively depreciated for HTTP1.1 as all connections are assumed persistent.
Comment 27 Sebb 2014-02-19 23:36:56 UTC
(In reply to oliver lloyd from comment #25)

> my concern though is that not sending this header seems to be pretty common
> so this same issue could impact others.

AFAIK you are the first and only person to report it so far.
I don't know whether that is because:
+ 2.11 has not been out for long, or
+ there are not many misbehaving servers or
+ not many test plans are using connections that can become idle.
Comment 28 Sebb 2014-02-19 23:38:44 UTC
(In reply to oliver lloyd from comment #26)
> Looking here: http://en.wikipedia.org/wiki/HTTP_persistent_connection
> 
> It seems as though the Keep-Alive header is effectively depreciated for
> HTTP1.1 as all connections are assumed persistent.

That is the Keep-Alive _request_ header.

The server should send the Keep-Alive _response_ header which has a different (but related) function.
Comment 29 Philippe Mouawad 2014-02-20 21:15:54 UTC
@Oliver,
Really big thanks for your follow up and investigations on this.

@sebb, I don't think Oliver is the only person who is facing this, I think I wrote somewhere here or on dev list that another person was facing this with 2.11 and tweeted about this.

As discussed on dev list, we should:
- write a wiki page about this setting

- Select a way to invalidate idle connections or connections that have run N requests (s3 case) for non 'clean' servers
Comment 30 Sebb 2014-02-21 01:39:25 UTC
(In reply to Philippe Mouawad from comment #29)
> @Oliver,
> Really big thanks for your follow up and investigations on this.

Yes, thanks for the prompt responses to our queries.
 
> @sebb, I don't think Oliver is the only person who is facing this, I think I
> wrote somewhere here or on dev list that another person was facing this with
> 2.11 and tweeted about this.

PK, it would be interesting to know if they were also using S3.

> As discussed on dev list, we should:
> - write a wiki page about this setting
> 
> - Select a way to invalidate idle connections or connections that have run N
> requests (s3 case) for non 'clean' servers

I have just run a test against Amazon.com. They also have a server that fails to send a Keep-Alive header. It seems to try to drop the connection after a certain number of requests - by sending the following header:

Cneonction: close

Yes, it really is spelt that way!

What is odd is that it seems the connection still keeps going. In any case, I suspect there is not a problem handling connection drop after N requests, because that will be initiated by the server at the end of the response.

But I suppose it may still be worth considering recycling connections after a certain number of requests.

The problem with the idle timeout is that it occurs at the start of a new request, when disconnects are not currently expected. Maybe we can handle this in JMeter, or it may need some help from HC.
Comment 31 Rainer Jung 2014-02-23 17:05:41 UTC
The typical explanation for a Cneonction header is here:

http://stackoverflow.com/questions/4798461/cneonction-and-nncoection-http-headers

They say it happens if the web server is fronted with a load balancer that does its own connection handling but doesn't want to do to much protocol adjustment work. By permuting some characters in the name of the connection header the client no longer reacts on the close message of the origin server but the packet sizes and probably TCP checksums do not change. So the implementation on the packet level is cheap.

The LB reacts on the connection close itself but has decoupled connection handling between client and LB from the connections between LB and origin server.

Now if that is true, it might well be, that the LB itself doesn't shut down the connections cleanly once it decides itself to close a connection from the client. That would fit into sebb's observation.

Asa result it should be OK to ignore those misspelled headers, because they are misspelled intentionally to get them ignored. And in this situation the next hop might infact expose a somewhat strange connection close behavior itself.
Comment 32 Sebb 2014-02-27 01:58:40 UTC
Added a work-round to provide an idle timeout value if the server does not send the Keep-Alive header. This works, but is not ideal as the same timeout applies to all hosts.
Comment 33 Sebb 2014-02-27 01:58:57 UTC
URL: http://svn.apache.org/r1572390
Log:
File uploads fail every other attempt using timers.
Enable idle timeouts for servers that don't send Keep-Alive headers.
Bugzilla Id: 56119

Modified:
    jmeter/trunk/bin/jmeter.properties
    jmeter/trunk/src/protocol/http/org/apache/jmeter/protocol/http/sampler/HTTPHC4Impl.java
    jmeter/trunk/xdocs/changes.xml
Comment 34 Sebb 2014-02-27 10:58:41 UTC
Note: if you want to try a nightly build with this fix, it will be in builds after version r1572390. Define the property

httpclient4.idletimeout=5000 

to set the timeout to 5 seconds for any servers that don't send Keep-Alive headers
Comment 35 Philippe Mouawad 2014-03-22 21:26:31 UTC
Hello Oliver,
And feedback ?
Thanks
Comment 36 Philippe Mouawad 2014-04-16 19:21:03 UTC
*** Bug 56417 has been marked as a duplicate of this bug. ***
Comment 37 Sebb 2015-01-12 19:43:50 UTC
@pam <pramod.desai@abyeti.com>: please don't re-open an old issue without providing further details.
Comment 38 Florence Lee 2015-06-02 07:50:45 UTC
Hello,have anyone know why I met the exception like "org.apache.commons.httpclient.NoHttpResponseException: The server <ip address>failed to respond" although I user the JMeter 2.10?