Bug 46634

Summary:	Requests sometimes take too long, wrong time taken reported in logs
Product:	Apache httpd-2	Reporter:	mb <m-b>
Component:	Core	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	ASSIGNED ---
Severity:	normal
Priority:	P3
Version:	2.2.3
Target Milestone:	---
Hardware:	PC
OS:	Linux
Attachments:	screendump.jpg apache_access.log systeminfo.txt stacktraces.txt

Description mb 2009-01-30 05:55:55 UTC

Created attachment 23199 [details]
screendump.jpg

Our Apache servers, each processing some 7 million requests a day, sometimes take way to long to send a response (1500000ns as compared to the normal 500ns). This happens with content stored on the local disk and under different load conditions all day round and can be observed in the server access logs as well as on the client side.

Using tcpdump we've tracked the problem down to the following:
1) server receives request A (for tab1_active.gif) and sends response immediately
2) server receives request B (for btn_pfeil.gif) on the same connection (using keep-alive), waits some 2 seconds after accepting the GET and then sends out response for B
3) however in the logfiles, the time taken for request A ist given as over 2 seconds whereas request B is logged as processed in 200ns.

This looks like for some reason Apache suffers from some delay between receiving a request and sending out the response, but before(!) calculating the time taken for the previous request. The problem was tracked down on 2.2.3 but also occures on 2.2.10.

So there are two sub-problems:
A) the time taken in the log files does not match the real time taken
B) it shouldn't take that long to deliver a small file at all

Attached you'll find a screenshot of the tcpdump, the corresponding log entries as well as the configuration of the system Apache is running on.

Any help is greatly appreciated as this problem severely affects the performance of the whole website.

Comment 1 mb 2009-01-30 05:56:23 UTC

Created attachment 23200 [details]
apache_access.log

Comment 2 mb 2009-01-30 05:56:46 UTC

Created attachment 23201 [details]
systeminfo.txt

Comment 3 Ruediger Pluem 2009-01-30 07:08:19 UTC

Can you please provide a stacktrace of the httpd process that is stuck for 2 seconds?

Comment 4 mb 2009-01-30 09:01:31 UTC

Ok, we'll see what is possible. Probably need to recompile the server and roll it out. However that'll take some days. Or is there any other easy way to capture the dump in exactly that moment?

Comment 5 mb 2009-02-03 09:47:00 UTC

Created attachment 23222 [details]
stacktraces.txt

Comment 6 mb 2009-02-03 09:47:26 UTC

In the attachment stacktraces.txt you'll find two stacktraces of requests that were logged as being unusually slow. 

To be able to use conditional breakpoints, we had to make some changes to mod_log_config.so though:
1) Patched mod_log_config.c:log_request_duration_microseconds to store value in local variable (as shown in attachment)
2) Changed some compiler options: removed "-O2", added "-g -g3 -ggdb"

Comment 7 mb 2009-02-03 09:57:56 UTC

I've forgot to mention: For the debugging session, we've used the 2.2.10 version of the server.

Comment 8 Ruediger Pluem 2009-02-03 23:59:50 UTC

Sorry but the stacktrace doesn't really help as you set a breakpoint in the logging section. What we need is that once httpd gets stuck you start your debugging session (which stops httpd) and we see where it is stuck.

Comment 9 mb 2009-02-04 02:14:12 UTC

Hmmm... That doesn't sound very feasible to me, 'cause it would require us stopping httpd in exactly that moment when any of the hundred parallel requests hangs for a second or two. However, neither do we know beforehand, when that will happen nor do we know which request is affected until it is logged. Even if we tracked it down with wireshark in parallel, it would not be possible to analyze the data that fast and stop httpd.

Of course, if there is any way to achieve this, we'll try. Any help to accomplish this task is appreciated.

Another option would be some investigation in the sources. The problem seems to occur somewhere after the last byte of the previous response is sent, the next pipelined request is accepted on the connection, but before the log is written for the first response.

Comment 10 mb 2009-02-20 05:51:57 UTC

There are some news: With some debugging and call tracing we've found the place where it hangs. It is mmap_bucket_read in srclib/apr-util/buckets/apr_buckets_mmap.c called from server/ap_core_output_filter in core_filters.c.

The strange thing is: This function as well as apr_mmap_offset in srclib/apr/mmap/unix/common.c only does some pointer shuffeling but sometimes takes over a second to complete.

Comment 11 Jeff Trawick 2009-02-20 07:16:32 UTC

Interesting; I don't see any touching of the memory page with the mmap-ed file either.

It can block for disk I/O at the point where the memory address returned by the bucket-read is dereferenced a bit later; I don't grok a block at an earlier point, unless this heap storage representing the mmap is getting paged out.

Help ;)

Comment 12 mb 2009-02-20 09:17:20 UTC

It's getting really weird now. The situation is as follows:

1. The line "rv = apr_bucket_read(e, &str, &n, eblock);" in ap_core_output_filter (core_filters.c) takes more than a second (1000000µs) to complete
2. When that happens, the return value of apr_bucket_read is 0 (APR_SUCCESS).
3. After apr_bucket_read returns, the value of e->type->name is always "MMAP".
4. After apr_bucket_read returns, the value of e->type->read is always the address of mmap_bucket_read (apr_buckets_mmap.c).
5. mmap_bucket_read records time on entry and just before "return APR_SUCCESS" with apr_time_now().
6. A difference of more than 300000µs is _never_ recorded in that function.

So it looks like, calling apr_bucket_read takes over a second, but the called function never takes more than a third of a second between first and last statement.

Is it possible, that a completely different function is called, i.e. the value of e->type->read changes due to the call? Could it really take so long to evaluate e->type->read in the macro apr_bucket_read and do some basic logic of a function call?

Comment 13 mb 2009-02-24 10:05:49 UTC

Ok, call to "read" DOES seem to replace bucket received as a parameter. With that in mind, we were able to trace it down to the system call "mmap" called by a file bucket. However, the problem still persists with "EnableMMAP Off", so it's somewhere in basic file I/O. We'll keep on looking what could be the reason for that. To some degree, it seems to depend on the compiler version used to build Apache and the kernel.

Comment 14 mb 2009-02-25 06:50:22 UTC

Finally, we found the reason for the delay. After switching off logging file access time in ext3, the values went back to normal. Access time with ext2 seems fine however. So I've lowered the priority of the bug. What remains, is that the long request is logged for the request _before_ such a delay in a connection with keep alive, not for the one that is actually slow.

Comment 15 MortenK 2013-12-16 14:01:21 UTC

Hi

I am also seeing apache loggin wrong time to serve reqeust in the log.

My TCPdump shows that the reqeust took 51 seconds, but my apache log reports "14282" microseconds.

In other cases where it takes along time, its spot on, but in some cases its wrong.
I am not sure how to procede with this finding.