Bug 62668

Summary: httpd threads spinning at 100%
Product: Apache httpd-2 Reporter: Bernard Spil <brnrd>
Component: CoreAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: critical    
Priority: P2    
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: FreeBSD   
URL: https://svn.apache.org/viewvc?view=revision&revision=1840028
Attachments: gdb thread apply all backtrace
stepping through util_filter.c 1012-1032

Description Bernard Spil 2018-09-01 13:24:14 UTC
Created attachment 36126 [details]
gdb thread apply all backtrace

Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk that makes multiple threads spin at 100% after requests have been processed.

USER   PID  %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND
www  79187 251.8  1.6 155612 131724  -  IJ   15:04   28:05.18 /usr/local/sbin/httpd -DNOHTTPACCEPT
www  79256 151.2  0.7 133076  59624  -  IJ   15:04   17:34.26 /usr/local/sbin/httpd -DNOHTTPACCEPT
root 79173   0.0  0.2  24404  18872  -  SsJ  15:04    0:00.32 /usr/local/sbin/httpd -DNOHTTPACCEPT

the only solution I've found is a full restart apache (httpd -k graceful doesn't suffice). Problem arises again almost instantly.
A trunk snapshot from 2018-06-20 does not exhibit this behavior.

Platform: FreeBSD 11.2-p1 amd64
OpenSSL 1.1.1pre9 (from ports)

# httpd -V
Server version: Apache/2.5.1-dev (FreeBSD)
Server built:   Sep  1 2018 14:20:24
Server's Module Magic Number: 20180720:7
Server loaded:  APR 1.6.3, APR-UTIL 1.6.1, PCRE 8.42 2018-03-20
Compiled using: APR 1.6.3, APR-UTIL 1.6.1, PCRE 8.42 2018-03-20
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses disabled)
 -D APR_USE_FLOCK_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/usr/local"
 -D SUEXEC_BIN="/usr/local/bin/suexec"
 -D DEFAULT_PIDLOG="httpd.pid"
 -D DEFAULT_SCOREBOARD="apache_runtime_status"
 -D DEFAULT_ERRORLOG="/var/log/httpd-error.log"
 -D AP_TYPES_CONFIG_FILE="etc/apache25/mime.types"
 -D SERVER_CONFIG_FILE="etc/apache25/httpd.conf"

# httpd -M
Loaded Modules:
 core_module (static)
 so_module (static)
 http_module (static)
 mpm_event_module (static)
 authn_file_module (shared)
 authn_core_module (shared)
 authz_host_module (shared)
 authz_groupfile_module (shared)
 authz_user_module (shared)
 authz_core_module (shared)
 auth_basic_module (shared)
 auth_digest_module (shared)
 socache_shmcb_module (shared)
 reqtimeout_module (shared)
 include_module (shared)
 filter_module (shared)
 deflate_module (shared)
 brotli_module (shared)
 mime_module (shared)
 log_config_module (shared)
 env_module (shared)
 expires_module (shared)
 headers_module (shared)
 setenvif_module (shared)
 proxy_module (shared)
 proxy_connect_module (shared)
 proxy_http_module (shared)
 proxy_fcgi_module (shared)
 ssl_module (shared)
 unixd_module (shared)
 status_module (shared)
 autoindex_module (shared)
 negotiation_module (shared)
 dir_module (shared)
 alias_module (shared)
 rewrite_module (shared)
 http2_module (shared)
Comment 1 Bernard Spil 2018-09-01 13:31:03 UTC
The version I roll back to to make this work:
# httpd -V
Server version: Apache/2.5.1-dev (FreeBSD)
Server built:   Jun 20 2018 18:28:30
Server's Module Magic Number: 20180606:1

First version I have exhibiting the 100% CPU for multiple threads
# httpd -V
Server version: Apache/2.5.1-dev (FreeBSD)
Server built:   Jul 17 2018 10:28:46
Server's Module Magic Number: 20180716:2
Comment 2 Bernard Spil 2018-09-01 13:53:33 UTC
The version I roll back to to make this work (build date is also trunk date):
> # httpd -V
> Server built:   Jun 20 2018 18:28:30
> Server's Module Magic Number: 20180606:1

First version I have exhibiting the 100% CPU for multiple threads
> # httpd -V
> Server built:   Jul 17 2018 10:28:46
> Server's Module Magic Number: 20180716:2
tyj

Looking at the magic numbers I rolled back a bit further, trunk on 2018-07-11 is already problematic:
> # /usr/local/sbin/httpd -V
> Server's Module Magic Number: 20180711:1
Backtrace is in line with the attachment.

Magic number was bumped in https://svn.apache.org/viewvc?view=revision&revision=1835640 and the accompanying changes are the likely culprit.

Going back a couple commits on trunk to 2018-07-10 results in a non-problematic version
> # /usr/local/sbin/httpd -V
> Server's Module Magic Number: 20180606:1
Comment 3 Eric Covener 2018-09-01 14:34:12 UTC
(In reply to Bernard Spil from comment #0)
> Created attachment 36126 [details]
> gdb thread apply all backtrace
> 
> Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk
> that makes multiple threads spin at 100% after requests have been processed.

Can you try moving fwd to r1837823 or greater?  r1837822+r1837823 seems related.
Comment 4 Yann Ylavic 2018-09-01 16:53:25 UTC
Bernard, is this with high concurrency scenario or it can be reproduced with a single/few requests?

Looking at the stacktrace, the most probable looping threads are numbers 60 to 62, thus ap_filter_output_pending(). Is the endless loop internal to this function or MPM event's process_socket() is calling it endlessly?

If I can't reproduce with a simple scenario, could you please debug that loop (latest trunk) and determine why ap_filter_output_pending() either never ends or keeps returning OK? It should stop (DECLINED) when the socket is completely flushed...
Comment 5 Bernard Spil 2018-09-02 08:24:16 UTC
(In reply to Eric Covener from comment #3)
> (In reply to Bernard Spil from comment #0)
> > Created attachment 36126 [details]
> > gdb thread apply all backtrace
> > 
> > Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk
> > that makes multiple threads spin at 100% after requests have been processed.
> 
> Can you try moving fwd to r1837823 or greater?  r1837822+r1837823 seems
> related.

Hi Eric,

This is reproducible with a 20180831 snapshot as well (git tag 50a5690). I pull the tarball from GitHub but that should be the same code.
https://github.com/apache/httpd/commit/767deda89753286d376c5660198a0b2f7651301b
https://github.com/apache/httpd/commit/10942ed3a206b83d81bbf45f91a15678fe8bcf87
Comment 6 Bernard Spil 2018-09-02 08:51:52 UTC
(In reply to Yann Ylavic from comment #4)
> Bernard, is this with high concurrency scenario or it can be reproduced with
> a single/few requests?
> 
> Looking at the stacktrace, the most probable looping threads are numbers 60
> to 62, thus ap_filter_output_pending(). Is the endless loop internal to this
> function or MPM event's process_socket() is calling it endlessly?
> 
> If I can't reproduce with a simple scenario, could you please debug that
> loop (latest trunk) and determine why ap_filter_output_pending() either
> never ends or keeps returning OK? It should stop (DECLINED) when the socket
> is completely flushed...

Hi Yann,

I have found that I can reliably reproduce it using "ANALYZE" on https://developers.google.com/speed/pagespeed/insights
I have not been able to reproduce it by merely loading a page with many resources. Tried ab too (ab -n 10000 -c 250 https://test.brnrd.eu/) but that doesn't trigger the issue either.

Yesterday I had issues inspecting the running process, now trying again to see if I can find out more about the spinning thread.
Comment 7 Bernard Spil 2018-09-02 10:58:25 UTC
Created attachment 36129 [details]
stepping through util_filter.c 1012-1032

Stepping through I see that I'm switching between the threads that are running at 100% CPU and never even leave that block of code (lines 1012-1032 in server/util_filter.c).
https://svn.apache.org/viewvc/httpd/httpd/trunk/server/util_filter.c?revision=1837822&view=markup#l1012

Hope that that's what you were looking for.
Comment 8 Yann Ylavic 2018-09-02 11:29:21 UTC
OK thanks, if you don't already, can you try to compile with -fno-strict-aliasing?
Aliasing is known to break with APR_RING...
Comment 9 Bernard Spil 2018-09-02 17:23:49 UTC
Hi Yann,

-fno-strict-aliasing is enabled by default. Build log of an earlier version here:
https://brnrd.eu/poudriere/data/112libre-default/2018-08-05_22h41m54s/logs/apache25-devel-2.5.1.20180529.log
poudriere (FreeBSD default ports bulk build tool) logs the MAKE_ENV in a separate section:
> CFLAGS="-O2 -fno-strict-aliasing -pipe -march=native  -DLIBICONV_PLUG -fstack-protector"
> CXXFLAGS="-O2 -fno-strict-aliasing -pipe -march=native -DLIBICONV_PLUG -fstack-protector  -DLIBICONV_PLUG"
Comment 10 Yann Ylavic 2018-09-03 08:02:53 UTC
I don't see how ap_filter_output_pending() can loop indefinitely..

Can you please step (or next) there from gdb, with "set scheduler-locking on" to stay in the same thread (and preferably symbols if possible)?
Comment 11 Yann Ylavic 2018-09-03 09:44:14 UTC
It would be interesting to know which filters are pending in ap_filter_output_pending() too (i.e. print f->frec->name in the loop).
Comment 12 Yann Ylavic 2018-09-03 11:31:09 UTC
Nevermind, I think I can reproduce with large download and some rate limit on the client side. Let me see what happens...
Comment 13 Yann Ylavic 2018-09-03 23:52:10 UTC
Hi Bernard,

should be fixed in r1839997, can you please give it a try?
Comment 14 Yann Ylavic 2018-09-04 02:42:57 UTC
Follow up in r1840002, simpler to test latest trunk anyway.
Comment 15 Bernard Spil 2018-09-05 16:28:56 UTC
(In reply to Yann Ylavic from comment #14)
> Follow up in r1840002, simpler to test latest trunk anyway.

Hi Yann,

Tested trunk r1840028 and confirm that it's fixed.

Thanks! Bernard.