Created attachment 36126 [details] gdb thread apply all backtrace Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk that makes multiple threads spin at 100% after requests have been processed. USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND www 79187 251.8 1.6 155612 131724 - IJ 15:04 28:05.18 /usr/local/sbin/httpd -DNOHTTPACCEPT www 79256 151.2 0.7 133076 59624 - IJ 15:04 17:34.26 /usr/local/sbin/httpd -DNOHTTPACCEPT root 79173 0.0 0.2 24404 18872 - SsJ 15:04 0:00.32 /usr/local/sbin/httpd -DNOHTTPACCEPT the only solution I've found is a full restart apache (httpd -k graceful doesn't suffice). Problem arises again almost instantly. A trunk snapshot from 2018-06-20 does not exhibit this behavior. Platform: FreeBSD 11.2-p1 amd64 OpenSSL 1.1.1pre9 (from ports) # httpd -V Server version: Apache/2.5.1-dev (FreeBSD) Server built: Sep 1 2018 14:20:24 Server's Module Magic Number: 20180720:7 Server loaded: APR 1.6.3, APR-UTIL 1.6.1, PCRE 8.42 2018-03-20 Compiled using: APR 1.6.3, APR-UTIL 1.6.1, PCRE 8.42 2018-03-20 Architecture: 64-bit Server MPM: event threaded: yes (fixed thread count) forked: yes (variable process count) Server compiled with.... -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses disabled) -D APR_USE_FLOCK_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=256 -D HTTPD_ROOT="/usr/local" -D SUEXEC_BIN="/usr/local/bin/suexec" -D DEFAULT_PIDLOG="httpd.pid" -D DEFAULT_SCOREBOARD="apache_runtime_status" -D DEFAULT_ERRORLOG="/var/log/httpd-error.log" -D AP_TYPES_CONFIG_FILE="etc/apache25/mime.types" -D SERVER_CONFIG_FILE="etc/apache25/httpd.conf" # httpd -M Loaded Modules: core_module (static) so_module (static) http_module (static) mpm_event_module (static) authn_file_module (shared) authn_core_module (shared) authz_host_module (shared) authz_groupfile_module (shared) authz_user_module (shared) authz_core_module (shared) auth_basic_module (shared) auth_digest_module (shared) socache_shmcb_module (shared) reqtimeout_module (shared) include_module (shared) filter_module (shared) deflate_module (shared) brotli_module (shared) mime_module (shared) log_config_module (shared) env_module (shared) expires_module (shared) headers_module (shared) setenvif_module (shared) proxy_module (shared) proxy_connect_module (shared) proxy_http_module (shared) proxy_fcgi_module (shared) ssl_module (shared) unixd_module (shared) status_module (shared) autoindex_module (shared) negotiation_module (shared) dir_module (shared) alias_module (shared) rewrite_module (shared) http2_module (shared)
The version I roll back to to make this work: # httpd -V Server version: Apache/2.5.1-dev (FreeBSD) Server built: Jun 20 2018 18:28:30 Server's Module Magic Number: 20180606:1 First version I have exhibiting the 100% CPU for multiple threads # httpd -V Server version: Apache/2.5.1-dev (FreeBSD) Server built: Jul 17 2018 10:28:46 Server's Module Magic Number: 20180716:2
The version I roll back to to make this work (build date is also trunk date): > # httpd -V > Server built: Jun 20 2018 18:28:30 > Server's Module Magic Number: 20180606:1 First version I have exhibiting the 100% CPU for multiple threads > # httpd -V > Server built: Jul 17 2018 10:28:46 > Server's Module Magic Number: 20180716:2 tyj Looking at the magic numbers I rolled back a bit further, trunk on 2018-07-11 is already problematic: > # /usr/local/sbin/httpd -V > Server's Module Magic Number: 20180711:1 Backtrace is in line with the attachment. Magic number was bumped in https://svn.apache.org/viewvc?view=revision&revision=1835640 and the accompanying changes are the likely culprit. Going back a couple commits on trunk to 2018-07-10 results in a non-problematic version > # /usr/local/sbin/httpd -V > Server's Module Magic Number: 20180606:1
(In reply to Bernard Spil from comment #0) > Created attachment 36126 [details] > gdb thread apply all backtrace > > Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk > that makes multiple threads spin at 100% after requests have been processed. Can you try moving fwd to r1837823 or greater? r1837822+r1837823 seems related.
Bernard, is this with high concurrency scenario or it can be reproduced with a single/few requests? Looking at the stacktrace, the most probable looping threads are numbers 60 to 62, thus ap_filter_output_pending(). Is the endless loop internal to this function or MPM event's process_socket() is calling it endlessly? If I can't reproduce with a simple scenario, could you please debug that loop (latest trunk) and determine why ap_filter_output_pending() either never ends or keeps returning OK? It should stop (DECLINED) when the socket is completely flushed...
(In reply to Eric Covener from comment #3) > (In reply to Bernard Spil from comment #0) > > Created attachment 36126 [details] > > gdb thread apply all backtrace > > > > Somewhere between 2018-06-20 and 2018-07-17 there's been a change in trunk > > that makes multiple threads spin at 100% after requests have been processed. > > Can you try moving fwd to r1837823 or greater? r1837822+r1837823 seems > related. Hi Eric, This is reproducible with a 20180831 snapshot as well (git tag 50a5690). I pull the tarball from GitHub but that should be the same code. https://github.com/apache/httpd/commit/767deda89753286d376c5660198a0b2f7651301b https://github.com/apache/httpd/commit/10942ed3a206b83d81bbf45f91a15678fe8bcf87
(In reply to Yann Ylavic from comment #4) > Bernard, is this with high concurrency scenario or it can be reproduced with > a single/few requests? > > Looking at the stacktrace, the most probable looping threads are numbers 60 > to 62, thus ap_filter_output_pending(). Is the endless loop internal to this > function or MPM event's process_socket() is calling it endlessly? > > If I can't reproduce with a simple scenario, could you please debug that > loop (latest trunk) and determine why ap_filter_output_pending() either > never ends or keeps returning OK? It should stop (DECLINED) when the socket > is completely flushed... Hi Yann, I have found that I can reliably reproduce it using "ANALYZE" on https://developers.google.com/speed/pagespeed/insights I have not been able to reproduce it by merely loading a page with many resources. Tried ab too (ab -n 10000 -c 250 https://test.brnrd.eu/) but that doesn't trigger the issue either. Yesterday I had issues inspecting the running process, now trying again to see if I can find out more about the spinning thread.
Created attachment 36129 [details] stepping through util_filter.c 1012-1032 Stepping through I see that I'm switching between the threads that are running at 100% CPU and never even leave that block of code (lines 1012-1032 in server/util_filter.c). https://svn.apache.org/viewvc/httpd/httpd/trunk/server/util_filter.c?revision=1837822&view=markup#l1012 Hope that that's what you were looking for.
OK thanks, if you don't already, can you try to compile with -fno-strict-aliasing? Aliasing is known to break with APR_RING...
Hi Yann, -fno-strict-aliasing is enabled by default. Build log of an earlier version here: https://brnrd.eu/poudriere/data/112libre-default/2018-08-05_22h41m54s/logs/apache25-devel-2.5.1.20180529.log poudriere (FreeBSD default ports bulk build tool) logs the MAKE_ENV in a separate section: > CFLAGS="-O2 -fno-strict-aliasing -pipe -march=native -DLIBICONV_PLUG -fstack-protector" > CXXFLAGS="-O2 -fno-strict-aliasing -pipe -march=native -DLIBICONV_PLUG -fstack-protector -DLIBICONV_PLUG"
I don't see how ap_filter_output_pending() can loop indefinitely.. Can you please step (or next) there from gdb, with "set scheduler-locking on" to stay in the same thread (and preferably symbols if possible)?
It would be interesting to know which filters are pending in ap_filter_output_pending() too (i.e. print f->frec->name in the loop).
Nevermind, I think I can reproduce with large download and some rate limit on the client side. Let me see what happens...
Hi Bernard, should be fixed in r1839997, can you please give it a try?
Follow up in r1840002, simpler to test latest trunk anyway.
(In reply to Yann Ylavic from comment #14) > Follow up in r1840002, simpler to test latest trunk anyway. Hi Yann, Tested trunk r1840028 and confirm that it's fixed. Thanks! Bernard.