Created attachment 32910 [details] 4 dumps of the vm Every now and then I get high cpu on Centos 5, 6 7 servers with Tomcat 8 switching back to 7 I do not get the problem. It seems to be in the ajp-apr-8009-Poller. I use apache front end mod_jk connector and tomcat-native-1.1.33 and jdk1.8.0_40 (oracle) I have tried 8.0.21 and 8.0.23 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2691 runner 20 0 3644m 565m 8324 S 144.0 14.7 20:36.12 java Dumps attached. I have asked on the user list but with no success. Cheers Greg
Will give version 8.0.24 a test. TBA.
Well, same again PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2707 runner 20 0 3728m 599m 7356 S 36.5 15.6 52:28.85 java
Created attachment 32936 [details] Dump for 8.0.24
Hello, I'm having trouble with the following setup: - Gentoo / mariaDB - Tomcat 8.0.23 - Tomcat native 1.1.33 - Apache http 2.4.12 with modproxy_ajp server.xml: ... <Connector port="8208" enableLookups="false" maxThreads="1000" protocol="org.apache.coyote.ajp.AjpAprProtocol" connectionTimeout="5000" maxConnections="8000" /> ... Everything works fina as long as I don' restart the apache server while the tomcat server is also running. If I do so the I get this in my catalina.out (after tweaking logging.properties): 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,023,232] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,023,232] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,023,232] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,078,656] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,078,656] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,078,656] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,023,232] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,023,232] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,023,232] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,078,656] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,078,656] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,078,656] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,023,232] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,023,232] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,023,232] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,078,656] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,078,656] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,078,656] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,023,232] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,023,232] from poller 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,023,232] for event(s) [1] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,918,626,078,656] 07-Jun-2015 11:47:48.217 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,918,626,078,656] from poller 07-Jun-2015 11:47:48.218 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,918,626,078,656] for event(s) [1] As long as everything is running fine the output is like: 07-Jun-2015 20:44:01.106 FINE [http-apr-127.0.0.1-8508-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [139,978,017,344,528] for event(s) [1] 07-Jun-2015 20:44:01.109 FINE [http-apr-127.0.0.1-8508-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,978,017,344,528] from poller 07-Jun-2015 20:44:01.929 FINE [http-apr-127.0.0.1-8508-Acceptor-0] org.apache.tomcat.util.net.AprEndpoint$Acceptor.run APR socket [139,977,950,259,008] opened with remote port [38,896] 07-Jun-2015 20:44:01.930 FINE [http-apr-127.0.0.1-8508-Acceptor-0] org.apache.tomcat.util.net.AprEndpoint.processSocketWithOptions socket [139,977,950,259,008] 07-Jun-2015 20:44:01.944 FINE [http-apr-127.0.0.1-8508-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,977,950,259,008] 07-Jun-2015 20:44:01.944 FINE [http-apr-127.0.0.1-8508-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,259,008] from poller 07-Jun-2015 20:44:06.471 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,977,950,307,072] 07-Jun-2015 20:44:06.471 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,307,072] from poller 07-Jun-2015 20:44:06.497 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,977,950,291,840] 07-Jun-2015 20:44:06.497 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,291,840] from poller 07-Jun-2015 20:44:06.501 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,977,950,322,336] 07-Jun-2015 20:44:06.501 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,322,336] from poller 07-Jun-2015 20:44:06.826 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Add to poller socket [139,977,950,271,008] 07-Jun-2015 20:44:06.826 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,271,008] from poller 07-Jun-2015 20:44:12.766 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [139,977,950,307,072] 07-Jun-2015 20:44:12.766 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,307,072] from poller 07-Jun-2015 20:44:12.766 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removed [139,977,950,307,072] from poller 07-Jun-2015 20:44:12.767 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [139,977,950,271,008] 07-Jun-2015 20:44:12.767 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,271,008] from poller 07-Jun-2015 20:44:12.767 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removed [139,977,950,271,008] from poller 07-Jun-2015 20:44:12.768 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [139,977,950,322,336] 07-Jun-2015 20:44:12.768 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,322,336] from poller 07-Jun-2015 20:44:12.768 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removed [139,977,950,322,336] from poller 07-Jun-2015 20:44:12.769 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [139,977,950,291,840] 07-Jun-2015 20:44:12.769 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Attempting to remove [139,977,950,291,840] from poller 07-Jun-2015 20:44:12.769 FINE [ajp-apr-8208-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removed [139,977,950,291,840] from poller To me this looks like a bug that Tomcat is not able to detect that the apache socket has gone away. But could also be a bug in the native ajp implementation of the socket handlers. When disabling apr support from server.xml all seems to function as expected and restarting apache does not result in a flood of messages. Any ideas on what could be happening? A bug? Thanks for any help, Milo van der Zee
Does it work ok on tomcat 7?
Works fine in Tomcat 7. In Tomcat 8 I see a massive amount of context switching and so I assume Tomcat tries to create a connection and fails but tries again immediately.
Centos 7 running via jsvc PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1459 runner 20 0 4588960 578536 8312 S 263.0 15.4 8:00.51 jsvc :{
I can't reproduce this on either Linux or Windows with the 8.0.24 code or the latest 8.0.x code and tc-native 1.1.33. I've been hammering the Tomcat home page (via httpd) using ab with the only impact of an httpd restart is that some requests are delayed a little. Steps to reproduce this with a clean Tomcat install would help. Is the CPU spike you observe permanent or temporary? If temporary, how long until it settles back down? If this is permanent, figuring out what thread(s) are using the CPU [1] and then remote debugging [2] to figure out what code path is causing the problem would be extremely helpful. [1] http://javadrama.blogspot.co.uk/2012/02/why-is-java-eating-my-cpu.html [2] http://wiki.apache.org/tomcat/FAQ/Developing#Q1
It happens randomly on all three of my Centos boxes 5/6/7. Maybe its only centos/rh? Once it starts you have to stop tomcat and restart it. from the "Dump for 8.0.24" "ajp-apr-8009-Poller" #26 daemon prio=5 os_prio=0 tid=0x00007f1d48374800 nid=0xc92 runnable [0x00007f1ceb9fa000] java.lang.Thread.State: RUNNABLE at org.apache.tomcat.jni.Poll.poll(Native Method) at org.apache.tomcat.util.net.AprEndpoint$Poller.run(AprEndpoint.java:1810) at java.lang.Thread.run(Thread.java:745) rv = Poll.poll(pollers[i], nextPollerTime, desc, true); Why only on my Centos, mystery. I will have another go at debugging it when it happens again.
Hello, I'm using Gentoo so not only Centos is affected. I can trigger the bug by restarting apache while there are sessions active. Load on the server does not seem to influence the occurance of the bug. Sometimes the load goes down again after about 30 minutes till hours. I did not actually measure the time but I noticed that sometimes the load goes back to normal after a long time. Because of the context switching (>400,000 cs/sec) that is causing the load I assume that the parkNanos call is in the neighborhood of where to look for fixes. "ajp-apr-8200-exec-107" #3671 daemon prio=5 os_prio=0 tid=0x00007f62a007f800 nid=0x3adc waiting on condition [0x00007f624d4a6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000ed8dfbe0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at org.apache.tomcat.util.threads.TaskQueue.poll(Unknown Source) at org.apache.tomcat.util.threads.TaskQueue.poll(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(Unknown Source) at java.lang.Thread.run(Thread.java:745) Thanks. MAG, Milo
A couple of further questions. What, exactly, do you mean by "sessions acive"? An actively used AJP connection to Tomcat? Also, what do you mean by "restart"? Stop/start? Graceful restart? It sounds like a connection is going in and out of the Poller very quickly putting the Poller into (effecively) a tight loop. What we really need to know is why hence my request for remote debugging to trace the code path. That should be enough to figure out what is going on.
I have setup eclispe to debug tomcat running on the server, (not done much remote debugging, so bear with me) when I attach the debugger the with a breakpoint it stops as expected, if I then remove and add it again and refresh the application it hangs and does not go back to the breakpoint in eclipse? I was expecting it to go back and allow me to step through the source. Are there any more detail instruction how to do this? Go to "Run->Debug Configurations...". Click on "Remote Java Applications", then click "New". Type in the title. Note that port 8000 from the Tomcat instructions. Save and run. Eclipse will connect to the JVM that Tomcat is running under. Wow, that was easy! Thanks Greg
What do you mean with "trace the code path". Isn't that the stacktrace? But that is already given so what exactly do you mean?
Hello, It started again and was on thread 3187 (see tomcatdump-8.0.24-2015-08-02.txt) 3187runner2003828m707m13mS16.518.56:38.29java 6:38.29 java "ajp-apr-8009-Poller" #28 daemon prio=5 os_prio=0 tid=0x00007f0be4626000 nid=0xc73 runnable [0x00007f0b92745000] "ajp-apr-8009-Poller" #28 daemon prio=5 os_prio=0 tid=0x00007f0be4626000 nid=0xc73 runnable [0x00007f0b92745000] java.lang.Thread.State: RUNNABLE at org.apache.tomcat.jni.Poll.poll(Native Method) at org.apache.tomcat.util.net.AprEndpoint$Poller.run(AprEndpoint.java:1810) at java.lang.Thread.run(Thread.java:745) I managed to get the debugger attached and a breakpoint in AprEndpoint class line 1802 (below), it seemed to be going in a loop which was consuming the cpu. // Poll for the specified interval for (int i = 0; i < pollers.length; i++) { .. } the variable pollers.length == 1 For some reason it stopped breaking and the CPU went back to normal, after I held it on the breakpoint for some time as I was looking at the source, so not sure? Cheers Greg
Created attachment 32951 [details] Thread 3187 dump
Checking the source from 7.0.61 which works ok with 8.0.24 there are some differences in the run method, maybe these need checking??? removeFromPoller(info.socket); added rv = mergeDescriptors(desc, rv); added differences // Process socket timeouts if (getSoTimeout() > 0 && maintain++ > 1000 && pollerRunning) { // This works and uses only one timeout mechanism for everything, but the // non event poller might be a bit faster by using the old maintain. maintain = 0; maintain(); } // Process socket timeouts if (getSoTimeout() > 0 && pollerRunning) { // This works and uses only one timeout mechanism for everything, but the // non event poller might be a bit faster by using the old maintain. maintain(); } Just an idea? Cheers Greg
The stack trace only shows where execution is at a single point in time. It doesn't show how execution reached that point or where execution will go afterwards. I suspect that something is putting a socket in an odd/bad state. The code then enters the following loop: - the socket is queued to be added to the poller - the socket is added to the poller - poll() finds an event for the socket - the event is processed - processing terminates early because of the odd/bad state What we need to figure out is what is this odd/bad state so we can detect it and ensure the socket is correctly processed (probably closed). It is possible that a timeout fired while you were looking at the code. I'd recommend increasing the socket timeout to 5-10 minutes while you are debugging this.
Mark, I do not think I will be able to debug the run() {..} Trying to check whether the odd/bad state when I really have no idea what it is supposed to do. if (((desc[n*2] & Poll.APR_POLLHUP) == Poll.APR_POLLHUP)|| ((desc[n*2] & Poll.APR_POLLERR) == Poll.APR_POLLERR) || ((desc[n*2] & Poll.APR_POLLNVAL) == Poll.APR_POLLNVAL)) { if (!processSocket(desc[n*2+1], SocketStatus.ERROR)) { // Close socket and clear pool closeSocket(desc[n*2+1]); } } Would the randomness of this be concurrency issues? I am running on slower hardware (although it is happening on my new i5 box) had this before, adding and removing at the same time, causing deadlocks. I don't mind making changes to the source ie backtracing the mods to see which is causing the run() to loop continuously. Cheers Greg
if you check the private void add{..} what did the ok do? boolean ok = false; synchronized (this) { // Add socket to the list. Newly added sockets will wait // at most for pollTime before being polled. Don't add the // socket once the poller has stopped but destroy it straight // away if (pollerRunning && addList.add(socket, timeout, flags)) { ok = true; this.notify(); } } if (!ok) { // Can't do anything: close the socket right away boolean comet = connections.get( Long.valueOf(socket)).isComet(); if (!comet || (comet && !processSocket( socket, SocketStatus.ERROR))) { closeSocket(socket); } }
Don't worry too much about why the loop is happening. Once we know the route execution is taking we'll probably be able to figure out the why. As a starting point, would it be possible for you to find out the following when the loop happens: 1. Put a break point on AprEndpoint.Poller.add(long, int, int) and tell us the full stack trace for the code that adds the socket to the Poller 2. Put as break point on ApeEndpoint.processSocket(long, status) and (for the looping socket) tell us what the value of the status parameter is. The above two bits of information should allow us to get a little further.
Mark, is it helpfull if I try to setup desktop sharing while the bug is happening? Then you could debug whatever you like :) But I'll also add the stacktraces as soon as I'm able to grap them.
Breakpoint 1: AprEndpoint$Poller.add(139982580808944, -1, 1) line: 1485 AprEndpoint$Poller.access$500(AprEndpoint$Poller, long, int, int) line: 1251 AprEndpoint$AprSocketWrapper.registerforEvent(int, boolean, boolean) line: 2516 AjpAprProtocol$AjpConnectionHandler.release(SocketWrapper<Long>, Processor<Long>, boolean, boolean) line: 136 AjpAprProtocol$AjpConnectionHandler(AbstractProtocol$AbstractConnectionHandler<S,P>).process(SocketWrapper<S>, SocketStatus) line: 722 AprEndpoint$SocketProcessor.doRun() line: 2463 AprEndpoint$SocketProcessor.run() line: 2452 ThreadPoolExecutor(ThreadPoolExecutor).runWorker(ThreadPoolExecutor$Worker) line: 1142 ThreadPoolExecutor$Worker.run() line: 617 TaskThread$WrappingRunnable.run() line: not available TaskThread(Thread).run() line: 745 The socket number changes every call to this method. ================== Breakpoint 2: AprEndpoint.processSocket(long, OPEN_READ) line: 887 AprEndpoint$Poller.run() line: 1884 Thread.run() line: 745 I set a conditional breakpoint (return !status.equals(SocketStatus.OPEN_READ);) and I've never seen anything else as OPEN_READ. Hope this helps...
I found that completely stopping apache triggers the bug instantly. The number of context switches rises from about 5000 cs/sec to 500000 cs/sec. When I then start apache cs/sec goes down to about 15000 cs/sec.
Thanks for all the pointers. I'm able to reproduce behavior that looks like the behavior described with high CPU on a single java thread. I can't be sure that I am seeing the same problem but I am seeing a problem which I should be able to fix. I'll update this ticket as I make progress. I might have some dev 8.0.x builds for people to test.
This should be fixed now in 8.0.x/trunk. The problem was that a return value of zero was treated as a non-blocking read that returned no data rather than as EOF. That meant that the socket was put straight back into the Poller only for the Poller to trigger a read event immediately for EOF and so the loop continued.
Thankyou very much.
Excellent, I will apply the change in AjpAprProcessor locally to see if its the same issue (I do not stop and restart tomcat for it to happen). Thanks! Greg
@Gregh: not restarting Tomcat but restarting Apache triggers the bug in my setup. Stopping Apache kills one end of the socket and so Tomcat starts the loop.
Seems fixed with the AjpAprProcessor patch, been over a week without the fans starting!!! Thanks @Milo (& Mark). Greg :-)