Bug 56645

Summary: socket callback with timeout callback causes infinite loop
Product: Apache httpd-2 Reporter: Takashi Sato <takashi.asfbugzilla>
Component: mpm_eventAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: major    
Priority: P2    
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Bug Depends on: 56654    
Bug Blocks:    
Attachments: patch for debug (event MPM using apr_array for timer list)

Description Takashi Sato 2014-06-19 10:44:27 UTC
Even after fix bug 56642, mod_proxy_wstunnel with ProxyWebsocketAsync On and ProxyWebsocketIdleTimeout doesn't work well.

I sent and receive small data many times, and sometimes went infinite loop.

I got core and backtrace, and I saw one loop was inside of apr_skiplist_insert (event.c line 1464), and another is on while loop (event.c line 1767).

I looked at event.c and found all timer_skiplist seemed to be well protected with g_timer_skiplist_mtx, so I don't have any idea why the infinity loop occurs.
Comment 1 Eric Covener 2014-06-19 11:15:40 UTC
I notice there is no checking of the return code for lock/unlock of those mutexes. They could be failing
Comment 2 Takashi Sato 2014-06-19 11:20:20 UTC
(In reply to Eric Covener from comment #1)
> I notice there is no checking of the return code for lock/unlock of those
> mutexes. They could be failing

OK, I try to put trace logs to log whether lock/unlock success or not.
Comment 3 Yann Ylavic 2014-06-19 12:17:12 UTC
Maybe the following path can help?

Index: server/mpm/event/event.c
===================================================================
--- server/mpm/event/event.c	(revision 1603193)
+++ server/mpm/event/event.c	(working copy)
@@ -1975,7 +1975,9 @@ static void * APR_THREAD_FUNC listener_thread(apr_
                 int i = 0;
                 socket_callback_baton_t *baton = (socket_callback_baton_t *) pt->baton;
                 if (baton->cancel_event) {
+                    apr_thread_mutex_lock(g_timer_skiplist_mtx);
                     baton->cancel_event->canceled = 1;
+                    apr_thread_mutex_unlock(g_timer_skiplist_mtx);
                 }
 
                 /* We only signal once per N sockets with this baton */
--

There seem to be a race between the cancellation check and set.
Comment 4 Yann Ylavic 2014-06-19 12:20:39 UTC
(In reply to Yann Ylavic from comment #3)
> There seem to be a race between the cancellation check and set.

Hmm, these are in the same thread, sorry for the noise.
Comment 5 Takashi Sato 2014-06-19 12:51:10 UTC
(In reply to Takashi Sato from comment #2)
> (In reply to Eric Covener from comment #1)
> > I notice there is no checking of the return code for lock/unlock of those
> > mutexes. They could be failing
> 
> OK, I try to put trace logs to log whether lock/unlock success or not.

I tried. lock/unlock succeed...

Then, I tried to migrate apr_skiplist to apr_array, then works fine (of cource performance is bad and may have memory leaks)!
It seems apr_skiplist has some bug.
Comment 6 Eric Covener 2014-06-19 12:53:44 UTC
(In reply to Takashi Sato from comment #5)
> (In reply to Takashi Sato from comment #2)
> > (In reply to Eric Covener from comment #1)
> > > I notice there is no checking of the return code for lock/unlock of those
> > > mutexes. They could be failing
> > 
> > OK, I try to put trace logs to log whether lock/unlock success or not.
> 
> I tried. lock/unlock succeed...
> 
> Then, I tried to migrate apr_skiplist to apr_array, then works fine (of
> cource performance is bad and may have memory leaks)!
> It seems apr_skiplist has some bug.

You might try Yanns patch. I don't know anything about the skiplist impl, but I don't know what might get moved while other threads add/remove unrelated entries.  In that case, his lock would unfortunately be required.
Comment 7 Takashi Sato 2014-06-19 12:54:12 UTC
Created attachment 31733 [details]
patch for debug (event MPM using apr_array for timer list)
Comment 8 Takashi Sato 2014-06-21 06:51:47 UTC
I observed apr_skiplist corruption, so I have annalyzed apr_skiplist, and found the bug 56654.
With 56654 fixed APR, I don't see any infinite loop.

I don't think Yann's patch.
baton->cancel_event->canceled does not affect any for timer_skiplist.