|
|
Created:
8 years, 9 months ago by akalin Modified:
8 years, 9 months ago CC:
chromium-reviews, brettw-cc_chromium.org, Sigurður Ásgeirsson Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionClean up condition variable usage in SequencedWorkerPool
Add unit test for spurious work signal behavior.
Split cond_var_ into multiple condition variables; one for each distinct
condition. Document exactly when each condition variable is waited on.
Restrict the thread that SWP::Shutdown() can be called from.
Fix brace usage in destructor.
BUG=117469
TEST=
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=126241
Patch Set 1 #Patch Set 2 : Address comments #
Total comments: 25
Patch Set 3 : Address jar's comments #
Total comments: 6
Patch Set 4 : Address nits #
Messages
Total messages: 13 (0 generated)
+brettw for review +siggi, feel free to chime in if you notice anything (the more eyes on condition-variable code the better).
+jar for review instead, as brett says you're a better person to review this
Have not reviewed in detail, but it strikes me that it would be interesting to add a unittest that sets up the spurious wake that triggers the spin. You'd then be assured that: 1. the (a) problem condition has been correctly identified. 2. your fix works to alleviate that condition.
On 2012/03/09 11:59:20, Sigurður Ásgeirsson wrote: > Have not reviewed in detail, but it strikes me that it would be interesting to > add a unittest that sets up the spurious wake that triggers the spin. > You'd then be assured that: > 1. the (a) problem condition has been correctly identified. > 2. your fix works to alleviate that condition. Good idea. I was able to write a unit test to trigger the spin. Done. PTAL!
I don't feel qualified to review this change on merits, but I do feel it's urgent for us to get fix in for this. https://chromiumcodereview.appspot.com/9651026/diff/6001/base/threading/seque... File base/threading/sequenced_worker_pool_unittest.cc (right): https://chromiumcodereview.appspot.com/9651026/diff/6001/base/threading/seque... base/threading/sequenced_worker_pool_unittest.cc:484: // work signal. This shouldn't cause any other work signals to be nit: one space after full stop, same below. https://chromiumcodereview.appspot.com/9651026/diff/6001/base/threading/seque... base/threading/sequenced_worker_pool_unittest.cc:486: TEST_F(SequencedWorkerPoolTest, SpuriousWorkSignal) { Nice!
In addition to a bunch of questions about this CL, I'm still questioning whether we need any additional CVs to monitor status during a shutdown. When we are shutting down, threads either run critical tasks, or exit. They don't sleep (wait on the CV). As a result, the signal migrates to being used to try to notify the controlling thread that "yet another possibly blocking thread has exited". To try to say that another way: When the pool is working as normal, the work threads either work, or wait on the CV. A Signal() will always awaken a worker (if there is one pending). During shutdown, only (at most) the controlling thread will be Wait()ing. Hence a Signal() during shutdown is targeted only at the controlling thread. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (left): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:387: { nit: You are dead right that this curly is not needed (if/after you make your change deleting line 435)... but it is making it much harder to see what you changed in this function (blame it on the diff system). Given how delicate this CL is, I'd rather leave it in for now, so that I can better see the changes here. Thanks in advance. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:404: cond_var_.Signal(); This signal is basically asking that anyone else that is waiting should check to see if they have work to do. Why did you delete it here? http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:437: cond_var_.Signal(); This is another fairly critical Signal() call. This indicates that we've decided to stop running tasks, and are terminating. We notify other waiting threads in case they too wish to follow suit. Why was this deleted? At best, playing with your multiple-cv plan, I'd expect the addition of: can_shutdown_cv_.Signal(); Surly when we are terminating, there is a *chance* that shutdown can and should proceed.... as we might have been the thread folks were waiting for. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:372: void SequencedWorkerPool::Inner::Shutdown() { IMO, it would be helpful if you commented, or asserted, that this is NOT run on a worker thread. I think it is probably run on only one thread... and it would be nicest if we could assert we're on said thread. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:385: has_work_cv_.Broadcast(); I liked the original code better, relying on signalling, rather than Broadcast(), which is plagued by the "box car" problem (threads wake up only to block on a lock). The good news is this is happening only at shutdown... so it may not be very perf critical. If you have distinct CVs (as are currently proposed in this CL), one for signalling wake-ups, and one for signalling shutdown progress, then you should IMO leave the code as it was (using Signal()), as it is cleaner and efficient. If you'd like to re-use the CV (stop trying to add additional condition variables), then you could use Broadcast() as suggested. That would wake all sleeping workers. They would then each either terminate, or run tasks (and eventually terminate), as we are in a shutdown mode. This would mean that the CV Signal()ing (hence forth) would only be targetted at this controlling thread. If you decide to keep the Broadcast(), and reuse the singular CV, then this is a pretty subtle transition IMO. There should be a comment to the effect that worker threads will no longer ever Wait() on this. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:394: // yet, or both. Block on the "queue empty" event to know when all tasks are This comment doesn't relate to any code IMO. In addition, queue empty doesn't mean that there is nothing left blocking us.... so it is very confused. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:422: if (CanShutdown()) I don't think anyone cares about CanShutdown() unless we have shutdown_called_. I expect the login in the running loop to check this status, and so I don't expect to need to see special cases on thread startup here. Bottom line: Why are lines 422 and 423 needed? In this class of code, less is better IMO. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:558: can_shutdown_cv_.Signal(); Again, I don't like seeing special case code interspersed and signalling this CV. I think the right place to signal this is when a thread does indeed shutdown, because it can find nothing more to do. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:610: can_shutdown_cv_.Signal(); Again, please don't add special case code for this. Try to put it in one central and most logical place (see comments above). http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:685: ++has_work_signal_count_; You are not generally inside a lock, but are manipulating this counter which does require a lock. I suspect that this testing variable should not be used... but I'm not sure.
Addressed all comments. PTAL. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (left): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:387: { On 2012/03/12 17:39:13, jar wrote: > nit: You are dead right that this curly is not needed (if/after you make your > change deleting line 435)... but it is making it much harder to see what you > changed in this function (blame it on the diff system). Given how delicate this > CL is, I'd rather leave it in for now, so that I can better see the changes > here. Thanks in advance. Done. (It ended up being needed anyway when addressing your comments.) http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:404: cond_var_.Signal(); On 2012/03/12 17:39:13, jar wrote: > This signal is basically asking that anyone else that is waiting should check to > see if they have work to do. Why did you delete it here? Hmm. If every task posted causes a signal, then wouldn't another thread be woken up anyway? But put it back in nonetheless. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:437: cond_var_.Signal(); On 2012/03/12 17:39:13, jar wrote: > This is another fairly critical Signal() call. This indicates that we've decided > to stop running tasks, and are terminating. We notify other waiting threads in > case they too wish to follow suit. > > Why was this deleted? This was deleted because of the change to using Broadcast(). Now that I reverted that, I put this back in. > At best, playing with your multiple-cv plan, I'd expect the addition of: > can_shutdown_cv_.Signal(); > Surly when we are terminating, there is a *chance* that shutdown can and should > proceed.... as we might have been the thread folks were waiting for. Removed all the other places I signal can_shutdown_cv_ and added it here. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:372: void SequencedWorkerPool::Inner::Shutdown() { On 2012/03/12 17:39:13, jar wrote: > IMO, it would be helpful if you commented, or asserted, that this is NOT run on > a worker thread. I think it is probably run on only one thread... and it would > be nicest if we could assert we're on said thread. Done. (Added comment, and assert is in SWP::Shutdown()). http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:385: has_work_cv_.Broadcast(); On 2012/03/12 17:39:13, jar wrote: > I liked the original code better, relying on signalling, rather than > Broadcast(), which is plagued by the "box car" problem (threads wake up only to > block on a lock). The good news is this is happening only at shutdown... so it > may not be very perf critical. > > If you have distinct CVs (as are currently proposed in this CL), one for > signalling wake-ups, and one for signalling shutdown progress, then you should > IMO leave the code as it was (using Signal()), as it is cleaner and efficient. > > If you'd like to re-use the CV (stop trying to add additional condition > variables), then you could use Broadcast() as suggested. That would wake all > sleeping workers. They would then each either terminate, or run tasks (and > eventually terminate), as we are in a shutdown mode. This would mean that the > CV Signal()ing (hence forth) would only be targetted at this controlling thread. > > If you decide to keep the Broadcast(), and reuse the singular CV, then this is a > pretty subtle transition IMO. There should be a comment to the effect that > worker threads will no longer ever Wait() on this. Okay, moved back to signal chain. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:394: // yet, or both. Block on the "queue empty" event to know when all tasks are On 2012/03/12 17:39:13, jar wrote: > This comment doesn't relate to any code IMO. > In addition, queue empty doesn't mean that there is nothing left blocking us.... > so it is very confused. Replaced comment (and the one below) with something clearer (IMO). http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:422: if (CanShutdown()) On 2012/03/12 17:39:13, jar wrote: > I don't think anyone cares about CanShutdown() unless we have shutdown_called_. > I expect the login in the running loop to check this status, and so I don't > expect to need to see special cases on thread startup here. > > Bottom line: Why are lines 422 and 423 needed? In this class of code, less is > better IMO. Done. Moved can_shutdown_cv_ signal to end of block. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:558: can_shutdown_cv_.Signal(); On 2012/03/12 17:39:13, jar wrote: > Again, I don't like seeing special case code interspersed and signalling this > CV. I think the right place to signal this is when a thread does indeed > shutdown, because it can find nothing more to do. Done (see above) http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:610: can_shutdown_cv_.Signal(); On 2012/03/12 17:39:13, jar wrote: > Again, please don't add special case code for this. Try to put it in one > central and most logical place (see comments above). Done (see above) http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:685: ++has_work_signal_count_; On 2012/03/12 17:39:13, jar wrote: > You are not generally inside a lock, but are manipulating this counter which > does require a lock. > > I suspect that this testing variable should not be used... but I'm not sure. My mistake. This testing variable is kind of hacky, but it's needed for the regression test for the bug this is supposed to fix. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool_unittest.cc (right): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool_unittest.cc:484: // work signal. This shouldn't cause any other work signals to be On 2012/03/12 13:01:40, Sigurður Ásgeirsson wrote: > nit: one space after full stop, same below. Done.
On 2012/03/12 17:39:12, jar wrote: > In addition to a bunch of questions about this CL, I'm still questioning whether > we need any additional CVs to monitor status during a shutdown. > > When we are shutting down, threads either run critical tasks, or exit. They > don't sleep (wait on the CV). As a result, the signal migrates to being used to > try to notify the controlling thread that "yet another possibly blocking thread > has exited". > > To try to say that another way: When the pool is working as normal, the work > threads either work, or wait on the CV. A Signal() will always awaken a worker > (if there is one pending). During shutdown, only (at most) the controlling > thread will be Wait()ing. Hence a Signal() during shutdown is targeted only at > the controlling thread. Well, I think we need at least one separate CV, is_idle_cv_, since conflating that with the other condition variable was what caused this bug in the first place. But you're right about has_work_cv_/can_shutdown_cv_. If you really want, I can re-combine them, but at the very least, I'd like there to be more comments about why the CV is used for different purposes. But I'm hesitant about doing so; is there really anything to gain from combining them? It's already caused one bug (since I think FlushForTesting() was added later, and whoever did assumed they could re-use the single CV) and I think the negative from that outweighs any possible advantage combining them would have. What do you think?
On 2012/03/12 18:39:20, akalin wrote: > But you're right about has_work_cv_/can_shutdown_cv_. If you really want, I can > re-combine them, but at the very least, I'd like there to be more comments about > why the CV is used for different purposes. But I'm hesitant about doing so; is > there really anything to gain from combining them? It's already caused one bug > (since I think FlushForTesting() was added later, and whoever did assumed they > could re-use the single CV) and I think the negative from that outweighs any > possible advantage combining them would have. What do you think? Actually, I'm questioning the correctness of combining has_work_cv_/can_shutdown_cv_ given that there is a "signal chain" when Shutdown() is called (where one thread shutting down signals the next thread to shut down). Let's say you have 3 worker threads and the work queue is empty. Then: - Main thread posts shutdown-blocking task (signal #1) - Main thread calls Shutdown (signal #2) - Main thread waits - WorkerThread1 wakes up (from signal #1), picks up the shutdown-blocking task (becoming a shutdown-blocking thread), then signals (signal #3) - Main thread wakes up, sees signal #3, waits again since CanShutdown() is still false - WorkerThread2 wakes up (from signal #2), sees no work to be done and shutdown called, and exits, signalling again (signal #4) - Main thread wakes up, sees signal #4, waits again since CanShutdown() is still false - WorkerThread1 runs the blocking task and unmarks itself as a blocking thread, then sees that shutdown was called, and exits, signalling again (signal #5) - Main thread wakes up (from signal #5) and continues, since CanShutdown() is true. - Main thread joins all worker threads (deadlock, since WorkerThread3 was never woken up). The problem being that the signals intended to wake up other worker threads end up waking the main thread instead. Is this a valid scenario?
Your scenario is what I (tried to) talk about with regard to the Broadcast(). If you wanted to stick with one CV, you'd then go to using Broadcast when you start to shutdown. If you stay with the Signal chain, you need to have the second (distinct) CV. Jim On 2012/03/12 19:14:16, akalin wrote: > On 2012/03/12 18:39:20, akalin wrote: > > But you're right about has_work_cv_/can_shutdown_cv_. If you really want, I > can > > re-combine them, but at the very least, I'd like there to be more comments > about > > why the CV is used for different purposes. But I'm hesitant about doing so; > is > > there really anything to gain from combining them? It's already caused one > bug > > (since I think FlushForTesting() was added later, and whoever did assumed they > > could re-use the single CV) and I think the negative from that outweighs any > > possible advantage combining them would have. What do you think? > > Actually, I'm questioning the correctness of combining > has_work_cv_/can_shutdown_cv_ given that there is a "signal chain" when > Shutdown() is called (where one thread shutting down signals the next thread to > shut down). Let's say you have 3 worker threads and the work queue is empty. > Then: > > - Main thread posts shutdown-blocking task (signal #1) > - Main thread calls Shutdown (signal #2) > - Main thread waits > - WorkerThread1 wakes up (from signal #1), picks up the shutdown-blocking task > (becoming a shutdown-blocking thread), then signals (signal #3) > - Main thread wakes up, sees signal #3, waits again since CanShutdown() is still > false > - WorkerThread2 wakes up (from signal #2), sees no work to be done and shutdown > called, and exits, signalling again (signal #4) > - Main thread wakes up, sees signal #4, waits again since CanShutdown() is still > false > - WorkerThread1 runs the blocking task and unmarks itself as a blocking thread, > then sees that shutdown was called, and exits, signalling again (signal #5) > - Main thread wakes up (from signal #5) and continues, since CanShutdown() is > true. > - Main thread joins all worker threads (deadlock, since WorkerThread3 was never > woken up). > > The problem being that the signals intended to wake up other worker threads end > up waking the main thread instead. Is this a valid scenario?
LGTM Nits and comments below don't require or request any changes to land. http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (left): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:404: cond_var_.Signal(); I *think* you are correct. This is a slightly (IMO) abnormal pattern that we get from this use of CVs, in that we get a signal for each new task, and work doesn't just spontaneously start appearing without signals (i.e., tasks in our queue don't just quietly insert more tasks). IMO, it is a little subtle, and I'm used to a simple "always signal when you take on work and can't Wait()," pattern. I think this might therefore be overkill... but ends up having no real consequence (I think). It sure makes it easy to see correctness though, IMO. On 2012/03/12 18:34:34, akalin wrote: > On 2012/03/12 17:39:13, jar wrote: > > This signal is basically asking that anyone else that is waiting should check > to > > see if they have work to do. Why did you delete it here? > > Hmm. If every task posted causes a signal, then wouldn't another thread be > woken up anyway? But put it back in nonetheless. http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:367: int SequencedWorkerPool::Inner::GetWorkSignalCountForTesting() const { nit: It bugs me a little that we've added this must testing infrastructure... and the optimizer will probably never be able to realize these are not needed. I wish these could be done in the unit test file, with minimal access given to a FriendTest class. For instance, it would be nice if we only handed the test class private access to our lock, or such, and let it craft CVs to its heart content :-/. For now, I'll focus on this CL, and its correctness... but I wanted to pass on that thought. http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:469: can_shutdown_cv_.Signal(); Placing this here is probably more optimal than I had envisioned. I figured you'd just put the extra signal outside the lock, and let the controlling process (if any) check to see if CanShutdown() (which it is already definitely planning to check, watching for spurious signals!). Please add the comment for the closing brace below, so It will be clear that this flow only happens when we are about to exit (and we've broken out of the loop). http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:470: } nit: Add comment: // Release lock_.
On 2012/03/12 19:35:54, jar wrote: > Your scenario is what I (tried to) talk about with regard to the Broadcast(). > If you wanted to stick with one CV, you'd then go to using Broadcast when you > start to shutdown. If you stay with the Signal chain, you need to have the > second (distinct) CV. Ah, okay. I didn't read your comments closely enough!
submitting as soon as trybots pass base_unittests Thanks for the review! http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... File base/threading/sequenced_worker_pool.cc (left): http://codereview.chromium.org/9651026/diff/6001/base/threading/sequenced_wor... base/threading/sequenced_worker_pool.cc:404: cond_var_.Signal(); On 2012/03/12 19:42:36, jar wrote: > I *think* you are correct. This is a slightly (IMO) abnormal pattern that we > get from this use of CVs, in that we get a signal for each new task, and work > doesn't just spontaneously start appearing without signals (i.e., tasks in our > queue don't just quietly insert more tasks). > > IMO, it is a little subtle, and I'm used to a simple "always signal when you > take on work and can't Wait()," pattern. > > I think this might therefore be overkill... but ends up having no real > consequence (I think). It sure makes it easy to see correctness though, IMO. I see. Okay, I just added a comment as to why it's left in. http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:367: int SequencedWorkerPool::Inner::GetWorkSignalCountForTesting() const { On 2012/03/12 19:42:36, jar wrote: > nit: It bugs me a little that we've added this must testing infrastructure... > and the optimizer will probably never be able to realize these are not needed. > I wish these could be done in the unit test file, with minimal access given to a > FriendTest class. For instance, it would be nice if we only handed the test > class private access to our lock, or such, and let it craft CVs to its heart > content :-/. > > For now, I'll focus on this CL, and its correctness... but I wanted to pass on > that thought. I think I can remove this variable and instead add another method to the TestingObserver class that is called whenever has_work_cv_ is signalled. I'll do that in another CL. http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:469: can_shutdown_cv_.Signal(); On 2012/03/12 19:42:36, jar wrote: > Placing this here is probably more optimal than I had envisioned. I figured > you'd just put the extra signal outside the lock, and let the controlling > process (if any) check to see if CanShutdown() (which it is already definitely > planning to check, watching for spurious signals!). Yeah, I think that's simpler. I moved it out. http://codereview.chromium.org/9651026/diff/12001/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:470: } On 2012/03/12 19:42:36, jar wrote: > nit: Add comment: // Release lock_. Done. |