Issue 17517002: Make sure that the UI window created by base::MessagePumpForUI is destoyed on the same thread (Wind…

Issue 17517002: Make sure that the UI window created by base::MessagePumpForUI is destoyed on the same thread (Wind… (Closed)

Created:
7 years, 6 months ago by alexeypa (please no reviews)

Modified:
7 years, 5 months ago

Reviewers:
darin (slow to review), rvargas (doing something else)

CC:
chromium-reviews, glider+watch_chromium.org, erikwright+watch_chromium.org, sadrul, timurrrr+watch_chromium.org, bruening+watch_chromium.org

Base URL:
svn://svn.chromium.org/chrome/trunk/src

Visibility:
Public.

More Reviews

Description

Make sure that the UI window created by base::MessagePumpForUI is destoyed on the same thread (Windows). Currently the window created base::MessagePumpForUI can be destroyed on a wrong thread. base::MessagePumpForUI is a ref-counted class so it can (and does) outlive the owning base::MessageLoop. As the result DestroyWindow() can be called on a wrong thread. This makes TSAN unhappy and it reports races deep unside user32.dll. Changes in this CL: - The message pump is now notified when the owning message loop is being destroyed. The notification is used to free all resources that hve to be released on the base::MessageLoop's thread. - MessagePumpForUI::ScheduleWork() synchronizes access to the message-only window handle to avoid posting messages to the window during or after its destruction. BUG=241939

Patch Set 1 #

Patch Set 2 : - #

Created: 7 years, 6 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+84 lines, -44 lines)			Patch
M	base/message_loop/message_loop.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump.h	View		1 chunk	+6 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_android.h	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	base/message_loop/message_pump_android.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_default.h	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	base/message_loop/message_pump_default.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_glib.h	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	base/message_loop/message_pump_glib.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_libevent.h	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	base/message_loop/message_pump_libevent.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_mac.h	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	base/message_loop/message_pump_mac.mm	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	base/message_loop/message_pump_win.h	View		5 chunks	+10 lines, -4 lines	0 comments	Download
M	base/message_loop/message_pump_win.cc	View	1	5 chunks	+45 lines, -9 lines	0 comments	Download
M	tools/valgrind/tsan/suppressions.txt	View		1 chunk	+0 lines, -31 lines	0 comments	Download

Messages

Total messages: 16 (0 generated)

Expand Messages | Collapse Messages

alexeypa (please no reviews)

Relanding r207278. It got reverted because a race around |message_hwnd_lock_| broke |NavigationControllerTest.PurgeScreenshot| unit test. Patch ...

7 years, 6 months ago (2013-06-20 21:11:57 UTC) #1

darin (slow to review)

What was wrong with the old patch exactly? Does the problem arise due to two ...

7 years, 6 months ago (2013-06-20 21:16:07 UTC) #2

alexeypa (please no reviews)

On 2013/06/20 21:16:07, darin wrote: > What was wrong with the old patch exactly? Does ...

7 years, 6 months ago (2013-06-20 21:26:11 UTC) #3

rvargas (doing something else)

On 2013/06/20 21:26:11, alexeypa wrote: > On 2013/06/20 21:16:07, darin wrote: > > What was ...

7 years, 6 months ago (2013-06-20 21:30:09 UTC) #4

alexeypa (please no reviews)

On 2013/06/20 21:30:09, rvargas wrote: > Shouldn't we then invert the order of Try and ...

7 years, 6 months ago (2013-06-20 21:37:11 UTC) #5

rvargas (doing something else)

On 2013/06/20 21:37:11, alexeypa wrote: > On 2013/06/20 21:30:09, rvargas wrote: > > Shouldn't we ...

7 years, 6 months ago (2013-06-20 21:42:45 UTC) #6

alexeypa (please no reviews)

On 2013/06/20 21:42:45, rvargas wrote: > Sorry, why not? The point of the lock is ...

7 years, 6 months ago (2013-06-20 22:09:44 UTC) #7

rvargas (doing something else)

On 2013/06/20 22:09:44, alexeypa wrote: > On 2013/06/20 21:42:45, rvargas wrote: > > Sorry, why ...

7 years, 6 months ago (2013-06-20 22:17:35 UTC) #8

alexeypa (please no reviews)

On 2013/06/20 22:17:35, rvargas wrote: > So?. The point of calling ScheduleWork is to tell ...

7 years, 6 months ago (2013-06-20 22:47:16 UTC) #9

rvargas (doing something else)

On 2013/06/20 22:47:16, alexeypa wrote: > On 2013/06/20 22:17:35, rvargas wrote: > > So?. The ...

7 years, 6 months ago (2013-06-21 21:41:33 UTC) #11

On 2013/06/20 22:47:16, alexeypa wrote:
> On 2013/06/20 22:17:35, rvargas wrote:
> > So?. The point of calling ScheduleWork is to tell the pump that there is a
new
> > message _already_ waiting on the queue.
> 
> Yes.
> 
> > If thread B races with thread C to "wake
> > up" A, they don't have to "wake up" A twice, one after the other, controlled
> by
> > have_work_. It should be totally safe to ignore one of the two signals.
> 
> They don't have to wake up A twice. But B can wake up A and then A can clear
> |have_work_| _before_ B had chance to release the lock. In this state B will
not
> try to wake up A because it already did so. B will not try to wake up itself.
C
> will not try to wake up A because the lock is still taken.

You're totally right. Part of the logic behind have_work_ is lost if we have it
"behind" the lock on ScheduleWork.

But taking a step back, now I'm not convinced there is something we need to fix
at all. And I'm thinking that adding more synchronization is not the right thing
to do (even if there is something that we should fix).

Let me explain the first point, maybe there's something I'm still missing.

MessagePump is refcounted... so it can be destroyed on a random thread.
We call DestroyWindow(message_hwnd_) on the destructor.
We use message_hwnd_ to post a message from ScheduleWork, to tickle the pump.
ScheduleWork can be called on a random thread, but only when posting tasks to
the thread. This method is used by MessageLoop, right after adding tasks to the
queue.
There should be no long-term buffering of the MessagePump (a consumer getting a
pointer and storing it for the future). Even if there is, that consumer is not
really expected (allowed?) to call ScheduleWork, because that's the job of
MessageLoop.
Consumers getting the pump do so through a valid MessageLoop (let's call this
the "owner" of the pump, ignore refcount for a second).
Calling DestroyWindow from the wrong thread fails (doesn't crash).
If a thread goes away before the MessagePump refcout reaches 0, all windows
belonging to that thread are destroyed by the OS (ie, there is no leak)

I don't see how can we receive a call to ScheduleWork after the owner
MessageLoop (and the thread?) goes away.

That would indicate that the current code works as it is. If for some reason
someone is holding a pump for a long time (should not be the case), there could
be handle reuse by the time we finally call DestroyWindow, but that should be
harmless (unless the thread that happens to run the dtor is also the thread that
created the window that got the recycled handle).

So sure, from the point of view of the current MessagePump interface we can
write a test that does the wrong thing. But I don't think that's the case once
we consider the users of the pump.

alexeypa (please no reviews)

On 2013/06/21 21:41:33, rvargas wrote: > And I'm thinking that adding more synchronization is not ...

7 years, 6 months ago (2013-06-21 22:22:56 UTC) #12

On 2013/06/21 21:41:33, rvargas wrote:
> And I'm thinking that adding more synchronization is not the right thing
> to do (even if there is something that we should fix).

Quite possibly so.

> ScheduleWork can be called on a random thread, but only when posting tasks to
> the thread. This method is used by MessageLoop, right after adding tasks to
the
> queue.

Note that there is no guarantee that the MessageLoop instance will be kept alive
for the duration of the call.

> I don't see how can we receive a call to ScheduleWork after the owner
> MessageLoop (and the thread?) goes away.

MessageLoop's interface allows that. The thread can be suspended right before
ScheduleWork() call and resumed after MessageLoop is destroyed.

MessageLoopProxyImpl seems to protects against this by taking a lock. I'm not
sure about other task runners.

What worries me is that MessageLoop::AddToIncomingQueue() makes an effort to
call ScheduleWork() outside lock. So I guess it is possible that MessageLoop
could be deleted in the middle of ScheduleWork().

> be handle reuse by the time we finally call DestroyWindow, but that should be
> harmless (unless the thread that happens to run the dtor is also the thread
that
> created the window that got the recycled handle).

Well, this sounds like a source of test flakiness and obscure bugs. It will be
very hard to track it down.

> So sure, from the point of view of the current MessagePump interface we can
> write a test that does the wrong thing. But I don't think that's the case once
> we consider the users of the pump.

remoting_unittests use SingleTaskTaskRunner interface (so it goes though
|message_loop_lock_| in MessageLoopProxyImpl) and newer post task to a different
thread using MessageLoop directly. Yet, TSAN reports a race deep down in
user32.dll which seems to be a real race. I'm not convinced at all that we can
say that the race window is very small and ignore the problem.

Another approach we could take is to make MessagePump interface not ref-counted
updating the rest of the code to go though MessageLoopProxy and properly
synchronize there.

rvargas (doing something else)

On 2013/06/21 22:22:56, alexeypa wrote: > On 2013/06/21 21:41:33, rvargas wrote: > > And I'm ...

7 years, 6 months ago (2013-06-21 23:10:38 UTC) #13

On 2013/06/21 22:22:56, alexeypa wrote:
> On 2013/06/21 21:41:33, rvargas wrote:
> > And I'm thinking that adding more synchronization is not the right thing
> > to do (even if there is something that we should fix).
> 
> Quite possibly so.
> 
> > ScheduleWork can be called on a random thread, but only when posting tasks
to
> > the thread. This method is used by MessageLoop, right after adding tasks to
> the
> > queue.
> 
> Note that there is no guarantee that the MessageLoop instance will be kept
alive
> for the duration of the call.

In the sense that once the pump is processing messages one of them can destroy
the loop, yes. See below.

> 
> > I don't see how can we receive a call to ScheduleWork after the owner
> > MessageLoop (and the thread?) goes away.
> 
> MessageLoop's interface allows that. The thread can be suspended right before
> ScheduleWork() call and resumed after MessageLoop is destroyed.

I guess you are talking about AddToIncomingQueue. If that is true (|this| being
destroyed before pump-ScheduleWork() starts), then there is nothing preventing
|this| going away before pump_or other members are accessed. There has to be
enough synchronization at the upper layers (and I mean MessageLoop and up) to
make sure that just getting _into_ ScheduleWork happens with a valid
MessageLoop.

> 
> MessageLoopProxyImpl seems to protects against this by taking a lock. I'm not
> sure about other task runners.
> 
> What worries me is that MessageLoop::AddToIncomingQueue() makes an effort to
> call ScheduleWork() outside lock. So I guess it is possible that MessageLoop
> could be deleted in the middle of ScheduleWork().

In the middle as in after tickling the loop sure. In fact, it would not be very
nice to enter the dispatching code with a lock grabbed (even ignoring the fact
that one of the tasks can attempt to delete the loop), and that's the other
thing that now I don't like about adding another lock.

> 
> > be handle reuse by the time we finally call DestroyWindow, but that should
be
> > harmless (unless the thread that happens to run the dtor is also the thread
> that
> > created the window that got the recycled handle).
> 
> Well, this sounds like a source of test flakiness and obscure bugs. It will be
> very hard to track it down.

Maybe... but it requires:
0. Thread B grabs A's pump
1. Thread A dies (message_hwnd_ closed)
2. OS decides it is time to reuse handle (message_hwnd_)
3. Thread B creates window and receives the same value as message_hwnd_
4. Thread B remembers that it has to decrement a refcount for A's pump
5. Thread B executes A's pump destructor
6. We call DesttroyWindow on a wrong handle 

This is only viable with long lived private references to the message pump.

Still, if we want to fix it we can, but we don't need a lock for it :)

> 
> > So sure, from the point of view of the current MessagePump interface we can
> > write a test that does the wrong thing. But I don't think that's the case
once
> > we consider the users of the pump.
> 
> remoting_unittests use SingleTaskTaskRunner interface (so it goes though
> |message_loop_lock_| in MessageLoopProxyImpl) and newer post task to a
different
> thread using MessageLoop directly. Yet, TSAN reports a race deep down in
> user32.dll which seems to be a real race. I'm not convinced at all that we can
> say that the race window is very small and ignore the problem.

Do you have a link handy? The original bug report does not reflect this
scenario, and the suppression is not used on a recent run I looked at yesterday.

> 
> 
> Another approach we could take is to make MessagePump interface not
ref-counted
> updating the rest of the code to go though MessageLoopProxy and properly
> synchronize there.

Yes, that would be an option. Major surgery either for the ownership model or
the internal synchronization (have_work_)

alexeypa (please no reviews)

> In fact, it would not be very > nice to enter the dispatching code ...

7 years, 6 months ago (2013-06-21 23:46:04 UTC) #14

> In fact, it would not be very
> nice to enter the dispatching code with a lock grabbed (even ignoring the fact
> that one of the tasks can attempt to delete the loop), and that's the other
> thing that now I don't like about adding another lock.

We already doing that when another thread posts a task via MessageLoopProxy.

> Maybe... but it requires:
> 0. Thread B grabs A's pump
> 1. Thread A dies (message_hwnd_ closed)
> 2. OS decides it is time to reuse handle (message_hwnd_)
> 3. Thread B creates window and receives the same value as message_hwnd_
> 4. Thread B remembers that it has to decrement a refcount for A's pump
> 5. Thread B executes A's pump destructor
> 6. We call DesttroyWindow on a wrong handle 

This is not the only problem that can occur. For instance Thread A may not die
(but instead create another loop). Than there will be a leaked window. It might
even receive the last |kHaveWorkMsg| after the pump is destroyed. I bet no one
looking at this code would think about such possibility.

One of the reasons I want to replace message_hwnd_ with base::win::MessageWindow
is that it has more checks than the current message pump code. The CL with this
change validates that base::win::MessageWindow is destroyed on proper thread;
that public methods of MessagePumpUI are called in the right thread. It verifies
values returned from DestoyWindow and UnregisterClass. Basically it is easy to
make assumptions when we say 100% races are not allowed. Once we allow a
particular race to happen it suddenly becomes mush harder to reason code and
make sure no scenarios were overlooked.

> Still, if we want to fix it we can, but we don't need a lock for it :)

I don't see yet, how this can be fixed without a lock (assuming we keep
MessagePump ref-counted).

> Do you have a link handy? The original bug report does not reflect this
> scenario, and the suppression is not used on a recent run I looked at
yesterday.

Interesting. I thought the suppression were still used. I guess the previous fix
(a generated window class name) took care of it.

alexeypa (please no reviews)

FYI: here is my attempt to make MessagePump not ref-counted: https://chromiumcodereview.appspot.com/17567007/.

7 years, 6 months ago (2013-06-24 23:21:30 UTC) #15

rvargas (doing something else)

7 years, 6 months ago (2013-06-25 23:08:39 UTC) #16

On 2013/06/21 23:46:04, alexeypa wrote:
> > In fact, it would not be very
> > nice to enter the dispatching code with a lock grabbed (even ignoring the
fact
> > that one of the tasks can attempt to delete the loop), and that's the other
> > thing that now I don't like about adding another lock.
> 
> We already doing that when another thread posts a task via MessageLoopProxy.
> 
> > Maybe... but it requires:
> > 0. Thread B grabs A's pump
> > 1. Thread A dies (message_hwnd_ closed)
> > 2. OS decides it is time to reuse handle (message_hwnd_)
> > 3. Thread B creates window and receives the same value as message_hwnd_
> > 4. Thread B remembers that it has to decrement a refcount for A's pump
> > 5. Thread B executes A's pump destructor
> > 6. We call DesttroyWindow on a wrong handle 
> 
> This is not the only problem that can occur. For instance Thread A may not die
> (but instead create another loop). Than there will be a leaked window. It
might
> even receive the last |kHaveWorkMsg| after the pump is destroyed. I bet no one
> looking at this code would think about such possibility.
> 
> One of the reasons I want to replace message_hwnd_ with
base::win::MessageWindow
> is that it has more checks than the current message pump code. The CL with
this
> change validates that base::win::MessageWindow is destroyed on proper thread;
> that public methods of MessagePumpUI are called in the right thread. It
verifies
> values returned from DestoyWindow and UnregisterClass. Basically it is easy to
> make assumptions when we say 100% races are not allowed. Once we allow a
> particular race to happen it suddenly becomes mush harder to reason code and
> make sure no scenarios were overlooked.
> 
> > Still, if we want to fix it we can, but we don't need a lock for it :)
> 
> I don't see yet, how this can be fixed without a lock (assuming we keep
> MessagePump ref-counted).
> 
> > Do you have a link handy? The original bug report does not reflect this
> > scenario, and the suppression is not used on a recent run I looked at
> yesterday.
> 
> Interesting. I thought the suppression were still used. I guess the previous
fix
> (a generated window class name) took care of it.

If message_hwnd_ is replaced by a general purpose class that does locking, then
the magic of composition would lend another lock to the message pump/loop and
that would probably be OK (out of sight, out of mind). If there is a general
purpose class that handles the message window then most likely that class will
have to worry about making sure that its API cannot be abused or end up
deadlocking, corrupting or leaking resources.

However, if we are dealing with the message pump, the code can adjust to the
expected or desired properties / behavior of the message pump/loop/loop_proxy.
(this is not exactly general purpose code even if it lives in base). In that
context, a leak may be bad, or may not be (message loops are more often than not
a way to interact with a given thread, it's not like the common pattern is to
repeatedly cycle loops on a given thread, even in the context of tests). So
doing nothing is a real option, especially after seeing the giant lock on
MessageLoopProxyImpl::PostTaskHelper. But I fully understand that doing nothing
is not the likely outcome here :)

How would I fix this without a lock and keeping the ref count? I don't remember
what I was thinking at that time but it certainly involved not supporting
repeated creation of a pump on a given thread. It probably didn't work, but I
think it was something like skipping DestroyWindow on random threads.

Back to the real world, I still think that the lock here seems weird so I would
prefer a bigger knife (thanks for taking the time to prototype something else).
That said, if Darin thinks we should keep trying an internal lock of some sort I
would not argue (and in that case the lock on this CL seems good enough,
certainly the simplest fix). It's just that at this point the pump is starting
to sound like rust has accumulated (especially after going through all the
pieces on the callers).

Expand Messages | Collapse Messages