Issue 23440015: Fix webrtc HW encode deadlock scenarios.

Issue 23440015: Fix webrtc HW encode deadlock scenarios. (Closed)

Created:
7 years, 3 months ago by Pawel Osciak

Modified:
7 years, 3 months ago

Reviewers:
sheu, Ami GONE FROM CHROMIUM

CC:
chromium-reviews, joi+watch-content_chromium.org, feature-media-reviews_chromium.org, jam, darin-cc_chromium.org

Base URL:
http://git.chromium.org/chromium/src.git@master

Visibility:
Public.

More Reviews

Description

Fix webrtc HW encode deadlock scenarios. Webrtc unfortunately likes to sleep in BaseChannel::Send on the renderer's ChildThread while directly or indirectly calling into HW encoder and we end up in a number of deadlocks of varying complexity in one way or another, while trying to also use the ChildThread to allocate shared memory to service those calls. The only way to avoid this is to not get onto the ChildThread while servicing webrtc requests, so we use the static ChildThread::AllocateSharedMemory() to send the request directly from the current thread. Also add VEA::RequireBitstreamBuffers() to the initialization sequence, so that RTCVideoEncoder::InitEncode() will only return after we've allocated requested buffers. VEA::RequireBitstreamBuffers() is effectively a part of initialization sequence anyway, because we can't really call VEA::Encode() without knowing the VEA impl's coded size requirements. This could also potentially reduce the latency of the first Encode() call. And separately, zero out header structures sent to the client. TEST=apprtc.appspot.com with HW encoding BUG=260210 Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=221395

Patch Set 1 #

Total comments: 8

Patch Set 2 : #

Created: 7 years, 3 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+17 lines, -48 lines)			Patch
M	content/renderer/media/renderer_gpu_video_accelerator_factories.h	View	1	5 chunks	+9 lines, -17 lines	0 comments	Download
M	content/renderer/media/renderer_gpu_video_accelerator_factories.cc	View	1	4 chunks	+5 lines, -30 lines	0 comments	Download
M	content/renderer/media/rtc_video_encoder.cc	View		3 chunks	+3 lines, -1 line	0 comments	Download

Messages

Total messages: 15 (0 generated)

Expand Messages | Collapse Messages

Ami GONE FROM CHROMIUM

Launching a thread per web feature is a bad idea (resource exhaustion attacks). The ownership ...

7 years, 3 months ago (2013-09-03 20:09:53 UTC) #2

Pawel Osciak

On 2013/09/03 20:09:53, Ami Fischman wrote: > Launching a thread per web feature is a ...

7 years, 3 months ago (2013-09-04 01:34:47 UTC) #3

On 2013/09/03 20:09:53, Ami Fischman wrote:
> Launching a thread per web feature is a bad idea (resource exhaustion
attacks). 

I don't like it any more than you do :) I actually held up on publishing this CL
to a wider audience and just shared it for testing and to have something
concrete as a base for discussion as one of the alternatives. Looks like John
deemed it publish-worthy though :)

In any case, I'm operating under the assumption that if webrtc sleeps on
ChildThread, it must have a very good reason to do so. I was going to discuss it
today nevertheless, after everyone came back from holiday.

There is one more way to solve this actually, but it's more involved: we could
move allocation of shared memory to the GPU process and share it from there
instead. Also a bit hacky, but less then this solution maybe.

> The ownership semantics of the thread (owned by the original Factories object,
> clones get ref to loopproxy) are incomplete; nothing guaratnees that a
Factories
> that was Clone()d wasn't also deleted while its Clone()s continue living.  
> 

The Thread is owned by the original factories only, which are in the
RenderThreadImpl. It's not a problem if any of the clones is deleted, no clones
get it, it's not copied. Only the original owns it. And RenderThread is the
ChildThread, isn't it? Why is it less safe than copying proxies to the
ChildThread and using ChildThread to allocate memory on instead?

> This CL feels like an attempt to hack around bugs/shortcomings in libjingle
> code.  Can we fix libjingle, instead?

Yes, I'd like to do that, but I don't think fixing threading in webrtc/libjingle
can be done overnight. and the deadlocks in this code are very real and already
there and I hit them almost all the time, so it's at least a temporary solution.

>
https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi...
> File content/renderer/media/rtc_video_encoder.cc (right):
> 
>
https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi...
> content/renderer/media/rtc_video_encoder.cc:300:
> SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK);
> Nothing guarantees that RequireBB will be called until Encode() is called,
which
> won't happen until initialization is done.  The justification for this in the
CL
> description is insufficient.
> If RequireBB was really part of VEA initialization, then NotifyInitializeDone
> should not be triggered before RBB/UOBB were done.

https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi...
File content/renderer/media/rtc_video_encoder.cc (right):

https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi...
content/renderer/media/rtc_video_encoder.cc:300:
SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK);
On 2013/09/03 20:09:54, Ami Fischman wrote:
> Nothing guarantees that RequireBB will be called until Encode() is called,
which
> won't happen until initialization is done.  The justification for this in the
CL
> description is insufficient.
> If RequireBB was really part of VEA initialization, then NotifyInitializeDone
> should not be triggered before RBB/UOBB were done.  

Well, if you look at how this class does it right now, it silently assumes that
RequireBB has been finished before first Encode() is called on RTCVE.
RTCVE::Encode() gets into Impl::Enqueue() and if input_buffers_free is empty, it
doesn't call EncodeOneFrame(), just returns, so Encode() will wait forever. My
CL is an attempt to fix this as well.

Ami GONE FROM CHROMIUM

> In any case, I'm operating under the assumption that if webrtc sleeps on > ...

7 years, 3 months ago (2013-09-04 02:19:11 UTC) #4

> In any case, I'm operating under the assumption that if webrtc sleeps on
> ChildThread, it must have a very good reason to do so. I was going to
> discuss it
> today nevertheless, after everyone came back from holiday.
>
>
Yeah, I don't think that's the case.  I suspect that the real answer is
that the authors of those sleeps didn't realize they'd be blocking the
renderer (and JS!).


> There is one more way to solve this actually, but it's more involved: we
> could
> move allocation of shared memory to the GPU process and share it from there
> instead. Also a bit hacky, but less then this solution maybe.


This is still a workaround (and one that affects the API!) for an
implementation detail which I bet could be fixed at the root (namely, stop
having webrtc sleep).


>  The ownership semantics of the thread (owned by the original Factories
>> object,
>> clones get ref to loopproxy) are incomplete; nothing guaratnees that a
>> Factories
>
>  that was Clone()d wasn't also deleted while its Clone()s continue living.
>
> The Thread is owned by the original factories only, which are in the
> RenderThreadImpl. It's not a problem if any of the clones is deleted, no
> clones
> get it, it's not copied. Only the original owns it. And RenderThread is the
> ChildThread, isn't it? Why is it less safe than copying proxies to the
> ChildThread and using ChildThread to allocate memory on instead?


My point is that you're adding to the impl of Factories the knowledge that
there's a magical "original" copy stored in the RTI, but that's not
guaranteed in any of the interfaces that Factories knows about.


>  This CL feels like an attempt to hack around bugs/shortcomings in
>> libjingle
>> code.  Can we fix libjingle, instead?
>
> Yes, I'd like to do that, but I don't think fixing threading in
> webrtc/libjingle
> can be done overnight. and the deadlocks in this code are very real and
> already
> there and I hit them almost all the time, so it's at least a temporary
> solution.


:(
I bet once you start you'll find it's not as hard as you think it is.

https://codereview.chromium.**org/23440015/diff/1/content/**
>
renderer/media/rtc_video_**encoder.cc<https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_video_encoder.cc>
> File content/renderer/media/rtc_**video_encoder.cc (right):
>
> https://codereview.chromium.**org/23440015/diff/1/content/**
>
renderer/media/rtc_video_**encoder.cc#newcode300<https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_video_encoder.cc#newcode300>
> content/renderer/media/rtc_**video_encoder.cc:300:
> SignalAsyncWaiter(WEBRTC_**VIDEO_CODEC_OK);
> On 2013/09/03 20:09:54, Ami Fischman wrote:
>
>> Nothing guarantees that RequireBB will be called until Encode() is
>>
> called, which
>
>> won't happen until initialization is done.  The justification for this
>>
> in the CL
>
>> description is insufficient.
>> If RequireBB was really part of VEA initialization, then
>>
> NotifyInitializeDone
>
>> should not be triggered before RBB/UOBB were done.
>>
>
> Well, if you look at how this class does it right now, it silently
> assumes that RequireBB has been finished before first Encode() is called
> on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if
> input_buffers_free is empty, it doesn't call EncodeOneFrame(), just
> returns, so Encode() will wait forever. My CL is an attempt to fix this
> as well.
>

Can you instead fix it so that a legit VEA impl that happened to wait for
the first Encode to RBB still works?

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Pawel Osciak

> >> If RequireBB was really part of VEA initialization, then > >> > > ...

7 years, 3 months ago (2013-09-04 11:37:05 UTC) #5

Ami GONE FROM CHROMIUM

On Wed, Sep 4, 2013 at 4:37 AM, <posciak@chromium.org> wrote: > >> If RequireBB was ...

7 years, 3 months ago (2013-09-04 17:34:17 UTC) #6

sheu

> >> > Well, if you look at how this class does it right now, ...

7 years, 3 months ago (2013-09-04 17:55:59 UTC) #7

Ami GONE FROM CHROMIUM

https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media/renderer_gpu_video_accelerator_factories.cc File content/renderer/media/renderer_gpu_video_accelerator_factories.cc (right): https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media/renderer_gpu_video_accelerator_factories.cc#newcode393 content/renderer/media/renderer_gpu_video_accelerator_factories.cc:393: CHECK(shared_memory_segment_->CreateAndMapAnonymous(size)); Do you even sandbox, bro? How is this ...

7 years, 3 months ago (2013-09-04 20:47:55 UTC) #8

Pawel Osciak

7 years, 3 months ago (2013-09-05 05:46:33 UTC) #9

Pawel Osciak

On 2013/09/05 06:18:28, Ami Fischman wrote: > LGTM % CL description needing a rewrite Thanks. ...

7 years, 3 months ago (2013-09-05 06:57:44 UTC) #11

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/posciak@chromium.org/23440015/13001

7 years, 3 months ago (2013-09-05 06:58:03 UTC) #12

commit-bot: I haz the power

Retried try job too often on ios_dbg_simulator for step(s) ui_unittests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=ios_dbg_simulator&number=83100

7 years, 3 months ago (2013-09-05 07:36:45 UTC) #13

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/posciak@chromium.org/23440015/13001

7 years, 3 months ago (2013-09-05 07:46:19 UTC) #14

Message was sent while issue was closed.

Change committed as 221395

Expand Messages | Collapse Messages