|
|
Created:
7 years, 3 months ago by Pawel Osciak Modified:
7 years, 3 months ago CC:
chromium-reviews, joi+watch-content_chromium.org, feature-media-reviews_chromium.org, jam, darin-cc_chromium.org Base URL:
http://git.chromium.org/chromium/src.git@master Visibility:
Public. |
DescriptionFix webrtc HW encode deadlock scenarios.
Webrtc unfortunately likes to sleep in BaseChannel::Send on the renderer's
ChildThread while directly or indirectly calling into HW encoder and we end
up in a number of deadlocks of varying complexity in one way or another,
while trying to also use the ChildThread to allocate shared memory to service
those calls.
The only way to avoid this is to not get onto the ChildThread while servicing
webrtc requests, so we use the static ChildThread::AllocateSharedMemory()
to send the request directly from the current thread.
Also add VEA::RequireBitstreamBuffers() to the initialization sequence, so that
RTCVideoEncoder::InitEncode() will only return after we've allocated requested
buffers. VEA::RequireBitstreamBuffers() is effectively a part of initialization
sequence anyway, because we can't really call VEA::Encode() without knowing the
VEA impl's coded size requirements. This could also potentially reduce
the latency of the first Encode() call.
And separately, zero out header structures sent to the client.
TEST=apprtc.appspot.com with HW encoding
BUG=260210
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=221395
Patch Set 1 #
Total comments: 8
Patch Set 2 : #
Messages
Total messages: 15 (0 generated)
Launching a thread per web feature is a bad idea (resource exhaustion attacks). The ownership semantics of the thread (owned by the original Factories object, clones get ref to loopproxy) are incomplete; nothing guaratnees that a Factories that was Clone()d wasn't also deleted while its Clone()s continue living. This CL feels like an attempt to hack around bugs/shortcomings in libjingle code. Can we fix libjingle, instead? https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... File content/renderer/media/rtc_video_encoder.cc (right): https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... content/renderer/media/rtc_video_encoder.cc:300: SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK); Nothing guarantees that RequireBB will be called until Encode() is called, which won't happen until initialization is done. The justification for this in the CL description is insufficient. If RequireBB was really part of VEA initialization, then NotifyInitializeDone should not be triggered before RBB/UOBB were done.
On 2013/09/03 20:09:53, Ami Fischman wrote: > Launching a thread per web feature is a bad idea (resource exhaustion attacks). I don't like it any more than you do :) I actually held up on publishing this CL to a wider audience and just shared it for testing and to have something concrete as a base for discussion as one of the alternatives. Looks like John deemed it publish-worthy though :) In any case, I'm operating under the assumption that if webrtc sleeps on ChildThread, it must have a very good reason to do so. I was going to discuss it today nevertheless, after everyone came back from holiday. There is one more way to solve this actually, but it's more involved: we could move allocation of shared memory to the GPU process and share it from there instead. Also a bit hacky, but less then this solution maybe. > The ownership semantics of the thread (owned by the original Factories object, > clones get ref to loopproxy) are incomplete; nothing guaratnees that a Factories > that was Clone()d wasn't also deleted while its Clone()s continue living. > The Thread is owned by the original factories only, which are in the RenderThreadImpl. It's not a problem if any of the clones is deleted, no clones get it, it's not copied. Only the original owns it. And RenderThread is the ChildThread, isn't it? Why is it less safe than copying proxies to the ChildThread and using ChildThread to allocate memory on instead? > This CL feels like an attempt to hack around bugs/shortcomings in libjingle > code. Can we fix libjingle, instead? Yes, I'd like to do that, but I don't think fixing threading in webrtc/libjingle can be done overnight. and the deadlocks in this code are very real and already there and I hit them almost all the time, so it's at least a temporary solution. > https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... > File content/renderer/media/rtc_video_encoder.cc (right): > > https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... > content/renderer/media/rtc_video_encoder.cc:300: > SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK); > Nothing guarantees that RequireBB will be called until Encode() is called, which > won't happen until initialization is done. The justification for this in the CL > description is insufficient. > If RequireBB was really part of VEA initialization, then NotifyInitializeDone > should not be triggered before RBB/UOBB were done. https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... File content/renderer/media/rtc_video_encoder.cc (right): https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_vi... content/renderer/media/rtc_video_encoder.cc:300: SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK); On 2013/09/03 20:09:54, Ami Fischman wrote: > Nothing guarantees that RequireBB will be called until Encode() is called, which > won't happen until initialization is done. The justification for this in the CL > description is insufficient. > If RequireBB was really part of VEA initialization, then NotifyInitializeDone > should not be triggered before RBB/UOBB were done. Well, if you look at how this class does it right now, it silently assumes that RequireBB has been finished before first Encode() is called on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if input_buffers_free is empty, it doesn't call EncodeOneFrame(), just returns, so Encode() will wait forever. My CL is an attempt to fix this as well.
> In any case, I'm operating under the assumption that if webrtc sleeps on > ChildThread, it must have a very good reason to do so. I was going to > discuss it > today nevertheless, after everyone came back from holiday. > > Yeah, I don't think that's the case. I suspect that the real answer is that the authors of those sleeps didn't realize they'd be blocking the renderer (and JS!). > There is one more way to solve this actually, but it's more involved: we > could > move allocation of shared memory to the GPU process and share it from there > instead. Also a bit hacky, but less then this solution maybe. This is still a workaround (and one that affects the API!) for an implementation detail which I bet could be fixed at the root (namely, stop having webrtc sleep). > The ownership semantics of the thread (owned by the original Factories >> object, >> clones get ref to loopproxy) are incomplete; nothing guaratnees that a >> Factories > > that was Clone()d wasn't also deleted while its Clone()s continue living. > > The Thread is owned by the original factories only, which are in the > RenderThreadImpl. It's not a problem if any of the clones is deleted, no > clones > get it, it's not copied. Only the original owns it. And RenderThread is the > ChildThread, isn't it? Why is it less safe than copying proxies to the > ChildThread and using ChildThread to allocate memory on instead? My point is that you're adding to the impl of Factories the knowledge that there's a magical "original" copy stored in the RTI, but that's not guaranteed in any of the interfaces that Factories knows about. > This CL feels like an attempt to hack around bugs/shortcomings in >> libjingle >> code. Can we fix libjingle, instead? > > Yes, I'd like to do that, but I don't think fixing threading in > webrtc/libjingle > can be done overnight. and the deadlocks in this code are very real and > already > there and I hit them almost all the time, so it's at least a temporary > solution. :( I bet once you start you'll find it's not as hard as you think it is. https://codereview.chromium.**org/23440015/diff/1/content/** > renderer/media/rtc_video_**encoder.cc<https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_video_encoder.cc> > File content/renderer/media/rtc_**video_encoder.cc (right): > > https://codereview.chromium.**org/23440015/diff/1/content/** > renderer/media/rtc_video_**encoder.cc#newcode300<https://codereview.chromium.org/23440015/diff/1/content/renderer/media/rtc_video_encoder.cc#newcode300> > content/renderer/media/rtc_**video_encoder.cc:300: > SignalAsyncWaiter(WEBRTC_**VIDEO_CODEC_OK); > On 2013/09/03 20:09:54, Ami Fischman wrote: > >> Nothing guarantees that RequireBB will be called until Encode() is >> > called, which > >> won't happen until initialization is done. The justification for this >> > in the CL > >> description is insufficient. >> If RequireBB was really part of VEA initialization, then >> > NotifyInitializeDone > >> should not be triggered before RBB/UOBB were done. >> > > Well, if you look at how this class does it right now, it silently > assumes that RequireBB has been finished before first Encode() is called > on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if > input_buffers_free is empty, it doesn't call EncodeOneFrame(), just > returns, so Encode() will wait forever. My CL is an attempt to fix this > as well. > Can you instead fix it so that a legit VEA impl that happened to wait for the first Encode to RBB still works? To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
> >> If RequireBB was really part of VEA initialization, then > >> > > NotifyInitializeDone > > > >> should not be triggered before RBB/UOBB were done. > >> > > > > Well, if you look at how this class does it right now, it silently > > assumes that RequireBB has been finished before first Encode() is called > > on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if > > input_buffers_free is empty, it doesn't call EncodeOneFrame(), just > > returns, so Encode() will wait forever. My CL is an attempt to fix this > > as well. > > > > Can you instead fix it so that a legit VEA impl that happened to wait for > the first Encode to RBB still works? I don't see any use for this though, it would make clients more complicated and nothing in the contents of input raw frames should change VEA impl's mind about its buffer requirements... And on the other hand, we kind of need impl's requirements for coded size before we start sending input frames to it. To be honest, I've actually all this time thought the API was requiring RBB before Encode(). If this is not the case, haven't we overengineered it a little bit?
On Wed, Sep 4, 2013 at 4:37 AM, <posciak@chromium.org> wrote: > >> If RequireBB was really part of VEA initialization, then >> >> >> > NotifyInitializeDone >> > >> >> should not be triggered before RBB/UOBB were done. >> >> >> > >> > Well, if you look at how this class does it right now, it silently >> > assumes that RequireBB has been finished before first Encode() is called >> > on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if >> > input_buffers_free is empty, it doesn't call EncodeOneFrame(), just >> > returns, so Encode() will wait forever. My CL is an attempt to fix this >> > as well. >> > >> > > Can you instead fix it so that a legit VEA impl that happened to wait for >> the first Encode to RBB still works? >> > > I don't see any use for this though, it would make clients more > complicated and > nothing in the contents of input raw frames should change VEA impl's mind > about > its buffer requirements... Changing the bitrate should trigger a new RBB round, no? (it doesn't today because exynos bugs cause EVEA to always request maximally sized buffers) > And on the other hand, we kind of need impl's > requirements for coded size before we start sending input frames to it. > Why? > > https://chromiumcodereview.**appspot.com/23440015/<https://chromiumcodereview... > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
> >> > Well, if you look at how this class does it right now, it silently > >> > assumes that RequireBB has been finished before first Encode() is called > >> > on RTCVE. RTCVE::Encode() gets into Impl::Enqueue() and if > >> > input_buffers_free is empty, it doesn't call EncodeOneFrame(), just > >> > returns, so Encode() will wait forever. My CL is an attempt to fix this > >> > as well. > >> > > > > Can you instead fix it so that a legit VEA impl that happened to wait for > >> the first Encode to RBB still works? > > > > I don't see any use for this though, it would make clients more > > complicated and > > nothing in the contents of input raw frames should change VEA impl's mind > > about > > its buffer requirements... > > Changing the bitrate should trigger a new RBB round, no? > (it doesn't today because exynos bugs cause EVEA to always request > maximally sized buffers) > > > And on the other hand, we kind of need impl's > > requirements for coded size before we start sending input frames to it. > > > > Why? RBB specifies the input_coded_size param, which specifies the size of the frames that must be passed to Encode(). So RBB implicitly must come before Encode(). I think this change is safe enough in that, since RBB always comes before Encode() (and after InitializeDone()), we can afford to wait a little longer on init to make sure that Encode() latency consistent starting with the first frame.
https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... File content/renderer/media/renderer_gpu_video_accelerator_factories.cc (right): https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/renderer_gpu_video_accelerator_factories.cc:393: CHECK(shared_memory_segment_->CreateAndMapAnonymous(size)); Do you even sandbox, bro? How is this working inside the renderer sandbox? https://code.google.com/p/chromium/codesearch#chromium/src/content/child/chil... If you _could_ create shm's on arbitrary renderer-process threads (as this CL implies, and which I think should actually be possible if you use a thread-safe IPC sender and the static ChildThread::AllocateSharedMemory() overload) then you wouldn't need the trampoline in the first place - just do the create from the calling thread! https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/renderer_gpu_video_accelerator_factories.cc:394: render_thread_async_waiter_.Signal(); This is not legit considering it's no longer the render_thread we're running on (so, for example, two outstanding operations could be extant, one to the render thread and one to here, and this signal would terminate both waits, and so on). But if you take my suggestion above then this code is deleted and no worries. https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... File content/renderer/media/renderer_gpu_video_accelerator_factories.h (right): https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/renderer_gpu_video_accelerator_factories.h:153: scoped_ptr<base::Thread> shm_alloc_thread_; Previously you said that "the original factories only, which are in the RenderThreadImpl" but AFAICT RTI::GetGpuFactories doesn't retain a ref to the Factories it returns, so I don't understand what keeps the original Factories alive (and thus the shm_alloc_thread_) for the lifetime of Clone()'d Factories. https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... File content/renderer/media/rtc_video_encoder.cc (right): https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/rtc_video_encoder.cc:300: SignalAsyncWaiter(WEBRTC_VIDEO_CODEC_OK); sheu@'s point about input_coded_size is reasonable to me. I forgot that we let that bleed into the API but it certainly guarantees RBB before first Encode. The diff in this file is fine by me now.
https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... File content/renderer/media/renderer_gpu_video_accelerator_factories.cc (right): https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/renderer_gpu_video_accelerator_factories.cc:393: CHECK(shared_memory_segment_->CreateAndMapAnonymous(size)); On 2013/09/04 20:47:56, Ami Fischman wrote: > Do you even sandbox, bro? > How is this working inside the renderer sandbox? > https://code.google.com/p/chromium/codesearch#chromium/src/content/child/chil... > > If you _could_ create shm's on arbitrary renderer-process threads (as this CL > implies, and which I think should actually be possible if you use a thread-safe > IPC sender and the static ChildThread::AllocateSharedMemory() overload) then you > wouldn't need the trampoline in the first place - just do the create from the > calling thread! Real men don't sandbox. Done for all the others. https://chromiumcodereview.appspot.com/23440015/diff/1/content/renderer/media... content/renderer/media/renderer_gpu_video_accelerator_factories.cc:394: render_thread_async_waiter_.Signal(); On 2013/09/04 20:47:56, Ami Fischman wrote: > This is not legit considering it's no longer the render_thread we're running on > (so, for example, two outstanding operations could be extant, one to the render > thread and one to here, and this signal would terminate both waits, and so on). > But if you take my suggestion above then this code is deleted and no worries. Done.
LGTM % CL description needing a rewrite
On 2013/09/05 06:18:28, Ami Fischman wrote: > LGTM % CL description needing a rewrite Thanks. Yeah, I almost remembered this time that git cl upload doesn't update the commit message here if I update it locally. Thanks for noticing.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/posciak@chromium.org/23440015/13001
Retried try job too often on ios_dbg_simulator for step(s) ui_unittests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=ios_dbg_si...
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/posciak@chromium.org/23440015/13001
Message was sent while issue was closed.
Change committed as 221395 |