|
|
Created:
7 years, 10 months ago by epenner Modified:
7 years, 10 months ago CC:
chromium-reviews, cc-bugs_chromium.org, enne (OOO), aelias_OOO_until_Jul13 Base URL:
https://chromium.googlesource.com/chromium/src.git@master Visibility:
Public. |
Descriptioncc: Enforce correct recycling in resource pool.
The active/pending tree makes it appear as though we
recycle textures that are not in-use, however our memory
assignment can easily take invalidated resources from the
active tree and put them directly in the pending tree
breaking our texture recycling.
This introduces a read fence concept to resource provider,
and uses the read fence to strictly enforce proper recycling
behavior. 'canLockForWrite' can tell us that a resource is
truly safe to recycle.
No try, as this is only cc, and cc unit tests have passed.
NOTRY=true
BUG=172995
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=180886
Patch Set 1 #Patch Set 2 : Use fence interface. #
Total comments: 9
Patch Set 3 : Moved to renderer. Addressed feedback. #
Total comments: 1
Patch Set 4 : Address feedback. #Patch Set 5 : Fix bots. #Patch Set 6 : Rebase. #
Messages
Total messages: 38 (0 generated)
Can you describe the situation that is causing the bug, first?
Ptal. I need to do some more verification, and I can add a unit test for the resource provider part. I went back and forth several times on where the code for this should go before I arrived at this solution, which is a lot less code, and I think should prevent this bug from happening again in other places.
What on earth is a "double buffer our writes" and "double buffered write constraint"?
I updated the first paragraph to better describe the problem. Double buffering writes refers to not writing to a texture that is currently in use. We could double buffer every single texture if we wanted to do this the easy way, but that would waste lots of memory. So instead we recycle from a pool of 'no longer in use' textures when we need to update a texture... But the problem which always occurs is that 'no longer in use' is not quite right. This fixes that by strictly keeping track of how long a texture is in use on the GPU.
Sorry, still dont fully get it. I think what i hear you saying is that we can't do an async upload while there is a draw outstanding to that tile. If thats the case, then the bug is in the service side.
This is only because we perform texture uploads on a separate thread, right? Feels very specific to our async upload implementation on android. Would it make sense to handle most of this on gpu process side instead? Maybe just sort resources in the tile manager resource pool by last usage time but everything else done in gpu process?
No I don't think it's best to handle it on the GPU side. The reason is the same reason we didn't do 'partial updates' anymore. Whether or not we do async uploads, something bad happens when we upload to a texture while it is in use. So it's good to have a tool to prevent that bad thing from happening. On N10 it's a 16ms pipeline stall, for example.
The whole purpose of texture recycling is to prevent writing to a texture while it is still in use...
One thing that might help us all communicate is to list out the steps that are happening client side when things go wrong. E.g. draw tile A: glDraw(tileA->resourceID), lets say its resource 7 manageTiles tileA->freeResources resouceID 7 for tileA returns to pool resourcID 7 given out to tileB rasterize tileB completes asyncUpload into 7 of tileB dta If I understand right, on the service side, somehow the asyncUpload(7) for tileB starts before draw 7 executes?
Nat, if the bug is in the service side, what do you propose the service should do? It's good to think about zero copy to. We'll have exactly the same issue, and there is no service side solution.
On 2013/02/04 02:23:34, epenner wrote: > Nat, if the bug is in the service side, what do you propose the service should > do? It's good to think about zero copy to. We'll have exactly the same issue, > and there is no service side solution. How much does just sorting resources in the resource pool by last frame used help? I'm wondering if this together with just stalling async uploads on the gpu side as a last resort would be good enough.
I'd prefer if we didn't make changes that prevent a different implementation of CHROMIUM_async_transfers from working in an optimal way. i.e. an idle upload implementation for desktop would work better if we didn't enforce this constrain on it. Maybe the right thing to do is somehow extend the async command buffer api to be able to handle this in an optimal way depending on the implementation?
On 2013/02/04 03:23:46, David Reveman wrote: > I'd prefer if we didn't make changes that prevent a different implementation of > CHROMIUM_async_transfers from working in an optimal way. i.e. an idle upload > implementation for desktop would work better if we didn't enforce this constrain > on it. I disagree. This is just 'nice client' behavior for almost any implementation. No matter what implementation you are on, underneath they need to deal with copying your textures around when you write to a texture that is in-use. We don't even have async uploads on trunk or Android's m18, but yet we do have this same client behavior... And 99% of the time we were already doing this. It was just in OOM case that we start to act differently.
The OOM case needs to be handled by limiting the visible portion of the active tree to take 50% of memory. Then this will never go over the limit... In any other case the system will need more memory somewhere, whether it be from our pool or inside the GPU.
Could we fix this by making the android implementation of async uploads always create a new texture? Btw, the correctness problem this patch is solving is in my opinion a bug in the async upload implementation not a problem in the usage. So even if we change the usage I don't think it's OK for the implementation to produce incorrect results in these situations.
When would it create a new texture? We have no usage tracking of textures to my knowledge. Regarding correctness, how do you device correctness when using multiple GL threads? It's fraught with "undefined" etc. just as the description of this API in the design document: https://docs.google.com/a/google.com/document/d/1vj-QmyReXBEw1CfiJphpIEgAqqrl... When the API says ASYNC, it really really means async.
"device correctness" -> define correctness.
I think a better thing to do if we could detect this on the service side, is issue some kind of error to the client. Adding unneeded and unintended slow-paths to Chrome is not worth our time and effort. We have hit the "black fill" path countless times and every time we need to debug and find it on Android. It would be better if it were an error.
I keep coming back to "what are the clean architectures" that we can support long term? I realize that its frustrating, but we're stuck with using GL and GL-inspired architecture to talk to the GPU right now, and while its something we may want to change someday in order to increase our velocity, for now, we've got to deal with it. Within the GL model, I think that leaves us with the following possible clean architectures: 1. glAsyncUpload API has eglimage semantics BUT is only advertised as available if we also have true fences. When we return a resource to the pool, we set a fence and dont let it be used until the fence passes. 2. glAsyncUpload API has a non-eglImage semantic. If we issue a glAsyncTex while its being drawn, the GPU process needs to stall the upload. As noted, fences dont work on android, although we do know that the GPU never runs more than 1 frame behind. So, we can "simulate" a fence based on tracking swaps, which is sort of what this patch does. I'm not sure how to proceed.
On 2013/02/04 03:53:14, epenner wrote: > When would it create a new texture? We have no usage tracking of textures to my > knowledge. How about always? I don't know if that's a good idea. You're the expert here. I'm just asking as it seems like that would be a simple solution if it worked. > > Regarding correctness, how do you device correctness when using multiple GL > threads? It's fraught with "undefined" etc. just as the description of this API > in the design document: > > https://docs.google.com/a/google.com/document/d/1vj-QmyReXBEw1CfiJphpIEgAqqrl... > > When the API says ASYNC, it really really means async. It's async but that doesn't mean there's concurrency. I think the implementation needs to be somewhat deterministic and protect against a misbehaving client. Having completely undefined behavior is a security problem. This api is after all exposed to the renderer.
Unfortunately we can't just allocate new textures on every upload (performance). Greg and I talked through both the API's current behavior, and safety as far as security. I personally don't see what the fuss is. We used the resource provider to completely block uninitialized texture reads. This is *awesome* as we don't get black-fill bugs anymore. The same could be said the nasty pipeline stalls and driver copies. They would be a thing of the past with a very simple policy enforced by resource provider... I don't know if we need more "GL'isms" in the resource provider either.
Your frame concept is a very simple fence. I think if we can make it a fence, then it makes sense to me, and it sets up this code for success in the future in a way that is understable by many people. Its a quick refactoring: Fence : RefCounted { virtual bool hasPassed() = 0; }; Resource scoped_refptr<Fence> m_readFence ResourceManager { RefPtr<Fence> m_fenceForCurrentFrame } lockForRead readLock = m_fenceForCurrentFrame lockForWrite assert m_resource->m_readFence()->isPassed(); resourceProvider->alloc() check the fence swap #if android setFence(pseudofence) #else ... TODOs #endif
Nat: 1.) wouldn't just need fences, it would special "async upload fences" since they happen async from normal gpu commands 2.) could be done more cleanly, but then we still need this patch to prevent the new slow path we would create. I think this patch should purely focus on the resource provider only, and forget about async uploads. 'N-rotation-buffers' is what all games to due to stream textures on consoles, so this just captures that technique, and enforces that the textures are 'rotated' correctly.
The read fence could work since that occurs with regular GPU command stream. Happy to do that if piman also approves. Can discuss tomorrow.
On 2013/02/04 07:53:54, epenner wrote: > Nat: 1.) wouldn't just need fences, it would special "async upload fences" since > they happen async from normal gpu commands 2.) could be done more cleanly, but > then we still need this patch to prevent the new slow path we would create. > > I think this patch should purely focus on the resource provider only, and forget > about async uploads. 'N-rotation-buffers' is what all games to due to stream > textures on consoles, so this just captures that technique, and enforces that > the textures are 'rotated' correctly. If this technique is expected from GLES clients for good performance then this patch makes sense. I still think we need to do something on the service side to prevent undefined behavior when a texture is reused to early. That the following code has undefined behavior seems wrong: beginQueryEXT(GL_ASYNC_PIXEL_TRANSFERS_COMPLETED_CHROMIUM, queryId); asyncTexImage2DCHROMIUM(textureId, ..); endQueryEXT(GL_ASYNC_PIXEL_TRANSFERS_COMPLETED_CHROMIUM); getQueryObjectuivEXT(queryId, GL_QUERY_RESULT_AVAILABLE_EXT, &completed); if (completed) { drawTexture(textureId); swapBuffers(); asyncTexImage2DCHROMIUM(textureId, ..); } Btw, is the performance problem with creating a new texture for each upload related to the use of eglimages?
Couple of high-level comments: - I'd really prefer if the RP didn't know anything about frames. The concept of a frame becomes a bit fuzzy with übercompositor in mind (which is what the RP was invented for). - the number of frames you want to "lock" for is hardcoded at the wrong place. It depends on lower level things, such as the driver behavior (e.g. what if your driver allows 3 frames in flight instead of 2, tiled GPU or not), and infrastructure (latency of parent compositor, Übercomp or not, software or not). So I fear that this will lead to more hacks to fine-tune the number based on external factors. I think Nat's concept of pseudo-fences, or something similar will get us closer to a good architecture. At least we have a hook to differentiate between the GL vs software vs Übercomp. I think if we want to be efficient in both perf (no stalls) and memory (re-using every time we can), we'll probably need help from the GPU process, but maybe that will have to be a follow-up. I also agree with Dave that we need to fix the infrastructure if his code snippet doesn't work as expected - it's important for our sanity that in the compositor we can reason about the commands we send based on their semantics without needing to understand (or relying on) the GPU process implementation. Should we have a quick meeting about this today so we can find a good solution that everyone is happy with?
On 2013/02/04 17:50:28, piman wrote: > Couple of high-level comments: > - I'd really prefer if the RP didn't know anything about frames. The concept of > a frame becomes a bit fuzzy with übercompositor in mind (which is what the RP > was invented for). > - the number of frames you want to "lock" for is hardcoded at the wrong place. > It depends on lower level things, such as the driver behavior (e.g. what if your > driver allows 3 frames in flight instead of 2, tiled GPU or not), and > infrastructure (latency of parent compositor, Übercomp or not, software or not). > So I fear that this will lead to more hacks to fine-tune the number based on > external factors. > > I think Nat's concept of pseudo-fences, or something similar will get us closer > to a good architecture. At least we have a hook to differentiate between the GL > vs software vs Übercomp. > > I think if we want to be efficient in both perf (no stalls) and memory (re-using > every time we can), we'll probably need help from the GPU process, but maybe > that will have to be a follow-up. > > > I also agree with Dave that we need to fix the infrastructure if his code > snippet doesn't work as expected - it's important for our sanity that in the > compositor we can reason about the commands we send based on their semantics > without needing to understand (or relying on) the GPU process implementation. > > Should we have a quick meeting about this today so we can find a good solution > that everyone is happy with? This sounds good. I can do it via pseudo fences, no problem. Except that the fence "completion" will still need to know about frames at the moment. I also don't see how a read fence could be implemented 'for real' without issuing way too many fences, but I guess that's a problem for the future. The GL semantics is the big disconnect here I think. Everybody seems to want "single threaded pixel buffer object" semantics, but from the beginning this was designed to have "multi threaded pixel buffer object" semantics, and that was fully described in the design. In the latter case the code snippet is getting exactly what it asked for, which is to write to the texture asynchronously. A meeting sounds good to me. I'll get started on the pseudo fence thing.
Couple follow ups: Dave, allocations are always expensive on Android (EGLImage or not). In fact, if we could allocate a texture for every upload, we could already do zero-copy on Android (this is called "one shot surface-texture" approach). Recycling zero-copy buffers on android will require us to enforce the exact same semantics here. I think the points about dealing with different GPU pipeline depths are interesting but seem moot when we realize we can't afford triple buffering in the compositor. The blocking solutions would just limit the pipeline depth to two, using one thousand fences instead of 2 fences ;)
Ptal. Sorry for the delay, lots of other bugs on my plate.
https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.cc File cc/layer_tree_host_impl.cc (right): https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:870: Can we move that logic to the renderer? It shouldn't apply to the software renderer, and the logic is likely different for the delegated renderer. Also that lets us later more easily tune the logic based on GPU logic (e.g. proper fence).
https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.cc File cc/layer_tree_host_impl.cc (right): https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:855: bool setToPassed() { m_passed = true; } m_passed -> m_hasPassed setHasPassed() https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:861: bool LayerTreeHostImpl::swapBuffers() what if output_surface.h had a scoped<RefPtr> ResourceProvider::Fence getFenceForWorkToThisPoint() call? It knows about swapbuffers. Effectively, move your LTHI changes to output surface. https://codereview.chromium.org/12197004/diff/19001/cc/resource_pool.cc File cc/resource_pool.cc (right): https://codereview.chromium.org/12197004/diff/19001/cc/resource_pool.cc#newco... cc/resource_pool.cc:65: // Extend all read locks on all resources until the resource is Can we restrict this to only resources that are async uploaded somehow? https://codereview.chromium.org/12197004/diff/19001/cc/resource_provider.h File cc/resource_provider.h (right): https://codereview.chromium.org/12197004/diff/19001/cc/resource_provider.h#ne... cc/resource_provider.h:301: bool readLockFenceHasPassed(Resource* resource) { return !resource->readLockFence || does ? operator clean this? blah ? blah->hasPassed : true
https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.cc File cc/layer_tree_host_impl.cc (right): https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:861: bool LayerTreeHostImpl::swapBuffers() On 2013/02/05 22:39:05, nduca wrote: > what if output_surface.h had a scoped<RefPtr> ResourceProvider::Fence > getFenceForWorkToThisPoint() call? It knows about swapbuffers. Sadly I don't think it knows about it yet, though we definitely want that and talked about it! That's why I suggested renderer in the mean time (though I support moing the swapBuffers to OutputSurface!) > Effectively, move > your LTHI changes to output surface.
https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.cc File cc/layer_tree_host_impl.cc (right): https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:855: bool setToPassed() { m_passed = true; } On 2013/02/05 22:39:05, nduca wrote: > m_passed -> m_hasPassed > > setHasPassed() Done. https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:861: bool LayerTreeHostImpl::swapBuffers() On 2013/02/05 22:46:47, piman wrote: > On 2013/02/05 22:39:05, nduca wrote: > > what if output_surface.h had a scoped<RefPtr> ResourceProvider::Fence > > getFenceForWorkToThisPoint() call? It knows about swapbuffers. > > Sadly I don't think it knows about it yet, though we definitely want that and > talked about it! > That's why I suggested renderer in the mean time (though I support moing the > swapBuffers to OutputSurface!) > > > Effectively, move > > your LTHI changes to output surface. > Done. https://codereview.chromium.org/12197004/diff/19001/cc/layer_tree_host_impl.c... cc/layer_tree_host_impl.cc:870: On 2013/02/05 22:19:07, piman wrote: > Can we move that logic to the renderer? It shouldn't apply to the software > renderer, and the logic is likely different for the delegated renderer. > Also that lets us later more easily tune the logic based on GPU logic (e.g. > proper fence). Done.
lgtm https://codereview.chromium.org/12197004/diff/22002/cc/gl_renderer.cc File cc/gl_renderer.cc (right): https://codereview.chromium.org/12197004/diff/22002/cc/gl_renderer.cc#newcode... cc/gl_renderer.cc:1300: } nit: move to the top of the file instead of in the middle of the GLRenderer implementation?
lgtm
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/epenner@chromium.org/12197004/24004
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/epenner@chromium.org/12197004/19004
Message was sent while issue was closed.
Change committed as 180886 |