|
|
Created:
8 years, 1 month ago by bulach Modified:
8 years, 1 month ago CC:
chromium-reviews, peter+watch_chromium.org, bulach+watch_chromium.org, yfriedman+watch_chromium.org, ilevy+watch_chromium.org Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionAndroid: improves test sharding reliability.
Originally test sharding for android was written with performance in mind
(i.e., to scale up the test speed per device).
Now that we're on the main waterfall, we need to improve reliability as
devices may randomly drop offline during tests.
This patch captures exceptions in key places and retries if there are enough
devices available.
BUG=153718
TEST=run android tests, randomly unplugging devices
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=165918
Patch Set 1 #
Total comments: 14
Patch Set 2 : Comments #
Total comments: 6
Patch Set 3 : Unrecoverable error #
Total comments: 3
Messages
Total messages: 16 (0 generated)
managed to squeeze this in whilst waiting for other patches, would you mind taking a look please? core idea: to capture any exception at a very high level and retry from there.. the reason being that there are way too many points of failures in the lower level, and very little possibility for recover on failing adb commands and such.. please let me know your thoughts..
Thanks for looking into this http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:91: try: Shouldn't this try be in the for loop? |device| is only defined in there. http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:101: continue By continuing, we're not updating |final_results|. It looks like it might only be tabulating successful tests in which case it's ok to skip? http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:111: results_lists = async_results.get(999999) Will the exception get thrown while the other shards are still running? If so, does this mean that we'll whip around and retry shards of tests on devices that are still running their old tests? http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py File build/android/run_tests.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new... build/android/run_tests.py:227: return all_tests, available_devices Don't you want to collect all the devices and merge the lists of tests? This looks like as soon as it finds an available device, it'll return the tests from that device. Or is it just that you need a device to get the list of devices and you need to query each device for it's shard of tests. http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new... build/android/run_tests.py:228: break "break" unnecessary given "return" above
thanks yaron! another look please? http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:91: try: On 2012/10/31 17:18:51, Yaron wrote: > Shouldn't this try be in the for loop? |device| is only defined in there. not really.. we need to create all shards at once, or fail. python allows for |device| to be available outside the for scope. if we move the try inside the for loop, we'd need to somehow abort the current test_runners creation and retry all over again.. http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:101: continue On 2012/10/31 17:18:51, Yaron wrote: > By continuing, we're not updating |final_results|. It looks like it might only > be tabulating successful tests in which case it's ok to skip? this continue here would be for the loop on 85, so there's nothing to update yet here... the flow is basically: - try to create N shards. |-- if it fails, retry. - dispatch and collect results for the N |-- if it fails, retry at the end, since we got N results for N devices, try to optimize what should run next.. does it make sense? http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:111: results_lists = async_results.get(999999) On 2012/10/31 17:18:51, Yaron wrote: > Will the exception get thrown while the other shards are still running? If so, > does this mean that we'll whip around and retry shards of tests on devices that > are still running their old tests? it's not clear from the python documentation if the exception is raised as soon as the shard throws, or later when all the results are collected. thankfully, it doesn't matter :) we'll create the shards again, and that re-installs the apks, which kills the process. http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py File build/android/run_tests.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new... build/android/run_tests.py:227: return all_tests, available_devices On 2012/10/31 17:18:51, Yaron wrote: > Don't you want to collect all the devices and merge the lists of tests? This > looks like as soon as it finds an available device, it'll return the tests from > that device. Or is it just that you need a device to get the list of devices and > you need to query each device for it's shard of tests. sort of the latter (see 211 on the other side), but to clarify: this needs a single device D (out of N) to capture the full list of tests, that will then be sharded higher up. if D fails, we'll assume it's a bad kitten and remove from N and try again. http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new... build/android/run_tests.py:228: break On 2012/10/31 17:18:51, Yaron wrote: > "break" unnecessary given "return" above Done.
http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:101: continue On 2012/10/31 18:10:35, bulach wrote: > On 2012/10/31 17:18:51, Yaron wrote: > > By continuing, we're not updating |final_results|. It looks like it might only > > be tabulating successful tests in which case it's ok to skip? > > this continue here would be for the loop on 85, so there's nothing to update yet > here... > > the flow is basically: > > - try to create N shards. > |-- if it fails, retry. > - dispatch and collect results for the N > |-- if it fails, retry > > at the end, since we got N results for N devices, try to optimize what should > run next.. > > does it make sense? Yes. My point was though that it looks like it's doing special accounting on the last run of the loop. If you "continue" here, it never gets a chance to do that and we can't report that any tests successfully ran. I guess if we end up with 0 devices it'll be a pretty big failure anyway so it's ok. However, looking at OnTestsCompleted in build/android/run_tests.py it assumes that test_runners will be non-empty and that's not true. http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:111: results_lists = async_results.get(999999) On 2012/10/31 18:10:35, bulach wrote: > On 2012/10/31 17:18:51, Yaron wrote: > > Will the exception get thrown while the other shards are still running? If so, > > does this mean that we'll whip around and retry shards of tests on devices > that > > are still running their old tests? > > it's not clear from the python documentation if the exception is raised as soon > as the shard throws, or later when all the results are collected. > > thankfully, it doesn't matter :) > > we'll create the shards again, and that re-installs the apks, which kills the > process. Ah, right. Thanks. http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py File build/android/run_tests.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new... build/android/run_tests.py:227: return all_tests, available_devices On 2012/10/31 18:10:35, bulach wrote: > On 2012/10/31 17:18:51, Yaron wrote: > > Don't you want to collect all the devices and merge the lists of tests? This > > looks like as soon as it finds an available device, it'll return the tests > from > > that device. Or is it just that you need a device to get the list of devices > and > > you need to query each device for it's shard of tests. > > sort of the latter (see 211 on the other side), but to clarify: > > this needs a single device D (out of N) to capture the full list of tests, that > will then be sharded higher up. > > if D fails, we'll assume it's a bad kitten and remove from N and try again. My point should've read "and you _don't_ need to query..". Makes sense.
Thanks for getting to this Marcus. This looks good to me, I'll give it one final look over after lunch. Some suggestions inline. Cheers! http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:112: except errors.DeviceUnresponsiveError as e: Is it possible to get other exceptions besides DeviceUnresponsive? http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py File build/android/run_tests.py (right): http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py#... build/android/run_tests.py:216: def _GetTests(self): I wonder if this would be more stable if we did the all-device apk push before this part, asked all devices to generate the list of tests and then just used the first list that we got back. Would that reduce retry logic? http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py#... build/android/run_tests.py:224: try: nit: Might be cleaner with a local variable for the current device. On line 224: device = available_devices.pop() Then replace occurrences of available_devices[-1] with device and remove line 231.
thanks for the clarification yaron! comment addressed, ptal http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_... build/android/pylib/base_test_sharder.py:101: continue On 2012/10/31 18:37:23, Yaron wrote: > On 2012/10/31 18:10:35, bulach wrote: > > On 2012/10/31 17:18:51, Yaron wrote: > > > By continuing, we're not updating |final_results|. It looks like it might > only > > > be tabulating successful tests in which case it's ok to skip? > > > > this continue here would be for the loop on 85, so there's nothing to update > yet > > here... > > > > the flow is basically: > > > > - try to create N shards. > > |-- if it fails, retry. > > - dispatch and collect results for the N > > |-- if it fails, retry > > > > at the end, since we got N results for N devices, try to optimize what should > > run next.. > > > > does it make sense? > > Yes. My point was though that it looks like it's doing special accounting on the > last run of the loop. If you "continue" here, it never gets a chance to do that > and we can't report that any tests successfully ran. I guess if we end up with 0 > devices it'll be a pretty big failure anyway so it's ok. > > However, looking at OnTestsCompleted in build/android/run_tests.py it assumes > that test_runners will be non-empty and that's not true. ahn, yes, good point! I added an else: clause to raise an exception if we run out of retries... at that point, there's no possible recovery anyways, but this at least clarifies which condition we hit..
thanks isaac! clarifications inline, please let me know your thoughts: http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:112: except errors.DeviceUnresponsiveError as e: On 2012/10/31 19:18:14, Isaac wrote: > Is it possible to get other exceptions besides DeviceUnresponsive? quite possibly. :) but trying to recover from them by just retrying on other devices would probably be wrong.. let's see how this behaves and we can fine tune from there. http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py File build/android/run_tests.py (right): http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py#... build/android/run_tests.py:216: def _GetTests(self): On 2012/10/31 19:18:14, Isaac wrote: > I wonder if this would be more stable if we did the all-device apk push before > this part, asked all devices to generate the list of tests and then just used > the first list that we got back. Would that reduce retry logic? we have no mechanism to do that, and it doesn't look like this patch should introduce this. I think the first step is to capture exception points and retry. we can later optimize.. http://codereview.chromium.org/11275078/diff/5002/build/android/run_tests.py#... build/android/run_tests.py:224: try: On 2012/10/31 19:18:14, Isaac wrote: > nit: Might be cleaner with a local variable for the current device. > > On line 224: device = available_devices.pop() > > Then replace occurrences of available_devices[-1] with device and remove line > 231. then line 227 would be more complicated, as it'd the typical path where we don't pop() anything.. ok to keep pop() as an exception? I think it's clearer.
lgtm
thanks. http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:115: continue If one device raises an exception, this might block other test runners. what about catching up exceptions in "_ShardedTestRunnable" so that once one is failed, others can continue to run? then in the next round, we just need to try those failures?
thanks Yongsheng! clarification below: http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:115: continue On 2012/11/01 02:22:49, yongsheng wrote: > If one device raises an exception, this might block other test runners. > what about catching up exceptions in "_ShardedTestRunnable" so that once one is > failed, others can continue to run? then in the next round, we just need to try > those failures? hmm, sorry, I'm not sure what do you mean by "block other test runners"... I just did a quick test, and if you run "foo.py 4", it blocks for ~10secs. running "foo.py 5" returns immediately, i.e., the exception thrown by any of the shards immediately bubbles up... so in our case here, we'd immediately retry on the remaining runners, and as mentioned to yaron above, they wouldn't be blocked... I appreciate we could potentially optimize this :) and just retry the failures. however, there's currently no simple way to communicate that back from "_ShardedTestRunnable" (we'd need more logic there to set all tests as broken with a device exception so that 121 and 132-134 would work...) if that's ok, would you be fine with doing this as a possible follow up? I'd like to gather data first to understand what are all failure scenarios in the bots. ===foo.py=== import multiprocessing import sys import time def sleep_or_raise(i): if i < 4: time.sleep(10) else: raise Exception('Raised!!!') n = int(sys.argv[1]) pool = multiprocessing.Pool(n) r = pool.map_async(sleep_or_raise, xrange(n)) r.get(9999999)
I agree with Marcus. Let's try this now for stability's sake and optimize later. On Thu, Nov 1, 2012 at 4:41 AM, <bulach@chromium.org> wrote: > thanks Yongsheng! clarification below: > > > > http://codereview.chromium.**org/11275078/diff/8002/build/** > android/pylib/base_test_**sharder.py<http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py> > File build/android/pylib/base_test_**sharder.py (right): > > http://codereview.chromium.**org/11275078/diff/8002/build/** > android/pylib/base_test_**sharder.py#newcode115<http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py#newcode115> > build/android/pylib/base_test_**sharder.py:115: continue > On 2012/11/01 02:22:49, yongsheng wrote: > >> If one device raises an exception, this might block other test >> > runners. > >> what about catching up exceptions in "_ShardedTestRunnable" so that >> > once one is > >> failed, others can continue to run? then in the next round, we just >> > need to try > >> those failures? >> > > hmm, sorry, I'm not sure what do you mean by "block other test > runners"... > I just did a quick test, and if you run "foo.py 4", it blocks for > ~10secs. running "foo.py 5" returns immediately, i.e., the exception > thrown by any of the shards immediately bubbles up... > > so in our case here, we'd immediately retry on the remaining runners, > and as mentioned to yaron above, they wouldn't be blocked... > > I appreciate we could potentially optimize this :) and just retry the > failures. > however, there's currently no simple way to communicate that back from > "_ShardedTestRunnable" (we'd need more logic there to set all tests as > broken with a device exception so that 121 and 132-134 would work...) > > if that's ok, would you be fine with doing this as a possible follow up? > > I'd like to gather data first to understand what are all failure > scenarios in the bots. > > > ===foo.py=== > > import multiprocessing > import sys > import time > > def sleep_or_raise(i): > if i < 4: > time.sleep(10) > else: > raise Exception('Raised!!!') > > n = int(sys.argv[1]) > pool = multiprocessing.Pool(n) > r = pool.map_async(sleep_or_raise, xrange(n)) > r.get(9999999) > > http://codereview.chromium.**org/11275078/<http://codereview.chromium.org/112... >
I also agree. lgtm
thanks for your explanation. Please go ahead http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:115: continue On 2012/11/01 11:41:06, bulach wrote: > On 2012/11/01 02:22:49, yongsheng wrote: > > If one device raises an exception, this might block other test runners. > > what about catching up exceptions in "_ShardedTestRunnable" so that once one > is > > failed, others can continue to run? then in the next round, we just need to > try > > those failures? > > hmm, sorry, I'm not sure what do you mean by "block other test runners"... > I just did a quick test, and if you run "foo.py 4", it blocks for ~10secs. > running "foo.py 5" returns immediately, i.e., the exception thrown by any of the > shards immediately bubbles up... > > so in our case here, we'd immediately retry on the remaining runners, and as > mentioned to yaron above, they wouldn't be blocked... > > I appreciate we could potentially optimize this :) and just retry the failures. > however, there's currently no simple way to communicate that back from > "_ShardedTestRunnable" (we'd need more logic there to set all tests as broken > with a device exception so that 121 and 132-134 would work...) you're right. That's what i mean: one exception is raised from one device and other devices running cases will be interrupted. yes, that's just a optimization. > if that's ok, would you be fine with doing this as a possible follow up? > > I'd like to gather data first to understand what are all failure scenarios in > the bots. > yes, please. > > ===foo.py=== > > import multiprocessing > import sys > import time > > def sleep_or_raise(i): > if i < 4: > time.sleep(10) > else: > raise Exception('Raised!!!') > > n = int(sys.argv[1]) > pool = multiprocessing.Pool(n) > r = pool.map_async(sleep_or_raise, xrange(n)) > r.get(9999999) > > oh, yes, Yaron is right. we can help improve this later. Please go ahead.
thanks everyone! I'll put through the try bots now and cq later..
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/bulach@chromium.org/11275078/8002
Change committed as 165918 |