Issue 11232037: Retry tests on other bots if the device is unresponsive/offline

yongsheng

bulach, please check whether this solution is feasible. I test it in my local machine ...

8 years, 2 months ago (2012-10-22 12:51:36 UTC) #1

bulach

+ilevy fyi many, many thanks, really appreciate!! this will make our infra much more reliable! ...

8 years, 2 months ago (2012-10-23 09:30:56 UTC) #2

yongsheng

thanks for your valuable comments. I'll refine them according to your suggestion. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py ...

8 years, 2 months ago (2012-10-23 12:58:14 UTC) #3

thanks for your valuable comments. I'll refine them according to your
suggestion.

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te...
File build/android/pylib/base_test_sharder.py (right):

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te...
build/android/pylib/base_test_sharder.py:105: rounds += 1
On 2012/10/23 09:30:56, bulach wrote:
> hmm.. this may never finish: if we keep getting device exception, the while
> retry < rounds would never exit given rounds +=1 here.. maybe remove this
block,
> and let it fall through and just use the standard retry in case of failure?
> 
> we'll need to switch the bots to use this param, if they're not using it yet..
oh, yes. 
So what if retry equals to rounds - 1? then the failed cases won't be executed 
since there is no next retry. Suppose self.retries = 1 and one device gets
offline, then tests on this device won't be re-run.

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te...
build/android/pylib/base_test_sharder.py:110: self.attached_devices =
android_commands.GetAttachedDevices()
On 2012/10/23 09:30:56, bulach wrote:
> nice! for extra safety, maybe we could do something like:
> # Re-check the attached devices for some devices may
> # become offline.
> retry_devices = set(android_commands.GetAttachedDevices())
> # Remove devices that had exceptions.
> retry_devices -= TestResults.DeviceExceptions(results_lists)
> # Retry on devices that didn't have any exception.
> self.attached_devices = list(retry_devices)
great improvement. thanks. I'll follow your suggestion.

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_...
File build/android/pylib/single_test_runner.py (right):

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_...
build/android/pylib/single_test_runner.py:332: failed=failed_tests,
device_except=True)
On 2012/10/23 09:30:56, bulach wrote:
> nit: indentation should be by 4 only.
> 
> more important :)
> if you agree with the above, rather than device_except being a boolean, we
could
> use:
> 
> device_exception=self.device
agree with you.

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re...
File build/android/pylib/test_result.py (right):

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re...
build/android/pylib/test_result.py:92: return False
On 2012/10/23 09:30:56, bulach wrote:
> I think this could be:
> 
> return any(filter(lambda t: t.device_exception, results))
good.

yongsheng

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py#newcode105 build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 12:58:14, yongsheng wrote: > ...

8 years, 2 months ago (2012-10-23 13:08:35 UTC) #4

bulach

thanks!! I think we're agreeing already :) but just some clarifications below, just in case.. ...

8 years, 2 months ago (2012-10-23 15:01:52 UTC) #5

yongsheng

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py#newcode105 build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 15:01:53, bulach wrote: > ...

8 years, 2 months ago (2012-10-24 01:41:52 UTC) #6

bulach

thanks!! clarification inline, please let me know your thoughts! http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_test_sharder.py#newcode105 build/android/pylib/base_test_sharder.py:105: ...

8 years, 2 months ago (2012-10-24 10:43:48 UTC) #7

thanks!! clarification inline, please let me know your thoughts!

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te...
File build/android/pylib/base_test_sharder.py (right):

http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te...
build/android/pylib/base_test_sharder.py:105: rounds += 1
On 2012/10/24 01:41:52, yongsheng wrote:
> On 2012/10/23 15:01:53, bulach wrote:
> > On 2012/10/23 13:08:35, yongsheng wrote:
> > > On 2012/10/23 12:58:14, yongsheng wrote:
> > > > On 2012/10/23 09:30:56, bulach wrote:
> > > > > hmm.. this may never finish: if we keep getting device exception, the
> > while
> > > > > retry < rounds would never exit given rounds +=1 here.. maybe remove
> this
> > > > block,
> > > > > and let it fall through and just use the standard retry in case of
> > failure?
> > > > > 
> > > > > we'll need to switch the bots to use this param, if they're not using
it
> > > yet..
> > > > oh, yes. 
> > > > So what if retry equals to rounds - 1? then the failed cases won't be
> > executed
> > > 
> > > > since there is no next retry. Suppose self.retries = 1 and one device
gets
> > > > offline, then tests on this device won't be re-run.
> > > if one device raise exception, then it's excluded in the list according to
> > your
> > > suggestion in the below comment. and if device list is empty, the loop is
> > done.
> > 
> > oh, sorry, I think the current semantics (before your patch!!) for retry
here
> is
> > broken :)
> > perhaps we could implement the following:
> > - if there are are no failures or is the last round, break the loop..
> > 
> my only concern is that if there are failures(due to device exceptions) and
it's
> the last round (and there are attached devices), so we just ignore the
failures
> and don't retry them. If you think it's acceptable, I'll follw up your
> suggestion here. 
> > - else: remove devices that had exceptions from the list (if any), and try
> > again..
> > 
> > this is guaranteed to terminate, and to retry if necessary.
> > 
> > what do you think?
> > 
> > how does that sound?
> 

I think that'd be fine, at least it's deterministic :)
otherwise, if we have two devices, and they keep alternating the failures, we'd
never exit the loop...

bulach

lgtm, sorry, forgot to mention it! :) many thanks once again, this will help us ...

8 years, 2 months ago (2012-10-24 11:33:45 UTC) #8

Isaac (away)

Thanks! Looks good to me too! One nit in android_commands. https://chromiumcodereview.appspot.com/11232037/diff/13001/build/android/pylib/android_commands.py File build/android/pylib/android_commands.py (right): https://chromiumcodereview.appspot.com/11232037/diff/13001/build/android/pylib/android_commands.py#newcode110 ...

8 years, 2 months ago (2012-10-24 17:33:12 UTC) #9

yongsheng

http://codereview.chromium.org/11232037/diff/13001/build/android/pylib/android_commands.py File build/android/pylib/android_commands.py (right): http://codereview.chromium.org/11232037/diff/13001/build/android/pylib/android_commands.py#newcode110 build/android/pylib/android_commands.py:110: def IsDeviceAttached(device): On 2012/10/24 17:33:12, Isaac wrote: > We ...

8 years, 2 months ago (2012-10-25 01:20:06 UTC) #10

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/yongsheng.zhu@intel.com/11232037/20002

8 years, 2 months ago (2012-10-25 01:33:31 UTC) #12

Yaron

https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py#newcode90 build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) Based on: http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/2864/steps/base_unittests/logs/stdio and http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/2860/steps/base_unittests/logs/stdio ...

8 years, 1 month ago (2012-10-26 18:48:44 UTC) #14

yongsheng

http://codereview.chromium.org/11232037/diff/20002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/20002/build/android/pylib/base_test_sharder.py#newcode90 build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/26 18:48:44, Yaron wrote: ...

8 years, 1 month ago (2012-10-28 01:41:02 UTC) #15

bulach

https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py#newcode90 build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/28 01:41:03, yongsheng wrote: ...

8 years, 1 month ago (2012-10-29 14:45:42 UTC) #16

bulach

https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py#newcode90 build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/29 14:45:42, bulach wrote: ...

8 years, 1 month ago (2012-10-29 14:51:23 UTC) #17

https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli...
File build/android/pylib/base_test_sharder.py (right):

https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli...
build/android/pylib/base_test_sharder.py:90: test_runner =
self.CreateShardedTestRunner(device, index)
On 2012/10/29 14:45:42, bulach wrote:
> On 2012/10/28 01:41:03, yongsheng wrote:
> > On 2012/10/26 18:48:44, Yaron wrote:
> > > Based on:
> > >
> >
>
http://build.chromium.org/p/chromium.linux/builders/Android%25252520Tests%252...
> > > and
> > >
> >
>
http://build.chromium.org/p/chromium.linux/builders/Android%25252520Tests%252...
> > > 
> > > It looks like the creation of sharded test runner needs to guarded with
> > catching
> > > DeviceUnresponsiveError or else we can fail without attempting any
retries.
> > > 
> > > I'm concerned though that we aren't catching this issue more generally and
> we
> > > need to ensure that all code run by a shard can enter the new retry logic
> > below.
> > yes, i don't try to catch exceptions here for i'm not sure all of these
errors
> > are due to device problems or others.
> > It seems that we have to do this kind of thing here.
> 
> yaron: just to clarify, I kindly asked Yongsheng (thanks btw! :) to help us
with
> the case where the device was going down when running the tests, which he
fixed
> here..
> 
> the sharding was originally done with speed in mind, not reliability. this is
> the very first patch to try and make it failsafe / hot reuse devices if they
> become ill during the test cycle, so I share your concern and we'll have to
work
> through them...
> 
> so two questions then:
> yongsheng: would you be able to continue to help us here?
> if not: yaron, would there be anyone on your side?

btw, the issue mentioned above I guess is in run_tests.py, TestSharder __init__
where it tries to GetAllTests().. we need to add some retry mechanism in there
should that call fail..

Yaron

On 2012/10/29 14:51:23, bulach wrote: > https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py > File build/android/pylib/base_test_sharder.py (right): > > https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pylib/base_test_sharder.py#newcode90 > ...

8 years, 1 month ago (2012-10-29 22:59:00 UTC) #18

On 2012/10/29 14:51:23, bulach wrote:
>
https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli...
> File build/android/pylib/base_test_sharder.py (right):
> 
>
https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli...
> build/android/pylib/base_test_sharder.py:90: test_runner =
> self.CreateShardedTestRunner(device, index)
> On 2012/10/29 14:45:42, bulach wrote:
> > On 2012/10/28 01:41:03, yongsheng wrote:
> > > On 2012/10/26 18:48:44, Yaron wrote:
> > > > Based on:
> > > >
> > >
> >
>
http://build.chromium.org/p/chromium.linux/builders/Android%2525252520Tests%2...
> > > > and
> > > >
> > >
> >
>
http://build.chromium.org/p/chromium.linux/builders/Android%2525252520Tests%2...
> > > > 
> > > > It looks like the creation of sharded test runner needs to guarded with
> > > catching
> > > > DeviceUnresponsiveError or else we can fail without attempting any
> retries.
> > > > 
> > > > I'm concerned though that we aren't catching this issue more generally
and
> > we
> > > > need to ensure that all code run by a shard can enter the new retry
logic
> > > below.
> > > yes, i don't try to catch exceptions here for i'm not sure all of these
> errors
> > > are due to device problems or others.
> > > It seems that we have to do this kind of thing here.
> > 
> > yaron: just to clarify, I kindly asked Yongsheng (thanks btw! :) to help us
> with
> > the case where the device was going down when running the tests, which he
> fixed
> > here..
> > 
> > the sharding was originally done with speed in mind, not reliability. this
is
> > the very first patch to try and make it failsafe / hot reuse devices if they
> > become ill during the test cycle, so I share your concern and we'll have to
> work
> > through them...
> > 
> > so two questions then:
> > yongsheng: would you be able to continue to help us here?
> > if not: yaron, would there be anyone on your side?
> 
> btw, the issue mentioned above I guess is in run_tests.py, TestSharder
__init__
> where it tries to GetAllTests().. we need to add some retry mechanism in there
> should that call fail..

yongsheng: it would be great if you can continue stabilizing this. If you can't,
I"ll find someone to help

yongsheng

8 years, 1 month ago (2012-10-30 07:50:54 UTC) #19

> yongsheng: it would be great if you can continue stabilizing this. If you
can't,
> I"ll find someone to help
Please ask someone else to help do this urgent issue since I'll be OOO. And cc
the new issue to me. Thanks a lot!

Issue 11232037: Retry tests on other bots if the device is unresponsive/offline (Closed)

Description

Patch Set 1 #

Patch Set 2 : small fixes #

Patch Set 3 : changes according to bulach's comments #

Patch Set 4 : small improvement #

Messages