|
|
Created:
8 years, 2 months ago by yongsheng Modified:
8 years, 1 month ago CC:
chromium-reviews, peter+watch_chromium.org, bulach+watch_chromium.org, yfriedman+watch_chromium.org, ilevy+watch_chromium.org, Isaac (away) Visibility:
Public. |
DescriptionRetry tests on other bots if the device is unresponsive/offline
Currently this works for offline devices when trying to access files
on devices or emulators.
Use DeviceUnresponsiveError to indicate the device is offline.
BUG=153718
TESTS=
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=164013
Patch Set 1 #Patch Set 2 : small fixes #
Total comments: 15
Patch Set 3 : changes according to bulach's comments #
Total comments: 2
Patch Set 4 : small improvement #
Total comments: 4
Messages
Total messages: 19 (0 generated)
bulach, please check whether this solution is feasible. I test it in my local machine and it works when i unplugin my device. Thanks.
+ilevy fyi many, many thanks, really appreciate!! this will make our infra much more reliable! just a few nits and some small suggestions below: http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/android... File build/android/pylib/android_commands.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/android... build/android/pylib/android_commands.py:1061: raise errors.DeviceUnresponsiveError("Device may be offline.") nit: single quote ' http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 hmm.. this may never finish: if we keep getting device exception, the while retry < rounds would never exit given rounds +=1 here.. maybe remove this block, and let it fall through and just use the standard retry in case of failure? we'll need to switch the bots to use this param, if they're not using it yet.. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:110: self.attached_devices = android_commands.GetAttachedDevices() nice! for extra safety, maybe we could do something like: # Re-check the attached devices for some devices may # become offline. retry_devices = set(android_commands.GetAttachedDevices()) # Remove devices that had exceptions. retry_devices -= TestResults.DeviceExceptions(results_lists) # Retry on devices that didn't have any exception. self.attached_devices = list(retry_devices) http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_... File build/android/pylib/single_test_runner.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_... build/android/pylib/single_test_runner.py:332: failed=failed_tests, device_except=True) nit: indentation should be by 4 only. more important :) if you agree with the above, rather than device_except being a boolean, we could use: device_exception=self.device http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... File build/android/pylib/test_result.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... build/android/pylib/test_result.py:57: self.device_except = False if you agree with the above: self.device_exception = None http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... build/android/pylib/test_result.py:61: overall_fail=False, device_except=False): nit: device_exception=None http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... build/android/pylib/test_result.py:92: return False I think this could be: return any(filter(lambda t: t.device_exception, results))
thanks for your valuable comments. I'll refine them according to your suggestion. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 09:30:56, bulach wrote: > hmm.. this may never finish: if we keep getting device exception, the while > retry < rounds would never exit given rounds +=1 here.. maybe remove this block, > and let it fall through and just use the standard retry in case of failure? > > we'll need to switch the bots to use this param, if they're not using it yet.. oh, yes. So what if retry equals to rounds - 1? then the failed cases won't be executed since there is no next retry. Suppose self.retries = 1 and one device gets offline, then tests on this device won't be re-run. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:110: self.attached_devices = android_commands.GetAttachedDevices() On 2012/10/23 09:30:56, bulach wrote: > nice! for extra safety, maybe we could do something like: > # Re-check the attached devices for some devices may > # become offline. > retry_devices = set(android_commands.GetAttachedDevices()) > # Remove devices that had exceptions. > retry_devices -= TestResults.DeviceExceptions(results_lists) > # Retry on devices that didn't have any exception. > self.attached_devices = list(retry_devices) great improvement. thanks. I'll follow your suggestion. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_... File build/android/pylib/single_test_runner.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/single_... build/android/pylib/single_test_runner.py:332: failed=failed_tests, device_except=True) On 2012/10/23 09:30:56, bulach wrote: > nit: indentation should be by 4 only. > > more important :) > if you agree with the above, rather than device_except being a boolean, we could > use: > > device_exception=self.device agree with you. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... File build/android/pylib/test_result.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/test_re... build/android/pylib/test_result.py:92: return False On 2012/10/23 09:30:56, bulach wrote: > I think this could be: > > return any(filter(lambda t: t.device_exception, results)) good.
http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 12:58:14, yongsheng wrote: > On 2012/10/23 09:30:56, bulach wrote: > > hmm.. this may never finish: if we keep getting device exception, the while > > retry < rounds would never exit given rounds +=1 here.. maybe remove this > block, > > and let it fall through and just use the standard retry in case of failure? > > > > we'll need to switch the bots to use this param, if they're not using it yet.. > oh, yes. > So what if retry equals to rounds - 1? then the failed cases won't be executed > since there is no next retry. Suppose self.retries = 1 and one device gets > offline, then tests on this device won't be re-run. if one device raise exception, then it's excluded in the list according to your suggestion in the below comment. and if device list is empty, the loop is done.
thanks!! I think we're agreeing already :) but just some clarifications below, just in case.. http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 13:08:35, yongsheng wrote: > On 2012/10/23 12:58:14, yongsheng wrote: > > On 2012/10/23 09:30:56, bulach wrote: > > > hmm.. this may never finish: if we keep getting device exception, the while > > > retry < rounds would never exit given rounds +=1 here.. maybe remove this > > block, > > > and let it fall through and just use the standard retry in case of failure? > > > > > > we'll need to switch the bots to use this param, if they're not using it > yet.. > > oh, yes. > > So what if retry equals to rounds - 1? then the failed cases won't be executed > > > since there is no next retry. Suppose self.retries = 1 and one device gets > > offline, then tests on this device won't be re-run. > if one device raise exception, then it's excluded in the list according to your > suggestion in the below comment. and if device list is empty, the loop is done. oh, sorry, I think the current semantics (before your patch!!) for retry here is broken :) perhaps we could implement the following: - if there are are no failures or is the last round, break the loop.. - else: remove devices that had exceptions from the list (if any), and try again.. this is guaranteed to terminate, and to retry if necessary. what do you think? how does that sound?
http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/23 15:01:53, bulach wrote: > On 2012/10/23 13:08:35, yongsheng wrote: > > On 2012/10/23 12:58:14, yongsheng wrote: > > > On 2012/10/23 09:30:56, bulach wrote: > > > > hmm.. this may never finish: if we keep getting device exception, the > while > > > > retry < rounds would never exit given rounds +=1 here.. maybe remove this > > > block, > > > > and let it fall through and just use the standard retry in case of > failure? > > > > > > > > we'll need to switch the bots to use this param, if they're not using it > > yet.. > > > oh, yes. > > > So what if retry equals to rounds - 1? then the failed cases won't be > executed > > > > > since there is no next retry. Suppose self.retries = 1 and one device gets > > > offline, then tests on this device won't be re-run. > > if one device raise exception, then it's excluded in the list according to > your > > suggestion in the below comment. and if device list is empty, the loop is > done. > > oh, sorry, I think the current semantics (before your patch!!) for retry here is > broken :) > perhaps we could implement the following: > - if there are are no failures or is the last round, break the loop.. > my only concern is that if there are failures(due to device exceptions) and it's the last round (and there are attached devices), so we just ignore the failures and don't retry them. If you think it's acceptable, I'll follw up your suggestion here. > - else: remove devices that had exceptions from the list (if any), and try > again.. > > this is guaranteed to terminate, and to retry if necessary. > > what do you think? > > how does that sound?
thanks!! clarification inline, please let me know your thoughts! http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/1007/build/android/pylib/base_te... build/android/pylib/base_test_sharder.py:105: rounds += 1 On 2012/10/24 01:41:52, yongsheng wrote: > On 2012/10/23 15:01:53, bulach wrote: > > On 2012/10/23 13:08:35, yongsheng wrote: > > > On 2012/10/23 12:58:14, yongsheng wrote: > > > > On 2012/10/23 09:30:56, bulach wrote: > > > > > hmm.. this may never finish: if we keep getting device exception, the > > while > > > > > retry < rounds would never exit given rounds +=1 here.. maybe remove > this > > > > block, > > > > > and let it fall through and just use the standard retry in case of > > failure? > > > > > > > > > > we'll need to switch the bots to use this param, if they're not using it > > > yet.. > > > > oh, yes. > > > > So what if retry equals to rounds - 1? then the failed cases won't be > > executed > > > > > > > since there is no next retry. Suppose self.retries = 1 and one device gets > > > > offline, then tests on this device won't be re-run. > > > if one device raise exception, then it's excluded in the list according to > > your > > > suggestion in the below comment. and if device list is empty, the loop is > > done. > > > > oh, sorry, I think the current semantics (before your patch!!) for retry here > is > > broken :) > > perhaps we could implement the following: > > - if there are are no failures or is the last round, break the loop.. > > > my only concern is that if there are failures(due to device exceptions) and it's > the last round (and there are attached devices), so we just ignore the failures > and don't retry them. If you think it's acceptable, I'll follw up your > suggestion here. > > - else: remove devices that had exceptions from the list (if any), and try > > again.. > > > > this is guaranteed to terminate, and to retry if necessary. > > > > what do you think? > > > > how does that sound? > I think that'd be fine, at least it's deterministic :) otherwise, if we have two devices, and they keep alternating the failures, we'd never exit the loop...
lgtm, sorry, forgot to mention it! :) many thanks once again, this will help us a lot!!
Thanks! Looks good to me too! One nit in android_commands. https://chromiumcodereview.appspot.com/11232037/diff/13001/build/android/pyli... File build/android/pylib/android_commands.py (right): https://chromiumcodereview.appspot.com/11232037/diff/13001/build/android/pyli... build/android/pylib/android_commands.py:110: def IsDeviceAttached(device): We can simplify this to: def IsDeviceAttached(device): return device in GetAttachedDevices()
http://codereview.chromium.org/11232037/diff/13001/build/android/pylib/androi... File build/android/pylib/android_commands.py (right): http://codereview.chromium.org/11232037/diff/13001/build/android/pylib/androi... build/android/pylib/android_commands.py:110: def IsDeviceAttached(device): On 2012/10/24 17:33:12, Isaac wrote: > We can simplify this to: > > def IsDeviceAttached(device): > return device in GetAttachedDevices() yes, that's good. I'll change it.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/yongsheng.zhu@intel.com/11232037/20002
Change committed as 164013
https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) Based on: http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%... and http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%... It looks like the creation of sharded test runner needs to guarded with catching DeviceUnresponsiveError or else we can fail without attempting any retries. I'm concerned though that we aren't catching this issue more generally and we need to ensure that all code run by a shard can enter the new retry logic below.
http://codereview.chromium.org/11232037/diff/20002/build/android/pylib/base_t... File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11232037/diff/20002/build/android/pylib/base_t... build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/26 18:48:44, Yaron wrote: > Based on: > http://build.chromium.org/p/chromium.linux/builders/Android%2520Tests%2520%25... > and > http://build.chromium.org/p/chromium.linux/builders/Android%2520Tests%2520%25... > > It looks like the creation of sharded test runner needs to guarded with catching > DeviceUnresponsiveError or else we can fail without attempting any retries. > > I'm concerned though that we aren't catching this issue more generally and we > need to ensure that all code run by a shard can enter the new retry logic below. yes, i don't try to catch exceptions here for i'm not sure all of these errors are due to device problems or others. It seems that we have to do this kind of thing here.
https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/28 01:41:03, yongsheng wrote: > On 2012/10/26 18:48:44, Yaron wrote: > > Based on: > > > http://build.chromium.org/p/chromium.linux/builders/Android%252520Tests%25252... > > and > > > http://build.chromium.org/p/chromium.linux/builders/Android%252520Tests%25252... > > > > It looks like the creation of sharded test runner needs to guarded with > catching > > DeviceUnresponsiveError or else we can fail without attempting any retries. > > > > I'm concerned though that we aren't catching this issue more generally and we > > need to ensure that all code run by a shard can enter the new retry logic > below. > yes, i don't try to catch exceptions here for i'm not sure all of these errors > are due to device problems or others. > It seems that we have to do this kind of thing here. yaron: just to clarify, I kindly asked Yongsheng (thanks btw! :) to help us with the case where the device was going down when running the tests, which he fixed here.. the sharding was originally done with speed in mind, not reliability. this is the very first patch to try and make it failsafe / hot reuse devices if they become ill during the test cycle, so I share your concern and we'll have to work through them... so two questions then: yongsheng: would you be able to continue to help us here? if not: yaron, would there be anyone on your side?
https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... File build/android/pylib/base_test_sharder.py (right): https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... build/android/pylib/base_test_sharder.py:90: test_runner = self.CreateShardedTestRunner(device, index) On 2012/10/29 14:45:42, bulach wrote: > On 2012/10/28 01:41:03, yongsheng wrote: > > On 2012/10/26 18:48:44, Yaron wrote: > > > Based on: > > > > > > http://build.chromium.org/p/chromium.linux/builders/Android%25252520Tests%252... > > > and > > > > > > http://build.chromium.org/p/chromium.linux/builders/Android%25252520Tests%252... > > > > > > It looks like the creation of sharded test runner needs to guarded with > > catching > > > DeviceUnresponsiveError or else we can fail without attempting any retries. > > > > > > I'm concerned though that we aren't catching this issue more generally and > we > > > need to ensure that all code run by a shard can enter the new retry logic > > below. > > yes, i don't try to catch exceptions here for i'm not sure all of these errors > > are due to device problems or others. > > It seems that we have to do this kind of thing here. > > yaron: just to clarify, I kindly asked Yongsheng (thanks btw! :) to help us with > the case where the device was going down when running the tests, which he fixed > here.. > > the sharding was originally done with speed in mind, not reliability. this is > the very first patch to try and make it failsafe / hot reuse devices if they > become ill during the test cycle, so I share your concern and we'll have to work > through them... > > so two questions then: > yongsheng: would you be able to continue to help us here? > if not: yaron, would there be anyone on your side? btw, the issue mentioned above I guess is in run_tests.py, TestSharder __init__ where it tries to GetAllTests().. we need to add some retry mechanism in there should that call fail..
On 2012/10/29 14:51:23, bulach wrote: > https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... > File build/android/pylib/base_test_sharder.py (right): > > https://chromiumcodereview.appspot.com/11232037/diff/20002/build/android/pyli... > build/android/pylib/base_test_sharder.py:90: test_runner = > self.CreateShardedTestRunner(device, index) > On 2012/10/29 14:45:42, bulach wrote: > > On 2012/10/28 01:41:03, yongsheng wrote: > > > On 2012/10/26 18:48:44, Yaron wrote: > > > > Based on: > > > > > > > > > > http://build.chromium.org/p/chromium.linux/builders/Android%2525252520Tests%2... > > > > and > > > > > > > > > > http://build.chromium.org/p/chromium.linux/builders/Android%2525252520Tests%2... > > > > > > > > It looks like the creation of sharded test runner needs to guarded with > > > catching > > > > DeviceUnresponsiveError or else we can fail without attempting any > retries. > > > > > > > > I'm concerned though that we aren't catching this issue more generally and > > we > > > > need to ensure that all code run by a shard can enter the new retry logic > > > below. > > > yes, i don't try to catch exceptions here for i'm not sure all of these > errors > > > are due to device problems or others. > > > It seems that we have to do this kind of thing here. > > > > yaron: just to clarify, I kindly asked Yongsheng (thanks btw! :) to help us > with > > the case where the device was going down when running the tests, which he > fixed > > here.. > > > > the sharding was originally done with speed in mind, not reliability. this is > > the very first patch to try and make it failsafe / hot reuse devices if they > > become ill during the test cycle, so I share your concern and we'll have to > work > > through them... > > > > so two questions then: > > yongsheng: would you be able to continue to help us here? > > if not: yaron, would there be anyone on your side? > > btw, the issue mentioned above I guess is in run_tests.py, TestSharder __init__ > where it tries to GetAllTests().. we need to add some retry mechanism in there > should that call fail.. yongsheng: it would be great if you can continue stabilizing this. If you can't, I"ll find someone to help
> yongsheng: it would be great if you can continue stabilizing this. If you can't, > I"ll find someone to help Please ask someone else to help do this urgent issue since I'll be OOO. And cc the new issue to me. Thanks a lot! |