Issue 11275078: Android: improves test sharding reliability.

Issue 11275078: Android: improves test sharding reliability. (Closed)

Created:
8 years, 1 month ago by bulach

Modified:
8 years, 1 month ago

Reviewers:
Yaron, yongsheng, Isaac (away), yfriedman

CC:
chromium-reviews, peter+watch_chromium.org, bulach+watch_chromium.org, yfriedman+watch_chromium.org, ilevy+watch_chromium.org

Base URL:
svn://svn.chromium.org/chrome/trunk/src

Visibility:
Public.

More Reviews

Description

Android: improves test sharding reliability. Originally test sharding for android was written with performance in mind (i.e., to scale up the test speed per device). Now that we're on the main waterfall, we need to improve reliability as devices may randomly drop offline during tests. This patch captures exceptions in key places and retries if there are enough devices available. BUG=153718 TEST=run android tests, randomly unplugging devices Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=165918

Patch Set 1 #

Total comments: 14

Patch Set 2 : Comments #

Total comments: 6

Patch Set 3 : Unrecoverable error #

Total comments: 3

Created: 8 years, 1 month ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+66 lines, -25 lines)			Patch
	M	build/android/pylib/base_test_sharder.py	View	1 2	5 chunks	+28 lines, -9 lines	3 comments	Download
	M	build/android/run_tests.py	View	1	1 chunk	+38 lines, -16 lines	0 comments	Download

Messages

Total messages: 16 (0 generated)

Expand Messages | Collapse Messages

bulach

managed to squeeze this in whilst waiting for other patches, would you mind taking a ...

8 years, 1 month ago (2012-10-31 16:41:26 UTC) #1

Yaron

Thanks for looking into this http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py#newcode91 build/android/pylib/base_test_sharder.py:91: try: Shouldn't this try ...

8 years, 1 month ago (2012-10-31 17:18:51 UTC) #2

bulach

thanks yaron! another look please? http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py#newcode91 build/android/pylib/base_test_sharder.py:91: try: On 2012/10/31 17:18:51, ...

8 years, 1 month ago (2012-10-31 18:10:35 UTC) #3

thanks yaron! another look please?

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
File build/android/pylib/base_test_sharder.py (right):

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
build/android/pylib/base_test_sharder.py:91: try:
On 2012/10/31 17:18:51, Yaron wrote:
> Shouldn't this try be in the for loop? |device| is only defined in there.

not really.. we need to create all shards at once, or fail.
python allows for |device| to be available outside the for scope.
if we move the try inside the for loop, we'd need to somehow abort the current
test_runners creation and retry all over again..

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
build/android/pylib/base_test_sharder.py:101: continue
On 2012/10/31 17:18:51, Yaron wrote:
> By continuing, we're not updating |final_results|. It looks like it might only
> be tabulating successful tests in which case it's ok to skip?

this continue here would be for the loop on 85, so there's nothing to update yet
here...

the flow is basically:

- try to create N shards.
|-- if it fails, retry.
- dispatch and collect results for the N
|-- if it fails, retry

at the end, since we got N results for N devices, try to optimize what should
run next..

does it make sense?

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
build/android/pylib/base_test_sharder.py:111: results_lists =
async_results.get(999999)
On 2012/10/31 17:18:51, Yaron wrote:
> Will the exception get thrown while the other shards are still running? If so,
> does this mean that we'll whip around and retry shards of tests on devices
that
> are still running their old tests?

it's not clear from the python documentation if the exception is raised as soon
as the shard throws, or later when all the results are collected.

thankfully, it doesn't matter :)

we'll create the shards again, and that re-installs the apks, which kills the
process.

http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py
File build/android/run_tests.py (right):

http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new...
build/android/run_tests.py:227: return all_tests, available_devices
On 2012/10/31 17:18:51, Yaron wrote:
> Don't you want to collect all the devices and merge the lists of tests? This
> looks like as soon as it finds an available device, it'll return the tests
from
> that device. Or is it just that you need a device to get the list of devices
and
> you need to query each device for it's shard of tests.

sort of the latter (see 211 on the other side), but to clarify:

this needs a single device D (out of N) to capture the full list of tests, that
will then be sharded higher up.

if D fails, we'll assume it's a bad kitten and remove from N and try again.

http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new...
build/android/run_tests.py:228: break
On 2012/10/31 17:18:51, Yaron wrote:
> "break" unnecessary given "return" above

Done.

Yaron

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py#newcode101 build/android/pylib/base_test_sharder.py:101: continue On 2012/10/31 18:10:35, bulach wrote: > On 2012/10/31 ...

8 years, 1 month ago (2012-10-31 18:37:23 UTC) #4

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
File build/android/pylib/base_test_sharder.py (right):

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
build/android/pylib/base_test_sharder.py:101: continue
On 2012/10/31 18:10:35, bulach wrote:
> On 2012/10/31 17:18:51, Yaron wrote:
> > By continuing, we're not updating |final_results|. It looks like it might
only
> > be tabulating successful tests in which case it's ok to skip?
> 
> this continue here would be for the loop on 85, so there's nothing to update
yet
> here...
> 
> the flow is basically:
> 
> - try to create N shards.
> |-- if it fails, retry.
> - dispatch and collect results for the N
> |-- if it fails, retry
> 
> at the end, since we got N results for N devices, try to optimize what should
> run next..
> 
> does it make sense?

Yes. My point was though that it looks like it's doing special accounting on the
last run of the loop. If you "continue" here, it never gets a chance to do that
and we can't report that any tests successfully ran. I guess if we end up with 0
devices it'll be a pretty big failure anyway so it's ok.

However, looking at OnTestsCompleted in build/android/run_tests.py it assumes
that test_runners will be non-empty and that's not true.

http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_...
build/android/pylib/base_test_sharder.py:111: results_lists =
async_results.get(999999)
On 2012/10/31 18:10:35, bulach wrote:
> On 2012/10/31 17:18:51, Yaron wrote:
> > Will the exception get thrown while the other shards are still running? If
so,
> > does this mean that we'll whip around and retry shards of tests on devices
> that
> > are still running their old tests?
> 
> it's not clear from the python documentation if the exception is raised as
soon
> as the shard throws, or later when all the results are collected.
> 
> thankfully, it doesn't matter :)
> 
> we'll create the shards again, and that re-installs the apks, which kills the
> process.

Ah, right. Thanks.

http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py
File build/android/run_tests.py (right):

http://codereview.chromium.org/11275078/diff/1/build/android/run_tests.py#new...
build/android/run_tests.py:227: return all_tests, available_devices
On 2012/10/31 18:10:35, bulach wrote:
> On 2012/10/31 17:18:51, Yaron wrote:
> > Don't you want to collect all the devices and merge the lists of tests? This
> > looks like as soon as it finds an available device, it'll return the tests
> from
> > that device. Or is it just that you need a device to get the list of devices
> and
> > you need to query each device for it's shard of tests.
> 
> sort of the latter (see 211 on the other side), but to clarify:
> 
> this needs a single device D (out of N) to capture the full list of tests,
that
> will then be sharded higher up.
> 
> if D fails, we'll assume it's a bad kitten and remove from N and try again.

My point should've read "and you _don't_ need to query..".

Makes sense.

Isaac (away)

Thanks for getting to this Marcus. This looks good to me, I'll give it one ...

8 years, 1 month ago (2012-10-31 19:18:14 UTC) #5

bulach

thanks for the clarification yaron! comment addressed, ptal http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/1/build/android/pylib/base_test_sharder.py#newcode101 build/android/pylib/base_test_sharder.py:101: continue ...

8 years, 1 month ago (2012-10-31 19:18:17 UTC) #6

bulach

thanks isaac! clarifications inline, please let me know your thoughts: http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/5002/build/android/pylib/base_test_sharder.py#newcode112 ...

8 years, 1 month ago (2012-10-31 19:27:31 UTC) #7

yongsheng

thanks. http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py#newcode115 build/android/pylib/base_test_sharder.py:115: continue If one device raises an exception, this ...

8 years, 1 month ago (2012-11-01 02:22:49 UTC) #9

bulach

thanks Yongsheng! clarification below: http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py#newcode115 build/android/pylib/base_test_sharder.py:115: continue On 2012/11/01 02:22:49, yongsheng ...

8 years, 1 month ago (2012-11-01 11:41:06 UTC) #10

yfriedman

I agree with Marcus. Let's try this now for stability's sake and optimize later. On ...

8 years, 1 month ago (2012-11-01 15:33:01 UTC) #11

I agree with Marcus. Let's try this now for stability's sake and optimize
later.


On Thu, Nov 1, 2012 at 4:41 AM, <bulach@chromium.org> wrote:

> thanks Yongsheng! clarification below:
>
>
>
> http://codereview.chromium.**org/11275078/diff/8002/build/**
>
android/pylib/base_test_**sharder.py<http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py>
> File build/android/pylib/base_test_**sharder.py (right):
>
> http://codereview.chromium.**org/11275078/diff/8002/build/**
>
android/pylib/base_test_**sharder.py#newcode115<http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py#newcode115>
> build/android/pylib/base_test_**sharder.py:115: continue
> On 2012/11/01 02:22:49, yongsheng wrote:
>
>> If one device raises an exception, this might block other test
>>
> runners.
>
>> what about catching up exceptions in "_ShardedTestRunnable" so that
>>
> once one is
>
>> failed, others can continue to run? then in the next round, we just
>>
> need to try
>
>> those failures?
>>
>
> hmm, sorry, I'm not sure what do you mean by "block other test
> runners"...
> I just did a quick test, and if you run "foo.py 4", it blocks for
> ~10secs. running "foo.py 5" returns immediately, i.e., the exception
> thrown by any of the shards immediately bubbles up...
>
> so in our case here, we'd immediately retry on the remaining runners,
> and as mentioned to yaron above, they wouldn't be blocked...
>
> I appreciate we could potentially optimize this :) and just retry the
> failures.
> however, there's currently no simple way to communicate that back from
> "_ShardedTestRunnable" (we'd need more logic there to set all tests as
> broken with a device exception so that 121 and 132-134 would work...)
>
> if that's ok, would you be fine with doing this as a possible follow up?
>
> I'd like to gather data first to understand what are all failure
> scenarios in the bots.
>
>
> ===foo.py===
>
> import multiprocessing
> import sys
> import time
>
> def sleep_or_raise(i):
>   if i < 4:
>     time.sleep(10)
>   else:
>     raise Exception('Raised!!!')
>
> n = int(sys.argv[1])
> pool = multiprocessing.Pool(n)
> r = pool.map_async(sleep_or_raise, xrange(n))
> r.get(9999999)
>
>
http://codereview.chromium.**org/11275078/<http://codereview.chromium.org/112...
>

yongsheng

thanks for your explanation. Please go ahead http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py File build/android/pylib/base_test_sharder.py (right): http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_test_sharder.py#newcode115 build/android/pylib/base_test_sharder.py:115: continue On ...

8 years, 1 month ago (2012-11-02 00:55:11 UTC) #13

thanks for your explanation. Please go ahead

http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te...
File build/android/pylib/base_test_sharder.py (right):

http://codereview.chromium.org/11275078/diff/8002/build/android/pylib/base_te...
build/android/pylib/base_test_sharder.py:115: continue
On 2012/11/01 11:41:06, bulach wrote:
> On 2012/11/01 02:22:49, yongsheng wrote:
> > If one device raises an exception, this might block other test runners.
> > what about catching up exceptions in "_ShardedTestRunnable" so that once one
> is
> > failed, others can continue to run? then in the next round, we just need to
> try
> > those failures?
> 
> hmm, sorry, I'm not sure what do you mean by "block other test runners"...
> I just did a quick test, and if you run "foo.py 4", it blocks for ~10secs.
> running "foo.py 5" returns immediately, i.e., the exception thrown by any of
the
> shards immediately bubbles up...
> 
> so in our case here, we'd immediately retry on the remaining runners, and as
> mentioned to yaron above, they wouldn't be blocked...
> 
> I appreciate we could potentially optimize this :) and just retry the
failures.
> however, there's currently no simple way to communicate that back from
> "_ShardedTestRunnable" (we'd need more logic there to set all tests as broken
> with a device exception so that 121 and 132-134 would work...)
you're right. That's what i mean: one exception is raised from one device and
other devices running cases will be interrupted. yes, that's just a
optimization.
 
> if that's ok, would you be fine with doing this as a possible follow up?
> 
> I'd like to gather data first to understand what are all failure scenarios in
> the bots.
> 
yes, please.
> 
> ===foo.py===
> 
> import multiprocessing
> import sys
> import time
> 
> def sleep_or_raise(i):
>   if i < 4:
>     time.sleep(10)
>   else:
>     raise Exception('Raised!!!')
> 
> n = int(sys.argv[1])
> pool = multiprocessing.Pool(n)
> r = pool.map_async(sleep_or_raise, xrange(n))
> r.get(9999999)
> 
> 

oh, yes, Yaron is right. we can help improve this later. Please go ahead.

bulach

thanks everyone! I'll put through the try bots now and cq later..

8 years, 1 month ago (2012-11-02 09:55:21 UTC) #14

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/bulach@chromium.org/11275078/8002

8 years, 1 month ago (2012-11-05 09:36:18 UTC) #15

Change committed as 165918

Expand Messages | Collapse Messages