third_party/gsutil/gslib/addlhelp/prod.py - Issue 12685010: Added gsutil/gslib to depot_tools/third_party

Side by Side Diff: third_party/gsutil/gslib/addlhelp/prod.py

Issue 12685010: Added gsutil/gslib to depot_tools/third_party (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master

Patch Set: Created 7 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 # Copyright 2012 Google Inc. All Rights Reserved.

	2 #

	3 # Licensed under the Apache License, Version 2.0 (the "License");

	4 # you may not use this file except in compliance with the License.

	5 # You may obtain a copy of the License at

	6 #

	7 # http://www.apache.org/licenses/LICENSE-2.0

	8 #

	9 # Unless required by applicable law or agreed to in writing, software

	10 # distributed under the License is distributed on an "AS IS" BASIS,

	11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

	12 # See the License for the specific language governing permissions and

	13 # limitations under the License.

	14

	15 from gslib.help_provider import HELP_NAME

	16 from gslib.help_provider import HELP_NAME_ALIASES

	17 from gslib.help_provider import HELP_ONE_LINE_SUMMARY

	18 from gslib.help_provider import HelpProvider

	19 from gslib.help_provider import HELP_TEXT

	20 from gslib.help_provider import HelpType

	21 from gslib.help_provider import HELP_TYPE

	22

	23 _detailed_help_text = ("""

	24 <B>OVERVIEW</B>

	25 If you use gsutil in large production tasks (such as uploading or

	26 downloading many GB of data each night), there are a number of things

	27 you can do to help ensure success. Specifically, this section discusses

	28 how to script large production tasks around gsutil's resumable transfer

	29 mechanism.

	30

	31

	32 <B>BACKGROUND ON RESUMABLE TRANSFERS</B>

	33 First, it's helpful to understand gsutil's resumable transfer mechanism,

	34 and how your script needs to be implemented around this mechanism to work

	35 reliably. gsutil uses the resumable transfer support in the boto library

	36 when you attempt to upload or download a file larger than a configurable

	37 threshold (by default, this threshold is 1MB). When a transfer fails

	38 partway through (e.g., because of an intermittent network problem),

	39 boto uses a randomized binary exponential backoff-and-retry strategy:

	40 wait a random period between [0..1] seconds and retry; if that fails,

	41 wait a random period between [0..2] seconds and retry; and if that

	42 fails, wait a random period between [0..4] seconds, and so on, up to a

	43 configurable number of times (the default is 6 times). Thus, the retry

	44 actually spans a randomized period up to 1+2+4+8+16+32=63 seconds.

	45

	46 If the transfer fails each of these attempts with no intervening

	47 progress, gsutil gives up on the transfer, but keeps a "tracker" file

	48 for it in a configurable location (the default location is ~/.gsutil/,

	49 in a file named by a combination of the SHA1 hash of the name of the

	50 bucket and object being transferred and the last 16 characters of the

	51 file name). When transfers fail in this fashion, you can rerun gsutil

	52 at some later time (e.g., after the networking problem has been

	53 resolved), and the resumable transfer picks up where it left off.

	54

	55

	56 <B>SCRIPTING DATA TRANSFER TASKS</B>

	57 To script large production data transfer tasks around this mechanism,

	58 you can implement a script that runs periodically, determines which file

	59 transfers have not yet succeeded, and runs gsutil to copy them. Below,

	60 we offer a number of suggestions about how this type of scripting should

	61 be implemented:

	62

	63 1. When resumable transfers fail without any progress 6 times in a row

	64 over the course of up to 63 seconds, it probably won't work to simply

	65 retry the transfer immediately. A more successful strategy would be to

	66 have a cron job that runs every 30 minutes, determines which transfers

	67 need to be run, and runs them. If the network experiences intermittent

	68 problems, the script picks up where it left off and will eventually

	69 succeed (once the network problem has been resolved).

	70

	71 2. If your business depends on timely data transfer, you should consider

	72 implementing some network monitoring. For example, you can implement

	73 a task that attempts a small download every few minutes and raises an

	74 alert if the attempt fails for several attempts in a row (or more or less

	75 frequently depending on your requirements), so that your IT staff can

	76 investigate problems promptly. As usual with monitoring implementations,

	77 you should experiment with the alerting thresholds, to avoid false

	78 positive alerts that cause your staff to begin ignoring the alerts.

	79

	80 3. There are a variety of ways you can determine what files remain to be

	81 transferred. We recommend that you avoid attempting to get a complete

	82 listing of a bucket containing many objects (e.g., tens of thousands

	83 or more). One strategy is to structure your object names in a way that

	84 represents your transfer process, and use gsutil prefix wildcards to

	85 request partial bucket listings. For example, if your periodic process

	86 involves downloading the current day's objects, you could name objects

	87 using a year-month-day-object-ID format and then find today's objects by

	88 using a command like gsutil ls gs://bucket/2011-09-27-*. Note that it

	89 is more efficient to have a non-wildcard prefix like this than to use

	90 something like gsutil ls gs://bucket/*-2011-09-27. The latter command

	91 actually requests a complete bucket listing and then filters in gsutil,

	92 while the former asks Google Storage to return the subset of objects

	93 whose names start with everything up to the *.

	94

	95 For data uploads, another technique would be to move local files from a "to

	96 be processed" area to a "done" area as your script successfully copies files

	97 to the cloud. You can do this in parallel batches by using a command like:

	98

	99 gsutil -m cp -R to_upload/subdir_$i gs://bucket/subdir_$i

	100

	101 where i is a shell loop variable. Make sure to check the shell $status

	102 variable is 0 after each gsutil cp command, to detect if some of the copies

	103 failed, and rerun the affected copies.

	104

	105 With this strategy, the file system keeps track of all remaining work to

	106 be done.

	107

	108 4. If you have really large numbers of objects in a single bucket

	109 (say hundreds of thousands or more), you should consider tracking your

	110 objects in a database instead of using bucket listings to enumerate

	111 the objects. For example this database could track the state of your

	112 downloads, so you can determine what objects need to be downloaded by

	113 your periodic download script by querying the database locally instead

	114 of performing a bucket listing.

	115

	116 5. Make sure you don't delete partially downloaded files after a transfer

	117 fails: gsutil picks up where it left off (and performs an MD5 check of

	118 the final downloaded content to ensure data integrity), so deleting

	119 partially transferred files will cause you to lose progress and make

	120 more wasteful use of your network. You should also make sure whatever

	121 process is waiting to consume the downloaded data doesn't get pointed

	122 at the partially downloaded files. One way to do this is to download

	123 into a staging directory and then move successfully downloaded files to

	124 a directory where consumer processes will read them.

	125

	126 6. If you have a fast network connection, you can speed up the transfer of

	127 large numbers of files by using the gsutil -m (multi-threading /

	128 multi-processing) option. Be aware, however, that gsutil doesn't attempt to

	129 keep track of which files were downloaded successfully in cases where some

	130 files failed to download. For example, if you use multi-threaded transfers

	131 to download 100 files and 3 failed to download, it is up to your scripting

	132 process to determine which transfers didn't succeed, and retry them. A

	133 periodic check-and-run approach like outlined earlier would handle this case.

	134

	135 If you use parallel transfers (gsutil -m) you might want to experiment with

	136 the number of threads being used (via the parallel_thread_count setting

	137 in the .boto config file). By default, gsutil uses 24 threads. Depending

	138 on your network speed, available memory, CPU load, and other conditions,

	139 this may or may not be optimal. Try experimenting with higher or lower

	140 numbers of threads, to find the best number of threads for your environment.

	141 """)

	142

	143

	144 class CommandOptions(HelpProvider):

	145 """Additional help about using gsutil for production tasks."""

	146

	147 help_spec = {

	148 # Name of command or auxiliary help info for which this help applies.

	149 HELP_NAME : 'prod',

	150 # List of help name aliases.

	151 HELP_NAME_ALIASES : ['production', 'resumable', 'resumable upload',

	152 'resumable transfer', 'resumable download',

	153 'scripts', 'scripting'],

	154 # Type of help:

	155 HELP_TYPE : HelpType.ADDITIONAL_HELP,

	156 # One line summary of this help.

	157 HELP_ONE_LINE_SUMMARY : 'Scripting production data transfers with gsutil',

	158 # The full help text.

	159 HELP_TEXT : _detailed_help_text,

	160 }

OLD	NEW

« no previous file with comments | « third_party/gsutil/gslib/addlhelp/naming.py ('k') | third_party/gsutil/gslib/addlhelp/projects.py » ('j') | third_party/gsutil/gslib/command_runner.py » ('J')