third_party/gsutil/boto/docs/source/cloudsearch_tut.rst - Issue 12317103: Added gsutil to depot tools

Side by Side Diff: third_party/gsutil/boto/docs/source/cloudsearch_tut.rst

Issue 12317103: Added gsutil to depot tools (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master

Patch Set: added readme Created 7 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

« third_party/gsutil/boto/docs/Makefile ('K') | « third_party/gsutil/boto/docs/source/cloudfront_tut.rst ('k') | third_party/gsutil/boto/docs/source/cloudwatch_tut.rst » ('j') | third_party/gsutil/boto/pylintrc » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
(Empty)
	1 .. cloudsearch_tut:

	2

	3 ===============================================

	4 An Introduction to boto's Cloudsearch interface

	5 ===============================================

	6

	7 This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial

	8 assumes that you have boto already downloaded and installed.

	9

	10 .. _Cloudsearch: http://aws.amazon.com/cloudsearch/

	11

	12 Creating a Connection

	13 ---------------------

	14 The first step in accessing CloudSearch is to create a connection to the service .

	15

	16 The recommended method of doing this is as follows::

	17

	18 >>> import boto.cloudsearch

	19 >>> conn = boto.cloudsearch.connect_to_region("us-east-1", aws_access_key_id = '<aws access key'>, aws_secret_access_key='<aws secret key>')

	20

	21 At this point, the variable conn will point to a CloudSearch connection object

	22 in the us-east-1 region. Currently, this is the only region which has the

	23 CloudSearch service. In this example, the AWS access key and AWS secret key are

	24 passed in to the method explicitly. Alternatively, you can set the environment

	25 variables:

	26

	27 * `AWS_ACCESS_KEY_ID` - Your AWS Access Key ID

	28 * `AWS_SECRET_ACCESS_KEY` - Your AWS Secret Access Key

	29

	30 and then simply call::

	31

	32 >>> import boto.cloudsearch

	33 >>> conn = boto.cloudsearch.connect_to_region("us-east-1")

	34

	35 In either case, conn will point to the Connection object which we will use

	36 throughout the remainder of this tutorial.

	37

	38 Creating a Domain

	39 -----------------

	40

	41 Once you have a connection established with the CloudSearch service, you will

	42 want to create a domain. A domain encapsulates the data that you wish to index,

	43 as well as indexes and metadata relating to it.

	44

	45 >>> from boto.cloudsearch.domain import Domain

	46 >>> domain = Domain(conn, conn.create_domain('demo'))

	47

	48 This domain can be used to control access policies, indexes, and the actual

	49 document service, which you will use to index and search.

	50

	51 Setting access policies

	52 -----------------------

	53

	54 Before you can connect to a document service, you need to set the correct access properties.

	55 For example, if you were connecting from 192.168.1.0, you could give yourself ac cess as follows:

	56

	57 >>> our_ip = '192.168.1.0'

	58

	59 >>> # Allow our IP address to access the document and search services

	60 >>> policy = domain.get_access_policies()

	61 >>> policy.allow_search_ip(our_ip)

	62 >>> policy.allow_doc_ip(our_ip)

	63

	64 You can use the allow_search_ip() and allow_doc_ip() methods to give different

	65 CIDR blocks access to searching and the document service respectively.

	66

	67 Creating index fields

	68 ---------------------

	69

	70 Each domain can have up to twenty index fields which are indexed by the

	71 CloudSearch service. For each index field, you will need to specify whether

	72 it's a text or integer field, as well as optionaly a default value.

	73

	74 >>> # Create an 'text' index field called 'username'

	75 >>> uname_field = domain.create_index_field('username', 'text')

	76

	77 >>> # Epoch time of when the user last did something

	78 >>> time_field = domain.create_index_field('last_activity', 'uint', default= 0)

	79

	80 It is also possible to mark an index field as a facet. Doing so allows a search

	81 query to return categories into which results can be grouped, or to create

	82 drill-down categories

	83

	84 >>> # But it would be neat to drill down into different countries

	85 >>> loc_field = domain.create_index_field('location', 'text', facet=True)

	86

	87 Finally, you can also mark a snippet of text as being able to be returned

	88 directly in your search query by using the results option.

	89

	90 >>> # Directly insert user snippets in our results

	91 >>> snippet_field = domain.create_index_field('snippet', 'text', result=True )

	92

	93 You can add up to 20 index fields in this manner:

	94

	95 >>> follower_field = domain.create_index_field('follower_count', 'uint', def ault=0)

	96

	97 Adding Documents to the Index

	98 -----------------------------

	99

	100 Now, we can add some documents to our new search domain. First, you will need a

	101 document service object through which queries are sent:

	102

	103 >>> doc_service = domain.get_document_service()

	104

	105 For this example, we will use a pre-populated list of sample content for our

	106 import. You would normally pull such data from your database or another

	107 document store.

	108

	109 >>> users = [

	110 {

	111 'id': 1,

	112 'username': 'dan',

	113 'last_activity': 1334252740,

	114 'follower_count': 20,

	115 'location': 'USA',

	116 'snippet': 'Dan likes watching sunsets and rock climbing',

	117 },

	118 {

	119 'id': 2,

	120 'username': 'dankosaur',

	121 'last_activity': 1334252904,

	122 'follower_count': 1,

	123 'location': 'UK',

	124 'snippet': 'Likes to dress up as a dinosaur.',

	125 },

	126 {

	127 'id': 3,

	128 'username': 'danielle',

	129 'last_activity': 1334252969,

	130 'follower_count': 100,

	131 'location': 'DE',

	132 'snippet': 'Just moved to Germany!'

	133 },

	134 {

	135 'id': 4,

	136 'username': 'daniella',

	137 'last_activity': 1334253279,

	138 'follower_count': 7,

	139 'location': 'USA',

	140 'snippet': 'Just like Dan, I like to watch a good sunset, but height s scare me.',

	141 }

	142 ]

	143

	144 When adding documents to our document service, we will batch them together. You

	145 can schedule a document to be added by using the add() method. Whenever you are

	146 adding a document, you must provide a unique ID, a version ID, and the actual

	147 document to be indexed. In this case, we are using the user ID as our unique

	148 ID. The version ID is used to determine which is the latest version of an

	149 object to be indexed. If you wish to update a document, you must use a higher

	150 version ID. In this case, we are using the time of the user's last activity as

	151 a version number.

	152

	153 >>> for user in users:

	154 >>> doc_service.add(user['id'], user['last_activity'], user)

	155

	156 When you are ready to send the batched request to the document service, you can

	157 do with the commit() method. Note that cloudsearch will charge per 1000 batch

	158 uploads. Each batch upload must be under 5MB.

	159

	160 >>> result = doc_service.commit()

	161

	162 The result is an instance of `cloudsearch.CommitResponse` which will

	163 make the plain dictionary response a nice object (ie result.adds,

	164 result.deletes) and raise an exception for us if all of our documents

	165 weren't actually committed.

	166

	167 After you have successfully committed some documents to cloudsearch, you must

	168 use :py:meth:`clear_sdf

	169 <boto.cloudsearch.document.DocumentServiceConnection.clear_sdf>`, if you wish

	170 to use the same document service connection again so that its internal cache is

	171 cleared.

	172

	173 Searching Documents

	174 -------------------

	175

	176 Now, let's try performing a search. First, we will need a SearchServiceConnectio n:

	177

	178 >>> search_service = domain.get_search_service()

	179

	180 A standard search will return documents which contain the exact words being

	181 searched for.

	182

	183 >>> results = search_service.search(q="dan")

	184 >>> results.hits

	185 2

	186 >>> map(lambda x: x['id'], results)

	187 [u'1', u'4']

	188

	189 The standard search does not look at word order:

	190

	191 >>> results = search_service.search(q="dinosaur dress")

	192 >>> results.hits

	193 1

	194 >>> map(lambda x: x['id'], results)

	195 [u'2']

	196

	197 It's also possible to do more complex queries using the bq argument (Boolean

	198 Query). When you are using bq, your search terms must be enclosed in single

	199 quotes.

	200

	201 >>> results = search_service.search(bq="'dan'")

	202 >>> results.hits

	203 2

	204 >>> map(lambda x: x['id'], results)

	205 [u'1', u'4']

	206

	207 When you are using boolean queries, it's also possible to use wildcards to

	208 extend your search to all words which start with your search terms:

	209

	210 >>> results = search_service.search(bq="'dan*'")

	211 >>> results.hits

	212 4

	213 >>> map(lambda x: x['id'], results)

	214 [u'1', u'2', u'3', u'4']

	215

	216 The boolean query also allows you to create more complex queries. You can OR

	217 term together using "\|", AND terms together using "+" or a space, and you can

	218 remove words from the query using the "-" operator.

	219

	220 >>> results = search_service.search(bq="'watched\|moved'")

	221 >>> results.hits

	222 2

	223 >>> map(lambda x: x['id'], results)

	224 [u'3', u'4']

	225

	226 By default, the search will return 10 terms but it is possible to adjust this

	227 by using the size argument as follows:

	228

	229 >>> results = search_service.search(bq="'dan*'", size=2)

	230 >>> results.hits

	231 4

	232 >>> map(lambda x: x['id'], results)

	233 [u'1', u'2']

	234

	235 It is also possible to offset the start of the search by using the start argumen t as follows:

	236

	237 >>> results = search_service.search(bq="'dan*'", start=2)

	238 >>> results.hits

	239 4

	240 >>> map(lambda x: x['id'], results)

	241 [u'3', u'4']

	242

	243

	244 Ordering search results and rank expressions

	245 --------------------------------------------

	246

	247 If your search query is going to return many results, it is good to be able to s ort them

	248 You can order your search results by using the rank argument. You are able to

	249 sort on any fields which have the results option turned on.

	250

	251 >>> results = search_service.search(bq=query, rank=['-follower_count'])

	252

	253 You can also create your own rank expressions to sort your results according to

	254 other criteria:

	255

	256 >>> domain.create_rank_expression('recently_active', 'last_activity') # We' ll want to be able to just show the most recently active users

	257

	258 >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_c ount/(time() - last_activity))*1000)') # Let's get trickier and combine text re levance with a really dynamic expression

	259

	260 >>> results = search_service.search(bq=query, rank=['-recently_active'])

	261

	262 Viewing and Adjusting Stemming for a Domain

	263 -------------------------------------------

	264

	265 A stemming dictionary maps related words to a common stem. A stem is

	266 typically the root or base word from which variants are derived. For

	267 example, run is the stem of running and ran. During indexing, Amazon

	268 CloudSearch uses the stemming dictionary when it performs

	269 text-processing on text fields. At search time, the stemming

	270 dictionary is used to perform text-processing on the search

	271 request. This enables matching on variants of a word. For example, if

	272 you map the term running to the stem run and then search for running,

	273 the request matches documents that contain run as well as running.

	274

	275 To get the current stemming dictionary defined for a domain, use the

	276 ``get_stemming`` method of the Domain object.

	277

	278 >>> stems = domain.get_stemming()

	279 >>> stems

	280 {u'stems': {}}

	281 >>>

	282

	283 This returns a dictionary object that can be manipulated directly to

	284 add additional stems for your search domain by adding pairs of term:stem

	285 to the stems dictionary.

	286

	287 >>> stems['stems']['running'] = 'run'

	288 >>> stems['stems']['ran'] = 'run'

	289 >>> stems

	290 {u'stems': {u'ran': u'run', u'running': u'run'}}

	291 >>>

	292

	293 This has changed the value locally. To update the information in

	294 Amazon CloudSearch, you need to save the data.

	295

	296 >>> stems.save()

	297

	298 You can also access certain CloudSearch-specific attributes related to

	299 the stemming dictionary defined for your domain.

	300

	301 >>> stems.status

	302 u'RequiresIndexDocuments'

	303 >>> stems.creation_date

	304 u'2012-05-01T12:12:32Z'

	305 >>> stems.update_date

	306 u'2012-05-01T12:12:32Z'

	307 >>> stems.update_version

	308 19

	309 >>>

	310

	311 The status indicates that, because you have changed the stems associated

	312 with the domain, you will need to re-index the documents in the domain

	313 before the new stems are used.

	314

	315 Viewing and Adjusting Stopwords for a Domain

	316 --------------------------------------------

	317

	318 Stopwords are words that should typically be ignored both during

	319 indexing and at search time because they are either insignificant or

	320 so common that including them would result in a massive number of

	321 matches.

	322

	323 To view the stopwords currently defined for your domain, use the

	324 ``get_stopwords`` method of the Domain object.

	325

	326 >>> stopwords = domain.get_stopwords()

	327 >>> stopwords

	328 {u'stopwords': [u'a',

	329 u'an',

	330 u'and',

	331 u'are',

	332 u'as',

	333 u'at',

	334 u'be',

	335 u'but',

	336 u'by',

	337 u'for',

	338 u'in',

	339 u'is',

	340 u'it',

	341 u'of',

	342 u'on',

	343 u'or',

	344 u'the',

	345 u'to',

	346 u'was']}

	347 >>>

	348

	349 You can add additional stopwords by simply appending the values to the

	350 list.

	351

	352 >>> stopwords['stopwords'].append('foo')

	353 >>> stopwords['stopwords'].append('bar')

	354 >>> stopwords

	355

	356 Similarly, you could remove currently defined stopwords from the list.

	357 To save the changes, use the ``save`` method.

	358

	359 >>> stopwords.save()

	360

	361 The stopwords object has similar attributes defined above for stemming

	362 that provide additional information about the stopwords in your domain.

	363

	364

	365 Viewing and Adjusting Stopwords for a Domain

	366 --------------------------------------------

	367

	368 You can configure synonyms for terms that appear in the data you are

	369 searching. That way, if a user searches for the synonym rather than

	370 the indexed term, the results will include documents that contain the

	371 indexed term.

	372

	373 If you want two terms to match the same documents, you must define

	374 them as synonyms of each other. For example:

	375

	376 cat, feline

	377 feline, cat

	378

	379 To view the synonyms currently defined for your domain, use the

	380 ``get_synonyms`` method of the Domain object.

	381

	382 >>> synonyms = domain.get_synonyms()

	383 >>> synonyms

	384 {u'synonyms': {}}

	385 >>>

	386

	387 You can define new synonyms by adding new term:synonyms entries to the

	388 synonyms dictionary object.

	389

	390 >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']

	391 >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']

	392

	393 To save the changes, use the ``save`` method.

	394

	395 >>> synonyms.save()

	396

	397 The synonyms object has similar attributes defined above for stemming

	398 that provide additional information about the stopwords in your domain.

	399

	400 Deleting Documents

	401 ------------------

	402

	403 >>> import time

	404 >>> from datetime import datetime

	405

	406 >>> doc_service = domain.get_document_service()

	407

	408 >>> # Again we'll cheat and use the current epoch time as our version number

	409

	410 >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))

	411 >>> service.commit()

OLD	NEW