Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(292)

Side by Side Diff: third_party/gsutil/boto/docs/source/cloudsearch_tut.rst

Issue 12317103: Added gsutil to depot tools (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master
Patch Set: added readme Created 7 years, 9 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
(Empty)
1 .. cloudsearch_tut:
2
3 ===============================================
4 An Introduction to boto's Cloudsearch interface
5 ===============================================
6
7 This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial
8 assumes that you have boto already downloaded and installed.
9
10 .. _Cloudsearch: http://aws.amazon.com/cloudsearch/
11
12 Creating a Connection
13 ---------------------
14 The first step in accessing CloudSearch is to create a connection to the service .
15
16 The recommended method of doing this is as follows::
17
18 >>> import boto.cloudsearch
19 >>> conn = boto.cloudsearch.connect_to_region("us-east-1", aws_access_key_id = '<aws access key'>, aws_secret_access_key='<aws secret key>')
20
21 At this point, the variable conn will point to a CloudSearch connection object
22 in the us-east-1 region. Currently, this is the only region which has the
23 CloudSearch service. In this example, the AWS access key and AWS secret key are
24 passed in to the method explicitly. Alternatively, you can set the environment
25 variables:
26
27 * `AWS_ACCESS_KEY_ID` - Your AWS Access Key ID
28 * `AWS_SECRET_ACCESS_KEY` - Your AWS Secret Access Key
29
30 and then simply call::
31
32 >>> import boto.cloudsearch
33 >>> conn = boto.cloudsearch.connect_to_region("us-east-1")
34
35 In either case, conn will point to the Connection object which we will use
36 throughout the remainder of this tutorial.
37
38 Creating a Domain
39 -----------------
40
41 Once you have a connection established with the CloudSearch service, you will
42 want to create a domain. A domain encapsulates the data that you wish to index,
43 as well as indexes and metadata relating to it.
44
45 >>> from boto.cloudsearch.domain import Domain
46 >>> domain = Domain(conn, conn.create_domain('demo'))
47
48 This domain can be used to control access policies, indexes, and the actual
49 document service, which you will use to index and search.
50
51 Setting access policies
52 -----------------------
53
54 Before you can connect to a document service, you need to set the correct access properties.
55 For example, if you were connecting from 192.168.1.0, you could give yourself ac cess as follows:
56
57 >>> our_ip = '192.168.1.0'
58
59 >>> # Allow our IP address to access the document and search services
60 >>> policy = domain.get_access_policies()
61 >>> policy.allow_search_ip(our_ip)
62 >>> policy.allow_doc_ip(our_ip)
63
64 You can use the allow_search_ip() and allow_doc_ip() methods to give different
65 CIDR blocks access to searching and the document service respectively.
66
67 Creating index fields
68 ---------------------
69
70 Each domain can have up to twenty index fields which are indexed by the
71 CloudSearch service. For each index field, you will need to specify whether
72 it's a text or integer field, as well as optionaly a default value.
73
74 >>> # Create an 'text' index field called 'username'
75 >>> uname_field = domain.create_index_field('username', 'text')
76
77 >>> # Epoch time of when the user last did something
78 >>> time_field = domain.create_index_field('last_activity', 'uint', default= 0)
79
80 It is also possible to mark an index field as a facet. Doing so allows a search
81 query to return categories into which results can be grouped, or to create
82 drill-down categories
83
84 >>> # But it would be neat to drill down into different countries
85 >>> loc_field = domain.create_index_field('location', 'text', facet=True)
86
87 Finally, you can also mark a snippet of text as being able to be returned
88 directly in your search query by using the results option.
89
90 >>> # Directly insert user snippets in our results
91 >>> snippet_field = domain.create_index_field('snippet', 'text', result=True )
92
93 You can add up to 20 index fields in this manner:
94
95 >>> follower_field = domain.create_index_field('follower_count', 'uint', def ault=0)
96
97 Adding Documents to the Index
98 -----------------------------
99
100 Now, we can add some documents to our new search domain. First, you will need a
101 document service object through which queries are sent:
102
103 >>> doc_service = domain.get_document_service()
104
105 For this example, we will use a pre-populated list of sample content for our
106 import. You would normally pull such data from your database or another
107 document store.
108
109 >>> users = [
110 {
111 'id': 1,
112 'username': 'dan',
113 'last_activity': 1334252740,
114 'follower_count': 20,
115 'location': 'USA',
116 'snippet': 'Dan likes watching sunsets and rock climbing',
117 },
118 {
119 'id': 2,
120 'username': 'dankosaur',
121 'last_activity': 1334252904,
122 'follower_count': 1,
123 'location': 'UK',
124 'snippet': 'Likes to dress up as a dinosaur.',
125 },
126 {
127 'id': 3,
128 'username': 'danielle',
129 'last_activity': 1334252969,
130 'follower_count': 100,
131 'location': 'DE',
132 'snippet': 'Just moved to Germany!'
133 },
134 {
135 'id': 4,
136 'username': 'daniella',
137 'last_activity': 1334253279,
138 'follower_count': 7,
139 'location': 'USA',
140 'snippet': 'Just like Dan, I like to watch a good sunset, but height s scare me.',
141 }
142 ]
143
144 When adding documents to our document service, we will batch them together. You
145 can schedule a document to be added by using the add() method. Whenever you are
146 adding a document, you must provide a unique ID, a version ID, and the actual
147 document to be indexed. In this case, we are using the user ID as our unique
148 ID. The version ID is used to determine which is the latest version of an
149 object to be indexed. If you wish to update a document, you must use a higher
150 version ID. In this case, we are using the time of the user's last activity as
151 a version number.
152
153 >>> for user in users:
154 >>> doc_service.add(user['id'], user['last_activity'], user)
155
156 When you are ready to send the batched request to the document service, you can
157 do with the commit() method. Note that cloudsearch will charge per 1000 batch
158 uploads. Each batch upload must be under 5MB.
159
160 >>> result = doc_service.commit()
161
162 The result is an instance of `cloudsearch.CommitResponse` which will
163 make the plain dictionary response a nice object (ie result.adds,
164 result.deletes) and raise an exception for us if all of our documents
165 weren't actually committed.
166
167 After you have successfully committed some documents to cloudsearch, you must
168 use :py:meth:`clear_sdf
169 <boto.cloudsearch.document.DocumentServiceConnection.clear_sdf>`, if you wish
170 to use the same document service connection again so that its internal cache is
171 cleared.
172
173 Searching Documents
174 -------------------
175
176 Now, let's try performing a search. First, we will need a SearchServiceConnectio n:
177
178 >>> search_service = domain.get_search_service()
179
180 A standard search will return documents which contain the exact words being
181 searched for.
182
183 >>> results = search_service.search(q="dan")
184 >>> results.hits
185 2
186 >>> map(lambda x: x['id'], results)
187 [u'1', u'4']
188
189 The standard search does not look at word order:
190
191 >>> results = search_service.search(q="dinosaur dress")
192 >>> results.hits
193 1
194 >>> map(lambda x: x['id'], results)
195 [u'2']
196
197 It's also possible to do more complex queries using the bq argument (Boolean
198 Query). When you are using bq, your search terms must be enclosed in single
199 quotes.
200
201 >>> results = search_service.search(bq="'dan'")
202 >>> results.hits
203 2
204 >>> map(lambda x: x['id'], results)
205 [u'1', u'4']
206
207 When you are using boolean queries, it's also possible to use wildcards to
208 extend your search to all words which start with your search terms:
209
210 >>> results = search_service.search(bq="'dan*'")
211 >>> results.hits
212 4
213 >>> map(lambda x: x['id'], results)
214 [u'1', u'2', u'3', u'4']
215
216 The boolean query also allows you to create more complex queries. You can OR
217 term together using "|", AND terms together using "+" or a space, and you can
218 remove words from the query using the "-" operator.
219
220 >>> results = search_service.search(bq="'watched|moved'")
221 >>> results.hits
222 2
223 >>> map(lambda x: x['id'], results)
224 [u'3', u'4']
225
226 By default, the search will return 10 terms but it is possible to adjust this
227 by using the size argument as follows:
228
229 >>> results = search_service.search(bq="'dan*'", size=2)
230 >>> results.hits
231 4
232 >>> map(lambda x: x['id'], results)
233 [u'1', u'2']
234
235 It is also possible to offset the start of the search by using the start argumen t as follows:
236
237 >>> results = search_service.search(bq="'dan*'", start=2)
238 >>> results.hits
239 4
240 >>> map(lambda x: x['id'], results)
241 [u'3', u'4']
242
243
244 Ordering search results and rank expressions
245 --------------------------------------------
246
247 If your search query is going to return many results, it is good to be able to s ort them
248 You can order your search results by using the rank argument. You are able to
249 sort on any fields which have the results option turned on.
250
251 >>> results = search_service.search(bq=query, rank=['-follower_count'])
252
253 You can also create your own rank expressions to sort your results according to
254 other criteria:
255
256 >>> domain.create_rank_expression('recently_active', 'last_activity') # We' ll want to be able to just show the most recently active users
257
258 >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_c ount/(time() - last_activity))*1000)') # Let's get trickier and combine text re levance with a really dynamic expression
259
260 >>> results = search_service.search(bq=query, rank=['-recently_active'])
261
262 Viewing and Adjusting Stemming for a Domain
263 -------------------------------------------
264
265 A stemming dictionary maps related words to a common stem. A stem is
266 typically the root or base word from which variants are derived. For
267 example, run is the stem of running and ran. During indexing, Amazon
268 CloudSearch uses the stemming dictionary when it performs
269 text-processing on text fields. At search time, the stemming
270 dictionary is used to perform text-processing on the search
271 request. This enables matching on variants of a word. For example, if
272 you map the term running to the stem run and then search for running,
273 the request matches documents that contain run as well as running.
274
275 To get the current stemming dictionary defined for a domain, use the
276 ``get_stemming`` method of the Domain object.
277
278 >>> stems = domain.get_stemming()
279 >>> stems
280 {u'stems': {}}
281 >>>
282
283 This returns a dictionary object that can be manipulated directly to
284 add additional stems for your search domain by adding pairs of term:stem
285 to the stems dictionary.
286
287 >>> stems['stems']['running'] = 'run'
288 >>> stems['stems']['ran'] = 'run'
289 >>> stems
290 {u'stems': {u'ran': u'run', u'running': u'run'}}
291 >>>
292
293 This has changed the value locally. To update the information in
294 Amazon CloudSearch, you need to save the data.
295
296 >>> stems.save()
297
298 You can also access certain CloudSearch-specific attributes related to
299 the stemming dictionary defined for your domain.
300
301 >>> stems.status
302 u'RequiresIndexDocuments'
303 >>> stems.creation_date
304 u'2012-05-01T12:12:32Z'
305 >>> stems.update_date
306 u'2012-05-01T12:12:32Z'
307 >>> stems.update_version
308 19
309 >>>
310
311 The status indicates that, because you have changed the stems associated
312 with the domain, you will need to re-index the documents in the domain
313 before the new stems are used.
314
315 Viewing and Adjusting Stopwords for a Domain
316 --------------------------------------------
317
318 Stopwords are words that should typically be ignored both during
319 indexing and at search time because they are either insignificant or
320 so common that including them would result in a massive number of
321 matches.
322
323 To view the stopwords currently defined for your domain, use the
324 ``get_stopwords`` method of the Domain object.
325
326 >>> stopwords = domain.get_stopwords()
327 >>> stopwords
328 {u'stopwords': [u'a',
329 u'an',
330 u'and',
331 u'are',
332 u'as',
333 u'at',
334 u'be',
335 u'but',
336 u'by',
337 u'for',
338 u'in',
339 u'is',
340 u'it',
341 u'of',
342 u'on',
343 u'or',
344 u'the',
345 u'to',
346 u'was']}
347 >>>
348
349 You can add additional stopwords by simply appending the values to the
350 list.
351
352 >>> stopwords['stopwords'].append('foo')
353 >>> stopwords['stopwords'].append('bar')
354 >>> stopwords
355
356 Similarly, you could remove currently defined stopwords from the list.
357 To save the changes, use the ``save`` method.
358
359 >>> stopwords.save()
360
361 The stopwords object has similar attributes defined above for stemming
362 that provide additional information about the stopwords in your domain.
363
364
365 Viewing and Adjusting Stopwords for a Domain
366 --------------------------------------------
367
368 You can configure synonyms for terms that appear in the data you are
369 searching. That way, if a user searches for the synonym rather than
370 the indexed term, the results will include documents that contain the
371 indexed term.
372
373 If you want two terms to match the same documents, you must define
374 them as synonyms of each other. For example:
375
376 cat, feline
377 feline, cat
378
379 To view the synonyms currently defined for your domain, use the
380 ``get_synonyms`` method of the Domain object.
381
382 >>> synonyms = domain.get_synonyms()
383 >>> synonyms
384 {u'synonyms': {}}
385 >>>
386
387 You can define new synonyms by adding new term:synonyms entries to the
388 synonyms dictionary object.
389
390 >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']
391 >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']
392
393 To save the changes, use the ``save`` method.
394
395 >>> synonyms.save()
396
397 The synonyms object has similar attributes defined above for stemming
398 that provide additional information about the stopwords in your domain.
399
400 Deleting Documents
401 ------------------
402
403 >>> import time
404 >>> from datetime import datetime
405
406 >>> doc_service = domain.get_document_service()
407
408 >>> # Again we'll cheat and use the current epoch time as our version number
409
410 >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
411 >>> service.commit()
OLDNEW

Powered by Google App Engine
This is Rietveld 408576698