OLD | NEW |
(Empty) | |
| 1 .. cloudsearch_tut: |
| 2 |
| 3 =============================================== |
| 4 An Introduction to boto's Cloudsearch interface |
| 5 =============================================== |
| 6 |
| 7 This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial |
| 8 assumes that you have boto already downloaded and installed. |
| 9 |
| 10 .. _Cloudsearch: http://aws.amazon.com/cloudsearch/ |
| 11 |
| 12 Creating a Connection |
| 13 --------------------- |
| 14 The first step in accessing CloudSearch is to create a connection to the service
. |
| 15 |
| 16 The recommended method of doing this is as follows:: |
| 17 |
| 18 >>> import boto.cloudsearch |
| 19 >>> conn = boto.cloudsearch.connect_to_region("us-east-1", aws_access_key_id
= '<aws access key'>, aws_secret_access_key='<aws secret key>') |
| 20 |
| 21 At this point, the variable conn will point to a CloudSearch connection object |
| 22 in the us-east-1 region. Currently, this is the only region which has the |
| 23 CloudSearch service. In this example, the AWS access key and AWS secret key are |
| 24 passed in to the method explicitly. Alternatively, you can set the environment |
| 25 variables: |
| 26 |
| 27 * `AWS_ACCESS_KEY_ID` - Your AWS Access Key ID |
| 28 * `AWS_SECRET_ACCESS_KEY` - Your AWS Secret Access Key |
| 29 |
| 30 and then simply call:: |
| 31 |
| 32 >>> import boto.cloudsearch |
| 33 >>> conn = boto.cloudsearch.connect_to_region("us-east-1") |
| 34 |
| 35 In either case, conn will point to the Connection object which we will use |
| 36 throughout the remainder of this tutorial. |
| 37 |
| 38 Creating a Domain |
| 39 ----------------- |
| 40 |
| 41 Once you have a connection established with the CloudSearch service, you will |
| 42 want to create a domain. A domain encapsulates the data that you wish to index, |
| 43 as well as indexes and metadata relating to it. |
| 44 |
| 45 >>> from boto.cloudsearch.domain import Domain |
| 46 >>> domain = Domain(conn, conn.create_domain('demo')) |
| 47 |
| 48 This domain can be used to control access policies, indexes, and the actual |
| 49 document service, which you will use to index and search. |
| 50 |
| 51 Setting access policies |
| 52 ----------------------- |
| 53 |
| 54 Before you can connect to a document service, you need to set the correct access
properties. |
| 55 For example, if you were connecting from 192.168.1.0, you could give yourself ac
cess as follows: |
| 56 |
| 57 >>> our_ip = '192.168.1.0' |
| 58 |
| 59 >>> # Allow our IP address to access the document and search services |
| 60 >>> policy = domain.get_access_policies() |
| 61 >>> policy.allow_search_ip(our_ip) |
| 62 >>> policy.allow_doc_ip(our_ip) |
| 63 |
| 64 You can use the allow_search_ip() and allow_doc_ip() methods to give different |
| 65 CIDR blocks access to searching and the document service respectively. |
| 66 |
| 67 Creating index fields |
| 68 --------------------- |
| 69 |
| 70 Each domain can have up to twenty index fields which are indexed by the |
| 71 CloudSearch service. For each index field, you will need to specify whether |
| 72 it's a text or integer field, as well as optionaly a default value. |
| 73 |
| 74 >>> # Create an 'text' index field called 'username' |
| 75 >>> uname_field = domain.create_index_field('username', 'text') |
| 76 |
| 77 >>> # Epoch time of when the user last did something |
| 78 >>> time_field = domain.create_index_field('last_activity', 'uint', default=
0) |
| 79 |
| 80 It is also possible to mark an index field as a facet. Doing so allows a search |
| 81 query to return categories into which results can be grouped, or to create |
| 82 drill-down categories |
| 83 |
| 84 >>> # But it would be neat to drill down into different countries |
| 85 >>> loc_field = domain.create_index_field('location', 'text', facet=True) |
| 86 |
| 87 Finally, you can also mark a snippet of text as being able to be returned |
| 88 directly in your search query by using the results option. |
| 89 |
| 90 >>> # Directly insert user snippets in our results |
| 91 >>> snippet_field = domain.create_index_field('snippet', 'text', result=True
) |
| 92 |
| 93 You can add up to 20 index fields in this manner: |
| 94 |
| 95 >>> follower_field = domain.create_index_field('follower_count', 'uint', def
ault=0) |
| 96 |
| 97 Adding Documents to the Index |
| 98 ----------------------------- |
| 99 |
| 100 Now, we can add some documents to our new search domain. First, you will need a |
| 101 document service object through which queries are sent: |
| 102 |
| 103 >>> doc_service = domain.get_document_service() |
| 104 |
| 105 For this example, we will use a pre-populated list of sample content for our |
| 106 import. You would normally pull such data from your database or another |
| 107 document store. |
| 108 |
| 109 >>> users = [ |
| 110 { |
| 111 'id': 1, |
| 112 'username': 'dan', |
| 113 'last_activity': 1334252740, |
| 114 'follower_count': 20, |
| 115 'location': 'USA', |
| 116 'snippet': 'Dan likes watching sunsets and rock climbing', |
| 117 }, |
| 118 { |
| 119 'id': 2, |
| 120 'username': 'dankosaur', |
| 121 'last_activity': 1334252904, |
| 122 'follower_count': 1, |
| 123 'location': 'UK', |
| 124 'snippet': 'Likes to dress up as a dinosaur.', |
| 125 }, |
| 126 { |
| 127 'id': 3, |
| 128 'username': 'danielle', |
| 129 'last_activity': 1334252969, |
| 130 'follower_count': 100, |
| 131 'location': 'DE', |
| 132 'snippet': 'Just moved to Germany!' |
| 133 }, |
| 134 { |
| 135 'id': 4, |
| 136 'username': 'daniella', |
| 137 'last_activity': 1334253279, |
| 138 'follower_count': 7, |
| 139 'location': 'USA', |
| 140 'snippet': 'Just like Dan, I like to watch a good sunset, but height
s scare me.', |
| 141 } |
| 142 ] |
| 143 |
| 144 When adding documents to our document service, we will batch them together. You |
| 145 can schedule a document to be added by using the add() method. Whenever you are |
| 146 adding a document, you must provide a unique ID, a version ID, and the actual |
| 147 document to be indexed. In this case, we are using the user ID as our unique |
| 148 ID. The version ID is used to determine which is the latest version of an |
| 149 object to be indexed. If you wish to update a document, you must use a higher |
| 150 version ID. In this case, we are using the time of the user's last activity as |
| 151 a version number. |
| 152 |
| 153 >>> for user in users: |
| 154 >>> doc_service.add(user['id'], user['last_activity'], user) |
| 155 |
| 156 When you are ready to send the batched request to the document service, you can |
| 157 do with the commit() method. Note that cloudsearch will charge per 1000 batch |
| 158 uploads. Each batch upload must be under 5MB. |
| 159 |
| 160 >>> result = doc_service.commit() |
| 161 |
| 162 The result is an instance of `cloudsearch.CommitResponse` which will |
| 163 make the plain dictionary response a nice object (ie result.adds, |
| 164 result.deletes) and raise an exception for us if all of our documents |
| 165 weren't actually committed. |
| 166 |
| 167 After you have successfully committed some documents to cloudsearch, you must |
| 168 use :py:meth:`clear_sdf |
| 169 <boto.cloudsearch.document.DocumentServiceConnection.clear_sdf>`, if you wish |
| 170 to use the same document service connection again so that its internal cache is |
| 171 cleared. |
| 172 |
| 173 Searching Documents |
| 174 ------------------- |
| 175 |
| 176 Now, let's try performing a search. First, we will need a SearchServiceConnectio
n: |
| 177 |
| 178 >>> search_service = domain.get_search_service() |
| 179 |
| 180 A standard search will return documents which contain the exact words being |
| 181 searched for. |
| 182 |
| 183 >>> results = search_service.search(q="dan") |
| 184 >>> results.hits |
| 185 2 |
| 186 >>> map(lambda x: x['id'], results) |
| 187 [u'1', u'4'] |
| 188 |
| 189 The standard search does not look at word order: |
| 190 |
| 191 >>> results = search_service.search(q="dinosaur dress") |
| 192 >>> results.hits |
| 193 1 |
| 194 >>> map(lambda x: x['id'], results) |
| 195 [u'2'] |
| 196 |
| 197 It's also possible to do more complex queries using the bq argument (Boolean |
| 198 Query). When you are using bq, your search terms must be enclosed in single |
| 199 quotes. |
| 200 |
| 201 >>> results = search_service.search(bq="'dan'") |
| 202 >>> results.hits |
| 203 2 |
| 204 >>> map(lambda x: x['id'], results) |
| 205 [u'1', u'4'] |
| 206 |
| 207 When you are using boolean queries, it's also possible to use wildcards to |
| 208 extend your search to all words which start with your search terms: |
| 209 |
| 210 >>> results = search_service.search(bq="'dan*'") |
| 211 >>> results.hits |
| 212 4 |
| 213 >>> map(lambda x: x['id'], results) |
| 214 [u'1', u'2', u'3', u'4'] |
| 215 |
| 216 The boolean query also allows you to create more complex queries. You can OR |
| 217 term together using "|", AND terms together using "+" or a space, and you can |
| 218 remove words from the query using the "-" operator. |
| 219 |
| 220 >>> results = search_service.search(bq="'watched|moved'") |
| 221 >>> results.hits |
| 222 2 |
| 223 >>> map(lambda x: x['id'], results) |
| 224 [u'3', u'4'] |
| 225 |
| 226 By default, the search will return 10 terms but it is possible to adjust this |
| 227 by using the size argument as follows: |
| 228 |
| 229 >>> results = search_service.search(bq="'dan*'", size=2) |
| 230 >>> results.hits |
| 231 4 |
| 232 >>> map(lambda x: x['id'], results) |
| 233 [u'1', u'2'] |
| 234 |
| 235 It is also possible to offset the start of the search by using the start argumen
t as follows: |
| 236 |
| 237 >>> results = search_service.search(bq="'dan*'", start=2) |
| 238 >>> results.hits |
| 239 4 |
| 240 >>> map(lambda x: x['id'], results) |
| 241 [u'3', u'4'] |
| 242 |
| 243 |
| 244 Ordering search results and rank expressions |
| 245 -------------------------------------------- |
| 246 |
| 247 If your search query is going to return many results, it is good to be able to s
ort them |
| 248 You can order your search results by using the rank argument. You are able to |
| 249 sort on any fields which have the results option turned on. |
| 250 |
| 251 >>> results = search_service.search(bq=query, rank=['-follower_count']) |
| 252 |
| 253 You can also create your own rank expressions to sort your results according to |
| 254 other criteria: |
| 255 |
| 256 >>> domain.create_rank_expression('recently_active', 'last_activity') # We'
ll want to be able to just show the most recently active users |
| 257 |
| 258 >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_c
ount/(time() - last_activity))*1000)') # Let's get trickier and combine text re
levance with a really dynamic expression |
| 259 |
| 260 >>> results = search_service.search(bq=query, rank=['-recently_active']) |
| 261 |
| 262 Viewing and Adjusting Stemming for a Domain |
| 263 ------------------------------------------- |
| 264 |
| 265 A stemming dictionary maps related words to a common stem. A stem is |
| 266 typically the root or base word from which variants are derived. For |
| 267 example, run is the stem of running and ran. During indexing, Amazon |
| 268 CloudSearch uses the stemming dictionary when it performs |
| 269 text-processing on text fields. At search time, the stemming |
| 270 dictionary is used to perform text-processing on the search |
| 271 request. This enables matching on variants of a word. For example, if |
| 272 you map the term running to the stem run and then search for running, |
| 273 the request matches documents that contain run as well as running. |
| 274 |
| 275 To get the current stemming dictionary defined for a domain, use the |
| 276 ``get_stemming`` method of the Domain object. |
| 277 |
| 278 >>> stems = domain.get_stemming() |
| 279 >>> stems |
| 280 {u'stems': {}} |
| 281 >>> |
| 282 |
| 283 This returns a dictionary object that can be manipulated directly to |
| 284 add additional stems for your search domain by adding pairs of term:stem |
| 285 to the stems dictionary. |
| 286 |
| 287 >>> stems['stems']['running'] = 'run' |
| 288 >>> stems['stems']['ran'] = 'run' |
| 289 >>> stems |
| 290 {u'stems': {u'ran': u'run', u'running': u'run'}} |
| 291 >>> |
| 292 |
| 293 This has changed the value locally. To update the information in |
| 294 Amazon CloudSearch, you need to save the data. |
| 295 |
| 296 >>> stems.save() |
| 297 |
| 298 You can also access certain CloudSearch-specific attributes related to |
| 299 the stemming dictionary defined for your domain. |
| 300 |
| 301 >>> stems.status |
| 302 u'RequiresIndexDocuments' |
| 303 >>> stems.creation_date |
| 304 u'2012-05-01T12:12:32Z' |
| 305 >>> stems.update_date |
| 306 u'2012-05-01T12:12:32Z' |
| 307 >>> stems.update_version |
| 308 19 |
| 309 >>> |
| 310 |
| 311 The status indicates that, because you have changed the stems associated |
| 312 with the domain, you will need to re-index the documents in the domain |
| 313 before the new stems are used. |
| 314 |
| 315 Viewing and Adjusting Stopwords for a Domain |
| 316 -------------------------------------------- |
| 317 |
| 318 Stopwords are words that should typically be ignored both during |
| 319 indexing and at search time because they are either insignificant or |
| 320 so common that including them would result in a massive number of |
| 321 matches. |
| 322 |
| 323 To view the stopwords currently defined for your domain, use the |
| 324 ``get_stopwords`` method of the Domain object. |
| 325 |
| 326 >>> stopwords = domain.get_stopwords() |
| 327 >>> stopwords |
| 328 {u'stopwords': [u'a', |
| 329 u'an', |
| 330 u'and', |
| 331 u'are', |
| 332 u'as', |
| 333 u'at', |
| 334 u'be', |
| 335 u'but', |
| 336 u'by', |
| 337 u'for', |
| 338 u'in', |
| 339 u'is', |
| 340 u'it', |
| 341 u'of', |
| 342 u'on', |
| 343 u'or', |
| 344 u'the', |
| 345 u'to', |
| 346 u'was']} |
| 347 >>> |
| 348 |
| 349 You can add additional stopwords by simply appending the values to the |
| 350 list. |
| 351 |
| 352 >>> stopwords['stopwords'].append('foo') |
| 353 >>> stopwords['stopwords'].append('bar') |
| 354 >>> stopwords |
| 355 |
| 356 Similarly, you could remove currently defined stopwords from the list. |
| 357 To save the changes, use the ``save`` method. |
| 358 |
| 359 >>> stopwords.save() |
| 360 |
| 361 The stopwords object has similar attributes defined above for stemming |
| 362 that provide additional information about the stopwords in your domain. |
| 363 |
| 364 |
| 365 Viewing and Adjusting Stopwords for a Domain |
| 366 -------------------------------------------- |
| 367 |
| 368 You can configure synonyms for terms that appear in the data you are |
| 369 searching. That way, if a user searches for the synonym rather than |
| 370 the indexed term, the results will include documents that contain the |
| 371 indexed term. |
| 372 |
| 373 If you want two terms to match the same documents, you must define |
| 374 them as synonyms of each other. For example: |
| 375 |
| 376 cat, feline |
| 377 feline, cat |
| 378 |
| 379 To view the synonyms currently defined for your domain, use the |
| 380 ``get_synonyms`` method of the Domain object. |
| 381 |
| 382 >>> synonyms = domain.get_synonyms() |
| 383 >>> synonyms |
| 384 {u'synonyms': {}} |
| 385 >>> |
| 386 |
| 387 You can define new synonyms by adding new term:synonyms entries to the |
| 388 synonyms dictionary object. |
| 389 |
| 390 >>> synonyms['synonyms']['cat'] = ['feline', 'kitten'] |
| 391 >>> synonyms['synonyms']['dog'] = ['canine', 'puppy'] |
| 392 |
| 393 To save the changes, use the ``save`` method. |
| 394 |
| 395 >>> synonyms.save() |
| 396 |
| 397 The synonyms object has similar attributes defined above for stemming |
| 398 that provide additional information about the stopwords in your domain. |
| 399 |
| 400 Deleting Documents |
| 401 ------------------ |
| 402 |
| 403 >>> import time |
| 404 >>> from datetime import datetime |
| 405 |
| 406 >>> doc_service = domain.get_document_service() |
| 407 |
| 408 >>> # Again we'll cheat and use the current epoch time as our version number |
| 409 |
| 410 >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple()))) |
| 411 >>> service.commit() |
OLD | NEW |