Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 In-memory appengine wrapper | 1 Datastore implementation internals |
| 2 =========================== | 2 ================================== |
| 3 | 3 |
| 4 | 4 This document contains internal implementation details for this in-memory |
| 5 Notes on the internal encodings | 5 version of datastore. It's mostly helpful to understand what's going on in this |
| 6 ------------------------------- | 6 implementation, but it also can reveal some insight into how the real appengine |
| 7 | 7 datastore works (though note that the specific encodings are different). |
| 8 All datatypes inside of the index Collections of the gkvlite Store are stored | 8 |
| 9 in a manner which allows them to be compared entirely via bytes.Compare. All | 9 Additionally note that this implementation cheats by moving some of the Key |
| 10 types are prefixed by a sortable type byte which encodes the sort-order of types | 10 bytes into the table (collection) names (like the namespace, property name for |
| 11 according to the appengine SDK. Additionally, types have the following data | 11 the builtin indexes, etc.). The real implementation contains these bytes in the |
| 12 encoding: | 12 table row keys, I think. |
| 13 * ints | 13 |
| 14 * stored with the `funnybase` varint encoding | 14 |
| 15 * floats | 15 Internal datastore key/value collection schema |
| 16 * http://stereopsis.com/radix.html | 16 ---------------------------------------------- |
| 17 * toBytes: | 17 |
| 18 ``` | 18 The datastore implementation here uses several different tables ('collections') |
| 19 b := math.Float64bits(f) | 19 to manage state for the data. The schema for these tables is enumerated below |
| 20 return b ^ (-(b >> 63) | 0x8000000000000000) | 20 to make the code a bit easier to reason about. |
| 21 ``` | 21 |
| 22 * fromBytes: | 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized |
| 23 ``` | 23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the |
| 24 return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000) | 24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings |
| 25 ``` | 25 are important to understanding why the schemas below sort correctly when |
| 26 * string, []byte, BlobKey, ByteString | 26 compared only using bytes.Compare (aka `memcmp`). This doc will assume that |
|
dnj
2015/08/28 16:37:54
"Compare (aka `memcmp`)"?
iannucci
2015/08/28 19:48:54
? it's `bytes.Compare`
| |
| 27 * funnybase byte count | 27 you're familiar with those encodings, but will point out where we diverge from |
| 28 * raw bytes | 28 the stock encodings. |
| 29 * \*Key, GeoPoint | 29 |
| 30 * composite of above types | 30 All encoded Property values used in Keys are serialized `WithoutContext`, and |
|
dnj
2015/08/28 16:37:54
Wording.
iannucci
2015/08/28 19:48:54
Done I think?
| |
| 31 * time.Time | 31 `ShouldIndex`. |
| 32 * composite of above types, stored with microsecond accuracy. | 32 |
| 33 * rounding to microseconds is a limitation of the real appengine. | 33 ### Primary table |
| 34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)` | 34 |
| 35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)` | 35 The primary table maps datastore keys to entities. |
| 36 * nil, true, false | 36 - Name: `"ents:" + namespace` |
| 37 * value is encoded directly in the type byte | 37 - Key: serialized datastore.Property containing the entity's datastore.Key |
| 38 | 38 - Value: serialized datastore.PropertyMap |
| 39 | 39 |
| 40 Gkvlite Collection schema | 40 This table also encodes values for the following special keys: |
| 41 ------------------------- | 41 - Every entity root (e.g. a Key with nil Parent()) with key K has: |
| 42 | 42 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}` |
| 43 In order to provide efficient result deduplication, the value of an index row | 43 A child entity with the kind `__entity_group__` and an id of `1`. The value |
| 44 which indexes 1 or more properties is a concatenation of the previous values | 44 has a single property `__version__`, which contains the version number of |
| 45 which would show up in the same index. For example, if you have the property | 45 this entity group. This is used to detect transaction conflicts. |
| 46 list for the key K: | 46 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}` |
| 47 | 47 A child entity with the kind `__entity_group__` and an id of `1`. The value |
| 48 bob: 1 | 48 has a single property `__version__`, which contains the last automatically |
| 49 bob: 4 | 49 allocated entity ID for entities within this entity group. |
| 50 bob: 7 | 50 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the |
| 51 cat: "hello" | 51 same `__version__` property, and indicates the last automatically allocated |
| 52 cat: "world" | 52 entity ID for root entities. |
| 53 | 53 |
| 54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the | 54 ### Compound Index table |
| 55 rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates | 55 |
| 56 concatenation, each value has an implied type byte. `...` indicates that other | 56 The next table keeps track of all the user-added 'compound' index descriptions |
| 57 rows may be present): | 57 (not the content for the indexes). There is a row in this table for each |
| 58 | 58 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`. |
| 59 ... | 59 - Name: `"idx"` |
| 60 1|"world"|K = nil|nil | 60 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice |
| 61 ... | 61 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`). |
| 62 1|"hello"|K = nil|"world" | 62 - Value: empty |
| 63 ... | 63 |
| 64 4|"world"|K = 1|nil | 64 The Key format here requires some special attention. Say you started with |
| 65 ... | 65 a compound IndexDefinition of: |
| 66 4|"hello"|K = 1|"world" | 66 IndexDefinition{ |
| 67 ... | 67 Kind: "Foo", |
| 68 7|"world"|K = 4|nil | 68 Ancestor: true, |
| 69 ... | 69 SortBy: []IndexColumn{ |
| 70 7|"hello"|K = 4|"world" | 70 {Property: "Something", Direction: DESCENDING}, |
| 71 ... | 71 {Property: "Else", Direction: ASCENDING}, |
| 72 | 72 {Property: "Cool", Direction: ASCENDING}, |
| 73 This allows us to, start scanning at any point and be able to determine if we've | 73 } |
| 74 returned a given key already (without storing all of the keys in memory | 74 } |
| 75 for the duration of the Query run). We can do this because we can see if the | 75 |
| 76 value of an index row falls within the original query filter parameters. If it | 76 After prepping it for the table, it would be equivalent to: |
| 77 does, then we must already have returned they Key, and can safely skip the index | 77 IndexDefinition{ |
| 78 row. AFAIK, real-datastore provides deduplication by keeping all the returned | 78 Kind: "Foo", |
| 79 keys in memory as it runs the query, and doing a set-check. | 79 Ancestor: true, |
| 80 | 80 SortBy: []IndexColumn{ |
| 81 The end-result is semantically equivalent, with the exception that Query Cursors | 81 {Property: "__key__", Direction: ASCENDING}, |
| 82 on the real datastore will potentially return the same Key in the first Cursor | 82 {Property: "Cool", Direction: ASCENDING}, |
| 83 use as well as on the 2nd (or Nth) cursor use, where this method will not. | 83 {Property: "Else", Direction: ASCENDING}, |
| 84 | 84 {Property: "Something", Direction: DESCENDING}, |
| 85 collections | 85 } |
| 86 ents:ns -> key -> value | 86 } |
| 87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int} | 87 |
| 88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int} | 88 The reason for doing this will be covered in the `Query Planning` section, but |
| 89 (__entity_group_ids__,1) -> {__version__: in t} | 89 it boils down to allowing the query planner to use this encoded table to |
| 90 // TODO(iannucci): Journal every entity write in a log with a globally | 90 intelligently scan for potentially useful compound indexes. |
| 91 // increasing version number (aka "timestamp"). | 91 |
| 92 // | 92 ### Index Tables |
|
dnj (Google)
2015/08/28 17:54:21
Comment about repeated tags and padding.
e.g., "t
iannucci
2015/08/28 19:48:54
Done. I think we should actually implement that te
| |
| 93 // TODO(iannucci): Use the value in idx collection to indicate the last | 93 |
| 94 // global log version reflected in this index. Then index updates can happ en | 94 Every index (both builtin and compound) has one index table per namespace, which |
| 95 // in parallel, in a truly eventually-consistent fashion (and completely | 95 contains as rows every entry in the index, one per row. |
| 96 // avoid holding the DB writelock while calculating index entries). | 96 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()` |
| 97 // Unfortunately, copying new records (and removing old ones) into the DB | 97 - Key: concatenated datastore.Property values, one per SortBy column in the |
| 98 // would still require holding global writelock. | 98 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is |
| 99 // | 99 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF). |
| 100 // TODO(iannucci): how do we garbage-collect the journal? | 100 - Value: empty |
| 101 // | 101 |
| 102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with | 102 If the IndexDefinition has `Ancestor: true`, then the very first column of the |
| 103 // another one, transactionally? Not sure if this is possible to do easily . | 103 Key contains the partial Key for the entity. So if an entity has the datastore |
| 104 // If we had this, then we could do all the index writes for a given index | 104 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`, |
| 105 // on the side, and then do a quick swap into place with a writelock. As | 105 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column |
|
dnj
2015/08/28 16:37:54
Explain why?
Also the key notation, I'm assuming
iannucci
2015/08/28 19:48:54
"/Some,1" is a key with no parent. It's the root o
| |
| 106 // long as every index only has a single goroutine writing it, then this | 106 of indexes that it matches. |
| 107 // would enable maximum concurrency, since all indexes could update in | 107 |
| 108 // parallel and only synchronize for the duration of a single pointer swap . | 108 #### Builtin (automatic) indexes |
| 109 idx -> kind|A?|[-?prop]* = nil | 109 |
| 110 idx:ns:kind -> key = nil | 110 The following indexes are automatically created for some entity with a key |
| 111 idx:ns:kind|prop -> propval|key = [prev val] | 111 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo": |
| 112 idx:ns:kind|-prop -> -propval|key = [next val] | 112 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ |
| 113 idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[pre v/next val] | 113 {Property: "__key__", Direction: ASCENDING}, |
| 114 idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/ next val] | 114 }} |
| 115 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ | |
| 116 {Property: "Foo", Direction: ASCENDING}, | |
| 117 {Property: "__key__", Direction: ASCENDING}, | |
| 118 }} | |
| 119 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ | |
| 120 {Property: "Foo", Direction: DESCENDING}, | |
| 121 {Property: "__key__", Direction: ASCENDING}, | |
| 122 }} | |
| 123 | |
| 124 Index updates | |
| 125 ------------- | |
| 126 | |
| 127 (Note that this is a LARGE departure from how the production appengine datastore | |
| 128 does this. This model only works because the implementation is not distributed, | |
| 129 and not journaled. The real datastore does index updates in parallel and is | |
| 130 generally pretty fancy compared to this). | |
| 131 | |
| 132 Index updates are pretty straightforward. On a multation to the primary entity | |
|
dnj
2015/08/28 16:37:54
mutation**
iannucci
2015/08/28 19:48:54
done
| |
| 133 table, we take the old entity value (remember that entity values are | |
| 134 PropertyMaps), the new property value, create index entries for both, merge | |
| 135 them, and apply the deltas to the affected index tables (i.e. entries that | |
| 136 exist in the old, but not new entities are deleted, entries that exist in the | |
|
dnj
2015/08/28 16:37:54
Kind of awkward wording. Maybe: "...but not new ar
iannucci
2015/08/28 19:48:54
Better?
| |
| 137 new but not old entities are deleted). | |
| 138 | |
| 139 Index generation works (given an slice of indexes []Idxs) by: | |
| 140 * serializing all ShouldIndex Properties in the PropertyMap to get a | |
| 141 `map[name][]serializedProperty`. | |
| 142 * for each index idx | |
| 143 * if idx's columns contain properties that are not in the map, skip idx | |
| 144 * make a `[][]serializedProperty`, where each serializedProperty slice | |
| 145 corresponds with the IndexColumn of idx.SortBy | |
| 146 * generate a `[]byte` row which is the contatenation of one value from each | |
| 147 `[]serializedProperty`, permuting through all combinations. If the SortBy | |
| 148 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty | |
| 149 value!. | |
| 150 * add that generated []byte row to the index's corresponding table. | |
| 151 | |
| 152 Query planning | |
| 153 -------------- | |
| 154 | |
| 155 Now that we have all of our data tabulated, let's plan some queries. The | |
| 156 high-level algorithm works like this: | |
| 157 * Generate a suffix format from the user's query which looks like: | |
| 158 * orders (including the inequality as the first order, if any) | |
| 159 * projected fields which aren't explicitly referenced in the orders (we | |
| 160 assume ASCENDING order for them), in the order that they were projected. | |
| 161 * `__key__` (implied ascending, unless the query's last sort order is for | |
| 162 `__key__`, in which case it's whatever order the user specified) | |
| 163 * Reverse the order of this suffix format, and serialize it into an | |
| 164 IndexDefinition, along with the query's Kind and Ancestor values. This | |
| 165 does what PrepForIdxTable did when we added the Index in the first place. | |
| 166 * Use this serialized reversed index to find compound indexes which might | |
| 167 match by looking up rows in the "idx" table which begin with this serialized | |
| 168 reversed index. | |
| 169 * Generate every builtin index for the inequality + equality filter | |
| 170 properties, and see if they match too. | |
| 171 | |
| 172 An index is a potential match if its suffix *exactly* matches the suffix format, | |
| 173 and it contains *only* sort orders which appear in the query (e.g. the index | |
| 174 contains a column which doesn't appear as an equality or inequlity filter). | |
| 175 | |
| 176 The index search continues until: | |
| 177 * We find at least one matching index; AND | |
| 178 * The combination of all matching indexes accounts for every equality filter | |
| 179 at least once. | |
| 180 | |
| 181 If we fail to find sufficient indexes to fulfil the query, we generate the | |
|
dnj
2015/08/28 16:37:54
fulfill
iannucci
2015/08/28 19:48:54
done
| |
| 182 'missing' index by concatenating all missing equality filters (in ascending | |
|
dnj
2015/08/28 16:37:54
Not clear, maybe "a suggested index description wh
iannucci
2015/08/28 19:48:54
How about this?
| |
| 183 order, but it doesn't matter if the user wants to make them decending), followed | |
| 184 by concatenating the suffix format that we generated for this query. | |
| 185 | |
| 186 Recall that equality filters are expressed as | |
| 187 `map[propName][]serializedProperty`. We'll refer to this mapping as the | |
| 188 'constraint' mapping below. | |
| 189 | |
| 190 To actually come up with the final index selection, we sort all the matching | |
| 191 indexes from greatest number of columns to least. We add the 0th index (the one | |
| 192 with the greatest number of columns) unconditionally. We then keep adding indexe s | |
| 193 which contain one or more of the remaining constraints, until we have no | |
| 194 more constraints to satisfy. | |
| 195 | |
| 196 Adding an index entails determining which columns in that index correspond to | |
| 197 equality columns, and which ones correspond to inequality/order/projection | |
| 198 columns. Recall that the inequality/order/projection columns are all the same | |
| 199 for all of the potential indices (i.e. they all share the same *suffix format*). | |
| 200 We can use this to just iterate over the index's SortBy columns which we'll use | |
| 201 for equality filters. For each equality column, we remove a corresponding value | |
| 202 from the constraints map. In the event that we _run out_ of constraints for a | |
| 203 given column, we simply _pick an arbitrary value_ from the original equality | |
| 204 filter mapping and use that. This is valid to do because, after all, they're | |
| 205 equality filters. | |
| 206 | |
| 207 Note that for compound indexes, the ancestor key counts as an equality filter, | |
| 208 and if the compound index has `Ancestor: true`, then we implicitly put the | |
| 209 ancestor as if it were the first SortBy column. For satisfying Ancestor queries | |
| 210 with built-in indexes, see the next section. | |
| 211 | |
| 212 Once we've got our list of constraints for this index, we concatenate them all | |
| 213 together to get the *prefix* for this index. When iterating over this index, we | |
| 214 only ever want to see index rows whose prefix exactly matches this. Unlike the | |
| 215 suffix formt, the prefix is per-index (remember that ALL indexes in the | |
| 216 query must have the same suffix format). | |
| 217 | |
| 218 Finally, we set the 'start' and 'end' values for all chosen indexes to either | |
| 219 the Start and End cursors, or the Greater-Than and Less-Than values for the | |
| 220 inequality. The Cursors contain values for every suffix column, and the | |
| 221 inequality only contains a value for the first suffix column. If both cursors | |
| 222 and an inequality are specified, we take the smaller set of both (the | |
| 223 combination which will return the fewest rows). | |
| 224 | |
| 225 That's about it for index selection! See Query Execution for how we actually use | |
| 226 the selected indexes to run a query. | |
| 227 | |
| 228 ### Ancestor queries and Built-in indexes | |
| 229 | |
| 230 You may have noticed that the built-in indexes can be used for Ancestor queries | |
| 231 with equality filters, but they don't start with the magic Ancestor column! | |
| 232 | |
| 233 There's a trick that you can do if the suffix format for the query is just | |
| 234 `__key__` though (e.g. the query only contains equality filters, and/or an | |
| 235 inequality filter on `__key__`). You can serialize the datastore key that you're | |
| 236 planning to use for the Ancestor query, then chop off the termintating null byte | |
| 237 from the encoding, and then use this as additional prefix bytes for this index. | |
| 238 So if the builtin for the "Val" property has the column format of: | |
| 239 | |
| 240 {Property: "Val"}, {Property: "__key__"} | |
| 241 | |
| 242 And your query holds Val as an equality filter, you can serialize the | |
| 243 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had | |
| 244 an index row: | |
| 245 | |
| 246 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP | |
| 247 | |
| 248 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for | |
| 249 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as: | |
| 250 | |
| 251 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 | |
| 252 | |
| 253 Omitting the STOP which normally terminates the Key encoding. Using this prefix | |
| 254 will only return index rows which are `/Kind,1` or its children. | |
| 255 | |
| 256 "That's cool! Why not use this trick for compound indexes?", I hear you ask :) | |
| 257 Remember that this trick only works if the prefix before the `__key__` is | |
| 258 *entirely* composed of equality filters. Also recall that if you ONLY use | |
| 259 equality filters and Ancestor (and possibly an inequality on `__key__`), then | |
| 260 you can always satisfy the query from the built-in indexes! While you | |
| 261 technically could do it with a compound index, there's not really a point to | |
| 262 doing so. To remain faithful to the production datastore implementation, we | |
| 263 don't implement this trick for anything other than the built-in indicies. | |
| 264 | |
| 265 ### Cursor format | |
| 266 | |
| 267 Cursors work by containing values for each of the columns in the suffix, in the | |
| 268 order and Direction specified by the suffix. In fact, cursors are just encoded | |
| 269 versions of the []IndexColumn used for the 'suffix format', followed by the | |
| 270 raw bytes of the suffix for that particular row (incremented by 1 bit). | |
| 271 | |
| 272 This means that technically you can port cursors between any queries which share | |
| 273 precisely the same suffix format, regardless of other query options, even if the | |
| 274 index planner ends up choosing different indexes to use from the first query to | |
| 275 the second. No state is maintained in the service implementation for cursors. | |
| 276 | |
| 277 I suspect that this is a more liberal version of cursors than how the production | |
| 278 appengine implements them, but I haven't verified one way or the other. | |
| 279 | |
| 280 Query execution | |
| 281 --------------- | |
| 282 | |
| 283 Last but not least, we need to actually execute the query. After figuring out | |
| 284 which indexes to use with what prefixes and start/end values, we essentially | |
| 285 have a list of index subsets, all sorted the same way. To pull the values out, | |
| 286 we start by iterating the first index in the list, grabbing its suffix value, | |
| 287 and trying to iterate from that suffix in the second, third, fourth, etc index. | |
| 288 | |
| 289 If any index iterates past that suffix, we start back at the 0th index with that | |
| 290 suffix, and continue to try to find a matching row. Doing this will end up | |
| 291 skipping large portions of all of the indexes in the list. This is the algorithm | |
| 292 known as "zigzag merge join", and you can find talks on it from some of the | |
| 293 appengine folks. It has very good algorithmic running time and tends to scale | |
| 294 with the number of full matches, rather than the size of all of the indexes | |
| 295 involved. | |
| 296 | |
| 297 A hit occurs when all of the iterators have precisely the same suffix. This hit | |
| 298 suffix is then decoded using the suffix format information. The very last column | |
| 299 of the suffix will always be the datastore key. The suffix is then used to call | |
| 300 back to the user, according to the query type: | |
| 301 * keys-only queries just directly return the Key | |
| 302 * projection queries return the projected fields from the decoded suffix. | |
| 303 Remember how we added all the projections after the orders? This is why. The | |
| 304 projected values are pulled directly from the index, instead of going to the | |
| 305 main entity table. | |
| 306 * normal queries pull the decoded Key from the "ents" table, and return that | |
| 307 entity to the user. | |
| OLD | NEW |