Chromium Code Reviews| OLD | NEW |
|---|---|
| (Empty) | |
| 1 // Copyright 2015 The Chromium Authors. All rights reserved. | |
| 2 // Use of this source code is governed by a BSD-style license that can be | |
| 3 // found in the LICENSE file. | |
| 4 | |
| 5 // Package cmpbin provides binary serialization routines which ensure that the | |
| 6 // serialized objects maintain the same sort order of the original inputs when | |
| 7 // sorted bytewise (i.e. with memcmp). Additionally, serialized objects are | |
| 8 // concatenatable. | |
| 9 // | |
| 10 // Notes on particular serialization schemes: | |
| 11 // | |
| 12 // * Numbers: | |
| 13 // The number encoding is less efficient on average than Varint | |
| 14 // ("encoding/binary") for small numbers (it has a minimum encoded size of | |
| 15 // 2 bytes), but is more efficient for large numbers (it has a maximum encoded | |
| 16 // size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b | |
| 17 // representation). | |
| 18 // | |
| 19 // Both signed and unsigned numbers are encoded with the same scheme, and will | |
| 20 // sort together as signed numbers. Decoding with the incorrect routine will | |
| 21 // result in an ErrOverflow/ErrUnderflow error if the actual value is out of | |
| 22 // range. | |
| 23 // | |
| 24 // The scheme works like: | |
| 25 // * given an 2's compliment value V | |
|
M-A Ruel
2015/07/06 23:21:35
I prefer - to * for lists (not a big deal, since i
iannucci
2015/07/07 00:03:27
It's considered an indented block (so it's <pre>..
| |
| 26 // * extract the sign (S) and magnitude (M) of V | |
| 27 // * Find the position of the highest bit (P), minus 1. | |
| 28 // * write (bits): | |
| 29 // * SPPPPPPP MMMMMMMM MM000000 | |
| 30 // * S is 1 | |
| 31 // * P's are the log2(M)-1 | |
| 32 // * M's are the magnitude of V | |
| 33 // * 0's are padding | |
| 34 // * Additionally, if the number is negative, invert the bits of all the bytes | |
| 35 // (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and | |
| 36 // makes the ordering of the numbers correct when compared bytewise. | |
| 37 // | |
| 38 // * Strings/[]byte | |
| 39 // Each byte in the encoded stream reserves the least significant bit as a stop | |
| 40 // bit (1 means that the string continues, 0 means that the string ends). The | |
| 41 // actual user data is shifted into the top 7 bits of every encoded byte. This | |
| 42 // results in a data inflation rate of 12.5%, but this overhead is constant | |
| 43 // (doesn't vary by the encoded content). Note that if space efficiency is very | |
| 44 // important and you are storing large strings on average, you could reduce the | |
| 45 // overhead by only placing the stop bit on every other byte or every 4th byte, | |
| 46 // etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would | |
| 47 // cause every string to round out to 2 or 4 byte chunks), and it would make | |
| 48 // the algorithm implementation more complex. The current implementation was | |
| 49 // chosen as good enough in light of the fact that pre-compressing regular data | |
| 50 // could save more than 12.5% overall, and that for incompressable data a | |
| 51 // commonly used encoding scheme (base64) has a full 25% overhead (and a | |
| 52 // generally more complex implementation). | |
| 53 // | |
| 54 // * Floats | |
| 55 // Floats are tricky (really tricky) because they have lots of weird | |
| 56 // non-sortable special cases (like NaN). That said, for the majority of | |
| 57 // non-weird cases, the implementation here will sort real numbers the way that | |
| 58 // you would expect. | |
| 59 // | |
| 60 // The implementation is derived from http://stereopsis.com/radix.html, and full | |
| 61 // credit for the original algorithm goes to Michael Herf. The algorithm is | |
| 62 // essentially: | |
| 63 // | |
| 64 // * if the number is positive, flip the top bit | |
| 65 // * if the number is negative, flip all the bits | |
| 66 // | |
| 67 // Floats are not varint encoded, you could varint encode the mantissa | |
| 68 // (significand). This is only a 52 bit section, meaning that it is normally | |
| 69 // encoded with 6.5 bytes (a nybble is stolen from the second exponent byte). | |
| 70 // Assuming you used the numerical encoding above, shifted left by 4 bits, | |
| 71 // discarding the sign bit (since its laready the MSb on the float, and then | |
| 72 // using 6 bits (instead of 7) to represent the number of significant bits in | |
| 73 // the mantissa (since there are only a maximum of 52), you could expect to see | |
| 74 // small-mantissa floats (of any characteristic) encoded in 3 bytes (this has | |
| 75 // 6 bits of mantissa), and the largest floats would have an encoded size of | |
| 76 // 9 bytes (with 2 wasted bits). However the implementation complexity would be | |
| 77 // higher. | |
| 78 // | |
| 79 // The actual encoded values for special cases are (sorted high to low): | |
| 80 // * QNaN - 0xFFF8000000000000 | |
| 81 // // note that golang doesn't seem to actually have SNaN? | |
| 82 // * SNaN - 0xFFF0000000000001 | |
| 83 // * +inf - 0xFFF0000000000000 | |
| 84 // * MaxFloat64 - 0xFFEFFFFFFFFFFFFF | |
| 85 // * SmallestNonzeroFloat64 - 0x8000000000000001 | |
| 86 // * 0 - 0x8000000000000000 | |
| 87 // * -0 - 0x7FFFFFFFFFFFFFFF | |
| 88 // * -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE | |
| 89 // * -MaxFloat64 - 0x0010000000000000 | |
| 90 // * -inf - 0x000FFFFFFFFFFFFF | |
| 91 package cmpbin | |
| OLD | NEW |