common/cmpbin/doc.go - Issue 1219323002: Rename funnybase and add more serializations.

Side by Side Diff: common/cmpbin/doc.go

Issue 1219323002: Rename funnybase and add more serializations. (Closed) Base URL: https://chromium.googlesource.com/external/github.com/luci/luci-go@master

Patch Set: fixes Created 5 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 // Copyright 2015 The Chromium Authors. All rights reserved.

	2 // Use of this source code is governed by a BSD-style license that can be

	3 // found in the LICENSE file.

	4

	5 // Package cmpbin provides binary serialization routines which ensure that the

	6 // serialized objects maintain the same sort order of the original inputs when

	7 // sorted bytewise (i.e. with memcmp). Additionally, serialized objects are

	8 // concatenatable.

	9 //

	10 // Notes on particular serialization schemes:

	11 //

	12 // - Numbers:

	13 // The number encoding is less efficient on average than Varint

	14 // ("encoding/binary") for small numbers (it has a minimum encoded size of

	15 // 2 bytes), but is more efficient for large numbers (it has a maximum encoded

	16 // size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b

	17 // representation).

	18 //

	19 // Both signed and unsigned numbers are encoded with the same scheme, and will

	20 // sort together as signed numbers. Decoding with the incorrect routine will

	21 // result in an ErrOverflow/ErrUnderflow error if the actual value is out of

	22 // range.

	23 //

	24 // The scheme works like:

	25 // - given an 2's compliment value V

	26 // - extract the sign (S) and magnitude (M) of V

	27 // - Find the position of the highest bit (P), minus 1.

	28 // - write (bits):

	29 // - SPPPPPPP MMMMMMMM MM000000

	30 // - S is 1

	31 // - P's are the log2(M)-1

	32 // - M's are the magnitude of V

	33 // - 0's are padding

	34 // - Additionally, if the number is negative, invert the bits of all the bytes

	35 // (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and

	36 // makes the ordering of the numbers correct when compared bytewise.

	37 //

	38 // - Strings/[]byte

	39 // Each byte in the encoded stream reserves the least significant bit as a stop

	40 // bit (1 means that the string continues, 0 means that the string ends). The

	41 // actual user data is shifted into the top 7 bits of every encoded byte. This

	42 // results in a data inflation rate of 12.5%, but this overhead is constant

	43 // (doesn't vary by the encoded content). Note that if space efficiency is very

	44 // important and you are storing large strings on average, you could reduce the

	45 // overhead by only placing the stop bit on every other byte or every 4th byte,

	46 // etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would

	47 // cause every string to round out to 2 or 4 byte chunks), and it would make

	48 // the algorithm implementation more complex. The current implementation was

	49 // chosen as good enough in light of the fact that pre-compressing regular data

	50 // could save more than 12.5% overall, and that for incompressable data a

	51 // commonly used encoding scheme (base64) has a full 25% overhead (and a

	52 // generally more complex implementation).

	53 //

	54 // - Floats

	55 // Floats are tricky (really tricky) because they have lots of weird

	56 // non-sortable special cases (like NaN). That said, for the majority of

	57 // non-weird cases, the implementation here will sort real numbers the way that

	58 // you would expect.

	59 //

	60 // The implementation is derived from http://stereopsis.com/radix.html, and full

	61 // credit for the original algorithm goes to Michael Herf. The algorithm is

	62 // essentially:

	63 //

	64 // - if the number is positive, flip the top bit

	65 // - if the number is negative, flip all the bits

	66 //

	67 // Floats are not varint encoded, you could varint encode the mantissa

	68 // (significand). This is only a 52 bit section, meaning that it is normally

	69 // encoded with 6.5 bytes (a nybble is stolen from the second exponent byte).

	70 // Assuming you used the numerical encoding above, shifted left by 4 bits,

	71 // discarding the sign bit (since its laready the MSb on the float, and then

	72 // using 6 bits (instead of 7) to represent the number of significant bits in

	73 // the mantissa (since there are only a maximum of 52), you could expect to see

	74 // small-mantissa floats (of any characteristic) encoded in 3 bytes (this has

	75 // 6 bits of mantissa), and the largest floats would have an encoded size of

	76 // 9 bytes (with 2 wasted bits). However the implementation complexity would be

	77 // higher.

	78 //

	79 // The actual encoded values for special cases are (sorted high to low):

	80 // - QNaN - 0xFFF8000000000000

	81 // // note that golang doesn't seem to actually have SNaN?

	82 // - SNaN - 0xFFF0000000000001

	83 // - +inf - 0xFFF0000000000000

	84 // - MaxFloat64 - 0xFFEFFFFFFFFFFFFF

	85 // - SmallestNonzeroFloat64 - 0x8000000000000001

	86 // - 0 - 0x8000000000000000

	87 // - -0 - 0x7FFFFFFFFFFFFFFF

	88 // - -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE

	89 // - -MaxFloat64 - 0x0010000000000000

	90 // - -inf - 0x000FFFFFFFFFFFFF

	91 package cmpbin

OLD	NEW

« no previous file with comments | « common/cmpbin/common_test.go ('k') | common/cmpbin/float.go » ('j') | no next file with comments »