Friday, November 28, 2008


Symbol Ranking is a method for data compression where one maintains a LRU list of symbols recently seen in the given context. Those symbols are expected to be more probable the next time and therefore encoded with fewer bits than other symbols. The LRU list is usually only up to three symbols long, thus making it possible to store a context in just one 32-bit word.

Symbol Ranking has been common computing folklore for ages, Peter Fenwick gave the first modern treatment and implementation subsequently improved by Matt Mahoney for example by adding a secondary arithmetic compression stage. I added a few ideas of my own, rewrote it in C in a way easily embeddable to other programs to compress e.g. network connections and database logging.

This compression library, SR3C, hits the front where one either has to consume more memory or cycles to compress better. The compression is slightly faster to gzip -7, but results in ~11% smaller data. bzip2 -2 compresses slightly better than SR3C, but takes almost three times as long and is not on-line.

Matt Mahoney made a version whose command line interface compiles on Windows and benchmarked it as well. Both versions of SR3C are published under the MIT license.

Random Access Pseudo-Random Numbers

Since my previous post I was approached by a fellow-in-keyboards in need of a "random-access" pseudo-random number generator which could generate the i'th generated random r number directly, without computing the intermediate i-1 random numbers since seeding. Would a hash function for integers be a good RAPRNG?

No! Good hash functions avoid funnels and are thus bijective, which for 32-bit hash functions would imply only a 2^32 period. Consequently, if we draw N < 2^32 consequtive random numbers, then hashing would make r could occur once or never, but not twice or more frequently. A correct random number generation would result in an exponential distribution of the frequency of r.

We can fix this by increasing the period, which is equivalent to increasing the number of output bits of the hash, but use only 32 bits of the output. Alternatively we could hash i with two independent integer hash functions and xor the result. Either way, we deliberately create just the right amount of final funnelling.

Practical tests, such as Dieharder, are surprisingly difficult. I tried xoring results of several pairs of known integer hash functions, and eventually found a reasonable solution based my own hash_32_to_64 mentioned in my previous post.
uint32_t raprng(uint64_t i, uint64_t seed)
uint64_t r = (2857720171ULL * ((uint32_t) i)) ^ 0x1EF57D8A7B344E7BULL;
r ^= r >> 29;
r += r << 16;
r ^= r >> 21;
r += r >> 32;
r = (2857720171ULL * ((uint32_t) (i ^ r))) ^ (0xD9EA571C8AF880B6ULL + seed);
r ^= r >> 29;
r += r << 16;
r ^= r >> 21;
return r + (r >> 32);

The hypothetical period of raprng is 2^96 if i were of sufficient width. This period is barely enough to be respectable, at least compared to some serious PRNG's like the Mersenne Twister. raprng is also slower. But it is random-accessible.

Having gone through all this I need to point out that there is a less-known PRNG called the explicit inversive congruential generator (EICG), where the i'th output is defined as modular inverse of a*i+b for some a and b, where a should obviously be non-zero. The problem herein is that modular inversion is by no means cheap, and it isn't clear to me that some other very non-linear bijection - such as a very good hash function - wouldn't do.

Thursday, November 13, 2008

Hashing with SSE2 Revisited, or My Hash Toolkit

Now that Bob "The Hashman" Jenkins refers to my earlier post on a hash function implemented with the help of SSE2 instructions, I feel compelled to post an update on it.

After my original post Bob and I concluded the mix step was not reversible, so I went back to the drawing board and came up with the function below. As earlier, the task of padding the data to even 16 bytes and other trivialities are left to the reader.

/* Compile with gcc -msse2 ... */

#include <stdlib.h>
#include <stdint.h>
#include <assert.h>
#include <xmmintrin.h>

static uint32_t coeffs[12] __attribute__((aligned(16))) = {
/* Four carefully selected coefficients and interleaving zeros. */
2561893793UL, 0, 1388747947UL, 0,
3077216833UL, 0, 3427609723UL, 0,
/* 128 bits of random data. */
0x564A4447, 0xC7265595, 0xE20C241D, 0x128FA608,

#define COMBINE_AND_MIX(c_1, c_2, s_1, s_2, in) \
/* Phase 1: Perform four 32x32->64 bit multiplication with the \
input block and words 1 and 3 coeffs, respectively. This \
effectively propagates a bit change in input to 32 more \
significant bit positions. Combine into internal state by \
subtracting the result of multiplications from the internal \
state. */ \
s_1 = _mm_sub_epi64(s_1, _mm_mul_epu32(c_1, _mm_unpackhi_epi32(in, in))); \
s_2 = _mm_sub_epi64(s_2, _mm_mul_epu32(c_2, _mm_unpacklo_epi32(in, in))); \
/* Phase 2: Perform shifts and xors to propagate the 32-bit \
changes produced above into 64-bit (and even a little larger) \
changes in the internal state. */ \
/* state ^= state >64> 29; */ \
s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 29)); \
s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 29)); \
/* state +64= state <64< 16; */ \
s_1 = _mm_add_epi64(s_1, _mm_slli_epi64(s_1, 16)); \
s_2 = _mm_add_epi64(s_2, _mm_slli_epi64(s_2, 16)); \
/* state ^= state >64> 21; */ \
s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 21)); \
s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 21)); \
/* state +64= state <128< 32; */ \
s_1 = _mm_add_epi64(s_1, _mm_slli_si128(s_1, 4)); \
s_2 = _mm_add_epi64(s_2, _mm_slli_si128(s_2, 4)); \
/* Phase 3: Propagate the changes among the four 64-bit words by \
performing 64-bit subtractions and 32-bit word shuffling. */ \
s_1 = _mm_sub_epi64(s_1, s_2); \
s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(0, 3, 2, 1)), s_1); \
s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(0, 1, 3, 2)), s_2); \
s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(2, 1, 0, 3)), s_1); \
s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(2, 1, 0, 3)), s_2); \
/* With good coefficients any one-bit flip in the input has now \
changed all bits in the internal state with a probability \
between 45% to 55%. */

void hasshe2(const unsigned char *input_buf, size_t n_bytes,
unsigned char *output_state)
__m128i coeffs_1, coeffs_2, rnd_data, input, state_1, state_2;

assert(n_bytes % 16 == 0);

coeffs_1 = _mm_load_si128((void *) coeffs);
coeffs_2 = _mm_load_si128((void *) (coeffs + 4));
rnd_data = _mm_load_si128((void *) (coeffs + 8));

/* Initialize internal state to something random. (Alternatively,
if hashing a chain of data, read in the previous hash result from
somewhere.) */
state_1 = state_2 = rnd_data;

while (n_bytes >= 16) {
/* Read in 16 bytes, or 128 bits, from buf. Advance buf and
decrement n_bytes accordingly. */
input = _mm_loadu_si128((void *) input_buf);
input_buf += 16;
n_bytes -= 16;

COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);

/* Postprocessing. Copy half of the internal state into fake input,
replace it with the constant rnd_data, and do one combine and mix
phase more. */
input = state_1;
state_1 = rnd_data;
COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);

_mm_storeu_si128((void *) output_state, state_1);
_mm_storeu_si128((void *) (output_state + 16), state_2);

Some people have asked me what my set preferred set of hash functions are. As mentioned earlier, the one-at-a-time hash fills most of my daily needs. But when I need more speed or more bits to address a larger hash table, I use either Jenkins hash functions for string-like data, or when I have fast multiplication and the hashed material is more reminiscient of picking fields in a record, I use

uint64_t hash_32_to_64(uint32_t key, uint64_t seed)
seed ^= 2857720171ULL * key;
seed ^= seed >> 29;
seed += seed << 16;
seed ^= seed >> 21;
seed += seed << 32;
return seed;

Bob Jenkins and Thomas Wang have pages for similar integer to integer hash functions. And when I need high speed and cryptographical width but not cryptographic strengh, I use the hasshe2.

In summary, my kit of non-cryptographic hash functions contains roughly half a dozen snippets of code I cut, paste and modify as needed.