Monday, September 10, 2007

Hashing with SSE2

NOTE: This post has been revised.

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:

u_int32_t one_at_a_time(unsigned char *key, size_t len)
{
u_int32_t hash, i;
for (hash = i = 0; i < len; ++i) {
hash += key[i];
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable hash and consumes eight bits of input per iteration. Hsieh's hash speeds this up consuming 32-bits per iteration while still significantly improving the hash function's statistical properties. Jenkins' hash increases both the internal state and consumption to 96 bits.

But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bit
entities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (exact and more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.

/* Compile with gcc -msse2 ... */

#include <xmmintrin.h>

typedef signed short s_int16_t; /* A 16-bit signed type. */

static s_int16_t mul_add_coeffs[16] __attribute__((aligned(16))) = {
18743, 28829, -22921, -14039, -15247, -11117, -14293, 18911,
-23407, 17987, -11321, -25943, -32287, 10001, 30773, -12541,
};

/* Compute a 256-bit hash into res[0..31] using SSE2. buf and res
need not be aligned, but len must be divisible by 32. */
static void hasshe2(const unsigned char *buf, int len, unsigned char *res)
{
__m128i madd_1, madd_2, in_1, in_2, hash_1, hash_2, tmp_1, tmp_2;

madd_1 = _mm_load_si128((void *) mul_add_coeffs);
madd_2 = _mm_load_si128((void *) (mul_add_coeffs + 8));

in_1 = _mm_loadu_si128((void *) buf);
in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_madd_epi16(madd_1, in_1);
hash_2 = _mm_madd_epi16(madd_2, in_2);
goto rest_of_mix;

do {
in_1 = _mm_loadu_si128((void *) buf);
in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_xor_si128(hash_1, _mm_madd_epi16(madd_1, in_1));
hash_2 = _mm_xor_si128(hash_2, _mm_madd_epi16(madd_2, in_2));

rest_of_mix:
hash_1 = _mm_add_epi64(hash_1, _mm_srli_si128(hash_1, 3));
hash_1 = _mm_sub_epi64(hash_1,
_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));
hash_2 = _mm_add_epi64(hash_2, _mm_srli_si128(hash_2, 3));
hash_2 = _mm_sub_epi64(hash_2,
_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_2, hash_1));
tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_1, hash_2));
hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));
hash_1 = _mm_sub_epi64(hash_1,
_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));
hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));
hash_2 = _mm_sub_epi64(hash_2,
_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

/* Move these after the loop if good statistical properties not
required. */
tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_1, hash_2));
tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_2, hash_1));
hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));
hash_1 = _mm_sub_epi64(hash_1,
_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(2, 1, 0, 3)));
hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));
hash_2 = _mm_sub_epi64(hash_2,
_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(2, 1, 0, 3)));

len -= 32;
buf += 32;
} while (len > 0);

_mm_storeu_si128((void *) res, hash_1);
_mm_storeu_si128((void *) (res + 16), hash_2);
}


The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because SSE4 will provide an instruction for computing CRC32.