NOTE: This post has been

revised.

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:

u_int32_t one_at_a_time(unsigned char *key, size_t len)

{

u_int32_t hash, i;

for (hash = i = 0; i < len; ++i) {

hash += key[i];

hash += hash << 10;

hash ^= hash >> 6;

}

hash += hash << 3;

hash ^= hash >> 11;

hash += hash << 15;

return hash;

}

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by

Bob Jenkins,

Fowler, Noll and Vo, and

Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable

`hash` and consumes eight bits of input per iteration. Hsieh's hash speeds this up consuming 32-bits per iteration while still significantly improving the hash function's statistical properties. Jenkins' hash increases both the internal state and consumption to 96 bits.

But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use

SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension

SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bit

entities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (

exact and

more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.

/* Compile with gcc -msse2 ... */

#include <xmmintrin.h>

typedef signed short s_int16_t; /* A 16-bit signed type. */

static s_int16_t mul_add_coeffs[16] __attribute__((aligned(16))) = {

18743, 28829, -22921, -14039, -15247, -11117, -14293, 18911,

-23407, 17987, -11321, -25943, -32287, 10001, 30773, -12541,

};

/* Compute a 256-bit hash into res[0..31] using SSE2. buf and res

need not be aligned, but len must be divisible by 32. */

static void hasshe2(const unsigned char *buf, int len, unsigned char *res)

{

__m128i madd_1, madd_2, in_1, in_2, hash_1, hash_2, tmp_1, tmp_2;

madd_1 = _mm_load_si128((void *) mul_add_coeffs);

madd_2 = _mm_load_si128((void *) (mul_add_coeffs + 8));

in_1 = _mm_loadu_si128((void *) buf);

in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_madd_epi16(madd_1, in_1);

hash_2 = _mm_madd_epi16(madd_2, in_2);

goto rest_of_mix;

do {

in_1 = _mm_loadu_si128((void *) buf);

in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_xor_si128(hash_1, _mm_madd_epi16(madd_1, in_1));

hash_2 = _mm_xor_si128(hash_2, _mm_madd_epi16(madd_2, in_2));

rest_of_mix:

hash_1 = _mm_add_epi64(hash_1, _mm_srli_si128(hash_1, 3));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));

hash_2 = _mm_add_epi64(hash_2, _mm_srli_si128(hash_2, 3));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_2, hash_1));

tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_1, hash_2));

hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));

hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

/* Move these after the loop if good statistical properties not

required. */

tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_1, hash_2));

tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_2, hash_1));

hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(2, 1, 0, 3)));

hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(2, 1, 0, 3)));

len -= 32;

buf += 32;

} while (len > 0);

_mm_storeu_si128((void *) res, hash_1);

_mm_storeu_si128((void *) (res + 16), hash_2);

}

The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because

SSE4 will provide an instruction for computing

CRC32.