### Hashing with SSE2

NOTE: This post has been revised.

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable

But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bit

entities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (exact and more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.

The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because SSE4 will provide an instruction for computing CRC32.

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:

u_int32_t one_at_a_time(unsigned char *key, size_t len)

{

u_int32_t hash, i;

for (hash = i = 0; i < len; ++i) {

hash += key[i];

hash += hash << 10;

hash ^= hash >> 6;

}

hash += hash << 3;

hash ^= hash >> 11;

hash += hash << 15;

return hash;

}

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable

`hash`and consumes eight bits of input per iteration. Hsieh's hash speeds this up consuming 32-bits per iteration while still significantly improving the hash function's statistical properties. Jenkins' hash increases both the internal state and consumption to 96 bits.But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bit

entities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (exact and more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.

/* Compile with gcc -msse2 ... */

#include <xmmintrin.h>

typedef signed short s_int16_t; /* A 16-bit signed type. */

static s_int16_t mul_add_coeffs[16] __attribute__((aligned(16))) = {

18743, 28829, -22921, -14039, -15247, -11117, -14293, 18911,

-23407, 17987, -11321, -25943, -32287, 10001, 30773, -12541,

};

/* Compute a 256-bit hash into res[0..31] using SSE2. buf and res

need not be aligned, but len must be divisible by 32. */

static void hasshe2(const unsigned char *buf, int len, unsigned char *res)

{

__m128i madd_1, madd_2, in_1, in_2, hash_1, hash_2, tmp_1, tmp_2;

madd_1 = _mm_load_si128((void *) mul_add_coeffs);

madd_2 = _mm_load_si128((void *) (mul_add_coeffs + 8));

in_1 = _mm_loadu_si128((void *) buf);

in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_madd_epi16(madd_1, in_1);

hash_2 = _mm_madd_epi16(madd_2, in_2);

goto rest_of_mix;

do {

in_1 = _mm_loadu_si128((void *) buf);

in_2 = _mm_loadu_si128((void *) (buf + 16));

hash_1 = _mm_xor_si128(hash_1, _mm_madd_epi16(madd_1, in_1));

hash_2 = _mm_xor_si128(hash_2, _mm_madd_epi16(madd_2, in_2));

rest_of_mix:

hash_1 = _mm_add_epi64(hash_1, _mm_srli_si128(hash_1, 3));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));

hash_2 = _mm_add_epi64(hash_2, _mm_srli_si128(hash_2, 3));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_2, hash_1));

tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_1, hash_2));

hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));

hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));

/* Move these after the loop if good statistical properties not

required. */

tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_1, hash_2));

tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_2, hash_1));

hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));

hash_1 = _mm_sub_epi64(hash_1,

_mm_shuffle_epi32(hash_1, _MM_SHUFFLE(2, 1, 0, 3)));

hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));

hash_2 = _mm_sub_epi64(hash_2,

_mm_shuffle_epi32(hash_2, _MM_SHUFFLE(2, 1, 0, 3)));

len -= 32;

buf += 32;

} while (len > 0);

_mm_storeu_si128((void *) res, hash_1);

_mm_storeu_si128((void *) (res + 16), hash_2);

}

The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because SSE4 will provide an instruction for computing CRC32.