## Thursday, November 13, 2008

### Hashing with SSE2 Revisited, or My Hash Toolkit

Now that Bob "The Hashman" Jenkins refers to my earlier post on a hash function implemented with the help of SSE2 instructions, I feel compelled to post an update on it.

After my original post Bob and I concluded the mix step was not reversible, so I went back to the drawing board and came up with the function below. As earlier, the task of padding the data to even 16 bytes and other trivialities are left to the reader.

`/* Compile with gcc -msse2 ... */#include <stdlib.h>#include <stdint.h>#include <assert.h>#include <xmmintrin.h>static uint32_t coeffs __attribute__((aligned(16))) = {  /* Four carefully selected coefficients and interleaving zeros. */  2561893793UL, 0, 1388747947UL, 0,  3077216833UL, 0, 3427609723UL, 0,  /* 128 bits of random data. */  0x564A4447, 0xC7265595, 0xE20C241D, 0x128FA608,};#define COMBINE_AND_MIX(c_1, c_2, s_1, s_2, in)                              \  /* Phase 1: Perform four 32x32->64 bit multiplication with the             \     input block and words 1 and 3 coeffs, respectively.  This               \     effectively propagates a bit change in input to 32 more                 \     significant bit positions.  Combine into internal state by              \     subtracting the result of multiplications from the internal             \     state. */                                                               \  s_1 = _mm_sub_epi64(s_1, _mm_mul_epu32(c_1, _mm_unpackhi_epi32(in, in)));  \  s_2 = _mm_sub_epi64(s_2, _mm_mul_epu32(c_2, _mm_unpacklo_epi32(in, in)));  \                                                                             \  /* Phase 2: Perform shifts and xors to propagate the 32-bit                \     changes produced above into 64-bit (and even a little larger)           \     changes in the internal state. */                                       \  /* state ^= state >64> 29; */                                              \  s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 29));                         \  s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 29));                         \  /* state +64= state <64< 16; */                                            \  s_1 = _mm_add_epi64(s_1, _mm_slli_epi64(s_1, 16));                         \  s_2 = _mm_add_epi64(s_2, _mm_slli_epi64(s_2, 16));                         \  /* state ^= state >64> 21; */                                              \  s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 21));                         \  s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 21));                         \  /* state +64= state <128< 32; */                                           \  s_1 = _mm_add_epi64(s_1, _mm_slli_si128(s_1, 4));                          \  s_2 = _mm_add_epi64(s_2, _mm_slli_si128(s_2, 4));                          \                                                                             \  /* Phase 3: Propagate the changes among the four 64-bit words by           \     performing 64-bit subtractions and 32-bit word shuffling. */            \  s_1 = _mm_sub_epi64(s_1, s_2);                                             \  s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(0, 3, 2, 1)), s_1); \  s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(0, 1, 3, 2)), s_2); \  s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(2, 1, 0, 3)), s_1); \  s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(2, 1, 0, 3)), s_2); \                                                                             \  /* With good coefficients any one-bit flip in the input has now            \     changed all bits in the internal state with a probability               \     between 45% to 55%. */void hasshe2(const unsigned char *input_buf, size_t n_bytes,              unsigned char *output_state){  __m128i coeffs_1, coeffs_2, rnd_data, input, state_1, state_2;  assert(n_bytes % 16 == 0);  coeffs_1 = _mm_load_si128((void *) coeffs);  coeffs_2 = _mm_load_si128((void *) (coeffs + 4));  rnd_data = _mm_load_si128((void *) (coeffs + 8));  /* Initialize internal state to something random.  (Alternatively,     if hashing a chain of data, read in the previous hash result from     somewhere.) */  state_1 = state_2 = rnd_data;  while (n_bytes >= 16) {    /* Read in 16 bytes, or 128 bits, from buf.  Advance buf and       decrement n_bytes accordingly. */    input = _mm_loadu_si128((void *) input_buf);    input_buf += 16;    n_bytes -= 16;    COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);  }  /* Postprocessing.  Copy half of the internal state into fake input,     replace it with the constant rnd_data, and do one combine and mix     phase more. */   input = state_1;  state_1 = rnd_data;  COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);  _mm_storeu_si128((void *) output_state, state_1);  _mm_storeu_si128((void *) (output_state + 16), state_2);}`

Some people have asked me what my set preferred set of hash functions are. As mentioned earlier, the one-at-a-time hash fills most of my daily needs. But when I need more speed or more bits to address a larger hash table, I use either Jenkins hash functions for string-like data, or when I have fast multiplication and the hashed material is more reminiscient of picking fields in a record, I use

`uint64_t hash_32_to_64(uint32_t key, uint64_t seed){  seed ^= 2857720171ULL * key;  seed ^= seed >> 29;  seed += seed << 16;  seed ^= seed >> 21;  seed += seed << 32;  return seed;}`

Bob Jenkins and Thomas Wang have pages for similar integer to integer hash functions. And when I need high speed and cryptographical width but not cryptographic strengh, I use the hasshe2.

In summary, my kit of non-cryptographic hash functions contains roughly half a dozen snippets of code I cut, paste and modify as needed. cbloom said...

Have you got a version of hasshe2 that does rollup & rolldown for non-aligned data ?

1:09 AM cessu said...

There are two 128-bit load insns in SSE2. The one wrapped in _mm_load_si128 assumes the argument it 16-byte aligned, and it is used in loading the constants. The loop that reads the input buffer, however, uses _mm_loadu_si128, which does NOT require 16-byte alignment, so your wish is already fullfilled. However, the code I presented does still assume the input buffer is padded to full 16 byte blocks.

10:34 AM Shintaro Miyazaki said...

I would like to contact you.
This project is amazing!
I worked in a more rough, but similiar direction. I really would like to know more about the project!

http://vimeo.com/6649750
http://www.algorhythmics.net/en/?page_id=2

9:46 PM