Cessu's blog

Self-referential Sentencing

2010-04-22T23:42:00.002+03:00

I explained a student the concept of powerlooping by constructing a program that generates self-referential sentences like

This sentence contains seventeen times the letter i, fifteen times the letter r, fourteen times the letter h, thirteen times the letter l, twelve times the letter n, eleven times the letter m, six times the letter f, five times the letter v, four times the letter o, and three times the letter u.

The student gawked for fifteen seconds while I was expecting an exclamation like "awesome" or "cool", but instead he balked with "and that is supposed to help me find a job?"

MegaHurtz for Your Ears

2009-10-15T01:22:00.002+03:00

Last week I experienced the awe of social media: a year ago I posted an article of how rendering a program's execution as sound could incidentally ne rather pleasing to the human ear, but not until last week did I receive an avalanche of comments and requests for more. It took a while to dust off all relevant programs I had used, but here you are: 190 hours of sound of programs, corresponding to approximately one second of CPU-time.

SR3C

2008-11-28T14:37:00.002+02:00

Symbol Ranking is a method for data compression where one maintains a LRU list of symbols recently seen in the given context. Those symbols are expected to be more probable the next time and therefore encoded with fewer bits than other symbols. The LRU list is usually only up to three symbols long, thus making it possible to store a context in just one 32-bit word.

Symbol Ranking has been common computing folklore for ages, Peter Fenwick gave the first modern treatment and implementation subsequently improved by Matt Mahoney for example by adding a secondary arithmetic compression stage. I added a few ideas of my own, rewrote it in C in a way easily embeddable to other programs to compress e.g. network connections and database logging.

This compression library, SR3C, hits the front where one either has to consume more memory or cycles to compress better. The compression is slightly faster to gzip -7, but results in ~11% smaller data. bzip2 -2 compresses slightly better than SR3C, but takes almost three times as long and is not on-line.

Matt Mahoney made a version whose command line interface compiles on Windows and benchmarked it as well. Both versions of SR3C are published under the MIT license.

Random Access Pseudo-Random Numbers

2008-11-28T14:22:00.002+02:00

Since my previous post I was approached by a fellow-in-keyboards in need of a "random-access" pseudo-random number generator which could generate the i'th generated random r number directly, without computing the intermediate i-1 random numbers since seeding. Would a hash function for integers be a good RAPRNG?

No! Good hash functions avoid funnels and are thus bijective, which for 32-bit hash functions would imply only a 2^32 period. Consequently, if we draw N < 2^32 consequtive random numbers, then hashing would make r could occur once or never, but not twice or more frequently. A correct random number generation would result in an exponential distribution of the frequency of r.

We can fix this by increasing the period, which is equivalent to increasing the number of output bits of the hash, but use only 32 bits of the output. Alternatively we could hash i with two independent integer hash functions and xor the result. Either way, we deliberately create just the right amount of final funnelling.

Practical tests, such as Dieharder, are surprisingly difficult. I tried xoring results of several pairs of known integer hash functions, and eventually found a reasonable solution based my own hash_32_to_64 mentioned in my previous post.

uint32_t raprng(uint64_t i, uint64_t seed)
{
  uint64_t r = (2857720171ULL * ((uint32_t) i)) ^ 0x1EF57D8A7B344E7BULL;
  r ^= r >> 29;
  r += r << 16;
  r ^= r >> 21;
  r += r >> 32;
  r = (2857720171ULL * ((uint32_t) (i ^ r))) ^ (0xD9EA571C8AF880B6ULL + seed);
  r ^= r >> 29;
  r += r << 16;
  r ^= r >> 21;
  return r + (r >> 32);
}

The hypothetical period of raprng is 2^96 if i were of sufficient width. This period is barely enough to be respectable, at least compared to some serious PRNG's like the Mersenne Twister. raprng is also slower. But it is random-accessible.

Having gone through all this I need to point out that there is a less-known PRNG called the explicit inversive congruential generator (EICG), where the i'th output is defined as modular inverse of a*i+b for some a and b, where a should obviously be non-zero. The problem herein is that modular inversion is by no means cheap, and it isn't clear to me that some other very non-linear bijection - such as a very good hash function - wouldn't do.

Hashing with SSE2 Revisited, or My Hash Toolkit

2008-11-13T10:45:00.002+02:00

Now that Bob "The Hashman" Jenkins refers to my earlier post on a hash function implemented with the help of SSE2 instructions, I feel compelled to post an update on it.

After my original post Bob and I concluded the mix step was not reversible, so I went back to the drawing board and came up with the function below. As earlier, the task of padding the data to even 16 bytes and other trivialities are left to the reader.

/* Compile with gcc -msse2 ... */

#include <stdlib.h>
#include <stdint.h>
#include <assert.h>
#include <xmmintrin.h>

static uint32_t coeffs[12] __attribute__((aligned(16))) = {
  /* Four carefully selected coefficients and interleaving zeros. */
  2561893793UL, 0, 1388747947UL, 0,
  3077216833UL, 0, 3427609723UL, 0,
  /* 128 bits of random data. */
  0x564A4447, 0xC7265595, 0xE20C241D, 0x128FA608,
};

#define COMBINE_AND_MIX(c_1, c_2, s_1, s_2, in)                              \
  /* Phase 1: Perform four 32x32->64 bit multiplication with the             \
     input block and words 1 and 3 coeffs, respectively.  This               \
     effectively propagates a bit change in input to 32 more                 \
     significant bit positions.  Combine into internal state by              \
     subtracting the result of multiplications from the internal             \
     state. */                                                               \
  s_1 = _mm_sub_epi64(s_1, _mm_mul_epu32(c_1, _mm_unpackhi_epi32(in, in)));  \
  s_2 = _mm_sub_epi64(s_2, _mm_mul_epu32(c_2, _mm_unpacklo_epi32(in, in)));  \
                                                                             \
  /* Phase 2: Perform shifts and xors to propagate the 32-bit                \
     changes produced above into 64-bit (and even a little larger)           \
     changes in the internal state. */                                       \
  /* state ^= state >64> 29; */                                              \
  s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 29));                         \
  s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 29));                         \
  /* state +64= state <64< 16; */                                            \
  s_1 = _mm_add_epi64(s_1, _mm_slli_epi64(s_1, 16));                         \
  s_2 = _mm_add_epi64(s_2, _mm_slli_epi64(s_2, 16));                         \
  /* state ^= state >64> 21; */                                              \
  s_1 = _mm_xor_si128(s_1, _mm_srli_epi64(s_1, 21));                         \
  s_2 = _mm_xor_si128(s_2, _mm_srli_epi64(s_2, 21));                         \
  /* state +64= state <128< 32; */                                           \
  s_1 = _mm_add_epi64(s_1, _mm_slli_si128(s_1, 4));                          \
  s_2 = _mm_add_epi64(s_2, _mm_slli_si128(s_2, 4));                          \
                                                                             \
  /* Phase 3: Propagate the changes among the four 64-bit words by           \
     performing 64-bit subtractions and 32-bit word shuffling. */            \
  s_1 = _mm_sub_epi64(s_1, s_2);                                             \
  s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(0, 3, 2, 1)), s_1); \
  s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(0, 1, 3, 2)), s_2); \
  s_2 = _mm_sub_epi64(_mm_shuffle_epi32(s_2, _MM_SHUFFLE(2, 1, 0, 3)), s_1); \
  s_1 = _mm_sub_epi64(_mm_shuffle_epi32(s_1, _MM_SHUFFLE(2, 1, 0, 3)), s_2); \
                                                                             \
  /* With good coefficients any one-bit flip in the input has now            \
     changed all bits in the internal state with a probability               \
     between 45% to 55%. */

void hasshe2(const unsigned char *input_buf, size_t n_bytes, 
             unsigned char *output_state)
{
  __m128i coeffs_1, coeffs_2, rnd_data, input, state_1, state_2;

  assert(n_bytes % 16 == 0);

  coeffs_1 = _mm_load_si128((void *) coeffs);
  coeffs_2 = _mm_load_si128((void *) (coeffs + 4));
  rnd_data = _mm_load_si128((void *) (coeffs + 8));

  /* Initialize internal state to something random.  (Alternatively,
     if hashing a chain of data, read in the previous hash result from
     somewhere.) */
  state_1 = state_2 = rnd_data;

  while (n_bytes >= 16) {
    /* Read in 16 bytes, or 128 bits, from buf.  Advance buf and
       decrement n_bytes accordingly. */
    input = _mm_loadu_si128((void *) input_buf);
    input_buf += 16;
    n_bytes -= 16;

    COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);
  }

  /* Postprocessing.  Copy half of the internal state into fake input,
     replace it with the constant rnd_data, and do one combine and mix
     phase more. */ 
  input = state_1;
  state_1 = rnd_data;
  COMBINE_AND_MIX(coeffs_1, coeffs_2, state_1, state_2, input);

  _mm_storeu_si128((void *) output_state, state_1);
  _mm_storeu_si128((void *) (output_state + 16), state_2);
}

Some people have asked me what my set preferred set of hash functions are. As mentioned earlier, the one-at-a-time hash fills most of my daily needs. But when I need more speed or more bits to address a larger hash table, I use either Jenkins hash functions for string-like data, or when I have fast multiplication and the hashed material is more reminiscient of picking fields in a record, I use

uint64_t hash_32_to_64(uint32_t key, uint64_t seed)
{
  seed ^= 2857720171ULL * key;
  seed ^= seed >> 29;
  seed += seed << 16;
  seed ^= seed >> 21;
  seed += seed << 32;
  return seed;
}

Bob Jenkins and Thomas Wang have pages for similar integer to integer hash functions. And when I need high speed and cryptographical width but not cryptographic strengh, I use the hasshe2.

In summary, my kit of non-cryptographic hash functions contains roughly half a dozen snippets of code I cut, paste and modify as needed.

Have You Listened to Your Program Today?

2008-09-23T14:34:00.003+03:00

Assume you could plug a synthesizer into the internals of your computer's CPU, what would you hear? Noise and cacophony would be an intuitive first guess, and probably correct if you would hear the sound in real time. But if we slow down the CPU to approximately 3500 operations per second, we just about might assume the CPU's computation could contain all essential ingredients of music: rhytm and repetition caused by iteration, melodies caused by various mechanisms that change a value incrementally, and harmonies caused by some mechanism of excluding values, such as memory alignment.

Since I wasn't able to open a live CPU without making it slightly disfunctional in the process, I took Valgrind to the rescue: I wrote a little object code instrumentation tool, "skin" as Valgrind authors call them, which logs into a file all values I wish to follow when executing the software with which I wish to have a Close Encounter of the Third Kind. A little more code to generate sound and movies of the values.

Here's what one would here if the bits one to seven in the result of each addition and subtraction would indicate a very short ping. All those bits being zero corresponds to the barely audible 24.5 Hz, or the sub-contra G. Roughly ten highest notes are unfortunately ultrasound, but this was nevertheless the best representation I could come up with.

They say listening daily to your spouse improves your relationship. I wonder whether the same applies to programs.

At 8am Helsinki is a 1.77-Dimensional

2007-12-04T00:00:00.000+02:00

And worse, at Sunday noon its dimensions decrease to 1.45. No wonder I occasionally feel a little confused about our geography...

Ok, maybe I need to explain a little. Assume we are given distances between random pairs of points in a hypercube of unknown number of dimensions and extent. A little calculus shows that if the coordinates of the points are uniformly distributed along each axis, the square of their distance has an expectation of a² N / 6 and variance of 7 a⁴ N / 180 where N is the number of dimensions and a the edge length of the hypercube. I hereafter make a lofty assumption that these formulas apply also to fractional dimensions, at least for some such definition.

According to YTV there are 6913 bus stops or other kind of access points for regional public transit in Helsinki and its satellite municipalities Espoo, Vantaa, Kerava, Kauniainen and Kirkkonummi combined. I wrote a script which picks two points at random and queries the optimum public transport solution between them from a web service generously provided by YTV. Due to the nature of the data provided by the web service, I define distance not by meters but by duration. More specifically, how many minutes before 8am must one enter the starting point in order to reach the end point by 8am on a working day's morning.

Based on some 20000 distance samples, the mean distance is 54.4 minutes with a whopping maximum of 275 minutes. No wonder people prefer cars in those more rural areas of the region. But from the statistics of the squares of the distances and solving the equations for a and N shows that the regional public transport system behaves like a 1.77-dimensional hypercube with edge length of 109 minutes of travelling.

On weekends public transport becomes less frequent and a larger portion of traffic goes to and from various local centers instead of offering short-cuts around them. This increases the mean travelling time by almost 10 minutes to 64 minutes when aiming to arrive on Sunday noon. In our hypercubistic terms, this decreases dimensionality to 1.45 and increases edge length to 142 minutes.

The really curious observation is the significance of high-speed rail traffic: refraining from subways and trains and using only busses, trams, ferries and feet doesn't increase the mean traveling time more than six minutes, but more importantly, it breaks off clusters reachable from others only through very long walks or infrequently running night-time busses. Consequently the variance of traveling time is larger than a homogenous cube-like space would allow. Mechanical use of the equations would in fact describe Helsinki without fast rails as a hypercube with 0.39 dimensions, i.e. fewer dimensions as a straight line!

Of course, the true reason for rails is not in fact so much in their speed (compared to an express bus) than in their capacity. And they indeed do add another dimension to our lives, or 1.38 to be more precise.

Hashing with SSE2

2007-09-10T15:17:00.001+03:00

NOTE: This post has been revised.

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:


u_int32_t one_at_a_time(unsigned char *key, size_t len)
{
  u_int32_t hash, i; 
  for (hash = i = 0; i < len; ++i) {
    hash += key[i];
    hash += hash << 10;
    hash ^= hash >> 6;
  }
  hash += hash << 3;
  hash ^= hash >> 11;
  hash += hash << 15;
  return hash;
}

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable hash and consumes eight bits of input per iteration. Hsieh's hash speeds this up consuming 32-bits per iteration while still significantly improving the hash function's statistical properties. Jenkins' hash increases both the internal state and consumption to 96 bits.

But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bit
entities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (exact and more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.


/* Compile with gcc -msse2 ... */

#include <xmmintrin.h>

typedef signed short s_int16_t; /* A 16-bit signed type. */

static s_int16_t mul_add_coeffs[16] __attribute__((aligned(16))) = {
   18743,  28829, -22921, -14039, -15247, -11117, -14293,  18911,
  -23407,  17987, -11321, -25943, -32287,  10001,  30773, -12541,
};

/* Compute a 256-bit hash into res[0..31] using SSE2.  buf and res
   need not be aligned, but len must be divisible by 32. */
static void hasshe2(const unsigned char *buf, int len, unsigned char *res)
{
  __m128i madd_1, madd_2, in_1, in_2, hash_1, hash_2, tmp_1, tmp_2;

  madd_1 = _mm_load_si128((void *) mul_add_coeffs);
  madd_2 = _mm_load_si128((void *) (mul_add_coeffs + 8));

  in_1 = _mm_loadu_si128((void *) buf);
  in_2 = _mm_loadu_si128((void *) (buf + 16));

  hash_1 = _mm_madd_epi16(madd_1, in_1);
  hash_2 = _mm_madd_epi16(madd_2, in_2);
  goto rest_of_mix;

  do {
    in_1 = _mm_loadu_si128((void *) buf);
    in_2 = _mm_loadu_si128((void *) (buf + 16));

    hash_1 = _mm_xor_si128(hash_1, _mm_madd_epi16(madd_1, in_1));
    hash_2 = _mm_xor_si128(hash_2, _mm_madd_epi16(madd_2, in_2));

  rest_of_mix:
    hash_1 = _mm_add_epi64(hash_1, _mm_srli_si128(hash_1, 3));
    hash_1 = _mm_sub_epi64(hash_1, 
      _mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));
    hash_2 = _mm_add_epi64(hash_2, _mm_srli_si128(hash_2, 3));
    hash_2 = _mm_sub_epi64(hash_2,
      _mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));
    
    tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_2, hash_1));
    tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_1, hash_2));
    hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));
    hash_1 = _mm_sub_epi64(hash_1, 
      _mm_shuffle_epi32(hash_1, _MM_SHUFFLE(0, 1, 2, 3)));
    hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));
    hash_2 = _mm_sub_epi64(hash_2,
      _mm_shuffle_epi32(hash_2, _MM_SHUFFLE(0, 1, 2, 3)));
    
    /* Move these after the loop if good statistical properties not
       required. */
    tmp_1 = _mm_xor_si128(in_1, _mm_unpackhi_epi8(hash_1, hash_2));
    tmp_2 = _mm_xor_si128(in_2, _mm_unpacklo_epi8(hash_2, hash_1));
    hash_1 = _mm_xor_si128(tmp_2, _mm_madd_epi16(madd_1, tmp_1));
    hash_1 = _mm_sub_epi64(hash_1, 
      _mm_shuffle_epi32(hash_1, _MM_SHUFFLE(2, 1, 0, 3)));
    hash_2 = _mm_xor_si128(tmp_1, _mm_madd_epi16(madd_2, tmp_2));
    hash_2 = _mm_sub_epi64(hash_2,
      _mm_shuffle_epi32(hash_2, _MM_SHUFFLE(2, 1, 0, 3)));
    
    len -= 32;
    buf += 32;
  } while (len > 0);

  _mm_storeu_si128((void *) res, hash_1);
  _mm_storeu_si128((void *) (res + 16), hash_2);
}

The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because SSE4 will provide an instruction for computing CRC32.

Was Noah's Ark a Sperm Bank?

2007-06-20T13:43:00.000+03:00

"And God said unto Noah, [..] make thee an ark [..] the length of the ark shall be three hundred cubits, the breadth of it fifty cubits, and the height of it thirty cubits." (Gen. 6:13-15)

A cubit is the distance between an adult's elbow and tip of the finger, approximately 45 cm. The ark was thus 135 meters long, 22.5 meters wide, and 13.5 meters high. These are presumably outer measures, and hence the habitable volume of the ark could not have been much more than 35000 cubic meters.

Even the most conservative estimates state that there are at least two million species of animals alive today, and surely a comparable number of plants, fungi, etc. A large fraction of species live in water, but could not survive if their ecosystem would suddenly submerge a mile deep - according to the bible even high mountains were submerged. There are approximations that the number of species of a given body length L is proportional to L raised to some power between -1.5 and -3. For example, if we use the frequent choice of -2 and we knew there is one 10-meter long species, then there are 100 one-meter species, ten thousand 10-centimeter species, etc. If we limit the species size to a tenth of a millimeter (for example, assuming evolution (!) will sufficiently rapidly reproduce the smaller species), approximate that species are half as wide and high as they are long, and integrate the product of volume and number of species over the species length, we obtain a total volume of approximately 125000 cubic meters.

We have already exceeded the available volume in the ark by a factor of 3.5, and we are yet to cram in the other sex and enough food for animals and soil for plants for the 150 days the bible says Noah sailed. Remember also that God required Noah to provide accomodation for his extended family, and in order to avoid extinction by inbreeding, he would have had to take tens if not closer to a hundred individuals of all other species as well. All this is a little too impressive for a six hundred year old man.

Unless Noah was in fact history's first documented case of sperm banking. If a small forest of Giant Sequoias were a challenge to fit in the ark, a sachel of fist-sized cones or a teaspoon of seeds would not. A rough calculation shows that all necessary genetic material could easily be fitted in a negligible part of the ark leaving the rest free for all the technology needed for recolonization, deep-freezing sperm, artificial wombs and all the other technology necessary to recreate mammals without living parents. That's truly impressive technology, but far more plausible than the alternative. And hey, the guy was six hundred years old - that's plenty of time to write any number of dissertations on all required fields, and a good reason to become a drunkard afterwards, like Noah did.

In fact I predict that this will be the method that humans will use should we some day colonize other stars systems. We will send unmanned ships ahead, and once they reach their destination we submit by radio (or laser) the genetic information, culture, and without a doubt a large number of software updates to the distant ships.

But there's still something fishy about Noah and the ark. Noah had sent out pigeons to search for land. When the waters abated and the ark hit mount Ararat one pigeon came back with an olive leaf. However, here lies a contradiction: olive doesn't grow high on Ararat and all ground-level olives must have rotten by then.

My Alma Mater

2007-06-20T09:18:00.000+03:00

I graduated and now work in the Helsinki University of Technology (HUT). Since 2003 Institute of Higher Education at Shanghai Jiao Tong University has maintaned a yearly ranking of the 500 best universities in the world. We were ranked as 371st, 349th, 446th, and last year 447th in an unfortunate declining trend.

Surely this must only be a coincidence with the rampant managerialism we have experienced at my Alma Mater. In a few years we've been blessed with several administrative procedures for determining salaries, counting working hours, managing travels, and accepting bills, just to name a few. The least common denominator of all these systems is that they are rigid, vaguely documented, self-contradictory, their IT implementations suck big time, and they cause a huge net increase in management work of really annoyed professors and other staff who really should concentrate on teaching and research instead.

No wonder the Shanghai list ranks us low because our own salary policies rank our degrees even lower: our M.Sc.'s teaching courses, working in or leading a research project can be paid from 50-75% of the salary paid to a person in a non-teaching position with the lowest job grade requiring a corresponding degree, typically a non-science degree from another university. For example, at default performance Linus Torvald's salary could be at most 2178 euros per month, or roughly $40000 a year, whereas if he (or some bureaucrat) neither taught nor researched he would be paid at least 2931 euros per month. No surprize that the name of this new salary system, UPJ, is widely used as a profanity among current teachers and anybody trying to recruit competent personnel to a (practically oriented) research project.

And in accordance to the Parkinson's Law the administration is growing rapidly. In addition to increasingly filling the days of teachers and researchers, administration is a major recruiter of new people. I listed every open position at HUT closing on the second half of 2006 or the first half of 2007 as they appeared on the official recruitment page. I ignored part-time positions, graduate schools and stipendiates and counted only "serious" jobs. Considering the main function of our university it was surprizing to see that barely half, 83, of the 160 positions were primarily for teaching and research. Furthermore, 62% of the 77 non-teaching positions were tenure positions, which is significantly higher than the 25% of HUT overall.

I also wish to bring into our memories the intelligent design conference that was held in the Helsinki University of Technology on the 22nd of October 2004. Our ranking on the Shanghai list dropped by nearly 100 positions in the next update.

But maybe there's hope. There are still able and hard-working people here, and in a few years HUT will be merged with two other universities and funded by a foundation with a significant industrial contribution. Some see that as a threat to academic freedom, but I'm hopeful that the board of the foundation may have the courage to tackle the managerialism and let the able and hard-working people focus on the real work: teaching and research.

Skepticism, and Bullshit in Bullshit by Penn and Teller

2007-03-18T03:04:00.000+02:00

I'm a member of the Skepsis, the Finnish equivalent of CSICOP. For the last few weeks a new TV channel has aired (or should I call "cabled", or "coppered") the show Bullshit! by Penn and Teller. Penn's harsh language and Teller's eloquent zero-liners didn't set me back, but two other issues did. The first is the abundance of ad hominem attacks which in my ears turn against the attacker even if the victim would indeed be a swindling crackpot. It won't take much to make a spokesperson for an organization (let alone a teen-aged protester) look ignorant, but the same would apply for any organization.

The second, and far worse problem with the show is that some of Penn's bullshitting is in fact bullshit itself. So far I've only seen roughly ten episodes, but the episode where they claim there was no scientific evidence that showed second hand smoking was unhealty was simply bullshit. According to the Robert Todd Carroll, the author of the Skeptic's Dictionary, he as well as Penn and Teller were lead to trust the standards of risk assessment as promoted by the tobacco industry (led by Philip Morris) and their Republican generals like Jim Tozzi.

Also the episode on the environmental movement had some strong elements of bullshit. Surely the movement attracts alarmists and many of the movement's claims are bullshit, but the general concern of global warming and its consequences among academics is genuine rather than a tool for another round of research funding. Carroll states that this episode "is based on these same questionable standards pushed by Republican leaders for their corporate donors whose main interest is the deregulation of industries and products rather than public safety or health. This approach fits well with P&T's libertarian philosophy but it is essentially dishonest and does nothing to promote the view of skepticism as healthy critical thinking."

In fact this abuse of pseudo-skepticism as an opinion altering tool for industrial deregulation is what I consider the greatest threat to Skepsis and CSICOP. The crackpots with their occasional legal threats are just bullshit.

Also the episode on diets threw more bullshit than necessary. Of course there's an abundance of bullshit in the diet industry, but it seems the late Dr. Atkins did have a positive impact on the science of obesity research after all. Despite being used to treat some forms of epilepsy since 1920's has no-one ever shown his diet to be unhealty. Furthermore, numerous recent studies have shown the effect on both weight as well as blood lipids are generally better to the "prudent" generally recommended diets.

But I can't deny the entertainment value of the show even if the background research seems scarser than Teller's lines. Maybe some day they'll air a self-critiquing episode - that would likely be the most interesting episode of them all. Until then, my dear readers, you should regard Penn and Teller's Bullshit! as manure or politics rather than true skepticism or good journalism.

Estimated and Actual Duration of Projects

2007-01-31T00:32:00.000+02:00

Software engineers are often accused of missing deadlines and grossly under-estimating the duration of projects. But how about warfare? Let's take some examples:

When the first world war broke, German military leaders estimated it would last approximately six months. But over three years and three months later Kaiser Wilhelm fled the country and Germany capitulated.

When the largest land war operation in history, operation Barbarossa, started on June 22, 1941, German leaders promised the troops would be home by christmas. Three years and ten months later Hitler committed suicide in his underground bunker.

(Actually, some historians claim that the first world war wasn't concluded until the second world war ended - there was merely a 21-year armistice.)

The Vietnam war ended exactly three decades after Hitler's suicide. But six years and three months earlier Nixon had promised congressman Donald Riegle (R-MI) that the Vietnam war would end six months after Nixon assumed office. After all, Nixon had a "secret plan", just like George W. Bush has today.

And speaking of Bush Junior, it's now three years and eight months since he declared the "mission was accomplished" and "major combat ended". Since then, the guerillas have fulfilled Bush's request to "bring'em on", resulting in over twenty-folding the U.S. body count.

So, over-optimistic programmers are just like politicians and military leaders: given their estimate of how long the project will take, if they haven't succeeded in twice that time they will fail and at least six longer until they realize it themselves.

(To be fair, there have been also realistic estimates in both venues. Churchill, for example, never promised a quick and pleasant end to the second world war.)

Better RAIDs

2006-12-28T23:25:00.000+02:00

Regardless of what the hard disk vendors claim to be the mean time before failure, based on roughly twenty of my most recent disks from several different vendors I'ld claim a more realistic half-life of a normal off-the-shelf PC hard disks is at most four years, or approximately 35000 hours. Assuming there's one million hard disks in Finland, on average one hard disk dies while I engulf a fluffy Big Mac. Sorry, I'll try to cut down on junk food.

I personally replicate my essential data with unison on three hard disks on separate locations, but this occasionally leads to annoying conflicts and even data losses when I forget to promptly synchronize the replacas. And for bulk data, such as scanned images, digital camera pictures and videos, triplicating data becomes unnecessarily expensive.

Engineers have found smarter ways to improve the reliability of disk systems by organizing them into redundant arrays of independent disks, RAIDs. A typical RAID-4 might consist of four disks which contain the data and a fifth disk containing the parity (bitwise exclusive-or) of the other disks. If any of the disks crash, it can be replaced and its data can be reconstructed from the other disks. In RAID-5 the role of the parity disk is distributed among all the disks in order to avoid performance bottlenecks in updating the parity information.

According to theory, when one of the disks die, you hot replace it from a reserve of pristine disks, and maybe in an hour you've reached the original reliability. So, if individual disks would be replaced by RAIDs of four data and one parity disk, the MTBF would be increased from 35 thousand to over 60 million hours. Or in other words, the number of permanent data losses in Finland would be reduced to below one every two days. Sound comforting, but in practice any less than a professional sysadmin might need days or until the next paycheck to buy a new disk, and the MTBF reduces in proportion to how much longer than one hour the replacement drags.

Suppose we had more than five disks, could we do better? That's the question I felt compelled to study (motivated also by a possible future hobby project). A toy example demonstrates a positive answer: ten disks set up as two independent RAID-5s can recover only from those two-disk crashes where the two crashed disks reside in different sub-RAIDs. There's only a 50% chance of that. But if we use one of the data disks in both parity disks, then we reach a 73.3% chance of recovering from a two-disk crash.

Guaranteeing recovery from two simultaneous disk crashes, in other words reaching RAID level 6, is actually harder than one might think. Given one data disk, simple mirroring on two additional disks would do. Up to four data disks, three parity disks is sufficient. The parities could, for example, be computed like this:

  x0 = d0 ^ d1 ^ d3
  x1 = d0 ^ d2 ^ d3 = d1 ^ d2 ^ x0
  x2 = d0 ^ d1 ^ d2 = d1 ^ d3 ^ x1

The equations are laid out so that no two disks occur in edgemost expressions in all equations. I prefer this formulation because it also states the recovery algorithm by repeated solving for only one unknown and substitution to other equations.

Four parity disks is sufficient to recover any two-disk crash in a system of eleven data disks.

  x0 = d0 ^ d1 ^ d2 ^ d4 ^ d6 ^ d7 ^ d8
  x1 = d0 ^ d4 ^ d5 ^ d8 ^ d9 ^ d10 ^ x0
  x2 = d0 ^ d2 ^ d3 ^ d6 ^ d9 ^ x0 ^ x1
  x3 = d0 ^ d1 ^ d3 ^ d5 ^ d6 ^ d8 ^ d9

This scheme can also recover a three-disk crash with a 84.84% probability and a four-disk crash with a 39.78% probability.

Five parity disks is sufficient to reover any two-disk crash in a system of twenty disks.

  x0 = d0 ^ d1 ^ d5 ^ d6 ^ d8 ^ d10 ^ d12 ^ d13 ^ d18 ^ d19
  x1 = d1 ^ d2 ^ d3 ^ d6 ^ d7 ^ d11 ^ d12 ^ d13 ^ d14 ^ d16
  x2 = d0 ^ d2 ^ d4 ^ d6 ^ d8 ^ d9 ^ d13 ^ d14 ^ d15 ^ d19 ^ x1
  x3 = d5 ^ d7 ^ d9 ^ d12 ^ d14 ^ d16 ^ d17 ^ d18 ^ d19 ^ x2
  x4 = d0 ^ d1 ^ d2 ^ d3 ^ d4 ^ d5 ^ d6 ^ d7 ^ d10 ^ d12 ^ d19 ^ x2 ^ x3

A three-disk crash is recovered with a 93.22% and a four-disk crash with a 66.57% probability. Notice that this scheme reaches the same storage overhead as the typical four data and one parity disk RAID-5, but with a hugely improved reliability - less than one irrecoverable crash every five years in one million such installments - of course making the lofty assumptions the disk losses are independent and not caused by an EMP or a virus.

For those who need truly high reliability, nested RAIDs are the way to go. Here one builds a RAID-5 of other RAID-5 elements instead of individual disks. For example, the twenty data disks are placed in four RAID-5s, and a fifth RAID-5 (of five data and one parity disk) is allocated to serve as a parity for those sub-RAIDs. Yes, it is reliable: it can handle all three-disk crashes and 99.45% of all four disk crashes. This happens at the expense of disk space: for twenty disks containing user data such a nested RAID consumes ten disks for redundancy.

But also this nested RAID can be improved. The set of parity equations below almost doubles the reliability, in particular now 99.72% of four-disk crashes can be recovered.

  x0 = d0 ^ d1 ^ d3 ^ d6 ^ d11 ^ d15 ^ d18 ^ d19
  x1 = d7 ^ d8 ^ d9 ^ d13 ^ d18 ^ x0
  x2 = d2 ^ d6 ^ d9 ^ d11 ^ d14 ^ d15 ^ d16 ^ d17
  x3 = d0 ^ d3 ^ d4 ^ d7 ^ d12 ^ d15 ^ d17 ^ d18
  x4 = d1 ^ d3 ^ d8 ^ d10 ^ d15 ^ d16 ^ d17 ^ d18 ^ x1
  x5 = d0 ^ d5 ^ d6 ^ d12 ^ d13 ^ d14 ^ d15 ^ d16 ^ d19 ^ x0
  x6 = d6 ^ d7 ^ d9 ^ d10 ^ d15 ^ x2
  x7 = d3 ^ d4 ^ d5 ^ d8 ^ d9 ^ d18 ^ x6
  x8 = d0 ^ d3 ^ d4 ^ d7 ^ d13 ^ d16 ^ d17 ^ x2 ^ x5
  x9 = d4 ^ d6 ^ d7 ^ d12 ^ d18 ^ d19 ^ x5 ^ x7

Alternatively, one could add three data disks and still have a higher reliability than with the smaller nested RAID scheme:

  x0 = d2 ^ d4 ^ d5 ^ d9 ^ d10 ^ d16 ^ d18 ^ d20 ^ d21 ^ d22
  x1 = d0 ^ d1 ^ d2 ^ d3 ^ d5 ^ d6 ^ d9 ^ d10 ^ d13 ^ d17 ^ d20
  x2 = d1 ^ d3 ^ d5 ^ d7 ^ d8 ^ d15 ^ d17 ^ x0
  x3 = d0 ^ d10 ^ d11 ^ d13 ^ d15 ^ d16 ^ d17 ^ d19 ^ x0
  x4 = d0 ^ d2 ^ d5 ^ d8 ^ d11 ^ d12 ^ d16 ^ d21 ^ d22
  x5 = d0 ^ d1 ^ d2 ^ d6 ^ d7 ^ d8 ^ d12 ^ d14 ^ d18 ^ d19 ^ d21 ^ x0
  x6 = d4 ^ d5 ^ d6 ^ d7 ^ d12 ^ d17 ^ d20 ^ x2 ^ x5
  x7 = d0 ^ d3 ^ d4 ^ d5 ^ d6 ^ d8 ^ d9 ^ d11 ^ d13 ^ d18 ^ d20 ^ d22 ^ x6
  x8 = d3 ^ d8 ^ d9 ^ d14 ^ d16 ^ d18 ^ d20 ^ d22 ^ x0 ^ x4 ^ x6
  x9 = d0 ^ d1 ^ d16 ^ d17 ^ d20 ^ d21 ^ x5

To generalize, scientists have developed numerous other kinds of Forward Error Correcting codes. The Reed-Solomon is probably the most widely known with uses ranging from CDs and xDSL-lines to communicating with space probes. Reed-Solomon becomes computationally expensive for a single large disk block, but we could employ it "horizontally" so that the i'th s-bits of each disk block form a message which is then appended with Reed-Solomon parity information. Because all errors will be erasures in our case, we can expect to recover from as many disk crashes as we have redundant disks. Despite this advantage, recomputation costs of an incremental update, overhead caused by shortening or rounding s upwards, and perhaps also complexity has kept Reed-Solomon codes out of RAIDs. The only work I know of is this.

What I've done in this blog entry is essentially equivalent to the basic ideas of LDPC-codes invented by Gallager in the early sixties, but with the exception that I employ a genetic algorithm to find the excellent encodings. Current research focus seems to be in developing codes for media delivery where the number of blocks can be tens or hundreds of thousands, and consequently the choice of the equations can be random. Unfortunately the domain is patent-ridden, but the LDGM approach seems to be free.

Oh, my hobby project? I won't reveal that yet as I haven't really decided whether I'll pursue it. But suffice it to say, that if all Finns would share 1% of their disk space for parity information with each other and recovering a disk would take one hour, we would have to wait over 10^750000 years for the next loss of data.

(Unfortunately all the assumptions in that calculation can't be taken into practice without significantly reducing the glamour of the result.)

Speed Limits and Economy

2006-11-23T02:04:00.000+02:00

I recently participated in a debate about speed limits. Even if none of the participants were really heavy-duty-speeders, most of them nevertheless strongly opposed to current rather low speed limits (30 or 40 km/h on small streets, 50 on main streets in Helsinki), arguing, for example, it would hamper the national economy. Some debaters even argued that people should be allowed to drive their cars at any speed the driver consider safe. To me, a father of a child with whom I walk between home and day care over seven zebra crossings twice every day, this simply disgusted me.

So, let's make some hyper-optimistic assumptions: a driver drives, say, three kilometers every day in an area where speed limits have been reduced from 40 to 30 km/h, how much will it cost the state? Nothing because most likely this wouldn't increase salaries and consequently tax revenues, but theoretically it could cost the driver up to 1.5 minutes every day, or assuming we count only working days, some 330 minutes during the year. That's five and a half hours of taxable work time. Assuming a relatively well-paid (4000 e/month) driver that would be below 150 euros per year, or below 15 million euros for the approximately 100 000 cars suffering from the reduced speed limit.

Of course that's only theory. In practice driving speed is usually limited by other cars, traffic lights, and taking corners. In fact statistics show that average driving speed dropped only by 1.5 km/h, cutting the cost to yearly drivers to only two million euros.

Ok, even two million euros is money, and spending it has to be justified. So let's look at the issue from the victim's point.

If a car hits a person, the probability of death is 5%, 15% or 40% for speeds 30, 40, and 50 km/h, respectively. The severity of non-fatal injuries and likelihood of a hit is also reduced.

What then is the cost of a death to the society? If the victim is a 30-year old adult, the state loses all its investments in his education and future tax revenues. A typical university degree costs the state approximately a quarter of a million euros, the basic education roughly equally much, and missed taxes 20000 euros per year until the victims retirement. So far the loss is one million euros, and we haven't yet accounted for the economic support for the grieving family. Even if statistics are still scarce, the reduction in driving speeds did have the desired effect: in 1999 to 2003 on average seven pedestrians were killed each year, but in 2005 only two.

The five million euro saving in the reduced deaths already counters for the drivers' lost leasure time, but we haven't yet accounted for non-fatal injuries. Unfortunately I don't have good data on them, but let it be noted that during 1999-2003 approximately 150 non-fatal injuries from car-pedestrian accidents were reported each year. A large fraction of them were probably tissue damages and bone fractures that maybe required only a few surgeries and some months' time to recover, but some of them were also tetraplegics whose life-time medical bills to the society easily reach a million - in addition to the one million in their abrupted careers.

In this treatment I've deliberately concentrated on the economic justification of current reduced speed limits, and that is quite strong. Injuries will cause physical pain, death and even moreso paralysis will cause emotional losses I can't formulate in numbers. I'll only say that it definitely is not the drivers' prerogative to define the level of risk his potential victims should to accept.

We, and the society we live in, pay heavily for speed. And I haven't yet treated all aspects of speed.

Anti-Fibonacci Sequences and Rings of Saturn

2006-11-16T12:26:00.000+02:00

Assume a process which repeatedly marks all of positive integers which belong to a Fibonacci sequence which start with the lowest unmarked x and x+c. For example, if c is one, the first Fibonacci sequence marks 1, 2, 3, 5, 8, 13, 21, etc. The process would start the second Fibonacci sequence at 6 and 7, marking thereafter 13, 20, 33, ..., the third one would mark 9, 10, 19, 29, etc. The unmarked numbers 4, 18, 22, 28, 32, 47, 54, 72, etc. are then what I call the anti-Fibonacci numbers for the given c = 1. For c = 2 the anti-Fibonacci numbers would be 5, 8, 9, 24, 34, 38, 45, 50, etc.

One might assume that anti-Fibonacci number generation is a sufficiently complicated process to be essentially pseudo-random. My back-of-the-envelope calculation went as follows: Let p(i) be the probability that i is anti-Fibonacci. A single existing Fibonacci sequence grows sparser with a rate of phi/i where phi is the base of the golden ratio (sqrt(5) + 1) / 2 with which the Fibonacci values grow, but with a probability of p(i)^2 the i and i+c starts a new Fibonacci sequence and p(i) gets a nudge of 2/i upwards. Putting these in equilibrium and solving for p gives sqrt(sqrt(5) + 1) / 2, or roughly 0.899. I knew I had made some lofty approximations in the derivation above, especially in assuming that the density of multiple Fibonacci sequences would decrease as rapidly as that of one sequence, so I wrote a little program that computes some 100 million first anti-Fibonacci numbers. I was somewhat surprized to see that p is only approximately 0.856 for c = 1. For c = 2 the program gave 0.872, which as again surprizing because I didn't expect c to affect p. But when my computer spat out what looks like interference patterns my jaw really dropped.

Having seen the non-randomness of the average of p over all i and observed how it depends on c, a change in coordinates lead me to wonder if conversely the average of p over c's depends on i. The answer is yes. Below I have a sample of p(i) estimated by c up to ten thousand.

The moral of the story? Whenever playing with number theory, staple your jaw so it won't keep falling off constantly.

Price of a Parking Place

2006-11-02T17:24:00.000+02:00

I recently overheard a powerful curse of how costly parking is in the center of Helsinki. At its highest normal street parking costs three euros per hour, but evenings, nights and weekends are free. So, should a parking place be in 100% utilization, and should everybody really pay their parking, it would give a revenue of a little above 700 euros per month, or 50 to 60 euros per square meter per month. This sounds high in comparison to the average appartment rents, typically well over 15 euros per square meter per month, in the same area. But appartments are typically stacked six to nine floors on top of each other, so consequently one can gather up to twice the amount in rent than in street parking for each unit of ground area. The cost of heating, maintenance and repairs is perhaps a fifth of the rent. The cost of construction is harder to estimate, but in the long run - we live in an eighty year old building - it too becomes relatively small. So, in summary, renting a parking place doesn't really seem costlier than renting an appartment.

Bad Hearing

2006-11-02T17:22:00.000+02:00

A student with a heavy flu asked what a monad is. Hampered by my hearing disability and absentmindedness of the context of the discussion, I politely answered "the glands that produce reproductive cells". Yes, humiliation is good for the soul, especially if it is followed by a mutual vigorous respiratory exercise.

Similarity of Texts

2006-10-05T21:28:00.000+03:00

I know of some computational text comparison attempts, such as using word length histograms to reveal the real author of a pseudonym, and some Self-Organizing Map attempts to cluster for example word histograms of patent applications.

Could these attempts be improved upon? Arguably text compression poses and compressors have to solve a very strongly text-characterizing problem: predicting the next alphabet. So, I decided to try compression in quantifying the similarity of two natural languages. The working hypothesis is that mixed texts written in two similar languages should compress better than texts mixed of two very different languages. I gathered a number of bible versions and translations, mixed them pairwise verse by verse, compressed the results with one of the best dictionary-free compression programs I could find, compared the compression ratios to unmixed bible versions, and visualized the result on a plane. I've had to adjust some points ever so slightly in order to avoid the explanatory texts being placed on top of each other, but otherwise this is the verbatim output of the program. Clusters of similar languages or different versions of the same language are quite evident.

The limitations of only two dimensions is quite evident - some languages such as latin have influenced and would compress well when mixed with a number of other later languages even if those languages are very different. Perhaps adding a third dimension, or some kind of wormholes or teleports would help. SOM representation avoid this problem by having multiple nodes (or disconnected areas) represent the same item. I leave this as a later research topic, though.

Another limitation not evident in the picture above is that mixing and compressing can only be applied to languages that use similar alphabets. Otherwise typical context modelling techniques in compressors will separate the two languages essentially into two separate input streams.

In this example I mixed and compressed two versions or translations of the same text, but what I would really like to do is to mix and compress texts written in the same language but by different authors. Suggestions for good corpora are welcome.

Oatmeal vs Eggs, or the Volumetric Myth

2006-08-27T23:32:00.000+03:00

Eager proponents of oatmeal argue that eating it for breakfast keeps the hunger away well into afternoon. I have always been skeptical about this, because the same effect could also be caused by one's habitual daily eating rhythm - if you choose to eat a late lunch for a few days, then the body (brains, that is) adapts to this and you won't get hungry until afternoon regardless of what you ate in the morning. Typical food satiation tests will not discern this, as they mostly follow how much a person eats before feeling satiated, not so much how long he remains satiated or what their eating rhythm is.

So I decided to do what Morgan Spurlock tought us: eat nothing but oatmeal for thirty days. Well, in my case I settled for three days. Except for allowing my caffeine-addicted metabolism two hefty cups of coffee with a little cream each day, my diet will consist of water and a portion of oatmeal made in water whenever I feel hungry. Not even butter, sugar, berries, jam, or anything of caloric value in the oatmeal. I allow only sugar-free spices, chewing gum, and vitamin pills to compensate for the one-sided diet.

Day 1: I'm not accustomed to having anything but coffee for breakfast, so I don't get hungry until 9 am. Betaglucan, here I come! After a plateful of microware-prepared porridge I feel satiated and pieceful, but suddenly at 10 am I'm hungry again. A lingering hunger disturbs my work and makes me take subsequent portions at 11 am, 1 pm, 2:30 pm, 5 pm, 8 pm, 8:30 pm, 10:30 pm, and 12 pm. It is really annoying to merely burp after dining and feel ready to eat again. Fortunately I have learned to spice up the oatmeal with for example sugarfree juice concentrates.

Day 2: Portions at 9:30 am, 10:30 am, 1 pm, 3 pm, 5:30 pm, 8 pm, 9:30 pm, 12 pm. I have tried various ground spices, such as peppers, cinnamon, and bitter orange. I also feel a little constipated. This is a little surprizing considering the amount of oh-so-healthy fibers I'm eating. Perhaps constipation is more a function of a change in diet instead of the diet itself.

Day 3: The constipation eases. I shall save you descriptions of my stool, let it only be noted that I felt very bureaucratic as the paperwork lasted much longer than production. Portions at 9 am, 11 am, 12:30 pm, 3:30 pm, 5 pm, 6:30 pm, 8 pm, 11 pm, and after a late night hack, at 2 am. I fall asleep counting mutton steaks.

Instead of eating properly and become satisfied for a good while, eating has turned into constant slow nibbling, just like one would eat chips or pop-corn in a movie. Even if I had set as a standard that I would never feel hungry, I did so on several occasions during the experiment. Of course I could have willed the hunger away, but that would have defeated the purpose of the experiment. After an average of nine portions of oatmeal per day, I can safely conclude that the myth that oatmeal keeps hunger away is definitely busted.

But for comparison I felt compelled to repeat the experiment with eggs. Two eggs contains almost the same amount of energy as a portion of oatmeal, but in a rich combination of proteins, fats, vitamins, and other micronutrients. The high cholesterol content of the egg yolk does not worry me - contrary to some nutritional urban myths, science has long known that dietary intake of cholesterol has little if any effect on the total blood cholesterol level. After a purge period of four days the second part of my experiment begins.

Day 1: One cooked egg at 9:30 am, two at 11 am, one at 1 pm, two at 3 pm, three at 5:30 pm, four at 8 pm. Felt very satiated. At 11:30 pm I poached two eggs with vinegar, spices, and a little artificial sweetener to give a slight sweet-and-sour taste - quite delicious!

Day 2: Three cooked eggs at 11 am, two at 3 pm, two at 5:30 pm, two at 8 pm, and two at 11:30 pm. I notice my susceptibility of hunger is declining, probably because my body is moving to ketosis where at least I begin to lose appetite.

Day 3: Three cooked eggs at 10:30 am and two at 2 pm. A sequence of meetings prevented to prompty quench my hunger until devouring four eggs at 6:30 pm. At midnight I concluded the day with two eggs deviled with hot spices and a teaspoon of cream fraiche. Ok, the cream fraiche was cheating, but insignificant in the sense of total energy value.

All in all I ate 12 and a third egg per day, and a nod of cream fraiche. Compared to the nine portions of oatmeal per day, I ate only 70% of the calories on the egg diet.

Conclusion: Volumetrics, the assumption that a person consumes less calories if he eats foods with low enery content per weight or volume, doesn't work, at least not for me on oatmeal versus eggs. Cement might work, but that I'm not going to try without the helping hand of a bunch of anarchovegans.

Note that I'm not saying oatmeal wouldn't be a healthy way to start a day - surely healthier than a slice of white toast and marmalade - nor am I suggesting anyone should eat eggs like Cool Hand Luke. This inhumane experiment was conducted in such extreme only to test the naive idea of volumetrics.

Buddhabrot Aerobics

2006-08-18T23:24:00.000+03:00

I'm suffering from an incurable illness called nomipsnofunia, a serious disability to have fun without involving a lot of heavy computations. For example, when I read about Buddhabrots I compulsively needed to compute ones myself too. But I also decided to study what happens when we generalize the magic constant "2" in the Mandelbrot iteration formula z=z^2+c. In the animated gif below, which I baptized to Aerobics Buddhabrot, the exponent rotates around a circle with center at 2 and radius of 0.2 on the complex plane.

I also computed a three-minute PAL-resolution mpg movie (17 MB of MPEG-4 encoded video in AVI) where the exponent rotates around a cirle of radius 2 centered at the origo. A colleague described the visual effect as "the Buddha throwing his guts out, rotating them around him through infinity, and assembling himself back again on the other side." I employed some thirty workstations to compute the frames during the last few weeks.

Polyfilla and Old Painters

2006-07-31T14:19:00.000+03:00

If I had to name a single favourite construction material, it would probably be Polyfilla, a plaster for filling up holes and cracks in walls and ceilings. With regular gypsum powder it would be difficult to add exactly the amount of water which would make the paste stiff enough to stay in the crack, and after a few minutes all the paste you mixed would harden nearly instantaneously leaving you very little working time. Furthermore, gypsum shrinks when drying and thereby forms new smaller cracks which you will have to plaster up a second or third time. But then someone, apparently at Polycell UK, thought of adding cellulose to gypsum powder. Mixed with water the resulting paste is very pleasurable to handle, it hardens less dramatically and shrinks less. Where I have needed a little harder finish, for example in an outward corner, I've successfully added small amounts of cement.

Polyfilla is nowadays a worldwide brand, but recently I learned that old painters have used the basic idea for centuries: they cooked oat meal in water and mixed the resulting thin porridge instead of clear water to the gypsum powder the next day. Instead of oat they might also use other sources of polymeric carbohydrates, such as mashed potatoes, mouldy bread, old newspapers, etc. Supposedly a pair of old underwear has been spread in the walls of an architecturally renowned house in central Helsinki.

If slightly cheaper and tougher finish was required, the gypsum powder would be mixed with coke slag. This material, colloquially called "kananpaska" (chicken shit) or "karhunpaska" (bear shit) was the predominant material for building light walls in Helsinki: a mold would be erected on one side of the future wall, the wall would be constructed layer-by-layer by throwing the gypsum-ash paste on the mold or previous layer, the mold would be removed, and a finishing ashless layer would be plastered on both sides of the wall. The demeaning name probably follows from the messiness of the substance when demolishing the wall: the coke slag produces extreme amounts of slowly setting dust. When local coal heaters was replaced by district heating during the fifties and sixties this construction material has fortunately lost popularity.

Football World Cup Finals

2006-07-12T03:07:00.000+03:00

I've been told the football world cup finals (FWCF) are in progression in Germany at the moment. Haven't followed them as I'm less interested in watching football played by others than the one I play myself with Konami's Pro Evolution Soccer 4 on my PS2.

The FWCF to some degree relates to an interesting algorithmic problem: the selection problem in which an algorithm must select the i'th largest element of n values with as few pairwise comparisons as possible. Author, mathematician, and reverend Lewis Carroll, officially Charles Dodgson, became puzzled over the same problem while following tennis tournaments. Barring luck (randomness in individual game outcomes), the common knockout tournament is obviously the best tournament system for selecting the winner. I've previously studied some specific selection problems, but never ones which involve unreliable comparisons.

The FWCF employs first a stage of eight parallel round-robin tournaments, each containing four teams. The two best teams of each group proceed to a knock-out tournament. The winning team has played seven games, and it is possible that two teams play against each other twice.

The Swiss tournament system was introduced in chess tournaments in the late 18'th century. As in round-robin tournaments, each player/team accumulates a score based on its wins, draws and losses. On each round players with equal scores are pitted against each other, but with additional rules that prevent same players meeting again. The Swiss tournament has nice advantages, such as rapid convergence and high entertainment value to players, but your average football fan might find it perplexing and, more importantly, the tournament doesn't necessarily end in a final climax.

Naturally I wouldn't be writing this blog entry unless I would be suggesting some wise-ass tournament schemes of my own. But first, as a basis for estimating the tournament systems I choose the probability of a win for player A against player B to follow the logistic curve, 1/(1+c^(a-b)), where c I choose to be 1.4, and a and b are the positions of players A and B if all players were ordered "correctly". Hence, the best player wins the second best player with probability 1/(1+1.4^(1-2)) = 0.5833... and the tenth best player with probability 0.9538...

A full round-robin tournament among all thirty-two FWCF finalists sounds fair albeit somewhat lengthy, lasting at least three months. Simulating such a round-robin tournament one million times on a computer with the above game outcome probabilities shows that the best player would be awarded the gold medal only with approximately a 47.5% probability, silver with 26.7% probability, and bronze with 14.5% probability. Adding a second round increases the likelihood of correct winner to approximately 58%.

The standard knockout tournament fares even worse: the correct winner is chosen only with approximately 38.5% probability. But there is a simple remedy to improve it: playing the final best-of-three improves the best player's chances to 42.3%, playing also the semifinals best-of-three yields 45.2%. The next improvement would be to play the final best-of-five, resulting in almost 47.6% chances of finding the best player, but such a lengthy final may already lead to loss of concentration in the audience and tolerance in the native inhabitants of the hosting cities. Furthermore, any best-of-N system implies uncertainty of when the final decisive game will be played, which in turn may be troublesome to broadcasting companies and organizers who would like to schedule everything well in advance.

The FWCF system is not particularly good, only slightly more reliable than the basic knockout tournament and it requires seven rounds of games - two more than the straight-forward knockout tournament. Exchanging the order of group and knockout stage would make the tournament more reliable, but that wouldn't ensure all teams have at least three occasions to play and that there's a final climax at a predetermined date.

For my tournament system proposal I chose the simple sorting network as the basic computational device. I decided to add an eighth round of games, and as the comparator function I use not only the result of the played game but the number of wins accumulated in all played games so far, with emphasis on later wins. Only the final round is played as earlier - winners of those games are honored with gold and bronze regardless of their game history - thus ensuring the final climax. I wrote an elementary genetic algorithm and a fitness function which tries to maximize the likelihood of also the dimmer medals for the best player and avoid same players meeting twice. A nightful of CPU cycles later I obtained the following network which I have post-processed and visualized:

The above tournament system would award gold to the best player with a 44.6% probability. There are duplicate games, such as players 1 and 5 meeting on the semifinal and quarter-final round, but this was an elective trade-off. On average slightly over five games are duplicates. The role of maintaining a score sheet of wins is interesting, because if the comparator function replaced with the traditional - winner moves to the upper line - the tournament still produces the best winner with 42.5% probability. The most significant drawback of the above tournament is that the chance of winning is not entirely independent of which line the player is initially placed on. On the other hand, the best player is better off in my tournament than in the current FWCF regardless of which line he is placed on.

There's one remaining question: how to deal with draws. FWCF employs extra time and penalty shootouts, the latter of which undoubtably brings an additional element of randomness to the results. The tournament system I proposed above might allow draws in some places, but if draws are easy to achieve, the player which considers a draw more beneficial may adjust his playing style accordingly. As this is generally considered dull for spectators I propose a few solutions amenable for football: instead of the penalty shootout, the winning team is the one granted more corner kicks when the ball was passed outside by the defending goal keeper, but not field player. Should also that statistic be drawn, during the first three and the last rounds ties are broken by penalty kicks, during the latter half a variant of the away goals rule is used: a zero-zero in goals is considered a win for the team on the lower line, otherwise the team on the upper line wins.

Should a FWCF follow my suggestions some day (yeah, right, chances are even worse than Finland winning the Eurovision Song Contest), I promise I'll buy a ticket and eat it.

The Eurovision Song Contest and Geographics

2006-05-25T01:17:00.000+03:00

In general I don't plan to comment on daily topics and with the exception of Apocalyptica I don't particularly like modern heavy music. But absent-mindedly following the Eurovision Song Context last weekend, I couldn't but help noticing that many neighboring countries did vote each other. This has been noticeable in past years, and this year, when citizens actually voted, I explained this to myself that immigrants and minorities vote to support their ethnic background.

So, up with the calculator, or rather the computer. I wrote a program which tries to place the countries on a 2D plane so that countries that voted heavily on each other were placed more adjacently than other countries, and vice versa. The layout tries to minimize the sum of squares of differences between the countries on the layout and the number of mutually awarded points. Final points weigh twice the semifinal points.

Yes, clearly one can observe several clusters of neighbor countries, but the planarity constraint makes it impossible to bring all voters of successful countries sufficiently close.

Oh, did I have an opinion of the winner? Not seriously, except that IMHO it was rather nice to see latex on the outside for once win silicone on the inside. And in general, diversity was really needed.

Different Kind of Hacking

2006-05-21T23:45:00.000+03:00

I hack quite a bit of code, both for fun and funding, but this weekend I hacked up a zero-budget video demonstrating the use of orthogonal fisheye transformation as a surrogate for a big screen on handhelds. Fortunately there was some programming involved, but mostly my time went to battling against the heat strokes of my computer and ADSL modem, a malicious sound card, various video formats, and the lack of satisfactory FOSS video editing software. Oh, and since didn't have a microphone I had to use my digital camera's dictaphone feature. But that caused a lot noise and clicks I have to edit out some day.

In away all this was fun, and should I have a lot of extra time it would be fun to really delve into this subject, study video formats, more than basic computer graphics, solid modelling, blender, and make these tools as easily approachable to layman as a pen for writing.

Public Healtcare in Finland: Scrap, or How to Fix It!

2006-05-02T17:51:00.000+03:00

There has been quite a bit of debate recently on the state of the public healthcare in Finland. According to laws passed in early 70's the state (and/or municipalities) must provide an adequate level of basic healthcare for all citizens. For two decades it worked marvellously, but since the recession in early 90's the state funding has gradually decreased despite the subsequent accession. For quite some time that was compensated by municipalities and "increased employee efficiency" (i.e., in many cases unpaid, underpaid, or incompetent labor), but recently their back seems to have been broken: nurses are rallying for better salaries and more workforce, and nasty rumors are beginning to circulate about neglected elderly patients left to suffer alone in overcrowded institutions.

So far the supply of doctors has been somewhat sufficient, but as people become older the ratio of nurses to doctors should increase as treatments such as washing, feeding and moving begin to dominate. Nurses are not accustomed to defend their rights by striking, but the young have long ago voted with their feet: nursing has become an unpopular career and in fact there are roughly twice as many nurses in the age group of 40-49 than 30-39. So, as the older age groups retire, the situation will only become worse.

Simultaneously worries are beginning to rise about the state's (i.e., younger generations') ability to pay the pensions the older generations have been promised (by themselves, essentially). If all goes well, then the pensions can indeed be paid, but the economic uncertainties are considerable. Consequently, one way to make it easier for the younger generations is to lift the state's responsibility of public healthcare, at least for the older age groups. This would reduce government spending but the elderly would have to pay their own health care by selling their properties as very few have taken healthcare insurances.

If, on the other hand, we the Finns decide to keep our public basic healthcare system alive, I propose a very simple remedy that will make sure it remains in working condition: enact a law that forbids all private healthcare or any special treatment forever for anyone who has worked sufficiently high in the state or municipalities. This law should at least cover all members of the parliament and anyone else who has made a similar or salary (say, 6000 euros/month or more) while working anywhere in the public sector. Once the politics and higher public servants are made to eat their own dog food the rest will fix itself, trust me on that one!