Friday, November 28, 2008


Symbol Ranking is a method for data compression where one maintains a LRU list of symbols recently seen in the given context. Those symbols are expected to be more probable the next time and therefore encoded with fewer bits than other symbols. The LRU list is usually only up to three symbols long, thus making it possible to store a context in just one 32-bit word.

Symbol Ranking has been common computing folklore for ages, Peter Fenwick gave the first modern treatment and implementation subsequently improved by Matt Mahoney for example by adding a secondary arithmetic compression stage. I added a few ideas of my own, rewrote it in C in a way easily embeddable to other programs to compress e.g. network connections and database logging.

This compression library, SR3C, hits the front where one either has to consume more memory or cycles to compress better. The compression is slightly faster to gzip -7, but results in ~11% smaller data. bzip2 -2 compresses slightly better than SR3C, but takes almost three times as long and is not on-line.

Matt Mahoney made a version whose command line interface compiles on Windows and benchmarked it as well. Both versions of SR3C are published under the MIT license.


Blogger cbloom said...

Hi, have you compared sr3c to a real modern coder? (such as LMZA, BALZ, or LZP ?)

Both gzip and bzip are very old, and are both designed for fast *decompression* not fast compression.

1:04 AM  
Blogger cessu said...

I did some comparisons during development, but Matt Mahoney has independently tested sr3c on his large text compression benchmark. Accoring to his results, for example BALZ compresses 14% better, but takes 23 times the CPU and 41 times the memory. LZOP and QuickLZ were my main candidates for the purpose I had. But for some reason I got excellent results with Mahoney's sr3. So I did some testing, checked the literature, did a rewrite for scratch, and out came sr3c.

IMHO, the zoo of compression algorithms is becoming difficult to navigate. Not only does the compression speed, but also decompression speed, memory consumption and type of data affect the decision. And of course also API and IPR issues, or whether one needs on-line compression, random access, etc.

11:15 AM  

Post a Comment

<< Home