Thursday, October 05, 2006

Similarity of Texts

I know of some computational text comparison attempts, such as using word length histograms to reveal the real author of a pseudonym, and some Self-Organizing Map attempts to cluster for example word histograms of patent applications.

Could these attempts be improved upon? Arguably text compression poses and compressors have to solve a very strongly text-characterizing problem: predicting the next alphabet. So, I decided to try compression in quantifying the similarity of two natural languages. The working hypothesis is that mixed texts written in two similar languages should compress better than texts mixed of two very different languages. I gathered a number of bible versions and translations, mixed them pairwise verse by verse, compressed the results with one of the best dictionary-free compression programs I could find, compared the compression ratios to unmixed bible versions, and visualized the result on a plane. I've had to adjust some points ever so slightly in order to avoid the explanatory texts being placed on top of each other, but otherwise this is the verbatim output of the program. Clusters of similar languages or different versions of the same language are quite evident.



The limitations of only two dimensions is quite evident - some languages such as latin have influenced and would compress well when mixed with a number of other later languages even if those languages are very different. Perhaps adding a third dimension, or some kind of wormholes or teleports would help. SOM representation avoid this problem by having multiple nodes (or disconnected areas) represent the same item. I leave this as a later research topic, though.

Another limitation not evident in the picture above is that mixing and compressing can only be applied to languages that use similar alphabets. Otherwise typical context modelling techniques in compressors will separate the two languages essentially into two separate input streams.

In this example I mixed and compressed two versions or translations of the same text, but what I would really like to do is to mix and compress texts written in the same language but by different authors. Suggestions for good corpora are welcome.