Saturday, December 31, 2005

Limit of copyrightability

Perhaps it is possible to derive mathematically a limit, n, on the shortest copyrightable text. My starting point is that my risk of incidentally producing the same copyrightable text as someone else, and should all text be perfectly searchable thus also risk false conviction, should not exceed the "acceptable risk" of 1 in 1000000.

Suppose there are p other persons producing k bits of information in each ones lifetime. We can safely assume n is much larger than k and since copyrightable works may begin anywhere, we can approximate that there are p*k copyrightable works in the world written by others. The probability that two works whose information is n bits are identical is 2^-n. The probability that my new work is truly unique in the world is thus (1-2^-n)^(p*k). The probability that none of the k works I produce during my life are unique is (1-2^-n)^(p*k*k). If I've indeed written all those works myself, that probability should be at least certainty minus the acceptable risk of 10^-6.

For a numerical value we might estimate p to be one billion (10^9) literarily productive persons now and cumulatively in the past, and k being 50 years * 365 days/year * 10 kilobytes/day = 1420 million bits. With these values n > 110 bits would suffice.

But note that the information content of natural text is not the same as the number of bits it contains. Shannon estimated based on his experiments that the entropy of English text is between 0.6 and 1.3 bits per letter. This would imply that the minimum length of a copyrightable text lies between 84 and 183 letters. Hence the previous sentence might just barely be copyrightable. The previous two sentences together should probably be copyrightable.

Of course, in highly repetitive and phrase-heavy text (legalese, lyrics and chitchat, for example) the entropy would be even lower than 0.6 bits per letter, and shortest copyrightable texts even longer.

Note that all this reasoning applies to two identical copies. If we wish to expand this reasoning from identical copies to equivalent or similar versions of text, the minimum copyrightable text length will necessarily increase. But more on this some other day. Probably I'll also expand this line of thought to music and software.

Thursday, December 29, 2005

Introduction

It has taken me some five years of hesitation before I started my own blog, or "web log", as they were called then. Blogs are a form of narcissism, I've often thought. So, therefore this blog won't be a teen diary of turbulent hormonally induced feelings, nor will it be a captains's log promptly recording every irrelevant action I take in my life, nor will I re-blog all even remotely interesting links I find when surfing the net or in other blogs. In fact very little of my private or professional life will be recorded here.

So, what do I plan to blog? Primarily reviews of books and software, and essays, or even shorter essaylettes, of paradoxes, my opinions, observations or ideas I don't have time to pursue further myself. And probably I will blog relatively infrequently and in bursts.

And, who am I? A gradually middle-aging hacker, a software engineer currently working on my PhD in computer science at the Helsinki University of Technology, married, a happy father of a two year old daughter who's probably bound to outsmart me in everything except some arcane programming languages, a modest/low-carber, member of IKI and Skepsis, and someone who likes good discussions, elegant algorithms, terse software designs, puzzles with an element of surprize, cycling, cooking, and - given a relaxed schedule - building or renovating woodwork.