### Limit of copyrightability

Perhaps it is possible to derive mathematically a limit, n, on the shortest copyrightable text. My starting point is that my risk of incidentally producing the same copyrightable text as someone else, and should all text be perfectly searchable thus also risk false conviction, should not exceed the "acceptable risk" of 1 in 1000000.

Suppose there are p other persons producing k bits of information in each ones lifetime. We can safely assume n is much larger than k and since copyrightable works may begin anywhere, we can approximate that there are p*k copyrightable works in the world written by others. The probability that two works whose information is n bits are identical is 2^-n. The probability that my new work is truly unique in the world is thus (1-2^-n)^(p*k). The probability that none of the k works I produce during my life are unique is (1-2^-n)^(p*k*k). If I've indeed written all those works myself, that probability should be at least certainty minus the acceptable risk of 10^-6.

For a numerical value we might estimate p to be one billion (10^9) literarily productive persons now and cumulatively in the past, and k being 50 years * 365 days/year * 10 kilobytes/day = 1420 million bits. With these values n > 110 bits would suffice.

But note that the information content of natural text is not the same as the number of bits it contains. Shannon estimated based on his experiments that the entropy of English text is between 0.6 and 1.3 bits per letter. This would imply that the minimum length of a copyrightable text lies between 84 and 183 letters. Hence the previous sentence might just barely be copyrightable. The previous two sentences together should probably be copyrightable.

Of course, in highly repetitive and phrase-heavy text (legalese, lyrics and chitchat, for example) the entropy would be even lower than 0.6 bits per letter, and shortest copyrightable texts even longer.

Note that all this reasoning applies to two identical copies. If we wish to expand this reasoning from identical copies to equivalent or similar versions of text, the minimum copyrightable text length will necessarily increase. But more on this some other day. Probably I'll also expand this line of thought to music and software.

Suppose there are p other persons producing k bits of information in each ones lifetime. We can safely assume n is much larger than k and since copyrightable works may begin anywhere, we can approximate that there are p*k copyrightable works in the world written by others. The probability that two works whose information is n bits are identical is 2^-n. The probability that my new work is truly unique in the world is thus (1-2^-n)^(p*k). The probability that none of the k works I produce during my life are unique is (1-2^-n)^(p*k*k). If I've indeed written all those works myself, that probability should be at least certainty minus the acceptable risk of 10^-6.

For a numerical value we might estimate p to be one billion (10^9) literarily productive persons now and cumulatively in the past, and k being 50 years * 365 days/year * 10 kilobytes/day = 1420 million bits. With these values n > 110 bits would suffice.

But note that the information content of natural text is not the same as the number of bits it contains. Shannon estimated based on his experiments that the entropy of English text is between 0.6 and 1.3 bits per letter. This would imply that the minimum length of a copyrightable text lies between 84 and 183 letters. Hence the previous sentence might just barely be copyrightable. The previous two sentences together should probably be copyrightable.

Of course, in highly repetitive and phrase-heavy text (legalese, lyrics and chitchat, for example) the entropy would be even lower than 0.6 bits per letter, and shortest copyrightable texts even longer.

Note that all this reasoning applies to two identical copies. If we wish to expand this reasoning from identical copies to equivalent or similar versions of text, the minimum copyrightable text length will necessarily increase. But more on this some other day. Probably I'll also expand this line of thought to music and software.