Sunday, January 22, 2006

Limits of copyrightability, part 3

In previous posts (1, 2) I stated that the risk of two persons incidentally producing an equivalent work and thus risking a false conviction of copyright violation would be unacceptably high if (1-2^-n)^(s*p*k*k) is below one minus the acceptable risk 10^-6. Here n is the information content of the work in bits, p the number of other people producing copyrightable works, k the number of bits of information everyone on average produces in his lifetime, and s is the number of works considered equivalent to an n bits long work. I have, for example, argued that in English text n would be from 100 to 216 letters, and consequently for example individual verses of Haiku should not enjoy copyright protection. In this rather lengthy post I shall first apply this principle, then another approach, to squeeze out some lower limit on the size of a copyrightable piece of program source code.

In the context of programming, the role of s becomes particularly interesting depending on whether we consider equivalence to cover superficial variations such as comments, indenting, naming conventions, and choice of equivalent control structures, or deeper issues such as the partition of functionality, representation of data, and choice of algorithms, protocols and pre-existing software components. In the latter extreme, all functionally equivalent programs, or significant parts of programs, would be considered equivalent and consequently the copyright of a specification would also cover all implementations of the specification. But that, of course, is not the purpose of copyright, but patents.

In this text I regard two (parts of) programs equivalent in the copyright sense if the two programs are compiled to the identical stripped object code. This is a whopping generalization, but somewhat justified. Firstly, compilers ignore comments and indentation and optimize the programs to the most efficient and consequently more canonical form. Secondly, stripping removes the symbol information and thereby compensate for equivalent variations in variable and function names. Additionally roughly half a kilobyte in each object file is reserved for various automatically generated metadata, which is ignored. After compressing the rest of the object files we obtain a rough measure of entropy for the original source code.

So, I took all newest software packets in ftp://ftp.funet.fi/pub/gnu/prep which compiled by simply issuing configure CFLAGS=-Os && make, ignored source code not written in C, and object files which couldn't easily be associated with their source file. The results from the remaining over 5000 files are shown in the figure below. They include many outliers, such as source code files including other source code files and, in the other extreme, files which include nothing to execute for this particular platform. But a rough estimate is three bytes per every non-empty non-comment line of source code as reported by David Wheeler's sloccount, or 0.8 bits per each non-whitespace non-comment source code character. This measure is a lower, and hence more accurate, upper bound on the entropy than compressing the source code without comments directly.



If we pick the total SLOC produced in the world to be, say, 200*10^9, and the number of SLOC that a productive person writes during his lifetime to be one million, multiply these by the entropy per SLOC, i.e. 24, substitute into the formula introduced in the first part of this series, then we have n >= 87. This shows that at least 3.6 average size source code lines, or 108 non-comment non-whitespace characters.

This may seem a little disappointing, but note that even after the transformations we have done, the entropy of the source code is over-estimated by for example random variations such as register allocation and stack layout in the compiler and suboptimality of compression program. In other words, the minimum copyrightable work is actually larger than the 3.6 lines, because it includes information about the context in which the work resides. Consequently, we could argue this lower limit for patches, but not arbitrary code shown out of context. We could push the lower limit higher by finding compiler mechanisms to get rid of such random variations, but perhaps writing a new C compiler goes beyond the scope of this blog entry...

But there's another argument we could use: observe how frequently programmers incidentally produce new code equivalent to some existing code. I chose gcc-core-4.0.2, a large open source project where there's no legal or technical reason not to reuse code written by someone else. I configured and compiled with -Os, and looked for the number of identical parts of the text segments in the 417 distinct object files containing almost 3 megabytes of executable code from over 300 kSLOC, some more headers, and 180 kSLOC of machine-produced source code. I roughly estimate each source code line results in some 6 executable bytes. Note that constant data (such as string contants) are excluded in this treatment even if they may contain copyrightable material.



As you can see from the graph above, equivalent code is produced far more frequently than one would assume after the above probabilistic treatment. For example, the likelihood to write an equivalent 128 bytes (21 SLOC) is over three times the acceptable risk. I went through the trouble to see where some of the identical parts had come from, and instead of macro expansion (which can produce duplicate machine code), most of them were indeed caused by typical C-like liturgy, such as code for looking up or even inserting with possible rehash in a chaining hash table with string keys. Some of this is caused by rather primitive reuse mechanisms in the programming language, some by ignorance of the existing code.

Note that even if I have argued that an average four line patch or twenty lines of new code shouldn't deserve copyright protection, they may nevertheless be protected by patents. Also, if it can be shown that your four-line patch or twenty-line function was a derived or copied work - for example because you might not be sufficiently skilled in programming - then none of the above arguments will help you.

1 Comments:

Anonymous Anonymous said...

The lower graph very much corresponds to what Zipf's law would predict.

10:51 PM  

Post a Comment

<< Home