Limits of copyrightability, part 3
In the context of programming, the role of s becomes particularly interesting depending on whether we consider equivalence to cover superficial variations such as comments, indenting, naming conventions, and choice of equivalent control structures, or deeper issues such as the partition of functionality, representation of data, and choice of algorithms, protocols and pre-existing software components. In the latter extreme, all functionally equivalent programs, or significant parts of programs, would be considered equivalent and consequently the copyright of a specification would also cover all implementations of the specification. But that, of course, is not the purpose of copyright, but patents.
In this text I regard two (parts of) programs equivalent in the copyright sense if the two programs are compiled to the identical stripped object code. This is a whopping generalization, but somewhat justified. Firstly, compilers ignore comments and indentation and optimize the programs to the most efficient and consequently more canonical form. Secondly, stripping removes the symbol information and thereby compensate for equivalent variations in variable and function names. Additionally roughly half a kilobyte in each object file is reserved for various automatically generated metadata, which is ignored. After compressing the rest of the object files we obtain a rough measure of entropy for the original source code.
So, I took all newest software packets in ftp://ftp.funet.fi/pub/gnu/prep which compiled by simply issuing configure CFLAGS=-Os && make, ignored source code not written in C, and object files which couldn't easily be associated with their source file. The results from the remaining over 5000 files are shown in the figure below. They include many outliers, such as source code files including other source code files and, in the other extreme, files which include nothing to execute for this particular platform. But a rough estimate is three bytes per every non-empty non-comment line of source code as reported by David Wheeler's sloccount, or 0.8 bits per each non-whitespace non-comment source code character. This measure is a lower, and hence more accurate, upper bound on the entropy than compressing the source code without comments directly.
If we pick the total SLOC produced in the world to be, say, 200*10^9, and the number of SLOC that a productive person writes during his lifetime to be one million, multiply these by the entropy per SLOC, i.e. 24, substitute into the formula introduced in the first part of this series, then we have n >= 87. This shows that at least 3.6 average size source code lines, or 108 non-comment non-whitespace characters.
This may seem a little disappointing, but note that even after the transformations we have done, the entropy of the source code is over-estimated by for example random variations such as register allocation and stack layout in the compiler and suboptimality of compression program. In other words, the minimum copyrightable work is actually larger than the 3.6 lines, because it includes information about the context in which the work resides. Consequently, we could argue this lower limit for patches, but not arbitrary code shown out of context. We could push the lower limit higher by finding compiler mechanisms to get rid of such random variations, but perhaps writing a new C compiler goes beyond the scope of this blog entry...
But there's another argument we could use: observe how frequently programmers incidentally produce new code equivalent to some existing code. I chose gcc-core-4.0.2, a large open source project where there's no legal or technical reason not to reuse code written by someone else. I configured and compiled with -Os, and looked for the number of identical parts of the text segments in the 417 distinct object files containing almost 3 megabytes of executable code from over 300 kSLOC, some more headers, and 180 kSLOC of machine-produced source code. I roughly estimate each source code line results in some 6 executable bytes. Note that constant data (such as string contants) are excluded in this treatment even if they may contain copyrightable material.
As you can see from the graph above, equivalent code is produced far more frequently than one would assume after the above probabilistic treatment. For example, the likelihood to write an equivalent 128 bytes (21 SLOC) is over three times the acceptable risk. I went through the trouble to see where some of the identical parts had come from, and instead of macro expansion (which can produce duplicate machine code), most of them were indeed caused by typical C-like liturgy, such as code for looking up or even inserting with possible rehash in a chaining hash table with string keys. Some of this is caused by rather primitive reuse mechanisms in the programming language, some by ignorance of the existing code.
Note that even if I have argued that an average four line patch or twenty lines of new code shouldn't deserve copyright protection, they may nevertheless be protected by patents. Also, if it can be shown that your four-line patch or twenty-line function was a derived or copied work - for example because you might not be sufficiently skilled in programming - then none of the above arguments will help you.