Wednesday, January 25, 2006

Styles of Reading and Writing

Children read the pictures and comics in the newspapers, but once they go to school they read (also) the text. Later the reverse happens: students write their theses mostly in text, but after graduation they become professional designers of Powerpoint slides. No wonder Nokia main building is often called "PowerPoint Palace".

Is video compression worthwhile?

Whenever our digibox's hard disk gets full, I have the habit to download its contents to my desktop, compress from MPEG2 to something more concise, say DivX, and either store on a USB/FW disk or DVD. My typical compression settings reduce the data size by over 70% from approximately 1.5 GB/hour to approximately 400 MB/hour. Here I'm not concerned with quality - in fact the compression seems to improve the quality by smoothing out some digi-TV artifacts. But a friend of mine claimed this was hardly economically justified. So we calculated it out, and below are the results.

A 4.7 GB (or actually 4.4 GB) DVD-R disk costs roughly 50 cents (without media tax) in large spindle boxes, or approximately 11 cents per GB, and therefore by compressing I save some 12 cents per hour of movie. Large USB or Firewire disks cost roughly 80 cents per GB, so when I store to them, I save some 88 cents per hour of movie. These hard disks suffer another drawback: I can't insert them into the (DivX-capable) family DVD player and watch the movies on the big screen.

Compression takes quite a bit of CPU time, roughly four hours per one hour of movie on my machine. I estimate my machine consumes roughly 60 watts more electricity when the CPU is working. Electricity costs roughly 8 cents per hour of kilowatt (including taxes and transfer but no fixed costs), or roughly two cents per one our of movie.

So yes, for all current media, my video compression is indeed economically justified, albeit not necessarily indefinitely: in future media will be cheaper and power more expensive.

There's another argument: space. With individual cases I can store roughly 200 DVD disks per a meter of shelf space, up to 15 shelves high. That's 3000 disks. They cost 1500 euros, a shelf might cost some 300 euros, but the floor space in the center of Helsinki costs currently a whopping 4000 euros per square meter. This quadruples the benefit of video compression!

Now there's just one aspect left to ponder: The full shelf can store 33000 hours of compressed movies. Assuming I watched four hours of movies every day - which I definitely don't have time to do - it will take over 22 years to watch all the disks. Furthermore, since my current recording speed seems to be well below one disk per week, filling the shelf would be concluded by my potential grand children. But by then the copyright to the movies has extinguished and my grandchildren could legally share the movies (or actually largely also documentaries) I have recorded.

Sunday, January 22, 2006

Limits of copyrightability, part 3

In previous posts (1, 2) I stated that the risk of two persons incidentally producing an equivalent work and thus risking a false conviction of copyright violation would be unacceptably high if (1-2^-n)^(s*p*k*k) is below one minus the acceptable risk 10^-6. Here n is the information content of the work in bits, p the number of other people producing copyrightable works, k the number of bits of information everyone on average produces in his lifetime, and s is the number of works considered equivalent to an n bits long work. I have, for example, argued that in English text n would be from 100 to 216 letters, and consequently for example individual verses of Haiku should not enjoy copyright protection. In this rather lengthy post I shall first apply this principle, then another approach, to squeeze out some lower limit on the size of a copyrightable piece of program source code.

In the context of programming, the role of s becomes particularly interesting depending on whether we consider equivalence to cover superficial variations such as comments, indenting, naming conventions, and choice of equivalent control structures, or deeper issues such as the partition of functionality, representation of data, and choice of algorithms, protocols and pre-existing software components. In the latter extreme, all functionally equivalent programs, or significant parts of programs, would be considered equivalent and consequently the copyright of a specification would also cover all implementations of the specification. But that, of course, is not the purpose of copyright, but patents.

In this text I regard two (parts of) programs equivalent in the copyright sense if the two programs are compiled to the identical stripped object code. This is a whopping generalization, but somewhat justified. Firstly, compilers ignore comments and indentation and optimize the programs to the most efficient and consequently more canonical form. Secondly, stripping removes the symbol information and thereby compensate for equivalent variations in variable and function names. Additionally roughly half a kilobyte in each object file is reserved for various automatically generated metadata, which is ignored. After compressing the rest of the object files we obtain a rough measure of entropy for the original source code.

So, I took all newest software packets in which compiled by simply issuing configure CFLAGS=-Os && make, ignored source code not written in C, and object files which couldn't easily be associated with their source file. The results from the remaining over 5000 files are shown in the figure below. They include many outliers, such as source code files including other source code files and, in the other extreme, files which include nothing to execute for this particular platform. But a rough estimate is three bytes per every non-empty non-comment line of source code as reported by David Wheeler's sloccount, or 0.8 bits per each non-whitespace non-comment source code character. This measure is a lower, and hence more accurate, upper bound on the entropy than compressing the source code without comments directly.

If we pick the total SLOC produced in the world to be, say, 200*10^9, and the number of SLOC that a productive person writes during his lifetime to be one million, multiply these by the entropy per SLOC, i.e. 24, substitute into the formula introduced in the first part of this series, then we have n >= 87. This shows that at least 3.6 average size source code lines, or 108 non-comment non-whitespace characters.

This may seem a little disappointing, but note that even after the transformations we have done, the entropy of the source code is over-estimated by for example random variations such as register allocation and stack layout in the compiler and suboptimality of compression program. In other words, the minimum copyrightable work is actually larger than the 3.6 lines, because it includes information about the context in which the work resides. Consequently, we could argue this lower limit for patches, but not arbitrary code shown out of context. We could push the lower limit higher by finding compiler mechanisms to get rid of such random variations, but perhaps writing a new C compiler goes beyond the scope of this blog entry...

But there's another argument we could use: observe how frequently programmers incidentally produce new code equivalent to some existing code. I chose gcc-core-4.0.2, a large open source project where there's no legal or technical reason not to reuse code written by someone else. I configured and compiled with -Os, and looked for the number of identical parts of the text segments in the 417 distinct object files containing almost 3 megabytes of executable code from over 300 kSLOC, some more headers, and 180 kSLOC of machine-produced source code. I roughly estimate each source code line results in some 6 executable bytes. Note that constant data (such as string contants) are excluded in this treatment even if they may contain copyrightable material.

As you can see from the graph above, equivalent code is produced far more frequently than one would assume after the above probabilistic treatment. For example, the likelihood to write an equivalent 128 bytes (21 SLOC) is over three times the acceptable risk. I went through the trouble to see where some of the identical parts had come from, and instead of macro expansion (which can produce duplicate machine code), most of them were indeed caused by typical C-like liturgy, such as code for looking up or even inserting with possible rehash in a chaining hash table with string keys. Some of this is caused by rather primitive reuse mechanisms in the programming language, some by ignorance of the existing code.

Note that even if I have argued that an average four line patch or twenty lines of new code shouldn't deserve copyright protection, they may nevertheless be protected by patents. Also, if it can be shown that your four-line patch or twenty-line function was a derived or copied work - for example because you might not be sufficiently skilled in programming - then none of the above arguments will help you.

Tuesday, January 10, 2006

Crossword Puzzles

A trivial backtracking search to generate an N*N crossword puzzle consumes a depressingly rapidly growing running time O(c^(2*N)), where c is the number of N-letter words. On my computer all 3*3-crosswords are found instantly, 4*4-crosswords in a day, and 5*5-crosswords would take longer than sun shines.

The key to make crossword puzzle generation efficient is in to check immediately when laying a new letter of a word in a puzzle that the crossing direction is not impossible to satisfy. For example, if I place the word "take" immediately above "time" at the top of the puzzle, it will be impossible to find crossing words which begin with letters "tt" or "km" later in the search. This prunes the search space increasingly efficiently as N grows, because while some 31% of all digraphs occur in some word, only 8% of all trigraps and less than 1% of all 4-graphs occur in some word. Furthermore not all combinations occur in any place, for example both digraphs "tt" and "km" do occur in some words, but neither can begin a valid word.

I also wrote hints of varying difficulty to a number of the words. To tease you, here's an easy 3*3-puzzle:
  impregnate color -----+
celestial rotation -+ !
beer -------------+ ! !
! ! !
increase ->
put down ->
visual organ ->

And a very hard one:
  protective clothing -+
anecdotes ---------+ !
metal bar -------+ ! !
! ! !
chatter ->
cuckoo ->
small amount ->

There are numerous puzzles when N is 3, 4 and even 5, but my program found only two 6*6 crossword puzzles when using an old /usr/dict/words without names of places and people:
a c c u s e
p r o p e l
h a n d e d
i n v a d e
d i e t e r
s a y e r s

The second 6*6-crossword is of course the transpose of the above one.

Limits of copyrightability, side note

Apple has sued and apparently won a preliminary injunction against a German company Liquid Air Lab for the name "spod" for its podcasting service. If this decision holds, apparently every product or service name one latter and one capitalization change away from "iPod" could be successfully threatened by Apple, especially if they are somehow related to mobile listening of music. That's over 400 names for the price of one, including the word "apod" for "footless", Earth Picture of the Day, APOD, etc. In particular, it is a painfully large share of all pronounciable four letter names.

This relates to my previous posts on copyrights in the sense that overly wide interpretations of trademarked names pollute the namespace, and soon it may be impossible to name a product without fear of having to rename it. Of course, for big products such as new car models a proper proximity search in trademark databases is not a big overhead, but for small companies and open source projects it is.

Also, the state earns quite a bit from trademark fees. One would assume that some of that money would be funneled into setting up a free web site that searches similar trademarks and gives a legally binding decision if it based on predefined rules can deduce that a proposed name violates no trademark. Although nowadays one can trademark colors (such as Fazer's blue in the context of sweets), sounds and music (such as the Nokia tune in the context of mobile phones), it would be sufficient to limit to textual trademarks. No trivial thing, but a reasonable research project proposal, methinks.

On Taxation

I participated recently in an interesting discussion about taxation and its fairness. Some of the participants were quite well off, and as winners in that game they called taxes on properties to "envy tax" and estate tax to "death tax", and generally dismissed all contrary opinions as leftist propaganda. While I definitely don't regard myself as leftist, intellectual honesty demanded me to oppose some of their views on taxation. Here's why.

Outside of social aspects such as evening out the income gap, taxes serve a simple purpose: share the cost of common purchases and thereby obtain savings in scale. For example, if we wish to have a state where people needn't hire private armies to defend our borders and guards to keep our property safe, we gather a sum of money through taxes and hire a common army, police force, legislature process, courts of law, and prison system which exercise a monopoly of violence within our borders. (Incidentally, that is what defines a state.) Currently almost all of that system is paid by income and value added tax as they form an overwhelming majority of our budget's income.

But now consider who has most to lose if the army, police force, law, and criminal system would cease to exist. Rampant theft and vigilantism would follow. Except for our lives, what do we have to lose? Yes, our property! Eventually our society would degrade to stone age where no-one could safely own more than he could carry with him, guard with sight and hearing and run to protect before the thief is gone. So the ones that benefit most of our costly system for maintaining safety and order are the ones who own a lot of properties (land, buildings, valuables, money, shares, promissory notes, whatever). However, property tax pays only a negligible fraction of the costs associated with protecting the properties of people, and it is now being lifted altogether. I don't see the fairness in that.

Estate (inheritance) tax is argued againts by calling it multiple taxation, as the same property is taxed first when the deceased earned it and a second time when it is passed on to the children. However, this argument doesn't hold for two reasons: Firstly, why should property in estates not be taxed when they change hands while essentially all other exchanges of ownership is taxed? Secondly, the one who pays the estate tax is the recipient of the estate, not the deceased, so there's really no double taxation occurring at all!

In my view estate tax can be used to replace property tax simply because everybody dies some day, and whether the taxation is done after death or while living is irrelevant. Therefore now that property tax is removed, estate tax should be sufficiently large to cover a fair share of our common infrastructure that protects property. Nevertheless, in my mind property and estate tax carried by the Finnish state should roughly equal the sum consumed by the interior, defence and justice ministeries, or roughly 5 billion euros.

Sunday, January 08, 2006

Styles of Speaking

A group of Finns were having a conversation with each other in a crowded restaurant in Rome. Suddenly the lights went off because of a power outage. The Finns continued their conversation unabated, but all Italians went silent. Why? Because Italians gesture intensively with their hands, but that couldn't be seen in the darkness.

Limits of copyrightability, part 2

In a previus post I concluded that works shorter than 110 bits of information do not deserve copyright because the risk of two persons incidentally producing an identical work and thus risking a false conviction would be unacceptably high. We saw that this was achieved when (1-2^-n)^(p*k*k) was at most one minus the acceptable risk 10^-6. Here n is the length of the work, p the number of other people producing copyrightable works, k the number of bits of information everyone on average produces in his lifetime.

We can expand this approach to cover similarity simply by regarding each new work as representing s similar new works. In other words we require (1-2^-n)^(s*p*k*k) to be at most one minus the acceptable risk 10^-6. The difficulty, however, lies in defining s. In text we might assume s covers, for example, changes in tense, mode, person, sentence structure, word order, dialect or language, or systematic changes of words with their synonyms, names with new ones, etc. Based purely on intuition I would guess s must be at least millions, before the resulting sentences in a natural language appear significantly different. If we choose s = 10^6, then the minimum information content of a copyrightable work grows from 110 bits to 130 bits, or from 100 to 216 letters.

While the above rise may not appear significant, it already seems to imply that no individual verse of Haiku should be copyrightable. Recall that typical western-style Haiku are 5+7+5 syllables long, such as the 82-letter Haiku with which David Dixon won the Salon magazine's Haiku contest:
Three things are certain:
Death, taxes, and lost data.
Guess which has occurred.

However, in defense we might argue that not p = 10^9 people write Haiku verses all of their life. If we instead take p = 10^6 and k = 100 for one million people writing one hundred verses, it would imply n = 73, or from 56 to 121 letters. Considering the the first 45 letters from Dixon's Haiku occur in several other phrases (try it with Google, for example), I'ld still claim that this Haiku probably shouldn't deserve copyright according to the arguments I've laid out.

Nevertheless, I think Dixon's Haiku is great, and as someone who has had all but the first occur personally and repeatedly, I think Dixon deserved to win.

Next week I hope to write a little about how all my reasoning applies to programming and copyrightability of source code.

Thursday, January 05, 2006

More on Caesar

My previous entry left me thinking about the likelihood of being a descendant of Julius Caesar. For example, my boss, of noble birth and a long recorded genealogy, is a descendant of Charlemagne at least in two different lines, but because it is impossible to follow the blood line during the first turbulent millennium, I have to resort to probability estimates.

Caesar had two children, Julia and Caesarion, who later became Ptolemaios 15, king of Egypt, and a stepson Gaius Octavius, who later became Augustus Caesar. Undoubtably Caesar had many secret relationships, also heterosexual ones, and we can hereby presume that his genes have had better than average chances of replicating. If at 50 B.C. the world population was roughly 200 million, we can assume that at least two of them were descended of Julius Caesar.

We assume descedants of Caesar are equally productive to non-descendants. The only way to produce a non-descendant is that both parents are non-descendants. If we denote with P(i) the probability that a random person i generations after Caesar is not his descendant, then P(i+1) = P(i)^2, which simplifies into P(i) = P(0)^(2^i). After some numerics, we see that in 23 generations, or around 500 A.D. nearly half of the world population would be descended from Caesar, and by today practically everybody.

Of course this is hardly quite correct, because some populations live in such physical and cultural separation that descendency may not have reached them before the 19'th century. Even we Finns are considered somewhat isolated, but by no means entirely and in any case two millenia is ample of time for some traveling warrior, trader, or priest to seed some of my ancestors.

Ok, so do I now feel any Caesareity? Nope, not even the slightest foretelling sting in my torso. And by the way, it was Shakespeare who put the words "Et tu, Brute" in Caesar's mouth. Suetonius claims Caesar's last words to have been "You too, my son" in Greek. Some sources also claim that he cried "This is violence!" (in Latin) when stabbed for the first time, but perhaps during over 20 stabs you have time to think of a more eloquent last words.

Homeopathic Caesar Urine

Julius Caesar (100-44 B.C) lived 55 years, and assuming he urinated two liters every day, there should be approximately 40 cubic meters of water from his urine. Calculations I omit from here suggest that photosynthesis, the most likely method of breaking water molecules down in nature, has had a modest affect on this amount. Therefore we can approximate all that water has survived to this day and dispersed evenly on our planet during the past two millenia.

Dividing the 40 cubic meters with the total volume of all oceans we obtain that one molecule in every 3.6 * 10^15 today is genuine "chez Caesar". The strongest homeopathic medicines have been diluted up to 10^30, but 10^15 should already have a clear effect. Unfortunately I can't quite figure out what it should cure - megalomania, or perhaps laurel rash?

Another way of looking at this is to say that the Coca-Cola company has sold half a millilitre of Caesar's urine world-wide during its entire operation. I guess that's the "secret ingredient".

Tuesday, January 03, 2006

Outlaw Radiators!

Production of electric radiators and other low-temperature electric heating elements should be outlawed. They should be replaced by cheap computing elements which produce heat as a by-product. Mass production of such devices would further for example SETI@home and other easily parallellisable tasks. The increased production cost would be covered by selling CPU time.

Longest Shortest Path

A shortest path connects two words, for example "dos" and "lnx", with the shortest possible list of words which differ only in one letter, for example "dog"-"log"-"lug"-"lux". Surely it must be only coincidental that dos is close to dog and lnx close to lux, which among other meanings is Latin for light.

Given a vocabulary of N words it relatively easy to construct an efficient program which computes the shortest path between any two words. But what is the best algorithm for finding the longest shortest path? One could use the Floyd-Warshall algorithm which takes O(N^3) time, or Johnson's algorithm which takes O(N^2 log N) time. Given N is perhaps a few thousand, these algorithms are already palatable in modern computers.

The actual findings depend of course heavily on the given vocabularies, but below are some results using chosen-length subsets of an old /usr/dict/words for English words: Perhaps it is only because of my skeptical tendencies, but for some reason "chi", the circulating life energy in Chinese philosophy, and "ass" come up when searching the longest shortest path in words of three letters. The intervening words are "phi"-"poi"-"pow"-"row"-"roe"-"rye"-"aye"-"are"-"ark"-"ask". Similarly suspiciously the search in five letter long words produced "toxin" and "quack" with a whopping 44 words in between them. Only the four letter words seemed to refer to something less quacky: "zeta"-"unit" or "lava"-"chum" with 15 intervening words.