Monday, July 9, 2018

Shalizi On Jaynes’ Arrow Of Time

Though early, grossly Orientalist, commentators* described Jnana Yogi Swami Cosma Rohilla Shalizi in terms similar to Descartes, it is clear from the historical record that Sw. Shalizi wrote his works in robes of saffron and kermes. As a small contribution to restoring historical reality, this paper is demonstrates how the recent publication by Carnegie-Mellon of Sw. Shalizi’s deutero-canonical writings throws new light on one of the canonical Shalizi texts in the Cornell Canon.

Some historical background is necessary. It is well known that Sw. Shalizi was a controversialist in the school of Guru Josiah Gibbs, opposing the faction led by Adhyapakah Edwin Jaynes. Gershom Scholem’s wonderful essay on the politics of saints “Religious Authority & Mysticism” gives a wonderful examination of how Sw. Shalizi likely felt about his work.

Adhyapakah Jaynes proclaimed, in seeming harmony with the tradition from Laplace to Schopenhauer, that the so-called “thermodynamic functions” - heat, pressure, etc. - had a mere conventional existence. Reality - the microstate - is something much stranger.

Against this preaching, Sw. Shalizi proposes that the true tradition would affirm the existence of heat even when there was no observer present. There seems to be an error in transcription in the Carnegie-Mellon text, as sometimes the text seems to be implying that without thermodynamic functions there would be no history prior to consciousness. But evolution of the objective microstate is not a function of the subjective macrostate. Fortunately, examination of the Cornell Canon is sufficient to demonstrate that Sw. Shalizi was aware of this.

Sw. Shalizi’s objection was much more subtle. Adhyapakah Jaynes essentially denied Mahapandita Ludwig Boltzmann’s claim that large numbers of particles play a role in thermalization, arguing that subjective ignorance is sufficient. Of course, Sw. Shalizi could not stand such a rejection of tradition.

We now move to the main text, Sw. Shalizi’s controversial article “False Jnanachakra”. In it, Sw. Shalizi imagines a contest between a mighty asura Andhaka and  Bhagvan Shiva. The Mahayogi bets that if Andhaka can turn the Jnanachakra then Andhaka may have a night with Parvati. Of course, Andhaka cannot resist such a carnal opportunity. The price he pays is this: if Andhaka cannot but Shiva can, then Andhaka must die and live the rest of his lives in non-violence and celibacy.

The Jnanachakra is a weightless Jade wheel with a single needle along it’s rim which may be lowered or raised. It has no effect on the pendulum. The Jnanachakra, as its name suggests, is merely a metaphor for the player’s knowledge. Beneath the Jnanachakra is an n-ary pendulum whose fobs beyond the first are invisible. Based only on macroscopic observation, the player attempting to turn the wheel must raise or lower the needle which will be pushed by the pendulum in the correct direction.

The observable state space of a single visible fob is a cylinder with one axis, up-down, as the velocity of the fob and the the other axis, around, as the velocity of the fob. The same for the Jnanachakra. Holding the current velocity of the Jnanachakra constant gives us a circle around the phase space. We want the needle down only when the fob is above the current circle in phase space. 

Andhaka’s strategy is to lower the needle in front of the fob when the fob is swinging right faster than the Jnanachakra. He believes this will allow him to rotate the Jnanachakra counterclockwise. For a short time, this works. But the hidden fobs cause the visible fob to bounce randomly on a long timescale - a timescale which can be calculated from coupling of the fobs (cite pg 44 Chaos & Coarse Graining In Stat Mech). Eventually, Andhaka’s predictions of will be no better than chance. Therefore Andhaka’s strategy will fail in the long run.

Now the Mahabuddhi goes to the wheel. He follows a similar strategy. At first the Jnanachakra seems to thermalize - bounce randomly. But soon the wheel begins turning counterclockwise. Why? Shiva can *learn* the motion of hidden fobs by observing the system at large. Since the state space of the system is fixed, in the long run Shiva learns the whole system perfectly. Following the teachings of Guru Shannon, Sw. Shalizi calculates the rate at which one may learn and finds it is either 0 (for Andhaka) or exponentially rapid (for Shiva).

Now Sw. Shalizi makes his attack. Nrityapriya’s dance requires only learning a finite state space and converges rapidly. Why cannot Andhaka do the same? Sw. Shalizi challenges the anti-Boltzmannian to explain without reference to the enormous size of the state space. If he cannot, then the anti-Boltzmannian has admitted that an objective property of reality - the dimension of the state space - plays the role Adhyapakah Jaynes has denied.

The case for Bhagavan Shiva and against the asura Andhaka can be made explicit. Start by recalling that the rate of learning is either zero or exponential. 
Andhaka’s capacity to learn is constant. As dimension of the state space - the number of fobs - grows large large, one soon finds that the asura cannot converge on the true state because he doesn’t have enough memory to hold the state in his mind. Therefore his learning rate is zero.
But Akshayaguna is different. His capacity to learn grows large holding the dimension of the state space constant. Therefore his learning rate is exponential.

There is no denying that Sw. Shalizi’s logic is absolutely sound. But before the publication of Three Toed Sloth, the full extent of his argument couldn’t be appreciated. Sw. Shalizi's implicit claim is that subjective arrows of time must have a consistent direction only if the state space of possibilities is much larger than the capacity any relevant learner. But Sw. Shalizi’s Cornell Canon piece doesn’t explicitly argue that why the learning capacity of a mere asura is insufficient.

Only now are we learning that Sw. Shalizi intended this work as part of a campaign of Arhat Ashby against the Jaynesians. Sw. Shalizi intends to use Ashby’s Law Of Requisite Variety, showing that a model of a system must be as complex as the system itself. This demonstrates that nobody but one who is unified with Brahman may have a backwards arrow of subjective time.

In sum, Carnegie-Mellon’s publication of Shalizi’s deutero-canonical work has opened new fields for scholarship. Already his brief note has been revealed to be more than a mere grumbling of a lover of controversy. His is a philosophical distinction between the monotonic subjective arrow of time of an ordinary being and the free arrow of time of an extraordinary one. All philologists of this canon  must take note.
*In case it isn't clear: this is written as a parody of Western commentary on Eastern religion - thus all the Sanskrit jargon despite none of the figures speaking the language. Why? Well, it seemed funny at the time. Of course, this blogpost shouldn't be considered scientific, much less serious theology.

Saturday, March 17, 2018

Two Dogmas Of Bayesianism

W V O Quine

There are different levels of disagreeing with a theory. To illustrate, we can imagine ourselves the central planner dispensing funds for researchers. The lowest level of disagreement is outright rejection - if a plan for a spaceship begins with "Assume we can violate the laws of thermodynamics..." I would simply not disperse any funds. Above this, there is non-fundamentalness. If I see a numerical simulation for a spaceship engine which I know doesn't respect energy conservation exactly, I might not dismiss it out of hand but would pay for investigation into whether the simulation flaws are fundamental. Off to the side, above rejection and below non-fundamentalness but not between either, there is suspicion. To speak ostensively: WVO Quine's attitude toward Carnap's dogmas was suspicion - he neither thought them to be wastes of paper nor without flaw. The notion of "analyticity" - words being true by definition - seemed both worth investigating and a bad foundation for the reconstruction of scientific investigation.

As Quine was suspicious of an any empiricism that took analyticity as atomic, I am suspicious of the philosophies that in the main call themselves "Bayesianism". What exactly Bayesianism consists of and when it began is not easy to say because self-described Bayesians do not all agree with one another.

The followers of de Finetti and Savage would trace Bayesianism to the logicians Ramsey and Wittgenstein. And indeed, whoever you think deserves credit as the origin of Bayesian philosophy, one must grant that Ramsey & Wittgenstein gave unusually clear statements of Bayesian philosophy. In Wittgenstein, the intuition of Bayesian philosophy were expounded in set of propositions 5.1* of his Tractatus. Ramsey gave these notions a more explicit development paper Truth And Probability. The particular style of the arguments on pages 19 - 23 of this paper has become known as "Dutch Book Arguments".

The general concept is simple. We start with a Wittgensteinian metaphysics: each possible state of the world corresponds to some proposition p. The propositions can be "arranged in a series" - there is an ordering Rpq such that there is an isomorphism between the propositions and the real numbers that respects the probability calculus. This ordering is something like "q is at least as good as p". Further, Ramsey realized, the relation R comes pretty near to an intuitive idea of rational behavior. And even more interesting is that the converse also holds (or nearly holds): "If anyone's mental condition violated these laws ... [he] could have a book made against him by a cunning bett[o]r and would then stand to lose in any event". This philosophical interpretation of two way equivalence between orderings on propositions and probability calculus is what I call the first dogma of Bayesianism.

Many other Bayesians - such as the followers of Jaynes - would trace it to the great physicists Laplace and Gibbs. The Bayesianism of early physicists is, granting that it exists at all, implicit and by example. Gibbs asked us to imagine a large collection of experiments floating in idea space. We should expect* that our actual experiment could be any one of those experiments, the great mass of which are functionally identical. Consider a classical ideal gas held in a stiff container. In the ensemble of possible experiments, there are a few where the gas is entirely in the lower half of the bottle. But the great mass of possible experiments the gas has had time to spread through the bottle. Therefore, for the great mass of possibilities, the volume the gas occupies would be the volume of the bottle. This means that if we measure the temperature, we get the pressure for free. Using Iverson Brackets, we see that for each possible temperature t and pressure p, Prob(P=p|T=t) = [p =(nR/V)t] or near abouts. The information we get out of measuring the temperature - that is, putting a probability distribution on temperature - is a probability distribution on pressure. The lesson the Gibbs example teaches ostensively is - supposedly - that what we want out of an experiment is a "posterior probability". This is the second dogma of Bayesianism.


Both dogmas are open to question. Let us question them.


It is almost too easy to pick on the Dutch Book concept. Ramsey himself expresses a great deal of skepticism about the general argument - "I have not worked out the mathematical logic of this in detail, because this would, I think, be rather like working out to seven places of decimals a result only valid to two.". Ramsey even gives a cogent criticism of the application of Dutch Book Argument (one foolishly tossed aside as ignorable by Nozick in The Nature Of Rationality):

"The old-established way of measuring a person's belief is to propose a bet, and see what are the lowest odds which he will accept. This method I regard as fundamentally sound; but it suffers from being insufficiently general, and from being necessarily inexact. It is inexact partly because of the diminishing marginal utility of money, partly because the person may have a special eagerness or reluctance to bet, because he either enjoys or dislikes excitement or for any other reason, e.g. to make a book."

The assumption "Value Is Linear Over Money" implies the Ramsey - von Neumann - Morgenstern axioms easily, but it is also a false empirical proposition. Defining a behavioristic theory whose in terms of behavior towards non-existent objects is the height of folly. Money doesn't stop becoming money because you are using it in a Bayesian probability example.

This brings us to the deeper problem of rationality in non-monetary societies. Rationality was supposed to reduce the probability calculus to something more basic. But now it seems to imply rationality was invented with coinage. Did rationality change when we (who is this 'we'?) went off the gold standard? These are absurd implications but phrasing a theory of rationality in terms of money seems to imply them. What about Dawkins' "Selfish Gene", isn't its rational pursuit of "self-interest" (in abstract terms) one of the deep facts about it? Surely this is a theory that might be right or wrong, not a theory a priori wrong and not a theory a priori wrong because genes don't care about money.

Related to this is that Dutch Book hypothesis seems to accept that people are "irrationally irrational" - they take their posted odds far too literally. As phrased in the Stanford Encyclopedia Of Philosophy article on Dutch Books

"An incoherent agent might not be confronted by a clever bookie who could, or would, take advantage of her, perhaps because she can take effective measures to avoid such folk. Even if so confronted, the agent can always prevent a sure loss by simply refusing to bet."

This argument is clarified by thinking in evolutionary terms. Let there be three kinds of birds: blue jays, cuckoos and dodos. Further assume that we've solved the problem from two paragraphs ago and have an understanding of what it means for an animal to bet. Dodos are trusting and give 110%. Dodos will offer and accept some bets that don't conform to the probability calculus (and fair bets). Cuckoos - as is well known - are underhanded and will cheat a dodo if it can. They will offer but never accept bets that don't conform to the probability calculus (and accept fair bets). Blue jays are rational. They accept and offer only fair bets. Assuming spatially mixed populations there are seven distinct cases: Dodos only, Blue Jays only, Cuckoos only, Dodo/Blue Jay, Dodo/Cuckoo and Dodo/Cuckoo/Blue Jay. Contrary to naive Bayesian theory, each of the pure cases is stable. Dodo-dodo interaction is a wash, every unfair bet lost is an unfair bet won by a dodo. Pure blue jay and pure cuckoo interactions are identical. Dodo/blue jay interactions may come out favorably to the dodo but never to the blue jay, the dodos can do favors for each other but the blue jays can neither offer nor accept favors from dodos or blue jays. The dodo is weakly evolutionarily dominant over the blue jay. Dodo/cuckoo interaction can only work out in the cuckoo's favor, but contrariwise a dodo can do a favor for a dodo but a cuckoo cannot offer a favor to a cuckoo (though it can accept such an offer). We'll call this a wash. Cuckoo/blue jay interaction is identical to cuckoo/cuckoo and blue jay/blue jay interaction, so neither can drive the other out. Finally, a total mix is unstable - dodos can drive out blue jays. The strong Bayesian blue jay is on the bottom. Introducing space (a la Skyrms) makes the problem even more interesting. One can easily imagine an inner ring of dodos pushing an ever expanding ring of blue jays shielding them from the cuckoos. There are plenty of senses of the word "stable" under which such a ring system is stable.

What does the above analysis tell us? We called Dodo/Cuckoo interactions a wash, but it really depends on the set of irrational offers Dodos and Cuckoos accept & reject/offer. As Ramsey says, for a young male Dodo "choice ... depend[s] on the precise form in which the options were offered him...". Ramsey finds this "absurd.". But to avoid this by assuming that we are living in a world which is functionally only blue jays seems an unforced restriction.


The logic underlying Bayesian theory itself is impoverished. It is a propositional logic lacking even monadic predicates (I can point to Jaynes as an example of Bayesian theory not going beyond propositional logic). This makes Bayesian logic less expressive than Aristotlean logic! That's no good!

Such a primitive logic has a difficult time with sentences which are "infinite" - have countably many terms - or even just have infinite representations. (An example infinite representation of the rational number 1/4 is a lazily evaluated list that spits out [0, ., 2, 5, 0, 0, ...]) Many Bayesians, such as Savage and de Finetti, are suspicious of infinite combinations of propositions for more-or-less the same reasons Wittgenstein was. Others, such as Jaynes and Jeffery, are confident that infinite combinations are allowed because to suppose otherwise would make every outcome depend on how one represented it. Even if one accepts infinite combinations, there also is the (related?) problem of sentences which would take infinitely many observations to justify.


An example of a simple finite sentence that takes infinitely many experiments to test is "A particular agent is Bayesian rational.". This is a monadic sentence that isn't in propositional logic. Unless one tests every possible combination of logical atoms, there's no way of telling the next one is out of place. The non-"Bayes observability" of Bayes rationality has inspired an enormous amount of commentary from people like Quine and Donald Davidson who accept the Wittgensteinian logical behaviorism that Bayesianism is founded upon.


So, is this literature confused or getting at something deeper? I don't know. But I believe that one can now see the reason Dutch Book arguments are hard to knock down isn't because there aren't plenty of criticisms. The real reason is the criticisms don't seem to go to the core of the theory. A criticism of Dutch Book needs to be like a statistical mechanics criticism of thermodynamics - it must explain both what the Dutch Book gets as well as what it misses. There does seem to be something there which will survive.

What You Want Is The Posterior

Okay okay okay, be that as it may, isn't it the case that what we want is the posterior odds? If D is some description of the world and E is some evidence we know to be actual, then we want P(D|E), right? Well, hold your horses. I say there's the probability of a description given evidence and probability of the truth of a theory and never the twain shall meet. To see it, let's meet our old uncle Noam.

Chomsky was talking about other people. Who he thought he was talking about isn't important, but he was really talking about Andrey Markov and Claude Shannon. Andrey Markov developed a crude mathematical description of poetry. He could write a machine with two states "print a vowel" and "print a consonant". He could give a probability for transferring between those states - given that you just printed a vowel/consonant, what is the probability you will print a consonant/vowel? The result will be nonsense, but look astonishingly like Russian. You can get more and more Russian like words just by increasing the state space. Our friend Claude Shannon comes in and tells us "Look, in each language there are finitely many words. Otherwise, language would be unlearnable. Grammar is just a grouping of words - nouns, verbs, adjectives, adverbs. By grouping the words and drawing connections between them in the right way we can get astonishingly English looking sentences pretty quickly.".

Now our "friend" Chomsky. He says "Look, you have English looking strings. They're syntactically correct. But you don't have English! The strings are not semantically constrained. There is no way to write a non-trivial machine that both passes Shakespeare and fails a nonsense sentence like 'Colorless green ideas sleep furiously.'! In order to know that sentence fails the machine needs to know ideas can't sleep, ideas can't be green, green things can't be colorless and one cannot sleep in a furious manner.".

Now, this is a scientific debate about models for language. Can it be put in a Bayesian manner? It cannot. The Markov-Shannon machine - by construction - matches the probabilistic behavior of the strings of the language. But it cannot be completely correct (otherwise, the set of regular expressions would be the Turing complete languages). Therefore, the statement that "What we want is the posterior odds!" cannot be entirely true.

I have said before that I am in the "Let Ten Thousand Flowers Bloom" school of probability. I think these arguments are interesting and worth considering. But it is also clear that despite books like Nozick's The Nature Of Rationality and Jaynes' similar tome, Bayesian philosophy does not replace the complicated mysteries of probability theories with simple clarities.

There is a further criticism that applying behavioristic analysis to research papers is unwise but I will make that another day.

* Both Ramsey's paper and Gibbs' book assume that the expectation operator exists for every relevant probability distribution. This is a minor flaw that can be removed without difficulty, so I will not mention it again.

Friday, January 12, 2018

Absolute And Comparitive Advantage



Adam Smith

Adam Smith imagined two firms with a choice of production of two products*. The two firms' entrepreneurs are Alice and Bob. Alice and Bob have the choice of producing two independent goods - maybe apples and blackboards. The firms have what has become known as "constant technology", the ratio of their outputs and their inputs is a constant independent of said quantities (different for each firm and product to avoid indeterminacy). Unemployment is ignored in both inputs and outputs - all inputs are bought and all output sold (Say's Law). The law of one price holds.

Each entrepreneur is then left with only one decision: how much of each good shall I make?

Leonid Kantorovich


The fastest way to this answer is through the theory of linear programming. Start by denoting \( L_{firm,product}\) the amount of labor Alice or Bob uses to make apples or blackboards. If \(L \) is the amount of labor available in society, then we have as a constraint

\[ L_{Alice,apples} + L_{Bob,apples} + L_{Alice,blackboards} + L_{Bob,blackboards} = L \]

The outputs are denoted \( Y_{firm,product} \) and the prices are \(p_{product}\) so that the total output is

 \(Y = p_{apples} (Y_{Alice,apples}+Y_{Bob,apples})+p_{blackboards} (Y_{Alice,blackboards}+Y_{Bob,blackboards}) \).

Finally, the technical coefficients are \( a_{firm,product} = Y_{firm,product} / L_{firm,product} \) . Our goal then is to maximize the above equation We know from LP theory* that only the vertices matter. Why? Well, the short answer is interior point optimization. Look at this picture:


The dark lines are the constraints and the darkened point is a guess at the optimum. The line through that point has a slope equal to the price ratio. It's obvious that more output by moving to a guess in the direction of the arrows that is still feasible. The triangle  made by the price ratio and the constraints (and its interior) are the "interior points". If an interior point set is one point, it must be the optimum. It's obvious that unless the price ratio is exactly the same slope as one of the boundaries, the only way to squeeze the interior point set down to one point is to chose one of the verticies.

Okay, so let's cycle through those verticies. The solution where no labor is used is the global minimum (the entire feasible set is the interior point set!). This also doesn't match the full employment constraint, so toss it.

Next, there's the where only one firm is buying labor to produce output (so that three \( L_{i,j}=0\) and one \( L_{i,j}=L\). We'll call this the \(Y_{one}\) solution as in "Why one?". Mathematically, it is written

\[ Y_{one} = max_{i,j}(p_j a_{i,j}L) \]

The next class of solutions is the instructive one. In these vertices, the non-negativity constraint is active on two possibilities - that is to say: labor is being hired for two reasons. But there's a problem. If both entrepreneurs  are making apples, then \(Y = p_{apples} (a_{Alice,apples} L_{Alice,apples} + a_{Bob,apples}L_{Bob,apples}) \). But if Alice is more efficient, then this can't be a maximum, because moving some labor from Bob's firm to Alice would increase output. This knocks out the two competitive verticies and leaves the four non-competitive verticies. Either one firm allocates labor two both products (and the other is dead) or two firms specialize in one product.

Read the above paragraph again, slowly. It's important to understand the next part. If you look at the verticies with three or all four labor hiring reasons, then the accounting for \( Y \) always has at least one term like the above. For instance, one of the three term verticies is \(Y = p_{apples} (a_{Alice,apples} L_{Alice,apples} + a_{Bob,apples}L_{Bob,apples}) + p_{blackboards}L_{Bob,blackboards} \) . But this can't be a maximum, because of the above paragraph - if Alice is more efficient in apple growing we can get more output by moving some labor from Bob's firm to hers.

There are therefore only possible three classes of extreme verticies: Alice or Bob make one thing, Alice or Bob make everything and finally Alice and Bob specialize.

David Ricardo

Phew! The great economist David Ricardo was uncomfortable with Smith's story. The vertices where one firm did everything and the other was non-existent troubled him.

Why would a non-existent firm trouble someone? Well, Ricardo was exploring an analogy between firms and nations. Nations have different technical coefficients - nobody needs to explain why California produces more wine than Utah or why Kazakhstan produces more uranium than Italy. But the idea that a country could "dominate" and produce everything was very troubling - not to mention that it seemed contrary to the facts.There was a political economy problem as well. Absolute advantage seems to suggest that global and local output could be at cross terms - global output might be maximized at the minimum of local output.

Ricardo "solved" this problem by ... restricting free trade! Yes, the classic argument for free trade assumes trade restrictions. Ricardo supposed that labor and capital couldn't move over national borders, but that consumption goods can.

John von Neumann

In the language of Linear Programming, Ricardo breaks the one labor constraint in the above problem into two:

\[ L_{Alice,apples} + L_{Bob,apples} =L_{Alice} \]

\[ L_{Bob,apples} +L_{Bob,blackboards} = L_{Bob}\]


Now the one firm produces everything equilbria are knocked out - they don't satisfy the above constraints. How they will specialize depends on the price ratio and technical coefficients, but they must always produce.




Comparative advantage is often thought of as a "long run" view - this is the supposed justification for the assumption of full employment and full consumption. But if capital and labor can flow over borders, it is a short run view that ignores unemployment and underconsumption.


Or is there a better interpretation of how comparative advantage works?

* The Adam Smith of my imagination.