+ All Categories
Home > Documents > Information Theory and Statistical Mechanics · Information theory provides a constructive...

Information Theory and Statistical Mechanics · Information theory provides a constructive...

Date post: 12-Mar-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
11
Reprinted from THE PHYSICAL REVIEW, Vol. 106, No.4, 620--630, May 15, 1957 Printed in U. S. A. Information Theory and Statistical Mechanics E. T. JAYNES Department of Physics, Stanford University, Stanford, California (Received September 4, 1956; revised manuscript received March 4, 1957) Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum.entropy estimate. It is the least biased estimate possible on the given information; i.e., it is maximally noncom- mittal with regard to missing information. If one considers statistical mechanics as a form of statistical inference rather than as a physical theory, it is found that the usual computational rules, starting with the determination of the partition function, are an immediate consequence of the maximum-entropy principle. In the resulting "subjective statistical mechanics," the usual rules are thus justified independently of any physical argument, and in particular independently of experimental verification; whether 1. INTRODUCTION T HE recent appearance of a very comprehensive survey! of past attempts to justify the methods of statistical mechanics in terms of mechanics, classical or quantum, has helped greatly, and at a very opportune time, to emphasize the unsolved problems in this field. 1 D. ter Haar, Revs. Modern Phys. 27, 289 (1955). or not the results agree with experiment, they still represent the best estimates that could have been made on the basis of the information available. It is conCluded that statistical mechanics need not be regarded as a physical theory dependent for its validity on the truth of additional assumptions not contained in the laws of mechanics (such as ergodicity, metric transitivity, equal a priori probabilities, etc.). Furthermore, it is possible to maintain a sharp distinction between its physical and statistical aspects. The former consists only of the correct enumeration of the states of a system and their properties; the latter is a straightforward example of statistical inference. Although the subject has been under development for many years, we still do not have a complete and satisfactory theory, in the sense that there is no line of argument proceeding from the laws of microscopic mechanics to macroscopic phenomena, that is generally regarded by physicists as convincing in all respects. Such an argument should (a) be free from objection on mathematical grounds, (b) involve no additional arbi-
Transcript
Page 1: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

Reprinted from THE PHYSICAL REVIEW, Vol. 106, No.4, 620--630, May 15, 1957Printed in U. S. A.

Information Theory and Statistical Mechanics

E. T. JAYNESDepartment of Physics, Stanford University, Stanford, California

(Received September 4, 1956; revised manuscript received March 4, 1957)

Information theory provides a constructive criterion for settingup probability distributions on the basis of partial knowledge,and leads to a type of statistical inference which is called themaximum.entropy estimate. It is the least biased estimatepossible on the given information; i.e., it is maximally noncom­mittal with regard to missing information. If one considersstatistical mechanics as a form of statistical inference rather thanas a physical theory, it is found that the usual computationalrules, starting with the determination of the partition function,are an immediate consequence of the maximum-entropy principle.In the resulting "subjective statistical mechanics," the usual rulesare thus justified independently of any physical argument, andin particular independently of experimental verification; whether

1. INTRODUCTION

T HE recent appearance of a very comprehensivesurvey! of past attempts to justify the methods

of statistical mechanics in terms of mechanics, classicalor quantum, has helped greatly, and at a very opportunetime, to emphasize the unsolved problems in this field.

1 D. ter Haar, Revs. Modern Phys. 27, 289 (1955).

or not the results agree with experiment, they still represent thebest estimates that could have been made on the basis of theinformation available.

It is conCluded that statistical mechanics need not be regardedas a physical theory dependent for its validity on the truth ofadditional assumptions not contained in the laws of mechanics(such as ergodicity, metric transitivity, equal a priori probabilities,etc.). Furthermore, it is possible to maintain a sharp distinctionbetween its physical and statistical aspects. The former consistsonly of the correct enumeration of the states of a system andtheir properties; the latter is a straightforward example ofstatistical inference.

Although the subject has been under development formany years, we still do not have a complete andsatisfactory theory, in the sense that there is no lineof argument proceeding from the laws of microscopicmechanics to macroscopic phenomena, that is generallyregarded by physicists as convincing in all respects.Such an argument should (a) be free from objection onmathematical grounds, (b) involve no additional arbi-

Page 2: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

621 IN FOR MAT ION THE 0 R Y AND S TAT 1ST I CAL ME C HAN Ie S

trary assumptions, and (c) automatically include anexplanation of nonequilibrium conditions and irre­versible processes as well as those of conventionalthermodynamics, since equilibrium thermodynamics ismerely an ideal limiting case of the behavior of matter.

It might appear that condition (b) is too severe,since we expect that a physical theory will involvecertain unproved assumptions, whose consequences arededuced and compared with experiment. For example,in the statistical mechanics of Gibbs2 there were severaldifficulties which could not be understood in terms ofclassical mechanics, and before the models which heconstructed could be made to correspond to the observedfacts, it was necessary t.o incorporate into them addi­tional restrictions not contained in the laws of classicalmechanics. First was the "freezing up" of cert.aindegrees of freedom, which caused the specific heat ofdiat.omic gases to be only i of the expected value.Secondly, the paradox regarding the entropy of com­bined systems, which was resolved only by adoption ofthe generic instead of the specific definition of phase,an assumption which seems impossible to justify interms of classical notions.3 Thirdly, in order to accountfor the actual values of vapor pressures and equilibriumconstants, an additional assumption about a naturalunit of volume (h3N ) of phase space was needed.However, wit.h the development of quant.um mechanicsthe originally arbitrary assumpt.ions are now seen asnecessary consequences of the laws of physics. Thissuggest.s the possibilit.y t.hat. we have now reached ast.at.e where st.atist.ical mechanics is no longer dependenton physical hypotheses, but may become merely anexample of statistical inference.

That the present may be an opportune time tore-examine these questions is due to two recent de­velopments. Statistical methods are being applied to avariety of specific phenomena involving irreversibleprocesses, and the mathematical methods which haveproven successful have not yet been incorporated intothe basic apparatus of statistical mechanics. In addition,the development of information theory! has been feltby many people to be of great significance for statisticalmechanics, although the exact way in which it shouldbe applied has remained obscure. In this connection it

2 J. W. Gibbs, Elementary Principles in Statistical Mechanics(Longmans Green and Company, New York, 1928), Vol. II ofcollected works.

3 We may note here that although Gibbs (reference 2, Chap.XV) started his discussion of this question by saying that thegeneric definition "seems in accordance with the spirit of thestatistical method," he concluded it with, "The perfect similarityof several particles of a system will not in the lc:ast interfere withthe identification of a particular particle in one case with aparticular particle in another. The question is one to be decidedin accordance with the requirements of practical convenience inthe discussion of the problems with which we are engaged."

4 C. E. Shannon, Bell System Tech. J. 27, 379, 623 (1948);these papers are reprinted in C. E. Shannon and W. Weaver,The Mathematical Theory of Communication (University ofIllinois Press, Urbana, 1949).

is essential to note the following. The mere fact thatthe same mathematical expression - L Pi logPi occursboth in statistical mechanics and in information theorydoes not in itself establish any connection betweenthese fields. This can be done only by finding newviewpoints from which thermodynamic entropy andinformation-theory entropy appear as the same concept.In this paper we suggest a reinterpretation of statisticalmechanics which accomplishes this, so that informationtheory can be applied to the problem of justification ofstatistical mechanics. We shall be concerned with theprediction of equilibrium thermodynamic properties,by an elementary treatment which involves only theprobabilities assigned to stationary states. Refinementsobtainable by use of the density matrix and discussionof irreversible processes will be taken up in later papers.

Section 2 defines and establishes some of the ele­mentary properties of maximum-entropy inference, andin Secs. 3 and 4 the application to statistical mechanicsis discussed. The mathematical facts concerning maxi­mization of entropy, as given in Sec. 2, were pointedout long ago by Gibbs. In the past, however, theseproperties were given the status of side remarks notessential to the theory and not providing in themselvesany justification for the methods of statistical me­chanics. The feature which was missing has beensupplied only recently by Shannon4 in the demon­stration that the expression for entropy has a deepermeaning, quite independent of thermodynamics. Thismakes possible a reversal of the usual line of reasoning instatistical mechanics. Previously, one constructed atheory based on the equations of motion, supplementedby additional hypotheses of ergodicity, metric transi­tivity, or equal a prior'i probabilities, and the identifi­cation of entropy was made only at the end, by com­parison of the resulting equations with the laws ofphenomenological thermodynamics. Now, however, wecan take entropy as our start.ing concept, and the factthat a probability distribution maximizes the entropysubject to certain constraints becomes the essential factwhich justifies use of that distribution for inference.

The most important consequence of this reversal ofviewpoint is not, however, the conceptual and mathe­matical simplification which results. In freeing thetheory from its apparent dependence on physicalhypotheses qf the above type, we make it possible tosee statistical mechanics in a much more general light.Its principles and mathematical methods becomeavailable for treatment of many new physical problems.Two examples are provided by the derivation of Siegert's"pressure ensemble" and treatment of a nuclear polari­zation effect, in Sec. 5.

2. MAXIMUM-ENTROPY ESTIMATES

The quantity x is capable of assuming the discretevalues Xi (i = 1,2 ... ,n). Weare not given the corre­sponding probabilities Pi; all we know is the expectation

Page 3: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

E. T. JAYNES 622

On the basis of this information, what is the expectationvalue of the function g(x)? At first glance, the problemseems insoluhle because the given information is insuffi­cient to determine the probabilities Pi." Equation (2-1)and the normalization condition

5 Yet this is precisely the proble;n con~ron.ting us i~ statisticalmechanics; on the basis of informatIOn WhlCh.l~ ~rossly ll:ad~quate

to determine any assignment of probabIlitIes to mdlvld~JaI

quantum states, we are a~ke~ to estim!ite the prt;ssure, speClficheat, intensity of magnetizatIOn, chen:lc~l potentlal~, e~c., of amacroscopic system. Furthermore, statl~tlcal mechamcs IS a:n!iz­ingly successful in providing accurate estlmates?f these quantItIes.Evidently there must be other reasons for thIS success, that gobeyond a mere correct statistical treatment of the problem asstated above.

6 The problems associated with the continuous case. are ~unda­

mentally more complicated than those enc?untered ':Y1th dIscreterandom variables; only the discrete case wlll be c?nslde~ed here.

7 For several examples, see E. P. Northrop, Rtddles m M athe­matics (D. Van Nostrand Company, Inc., New York, 1944),

Chap. 8. .. (P' t8 H. Cramer, M athematical Methods of Stahsttcs nnce on

University Press, Princeton, 1946).9 W. Feller, An Introduction to Probability Theory and its

Applications (John Wiley and Sons, Inc., New York, 1950).

would have to be supplemented by (n- 2) more condi­tions before (g(x» could be found.

This problem of specification of probabilities in caseswhere little or no information is available, is as old asthe theory of probability. Laplace's "Principle ofInsufficient Reason" was an attempt to supply acriterion of choice, in which one said that two eventsare to be assigned equal probabilities if there is noreason to think otherwise. However, except in caseswhere there is an evident element of symmetry thatclearly renders the events "equally possible," thisassumption may appear just as arbitrary as any otherthat might be made. Furthermore, it has been veryfertile in generating paradoxes in the case of continu­ously variable random quantities,6 since intuitivenotions of "equally possible" are altered by a change ofvariables. 7 Since the time of Laplace, this way offormulating problems has been largely abandoned,owing to the lack of any construct.ive principle w~~ch

would give us a reason for prefernng one probabIlitydistribution over another in cases where both agreeequally well with the available information.

For further discussion of this problem, one mustrecognize the fact that probability theory has developedin two very different directions as regards fundamentalnotions. The "objective" school of thought 8 •

9 regardsthe probability of an event as an obje.ctive proper~~ ofthat event always capable in prinCIple of empmcal, . .measurement by observation of frequency ratlOs m arandom experiment. In calculating a probability distri­bution the objectivist believes that he is making

value of the function f(x) :

n

(j(x»= L P;f(Xi)'i=l

L pi= 1

(2-1)

(2-2)

predictions which are in principle verifiable in everydetail, just as are those of classical mechanics. T?etest of a good objective probability distribution P(X! IS:does it correctly represent the observable fluctuatlOnsof x?

On the other hand, the "subjective" school ofthoughtlO •ll regards probabilities as expressions ofhuman ignorance; the probability of an event is merelya formal expression of our expectation that the eventwill or did occur, based on whatever information isavailable. To the subjectivist, the purpose of proba­bility theory is to help us in forming plaus~ble conc.lu­sions in cases where there is not enough mformatlOnavailable to lead to certain conclusions; thus detailedverification is not expected. The test of a good subjec­tive probability distribution is does it correctly repre­sent our state of knowledge as to the value of x?

Although the theories of subjective and objectiveprobability are mathematically identical; the cor:ceptsthemselves refuse to be united. In the vanous statlstlcalproblems presented to us by physics, both viewpointsare required. Needless controversy has resulted fromattempts to uphold one or the other in all ca~es. T~e

subjective view is evidently the broader one~ sm.ce It :salways possible to interpret frequer:cy rat~os m t~~s

way; furthermore, the subjectivist wIll.admlt a~ legltl­mate objects of inquiry many questlOns whIch theobjectivist considers meaningless. The problem posedat the beginning of this section is of this type, andtherefore in considering it we are necessarily adoptingthe subjective point of view. .

Just as in applied statistics the crux of a pr?blem ISoften the devising of some method of samplmg thatavoids bias our problem is that of £nding a probabilityassignment' which avoids bias, while agreeing withwhatever information is given. The great advanceprovided by information theory lies in the discoverythat there is a unique, unambiguous criterion for the"amount of uncertainty" represented by a discreteprobability distribution, which agrees with our intuitivenotions that a broad distribution represents moreuncertainty than does a sharply peaked one, andsatisfies all other conditions which make it reasonable.4

In Appendix A we sketch Shannon's proof that ~he

quantity which is positive, which increases WIthincreasing uncertainty, and is additive for independentsources of uncertainty, is

H(h' ,pn)= -K Li Pi lnpi, (2-3)

where K is a positive constant. Since this is just theexpression for entropy as found in statistical mechanics,it will be called the entropy of the probability distri­bution Pi; henceforth we will consider the terms"entropy" and "uncertainty" as synonymous.

10 J. M. Keynes, A Treatise on Probability (MacMillan Company,London, 1921).

II H. Jeffreys, Theory of Probability (Oxford University Press,London, 1939).

Page 4: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

623 I N FOR MAT ION THE 0 R Y AND S TAT r S TIC A L M E C HAN I C S

will be called the partition function.This may be generalized to any number of functions

f(x) : given the averages

Then the maximum-entropy probability distribution isgiven by

pi=exp{ -[Ao+Al!J(Xi)+' .. +A",f",(Xi)]}, (2-10)

3. APPLICATION TO STATISTICAL MECHANICS

It will be apparent from the equations in the pre­ceding section that the theory of maximum-entropyinference is identical in mathematical form with therules of calculation provided by statistical mechanics.Specifically, let the energy levels of a system be

E i (al,a2,' .. ),

where the external parameters ai may include thevolume, strain tensor applied electric or magneticfields, gravitational potential, etc. Then if we knowonly the average energy (E), the maximum-entropyprobabilities of the levels E,i are given by a special caseof (2-10), which we recognize as the Boltzmann distri­bution. This observation really completes our derivation

1.2 R. A. Fisher, Proc. Camhridge Phil. Soc. 22, 700 (1925).13 ]. L. Doob, Trans. Am. Math. Soc. 39, 410 (1936).

The principle of maximum entropy may be regardedas an extension of the principle of insufficient reason(to which it reduces in case no information is givenexcept enumeration of the possibilities Xi), with thefollowing essential difference. The maximum-entropydistribution may be asserted for the positive reasonthat it is uniquely determined as the one which ismaximally noncommittal with regard to missing infor­mation, instead of the negative one that there was noreason to think otherwise. Thus the concept of entropysupplies the missing criterion of choice which Laplaceneeded to remove the apparent arbitrariness of theprinciple of insufficient reason, and in addition it showsprecisely how this principle is to be modilied in casethere are reasons for "thinking otherwise."

Mathematically, the maximum-entropy distributionhas the important property that no possibility isignored; it assigns positive weight to every situationthat is not absolutely excluded by the given information.This is quite similar in effect to an ergodic property.in this connection it is interesting to note that prior tothe work of Shannon other information measures hadbeen proposed1Z.13 and used in statistical inference,although in a different way than in the present paper.In particular, the quantity - L pl has many of thequalitative properties of Shannon's information meas­ure, and in many cases leads to substantially the sameresults. However, it is much more difficult to apply inpractice. Conditional maxima of - L pl cannot befound by a stationary property involving Lagrangianmultipliers, because the distribution which makes thisquantity stationary subject to prescribed averages doesnot in general satisfy the condition Pi?- O. A much moreimportant reason for preferring the Shannon measureis that it is the only one which satisfies the condition ofconsistency represented by the composition law (Ap­pendix A). Therefore one expects that deductions madefrom any other information measure, if carried farenough, will eventually lead to contradictions.

(2-5)

(2-6)

(2-7)

(2-4)

(2-15)

(2-14)

(2-12)

(2-11)

A= InZ(Ji.),

Z (Ji.) = Li e--~/(xi)

a(jr(X») = --lnZ,

aA r

Ao=lnZ.

<afr> 1 aaak = - A

rdak InZ.

az/),.2 fr = (P) - (fr)2 = - (lnZ).

aA/

in which the constants are determined from

where

The entropy of the distribution (2-10) then reduces to

Smax=AO+AI(!r(X»)+'" +A",(j", (x»), (2-13)

where the constant K in (2-3) has been set equal tounity. The variance of the distribution of fr(X) is foundto be

In addition to its dependence on x, the function fr maycontain other parameters aI, az, "', and it is easilyshown that the maximum-entropy estimates of thederivatives are given by

form the partition function

Z(AI,'" ,A",)=Li exp{ - [AI/!(Xi)+' .. +A",f",(Xi)]}. (2-9)

The constants A, Ji. are determined by substituting into(2-1) and (2-2). The result may be written in the form

a(j(x») = --lnZ(Ji.),

aJi.

It is now evident how to solve our problem; in makingil\,ferences on the basis of partial information we mustuse that probability distribution which has maximumentropy subject to whatever is known. This is the onlyunbiased assignment we can make; to use any otherwould amount to arbitrary assumption of informationwhich by hypothesis we do not have. To maximize(2-3) subject to the constraints (2-1) and (2-2), oneintroduces Lagrangian multipliers A, Ji., in the usualway, and obtains the result

Page 5: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

E. T. JAYNES 624

of the conventional rules of statistical mechanics as anexample of statistical inference; the identification oftemperature, free energy, etc., proceeds in a familiarmanner,14 with resul ts summarized as

(3-7)

16 D. ter Haar, Elements of Statistical Mechanics (Rinehart andCompany, New York, 1954), Chap. 7.

It is interesting to note the ease with which theserules of calculation are set up when we make entropythe primitive concept. Conventional arguments, whichexploit all that is known about the laws of physics, inparticular the constants of the motion, lead to exactlythe same predictions that one obtains directly frommaximizing the entropy. In the light of informationtheory, this can be recognized as telling us a simplebut important fact; there is nothing in the general lawsof motion that can provide us with any additional infor­mation about the state of a system beyond what we haveobtained from measurement. This refers to interpretationof the state of a system at time t on the basis of meas­urements carried out at time t. For predicting the courseof time-dependent phenomena, knowledge of the equa­tions of motion is of course needed. By restricting ourattention to the prediction of equilibrium properties asin the present paper, we are in effect deciding at theoutset that the only type of initial information allowedwill be values of quantities which are observed to beconstant in time. Any prior knowledge that thesequantities would be constant (within macroscopicexperimental error) in consequence of the laws ofphysics, is then redundant and cannot help us inassigning probabilities.

This principle has interesting consequences. Supposethat a super-mathematician were to discover a newclass of uniform integrals of the motion, hithertounsuspected. In view of the importance ascribed touniform integrals of the motion in conventional sta­tistical mechanics, and the assumed nonexistence ofnew ones, one might expect that our equations wouldbe completely changed by this developmen t. This wouldnot be the case, however, unless we also supplementedour prediction problem with new experimental datawhich provided us with some information as to thelikely values of these new constants. Even if we had aclear proof that a system is not metrically transitive, wewould stitt have no rational basis for excluding any regionof phase space that is allowed by the information availableto us. In its effect on our ultimate predictions, this factis equivalent to an ergodic hypothesis, quite independ­ently of whether physical systems are in fact ergodic.

This shows the great practical convenience of thesubjective point of view. If we were attempting toestablish the probabilities of different states in the

where the free-energy function F= - kTAo, and Ao= InZis called the "grand potential."16 Writing out (2-13)for this case and rearranging, we have the usualexpression

F(T,Ci lCi2' .. ,f.Llf.L2· .. )

= (E)- TS+f.LI(nl)+f.L2(n2)+· . '. (3-8)

and the (ni);

(3-4)

(3-3)

a{3i= kT-InZ.

aCiI

14 E. Schrodinger, Statistical Thermodynamics (CambridgeUniversity Press, Cambridge, 1948).

16 Boltzmann's constant may be regarded as a correction factornecessitated by our custom of measuring temperature in arbitraryunits derived from the freezing and boiling points of water. Sincethe product TS must have the dimensions of energy, the units inwhich entropy is measured depend on those chosen for tempera­ture. It would be convenient in general arguments to define an"absolute cgs unit" of temperature such that Boltzmann'sconstant is made equal to unity. Then entropy would becomedimensionless (as the considerations of Sec. 2 indicate it should be),and the temperature would be equal to twice the average energyper degree of freedom; it is, of course, just the "modulus" e ofGibbs.

and the corresponding maximum-entropy distribution(2-10) is that of the "quantum-mechanical grandcanonical ensemble;" the Eqs. (2-11) fixing the con­stants, are recognized as giving the relation betweenthe chemical potentials

f.Li= - kTAi, (3-6)

AI=(1/kT), (3-1)

U - TS=F(T,CiI,Ci2,' .. ) = -kT InZ(T,CiI,Ci2,' . '), (3-2)

aFS= --= -k L: Pi InPi,

aT

The thermodynamic entropy is identical with theinformation-theory entropy of the probability distri­bution except for the presence of Boltzmann's con­stant.15 The "forces" (3i include pressure, stress tensor,electric or magnetic moment, etc., and Eqs. (3-2),(3-3), (3-4) then give a complete description of thethermodynamic properties of the system, in which theforces are given by special cases of (2-15); i.e., asmaximum-entropy estimates of the derivatives(aE;/aCik).

In the above relations we have assumed the number ofmolecules of each type to be fixed. Now let nl be thenumber of molecules of type 1, n2 the number of type2, etc. If the n. are not known, then a possible "state"of the system requires a specification of all the n. as wellas a particular energy level Ei(CilCi2' .. Inln2' .. ). If weare given the expectation values

(E), (nl), (n2),

then in order to make maximum-entropy inferences,we need to form, according to (2-9), the partitionfunction

Page 6: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

625 I N FOR MAT ION THE 0 R Y AND S TAT 1ST [ CAL ME C HAN I C S

with X determined from (2-11),

Then we find, for the maximum-entropy estimate of x2,

(3-12)

(3-10)

(3-11)(n+1)' (1n) 1

= -- (x2)1.2 [1(n+1)J!

(Xl)- (X)2= (X)2/ (n+ 1).

Next we invert the problem: (B) Given (Xl), estimatex. The solution is

a n+1(x)=--lnZ=-.

ax X

Z(X)= ioop(x) exp(-Xx2)dxo

11" 'n ! 1

2n+l (n/2)! ~XHn+I)'

sponding range of values of x,

00 nlZ(X) = i p(x)e-~xdx=--,

o ~X n+1

Lif(Xi)-f f(x)p(x)dx,

where, from (3-9), we have

p(x)=xn/~.

This approximation is not at all essential, but itsimplifies the mathematics.

Now consider the problem: (A) Given (x), estimatex2. Using our general rules, as developed in Sec. II,we first obtain the partition function

The solutions are plotted in Fig. 1 for the case n= 1.The upper "regression line" represents Eq. (3-10), andthe lower one Eq. (3-11). For other values of n, theslopes of the regression lines are plotted in Fig. 2. Asn_ oo , both regression lines approach the line at 45°,and thus for large n, there is for all practical purposesa definite functional relationship between (x) and (x2),independently of which one is considered "given," andwhich one "estimated." Furthermore, as n increasesthe distributions become sharper; in problem (A) wefind for the variance of x,

a n+1(x2) = --lnZ=--,

ax 2X

(x){ (x2)} = Z-I Joopx(x) exp( -Xx2)dxo

According to this law, the x; increase without limit asi_OCi, but become closer together at a rate determinedby n. By choosing ~ sufficiently small we can make thedensity of points x; in the neighborhood of any partic­ular value of x as high as we please, and therefore for acontinuous function f(x) we can approximate a sum asclosely as we please by an integral taken over a corre-

objective sense, questions of metric transitivity wouldbe crucial, and unless it could be shown that the systemwas met.rically transit.ive, we would not be able to findany solut.ion at all. If we are content with the moremodest aim of finding subjective probabilities, metrictransitivit.y is irrelevant. Nevertheless, the subjectivetheory leads to exactly the same predictions that onehas attempted to justify in the objective sense. Theonly place where subject.ive statistical mechanics makescontact with the laws of physics is in the enumerationof the different possible, mutually exclusive states inwhich t.he system might be. Unless a new advance inknowledge affects this enumeration, it cannot alterthe equations which we use for inference.

If the subject were dropped at this point, however,it would remain very difficult to understand why theabove rules of calculation are so uniformly successfulin predict.ing the behavior of individual systems. Instripping the stat.istical part of the argument to itsbare essentials, we have revealed how little content itreally has; the amount of information available inpractical situations is so minute that it alone couldnever suffice for making reliable predictions. Withoutfurther conditions arising from the physical nature ofmacroscopic systems, one would expect such greatuncertainty in prediction of quantities such as pressurethat we would have no definite theory which could becompared with experiments. It might also be questionedwhether it is not the most probable, rather than theaverage, value over the maximum-entropy distributionthat should be compared with experiment, since theaverage might be the average of two peaks and itselfcorrespond to an impossible value.

It is well known that t.he answer to both of thesequestions lies in the fact that for systems of very largenumber of degrees of freedom, the probability distri­butions of the usual macroscopic quantities determinedfrom the equations above, possess a single extremelysharp peak which includes practically all the "mass" ofthe distribution. Thus for all practical purposes average,most probable, median, or any other type of estimateare one and the same. It is inst.ructive to see how, inspite of the small amount. of information given, maxi­mum-entropy estimates of certain functions g(x) canapproach practical certainty because of the way thepossible values of x are dist.ributed. We illustrate thisby a model in which the possible values x; are definedas follows: let n be a non-negative integer, and ~ asmall positive number. Then we take

Page 7: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

E. T. JAYNES 626

Similar results hold in this model for the maximum­entropy estimate of any sufficiently well-behavedfunction g(x). If g(x) can be expanded in a power seriesin a sufficiently wide region about the point x= (x), weobtain, using the distribution of problem A above, thefollowing expressions for the expectation value andvariance of g:

t<x'

//

//

//

//

/Y

//

//

//

//

//

FIG. 1. Regressionof x and x2 for statedensity increasinglinearly with x. Tofind the maximum­entropy estimate ofeither quantity giventhe expectation val­ue of the other,follow the arrows.

cept may be regarded as a measure of our degree ofignorance as to the state of a system; on the otherhand, for equilibrium conditions it is an experimentallymeasurable quantity, whose most important propertieswere first found empirically. It is this last circumstancethat is most often advanced as an argument againstthe subjective interpretation of entropy.

The relation between maximum-entropy inferenceand experimental facts may be clarified as follO\ys. Wefrankly recognize that the probabilities involved inprediction based on partial information can have onlya subjective significance, and that the situation cannotbe altered by the device of inventing a fictitiousensemble, even though this enables us to give theprobabilities a frequency interpretation. One mightthen ask how such probabilities could be in any wayrelevant to the behavior of actual physical systems. Agood answer to t:iis is Laplace's famous remark thatprobability theory is nothing but "common sensereduced to calculation." If we have lillie or no infor-

(X)2 ( 1 )(g(x))=g((x))+g'/((x)) +0 - ,

2(n+1) n2

.1.2 (g) = (g2(X»- (g(X)2

(X)2 (1 )=[g/((x»J2-+0 - .

n+1 n2

(3-13)

(3-14)

2.0

1.6

FIG. 2. Slope ofregression lines as afunction of tl.

1.2

Conversely, a sufficient condition for x to be welldetermined by knowledge of (g(x» is that x be asufficiently smooth monotonic function of g. The ap­parent lack of symmetry, in that reasoning from (x)to g does not require monotonicity of g(x), is due tothe fact that the distribution of possible values hasbeen specified in terms of x rather than g.

As n increases, the relative standard deviations of allsufficiently well-behaved functions go down like n-t ; itis in this way that definite laws of thermodynamics,essentially independent of the type of information given,emerge from a statistical treatment that at first appearsincapable of giving reliable predictions. The parametern is to be compared with the number of degrees offreedom of a macroscopic system.

4. SUBJECTIVE AND OBJECTIVESTATISTICAL MECHANICS

Many of the propositions of statistical mechanics arecapable of two different interpretations. The Max­wellian distribution of velocities in a gas is, on the onehand, the distribution that can be realized in thegreatest number of ways for a given total energy; onthe other hand, it is a well-verified experimental fact.Fluctuations in quantities such as the density of a gasor the voltage across a resistor represent on the onehand the uncertainty of our predictions, on the othera measurable physical phenomenon. Entropy as a con-

I.~~~~~o 2 4 6 8 10 12 14n_

mation relevant to a certain question, common sensetells us that no strong conclusions either way arcjustified. The same thing must happen in statisticalinference, the appearance of a br02d probability distri­bution signifying the verdict, "no definite conclusion."On the other hand, whenever the available informationis sufficient to justify fairly strong opinions, maximum­entropy inference gives sharp probability distributionsindicatir.g the favored alternative. Thus, the theorymakes definite predictions as to experimental behavioronly when, and to the extent that, it leads to sharp distri­butions.

When our distributions broaden, the predictionsbecome indefinite and it becomes less and less meaning­ful to speak of experimental verification. As the avail­able information decreases to zero, maximum-ent.ropyinference (as well as common sense) shades continuouslyinto nonsense and eventually becomes useless. Never­theless, at each st.age it still represents the best thatcould have been done with the given information.

Phenomena in which the predictions of statisticalmechanics are well verified experimentally are alwaysthose in which our probability distributions, for themacroscopic quantities actually measured, have enor­mously sharp peaks. But the process of maximum-

Page 8: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

627 INFORMATION THEORY AND STATISTICAL MECHANICS

(5-2)

entropy inference is one in which we choose the broadestPQssible probability distribution over the microscopicst~tes, compatible with the initial data. Evidently, suchsharp distributions for macroscopic quantities canemerge only if it is true that for eadz of the overwhelm­ing majority of those states to which appreciable weightis assigned, we would have the same macroscopicbehavior. We regard this, not merely as an interestingside remark, but as the essential fact without whichstatistical mechanics could have no experimental va­lidity, and indeed without which matter would have nodefinite macroscopic properties, and experimentalphysics would be impossible. It is this principle of"macroscopic uniformity" which provides the objectivecontent of the calculations, not the probabilities per se.Because of it, the predictions of the theory are to alarge extent independent of the probability distributionsover microstates. For example, if we choose at randomone out of each 101010 of the possible states and arbi­trarily assign zero probability to all the others, thiswould in most cases have no discernible effect on themacroscopic predictions.

Consider now the case where the theory makesdefinite predictions and they are not borne out byexperiment. This situation cannot be explained awayby concluding that the initial information was notsufficient to lead to the correct prediction; if that werethe case the theory would not have given a sharpdistribution at all. The most reasonable conclusion inthis case is that the enumeration of the differentpossible states (i.e., the part of the theory whichinvolves our knowledge of the laws of physics) was notcorrectly given. Thus, experimental proof that a definiteprediction is incorrect gives evidence of the existence of newlaws of physics. The failures of classical statisticalmechanics, and their resolution by quantum theory,provide several examples of this phenomenon.

Although the principle of maximum-entropy inferenceappears capable of handling most of the predictionproblems of statistical mechanics, it is to be noted thatprediction is only one of the functions of statisticalmechanics. Equally important is the problem of inter­pretation; given certain observed behavior of a system,what conclusions can we draw as to the microscopiccauses of that behavior? To treat this problem andothers like it, a different theory, which we may callobjective statistical mechanics, is needed. Considerablesemantic confusion has resulted from failure to distin­guish between the prediction and interpretation prob­lems, and attempting to make a single formalism dofor both.

In the problem of interpretation, one will, of course,consider the probabilities of different states in theobjective sense; i.e., the probability of state n is thefraction of the time that the system spends in state n.It is readily seen that one can never deduce the ob­jective probabilities of individual states from macro­scopic measurements. There will be a great number of

different probability assignments that are indistin­guishable experimentally; very severe unknown con­straints on the possible states could exist. We see that,although it is now a relevant question, metric transi­tivity is far from necessary, either for justifying therules of calculation used in prediction, or for interpretingobserved behavior. Bohm and Schtitzerl7 have come tosimilar conclusions on the basis of entirely differentarguments.

5. GENERALIZED STATISTICAL MECHANICS

In conventional statistical mechanics the energyplays a preferred role among all dynamical quantitiesbecause it is conserved both in the time developmentof isolated systems and in the interaction of differentsystems. Since, however, the principles of maximum­entropy inference are independent of any physicalproperties, it appears that in subjective statisticalmechanics all measurable quantities may be treated onthe same basis, subject to certain precautions. Toexhibit this equivalence, we return to the generalproblem of maximum-entropy inference of Sec. 2, andconsider the effect of a small change in the problem.Suppose we vary the functions fk(X) whose expectationvalues are given, in an arbitrary way; Ofk(Xi) may bespecified independently for each value of k and i. Inaddition we change the expectation values of the fk ina manner independent of the 5fk; i.e., there is norelation between 5(jk) and (5 fk)' We thus pass fromone maximum-entropy probability distribution to aslightly different one, the variations in probabilities 5Piand in the Lagrangian multipliers 5Ak being determinedfrom the O(jk) and Ofk(Xi) by the relations of Sec. 2.How does this affect the entropy? The change in thepartition function (2-9) is given by

5Ao=0 lnZ= - Lk[5Ak(jk)+Ak(5fk)], ' (5-1)

and therefore, using (2-13),

55= Lk Ak[0(jk)-(5fk)]

= Lk Ak5Qk.

The quantity

(5-3)

provides a generalization of the notion of infinitesimalheat supplied to the system, and might be called the"heat of the kth type." If fk is the energy, OQk is theheat in the ordinary sense. We see that the Lagrangianmultiplier Ak is the integrating factor for the kth typeof heat, and therefore it is possible to speak of the kthtype of temperature. However, we shall refer to Ak asthe quantity "statistically conjugate" to fk, and usethe terms "heat" and "temperature" only in theirconventional sense. Up to this point, the theory iscompletely symmetrical with respect to all quantities h.

17 D. Bohm and W. Schtitzer, Nuovo cimento, Suppl. II, 1004(1955).

Page 9: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

E. T. JAYNES 628

In a measurement of temperature, we place thethermometer in thermal contact with the system 0'[ ofinterest. We are now uncertain not only of the state ofthe system 0'1 but also of the state of the thermometer0'2, and so in making inferences, we must find themaximum-entropy probability distribution of the totalsystem 2:=0'1+0'2, subject to the available information.A stitte of Z; is defined by specifying simultaneously astate i of 0'1 and a state j of 0'2 to which we assign aprobability Pij. Now however we have an additionalpiece of information, of a type not previously con­sidered; we know that the interaction of 0'1 and 0'2 mayallow transitions to take place between states (ij) and(mn) if the total energy is conserved:

E li+E2j = E lm+E 2n .

In the absence of detailed knowledge of the matrixelements of En responsible for these transitions (whichin practice is never available), we have no rational basisfor excluding the possibility of any transition of thistype. Therefore all states of ~ having a given totalenergy must be considered equivalent; the probabilityp;j in its dependence on energy may contain only

In all the foregoing discussions, the idea has beenimplicit that the Uk) on which we base our probabilitydistributions represent the results of measurements ofvarious quantities. If the energy is included among the!k, the resulting equations are identical with those ofconventional statistical mechanics. However, in practicea measurement of energy is rarely part of the initialinformation available; it is the temperature that iseasily measurable. In order to treat the experimentalmeasurement of temperature from the present point ofview, it is necessary to consider not only the system 0'1

under investigation, but also another system 0'2. Weintroduce several definitions:

A heat bath is a system 0'2 such that(a) The separation of energy levels of 0'2 is much

smaller than any macroscopically measurable energydifference, so that the possible energies E 2i form, fromthe macroscopic point of view, a continuum.

(b) The entropy S2 of the maximum-entropy proba­bility distribution for given (E 2) is a definite monotonicfunction of (E2); i.e., 0'2 contains no "mechanicalparameters" which can be varied independently of itsenergy.

(c) 0'2 can be placed in interaction with anothersystem 0'] in such a way that only energy can be trans­ferred between them (i.e., no mass, momentum, etc.),and in the total energy E=EI+E 2+E 12, the int.eractionterm E I2 is small compared to either E I or E2. Thisstate of interaction will be called thermal contact.

A thermometer is a heat-bath 0'2 equipped with apointer which reads its average energy. The scale is,however, calibrated so as to give a number T, calledthe temperature, defined by

(5-8)

(5-9)

(5-7)

(5-10)

t..=dS2Id(E2)= liT.

which factors into separate partition functions for thetwo systems

Z](t..)=L;exp(-t..E li), Z2(t..)=Ljexp(-t..E2j), (5-6)

with t.. determined as before by

a(E2) = --lnZ2(t..);

at..

and the total entropy is additive:

S(~)=S]+S2'

(E li+E2j), not Eli and E 2j separately,!8 Therefore, themaximum-entropy probability distribution, based onknowledge of (E2) and the conservation of energy, isassociated with the partition function

More generally, this factorization is always possible ifthe information available consists of certain propertiesof 0'] by itself and certain properties of 0'2 by itself.The probability distribution then factors into twoindependent distributions

if

We conclude that the function of the thermometer ismerely to tell us what value of the parameter A shouldbe used in specifying the probability distribution ofsystem 0']. Given this value and the above bctorizatiollproperty, it is no longer necessary to consider theproperties of the thermometer in detail when incorpo­rating temperature measurements into our probabilitydistributions; the mathematical processes used insetting up probability distributions based on energy ortemperature measurements are exactly the same butonly interpreted differently.

It is clear that any quantity which can be inter­changed between two systems in such a way that thetotal amount is conserved, may be used in place ofenergy in arguments of the above type, and the funda­mental symmetry of the theory with respect to suchquantities is preserved. Thus, we may defme a "volumebath," "particle bath," "momentum bath," etc., andthe probability distribution which gives the mostunbiased representation of our knowledge of the stateof a system is obtained by the same mathematicalprocedure whether the available information consistsof a measurement of Uk) or its statistically conjugatequantity t..k.

18 This argument admittedly lacks rigor, which can be suppliedonly by consideration of phase coherence properties between thevarious states by means of the density matrix formalism. This,however, leads to the result given.

or, solving for t.. by use of (2-13), we find that thequantity statistically conjugate to the energy IS thereciprocal temperature:

(5-4)1jT-==dS?/d(E2).

Page 10: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

629 I N FOR MAT ION THE 0 R Y AND S TAT 1ST I CAL M E C HAN I C S

surroundings should produce on the average a nuclearpolarization (ml)= (I.), equal to the Brillouin function

Thus, the predicted polarization is just what would beproduced by a magnetic field of such strength that theLarmor frequency WL=W. If Ixl«l, the result may bedescribed by a "dragging coefficient"

In the case 1= t, the polarization reduces to

(ml)= -t tanh(tX). (5-16)

If the angular velocity w is small, (5-12) may be ap­proximated by a power series in X:

Z2({3,X) = Z2({3,O)[1- X(m2)O+tX2(m22)O+ ... ],

where ( )0 stands for an expectation value in thenonrotating state. In the absence of a magnetic field(m2)O=0, Ji2(m22)O=kTB, so that (5-13) reduces to

(5-15)

(5-17)

(5-14)

X=-hw/kT.

I

Zl(X) = L e-Xm•m~I

where

J.I.=XP=P/kT.

Thus, when the available information consists of eitherof the quantities (T,(E»), plus either of the quantities(P/T,(V»), the probability distribution which describesthis information, without assuming anything else, isproportional to

expl-[E;(V~;PVJ}. (5-11)

We now give two elementary examples of the treat­ment of problems using this generalized form of sta­ti~tical mechanics.

The pressure ensembte.-Consider a gas with energylevels E;(V) dependent on the volume. If we are givenmacroscopic measurements of the energy (E) and thevolume (V), the appropriate partition function is

Z(X,j.l) = foodV L; exp[-XE;(V)-j.lV],o

where X, J.I. are Lagrangian multipliers. A short calcu­lation shows that the pressure is given by

P= -(aE;(V)/aV)=JJ.!x,

so that the quantity statistically conjugate to thevolume is

19 M. B. Lewis and A. J. F. Siegert, Phys. Rev. 101, 1227 (1956).

where B is the moment of inertia of 0"2. Then, our mostunbiased guess is that the rotation of the molecular

(5-18)

6. CONCLUSION

The essential point in the arguments presented aboveis that we accept the von-Neumann-Shannon expres­sion for entropy, very literally, as a measure of theamount of uncertainty represented by a probabilitydistribution; thus entropy becomes the primitive con­cept with which we work, more fundamental even thanenergy. If in addition we reinterpret the predictionproblem of statistical mechanics in the subjective sense,we can derive the usual relations in a very elementaryway without any consideration of ensembles ill" -appealto the usual arguments concerning ergodicity or equala priori probabilities. The principles and mathemaLicalmethods of statistical mechanics are seen to be of much

There is every reason to believe that this effect actuallyexists; it is closely related to' the Einstein-de Haaseffect. It is especially interesting that it can be predictedin some detail by a form of statistical mechanics whichdoes not involve the energy of the spin system, andmakes no reference to the mechanism causing thepolarization. As a numerical example, if a sample ofwater is rotated at 36000 rpm, this should polarize theprotons to the same extent as would a magnetic fieldof about 1/7 gauss. This should be accessible to experi­ment. A straightforward extension of these calculationswould reveal how the effect is modified by nuclearquadrupole coupling, in the case of higher spin values.

(5-13)

where {3= l/kT, and Xis determined by

a Bw(m2) = --lnZ2=-,

ax Ji

This is the distribution of the "pressure ensemble" ofLewis and Siegert. 19

A nuclear polarization ejfect.-Consider a macroscopicsystem which consists of 0"1 (a nucleus with spin I), and0"2 (the rest of the system). The nuclear spin is veryloosely coupled to its environment, and they canexchange angular momentum in such a way that thetotal amount is conserved; thus 0"2 is an angular mo­mentum bath. On the other hand they cannot exchangeenergy, since all st.ates of 0"1 have the same energy.Suppose we are given the temperature, and in additionare told that the system 0"2 is rotating about a certainaxis, which we choose as the z axis, with a macroscopi­cally measured angular velocity w. Does that provideany evidence for expecting that. the nuclear spin I ispolarized along the same axis? Let m2 be the angularmomentum quantum number of 0"2, and denote by nall other quantum numbers necessary to specify astate of 0"2. Then we form the partition function

Page 11: Information Theory and Statistical Mechanics · Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and

E. T. JAYNES 630

(A-3)

more general applicability than conventional argumentswould lead one to suppose. In the problem of prediction,the maximization of entropy is not an application of alaw of physics, but merely a method of reasoning whichensures that no unconscious arbitrary assumptionshave been introduced.

APPENDIX A. ENTROPY OF A PROBABILITYDISTRIBUTION

The variable x can assume the discrete values(Xl,' ·Xn ). Our partial understanding of the processeswhich determine the value of X can be represented byassigning corresponding probabilities (PI,'" ,pn). Weask, with Shannon,4 whether it is possible to find anyquantity H (PI' .. pn) which measures in a unique waythe amount of uncertainty represented by this proba­bility distribution. It might at first seem very difficultto specify conditions for such a measure which wouldensure both uniqueness and consistency, to say nothingof usefulness. Accordingly it is a very remarkable factthat the most elementary conditions of consistency,amounting really to only one composition law, alreadydetermines the function H (Pl' .. pn) to within a con­stant factor. The three conditions are:

(1) H is a continuous function of the Pi.(2) If .all Pi are equal, the quantity A (n)

=H(l/n," ·,l/n) is a monotonic increasing functionof n.

(3) The composition law. Instead of giving theprobabilities of the events (Xl' .. xn ) directly, we mightgroup the first k of them together as a single event, andgive its probability WI= (PI+'" +Pk) j then the nextm possibilities are assigned the total probabilityW2= (Pk+I+' .. +Pk+m), etc. When this much has beenspecified, the amount of uncertainty as to the compositeevents is H(Wl" ,wr ). Then we give the conditionalprobabilities (PI/WI,'" ,Pk/WI) of the ultimate events(XI' .. Xk), given that the first composite event hadoccurred, the conditional probabilities for the secondcomposite event, and so on. We arrive ultimately atthe same state of knowledge as if the (Pl'" pn) hadbeen given directly, therefore if our information measureis to be consistent, we must obtain the same ultimateuncertainty no matter how the choices were broken

down in this way. Thus, we must have

H(PI' .. pn) = H(WI' . 'W2)+wIH (PI/WI,' .. ,Pk/WI)+w2H (Pk+I/W2,' .. ,pk+m/W2) +.... (A-l)

The weighting factor WI appears in the second termbecause the additional uncertainty H (PI/WI,' .. ,PkiwI)is encountered only with probability WI. For example,H(1/2, 1/3, 1/6)=H(1/2, 1/2)+!H(2/3, 1/3).

From condition (1), it is sufficient to determine Hfor all rational values

pi=ni/L ni,

with ni integers. But then condition (3) implies that His determined already from the symmetrical quantitiesA (n). For we can regard a choice of one of the alter­natives (Xl' .. xn) as a first step in the choice of one of

n

L nii=l

equally likely alternatives, the second step of \\'hich isalso a choice between ni equally likely alternati\'es.As an example, with n=3, we might choose (lIl,lle,1I3)

= (3,4,2). For this case the composition law becomes

(342) 3 4 2

H -,-,- +-A (3)+-A (4)+-A (2)'-=.-1 (9).999 9 9 9

In general, it could be written

In particular, we could choose all ni equal to 111, \\'here­upon (A-2) reduces to

A (m)+A (n)=A (mn).

Evidently this is solved by setting

A (n) = K Inn, (A--l)

where, by condition (2), K>O. For a proof that (A--l)is the only solution of (A-3), we refer the reader toShannon's paper.4 Substituting (A-4) into (A-2), wehave the desired result,

H(PJ-· ,pn)=K In(L ni)-K L Pi hlll i

= -K Li p;lnPi- (A-S)


Recommended