Chapter2
A M easureofInformation
2.1 EncodingofInformationInthislecturewewillthinkofamessageasasequenceofsymbolsproducedbyaninformationsource. T hemessagewillappearasasignal, which is afunctionofoneormoreindependentvariables. T hesignalmaybeasoundwave, animage, oranyofamyriadofphysicalforms.
W ewill…ndinafuturelecturethatthemessagemaybebasedonphe-nomenaotherthanlanguage. A languagemaybethoughtofas organizedcombinations ofsymbols thatexpressandcommunicatethoughtsandfeel-ings. A messagemaybeanexpression inalanguage, andthatis howtheterm isordinarilyused. H owever, wewanttousetheterm inamoregeneralform sowecanmodelmorethanhumancommunication. Ifwethinkofamessagefrom amachineorfrom nature, itis di¢culttoassociate itwiththoughtsorfeelings. W ecangeneralizebynotingthatwecanusuallymodelthings, atleastconceptually, intermsoftheirstate. A state-spacemodelisaquitegeneralstructure. W ewillthereforesaythatamessageisanexpressionofthestateofasystem1.
Suppose thatwehavemade someobservations and havededuced thestateofasystem. W emaywanttorecordthestatesothatitcanberesettothatvalueatanothertime. T his is essential, forexample, insimulation.Torecordthestateweneedtoconstructasignalthatwilldrivearecording
1M oreprecisely, itis themathematicalmodelofasystem thathas states. A goodmodelwillbehavelikethenaturalsystem. T hebehaviorofthemodelis determinedbyitsstate. T hecurrentstateisdeducedfrom observations.
9
10 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
apparatus. A littlethoughtwillconvinceyouthatrecordingthestateofasystemorrecordingamessageareequivalenttasks. M oreover, recordingandcommunicationareequivalentfromourconceptualviewpoint.
A recording is done through a signalthatis suited tothe recordingmedium. T hatsignalmaybeproducedbyasequenceofsymbols from analphabetthatwemaychoose— itis notdetermined by the system underobservation. W esimplyrequirethateachpotentialstate(oreachpotentialmessage)beuniquelyrepresented. IfthereareN di¤erentstatesormessages,thentheremustbeatleastN uniquearrangementsofsymbolsavailabletous. Inthelastlecturewesawthatanyalphabetwithtwoormoresymbolscanbeused. M oreover, itis nottheappearanceofthesymbols, buttheirnumber, thatis importantindeterminingtheirenumerationproperties. W ewill…ndthatattimesonecangainsomee¢ciencybyproperlychoosingthesizeoftherecordingalphabet, butthatisusuallyasmallgain.
T he importantdeterminerofthe e¢ciency ofarecordingcode is themannerinwhichsequencesofsymbolsareassociatedwiththedi¤erentstates.Itis thestructureofthecode, andnottheidentityofthesymbols, thatisimportant. A goodcodecangreatlyreducetheamountofdataneededtorecordthestateofthemodel. T hisis, infact, thegoalofimagecompressionalgorithms. A compressionsystem containsanalgorithm, whichisamodelofimages, andamethodtorecordthedataneededtoreproducetheimage.T hebestcompression system is theonewhichreproduces the imagewithacceptable…delitywiththeminimumdata.
T heamountofinformationneededtorecordthestateofasystemdependsupontheprobabilities ofthevarious states. Itturns outthatifweknowthestateprobabilitieswecancalculatetheminimumamountofdatathatisrequiredontheaverage. T hisminimumisgivenbytheentropyofthesystem,whichiscloselyrelatedtotheconceptofentropyinphysicalsystems.
A codeisanalphabetofsymbolsandasetofrulesthatassociateeventswithpatternsofsymbolsfrom thealphabet. A neventisanythingthatonewishestoexpressintermsofthecode, suchasthestateofasystem. T hede-velopmentofcodesdependsonlyupontheabstractnotionofeventsandtheirassociatedprobabilities. T heassociationwithstatemodels isaccomplishedbyinterpretingthemeaningofeventasaparticularstate.
T heingredientsofacodeare:
1. analphabetAofsymbols. W edon’tcareaboutthespeci…crepresen-tationofobjectsinA, onlythattheybedistinguishableandtherebea
2.1. EN CO D IN G O F IN FO R M A T IO N 11
…nitenumberofthem. W ewilltypicallyrepresentthealphabetintheformAr = fa1;a2 ;:::;arg. W eonlycareaboutthesizer. T hesmall-estusefulalphabetisA2 , whosesymbolsarecustomarilyrepresentedasA2 = f0 ;1g.
2. asetofeventsE= fe1;e 2 ;:::g. A setof…nitesizemaybeindicatedasEn andoneofin…nitesizebyE1 . T hespeci…cphysicalcharacteristicsoftheevents isofnointerestasthefarasthecodeisconcerned.
3. aruleR thatmapseventsontocodewords, whicharecombinationsofsymbolsfromA.
4. asetofprobabilitiesoverE. T hatis, onecancomputetheprobabilityofanysetB ofevents fromE. IfB ½Ethenonehas aprobabilitymeasurep(B )whichsatis…esalltherulesforprobabilitymeasures.
A codeC canberepresentedbytheaboveitems, anddenotedbyC(A;E;R ;p).W erequirethatR besuchthateacheventis representedbyaunique
codeword. L etwibethecodewordforaneventei, andletnibythenumberofsymbols(orlength) ofwi. T hesetW ofcodewords isrepresentedbytheruleR :E! W . ForuniquedecodingwerequirearuleR ¡1 :W ! E.
L etn(e)bethelength, ornumberofsymbols, ofawordw (e)thatrepre-sentsanevente. T heaveragenumberofsymbolsusedtorepresentanevent,andtheaveragenumberofsymbolspercodeword, is
¹n =X
e2En(e)p(e) (2.1)
A goodcodeisonewhichhasasmallvaluefor¹n. H ereweseektoknowhowto…ndtheminimum valueof¹n foragiveneventspace, describedby(E;p)overallpossiblecoderulesforagivenalphabetAr. W ewouldalsoliketohaveguidanceonhowto…ndR foragiven(E;p;Ar) suchthatthecodeis e¢cientandboth R and R ¡1 arepracticaltoimplement. Informationtheoryhasmadeconsiderableprogressonallthesetasks.
Example1 Random N umberG eneratorA random numbergeneratorcan producepositive integers from the set
fe1 = 1;e 2 = 2 ;:::;en = ng. This machinehas a…niteeventspace. Thegeneratoroutputistoberepresentedbywordswhoselettersarefromabinary
12 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
codealphabet, A = fA;B g. W eare interestedin acodethatcanuniquelyrepresentasequenceofnumbers producedbythesource.
Ifn = 2 , an obvious coderule is toassign e1 = 1 ! A, e 2 = 2 ! B .Therewillthenbeaone-to-onerelationship betweenanysequencefrom thesourceandacorrespondingsequenceoflettersfrom thealphabet.
Ifn = 4 thenitisnecessarytoconstructfourcodewordsfromanalphabetoftwoletters. W ecouldusefe1 ! AA;e 2 ! AB ;e3 ! B A;e 4 ! B B g.
Byextension, wecanconstructacodewhosewords areoflengthlog2 nwhenn isanypowerof2. A nyn thatisnotapowerof2 canbeaccommodatedbyusingcodewordswhoselengthis thenextintegerabovelog2 n.
Codes thatareconstructedinthis mannerareuniquelydecodable. H ow-ever, theymaynotbeas e¢cientas possible. Thee¢ciencycannotbede-terminedwithoutknowingthestatisticsofthesourceevents.
A de…nitionofuniquedecodabilityisgivenbelow.
2.1.1 U niquelyD ecodableCodesSupposethatonewants torecordasequenceofevents, whichwouldthenproduceasequenceofcodewords. W ewouldliketodecodeanysequenceofsymbolsproducedbyconcatenatingthecodewordsproducedbythesequenceofevents. W ewanttobeabletodothiswithoutintroducinganothersymboltoserveasaseparator, orcomma, betweencodewordsinthesequence. Codeswiththispropertyarecalleduniquelydecodable.
2.1.2 InstantaneousCodesW ewouldliketobeabletodecodeeachofthecodewordsinthesequenceinde-pendently. T his imposesastrongerrequirementthanthatofbeinguniquelydecodable. W ewanttobeabletorecognizeeachcodewordas soonas itis seen, readinglefttoright, independentofsurroundingcodewords. T hisstrongerrequirementcouldimposeconditionsthatmakethecodelonger.
A necessary and su¢cientcondition fora code to be instantaneous is thatnocompletecodewordbeapre…xofsomeothercodeword.
Example2 A codethatis notinstantaneous.Thecodewithwordsfa = 1;b= 10 ;c= 10 0 ;d = 10 0 0g isuniquelydecod-
ablebecauseeachsymbolis determinedbythenumberofzeros itcontains.
2.1. EN CO D IN G O F IN FO R M A T IO N 13
ei wi nie1 1 1e 2 01 2e3 001 3e4 000 3
Table2.1: A ninstantaneouscodeforanalphabetoffoursymbols.
Thesequencebd acis encodedas 10 10 0 0 110 0 . O necannotdecodeasequenceofdigitsuntiloneissureallthezeroshavebeenreceived. H ence, thesequence10 maybethesymbolborthebeginningofcord . Itcannotbedecodeduntilmoredigitsareseen. H ence, this codeisnotinstantaneous.
Toconstructaninstantaneouscodewecanusethesimpleexpedientofnotusinganypreviouslyde…nedcodewordasapre…xwhende…ninganewcodeword.
Example3 A n instantaneouscode.Thecodefa = 1;b= 0 1;c= 0 0 1;d = 0 0 0g isinstantaneous. Thedecoding
ruleistorecognizeasymbolassoonasitscodewordiscomplete. Thesequencebd acencodesas 010001001.
Clearly, onecanhaveaninstantaneouscodethatisuniquelydecodable.T hequestioniswhetherthereisanyadvantageinnotimposingtheconstraintofuniquedecodability. Itturns outthatthere is noadvantage, aswillbeshownbytheanalysis developed below. W e …rststateand proveKraft’sinequality and then use itin M cM illan’s inequality. T his establishes theaboveassertion.
2.1.3 Kraft’s InequalityT heKraftinequalityprovides thenecessaryand su¢cientconditions foracodetobeinstantaneous. SupposethatasystemhasqrecognizableeventsinasetEq = fe1;e 2 ;:::;eqgwhicharetoberecordedorcommunicatedusinganalphabetAr = fa1;a2 ;:::;arg. T heeventscanbeanysystemobservablesorcombinationofobservablesthatareofinterest. H ereweareinterestedinthenumberq. A codeC isnowde…nedwhichmapseacheventontoacodeword
14 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
wi= R (ei)wherewi isconstructedofsymbolsfromAr andisoflengthni.A nexampleofacodewithsymbolsfromA2 andcodewordlengthsf1;2 ;3;3gisshowninTable2.1.
Theorem4 A su¢cientconditionfortheexistenceofaninstantaneouscodewithwordlengthsfn1;n 2 ;:::;nqg is that
qX
i= 1
r¡ni·1 (2.2)
Forabinarycode, whichusesthealphabetA2 , theKraftinequalityisqX
i= 1
2 ¡ni·1 (2.3)
Example5 Thebinarycodewordlengths inExample3aref1;2 ;3;3g. TheKraftinequalityis
2 ¡1 + 2 ¡2 + 2 ¡3+ 2 ¡3 = 1
H ence, aninstantaneouscodeexists, aswehaveseenbydemonstraionabove.
2.1.4 ProofoftheKraftInequalityToprovethesu¢ciency, wewillprovideamethodtoactuallyconstructacodewithwordlengthsfn1;n 2 ;:::;nqgwhentheselengthssatisfy(2.2). L etci bethenumberofcodewords oflength i. Intheaboveexamplec1 = 1,c2 = 1, c3 = 2 . Ifthemaximumcodewordlengthisn , thenthecountsmustsatisfy
nX
i= 1
ci= q (2.4)
sincetheremustbeacodewordforeachsourceevent.U singtheabovede…nitions, wecanwritetheKraftinequalityas
nX
i= 1
cir¡i·1 (2.5)
U ponmultiplyingbyrn wehavenX
i= 1
cirn¡i·rn (2.6)
2.1. EN CO D IN G O F IN FO R M A T IO N 15
T his inequalitymaybewrittenoutandrearrangedtoobtain
cn ·rn ¡c1rn¡1¡c2 rn¡2 ¡¢¢¢¡cn¡1r
T hetermontherightoftheinequalitymustbenonnegative. Ifwedivideitbyr andrearrange, weobtain
cn¡1 ·rn¡1¡c1rn¡2 ¡c2 rn¡3¡¢¢¢¡cn¡2 r
O nceagain, thetermontherightmustbenonnegative. W ecancontinueinthisfashiontoobtainthefollowingsequenceofinequlities:
cn · rn ¡c1rn¡1¡c2 rn¡2 ¡¢¢¢¡cn¡1r (2.7 )cn¡1 · rn¡1¡c1rn¡2 ¡c2 rn¡3¡¢¢¢¡cn¡2 r (2.8)
... (2.9 )c3 · r3¡c1r2 ¡c2 r (2.10)c2 · r2 ¡c1r (2.11)c1 · r (2.12)
T heaboveinequalitiesnowshowthatwecanconstructtherequiredcode-words. W earerequiredtoform c1 words oflength1, leavingr¡c1 sym-bolsthatcanserveasthebeginningoflongercodewords. Byaddinganewsymboltotheendofeachofthesepre…xes wecan constructas manyasr(r¡c1) = r2 ¡c1r wordsoflength2. W eareassuredby(2.11) thattherewillbeenoughwordsavailable. Fromtheunusedtwo-symbolpre…xes, therewillremainr2 ¡c1r¡c2 whichcanbeusedtoconstructr3¡c1r2 ¡c2 r three-symbolwords. Comparingwith(2.10)weseethatthisisasu¢cientnumber.Continuinginthismanner, wewill…ndthatitis possibletoconstructthenecessarynumberofwordsofeachlength. W ehavethusdemonstratedthat(2.2) or(2.5) aresu¢cienttoguaranteetheexistenceofan instantaneouscode.
2.1.5 M cM illan’s InequalityA lthough theabove proofshows thatsatisfaction oftheKraftinequalityis guarantees theexistenceofan instantaneous (and thereforeauniquelydecodable) code, itdoes notshowthatthecondition is necessary. M ight
16 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
there be uniquelydecodable codes thataremore e¢cient? The analysisbelowshowsthattherequirementisalsonecessary.
ConsiderraisingthesummationintheKraftinequalitytoapower. T henÃ
qX
i= 1
r¡ni! m
=¡r¡n1 + r¡n 2 + ¢¢¢+ r¡nq
¢m (2.13)
W henthisismultipliedoutandthetermswithequalexponentsaregatheredtogether, theexpressionwillbeoftheform
ÃqX
i= 1
r¡ni! m
=m nX
k= m
N kr¡k (2.14)
wheren isthelengthofthelongestcodewordandN k isthenumberoftermsoftheform r¡k. N k isalsothenumberofdistinctstringsofcodewordsthatcanbeconstructedbystringingtogetherm codewordssuchthatthestringlength is exactlyksymbols. Ifthecodeis tobeuniquelydecodable, thenthis numbermustbeless thanthenumberofpossiblestrings oflengthk,which is rk. T herefore, ifwesubstitute rk¸ N k inplaceofN k, theaboveequationbecomesaninequlity.
ÃqX
i= 1
r¡ni! m
·m nX
k= m
rkr¡k
· m n ¡m + 1· m n
L etx standforthesum ontheleft. Foranyx > 1, xm > m n foranyn ifm is su¢cientlylarge. T hiswouldviolatetheaboveinequality. H ence,auniquelydecodablecoderequires x · 1, which is identicaltotheKraftinequality(2.2).
M cM illan’s inequalityshowsthatthereis noadvantageinnotusinganinstantaneous code. T hatis, foranyuniquelydecodablecodethere is anequivalentinstantaneouscode.
2.1.6 E¢cientCodesW ewould like toconstructinstantaneous codes which use theminimumaveragenumberofsymbols perevent. T his willenableus todescribean
2.1. EN CO D IN G O F IN FO R M A T IO N 17
eventin themoste¢cientmannerpossible, a factthatis importantforstorageandtransmissionofinformation. T heaveragenumberofalphabetsymbolspercodewordisgivenby(2.28)
¹n =qX
i= 1
nipi
whereq is thenumberofdistinctevents inE. W ewanttochoosethe nitosatisfytheKraftinequalityandminimize¹n. T his isnotatrivialmatter,butithas been solved by the H u¤man codingprocedure forthe case ofstatistically independentevents. W ewillexaminethe H u¤man procedurelater. Fornowwewillconsidertheproblemof…ndingalowerboundon ¹n.
Inouranalysisweneedtomakeuseofthefollowingtheorem. W ewill…ndthistheoremusefulonotheroccasions.
Theorem6L etx1;x2 ;:::;xq andy1;y2 ;:::;yq beanytwosetsofnumberswithxi¸0 andyi¸0 for1·i·qwith
qX
i= 1
xi= 1 andqX
i= 1
yi= 1
ThenqX
i= 1
xiloga1xi·
qX
i= 1
xiloga1yi
(2.15)
Toprovethiswenote…rstthatlnx·x¡1 withequalityonlyatx = 1.T hiscanbeshownwithagraphofthetwofunctions. T hen
qPi= 1
xiln³yixi
´·
qPi= 1
xi³yixi¡1
´
·qP
i= 1yi¡
qPi= 1
xi
·0
withequalityifandonlyifxi= yiforalli. B yexpandingthelogarithmanddividingthroughbylna thetheorem isestablishedforanylogarithmicbasea.
18 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
N ow, bylettingxi = p(ei) andyi = 1=q, i= 1;2 ;:::;q wehave(witha = q)
qX
i= 1
pilogq1pi·
qX
i= 1
pilogqq
orqX
i= 1
pilogq1pi·1 (2.16)
withequalityifandonlyifpi= 1=qforalli. Itiscommontowritetheaboveexpressionintermsoflog2 , whichyields
qX
i= 1
pilog21pi·log2 q (2.17 )
T heabovesum isknownastheentropyofasourceinwhichtheeventsarestatisticallyindependent.
H (E) = ¡qX
i= 1
pilog2 pi bits/event (2.18)
T heentropy, whichwillbediscussed inconsiderabledetailinthenextlecture, measures theaverageamountofuncertaintythatanobserverhasaboutthenextsourceevent. T heuncertaintyismaximum whenallstatesareequallylikely. T hatis, from (2.18)
H (E)·log2 (q) (2.19 )
whereq isthesizeoftheeventspace, withequalityifandonlyifalleventsareequallylikely.
Fornow, theentropyis justan interestingquantitythatenables us toputalowerboundontheaveragecodewordlength. L ety1;y2 ;:::;yq beanynumericalquantitiesthatsatisfyyi¸0 forallisuchthat
P qi= 1yi= 1. T hen
byTheorem6weknowthat
qX
i= 1
pilog1pi·
qX
i= 1
pilog1yi
(2.20)
2.1. EN CO D IN G O F IN FO R M A T IO N 19
withequalityifandonlyifyi= piforalli. N owletuschoosetheyitobe
yi=r¡niqP
j= 1r¡nj
(2.21)
T hen
H (E) · ¡qX
i= 1
pi(log2 r¡ni)+
qX
i= 1
pi
Ãlog2
qX
j= 1
r¡nj!
· log2 rqX
i= 1
pini+ log2
ÃqX
j= 1
r¡nj!
· ¹n log2 r + log2qX
j= 1
r¡nj (2.22)
T hesuminthelasttermmustsatisfytheKraftinequalityifthecodeistobeuniquelydecodable. T helogarithmofthisterm isthereforenon-positive.T herefore, wecanwritetheaboveinequalityas
¹n log2 r¸H (E)
or
¹n ¸ H (E)log2 r
(2.23)
T hee¢ciencyofacodeisde…nedas
´ =H (E)¹n log2 r
(2.24)
A codewillhaveane¢ciencyofonewhenthewordlengthsfromthealphabetAr arechosentomatchtheentropy. T his issue is exploredfurtherinthehomework.
A codethatisclosetothelowerboundfortheminimumaveragelengthcanbeconstructedbyasimplemethod. Choosethecodewordlengths tosatisfy
logr1pi·ni< logr
1pi+ 1 (2.25)
20 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
Ifpi= r¡ki whereki isaninteger, thenchooseni= ki, otherwisechoosenitobethe…rstintegergreaterthan¡logr pi. T heaveragecodewordlengthcanbefoundbymultiplyingthroughtheaboveequationandsummingoverallthesourceevents.
qX
i= 1
pilogr1pi·
qX
i= 1
nipi<qX
i= 1
pilogr1pi+
qX
i= 1
pi (2.26)
H (E)log2 r
·¹n <H (E)log2 r
+ 1 (2.27 )
Example 7 A n imagecodingproblem. A 10 2 4 £10 2 4 imagehaseightgreylevels. A histogram analysis has foundthatthethelevelshavetheprobabili-ties listedinthetablebelow. A lsoshownaretwoinstantaneousbinarycodesthatcouldbeusedtorepresenttheimage. ThewordlengthsofCodeB satisfy(2.25)whilethoseofCodeA wereconstructedbytheH u¤mancodingproce-duretobediscussedinL ecture3. Itisreadilyveri…edthatbothcodes satisfytheKraftinequalityandareinstantaneous.
pi log2 pi CodeA CodeB nAi nBi0.25 2. 00 00 2 20.2 2.32 10 100 2 30.17 2.56 010 010 3 30.15 2.7 4 110 110 3 30.1 3.32 111 1110 3 40.08 3.64 0111 0111 4 40.03 5.06 01100 011000 5 60.02 5.64 01101 011010 5 6
Theaveragenumberofdigits percodewordare
nA =8X
i= 1
pinAi = 2 :73
nB =8X
i= 1
pinB i = 3:0 8
2.2. A M EA SU R EO F IN FO R M A T IO N 21
Clearly, Code B is noprize since itrequires moredigits perpixelthanwouldberequiredbyusingastraightthree-digitbinarycode. H owever, CodeA uses an average ofless than three digits perpixel. The entropyoftheimage, assumingthepixelsarestatisticallyindependent, is H = 2 :7bits perpixel. H ence, CodeA is actuallyquitee¢cient.
W ewill…ndthatthesimpleShannoncodingtechniqueof(2.25) canac-tuallybequitee¢cientwhengroupingsofpixelsratherthansinglepixelsareused. Itis thetheoreticalbasis ofcomputations ofasymptotice¢ciencyofcodingmethods.
2.2 A M easureofInformationInthislecturewewillexaminetherelationshipsbetweeneventsandobserva-tions. SupposethatthereisaspaceX ofeventsandaspaceY ofobservations.H owmuchinformationisprovidedbyaparticularobservationy2Y abouteachpossibleeventx 2X ?Thisquestioncanbeappliedtoanysituationinwhichthereis aneedtorelateobservationstounderlyingcauses. Itsrootswereinthe…eldofcommunication, wherethespaceX isthesetofsymbolsormessages and the space Y is the spaceofsignals, butithas farwiderapplication. Ingeneral, X andY arecollectionsofrandomeventsconnectedbysomeprocess, asshowninFigure2.1.
ProcessEvent Observation
Figure2.1: Eventsandobservationsmayberelatedbyanykindofprocess.T hee¤ectoftheprocess ismodeledbytheconditionalprobabilityfunctionp(yjx):
Itis presumedthatthejointprobabilitydistribution p(x;y) isavailableforanypoint(x;y) inthespaceX £Y . A llofourresultswillultimatelyberelatedtothisdistribution. T hetheorywillapplyifandonlyifthisfunctioncanbeconstructedwhenmodelingarealproblem.
T herulesofprobabilitycanbeusedtorelatethejointprobabilitytothe
22 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
p(y |x ) p (y )p (x )0.5
0.5
0.9
0.9
0.10.1
y 1
y 2
x 1
x 2
0.5
0.5
Figure2.2: Input/outputrelationshipsforabinarysymmetricchannelwitherrorprobability0.1.
marginalandconditionalprobabilities.
p(x) =P
y2Yp(x;y) p(yjx) = p(x;y)p(x)
p(y) =P
x2X p(x;y) p(xjy) = p(x;y)p(y)
Itis commonthattheconditionalrelationship p(yjx) is availableforaprocess. T his isthenormalsituationwhenmodelinganobservationsystem,communicationchannel, recordingsystem, andthelike. Ifonethenmodelsthedistributionofinputevents bychoosingp(x) itis possibletocalculatethejointprobabilitydistributionandalloftheotherprobabilitiesabove.
W eareinterestedintheamountofinformationabouttheinputthatisprovidedbytheobservedoutput. Beforetheobservation, aparticularinputhadprobabilityp(x)andaftertheobservationithasprobabilityp(xjy). A nyobservationoftheoutputchangestheprobabilitydistributionovertheinput.Somepossibleinputswillbemademorelikelyandsomeless likely. Intheextreme, oneeventmaybemadecertainandtheothersmadeimpossible. W ewouldliketohaveareasonablewaytodescribetheamountofinformationthatisprovidedabouttheinputbyanyparticularobservation.
A n example ofthe common digitalcommunication channelcalled thebinarysymmetricchannel(B SC) isshowninFigure2.2. T hismodelisusedformanydigitalcommunicationandrecordingsystems. B oththeinputandoutputspacecontaintwoevents. T heprobabilityoferroris equaltothecrossoverprobability. Forthisexample, itisassumedthattheinputeventsareequallylikely, which, becauseofthechannelsymmetry, causestheoutput
2.2. A M EA SU R EO F IN FO R M A T IO N 23
symbolstoalsobeequallylikely.A nobserver, locatedattheoutputside, assumesauniform probability
distribution p(x1) = p(x2 ) = 0:5 beforeanyobservationismade. A fteranobservationismade, sayy1, theprobabilities changetop(x1jy1) = 0:9 andp(x2jy1) = 0:1. T he“evidence,” althoughnotcompletelycertain, pointstox1 as the cause oftheobservation y1. H owmuch “information” has theobserverreceivedabouttheinput?
Themeasureofinformationisbasedonthefollowingde…nition:
D e…nition8 Theamountofinformationprovidedbytheoccurrenceoftheeventy2Y abouttheoccurrenceoftheeventx 2X isde…nedas
I (x;y) = logp(xjy)p(x)
(2.28)
T hebaseofthelogarithmdeterminestheunitsofinformation. W henthebaseis 2, theunitis thebit2. T hisde…nitionprovidesanatural, intuitive,interpretationof“information” thatwillbedevelopedasweinvestigateitsproperties.
T heinformationmeasurehas thepropertythatiftheobservationy in-creasestheprobabilityofx thenitispositive. T hatis, ifp(xjy) > p(x) thenI (x;y) > 0 . Ifweexpand(2.28)wehave
I (x;y) = ¡logp(x)+ logp(xjy) (2.29 )
T his expressioncanbegivenanintuitivemeaningbyde…ningatheuncer-taintyofanevent.
D e…nition9 Theuncertaintyofaneventx is¡logp(x).
W e see thattheuncertainty is zero ifp(x) = 1 and increases as theprobabilitydecreases.
Forthe B SC ofFigure 2.2 the uncertainty aboutboth x1 and x2 is¡log2 0:5= 1 bitbeforetheobservation. T hisuncertaintycanberemovedbytellingtheobserverwhichoftwoequallylikelyeventshasoccurred. A fterobservingy1 theprobabilityofx1 has changed to p(x1jy1) = 0:9 , and its
2T heunits “nat” and“H artley” areoftenusedwiththebaseseand10, respectively.T hedecimalunitis inhonorofcommunicationpioneerR .V .L . H artley
24 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
uncertaintynowis¡log2 0:9 = 0:152 bit. Itsuncertaintyhasdecreasedbe-causeitsprobabilityhas increasedfrom itsoriginalvalue. T heinformationreceivedabouttheeventx1 is 1 ¡0:152 = 0:84 6 bit. A tthesametime,theprobabilityofx2 haschangedtop(x2jy1) = 0:1. Its uncertaintynowis¡log2 0:1 = 3:32 2 bit. T heinformationreceivedabouttheeventx2 is -2.322bit. Ifitlaterturnsoutthatx1 actuallyoccurredthentheinformationre-ceivedgaveacorrectindicationandispositive. H owever, ifanerroroccurredthenitisnegative.
T heinformationgainedbytheobservationcanthenbeexpressedas
I (x;y) = Initialuncertainty- FinalU ncertainty (2.30)
T heamountofinformationaboutx thatisproducedbytheobservationyequalstheinitialuncertaintyaboutxminustheuncertaintyofx conditionedony.
T heterm “uncertainty” canberelatedtothefeelingofsurprise. T hereislittlesurpriseintheoccurrenceofaneventwithprobabilityclosetoone,butthereisgreatsurpriseattheoccurrenceofarareevent.
2.2.1 SelfInformationSupposethatanobservationyremovesalloftheuncertaintyaboutx. T hatis, p(xjy) = 1. T hiswouldbethecaseforanidealerror-freechannel. T hen,sincelogp(xjy) = 0 , the amountofinformation providedmustequaltheinitialuncertaintyaboutx. W ecallthistheselfinformation.
D e…nition10 Theselfinformationassociatedwithaneventx 2X is
I (x) = ¡logp(x) (2.31)
Itis commontowrite I (p)whenonewantstofocusontheprobabilityp(x)ratherthanontheevent. A plotofI (p) isshowninFigure2.3. W eseethataneventhasnouncertaintywhen p = 1 andin…niteuncertaintywhenp = 0 . I (p) is ameasureofour“surprise” whenaneventofprobability poccurs. W earenotverysurprisedifp ¼1 andverysurprisedifp ¼0 .
W ehaveappliedthemodelofpartialinformationgaintotransmissionthroughabinarycommunication channel. Itcanalsobeapplied tocode
2.2. A M EA SU R EO F IN FO R M A T IO N 25
p 10.80.60.40.20
7
6
5
4
3
2
1
0
Figure2.3: U ncertaintyofaneventasafunctionofitsprobability.
applications. W ehaveseenthatasetofeventscanberepresentedbycode-words. W hen acodeword is transmitted symbol-by-symbol, each symboladds tothequantityofinformationcarriedbytheword. A s each symbolis receivedattheotherend, itadds informationabouttheeventthatwastransmitted. Eachsymbolofthereceivedcodewordwillmakesomeeventslesslikelyandothersmorelikelyuntilthe…naldigitmakesoneeventcertainandtheothers impossible. W ewillexplorethiswhenweexaminedecodingtreesbelow.
2.2.2 Entropyas A verageSelfInformationSupposewewanttoknowtheaverageamountofinformationthatisprovidedbythesource. T his is theamountofinformationperinputeventthatwillhavetobecarriedbythechannelorstoredinarecordingsystem. W ecan…ndtheanswerbyaveragingovertheselfinformationofeachx2X .
T heselfinformation I (x) = ¡logp(x) isarandomvariablewhosevalueis determined bytheeventx. B eingarandom variable, itis possibletocomputeitsaveragevalue.
E[I ] =X
x2Xp(x)I (x)
= ¡X
x2Xp(x)logp(x) = H (X ) (2.32)
26 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
T hisisrecognizedasthequantitywenamedentropyinL ecture2. T here-fore, entropycanbeinterpretedastheaverageinformationoftheunderlyingeventspace. W henthelogarithm ofbase2 is used, theunits arebits perevent. W efoundthattheaveragecodewordlength ¹n cannotbelessthanHdigits. T hus, eachdigitofabinarycodecannot, ontheaverage, carrymorethanonebitofinformation.
In L ecture2 weshowedthattheentropyforaneventspaceofsizeq isboundedby
0 ·H (E)·log2 q (2.33)
withH (E) = 0 ifsomep(ei) = 1 andH (E) = log2 q ifallp(ei) = 1=q. Inthe…rstcaseonlyoneoftheeventscanoccursothereisnouncertaintyandinthesecondanyeventcanoccurwithequalprobabilitysotheuncertaintyismaximum.
T he B SC provides a usefulexample. L etthe inputprobabilities bep(x1) = p and p(x2 ) = 1¡p. T heentropyassociatedwithabinaryeventspaceisafunctionofthesinglevariable, p, andcanbeplottedasasimplecurve.
H (p) = ¡plog2 p¡(1¡p)log2 (1¡p) (2.34)
A s illustrated in Figure 2.4, this function is symmetricaboutp = 0:5, ismaximum wheretheevents areequallyprobableand is zerowhereeithereventis certain. T his means thatifp 6= 0:5 it is possible torecord orcommunicatetheinformationfromthissourceatarateoflessthanonebitperevent. T hiscan, infact, beapproximatedcloselybyH u¤manencodingofdoubletsortripletsfromthesource.
T heentropynowhasameaningintermsoftheaveragequantityofin-formationneededtorecordtheoutputofasource. W ehaveshownthattheentropyofasourcethatgenerates statistically independentevents canbecalculatedby(2.32). T herearemanysources thatdonot…ttheindepen-denteventmodel, andmoreanalysiswillbeneededtoconstructamethodtocomputeitforsuchsituations. Forexample, estimationoftheentropyofEnglishtextisverydi¢cultbecauseoneneedstotakealloftheconstraintsofspellingandgrammarintoaccountsomehow. Butwewill…ndthathow-everdi¢cultitis toestimateH , itprovidesalowerboundontheamountofdataneededtorepresentevents. Itisthereforeafundamentalquantityinthedesignofe¢cientcommunicationorrecordingsystems.
2.2. A M EA SU R EO F IN FO R M A T IO N 27
p 10.80.60.40.20
1
0.8
0.6
0.4
0.2
Figure2.4: EntropyH (p)asafunctionofp forabinaryeventspace.
Itshould be noted thatentropy is associated with long sequences ofevents, notwith individualevents. Itis an averageoverthe informationproducedbymanyevents. Itis meaningless tospeakoftheentropyofanindividualevent.
2.2.3 D ecodingTreesandPartialInformationL etussupposethateventsaretoberepresentedbycodewordswhosesymbolsaredrawnfromanalphabetofsizer. Inane¢cientcode, codewordlengthis relatedtoeventprobabilitysothatshortcodewords areassociatedwitheventsofhighprobability. W ewillnowseethatcodewordlengthisrelatedtoentropy.
T heuncertaintyofaneventcanberelatedtothelengthofitscodeword.Eventswithhighprobability, and, therefore, lowuncertaintyshouldgetshortcodewordswhilethosewithlowprobability, andhighuncertainty, shouldgetlongcodewords. A n e¢cientbinarycodehas words oflength¡log2 pi·ni < ¡log2 pi+ 1. T hus, thebinarycodewordforaneventei shouldhaveaboutthesamenumberofdigitsastheuncertaintymeasuredinbits. T heycannotbeequalunlesstheuncertaintyisaninteger, inwhichcasepi= 2 ¡niand ¹n = H (E). T hecodeinTable2.2 isanexample.
Supposenowthatsomeonebegins toshowyouacodewordadigitatatime. T he…rstdigitgivesyousomeinformationabouttheevent, thesecond
28 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
i pi ¡logpi w i ni1 1/2 1 0 12 1/4 2 10 23 1/8 3 110 34 1/8 3 111 4
Table2.2: Exampleofa100% e¢cientcode.
i pi ¡logpi w i ni1 0.4 1.32 0 12 0.3 1.7 4 10 23 0.2 2.32 110 34 0.1 3.32 111 4
Table2.3: Exampleofacodethatislessthan100% e¢cient.
gives youmore, and soon. Finally, whenyouhave seenallthedigits inthe codeword, you knowwhich eventhas occurred. Each digitgives youinformationandreduces youruncertainty. T hatiswhyweusethesymbolI (ei): itrepresentstheinformationrequiredtoremovetheuncertainty. W eequateinformationtoreductioninuncertaintyandmeasureitinbits.
B ecause¹n ¸H (E), eachdigitcanconvey, ontheaverage, atmostonebitofinformation. W eoftentalkaboutthesymbols ‘0’and‘1’as “bits” whentheyarereallydigits. A bitisaunitofinformation. A digitcarriesx bitsofinformation, wherex dependsuponthesituation.
A decodingtreeforthecodeinTable2.2 is showninFigure2.5. Eachbranchshowsthedigitrelatedtoitaswellasapair(p;I (p)). Supposethatyouaregivenacodeworddigitbydigit. Eachdigitchoosesbetweenanupperorlowerbranch. T he…rstsymbolhas probability p = 0:5ofbeing1 or0,sothe…rstdigitcarrieslog2 = 1 bitofinformation. Ifthe…rstsymbolis‘1’yougofromnodeatonodeb. A gain, eitherpathisequallylikely, andsoon.A teachstepthenewinformationremovesafullbitofuncertainty. IfI (p) issummedalongthepathitequalsthetotaluncertaintyoftheevent, which,inthiscase, isanintegerequaltothecodewordlength.
Supposenowthattheprobabilitiesareslightlydi¤erent, sothat¡logpiisnotaninteger. T hissituationisshowninTable2.3. T hecodetreeisshown
2.2. A M EA SU R EO F IN FO R M A T IO N 29
Figure2.5: D ecodingtreeforthecodeofTable3.1.
30 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
Figure2.6: D ecodingtreeforthecodeofTable3.2.
2.2. A M EA SU R EO F IN FO R M A T IO N 31
inFigure2.6. T heprobabilityp oftakingagivenbranchfromitsnodeisnownotalways0.5, sotheuncertaintyisnotalways1 bit. T heuncertaintyofeachbranchisthereforenotaninteger. T hesum oftheuncertaintiesalongeachpathisequaltotheuncertaintyoftheevent, andthisisclosetothecodewordlength. T heaveragecodewordlengthisnowanumberslightlygreaterthantheentropy. Ifonemultipliestheprobabilitiesoraddstheuncertaintiesalongapathleadingtoanevent, theresultistheprobabilityortheuncertaintyoftheevent, shownincolumns2 and3ofTable2.3.
W enowseethatthere is acloseassociation between informationanduncertainty. Eachsymbolinacodewordcarriesinformationthatreducestheuncertaintyabouttheevent. L owprobabilityeventsrequiremoredigits inane¢cientcodebecausetheyhavemoreuncertainty.
Eachdigitofacodewordreduces theuncertaintybut, unless itis thelastone, doesnoteliminateit. Tomakethisexplicit, letaijbesymboljofcodewordwiassociatedwitheventei. B eforeanydigithas beenobserved,the uncertaintyabout ei is¡logp(ei). A ftersymbola11 is observed theprobabilityofeihaschangedtop(eija11), whichistheprobabilityconditionedontheobservedsymbol. T heuncertaintymustnowbe¡logp(eija11). T hedi¤erenceintheuncertaintyistheinformationprovidedbya11.
T heinformationabouteiprovidedbya11 is
I (eija11) = logµp(eija11)p(ei)
¶
SupposethateventspaceofTable 2.3 is used and thattheevent e3 hasoccurred. T hemessageis, therefore, w3 = (1;1;0 ). T heoriginalprobabilityis p(e3) = 0:2 , which corresponds to2.32 bits ofuncertainty. A fterthe…rstdigitis receivedtheprobability is p(e3j1) = 1=3andtheuncertaintyhaschangedto¡log(1=3) = 1:58 bits. T heinformationreceivedwas 2.32-1.58= .7 4 bits. N otethatthis is thequantityonthatbranchofFigure2.6.W hentheseconddigitisreceivedtheprobabilityis p(e3j11) = 2 =3andtheuncertaintyislog(2 =3) = :58 bits. T heinformationreceivedwas1.58 -.58= 1bit, whichisthequantityonthatbranchofthetree. T he…naldigitincreasestheprobabilitytop(e3j110 ) = 1 sothattheuncertaintyisremoved. T hisdigitcarriedtheremaining.58 bitsofinformation.
T his exampleshows thattheamountofinformationcarriedbyadigitdepends upontheeventandtheprecedingdigits. Even iftheevents areindependent, itis notnecessarilytrue thatthedigits in acode sequencearestatisticallyindependent. H owever, wecansaythatina100% e¢cient
32 CH A P T ER 2. A M EA SU R EO F IN FO R M A T IO N
codeeachdigitmustcarryafullbitofinformationandmustbestatisticallyindependentofallotherdigits. Toseethislatterpoint, carryoutananalysisofthecodeofFigure2.5 inamannerthatparallelstheprecedingparagraph.