Information-theoretic analysis of information hiding...

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 3, MARCH 2003 563

Information-Theoretic Analysis ofInformation Hiding

Pierre Moulin, Fellow, IEEE,and Joseph A. O’Sullivan, Fellow, IEEE

Abstract—An information-theoretic analysis of informationhiding is presented in this paper, forming the theoretical basisfor design of information-hiding systems. Information hidingis an emerging research area which encompasses applicationssuch as copyright protection for digital media, watermarking,fingerprinting, steganography, and data embedding. In theseapplications, information is hidden within a host data set andis to be reliably communicated to a receiver. The host data setis intentionally corrupted, but in a covert way, designed to beimperceptible to a casual analysis. Next, an attacker may seek todestroy this hidden information, and for this purpose, introduceadditional distortion to the data set. Side information (in the formof cryptographic keys and/or information about the host signal)may be available to the information hider and to the decoder.

We formalize these notions and evaluate thehiding capacity,which upper-bounds the rates of reliable transmission andquantifies the fundamental tradeoff between three quantities: theachievable information-hiding rates and the allowed distortionlevels for the information hider and the attacker. The hidingcapacity is the value of a game between the information hider andthe attacker. The optimal attack strategy is the solution of a par-ticular rate-distortion problem, and the optimal hiding strategy isthe solution to a channel-coding problem. The hiding capacity isderived by extending the Gel’fand–Pinsker theory of communica-tion with side information at the encoder. The extensions includethe presence of distortion constraints, side information at thedecoder, and unknown communication channel. Explicit formulasfor capacity are given in several cases, including Bernoulli andGaussian problems, as well as the important special case of smalldistortions. In some cases, including the last two above, the hidingcapacity is the same whether or not the decoder knows the hostdata set. It is shown that many existing information-hiding systemsin the literature operate far below capacity.

Index Terms—Channel capacity, cryptography, fingerprinting,game theory, information hiding, network information theory,optimal jamming, randomized codes, rate-distortion theory,steganography, watermarking.

I. INTRODUCTION

I NFORMATION hiding is an emerging research area whichencompasses applications such as copyright protection for

digital media, watermarking, fingerprinting, steganography,

Manuscript received October 5, 1999; revised September 13, 2002. The ma-terial in this paper was presented in part at the IEEE International Symposia onInformation Theory, Cambridge, MA, August 1998, and Sorrento, Italy, June2000; and at the 39th Allerton Conference, Monticello, IL, October 2001. Thework of P. Moulin was supported by NSF under Grants CDA-9624396, MIP-9707633, and CCR-0081268.

P. Moulin is with the Beckman Institute, the Coordinated Science Laboratory,and the Electrical and Computer Engineering Department, University of Illinoisat Urbana-Champaign, Urbana, IL 61801 USA (e-mail: [email protected]).

J. A. O’Sullivan is with the Department of Electrical Engineering, Wash-ington University, St. Louis, MO 63130 USA (e-mail: [email protected]).

Communicated by T. Fuja, Associate Editor At Large.Digital Object Identifier 10.1109/TIT.2002.808134

and data embedding. In particular, watermarking is now amajor activity in audio, image, and video processing, andstandardization efforts for JPEG-2000, MPEG-4, and digitalvideo disks are underway. Commercial products are alreadybeing developed. International Workshops on InformationHiding have been held regularly since 1996. Special issues ofthe IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS

and of the PROCEEDINGS OF THEIEEE were recently devoted tocopyright and privacy protection [1], [2]. An excellent reviewof the current state of the art appears in [3], and comprehensivesurveys of image and multimedia watermarking techniquesare available from [4], [5]. The majority of the papers to datehave focused on novel ways to hide information and to detectand/or remove hidden information. However, these papershave lacked a guiding theory describing the fundamental limitsof any information-hiding system. The need for practitionersand system designers to understand the nature of these funda-mental limits has been recognized [1], [3], [5]–[7]. We helpto close this gap by providing a theoretical basis for a genericversion of the information-hiding problem. We formulate theinformation-hiding problem as a communication problem andseek the maximum rate of reliable communication throughthe communication system. Related aspects of this problemhave also been recently explored by Merhav [8], Steinberg andMerhav [9], Somekh-Baruch and Merhav [10], Willems [11],Cohen and Lapidoth [12], and Chen and Wornell [13], [14].Also, see the recent study by Hernández and Pérez-González ondecision-theoretic aspects of the watermarking problem [15].

In our generic information-hiding problem, a messageisto be embedded in a host data set, and the resulting dataset may be subject to data processing operations (attacks)that attempt to remove any trace of from . The informa-tion-hiding system should satisfy two basic requirements. Thefirst requirement is usually referred to astransparencyor un-obtrusiveness: the data set should be similar to , accordingto a suitable distortion measure. The second requirement is re-ferred to asrobustness: the hidden message should survive theapplication of any data processing technique (within a certainclass) to . Often there is a limit on the amount of distortionthat an attacker is willing to introduce. A precise statement ofthe information-hiding problem is proposed in Section II.

Applications of information hiding are quite diverse [3]. Inwatermarking applications, the message contains informationsuch as owner identification and a digital time stamp. The goalhere is usually copyright protection. The message itself is notsecret, but it is desired that it permanently resides within the hostdata set. Similar requirements exist for systems that embed data(such as object identification, text, or audio), in image and video

0018-9448/03$17.00 © 2003 IEEE

564 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 3, MARCH 2003

databases. Such applications are commonly referred to as datahiding or data embedding. Closely related to watermarking isthe fingerprinting, or traitor tracing, problem, where in additionto copyright information, the owner of the data set embeds aserial number, or fingerprint, that uniquely identifies the userof the dataset and makes it possible to trace any unauthorizeduse of the data set back to the user [16], [17]. This applicationis particularly challenging as it opens up the possibility of acollusion between different users to remove these fingerprints.A different type of application is the embedding of data such asmultilingual soundtracks in pay-per-view television programs.Here the message is secret in the sense that it should not bedecipherable by unauthorized decoders. Other applications ofinformation hiding to television broadcasting are described in[18]. Another, more classical application of information hidingis steganography. Here not only is the message secret, but itsvery presence within the host data set should be undetectable.Steganography and related applications have a long, sometimesromantic history dating from ancient times [3], [6], [19], [20].

This brief discussion suggests that information hidingborrows from a variety of areas, including signal processing,communications, game theory, and cryptography. Indeed, avast array of techniques from signal processing and com-munications have been used to design algorithms for hidinginformation (e.g., the spread-spectrum methods popularizedby Cox et al. [16], or the dithered quantization methodsdeveloped by Chen and Wornell [14], [21]) and for attemptingto remove that information (by means of techniques such ascompression, signal warping, and addition of noise). Perceptualmodels for audio, imagery, and video have helped to quantifythe distortions introduced by information-hiding and attackalgorithms. Game-theoretic aspects of information hiding havebeen explored for special cases by Ettinger [22] and O’Sullivanet al. [23], [24]. Cryptographic aspects of information hidinginclude the use of secret keys to protect the message. It should,however, be clearly recognized that the functional requirementsof cryptography and information hiding as described above arevery different, as secrecy of the message is the main objectivein cryptography but is often not a requirement in informationhiding, where reliable embedding of the message within thehost data is often the single most important requirement. More-over, while cryptography has received significant attentionin the Information Theory community, following Shannon’slandmark paper [25], information hiding today is an immaturesubject, both on a mathematical and a technological level. Forinstance, there is no consensus today about the formulationof system requirements; and there is considerable uncertaintyabout the eventual performance of such systems, as all pub-lished algorithms in current audio and image watermarkingliterature can be defeated by attacks such as those in the popularfreeware packageStirmark[6], [26].

Our analysis of information hiding as a communicationproblem begins in Section II. There we propose a preciseformulation of the information-hiding problem and introducethe notions ofcovert channeland attack channelthat are,respectively, designed by the information hider and by theattacker, subject to average distortion constraints. We thenseek the maximum rate of reliable transmission through the

communication system. This maximum rate is termedhidingcapacityand is the value of a game played between the infor-mation hider and the attacker, as discussed in Section III. InSection IV, we characterize the value of this game under variousassumptions about the knowledge available to the attacker andto the decoder. These assumptions include knowledge of theinformation-hiding strategy by the attacker, and knowledgeof the attack strategy by the decoder. It is not our intent toprovide an analysis of every possible scenario that may beencountered in practice, but we do investigate a few scenariosthat lead to insightful results. To this end, we assume that sideinformation is available to the decoder and investigate twoextreme but important special cases in some depth: in the firstcase, sometimes termedprivate information hiding, the hostdata themselves are available to the decoder [16]; and in thesecond case, which we termblind information hiding, no sideinformation at all is available [27], [28].1 Our theory quantifiesthe effect of side information on hiding capacity; the role ofside information in watermarking has also been explored byCox et al. [7], Chou et al. [29], Barronet al. [30], Chen andWornell [14], and Willems [11]. The theory is illustrated usingan example based on a Bernoulli process and a Hammingdistortion function.

In Section V, we extend these results to the case of infinitealphabets. This allows us to treat the case of squared error dis-tortion in Euclidean spaces, which provides considerable insightinto the information-hiding problem. In this case, we are able togive explicit formulas for hiding capacity under the assumptionof Gaussian host data. These formulas are also upper bounds onhiding capacity for non-Gaussian host data sets. These resultsshow that existing information-hiding schemes in the literatureoperate far below capacity. In Section VI, we investigate the caseof small distortion. This is a problem of considerable interest,as many information-hiding schemes are precisely designed tooperate at small distortion levels. We show that the upper boundon hiding capacity in Section V is in fact tight for non-Gaussianprocesses.

In Section VII, we study several extensions of the basicsetup, including knowledge of the side information by theattacker, alternative distortion constraints, and steganographyrequirements.

The results above are derived under the assumption thatattacks are memoryless, which greatly simplifies the pre-sentation of the main ideas. In Section VIII, we show thatsuch memoryless strategies are, in fact, optimal in a certainsense, and we extend our results to a simple but useful classof channels with memory, namely,blockwise memorylesschannels. We also present an information-theoretic formulationof the fingerprinting problem and characterize its solution.Conclusions are presented in Section IX. The proof of sometechnical results is given in the Appendix.

II. STATEMENT OF THE PROBLEM

Notation. We use the following notation. Random variablesare denoted by capital letters (e.g.,), and their individual

1By analogy to the established terminology “blind watermarking” and “ pri-vate watermarking.”

MOULIN AND O’SULLIVAN: INFORMATION-THEORETIC ANALYSIS OF INFORMATION HIDING 565

Fig. 1. Formulation of information hiding as a communication problem. HereM is the message to be embedded in the host data setS and transmitted to thedecoder. The composite data setX is subject to attacks embodied by the channelA(yjx). The encoder and the decoder share common side informationK .

values by lower case letters (e.g.,). The domains over whichrandom variables are defined are denoted by script letters (e.g.,

). Sequences of random variables are denoted with asuperscript (e.g., takes its valueson the product set ). The probability mass function (pmf)of a random variable is denoted by .When no confusion is possible, we drop the subscript in orderto simplify the notation. Special letters such asand arereserved for pmf’s of special interest. We write toindicate that a random variable as distributed according to

. Given random variables , , , we denote the entropyof by , the mutual information between and by

, and the conditional mutual information betweenand , conditioned on , by [31]. The Gaussiandistribution with mean and variance is denoted by

. Finally, we write as to denoteasymptotic equality of two functions and , i.e.,

A. Description of the Problem

There are various formulations of information-hiding prob-lems. We consider the following generic version in this paper.Referring to Fig. 1, suppose there is a host-data source pro-ducing random variables taking values in according to aknown pmf , a side-information source producing randomvariables distributed as , and a message source pro-ducing a message from a message set . Specifically, , ,and are as follows.

• In typical problems, is a block of data or transform data(such as discrete cosine transform coefficients or waveletcoefficients) from an image, video, audio signal, or someotherhost data setin which the information hider embedsinformation. The set could be a continuum (such as the

-dimensional cube ) or a discrete set (such as aset of quantized transform coefficients). In all sections butSections V and VI, we assume thathas finite cardinality.Thehost dataare a sequence of inde-pendent and identically distributed (i.i.d.) samples drawnfrom .

• Side information is available toboth the encoder and the decoder, but not to the attacker.The individual letters are i.i.d. . Theside information potentially plays two roles. First, itmay provide a source of randomness that is known to thedecoder and enable the use of randomized codes. This

is a standard communication technique which generallyleads to improved transmission performance and is usedto design combat jamming [32]–[34]. Second, mayprovide side information about to the decoder. Thedependencies between and are modeled using ajoint distribution . For instance, may be fullyavailable at the decoder, a common assumption in thewatermarking literature [3], [4], [16]. In this case,is afunction of . In other problems, only partial informationabout (e.g., image features [35]) is available at thedecoder. Other examples of side information includehash values [36], location of watermarks [37], [38], andseeds for modulating pseudonoise sequences in spread-spectrum systems [20], [28]. In blind-information-hidingapplications, the decoder is not allowed access to anyside information, so anyone can decode the message [21],[27]. If is a cryptographic key that is independent of

, the application is still sometimes referred to as blindwatermarking (information hiding) in the literature.

• The message of interest is uniformly distributed overthe message set and is to be reliably transmitted to thedecoder. may be a cryptographically encrypted mes-sage, in which case an additional cryptographic key isrequired at the encoder and the decoder.

The information hider passes , , and the messagethrough a function , producingcomposite data that aremade publicly available.2

Next, an attacker passes through a random attack channelto produce corrupted data in an attempt to

remove traces of the message and to cause a decoder withinputs and to produce an unreliable estimate ofthe message . The setup considered includes deterministicattacks of the form , where is a determin-istic map, as a special case. For deterministic attacks, the onlypossible values of are zero and one. An impor-tant point is that if the information-hiding system did not useside information, the attacker (knowing the code used) wouldbe able to decode the message and might then be able to removeit from the data . Hence, side information plays an impor-tant role. Our working assumption is that the attacker knows thedistributions of all random variables in the problem and the ac-tual information-hiding code used, but not the side information.

2In watermarking applications,X is often referred to as thewatermarkedsignal. In the information-hiding literature,M , K, S, andX are usually re-ferred to as the mark, the stego-key, the cover-data, and the stego-data, respec-tively. We prefer not to use thestego-� terminology to avoid possible confusion:undetectability of the message is a requirement in steganographic applicationsonly.


Fig. 2. Design of information-hiding system used in current watermarking literature. Various structural constraints may be imposed onf and onV .

The scenario in which no side information is used or the attackersomehow has discovered the side information is briefly consid-ered in Section VII-A.

We assume that and take their values in finite sets and. In many applications, . In others, these sets may

be different, e.g., is a set of continuous-tone pictures,is aset of 8-bit images, and is a set of halftone images.

Constrained Information-Hiding Systems. The systemsused almost universally in the watermarking literature use theencoder and decoder shown in Fig. 2, in which several compo-nents are constrained [7], [16], [27], [28]. First, a mapping from

to codewords is defined. This mappingis independent of . The composite data are obtained as

, where may be further constrained.A simple example would be for , when

is a finite field. The decoder could be a correlation rule

or a modified correlation rule [16]. More elaborate designs ofin the image watermarking literature exploit the perceptual

characteristics of the human visual system and are not additive[4], [7]. While such designs make it convenient for the informa-tion hider to satisfy distortion constraints, heuristic choices canbe largely suboptimal in terms of achievable rates. As our sub-sequent analysis shows, the above restrictions on the alphabetfor and on the information-hiding code may reducethe maximum rate of reliable transmission.

B. Distortion Constraints

We now formally define the constraints on the information-hiding and attack strategies. This completes the mathematicaldescription of the information-hiding problem and allows us toderive achievable rates of reliable transmission for the commu-nication system in Fig. 1.

Definition 2.1: A distortion function for the informationhider is a nonnegative function .

Definition 2.2: A distortion function for the attacker is a non-negative function .

The distortion function for the information hider is bounded:. Other properties of

interest are stated explicitly where applicable, including sym-metry: for all , and the con-dition . The distortion functions

are extended to per-symbol distortions on-tuples by

The theory developed in this paper applies to classical distortionfunctions such as the Hamming distance, as well as to morecomplicated perceptual (auditory or visual) distortion functionsthat would satisfy the technical conditions above. Note that thecondition is not satisfied by perceptualdistortion functions in image processing, due to the presence ofthreshold effects in the human visual system [4].

Definition 2.3: A length- information-hiding code subjectto distortion is a triple , where

• is the message set of cardinality ;

• is the encoder mapping asequence , a message , and side information to asequence . This mapping is subjectto the distortion constraint

(2.1)

• is the decoder mapping the receivedsequence and the side information to a decodedmessage .

Note that there are distortion constraints but no rate con-straints on . Typically, is small, as the embedding ofinformation within the host data set is intended to be impercep-tible to a casual analysis. (In watermarking applications, this isknown astransparent watermarking.) Also note that the defini-tion of the distortion constraint (2.1) involves an averaging withrespect to the distribution and with respect to theuniform distribution on the messages. This choice is made forconvenience as it allows us to use classical tools from Shannontheory. Also possible but more difficult to analyze would be theuse of maximum distortion constraints, where the maximum iswith respect to , , and . The distribution andthe encoder mapping induce a distribution on thecomposite data set.

Definition 2.4: An attack channel with memory, subject todistortion , is a sequence of conditional pmfsfrom to , such that

(2.2)for all .

If , , and the distortion function is sym-metric, it is true in many applications that , so that therange of options offered to the attacker includes recovering the


host data. Under the definition (2.2), the attack channel is sub-ject to a constraint on the average distortion between and

. Another possibility, briefly considered in Section VII-B,is to constrain the average distortion between the host dataand .

In addition to distortion constraints, other restrictions may beimposed on the attack channels. For instance, the attack channelmay be constrained to be memoryless, or blockwise memory-less. We let denote the class of attack channelsconsidered. When the only restriction on the attack channel isto satisfy the distortion constraint (2.2), we denote this class by

. Observe that depends on via.

The rate of the information-hiding code is .The average probability of error is

(2.3)

where is a random variable in with a uniform probabilitydistribution. That is, equals the probability that an attacksuccessfully removes the message, averaged over all messages.Rates of reliable transmission and hiding capacity are definedas follows.

Definition 2.5: A rate is achievable for distortion andfor a class of attack channels , if there is a se-quence of codes , subject to distortion ,with rate such that as .

Definition 2.6: The information-hiding capacityis the supremum of all achievable

rates for distortion and attacks in the class .

III. T HE INFORMATION-HIDING GAME

Information hiding can be thought of as a game [39] betweentwo cooperative players (the information hider and the decoder)and an opponent (the attacker). The first party tries to maxi-mize a payoff function, and the opponent tries to minimize it.A natural choice for the payoff function would be the function

, where is the probability of error(2.3) [34]. Another classical choice would be the maximumachievable rate of reliable transmission, as in Definition 2.6. Forany transmission above that value, as . The re-mainder of this section uses a generic payoff function involvingthe functions and the attack channel directly.

Let denote the payoff function, where thepair of encoding and decoding functions and the at-tack channel are the variables under control of the informa-tion hider and the attacker, respectively. The feasible set fordepends on the choice of via the distortion constraint. Thefeasible set for the pair is then said to be nonrectan-gular. The feasible set for is independent of the choice of

and .

A Nash equilibrium of the game is obtainedif and only if

(3.1)for all admissible . Under some conditions (suchas rectangular feasible sets), a Nash equilibrium is also a saddle-point. The value of the game in this case is .It is in the interest of neither party to deviate from a saddlepointstrategy [39]. For examples of saddlepoint strategies in informa-tion-theoretic games, see [31, p. 263], [41], [42].

In many games, Nash equilibria and saddlepoints do not exist.Then the information available to each party critically deter-mines the value of the game. If the players choose their actionsin a given order, then a conservative approach for the first playeris to assume that the second player will find out what the firstplayer’s action is. This scenario applies to watermarking, wherethe information hider plays first (chooses ) and may assumethat the attacker will be able to learn and then chooseaccordingly. Likewise, the attacker may assume that the thirdplayer (the decoder) will be able to learn and then choose

accordingly. In this case, the value of the game is

(3.2)

A more conservative (more secure) approach for the informationhider/decoder team is to assume that they will be unable to know

, but that the attacker will be able to find out both and, and design accordingly [12]. The value of the game

played against such an omniscient attacker is

(3.3)

Because (where equalityholds only at saddlepoints [39]),3 we have , as one doesexpect, owing to the additional information available to the at-tacker. The formulation (3.3) has been used in jamming prob-lems [34], [40] and in recent watermarking problems [10], [12]and yields alower valueof the information-hiding game. Anupper valueof the game is obtained using the unrealistic (be-cause it is overly optimistic) assumption that both the encoderand decoder know

(3.4)

If good attack channel identification techniques are available,then (3.2) is an appropriate formulation of the watermarkinggame; otherwise, (3.3) should be preferred. Recent results bySomekh-Baruch and Marhav [10] and Cohen and Lapidoth [12]show that, under some assumptions, the values of both gamesare identical. Our focus in this paper is on (3.2).

IV. M AIN RESULTS

Definition 2.6 gives an operational definition of informa-tion-hiding capacity. In Section VIII, we show that memorylessattacks are optimal within a certain class of attack channels withmemory. In this section, we assume that attacks are memorylessand show that capacity is the value of a mutual-information

3This holds because the set of feasible� does not depend on the choice ofA . Such inequalities generally do not hold for games over nonrectangular sets.


game between the information hider and the attacker. In orderto maximize the payoff, the information hider optimally designsthe covert channeldefined in Section IV-A. In order to mini-mize the payoff, the attacker designs an optimal memorylessattack channel, defined in Section IV-B. An expression for theinformation-hiding capacity is derived in terms of the optimalcovert and attack channels in Section IV-C and is applied to asimple example in Section IV-F.

A. Covert Channel

Define the support set of

(4.1)

Consider an auxiliary random variabledefined over a finiteset of cardinality . The role of will becomeapparent in Section IV-C.

Definition 4.1: A memoryless covert channel subject to dis-tortion is a conditional pmf from to

, such that

(4.2)

The length- memoryless extension of is the conditional pmf

Note that the distortion constraint (4.2) involvesonly via the marginals .

Definition 4.2: The class is the set of all memorylesscovert channels subject to distortion .

To analyze the information-hiding system, we find it conve-nient to write in the cascaded form

B. Attack Channel

Definition 4.3: A memoryless attack channel subject to dis-tortion is a conditional pmf from to , such that

(4.3)

The length- memoryless extension of is the conditional pmf

Definition 4.4: The class is the set of all mem-oryless attack channels subject to distortion under covertchannel .

Both the sets and are defined by linear inequalityconstraints and hence are convex. The set dependson via . Additional con-straints may be imposed on the attack channel. For this reason,

we assume in what follows that attack channels belong to a (pos-sibly nonconvex) subset of . Assume this subset is ofthe form , where is some compact setof channels. For instance, could be a singleton (fixed attackchannel), a finite-dimensional parametric family, or a class ofchannels that introduce no signal bias: . Some of ourresults require that be convex or be equal to .The admissible set is gen-erally nonrectangular.

Finally, we shall need to consider memoryless attack chan-nels that satisfy the distortion constraint (2.2) for a particularinformation-hiding code .

Definition 4.5: The class is the set of all memo-ryless attack channels that satisfy the distortion constraint (2.2)under the information-hiding code .

By analogy with , we also define

C. Hiding Capacity

We now introduce Theorem 4.4, which we view as the basictheorem of information hiding. For any arbitrarily complicatedencoding scheme and memoryless attack, this theorem givesthe maximum rate of reliable transmission for the informationhider, under the assumptions that the attacker knows the infor-mation-hiding code but not the side information, and that thedecoder knows both the information-hiding code and the attackchannel. The latter assumption is weaker than it appears at firstsight. Even if the attack channel is not given, it is to be expectedthat the decoder will be able to learn it, provided thatis largeenough. While this assumption is reasonable, it might not al-ways be satisfied in practice, see Section III.

The cost function in the mutual-information game is given by

(4.4)

(4.5)

where the random variables, , , , are jointly dis-tributed as

i.e., forms a Markov chain. sat-isfies convexity properties stated in Proposition 4.1. Propertiesii), iii), iv), v), and vii) are analogous to those stated in Gel’fandand Pinsker’s paper [43]. The payoff (4.4) is convex inbut isnonconcave in .

Proposition 4.1 (Convexity Properties of ):

i) For fixed and , the payoff (4.4) isconvex in .

ii) For fixed , and , (4.4) is con-cave in [46].

iii) For fixed , and , the payoff (4.4)is convex in .

iv) To evaluate the maxmin of (4.4) over , one canrestrict to sets of cardinality .


v) For fixed , (4.4) is maximized by

that has the following property. The cardinality of thesupport set of is equal to eitherone4 or two for each .

vi) Let where is conditionally independent ofgiven . Denote by the joint condi-

tional distribution of . Then

for all

with equality if and only if .

vii) There exists that achieves

such that is a deterministic function of and.

Proof: see Appendix.

Theorem 4.4 below follows from Proposition 4.1 iv), from aproof of achievability (Proposition 4.2), and a converse (Propo-sition 4.3) for the class of attack channels. Achievability isproved using a random bin coding technique [31, p. 410]. Thedecoder uses joint typical set decoding. The proof of Proposi-tion 4.2 is an adaptation of techniques used by Gel’fand andPinsker [43] to derive the capacity of a fixed discrete memory-less channel with random parameters (state of the channel) thatare known at the encoder but not at the decoder, see Fig. 3. Thecapacity of this channel is given by

where is the random channel parameter (“state” of thechannel), and is an auxiliary random variable. These re-sults have been extended by Heegard and El Gamal to findthe capacity of computer memory with defects [44]. In theinformation-hiding problem, the host data plays the roleof the random parameter in Gel’fand and Pinsker’s work.The analogy between Gel’fand and Pinsker’s work and wa-termarking problems was described by Chen [13] (who alsocredits it to Lapidoth) and Willems [11]. Our initial derivationof hiding capacity was based on the analogy between thewatermarking problem and Wyner–Ziv’s problem of sourcecoding with side information at the decoder [31], [45]. Thisanalogy was further developed by Chouet al. [29] and Barronet al.[30]. Four key differences between our setup and Gel’fandand Pinsker’s are

• the presence of distortion constraints;• the availability of side information at the encoder and the

decoder;• the fact that the encoder does not know the attack channel

; and• the unavailability of to the attacker.

4In this case,x is a deterministic function of(s; u; k).

Fig. 3. Channel with random parameterss that are known at the encoder butnot at the decoder.

Due to the latter constraint, the joint distribution for, , , ,is subject to the constraint that forms a

Markov chain, which is a constrained form of the Markov chainrelationship in the Gel’fand and Pinskerproblem. Proofs of Propositions 4.2 and 4.3 can be found in theAppendix.

Proposition 4.2: (Achievability for Class of Memory-less Attack Channels):Select that maximizes

over . For any and sufficientlylarge , there exists a length- information-hiding code

with and

Proposition 4.3: (Converse for Class of Memoryless AttackChannels): Consider a length- information-hiding code

with rate . If for any wehave

as

then there exists an alphabetof cardinality ,and a covert channel such that

Theorem 4.4:Assume that for any , the attackerknows , and the decoder knows both and the attackchannel. A rate is achievable for distortion and attacks inthe class if and only if , where

(4.6)

and is a random variable defined over an alphabetof car-dinality .

Corollary 4.5: In the special case , the hidingcapacity is given by

(4.7)

where the maximization and minimization are subject to the dis-tortion and constraints. Let attain the maximum in(4.7). Any choice of such that isoptimal; in particular, the design is optimal.

Proof: When , the payoff function (4.4) can bewritten as

(4.8)

where the inequality follows from the data processing inequalityapplied to the Markov chain . The inequality


is satisfied with equality when is a function of . In thiscase, the payoff (4.4) depends on only via .Hence the capacity is given by (4.7).

Remark 4.1:The objective function is convex in(for fixed ) and concave in

and, therefore, concave in (for fixed ).

Corollary 4.6 gives an interpretation of the capacity-achieving distribution in Theorem 4.4 when no attack takesplace . Corollary 4.7 gives lower bounds on capacity,based on suboptimal choices of. An application of thiscorollary will be presented in Section VIII-A.

Corollary 4.6: Assume (no attack), and eitheror no side information is available at the decoder. In both cases,the information-hiding capacity is given by

where the maximization is subject to the distortion constraint.For the blind information hiding problem, is an optimalchoice.

Proof: If , Corollary 4.5 yields

In the absence of side information, apply Theorem 4.4 with

Evaluating at , we have for all possible. But capacity cannot be greater than in the case

(for which more information is available at the decoder), so wemust have again.

Corollary 4.7: The information-hiding capacity admitsthe following lower bound:

where the maximization and minimization are subject to the dis-tortion and constraints.

Proof: The bound is obtained by restricting the maximumto covert channels such that

The hiding capacity (4.6) clearly depends on and, which can be emphasized by writing

.

Proposition 4.8: Assume for all .Then is a convex, nonincreasing functionof .

Proof: See Appendix.

In general, there is no such convexity property with respect to. The function also satisfies the following

simple properties.

1) is nondecreasing in .

2) is upper-bounded by the entropy of

Informally, it is easier to hide information in complexdatasets than in simple ones.

3) If and the distortion function satisfies the con-dition , we have when

, i.e., no information can be transmitted. In thiscase, for all .

4) For any fixed , if islarge enough to include deterministic attacks of the form

, for some fixed . Such attackseliminate all traces of the message.

D. Comments

1) The capacity bound (4.6) is achievable if the decoderknows the attack channel used or is able to estimate itreliably from the received data. A weaker requirementwould be that auniversal decoderover the class exists.There is a well-developed theory of universal decodingfor compound channels [32], [34], but extension of thistheory and development of universal decoding algorithmsfor information hiding is still an open problem. Promisingresults in this direction have recently been obtained bySomekh-Baruch and Merhav [10] and Cohen and Lapi-doth [12].

2) Theorem 4.4 states that the problem reduces to a rate-distortion-constrained capacity game. The optimal attackis the solution to a rate-distortion problem, indicatingthe important role of data compression in the theory.The optimal information-hiding strategy is the solutionto a constrained capacity problem.

3) The information-hiding problem is also related to thewiretapping problem in cryptography [32, p. 407]. Thewiretapping problem involves communication of datato a receiver via a direct channel, and presence of awiretapper that observes these data through a secondchannel, and knows the codes used by the transmitter.The secrecy capacity is the maximum rate of reliabletransmission under the constraint that the equivocationrate for the wiretapper is maximum. The secrecy capacityis zero if the wiretapper’s channel is less noisy than thedirect channel. In the information-hiding problem, theattacker does more than being a simple wiretapper as hemaliciously degrades the direct channel fromto ,which is at least as noisy as the wiretapper’s channel(from to ). The secrecy capacity is zero unlesscryptographic keys are used.

4) Many conventional watermarking algorithms can beviewed as choosing independently of (see Fig. 2).This choice is generally suboptimal. Evaluation of (4.4)shows that the rates of reliable transmission in that caseare upper-bounded by the maximum of overa smaller set of distributions.


5) Assume , a symmetric distortion measure,and the condition . If the attacker somehowknew or could recover the host signal , the optimalattack would consist in setting . In this case,the output of the attack no longer contains any trace ofthe message, and the hiding capacity is zero. This simpleobservation motivates a potentially powerful attack, inwhich the attacker attempts to construct an estimate of

. Specifically, if the attacker is able to constructsuch that , the payoff is upper-bounded by

(4.9)

for all . Hence .

E. Solution of Maxmin Game

We have found Lemma 4.11 useful for deriving the solutionto a maxmin game defined over a nonrectangular feasible set

. The lemma provides suffi-cient conditions for to achieveover . There is a natural way to design the family ,in several problems we have considered. Note that equationsfor a Nash equilibrium are generally not sufficient to determinethe maxmin solution, as Nash equilibria may not be unique.This is true even if the payoff function is strictly con-cave/convex and the feasible set for is convex.

The proof of Lemma 4.11 follows immediately from the twosimple lemmas below, which are also useful when a lower oran upper bound on capacity is sought (see, e.g., Sections V andVI).

Lemma 4.9:Consider a particular . Then

is a lower bound on

Lemma 4.10:Consider a particular family such thatfor each . Then

is an upper bound on

Lemma 4.11:Consider and a familysuch that the lower and upper bounds on

in Lemmas 4.9 and 4.10 are equal. Then the pairachieves

Fig. 4. Binary channel with optimal information-hiding and attack strategies.HereS � Bernoulli , U � Bernoulli (D ), andW � Bernoulli (D ) aremutually independent random variables.

and

(4.10)

(4.11)

Remark 4.2: If is feasible for all (as isthe case for rectangular feasible sets), one can simply choose

. In this case, conditions (4.10) and (4.11) imply thatis a saddlepoint of .

Remark 4.3:One can always choose

but the minimization problem may be hard to solve for all. For that particular choice of the function , (4.10) and

(4.11) are necessary and sufficient for optimality of .

Remark 4.4:The lemma does not require any convexity oreven continuity properties of the payoff function.

Remark 4.5:Lemma 4.11 also follows from the relationshipbetween the maxmin solution of the original game and the sad-dlepoint solution of an induced dynamic game with rectangularfeasible set [59].

F. Example: Binary Channel

We illustrate Theorem 4.4 and Corollary 4.5 using a simpleproblem involving a binary alphabetand a Bernoulli source with parameter . The distortion func-tion is Hamming distance: if and

otherwise. The channelsand are subject solelyto distortion constraints. We first assume the host data are avail-able at the decoder: . The problem is illustrated in Fig. 4,and its solution is stated in Proposition 4.12. One can think ofthe random variable as a direct representation of the message

itself.

Proposition 4.12: For , the hiding capacity forthe information-hiding game above is given by

where we let , and

The capacity-achieving distributions are as follows. The com-posite data are given by , where denotesexclusiveor, and is a Bernoulli random variable independent of

. Both and are optimal. The optimal attack isgiven by , where is a Bernoulli randomvariable independent of . For and , the hidingcapacity is . For , the hiding capacityis equal to zero.


TABLE IOPTIMAL COVERT CHANNEL Q(x; ujs) FOR BLIND BERNOULLI GAME. THE OPTIMAL ATTACK IS THE SAME

BINARY-SYMMETRIC CHANNEL AS IN THE PRIVATE BERNOULLI GAME

Proof: We apply Lemma 4.11 and verify successivelyconditions (4.10) and (4.11). We choose definedin the statement of the proposition; introduces distortion

and hence is feasible for all . Let and(which is independent of ) The payoff function

is . Assume .Step 1.We have for all

where is by definition of the conditional mutual informa-tion, is because is a Markov chain, andinequality holds because conditioning reduces entropy.Equality is achieved in if and only if , hence, isindependent of . Inequality holds because and areindependent (because forms a Markov chainand and are independent), and . Socondition (4.11) is satisfied.

Step 2.With specified in the statement of the propo-sition, and are independent. Moreover,forms a Markov chain, so and are also independent. Forall , we have

where inequality holds because conditioning reduces en-tropy, and inequality holds because and are indepen-dent, and . So condition (4.10) is satisfied.

The derivation of hiding capacity in the cases oris straightforward.

The encoding system for above is the same as Vernam’sone-time padencryption system [47]. The distributionsand are identical, which means that this system satis-fies Shannon’sperfect secrecycondition [25], [47]: observingthe data does not provide the attacker with any informationabout the message. Moreover, the encoding system above sat-isfies the basic requirement of a steganographic system: the dis-tributions and are identical, so it is impossible for anobserver (such as the attacker) to determine whether the datawere drawn from or from .

To illustrate the difference between private and blind informa-tion hiding, we also derived the solution to the maxmin game inthe latter case. When no attack takes place , hiding ca-pacity is obtained from Corollary 4.6: in this case,i.e., is the same as in the private Bernoulli game.

The capacity of the blind Bernoulli game is strictly less thanthe capacity of the private Bernoulli game when . Toshow this, we performed a numerical search over the feasibleset. The optimal cardinality for the alphabetwas found to be, which is strictly less than the upper bound

given by Theorem 4.4. The optimal attack is the same as in theprivate Bernoulli game. The optimal is given in Table I forthe following distortions: (low distortion) and

(high distortion). The capacities for the privateinformation hiding game are, respectively, and . Thecapacities for the blind information hiding game are, respec-tively, and , i.e., are about 30% lower. Observe thatthe optimal is a zero/one pmf.

V. CONTINUOUS ALPHABETS

The results above can be extended to the case of infinitealphabets , , , , . Given a probability measure for

, mutual information is defined as [48, Ch.2.5], [49, Ch. 2]

(5.1)

(5.2)

where the supremum is over all finite partitions of the input andoutput alphabets, yielding finite-alphabet random variables,

, , , and .Assume and are finite-dimensional Euclidean

spaces or compact subsets thereof, and is boundedand continuous over its domain . The distortionfunctions and are assumed to be continuous. The distortionconstraints for the information hider and the attacker are givenby (2.1) and (2.2), where sums over , , , and arereplaced by integrals. Likewise, memoryless covert and attack


channels are defined by (4.2) and (4.3), with sums replacedby integrals. These channels are now conditional probabilitydensity functions (pdfs). To ensure that the feasible setsand

of covert and attack channels are compact, we restrict theirelements to an appropriate set of “well-behaved” functions(including properties such as boundedness and continuity).

Under these assumptions, the mutual informations (5.1) and(5.2) are finite [48], and so is

for all admissible . For any , select a finite partitionof the sets such that

(5.3)

(5.4)

(5.5)

(5.6)

for all . The existence of such a partition isguaranteed by our smoothness assumptions. Now let

The hiding capacity for the continuous game is defined as

(5.7)

From (5.3) and (5.4), we obtain

and hence,

Therefore, (5.7) becomes

For compact sets and , and may be replacedby and , and the results from Section IV still apply,5

provided that sums are replaced with integrals.

A. Gaussian Channels

The case of Gaussianand squared-error distortion measures

is of considerable interest, as it becomes possible to explicitlycompute the distributions that achieve capacity. We refer to thiscase as the Gaussian channel. Here and

. The class of attack channels is . We considertwo special cases. In the first one, the side informationis thehost data itself. In the second case (blind information hiding),no side information is available at the decoder.

The case of non-Gaussian is of course more difficult,but useful results can still be obtained. In particular, lower and

5With the exception of Corollary 4.6, the lower bounds in Corollary 4.7(where entropyH should be replaced with differential entropy), Proposition4.1 iv), v), and Comment 5 in Section IV-D.

Fig. 5. Optimal information-hiding and attack strategies for Gaussian host dataS�N (0; � ). HereZ�N (0;D �(a� 1) � ) andW �N (0; �D ) are

mutually independent random variables, and� = . The optimal attackis the Gaussian test channel with distortion levelD . For private watermarking,�= a.

upper bounds on hiding capacity can be obtained by applicationof Lemmas 4.9 and 4.10. We found this bounding technique tobe very useful for non-Gaussian channels, provided thatandthe family are suitably selected.

B. Host Data Are Available at the Decoder

Theorem 5.1 gives the hiding capacity when the host data areavailable at the decoder. The capacity-achieving distributionsfor Gaussian are given in Fig. 5. The same expressions wereobtained by Cohen and Lapidoth [12] who corrected a mistakein our original derivation of the maxmin solution [24]6 and ex-panded the framework of the analysis.

Theorem 5.1:Let andbe the squared-error distortion measure. Assume that .Let be the maximizer of the function

(5.8)

in the interval , where

Then we have the following.

i) If , the hiding capacity is .

ii) If is non-Gaussian with mean zero and standard devi-ation , the hiding capacity is upper-bounded by

(5.9)

iii) If and , the hidingcapacity is given by (5.9). The optimal covert channel isgiven by , where

6We had shown that the distributionsQ; A in the statement of the theoremwith a = 1 form a Nash equilibrium of the game, but as mentioned in Sec-tion IV-E, this property is not sufficient for maxmin optimality.


is independent of . The optimal attack is the Gaussiantest channel from rate-distortion theory [31]

(5.10)

where , and .


Corollary 5.2: In the asymptotic case , we have

(5.11)

The additive white Gaussian noise attack channelis asymptotically optimal.


Note that the capacity (5.9) is zero if ,corresponding to the case of a host signal that is weak relative tothe specified distortion levels. The zero-capacity problem is pre-cisely the main limitation of some early watermarking schemes,in which watermarks were hidden in the least significant bit(s)of the host signal in a straightforward attempt to introduce min-imal distortion. Unfortunately, such watermarks were easy toeliminate by an attacker that would simply randomize the leastsignificant bit(s), and cause a minimal amount of distortion inthe process. Coxet al. [16] were among the first researchers torealize the fundamental limitations of this approach.

The role of the host-signal scaling parameter in The-orem 5.1 is to increase the value of and thereby to reduce theeffective noise variance of the Gaussian test channel (see(E13) and Fig. 5). From (5.9), (E17), and (E18), we can write

, where decreases as increases,and increases as tends to . Hence, the optimal value ofresults from a tradeoff.

C. Blind Information Hiding

Rates of reliable transmission for blind information hidingclearly cannot be higher than the rates in the case the decoderhas access to side information. Hence hiding capacity is upper-bounded by (5.9) for any .

Theorem 5.3 gives the optimal blind-information-hidingstrategy and the optimal attack for Gaussian . The optimalattack is again the Gaussian test channel (5.10). Theoptimal distribution is the same optimal distributionthat achieves capacity in a problem studied by Costa [50].Costa’s result is an elegant extension of Gel’fand and Pinsker’sresults (see Fig. 3) to the case of additive white Gaussian noisechannels with input power constraints. The hiding capacity forthe blind-information-hiding problem is given by (5.9) again!In other words,the achievable rate of reliable transmissionis the same whether or not the host data are known at thedecoder.7

7This conclusion was originally obtained by Chen for a closely relatedproblem that assumes a fixed additive white Gaussian noise attack channel[13], [14].

Theorem 5.3:Let andbe the squared-error distortion measure. Assume thatis

and . Let be the maximizer ofthe expression (5.8)

in the interval . Then, the following distribu-tion yields the maxmin solution of the game (4.4):and , where is in-dependent of . The optimal attack is the Gaussian testchannel (5.10). Here

and

The hiding capacity is the same as (5.9) in the private water-marking game.


Corollary 5.4: If is non-Gaussian with mean zero and vari-ance , (5.9) is an upper bound on hiding capacity.

Proof: Capacity cannot be larger than in the case whereis known at the decoder. Capacity in that case is upper-boundedby (5.9), according to Theorem 5.1 ii).

Comments:

1) Consider the use of a covert channel of the form, , where and are not given by

the optimal values in Theorem 5.3. A special case of thisdesign is the constrained encoding strategy of Fig. 2, forwhich , and a codebook is designed for(corresponding to above). In the small-distortioncase, , the host signal acts as a strong in-terference, and the capacity of the cascaded covert/attackchannel is limited by this interferer. The rate of reliabletransmission for that suboptimal class of encoders is givenby

(5.12)

Comparison of (5.9) and (5.12) shows that dramatic im-provements are possible by using an optimal encodingstrategy.

2) Consider the design of a codebook for (corre-sponding to above). One practical such system,based on dithered quantizers, is described by Chen andWornell [21]. The gap to capacity for such schemes is fur-ther discussed in their recent paper [14].

3) Consider now two plausible but suboptimal attacks. Theadditive white Gaussian noise attackis suboptimal but is asymptotically optimal for

, because in this case. In con-trast, the attack , which attempts


Fig. 6. Rates of reliable transmission for Gaussian channels, using different information-hiding strategies, assumingD = 1 and� = 10. Current designs in theliterature often operate far below capacity, as indicated on the graph and discussed in the text.

to degrade the message by recoveringusing the max-imum a posteriori(MAP) estimation rule, is completelyinefficient. Here , so is an invertiblefunction of . Hence, the data processing inequality

is satisfied with equality, and theattack fails to remove any information. However, an“invertible attack” can be very effective if a suboptimaldecoder is used; for instance, a simple scaling of pixelintensities has been known to defeat some image water-marking decoders [3].

4) Fig. 6 illustrates different designs of for Gaussianchannels.

5) Consider again the small-distortion case .Then, and as and

. To achieve the capacity bound using therandom bin technique, we need a codebook of size not

but , where

is typically very large relative to

Structured codebooks based on dithered quantizers havebeen developed in 1999 and described in Lin’s thesis [51]and in Chen and Wornell’s paper [14].

Fig. 7. Sphere-packing interpretation of blind Gaussian information hiding.The encoder selects one of the2 scaled codewords U within amedium-size sphere centered atS . The received data vectorY lies withina small sphere centered atU . Shaded spheres are indexed by the samemessagem.

D. Sphere-Packing Interpretation

The property that capacity is the same whether or notisknown at the decoder can be illustrated using sphere-packingarguments, see Fig. 7. We consider the asymptotic case of smalldistortions: and . In this case,and . From Theorem 5.3, we write

(5.13)


(5.14)

where we have defined the random variable

(5.15)

which is distributed as , with

as

The variance of in (5.13) is

as

In the small-distortion case, is asymptotically independent ofand hence of .The encoder in the random binning construction (see proof

of Proposition 4.2) selects a codeword that is jointly typ-ical with , i.e., lies within a medium-size sphere ofradius centered at . There are approximately

codewords (one for each possible message) within thismedium-size sphere.

From (5.14), the received data vector lies within a smallsphere of radius centered at . Decoding by jointtypicality means decoding to the center of the closest smallsphere. To yield a vanishing probability of error, the smallspheres should not overlap. The number of small spheres thatcan be packed inside the medium-size sphere is

as

This number must be equal to , so .

The decoding procedure does not require knowledge of.Moreover, this procedure does not even make use of the factthat is Gaussian distributed. This helps understand whysimilar results are obtained in the non-Gaussian case (in thesmall-distortion regime), see Section VI.

E. Optimal Decoder

Optimal decoding performance is obtained using the MAPdecoding rule

where denotes the codebook for . For the optimal infor-mation-hiding and attack strategies of Theorem 5.3, we let

and obtain

(5.16)

where

is approximately equal to if . Hence,the decoder simply scales the receivedby a factor of andfinds the codeword closest to in the Euclidean-distancesense. A practical watermarking system based on this concept isdescribed in Lin’s thesis [51]. For the conventional but subop-timal design in Section V-C, is approximatelythe same for all , and the MAP decoding rule (5.16) isapproximately equivalent to the maximum-correlation rule

(5.17)

If the pair of random variables is non-Gaussian, or ifis not the same for all , the maximum-corre-

lation rule (5.17) is suboptimal. In current watermarking liter-ature, a maximum-correlation method similar to (5.17) is oftenused to decode watermarks. Another commonly used decisionstatistic is the normalized correlation coefficient betweenand [7], [16].

VI. SMALL -DISTORTION ANALYSIS

The case of small distortions and is typical of manyinformation-hiding problems. One may wonder whether somesimplifications occur in the theory, possibly like in rate-distor-tion theory [52].

We show that this is indeed the case. We consider the squared-error-distortion metric over the real line and show that the hidingcapacity isindependent of the statistics of, asymptoticallyas . This complements the analogous remarkableCosta-type result for Gaussian channels in Section V-C, whichwas, however, valid for all distortion levels. Our result is for-mally stated in Theorem 6.1. An intuitive explanation for thisapparently surprising result is that hiding capacity is essentiallydetermined by the geometry of small distortion balls, and thereare many such small balls within regions where is essen-tially flat; refer back to the discussion at the end of Section V-D.

Theorem 6.1:Let andbe the squared-error-distortion measure. Assume that noside information is used and that has zero mean andvariance and is bounded and continuous. Then the capacity

is asymptotic to the hiding capacity inthe Gaussian case:

as

The distribution that asymptotically achieves capacity is thesame as in the Gaussian case: , ,where , is independent of , and

is the Gaussian test channel (5.10) (with ).Proof: See Appendix.

VII. FURTHER EXTENSIONS

The framework developed in this paper can be used to analyzethe performance of a variety of information-hiding systems.Some useful extensions are considered in the following. Weassume finite alphabets throughout this section.


A. Attacker Knows Message

Some information-hiding systems may not use side informa-tion, or the attacker may have managed to find out what the sideinformation is (e.g., breaking the code). In both cases, the at-tacker is able to decode the message. (This does not necessarilymean that he is able to remove traces of the message from).Clearly, the hiding capacity is upper-bounded by the capacityfrom Theorem 4.4, because the attacker uses more information.But can the hiding capacity still be strictly positive?

The solution of this problem is as follows. We define an attackchannel as a conditional pmf , and let be theset of all such channels satisfying the distortion constraint

(7.1)The attack channel belongs to a subset of

.Then we have the following theorem, which takes the same

form as Theorem 4.4, except that the sets are larger thanin Theorem 4.4.

Theorem 7.1:Assume the attacker knows both and theside information , and the decoder knows , , and theattack channel. A rate is achievable for distortion andattacks in class if and only if , where

(7.2)

Proof: The proof parallels that of Theorem 4.4. Forachievability, the encoder and decoder use the same randombinning technique as in Proposition 4.2. But now the attacker isable to decode the codeword used. He is then able togenerate from the distribution .The only modification in the proof of the converse theorem(Proposition 4.3) is that

Gaussian channels.If the decoder does not knowand doesnot have access to any side information, what is the performanceof the encoding schemes in Section V-C? Consider the design

in Fig. 5 again. Under this design, we have. Conventional information-hiding systems use

[16]. For any , the deterministic attack

is admissible if ; but the rate of reliable transmissionunder this attack is zero. To improve performance, the informa-tion hider must use side information, or choose , in eachcase restricting the range of options available to the attacker.

B. Measuring Distortions With Respect to Host Data

In some applications, it may be more natural for the attackerto measure distortions with respect to the host data rather than

with respect to the composite data. Consider the distortionconstraint

(7.3)

as an alternative to (2.2) for the attacker.Redefine a memoryless attack channel as a pmf that

satisfies the distortion constraint

(7.4)

where , and the class as the set ofall memoryless attack channels that satisfy (7.4).

The payoff function is the same as in Sections II–IV. Theproblems of deriving capacity for the class aboveand for the class in Definition 4.2 have a similarmathematical structure. In some cases, the results derived inSections I–VI can be extended to defined above[56].

C. Steganography

The purpose of steganography is to convey a message to a re-ceiver in such a way that the very presence of the message isundetectable by a third party. Other restrictions may apply. Inparticular, we assume that the distortion constraints (4.2) and(4.3) are imposed on the information hider and the attacker,respectively.

Cachin [19] introduced a natural requirement for stegano-graphic systems. The information hider can guarantee a cer-tain level of undetectability provided that the relative entropy

is no greater than some small, specified value.If is small enough, it becomes impossible for the attacker todetermine whether was drawn from the distributionor from . The requirement introducesan additional convex constraint on and hence on the covertchannel .

For memoryless sources and covert channels, the relative en-tropy increases linearly with

It follows that, unless , information-hidingschemes are always detectable for large enough. Forperfectundetectability, we need , hence (asin the Bernoulli example of Section IV-F).

VIII. B LOCKWISE MEMORYLESSINFORMATION HIDING AND

ATTACK STRATEGIES

In this section, we extend the basic results of Section IV toa simple class of attack channels with memory. This allows usto explore the optimality properties of memoryless information-hiding and attack strategies. Practical applications of this setupwould include problems in image processing, where image dataare partitioned into blocks (of size), and the image is subject tosome complex attack on each block, rather than to independent


Fig. 8. Block diagram for blockwise i.i.d. sources and blockwise memoryless attack channels.

attacks on the constituent pixels. A JPEG compression attackwould nearly fit this model.8 We use boldface letters to denoteblocks, e.g., . Definitions 8.1–8.5 areextensions of Definitions 4.1–4.5.

Definition 8.1: A blockwise memoryless covert channel,subject to distortion , is a conditional pmfsuch that

(8.1)

The length- memoryless extension of is

Definition 8.2: The class is the set of blockwise memo-ryless covert channels subject to distortion.

Definition 8.3: A blockwise memoryless attack channel,subject to distortion , is a conditional pmf from

to which satisfies the blockwise distortion constraint

(8.2)

The length- blockwise memoryless extension of isthe conditional pmf

where refers to theth block of data, , and.

Definition 8.4: The class is the set of allblockwise memoryless attack channels subject to distortionunder covert channel .

Definition 8.5: The class is the set of all block-wise memoryless attack channels that satisfy the distortion con-straint (2.2) under the information-hiding code .

8In the JPEG standard, the zero-frequency coefficients are encoded using adifferrential pulse code modulation (DPCM) technique. This introduces a smalldependency between blocks.

We refer to as the attack-channel block length and assumethat attack chanels belong to a subset of , denotedby . The integer is fixed and independent of .

The pair is now assumed to beblockwise i.i.d.

(8.3)

This is a generalization of the model in Section IV, whichassumed the individual symbols to be i.i.d. A blockdiagram of the system is shown in Fig. 8. For the model (8.3),we obtain the following result, which is a straightforwardapplication of Theorem 4.4, using the alphabets, , ,and in place of , , , and . The auxiliary alphabet canbe taken to have a product form without loss of generality.

Proposition 8.1: Assume that for any , the attackerknows , and the decoder knows both and the attackchannel. A rate is achievable for distortion and attacks inthe class if and only if , where

(8.4)

and

(8.5)

A. On the Optimality of Memoryless Strategies

It follows directly from Propositions 4.2 and 4.3 that, underthe assumptions of a time-invariant, memoryless attack and amemoryless source , the optimal (capacity-achieving)information-hiding strategy is time-invariant and memoryless.Proposition 8.2 i) and ii) below establishes a dual result,namely, under a memoryless information-hiding strategy, theworst attack channel in the class is memorylessand time invariant. Proposition 8.2 iii) shows that the maxmin-optimal strategies for both players are memoryless.

Proposition 8.2: Assume the source is memoryless.Then we have the following.

i) Assume


Let be the marginals of , and

be the product of these marginals. Then ,

and .

ii) If the covert channels in Part i) are independentof , then the optimal attack channels must also beindependent of.

iii) The solution of the maxmin game (8.4) is achieved bymemoryless and .


If there exist dependencies between the components of, what are the effects on hiding capacity? If, one expects that dependencies between components

of should reduce capacity, as the attacker may be ableto develop a more efficient attack, based on his improvedcapability to estimate (as in the fingerprinting problemof Section VIII-B). If no side information is available at thedecoder, then it is not clear that the same result should hold:the decoder is penalized by its lack of knowledge of, andso its task may be facilitated if dependencies existed betweencomponents of . Proposition 8.3 states our result when

.

Proposition 8.3: Assume that is a blockwise memorylesssource with block size , that , and that the attackchannel is blockwise memoryless with block size. Let

be the product of the marginals of . Let andbe the capacities of the watermarking game subject to distortion

and attacks in the class , assuming distribu-tions and , respectively. Then, , with equality ifand only if .


A popular class ofattacks in the image watermarking literatureis the so-called geometric attacks, such as attacks that apply awarping operation to an image. Blockwise-memoryless versionsof such attacks may be ineffective for large blocks, becausethey fail to introduce sufficient randomness and/or irreversibledata processing operations. Consider, for instance, a simplemodel in which these attacks are blockwise memoryless (withblock length ) and take the following form:

(8.6)

where is a random variable with entropy , andthe functions and are invertible. The setcontains a null element such that is the identityfunction. This model for geometric attacks defines a setofattack channels, see Section IV-B. Let be the set ofblockwise memoryless attack channels that satisfy both (8.6)and the distortion constraint (8.2).

Proposition 8.4: Consider an i.i.d. sourcewith distribution, where either or there is no side information. Thenin (8.4) satisfies

(8.7)

where is the capacity obtained whenno attack takes place, and the maximization is subject to thedistortion constraints. Moreover

as (8.8)

Proof: The expression for is given by Corollary 4.6.The second inequality in (8.7) clearly holds. Likewise, fromCorollary 4.7, we obtain

(8.9)

Now

where the second and fourth equalities hold because of our in-vertibility assumptions on the function . The first inequalityin (8.7) follows immediately, due to the inequality

Equation (8.8) follows from the inequality

Proposition 8.4 shows that the attack is inefficient ifis small, i.e., can be reliably estimated from

the data , which is hardly surprising. But even ifis unidentifiable , the attack is

inefficient for large .

B. Fingerprinting

In fingerprinting applications, the information hider makesseveral copies of the host dataavailable to different users.However, a different message is embedded in each copy.The message is a fingerprint, or serial number, which canbe used to trace any unauthorized use of the signal back tothe user. The fingerprint may contain additional user-specificinformation. The user should not be able to remove tracesof the fingerprint without seriously degrading the signal, andthe fingerprint itself should be imperceptible. Developing asuccessful fingerprinting system is difficult, because of possiblecollusionbetween multiple users [3], [16], [17]. We show laterthat collusion allows users to compute a good estimate of thehost signal, which contains little residual information about theindividual messages. Single-user detectors are used to extractthe message embedded in each user’s data; it is assumed thatthese detectors do not know who the colluders are and donot communicate.

Our model for this fingerprinting problem is as follows.9 Re-ferring to Fig. 9, assume there areusers, all potential col-luders. Let be the data sent to userat time . Also let ,

, . Eachuser’s sequence is individually encoded according to

9In many fingerprinting applications, the decoder has access to the host data aswell as to the individual messages. Such problems are categorized as verificationrather than transmission problems and are not considered here.


Fig. 9. Block-diagram representation of the fingerprinting problem.

where is the fingerprint for user, and is theencoder defined in Section II. In other words, the same hostsignal and the same side information are used for encoding allmessages. Write and

. The messages are assumed to be independent anduniformly distributed over . Next, the attack channel isdefined as a conditional pmf from to , andits memoryless extension as .The sets of admissible covert and attack channels are denotedby and , respectively. Each message is decodedas , where is the decoder defined inSection II. The average probability of error (over messagesand users) is

Code rate and capacity can then be defined as in Section II.We obtain the following result. Capacity is not the same as inProposition 8.1 due to the use of single-user detectors.

Theorem 8.5:

i) For any attack subject to distortion , a rate is achiev-able if and only if , where

(8.10)

the sets and are defined in Definitions 4.2 and 8.4,respectively, and

(8.11)The optimal attack channel satisfies the followingsymmetry property: is independent of.

ii) Assume that , the distortion func-tion is symmetric, , and

, where

is the Chernoff distance between distributionsand . Then the capacity tends to zero

exponentially fast (with rate lower-bounded by ) as.

Proof:

i) The proof parallels the proof of the achievability theorem(Proposition 4.2) and the converse theorem (Proposition4.3). The data available to userare and . Theencoding functions are constrained to be the same forall users and the messages are independent, so the covertchannel is the same for all users and takes a product form:

The symmetry property follows from the convexity ofwith respect to .

ii) Assume the information hider uses the capacity-achievingdistribution . Let each user implement the memorylessattack , the MAP estimate offrom . The attack for all definesan attack channel and yields an upper bound

on capacity. Next we use Fano’sinequality and Chernoff bounds to show thattends to zero exponentially fast with, verify that thedistortion constraint is satisfied, and invoke a simpleextension of (4.9) to conclude the proof.

The error probability for the MAP estimation problemis . Using the Chernoff bound onthe probability of error of binary hypothesis tests [31], wewrite

(8.12)

where . Sinceby assumption, the MAP estimate converges a.s. toas


Hence, ,and the distortion constraint is satis-fied for large enough.

Now Fano’s lemma (applied to the same problem ofestimating given ) gives where

is monoton-ically increasing over the interval and is asymptoticto as . Hence the Fano and Chernoffbounds can be combined to yield the upper bound

as (8.13)

The technique used to establish (4.9) can be used againto show that

The asymptotic inequality (8.13) implies that

as

Observe that the optimal attack channel is not memoryless,and that the exponential decrease in capacity holds for any.However, the bound (8.12) becomes looser asincreases. Thepapers [16], [17] contain examples of specific fingerprinting al-gorithms whose performance degrades dramatically for large.According to Theorem 8.5, these results hold for any finger-printing algorithm in a broad class. It may seem surprising thatthe MAP attack, which was so ineffective in problems of infor-mation hiding using Gaussian data (Section V-C) is so effectivehere. This is because the fingerprint attack, while deterministic,is many-to-one.

If the message to be embedded contains two parts: a mes-sage that is common to all users (say copyright information)and user-dependent messages(fingerprints), then the resultsabove suggest a two-stage encoding technique, where in the firststage the common messageis embedded in the data settoproduce an intermediary data set, and in the second stage thefingerprints are embedded in the data setto produce thefingerprinted data .

IX. SUMMARY AND CONCLUSION

We have presented an information-theoretic analysis of infor-mation hiding, including the fundamental theorem of informa-tion hiding that characterizes the communication rate achievablefor the information hider in the face of an optimal attack. Thegoal of the information hider is to maximize the rate of reliabletransmission; the goal of the attacker is to minimize that rate.The payoff function in this game is a difference between twomutual informations. Different expressions for the hiding ca-pacity are obtained depending on the knowledge available to theinformation hider, to the attacker, and to the decoder. We haveprimarily focused on a scenario involving the following assump-tions. The information hider does not know the attack that willbe implemented. The attacker does not know the side informa-tion available to the decoder and hence is unable to decode themessage. He designs an attack channel based on the data

available to him. The decoder knows (or is able to learn) theattack channel. The allowed distortion levels (for specified dis-tortion functions) for the information hider and the attacker are

and , respectively. These distortion levels, respectively,

characterize the transparency of the information-hiding schemeand the severity of the attacks.

Our analysis shows the fundamental role of optimal datacompression strategies in the attack and of channel coding inthe hiding. Under the assumptions stated above, the hidingcapacity is the solution to a maxmin optimization problem. Inblind-information-hiding problems with Gaussian host dataand a squared-error distortion function, the optimal attack isthe Gaussian test channel, and the hiding capacity is the sameas if the host data were available at the decoder. This resultmay seem surprising but is analogous to results by Costa [50]and Chen [13]. The hiding capacity for non-Gaussian host datais upper-bounded by the capacity for Gaussian host data withthe same variance.

We have also conducted an analysis of the information-hidingproblem in the case of small squared-error distortion. Remark-able simplifications arise in this case. The hiding capacity isasymptotic to , in the limit as, independently of the statistics of the host data set, whether or

not the decoder knows the host data.While reasonable assumptions have been made about the

knowledge available to the information hider, the attacker andthe decoder, other assumptions may be preferable in someapplications, and would require an extension of our basictheory. We have briefly described some of these extensions.

Much work remains to be done in designing practical infor-mation-hiding codes that approach capacity. Recent results havebeen reported in [14], [51], [53]–[55]. Our analysis has outlinedthe potential benefits of using randomized codes. Other prac-tical problems include the choice of a suitable distortion mea-sure, which is a holy grail in audio, image, and video processing.Theoretical problems include computation of error exponentsand reliability functions [8], and design of universal decodersthat perform well for a broad class of attack channels [10], [12].

APPENDIX A

A. Proof of Proposition 4.1

Here we prove convexity properties of the payoff (4.4).i) The payoff (4.4) depends on only through the first

term, . From elementary properties of conditionalmutual information, is a convex function of

and hence a convex functionof .

iii) Write

(A1)

where

(A2)


To check convexity of with respect to , rewrite(A1) as

(A3)

The second term in the right side is linear in andhence convex. Fix and , and let

where . Also, let

and

for . Denote the first term on the right side of (A3),viewed as a function of , by . We now show that

. Indeed

where the inequality is the log-sum inequality [31, p. 29] appliedto the argument of . This proves the claim.

iv) The proof is based on Caratheodory’s theorem [32, pp.310–312]. We are given the quintuple of random variables

with joint distribution

where is defined in (4.1). Let . We show thereexists another random variablewith rangeand joint distribution

(A4)

such that for all .Let be the convex set of pmfs defined over ; so

for each . Let be real-valued,continuous functions defined over, and be the image ofunder the continuous mapping . By applica-tion of Caratheodory’s theorem, each element in the convex clo-sure of can be represented as the convex combination of atmost elements of . Hence, for any , there exist

elements of and nonnegative numberssumming to one, such that

(A5)

Apply (A5) to the following functions:

(A6)

(A7)

where , and in (A6) indexes all elementsof except possibly one. Due to (A5) and (A6), there existsa random variable such that

which establishes (A4) with

The distortion constraint (4.2), which involves only via itsmarginal , is satisfied. Due to (A5) and (A7), we have

and hence,

for all .v) Fix . The distortion constraint (4.2) written as a

function of becomes

(A8)

In the absence of distortion constraints, the set of admissiblewould be the Cartesian product of at most

probability simplices in , where each probabilitysimplex is indexed by a different value of . is aclosed, convex, polyhedral set. Its vertices (extremal points[57]) satisfy the following property: the functionsare zero/one functions for all . The maximum of aconvex function over a convex polyhedral set is attained at avertex of that set. Due to the distortion constraint (A8), theadmissible set is the intersection of above with thehalf hyperplane specified by (A8). So is a closed, convex,polyhedral set. Each of its vertices is either a vertex ofor lies on an edge between two vertices of. From Partiii) of the proposition, the payoff (4.4) viewed as a functionof is convex. Hence, its maximum over isachieved on one of the vertices of [57]. This proves theclaim.


vi) By definition of , we have and thus. Moreover

where the inequality occurs because conditioning reduces en-tropy. The claim follows from the expression (4.5) for.

vii) Assume is optimal, where . Givenany , we have possible collections of pairs

. Each such collection maybe viewed as the graph of a function . Letdenote the set of such graphs andindex the elements of .Consider any joint distribution of the random variables, ,

, , such that is independent of conditioned on, and

We think of as a random function selector. The conditionalpmf can be expanded as

where if and otherwise.Next, define which takes its values in the al-

phabet . It follows from Part vi) thatis capacity achieving. Here is a zero/one pmf, i.e.,

Let be the convex set of all pmfs defined over . Hence,, for all . Using the same application

of Caratheodory’s theorem as in Part iv), we can reduce the sizeof the alphabet to without loss of optimality.

APPENDIX B


Here we prove achievability of the rate

when and are defined with respect to dis-tortion levels and ,respectively. The claim follows as . Part I of the proofbuilds on the proof of Proposition 2 in [43]. Part II deals withsome technical aspects of the random coding argument.

First recall some definitions [31]. The type, or empirical prob-ability distribution of the sequence is timesthe number of occurrences of the symbol in the sequence

. The strongly typical set is the set of sequences inwhose type does not differ from its expected value by more

than in any component

The strong law of large numbers implies that for all ,as . The definition of strongly

typical sets for pairs of random variables ,for triples, etc., is similar to the definition of above.

Select that maximizes overin (4.4). Let be the minimizer of over .Note that does not depend on the actual choice ofby theattacker, and that minimizes under . Also,both the minimum and the maximum are attainable becausethe functions to be minimized (resp., maximized) are boundedand continuous, and the admissible setsand are compact.

Part I.Random Codebook Generation.Let

and

From (4.5), we have

for all . We generate codewords whose lettersare independently drawn from the distribution

. For every message , we generate

independent codewords from the distribution . We denoteeach codeword by . The collec-tion of these codewords is a codebook whose size is

Neither nor depend on chosen by the attacker. Thenumber in the definition of the strongly typical sets below isa function of such that as . For clarity we write

to denote the probability of error associated witha particular code and attack channel; we write

to denote the average of overa distribution of random codes . The encoder and decoderperform the following operations.

Encoder. Given , , and , theencoder seeks a codeword in that isjointly typical with . Specifically, the encoder seeksthe smallest such that

(An encoding error is declared andis set equal to zero if nosuch can be found, but the strong law of large numbers ensuresthis event has vanishing probability ). Let bethe corresponding value of. Given , , and ,the encoder then randomly generates from the distribution

(which can be taken to be a functionaccording to Proposition 4.1 vii)). This completes the descrip-tion of the information-hiding code . Denote by

(B1)

the expected distortion for code . The average distortionsatisfies

(B2)

If tends to zero, the pair is distortion typical [31].Decoder.Given the received sequence and the side infor-

mation , the decoder seeks a codewordin the codebook, such that

, i.e., is jointly typical with . Adecoding error is declared if no such codeword can be found


or if several codewords indexed by different’s are found, i.e.,either

1) , or2) there exist and such that

Due to the strong law of large numbers, the first event has van-ishing probability for any . Proceeding asin [43], it is seen that the probability of the second event isupper-bounded by for any

. The worst case value of this upper bound overis

For any , one can choose sufficiently large so that ,, and are all less than

. Hence, the total probability of error, averaged over,satisfies

(B3)

for all .

Part II.To prove achievability of the rate , it

suffices to demonstrate the existence of that satisfies thefollowing inequalities:

(B4)

(B5)

The random-coding argument is that if is chosen randomlyfrom the capacity-achieving distribution , these inequali-ties do hold with arbitrarily high probability for large enough;hence, there exists at least one such. This argument appliesdirectly if the class of channels is finite, but more care is re-quired when the class of channels is a continuum [34, p. 2151].Still, the class is small (being a subset of the class ofdiscrete memoryless channels), and it is well known that therandom coding argument can be applied in such cases as well.We now outline the main steps of this proof.

First consider inequality (B4). Any induces a pmfon the th component of . Define the empiricalpmf

associated with , from which we can also define marginalssuch as and . It is easily shown that

(with an abuse of notation). The empirical pmf con-verges in probability at an exponential rate to

Hence, the random variable also converges in proba-bility at an exponential rate to . Thatis, there exists large enough such that

(B6)

for all . Combining this equation with (B2), we obtain(B4).

Next, consider (B5). We proceed in four steps.Step 1.Define a dense subset of

, such that grows subexponentially with andany element of can be approximated accurately byan element of in the following sense.

Let be a discretizationof the interval . For all and all , all components of

, except one (to ensure that ) arein the finite set . Hence the cardinality of is

. To any we can associate some(via quantization of the components in the set )

such that

, if

, else.(B7)

Step 2. For every , the random variableconverges in probability at an exponential rate to

. That is, there exists a constant (independentof ) such that

for all . Then we obtain

(B8)

for all . The last line follows from the union-of-eventsbound.

Step 3.Let

(B9)

denote the probability of error of the system using an arbitrarydecoder . (So far, we had used exclusively the maximum-likelihood decoder , which is a function of and ). Con-sider associated with via the quantization operation (B7).We have

if for some component we have . Other-wise, we have

Thus,


and the inequalities at the top of the page are satisfied for all. Therefore, there exists such that

(B10)for all .

Combining (B3), (B8), and (B10), we obtain

(B11)

for all .Step 4.The set depends on only via its marginal ,

which we momentarily emphasize by writing .It is easily shown that (see (C18) fora similar derivation). The random variable convergesin probability (at an exponential rate) to (marginal of )used to generate the random information-hiding code. Thatis, there exists such that

for all . Now letting

and using for all , we have

(B12)

for all . Combining (B4), (B11), and (B12) proves theexistence of that satisfies the required conditions (B4) and(B5) for .

This proves achievability of all rates below

APPENDIX C


We show that any sequence of codes with rateand error probability

must satisfy the following condition: there exists a covertchannel such that ,and the distortion constraint (4.2) is satisfied, i.e., .

For any , we have

(C1)

where the first equality is because the messageis drawnfrom a uniform distribution, the second is because and

are independent, the third follows from the definition ofmutual information, and the inequality is Fano’s inequality[31]. From (C1), we have

(C2)

Define the quintuple and the setof probability distributions that have the form

10 and satisfy the distortion con-straint (4.2) and . Also define the random variables

(C3)

(C4)

where

Notice that because and areindependent. Using the same techniques as in [43, Lemma 4],we derive

(C5)

10So by construction,(U; S; K)! X ! Y forms a Markov chain.


Step 1.To prove (C5), observe that the random sequencesand satisfy the following recurrence relations:

(C6)

(C7)

where and . We now show that

(C8)

The six terms in (C8) can be written as follows, using the chainrule :

(C9)

(C10)

(C11)

(C12)

(C13)

(C14)

And finally

(C15)

Adding (C10), (C11), and (C13), and subtracting (C9), (C12),and (C14), we obtain the difference between the right and leftsides of (C8). Combining with (C13), this difference becomes(C8). This difference is equal to

which is a nonnegative quantity. This proves (C8). Then we ob-tain (C5) by summing the inequalities (C8) for .

Step 2. Define a time-sharing random variable that isuniformly distributed over the set , and is inde-pendent of all the other random variables.11 Define randomvariables , , , , equal to , , , , respectively,when takes value . With this definition, we have

(C16)

where is a new random variable. Inequality (C16)holds because , as the pairs are i.i.d.From (C5) and (C16), we obtain

(C17)

where in equality is a random variable defined over analphabet of size ; the equality follows from Proposition4.1 iv); and in equality is the joint distribution of ,conditioned on . We also have , because

, (refer toSection IV-B), and

(C18)

where the second equality follows from the equality

for any pmf and any function of the additive form

11The application of this idea to the information-hiding problem is dueto F. M. J. Willems [11].


Step 3.By assumption, the code satisfies the distortionconstraint (4.2), so

and .Step 4.Using (C2) and (C17), we obtain

for identified in Step 2. The converse theorem follows.

APPENDIX D


The monotonicity property follows from the fact that the fea-sible sets are embedded for any. To provethe convexity property, fix and let be an arbitrary numberin the interval . Define and let

be the set of all pmfs of the form

(D1)For any , the distortion incurred under an attack ofthe form (D1) is

Hence, . By convexity of the payoffwith respect to , we also have

Hence, we obtain the expression at the bottom of the page forall . This concludes the proof.

APPENDIX E

A. Proof of Theorem 5.1

i) The maximal value of over all admissibleis and is achieved by the channel

where . For , the attack isadmissible, in which case the value of the payoff function (4.4)is . The attack is admissible for all when

, so in this case.ii) Here . We construct a family of attack

channels , (Step 1), and show that is an upperbound on capacity, by application of Lemma 4.10. To this end,we prove that

in two steps. First, Step 2 shows that ,where is a particular Gaussian channel whose parametersdepend on . Then Step 3 shows that .

Step 1. The variance is a function of . If is such that, define by . Clearly, is

feasible, and . If is such that ,let

(E1)

(E2)

and define as the Gaussian channel

(E3)

where is independent of . Note that doesdepend on via and . By construction, satisfies thedistortion constraint (4.3) with equality. Indeed, we have

(E4)

Step 2. Let be the linear minimum-mean-squared-error (MMSE) estimator of given , whereand


Let . Define the random vari-able , which has mean zero and variance, andis uncorrelated with . Let . We have

(E5)

where follows from (4.4), is because is a deterministicfunction of , follows from the definition of conditionalmutual information, and holds because formsa Markov chain. Since , conditioned on , is Gaussian, thesecond term in the right-hand side of (E5) is given by

(E6)

and only the first term, , depends on . Now wehave [58]

where

Hence,

(E7)

where

and

Inequality holds with equality if and only if conditionedon is ; or equivalently, conditionedon is . Inequality follows fromJensen’s inequality, and holds with equality if and only if

Inequality is because is the variance of the MMSEestimator of given , and holds with equality if is Gaussian.Hence (E7) holds with equality if

(E8)

where is independent of .

Due to the definition of above (E5), we have

(E9)

and

(E10)

From (E5)–(E7), we obtain

(E11)

where

(E12)

Equality holds in (E11) if satisfies the Gaussian model (E8).From (E1) and (E2), we have

(E13)

Observe that is independent of and is an increasing func-tion of and (the latter via (E13)).

It also follows from (E10) and (E9) that

(E14)

(E15)

Given , both and (and hence ) are maximal whenand

(E16)

(E17)

Then (E13) yields

(E18)

Combining (E12), (E14), and (E18), we obtain (E19) (at thebottom of the page) with equality if the Gaussian model (E8)holds, the distortion constraint (E10) holds with equality, and

. It remains to optimize in the right side of (E19).Step 3. We define as where ,is specified in (E16), and the admissible values ofare in

the interval . The extreme points ofthis interval correspond to . In addition, the optimalsatisfies (otherwise, ); from (E17), this yields

For convenience, let . We view as a function ofand maximize

(E19)


over . We have for all .Hence, the optimal must be larger than in the statementof the theorem. Moreover

and (E20)

Hence, the maximum is an interior point and satisfies athird-order polynomial equation (setting ). Usingthe optimal in (E19), we obtain the upper bound (5.9).

iii) If , we apply Lemma 4.9 and show thatis a lower bound on capacity. Let be the covert channel

given by the statement of the theorem, and consider any attackchannel . Let

which is positive because by assumption. Letand . Hence, and

. Due to the distortion constraint (4.3), we have

(E21)

Now

where is a random variable such that the pair isGaussian with the same second-order statistics as . Hereequality is because and are independent, inequalityfollows from the fact that conditioning reduces entropy, and in-equality can be found in [58]. Both inequalities are satisfiedwith equality if is Gaussian and independent of. Sinceand are independent and forms a Markovchain, and are also independent. Hence, the lower boundabove becomes

The lowest possible bound is obtained by maximizingsub-ject to the constraint (E21), where . We obtain

(E22)

which is maximized by and

Substituting this value of into (E22), we obtain .The Gaussian test channel attains . By appli-cation of ii) and Lemma 4.11, is the capacity of the Gaussiangame, and are the capacity-achieving distributions.

APPENDIX F

A. Proof of Corollary 5.2

In the asymptotic case , the admissible rangefor is a vanishing interval to the right of

. The root of in (E20), when , is given by

hence,

as (F1)

The capacity (5.9) is given by , where

(F2)

As , the first term in brackets in the numerator of(F2) is asymptotic to

The second term in brackets is asymptotic to

The denominator is asymptotic to

Combining the preceding four expressions, we obtain

and thus , proving the claim (5.11).Finally,

in this analysis.

APPENDIX G


The maximal value of over all admissible isand is achieved by the channel where

. For , the attack is admissible, inwhich case the value of the payoff function (4.4) is .The attack is admissible for all when

, so in this case.Assume from now on that . Let be

the covert channel specified in the statement of the theorem. Weshow that is the same as the capacity (5.9) forthe private watermarking game. So must be the capacity-achieving channel, because capacity cannot be greater than inthe case the decoder knows.

We have , wheredoes not depend on. Hence, the optimal attack channel under

is the one that minimizes . Let be a Gaussianrandom variable such that and have


the same second-order statistics. Then we use the classicalinequality [58]

(G1)

where we use the notation .Equality is achieved in (G1) if is jointly Gaussiandistributed. Now minimize over all Gaussian distribu-tions that satisfy the distortion constraint .

Because is jointly Gaussian, there exist two positiveconstants and and a random variable inde-

pendent of , such that . We havewhere and .

Let be a Gaussian random variable such that andhave the same second-order statistics. Then

(G2)

where

(G3)

Note that for a given value of , the value of that maximizesis . The problem at hand is, however, a

minimization of with respect to .For the attack channel to be admissible, we need

the following condition on and :

where is because , and follows from theindependence of and . We then obtain

(G4)

The right side of (G2) is minimized by maximizing, i.e.,choosing that maximizes , and thatachieves equality in (G4). Hence,

(G.5)

and . These are the optimal values ofand (inde-pendently of the choice of ). Hence, .The attack channel achieves that bound and hence is optimal.

APPENDIX H


Let and be the two distributions speci-fied in the statement of the theorem. For clarity, we explicitlyindicate the subscripts on the other pdfs, such as , ,and .

Corollary 5.4 implies that that hiding capacity is asymptot-ically upper-bounded by when ,for any , , and . We will now show thatis also a lower bound on as , and hence that

. The proof proceeds in three steps.Step 1.Due to Lemma 4.9, any lower bound on ,

valid for all , is also a lower bound on . Rewritethe payoff function as

(H1)

and fix . By definition of , we have; therefore, the first term in the right side of (H1)

is given by

(H2)

The second term depends on. We construct an upper boundon which depends on but not on (and is

asymptotically achievable as ). Then we let

(H3)

be our lower bound on .Step 2.(Construction of upper bound on ).Step 2a. We have . Even though

and are not independent, we show that converges toas . Indeed, from the definitions

of and in terms of the independent random variablesand, we derive

as (H4)

as

(H5)

where (H4) holds because the family with is aresolution of the identity, and (H5) is because is asymp-totically concentrated in a vanishingly small neighborhood ofthe line . Hence,

as (H6)

and so .


Step 2b. Let be a Gaussian random variable that has thesame second-order statistics as, conditioned on

(H7)

where , and is a Gaussian random vari-able with zero mean and variance . For anyadmissible attack channel, we have

(H8)

Step 2c. Consider

Define a random variable that is Gaussian, conditioned on, with distribution given in (H6); and let

For any , we have [58]

where . Hence,

(H9)

where

The last inequality follows from concavity of the , and holdswith equality if and only if a.e. .

Step 2d. Now we seek an asymptotic upper bound on in(H9), as .

From (H6) and (H7), the asymptotic distribution of asis given by

as (H.10)

where and are con-ditionally independent Gaussian random variables (conditionedon ). Hence,

as

(H11)

where the inequality comes from (H8), and is satisfied withequality if and .

Now defining

(H12)

and combining (H9) and (H11), we have ,as , with equality if and only if . (Notethat for any , the upper bound is not achievableby any element of —but it is asymptotically achievable,as ).

Step 3.Combining (H3), (H2) and (H12), we obtain

APPENDIX I


Here we show that the optimal attack for a time-invariant,memoryless covert channel is time-invariant and memoryless.

i) (Memoryless). By definition of the mutual information, wehave

Only the second term in the right side depends on. Now

(I1)

where the equality follows from the chain rule for entropy [31,p. 21], and the first inequality follows from the fact that con-ditioning reduces entropy. Since the pairs are mutu-ally independent, equality is achieved in this inequality if andonly if is independent of , given .Equality is achieved for a memoryless attack channel. Hence,

Finally, observe that if the attack channel lies inthe admissible set , then so does the product of its marginals

, as the expected distortions under and areequal.

ii) (Time-invariant). From Part i), the cost function to be min-imized by the attacker takes the form

The attack is subject to linear constraints

and

for all . The cost function is strictly convex in ,and both the cost function and the constraints are invariant topermutations of the indexes. Hence, must be independentof to achieve the minimum.


iii). Let be the set of all memoryless covert channels in, and be the set of all memoryless attack channels

in . Then by application of (8.4)

where the inequality is because for all ,and the last equality follows from Proposition 4.3.

APPENDIX J


By straightforward extension of Corollary 4.5 to theblock-wise memoryless case, we have

when . Denote the payoff function above bywhere the subscript on indicates that .

The games andhave solutions and , respectively. It is knownfrom Proposition 8.2 that both and are memoryless.Define the memoryless channel

where is computed from the marginals of. Also define the memoryless attack channel

as a minimizer of . Consider the string ofinequalities

where holds because is the optimal attack channel under, is because is memoryless, is the chain rule

for conditional entropy, is because conditioning reduces en-tropy, follows from the definition of and , and is be-cause is capacity achieving. Equality is achieved if andonly if and are memoryless. In this case,and . Finally, we observe that the distortions under

and are the same, hence is feasible.This concludes the proof.

ACKNOWLEDGMENT

The authors are deeply indebted to several colleagues whohelped them at different stages during the preparation of thispaper. In particular, they would like to thank J. Mark Ettingerand Tamer Basar for their help regarding game-theoretic aspectsof the problem, and Brian Chen, Aaron Cohen, Ralf Koetter,Neri Merhav, M. Kivanç Mihçak, Kannan Ramchandran, FransM. J. Willems, and the five anonymous reviewers for their usefulcomments. In particular, the reviewers have helped to fix gapsin several proofs.

REFERENCES

[1] IEEE J. Select. Areas Commun. (Special Issue on Copyright and PrivacyProtection), vol. 16, pp. 452–593, May 1998.

[2] Proc. IEEE (Special Issue on Identification and Protection of Multi-media Information), vol. 87, pp. 1062–1304, July 1999.

[3] M. D. Swanson, M. Kobayashi, and A. H. Tewfik, “Multimediadata-embedding and watermarking strategies,”Proc. IEEE, vol. 86, pp.1064–1087, June 1998.

[4] R. B. Wolfgang, C. I. Podilchuk, and E. J. Delp, “Perceptual watermarksfor digital images and video,”Proc. IEEE (Special Issue on Identifica-tion and Protection of Multimedia Information), vol. 87, pp. 1108–1126,July 1999.

[5] F. Hartung and M. Kutter, “Multimedia watermarking techniques,”Proc.IEEE (Special Issue on Identification and Protection of Multimedia In-formation), vol. 87, pp. 1079–1107, July 1999.

[6] F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, “Informationhiding—A Survey,” Proc. IEEE (Special Issue on Identification andProtection of Multimedia Information), vol. 87, pp. 1062–1078, July1999.

[7] I. J. Cox, M. L. Miller, and A. L. McKellips, “Watermarking ascommunications with side information,”Proc. IEEE (Special Issue onIdentification and Protection of Multimedia Information), vol. 87, pp.1127–1141, July 1999.

[8] N. Merhav, “On random coding error exponents of watermarkingcodes,”IEEE Trans. Inform Theory, vol. 46, pp. 420–430, Mar. 2000.

[9] Y. Steinberg and N. Merhav, “’Identification in the presence of side in-formation with application to watermarking,” inProc. IEEE Int. Symp.Information Theory, Sorrento, Italy, June 2000, p. 45.

[10] A. Somekh-Baruch and N. Merhav. (2002) On the capacity game ofpublic watermarking systems. [Online]. Available: tiger.technion.ac.il/users/merhav

[11] F. M. J. Willems, “An information theoretical approach to informationembedding,” inProc. 21st Symp. Information Theory in the Benelux,Wassenaar, The Netherlands, May 2000, pp. 255–260.

[12] A. S. Cohen and A. Lapidoth, “The Gaussian watermarking game,”IEEE Trans. Inform. Theory (Special Issue on Shannon Theory), vol.48, pp. 1639–1667, June 2002.

[13] B. Chen, Seminar at the University of Illinois, Apr. 1999.[14] B. Chen and G. W. Wornell, “Quantization index modulation methods:

A class of provably good methods for digital watermarking and informa-tion embedding,”IEEE Trans. Inform. Theory, vol. 47, pp. 1423–1443,May 2001.

[15] J. R. Hernández and F. Pérez-Gonz´lez;, “Statistical analysis of water-marking schemes for copyright protection of images,”Proc. IEEE (Spe-cial Issue on Identification and Protection of Multimedia Information),vol. 87, pp. 1142–1166, July 1999.


[16] I. J. Cox, J. Killian, F. T. Leighton, and T. Shamoon, “Secure spread spec-trum watermarking for multimedia,”IEEE Trans. Image Processing, vol.6, pp. 1673–1687, Dec. 1997.

[17] D. Boneh and J. Shaw, “Collusion-secure fingerprinting for digital data,”in Adv. Cryptology: Proc. CRYPTO’95. New York: Springer-Verlag,1995.

[18] B. Macq and J.-J. Quisquater, “Cryptology for digital TV broadcasting,”Proc. IEEE, vol. 83, pp. 944–957, June 1995.

[19] C. Cachin, “An information-theoretic model for steganography,” inProc. 1998 Workshop Information Hiding (Lecture Notes in ComputerSciences). Berlin, Germany: Springer-Verlag, 1998.

[20] L. Marvel, C. G. Boncelet Jr., and C. T. Retter, “Spread-spectrum imagesteganography,”IEEE Trans. Image Processing, vol. 8, pp. 1075–1083,Aug. 1999.

[21] B. Chen and G. W. Wornell, “An information-theoretic approach tothe design of robust digital watermarking systems,” inProc. Int. Conf.Acoustics, Speech and Signal Processing (ICASSP), Phoenix, AZ, Mar.1999, pp. 2061–2064.

[22] J. M. Ettinger, “Steganalysis and game equilibria,” inProc. 1998Workshop on Information Hiding (Lecture Notes in Computer Sci-ences). Berlin, Germany: Springer-Verlag, 1998.

[23] J. A. O’Sullivan, P. Moulin, and J. M. Ettinger, “Information-theoreticanalysis of steganography,” inProc. IEEE Int. Symp. InformationTheory, Cambridge, MA, Aug. 1998, p. 297.

[24] P. Moulin and J. A. O’Sullivan, “Information-theoretic analysis of infor-mation hiding,” inProc. IEEE Int. Symp. Information Theory, Sorrento,Italy, June 2000, p. 19.

[25] C. E. Shannon, “Communication theory of secrecy systems,”Bell Syst.Tech. J., vol. 28, no. 4, pp. 656–715, 1949.

[26] F. A. P. Petitcolas and M. G. Kuhn. StirMark Software. [Online].Available: www.cl.cam.ac.uk/~fapp2/watermarking/image_water-marking/stirmark/

[27] A. Piva, M. Barni, F. Bartolini, and V. Cappellini, “DCT-based water-mark recovering without resorting to the uncorrupted original image,”in Proc. IEEE Int. Conf. Image Processing, vol. I, Santa Barbara, CA,1997, pp. 520–523.

[28] F. Hartung and B. Girod, “Digital watermarking of uncompressed andcompressed video,”Signal Processing, vol. 66, pp. 283–301, 1998.

[29] S. S. Pradhan, J. Chou, and K. Ramchandran, “On the duality betweendistributed source coding and data hiding,” inProc. 33rd Asilomar Conf.,Pacific Grove, CA, Nov. 1999.

[30] R. J. Barron, B. Chen, and G. W. Wornell, “The duality between informa-tion embedding and source coding with side information and some ap-plications,” inProc. IEEE Int. Symp. Information Theory, Washington,DC, June 2001, p. 300.

[31] T. M. Cover and J. A. Thomas,Elements of Information Theory. NewYork: Wiley, 1991.

[32] I. Csiszár and J. Körner,Information Theory: Coding Theory for Dis-crete Memoryless Systems. New York: Academic, 1981.

[33] M. V. Hegde, W. E. Stark, and D. Teneketzis, “On the capacity of chan-nels with unknown interference,”IEEE Trans. Inform. Theory, vol. 35,pp. 770–783, July 1989.

[34] A. Lapidoth and P. Narayan, “Reliable communication under channeluncertainty,” IEEE Trans. Inform. Theory (Special CommemorativeIssue), vol. 44, pp. 2148–2177, Oct. 1998.

[35] N. F. Johnson, Z. Duric, and S. Jajodia, “Recovery of watermarks fromdistorted images,” inProc. Information Hiding Workshop, Dresden, Ger-many, 2000, pp. 318–332.

[36] P. W. Wong, “A public key watermark for image verification and authen-tication,” inProc. Int. Conf. Image Processing, vol. I, 1998, pp. 455–459.

[37] W. Bender, D. Gruhl, and N. Morimoto, “Techniques for data hiding,”Proc. SPIE, vol. 2420, p. 40, Feb. 1995.

[38] C. Busch, W. Funk, and S. Wolthusen, “Digital watermarking: From con-cepts to real-time video applications,”IEEE Comput. Graph. Applic., pp.25–35, Jan.–Feb. 1999.

[39] T. Basar and G. J. Olsder,Dynamic Noncooperative Game Theory (SIAMClassics in Applied Mathematics). Philadelphia, PA: SIAM, 1999.

[40] T. Basar and Y.-W. Wu, “A complete characterization of minimax andmaximin encoder-decoder policies for communication channels withincomplete statistical description,”IEEE Trans. Inform. Theory, vol.IT–31, pp. 482–489, July 1985.

[41] T. Basar, “The gaussian channel with an intelligent jammer,”IEEETrans. Inform. Theory, vol. IT–29, pp. 152–157, Jan. 1983.

[42] J. M. Borden, D. M. Mason, and R. J. McEliece, “Some informationtheoretic saddlepoints,”SIAM J. Contr. Optimiz., vol. 23, no. 1, pp.129–143, Jan. 1985.

[43] S. I. Gel’fand and M. S. Pinsker, “Coding for channel with randomparameters,”Probl. Contr. Inform. Theory, vol. 9, no. 1, pp. 19–31,1980.

[44] C. Heegard and A. A. El Gamal, “On the capacity of computer memorywith defects,”IEEE Trans. Inform. Theory, vol. IT–29, pp. 731–739,Sept. 1983.

[45] A. D. Wyner and J. Ziv, “The rate-distortion function for source codingwith side information at the decoder,”IEEE Trans. Inform. Theory, vol.IT–22, pp. 1–10, Jan. 1976.

[46] J. A. O’Sullivan and P. Moulin, “Some properties of optimal informa-tion hiding and information attacks,” inProc. 39th Allerton Conf., Mon-ticello, IL, Oct. 2001, pp. 92–101.

[47] D. R. Stinson,Cryptography: Theory and Practice. Boca Raton, FL:CRC, 1998, vol. 44, pp. 2148–2177.

[48] R. G. Gallager, Information Theory and Reliable Communica-tion. New York: Wiley, 1968.

[49] M. S. Pinsker,Information and Information Stability of Random Vari-ables and Processes. San Fransisco, CA: Holden-Day, 1964.

[50] M. Costa, “Writing on dirty paper,”IEEE Trans. Inform. Theory, vol.IT-29, pp. 439–441, May 1983.

[51] G.-I. Lin, “Digital Watermarking of Still Images Using a ModifiedDither Modulation Algorithm,” M. S. thesis, Dept. Elec. Comput. Eng.,Univ. Illinois at Urbana-Champaign, 2000.

[52] A. M. Gerrish and P. M. Schulteiss, “Information rates of nongaussianprocesses,”IEEE Trans. Inform. Theory, vol. IT-10, pp. 265–271, Oct.1964.

[53] B. Chen and G. W. Wornell, “Preprocessed and postprocessed quantiza-tion index modulation methods for digital watermarking,”Proc. SPIE,vol. 3971, Jan. 2000.

[54] J. Chou, S. Pradhan, L. E. Ghaoui, and K. Ramchandran, “A robust op-timization solution to the data hiding problem using distributed sourcecoding principles,”Proc. SPIE, vol. 3971, Jan. 2000.

[55] M. Kesal, K. M. Mihçak, R. Kötter, and P. Moulin, “Iterativelydecodable codes for watermarking applications,” inProc. 2nd Symp.Turbo Codes and Related Topics, Brest, France, Sept. 2000, pp.113–116.

[56] P. Moulin and M. K. Mihçak. (2000) The parallel-Gaussian water-marking game. [Online] Available: www.ifp.uiuc.edu/~moulin/Papers/pg01.ps.gz

[57] R. T. Rockefellar,Convex Analysis. Princeton, NJ: Princeton Univ.Press, 1970.

[58] M. Pinsker, “Gaussian sources,”Probl. Inform. Transm., vol. 14, pp.59–100, 1963.

[59] T. Basar, personal communication.

Date post:	19-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Information-theoretic analysis of information hiding...

Documents