[IEEE 2008 42nd Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA...

An Overview of Renyi Entropy and Some Potential Applications

Ed Beadle Jim Schroeder

Harris Corporation Melbourne, FL 32902

Bill Moran Sofia Suvorova

University of Melbourne Melbourne, VIC 3010, AU

Abstract – We introduce Renyi Entropy in this paper and review the basic associated properties that differentiate Renyi Entropy from Shannon Entropy. Theoretic and simulation examples are used to illustrate the differences between the two entropies. We suggest several potential applications of Renyi Entropy to such areas as spectral estimation and pattern recognition.

I. INTRODUCTION

Entropy is an important concept in many fields related to communications. Traditionally one of the foremost application areas has been in coding theory which is based in the efficient representation of information be it audio, video, still imagery, or even text [1]. Of course the most well known paper on information theory is likely the 1948 classic reference by Shannon himself [2]. However, what is less well known by many in the signal processing community is that Shannon built on some earlier work by Hartley in the 1920’s [3]. And also that information theory has continued to evolve and generalize many of Shannon’s (and Hartley’s) concepts and measures such as entropy (e.g. conditional, joint), divergence, and mutual information. While there are many extensions of the familiar forms of (Shannon) entropy, (Kullback-Liebler) divergence and mutual information, perhaps the most topical advancement for signal processors is the so-called Renyi entropy or alpha-order entropy (often denoted as -order) and its relation to a fairly new concept called Quadratic Independence Measure (QIM) [4,6]. The impact of these new tools may hold promise to impact the design of adaptive signal processing systems for a wide variety of problems where the ubiquitous second-order based methods fail to achieve the

design goal or “classic” information theory approaches are too cumbersome. Renyi entropy and related quantities (e.g. divergence and mutual information) can be argued to be a generalization of Shannon’s concepts and definitions. In part the argument can be posed that Renyi’s definition is parameterized by a single parameter, , which when allowed to approach unity, all revert to the familiar Shannon concepts. One remarkable property of Renyi entropy is that for algorithms requiring entropy maximization, Renyi’s definitions, for any , can be directly substituted where “classic” entropy has been used since they both maximize for the same condition. With a free parameter, namely , is it logical to ask if there are preferred values of and to determine under what context are those values preferred. One answer is the 2=α offers a large reduction in the computational effort required to produce entropy estimates (when needed) say to support ICA type algorithms [7, 20, 21]. Reference [6] deserves the credit for realizing a break in the traditional signal processing paradigm could be useful by adopting Renyi’s view of information. Additionally, [6] coined the term “information theoretic learning” and did extensive theoretical development based on Renyi entropy along with tests in a wide range of applications which included blind source separation see for example [6, 14, 15, 17, 18]. Currently the number of other fields are investigating or using Renyi entropy goes well beyond the results published in information theory journals into practical application for several problem areas diverse scientific fields such as biology, physics, and of course adaptive systems in engineering such as

1698978-1-4244-2941-7/08/$25.00 ©2008 IEEE Asilomar 2008

blind separation/equalization/deconvolution, sensor management and independent component analysis (ICA) [7, 20, 21,22]. This paper is neither meant to be an authoritative treatise on information theory nor one on adaptive systems using information theory. There are growing “pockets” of this material, the objective here is to provide a motivation to other signal processing practitioners to break the standard modus-operandi and examine this methodology as offering potentially superior performance in systems designed for real-world applications.

II. THE UTILITY OF INFORMATION PROCESSING / INFORMATION FILTERING

Recently, information theoretic concepts have made their way into the literature and applications in adaptive signal processing for example see [5,8,9]. Perhaps the driving force for the signal processing community to considering information theoretic constructs is the emerging need to define alternative criteria for optimality beyond the offered by the traditional second-order methods typified in the vast majority of the signal processing literature available today and standard reference texts such as [23]. As an alternative means to define optimality metrics, information theory offers the potential to address real-world problems that are, in truth, more complicated that the linear Gaussian models often used for second-order approaches. Information theoretic-based processing is able to handle a more complex class of problems where the underlying models are non-linear and/or non-Gaussian. Perhaps the most general umbrella to classify these new approaches is information theoretic learning (ITL) or information processing [6]. The recent surge in interest is likely due to the underlying selection of the optimality criteria for adaptive systems. The optimality criterion, after all, imposes certain properties onto the resulting solution. One example where it can be seen that methods exploiting second order statistics (e.g. PCA), while elegant and generally straightforward to implement, do not impose the most intuitively meaningful properties into the output, such as recovering independence is in blind source separation. In that community it is well-known that that the family of principal components analysis

(PCA) approaches, in general provide uncorrelated but not independent outputs, even though the constituent signals of the sensed mixtures are often assumed, rightfully so, to be independent. In contrast, however, the families of techniques termed independent component analysis (ICA) methods do return outputs as “independent” as possible, where independence is measured by a selected metric such as kurtosis [7]. In addition, other information theoretic measures can be used which form the basis for algorithms in this family such as INFOMAX, MAXENT, and others [7]. This is not to say that second order techniques do not provide useful approaches and results. In many cases they do, but not always. For example, consider the case of separating voice signals. In this application the density functions of the target signals are typically Laplacian, Gamma or Generalized Gaussian where the distribution depends on the frame considered [24]. Thus, given the non-Gaussian nature of the signal modality this is a first motivation for considering alternatives to second-order based processing. To illustrate the potential advantage of information filtering versus “classic” adaptive filters consider the explanation below, where we follow reasoning previously published in [5]. The objective of a “classic” adaptive filter (e.g. LMS, MMSE, least squares) is to minimize the

square-error norm (e.g. 2L or 2l ) between a

desired response vector d and measured output data vector y. Typically the measurement model is assumed to a linear function of some parameter vector w for efficacy. Thus the objective function for adaption can be written as:

{ }2

)()(min wydww

−= EJ (1)

The common assumptions made for applying the above equation are that the system is both linear and Gaussian. Many systems in engineering and signal processing do conform to the linear Gaussian model, either exactly or within some reasonable approximation. The Gaussian assumption (when valid) imposes some useful properties on the resulting solution. For example, uncorrelated variables imply independence between the variables. The linearity

1699

assumption typically simplifies the computations required to implement the estimator derived from the optimality criteria, and also linearity can maintain Gaussian properties at the output (given the inputs are Gaussian). However, when the system deviates from the linear Gaussian model, a squared error type optimality metric can lead to estimators that perform extremely poorly in an application, even though they are still optimal in the square error sense. In some cases, the true underlying model is so different from the linear Gaussian one that the solution derived by the square error criteria is (at best) not intuitively pleasing or (at worst) completely inappropriate given the problem context. Hence another approach must be found. So we consider information filtering as an approach for adaptive systems. By contrast to the above methodology, the information approach does not make the restrictive assumptions of linear and Gaussian system. In this approach we do assume, as before, there is a desired response vector d created by and unknown transformation f of the input x. The joint pdf p(x,z) fully characterizes the relationship including any noise or errors. The objective is to construct a (possibly vector-valued) mapping g with parameters w to approximate the unknown mapping f. The measure of “closeness” or optimality of the approximation (i.e. estimation) adopted say is the information theoretic criteria, such as the Kullback-Liebler distance (KLD), namely,

)||(

)(

)(log),()(min

~

~

w

w

wdx

xd,

xd,xdw

ppD

ddp

ppJ

=

= (2)

Returning to the optimization, if we choose to formulate the model structure to have a (possibly) non-linear dependence on the data x, namely,

d = f(x) + e (3)

Then using the entropy definitions above, it can be shown that minimizing the KLD under this transformation is equivalent to minimizing the entropy of the error [10],

eeew

dppEHe

))((log)()(min 21

Σ∈

−= (4)

where the pdfs above are parameterized by the parameters comprising w. Hence we have a new “objective” function based in information theory concepts where we have not imposed any constraints on the unknown system f, the approximating system g, or on the pdf’s of the data and response. This can provide tremendous flexibility in the design of systems to adapt in real-world applications. While the above explanation has adhered to the Shannon principles, a fundamental question arises. How can one estimate the entropy with only the data samples available and lack of knowledge of the underlying pdfs? The answer to this question is in part why Renyi entropy, and more specifically Renyi entropy of order 2, is a good approach from a computational and performance point of view. The next section introduces the concept of Renyi entropy and related measures, the subsequent section exposes the value of second-order Renyi entropy.

III. RENYI ENTROPY DEFINITION

Renyi Entropy [1,2] or α-order Entropy is defined as

−−= xx dpH R )(log

1

1)( 2

α

αα (5)

Note that when 1→α this definition becomes the same as Shannon’s Entropy, namely,

−= xxx dppH I )(log)()( 2α (6)

This may be easily shown for 1→α ; the Renyi

Entropy )(αRH becomes a 0/0 indeterminate

form, and use of L’Hosptial’s rule results in

[ ]

[ ]1

)(loglim)(lim

2

11−

−=→→

αα

αα

α

αα

d

d

dpd

d

HR

xx (7)

1700

that after simplification leads in a straightforward manner to Shannon’s Entropy. Therefore Shannon Entropy is a special case of Renyi Entropy. Renyi Entropy, HR( ), of order , 0 < < , may also be defined as suggested by [4] in integral form as

dffPHW

W

R )(ln1

1)(

−−

= α

αα (8)

where we interpret P(f), the Power Spectral, as a probability density function.

IV. PROPERTIES OF RENYI ENTROPY

Several important properties of Renyi Entropy will be reviewed in this section. A. Given a random variate with discrete uniform probability mass function, then for all α the discrete Renyi Entropy is equal to the (discrete) Shannon Entropy. And the entropies are maximum

with value N2log .

Given a probability mass function for a random variate x where there are N possible equally likely outcomes we have,

NnN

xp nn ,...,2,1,1

)Pr( ∈==≡ x (9)

And the (discrete) Renyi Entropy is found to be,

N

NN

NN

N

pH

N

n

N

nnR

2

2

1

2

2

12

12

log

1log

1log

1

1

1log

1

1

1log

1

1

log1

1)(

=

−=−

−=

−−=

−−=

−−=

−

=

=

α

α

α

α

α

α

α

αα

(10)

Note that for Shannon Entropy,

NN

NNppH

N

n

N

nnnS

22

12

12

log1

log

1log

1log

=−=

=−=−===

(11)

Therefore αα ∀== const)( SR HH as claimed.

B. We are interested to contrast and compare the results for Shannon Entropy versus the results for Renyi Alpha-Order Entropy for different alphas. The prototype distribution used is the Generalized Error Distribution,

[ ])2/1(2

)2/(exp)(

12/

/2

cb

baxxp

c

c

+Γ

−−=

+ (12)

Where the variate is ∞<<∞− x , the location parameter ∞<<∞− a , the scale parameter b > 0, and the shape parameter c > 0 are satisfied. This distribution has a number of special cases such as the Gaussian when b=c=1, and Laplace when b=1/2, c=2, and (nearly) rectangular as c approaches zero. There is a natural dividing line based on c=1. That is distributions that are leptokurtic (larger or “heavier” tails than a Gaussian) occur when c > 1, and conversely distributions that are platykurtic (smaller or “lighter” tails than a Gaussian) occur when c < 1. For completeness the Gaussian is referred to as mesokurtic. An example is shown below,

Fig. 1. Examples of 3 classes of PDFs.

1701

An interesting feature of the Renyi entropy is noticed in the experiments reported below. And that is for platykurtic and mesokurtic PDFs the trend for Renyi Entropy was as expected, as the order increased from 0 to 1 and then beyond 1, the Renyi entropy began at a value larger then Shannon, met Shannon and decreased below Shannon’s measure, respectively. This is consistent with the statements from numerous sources [c.f. 9] that Renyi (as

1→α ) becomes Shannon’s entropy. Hence, the empirical finding for the lepokurtic example is interesting, since as 1→α the limit does not exist.

Fig. 2. Gaussian (mesokurtic) Example

Fig. 3. Platykurtic Example

Fig. 4. Leptokurtic Example V. MAXIMUM RENYI ENTROPY SPECTRAL ESTIMATOR

A “Maximum Renyi Spectral Estimator” may be derived either by the calculus of variation or by maximizing the integral definition of Renyi Entropy in a Maximum Entropy sense as reported in [30, 31, 32, 33], subject to the constraint its Autocorrelation Function matches a set of known lag values. To summarize, using Renyi Entropy, HR( ), of order , 0 < < , defined as

dffPHW

W

R )(ln1

1)(

−−

= α

αα (13)

we find the Power Spectral Density, P(f), of a Random Process sampled at T seconds with one-sided bandwidth W Hz and characterized by its Autocorrelation function, R(n), known for –N < n < N, that maximizes HR( ) subject to the constraint

NnNdfefPnRW

W

nfTj ≤≤−=−

,)()( 2π . (14)

Using the Discrete-Time Continuous Frequency Transform (DTCF) relationship

∞=

−∞=

−=n

n

nfTjenRW

fP π2)(2

1)( (15)

We may write the Renyi Entropy as

dfenRW

HW

W

n

n

nfTjR

α

π

αα

−

∞=

−∞=

−

−= 2)(

2

1ln

1

1)(

. (16)

Since our goal is to find a P(f) that maximizes HR( ) w.r.t. R(n) we solve

1702

0)(

)(=

∂

∂

nR

H R α.

We maintain that the resulting spectral estimator, for 0 < < 1, is given by

απ

αβ

α −

=

−−

=1

1

0

21

1)(

N

n

nfTjnea

fP

−

=W

W

dffP )(

1

α

β (17)

We observe that up to a scale factor this spectral estimator is functionally equivalent to Burg’s Maximum Entropy spectral estimator, originally derived in [30, 31]. Figure 5. illustrates a comparison of a common Least Squares spectral estimator, the Forward Backward Linear Prediction method, and the proposed Maximum Renyi Entropy spectral estimator with = 1/2. The real valued 64 sample data set consists of three sinusoids at a normalized frequency of .1, .2 and .21 Hz and a colored noise process obtained by bandpass filtering (.35 Hz center frequency) Gaussian white noise as suggested in [32]. A 10th order model is used in both cases. As expected the differences are rather subtle with possibly the MRSE exhibiting increased smoothing of the bandpass spectrum on the frequency range .2 Hz < f < .5 Hz.

Fig. 5. Least Squares vs. Maximum Renyi Entropy Spectral Estimator

VI. CONCLUSIONS

We have briefly overviewed Renyi Entropy and highlighted key properties. We suggested uses of Renyi Entropy for spectral estimation, pattern recognition and source separation. To further explore spectral estimation we derived a Maximum Renyi Entropy solution, motivated by Burg’s Maximum Entropy method. We compared this spectral estimator, using a standard data set, to another common parametric spectral estimator known as the “Forward-Backward” method.

REFERENCES

[1] Cover, T., and Thomas, J., Elements of Information Theory, Wiley, NY, 1991.

[2] C.E. Shannon, “A Mathematical Theory of Communication”, BSTJ Vol 27, pp. 379-423, 623 – 656, July/Oct 1948.

[3] R. Hartley, “Transmission of Information”, BSJT Vol 7, pp. 535 – 563, 1928.

[4] A. Renyi, “On Measures of Entropy and Information”, Proc. 4th Berkeley Symp on Math. Stat. Prob. Vol I, pp. 547-561, 1961.

[5] D. Erdogmus and J. Principe, “From Linear Adaptive Filtering to Nonlinear Information Processing”, IEEE Sig Proc Magazine, November, 2006, pp. 14 – 33.

[6] J. Principe, “Information-Theoretic Learning”, in Unsupervised Adaptive Filtering Vol I, Simon Haykin ed., Wiley, NY, 2000 (Chapter 7).

[7] Hyvarinen, A., Karhunen, J., Oja, E., Independent Component Analysis, Wiley, NY, 2001.

[8] J. Hudson, “Signal Processing Using Mutual Information”, IEEE Signal Proc Magazine, pp. 50 – 58, Nov 2006.

[9] D. Lake, “Renyi Entropy Measures of Heart Rate Gaussianity”, IEEE Trans on Biomedical Engineering, Vol 53, No. 1, pp. 21 – 27, Jan 2006

[10] D. Erdogmus and J. Principe, “An Error-Entropy Minimization Algorithm for Supervised Training of Non-linear Adaptive Systems”, IEEE Tran Sig Proc, vol 50, No. 7, pp. 1780 – 1786, 2002.

[11] C. Arndt, Information Measures: Information and its Description in Science and Engineering, Springer, Berlin, 2001.

1703

[12] Grassberger,P, Procaccia I., “Characterization of Strange Attractors”, Phys Rev Letters, Vol 50, No. 5, pp. 346 – 349, 1983.

[13] J.N. Kapur, Measures of Information Theory and Their Applications, John Wiley and Sons, NY, 1995

[14] K. Hild, et. al., “An Analysis of Entropy Estimators for Blind Source Separation”, Signal Processing 86 (2006), pp. 182 -194.

[15] K. Hild, et. al., “Blind Source Separation Using Renyi’s Mutual Information”, IEEE Signal Proc Letters, Vol 8, No 6, pp. 174-176, June 2001.

[16] E. Parzen, “On the Estimation of a Probability Density Function and the Mode”, Ann. Math. Stat. , vol. 33, 1962, pp. 1065.

[17] D. Ergodomus, Information Theorethic Learning: REnyi’s Entropy and it’s Application to Adpative System Training, Ph.D. Dissertation, Univ of Florida, 2002.

[18] K. Hild, Blind Separation of Convolutive Mixtures using Renyi’s Divergence, Ph. D. Dissertation, Univ of Florida 2003.

[19] S. Han, A Family of Minimum Renyi’s Error Entropy Algorithm for Information Processing, Ph.D. Dissertation, Univ. of Florida, 2007.

[20] D. Erdogmus, et. al., “Independent Components Analysis Using Renyi’s Mutual Information and Legendre Density Estimation”, Intl. Joint Conf. on Neural Networks, pp. 2762-2767, July 2001

[21] Y. Bao and H. Krim, “Renyi Based Divergence Measures for ICA”, 2003 IEEE Workshop on Statistical Signal Processing Workshop 28 Sept.-1 Oct. 2003, pp. 565- 568.

[22] Kreucher, et. al., “An Information Based Sensor Management Method for Multitarget Tracking”, First IEEE Conference on Information Processing in Sensor Networks, pp. 209-222, 2003.

[23] B. Widrow and S. Stearns, Adaptive Signal Processing, Prentice-Hall, NJ, 1985.

[24] S. Gazor and W. Zhang, “Speech Probability Distribution”, IEEE Signal Proc Letters, vol 10, No. 7, pp. 204 – 207, Jul 2003.

[25] A. Renyi, Probability Theory, Dover, NY, 1970.

[26] P. Viola, et. al., “Empirical Entropy Manipulation for Real-World Problems”, Advances in Neural Information Processing Sys 8 (NIPS*96), MIT Press, pp. 851-857, 1996.

[27] K. Waheed and F. Salam, “A Data-derived Quadratic Independence Measure for Adaptive Blind Source Recovery in Practical Applications”, 45th Midwest Symposium on Circuits and Systems, Vol 3, pp 473-476 , Aug. 2002.

[28] A. Renyi, Probability Theory. New York: Elsevier, 1970.

[29] C. Tsallis, “Possible generalization of Boltzmann-Gibbs statistics,” J. Statist. Phys., vol. 52, pp. 479–487, 1988.

[30] J. P. Burg, “Maximum entropy spectral analysis,” 37th Annual Meeting Society of Exploration Geophysicists, Oklahoma City, OK, 1967.

[31] John Parker Burg, “Maximum Entropy Spectral Analysis,” PhD Thesis, Department of Geophysics, Stanford University, May 1975.

[32] S.M. Kay and S.L. Marple, “Spectrum Analysis – A Modern Perspective,” IEEE Proc., Vol. 69, No. 11, pp. 1380-1419, November 1981.

[33] E.T. Jaynes, “On the Rationale of Maximum Entropy Methods,” IEEE Proc., Vol. 70, No. 9, pp. 939-952, September 1982.

[34] A. Papoulis, “Maximum Entropy and Spectral Estimation: A Review,” IEEE Trans. ASSP, Vol. ASSP-29, No. 6, pp. 1176-1186, December 1981.

[35] Deniz Erdogmus, Kenneth E. Hild, Jose C. Principe, Marcelino Lazaro, and Ignacio Santamaria, “Adaptive Blind Deconvolution of Linear Channels Using Renyi’s Entropy with Parzen Window Estimation,” IEEE Trans. SP, Vol. 52, No. 6, June 2004.

1704

Date post:	11-Dec-2016
Category:	Documents
Upload:	sofia
View:	216 times
Download:	1 times

[IEEE 2008 42nd Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA...

Documents