Radio Engineering (From Software To Cognitive Radio) || Decision Making and Learning

Chapter 4

Decision Making and Learning

4.1. Introduction

In spite of scientific advances in the fields of neuropsychology and, more generally,in cognitive sciences during the last 50 years, we are still very far from understandingthe physiological mechanisms that explain learning and decision making in theanimal kingdom. Although no computer can still claim cognitive qualities similarto human beings, the problem of learning and decision making for an intelligenttelecommunications system starts by establishing precise rules that give birth to thesefunctions. The learning for a cognitive system corresponds to a phase of interpretationof stimuli provided by the environment in a language that is understandable by thesystem and is minimalist in terms of information storage. In this chapter, we willdescribe a mathematical approach based on Bayesian probabilities and the principle ofmaximum entropy, which allows learning via cognitive radios (CRs). The advantagesand drawbacks of these techniques will be elaborated through examples of channelmodeling and channel estimation.

After learning, the intelligent device is required to make decisions involvingactions that will enable it to adapt itself to its surrounding environment (and sometimesmodify this environment). This decision comprises the choice of the action to beperformed. This choice will be guided by information acquired by the system and, inparticular, in intelligent systems for telecommunications, by information about othernetwork agents. It is particularly essential that each agent knows or has at least a priori

Chapter written by Romain COUILLET, Mérouane DEBBAH, Hamidou TEMBINE,Wassim JOUINI and Christophe MOY.

78 Radio Engineering

knowledge of the strategy adopted by other network agents, to avoid reaching a never-ending situation in which the device decides to take this or that decision owing to thefact that other agents know that it knows that they know ... which set of actions thedevice can take?

The decisions that a set of smart communication devices are to take are wellanalyzed using game theory via the important work of Von Neumann and Nash. Wewill discuss in succession the theoretical aspect of decision making and will study,through game theory, the artificial mechanisms of generating autonomous cooperativenetworks of intelligent devices.

In the beginning, this chapter will elucidate the problems of decision making andlearning in the context of CR. In particular, the concept of a cognitive agent will beintroduced. Then, the constraints related to decision making lead to the introductionof the notion of decision space. Subsequently, decision making and learning will bediscussed from the device and network point of view. Finally, as a more concreteexample, a state of the art in the case of dynamic adaptation will be proposed and inthis context a proposal will be made to classify the related decision-making methods.

4.2. CR equipment: decision and/or learning

4.2.1. Cognitive agent

Whatever the context considered, CR equipment can be defined as being anautonomous communication system, well conscious of its environment in addition toits operational capabilities, and able to act on them intelligently [HAY 05, MIT 99a].Hence, it is a device equipped with sensors that are able to collect different kindsof information and is capable of using this information to adapt to its environmentas shown in Figure 1.1. This assumes that the system has elementary cognitioncapabilities that are: perception (position, spectral environment, etc.), reasoning(data fusion, decision making, learning, memory, etc.), and executive functions(reconfiguration, transmission, etc.). The simplified interaction of the CR equipmentwith its environment is illustrated in Figure 5.1. Only the intelligent subsystem, i.e.the decision-making engine, will be of interest in the next section. The decision-making engine appears like a cognitive agent and can be considered as the brain ofthe CR equipment. Hence, its objective is to use the collected metrics in order todevise a strategy that satisfies the needs of the user according to the environmentin which the equipment operates. This strategy results in commands sent to the restof the equipment in order to change its operational parameters (modulation, coding,transmission power, band used, etc). As a first approximation, the intelligent agent ateach instant depends on:

Decision Making and Learning 79

– information that it collects and is capable of managing. These metricscharacterize the operational environment of the equipment;

– “objective function” (or utility function) that characterizes the user needs aswell as the possible constraints to which the equipment is subjected. The “objectivefunction” may be the union of several criteria to be optimized;

– all the actions/commands that it can execute/give. In terms of equipment, it isidentical to the parameters that the CR is capable of modifying.

To sum up, we can reduce the decision-making problem to a function that has,as inputs, the pieces of information retrieved by sensors as well as those memorizedthroughout the operation of the cognitive agent, and which has as outputs, commandsfor the rest of the equipment. Whatever the methods that may be suggested by radiocommunity, the prime objective is to determine a good decision function (explicit orimplicit) that leads to a system capable of meeting user needs while adapting itselfto its environment. To accomplish its mission successfully, the intelligent agent willhave to face several problems inherent in multiobjective decision making, especiallythe conflicts between objectives and problem modeling.

4.2.2. Conflicting objectives

In the context described above, the cognitive agent is a multicriteriadecision function under constraints. In the context of CR, this problem ofmulticriteria optimization is even more complex as different objective functions aresometimes conflicting (for example, minimize the bit error rate (BER), minimizecomplexity, maximize throughput, maximize spectral efficiency, and minimize powerconsumption). As a consequence, a set of parameters that could be optimal for one ofthe criteria may likely deteriorate the system’s performance significantly with respectto another criterion. Consequently, in this kind of problem, most often there does notexist a single solution better than all others, but rather a subset of “good candidates”,which would be a compromise on the performances of the radio equipment withrespect to different criteria. Finally, with the increase in the number of parameters tobe manipulated and the objectives to be optimized and hence their heterogeneity, thesolution space could quickly become too large to permit exact resolution of a problemor even any exhaustive search for solutions.

4.2.3. A modeling part in all approaches

In certain approaches recommended in the literature, an important part of modelingis necessary to analyze the performance of the proposed methods. Unfortunately, thesolutions found by these methods are not independent of modeling. In view of theseremarks, it is conceived that these solutions will be more or less valid depending on


the accuracy of the chosen model with respect to reality. Consequently, the proposedsolutions at the time of resolution of the multicriteria problem will also be more orless adapted to reality. It is unfortunately very difficult to analyze, in a general case,the effects of these biases on the solution quality; nevertheless, we can expect thatthe system would react “satisfactorily” (though not necessarily optimally) in casesin which the chosen model would be close to reality. On the other hand, in caseswhere the identified model would be “distant” from reality, it is difficult to predict theperformance gap between optimal case and the found solution. An opposite approachwould be to consider only minimal assumptions on the environment. This leads tointeresting solutions, as we will see at the end, in the case of “multi-armed bandit”(MAB). Nonetheless, conceiving an efficient system will require a compromisebetween modeling assumptions and expert knowledge.

Figure 4.1. An example of a multiobjective problem in the cognitive context [RON 06]. In the problem described above, the decision engine is interested in several objectives such as throughput, spectral efficiency, rate bit error (BER) or even power consumption. These objectives depend on a certain number of parameters common to the objectives (Tables I and II in this figure). Consequently, values chosen to optimize one of the criteria could lead to significant degradation in system performances with respect to another criterion

4.2.4. Decision making and learning: network equipment

In numerous scenarios, the decisions taken by the equipment can be consideredlocal, and hence without any impact on overall behavior of the network. Nevertheless,to diminish the large range of problems encountered by the CR community, we cannotignore the behavior of network elements that could lead to a significant degradation oftransmission performance of all the neighboring devices: use of excessive high power,access to a frequency band without authorization, etc. In these cases, it is necessaryto set up basic rules of behavior in order to allow all to evenhandedly benefit fromresources offered by the environment.


4.3. Decision design space

4.3.1. Decision constraints

An analysis of the literature on dynamic configuration adaptation (DCA) [COL 08,MIT 99a, RIE 04] defines three constraints on which the proper functioning of a CRequipment [JOU 10b] depends. These are:

– environmental constraints;

– user constraints (or service requested);

– equipment constraints (particularly reconfiguration capability).

Conceiving the design of CR equipment comes back with the obligation ofproviding the equipment, necessary intelligence capabilities, allowing it to adapt itselfaccording to its operational objectives and its capabilities. It is therefore essential tooffer all the exploration possibilities in the three-dimensional space created by thesethree constraints. It is to be noted that this approach is equally valid for dynamicspectrum access, which also requires that the CR equipment is able to adapt itself infrequency.

4.3.1.1. Environmental constraints

In environmental constraints, we can include not only what is imposed byoperating conditions, namely propagation, obstacles, movements, etc., but also whatdepends on rules for the use of frequencies, the interference tolerance level, etc. It isalso true that if the environment imposes too many constraints, then the equipmenthas no more degree of freedom to adapt itself. However, if the environment does notimpose any constraint, the CR equipment can only act within the limits of its owncapabilities. Its operation must also remain in accordance with the user’s wishes.

4.3.1.2. User constraints

In different usage modes of its telecommunications equipment, user requirementsmay vary depending on the nature of service requested, the importance of itsinteractions, or other factors such as power consumption or communication cost.In addition, the operator requirements can also be added to the above, e.g. increasespectral efficiency or use a certain mode that is more profitable at a specific time.However, if we want to apply too many constraints simultaneously, finding thesolution to the problem can become impossible because the required objectives maybe contradictory. We fully understand that the interests of the user and operator maybe different: we want to pay less and the other wants more charges. However, if theuser has no particular requirement, it will not be equipped with a CR device becauseit will not take benefit from it.


4.3.1.3. Equipment capacity constraints

The modifying capabilities are added to the inherent computing power capacitiesof an equipment so that the device can adapt itself to the environment. This is whyCR will be advantageously based on software radio technologies as indicated inChapters 8, 10, and 11. Flexibility of equipment offers as many new degrees offreedom as its modifiable parameters.

4.3.2. Cognitive radio design space

Cognitive radio design space [JOU 10b] is the abstract volume formed bythree dimensions: environmental constraints, utilization criteria, and limits of theequipment platform. This space is shown in Figure 4.2. It should be noted thatif we consider these three dimensions independent of each other, we obtain anexceedingly large exploration space called virtual space. But as these dimensions maybe correlated with each other, this space is curtailed. Indeed, some degrees of freedomon one axis (increase the capacity) may be impossible to exploit if, on another axis,too many parameters are fixed (imposed waveform).

In a nutshell, let us note that this space is bounded neither to a time nor to aplace. It takes into account all the scenarios considered and also those that mightbe encountered by radio equipment. As a result, the constraints imposed by theuser (through all possible objectives), those fixed by the environment (especially allnetwork constraints depending upon location and time), and constraints intrinsic to theplatform define a set of possible decision problems. At each instant, instances of thisspace will be defined as functions of the exact constraints being met. The cognitiveagent will then search for a solution to the problem based on the a priori informationthat it has due to the functional relationship that links these three elements of space.We will then see that in the case of DCA the decision space found in the literature isthe same. However, based on what is supposed to be known to the cognitive agent,techniques proposed to solve the decision problem can vary.

4.4. Decision making and learning from the equipment’s perspective

4.4.1. A priori uncertainty measurements

One of the first people to notice the mathematical modeling of learning wasRichard Cox [COX 46], who in 1946 designated the domain of probabilities as themost appropriate tool to model the abstract notions of knowledge and learning. In acognitive approach, the probability theory characterizes not only the study of laws ofrandom events (called most occurrence approach), but also the study of confidenceof an agent on deterministic events with incomplete information (called Bayesianapproach). Denoting the event by A, and all the information known a priori to


the agent by I (whether this information is correlated to A or not), the degree ofknowledge of A given the a priori information I is given by:

P (A|I) [4.1]

If A is completely known, i.e. if the agent is perfectly able to judge whetherA is true or not starting from I , then P (A|I) ∈ {0, 1}. In the contrary case, whenI does not alone determine the knowledge for the agent of the truthfulness of A,P (A|I) ∈ [0, 1], indicating by this bias that the agent has a certain degree of belief inA. P (A|I) will be particularly close to 1 if I provides evidence of truthfulness ofA.

Figure 4.2. Modeling space of cognitive radio [JOU 10b]. The space shown above definesall problems that could be faced by the radio. The volume of this space depends on three dimensions that constrain the decision agent; first, the constraints intrinsic to the equipment (e.g. waveforms that it can generate), then the network-related constraints such as maximum power and authorized interference or bands to be used, etc., and finally, the user constraints that are manifested in the form of criteria to be met insofar as possible. If each dimension is considered independent of the other, modeling space is larger than the limited real space. Indeed, the three dimensions are not independent and the constraints imposed by one of the constituents (equipment, network, user) will have necessarily an impact on all the possibilities


Cox [COX 46] then demonstrated that this probabilistic approach is consistent withthe classical rules, namely the Bayes rule, for two events A and B and an a priori I:

P (A|B, I) = P (B|A, I)P (A|I)P (B|I) [4.2]

The Bayes rule reflects learning information on the event A when the informationB is given to the agent. The transition of probability from P (A|I) to P (A|B, I)is to be considered as the update of our degree of belief on event A when theadditional information B is given to the agent. Bayesian learning techniques that arean extension of Cox’s approach are discussed in the context of CR in next section.

4.4.2. Bayesian techniques

In this section, we develop the Bayesian approach of learning information aboutthe environment by an opportunistic agent in the cognitive network through theexample of modeling and channel estimation.

Let us consider a telecommunications equipment in an opportunistic networksearching for a multiantenna link within itself, which has nR antennas and a primarynetwork user whose number of antennas nT is known a priori. These two pieces ofinformation naturally make a part of the a priori information I . The cognitive agentthen seeks to estimate the multidimensional channel H0 ∈ CnR×nT . If no stimulus(no data) is received through the link H0, then the a priori information I gatherslittle information in general; however, the cognitive agent has a degree of confidenceP (H|I) that each possible link H is the effective channel H0. In particular, it ishighly unlikely that each input of H0 is identically equal to 1010, corresponding toa gain of 200 dB on each channel input. This later channel must be assigned to a lowprobability.

In general, it is desirable that for any a priori information I there exists anatural and systematic way to obtain P (H|I). This method, devised by Jaynes[JAY 82, JAY 89, JAY 03], exists and is commonly known as theory of maximumentropy. It consists of following two steps:

– Consider the whole set of probability distributions and eliminate from this setany probability distribution that is inconsistent with the a priori information I . Inparticular, if I claims that E[H] = 0, every distribution with the non-zero mean iseliminated.

– In the remaining set, probability distribution of the maximum entropy is chosenand assigned P (H|I), i.e. P (H|I) is defined by:

P (H|I) = argmaxq

[−∫q(H) log q(H)dH

][4.3]


The choice of the maximum entropy distribution is justified by Jaynes as thechoice being the most “honest” in the sense that it does not presume any additionalunknown information. It is to be particularly noted that this choice is consistentwith the mathematical definition of information by Claude Shannon [SHA 48] andthe work of Leon Brillouin [BRI 63] on the close ties between information theoryand physics. Especially, note that the uniform distribution of a gas in an open spaceactually corresponds to the unconstrained maximum entropy distribution and is morelikely probable than the probability distribution where all the gas is concentrated atone precise point in space. The cognitive agent is likely to have important a prioriinformation of the propagation channel, in particular, when the agent has strongreasons to believe that the transmitter is in the line of sight (typical case of a networkdeployed in a low urbanized areas), this information is integrated to I .

Similarly, if the agent is placed in an environment where electromagnetic waves arelikely to come from one preferred direction, P (H|I) must integrate this information.Work of Debbah and colleagues describes several situations of channel inferenceunder different statistical constraints brought together in the a priori information I[GUI 06].

Let us consider, in particular, the case where the only knowledge available aboutthe channel is its average gain. I integrates this information as:

E

⎡⎢⎢⎣ ∑1≤i≤nR

1≤j≤nT

|hij |2

⎤⎥⎥⎦ = E [4.4]

Using Lagrangian multipliers, it is possible to evaluate the distribution P (H|I),which has a maximum entropy under the constraint I . This is given by:

P (H|I) = argmaxq

{[−∫dHq(H) log q(H)

]

+ γ

nR∑i=1

nT∑j=1

[E −∫dH|hij |2q(H)

]

+ β

[1−∫dHq(H)

]}[4.5]

where γ and β are Lagrangian multipliers associated with the following constraints,respectively: (i) q a second-order moment equal to E and (ii) q a probabilitydistribution. It was shown in [GUI 06] that previous optimization implies:

P (hij |I) = e−(γ|x|2+ β+1

nRnT) [4.6]


and hence each entry of H is identically distributed. The power constraint E impliesthat P (H|I) is a multivariate Gaussian distribution with zero mean and varianceE/nRInR

. Therefore, it seems that a multiantenna channel with independent andidentically distributed Gaussian inputs is consistent with the a priori informationof the complete knowledge of the second-order moment. It is also proven that thea priori knowledge of the intrinsic correlations between transmit and receive antennasgenerates a channel distribution consistent with Kronecker’s model, i.e.:

P (H|I) = R12XT

12 [4.7]

where T ∈ CnT×nT and R ∈ CnR×nR are known correlation matrices in transmissionand reception, respectively, and X is a random matrix of independent and identicallydistributed inputs with zero mean.

Once this a priori distribution is established, any additional stimulus x received bythe antennas of the cognitive agent, i.e. typically a signal emitted by the transmitter,must be integrated to the information I that becomes I ′ = (x, I) and the resultingchannel distribution using Bayes rule becomes:

P (H|I ′) = P (H|x, I) = 1

αP (x|H, I)P (H|I) [4.8]

where α =∫P (x|H, I)P (H|I)dx is a simple normalization factor. We therefore

have an explicit relationship to verify the learning of H by the addition of informationx. If the information brought by x is useful, i.e. if it brings additional data to the finalknowledge of H, then P (H|I ′) will have a less spread out profile with low probabilityfor all H but, on the other hand, more centered in the neighborhood of H0 with ahigh probability. In [COU 09] and [COU 10a], Couillard et al. describe the channelestimation operations in detail, starting from a priori P (H|I), which changes thedecisions that have already been made and are discussed later in this chapter.

We now describe the techniques known as reinforcement techniques that allowus continuous learning when the environment evolves and every action of the agentmodifies its the environment and makes it possible to acquire more information aposteriori about this environment. These techniques deviate from the ideal Bayesiantechniques described here, but propose robust algorithms of sequential learning.

4.4.3. Reinforcement techniques: general case

In many problems encountered in the CR domain (for example, the opportunistaccess to the spectrum or reconfiguration in an unknown environment), the decisionmaker faces many choices without a priori knowledge on their performance.Nonetheless, it must find a strategy that enables it to maximize the quality of serviceoffered to the users without disturbing the rest of the network. In this framework,


the decision maker has no alternative but to try different choices available to it, andestimate their performance. This estimate is the reinforcement signal that will enablethe decision maker to adapt its behavior to its environment. If the decision makerspends enough time on each of the possible choices, we can easily imagine that itwill have sufficient precision on their performance allowing it to make an appropriatedecision when time comes. However, if it devotes too much time in estimating theperformance of the possible choices, the user will not benefit form this collectedinformation. As a consequence, the decision maker faces a dilemma between theimmediate exploitation of choices that seem most profitable (i.e. the choice havingthe most current estimate) and exploration of other choices in order to improve theperformance estimation of the available choices. In the rest of this section, we willseparate the case where the chosen action depends on the present state of the decisionmaker, from the case where the notion of the state does not exist or is consideredindependent of the decisions to take.

4.4.3.1. Bellman’s equation

During the summer of 1949, Richard Bellman, a 28-year-old mathematician atStanford University, already renowned for his promising work in number theory,was hired as a consultant at RAND Corporation, an institution of research anddevelopment founded in 1945 by the U.S. Air Force. He was interested in applicationsof mathematics. It was suggested that he work on the decision-making process atmultiple steps. At that time, research in mathematics was not really appreciated bythe Department of Defense and among politicians who also directed the Air Force.Bellman’s first task was to name his work that would gratify his executives. Heselected the word programming, which at that time was considered more relevantto planning and scheduling than the programming in sense of algorithms in ourtime. Then he added the term dynamic to evoke the idea of evolution over time.The terminology of dynamic programming thus served as an umbrella for Bellmanto cover his mathematical research activities at the RAND Corporation. Dynamicprogramming is based on a technique called Bellman’s optimality principle. Thisgeneral principle states that the solution to a global problem can be obtained bydecomposing the problem into subproblems that are simpler to solve. An elementarybut conventional example is the computation of the shortest paths (or paths withlower costs) in a graph. A famous example in this context is the traveling salespersonproblem.

Bellman started working on the theory of optimal control while studying theoptimality principle of dynamic programming. This domain deals with the problem offinding a control strategy for a given system in order to satisfy an optimality criterioninvolving a cost function that depends on state and control variables. For example,consider a car traversing a hilly road. The question is to determine how the drivermust drive in order to minimize the total duration of the journey. Here, the controlstrategy means how the driver must press the accelerator or brake pedal. The system


consists of the car and the road. The optimality criterion is to minimize the overalllength of the route. Control problems generally include auxiliary constraints. In thecase of the example considered, it may be the limited quantity of petrol, speed limits,etc. A cost function here is a mathematical expression giving travel time as functionof speed, geometric considerations of the road, etc. We introduce value functions atdifferent times or intermediate stages and calculate them starting from the end andthen by recursive induction. We will use this optimality principle to learn quality aswell as the optimal strategy in the following section.

4.4.3.2. Bellman’s equation to reinforcement techniques

Let us consider an entity that perceives its environment through sensors and actsaccordingly. Perceptions are used not only to act but also to improve performance ofthe agent in the future.

Let us consider a finite set S, describing the possible states of the agent. In eachstate s ∈ S, the agent has a finite set of actions denoted by A(s). When the agentchooses an action at at time t, it modifies its environment and therefore perceptions.It goes from state st to state st+1 by receiving a reward rt = r(st, at). In general,the perceptions and the states do not permit reconstructing the entire environment andthe effect of an action on perceptions is a stochastic dynamic process at discrete time.We describe the case in which the process is Markovian of degree 1, i.e. at any time t,the probability of going from s to s′ depends on s and a only, and not on the previousstates and actions:

P (st+1 = s′ | st = s, at = a, st−1, at−1, . . . s0, a0)

= P (st+1 = s′ |st = s, at = a)

= Psas′ [4.9]

This defines a Markov decision process given by the quadruple (S,A, P, r). Wecall deterministic Markov policy a function π : S −→ A that associates with eachstate an action to be performed. More generally, we can define a non-deterministicMarkov policy π, which, given a state s, associates a probability distribution on theaction space: π(s) ∈ Δ(A(s)), where Δ(A(s)) is the set of probabilities A(s). Wedefine the value function of a state s under a deterministic policy π as being thediscounted cumulated reward:

vπ(s) = E[(1− γ)r0 + (1 − γ)γr1 + (1− γ)γ2r2 + . . .

]where E is the mathematical expectation of the transition distribution P , and γ adiscount parameter (called discount rate). This parameter γ determines the presentvalue of future rewards.

Unlike dynamic programming in which the complete model of theenvironment is assumed to be known, reinforcement learning relaxes a certain


number of assumptions. The agent only knows the perceptions and states andshould continuously improve its policy by attempting new actions to better understandthe consequences of its actions on the environment. One of the advantages of thereinforcement techniques is that they are rather general models that do not requireall the parameters of the environment and allow dynamic adaptation when conditionschange. To start with, let us study the reinforcement learning models that are directlyconstructed from Bellman’s optimality equation (complete model). Then we will seewhat we can do if this information is unreliable. We define the discounted cumulativereward, also called quality, as follows:

Qπ(s, a) = (1− γ)E

⎛⎝∑t≥0

γtrt | s0 = s, a0 = a

⎞⎠which can be rewritten as:

Qπ(s, a) = (1− γ)r(s, a) + γ∑s′∈S

Psπ(s)s′vπ(s′)

Then the Bellman equation can be written as:

vπ(s) = (1− γ)r(s, a) + γ∑s′∈S

Psπ(s)s′vπ(s′)

This equation gives a recursive relationship on vπ(s) for a state s by binding itto the value function of its successor states. We can evaluate the value function of aMarkov policy by solving the linear system of |S| (cardinality of S) equations with |S|unknowns. Note that to solve this system we must know the values of Psπ(s)s′ . Wewill describe in the next section how to learn the value function when P is unknown.

4.4.3.3. Value update

The method consists of iterations of values to solve Bellman’s optimality equationsincrementally and then to construct an optimal policy. This technique generates asequence v0, v1, v2, . . . of value functions, which converges toward v∗. The algorithmis described as follows:

– v0 is set arbitrarily (e.g. the null vector);

– update the value:

vt+1(s) = (1 − γ)r(s, π(s)) + γ∑s′∈S

Psπ(s)s′vt(s′);

– stop the iteration if the difference maxs |vt+1(s)− vt(s)| ≤ ε.

Bellman’s optimality equation is:

v∗(s) = maxa∈A(s)

[(1 − γ)r(s, a) + γ

∑s′∈S

Psas′v∗(s′)

]


In a finite horizon, this result of Bellman’s can be depicted using a graph. TheMarkov decision process corresponds to a shortest path problem on a weighted graph.The result says that any subtrajectory of an optimal trajectory is optimal. We can showthat this extends to infinite horizon problems. There also exists a generalization of thismethod that modifies the value function only in a subset of states, at each stage of thealgorithm.

4.4.3.4. Iteration algorithm for policies

If we know P and r, we can directly solve Bellman equation and obtain v∗. Wecan make an optimal policy starting from Bellman equation by setting:

π∗(s) ∈ arg maxa∈A(s)

[(1− γ)r(s, a) + γ

∑s′∈S

Psas′v∗(s′)

]We say that policy π is better than another policy π′, if the value function obtained

starting from s under the policy π, vπ(s) is greater or equal to the value function vπ′(s)

obtained under π′ starting from s. We have the following result on the improvement ofpolicies: policy π is better that π′ if Qπ(s, π) ≥ vπ

′(s), ∀ s ∈ S. On the basis of this

result, we make an algorithm that improves policy at each step. Since there are only afinite number of states and a finite number of actions in each state, the process stopsafter a certain number of steps (otherwise we may end up falling on the same action).The algorithm is described as follows:

– choose an initial policy π0. Evaluate this policy by calculating Qπ0 ;

– make a new policy: π1(s) ∈ argmaxa Qπ0(s, a), ∀ s ∈ S;

– by using the result on the improvement of policies, the newly obtained policyπ1 is better than π0. If π1 = π0, the algorithm stops, if not we start again by takingpolicy π1 as initial policy.

Even if the algorithm stops in a finite number of steps, the computationalcomplexity of Qπ at each stage can be very high.

4.4.3.5. Q-learning

Q-learning is a reinforcement learning technique. Its goal is to learn the qualityvalues (Q-values). The iterative method is the following:

Qt+1(st, at)

= Qt(st, at) +

[(1− γ)r(st, at) + γ

∑s′∈S

Pstats′ maxa′

Qt(s′, a′)−Qt(st, at)

][4.10]

In this equation, we replace r(st, at) by perception rt and the summation over s′

by the state st+1, modified by the multiplicative factorαt. Then the learning algorithmof Q can be written as:


– choose arbitrarilyQ0(s, a), ∀s, a,;– extract the current state st, choose an action at ∈ A(st): initially, all actions of

all the states are tested, then exploit the functionQ while continuing exploration;

– observe the perceived reward rt and the new state st+1;

– update Qt(st, at) using:

Qt+1(st, at) = Qt(st, at) + αt

[(1− γ)rt + γmax

a′Qt(st+1, a

′)−Qt(st, at)]

and decrease the coefficient αt;

– repeat step 2.

Assuming that the coefficients αt satisfy:

αt ≥ 0,∑t≥0

αt = +∞,∑t≥0

α2t < +∞,

we can show that this algorithm converges if for each pair (s, a), the function Q isalways updated.

Choosing an action:

– ε-threshold: draw uniformly a number λ in the interval [0, 1]. If λ < ε, thenexplore a new action at at random (draw uniformly an action at ∈ A(s)). If λ ≥ εthen exploit it by choosing: at ∈ argmaxaQ(st, a);

– choose an action at ∈ A(st) according to the rule given by: Q(st,at)∑a Q(st,a)

;

– choose at according to the Boltzmann–Gibbs distribution given by:e1εQ(st,at)

∑a e

1εQ(st,a)

. When ε is fairly large, the distribution is almost uniform. When ε tends

to zero, the distribution approaches toward argmaxaQ(st, a).

4.4.4. Reinforcement techniques: slot machine problem

4.4.4.1. An introductory example: analogy with a slot machine

Let us suppose that a player enters in a casino one day and there s/he is faced witha certain number of slot machines. S/he then seeks to maximize the collected earningsas a result of a certain number of tries. If this player has complete information on theaverage earnings of each slot machine, an optimal strategy will be to keep playingwith the machine having the highest average earnings. However, in the case so farwhere the player has no information that s/he could win by playing on one or anothermachine, s/he has no other choice but to test different machines to estimate theiraverage earnings. This model is also known in the litterature as the MAB problem.If we imagine that these machines are subbands that we would like to access or the


configurations to be tested, these decision and learning problems in CR are similar toa slot machine problem. This particular problem is known as opportunistic spectrumaccess (see section 1.3.2.4).

4.4.4.2. Mathematical formalism and fundamental results

Let us consider a set K = {1, 2, ..., k, ...,K} of K slot machines among whichthe decision maker seeks to determine which machine will provide him the highestaverage gain. At each instant t = 0, 1, 2.., the decision maker plays sequentiallywith each of the machines following a certain strategy π. At each time a machinek is played, the player collects a reward rt from the machine’s own distribution θk.We will assume that the earnings collected from the same machine are independentand identically distributed. As far as the distributions θk are concerned, they aresupposed to be independent, stationary, but not all identical. Finally, we will noteany expectation μk = E[θk] of a distribution θk whatsoever and expectationμ∗ = max

k{μk} of the distribution associated with the optimal machine.

In this context, it is possible to define the concept of cumulative loss called “regret”as follows:

Rπt = t.μ∗ −

t−1∑m=0

rm [4.11]

Under these assumptions, the expected cumulative regret while consideringΔk = μ∗ − μk can be written as:

E[Rπt ] =

K∑k=1

Δk.E[Tk(t)] [4.12]

We will say that a given strategy is efficient if it minimizes the average cumulativeregret.

In 1985, Lai and Robbins [LAI 85] showed that whatever the adopted strategymay be, the average cumulated regret is necessarily greater or equal to a logarithmicfunction of time t. They, moreover, showed that the directing coefficient of thisfunction depends on the distributions θk considered. Consequently, no player onaverage can expect to obtain a smaller average cumulative regret. Finally, they showedthat the strategies capable of achieving, asymptotically, this lower bound of averagecumulative regret exist and have explicitly given their forms for certain distributions(Gaussian, Bernoulli, etc.).

4.4.4.3. Upper confidence bound (UCB) algorithms

The algorithms presented by Lai and Robbins [LAI 85] make up a part of a largefamily of algorithms whose principle is as follows:


– associate with each machine an index that will synthesize the informationcollected on the machine until the instant t;

– at instant t+ 1, choose the machine that has the largest index;

– the decision maker plays the selected machine and obtains a gain rt drawn fromthe distribution associated with the machine;

– update the index of the played machine.

Optimal form indices proposed by Lai and Robbins often demand a long andtiresome calculation (it actually calculates the generalized likelihood ratio) andassume memorization of all the gains collected until the instant t. In addition, theresults given by Lai and Robbins are valid only asymptotically. We prefer in thischapter to describe suboptimal indices that are very simple to calculate and guaranteean average cumulative regret smaller than some logarithmic function of time, i.e.whatever the instant t considered they are uniform in time. The general form of indicesconsidered is as follows:

Bk,t,Tk(t) = Xk,Tk(t) +Ak,t,Tk(t) [4.13]

where the terms that appear in the above equation are defined as follows:

– Bk,t,Tk(t) is the index associated with the machine k after having played Tktimes, until iteration t. B indexes provide a UCB of the actual result as they areoptimistic estimations of the performance of every one of the MABs or slot machines;

– Xk,Tk(t) is the empirical average reward collected by playing the machine k;

– Ak,t,Tk(t) is a bias added and has as a role to overestimate the performance ofthe machine.

In the rest of this section, the results presented are considered only in the case of aset K of distribution bounded by [0, 1].

4.4.4.4. UCB1 algorithm

In the case of UCB1, the bias has the following form:

Ak,t,Tk(t) =

√α. ln(t)

Tk(t)[4.14]

where α is a positive real number. It is possible to prove that the strategy that alwayschooses the machine with the highest index has bounded average cumulated regret ifα > 1 [AUD 07, AUE 02]:

E[Rπ=UCB1t ] ≤

∑k:Δk>0

4.α

Δk. ln(t) [4.15]


4.4.4.5. UCBV algorithm

In the case of UCBV , the bias has the following form:

Ak,t,Tk(t) =

√2ξ.Vk(t). ln(t)

Tk(t)+

3.c.ξ. ln(t)

Tk(t)[4.16]

with c ≥ 1, ξ > 1 and where Vk(t) represents the empirical variance associated withthe machine k. It is possible to prove that the strategy that always chooses the machinewith the highest index has bounded average cumulated regret [AUD 07]:

E[Rπ=UCBVt ] ≤ Cξ

∑k:Δk>0

(σ2k

Δk+ 2). ln(t) [4.17]

where Cξ is a factor that depends on parameter ξ and σ2k is the variance associated

with each machine.

4.4.4.6. Application example: opportunistic spectrum access

The past century witnessed a significant part of the spectral resources exclusivelydedicated to many services that appeared year after year. With the seemingly unendingincrease in the need to allocate frequency bands to the emerging wireless applications,the world of radio communications was faced with a shortage of spectral resources.Nevertheless, a recent study showed that in reality these resources were underused(see Figures 1.5 and 1.6). In other words, it is not the abundance of frequency bandswhich is to be put in question but rather the management of this resource. In order toexploit the underused bands, at a certain moment in a given place, accessible to otherservices, the CR community devised the distribution of access rights as follows:

– Equipment or networks that use a frequency band that had been allocated to themare called primary users. They have priority over the band and have all access rights(within the limit of those delegated by the local regulation authority).

– Any equipment or networks that attempt to benefit from a frequency band,momentarily, during the absence of the primary network are called secondary users.They must respect the priority of the primary users.

To avoid disturbing the primary networks in their neighborhood, secondary userswill need to scan their environment in order to detect the possible activity of a primaryuser. Based on the collected information, a decision is taken (e.g. spectrum access,change of frequency band, and continue to scan) followed by appropriate actions(e.g. reconfiguration and data transmission). The secondary user device must followthe elementary cognitive cycle (perception–reasoning–action) as mentioned earlier.Hence, the CR technology naturally responds to the specifications. The frameworkprovided by the MAB is one of the models to describe the opportunistic spectrum


access problem from the point of view of the secondary users. Indeed, the machinesin this case are just like different frequency bands that the secondary user (i.e. thedecision maker) would like to access. Time is assumed to be divided into blocksof defined size (size of a data packet) t = 0, 1, 2... For each new block t, the CRequipment chooses a frequency band, explores this band in search of a primaryuser, and accesses it in case of availability of the spectrum resource. In this case,the equipment transmits a data packet, otherwise the device awaits the next block.In all the cases, the decision engine determines again at the end of the operationsperformed during a block, a frequency band to be explored, and the equipment repeatsthe afore-mentioned cycle. After each cycle, the decision-making engine collectsa reward that depends on user needs. This gain could be, for example, channelavailability (0 for an occupied channel, 1 for a free channel) or transmissionthroughput. Hence, during a communication, the CR equipment seeks to maximizethese accumulated gains (spectrum access or accumulated throughput). As a result,it is possible to use the tools described above (reinforcement learning in general andUCB algorithms in particular) to address the slot machine problem.

The behavior of UCB1 and UCBV algorithms in the case of 10 channels (i.e.frequency bands or machines) [JOU 10a] is illustrated in Figure 4.3. In this particularexample, the cognitive agent is interested in the availability of the probed channels.These curves show the proportion of time spent in selecting the channel, which ismostly available on average. It is observed that the two algorithms more and moreoften select the channel, on average, that is less used by the primary users (and hence,which maximizes the gains of secondary cognitive agent). However, their behavioris slightly different in the way that the UCBV , which uses the empirical variance inthe expression of these indices, seems less efficient, on average, at the beginning ofthe experiment. This phase can be interpreted as being the learning time. Afterward,the UCBV better and better exploits the information collected on different channels.During this second phase, the selection rate of the best channel increases rapidly.Therefore, it implies that the cognitive agent most often selects the best channel.

4.4.5. Artificial intelligence

The term “artificial intelligence” might cover all the methods discussed so far,in addition to those that we will present afterward. In fact, by artificial intelligencewe mean all the methods that try to give learning and decision-making abilitiesto a system, hence conferring to it an aptitude to “reason” in order to adapt toits environment intelligently. Nevertheless, we have preferred to separate a fewtechniques that we will present, thereafter, to put in a specific context related to afundamental problem in CR i.e. DCA (see section 4.6).


Figure 4.3. Selection of most available channel to the secondary user in the context of opportunistic spectrum access. The decision-making agent uses UCB algorithms. This curve illustrates the convergence of the cognitive agent toward the channel that seems to be, on average, less used by the primary network. Thus, the secondary user can expect, on average, to maximize the number of spectrum accesses [JOU 10a]

4.5. Decision making and learning from network perspective: game theory

4.5.1. Active or passive decision

Decision making can be divided into two approaches: passive decision makingthat does not result in an action that modifies the environment, and active decisionmaking that results in real action on the environment.

In general, passive decisions are made with the objective of choosing an actionthat does not modify the environment and whose goal is to accelerate learning orselection of a parameter about which the agent has only incomplete information.Among the choices of passive actions, we mention the example of choice relevantto the next information source to be exploited; translated in a human context, thischoice is typically made when we want to collect information on weather conditions(Do I have to turn on the television and wait for the weather report? Do I have to lookat the sky and evaluate myself? etc.). Knuth [KNU 02] and Knuth et al. [KNU 07]addressed these issues simultaneously from the theoretical perspective, for example,mathematical definition of a question and good choice of the next question to pose[KNU 02] is a reality, e.g. design a robot that recognizes geometric shapes quicklyby deciding optimally the choice of the next observation. In the framework of CRs


where a large amount of data must sometimes pass in the opportunist network for itscombined learning, it may be important to select the requested information carefully.However, even a few studies are not known to date in this framework of CR.

Among the examples of passive decisions, which brings along a choice of aparameter on which only incomplete information is available, we mention the contextof channel estimation that, coupled with the learning described above, gives location tothe choice of a channel H0 that better estimates the true channelH0. In [COU 09] and[COU 10a], decision methods based on estimators with minimum mean square error(MMSE) are developed in the case of partial or complete knowledge I of temporalcoherence and spatial parameters of the channel. The general method is to decideon an error measurement that can be written as a probability distribution functionP (H|I):

H0 =

∫f(H)P (H|I)dH [4.18]

In particular, the typical choice of MMSE estimator gives:

H(MMSE)0 =

∫HP (H|I)dH [4.19]

In the context of decision making that results in an action from the agent, thisaction generally affects simultaneously the agent who takes the decision and itsenvironment. This environment is most often composed of other actors who thenmodify their behavior in response to the action of the first agent. Poor decisionmaking may lead to a radical modification of the environment that will drive thenext decision to again modify the environment significantly, eventually resulting ina situation of total instability. This situation occurs typically when several cognitiveagents tend to share a resource (a frequency band of transmission, for example)whereas each agent has no pre-established strategy to request a resource; this generallyresults in corruption of a lot of exchanged data and very low average throughputof the communication. On the contrary, when the two agents have an establishedstrategy and each player knows the strategy of resource sharing of its competitor,non-cooperative decisions can be made to maximize the collective throughput. Wewill discuss situations in which numerous agents have a well-established strategyfor a situation of the game. The next sections of this chapter are dedicated to anintroduction to such non-cooperative games in the particular framework of resourcesharing. First of all, we introduce the mathematical basics of game theory and thechoice of optimal strategies in a multiagent context.

4.5.2. Techniques based on game theory

So far we have studied the learning mechanisms for only one agent. In thissection, we are interested in possible extensions for several agents. The difficulty


of learning in a multiagent context lies in the fact that decision making is nowinteractive. The reward (also called utility, payment, performance, etc.) of an agentdepends not only on its own decision but also on the decisions of other agents.When each agent observes the actions taken by the other agents in the prior stage,there exist sophisticated techniques such as Cournot’s groping (best response), thefictitious play (each agent chooses a best response to the frequency of actions takenby others), and the best response dynamics. Learning becomes more complicatedwhen these assumptions on observations are not satisfied. We will describe how toobtain approached values in situations in which no agent is interested in changing itsaction unilaterally (s/he cannot do better in terms of immediate reward if the othershold on to their choices). Such a situation is known as Cournot equilibrium or Nashequilibrium.

Let us consider n agents and denote the set of these agents by N = {1, . . . , n}.Each agent j ∈ N has a set of choices (actions) denoted by Aj . At each timeinstance t, every agent j chooses an action aj,t ∈ Aj . Agent j receives a rewardof rj,t = rj(a1,t, . . . , an,t). The collection (N, {Aj}j∈N, {rj}j∈N) is called normal-form game or strategic-form game. For an agent j, an action bj is best response to:

a−j := (a1, . . . , aj−1, aj+1, . . . , an) ∈∏j′ �=j

Aj

if:

bj ∈ argmaxb′j

rj(a1, . . . , aj−1, b′j, aj+1, . . . , an)

Let us denote byBRj(a−j) the set of strategies of agent j that are best responses tothe actions’ profile a−j . This set plays a vital role in the concept of the Nash solution.In fact, the Nash equilibrium is characterized as a fixed point of this multivaluedapplication.

4.5.2.1. Cournot’s competition and best response

Cournot’s competition is a very simple adjustment on the observations of previousactions of other agents. Agents can act simultaneously or by turns. At time t, agentj plays the best response to the actions a−j,t−1 taken by other agents at time t − 1.Therefore, this mechanism requires the knowledge of what other agents have chosenin the previous step. The convergence of this process is not guaranteed. A typicalexample that illustrates this divergence phenomenon is given by the fixed points oflogistics of the functions of type f(x) = r′x(1 − x) for r′ = 4.

4.5.2.2. Fictitious play

At each time instance, each agent chooses an action that is the best response to theempirical average of the actions taken by the others. The stationary distributions of


this process are the Nash equilibria of this game. To formalize this, we introduce theaverage of actions performed by agent j until time t, fj,t(bj) = 1

t

∑tt′=1 1l{aj,t′=bj}.

This average can be calculated at each step by the following recursive equation:

fj,t+1(bj) = fj,t(bj) +1

t+ 1(1l{aj,t+1=bj} − fj,t(bj))

This algorithm converges (in frequency) to a zero-sum game (r1+r2 = 0) with twoagents, to games with common interests or even to potential games. It should be notedthat the fictitious play’s algorithm implicitly makes the assumptions that each agenthas the information on the strategy played by all other agents in the previous step. Itcan calculate this average (using the recursive equation, for example) while assumingthat others are stationary. This behavior is limited in the sense that the future rewardis not taken into account.

4.5.2.3. Reinforcement strategy

The reinforcement learning mechanism gives more weight to the strategies thatgive good rewards while exploring all new actions. Let us consider the case whereeach agent j has a reference Mj on its reward, a learning rate λj > 0, and it updatesits strategy based on its current perception. The agent j updates its strategy as follows:

xj,t+1(bj) =

{xj,t(bj) + λjsj,t

∑b′j �=bj

xj,t(b′j) if sj,t ≥ 0

xj,t(bj) + λjsj,txj,t(bj) if sj,t < 0

The term sj,t =rj,t−Mj

supa |rj(a)−Mj | represents a reference measurement for agent j.

This algorithm has good convergence properties if dimensions are small (say twoor three). By changing the scale, its trajectory can be approximated by a system ofordinary differential equations (ODEs). Most of the convergence demonstrations ofthe learning mechanisms of this type use stochastic approximation techniques. Oncethe ODEs are obtained, we study vector fields, phase plans, and basins of attractionin order to deduce stability and instability conditions of stationary points. Theselink learning mechanisms to dynamic games that are well known in evolutionarygames (replicator dynamics, better response, Brown–von Neumann–Nash, dynamicprojection, etc.). Consider the learning mechanism given by:{

xj,t+1(bj) = xj,t(bj) + λj,trj,t(1l{aj,t=bj} − xj,t(bj)

)j ∈ {1, . . . , n}, bj ∈ Aj

The term xj,t is the strategy of agent j at time t. It can be shown that the abovelearning mechanism converges to a variant of replication dynamics given by:

xj(bj) = xj(bj)

⎡⎣rj(bj , x−j)−∑

aj∈Aj

xj(aj)rj(aj , x−j)

⎤⎦ , j ∈ N, bj ∈ Aj


with rj(bj , x−j) := Ex−jrj(bj , .). It should be noted that these dynamics do notalways converge. Especially, they may have limit cycles and chaotic orbits.

4.5.2.4. Boltzmann–Gibbs and coupled learning

The Boltzmann–Gibbs distribution finds the equilibrium of a perturbed game inwhich the reward functions are replaced by rj(x) + εjH(xj). The functionH(xj) =−∑bj

xj(bj) log(xj(bj)) is the entropy of the strategy xj . The term εjH(xj) canbe interpreted as a penalty or cost associated with strategy xj . When εj −→ 0, thepenalty term approaches zero, i.e. the agent j exactly maximizes its own reward.Since rewards of actions other than those which it has chosen are not known by agentj, this last estimates the rewards rj by updating them at each stage:⎧⎨⎩

xj,t+1 = (1− λj,t)xj,t + λj,tβj(rj,t)rj,t+1(bj) = rj,t(bj) +

μj,t

xj,t(bj)1l{aj,t=bj} (rj,t − rj,t(bj))

j ∈ {1, . . . , n}, bj ∈ Aj

where the component corresponding to the action aj of vector βj(rj) is given by:

βj(rj)(aj) =e

1εj

rj(aj)∑bj∈Aj

e1εj

rj(bj)

where the parameters λj and μj are the learning rates of the agent j. It can be shownthat under certain assumptions on the choice of learning rates (λj,t, μj,t), trajectoriesof this algorithm can be approximated by the solutions of the system of ODE given by:⎧⎨⎩

xj = βj(rj)− xjddt rj(bj) = Exrj − rj(bj)j ∈ {1, . . . , n}, bj ∈ Aj

The term Exrj represents the expected reward in comparison with the strategy ofother agents.

4.5.2.5. Imitation

Learning by coupled reinforcement and imitation is a variant of the coupledlearning mechanism that modifies the Boltzmann–Gibbs distribution and adds partof imitation, the actions that give best performances are increasingly used with afactor proportional to the probability that this action will be selected. The updateof the strategy is obtained by replacing the Boltzmann–Gibbs distribution β by σdefined by:

σj(xj,t, rj,t)(bj) =xj,t(bj)e

1εj

rj,t(bj)∑aj∈Aj

xj,t(aj)e1εj

rj,t(aj)


Likewise, for replication dynamics, the interior stationary points of the asymptoticof this learning mechanism are Nash equilibria of the game when the parameters εjtend to zero.

4.5.2.6. Learning in stochastic games

The quality of learning algorithms that depend on the state may be extendedin certain cases to stochastic games (several Markov decision-making processes areinterdependent). For example, in the case of zero-sum games, the term maxQ willbe replaced by minmaxQ(st+1), which is the value of the game in state st+1. Inthe non-zero sum games, maxQ can be replaced by Nash (Q(st+1)), i.e. the rewardobtained when Nash equilibrium is played in state st+1.

It is to be noted that reward calculation at Nash equilibrium in each step of thealgorithm makes this algorithm complex. The complexity on S and action spaces Aj

is of exponential order.

4.6. Brief state of the art: classification of methods for dynamic configurationadaptation

DCA [JOU 09, JOU 10b] is a CR problem that arises in a condition when theequipment must choose either a satisfactory or an optimal operating configuration(according to optimization criteria), among K available configurations in order tomeet the three constraints in fixing a design space as mentioned above.

Analysis of different approaches suggested in the literature that respond to theproblem of DCA shows that although all the proposed case studies are based onthe same decision space, decision approaches differ from each other depending on theassumptions made on a priori knowledge that the cognitive agent has on functionalrelationships that coercively connect the three dimensions. This information thatmodels the functional relationship between equipment parameters, “objective”functions, and the environment metrics can be envisioned as a fourth dimension addedto the design space. The four dimensions together allow us to identify a tool to solvethe defined problem. In fact, if we consider that the designer provides a sufficient setof analytical relations to the cognitive agent to directly infer decision rules, then anexpert approach could prove to be sufficient. On the other hand, if the designer doesnot provide information to the cognitive agent, it will have no other alternative but tolearn on its own by interacting with its environment. Figure 4.4 provides a summaryof classification of approaches detailed henceforth [JOU 10b].

4.6.1. The expert approach

The expert approach starts from the fact that the more the a priori knowledgeis complete, the better the device can exploit it and react to its environment.


This knowledge is based on the expertise of engineers acquired at the theoreticallevel as well as by means of a series of measurements. Mitola [MIT 00a] in his thesiscreates a list of behavioral rules that are supposed to respond systematically to all thecases that the equipment will encounter during its use.

For this, it is necessary to have the ability to represent this knowledge in sucha way as to exploit it to control it, by adequately adapting the operations. Mitoladefined, for this purpose, a knowledge representation language called radio knowledgerepresentation language. As a result, the decision process becomes very simple,complexity being relayed to the expression of knowledge, at the design level. We canobviously expand this approach by devising ways to acquire new knowledge duringthe operation.

4.6.2. Exploration-based decision making: genetic algorithms

In the case of CR and by considering that the received information provides moreor less a good estimate of reality, an approach was proposed in which the cognitiveagent’s decision is based on a genetic algorithm [RIE 04, RON 06].

In a general context, a genetic algorithm needs to define the notion of individual(or admissible configuration in our case). The latter is encoded in the form ofa chromosome whose different genes represent different manipulatable parametersof the radio equipment. The alleles of the chromosome represent some particularinstances of these parameters. Using the fitness function, we have evaluated theadequacy of each individual with respect to the environment encountered.

In the case of the CR, this of course needs a priori knowledge of the objectivefunctions and the definition of a fitness function that evaluates overall performancesolutions vis-à-vis all objectives. At each generation, a selection operation isconducted to support individuals whose evaluation is the most promising. Finally, aset of random operations whose aim is to diversify the selected individuals produces anew generation from the previous generations. The most common operations are:

– cross two parent chromosomes to obtain two child chromosomes. This crossovermakes it possible to share their genetic inheritance;

– random mutation at the level of parent or child chromosomes.

The repetition of this chain of operations can converge to satisfactory solutionswith respect to the chosen fitness function.

The general model proposed by the team at Virginia Tech is based on two geneticalgorithms and an intelligent control system called cognitive system monitor. Geneticalgorithms have the tasks of channel modeling on the one hand, and determining


particular configurations on the other hand, in order to determine an adequate solutionto the problem faced. The purpose of intelligent control is to coordinate theseoperations and to implement medium- and long-term learning mechanisms in orderto establish new decision rules that would enable the recognition of already analyzedcases.

Figure 4.4. Decision-making methods based on a priori knowledge introducedby the designer [JOU 10b]

Under certain assumptions mentioned in the beginning of this section, theapproach based on genetic algorithms is particularly promising to address CR-relatedproblems, since on one hand it makes possible exploration of a large solution spaceand on the other hand, due to a population of more or less diversified individualspresent permanently, it can adapt quickly to any environmental changes.

Nevertheless, these advantages have a cost:

– The assumptions according to which we know the environment-related analyticalmodels, the various system parameters, as well as the functions to be optimized arenot very realistic. Indeed, on the one hand, these models are idealized, and on theother hand, we cannot assume that all the possible situations are known and modeledalready.

– The fact that these algorithms manipulate a population of individuals and thatmany operations are necessary, generation after generation makes the decision-makingsystem particularly greedy in terms of the time and computing capacity (and henceenergy). In the context of CR in which these resources are primarily limited, it couldbecome constricting.

– The evolutionary approach is very sensitive to algorithm parameters (such as sizeof the population, selection rate, crossover and mutation of one generation to another,and the choice of the stop criterion) and its success will depend on their choices.

4.6.3. Learning approaches: joint exploration and exploitation

Analytical approaches may not represent all the complexity of the phenomenaentering into game. In order to be able to function in more realistic scenarios, it is


possible to use learning-based methods. This is the case of neural networks, evolvingconnectionist systems, statistical learning of the regression models, etc.

Insofar as certain data or certain models necessary in decision making are missing,the cognitive agent is seen as being obliged to implement the learning process. Amongthe techniques found in the literature, we can divide these approaches into threecategories:

– prediction of the system performance from previously collected data: thisapproach segregates the learning and decision-making phases. A wide range oftechniques come in this category: neural networks and statistical approaches areexamples used in the framework of CR. The objective is to extract from theobservation phase, a functional link between environment, operational parameters ofthe equipment, and system performance in order to infer decision rules. In a secondphase, the cognitive agent uses these new rules to try to adapt itself to its environment;

– dividing the environment according to system experience and expert knowledge:this approach, in view of Colson’s work, is akin to a clustering technique under expertinformation. Thus, by alternating the learning and operation phases, cognitive agentseeks to enrich its basic decision rules. The suggested general architecture is basedon an evolutionary neural network proposed by Kasbov [KAS 98, KAS 07]. In fact,this is particularly a case of evolving connectionist systems. They aim to overcomethe weaknesses of conventional neural networks due to a faster learning capacity andmore flexible and evolutive neural structure;

– monitoring (also control or supervision) with partial information: in this case,the learning and decision-making phases are intermingled. At each iteration, thedecision taken makes it possible to collect information on the equipment performanceas well as on the environment. This information is immediately incorporated togenerate a new decision. The objective is to learn while using the equipment in orderto provide the requested service while ensuring improvement in this service with thepassage of time. This approach relies on notions of prediction and reinforcementlearning. A particular case of this class of problems is that of “MAB” as discussedearlier among the methods of reinforcement learning.

4.7. Conclusion

Learning and decision making are two vast domains for which solid mathematicaltheories (Bayesian probabilities, game theory, decision theory) exist and make itpossible to systematically describe optimal methods of learning as well as the optimalchoices in terms of decision. However, today we are still far from reproducing thelevels of adaptation and intelligence of an animal brain; one of the reasons stemsfrom the complexity of the optimal methods mentioned above when the dimensionof the cognitive network becomes too large and that the range of possible decisions


broadens. A natural selection of the subset of “sensible decisions” as well as a naturalselection of sufficient parameters contained in each stimulus is necessary to envisionthe development on a large scale of flexible and fast cognitive methods.

Different methodological tools introduced in this chapter (without beingexhaustive) to develop cognitive algorithms (Bayesian probabilities, game theory, andreinforcement learning) offer promising solutions to address the problem of decisionmaking and learning for CR, but the problem of a global approach still remains open.

Date post:	16-Dec-2016
Category:	Documents
Upload:	pierre-noel
View:	214 times
Download:	2 times

Radio Engineering (From Software To Cognitive Radio) || Decision Making and Learning

Documents