Intelligent Approaches for Communication Denial · Intelligent Approaches for Communication Denial...

Intelligent Approaches for Communication Denial

SaiDhiraj Amuru

Dissertation submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Electrical Engineering

R. Michael Buehrer, Chair

Claudio R. C. M. da Silva

Ravi Tandon

Dhruv Batra

Inyoung Kim

September 21, 2015

Blacksburg, Virginia

Keywords: Communication, Denial, Jamming, Learning

Copyright 2015, SaiDhiraj Amuru

Intelligent Approaches for Communication Denial

SaiDhiraj Amuru

(ABSTRACT)

Spectrum supremacy is a vital part of security in the modern era. In the past 50 years, a great

deal of work has been devoted to designing defenses against attacks from malicious nodes (e.g.,

anti-jamming), while significantly less work has been devoted to the equally important task of

designing effective strategies for denying communication between enemy nodes/radios within an

area (e.g., jamming). Such denial techniques are especially useful in military applications and

intrusion detection systems where untrusted communication must be stopped. In this dissertation,

we study these offensive attack procedures, collectively termed as communication denial. The

communication denial strategies studied in this dissertation are not only useful in undermining the

communication between enemy nodes, but also help in analyzing the vulnerabilities of existing

systems.

A majority of the works which address communication denial assume that knowledge about the

enemy nodes is available a priori. However, recent advances in communication systems creates

the potential for dynamic environmental conditions where it is difficult and most likely not even

possible to obtain a priori information regarding the environment and the nodes that are present

in it. Therefore, it is necessary to have cognitive capabilities that enable the attacker to learn

the environment and prevent enemy nodes from accessing valuable spectrum, thereby denying

communication.

In this regard, we ask the following question in this dissertation “Can an intelligent attacker

learn and adapt to unknown environments in an electronic warfare-type scenario?” Fundamen-

tally speaking, we explore whether existing machine learning techniques can be used to address

such cognitive scenarios and, if not, what are the missing pieces that will enable an attacker to

achieve spectrum supremacy by denying an enemy the ability to communicate? The first task in

achieving spectrum supremacy is to identify the signal of interest before it can be attacked. Thus,

we first address signal identification, specifically modulation classification, in practical wireless

environments where the interference is often non-Gaussian. Upon identifying the signal of inter-

est, the next step is to effectively attack the victim signals in order to deny communication. We

present a rigorous fundamental analysis regarding the attackers performance, in terms of achieving

communication denial, in practical communication settings. Furthermore, we develop intelligent

approaches for communication denial that employ novel machine learning techniques to attack the

victim either at the physical layer, the MAC layer, or the network layer. We rigorously investigate

whether or not these learning techniques enable the attacker to approach the fundamental perfor-

mance limits achievable when an attacker has complete knowledge of the environment. As a result

of our work, we debunk several myths about communication denial strategies that were believed

to be true mainly because incorrect system models were previously considered and thus the wrong

questions were answered.

Dedication

To my parents (Sudhakar and Sai Sudha), my sister (Sai Deepika), my brother-in-law (Raghu

Pavan) and my newborn niece.

iii

Acknowledgments

Finally, after several thousand cups of coffee, the time has come.1 I have been waiting for this

moment—to write the Acknowledgements section—for quite some time, even more than my dis-

sertation. Such is the impact various people have had on me over the years. Although it cannot be

described in a few words, I will make an attempt.

I would like to thank God for his blessings and for giving me and my family strength to pass

through various difficulties and to finish this Ph.D. journey. It would not have been possible to

finish this dissertation without all the sacrifices my family made. They supported me in every

possible way all throughout my life and continue making an impact on me every single day. Thanks

for all the love.

I am very fortunate to have worked with several people during my Ph.D. Firstly, I owe my deepest

gratitude to my advisor Dr. Buehrer for giving me an opportunity to be part of his group and for

making sure I did not get lost at any point during my Ph.D. Thanks for being patient and correcting

all the mistakes I made during these years and for pushing me to solve challenging problems. His

unique skill for identifying important research problems, attention for details, and deep knowledge

about any problem have greatly helped in improving my research capabilities. You have always

been supportive of my work even at times when I doubted myself. Thank you for guiding me

in the right direction not only in my graduate studies, but also for being a person to whom I can

always look up to for life advice. I would be happy if I can put into practice all your teachings, be

it research or otherwise, and become as dedicated, contented, and as committed as you are.

Dr. Claudio da Silva was the reason I came to VT. He provided me immense support over the

years more than just as an advisor, as a friend. His understanding of his students, especially the

difficulties faced by an international student, and his help to settle down and make a head start in

my Ph.D. was the stepping point for my graduate studies. All our phone discussions, despite us

being on opposite ends of the country, have motivated me to do better work and aim higher always.

I will forever be indebted to him for all the encouragement that he provided and for also making

sure from time-to-time that I am doing well.

Dr. Ravi Tandon, is my go-to-guru for brainstorming about any research problem. He is full-of

energy at any time during the day (and also during the night). He introduced me to information the-

ory, which to this day, I am still afraid of. Thanks for patiently explaining the various intricacies in

1Thanks to the folks at Next Door Bake Shop for pumping caffeine into my body and helping me finish my Ph.D.

iv

a variety of problems that we worked on together. Working with him has helped me learn the ways

of research and has significantly improved my problem solving skills. All the discussions during

our coffee breaks have taught me a lot about the academic world. It has been a very memorable

and pleasant experience to have collaborated with him.

I would like to thank Dr. Batra for his machine learning course that has sowed the seeds for the

learning components of this dissertation. Thanks for being approachable and for the career advice

you have provided me. Thanks to Dr. Kim for the Bayesian statistics class and also for providing

valuable feedback on my research contributions.

Going to UCLA during the Summer of 2014 was one of the best decisions made during my Ph.D.

I would like to thank Dr. Mihaela for hosting me at UCLA during this time and for the wonderful

collaboration from thereon. Thanks for helping me explore the crazy world of learning. Your

perseverance still amazes me. I would be very happy if I can be at least 1% as dedicated andmotivated as you are about venturing into new research fields. Dr. Cem and Dr. Xiao have been

great teachers during my stay at UCLA. I am glad these teachings have resulted in publications.

I worked with Dr. Harpreet during the final stages of my Ph.D. I am very glad he chose to join

VT. My association with him dates back to the days when I newly joined Dr. Buehrer’s group.

His advice as a former Dr. Buehrer student was very helpful in succeeding in my Ph.D. Working

with him on stochastic geometry-related problems and also successfully organizing the W@VT

seminars has been a great learning experience.

Dr. Gautham, thanks for helping me hold on to the ropes during my first year at VT. You have

been an awesome mentor and a great friend over the years and have always provided me the right

suggestions while taking critical decisions at various times over the last four years.

The rowdy bunch at the Wireless @ VT lab - Daniel, Matt, Kevin, Hilda, Chris Headley, Chris

Phelps, Reza, Javier, Joe, and Mahi have made this journey very special. I am glad to have been

part of all the fun, pranks, fantasy leagues, lunches, coffee breaks, game nights etc. The “wall”

will never be forgotten nor will the hardships you guys gave me for being the student chair of

W@VT. Daniel, Marc Lichtman and Jeff Poston, thanks for proof reading my manuscripts several

times. Nancy, thanks for making the lab a wonderful habitat, a place where I spent most of my

time during the last four years. Being the student chair and organizing seminars has truly helped

me appreciate the work Nancy and Hilda do for W@VT.

All the members of Shawnee Theatre - Sriram, Viru, Deepak, Sarvesh, KC, Prasad and Aproov,

thank you for helping me stay sane during my Ph.D. journey and for all the awesome fun, dancing,

cooking sessions we enjoyed together. Thanks to Varuni, Vishwas and Aditya for the various stim-

ulating and intellectual discussions. Sriram, Varuni, Himanshu, Deepak, and Emily have helped

me stay healthy over the years with their mouth-watering dishes. Karteek and Santhosh, my friends

from IIT, thanks for being just a call away and for talking to me in times of stress. Lakshman, Avik

Dayal, Avik Sengupta, Mehrnaz, and several others, I am thankful for all the great times we shared.

v

Contents

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Modulation classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Optimal jamming in AWGN channels . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Jamming in fading channels . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Jamming Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.5 MAC-layer jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.6 Blind network interdiction . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.7 Jamming against wireless networks . . . . . . . . . . . . . . . . . . . . . 8

1.1.8 List of relevant publications . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 10

2.1 Reinforcement Learning and Markov Decision Processes . . . . . . . . . . . . . . 10

2.2 Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 A Blind Pre-Processor for Modulation Classification Applications in Frequency-Selective

Non-Gaussian Channels 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Blind Gibbs Sampling-Based Pre-Processing Stage . . . . . . . . . . . . . . . . . 19

3.3.1 Short introduction to Gibbs sampling . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Superconstellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vi

3.3.3 Prior pdfs for the unknown parameters . . . . . . . . . . . . . . . . . . . . 22

3.3.4 Marginal posterior pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.5 Summary and a note on complexity . . . . . . . . . . . . . . . . . . . . . 26

3.4 Modulation Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.1 Pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.2 Numerical classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.3 Approximated likelihood classifier . . . . . . . . . . . . . . . . . . . . . . 35

3.5.4 Carrier frequency offset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Optimal Jamming against Digital Modulation 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Perfect Channel Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Optimum Jamming Signal Distribution . . . . . . . . . . . . . . . . . . . 46

4.3.2 Analysis against M-QAM victim signals . . . . . . . . . . . . . . . . . . 47

4.4 Factors that mitigate jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.1 Non-Coherent Jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Symbol Timing Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4.3 Signal Level Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Jamming an OFDM Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6 The Case of Multiple Jammers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 On Jammer Power Allocation Against OFDM Signals in Fading Channels 66

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii

5.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Jamming Strategies in Fading Channels . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 Optimal power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.2 Other power allocation strategies . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3 Approximately optimal jamming power allocation . . . . . . . . . . . . . 75

5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


5.4.1 Power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.2 Jamming performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4.3 Approximately optimal jamming solution performance . . . . . . . . . . . 81

5.4.4 Factors that affect jamming . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Jamming Bandits - A Novel Learning Method for Optimal Jamming 93

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3 Jamming against a Static Transmitter-Receiver Pair . . . . . . . . . . . . . . . . . 96

6.3.1 Set of actions for the jammer . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3.2 MAB formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.4 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.5 High confidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.6 Improving convergence via arm elimination . . . . . . . . . . . . . . . . . 106

6.4 Learning Jamming Strategies against a Time-Varying User . . . . . . . . . . . . . 108

6.4.1 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 109


6.5.1 Fixed user strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.5.2 Jamming performance against an adaptive victim . . . . . . . . . . . . . . 114

viii

6.5.3 Multiple victims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.5.4 A note on the assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7 Optimal Jamming using Delayed Learning 129

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2 Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2.1 Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.2.2 A Novel Delayed Learning Framework with Transition-based Rewards . . 133

7.3 Jamming via Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Protocol Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.2 Jamming Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3.3 Feedback Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


7.4.1 Learning the optimal policy: MDP model and ρ known . . . . . . . . . . . 138

7.4.2 Intuition about the optimal policy . . . . . . . . . . . . . . . . . . . . . . 139

7.4.3 Learning ρ and the optimal policy: MDP model known . . . . . . . . . . . 139

7.4.4 Learning the MDP model, ρ and the optimal policy . . . . . . . . . . . . . 140

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8 Blind Network Interdiction Strategies - A Learning Approach 143

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.2 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 145

8.2.1 Victim Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.2.2 Flow model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.2.3 Attack Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.3 Single-Node Attack – Strategies and Analysis . . . . . . . . . . . . . . . . . . . . 148

8.3.1 Benchmark Strategies (when the attacker has topology knowledge) . . . . 149

8.3.2 Blind strategies (when the attacker does not have topology knowledge) . . 150

ix

8.3.3 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.3.4 Notes on attack performance . . . . . . . . . . . . . . . . . . . . . . . . . 154

8.3.5 Learning rates in blind scenarios . . . . . . . . . . . . . . . . . . . . . . . 156

8.4 Results - Single Node Attack Scenario . . . . . . . . . . . . . . . . . . . . . . . . 158

8.4.1 Fixed flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.4.2 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.5 Multiple Node Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.5.1 Single attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.5.2 Multiple attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

9 On Jamming Attacks against Wireless Networks 175

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.3 Outage probability of the victim receiver . . . . . . . . . . . . . . . . . . . . . . . 180

9.4 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.4.1 PEP derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

9.4.2 Gaussian-Hermite quadrature approximation . . . . . . . . . . . . . . . . 187

9.4.3 ASEP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.5.1 Outage Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.5.2 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

9.5.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 198

9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10 Conclusions 203

Bibliography 207

x

List of Figures

3.1 Realization of the samples obtained in the estimation process of h1, h2, τ , and λ2.The dotted lines represent the true values of the parameters being estimated and

the bold lines are the samples obtained in each Gibbs sampling iteration. . . . . . 303.2 Correlation among the samples of σ21 (square) and among the samples of the real

part of h1 (circle) after the burn-in period. . . . . . . . . . . . . . . . . . . . . . . 313.3 Realization of the samples obtained in the estimation process of τ for different

resolution factor (OS) values. The dotted line represents the true value of τ . OS isset to 30 (dashed), 50 (dash-dot) and 100 (bold). . . . . . . . . . . . . . . . . . . 31

3.4 Average normalized variance of the error in the estimation of σ21 and σ22 . Number

of observed symbols equal to 100 (circle), 300 (square), and 500 (diamond). The

average normalized variance in the estimate X̂ of X is defined as V ar[X − X̂]/X . 323.5 Normalized MSE in the estimation of σ21 and σ

22 when the modulation scheme of

the received symbols is known and is either BPSK, QPSK, 8 PSK, 16 QAM, 32

QAM, or 64 QAM. The average normalized MSE in the estimate X̂ of X is definedas E[(X − X̂)2]/X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Realization of the samples obtained in the estimation process of the real part of

h1, h2, and h3. The dotted lines represent the true values of the parameters beingestimated and the bold lines are the samples obtained in each iteration. . . . . . . . 34

3.7 Realization of the samples obtained in the estimation process of σ21 , σ22 , and σ

23 .

The dotted lines represent the true values of the parameters being estimated and

the bold lines are the samples obtained in each iteration. . . . . . . . . . . . . . . . 34

3.8 Probability of correct classification of the numerical classifier for different num-

bers of observed symbols (750, 1000, and 1250). Clairvoyant classifier uses 1250

symbols. Set of possible modulation schemes: BPSK, QPSK, 8 PSK, and 16

QAM. L = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.9 Probability of correct classification of the numerical classifier for different values

of cth. Number of observed symbols: 1250. Set of possible modulation schemes:BPSK, QPSK, 8 PSK, and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . 36

xi

3.10 Probability of correct classification of the numerical classifier for the case when

the values of L or N are over- or under-estimated. The correct values of L and Nare 2. Number of observed symbols: 750. Set of possible modulation schemes:

BPSK and QPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.11 Probability of correct classification of the approximated likelihood classifier for

different numbers of symbols used by the pre-processing stage (300 or 500) and

by the classifier (K=750 or K=1000). L = 3. . . . . . . . . . . . . . . . . . . . . 37


different numbers of symbols used by the pre-processing stage (500 or 1000) and

a fixed number of symbols used for classification (K=1000). Set of possible mod-ulation schemes: BPSK, QPSK, 8 PSK, 16 QAM, 32 QAM, and 64 QAM. L = 3. 38


different values of β. Number of symbols used for both estimation and classifica-tion is equal to 1000. Set of possible modulation schemes: BPSK, QPSK, 8 PSK,

and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.14 Realization of the samples obtained in the estimation process of τ and δf for thecase when training data is available for parameter estimation. Length of the train-

ing sequence is 50 symbols. Carrier frequency offset is 0.0045. The dotted lines

represent the true values of the parameters being estimated and the bold lines are

the samples obtained in each Gibbs sampling iteration. . . . . . . . . . . . . . . . 40

3.15 Probability of correct classification of the approximated likelihood classifier for the

case when the received symbols suffer phase rotation due to carrier frequency off-

set. Two carrier frequency offset values are considered (0.0045 and 0.01). Number

of symbols used for estimation and classification is equal to 50 and K=300, re-spectively. Set of possible modulation schemes: BPSK and QPSK. L = 3. . . . . . 40

4.1 Comparison of various jamming techniques against a 16-QAM modulated victimsignal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Comparison of jamming techniques against a 16-QAM victim signal in a non-coherent (random phase offset) scenario, JNR = 10 dB. . . . . . . . . . . . . . . 54

4.3 Comparison of jamming techniques against a 16-QAM victim signal in the pres-

ence of timing synchronization errors, JNR = 10 dB. . . . . . . . . . . . . . . . 56

4.4 Comparison of jamming techniques against a 16-QAM victim signal in the pres-

ence of signal level mismatch, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . 57

4.5 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim

signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xii

4.6 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim

signal in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . 60

4.7 Comparison of jamming techniques when multiple jammers attack a single 16-

QAM modulated victim signal, JNR=10dB. . . . . . . . . . . . . . . . . . . . . . 64

5.1 Power allocations for an AWGN jammer against an OFDM-based 16-QAM victimsignal, JNR=10 dB, SNR = 15 dB, 52 out of Nsc = 64 subcarriers are shown. Thesolid lines indicate the channel power levels across the OFDM subcarriers. The

optimal power allocation obtained by solving (5.6) is seen to be different from

channel inversion, water-filling and capacity minimization-based power allocations. 78

5.2 Performance comparison of the various power allocation strategies when pulsed

AWGN jamming signal is used against an OFDM-based 16-QAM modulated vic-

tim signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Performance comparison of the various power allocation strategies when pulsed

QPSK jamming signal is used against an OFDM-based 16-QAM modulated victim

signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Performance comparison of the approximately optimal jamming power allocation

in (5.14) with the optimal, water-filling and channel inversion-based power allo-

cations, pulsed AWGN jamming is used against an OFDM-based 16-QAM mod-

ulated victim signal, JNR = 10 dB. The approximately optimal power allocationstrategy (diamond marker) performs nrealy as well as the optimal power allocation

strategy (triangle marker). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Jamming performance against an OFDM-based 16-QAM modulated victim signal

with erroneous channel knowledge, JNR = 10 dB. . . . . . . . . . . . . . . . . . . 83

5.6 Jamming performance against an OFDM-based 16-QAM modulated victim signal

in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . . . . . . 84

5.7 Jamming performance when the jammer is uncertain about the victim’s modulation

scheme, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.8 Jamming performance when the jammer is uncertain about the victim’s modula-

tion scheme and when the victim’s channel {hk}Nsck=1 is not compensated prior totransmission, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.9 Empirical KL divergence measure between QPSK modulated jamming signal in

the presence of a carrier frequency offset ǫ (normalized value) and an AWGN jam-ming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

xiii

6.1 An illustration of learning in one round of JB. It is possible that the optimal strategy

denoted by {J ∗, JNR∗, ρ∗} lies out of the set of discretized strategies. In such acase the jammer learns the best discretized strategy, but based on the value of the

discretization parameter M , the loss incurred by using this strategy with respect tothe optimal strategy can be bounded using the Hölder continuity condition. The

value of the discretization M is shown in the figure and Alg. 6.1. . . . . . . . . . . 102

6.2 Using Theorems 6.3 and 6.5 in a real time jamming environment. . . . . . . . . . . 106

6.3 Instantaneous SER achieved by the JB algorithm when JNR = 10dB, SNR =20dB and the victim uses BPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK. The jammer learns to use BPSK with ρ = 0.078 using JB. Thelearning performance of the ǫ-greedy learning algorithm with various discretizationfactors M is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.5 Learning the optimal jamming strategy when JNR = 10dB, SNR = 20dB and thevictim uses QPSK modulation scheme. The jammer learns to use QPSK signaling

scheme with ρ = 0.087. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.6 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK and there is a phase offset between the two signals. The jammer

learns to use BPSK with ρ = 0.051 using JB. The learning performance of theǫ-greedy learning algorithm with various discretization factors M is also shown. . . 111

6.7 Average PER inflicted by the jammer at the victim receiver, SNR = 20 dB, victimuses BPSK and JNR = 10 dB. The jammer learns to use BPSK signaling schemewith ρ = 0.23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.8 Average reward obtained by the jammer against a BPSK modulated victim, SNR =20 dB. The optimal reward is obtained via grid search with discretization M = 100. 112

6.9 Confidence level (optimal reward-achieved reward) predicted by Theorem 6.3 and

that achieved by JB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.10 Learning the jamming strategies by using arm-elimination. The victim uses BPSK

with SNR = 20dB. The jammer learned to use BPSK with JNR = 15 dB andρ = 0.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.11 Learning jammers’ strategy against a stochastic user. The victim transmitter-receiver

pair use a uniformly random signaling scheme that belongs to the set {BPSK,QPSK}and random power level in the range [0, 20] dB. . . . . . . . . . . . . . . . . . . . 114

6.12 Learning against a victim with time-varying strategies. The figure shows the power

levels adaptation by the jammer and that used by the victim. . . . . . . . . . . . . 115

xiv

6.13 Learning against a victim with time-varying strategies. The figure shows the power

level adaptation by the jammer using a drifting algorithm and that used by the victim.115

6.14 PER achieved by the jammer against 2 users, user 1 uses BPSK at 15dB and user 2sends BPSK at 5dB. The jammer learns to use BPSK signal with power 13dB andρ = 0.46. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.15 PER achieved by the jammer against 2 users, user 1 sends QPSK at 5dB and user 2sends BPSK at 15dB. The jammer learns to use BPSK signal with power 11.25dBand ρ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.16 PER achieved by the jammer against 2 stochastic users in the network. Both the

users employ BPSK signaling scheme. The jammer learns to use the BPSK sig-

naling scheme to achieve power efficient jamming strategies and also tracks the

changes in the users’ strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.1 MDP model of the 802.11-type wireless network with the RTS-CTS protocol. Thestate transitions indicate the effect of a jamming attack on the wireless network. . . 137

7.2 Rewards obtained in various scenarios, ρ = 0.3. The rewards obtained with instan-taneous knowledge are on average better than the rewards obtained in the delayed

knowledge scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3 Optimal jamming policies as a function of the ratio of the throughput cost to the

energy cost; ρ = 0.5. The colors represent the various optimal jamming policies. . 141

7.4 Rewards obtained when jammer is uncertain about the underlying MDP model and

ρ and learns it by interacting with the environment; ρ = 0.5. . . . . . . . . . . . . 142

8.1 Betweenness metrics for nodes in network (a) 112[0, 6, 8, 0, 0] and

network (b) 112[0, 6, 0, 0, 0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.2 Network attack performance against fixed flows in a star network, number of

nodes=50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.3 Network attack performance against an Erdös-Rényi random network, connection

probability (p) = 0.8, number of nodes = 50. The average number of flows stoppedin one network instantiation of the ER network is shown. . . . . . . . . . . . . . . 158

8.4 Network attack performance against fixed flows in an ER network, p = 0.8, num-ber of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.5 Network attack performance against fixed flows in a BA network, connection de-

gree = 5, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.6 PPP-based network model, with nearest neighbor connections. The red dots indi-

cate the various network nodes and the blue lines indicate the network connections. 161

xv

8.7 Network attack performance against fixed flows in a PPP-based network, number

of nearest neighbor connections = 5, number of nodes = 50. . . . . . . . . . . . . . 161

8.8 Network attack performance against random flows in a Star network, number of

nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.9 Network attack performance against random flows in an ER random network, p =0.8, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.10 Network attack performance against random flows in a BA network, number of

nodes = 50, connection degree = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.11 Network attack performance against fixed flows in a ER network, with 25 nodesand p = 0.8, when two nodes can be attacked simultaneously by the attackers. . . . 166

8.12 Attack performance by exploiting the similarity in a network modeled using a

Poisson-point process. L(G) = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.13 An example network attacked by two attackers, with each capable of attacking

only a subset of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.1 [System Model] The cross marks indicate the BS/APs in the wireless network that

are distributed according to a PPP. The Voronoi tessellation indicates the coverage

regions of the BS/APs. The square indicates the victim receiver which is at the

origin. The black arrow indicates the link between the the closest BS and the

victim receiver. The triangles indicate the jammers that are distributed according

to a BPP within the black-dotted region of radius RJ . . . . . . . . . . . . . . . . . 178

9.2 [Effect of NJ ]: Outage probability of the victim receiver as a function of the num-ber of jammers NJ in the network. p = 0.01, PT/PJ = 0dB. The solid lines indi-cate the outage probability obtained via Monte Carlo simulations and the markers

indicate the theoretical outage probability evaluated using (9.3). . . . . . . . . . . 190

9.3 [Effect of NJc]: Outage probability of the victim receiver as a function of the num-ber of jammers per cell (or per BS) NJc in the network. p = 0.01, NJ = 4,PT/PJ = 0dB. The solid lines indicate the outage probability obtained via MonteCarlo simulations and the markers indicate the theoretical outage probability eval-

uated using (9.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

9.4 [Effect of p]: Outage probability of the victim receiver as a function of the activityfactor p. NJ = 4, NJc = 1, PT/PJ = 0dB. The solid lines indicate the outageprobability obtained using Monte Carlo simulations and the markers indicate the

theoretical outage probability expression evaluated using (9.3). . . . . . . . . . . . 191

xvi

9.5 [GHQ Approximation]: The accuracy of the Gaussian-Hermite quadrature approx-

imation in evaluating the outage probability as a function of the number of terms

N used in (9.12). The dotted line is the outage probability evaluated using (9.3).The marked lines indicate the outage probability evaluated using (9.12) for various

values of N . p = 0.01, NJc = 1, PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . 191

9.6 [Effect of activity factor p]: Number of jammers N∗J required to cause a 90%probability of outage in the wireless network, as a function of the activity factor

(network load) p. PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

9.7 [Effect of λT ]: Number of jammers N∗J required in a BPP to cause a 90% proba-

bility of outage in the wireless network, as a function of λT , p = 0.1. . . . . . . . . 193

9.8 [Effect of Shadowing]: Number of jammers N∗J required in a BPP to cause a 90%probability of outage in the wireless network, as a function of σχ and p = 0.01. . . 193

9.9 [Effect of Retransmissions]: The steady state activity factor (ps) as a functionof the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194

9.10 [Effect of Retransmissions]: The steady state packet drop probability (δ) as a func-tion of the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194

9.11 The accuracy of the Gaussian-Hermite quadrature approximation for error proba-

bility evaluation as a function of the number of terms N used in the approximation.The zoomed in plot shows a part of the overall figure and indicates that N = 10terms very closely matches the true value without any approximation. . . . . . . . 195

9.12 [Effect of Activity Factor]: Average symbol error rate as a function of the activity

factor p when the victim receiver uses BPSK modulation and the jammer networkuses BPSK modulation. NJ = 4, NJc = 1, JNR = 100 dB. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP

evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

9.13 [Effect of Number of Jammers]: Average symbol error rate as a function of the

number of jammers when the victim receiver uses BPSK modulation and the jam-

mer network uses BPSK modulation. NJc = 1, p = 0.01. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP

evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

9.14 [Effect of NJc]: Average symbol error rate when the victim receiver uses BPSKmodulation and the jammer network uses BPSK modulation as a function of the

number of jammers per cell (BS). The solid lines indicate the Monte Carlo simu-

lation results and the markers indicate the theoretical ASEP evaluated using (9.25). 196

xvii

9.15 [Effect of shadowing]: Average symbol error rate as a function of shadowing power

level when the victim receiver uses BPSK modulation and the jammer network

uses BPSK modulation. The solid lines indicate the Monte Carlo simulation results

and the markers indicate the theoretical ASEP evaluated using (9.25). . . . . . . . . 197

9.16 [Effect of the jamming signaling scheme]: Average symbol error rate as a function

of p when the victim receiver uses BPSK modulation and different jamming signalsare used by the jammer network. NJ = 4, NJc = 1. It is seen that in all cases, thejamming performance of the three jamming signals are the same. . . . . . . . . . 197

9.17 [No Fading Scenario]: Average symbol error rate when the victim receiver uses

BPSK modulation and different jamming signals are used by the jammer network,

NJ = 4, NJc = 1, p = 0.01. In all cases it is seen that the BPSK jammingoutperforms QPSK and AWGN jamming signaling schemes. . . . . . . . . . . . . 198

9.18 The symbol error probability of the victim receiver when the jammer interference

is approximated as Gaussian with variance denoted by (9.30). . . . . . . . . . . . . 199

xviii

List of Tables

4.1 Optimal jamming signal level distribution against a 16-QAM victim signal, JNR =10 dB. a1, a2 indicate the absolute values of the real and imaginary parts of thejamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Optimal jamming signals in a coherent scenario. . . . . . . . . . . . . . . . . . . . 51

4.3 Optimal non-coherent jamming signal level distribution against a 16-QAM victimsignal, JNR = 10 dB. a1, a2 indicate the absolute values of the real and imaginaryparts of the jamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Optimal jamming strategies versus jammer knowledge . . . . . . . . . . . . . . . 68

6.1 Comparison between related bandit works . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1 MDP model state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.2 Optimal Jamming Policies via Delayed Learning, E = −10, T = −100 . . . . . . 139

9.1 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

xix

Chapter 1

Introduction

Wireless connectivity has now become ubiquitous and an integral part of our everyday lives. It

is now more of a necessity than a luxury. With the advent of new technological capabilities, the

demand for wireless spectrum is ever-increasing. However, the inherent openness of the wireless

medium makes it susceptible to both intentional and un-intentional interference. Interference from

neighboring communicating devices is one of the major causes for un-intentional interference. On

the other hand, intentional interference corresponds to adversarial attacks on a victim receiver.

Therefore, ensuring the security and privacy of every device, in order to avoid data breaches and

any type of attack, is of utmost importance. Security need not only be defensive, such as crypto-

graphic or information theoretic security, that evades attacks, but can also refer to being offensive

on an as needed basis. In this dissertation, we focus on the offensive techniques that help to ensure

security of the various devices. Such security-related studies not only allow for the analysis of the

system vulnerabilities but also enable to undermine an enemy system capabilities.

The rapid rise in the technological advancements in the field of Artificial Intelligence and Machine

Learning can potentially enable every device (regardless of wired or wireless) to possess some sort

of intelligence that allows for real time operation and adaptation [1]-[6]. If such capabilities exist

with the malicious nodes,1 then it is a threat to the security of the various devices that co-exist in

the same environment. It is thus imperative that devices be intelligent and predict the next move by

the adversary so as to limit the effectiveness of attacks. Therefore, spectrum supremacy, or in other

words, ensuring unimpeded access to spectrum while denying it to adversaries and thereby having

control of the spectrum, is a vital part of security in the modern era. Throughout this dissertation,

we refer to the offensive techniques that help to gain control over the spectrum as communication

denial. Communication denial, for instance, is vital for military applications (popularly referred to

as electronic warfare) [7] where the military devices must have un-interrupted access to spectrum

resources to cater to mission critical applications. It is also useful in commercial applications

where malicious sensor nodes must be stopped from eavesdropping, for instance during a private

1In this dissertation, the terms malicious nodes, adversarial nodes, enemy nodes and victim nodes are used inter-

changeably.

1

SaiDhiraj Amuru Chapter 1. Introduction 2

meeting.

Communication denial has mainly been studied by using either optimization, game-theoretic or

information theoretic principles. The major disadvantage of these studies is that they assume a lot

of a priori information about the communication strategies used by the enemy nodes, environment

conditions (such as fading channel, spectrum occupancy) etc., which may not be available in prac-

tical scenarios. Therefore, the major point of departure for this dissertation from the previous work

is the realization that the recent advances in communication systems create the potential for dy-

namic environmental conditions. Under such scenarios, more often than not, it is difficult and most

likely not even possible to obtain a priori information regarding the environment and the nodes that

are present in it. Therefore, it is necessary to have cognitive capabilities that enable nodes to learn

the environment and prevent the enemy nodes from accessing the spectrum and thereby denying

communication.

In this dissertation, we address several unsolved fundamental problems in the area of commu-

nication denial. In particular, this dissertation considers the scenarios where several nodes are

attempting to communicate in a secure or sensitive area, and one or more secure nodes wish to

prevent that communication i.e., deny the nodes from communicating. Broadly, we ask the fol-

lowing question in this dissertation “Can an intelligent attacker learn and adapt to an unknown

environment in an electronic warfare-type scenario?” We answer this question in several stages by

fundamentally analyzing the performance of an attacker in various communication settings. We

assume that the attacker has already identified that a device(s) is(are) malicious, for instance by

using device fingerprinting techniques [8]. In this dissertation, we focus on intelligent approaches

for communication denial of the malicious node once it has been identified.

1.1 Contributions

Chapter 2 provides a short background on the learning theory concepts used in this dissertation.

Chapters 3-9 describe the major contributions of this dissertation– namely victim signal identifica-

tion and attack strategies at various open systems interconnection (OSI) model layers. Specifically,

Chapters 3-6 and 9 discuss attacks at the physical layer, Chapter 7 discusses attacks at the MAC

layer and Chapter 8 addresses attacks at the network layer. Conclusions and future directions are

presented in Chapter 10. The major contributions of this dissertation are briefly described below.

1.1.1 Modulation classification

As mentioned earlier, the first task in effectively attacking a malicious node is to identify its sig-

naling strategy. In Chapter 3, we present a novel signal identification technique, specifically a

modulation classification algorithm to identify the modulation scheme used by the victim for its

communication. While modulation classification has been studied extensively, (see [9]- [12] and


references therein,) unfortunately none of the previous works consider practical, realistic environ-

ments where the interference is often non-Gaussian and the attacker is not aware of the timing of

the victim’s signal. Further, the difficulty in performing modulation classification is due primar-

ily to the fact that classifiers operate with no or incomplete knowledge of the fading experienced

by the signal and the distribution of the noise added in the channel. This is because a receiver

typically has to first classify the received signal before it can successfully acquire symbol timing

and estimate the channel state. As a result, the impractical assumption that the received signal is

acquired and equalized by the radio front-end before classification is often made in the design of

modulation classification algorithms [13], [14].

In this chapter, we present and analyze a pre-processor that allows for the reliable classification

of digital amplitude-phase modulated signals (ASK, PSK, and QAM) when the receiver has no

knowledge of the timing (symbol transition epochs) of the received signal, the noise added in

the channel is non-Gaussian, and the unknown fading experienced by the signal is frequency-

selective. We assume that the additive noise is non-Gaussian because various studies have shown

that most radio channels experience both man-made and natural noise, and that the combined noise

is impulsive. This also accounts for non-Gaussian interference that is often experienced in practical

wireless environments [9]. We propose a Bayesian pre-processing stage that estimates the various

signal parameters and reliably identifies the signal of interest. The numerical results demonstrate

that, by using the proposed pre-processor, modulation classification algorithms can perform well

compared to clairvoyant classifiers assumed to be symbol synchronous with the received signal and

to have perfect knowledge of the channel state and noise distribution. An extension of the proposed

pre-processor for the case when the received symbols suffer phase rotation due to the presence of

a residual carrier frequency offset is also considered. More details are given in Chapter 3.

1.1.2 Optimal jamming in AWGN channels

Once the victim’s signaling scheme is identified, the next task for the attacker is to efficiently at-

tack it using all the available information. In Chapter 4, we study attacks from a physical layer

perspective. More specifically, we study jamming attacks against practical wireless signals, namely

digital amplitude-phase modulated signals. Jamming has traditionally been studied in the context

of spread spectrum communications [15]. Barrage jamming, partial-band/narrow-band jamming,

tone-jamming (where a victim is attacked by sending either a single or multiple jamming tones) and

pulsed jamming are the most common types of jamming models considered in wireless commu-

nication systems. Deviating from these traditional simplistic techniques, we want to know “What

is the optimum statistical distribution for power constrained jamming signals in order to maximize

the error probability of digital amplitude-phase modulated constellations?” This work answers

a question that is more relevant to practical wireless communication systems when compared to

similar questions studied in the past, and consequently offers different solutions mainly because

incorrect system models were previously considered and thus the wrong questions were answered.

As a result of the analysis in this chapter, we show that modulation-based pulsed jamming sig-


nals are optimal in both coherent and non-coherent (phase asynchronous) scenarios against digital

amplitude-phase modulated signals. As opposed to the common belief that matching the victim

signal (correlated jamming) increases confusion at the victim receiver, our analysis shows that

the optimal jamming signals match standard modulation formats only in a certain range of signal

and jamming powers. Beyond this range, either binary or quaternary pulsed jamming is the opti-

mal jamming signal. An interesting relationship between these optimal jamming signals and the

well-known pulse jamming signals discussed in the context of spread spectrum communications

was illustrated. The performance of these optimal jamming signals is shown to be degraded when

the victim and the jamming signals are not phase or time synchronous or when it does not have

perfect knowledge of the power levels of the victim and the jamming signals although the opti-

mal jamming signal distributions don’t change. In this chapter, we also study jamming against

OFDM-based victim signaling schemes and the effects of multiple jammers against the victim.

More details are presented in Chapter 4.

1.1.3 Jamming in fading channels

In Chapter 4 we studied jamming in AWGN channels. In Chapter 5, we take the jamming analysis

in Chapter 4 a step further and investigate jamming attacks in fading channels. As pointed out

in [16], most of the existing jamming works, see [7], [17]-[20] and references therein, ignore the

presence of a fading channel between the jammer and the victim receiver as it simplifies the jam-

ming analysis. Although the impact of fading channels on the jamming performance has sparingly

been studied in the context of multiple-input multiple-output (MIMO) systems [16], [24]-[26],

these works addressed jamming by only considering an AWGN jamming signal against Gaussian

victim signaling and showed that equal power allocation or water filling based on the second-order

statistics of the fading channel are Nash-equilibrium strategies. However, it was recently shown

in [16] that ignoring the presence of a fading channel and/or using equal power allocation/ water

filling is sub-optimal in terms of the jamming performance evaluated via the Shannon rate metric.

While [16] addresses the shortcomings of the earlier works [21]-[26], it assumes that the victim

employs Gaussian signaling schemes which are typically not used in practice. Furthermore, none

of the works that study jamming against OFDM systems, which is the preferred signaling scheme

for most wireless standards, explicitly consider the effects of a fading channel between the jammer

and the victim receiver (see [21]-[23] and references therein for more information on jamming

against OFDM systems). Hence, there is not currently a good understanding as to how a jam-

mer can effectively attack a victim that uses practical wireless signals in the presence of a fading

channel between the jammer and the victim receiver.

Therefore, we address this open question by studying jamming attacks against digital modulation

schemes in wireless fading channels. Again, we focus on the error probability metric as the Shan-

non rate metric fails to capture the effects of digital modulation schemes typically employed by

the victim. Specifically, in this chapter, we study the problem of jamming power allocation across

a fading channel under total and peak power constraints in order to maximize the error probability

of a victim receiver. As a result of the analysis in this chapter, an interesting power allocation


strategy is obtained for the jammer, which is different from equal power allocation, channel inver-

sion and water filling. Specifically, it will be shown that for a given jamming power, the power

allocation is similar to channel inversion at low victim signal power values and to water filling

at high victim signal power values. However, at medium victim signal power values, the jammer

allots more power when the channel fading is weak than when it is strong but allots no power

when the fading is weakest. The jammer performance was also considered under several non-ideal

scenarios which shows the benefits of employing the proposed jamming strategies over conven-

tional jamming techniques. Finally, the proposed jamming strategies are not only applicable to

frequency-selective fading channels, but also to time- selective fading channels and hence can be

used to optimally attack a victim across a variety of scenarios.

1.1.4 Jamming Bandits

As mentioned earlier, jamming was traditionally studied by using either optimization or game-

theoretic or information theoretic principles, see [17]-[26] and references therein. The major dis-

advantage of these studies is that they assume the jammer has a lot of a priori information about

the strategies used by the (victim) transmitter-receiver pairs, channel gains, etc., which may not

be available in practical scenarios. For instance, in Chapters 4 and 5, we analyzed jamming from

an optimization perspective and studied jamming strategies in AWGN and fading channels. How-

ever, these jamming strategies were obtained by assuming that the jammer has a priori knowledge

regarding the transmission strategy of the victim transmitter-receiver pair. While the results in

Chapters 4 and 5 shed light on the fundamental performance limits of the jammer, they cannot be

used in real time environments due to the lack of a priori knowledge about the victim. Further,

such optimization-based techniques need to be re-programmed whenever the victim changes its

strategy, which may be a complicated procedure. Therefore, in contrast to prior work (both ours

and others), in this chapter we develop online learning algorithms that learn the optimal jamming

strategy by repeatedly interacting with the victim nodes. Essentially, the jammer must learn to act

in an unknown environment in order to maximize its total reward (e.g., jamming success rate).

In this regard, we ask “Can an intelligent jammer learn the optimal physical layer jamming strate-

gies obtained in Chapter 4, with limited to no knowledge about the victim nodes?” By learning,

we refer to the cognitive capabilities of a jammer wherein it has the ability to understand its envi-

ronment and the impact of its actions on the environment. In Chapter 4, we show that the optimal

jamming signal depends on three parameters, namely modulation scheme, signal power and the

on-off duration. While the set of modulation schemes is discrete, the signal power and the on-off

duration parameters are continuous. As will be discussed in detail in Chapter 6, traditional learning

techniques (i.e., those available in the open literature) cannot be directly employed to learn in such

mixed action spaces (discrete and continuous). The multi-armed bandit (MAB) framework lends

itself well to solve this problem as will be described in Chapter 6. However, there are no exist-

ing bandit frameworks that can be directly applied to this problem which motivated us to develop

novel learning frameworks and algorithms, novel both with respect to their application to jamming

as well as the general learning literature, to address this cognitive physical layer jamming problem.


Moreover, these algorithms come with theoretical guarantees on the jamming performance which

is vital in offensive security scenarios. Specifically, we prove that our learning algorithm converges

to the optimal (in terms of the error rate inflicted at the victim and the energy used) jamming strat-

egy. Even more importantly, we prove that the rate of convergence to the optimal jamming strategy

is sub-linear, i.e. the learning is fast in comparison to existing reinforcement learning algorithms,

which is particularly important in dynamically changing wireless environments. Also, we charac-

terize the performance of the proposed bandit-based learning algorithm against multiple static and

adaptive transmitter-receiver pairs. More details are presented in Chapter 6.

1.1.5 MAC-layer jamming

Jamming all the information exchanged between the malicious nodes, for example by employing

the physical layer jamming techniques obtained in Chapters 4-6, may not always be necessary. It

was shown in [27], [28] that the jammer can perform better (say in terms of the energy efficiency)

if it accounts for the inherent structure in the data transmission. For example, in some scenarios,

jamming the control packets or pilot signals is sufficient to stop the malicious nodes from commu-

nicating with each other [27], [28]. Thus, higher layer jamming attacks either at the MAC layer

or network layer should be considered. MAC-layer attacks typically rely on the knowledge of the

protocol used by the malicious nodes and network layer attacks rely on the ability to create con-

gestion in the network by mimicking the packets that are sent by other nodes in the network [7]. In

Chapter 7, we seek to understand the optimal MAC layer jamming attacks against an 802.11-basedwireless network. Specifically, we ask “Can an intelligent jammer learn the optimal MAC-layer

jamming strategies when it has delayed knowledge about the malicious nodes?”

In this chapter, we assume that the jammer can identify the basic MAC-layer protocol being used

by the malicious nodes, although not necessarily the full details. This can be fairly easily achieved

by observing the traffic pattern of the nodes in the environment over some time interval [29].

However, one of the main challenges still faced in studying a MAC layer jamming attack is that the

knowledge about the malicious nodes is not always available instantaneously, especially when the

jammer intends to track the changes in the victim’s strategies. Hence, in this problem, we assume

a middle ground between Chapters 4, 5 and Chapter 6 and study how efficiently and effectively

a jammer can learn the optimal jamming strategy when there is delayed knowledge about the

malicious nodes i.e., in cases where the jammer is aware of the malicious nodes’ behavior after

some time delay. The framework for delayed observations is more practically relevant, especially

in the context of wireless communications [30].

In order to answer the question raised, we will use the Markov Decision Process (MDP) framework

which is particularly useful in modeling environments that obey the Markovian property and have

to keep track of only a small number of states. By state, we refer to the condition of the environment

in this dissertation. However, as will be discussed in detail in Chapter 7, the literature on delayed

learning frameworks is immature, thereby forcing the development of an appropriate framework

that enables us to obtain the optimal MAC-layer jamming strategies. As a result of the analysis in


this chapter, we developed a novel delayed learning framework with transition-based rewards, that

allows us to handle the realistic case of delayed knowledge. Using this framework, it is shown that

the jammer can learn the optimal policy. More details are presented in Chapter 7.

1.1.6 Blind network interdiction

Network centric architectures are increasingly gaining prominence, be it social networks or wire-

less networks, as they allow for decentralized operation among various nodes without the need

for a central entity to control their communication. With a widespread deployment of such ar-

chitectures, the security aspects of the underlying networks is now a major concern. The ability

to undermine a malicious network’s communication capabilities is crucial for ensuring security in

sensitive environments. In Chapter 8, we particularly focus on attacks against networks when their

topology is unknown a priori.

Network interdiction refers to disrupting a network in an attempt to either analyze the network’s

vulnerabilities or to undermine a network’s communication capabilities. A vast majority of the

works that have studied network interdiction assume a priori knowledge of the network topology

[31]-[42]. However, such knowledge may not be available in real-time settings. For instance,

in practical electronic warfare-type settings, an attacker that intends to disrupt communication

in the network may not know the topology a priori. Hence, it is necessary to develop online

learning strategies that enable the attacker to interdict communication in the underlying network

in realtime. In this chapter, we develop several learning techniques that enable the attacker to learn

the best network interdiction strategies (in terms of the best nodes to attack to maximally disrupt

communication in the network) and also discuss the potential limitations that the attacker faces in

such blind scenarios. We consider settings where a) only one node can be attacked and b) where

multiple nodes can be attacked in the network. In addition to the single-attacker setting, we also

discuss learning strategies when multiple attackers attack this network and discuss the limitations

they face in real-time settings. Several different network topologies are considered in this study

using which we show that under the blind settings considered in this chapter, except for some

simple network topologies, the attacker cannot optimally (measured in terms of the number of

flows stopped) attack the network.

More specifically, in this chapter, we show that: (a) relying on well-known graph metrics, such

as betweenness centrality [40], attacking a network does not necessarily work for all network

topologies, (b) under blind scenarios, the learning rates cannot be improved beyond O(|V |) where|V | is the number of nodes in the network, (c) under blind scenarios, multiple attackers mustcollaborate at every time instant in order to learn the best set of nodes to attack in the network

and (d) the learning performance, be it a single attacker or multiple attackers, will depend on the

network structure and not just the number of nodes in the network. More details are presented in

Chapter 8.


1.1.7 Jamming against wireless networks

Jamming against wireless networks (not just single nodes) has been previously addressed albeit

from an optimization perspective in [37], [43]-[45]. The jammer-to-flow assignment problem i.e.,

optimally assigning jammers to stop flows in a network based on their locations and other con-

straints such as power, was considered in [37]. In [43]-[45], the problem of jammer placement

against wireless networks with the aim of disconnecting the network was studied. All these works

model a network as a graph and find the best set of nodes/edges to attack so that the network is dis-

connected. While these studies indicate which nodes/links to be attacked, they do not address the

problem of how this attack can be realized in practice against cellular and/or WiFi-based wireless

networks. In other words, the jamming techniques against wireless networks is not well understood

from a physical layer perspective.

In Chapter 9, we analyze the impact of randomly placed jammers against a wireless network in

terms of a) the outage probability and b) the error probability of a victim receiver in the downlink

of this wireless network. We derive analytical expressions for both these metrics and discuss in

detail how the jammer network must be matched to the wireless network parameters in order to

effectively attack the victim receiver. For instance, we show that as the network loading increases,

assuming universal frequency reuse, the number of jammers that are needed to inflict a given outage

probability at the victim receiver decreases. Retransmissions are commonly used across a variety

of wireless protocols. We will show that when the wireless network uses retransmissions (in order

to improve the probability of successful communication), the number of jammers necessary to

achieve a required outage probability at the victim receiver decreases due to increased interference

among the BSs. Furthermore, we will show that the behavior of the jammer network as a function

of the BS/AP density is not obvious. In particular, an interesting concave-type behavior is seen

which indicates that the number of jammers required to attack the wireless network must scale

with the BS density only until a certain value beyond which it decreases. In the context of error

probability of the victim receiver, we study whether or not some recent results related to jamming in

the point-to-point link scenario can be extended to the case of jamming against wireless networks.

As a result of the analysis in this chapter, we show that a fixed number of jammer’s can tip a

wireless network i.e., can significantly reduce the probability of successful communication in this

wireless network. A similar analysis is performed in the context of the error probability of the

victim receiver. Specifically, we will show that when the small scale fading effects are averaged

out, then the results in Chapter 4 can be extended to the case of jamming against wireless networks

and that significant gains can be achieved by using modulation-based jamming signals (i.e., the

findings from Chapter 4) when compared to AWGN jamming.

1.1.8 List of relevant publications

This dissertation is based on the following publications:


1. S. Amuru and C. R. C. M. da Silva, “A blind pre-processor for modulation classification

applications in frequency-selective non-Gaussian channels,” IEEE Trans. Commun., vol.

63, no. 1, pp. 156-169, Jan. 2015.

2. S. Amuru and R. M. Buehrer, “Optimal jamming against digital modulation,” IEEE Trans.

Inf. Forensics and Security, vol. 10, no. 10, pp. 2212-2224, Oct. 2015.

3. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “Jamming bandits - a novel

learning method for optimal jamming,” submitted to IEEE Trans. Wireless Commun., avail-

able at arXiv:1411.3652.

4. S. Amuru and R. M. Buehrer, “On jamming power allocation against OFDM signals in

fading channels,” submitted to IEEE Trans. Inf. Forensics and Security, Aug. 2015.

5. S. Amuru, R. M. Buehrer, and M. van der Schaar, “Blind network interdiction strategies - a

learning approach,” submitted to IEEE Trans. Cognitive Commun. Netw., Sept. 2015.

6. S. Amuru, H. S. Dhillon, and R. M. Buehrer, “On jamming attacks against wireless net-

works” submitted to IEEE Trans. Wireless Commun., Sept. 2015.

7. S. Amuru and R. M. Buehrer, “Optimal jamming using delayed learning,” in Proc. IEEE

Military Comm. Conf., (Milcom), Baltimore, MD, Oct. 2014, pp. 1528-1533.

8. S. Amuru and R. M. Buehrer, “Optimal jamming strategies in digital communications-

impact of modulation,” in Proc. IEEE Global Commun. Conf., Dec. 2014.

9. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “A systematic learning method

for optimal jamming,” in Proc. Intern. Conf. Commun., Jun. 2015.

Chapter 2

Background

In this chapter, we briefly introduce two concepts that are used in this dissertation, namely a)

Reinforcement learning and the associated theory of Markov Decision Processes and b) Multi-

armed bandits.

2.1 Reinforcement Learning and Markov Decision Processes

Reinforcement learning is a technique that allows an agent to modify its actions (without any

supervision) by repeatedly interacting with the environment and is commonly used to address

sequential decision making. A reinforcement learning task that satisfies the Markov property1

is called a Markov decision process, or MDP, [47]. A MDP is defined by a tuple (S,A, P,R)where S is the set of all possible environment states and A is the set of all possible actions thatthe agent can perform in any environment state. For instance, from a jammer’s perspective, the

environment states could be Transmission/No Transmission to reflect the cases where a packet is

exchanged between the transmit-receive pair or when they are idle, and the actions of the jammer

could be Jam/Don’t Jam. P is the state transition probability matrix that governs the dynamics ofthe environment, and its entries are given by the transition probabilities p(s′|s, a) which indicatesthe probability that the environment moves to the state s′ when action a is executed in the state s.Finally R indicates the |S| × |A| reward matrix whose entries are given by elements r(s, a) whichindicate the reward (for example, energy expended) obtained in state s when action a is executed.Here, |S| and |A| indicate the cardinality of the sets S and A respectively.In the traditional RL framework, an agent observes the current state of the environment s, andchooses an action a. An optimum policy (a functional mapping between states and the actions thatcan be performed in these states) is one that maximizes the total expected reward, that is more often

1The Markov property refers to the memoryless property of a stochastic process. More specifically, the conditional

probability distribution of the future states of the random process depends only on the present state and not on the

states that happened earlier. Such a stochastic process is also known as the Markov process.

10

SaiDhiraj Amuru Chapter 2. Background 11

than not discounted by a factor γ ∈ [0, 1) to account for an infinite time horizon. The objective ofa RL algorithm is therefore to find an optimal policy Π (mapping between states and actions), thatmaximizes the cumulative discounted reward

R(t) =

∞∑

k=0

γkr(st+k, at+k = Π(st+k)), (2.1)

where st, at indicate the state and action taken at time t [46]. The value of a policy Π when theenvironment is in state s is given by

V Π(s) = EΠ

( ∞∑

k=0

γkr(st+k, at+k|st = s)), (2.2)

where EΠ indicates the averaging performed over all possible state transitions when the agentfollows the policy Π. Several algorithms exist to find an optimal policy Π∗, such as value iterationand policy iteration (which are useful when P is known a priori). For more details please see [47].For ease of analysis, we assume a stationary model (state transition matrix is independent of time)

and ignore the time parameter t hereafter.

When the underlying MDP model is known, policy evaluation (finding the value of a given policy)

can also be done via matrix inversion (especially for small MDPs, i.e., MDPs with small state-

action space) [47]. Specifically,

V Π(s) = r(s, a = Π(s)) + EΠ

( ∞∑

k=1

γkr(sk, ak|s))

= r(s, a = Π(s)) + γ∑

s′

p(s′|s, a = Π(s))V Π(s′).

Thus, writing the above set of equations for all possible states s ∈ S in the MDP, we have

V̄Π = r̄Π+γPΠ(s′|s)V̄Π=⇒ V̄Π=(I−γPΠ(s′|s)

)−1r̄Π (2.3)

where V̄Π is the |S| ∗ 1 vector of values of the policy Π in states s ∈ S, r̄Π is the |S| ∗ 1 vectorof rewards obtained in states s ∈ S using policy Π and PΠ(s′|s) indicates the |S| ∗ |S| statetransition probability matrix when the agent uses policyΠ, and I is an identity matrix of appropriatedimensions.

In general when there are no policies given a priori, for any set of states S and set of actions A,the optimal value function can be written as

V ∗(s) = maxa∈A

(r(s, a) + γ

∑

s′

p(s′|s, a)V (s′)), (2.4)

which indicates the optimal value that can be associated with a state s ∈ S (which is known by


exploring all actions a ∈ A in the state s). Along similar lines, we define a new state-actionfunction Q(s, a) which captures the quality of an action taken in a particular state as follows,

Q(s, a) =(r(s, a) + γmax

a′∈A

∑

s′

p(s′|s, a)Q(s′, a′)), (2.5)

which gives the optimal value function as V ∗(s) = maxa∈A Q(s, a) and helps to find an optimalpolicy Π∗ as2

Π∗(s) = argmaxa∈A

(Q(s, a)

)(2.6)

It should be clear by now that all the above equations can be used to find the optimal set of actions

only when P i.e., the transition probability matrix is known or can be estimated. Such techniquesthat rely on the knowledge of P are commonly known as Indirect Learning or Planning algorithms[47]. But usually, P is unknown in dynamic environments and can be difficult to estimate in realtime environments. Since the value of a state is defined as the expectation of the random rewards

obtained when the MDP is started from the given state, a direct way of estimating this value is to

estimate an average over multiple independent realizations of the MDP that start from the given

state i.e., the Monte Carlo technique. Unfortunately, the variance of the returns can be high which

can result in poor estimates of the Q-function (because it is possible to obtain different estimates

for the same state-action pair, for example due to the wireless channel conditions). To address this,

an online learning technique popularly known as Q-Learning [47] was developed, which updates

the state-action function as below:

Qt(s, a) = (1− αt)Qt−1(s, a) + αt[r(s, a) + maxa′

∑

s′

p(s′|s, a)Qt−1(s′, a′)], (2.7)

which is shown to converge to the optimal solution when the learning rate αt ∈ (0, 1] satisfies∑t αt = ∞ and

∑t α

2t < ∞. The proof of convergence is based on relating (2.7) to that of an

ordinary differential equation with a fixed point solution and the theory of stochastic approximation

[48].

Moreover, when we are concerned with online learning problems, finding a balance between ex-

ploration (trying actions that may yield higher rewards) and exploitation (using the best actions

learned thus far) becomes important, given the finite available resources. ǫ-Greedy is a commonlyused learning algorithm where an agent explores the actions (in any state) with probability ǫ andexploits the existing knowledge with probability 1 − ǫ. In Q-Learning, actions are chosen as peran exploration-exploitation schedule that is decided a priori such that all actions can be tried in all

possible environment states. Thus, such learning algorithms can guarantee optimality only asymp-

totically as the size of the MDP grows. While the theory is mature for the case of finite MDPs,

efficient exploration, for example, is still being studied in the case of large MDPs (this problem

2Note that while the optimal value function V ∗(s) is unique, the optimal policy is not necessarily unique [47].


has been addressed well from the context of multi-armed bandit problems that is discussed next).

Finite time bounds that indicate the rate of convergence to the optimal policy in the case of finite

MDPs have been studied [49]. For more details on reinforcement learning, please see [46]-[49].

2.2 Multi-armed Bandits

Multi-armed bandit problems are a sub-class of sequential decision making problems that are con-

cerned with allocating the available resources among several alternative arms/actions [50]-[53].

For example, such algorithms are most widely used in the context of clinical trials where several

treatments are applied to patients in a sequential manner, and patients are dynamically allocated

to the best treatment [50]. A single-armed bandit process is an arm that is defined by two random

sequences namely, s(n) and r(s(n)), where s(n) is the state of the arm after it has been played ntimes and r(s(n)) is the instantaneous reward obtained after the arm is played n times. Specifically,it is assumed that the state of the arm evolves as s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)),where fn−1(.) is known a priori and w(n) is a sequence of independent random variables that arealso independent of s(n) and come from a known statistical distribution. A multi-armed banditprocess is defined as a collection of K such single-armed bandit processes3 and a controller/playerthat has to make the decision to choose one among these K bandit processes at every time instant.The decisions are taken such that the average cumulative discounted reward is maximized. There-

fore, it is a sequential decision problem as the decision to be made at every time instant depends

on what happened thus far and thereby faces the exploration versus exploitation dilemma when it

has to choose the arms. Putting this in the context of the MDPs introduced earlier4, a policy now

refers to sequence of arms chosen at every time instant and therefore, an optimal policy is one that

choses the best arm at every time instant. The goal of this problem is to maximize

J = E[ ∞∑

t=0

γtK∑

k=1

rk(sk(nk(t)), uk(t))|s1(0), s2(0), . . . , sK(0)], (2.8)

where γ ∈ (0, 1] is the discount factor, nk(t) is the number of times arm k has been chosen untiltime t, uk(t) is a 1 ×K vector which has 1 in the kth position and 0 else where if the kth arm ischosen.

It is easy to see that this problem formulation is similar to that of MDPs and can be solved by

using the regular dynamic programming techniques. However, it has been shown that this K-dimensional problem can be reduced to K 1-dimensional problems and that the optimal policy is

3For ease of analysis, K is assumed to be finite. The case of continuous state space S is usually handled bydiscretization of the state space. More details about such continuous spaces will be presented as part of our solution

to (Q2).4While the concept of bandit processes was initially developed in the context of a Markovian evolution process

(similar to the formulation of a MDP) [50], they were later generalized to all generic stochastic processes described

by s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)) in [51].


of an index-type (i.e., an index known as the Gittin’s index is assigned to each of the arms and the

arm with the highest index value is chosen at every time instant) under certain constraints which

are satisfied by the MAB problems [50]. The specific constraints are a) all arms are independent,

b) only one arm is chosen to play at any time instant, c) only the state of the arm chosen changes

according to f(.) and the rest of the arms are frozen, and d) an arm gives a reward only whenit is operated. However, these Gittin’s indices can be evaluated only under the knowledge of

the evolution process f(.) and perfect knowledge of the reward functions [50]-[53]. Further, asdiscussed in [52] this formulation is most useful and computationally feasible to obtain optimal

policies when the arms evolve according to a Markov process.

An alternative formulation for the MAB problem, that does not assume any parametric forms

for the state and reward functions and is based on formulating a performance criterion known

as “learning loss” or “regret” was primarily investigated in [54]-[56]. Regret is defined as the

difference between the expected reward that can be obtained by an oracle policy that has some

or complete knowledge about the statistical distribution of the arms and their rewards, and the

expected reward of the player’s policy. The most commonly used oracle policy is the best single

action policy that is optimal among all policies which choose only one arm over the entire time

horizon. This type of regret is also known as the “weak regret” [57], which will be used throughout

this report. Formally, weak regret is defined as

RWn = maxk=1,2,...,K

E( n∑

t=1

rk,t −n∑

t=1

rIt,t

), (2.9)

where It indicates the arm chosen at time t, rk,t is the reward obtained at time t by playing thekth arm and the expectation is with respect to the random choices made by the player’s policy andthe unknown evolution process of the arms (in other words, the random rewards obtained from the

arms). On the other hand, the concept of the “strong regret” defined as

RSn = E maxk=1,2,...,K

( n∑

Date post:	03-Feb-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Intelligent Approaches for Communication Denial · Intelligent Approaches for Communication Denial...

Documents