Intelligent Approaches for Communication Denial
SaiDhiraj Amuru
Dissertation submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Electrical Engineering
R. Michael Buehrer, Chair
Claudio R. C. M. da Silva
Ravi Tandon
Dhruv Batra
Inyoung Kim
September 21, 2015
Blacksburg, Virginia
Keywords: Communication, Denial, Jamming, Learning
Copyright 2015, SaiDhiraj Amuru
Intelligent Approaches for Communication Denial
SaiDhiraj Amuru
(ABSTRACT)
Spectrum supremacy is a vital part of security in the modern era. In the past 50 years, a great
deal of work has been devoted to designing defenses against attacks from malicious nodes (e.g.,
anti-jamming), while significantly less work has been devoted to the equally important task of
designing effective strategies for denying communication between enemy nodes/radios within an
area (e.g., jamming). Such denial techniques are especially useful in military applications and
intrusion detection systems where untrusted communication must be stopped. In this dissertation,
we study these offensive attack procedures, collectively termed as communication denial. The
communication denial strategies studied in this dissertation are not only useful in undermining the
communication between enemy nodes, but also help in analyzing the vulnerabilities of existing
systems.
A majority of the works which address communication denial assume that knowledge about the
enemy nodes is available a priori. However, recent advances in communication systems creates
the potential for dynamic environmental conditions where it is difficult and most likely not even
possible to obtain a priori information regarding the environment and the nodes that are present
in it. Therefore, it is necessary to have cognitive capabilities that enable the attacker to learn
the environment and prevent enemy nodes from accessing valuable spectrum, thereby denying
communication.
In this regard, we ask the following question in this dissertation “Can an intelligent attacker
learn and adapt to unknown environments in an electronic warfare-type scenario?” Fundamen-
tally speaking, we explore whether existing machine learning techniques can be used to address
such cognitive scenarios and, if not, what are the missing pieces that will enable an attacker to
achieve spectrum supremacy by denying an enemy the ability to communicate? The first task in
achieving spectrum supremacy is to identify the signal of interest before it can be attacked. Thus,
we first address signal identification, specifically modulation classification, in practical wireless
environments where the interference is often non-Gaussian. Upon identifying the signal of inter-
est, the next step is to effectively attack the victim signals in order to deny communication. We
present a rigorous fundamental analysis regarding the attackers performance, in terms of achieving
communication denial, in practical communication settings. Furthermore, we develop intelligent
approaches for communication denial that employ novel machine learning techniques to attack the
victim either at the physical layer, the MAC layer, or the network layer. We rigorously investigate
whether or not these learning techniques enable the attacker to approach the fundamental perfor-
mance limits achievable when an attacker has complete knowledge of the environment. As a result
of our work, we debunk several myths about communication denial strategies that were believed
to be true mainly because incorrect system models were previously considered and thus the wrong
questions were answered.
Dedication
To my parents (Sudhakar and Sai Sudha), my sister (Sai Deepika), my brother-in-law (Raghu
Pavan) and my newborn niece.
iii
Acknowledgments
Finally, after several thousand cups of coffee, the time has come.1 I have been waiting for this
moment—to write the Acknowledgements section—for quite some time, even more than my dis-
sertation. Such is the impact various people have had on me over the years. Although it cannot be
described in a few words, I will make an attempt.
I would like to thank God for his blessings and for giving me and my family strength to pass
through various difficulties and to finish this Ph.D. journey. It would not have been possible to
finish this dissertation without all the sacrifices my family made. They supported me in every
possible way all throughout my life and continue making an impact on me every single day. Thanks
for all the love.
I am very fortunate to have worked with several people during my Ph.D. Firstly, I owe my deepest
gratitude to my advisor Dr. Buehrer for giving me an opportunity to be part of his group and for
making sure I did not get lost at any point during my Ph.D. Thanks for being patient and correcting
all the mistakes I made during these years and for pushing me to solve challenging problems. His
unique skill for identifying important research problems, attention for details, and deep knowledge
about any problem have greatly helped in improving my research capabilities. You have always
been supportive of my work even at times when I doubted myself. Thank you for guiding me
in the right direction not only in my graduate studies, but also for being a person to whom I can
always look up to for life advice. I would be happy if I can put into practice all your teachings, be
it research or otherwise, and become as dedicated, contented, and as committed as you are.
Dr. Claudio da Silva was the reason I came to VT. He provided me immense support over the
years more than just as an advisor, as a friend. His understanding of his students, especially the
difficulties faced by an international student, and his help to settle down and make a head start in
my Ph.D. was the stepping point for my graduate studies. All our phone discussions, despite us
being on opposite ends of the country, have motivated me to do better work and aim higher always.
I will forever be indebted to him for all the encouragement that he provided and for also making
sure from time-to-time that I am doing well.
Dr. Ravi Tandon, is my go-to-guru for brainstorming about any research problem. He is full-of
energy at any time during the day (and also during the night). He introduced me to information the-
ory, which to this day, I am still afraid of. Thanks for patiently explaining the various intricacies in
1Thanks to the folks at Next Door Bake Shop for pumping caffeine into my body and helping me finish my Ph.D.
iv
a variety of problems that we worked on together. Working with him has helped me learn the ways
of research and has significantly improved my problem solving skills. All the discussions during
our coffee breaks have taught me a lot about the academic world. It has been a very memorable
and pleasant experience to have collaborated with him.
I would like to thank Dr. Batra for his machine learning course that has sowed the seeds for the
learning components of this dissertation. Thanks for being approachable and for the career advice
you have provided me. Thanks to Dr. Kim for the Bayesian statistics class and also for providing
valuable feedback on my research contributions.
Going to UCLA during the Summer of 2014 was one of the best decisions made during my Ph.D.
I would like to thank Dr. Mihaela for hosting me at UCLA during this time and for the wonderful
collaboration from thereon. Thanks for helping me explore the crazy world of learning. Your
perseverance still amazes me. I would be very happy if I can be at least 1% as dedicated andmotivated as you are about venturing into new research fields. Dr. Cem and Dr. Xiao have been
great teachers during my stay at UCLA. I am glad these teachings have resulted in publications.
I worked with Dr. Harpreet during the final stages of my Ph.D. I am very glad he chose to join
VT. My association with him dates back to the days when I newly joined Dr. Buehrer’s group.
His advice as a former Dr. Buehrer student was very helpful in succeeding in my Ph.D. Working
with him on stochastic geometry-related problems and also successfully organizing the W@VT
seminars has been a great learning experience.
Dr. Gautham, thanks for helping me hold on to the ropes during my first year at VT. You have
been an awesome mentor and a great friend over the years and have always provided me the right
suggestions while taking critical decisions at various times over the last four years.
The rowdy bunch at the Wireless @ VT lab - Daniel, Matt, Kevin, Hilda, Chris Headley, Chris
Phelps, Reza, Javier, Joe, and Mahi have made this journey very special. I am glad to have been
part of all the fun, pranks, fantasy leagues, lunches, coffee breaks, game nights etc. The “wall”
will never be forgotten nor will the hardships you guys gave me for being the student chair of
W@VT. Daniel, Marc Lichtman and Jeff Poston, thanks for proof reading my manuscripts several
times. Nancy, thanks for making the lab a wonderful habitat, a place where I spent most of my
time during the last four years. Being the student chair and organizing seminars has truly helped
me appreciate the work Nancy and Hilda do for W@VT.
All the members of Shawnee Theatre - Sriram, Viru, Deepak, Sarvesh, KC, Prasad and Aproov,
thank you for helping me stay sane during my Ph.D. journey and for all the awesome fun, dancing,
cooking sessions we enjoyed together. Thanks to Varuni, Vishwas and Aditya for the various stim-
ulating and intellectual discussions. Sriram, Varuni, Himanshu, Deepak, and Emily have helped
me stay healthy over the years with their mouth-watering dishes. Karteek and Santhosh, my friends
from IIT, thanks for being just a call away and for talking to me in times of stress. Lakshman, Avik
Dayal, Avik Sengupta, Mehrnaz, and several others, I am thankful for all the great times we shared.
v
Contents
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Modulation classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Optimal jamming in AWGN channels . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Jamming in fading channels . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Jamming Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 MAC-layer jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.6 Blind network interdiction . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.7 Jamming against wireless networks . . . . . . . . . . . . . . . . . . . . . 8
1.1.8 List of relevant publications . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 10
2.1 Reinforcement Learning and Markov Decision Processes . . . . . . . . . . . . . . 10
2.2 Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 A Blind Pre-Processor for Modulation Classification Applications in Frequency-Selective
Non-Gaussian Channels 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Blind Gibbs Sampling-Based Pre-Processing Stage . . . . . . . . . . . . . . . . . 19
3.3.1 Short introduction to Gibbs sampling . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Superconstellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
3.3.3 Prior pdfs for the unknown parameters . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Marginal posterior pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.5 Summary and a note on complexity . . . . . . . . . . . . . . . . . . . . . 26
3.4 Modulation Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.2 Numerical classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.3 Approximated likelihood classifier . . . . . . . . . . . . . . . . . . . . . . 35
3.5.4 Carrier frequency offset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Optimal Jamming against Digital Modulation 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Perfect Channel Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Optimum Jamming Signal Distribution . . . . . . . . . . . . . . . . . . . 46
4.3.2 Analysis against M-QAM victim signals . . . . . . . . . . . . . . . . . . 47
4.4 Factors that mitigate jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Non-Coherent Jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Symbol Timing Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.3 Signal Level Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Jamming an OFDM Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 The Case of Multiple Jammers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 On Jammer Power Allocation Against OFDM Signals in Fading Channels 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vii
5.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Jamming Strategies in Fading Channels . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.1 Optimal power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 Other power allocation strategies . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.3 Approximately optimal jamming power allocation . . . . . . . . . . . . . 75
5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.1 Power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.2 Jamming performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.3 Approximately optimal jamming solution performance . . . . . . . . . . . 81
5.4.4 Factors that affect jamming . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Jamming Bandits - A Novel Learning Method for Optimal Jamming 93
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Jamming against a Static Transmitter-Receiver Pair . . . . . . . . . . . . . . . . . 96
6.3.1 Set of actions for the jammer . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.2 MAB formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.4 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.5 High confidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.6 Improving convergence via arm elimination . . . . . . . . . . . . . . . . . 106
6.4 Learning Jamming Strategies against a Time-Varying User . . . . . . . . . . . . . 108
6.4.1 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.1 Fixed user strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5.2 Jamming performance against an adaptive victim . . . . . . . . . . . . . . 114
viii
6.5.3 Multiple victims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5.4 A note on the assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7 Optimal Jamming using Delayed Learning 129
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.1 Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.2 A Novel Delayed Learning Framework with Transition-based Rewards . . 133
7.3 Jamming via Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Protocol Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.2 Jamming Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.3 Feedback Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4.1 Learning the optimal policy: MDP model and ρ known . . . . . . . . . . . 138
7.4.2 Intuition about the optimal policy . . . . . . . . . . . . . . . . . . . . . . 139
7.4.3 Learning ρ and the optimal policy: MDP model known . . . . . . . . . . . 139
7.4.4 Learning the MDP model, ρ and the optimal policy . . . . . . . . . . . . . 140
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8 Blind Network Interdiction Strategies - A Learning Approach 143
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.1 Victim Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.2 Flow model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2.3 Attack Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3 Single-Node Attack – Strategies and Analysis . . . . . . . . . . . . . . . . . . . . 148
8.3.1 Benchmark Strategies (when the attacker has topology knowledge) . . . . 149
8.3.2 Blind strategies (when the attacker does not have topology knowledge) . . 150
ix
8.3.3 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3.4 Notes on attack performance . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3.5 Learning rates in blind scenarios . . . . . . . . . . . . . . . . . . . . . . . 156
8.4 Results - Single Node Attack Scenario . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4.1 Fixed flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4.2 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.5 Multiple Node Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.5.1 Single attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.5.2 Multiple attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9 On Jamming Attacks against Wireless Networks 175
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.3 Outage probability of the victim receiver . . . . . . . . . . . . . . . . . . . . . . . 180
9.4 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.4.1 PEP derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.4.2 Gaussian-Hermite quadrature approximation . . . . . . . . . . . . . . . . 187
9.4.3 ASEP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.5.1 Outage Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.5.2 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.5.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10 Conclusions 203
Bibliography 207
x
List of Figures
3.1 Realization of the samples obtained in the estimation process of h1, h2, τ , and λ2.The dotted lines represent the true values of the parameters being estimated and
the bold lines are the samples obtained in each Gibbs sampling iteration. . . . . . 303.2 Correlation among the samples of σ21 (square) and among the samples of the real
part of h1 (circle) after the burn-in period. . . . . . . . . . . . . . . . . . . . . . . 313.3 Realization of the samples obtained in the estimation process of τ for different
resolution factor (OS) values. The dotted line represents the true value of τ . OS isset to 30 (dashed), 50 (dash-dot) and 100 (bold). . . . . . . . . . . . . . . . . . . 31
3.4 Average normalized variance of the error in the estimation of σ21 and σ22 . Number
of observed symbols equal to 100 (circle), 300 (square), and 500 (diamond). The
average normalized variance in the estimate X̂ of X is defined as V ar[X − X̂]/X . 323.5 Normalized MSE in the estimation of σ21 and σ
22 when the modulation scheme of
the received symbols is known and is either BPSK, QPSK, 8 PSK, 16 QAM, 32
QAM, or 64 QAM. The average normalized MSE in the estimate X̂ of X is definedas E[(X − X̂)2]/X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Realization of the samples obtained in the estimation process of the real part of
h1, h2, and h3. The dotted lines represent the true values of the parameters beingestimated and the bold lines are the samples obtained in each iteration. . . . . . . . 34
3.7 Realization of the samples obtained in the estimation process of σ21 , σ22 , and σ
23 .
The dotted lines represent the true values of the parameters being estimated and
the bold lines are the samples obtained in each iteration. . . . . . . . . . . . . . . . 34
3.8 Probability of correct classification of the numerical classifier for different num-
bers of observed symbols (750, 1000, and 1250). Clairvoyant classifier uses 1250
symbols. Set of possible modulation schemes: BPSK, QPSK, 8 PSK, and 16
QAM. L = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Probability of correct classification of the numerical classifier for different values
of cth. Number of observed symbols: 1250. Set of possible modulation schemes:BPSK, QPSK, 8 PSK, and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . 36
xi
3.10 Probability of correct classification of the numerical classifier for the case when
the values of L or N are over- or under-estimated. The correct values of L and Nare 2. Number of observed symbols: 750. Set of possible modulation schemes:
BPSK and QPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.11 Probability of correct classification of the approximated likelihood classifier for
different numbers of symbols used by the pre-processing stage (300 or 500) and
by the classifier (K=750 or K=1000). L = 3. . . . . . . . . . . . . . . . . . . . . 37
3.12 Probability of correct classification of the approximated likelihood classifier for
different numbers of symbols used by the pre-processing stage (500 or 1000) and
a fixed number of symbols used for classification (K=1000). Set of possible mod-ulation schemes: BPSK, QPSK, 8 PSK, 16 QAM, 32 QAM, and 64 QAM. L = 3. 38
3.13 Probability of correct classification of the approximated likelihood classifier for
different values of β. Number of symbols used for both estimation and classifica-tion is equal to 1000. Set of possible modulation schemes: BPSK, QPSK, 8 PSK,
and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.14 Realization of the samples obtained in the estimation process of τ and δf for thecase when training data is available for parameter estimation. Length of the train-
ing sequence is 50 symbols. Carrier frequency offset is 0.0045. The dotted lines
represent the true values of the parameters being estimated and the bold lines are
the samples obtained in each Gibbs sampling iteration. . . . . . . . . . . . . . . . 40
3.15 Probability of correct classification of the approximated likelihood classifier for the
case when the received symbols suffer phase rotation due to carrier frequency off-
set. Two carrier frequency offset values are considered (0.0045 and 0.01). Number
of symbols used for estimation and classification is equal to 50 and K=300, re-spectively. Set of possible modulation schemes: BPSK and QPSK. L = 3. . . . . . 40
4.1 Comparison of various jamming techniques against a 16-QAM modulated victimsignal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Comparison of jamming techniques against a 16-QAM victim signal in a non-coherent (random phase offset) scenario, JNR = 10 dB. . . . . . . . . . . . . . . 54
4.3 Comparison of jamming techniques against a 16-QAM victim signal in the pres-
ence of timing synchronization errors, JNR = 10 dB. . . . . . . . . . . . . . . . 56
4.4 Comparison of jamming techniques against a 16-QAM victim signal in the pres-
ence of signal level mismatch, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . 57
4.5 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim
signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xii
4.6 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim
signal in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . 60
4.7 Comparison of jamming techniques when multiple jammers attack a single 16-
QAM modulated victim signal, JNR=10dB. . . . . . . . . . . . . . . . . . . . . . 64
5.1 Power allocations for an AWGN jammer against an OFDM-based 16-QAM victimsignal, JNR=10 dB, SNR = 15 dB, 52 out of Nsc = 64 subcarriers are shown. Thesolid lines indicate the channel power levels across the OFDM subcarriers. The
optimal power allocation obtained by solving (5.6) is seen to be different from
channel inversion, water-filling and capacity minimization-based power allocations. 78
5.2 Performance comparison of the various power allocation strategies when pulsed
AWGN jamming signal is used against an OFDM-based 16-QAM modulated vic-
tim signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Performance comparison of the various power allocation strategies when pulsed
QPSK jamming signal is used against an OFDM-based 16-QAM modulated victim
signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Performance comparison of the approximately optimal jamming power allocation
in (5.14) with the optimal, water-filling and channel inversion-based power allo-
cations, pulsed AWGN jamming is used against an OFDM-based 16-QAM mod-
ulated victim signal, JNR = 10 dB. The approximately optimal power allocationstrategy (diamond marker) performs nrealy as well as the optimal power allocation
strategy (triangle marker). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Jamming performance against an OFDM-based 16-QAM modulated victim signal
with erroneous channel knowledge, JNR = 10 dB. . . . . . . . . . . . . . . . . . . 83
5.6 Jamming performance against an OFDM-based 16-QAM modulated victim signal
in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . . . . . . 84
5.7 Jamming performance when the jammer is uncertain about the victim’s modulation
scheme, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.8 Jamming performance when the jammer is uncertain about the victim’s modula-
tion scheme and when the victim’s channel {hk}Nsck=1 is not compensated prior totransmission, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.9 Empirical KL divergence measure between QPSK modulated jamming signal in
the presence of a carrier frequency offset ǫ (normalized value) and an AWGN jam-ming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xiii
6.1 An illustration of learning in one round of JB. It is possible that the optimal strategy
denoted by {J ∗, JNR∗, ρ∗} lies out of the set of discretized strategies. In such acase the jammer learns the best discretized strategy, but based on the value of the
discretization parameter M , the loss incurred by using this strategy with respect tothe optimal strategy can be bounded using the Hölder continuity condition. The
value of the discretization M is shown in the figure and Alg. 6.1. . . . . . . . . . . 102
6.2 Using Theorems 6.3 and 6.5 in a real time jamming environment. . . . . . . . . . . 106
6.3 Instantaneous SER achieved by the JB algorithm when JNR = 10dB, SNR =20dB and the victim uses BPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK. The jammer learns to use BPSK with ρ = 0.078 using JB. Thelearning performance of the ǫ-greedy learning algorithm with various discretizationfactors M is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Learning the optimal jamming strategy when JNR = 10dB, SNR = 20dB and thevictim uses QPSK modulation scheme. The jammer learns to use QPSK signaling
scheme with ρ = 0.087. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK and there is a phase offset between the two signals. The jammer
learns to use BPSK with ρ = 0.051 using JB. The learning performance of theǫ-greedy learning algorithm with various discretization factors M is also shown. . . 111
6.7 Average PER inflicted by the jammer at the victim receiver, SNR = 20 dB, victimuses BPSK and JNR = 10 dB. The jammer learns to use BPSK signaling schemewith ρ = 0.23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.8 Average reward obtained by the jammer against a BPSK modulated victim, SNR =20 dB. The optimal reward is obtained via grid search with discretization M = 100. 112
6.9 Confidence level (optimal reward-achieved reward) predicted by Theorem 6.3 and
that achieved by JB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.10 Learning the jamming strategies by using arm-elimination. The victim uses BPSK
with SNR = 20dB. The jammer learned to use BPSK with JNR = 15 dB andρ = 0.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.11 Learning jammers’ strategy against a stochastic user. The victim transmitter-receiver
pair use a uniformly random signaling scheme that belongs to the set {BPSK,QPSK}and random power level in the range [0, 20] dB. . . . . . . . . . . . . . . . . . . . 114
6.12 Learning against a victim with time-varying strategies. The figure shows the power
levels adaptation by the jammer and that used by the victim. . . . . . . . . . . . . 115
xiv
6.13 Learning against a victim with time-varying strategies. The figure shows the power
level adaptation by the jammer using a drifting algorithm and that used by the victim.115
6.14 PER achieved by the jammer against 2 users, user 1 uses BPSK at 15dB and user 2sends BPSK at 5dB. The jammer learns to use BPSK signal with power 13dB andρ = 0.46. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.15 PER achieved by the jammer against 2 users, user 1 sends QPSK at 5dB and user 2sends BPSK at 15dB. The jammer learns to use BPSK signal with power 11.25dBand ρ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.16 PER achieved by the jammer against 2 stochastic users in the network. Both the
users employ BPSK signaling scheme. The jammer learns to use the BPSK sig-
naling scheme to achieve power efficient jamming strategies and also tracks the
changes in the users’ strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1 MDP model of the 802.11-type wireless network with the RTS-CTS protocol. Thestate transitions indicate the effect of a jamming attack on the wireless network. . . 137
7.2 Rewards obtained in various scenarios, ρ = 0.3. The rewards obtained with instan-taneous knowledge are on average better than the rewards obtained in the delayed
knowledge scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Optimal jamming policies as a function of the ratio of the throughput cost to the
energy cost; ρ = 0.5. The colors represent the various optimal jamming policies. . 141
7.4 Rewards obtained when jammer is uncertain about the underlying MDP model and
ρ and learns it by interacting with the environment; ρ = 0.5. . . . . . . . . . . . . 142
8.1 Betweenness metrics for nodes in network (a) 112[0, 6, 8, 0, 0] and
network (b) 112[0, 6, 0, 0, 0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2 Network attack performance against fixed flows in a star network, number of
nodes=50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.3 Network attack performance against an Erdös-Rényi random network, connection
probability (p) = 0.8, number of nodes = 50. The average number of flows stoppedin one network instantiation of the ER network is shown. . . . . . . . . . . . . . . 158
8.4 Network attack performance against fixed flows in an ER network, p = 0.8, num-ber of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.5 Network attack performance against fixed flows in a BA network, connection de-
gree = 5, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6 PPP-based network model, with nearest neighbor connections. The red dots indi-
cate the various network nodes and the blue lines indicate the network connections. 161
xv
8.7 Network attack performance against fixed flows in a PPP-based network, number
of nearest neighbor connections = 5, number of nodes = 50. . . . . . . . . . . . . . 161
8.8 Network attack performance against random flows in a Star network, number of
nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.9 Network attack performance against random flows in an ER random network, p =0.8, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.10 Network attack performance against random flows in a BA network, number of
nodes = 50, connection degree = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.11 Network attack performance against fixed flows in a ER network, with 25 nodesand p = 0.8, when two nodes can be attacked simultaneously by the attackers. . . . 166
8.12 Attack performance by exploiting the similarity in a network modeled using a
Poisson-point process. L(G) = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.13 An example network attacked by two attackers, with each capable of attacking
only a subset of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.1 [System Model] The cross marks indicate the BS/APs in the wireless network that
are distributed according to a PPP. The Voronoi tessellation indicates the coverage
regions of the BS/APs. The square indicates the victim receiver which is at the
origin. The black arrow indicates the link between the the closest BS and the
victim receiver. The triangles indicate the jammers that are distributed according
to a BPP within the black-dotted region of radius RJ . . . . . . . . . . . . . . . . . 178
9.2 [Effect of NJ ]: Outage probability of the victim receiver as a function of the num-ber of jammers NJ in the network. p = 0.01, PT/PJ = 0dB. The solid lines indi-cate the outage probability obtained via Monte Carlo simulations and the markers
indicate the theoretical outage probability evaluated using (9.3). . . . . . . . . . . 190
9.3 [Effect of NJc]: Outage probability of the victim receiver as a function of the num-ber of jammers per cell (or per BS) NJc in the network. p = 0.01, NJ = 4,PT/PJ = 0dB. The solid lines indicate the outage probability obtained via MonteCarlo simulations and the markers indicate the theoretical outage probability eval-
uated using (9.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.4 [Effect of p]: Outage probability of the victim receiver as a function of the activityfactor p. NJ = 4, NJc = 1, PT/PJ = 0dB. The solid lines indicate the outageprobability obtained using Monte Carlo simulations and the markers indicate the
theoretical outage probability expression evaluated using (9.3). . . . . . . . . . . . 191
xvi
9.5 [GHQ Approximation]: The accuracy of the Gaussian-Hermite quadrature approx-
imation in evaluating the outage probability as a function of the number of terms
N used in (9.12). The dotted line is the outage probability evaluated using (9.3).The marked lines indicate the outage probability evaluated using (9.12) for various
values of N . p = 0.01, NJc = 1, PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . 191
9.6 [Effect of activity factor p]: Number of jammers N∗J required to cause a 90%probability of outage in the wireless network, as a function of the activity factor
(network load) p. PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.7 [Effect of λT ]: Number of jammers N∗J required in a BPP to cause a 90% proba-
bility of outage in the wireless network, as a function of λT , p = 0.1. . . . . . . . . 193
9.8 [Effect of Shadowing]: Number of jammers N∗J required in a BPP to cause a 90%probability of outage in the wireless network, as a function of σχ and p = 0.01. . . 193
9.9 [Effect of Retransmissions]: The steady state activity factor (ps) as a functionof the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194
9.10 [Effect of Retransmissions]: The steady state packet drop probability (δ) as a func-tion of the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194
9.11 The accuracy of the Gaussian-Hermite quadrature approximation for error proba-
bility evaluation as a function of the number of terms N used in the approximation.The zoomed in plot shows a part of the overall figure and indicates that N = 10terms very closely matches the true value without any approximation. . . . . . . . 195
9.12 [Effect of Activity Factor]: Average symbol error rate as a function of the activity
factor p when the victim receiver uses BPSK modulation and the jammer networkuses BPSK modulation. NJ = 4, NJc = 1, JNR = 100 dB. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP
evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.13 [Effect of Number of Jammers]: Average symbol error rate as a function of the
number of jammers when the victim receiver uses BPSK modulation and the jam-
mer network uses BPSK modulation. NJc = 1, p = 0.01. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP
evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.14 [Effect of NJc]: Average symbol error rate when the victim receiver uses BPSKmodulation and the jammer network uses BPSK modulation as a function of the
number of jammers per cell (BS). The solid lines indicate the Monte Carlo simu-
lation results and the markers indicate the theoretical ASEP evaluated using (9.25). 196
xvii
9.15 [Effect of shadowing]: Average symbol error rate as a function of shadowing power
level when the victim receiver uses BPSK modulation and the jammer network
uses BPSK modulation. The solid lines indicate the Monte Carlo simulation results
and the markers indicate the theoretical ASEP evaluated using (9.25). . . . . . . . . 197
9.16 [Effect of the jamming signaling scheme]: Average symbol error rate as a function
of p when the victim receiver uses BPSK modulation and different jamming signalsare used by the jammer network. NJ = 4, NJc = 1. It is seen that in all cases, thejamming performance of the three jamming signals are the same. . . . . . . . . . 197
9.17 [No Fading Scenario]: Average symbol error rate when the victim receiver uses
BPSK modulation and different jamming signals are used by the jammer network,
NJ = 4, NJc = 1, p = 0.01. In all cases it is seen that the BPSK jammingoutperforms QPSK and AWGN jamming signaling schemes. . . . . . . . . . . . . 198
9.18 The symbol error probability of the victim receiver when the jammer interference
is approximated as Gaussian with variance denoted by (9.30). . . . . . . . . . . . . 199
xviii
List of Tables
4.1 Optimal jamming signal level distribution against a 16-QAM victim signal, JNR =10 dB. a1, a2 indicate the absolute values of the real and imaginary parts of thejamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Optimal jamming signals in a coherent scenario. . . . . . . . . . . . . . . . . . . . 51
4.3 Optimal non-coherent jamming signal level distribution against a 16-QAM victimsignal, JNR = 10 dB. a1, a2 indicate the absolute values of the real and imaginaryparts of the jamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Optimal jamming strategies versus jammer knowledge . . . . . . . . . . . . . . . 68
6.1 Comparison between related bandit works . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1 MDP model state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Optimal Jamming Policies via Delayed Learning, E = −10, T = −100 . . . . . . 139
9.1 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
xix
Chapter 1
Introduction
Wireless connectivity has now become ubiquitous and an integral part of our everyday lives. It
is now more of a necessity than a luxury. With the advent of new technological capabilities, the
demand for wireless spectrum is ever-increasing. However, the inherent openness of the wireless
medium makes it susceptible to both intentional and un-intentional interference. Interference from
neighboring communicating devices is one of the major causes for un-intentional interference. On
the other hand, intentional interference corresponds to adversarial attacks on a victim receiver.
Therefore, ensuring the security and privacy of every device, in order to avoid data breaches and
any type of attack, is of utmost importance. Security need not only be defensive, such as crypto-
graphic or information theoretic security, that evades attacks, but can also refer to being offensive
on an as needed basis. In this dissertation, we focus on the offensive techniques that help to ensure
security of the various devices. Such security-related studies not only allow for the analysis of the
system vulnerabilities but also enable to undermine an enemy system capabilities.
The rapid rise in the technological advancements in the field of Artificial Intelligence and Machine
Learning can potentially enable every device (regardless of wired or wireless) to possess some sort
of intelligence that allows for real time operation and adaptation [1]-[6]. If such capabilities exist
with the malicious nodes,1 then it is a threat to the security of the various devices that co-exist in
the same environment. It is thus imperative that devices be intelligent and predict the next move by
the adversary so as to limit the effectiveness of attacks. Therefore, spectrum supremacy, or in other
words, ensuring unimpeded access to spectrum while denying it to adversaries and thereby having
control of the spectrum, is a vital part of security in the modern era. Throughout this dissertation,
we refer to the offensive techniques that help to gain control over the spectrum as communication
denial. Communication denial, for instance, is vital for military applications (popularly referred to
as electronic warfare) [7] where the military devices must have un-interrupted access to spectrum
resources to cater to mission critical applications. It is also useful in commercial applications
where malicious sensor nodes must be stopped from eavesdropping, for instance during a private
1In this dissertation, the terms malicious nodes, adversarial nodes, enemy nodes and victim nodes are used inter-
changeably.
1
SaiDhiraj Amuru Chapter 1. Introduction 2
meeting.
Communication denial has mainly been studied by using either optimization, game-theoretic or
information theoretic principles. The major disadvantage of these studies is that they assume a lot
of a priori information about the communication strategies used by the enemy nodes, environment
conditions (such as fading channel, spectrum occupancy) etc., which may not be available in prac-
tical scenarios. Therefore, the major point of departure for this dissertation from the previous work
is the realization that the recent advances in communication systems create the potential for dy-
namic environmental conditions. Under such scenarios, more often than not, it is difficult and most
likely not even possible to obtain a priori information regarding the environment and the nodes that
are present in it. Therefore, it is necessary to have cognitive capabilities that enable nodes to learn
the environment and prevent the enemy nodes from accessing the spectrum and thereby denying
communication.
In this dissertation, we address several unsolved fundamental problems in the area of commu-
nication denial. In particular, this dissertation considers the scenarios where several nodes are
attempting to communicate in a secure or sensitive area, and one or more secure nodes wish to
prevent that communication i.e., deny the nodes from communicating. Broadly, we ask the fol-
lowing question in this dissertation “Can an intelligent attacker learn and adapt to an unknown
environment in an electronic warfare-type scenario?” We answer this question in several stages by
fundamentally analyzing the performance of an attacker in various communication settings. We
assume that the attacker has already identified that a device(s) is(are) malicious, for instance by
using device fingerprinting techniques [8]. In this dissertation, we focus on intelligent approaches
for communication denial of the malicious node once it has been identified.
1.1 Contributions
Chapter 2 provides a short background on the learning theory concepts used in this dissertation.
Chapters 3-9 describe the major contributions of this dissertation– namely victim signal identifica-
tion and attack strategies at various open systems interconnection (OSI) model layers. Specifically,
Chapters 3-6 and 9 discuss attacks at the physical layer, Chapter 7 discusses attacks at the MAC
layer and Chapter 8 addresses attacks at the network layer. Conclusions and future directions are
presented in Chapter 10. The major contributions of this dissertation are briefly described below.
1.1.1 Modulation classification
As mentioned earlier, the first task in effectively attacking a malicious node is to identify its sig-
naling strategy. In Chapter 3, we present a novel signal identification technique, specifically a
modulation classification algorithm to identify the modulation scheme used by the victim for its
communication. While modulation classification has been studied extensively, (see [9]- [12] and
SaiDhiraj Amuru Chapter 1. Introduction 3
references therein,) unfortunately none of the previous works consider practical, realistic environ-
ments where the interference is often non-Gaussian and the attacker is not aware of the timing of
the victim’s signal. Further, the difficulty in performing modulation classification is due primar-
ily to the fact that classifiers operate with no or incomplete knowledge of the fading experienced
by the signal and the distribution of the noise added in the channel. This is because a receiver
typically has to first classify the received signal before it can successfully acquire symbol timing
and estimate the channel state. As a result, the impractical assumption that the received signal is
acquired and equalized by the radio front-end before classification is often made in the design of
modulation classification algorithms [13], [14].
In this chapter, we present and analyze a pre-processor that allows for the reliable classification
of digital amplitude-phase modulated signals (ASK, PSK, and QAM) when the receiver has no
knowledge of the timing (symbol transition epochs) of the received signal, the noise added in
the channel is non-Gaussian, and the unknown fading experienced by the signal is frequency-
selective. We assume that the additive noise is non-Gaussian because various studies have shown
that most radio channels experience both man-made and natural noise, and that the combined noise
is impulsive. This also accounts for non-Gaussian interference that is often experienced in practical
wireless environments [9]. We propose a Bayesian pre-processing stage that estimates the various
signal parameters and reliably identifies the signal of interest. The numerical results demonstrate
that, by using the proposed pre-processor, modulation classification algorithms can perform well
compared to clairvoyant classifiers assumed to be symbol synchronous with the received signal and
to have perfect knowledge of the channel state and noise distribution. An extension of the proposed
pre-processor for the case when the received symbols suffer phase rotation due to the presence of
a residual carrier frequency offset is also considered. More details are given in Chapter 3.
1.1.2 Optimal jamming in AWGN channels
Once the victim’s signaling scheme is identified, the next task for the attacker is to efficiently at-
tack it using all the available information. In Chapter 4, we study attacks from a physical layer
perspective. More specifically, we study jamming attacks against practical wireless signals, namely
digital amplitude-phase modulated signals. Jamming has traditionally been studied in the context
of spread spectrum communications [15]. Barrage jamming, partial-band/narrow-band jamming,
tone-jamming (where a victim is attacked by sending either a single or multiple jamming tones) and
pulsed jamming are the most common types of jamming models considered in wireless commu-
nication systems. Deviating from these traditional simplistic techniques, we want to know “What
is the optimum statistical distribution for power constrained jamming signals in order to maximize
the error probability of digital amplitude-phase modulated constellations?” This work answers
a question that is more relevant to practical wireless communication systems when compared to
similar questions studied in the past, and consequently offers different solutions mainly because
incorrect system models were previously considered and thus the wrong questions were answered.
As a result of the analysis in this chapter, we show that modulation-based pulsed jamming sig-
SaiDhiraj Amuru Chapter 1. Introduction 4
nals are optimal in both coherent and non-coherent (phase asynchronous) scenarios against digital
amplitude-phase modulated signals. As opposed to the common belief that matching the victim
signal (correlated jamming) increases confusion at the victim receiver, our analysis shows that
the optimal jamming signals match standard modulation formats only in a certain range of signal
and jamming powers. Beyond this range, either binary or quaternary pulsed jamming is the opti-
mal jamming signal. An interesting relationship between these optimal jamming signals and the
well-known pulse jamming signals discussed in the context of spread spectrum communications
was illustrated. The performance of these optimal jamming signals is shown to be degraded when
the victim and the jamming signals are not phase or time synchronous or when it does not have
perfect knowledge of the power levels of the victim and the jamming signals although the opti-
mal jamming signal distributions don’t change. In this chapter, we also study jamming against
OFDM-based victim signaling schemes and the effects of multiple jammers against the victim.
More details are presented in Chapter 4.
1.1.3 Jamming in fading channels
In Chapter 4 we studied jamming in AWGN channels. In Chapter 5, we take the jamming analysis
in Chapter 4 a step further and investigate jamming attacks in fading channels. As pointed out
in [16], most of the existing jamming works, see [7], [17]-[20] and references therein, ignore the
presence of a fading channel between the jammer and the victim receiver as it simplifies the jam-
ming analysis. Although the impact of fading channels on the jamming performance has sparingly
been studied in the context of multiple-input multiple-output (MIMO) systems [16], [24]-[26],
these works addressed jamming by only considering an AWGN jamming signal against Gaussian
victim signaling and showed that equal power allocation or water filling based on the second-order
statistics of the fading channel are Nash-equilibrium strategies. However, it was recently shown
in [16] that ignoring the presence of a fading channel and/or using equal power allocation/ water
filling is sub-optimal in terms of the jamming performance evaluated via the Shannon rate metric.
While [16] addresses the shortcomings of the earlier works [21]-[26], it assumes that the victim
employs Gaussian signaling schemes which are typically not used in practice. Furthermore, none
of the works that study jamming against OFDM systems, which is the preferred signaling scheme
for most wireless standards, explicitly consider the effects of a fading channel between the jammer
and the victim receiver (see [21]-[23] and references therein for more information on jamming
against OFDM systems). Hence, there is not currently a good understanding as to how a jam-
mer can effectively attack a victim that uses practical wireless signals in the presence of a fading
channel between the jammer and the victim receiver.
Therefore, we address this open question by studying jamming attacks against digital modulation
schemes in wireless fading channels. Again, we focus on the error probability metric as the Shan-
non rate metric fails to capture the effects of digital modulation schemes typically employed by
the victim. Specifically, in this chapter, we study the problem of jamming power allocation across
a fading channel under total and peak power constraints in order to maximize the error probability
of a victim receiver. As a result of the analysis in this chapter, an interesting power allocation
SaiDhiraj Amuru Chapter 1. Introduction 5
strategy is obtained for the jammer, which is different from equal power allocation, channel inver-
sion and water filling. Specifically, it will be shown that for a given jamming power, the power
allocation is similar to channel inversion at low victim signal power values and to water filling
at high victim signal power values. However, at medium victim signal power values, the jammer
allots more power when the channel fading is weak than when it is strong but allots no power
when the fading is weakest. The jammer performance was also considered under several non-ideal
scenarios which shows the benefits of employing the proposed jamming strategies over conven-
tional jamming techniques. Finally, the proposed jamming strategies are not only applicable to
frequency-selective fading channels, but also to time- selective fading channels and hence can be
used to optimally attack a victim across a variety of scenarios.
1.1.4 Jamming Bandits
As mentioned earlier, jamming was traditionally studied by using either optimization or game-
theoretic or information theoretic principles, see [17]-[26] and references therein. The major dis-
advantage of these studies is that they assume the jammer has a lot of a priori information about
the strategies used by the (victim) transmitter-receiver pairs, channel gains, etc., which may not
be available in practical scenarios. For instance, in Chapters 4 and 5, we analyzed jamming from
an optimization perspective and studied jamming strategies in AWGN and fading channels. How-
ever, these jamming strategies were obtained by assuming that the jammer has a priori knowledge
regarding the transmission strategy of the victim transmitter-receiver pair. While the results in
Chapters 4 and 5 shed light on the fundamental performance limits of the jammer, they cannot be
used in real time environments due to the lack of a priori knowledge about the victim. Further,
such optimization-based techniques need to be re-programmed whenever the victim changes its
strategy, which may be a complicated procedure. Therefore, in contrast to prior work (both ours
and others), in this chapter we develop online learning algorithms that learn the optimal jamming
strategy by repeatedly interacting with the victim nodes. Essentially, the jammer must learn to act
in an unknown environment in order to maximize its total reward (e.g., jamming success rate).
In this regard, we ask “Can an intelligent jammer learn the optimal physical layer jamming strate-
gies obtained in Chapter 4, with limited to no knowledge about the victim nodes?” By learning,
we refer to the cognitive capabilities of a jammer wherein it has the ability to understand its envi-
ronment and the impact of its actions on the environment. In Chapter 4, we show that the optimal
jamming signal depends on three parameters, namely modulation scheme, signal power and the
on-off duration. While the set of modulation schemes is discrete, the signal power and the on-off
duration parameters are continuous. As will be discussed in detail in Chapter 6, traditional learning
techniques (i.e., those available in the open literature) cannot be directly employed to learn in such
mixed action spaces (discrete and continuous). The multi-armed bandit (MAB) framework lends
itself well to solve this problem as will be described in Chapter 6. However, there are no exist-
ing bandit frameworks that can be directly applied to this problem which motivated us to develop
novel learning frameworks and algorithms, novel both with respect to their application to jamming
as well as the general learning literature, to address this cognitive physical layer jamming problem.
SaiDhiraj Amuru Chapter 1. Introduction 6
Moreover, these algorithms come with theoretical guarantees on the jamming performance which
is vital in offensive security scenarios. Specifically, we prove that our learning algorithm converges
to the optimal (in terms of the error rate inflicted at the victim and the energy used) jamming strat-
egy. Even more importantly, we prove that the rate of convergence to the optimal jamming strategy
is sub-linear, i.e. the learning is fast in comparison to existing reinforcement learning algorithms,
which is particularly important in dynamically changing wireless environments. Also, we charac-
terize the performance of the proposed bandit-based learning algorithm against multiple static and
adaptive transmitter-receiver pairs. More details are presented in Chapter 6.
1.1.5 MAC-layer jamming
Jamming all the information exchanged between the malicious nodes, for example by employing
the physical layer jamming techniques obtained in Chapters 4-6, may not always be necessary. It
was shown in [27], [28] that the jammer can perform better (say in terms of the energy efficiency)
if it accounts for the inherent structure in the data transmission. For example, in some scenarios,
jamming the control packets or pilot signals is sufficient to stop the malicious nodes from commu-
nicating with each other [27], [28]. Thus, higher layer jamming attacks either at the MAC layer
or network layer should be considered. MAC-layer attacks typically rely on the knowledge of the
protocol used by the malicious nodes and network layer attacks rely on the ability to create con-
gestion in the network by mimicking the packets that are sent by other nodes in the network [7]. In
Chapter 7, we seek to understand the optimal MAC layer jamming attacks against an 802.11-basedwireless network. Specifically, we ask “Can an intelligent jammer learn the optimal MAC-layer
jamming strategies when it has delayed knowledge about the malicious nodes?”
In this chapter, we assume that the jammer can identify the basic MAC-layer protocol being used
by the malicious nodes, although not necessarily the full details. This can be fairly easily achieved
by observing the traffic pattern of the nodes in the environment over some time interval [29].
However, one of the main challenges still faced in studying a MAC layer jamming attack is that the
knowledge about the malicious nodes is not always available instantaneously, especially when the
jammer intends to track the changes in the victim’s strategies. Hence, in this problem, we assume
a middle ground between Chapters 4, 5 and Chapter 6 and study how efficiently and effectively
a jammer can learn the optimal jamming strategy when there is delayed knowledge about the
malicious nodes i.e., in cases where the jammer is aware of the malicious nodes’ behavior after
some time delay. The framework for delayed observations is more practically relevant, especially
in the context of wireless communications [30].
In order to answer the question raised, we will use the Markov Decision Process (MDP) framework
which is particularly useful in modeling environments that obey the Markovian property and have
to keep track of only a small number of states. By state, we refer to the condition of the environment
in this dissertation. However, as will be discussed in detail in Chapter 7, the literature on delayed
learning frameworks is immature, thereby forcing the development of an appropriate framework
that enables us to obtain the optimal MAC-layer jamming strategies. As a result of the analysis in
SaiDhiraj Amuru Chapter 1. Introduction 7
this chapter, we developed a novel delayed learning framework with transition-based rewards, that
allows us to handle the realistic case of delayed knowledge. Using this framework, it is shown that
the jammer can learn the optimal policy. More details are presented in Chapter 7.
1.1.6 Blind network interdiction
Network centric architectures are increasingly gaining prominence, be it social networks or wire-
less networks, as they allow for decentralized operation among various nodes without the need
for a central entity to control their communication. With a widespread deployment of such ar-
chitectures, the security aspects of the underlying networks is now a major concern. The ability
to undermine a malicious network’s communication capabilities is crucial for ensuring security in
sensitive environments. In Chapter 8, we particularly focus on attacks against networks when their
topology is unknown a priori.
Network interdiction refers to disrupting a network in an attempt to either analyze the network’s
vulnerabilities or to undermine a network’s communication capabilities. A vast majority of the
works that have studied network interdiction assume a priori knowledge of the network topology
[31]-[42]. However, such knowledge may not be available in real-time settings. For instance,
in practical electronic warfare-type settings, an attacker that intends to disrupt communication
in the network may not know the topology a priori. Hence, it is necessary to develop online
learning strategies that enable the attacker to interdict communication in the underlying network
in realtime. In this chapter, we develop several learning techniques that enable the attacker to learn
the best network interdiction strategies (in terms of the best nodes to attack to maximally disrupt
communication in the network) and also discuss the potential limitations that the attacker faces in
such blind scenarios. We consider settings where a) only one node can be attacked and b) where
multiple nodes can be attacked in the network. In addition to the single-attacker setting, we also
discuss learning strategies when multiple attackers attack this network and discuss the limitations
they face in real-time settings. Several different network topologies are considered in this study
using which we show that under the blind settings considered in this chapter, except for some
simple network topologies, the attacker cannot optimally (measured in terms of the number of
flows stopped) attack the network.
More specifically, in this chapter, we show that: (a) relying on well-known graph metrics, such
as betweenness centrality [40], attacking a network does not necessarily work for all network
topologies, (b) under blind scenarios, the learning rates cannot be improved beyond O(|V |) where|V | is the number of nodes in the network, (c) under blind scenarios, multiple attackers mustcollaborate at every time instant in order to learn the best set of nodes to attack in the network
and (d) the learning performance, be it a single attacker or multiple attackers, will depend on the
network structure and not just the number of nodes in the network. More details are presented in
Chapter 8.
SaiDhiraj Amuru Chapter 1. Introduction 8
1.1.7 Jamming against wireless networks
Jamming against wireless networks (not just single nodes) has been previously addressed albeit
from an optimization perspective in [37], [43]-[45]. The jammer-to-flow assignment problem i.e.,
optimally assigning jammers to stop flows in a network based on their locations and other con-
straints such as power, was considered in [37]. In [43]-[45], the problem of jammer placement
against wireless networks with the aim of disconnecting the network was studied. All these works
model a network as a graph and find the best set of nodes/edges to attack so that the network is dis-
connected. While these studies indicate which nodes/links to be attacked, they do not address the
problem of how this attack can be realized in practice against cellular and/or WiFi-based wireless
networks. In other words, the jamming techniques against wireless networks is not well understood
from a physical layer perspective.
In Chapter 9, we analyze the impact of randomly placed jammers against a wireless network in
terms of a) the outage probability and b) the error probability of a victim receiver in the downlink
of this wireless network. We derive analytical expressions for both these metrics and discuss in
detail how the jammer network must be matched to the wireless network parameters in order to
effectively attack the victim receiver. For instance, we show that as the network loading increases,
assuming universal frequency reuse, the number of jammers that are needed to inflict a given outage
probability at the victim receiver decreases. Retransmissions are commonly used across a variety
of wireless protocols. We will show that when the wireless network uses retransmissions (in order
to improve the probability of successful communication), the number of jammers necessary to
achieve a required outage probability at the victim receiver decreases due to increased interference
among the BSs. Furthermore, we will show that the behavior of the jammer network as a function
of the BS/AP density is not obvious. In particular, an interesting concave-type behavior is seen
which indicates that the number of jammers required to attack the wireless network must scale
with the BS density only until a certain value beyond which it decreases. In the context of error
probability of the victim receiver, we study whether or not some recent results related to jamming in
the point-to-point link scenario can be extended to the case of jamming against wireless networks.
As a result of the analysis in this chapter, we show that a fixed number of jammer’s can tip a
wireless network i.e., can significantly reduce the probability of successful communication in this
wireless network. A similar analysis is performed in the context of the error probability of the
victim receiver. Specifically, we will show that when the small scale fading effects are averaged
out, then the results in Chapter 4 can be extended to the case of jamming against wireless networks
and that significant gains can be achieved by using modulation-based jamming signals (i.e., the
findings from Chapter 4) when compared to AWGN jamming.
1.1.8 List of relevant publications
This dissertation is based on the following publications:
SaiDhiraj Amuru Chapter 1. Introduction 9
1. S. Amuru and C. R. C. M. da Silva, “A blind pre-processor for modulation classification
applications in frequency-selective non-Gaussian channels,” IEEE Trans. Commun., vol.
63, no. 1, pp. 156-169, Jan. 2015.
2. S. Amuru and R. M. Buehrer, “Optimal jamming against digital modulation,” IEEE Trans.
Inf. Forensics and Security, vol. 10, no. 10, pp. 2212-2224, Oct. 2015.
3. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “Jamming bandits - a novel
learning method for optimal jamming,” submitted to IEEE Trans. Wireless Commun., avail-
able at arXiv:1411.3652.
4. S. Amuru and R. M. Buehrer, “On jamming power allocation against OFDM signals in
fading channels,” submitted to IEEE Trans. Inf. Forensics and Security, Aug. 2015.
5. S. Amuru, R. M. Buehrer, and M. van der Schaar, “Blind network interdiction strategies - a
learning approach,” submitted to IEEE Trans. Cognitive Commun. Netw., Sept. 2015.
6. S. Amuru, H. S. Dhillon, and R. M. Buehrer, “On jamming attacks against wireless net-
works” submitted to IEEE Trans. Wireless Commun., Sept. 2015.
7. S. Amuru and R. M. Buehrer, “Optimal jamming using delayed learning,” in Proc. IEEE
Military Comm. Conf., (Milcom), Baltimore, MD, Oct. 2014, pp. 1528-1533.
8. S. Amuru and R. M. Buehrer, “Optimal jamming strategies in digital communications-
impact of modulation,” in Proc. IEEE Global Commun. Conf., Dec. 2014.
9. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “A systematic learning method
for optimal jamming,” in Proc. Intern. Conf. Commun., Jun. 2015.
Chapter 2
Background
In this chapter, we briefly introduce two concepts that are used in this dissertation, namely a)
Reinforcement learning and the associated theory of Markov Decision Processes and b) Multi-
armed bandits.
2.1 Reinforcement Learning and Markov Decision Processes
Reinforcement learning is a technique that allows an agent to modify its actions (without any
supervision) by repeatedly interacting with the environment and is commonly used to address
sequential decision making. A reinforcement learning task that satisfies the Markov property1
is called a Markov decision process, or MDP, [47]. A MDP is defined by a tuple (S,A, P,R)where S is the set of all possible environment states and A is the set of all possible actions thatthe agent can perform in any environment state. For instance, from a jammer’s perspective, the
environment states could be Transmission/No Transmission to reflect the cases where a packet is
exchanged between the transmit-receive pair or when they are idle, and the actions of the jammer
could be Jam/Don’t Jam. P is the state transition probability matrix that governs the dynamics ofthe environment, and its entries are given by the transition probabilities p(s′|s, a) which indicatesthe probability that the environment moves to the state s′ when action a is executed in the state s.Finally R indicates the |S| × |A| reward matrix whose entries are given by elements r(s, a) whichindicate the reward (for example, energy expended) obtained in state s when action a is executed.Here, |S| and |A| indicate the cardinality of the sets S and A respectively.In the traditional RL framework, an agent observes the current state of the environment s, andchooses an action a. An optimum policy (a functional mapping between states and the actions thatcan be performed in these states) is one that maximizes the total expected reward, that is more often
1The Markov property refers to the memoryless property of a stochastic process. More specifically, the conditional
probability distribution of the future states of the random process depends only on the present state and not on the
states that happened earlier. Such a stochastic process is also known as the Markov process.
10
SaiDhiraj Amuru Chapter 2. Background 11
than not discounted by a factor γ ∈ [0, 1) to account for an infinite time horizon. The objective ofa RL algorithm is therefore to find an optimal policy Π (mapping between states and actions), thatmaximizes the cumulative discounted reward
R(t) =
∞∑
k=0
γkr(st+k, at+k = Π(st+k)), (2.1)
where st, at indicate the state and action taken at time t [46]. The value of a policy Π when theenvironment is in state s is given by
V Π(s) = EΠ
( ∞∑
k=0
γkr(st+k, at+k|st = s)), (2.2)
where EΠ indicates the averaging performed over all possible state transitions when the agentfollows the policy Π. Several algorithms exist to find an optimal policy Π∗, such as value iterationand policy iteration (which are useful when P is known a priori). For more details please see [47].For ease of analysis, we assume a stationary model (state transition matrix is independent of time)
and ignore the time parameter t hereafter.
When the underlying MDP model is known, policy evaluation (finding the value of a given policy)
can also be done via matrix inversion (especially for small MDPs, i.e., MDPs with small state-
action space) [47]. Specifically,
V Π(s) = r(s, a = Π(s)) + EΠ
( ∞∑
k=1
γkr(sk, ak|s))
= r(s, a = Π(s)) + γ∑
s′
p(s′|s, a = Π(s))V Π(s′).
Thus, writing the above set of equations for all possible states s ∈ S in the MDP, we have
V̄Π = r̄Π+γPΠ(s′|s)V̄Π=⇒ V̄Π=(I−γPΠ(s′|s)
)−1r̄Π (2.3)
where V̄Π is the |S| ∗ 1 vector of values of the policy Π in states s ∈ S, r̄Π is the |S| ∗ 1 vectorof rewards obtained in states s ∈ S using policy Π and PΠ(s′|s) indicates the |S| ∗ |S| statetransition probability matrix when the agent uses policyΠ, and I is an identity matrix of appropriatedimensions.
In general when there are no policies given a priori, for any set of states S and set of actions A,the optimal value function can be written as
V ∗(s) = maxa∈A
(r(s, a) + γ
∑
s′
p(s′|s, a)V (s′)), (2.4)
which indicates the optimal value that can be associated with a state s ∈ S (which is known by
SaiDhiraj Amuru Chapter 2. Background 12
exploring all actions a ∈ A in the state s). Along similar lines, we define a new state-actionfunction Q(s, a) which captures the quality of an action taken in a particular state as follows,
Q(s, a) =(r(s, a) + γmax
a′∈A
∑
s′
p(s′|s, a)Q(s′, a′)), (2.5)
which gives the optimal value function as V ∗(s) = maxa∈A Q(s, a) and helps to find an optimalpolicy Π∗ as2
Π∗(s) = argmaxa∈A
(Q(s, a)
)(2.6)
It should be clear by now that all the above equations can be used to find the optimal set of actions
only when P i.e., the transition probability matrix is known or can be estimated. Such techniquesthat rely on the knowledge of P are commonly known as Indirect Learning or Planning algorithms[47]. But usually, P is unknown in dynamic environments and can be difficult to estimate in realtime environments. Since the value of a state is defined as the expectation of the random rewards
obtained when the MDP is started from the given state, a direct way of estimating this value is to
estimate an average over multiple independent realizations of the MDP that start from the given
state i.e., the Monte Carlo technique. Unfortunately, the variance of the returns can be high which
can result in poor estimates of the Q-function (because it is possible to obtain different estimates
for the same state-action pair, for example due to the wireless channel conditions). To address this,
an online learning technique popularly known as Q-Learning [47] was developed, which updates
the state-action function as below:
Qt(s, a) = (1− αt)Qt−1(s, a) + αt[r(s, a) + maxa′
∑
s′
p(s′|s, a)Qt−1(s′, a′)], (2.7)
which is shown to converge to the optimal solution when the learning rate αt ∈ (0, 1] satisfies∑t αt = ∞ and
∑t α
2t < ∞. The proof of convergence is based on relating (2.7) to that of an
ordinary differential equation with a fixed point solution and the theory of stochastic approximation
[48].
Moreover, when we are concerned with online learning problems, finding a balance between ex-
ploration (trying actions that may yield higher rewards) and exploitation (using the best actions
learned thus far) becomes important, given the finite available resources. ǫ-Greedy is a commonlyused learning algorithm where an agent explores the actions (in any state) with probability ǫ andexploits the existing knowledge with probability 1 − ǫ. In Q-Learning, actions are chosen as peran exploration-exploitation schedule that is decided a priori such that all actions can be tried in all
possible environment states. Thus, such learning algorithms can guarantee optimality only asymp-
totically as the size of the MDP grows. While the theory is mature for the case of finite MDPs,
efficient exploration, for example, is still being studied in the case of large MDPs (this problem
2Note that while the optimal value function V ∗(s) is unique, the optimal policy is not necessarily unique [47].
SaiDhiraj Amuru Chapter 2. Background 13
has been addressed well from the context of multi-armed bandit problems that is discussed next).
Finite time bounds that indicate the rate of convergence to the optimal policy in the case of finite
MDPs have been studied [49]. For more details on reinforcement learning, please see [46]-[49].
2.2 Multi-armed Bandits
Multi-armed bandit problems are a sub-class of sequential decision making problems that are con-
cerned with allocating the available resources among several alternative arms/actions [50]-[53].
For example, such algorithms are most widely used in the context of clinical trials where several
treatments are applied to patients in a sequential manner, and patients are dynamically allocated
to the best treatment [50]. A single-armed bandit process is an arm that is defined by two random
sequences namely, s(n) and r(s(n)), where s(n) is the state of the arm after it has been played ntimes and r(s(n)) is the instantaneous reward obtained after the arm is played n times. Specifically,it is assumed that the state of the arm evolves as s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)),where fn−1(.) is known a priori and w(n) is a sequence of independent random variables that arealso independent of s(n) and come from a known statistical distribution. A multi-armed banditprocess is defined as a collection of K such single-armed bandit processes3 and a controller/playerthat has to make the decision to choose one among these K bandit processes at every time instant.The decisions are taken such that the average cumulative discounted reward is maximized. There-
fore, it is a sequential decision problem as the decision to be made at every time instant depends
on what happened thus far and thereby faces the exploration versus exploitation dilemma when it
has to choose the arms. Putting this in the context of the MDPs introduced earlier4, a policy now
refers to sequence of arms chosen at every time instant and therefore, an optimal policy is one that
choses the best arm at every time instant. The goal of this problem is to maximize
J = E[ ∞∑
t=0
γtK∑
k=1
rk(sk(nk(t)), uk(t))|s1(0), s2(0), . . . , sK(0)], (2.8)
where γ ∈ (0, 1] is the discount factor, nk(t) is the number of times arm k has been chosen untiltime t, uk(t) is a 1 ×K vector which has 1 in the kth position and 0 else where if the kth arm ischosen.
It is easy to see that this problem formulation is similar to that of MDPs and can be solved by
using the regular dynamic programming techniques. However, it has been shown that this K-dimensional problem can be reduced to K 1-dimensional problems and that the optimal policy is
3For ease of analysis, K is assumed to be finite. The case of continuous state space S is usually handled bydiscretization of the state space. More details about such continuous spaces will be presented as part of our solution
to (Q2).4While the concept of bandit processes was initially developed in the context of a Markovian evolution process
(similar to the formulation of a MDP) [50], they were later generalized to all generic stochastic processes described
by s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)) in [51].
SaiDhiraj Amuru Chapter 2. Background 14
of an index-type (i.e., an index known as the Gittin’s index is assigned to each of the arms and the
arm with the highest index value is chosen at every time instant) under certain constraints which
are satisfied by the MAB problems [50]. The specific constraints are a) all arms are independent,
b) only one arm is chosen to play at any time instant, c) only the state of the arm chosen changes
according to f(.) and the rest of the arms are frozen, and d) an arm gives a reward only whenit is operated. However, these Gittin’s indices can be evaluated only under the knowledge of
the evolution process f(.) and perfect knowledge of the reward functions [50]-[53]. Further, asdiscussed in [52] this formulation is most useful and computationally feasible to obtain optimal
policies when the arms evolve according to a Markov process.
An alternative formulation for the MAB problem, that does not assume any parametric forms
for the state and reward functions and is based on formulating a performance criterion known
as “learning loss” or “regret” was primarily investigated in [54]-[56]. Regret is defined as the
difference between the expected reward that can be obtained by an oracle policy that has some
or complete knowledge about the statistical distribution of the arms and their rewards, and the
expected reward of the player’s policy. The most commonly used oracle policy is the best single
action policy that is optimal among all policies which choose only one arm over the entire time
horizon. This type of regret is also known as the “weak regret” [57], which will be used throughout
this report. Formally, weak regret is defined as
RWn = maxk=1,2,...,K
E( n∑
t=1
rk,t −n∑
t=1
rIt,t
), (2.9)
where It indicates the arm chosen at time t, rk,t is the reward obtained at time t by playing thekth arm and the expectation is with respect to the random choices made by the player’s policy andthe unknown evolution process of the arms (in other words, the random rewards obtained from the
arms). On the other hand, the concept of the “strong regret” defined as
RSn = E maxk=1,2,...,K
( n∑