+ All Categories
Home > Documents > Intelligent Approaches for Communication Denial · Intelligent Approaches for Communication Denial...

Intelligent Approaches for Communication Denial · Intelligent Approaches for Communication Denial...

Date post: 03-Feb-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
241
Intelligent Approaches for Communication Denial SaiDhiraj Amuru Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering R. Michael Buehrer, Chair Claudio R. C. M. da Silva Ravi Tandon Dhruv Batra Inyoung Kim September 21, 2015 Blacksburg, Virginia Keywords: Communication, Denial, Jamming, Learning Copyright 2015, SaiDhiraj Amuru
Transcript
  • Intelligent Approaches for Communication Denial

    SaiDhiraj Amuru

    Dissertation submitted to the Faculty of the

    Virginia Polytechnic Institute and State University

    in partial fulfillment of the requirements for the degree of

    Doctor of Philosophy

    in

    Electrical Engineering

    R. Michael Buehrer, Chair

    Claudio R. C. M. da Silva

    Ravi Tandon

    Dhruv Batra

    Inyoung Kim

    September 21, 2015

    Blacksburg, Virginia

    Keywords: Communication, Denial, Jamming, Learning

    Copyright 2015, SaiDhiraj Amuru

  • Intelligent Approaches for Communication Denial

    SaiDhiraj Amuru

    (ABSTRACT)

    Spectrum supremacy is a vital part of security in the modern era. In the past 50 years, a great

    deal of work has been devoted to designing defenses against attacks from malicious nodes (e.g.,

    anti-jamming), while significantly less work has been devoted to the equally important task of

    designing effective strategies for denying communication between enemy nodes/radios within an

    area (e.g., jamming). Such denial techniques are especially useful in military applications and

    intrusion detection systems where untrusted communication must be stopped. In this dissertation,

    we study these offensive attack procedures, collectively termed as communication denial. The

    communication denial strategies studied in this dissertation are not only useful in undermining the

    communication between enemy nodes, but also help in analyzing the vulnerabilities of existing

    systems.

    A majority of the works which address communication denial assume that knowledge about the

    enemy nodes is available a priori. However, recent advances in communication systems creates

    the potential for dynamic environmental conditions where it is difficult and most likely not even

    possible to obtain a priori information regarding the environment and the nodes that are present

    in it. Therefore, it is necessary to have cognitive capabilities that enable the attacker to learn

    the environment and prevent enemy nodes from accessing valuable spectrum, thereby denying

    communication.

    In this regard, we ask the following question in this dissertation “Can an intelligent attacker

    learn and adapt to unknown environments in an electronic warfare-type scenario?” Fundamen-

    tally speaking, we explore whether existing machine learning techniques can be used to address

    such cognitive scenarios and, if not, what are the missing pieces that will enable an attacker to

    achieve spectrum supremacy by denying an enemy the ability to communicate? The first task in

    achieving spectrum supremacy is to identify the signal of interest before it can be attacked. Thus,

    we first address signal identification, specifically modulation classification, in practical wireless

    environments where the interference is often non-Gaussian. Upon identifying the signal of inter-

    est, the next step is to effectively attack the victim signals in order to deny communication. We

    present a rigorous fundamental analysis regarding the attackers performance, in terms of achieving

    communication denial, in practical communication settings. Furthermore, we develop intelligent

    approaches for communication denial that employ novel machine learning techniques to attack the

    victim either at the physical layer, the MAC layer, or the network layer. We rigorously investigate

    whether or not these learning techniques enable the attacker to approach the fundamental perfor-

    mance limits achievable when an attacker has complete knowledge of the environment. As a result

    of our work, we debunk several myths about communication denial strategies that were believed

    to be true mainly because incorrect system models were previously considered and thus the wrong

    questions were answered.

  • Dedication

    To my parents (Sudhakar and Sai Sudha), my sister (Sai Deepika), my brother-in-law (Raghu

    Pavan) and my newborn niece.

    iii

  • Acknowledgments

    Finally, after several thousand cups of coffee, the time has come.1 I have been waiting for this

    moment—to write the Acknowledgements section—for quite some time, even more than my dis-

    sertation. Such is the impact various people have had on me over the years. Although it cannot be

    described in a few words, I will make an attempt.

    I would like to thank God for his blessings and for giving me and my family strength to pass

    through various difficulties and to finish this Ph.D. journey. It would not have been possible to

    finish this dissertation without all the sacrifices my family made. They supported me in every

    possible way all throughout my life and continue making an impact on me every single day. Thanks

    for all the love.

    I am very fortunate to have worked with several people during my Ph.D. Firstly, I owe my deepest

    gratitude to my advisor Dr. Buehrer for giving me an opportunity to be part of his group and for

    making sure I did not get lost at any point during my Ph.D. Thanks for being patient and correcting

    all the mistakes I made during these years and for pushing me to solve challenging problems. His

    unique skill for identifying important research problems, attention for details, and deep knowledge

    about any problem have greatly helped in improving my research capabilities. You have always

    been supportive of my work even at times when I doubted myself. Thank you for guiding me

    in the right direction not only in my graduate studies, but also for being a person to whom I can

    always look up to for life advice. I would be happy if I can put into practice all your teachings, be

    it research or otherwise, and become as dedicated, contented, and as committed as you are.

    Dr. Claudio da Silva was the reason I came to VT. He provided me immense support over the

    years more than just as an advisor, as a friend. His understanding of his students, especially the

    difficulties faced by an international student, and his help to settle down and make a head start in

    my Ph.D. was the stepping point for my graduate studies. All our phone discussions, despite us

    being on opposite ends of the country, have motivated me to do better work and aim higher always.

    I will forever be indebted to him for all the encouragement that he provided and for also making

    sure from time-to-time that I am doing well.

    Dr. Ravi Tandon, is my go-to-guru for brainstorming about any research problem. He is full-of

    energy at any time during the day (and also during the night). He introduced me to information the-

    ory, which to this day, I am still afraid of. Thanks for patiently explaining the various intricacies in

    1Thanks to the folks at Next Door Bake Shop for pumping caffeine into my body and helping me finish my Ph.D.

    iv

  • a variety of problems that we worked on together. Working with him has helped me learn the ways

    of research and has significantly improved my problem solving skills. All the discussions during

    our coffee breaks have taught me a lot about the academic world. It has been a very memorable

    and pleasant experience to have collaborated with him.

    I would like to thank Dr. Batra for his machine learning course that has sowed the seeds for the

    learning components of this dissertation. Thanks for being approachable and for the career advice

    you have provided me. Thanks to Dr. Kim for the Bayesian statistics class and also for providing

    valuable feedback on my research contributions.

    Going to UCLA during the Summer of 2014 was one of the best decisions made during my Ph.D.

    I would like to thank Dr. Mihaela for hosting me at UCLA during this time and for the wonderful

    collaboration from thereon. Thanks for helping me explore the crazy world of learning. Your

    perseverance still amazes me. I would be very happy if I can be at least 1% as dedicated andmotivated as you are about venturing into new research fields. Dr. Cem and Dr. Xiao have been

    great teachers during my stay at UCLA. I am glad these teachings have resulted in publications.

    I worked with Dr. Harpreet during the final stages of my Ph.D. I am very glad he chose to join

    VT. My association with him dates back to the days when I newly joined Dr. Buehrer’s group.

    His advice as a former Dr. Buehrer student was very helpful in succeeding in my Ph.D. Working

    with him on stochastic geometry-related problems and also successfully organizing the W@VT

    seminars has been a great learning experience.

    Dr. Gautham, thanks for helping me hold on to the ropes during my first year at VT. You have

    been an awesome mentor and a great friend over the years and have always provided me the right

    suggestions while taking critical decisions at various times over the last four years.

    The rowdy bunch at the Wireless @ VT lab - Daniel, Matt, Kevin, Hilda, Chris Headley, Chris

    Phelps, Reza, Javier, Joe, and Mahi have made this journey very special. I am glad to have been

    part of all the fun, pranks, fantasy leagues, lunches, coffee breaks, game nights etc. The “wall”

    will never be forgotten nor will the hardships you guys gave me for being the student chair of

    W@VT. Daniel, Marc Lichtman and Jeff Poston, thanks for proof reading my manuscripts several

    times. Nancy, thanks for making the lab a wonderful habitat, a place where I spent most of my

    time during the last four years. Being the student chair and organizing seminars has truly helped

    me appreciate the work Nancy and Hilda do for W@VT.

    All the members of Shawnee Theatre - Sriram, Viru, Deepak, Sarvesh, KC, Prasad and Aproov,

    thank you for helping me stay sane during my Ph.D. journey and for all the awesome fun, dancing,

    cooking sessions we enjoyed together. Thanks to Varuni, Vishwas and Aditya for the various stim-

    ulating and intellectual discussions. Sriram, Varuni, Himanshu, Deepak, and Emily have helped

    me stay healthy over the years with their mouth-watering dishes. Karteek and Santhosh, my friends

    from IIT, thanks for being just a call away and for talking to me in times of stress. Lakshman, Avik

    Dayal, Avik Sengupta, Mehrnaz, and several others, I am thankful for all the great times we shared.

    v

  • Contents

    1 Introduction 1

    1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 Modulation classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.2 Optimal jamming in AWGN channels . . . . . . . . . . . . . . . . . . . . 3

    1.1.3 Jamming in fading channels . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.1.4 Jamming Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.1.5 MAC-layer jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.1.6 Blind network interdiction . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.1.7 Jamming against wireless networks . . . . . . . . . . . . . . . . . . . . . 8

    1.1.8 List of relevant publications . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Background 10

    2.1 Reinforcement Learning and Markov Decision Processes . . . . . . . . . . . . . . 10

    2.2 Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 A Blind Pre-Processor for Modulation Classification Applications in Frequency-Selective

    Non-Gaussian Channels 16

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.3 Blind Gibbs Sampling-Based Pre-Processing Stage . . . . . . . . . . . . . . . . . 19

    3.3.1 Short introduction to Gibbs sampling . . . . . . . . . . . . . . . . . . . . 19

    3.3.2 Superconstellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    vi

  • 3.3.3 Prior pdfs for the unknown parameters . . . . . . . . . . . . . . . . . . . . 22

    3.3.4 Marginal posterior pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.3.5 Summary and a note on complexity . . . . . . . . . . . . . . . . . . . . . 26

    3.4 Modulation Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.5.1 Pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.5.2 Numerical classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.5.3 Approximated likelihood classifier . . . . . . . . . . . . . . . . . . . . . . 35

    3.5.4 Carrier frequency offset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4 Optimal Jamming against Digital Modulation 41

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.3 Perfect Channel Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.3.1 Optimum Jamming Signal Distribution . . . . . . . . . . . . . . . . . . . 46

    4.3.2 Analysis against M-QAM victim signals . . . . . . . . . . . . . . . . . . 47

    4.4 Factors that mitigate jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4.1 Non-Coherent Jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4.2 Symbol Timing Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.4.3 Signal Level Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.5 Jamming an OFDM Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.6 The Case of Multiple Jammers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    5 On Jammer Power Allocation Against OFDM Signals in Fading Channels 66

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    vii

  • 5.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.3 Jamming Strategies in Fading Channels . . . . . . . . . . . . . . . . . . . . . . . 71

    5.3.1 Optimal power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.3.2 Other power allocation strategies . . . . . . . . . . . . . . . . . . . . . . . 74

    5.3.3 Approximately optimal jamming power allocation . . . . . . . . . . . . . 75

    5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.4.1 Power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.4.2 Jamming performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    5.4.3 Approximately optimal jamming solution performance . . . . . . . . . . . 81

    5.4.4 Factors that affect jamming . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

    6 Jamming Bandits - A Novel Learning Method for Optimal Jamming 93

    6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    6.3 Jamming against a Static Transmitter-Receiver Pair . . . . . . . . . . . . . . . . . 96

    6.3.1 Set of actions for the jammer . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6.3.2 MAB formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    6.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    6.3.4 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.3.5 High confidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.3.6 Improving convergence via arm elimination . . . . . . . . . . . . . . . . . 106

    6.4 Learning Jamming Strategies against a Time-Varying User . . . . . . . . . . . . . 108

    6.4.1 Upper bound on the regret . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    6.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    6.5.1 Fixed user strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.5.2 Jamming performance against an adaptive victim . . . . . . . . . . . . . . 114

    viii

  • 6.5.3 Multiple victims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    6.5.4 A note on the assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    7 Optimal Jamming using Delayed Learning 129

    7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    7.2 Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    7.2.1 Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    7.2.2 A Novel Delayed Learning Framework with Transition-based Rewards . . 133

    7.3 Jamming via Delayed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.3.1 Protocol Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.3.2 Jamming Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    7.3.3 Feedback Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    7.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    7.4.1 Learning the optimal policy: MDP model and ρ known . . . . . . . . . . . 138

    7.4.2 Intuition about the optimal policy . . . . . . . . . . . . . . . . . . . . . . 139

    7.4.3 Learning ρ and the optimal policy: MDP model known . . . . . . . . . . . 139

    7.4.4 Learning the MDP model, ρ and the optimal policy . . . . . . . . . . . . . 140

    7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

    8 Blind Network Interdiction Strategies - A Learning Approach 143

    8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    8.2 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 145

    8.2.1 Victim Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    8.2.2 Flow model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    8.2.3 Attack Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    8.3 Single-Node Attack – Strategies and Analysis . . . . . . . . . . . . . . . . . . . . 148

    8.3.1 Benchmark Strategies (when the attacker has topology knowledge) . . . . 149

    8.3.2 Blind strategies (when the attacker does not have topology knowledge) . . 150

    ix

  • 8.3.3 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

    8.3.4 Notes on attack performance . . . . . . . . . . . . . . . . . . . . . . . . . 154

    8.3.5 Learning rates in blind scenarios . . . . . . . . . . . . . . . . . . . . . . . 156

    8.4 Results - Single Node Attack Scenario . . . . . . . . . . . . . . . . . . . . . . . . 158

    8.4.1 Fixed flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    8.4.2 Random flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

    8.5 Multiple Node Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    8.5.1 Single attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

    8.5.2 Multiple attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

    8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    9 On Jamming Attacks against Wireless Networks 175

    9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    9.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

    9.3 Outage probability of the victim receiver . . . . . . . . . . . . . . . . . . . . . . . 180

    9.4 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

    9.4.1 PEP derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

    9.4.2 Gaussian-Hermite quadrature approximation . . . . . . . . . . . . . . . . 187

    9.4.3 ASEP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

    9.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

    9.5.1 Outage Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

    9.5.2 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

    9.5.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 198

    9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

    10 Conclusions 203

    Bibliography 207

    x

  • List of Figures

    3.1 Realization of the samples obtained in the estimation process of h1, h2, τ , and λ2.The dotted lines represent the true values of the parameters being estimated and

    the bold lines are the samples obtained in each Gibbs sampling iteration. . . . . . 303.2 Correlation among the samples of σ21 (square) and among the samples of the real

    part of h1 (circle) after the burn-in period. . . . . . . . . . . . . . . . . . . . . . . 313.3 Realization of the samples obtained in the estimation process of τ for different

    resolution factor (OS) values. The dotted line represents the true value of τ . OS isset to 30 (dashed), 50 (dash-dot) and 100 (bold). . . . . . . . . . . . . . . . . . . 31

    3.4 Average normalized variance of the error in the estimation of σ21 and σ22 . Number

    of observed symbols equal to 100 (circle), 300 (square), and 500 (diamond). The

    average normalized variance in the estimate X̂ of X is defined as V ar[X − X̂]/X . 323.5 Normalized MSE in the estimation of σ21 and σ

    22 when the modulation scheme of

    the received symbols is known and is either BPSK, QPSK, 8 PSK, 16 QAM, 32

    QAM, or 64 QAM. The average normalized MSE in the estimate X̂ of X is definedas E[(X − X̂)2]/X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.6 Realization of the samples obtained in the estimation process of the real part of

    h1, h2, and h3. The dotted lines represent the true values of the parameters beingestimated and the bold lines are the samples obtained in each iteration. . . . . . . . 34

    3.7 Realization of the samples obtained in the estimation process of σ21 , σ22 , and σ

    23 .

    The dotted lines represent the true values of the parameters being estimated and

    the bold lines are the samples obtained in each iteration. . . . . . . . . . . . . . . . 34

    3.8 Probability of correct classification of the numerical classifier for different num-

    bers of observed symbols (750, 1000, and 1250). Clairvoyant classifier uses 1250

    symbols. Set of possible modulation schemes: BPSK, QPSK, 8 PSK, and 16

    QAM. L = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.9 Probability of correct classification of the numerical classifier for different values

    of cth. Number of observed symbols: 1250. Set of possible modulation schemes:BPSK, QPSK, 8 PSK, and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . 36

    xi

  • 3.10 Probability of correct classification of the numerical classifier for the case when

    the values of L or N are over- or under-estimated. The correct values of L and Nare 2. Number of observed symbols: 750. Set of possible modulation schemes:

    BPSK and QPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.11 Probability of correct classification of the approximated likelihood classifier for

    different numbers of symbols used by the pre-processing stage (300 or 500) and

    by the classifier (K=750 or K=1000). L = 3. . . . . . . . . . . . . . . . . . . . . 37

    3.12 Probability of correct classification of the approximated likelihood classifier for

    different numbers of symbols used by the pre-processing stage (500 or 1000) and

    a fixed number of symbols used for classification (K=1000). Set of possible mod-ulation schemes: BPSK, QPSK, 8 PSK, 16 QAM, 32 QAM, and 64 QAM. L = 3. 38

    3.13 Probability of correct classification of the approximated likelihood classifier for

    different values of β. Number of symbols used for both estimation and classifica-tion is equal to 1000. Set of possible modulation schemes: BPSK, QPSK, 8 PSK,

    and 16 QAM. L = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.14 Realization of the samples obtained in the estimation process of τ and δf for thecase when training data is available for parameter estimation. Length of the train-

    ing sequence is 50 symbols. Carrier frequency offset is 0.0045. The dotted lines

    represent the true values of the parameters being estimated and the bold lines are

    the samples obtained in each Gibbs sampling iteration. . . . . . . . . . . . . . . . 40

    3.15 Probability of correct classification of the approximated likelihood classifier for the

    case when the received symbols suffer phase rotation due to carrier frequency off-

    set. Two carrier frequency offset values are considered (0.0045 and 0.01). Number

    of symbols used for estimation and classification is equal to 50 and K=300, re-spectively. Set of possible modulation schemes: BPSK and QPSK. L = 3. . . . . . 40

    4.1 Comparison of various jamming techniques against a 16-QAM modulated victimsignal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4.2 Comparison of jamming techniques against a 16-QAM victim signal in a non-coherent (random phase offset) scenario, JNR = 10 dB. . . . . . . . . . . . . . . 54

    4.3 Comparison of jamming techniques against a 16-QAM victim signal in the pres-

    ence of timing synchronization errors, JNR = 10 dB. . . . . . . . . . . . . . . . 56

    4.4 Comparison of jamming techniques against a 16-QAM victim signal in the pres-

    ence of signal level mismatch, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . 57

    4.5 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim

    signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    xii

  • 4.6 Comparison of jamming techniques against a OFDM-modulated 16-QAM victim

    signal in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . 60

    4.7 Comparison of jamming techniques when multiple jammers attack a single 16-

    QAM modulated victim signal, JNR=10dB. . . . . . . . . . . . . . . . . . . . . . 64

    5.1 Power allocations for an AWGN jammer against an OFDM-based 16-QAM victimsignal, JNR=10 dB, SNR = 15 dB, 52 out of Nsc = 64 subcarriers are shown. Thesolid lines indicate the channel power levels across the OFDM subcarriers. The

    optimal power allocation obtained by solving (5.6) is seen to be different from

    channel inversion, water-filling and capacity minimization-based power allocations. 78

    5.2 Performance comparison of the various power allocation strategies when pulsed

    AWGN jamming signal is used against an OFDM-based 16-QAM modulated vic-

    tim signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    5.3 Performance comparison of the various power allocation strategies when pulsed

    QPSK jamming signal is used against an OFDM-based 16-QAM modulated victim

    signal, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.4 Performance comparison of the approximately optimal jamming power allocation

    in (5.14) with the optimal, water-filling and channel inversion-based power allo-

    cations, pulsed AWGN jamming is used against an OFDM-based 16-QAM mod-

    ulated victim signal, JNR = 10 dB. The approximately optimal power allocationstrategy (diamond marker) performs nrealy as well as the optimal power allocation

    strategy (triangle marker). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.5 Jamming performance against an OFDM-based 16-QAM modulated victim signal

    with erroneous channel knowledge, JNR = 10 dB. . . . . . . . . . . . . . . . . . . 83

    5.6 Jamming performance against an OFDM-based 16-QAM modulated victim signal

    in the presence of a frequency offset, JNR = 10 dB. . . . . . . . . . . . . . . . . . 84

    5.7 Jamming performance when the jammer is uncertain about the victim’s modulation

    scheme, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    5.8 Jamming performance when the jammer is uncertain about the victim’s modula-

    tion scheme and when the victim’s channel {hk}Nsck=1 is not compensated prior totransmission, JNR = 10 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    5.9 Empirical KL divergence measure between QPSK modulated jamming signal in

    the presence of a carrier frequency offset ǫ (normalized value) and an AWGN jam-ming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    xiii

  • 6.1 An illustration of learning in one round of JB. It is possible that the optimal strategy

    denoted by {J ∗, JNR∗, ρ∗} lies out of the set of discretized strategies. In such acase the jammer learns the best discretized strategy, but based on the value of the

    discretization parameter M , the loss incurred by using this strategy with respect tothe optimal strategy can be bounded using the Hölder continuity condition. The

    value of the discretization M is shown in the figure and Alg. 6.1. . . . . . . . . . . 102

    6.2 Using Theorems 6.3 and 6.5 in a real time jamming environment. . . . . . . . . . . 106

    6.3 Instantaneous SER achieved by the JB algorithm when JNR = 10dB, SNR =20dB and the victim uses BPSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.4 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK. The jammer learns to use BPSK with ρ = 0.078 using JB. Thelearning performance of the ǫ-greedy learning algorithm with various discretizationfactors M is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.5 Learning the optimal jamming strategy when JNR = 10dB, SNR = 20dB and thevictim uses QPSK modulation scheme. The jammer learns to use QPSK signaling

    scheme with ρ = 0.087. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    6.6 Average SER achieved by the jammer when JNR = 10dB, SNR = 20dB and thevictim uses BPSK and there is a phase offset between the two signals. The jammer

    learns to use BPSK with ρ = 0.051 using JB. The learning performance of theǫ-greedy learning algorithm with various discretization factors M is also shown. . . 111

    6.7 Average PER inflicted by the jammer at the victim receiver, SNR = 20 dB, victimuses BPSK and JNR = 10 dB. The jammer learns to use BPSK signaling schemewith ρ = 0.23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.8 Average reward obtained by the jammer against a BPSK modulated victim, SNR =20 dB. The optimal reward is obtained via grid search with discretization M = 100. 112

    6.9 Confidence level (optimal reward-achieved reward) predicted by Theorem 6.3 and

    that achieved by JB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.10 Learning the jamming strategies by using arm-elimination. The victim uses BPSK

    with SNR = 20dB. The jammer learned to use BPSK with JNR = 15 dB andρ = 0.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.11 Learning jammers’ strategy against a stochastic user. The victim transmitter-receiver

    pair use a uniformly random signaling scheme that belongs to the set {BPSK,QPSK}and random power level in the range [0, 20] dB. . . . . . . . . . . . . . . . . . . . 114

    6.12 Learning against a victim with time-varying strategies. The figure shows the power

    levels adaptation by the jammer and that used by the victim. . . . . . . . . . . . . 115

    xiv

  • 6.13 Learning against a victim with time-varying strategies. The figure shows the power

    level adaptation by the jammer using a drifting algorithm and that used by the victim.115

    6.14 PER achieved by the jammer against 2 users, user 1 uses BPSK at 15dB and user 2sends BPSK at 5dB. The jammer learns to use BPSK signal with power 13dB andρ = 0.46. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    6.15 PER achieved by the jammer against 2 users, user 1 sends QPSK at 5dB and user 2sends BPSK at 15dB. The jammer learns to use BPSK signal with power 11.25dBand ρ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    6.16 PER achieved by the jammer against 2 stochastic users in the network. Both the

    users employ BPSK signaling scheme. The jammer learns to use the BPSK sig-

    naling scheme to achieve power efficient jamming strategies and also tracks the

    changes in the users’ strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    7.1 MDP model of the 802.11-type wireless network with the RTS-CTS protocol. Thestate transitions indicate the effect of a jamming attack on the wireless network. . . 137

    7.2 Rewards obtained in various scenarios, ρ = 0.3. The rewards obtained with instan-taneous knowledge are on average better than the rewards obtained in the delayed

    knowledge scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

    7.3 Optimal jamming policies as a function of the ratio of the throughput cost to the

    energy cost; ρ = 0.5. The colors represent the various optimal jamming policies. . 141

    7.4 Rewards obtained when jammer is uncertain about the underlying MDP model and

    ρ and learns it by interacting with the environment; ρ = 0.5. . . . . . . . . . . . . 142

    8.1 Betweenness metrics for nodes in network (a) 112[0, 6, 8, 0, 0] and

    network (b) 112[0, 6, 0, 0, 0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    8.2 Network attack performance against fixed flows in a star network, number of

    nodes=50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    8.3 Network attack performance against an Erdös-Rényi random network, connection

    probability (p) = 0.8, number of nodes = 50. The average number of flows stoppedin one network instantiation of the ER network is shown. . . . . . . . . . . . . . . 158

    8.4 Network attack performance against fixed flows in an ER network, p = 0.8, num-ber of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    8.5 Network attack performance against fixed flows in a BA network, connection de-

    gree = 5, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    8.6 PPP-based network model, with nearest neighbor connections. The red dots indi-

    cate the various network nodes and the blue lines indicate the network connections. 161

    xv

  • 8.7 Network attack performance against fixed flows in a PPP-based network, number

    of nearest neighbor connections = 5, number of nodes = 50. . . . . . . . . . . . . . 161

    8.8 Network attack performance against random flows in a Star network, number of

    nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

    8.9 Network attack performance against random flows in an ER random network, p =0.8, number of nodes = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    8.10 Network attack performance against random flows in a BA network, number of

    nodes = 50, connection degree = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    8.11 Network attack performance against fixed flows in a ER network, with 25 nodesand p = 0.8, when two nodes can be attacked simultaneously by the attackers. . . . 166

    8.12 Attack performance by exploiting the similarity in a network modeled using a

    Poisson-point process. L(G) = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    8.13 An example network attacked by two attackers, with each capable of attacking

    only a subset of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

    9.1 [System Model] The cross marks indicate the BS/APs in the wireless network that

    are distributed according to a PPP. The Voronoi tessellation indicates the coverage

    regions of the BS/APs. The square indicates the victim receiver which is at the

    origin. The black arrow indicates the link between the the closest BS and the

    victim receiver. The triangles indicate the jammers that are distributed according

    to a BPP within the black-dotted region of radius RJ . . . . . . . . . . . . . . . . . 178

    9.2 [Effect of NJ ]: Outage probability of the victim receiver as a function of the num-ber of jammers NJ in the network. p = 0.01, PT/PJ = 0dB. The solid lines indi-cate the outage probability obtained via Monte Carlo simulations and the markers

    indicate the theoretical outage probability evaluated using (9.3). . . . . . . . . . . 190

    9.3 [Effect of NJc]: Outage probability of the victim receiver as a function of the num-ber of jammers per cell (or per BS) NJc in the network. p = 0.01, NJ = 4,PT/PJ = 0dB. The solid lines indicate the outage probability obtained via MonteCarlo simulations and the markers indicate the theoretical outage probability eval-

    uated using (9.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

    9.4 [Effect of p]: Outage probability of the victim receiver as a function of the activityfactor p. NJ = 4, NJc = 1, PT/PJ = 0dB. The solid lines indicate the outageprobability obtained using Monte Carlo simulations and the markers indicate the

    theoretical outage probability expression evaluated using (9.3). . . . . . . . . . . . 191

    xvi

  • 9.5 [GHQ Approximation]: The accuracy of the Gaussian-Hermite quadrature approx-

    imation in evaluating the outage probability as a function of the number of terms

    N used in (9.12). The dotted line is the outage probability evaluated using (9.3).The marked lines indicate the outage probability evaluated using (9.12) for various

    values of N . p = 0.01, NJc = 1, PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . 191

    9.6 [Effect of activity factor p]: Number of jammers N∗J required to cause a 90%probability of outage in the wireless network, as a function of the activity factor

    (network load) p. PT/PJ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

    9.7 [Effect of λT ]: Number of jammers N∗J required in a BPP to cause a 90% proba-

    bility of outage in the wireless network, as a function of λT , p = 0.1. . . . . . . . . 193

    9.8 [Effect of Shadowing]: Number of jammers N∗J required in a BPP to cause a 90%probability of outage in the wireless network, as a function of σχ and p = 0.01. . . 193

    9.9 [Effect of Retransmissions]: The steady state activity factor (ps) as a functionof the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194

    9.10 [Effect of Retransmissions]: The steady state packet drop probability (δ) as a func-tion of the number of retransmissions (D). The initial activity factor is taken to bep = 0.01. The SIR threshold θ = 0dB. . . . . . . . . . . . . . . . . . . . . . . . . 194

    9.11 The accuracy of the Gaussian-Hermite quadrature approximation for error proba-

    bility evaluation as a function of the number of terms N used in the approximation.The zoomed in plot shows a part of the overall figure and indicates that N = 10terms very closely matches the true value without any approximation. . . . . . . . 195

    9.12 [Effect of Activity Factor]: Average symbol error rate as a function of the activity

    factor p when the victim receiver uses BPSK modulation and the jammer networkuses BPSK modulation. NJ = 4, NJc = 1, JNR = 100 dB. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP

    evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

    9.13 [Effect of Number of Jammers]: Average symbol error rate as a function of the

    number of jammers when the victim receiver uses BPSK modulation and the jam-

    mer network uses BPSK modulation. NJc = 1, p = 0.01. The solid lines indicatethe Monte Carlo simulation results and the markers indicate the theoretical ASEP

    evaluated using (9.25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

    9.14 [Effect of NJc]: Average symbol error rate when the victim receiver uses BPSKmodulation and the jammer network uses BPSK modulation as a function of the

    number of jammers per cell (BS). The solid lines indicate the Monte Carlo simu-

    lation results and the markers indicate the theoretical ASEP evaluated using (9.25). 196

    xvii

  • 9.15 [Effect of shadowing]: Average symbol error rate as a function of shadowing power

    level when the victim receiver uses BPSK modulation and the jammer network

    uses BPSK modulation. The solid lines indicate the Monte Carlo simulation results

    and the markers indicate the theoretical ASEP evaluated using (9.25). . . . . . . . . 197

    9.16 [Effect of the jamming signaling scheme]: Average symbol error rate as a function

    of p when the victim receiver uses BPSK modulation and different jamming signalsare used by the jammer network. NJ = 4, NJc = 1. It is seen that in all cases, thejamming performance of the three jamming signals are the same. . . . . . . . . . 197

    9.17 [No Fading Scenario]: Average symbol error rate when the victim receiver uses

    BPSK modulation and different jamming signals are used by the jammer network,

    NJ = 4, NJc = 1, p = 0.01. In all cases it is seen that the BPSK jammingoutperforms QPSK and AWGN jamming signaling schemes. . . . . . . . . . . . . 198

    9.18 The symbol error probability of the victim receiver when the jammer interference

    is approximated as Gaussian with variance denoted by (9.30). . . . . . . . . . . . . 199

    xviii

  • List of Tables

    4.1 Optimal jamming signal level distribution against a 16-QAM victim signal, JNR =10 dB. a1, a2 indicate the absolute values of the real and imaginary parts of thejamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4.2 Optimal jamming signals in a coherent scenario. . . . . . . . . . . . . . . . . . . . 51

    4.3 Optimal non-coherent jamming signal level distribution against a 16-QAM victimsignal, JNR = 10 dB. a1, a2 indicate the absolute values of the real and imaginaryparts of the jamming signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.1 Optimal jamming strategies versus jammer knowledge . . . . . . . . . . . . . . . 68

    6.1 Comparison between related bandit works . . . . . . . . . . . . . . . . . . . . . . 95

    6.2 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    7.1 MDP model state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    7.2 Optimal Jamming Policies via Delayed Learning, E = −10, T = −100 . . . . . . 139

    9.1 Notations used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

    xix

  • Chapter 1

    Introduction

    Wireless connectivity has now become ubiquitous and an integral part of our everyday lives. It

    is now more of a necessity than a luxury. With the advent of new technological capabilities, the

    demand for wireless spectrum is ever-increasing. However, the inherent openness of the wireless

    medium makes it susceptible to both intentional and un-intentional interference. Interference from

    neighboring communicating devices is one of the major causes for un-intentional interference. On

    the other hand, intentional interference corresponds to adversarial attacks on a victim receiver.

    Therefore, ensuring the security and privacy of every device, in order to avoid data breaches and

    any type of attack, is of utmost importance. Security need not only be defensive, such as crypto-

    graphic or information theoretic security, that evades attacks, but can also refer to being offensive

    on an as needed basis. In this dissertation, we focus on the offensive techniques that help to ensure

    security of the various devices. Such security-related studies not only allow for the analysis of the

    system vulnerabilities but also enable to undermine an enemy system capabilities.

    The rapid rise in the technological advancements in the field of Artificial Intelligence and Machine

    Learning can potentially enable every device (regardless of wired or wireless) to possess some sort

    of intelligence that allows for real time operation and adaptation [1]-[6]. If such capabilities exist

    with the malicious nodes,1 then it is a threat to the security of the various devices that co-exist in

    the same environment. It is thus imperative that devices be intelligent and predict the next move by

    the adversary so as to limit the effectiveness of attacks. Therefore, spectrum supremacy, or in other

    words, ensuring unimpeded access to spectrum while denying it to adversaries and thereby having

    control of the spectrum, is a vital part of security in the modern era. Throughout this dissertation,

    we refer to the offensive techniques that help to gain control over the spectrum as communication

    denial. Communication denial, for instance, is vital for military applications (popularly referred to

    as electronic warfare) [7] where the military devices must have un-interrupted access to spectrum

    resources to cater to mission critical applications. It is also useful in commercial applications

    where malicious sensor nodes must be stopped from eavesdropping, for instance during a private

    1In this dissertation, the terms malicious nodes, adversarial nodes, enemy nodes and victim nodes are used inter-

    changeably.

    1

  • SaiDhiraj Amuru Chapter 1. Introduction 2

    meeting.

    Communication denial has mainly been studied by using either optimization, game-theoretic or

    information theoretic principles. The major disadvantage of these studies is that they assume a lot

    of a priori information about the communication strategies used by the enemy nodes, environment

    conditions (such as fading channel, spectrum occupancy) etc., which may not be available in prac-

    tical scenarios. Therefore, the major point of departure for this dissertation from the previous work

    is the realization that the recent advances in communication systems create the potential for dy-

    namic environmental conditions. Under such scenarios, more often than not, it is difficult and most

    likely not even possible to obtain a priori information regarding the environment and the nodes that

    are present in it. Therefore, it is necessary to have cognitive capabilities that enable nodes to learn

    the environment and prevent the enemy nodes from accessing the spectrum and thereby denying

    communication.

    In this dissertation, we address several unsolved fundamental problems in the area of commu-

    nication denial. In particular, this dissertation considers the scenarios where several nodes are

    attempting to communicate in a secure or sensitive area, and one or more secure nodes wish to

    prevent that communication i.e., deny the nodes from communicating. Broadly, we ask the fol-

    lowing question in this dissertation “Can an intelligent attacker learn and adapt to an unknown

    environment in an electronic warfare-type scenario?” We answer this question in several stages by

    fundamentally analyzing the performance of an attacker in various communication settings. We

    assume that the attacker has already identified that a device(s) is(are) malicious, for instance by

    using device fingerprinting techniques [8]. In this dissertation, we focus on intelligent approaches

    for communication denial of the malicious node once it has been identified.

    1.1 Contributions

    Chapter 2 provides a short background on the learning theory concepts used in this dissertation.

    Chapters 3-9 describe the major contributions of this dissertation– namely victim signal identifica-

    tion and attack strategies at various open systems interconnection (OSI) model layers. Specifically,

    Chapters 3-6 and 9 discuss attacks at the physical layer, Chapter 7 discusses attacks at the MAC

    layer and Chapter 8 addresses attacks at the network layer. Conclusions and future directions are

    presented in Chapter 10. The major contributions of this dissertation are briefly described below.

    1.1.1 Modulation classification

    As mentioned earlier, the first task in effectively attacking a malicious node is to identify its sig-

    naling strategy. In Chapter 3, we present a novel signal identification technique, specifically a

    modulation classification algorithm to identify the modulation scheme used by the victim for its

    communication. While modulation classification has been studied extensively, (see [9]- [12] and

  • SaiDhiraj Amuru Chapter 1. Introduction 3

    references therein,) unfortunately none of the previous works consider practical, realistic environ-

    ments where the interference is often non-Gaussian and the attacker is not aware of the timing of

    the victim’s signal. Further, the difficulty in performing modulation classification is due primar-

    ily to the fact that classifiers operate with no or incomplete knowledge of the fading experienced

    by the signal and the distribution of the noise added in the channel. This is because a receiver

    typically has to first classify the received signal before it can successfully acquire symbol timing

    and estimate the channel state. As a result, the impractical assumption that the received signal is

    acquired and equalized by the radio front-end before classification is often made in the design of

    modulation classification algorithms [13], [14].

    In this chapter, we present and analyze a pre-processor that allows for the reliable classification

    of digital amplitude-phase modulated signals (ASK, PSK, and QAM) when the receiver has no

    knowledge of the timing (symbol transition epochs) of the received signal, the noise added in

    the channel is non-Gaussian, and the unknown fading experienced by the signal is frequency-

    selective. We assume that the additive noise is non-Gaussian because various studies have shown

    that most radio channels experience both man-made and natural noise, and that the combined noise

    is impulsive. This also accounts for non-Gaussian interference that is often experienced in practical

    wireless environments [9]. We propose a Bayesian pre-processing stage that estimates the various

    signal parameters and reliably identifies the signal of interest. The numerical results demonstrate

    that, by using the proposed pre-processor, modulation classification algorithms can perform well

    compared to clairvoyant classifiers assumed to be symbol synchronous with the received signal and

    to have perfect knowledge of the channel state and noise distribution. An extension of the proposed

    pre-processor for the case when the received symbols suffer phase rotation due to the presence of

    a residual carrier frequency offset is also considered. More details are given in Chapter 3.

    1.1.2 Optimal jamming in AWGN channels

    Once the victim’s signaling scheme is identified, the next task for the attacker is to efficiently at-

    tack it using all the available information. In Chapter 4, we study attacks from a physical layer

    perspective. More specifically, we study jamming attacks against practical wireless signals, namely

    digital amplitude-phase modulated signals. Jamming has traditionally been studied in the context

    of spread spectrum communications [15]. Barrage jamming, partial-band/narrow-band jamming,

    tone-jamming (where a victim is attacked by sending either a single or multiple jamming tones) and

    pulsed jamming are the most common types of jamming models considered in wireless commu-

    nication systems. Deviating from these traditional simplistic techniques, we want to know “What

    is the optimum statistical distribution for power constrained jamming signals in order to maximize

    the error probability of digital amplitude-phase modulated constellations?” This work answers

    a question that is more relevant to practical wireless communication systems when compared to

    similar questions studied in the past, and consequently offers different solutions mainly because

    incorrect system models were previously considered and thus the wrong questions were answered.

    As a result of the analysis in this chapter, we show that modulation-based pulsed jamming sig-

  • SaiDhiraj Amuru Chapter 1. Introduction 4

    nals are optimal in both coherent and non-coherent (phase asynchronous) scenarios against digital

    amplitude-phase modulated signals. As opposed to the common belief that matching the victim

    signal (correlated jamming) increases confusion at the victim receiver, our analysis shows that

    the optimal jamming signals match standard modulation formats only in a certain range of signal

    and jamming powers. Beyond this range, either binary or quaternary pulsed jamming is the opti-

    mal jamming signal. An interesting relationship between these optimal jamming signals and the

    well-known pulse jamming signals discussed in the context of spread spectrum communications

    was illustrated. The performance of these optimal jamming signals is shown to be degraded when

    the victim and the jamming signals are not phase or time synchronous or when it does not have

    perfect knowledge of the power levels of the victim and the jamming signals although the opti-

    mal jamming signal distributions don’t change. In this chapter, we also study jamming against

    OFDM-based victim signaling schemes and the effects of multiple jammers against the victim.

    More details are presented in Chapter 4.

    1.1.3 Jamming in fading channels

    In Chapter 4 we studied jamming in AWGN channels. In Chapter 5, we take the jamming analysis

    in Chapter 4 a step further and investigate jamming attacks in fading channels. As pointed out

    in [16], most of the existing jamming works, see [7], [17]-[20] and references therein, ignore the

    presence of a fading channel between the jammer and the victim receiver as it simplifies the jam-

    ming analysis. Although the impact of fading channels on the jamming performance has sparingly

    been studied in the context of multiple-input multiple-output (MIMO) systems [16], [24]-[26],

    these works addressed jamming by only considering an AWGN jamming signal against Gaussian

    victim signaling and showed that equal power allocation or water filling based on the second-order

    statistics of the fading channel are Nash-equilibrium strategies. However, it was recently shown

    in [16] that ignoring the presence of a fading channel and/or using equal power allocation/ water

    filling is sub-optimal in terms of the jamming performance evaluated via the Shannon rate metric.

    While [16] addresses the shortcomings of the earlier works [21]-[26], it assumes that the victim

    employs Gaussian signaling schemes which are typically not used in practice. Furthermore, none

    of the works that study jamming against OFDM systems, which is the preferred signaling scheme

    for most wireless standards, explicitly consider the effects of a fading channel between the jammer

    and the victim receiver (see [21]-[23] and references therein for more information on jamming

    against OFDM systems). Hence, there is not currently a good understanding as to how a jam-

    mer can effectively attack a victim that uses practical wireless signals in the presence of a fading

    channel between the jammer and the victim receiver.

    Therefore, we address this open question by studying jamming attacks against digital modulation

    schemes in wireless fading channels. Again, we focus on the error probability metric as the Shan-

    non rate metric fails to capture the effects of digital modulation schemes typically employed by

    the victim. Specifically, in this chapter, we study the problem of jamming power allocation across

    a fading channel under total and peak power constraints in order to maximize the error probability

    of a victim receiver. As a result of the analysis in this chapter, an interesting power allocation

  • SaiDhiraj Amuru Chapter 1. Introduction 5

    strategy is obtained for the jammer, which is different from equal power allocation, channel inver-

    sion and water filling. Specifically, it will be shown that for a given jamming power, the power

    allocation is similar to channel inversion at low victim signal power values and to water filling

    at high victim signal power values. However, at medium victim signal power values, the jammer

    allots more power when the channel fading is weak than when it is strong but allots no power

    when the fading is weakest. The jammer performance was also considered under several non-ideal

    scenarios which shows the benefits of employing the proposed jamming strategies over conven-

    tional jamming techniques. Finally, the proposed jamming strategies are not only applicable to

    frequency-selective fading channels, but also to time- selective fading channels and hence can be

    used to optimally attack a victim across a variety of scenarios.

    1.1.4 Jamming Bandits

    As mentioned earlier, jamming was traditionally studied by using either optimization or game-

    theoretic or information theoretic principles, see [17]-[26] and references therein. The major dis-

    advantage of these studies is that they assume the jammer has a lot of a priori information about

    the strategies used by the (victim) transmitter-receiver pairs, channel gains, etc., which may not

    be available in practical scenarios. For instance, in Chapters 4 and 5, we analyzed jamming from

    an optimization perspective and studied jamming strategies in AWGN and fading channels. How-

    ever, these jamming strategies were obtained by assuming that the jammer has a priori knowledge

    regarding the transmission strategy of the victim transmitter-receiver pair. While the results in

    Chapters 4 and 5 shed light on the fundamental performance limits of the jammer, they cannot be

    used in real time environments due to the lack of a priori knowledge about the victim. Further,

    such optimization-based techniques need to be re-programmed whenever the victim changes its

    strategy, which may be a complicated procedure. Therefore, in contrast to prior work (both ours

    and others), in this chapter we develop online learning algorithms that learn the optimal jamming

    strategy by repeatedly interacting with the victim nodes. Essentially, the jammer must learn to act

    in an unknown environment in order to maximize its total reward (e.g., jamming success rate).

    In this regard, we ask “Can an intelligent jammer learn the optimal physical layer jamming strate-

    gies obtained in Chapter 4, with limited to no knowledge about the victim nodes?” By learning,

    we refer to the cognitive capabilities of a jammer wherein it has the ability to understand its envi-

    ronment and the impact of its actions on the environment. In Chapter 4, we show that the optimal

    jamming signal depends on three parameters, namely modulation scheme, signal power and the

    on-off duration. While the set of modulation schemes is discrete, the signal power and the on-off

    duration parameters are continuous. As will be discussed in detail in Chapter 6, traditional learning

    techniques (i.e., those available in the open literature) cannot be directly employed to learn in such

    mixed action spaces (discrete and continuous). The multi-armed bandit (MAB) framework lends

    itself well to solve this problem as will be described in Chapter 6. However, there are no exist-

    ing bandit frameworks that can be directly applied to this problem which motivated us to develop

    novel learning frameworks and algorithms, novel both with respect to their application to jamming

    as well as the general learning literature, to address this cognitive physical layer jamming problem.

  • SaiDhiraj Amuru Chapter 1. Introduction 6

    Moreover, these algorithms come with theoretical guarantees on the jamming performance which

    is vital in offensive security scenarios. Specifically, we prove that our learning algorithm converges

    to the optimal (in terms of the error rate inflicted at the victim and the energy used) jamming strat-

    egy. Even more importantly, we prove that the rate of convergence to the optimal jamming strategy

    is sub-linear, i.e. the learning is fast in comparison to existing reinforcement learning algorithms,

    which is particularly important in dynamically changing wireless environments. Also, we charac-

    terize the performance of the proposed bandit-based learning algorithm against multiple static and

    adaptive transmitter-receiver pairs. More details are presented in Chapter 6.

    1.1.5 MAC-layer jamming

    Jamming all the information exchanged between the malicious nodes, for example by employing

    the physical layer jamming techniques obtained in Chapters 4-6, may not always be necessary. It

    was shown in [27], [28] that the jammer can perform better (say in terms of the energy efficiency)

    if it accounts for the inherent structure in the data transmission. For example, in some scenarios,

    jamming the control packets or pilot signals is sufficient to stop the malicious nodes from commu-

    nicating with each other [27], [28]. Thus, higher layer jamming attacks either at the MAC layer

    or network layer should be considered. MAC-layer attacks typically rely on the knowledge of the

    protocol used by the malicious nodes and network layer attacks rely on the ability to create con-

    gestion in the network by mimicking the packets that are sent by other nodes in the network [7]. In

    Chapter 7, we seek to understand the optimal MAC layer jamming attacks against an 802.11-basedwireless network. Specifically, we ask “Can an intelligent jammer learn the optimal MAC-layer

    jamming strategies when it has delayed knowledge about the malicious nodes?”

    In this chapter, we assume that the jammer can identify the basic MAC-layer protocol being used

    by the malicious nodes, although not necessarily the full details. This can be fairly easily achieved

    by observing the traffic pattern of the nodes in the environment over some time interval [29].

    However, one of the main challenges still faced in studying a MAC layer jamming attack is that the

    knowledge about the malicious nodes is not always available instantaneously, especially when the

    jammer intends to track the changes in the victim’s strategies. Hence, in this problem, we assume

    a middle ground between Chapters 4, 5 and Chapter 6 and study how efficiently and effectively

    a jammer can learn the optimal jamming strategy when there is delayed knowledge about the

    malicious nodes i.e., in cases where the jammer is aware of the malicious nodes’ behavior after

    some time delay. The framework for delayed observations is more practically relevant, especially

    in the context of wireless communications [30].

    In order to answer the question raised, we will use the Markov Decision Process (MDP) framework

    which is particularly useful in modeling environments that obey the Markovian property and have

    to keep track of only a small number of states. By state, we refer to the condition of the environment

    in this dissertation. However, as will be discussed in detail in Chapter 7, the literature on delayed

    learning frameworks is immature, thereby forcing the development of an appropriate framework

    that enables us to obtain the optimal MAC-layer jamming strategies. As a result of the analysis in

  • SaiDhiraj Amuru Chapter 1. Introduction 7

    this chapter, we developed a novel delayed learning framework with transition-based rewards, that

    allows us to handle the realistic case of delayed knowledge. Using this framework, it is shown that

    the jammer can learn the optimal policy. More details are presented in Chapter 7.

    1.1.6 Blind network interdiction

    Network centric architectures are increasingly gaining prominence, be it social networks or wire-

    less networks, as they allow for decentralized operation among various nodes without the need

    for a central entity to control their communication. With a widespread deployment of such ar-

    chitectures, the security aspects of the underlying networks is now a major concern. The ability

    to undermine a malicious network’s communication capabilities is crucial for ensuring security in

    sensitive environments. In Chapter 8, we particularly focus on attacks against networks when their

    topology is unknown a priori.

    Network interdiction refers to disrupting a network in an attempt to either analyze the network’s

    vulnerabilities or to undermine a network’s communication capabilities. A vast majority of the

    works that have studied network interdiction assume a priori knowledge of the network topology

    [31]-[42]. However, such knowledge may not be available in real-time settings. For instance,

    in practical electronic warfare-type settings, an attacker that intends to disrupt communication

    in the network may not know the topology a priori. Hence, it is necessary to develop online

    learning strategies that enable the attacker to interdict communication in the underlying network

    in realtime. In this chapter, we develop several learning techniques that enable the attacker to learn

    the best network interdiction strategies (in terms of the best nodes to attack to maximally disrupt

    communication in the network) and also discuss the potential limitations that the attacker faces in

    such blind scenarios. We consider settings where a) only one node can be attacked and b) where

    multiple nodes can be attacked in the network. In addition to the single-attacker setting, we also

    discuss learning strategies when multiple attackers attack this network and discuss the limitations

    they face in real-time settings. Several different network topologies are considered in this study

    using which we show that under the blind settings considered in this chapter, except for some

    simple network topologies, the attacker cannot optimally (measured in terms of the number of

    flows stopped) attack the network.

    More specifically, in this chapter, we show that: (a) relying on well-known graph metrics, such

    as betweenness centrality [40], attacking a network does not necessarily work for all network

    topologies, (b) under blind scenarios, the learning rates cannot be improved beyond O(|V |) where|V | is the number of nodes in the network, (c) under blind scenarios, multiple attackers mustcollaborate at every time instant in order to learn the best set of nodes to attack in the network

    and (d) the learning performance, be it a single attacker or multiple attackers, will depend on the

    network structure and not just the number of nodes in the network. More details are presented in

    Chapter 8.

  • SaiDhiraj Amuru Chapter 1. Introduction 8

    1.1.7 Jamming against wireless networks

    Jamming against wireless networks (not just single nodes) has been previously addressed albeit

    from an optimization perspective in [37], [43]-[45]. The jammer-to-flow assignment problem i.e.,

    optimally assigning jammers to stop flows in a network based on their locations and other con-

    straints such as power, was considered in [37]. In [43]-[45], the problem of jammer placement

    against wireless networks with the aim of disconnecting the network was studied. All these works

    model a network as a graph and find the best set of nodes/edges to attack so that the network is dis-

    connected. While these studies indicate which nodes/links to be attacked, they do not address the

    problem of how this attack can be realized in practice against cellular and/or WiFi-based wireless

    networks. In other words, the jamming techniques against wireless networks is not well understood

    from a physical layer perspective.

    In Chapter 9, we analyze the impact of randomly placed jammers against a wireless network in

    terms of a) the outage probability and b) the error probability of a victim receiver in the downlink

    of this wireless network. We derive analytical expressions for both these metrics and discuss in

    detail how the jammer network must be matched to the wireless network parameters in order to

    effectively attack the victim receiver. For instance, we show that as the network loading increases,

    assuming universal frequency reuse, the number of jammers that are needed to inflict a given outage

    probability at the victim receiver decreases. Retransmissions are commonly used across a variety

    of wireless protocols. We will show that when the wireless network uses retransmissions (in order

    to improve the probability of successful communication), the number of jammers necessary to

    achieve a required outage probability at the victim receiver decreases due to increased interference

    among the BSs. Furthermore, we will show that the behavior of the jammer network as a function

    of the BS/AP density is not obvious. In particular, an interesting concave-type behavior is seen

    which indicates that the number of jammers required to attack the wireless network must scale

    with the BS density only until a certain value beyond which it decreases. In the context of error

    probability of the victim receiver, we study whether or not some recent results related to jamming in

    the point-to-point link scenario can be extended to the case of jamming against wireless networks.

    As a result of the analysis in this chapter, we show that a fixed number of jammer’s can tip a

    wireless network i.e., can significantly reduce the probability of successful communication in this

    wireless network. A similar analysis is performed in the context of the error probability of the

    victim receiver. Specifically, we will show that when the small scale fading effects are averaged

    out, then the results in Chapter 4 can be extended to the case of jamming against wireless networks

    and that significant gains can be achieved by using modulation-based jamming signals (i.e., the

    findings from Chapter 4) when compared to AWGN jamming.

    1.1.8 List of relevant publications

    This dissertation is based on the following publications:

  • SaiDhiraj Amuru Chapter 1. Introduction 9

    1. S. Amuru and C. R. C. M. da Silva, “A blind pre-processor for modulation classification

    applications in frequency-selective non-Gaussian channels,” IEEE Trans. Commun., vol.

    63, no. 1, pp. 156-169, Jan. 2015.

    2. S. Amuru and R. M. Buehrer, “Optimal jamming against digital modulation,” IEEE Trans.

    Inf. Forensics and Security, vol. 10, no. 10, pp. 2212-2224, Oct. 2015.

    3. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “Jamming bandits - a novel

    learning method for optimal jamming,” submitted to IEEE Trans. Wireless Commun., avail-

    able at arXiv:1411.3652.

    4. S. Amuru and R. M. Buehrer, “On jamming power allocation against OFDM signals in

    fading channels,” submitted to IEEE Trans. Inf. Forensics and Security, Aug. 2015.

    5. S. Amuru, R. M. Buehrer, and M. van der Schaar, “Blind network interdiction strategies - a

    learning approach,” submitted to IEEE Trans. Cognitive Commun. Netw., Sept. 2015.

    6. S. Amuru, H. S. Dhillon, and R. M. Buehrer, “On jamming attacks against wireless net-

    works” submitted to IEEE Trans. Wireless Commun., Sept. 2015.

    7. S. Amuru and R. M. Buehrer, “Optimal jamming using delayed learning,” in Proc. IEEE

    Military Comm. Conf., (Milcom), Baltimore, MD, Oct. 2014, pp. 1528-1533.

    8. S. Amuru and R. M. Buehrer, “Optimal jamming strategies in digital communications-

    impact of modulation,” in Proc. IEEE Global Commun. Conf., Dec. 2014.

    9. S. Amuru, C. Tekin, M. van der Schaar, and R. M. Buehrer, “A systematic learning method

    for optimal jamming,” in Proc. Intern. Conf. Commun., Jun. 2015.

  • Chapter 2

    Background

    In this chapter, we briefly introduce two concepts that are used in this dissertation, namely a)

    Reinforcement learning and the associated theory of Markov Decision Processes and b) Multi-

    armed bandits.

    2.1 Reinforcement Learning and Markov Decision Processes

    Reinforcement learning is a technique that allows an agent to modify its actions (without any

    supervision) by repeatedly interacting with the environment and is commonly used to address

    sequential decision making. A reinforcement learning task that satisfies the Markov property1

    is called a Markov decision process, or MDP, [47]. A MDP is defined by a tuple (S,A, P,R)where S is the set of all possible environment states and A is the set of all possible actions thatthe agent can perform in any environment state. For instance, from a jammer’s perspective, the

    environment states could be Transmission/No Transmission to reflect the cases where a packet is

    exchanged between the transmit-receive pair or when they are idle, and the actions of the jammer

    could be Jam/Don’t Jam. P is the state transition probability matrix that governs the dynamics ofthe environment, and its entries are given by the transition probabilities p(s′|s, a) which indicatesthe probability that the environment moves to the state s′ when action a is executed in the state s.Finally R indicates the |S| × |A| reward matrix whose entries are given by elements r(s, a) whichindicate the reward (for example, energy expended) obtained in state s when action a is executed.Here, |S| and |A| indicate the cardinality of the sets S and A respectively.In the traditional RL framework, an agent observes the current state of the environment s, andchooses an action a. An optimum policy (a functional mapping between states and the actions thatcan be performed in these states) is one that maximizes the total expected reward, that is more often

    1The Markov property refers to the memoryless property of a stochastic process. More specifically, the conditional

    probability distribution of the future states of the random process depends only on the present state and not on the

    states that happened earlier. Such a stochastic process is also known as the Markov process.

    10

  • SaiDhiraj Amuru Chapter 2. Background 11

    than not discounted by a factor γ ∈ [0, 1) to account for an infinite time horizon. The objective ofa RL algorithm is therefore to find an optimal policy Π (mapping between states and actions), thatmaximizes the cumulative discounted reward

    R(t) =

    ∞∑

    k=0

    γkr(st+k, at+k = Π(st+k)), (2.1)

    where st, at indicate the state and action taken at time t [46]. The value of a policy Π when theenvironment is in state s is given by

    V Π(s) = EΠ

    ( ∞∑

    k=0

    γkr(st+k, at+k|st = s)), (2.2)

    where EΠ indicates the averaging performed over all possible state transitions when the agentfollows the policy Π. Several algorithms exist to find an optimal policy Π∗, such as value iterationand policy iteration (which are useful when P is known a priori). For more details please see [47].For ease of analysis, we assume a stationary model (state transition matrix is independent of time)

    and ignore the time parameter t hereafter.

    When the underlying MDP model is known, policy evaluation (finding the value of a given policy)

    can also be done via matrix inversion (especially for small MDPs, i.e., MDPs with small state-

    action space) [47]. Specifically,

    V Π(s) = r(s, a = Π(s)) + EΠ

    ( ∞∑

    k=1

    γkr(sk, ak|s))

    = r(s, a = Π(s)) + γ∑

    s′

    p(s′|s, a = Π(s))V Π(s′).

    Thus, writing the above set of equations for all possible states s ∈ S in the MDP, we have

    V̄Π = r̄Π+γPΠ(s′|s)V̄Π=⇒ V̄Π=(I−γPΠ(s′|s)

    )−1r̄Π (2.3)

    where V̄Π is the |S| ∗ 1 vector of values of the policy Π in states s ∈ S, r̄Π is the |S| ∗ 1 vectorof rewards obtained in states s ∈ S using policy Π and PΠ(s′|s) indicates the |S| ∗ |S| statetransition probability matrix when the agent uses policyΠ, and I is an identity matrix of appropriatedimensions.

    In general when there are no policies given a priori, for any set of states S and set of actions A,the optimal value function can be written as

    V ∗(s) = maxa∈A

    (r(s, a) + γ

    s′

    p(s′|s, a)V (s′)), (2.4)

    which indicates the optimal value that can be associated with a state s ∈ S (which is known by

  • SaiDhiraj Amuru Chapter 2. Background 12

    exploring all actions a ∈ A in the state s). Along similar lines, we define a new state-actionfunction Q(s, a) which captures the quality of an action taken in a particular state as follows,

    Q(s, a) =(r(s, a) + γmax

    a′∈A

    s′

    p(s′|s, a)Q(s′, a′)), (2.5)

    which gives the optimal value function as V ∗(s) = maxa∈A Q(s, a) and helps to find an optimalpolicy Π∗ as2

    Π∗(s) = argmaxa∈A

    (Q(s, a)

    )(2.6)

    It should be clear by now that all the above equations can be used to find the optimal set of actions

    only when P i.e., the transition probability matrix is known or can be estimated. Such techniquesthat rely on the knowledge of P are commonly known as Indirect Learning or Planning algorithms[47]. But usually, P is unknown in dynamic environments and can be difficult to estimate in realtime environments. Since the value of a state is defined as the expectation of the random rewards

    obtained when the MDP is started from the given state, a direct way of estimating this value is to

    estimate an average over multiple independent realizations of the MDP that start from the given

    state i.e., the Monte Carlo technique. Unfortunately, the variance of the returns can be high which

    can result in poor estimates of the Q-function (because it is possible to obtain different estimates

    for the same state-action pair, for example due to the wireless channel conditions). To address this,

    an online learning technique popularly known as Q-Learning [47] was developed, which updates

    the state-action function as below:

    Qt(s, a) = (1− αt)Qt−1(s, a) + αt[r(s, a) + maxa′

    s′

    p(s′|s, a)Qt−1(s′, a′)], (2.7)

    which is shown to converge to the optimal solution when the learning rate αt ∈ (0, 1] satisfies∑t αt = ∞ and

    ∑t α

    2t < ∞. The proof of convergence is based on relating (2.7) to that of an

    ordinary differential equation with a fixed point solution and the theory of stochastic approximation

    [48].

    Moreover, when we are concerned with online learning problems, finding a balance between ex-

    ploration (trying actions that may yield higher rewards) and exploitation (using the best actions

    learned thus far) becomes important, given the finite available resources. ǫ-Greedy is a commonlyused learning algorithm where an agent explores the actions (in any state) with probability ǫ andexploits the existing knowledge with probability 1 − ǫ. In Q-Learning, actions are chosen as peran exploration-exploitation schedule that is decided a priori such that all actions can be tried in all

    possible environment states. Thus, such learning algorithms can guarantee optimality only asymp-

    totically as the size of the MDP grows. While the theory is mature for the case of finite MDPs,

    efficient exploration, for example, is still being studied in the case of large MDPs (this problem

    2Note that while the optimal value function V ∗(s) is unique, the optimal policy is not necessarily unique [47].

  • SaiDhiraj Amuru Chapter 2. Background 13

    has been addressed well from the context of multi-armed bandit problems that is discussed next).

    Finite time bounds that indicate the rate of convergence to the optimal policy in the case of finite

    MDPs have been studied [49]. For more details on reinforcement learning, please see [46]-[49].

    2.2 Multi-armed Bandits

    Multi-armed bandit problems are a sub-class of sequential decision making problems that are con-

    cerned with allocating the available resources among several alternative arms/actions [50]-[53].

    For example, such algorithms are most widely used in the context of clinical trials where several

    treatments are applied to patients in a sequential manner, and patients are dynamically allocated

    to the best treatment [50]. A single-armed bandit process is an arm that is defined by two random

    sequences namely, s(n) and r(s(n)), where s(n) is the state of the arm after it has been played ntimes and r(s(n)) is the instantaneous reward obtained after the arm is played n times. Specifically,it is assumed that the state of the arm evolves as s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)),where fn−1(.) is known a priori and w(n) is a sequence of independent random variables that arealso independent of s(n) and come from a known statistical distribution. A multi-armed banditprocess is defined as a collection of K such single-armed bandit processes3 and a controller/playerthat has to make the decision to choose one among these K bandit processes at every time instant.The decisions are taken such that the average cumulative discounted reward is maximized. There-

    fore, it is a sequential decision problem as the decision to be made at every time instant depends

    on what happened thus far and thereby faces the exploration versus exploitation dilemma when it

    has to choose the arms. Putting this in the context of the MDPs introduced earlier4, a policy now

    refers to sequence of arms chosen at every time instant and therefore, an optimal policy is one that

    choses the best arm at every time instant. The goal of this problem is to maximize

    J = E[ ∞∑

    t=0

    γtK∑

    k=1

    rk(sk(nk(t)), uk(t))|s1(0), s2(0), . . . , sK(0)], (2.8)

    where γ ∈ (0, 1] is the discount factor, nk(t) is the number of times arm k has been chosen untiltime t, uk(t) is a 1 ×K vector which has 1 in the kth position and 0 else where if the kth arm ischosen.

    It is easy to see that this problem formulation is similar to that of MDPs and can be solved by

    using the regular dynamic programming techniques. However, it has been shown that this K-dimensional problem can be reduced to K 1-dimensional problems and that the optimal policy is

    3For ease of analysis, K is assumed to be finite. The case of continuous state space S is usually handled bydiscretization of the state space. More details about such continuous spaces will be presented as part of our solution

    to (Q2).4While the concept of bandit processes was initially developed in the context of a Markovian evolution process

    (similar to the formulation of a MDP) [50], they were later generalized to all generic stochastic processes described

    by s(n) = fn−1(s(0), s(1), . . . , s(n− 1), w(n− 1)) in [51].

  • SaiDhiraj Amuru Chapter 2. Background 14

    of an index-type (i.e., an index known as the Gittin’s index is assigned to each of the arms and the

    arm with the highest index value is chosen at every time instant) under certain constraints which

    are satisfied by the MAB problems [50]. The specific constraints are a) all arms are independent,

    b) only one arm is chosen to play at any time instant, c) only the state of the arm chosen changes

    according to f(.) and the rest of the arms are frozen, and d) an arm gives a reward only whenit is operated. However, these Gittin’s indices can be evaluated only under the knowledge of

    the evolution process f(.) and perfect knowledge of the reward functions [50]-[53]. Further, asdiscussed in [52] this formulation is most useful and computationally feasible to obtain optimal

    policies when the arms evolve according to a Markov process.

    An alternative formulation for the MAB problem, that does not assume any parametric forms

    for the state and reward functions and is based on formulating a performance criterion known

    as “learning loss” or “regret” was primarily investigated in [54]-[56]. Regret is defined as the

    difference between the expected reward that can be obtained by an oracle policy that has some

    or complete knowledge about the statistical distribution of the arms and their rewards, and the

    expected reward of the player’s policy. The most commonly used oracle policy is the best single

    action policy that is optimal among all policies which choose only one arm over the entire time

    horizon. This type of regret is also known as the “weak regret” [57], which will be used throughout

    this report. Formally, weak regret is defined as

    RWn = maxk=1,2,...,K

    E( n∑

    t=1

    rk,t −n∑

    t=1

    rIt,t

    ), (2.9)

    where It indicates the arm chosen at time t, rk,t is the reward obtained at time t by playing thekth arm and the expectation is with respect to the random choices made by the player’s policy andthe unknown evolution process of the arms (in other words, the random rewards obtained from the

    arms). On the other hand, the concept of the “strong regret” defined as

    RSn = E maxk=1,2,...,K

    ( n∑


Recommended