+ All Categories
Home > Documents > 1 A Review of Uncertainty Quantification in Deep Learning ...S. Hussain is with the System...

1 A Review of Uncertainty Quantification in Deep Learning ...S. Hussain is with the System...

Date post: 25-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
61
1 A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges Moloud Abdar*, Farhad Pourpanah, Member, IEEE , Sadiq Hussain, Dana Rezazadegan, Li Liu, Senior Member, IEEE, Mohammad Ghavamzadeh, Paul Fieguth, Senior Member, IEEE, Xiaochun Cao, Senior Member, IEEE, Abbas Khosravi, Member, IEEE, U Rajendra Acharya, Senior Member, IEEE, Vladimir Makarenkov and Saeid Nahavandi, Fellow, IEEE Abstract—Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decision making processes. It can be applied to solve a variety of real-world applications in science and engineering. Bayesian approximation and ensemble learning techniques are two most widely-used UQ methods in the literature. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning. Moreover, we also investigate the application of these methods in reinforcement learning (RL). Then, we outline a few important applications of UQ methods. Finally, we briefly highlight the fundamental research challenges faced by UQ methods and discuss the future research directions in this field. Index Terms—Artificial intelligence, Uncertainty quantification, Deep learning, Machine learning, Bayesian statistics, Ensemble learning, Reinforcement learning. 1 I NTRODUCTION I N everyday scenarios, we deal with uncertainties in numerous fields, from invest opportunities and medical diagnosis to sporting games and weather forecast, with an objective to make decision based on collected observations and uncertain domain knowledge. Nowadays, we can rely on models developed using machine and deep learning M. Abdar, A. Khosravi and S. Nahavandi are with the Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia (e-mails: [email protected], [email protected], ab- [email protected] & [email protected]). F. Pourpanah is with the College of Mathematics and Statistics, Shenzhen University, Shenzhen, China (e-mail: [email protected]). S. Hussain is with the System Administrator, Dibrugarh University, Dibrugarh, India (e-mail: [email protected]). D. Rezazadegan is with the Department of Computer Science and Software Engineering, Swinburne University of Technology, Melbourne, Australia (e-mail: [email protected]). L. Liu is with the Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland (e-mail: li.liu@oulu.fi). M. Ghavamzadeh is with the Google research (e-mail: [email protected]). P. Fieguth is with the Department of Systems Design Engineering, Uni- versity of Waterloo, Waterloo, Canada (e-mail: pfi[email protected]). X. Cao is with the State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China (e-mail: [email protected]). U. R. Acharya is with the Department of Electronics and Com- puter Engineering, Ngee Ann Polytechnic, Clementi, Singapore (e-mail: [email protected]). V. Makarenkov is with the Department of Computer Science, Uni- versity of Quebec in Montreal, Montreal (QC), Canada (e-mail: [email protected]). * Corresponding author: Moloud Abdar, [email protected] Epistemic Aleatoric Fig. 1: A schematic view of main differences between aleatoric and epistemic uncertainties. techniques can quantify the uncertainties to accomplish statistical inference [1]. It is very important to evaluate the efficacy of artificial intelligence (AI) systems before its usage [2]. The predictions made by such models are uncertain as they are prone to noises and wrong model inference besides the inductive assumptions that are inherent in case of uncer- tainty. Thus, it is highly desirable to represent uncertainty in a trustworthy manner in any AI-based systems. Such automated systems should be able to perform accurately by handling uncertainty effectively. Principle of uncertainty plays an important role in AI settings such as concrete learning algorithms [3], and active learning (AL) [4], [5]. The sources of uncertainty occurs when the test and training arXiv:2011.06225v3 [cs.LG] 17 Nov 2020
Transcript
  • 1

    A Review of Uncertainty Quantification in DeepLearning: Techniques, Applications and

    ChallengesMoloud Abdar*, Farhad Pourpanah, Member, IEEE , Sadiq Hussain, Dana Rezazadegan, Li Liu, SeniorMember, IEEE, Mohammad Ghavamzadeh, Paul Fieguth, Senior Member, IEEE, Xiaochun Cao, SeniorMember, IEEE, Abbas Khosravi, Member, IEEE, U Rajendra Acharya, Senior Member, IEEE, Vladimir

    Makarenkov and Saeid Nahavandi, Fellow, IEEE

    Abstract—Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decisionmaking processes. It can be applied to solve a variety of real-world applications in science and engineering. Bayesian approximationand ensemble learning techniques are two most widely-used UQ methods in the literature. In this regard, researchers have proposeddifferent UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars andobject detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification andsegmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc.This study reviews recent advances in UQ methods used in deep learning. Moreover, we also investigate the application of thesemethods in reinforcement learning (RL). Then, we outline a few important applications of UQ methods. Finally, we briefly highlight thefundamental research challenges faced by UQ methods and discuss the future research directions in this field.

    Index Terms—Artificial intelligence, Uncertainty quantification, Deep learning, Machine learning, Bayesian statistics, Ensemblelearning, Reinforcement learning.

    F

    1 INTRODUCTION

    IN everyday scenarios, we deal with uncertainties innumerous fields, from invest opportunities and medicaldiagnosis to sporting games and weather forecast, with anobjective to make decision based on collected observationsand uncertain domain knowledge. Nowadays, we can relyon models developed using machine and deep learning

    • M. Abdar, A. Khosravi and S. Nahavandi are with the Institute forIntelligent Systems Research and Innovation (IISRI), Deakin University,Australia (e-mails: [email protected], [email protected], [email protected] & [email protected]).

    • F. Pourpanah is with the College of Mathematics and Statistics, ShenzhenUniversity, Shenzhen, China (e-mail: [email protected]).

    • S. Hussain is with the System Administrator, Dibrugarh University,Dibrugarh, India (e-mail: [email protected]).

    • D. Rezazadegan is with the Department of Computer Science and SoftwareEngineering, Swinburne University of Technology, Melbourne, Australia(e-mail: [email protected]).

    • L. Liu is with the Center for Machine Vision and Signal Analysis,University of Oulu, Oulu, Finland (e-mail: [email protected]).

    • M. Ghavamzadeh is with the Google research (e-mail:[email protected]).

    • P. Fieguth is with the Department of Systems Design Engineering, Uni-versity of Waterloo, Waterloo, Canada (e-mail: [email protected]).

    • X. Cao is with the State Key Laboratory of Information Security, Instituteof Information Engineering, Chinese Academy of Sciences, Beijing, China(e-mail: [email protected]).

    • U. R. Acharya is with the Department of Electronics and Com-puter Engineering, Ngee Ann Polytechnic, Clementi, Singapore (e-mail:[email protected]).

    • V. Makarenkov is with the Department of Computer Science, Uni-versity of Quebec in Montreal, Montreal (QC), Canada (e-mail:[email protected]).

    • * Corresponding author: Moloud Abdar, [email protected]

    Epistemic

    Aleatoric

    Fig. 1: A schematic view of main differences betweenaleatoric and epistemic uncertainties.

    techniques can quantify the uncertainties to accomplishstatistical inference [1]. It is very important to evaluate theefficacy of artificial intelligence (AI) systems before its usage[2]. The predictions made by such models are uncertain asthey are prone to noises and wrong model inference besidesthe inductive assumptions that are inherent in case of uncer-tainty. Thus, it is highly desirable to represent uncertaintyin a trustworthy manner in any AI-based systems. Suchautomated systems should be able to perform accuratelyby handling uncertainty effectively. Principle of uncertaintyplays an important role in AI settings such as concretelearning algorithms [3], and active learning (AL) [4], [5].The sources of uncertainty occurs when the test and training

    arX

    iv:2

    011.

    0622

    5v3

    [cs

    .LG

    ] 1

    7 N

    ov 2

    020

  • 2

    Data Output

    (a) Monte Carlo (MC) dropout

    Data Output

    (b) Bootstrap model

    Data Output

    (c) Gaussian Mixture model (GMM)Fig. 2: Schematic view of three different uncertainty modelswith the related network architectures, reproduced basedon [9].

    data are mismatched and data uncertainty occurs becauseof class overlap or due to the presence of noise in thedata [6]. Estimating knowledge uncertainty is more difficultcompared to data uncertainty which naturally measuresit as a result of maximum likelihood training. Sources ofuncertainty in prediction are essential to tackle the uncer-tainty estimation problem [7]. There are two main sourcesof uncertainty, conceptually called aleatoric and epistemicuncertainties [8] (see Fig. 1).Irreducible uncertainty in data giving rise to uncertainty in

    predictions is an aleatoric uncertainty (also known as datauncertainty). This type of uncertainty is not the property ofthe model, but rather is an inherent property of the datadistribution; hence it is irreducible. Another type of uncer-tainty is epistemic uncertainty (also known as knowledgeuncertainty) that occurs due to inadequate knowledge anddata. One can define models to answer different humanquestions poised in model-based prediction. In the caseof data-rich problem, there is a collection of massive databut it may be informatively poor [10]. In such cases, AI-based methods can be used to define the efficient modelswhich characterize the emergent features from the data.Very often these data are incomplete, noisy, discordant andmultimodal [1].Uncertainty quantification (UQ) underpins many criticaldecisions today. Predictions made without UQ are usuallynot trustworthy and inaccurate. To understand the DeepLearning (DL) [11], [12] process life cycle, we need tocomprehend the role of UQ in DL. The DL models startwith the collection of most comprehensive and potentiallyrelevant datasets available for decision making process. TheDL scenarios are designed to meet some performance goalsto select the most appropriate DL architecture after training

    hy

    µ 𝜎2

    N

    𝛉

    X

    (a) BNN

    yN

    X

    o

    (b) OoD classifierFig. 3: A graphical representation of two differentuncertainty-aware (UA) models, reproduced based on [14].

    the model using the labeled data. The iterative trainingprocess optimizes different learning parameters, which willbe ‘tweaked’ until the network provides a satisfactory levelof performance.There are several uncertainties that need to be quantifiedin the steps involved. The uncertainties that are obviousin these steps are the following: (i) selection and collectionof training data, (ii) completeness and accuracy of trainingdata, (iii) understanding the DL (or traditional machinelearning) model with performance bounds and its limita-tions, and (iv) uncertainties corresponds to the performanceof the model based on operational data [13]. Data drivenapproaches such as DL associated with UQ poses at leastfour overlapping groups of challenges: (i) absence of theory,(ii) absence of casual models, (iii) sensitivity to imperfectdata, and (iv) computational expenses. To mitigate suchchallenges, ad hoc solutions like the study of model vari-ability and sensitivity analysis are sometimes employed.Uncertainty estimation and quantification have been exten-sively studied in DL and traditional machine learning. Inthe following, we provide a brief summary of some recentstudies that examined the effectiveness of various methodsto deal with uncertainties.A schematic comparison of the three different uncertaintymodels [9] (MC dropout, Boostrap model and GMM is pro-vided in Fig. 2. In addition, two graphical representationsof uncertainty-aware models (BNN) vs OoD classifier) isillustrated in Fig. 3.

    1.1 Research Objectives and OutlineIn the era of big data, ML and DL, intelligent use of differentraw data has an enormous potential to benefit wide varietyof areas. However, UQ in different ML and DL methods cansignificantly increase the reliability of their results. Ning etal. [15] summarized and classified the main contributions ofthe data-driven optimization paradigm under uncertainty.As can be observed, this paper reviewed the data-drivenoptimization only. In another study, Kabir et al. [16] re-viewed Neural Network-based UQ. The authors focusedon probabilistic forecasting and prediction intervals (PIs)as they are among most widely used techniques in theliterature for UQ.We have noticed that, from 2010 to 2020 (end of June),more than 2500 papers on UQ in AI have been publishedin various fields (e.g., computer vision, image processing,medical image analysis, signal processing, natural language

  • 3

    processing, etc.). In one hand, we ignore large number ofpapers due to lack of adequate connection with the subjectof our review. On the other hand, although many papersthat we have reviewed have been published in relatedconferences and journals, many papers have been foundon open-access repository as electronic preprints (i.e. arXiv)that we reviewed them due to their high quality and fullrelevance to the subject. We have tried our level to best tocover most of the related articles in this review paper. It isworth mentioning that this review can, therefore, serve asa comprehensive guide to the readers in order to steer thisfast-growing research field.Unlike previous review papers in the field of UQ, this studyreviewed most recent articles published in quantifying un-certainty in AI (ML and DL) using different approaches.In addition, we are keen to find how UQ can impact thereal cases and solve uncertainty in AI can help to obtainreliable results. Meanwhile, finding important chats in ex-isting methods is a great way to shed light on the pathto the future research. In this regard, this review papergives more inputs to future researchers who work on UQin ML and DL. We investigated more recent studies in thedomain of UQ applied in ML and DL methods. Therefore,we summarized few existing studies on UQ in ML andDL. It is worth mentioning that the main purpose of thisstudy is not to compare the performance of different UQmethods proposed because these methods are introducedfor different data and specific tasks. For this reason, weargue that comparing the performance of all methods isbeyond the scope of this study. For this reason, this studymainly focuses on important areas including DL, ML andReinforcement Learning (RL). Hence, the main contribu-tions of this study are as follows:

    • To the best of our knowledge, this is the first compre-hensive review paper regarding UQ methods usedin ML and DL methods which is worthwhile forresearchers in this domain.

    • A comprehensive review of newly proposed UQmethods is provided.

    • Moreover, the main categories of important applica-tions of UQ methods are also listed.

    • The main research gaps of UQ methods are pointedout.

    • Finally, few solid future directions are discussed.

    2 PRELIMINARIESIn this section, we explained the structure of feed-forwardneural network followed by Bayesian modeling to discussthe uncertainty in detail.

    2.1 Feed-forward neural network

    In this section, the structure of a single-hidden layer neuralnetwork [17] is explained, which can be extended to multi-ple layers. Suppose x is a D-dimensional input vector, weuse a linear map W1 and bias b to transform x into a rowvector with Q elements, i.e., W1x + b. Next a non-lineartransfer function σ(.), such as rectified linear (ReLU), canbe applied to obtain the output of the hidden layer. Then

    another linear function W2 can be used to map hidden layerto the output:

    ŷ = σ(xW1 + b)W2 (1)

    For classification, to compute the probability of X be-longing to a label c in the set {1, ..., C}, the normalizedscore is obtained by passing the model output ŷ througha softmax function p̂d = exp(ŷd)/(

    ∑d′ exp(ŷd′)). Then the

    softmax loss is used:

    EW1,W2,b(X,Y ) = − 1N

    N∑i=1

    log(p̂i,ci) (2)

    where X = (x1, ..., xN ) and Y = (y1, ..., yN ) are inputs andtheir corresponding outputs, respectively.

    For regression, the Euclidean loss can be used:

    EW1,W2,b(X,Y ) =1

    2N

    N∑i=1

    ‖yi − ŷ‖2 (3)

    2.2 Uncertainty ModelingAs mentioned above, there are two main uncertainties:epistemic (model uncertainty) and aleatoric (data uncer-tainty) [18]. The aleatoric uncertainty has two types: ho-moscedastic and heteroscedastic [19].The predictive uncertainty (PU) consists of two parts: (i)epistemic uncertainty (EU), and (ii) aleatoric uncertainty(AU), which can be written as sum of these two parts:

    PU = EU +AU. (4)

    Epistemic uncertainties can be formulated as probabilitydistribution over model parameters. Let Dtr = {X,Y } ={(xi, yi)}Ni=1 denotes a training dataset with inputs xi ∈

  • 4

    the posterior distribution obtained by the model. As such,the Kullback-Leibler (KL) [20] divergence is needed to beminimised with regard to θ. The level of similarity amongtwo distributions can be measured as follows:

    KL(qθ(ω)‖p(ω|X,Y )) =∫qθ(ω) log

    qθ(ω)

    p(ω|X,Y )dω. (9)

    The predictive distribution can be approximated by mini-mizing KL divergence, as follows:

    p(y∗|x∗, X, Y ) ≈∫p(y∗|x∗, ω)q∗θ(ω)dω =: q∗θ(y∗, x∗),

    (10)

    where q∗θ(ω) indicates the optimized objective.KL divergence minimization can also be rearranged into

    the evidence lower bound (ELBO) maximization [21]:

    LV I(θ) :=∫qθ(ω) log p(Y |X,ω)dω −KL(qθ(ω)‖p(ω)),

    (11)

    where qθ(ω) is able to describe the data well by maximizingthe first term, and be as close as possible to the prior byminimizing the second term. This process is called varia-tional inference (VI). Dropout VI is one of the most commonapproaches that has been widely used to approximate infer-ence in complex models [22]. The minimization objective isas follows [23]:

    L(θ, p) = − 1N

    N∑i=1

    log p(yi|xi, ω) +1− p2N‖θ‖2 (12)

    where N and P represent the number of samples anddropout probability, respectively.

    To obtain data-dependent uncertainty, the precision τin (6) can be formulated as a function of data. One ap-proach to obtain epistemic uncertainty is to mix two func-tions: predictive mean, i.e., fθ(x), and model precision,i.e., gθ(x), and the likelihood function can be written asyi = N (fθ(x), gθ(x)−1). A prior distribution is placedover the weights of the model, and then the amount ofchange in the weights for given data samples is computed.The Euclidian distance loss function (3) can be adapted asfollows:

    EW1,W2,b :=

    1

    2(y − fW1,W2,b(x))gW1,W2,b(x)(y − fW1,W2,b(x))T−

    1

    2log det gW1,W2,b +

    D

    2log 2π

    = − logN (fθ(x), gθ(x)−1) (13)

    The predictive variance can be obtained as follows:

    V̂ ar[x∗] :=1

    T

    T∑t=1

    gω̃t(x)I + f ω̃t(x∗)T f ω̃t(x∗)

    − Ẽ[y∗]T Ẽ[y∗] −→T→∞

    V arq∗θ (y∗|x∗)[y∗] (14)

    3 UNCERTAINTY QUANTIFICATION USINGBAYESIAN TECHNIQUES

    3.1 Bayesian Deep Learning/Bayesian Neural Networks

    Despite the success of standard DL methods in solvingvarious real-word problems, they cannot provide informa-tion about the reliability of their predictions. To alleviatethis issue, BNNs/BDL [24], [25], [26] can be used to inter-pret the model parameters. BNNs/BDL are robust to over-fitting problem and can be trained on both small and bigdatasets [27].

    3.2 Monte Carlo (MC) dropout

    As stated earlier, it is difficult to compute the exact poste-rior inference, but it can be approximated. In this regard,Monte Carlo (MC) [28] is an effective method. Nonetheless,it is a slow and computationally expensive method whenintegrated into a deep architecture. To combat this, MC(MC) dropout has been introduced, which uses dropout [29]as a regularization term to compute the prediction uncer-tainty [30]. Dropout is an effective technique that has beenwidely used to solve over-fitting problem in DNNs. Duringthe training process, dropout randomly drops some unitsof NN to avoid them from co-tuning too much. Assumea NN with L layers, which Wl, bl and Kl denote theweight matrices, bias vectors and dimensions of the lthlayer, respectively. The output of NN and target class ofthe ith input xi (i = 1, ..., N ) are indicated by ŷi and yi,respectively. The objective function using L2 regularizationcan be written as:

    Ldropout :=1

    N

    N∑i=1

    E(yi, ŷi) + λL∑l=1

    (‖Wi‖22 + ‖bi‖22) (15)

    Dropout samples binary variables for each input dataand every network unit in each layer (except the outputlayer), with the probability pi for ith layer, if its value is 0,the unit i is dropped for a given input data. Same valuesare used in the backward pass to update parameters. Fig. 4shows several visualization of variational distributions on asimple NN [31].

    Several studies used MC dropout [32] to estimate UQ.Wang et al. [33] analyzed epistemic and aleatoric uncertain-ties for deep CNN-based medical image segmentation prob-lems at both pixel and structure levels. They augmented theinput image during test phase to estimate the transforma-tion uncertainty. Specifically, the MC sampling was used toestimate the distribution of the output segmentation. Liu etal. [34] proposed a unified model using SGD to approximateboth epistemic and aleatoric uncertainties of CNNs in pres-ence of universal adversarial perturbations. The epistemicuncertainty was estimated by applying MC dropout withBernoulli distribution at the output of neurons. In addi-tion, they introduced the texture bias to better approximatethe aleatoric uncertainty. Nasir et al. [35] conducted MCdropout to estimate four types of uncertainties, includingvariance of MC samples, predictive entropy, and MutualInformation (MI), in a 3D CNN to segment lesion from MRIsequences.In [37], two dropout methods, i.e. element-wise Bernoulli

  • 5

    (a) Baseline neu-ral network

    (b) BernoulliDropConnect

    (c) GaussianDropConnect

    (d) BernoulliDropout

    (e) GaussianDropout

    (f) Spike-and-SlabDropout

    Fig. 4: A graphical representation of several different visu-alization of variational distributions on a simple NN whichis reproduced based on [31].

    Teacher Model

    Student Model

    EMA (exponential

    moving average)

    Sa

    me

    Arc

    h.

    Monte Carlo

    Dropout

    Input (left atrium

    3D MRI images)

    𝑁𝑜𝑖𝑠𝑒 𝜉

    𝑁𝑜𝑖𝑠𝑒 𝜉′

    𝐷𝐿

    𝐷𝑈 +𝐷𝐿

    ℒ𝑐

    ℒ𝑠

    Uncertainty

    map

    Guide

    Fig. 5: A general view demonstrating the semi-supervisedUA-MT framework applied to LA segmentation which isreproduced based on [36].

    dropout [29] and spatial Bernoulli dropout [38] are imple-mented to compute the model uncertainty in BNNs forthe end-to-end autonomous vehicle control. McClure andKriegeskorte [31] expressed that sampling of weights usingBernoulli or Gaussian can lead to have a more accuratedepiction of uncertainty in comparison to sampling of units.However, according to the outcomes obtained in [31], it canbe argued that using either Bernoulli or Gaussian dropoutcan improve the classification accuracy of CNN. Based onthese findings, they proposed a novel model (called spike-and-slab sampling) by combining Bernoulli or Gaussiandropout.Do et al. [39] modified U-Net [40], which is a CNN-baseddeep model, to segment myocardial arterial spin labelingand estimate uncertainty. Specifically, batch normalizationand dropout are added after each convolutional layer andresolution scale, respectively. Later, Teye et al. [41] proposedMC batch normalization (MCBN) that can be used to es-timate uncertainty of networks with batch normalization.They showed that batch normalization can be consideredas an approximate Bayesian model. Yu et al. [36] proposeda semi-supervised model to segment left atrium from 3DMR images. It consists of two modules including teacherand student, and used them in UA framework called UAself-ensembling mean teacher (UA-MT) model (see Fig. 5).As such, the student model learns from teacher model viaminimizing the segmentation and consistency losses of the

    m

    𝑝1

    m

    𝑝1

    𝑝2

    m

    𝑝1

    𝑝2

    𝜃0 𝜃1 𝜃3

    (a) One worker

    m

    𝑝1

    m

    𝑝1

    𝑝2

    m

    𝑝1

    𝑝2

    𝜃0 𝜃1 𝜃3

    (b) Synchronous

    m

    𝑝1

    m

    𝑝1

    𝑝2

    m

    𝑝1

    𝑝2

    𝜃0 𝜃1 𝜃3

    (c) Asynchronous

    m

    𝑝1

    𝑝2

    𝜋

    𝜋𝜋

    Center

    parameter

    Intermediate

    parameter

    Gradient

    evaluation(d) Asynchronous and pe-riodic

    Center parameter Intermediate parameter Gradient evaluation

    Fig. 6: A graphical implementations of different SG-MCMCmodels which is reproduced based on [42].

    labeled samples and targets of the teacher model, respec-tively. In addition, UA framework based on MC dropoutwas designed to help student model to learn a better modelby using uncertainty information obtained from teachermodel. Table 1 lists studies that directly applied MC dropoutto approximate uncertainty along with their applications.

    3.2.1 Comparison of MC dropout with other UQ methodsRecently, several studies have been conducted to comparedifferent UQ methods. For example, Foong et al. [60] em-pirically and theoretically studied the MC dropout andmean-field Gaussian VI. They found that both models canexpress uncertainty well in shallow BNNs. However, mean-field Gaussian VI could not approximate posterior well toestimate uncertainty for deep BNNs. Ng et al. [61] comparedMC dropout with BBB using U-Net [40] as a base classifier.Siddhant et al. [62] empirically studied various DAL modelsfor NLP. During prediction, they applied dropout to CNNsand RNNs to estimate the uncertainty. Hubschneider etal. [9] compared MC dropout with bootstrap ensembling-based method and a Gaussian mixture for the task of vehiclecontrol. In addition, Mukhoti [63] applied MC dropout withseveral models to estimate uncertainty in regression prob-lems. Kennamer et al. [64] empirically studied MC dropoutin Astronomical Observing Conditions.

    3.3 Markov chain Monte Carlo (MCMC)Markov chain Monte Carlo (MCMC) [65] is another effectivemethod that has been used to approximate inference. Itstarts by taking random draw z0 from distribution q(z0)or q(z0|x). Then, it applies a stochastic transition to z0, asfollows:

    Zt ∼ q(zt|zt−1, x). (16)

    This transition operator is chosen and repeated for T times,and the outcome, which is a random variable, converges indistribution to the exact posterior. Salakhutdinov et al. [66]used MCMC to approximate the predictive distributionrating values of the movies. Despite the success of the con-ventional MCMC, the sufficiant number of iteration is un-known. In addition, MCMC requires long time to convergeto a desired distribution [28]. Several studies have been

  • 6TABLE 1: A summary of studies that applied the original MC dropout to approximate uncertainty along with theirapplications (Sorted by year).

    Study Year Method Application Code

    Kendal et al. [43] 2015 SegNet [44] semantic segmentation√

    Leibig et al. [45] 2017 CNN diabetic retinopathy√

    Choi et al. [46] 2017 mixture density network (MDN) [47] regression ×

    Jung et al. [48] 2018 full-resolution ResNet [49] brain tumor segmentation ×

    Wickstrom et al. [50] 2018 FCN [51] and SehNet [44] polyps segmentation ×

    Jungo et al. [52] 2018 FCN brain tumor segmentation ×

    Vandal et al. [53] 2018 Variational LSTM predict flight delays ×

    Devries and Taylor [54] 2018 CNN medical image segmentation ×

    Tousignant et al. [55] 2019 CNN MRI images ×

    Norouzi et al. [56] 2019 FCN MRI images segmentation ×

    Roy et al. [57] 2019 Bayesian FCNN brain images (MRI) segmentation√

    Filos et al. [58] 2019 CNN diabetic retinopathy√

    Harper and Southern [59] 2020 RNN and CNN emotion prediction ×

    conducted to overcome these shortcomings. For example,Salimans et al. [67] expanded space into a set of auxiliaryrandom variables and interpreted the stochastic Markov chainas a variational approximation.

    The stochastic gradient MCMC (SG-MCMC) [68], [69]was proposed to train DNNs. It only needs to estimatethe gradient on small sets of mini-batches. In addition, SG-MCMC can be converged to the true posterior by decreasingthe step sizes [70], [71]. Gong et al. [72] combined amortizedinference with SG-MCMC to increase the generalizationability of the model. Li et al. [42] proposed an acceleratingSG-MCMC to improve the speed of the conventional SG-MCMC (see Fig. 6 for implementation of different SG-MCMC models). However, in short time, SG-MCMC suffersfrom a bounded estimation error [73] and it loses surfacewhen applied to the multi-layer networks [74]. In this re-gard, Zhang et al. [75] developed a cyclical SG-MCMC (cSG-MCMC) to compute the posterior over the weights of neuralnetworks. Specifically, a cyclical stepsize was used insteadof the decreasing one. Large stepsize allows the sampler totake large moves, while small stepsize attempts the samplerto explore local mode.Although SG-MCMC reduces the computational complexityby using a smaller subset, i.e. mini-batch, of dataset ateach iteration to update the model parameters, those smallsubsets of data add noise into the model, and consequentlyincrease the uncertainty of the system. To alleviate this,Luo et al. [76] introduced a sampling method called thethermostat-assisted continuously tempered Hamiltonian MonteCarlo, which is an extended version of the conventionalHamiltonian MC (HMC) [77]. Note that HMC is a MCMCmethod [78]. Specifically, they used Nosé-Hoover ther-mostats [79], [80] to handle the noise generated by mini-batch datasets. Later, dropout HMC (D-HMC) [78] wasproposed for uncertainty estimation, and compared withSG-MCMC [68] and SGLD [81].Besides, MCMC was integrated into the generative basedmethods to approximate posterior. For example, in [82],MCMC was applied to the stochastic object models, whichis learned by generative adversarial networks (GANs), to

    approximate the ideal observer. In [83], a visual trackingsystem based on a variational autoencoder (VAE) MCMC(VAE-MCMC) was proposed.

    Tied parameters

    ▪ NoisyK-FAC

    ▪ Matrix-variate

    Normal VB

    ▪ Hierarchical VB

    Structured

    distribution

    Factorized

    distribution

    Free parameters

    ▪ Normalizing

    flows▪ Structured

    mean field

    ▪ Gaussian mean field

    ▪ Weight sharing (Mean-

    field assumption +

    dramatic reduction)

    Fig. 7: A summary of various VI methods forBDL which is reproduced based on [84]. Note thatWeight sharing (Mean − field assumption +dramatic reduction) is added based on the proposedmethod in [84].

    3.4 Variational Inference (VI)

    The variational inference (VI) is an approximation methodthat learns the posterior distribution over BNN weights. VI-based methods consider the Bayesian inference problem asan optimization problem which is used by the SGD to trainDNNs. Fig. 7 summaries various VI methods for BNN [84].For BNNs, VI-based methods aim to approximate posteriordistributions over the weights of NN. To achieve this, theloss can be defined as follows:

    L(Φ) ≈ 12|D|

    |D|∑i=1

    LR(y(i), x(i)) +1

    |D|KL(qφ(w)‖p(w))

    (17)

    where |D| indicates the number of samples, and

    LR(y, x) = − log(τ̂x)T 1 + ‖√τ̂x � (y − µ̂x)‖2 (18)

    µ̂x = µ̂(x,wµ); w ∼ qφ(w) (19)

  • 7

    τ̂x = τ̂(x,wr). (20)

    where � and 1 represent the element-wise product andvector filled with ones, respectively. Eq. (17) can be usedto compute (10).

    Posch et al. [85] defined the variational distribution us-ing a product of Gaussian distributions along with diagonalcovariance matrices. For each network layer, a posterior un-certainty of the network parameter was represented. Later,in [86], they replaced the diagonal covariance matriceswith the traditional ones to allow the network parame-ters to correlate with each other. Inspired from transferlearning and empirical Bayes (EB) [87], MOPED [88] useda deterministic weights, which was derived from a pre-trained DNNs with same architecture, to select meaningfulprior distributions over the weight space. Later, in [89],they integrated an approach based on parametric EB intoMOPED for mean field VI in Bayesian DNNs, and usedfully factorized Gaussian distribution to model the weights.In addition, they used a real-world case study, i.e., diabeticretinopathy diagnosis, to evaluate their method. Subedar etal. [90] proposed an uncertainty aware framework based onmulti-modal Bayesian fusion for activity recognition. Theyscaled BDNN into deeper structure by combining determin-istic and variational layers. Marino et al. [91] proposed astochastic modeling based approach to model uncertainty.Specifically, the DBNN was used to learn the stochasticlearning of the system. Variational BNN [92], which is agenerative-based model, was proposed to predict the super-conducting transition temperature. Specifically, the VI wasadapted to compute the distribution in the latent space forthe model.

    Louizos and Welling [93] adopted a stochastic gradientVI [94] to compute the posterior distributions over theweights of NNs. Hubin and Storvik [95] proposed a stochas-tic VI method that jointly considers both model and param-eter uncertainties in BNNs, and introduced a latent binaryvariables to include/exclude certain weights of the model.Liu et al. [96] integrated the VI into a spatial–temporalNN to approximate the posterior parameter distribution ofthe network and estimate the probability of the prediction.Ryu et al. [97] integrated the graph convolutional network(GCN) into the Bayesian framework to learn representa-tions and predict the molecular properties. Swiatkowskiet al. [84] empirically studied the Gaussian mean-field VI.They decomposed the variational parameters into a low-rank factorization to make a more compact approximation,and improve the SNR ratio of the SG in estimating thelower bound of the variational. Franquhar et al. [98] usedthe mean-field VI to better train deep models. They arguedthat a deeper linear mean-field network can provide ananalogous distribution of function space like shallowly full-co-variance networks. A schematic view of the proposedapproach is demonstrated in Fig. 8.

    3.5 Bayesian Active Learning (BAL)Active learning (AL) methods aim to learn from unlabeledsamples by querying an oracle [99]. Defining the rightacquisition function, i.e., the condition on which sample ismost informative for the model, is the main challenge of

    x yFull-covariance

    3+ 'mean-field' layers

    Sim

    ila

    rly

    exp

    ress

    ive

    Fig. 8: A general architecture of the deeper linear mean-fieldnetwork with three mean-field weight layers or more whichis reproduced based on [98].

    AL-based methods. Although existing AL frameworks haveshown promising results in variety of tasks, they lack ofscalability to high-dimensional data [100]. In this regard, theBaysian approaches can be integrated into DL structure torepresent the uncertainty, and then combine with deep ALacquisition function to probe for the uncertain samples inthe oracle.

    DBAL [101], i.e., deep Bayesian AL, combine an ALframework with Bayesian DL to deal with high-dimensionaldata problems, i.e., image data. DBAL used batch acquisi-tion to select the top n samples with the highest BayesianAL by disagreement (BALD) [102] score. Model priors fromempirical bayes (MOPED) [103] used BALD to evaluatethe uncertainty. In addition, MC dropout was applied toestimate the model uncertainty. Later, Krisch et al. [104] pro-posed BatchBALD, which uses greedy algorithm to select abatch in linear time and reduce the run time. They mod-eled the uncertainty by leveraging the Bayesian AL (BAL)using Dropout-sampling. In [105], two types of uncertaintymeasures namely entropy and BALD [102], were compared.

    ActiveHARNet [106], which is an AL-based frameworkfor human action recognition, modeled the uncertainty bylinking BNNs with GP using dropout. To achieve this,dropout was applied before each fully connected layer toestimate the mean and variance of BNN. DeepBASS [107],i.e., a deep AL semi-supervised learning, is an expectation-maximization [108] -based technique paired with an ALcomponent. It applied MC dropout to estimate the uncer-tainty.

    Scandalea et al. [109] proposed a framework based onU-Net structure for deep AL to segment biomedical images,and used uncertainty measure obtained by MC dropout, tosuggest the sample to be annotated. Specifically, the uncer-tainty was defined based on the posterior probabilities’ SDof the MC-samples. Zheng et al. [110] varied the number ofBayesian layers and their positions to estimate uncertaintythrough AL on MNIST dataset. The outcome indicated thatfew Bayesian layers near the output layer are enough tofully estimate the uncertainty of the model.

    Inspired from [111], the Bayesian batch AL [112], whichselects a batch of samples at each AL iteration to performposterior inference over the model parameters, was pro-posed for large-scale problems. Active user training [113],which is a BAL-based crowdsourcing model, was proposedto tackle high-dimensional and complex classification prob-

  • 8

    Labeled

    dataset

    Unlabeled

    dataset

    Classifier

    ACGAN

    VAE

    Oracle

    Join

    t Tra

    inin

    g

    (𝐗∗, ?)

    (𝐗∗, 𝐘∗)

    (𝐗′, 𝐘∗)

    {(X, Y)} ∪ (𝐗∗, 𝐘∗) ∪ (𝐗′, 𝐘∗)

    Fig. 9: Bayesian generative active deep learning (Note,ACGAN stands for the Auxiliary-classifier GAN which isreproduced based on [116].

    lems. In addition, the Bayesian inference proposed in [114]was used to consider the uncertainty of the confusion matrixof the annotators.

    Several generative-based AL frameworks have been in-troduced. In [115], a semi-supervised Bayesian AL model,which is a deep generative-based model that uses BNNsto give discriminative component, was developed. Tran etal. [116] proposed a Bayesian-based generative deep AL(BGADL) (Fig. 9) for image classification problems. They,firstly used the concept of DBAL to select the must in-formative samples and then VAE-ACGAN was applied togenerate new samples based on the selected ones. Akbari etal. [117] proposed a unified BDL framework to quantify bothaleatoric and epistemic uncertainties for activity recognition.They used an unsupervised DL model to extract featuresfrom the time series, and then their posterior distributionswas learned through a VAE model. Finally, the Dropout [30]was applied after each dense layer and test phase forrandomness of the model weights and sample from theapproximate posterior, respectively.

    3.6 Bayes by Backprop (BBB)The learning process of a probability distribution using theweights of neural networks plays significant role for havingbetter predictions results. Blundell et al. [118] proposed anovel yet efficient algorithm named Bayes by Backprop(BBB) to quantify uncertainty of these weights. The pro-posed BBB minimizes the compression cost which is knownas the variational free energy (VFE) or the lower bound(expected) of the marginal likelihood. To do so, they defineda cost function as follows:

    F (D, θ) = KL[q(w|θ)‖P (w)]−Eq(w,θ)[logP (D|w)]. (21)The BBB algorithm uses unbiased gradient estimates of

    the cost function in 21 for learning distribution over theweights of neural networks. In another research, Fortunatoet al. [119] proposed a new Bayesian recurrent neural net-work (BRNNs) using BBB algorithm. In order to improvethe BBB algorithm, they used a simple adaptation of trun-cated back-propagation throughout the time. The proposedBayesian RNN (BRNN) model is shown in Fig. 10.

    Ebrahimi et al. [120] proposed an uncertainty-guidedcontinual approach with BNNs (named UCB which stands

    Fig. 10: Bayesian RNNs (BRNNs) which is reproduced basedon the proposed model by Fortunato et al. [119].

    for Uncertainty-guided continual learning technique withBNNs). The continual learning leads to learn a variety ofnew tasks while impound the aforetime knowledge ob-tained learned ones. The proposed UCB exploits the pre-dicted uncertainty of the posterior distribution in orderto formulate the modification in “important” parametersboth in setting a hard-threshold as well as in a soft way.Recognition of different actions in videos needs not onlybig data but also is a time consuming process. To deal withthis issue, de la Riva and Mettes [121] proposed a Bayesian-based deep learning method (named Bayesian 3D ConvNet)to analyze a small number of videos. In this regard, BBBwas extended to be used by 3D CNNs and then employedto deal with uncertainty over the convolution weights inthe proposed model. To do so, Gaussian distribution wasapplied to approximate the correct posterior in the proposed3D Convolution layers using mean and STD (standard de-viation) as follows:

    θ = (µ, α),

    σ2 = α.µ2,

    qθ(wijhwt|D) = N (µijhwt, αijhwtµ2ijhwt),(22)

    where i represents the input, j is the output, h is the filterheight, w is the filter width and t is the time dimension. Inanother research, Ng et al. [61] compared the performanceof two well-known uncertainty methods (MC dropout andBBB) in medical image segmentation (cardiac MRI) on a U-Net model. The obtained results showed that MC dropoutand BBB demonstrated almost similar performances in med-ical image segmentation task.

    3.7 Variational AutoencodersAn autoencoder is a variant of DL that consists of twocomponents: (i) encoder, and (ii) decoder. Encoder aims tomap high-dimensional input sample x to a low-dimensionallatent variable z. While decoder reproduces the originalsample x using latent variable z. The latent variables arecompelled to conform a given prior distribution P (z). Vari-ational Autoencoders (VAEs) [94] are effective methods tomodel the posterior. They cast learning representations forhigh-dimensional distributions as a VI problem [123]. Aprobabilistic model Pθ(x) of sample x in data space witha latent variable z in latent space can be written as follows:

    pθ(x) =

    ∫zpθ(x|z)p(z), (23)

  • 9

    ℓ ′′(𝒛𝟏, 𝒚𝟏, 𝒛𝟐, 𝒚𝟐)

    𝒙𝟏 𝒙𝟏

    𝒙𝟐 𝒙𝟐

    𝒚𝟐𝒚𝟏

    𝒛𝟐

    𝒛𝟏

    ෝ𝒚𝟏

    ෝ𝒚𝟐

    𝒒𝝓(z|x)

    𝒒𝝓(z|x)

    𝒑𝜽(x|z)

    𝒑𝜽(x|z)

    Fig. 11: Pairwise Supervised Hashing-Bernoulli VAE (PSH-BVAE) which is reproduced based on [122].

    The VI can be used to model the evidence lower boundlog pθ(x) as follows:

    log pθ(x) = Eqφ(z|x)[log pθ(x|z)]−DKL(qφ(z|x)‖p(x)),(24)

    where qφ(z|x) and pθ(x|z) are the encoder and decodermodels, respectively, and φ and θ indicate their parameters.

    Zamani et al. [122] developed a discrete VAE frame-work with Bernoulli latent variables as binary hashing code(Fig.11). The stochastic gradient was exploited to learn themodel. They proposed a pairwise supervised hashing (PSH)framework to derive better hashing codes. PSH maximizesthe ELBO with weighted KL regularization to learn moreinformative binary codes, and adapts a pairwise loss func-tion to reward within-class similarity and between-classdissimilarity to minimize the distance among the hashingcodes of samples from same class and vice versa.Bohm et al. [124] studied UQ for linear inverse problemsusing VAEs. Specifically, the vanilla VAE with mean-fieldGaussian posterior was trained on uncorrupted samplesunder the ELBO. In addition, the EL2O method [125]was adopted to approximate the posterior. Edupuganti etal. [126] studied the UQ tasks in magnetic resonance imagerecovery (see Fig. 12). As such, a VAE-GAN, which is aprobabilistic recovery scheme, was developed to map thelow quality images to high-quality ones. The VAE-GANconsists of VAE and multi-layer CNN as generator anddiscriminator, respectively. In addition, the Stein’s unbiasedrisk estimator (SURE) was leveraged as a proxy to predicterror and estimate the uncertainty of the model.

    In [127], a framework based on variational U-Net [128]architecture was proposed for UQ tasks in reservoir simu-lations. Both simple U-Net and variational U-Net (VUNet)are illustrated in Fig. 13. Cosmo VAE [129], which is aDL, i.e., U-Net, based VAE, was proposed to restore themissing observations of the cosmic microwave background(CMB) map. As such, the variational Bayes approximationwas used to determine the ELBO of likelihood of the recon-structed image. Mehrasa et al. [130] proposed action point

    process VAE (APP VAE) for action sequences. APP VAEconsists of two LSTM to estimate the prior and posteriordistributions. Sato et al. [131] proposed a VAE-based UAfor anomaly detection. They used MC sampling to estimateposterior.

    Since VAEs are not stochastic processes, they are limitedto encode finite-dimensional priors. To alleviate this limita-tion, Mishra et al. [132] developed the prior encoding VAE,i.e., πVAE. Inspired by the Gaussian process [133], πVAE is astochastic process that learns the distribution over functions.To achieve this, πVAE encoder, firstly, transforms the loca-tions to a high-dimensional space, and then, uses a linearmapping to link the feature space to outputs. While πVAEencoder aims to recreate linear mapping from the lowerdimensional probabilistic embedding. Finally, the recreatedmapping is used to get the reconstruction of the outputs.Guo et al. [134] used VAE to deal with data uncertaintyunder a just-in-time learning framework. The Gaussian dis-tribution was employed to describe latent space features asvariable-wise, and then the KL-divergence was used to en-sure that the selected samples are the most relevant to a newsample. Daxberger et al. [135] tried to detect OoD samplesduring test phase. As such, the developed an unsupervised,probabilistic framework based on a Bayesian VAE. Besides,they estimated the posterior over the decoder parameters byapplying SG-MCMC.

    4 OTHER METHODSIn this section, we discuss few other proposed UQ methodsused in machine and deep learning algorithms.

    4.1 Deep Gaussian processesDeep Gaussian processes (DGPs) [136], [137], [138], [139],[140], [141], [142] are effective multi-layer decision makingmodels that can accurately model the uncertainty. Theyrepresent a multi-layer hierarchy to Gaussian processes(GPs) [143], [144]. GPs is a non-parametric type of Bayesianmodel that encodes the similarity between samples usingkernel function. It represents distributions over the latentvariables with respect to the input samples as a Gaussiandistribution fx ∼ GP(m(x), k(x, x′)). Then, the output yis distributed based on a likelihood function y|fx ∼ h(fx).However, the conventional GPs can not effectively scale thelarge datasets. To alleviate this issue, inducing samples canbe used. As such, the following variational lower bound canbe optimized.

    log p(Y ) ≥∑y,x∈Y,X

    Eq(fx)[log p(y|fx)]− KL(q(fZ)‖p(fZ)), (25)

    where Z and q(fx) are the location of the inducing samplesand the approximated variational to the distribution of fx,respectively.

    Oh et al. [145] proposed the hedged instance embedding(HIB), which hedges the position of each sample in theembedding space, to model the uncertainty when the inputsample is ambiguous. As such, the probability of two sam-ples matching was extended to stochastic embedding, and

  • 10

    5*5 C

    on

    v. (

    128)Input

    5*5 C

    on

    v. (

    256)

    5*5 C

    on

    v. (

    512)

    5*5 C

    on

    v. (

    1024)

    Flatten

    Dense

    Dense

    +

    +

    N(0,1) Epsilon

    Mean

    St. Dev.

    Latent code

    Encoder

    5*5 C

    on

    v. (

    1024)

    5*5 C

    on

    v. (

    512)

    5*5 C

    on

    v. (

    25

    6)

    5*

    5 C

    on

    v. (

    128)

    Dense

    Decoder

    𝑁 (𝜇𝑦 , 𝜎𝑦)

    A B

    Fig. 12: A schematic view of the proposed VAE model by Edupuganti et al. which is reproduced based on [126].

    𝑦𝐸𝜃 𝑝(𝑆|𝑦, 𝑘, 𝑧)

    𝐺𝜃

    𝐷𝜃𝑘

    𝑝(𝑧|𝑦, 𝑘)

    (a) U-Net

    𝑦𝑖′ 𝐸𝜃 𝐷ψ 𝑝(𝑆𝑖′,𝑗| 𝑦𝑖′ , 𝑘𝑗)

    𝐺𝜃U-NET

    𝐹φ

    𝑦𝑖𝑆𝑖,𝑗

    𝑃𝑖,𝑗 𝑝(𝑘𝑗|𝑦𝑖 ,𝑆𝑖,𝑗 ,𝑃𝑖,𝑗)

    AE

    (b) VUNet

    Fig. 13: A general view of U-Net and VUNet which arereproduced based on [127].

    the MC sampling was used to approximate it. Specifically,the mixture of C Gaussians was used to represent theuncertainty. Havasi et al. [146] applied SGHMC into DGPsto approximate the posterior distribution. They introduced amoving window MC expectation maximization to obtain themaximum likelihood to deal with the problem of optimizinglarge number of parameters in DGPs. Maddox et al. [147]used stochastic weight averaging (SWA) [148] to build aGaussian-baed model to approximate the true posterior.Later, they proposed SWA-G [149], which is SWA-Gaussian,to model Bayesian averaging and estimate uncertainty.

    Most of the weight perturbation-based algorithms sufferfrom high variance of gradient estimation due to sharingsame perturbation by all samples in a mini-batch. To al-leviate this problem, flipout [150] was proposed. Filipoutsamples the pseudo-independent weight perturbations foreach input to decorrelate the gradient within the mini-batch.It is able to reduce variance and computational time intraining NNs with multiplicative Gaussian perturbations.

    Convolutional network

    Input

    image

    D hidden

    units

    Softmax

    output

    Architecture B

    GP

    Q hidden units

    Softmax

    output

    Softmax

    output

    Fig. 14: A general Gaussian-based DNN model proposedby Bradshaw et al. [152] which is reproduced based on thesame reference.

    Despite the success of DNNs in dealing with complexand high-dimensional image data, they are not robust toadversarial examples [151]. Bradshaw et al. [152] proposeda hybrid model of GP and DNNs (GPDNNs) to deal withuncertainty caused by adversarial examples (see Fig. 14).

    Choi et al. [153] proposed a Gaussian-based model topredict the localization uncertainty in YOLOv3 [154]. Assuch, they applied a single Gaussian model to the bbox co-ordinates of the detection layer. Specifically, the coordinatesof each bbox is modeled as the mean (µ) and variance (

    ∑)

    to predict the uncertainty of bbox.Khan et al. [155] proposed a natural gradient-based

    algorithm for Gaussian mean-field VI. The Gaussian dis-tribution with diagonal covariances was used to estimatethe probability. The proposed algorithm was implementedwithin the Adam optimizer. To achieve this, the networkweights were perturbed during the gradient evaluation. Inaddition, they used a vector to adapt the learning rate toestimate uncertainty.

    Sun et al. [156] considered structural information ofthe model weights. They used the matrix variate Gaus-sian (MVG) [157] distribution to model structured corre-lations within the weights of DNNs, and introduced areparametrization for the MVG posterior to make the pos-terior inference feasible. The resulting MVG model was ap-plied to a probabilistic BP framework to estimate posteriorinference. Louizos and Welling [158] used MVG distributionto estimate the weight posterior uncertainty. They treatedthe weight matrix as a whole rather than treating each

  • 11

    component of weight matrix independently. As mentionedearlier, GPs were widely used for UQ in deep learningmethods. Van der Wilk et al. [159], Blomqvist et al. [160],Tran et al. [161], Dutordoir et al. [162] and Shi et al. [163]introduced convolutional structure into GP. In another

    ConvNet Classifier

    ConfidNet

    ℒconf

    ℒCE

    y∗

    ොc

    P(Y|w, x)

    c∗

    x

    Input

    Fixed during confidence training Classification model

    Confidence model

    Fig. 15: A schematic view of the TCP model which isreproduced based on the same reference. [164].

    study, Corbière et al. [164] expressed that the confidenceof DDNs and predicting their failures is of key importancefor the practical application of these methods. In this regard,they showed that the TCP (TrueClassProbability) is moresuitable than the MCP (MaximumClassProbability) forfailure prediction of such deep learning methods as follows:

    TCP : Rd × Y → R(x, y∗)→ P (Y = y∗|w, x), (26)

    where xi ∈ Rd represents a d-dimensional feature and y∗i ∈Y = {1, ...,K} is its correct class. Then, they introduced anew normalized type of the TCP confidence criterion:

    TCP r(x, y∗) =P (Y = y∗|w, x)P (Y = ŷ|w, x)

    . (27)

    A general view of the proposed model in [164] is illustratedby Fig. 15.In another research, Atanov et al. [165] introduced a proba-bilistic model and showed that Batch Normalization (BN)approach can maximize the lower bound of its relatedmarginalized log-likelihood. Since inference computation-ally was not efficient, they proposed the Stochastic BN(SBN) approach for approximation of appropriate inferenceprocedure, as an uncertainty estimation method. Moreover,the induced noise is generally employed to capture theuncertainty, check overfitting and slightly improve the per-formance via test-time averaging whereas ordinary stochas-tic neural networks typically depend on the expected val-ues of their weights to formulate predictions. Neklyudovet al. [166] proposed a different kind of stochastic layercalled variance layers. It is parameterized by its varianceand each weight of a variance layer obeyed a zero-meandistribution. It implies that each object was denoted by azero-mean distribution in the space of the activations. Theydemonstrated that these layers presented an upright defenseagainst adversarial attacks and could serve as a crucialexploration tool in reinforcement learning tasks.

    Batch Policy

    Optimization

    𝜃𝑖+1𝑎𝑡Policy 𝜋𝜃𝑖

    World 𝜙𝑡

    Bayes Filter 𝑠𝑡𝑏(𝜙𝑡)

    𝑏(𝑠0, 𝜙0) ~ 𝑃0

    (a) Training procedure

    𝜏1

    𝜏2

    𝜏𝑛

    (a) Training procedure

    Policy

    Network

    b

    s

    Encoder

    Encoder

    a

    (b) Network structure(b) Network structure

    Fig. 16: A general view of BPO which is reproduced basedon [171].

    4.2 Laplace approximations

    Laplace approximations (LAs) are other popular UQ meth-ods which are used to estimate the Bayesian inference [167].They build a Gaussian distribution around true posteriorusing a Taylor expansion around the MAP, θ∗, as follows:

    p(θ|D) ≈ p(θ∗) exp{−12

    (θ − θ∗)′H|θ∗(θ − θ∗)} (28)

    where H|θ = Oθp(y|θ)Oθp(y|θ)′ indicates the Hessian of thelikelihood estimated at the MAP estimate. Ritter et al. [168]introduced a scalable LA (SLA) approach for different NNs.The proposed the model, then compared with the otherwell-known methods such as Dropout and a diagonal LAfor the uncertainty estimation of networks.

    5 UNCERTAINTY QUANTIFICATION IN REINFORCE-MENT LEARNINGIn decision making process, uncertainty plays a key role indecision performance in various fields such as Reinforce-ment Learning (RL) [169]. Different UQ methods in RLhave been widely investigated in the literature [170]. Leeet al. [171] formulated the model uncertainty problem asBayes-Adaptive Markov Decision Process (BAMDP). Thegeneral BAMDP defined by a tuple 〈 S, Φ, A, T, R, P0, γ〉, where where S shows the underlying MDP’s observablestate space, Φ indicates the latent space, A represents theaction space, T is the parameterized transition and finallyR is the reward functions, respectively. Lets b0 be an initialbelief, a Bayes filter updates the posterior as follows:

    b′(φ′|s, b, a′, s′) = η∑φ∈Φ

    b(φ)T (s, φ, a′, s′, φ′) (29)

  • 12

    Then, Bayesian Policy Optimization (BPO) method (seeFig. 16) is applied to POMDPs as a Bayes filter to computethe belief b of the hidden state as follows:

    b′(s′) = ψ(b, a′, o′) = η∑s∈S

    b(s)T (s, a′, s′)Z(s, a′, o′) (30)

    In another research, O’Donoghue et al. [172] proposedthe uncertainty Bellman equation (UBE) to quantify uncer-tainty. The authors used a Bellman-based which propagatedthe uncertainty (here variance) relationship of the posteriordistribution of Bayesian. Kahn et al. a [173] presented a newUA model for learning algorithm to control a mobile robot.A review of past studies in RL shows that different Bayesianapproaches have been used for handling parameters un-certainty [174]. Bayesian RL was significantly reviewed byGhavamzadeh et al. [174] in 2015. Due to page limitation,we do not discuss the application of UQ in RL; but wesummarise some of the recent studies here.Kahn et al. a [173] used both Bootstrapping and Dropoutmethods to estimate uncertainty in NNs and then used inUA collision prediction model. Besides Bayesian statisticalmethods, ensemble methods have been used to quantifyuncertainty in RL [175]. In this regard, Tschantz et al. [175]applied an ensemble of different point-estimate parametersθ = {θ0, ..., θB} when trained on various batches of adataset D and then maintained and treated by the poste-rior distribution p(θ|D). The ensemble method helped tocapture both aleatoric and epistemic uncertainty. There aremore UQ techniques used in RL, however, we are not ableto discuss all of them in details in this work due to variousreasons, such as page restrictions and the breadth of articles.Table 2 summarizes different UQ methods used in a varietyof RL subjects.

    6 ENSEMBLE TECHNIQUESDeep neural networks (DNNs) have been effectively em-ployed in a wide variety of machine learning tasks and haveachieved state-of-the-art performance in different domainssuch as bioinformatics, natural language processing (NLP),speech recognition and computer vision [187], [188]. Insupervised learning benchmarks, NNs yielded competitiveac curacies, yet poor predictive uncertainty quantification.Hence, it is inclined to generate overconfident predictions.Incorrect overconfident predictions can be harmful; hence itis important to handle UQ in a proper manner in real-worldapplications [189]. As empirical evidence of uncertaintyestimates are not available in general, quality of predictiveuncertainty evaluation is a challenging task. Two evaluationmeasures called calibration and domain shift are appliedwhich usually are inspired by the practical applications ofNNs. Calibration measures the discrepancy between long-run frequencies and subjective forecasts. The second no-tion concerns generalization of the predictive uncertaintyto domain shift that is estimating if the network knowswhat it knows. An ensemble of models enhances predictiveperformance. However, it is not evident why and when anensemble of NNs can generate good uncertainty estimates.Bayesian model averaging (BMA) believes that the truemodel reclines within the hypothesis class of the prior andexecutes soft model selection to locate the single best model

    within the hypothesis class. On the contrary, ensemblescombine models to discover more powerful model; ensem-bles can be anticipated to be better when the true modeldoes not lie down within the hypothesis class.The authors in [190] devised Maximize Overall Diversity(MOD) model to estimate ensemble-based uncertainty bytaking into account diversity in ensemble predictions acrossfuture possible inputs. Gustafsson et al. [191] presented anevaluation approach for measuring uncertainty estimationto investigate the robustness in computer vision domain. Re-searchers in [192] proposed a deep ensemble echo state net-work model for spatio-temporal forecasting in uncertaintyquantification. Chua et al. [193] devised a novel methodcalled probabilistic ensembles with trajectory sampling thatintegrated sampling-based uncertainty propagation withUA deep network dynamics approach. The authors in [187]demonstrated that prevailing calibration error estimatorswere unreliable in small data regime and hence proposedkernel density-based estimator for calibration performanceevaluation and proved its consistency and unbiasedness.Liu et al. [194] presented a Bayesian nonparametric en-semble method which enhanced an ensemble model thataugmented model’s distribution functions using Bayesiannonparametric machinery and prediction mechanism. Hu etal. [195] proposed a model called margin-based Pareto deepensemble pruning utilizing deep ensemble network thatyielded competitive uncertainty estimation with elevatedconfidence of prediction interval coverage probability anda small value of the prediction interval width. In anotherstudy, the researchers [196] exploited the challenges asso-ciated with attaining uncertainty estimations for structuredpredictions job and presented baselines for sequence-levelout-of-domain input detection, sequence-level predictionrejection and token-level error detection utilizing ensembles.Ensembles involve memory and computational cost whichis not acceptable in many application [197]. There has beennoteworthy work done on the distillation of an ensembleinto a single model. Such approaches achieved comparableaccuracy using ensembles and mitigated the computationalcosts. In posterior distribution p(θ|D), the uncertainty ofmodel is captured. Let us consider from the posterior sam-pled ensemble of models {P (y|x?, θ(m))}Mm=1 as follows[197]:

    {P (y|x?, θ(m))}Mm=1 → {P (y|π(m))}Mm=1,πm = f(x?; θ(m)), θ(m) ∼ p(θ|D) (31)

    where x∗a test is input and π represents the parameters ofa categorical distribution [P (y = w1), ..., P (y = wk)]T . Bytaken into account the expectation with respect to the modelposterior, predictive posterior or the expected predictivedistribution, for a test input x∗ is acquired. And then wehave:

    P (y|x?, D) = Ep(θ|D)[P (y|x?, θ)] (32)

    Different estimate of data uncertainty are demonstratedby each of the models P (y|x?, θ(m)). The ‘disagreement’or the level of spread of an ensemble sampled from theposterior is occurred due to the uncertainty in predictions asa result of model uncertainty. Let us consider an ensemble

  • 13TABLE 2: Further information of some UQ methods used in RL.

    Study Application Goal/Objective UQ method Code

    Tegho et al. [176] Dialogue management con-text

    Dialogue policy optimisation BBB propagation deep Q-networks(BBQN)

    ×

    Janz et al. [177] Temporal difference learn-ing

    Posterior sampling for RL(PSRL)

    Successor Uncertainties (SU)√

    Shen and How [178] Discriminating potentialthreats

    Stochastic belief space policy Soft-Q learning ×

    Benatan and Pyzer-Knapp [179]

    Safe RL (SRL) The weights in RNN usingmean and variance weights

    Probabilistic Backpropagation (PBP) ×

    Kalweit and Boedecker[180]

    Continuous Deep RL(CDRL)

    Minimizing real-world inter-action

    Model-assisted Bootstrapped DeepDeterministic Policy Gradient (MA-BDDPG)

    ×

    Riquelme et al. [181] Approximating the poste-rior sampling

    Balancing both explorationand exploitation in differentcomplex domains

    Deep Bayesian Bandits Showdown usingThompson sampling

    Huang et al. [182] Model-based RL (MRL) Better decision and improveperformance

    Bootstrapped model-based RL (BMRL) ×

    Eriksson andDimitrakakis [183]

    Risk measures and leverag-ing preferences

    Risk-Sensitive RL (RSRL) Epistemic Risk Sensitive Policy Gradient(EPPG)

    ×

    Lötjens et al. [184] SRL UA navigation Ensemble of MC dropout (EMCD) andBootstrapping

    ×

    Clements et al. [185] Designing risk-sensitive al-gorithm

    Disentangling aleatoric andepistemic uncertainties

    Combination of distributional RL (DRL)and Approximate Bayesian computation(ABC) methods with NNs

    D’Eramo et al. [186] Drive exploration Multi-Armed Bandit (MAB) Bootstrapped deep Q-network with TS(BDQNTS)

    ×

    {P (y|x?, θ(m))}Mm=1 that yields the expected set of behav-iors, the entropy of expected distribution P (y|x?, D) can beutilized as an estimate of total uncertainty in the prediction.Measures of spread or ‘disagreement’ of the ensemble suchas MI can be used to assess uncertainty in predictions dueto knowledge uncertainty as follows:

    MI[y, θ|x?, D]︸ ︷︷ ︸Knowledge Uncertainty

    = H[Ep(θ|D)[P (y|x?, θ)]]︸ ︷︷ ︸Total Uncertainty

    Ep(θ|D)[H[P (y|x?, θ)]]︸ ︷︷ ︸Expected Data Uncertainty

    (33)

    The total uncertainty can be decomposed into expected datauncertainty and knowledge uncertainty via MI formulation.If the model is uncertain – both in out-of-domain and re-gions of severe class overlap, entropy of the total uncertaintyor predictive posterior is high. If the models disagree, thedifference of the expected entropy and entropy of predictiveposterior of the individual models will be non-zero. For ex-ample, MI will be low and expected and predictive posteriorentropy will be similar, and each member of the ensemblewill demonstrate high entropy distribution in case of inregions of class overlap. In such scenario, data uncertaintydominates total uncertainty. The predictive posterior is nearuniform while the expected entropy of each model maybe low that yielded from diverse distributions over classesas a result of out-of-domain inputs on the other hand.Inthis region of input space, knowledge uncertainty is highbecause of the model’s understanding of data is low. Inensemble distribution distillation, the aim is not only tocapture its diversity but also the mean of the ensemble.An ensemble can be observed as a set of samples from an

    implicit distribution of output distributions:

    {P (y|x?, θ(m))}Mm=1 → {P (y|π(m))}Mm=1, π(m) ∼ p(π|x?,D).

    (34)

    Prior Networks, a new class model was proposed that ex-plicitly parameterize a conditional distribution over outputdistributions p(π|x?, ∅̂) utilizing a single neural networkparameterized by a point estimate of the model param-eters ∅̂. An ensemble can be emulated effectively by aPrior Networks and hence illustrated the same measureof uncertainty. By parameterizing the Dirichlet distribution,the Prior Network p(π|x?, ∅̂) represents a distribution overcategorical output distributions. Ensembling performance ismeasured by uncertainty estimation. Deep learning ensem-bles produces benchmark results in uncertainty estimation.The authors in [198] exploited in-domain uncertainty andexamined its standards for its quantification and revealedpitfalls of prevailing matrices. They presented the deepensemble equivalent score (DEE) and demonstrated how anensemble of trained networks which is only few in numbercan be equivalent to many urbane ensembling methods withrespect to test performance. For one ensemble, they pro-posed the test-time augmentation (TTA) in order to improvethe performance of different ensemble learning techniques(see Fig. 17).

    However, deep ensembles [199] are a simple approachthat presents independent samples from various modes ofthe loss setting. Under a fixed test-time computed bud-get, deep ensembles can be regarded as powerful baselinefor the performance of other ensembling methods. It is a

  • 14

    Augmentations

    Input (original

    image)

    Ensemble

    Frog

    Snake

    Bird

    Predictions

    Ensemble

    Prediction

    0

    0

    0

    0 1

    1

    1

    1

    𝒲1

    𝒲2

    𝒲𝐾

    Frog

    Snake

    Bird

    Frog

    Snake

    Bird

    Frog

    Snake

    Bird

    Fig. 17: A schematic view of TTA for ensembling techniqueswhich is reproduced based on [198].

    challenging task to compare the performance of ensem-bling methods. Different values of matrices are achievedby different models on different datasets. Interpretabilityis lacking in values of matrices as performance gain iscompared with dataset and model specific baseline. Hence,Ashukha et al. [198] proposed DDE with an aim to introduceinterpretability and perspective that applies deep ensemblesto compute the performance of other ensembling methods.DDE score tries to answer the question: what size of deepensemble demonstrates the same performance as a specificensembling technique? The DDE score is based on calibratedlog-likelihood (CLL). DDE is defined for an ensemblingtechnique (m) and lower and upper bounds are depictedas below [198]:

    DEEm(k) = min{l ∈ R, l ≥ 1|CLLmeanDE (l) ≥ CLLmeanm (k)},(35)

    DEEupper/lowerm (k) = min{l ∈ R, l ≥ 1|CLLmeanDE (l)∓ CLLstdDE(l) ≥ CLLmeanm (k)}, (36)

    where the mean and standard deviation of the calibratedlog-likelihood yielded by an ensembling technique m withl samples is dubbed as CLLmean/stdm (l). They measuredCLLstdDE(l) and CLL

    meanDE (l) for natural numbers l ∈ N>0

    and linear interpolation is applied to define them for realvalues l ≥ 1. They depict DDEm(k) for different numberof samples k for different methods m with upper and lowerbounds DEEupperm (k) and DEE

    lowerm (k).

    Different sources of model uncertainty can be taken careby incorporating a presented ensemble technique to proposea Bayesian nonparametric ensemble (BNE) model devisedby Liu et al. [194]. Bayesian nonparametric machinery wasutilized to augment distribution functions and prediction ofa model by BNE. The BNE measure the uncertainty patternsin data distribution and decompose uncertainty into discretecomponents that are due to error and noise. The modelyielded precise uncertainty estimates from observationalnoise and demonstrated its utility with respect to model’sbias detection and uncertainty decomposition for an ensem-ble method used in prediction. The predictive mean of BNE

    can be expressed as below [194]:

    E(y|X,ω, δ,G) =K∑k=1

    fk(X)ωk + δ(X)︸ ︷︷ ︸Due to δ

    +

    ∫y∈†

    [Φ((y|X,µ)−G[Φ((y|X,µ]

    ]dy︸ ︷︷ ︸

    Due to G

    . (37)

    The predictive mean for the full BNE is comprised ofthree sections:

    1) The predictive mean of original ensemble∑Kk=1 fk(X)ωk;

    2) BNE’s direct correction to the prediction function isrepresented by the term δ; and

    3) BNE’s indirect correction on prediction derivedfrom the relaxation of the Gaussian assump-tion in the model cumulative distribution func-

    tion is represented by the term∫ [

    Φ((y|X,µ) −

    G[Φ((y|X,µ]]dy. In addition, two error correction

    terms Dδ(y|X) and DG(y|X) are also presented.

    To denote BNE’s predictive uncertainty estimation, the termΦε,ω is used which is the predictive cumulative distributionfunction of the original ensemble (i.e. with variance σ2ε andmean

    ∑k fkωk). The BNE’s predictive interval is presented

    as [194]:

    Uq(y|X,ω, δ,G) =[Φ−1ε,ω

    (G−1(1− q

    2|X)

    )+ δ(x),

    Φ−1ε,ω

    (G−1(1 +

    q

    2|X)

    )+ δ(x)

    ]. (38)

    Comparing the above equation to the predictive interval

    of original ensemble[Φ−1ε,ω

    (G−1(1− q2 |X)

    ),Φ−1ε,ω

    (G−1(1+

    q2 |X)

    )], it can be observed that the residual process δ ad-

    justs the locations of the BNE predictive interval endpointswhile G calibrates the spread of the predictive interval.As an important part of ensemble techniques, loss functionsplay a significant role of having a good performance bydifferent ensemble techniques. In other words, choosing theappropriate loss function can dramatically improve results.Due to page limitation, we summarise the most importantloss functions applied for UQ in Table 3.

    6.1 Deep EnsembleDeep ensemble, is another powerful method used to mea-sure uncertainty and has been extensively applied in manyreal-world applications [195]. To achieve good learning re-sults, the data distributions in testing datasets should beas close as the training datasets. In many situations, thedistributions of test datasets are unknown especially incase of uncertainty prediction problem. Hence, it is trickyfor the traditional learning models to yield competitiveperformance. Some researchers applied MCMC and BNNsthat relied on the prior distribution of datasets to work out

  • 15TABLE 3: Main loss functions used by ensemble techniques for UQ.

    Study Dataset type Baseclassifier(s)

    Method’s name Loss equation Code

    TV et al. [200] Sensor data NeuralNetworks(LSTM)

    OrdinalRegression (OR)

    LOR(y, ŷ) = − 1N∑Kj=1 yj . log(ŷj) + (1 −

    yj). log(1− ŷj)×

    Sinha et al. [201] Image NeuralNetworks

    Diverse Informa-tion Bottleneck inEnsembles (DIBS)

    LG = Eẑ1∼q(z̃i|x),ẑ2∼q(z̃j |x)[logD(ẑ1, ẑ2)] +Eẑ1∼r(z̃),ẑ2∼q(z̃i|x)[log(1 − D(ẑ1, ẑ2))] +Eẑ1∼q(z̃i|x),ẑ2∼q(z̃i|x)[log 1−D(ẑ1, ẑ2))]

    Zhang et al. [187] Image NeuralNetworks

    Mix-n-Match Cali-bration

    E‖z − y‖22 (the standard square loss) ×

    Lakshminarayananet al. [189]

    Image NeuralNetworks

    Deep Ensembles L(θ) = −S(pθ, q) ×

    Jain et al. [190] Image and ProteinDNA binding

    DeepEnsembles

    Maximize OverallDiversity (MOD)

    L(θm;xn, yn) = − log pθm (yn|xm) ×

    Gustafsson et al.[191]

    Video NeuralNetworks

    Scalable BDL Regression: L(θ) = 1N

    ∑Ni=1

    (yi−µ̂(xi))2σ2(xi)

    +

    log σ2(xi) +1Nθ>θ, Classification: L(θ) =

    − 1N

    ∑Ni=1

    ∑Ck=1 yi,k log ŝ(xi)k +

    12N

    θ>θ

    Chua et al. [193] Robotics (video) NeuralNetworks

    Probabilisticensembleswith trajectorysampling (PETS)

    lossp(θ) = −∑Nn=1 log f̃θ(sn+1|sn, an)

    Hu et al. [195] Image and tabulardata

    NeuralNetworks

    margin-basedPareto deepensemble pruning(MBPEP)

    Lossmulti = WCV AE ∗ LossCV AE +WCRNN ∗ LossCRNN

    ×

    Malinin et al. [197] Image NeuralNetworks

    EnsembleDistributionDistillation(EnD2)

    L(φ,Dens) = − 1N∑Ni=1

    [ln Γ(α̂

    (i)0 −∑K

    c=1 ln Γ(α̂(i)c +

    1M

    ∑Mm=1

    ∑Kc=1(α̂

    (i)0 −

    1) lnπ(cim)

    ]√

    Ashukha et al.[198]

    Image NeuralNetworks

    Deep ensembleequivalent score(DEE)

    L(w) = − 1N

    ∑Ni=1 log p̂(y

    ?i |xi, w)+

    λ2‖w‖2 →

    minw

    Pearce et al. [202] Tabular data NeuralNetworks

    Quality-DrivenEnsembles (QD-Ens)

    LossQD = MPIWcapt. +λ nα(1−α) max(0, (1− α)− PICP )

    2

    Ambrogioni et al.[203]

    Tabular data Bayesian logis-tic regression

    Wasserstein varia-tional gradient de-scent (WVG)

    L(z1) = −Ez∼p(z|x)[c(zj , z)] ×

    Hu et al. [204] Image NeuralNetworks

    Bias-variance de-composition

    L = 12

    exp(−s(x))∑r ‖yr(x)−ŷ(x)‖

    2

    R+ 1

    2s(x) ×

    the uncertainty prediction problems [190]. When these ap-proaches are employed into large size networks, it becomescomputationally expensive. Model ensembling is an effec-tive technique which can be used to enhance the predictiveperformance of supervised learners. Deep ensembles areapplied to get better predictions on test data and also pro-duce model uncertainty estimates when learns are providedwith OoD data. The success of ensembles depends on thevariance-reduction generated by combining predictions thatare prone to several types of errors individually. Hence, theimprovement in predictions is comprehended by utilizinga large ensemble with numerous base models and suchensembles also generate distributional estimates of modeluncertainty. A deep ensemble echo state network (D-EESN)model with two versions of the model for spatio-temporalforecasting and associated uncertainty measurement pre-sented in [192]. The first framework applies a bootstrapensemble approach and second one devised within a hierar-

    chical Bayesian framework. Multiple levels of uncertaintiesand non-Gaussian data types were accommodated by gen-eral hierarchical Bayesian approach. The authors in [192]broadened some of the deep ESN technique constituentspresented by Antonelo et al. [205] and Ma et al. [206] to fit ina spatio-temporal ensemble approach in the D-EESN modelto contain such structure. As shown in previous section, inthe following , we summarise few loss functions of deepensembles in Table 4.

    6.2 Deep Ensemble BayesianThe expressive power of various ensemble techniquesextensively shown in the literature. However, traditionallearning techniques suffered from several drawbacksand limitations as listed in [210]. To overcome theselimitations, Fersini et al. [210] utilized the ensemblelearning approach to mitigate the noise sensitivity relatedto language ambiguity and more accurate prediction of

  • 16TABLE 4: Main loss functions used by deep ensemble techniques for UQ.

    Study Dataset type Base classifier(s) Method’s name Loss equation Code

    Fan et al. [207] GPS-log Neural Networks Online DeepEnsemble Learning(ODEL)

    L = H(Fensemble(Xt−T :t−1), one hot(Xt)) ×

    Yang et al.[208]

    Smart grid K-means Least absoluteshrinkage andselection operator(LASSO)

    L(ymi , ŷm,qi ) =

    1Q

    ∑q∈Q max

    ((q −

    1)Hε(ymi , ŷ

    m,qi

    ), qHε

    (ymi , ŷ

    m,qi

    )) ×

    vanAmersfoort etal. [209]

    Image Neural Networks Deterministic UQ(DUQ)

    L(x, y) = −∑c yc log(Kc)+(1−yc) log(1−Kc) ×

    polarity can be estimated. The proposed ensemble methodemployed Bayesian model averaging, where both reliabilityand uncertainty of each single model were considered.Study [211] presented one alteration to the prevailingapproximate Bayesian inference by regularizing parametersabout values derived from a distribution that could be setequal to the prior. The analysis of the process suggested thatthe recovered posterior was centered correctly but leanedto have an overestimated correlation and underestimatedmarginal variance. To obtain uncertainty estimates, one ofthe most promising frameworks is Deep BAL (DBAL) withMC dropout. Pop et al. [199] argued that in variationalinference methods, the mode collapse phenomenonwas responsible for overconfident predictions of DBALmethods. They devised Deep Ensemble BAL that addressedthe mode collapse issue and improved the MC dropoutmethod. In another study, Pop et al. [212] proposed anovel AL technique especially for DNNs. The statisticalproperties and expressive power of model ensembleswere employed to enhance the state-of-the-art deep BALtechnique that suffered from the mode collapse problem.In another research, Pearce et al. [213] a new ensembleof NNs, approximately Bayesian ensembling approach,called ”anchoredensembling”. The proposed approachregularises the parameters regarding values attracted froma distribution.

    6.3 Uncertainty Quantification in Traditional MachineLearning domain using Ensemble TechniquesIt is worthwhile noting that UQ in traditional machine learn-ing algorithms have extensively been studied using differentensemble techniques and few more UQ methods (e.g. pleasesee [214]) in the literature. However, due to page limitation,we just summarized some of the ensemble techniques (asUQ methods) used in traditional machine learning domain.For example, Tzelepis et al. [214] proposed a maximummargin classifier to deal with uncertainty in input data.The proposed model is applied for classification task us-ing SVM (Support Vector Machine) algorithm with multi-dimensional Gaussian distributions. The proposed modelnamed SVM-GSU (SVM with Gaussian Sample Uncertainty)and it is illustrated by Fig. 18: In another research, Pereira

    et al. [215] examined various techniques for transformingclassifiers into uncertainty methods where predictions areharmonized with probability estimates with their uncer-tainty. They applied various uncertainty methods: Venn-

    Class 1

    Class 2

    LSVM

    LSVM-GSU

    Fig. 18: A schematic view of SVM-GSU which is reproducedbased on . [214].

    ABERS predictors, Conformal Predictors, Platt Scaling andIsotonic Regression. Partalas et al. [216] presented a novelmeasure called Uncertainty Weighted Accuracy (UWA), forensemble pruning through directed hill climbing that tookcare of uncertainty of present ensemble decision. The exper-imental results demonstrated that the new measure to prunea heterogeneous ensemble significantly enhanced the accu-racy compared to baseline methods and other state-of-the-art measures. Peterson et al. [217] exploited different typesof errors that might creep in atomistic machine learning,and addressed how uncertainty analysis validated machine-learning predictions. They applied a bootstrap ensemble ofneural network based calculators, and exhibited that thewidth of the ensemble can present an approximation of theuncertainty.

    7 FURTHER STUDIES OF UQ METHODSIn this section, we cover other methods used to estimatethe uncertainty. In this regard, presented a summary of theproposed methods, but not the theoretical parts. Due tothe page limitation and large number of references, we arenot able to review all the details of the methods. For thisreason, we recommend that readers check more details ofeach method in the reference if needed.The OoD is a common error appears in machine anddeep learning systems when training data have dif-ferent distribution. To address this issue, Ardywibowoet al. [218] introduced a new UA architecture calledNeural Architecture Distribution Search (NADS). Theproposed NADS finds an appropriate distribution of differ-ent architectures which accomplish significantly good on aspecified task. A single block diagram for searching space

  • 17

    𝒛𝑳

    Step of flow

    Squeeze

    Split

    Step of flow

    Squeeze

    𝒛𝒊

    X

    Affine coupling

    layer

    Invertible 1x1

    conv

    Actnorm

    Concatenate

    OP5

    OP3 OP1 OP2 OP4

    +

    +

    ReLU-Conv-BN ReLU-Conv-BN

    × 𝐶 × (𝐿 − 1)

    Fig. 19: A single block diagram for searching space in thearchitecture which is reproduced based on [218].

    in the architecture is presented by Fig. 19.Unlike previous designing architecture methods, NADS al-lows to recognize common blocks amongst the entire UAarchitectures. On the other hand, the cost functions for theuncertainty oriented neural network (NN) are not alwaysconverging. Moreover, an optimized prediction interval (PI)is not always generated by the converged NNs. The conver-gence of training is uncertain and they are not customizablein the case of such cost functions. To construct optimalPIs, Kabir et al. [219] presented a smooth customizablecost function to develop optimal PIs to construct NNs.The PI coverage probability (PICP), PI-failure distances andoptimized average width of PIs were computed to lessenthe variation in the quality of PIs, enhance convergenceprobability and speed up the training. They tested theirmethod on electricity demand and wind power generationdata. In the case of non-Bayesian deep neural classification,uncertainty estimation methods introduced biased estimatesfor instances whose predictions are highly accurate. Theyargued that this limitation occurred because of the dynamicsof training with SGD-like optimizers and possessed simi-lar characteristics such as overfitting. Geifman et al. [220]proposed an uncertainty estimation method that computedthe uncertainty of highly confident points by utilizing snap-shots of the trained model before their approximationswere jittered. The proposed algorithm outperformed allwell-known techniques. In another research, Tagasovskaet al. [221] proposed single-model estimates for DNNs ofepistemic and aleatoric uncertainty. They suggested a lossfunction called Simultaneous Quantile Regression (SQR)to discover the conditional quantiles of a target variableto assess aleatoric uncertainty. Well-calibrated predictionintervals could be derived by using these quantiles. Theydevised Orthonormal Certificates (OCs), a collection of non-constant functions that mapped training samples to zeroto estimate epistemic uncertainty. The OoD examples weremapped by these certificates to non-zero values.van Amersfoort et al. [209], [222] presented a method to findand reject distribution data points for training a determinis-

    𝑒1

    𝑓𝜃(x)

    𝑒2

    𝑒3 Dog

    Prediction

    Bird

    Input image

    𝑓𝜃

    Uncertainty = exp −1

    𝑛𝐖𝐜𝑓𝜃(x) - 𝐞𝐜 2

    2

    2𝜎2

    Fig. 20: A general view of the DUQ architecture which isreproduced based on [209], [222].

    tic deep model with a single forward pass at test time. Theyexploited the ideas of RBF networks to devise deterministicUQ (DUQ) which is presented in Fig. 20. They scaledtraining in this with a centroid updating scheme and newloss function. Their method could detect out of distributiondata consistently by utilizing a gradient penalty to trackchanges in the input. Their method is able to enhance deepensembles and scaled well to huge databases. Tagasovskaet al. [223] demonstrated frequentist estimates of epistemicand aleatoric uncertainty for DNNs. They proposed a lossfunction, simultaneous quantile regression to estimate allthe conditional quantiles of a given target variable in caseof aleatoric uncertainty. Well-calibrated prediction intervalscould be measured by using these quantiles. They proposeda collection of non-trivial diverse functions that map alltraining samples to zero and dubbed as training certificatesfor the estimation of epistemic uncertainty. The certificatessignalled high epistemic uncertainty by mapping OoD ex-amples to non-zero values. By using Bayesian deep net-works, it is possible to know what the DNNs do not knowin the domains where safety is a major concern. Flaweddecision may lead to severe penalty in many domains suchas autonomous driving, security and medical diagnosis. Tra-ditional approaches are incapable of scaling complex largeneural networks. Mobiny et al. [224] proposed an approachby imposing a Bernoulli distribution on the model weightsto approximate Bayesian inference for DNNs. Their frame-work dubbed as MC-DropConnect demonstrated modeluncertainty by small alternation in the model structure orcomputed cost. They validated their technique on variousdatasets and architectures for semantic segmentation andclassification tasks. They also introduced a novel uncertaintyquantification metrics. Their experimental results showedconsiderable enhancements in uncertainty estimation andprediction accuracy compared to the prior approaches.Uncertainty measures are crucial estimating tools in ma-chine learning domain, that can lead to evaluate the sim-ilarity and dependence between two feature subsets andcan be utilized to verify the importance of features in clus-tering and classification algorithms. There are few uncer-tainty measure tools to estimate a feature subset includingrough entropy, information entropy, roughness, and accu-racy etc. in the classical rough sets. These measures arenot proper for real-valued datasets and relevant to discrete-valued information systems. Chen et al. [225] proposed the

  • 18

    neighborhood rough set model. In their approach, eachobject is related to a neighborhood subset, dubbed as aneighborhood granule. Different uncertainty measures ofneighborhood granules were introduced, that were informa-tion granularity, neighborhood entropy, information quan-tity, and neighborhood accuracy. Further, they confirmedthat these measures of uncertainty assured monotonicity,invariance and non-negativity. In the neighborhood sys-tems, their experimental results and theoretical analysisdemonstrated that information granularity, neighborhoodentropy and information quantity performed superior to theneighborhood accuracy measure. On the other hand, reliableand accurate machine learning systems depends on tech-niques for reasoning under uncertainty. The UQ is providedby a framework using Bayesian methods. But Bayesianuncertainty estimations are often imprecise because of theuse of approximate inference and model misspecification.Kuleshov et al. [226] devised a simple method for calibratingany regression algorithm; it was guaranteed to providecalibrated uncertainty estimates having enough data whenused to probabilistic and Bayesian models. They assessedtheir technique on recurrent, feedforward neural networks,and Bayesian linear regression and located outputs well-calibrated credible intervals while enhancing performanceon model-based RL and time series forecasting tasks.Gradient-based optimization techniques have showed itsefficacy in learning overparameterized and complex neuralnetworks from non-convex objectives. Nevertheless, gen-eralization in DNNs, the induced training dynamics, andspecific theoretical relationship between gradient-based op-timization methods are still unclear. Rudner et al. [227]examined training dynamics of overparameterized neuralnetworks under natural gradient descent. They demon-strated that the discrepancy between the functions obtainedfrom non-linearized and linearized natural gradient descentis smaller in comparison to standard gradient descent. Theyshowed empirically that there was no need to formulate alimit argument about the width of the neural network layersas the discrepancy is small for overparameterized neuralnetworks. Finally, they demonstrated that the discrepancywas small on a set of regression benchmark problems andtheir theoretical results were steady with empirical discrep-ancy between the functions obtained from non-linearizedand linearized natural gradient descent. Patro et al. [228]devised gradient-based certainty estimates with visual at-tention maps. They resolved visual question answering job.The gradients for the estimates were enhanced by incor-porating probabilistic deep learning techniques. There aretwo key advantages: 1. enhancement in getting the certaintyestimates correlated better with misclassified samples and2. state-of-the-art results obtained by improving attentionmaps correlated with human attention regions. The en-hanced attention maps consistently improved different tech-niques for visual question answering. Improved certaintyestimates and explanation of deep learning techniques couldbe achieved through the presented method. They providedempirical results on all benchmarks for the visual questionanswering job and compared it with standard techniques.BNNs have been used as a solution for neural network pre-dictions, but it is still an open challenge to specify their prior.Independent normal prior in weight space leads to weak

    constraints on the function posterior, permit it to generalizein unanticipated ways on inputs outside of the trainingdistribution. Hafner et al. [14] presented noise contrastivepriors (NCPs) to estimate consistent uncertainty. The primeinitiative was to train the model for data points outsideof the training distribution to output elevated uncertainty.The NCPs relied on input prior, that included noise to theinputs of the current mini batch, and an output prior, thatwas an extensive distribution set by these inputs. The NCPsrestricted overfitting outside of the training distributionand produced handy uncertainty estimates for AL. BNNswith latent variables are flexible and scalable probabilisticmodels. T


Recommended