+ All Categories
Home > Documents > Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne...

Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne...

Date post: 25-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
10
QRMODA and BRMODA: Novel Models for Face Recognition Accuracy in Computer Vision Systems with Adapted Video Streams Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems is providing the ability to accurately detect threats and recognize subjects and/or objects under dynamically changing network conditions. We propose two novel models that characterize the face recognition accuracy in terms of video encoding parameters. Specifically, we model the accuracy in terms of video resolution, quan- tization, and actual bit rate. We validate the models using two distinct video datasets and a large image dataset by conducting 1, 668 experiments that involve simultaneously varying combinations of encoding pa- rameters. We show that both models hold true for the deep learning and statistical based face recogni- tion. Furthermore, we show that the models can be used to capture different accuracy metrics, specifically the recall, precision, and F1-score. Ultimately, we pro- vide meaningful insights on the factors affecting the constants of each proposed model. 1 Introduction The video streams in Computer Vision (CV) sys- tems should be adapted dynamically (by changing the video capturing and encoding parameters) to fit the tight resource constraints, including network band- width, energy, and storage. Therefore, these adapta- tions lead to various tradeoffs involving the accuracy and the aforementioned constraints. The overwhelm- ing majority of studies on CV focused on the develop- ment of robust algorithms to improve the accuracy in primarily image datasets, through statistical [1] (and references within) and more recently deep learning ap- proaches [2, 3, 4, 5] (and references within). We analyze the behavior of CV accuracy focusing primarily on face recognition. We make a fundamental contribution by developing two novel models that help in assessing the effect of combining adaptation strate- gies for the same video stream. The first model (QR- MODA) characterizes CV accuracy in terms of the spa- tial resolution and quantization parameter (Q p ), which we determine as a logistic function of Q p , with the x 0 value of the Sigmoid’s midpoint being a function of the resolution. In contrast, the second model (BRMODA) shows the accuracy in terms of the spatial resolution and actual bitrate. We find that the accuracy is equal to the sum of two exponentials of the actual bitrate, with the resolution as a multiplicative factor with one exponential. Furthermore, we validate each model against two different (deep learning and statistical based) ap- proaches of face recognition, utilizing two greatly dis- tinct video datasets (Honda/UCSD [6] and DISFA [7]), and a large image dataset (Labeled Faces in the Wild (LFW) [8]). We conduct 1, 668 actual experiments on 99 videos and 13, 233 images, with 47 and 5, 749 sub- jects, respectively. Subjects have different gender, eth- nicity, and pose variations. The results indicate that both proposed models hold true for both face recog- nition approaches and using different datasets. The results also show that the models can characterize face detection. Moreover, the models can be used to char- acterize different accuracy metrics, specifically recall, precision and F1-score, but we focus primarily on recall due to its importance in our particular application. We compute the coefficient of determination (R 2 ) to assess the goodness of fit. Ultimately, we discuss the factors impacting the constants of each proposed model and how to compute them in actual systems. The main contributions can be summarized as follows: (1) developing two mathematical models of face recognition accuracy in terms of the main video encoding parameters, (2) conducting extensive experi- ments to analyze the impacts of different combinations of video adaptation techniques on CV accuracy, (3) val- idating the two models using two greatly distinct and diverse video datasets and a large image dataset, and 1 arXiv:1907.10559v1 [cs.CV] 24 Jul 2019
Transcript
Page 1: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

QRMODA and BRMODA: Novel Models for Face RecognitionAccuracy in Computer Vision Systems with Adapted Video Streams

Hayder R. Hamandi and Nabil J. SarhanWayne State University

Detroit, MI

Abstract

A major challenge facing Computer Vision systemsis providing the ability to accurately detect threats andrecognize subjects and/or objects under dynamicallychanging network conditions. We propose two novelmodels that characterize the face recognition accuracyin terms of video encoding parameters. Specifically, wemodel the accuracy in terms of video resolution, quan-tization, and actual bit rate. We validate the modelsusing two distinct video datasets and a large imagedataset by conducting 1, 668 experiments that involvesimultaneously varying combinations of encoding pa-rameters. We show that both models hold true forthe deep learning and statistical based face recogni-tion. Furthermore, we show that the models can beused to capture different accuracy metrics, specificallythe recall, precision, and F1-score. Ultimately, we pro-vide meaningful insights on the factors affecting theconstants of each proposed model.

1 Introduction

The video streams in Computer Vision (CV) sys-tems should be adapted dynamically (by changing thevideo capturing and encoding parameters) to fit thetight resource constraints, including network band-width, energy, and storage. Therefore, these adapta-tions lead to various tradeoffs involving the accuracyand the aforementioned constraints. The overwhelm-ing majority of studies on CV focused on the develop-ment of robust algorithms to improve the accuracy inprimarily image datasets, through statistical [1] (andreferences within) and more recently deep learning ap-proaches [2, 3, 4, 5] (and references within).

We analyze the behavior of CV accuracy focusingprimarily on face recognition. We make a fundamentalcontribution by developing two novel models that help

in assessing the effect of combining adaptation strate-gies for the same video stream. The first model (QR-MODA) characterizes CV accuracy in terms of the spa-tial resolution and quantization parameter (Qp), whichwe determine as a logistic function of Qp, with the x0

value of the Sigmoid’s midpoint being a function of theresolution. In contrast, the second model (BRMODA)shows the accuracy in terms of the spatial resolutionand actual bitrate. We find that the accuracy is equalto the sum of two exponentials of the actual bitrate,with the resolution as a multiplicative factor with oneexponential.

Furthermore, we validate each model against twodifferent (deep learning and statistical based) ap-proaches of face recognition, utilizing two greatly dis-tinct video datasets (Honda/UCSD [6] and DISFA [7]),and a large image dataset (Labeled Faces in the Wild(LFW) [8]). We conduct 1, 668 actual experiments on99 videos and 13, 233 images, with 47 and 5, 749 sub-jects, respectively. Subjects have different gender, eth-nicity, and pose variations. The results indicate thatboth proposed models hold true for both face recog-nition approaches and using different datasets. Theresults also show that the models can characterize facedetection. Moreover, the models can be used to char-acterize different accuracy metrics, specifically recall,precision and F1-score, but we focus primarily on recalldue to its importance in our particular application. Wecompute the coefficient of determination (R2) to assessthe goodness of fit. Ultimately, we discuss the factorsimpacting the constants of each proposed model andhow to compute them in actual systems.

The main contributions can be summarized asfollows: (1) developing two mathematical models offace recognition accuracy in terms of the main videoencoding parameters, (2) conducting extensive experi-ments to analyze the impacts of different combinationsof video adaptation techniques on CV accuracy, (3) val-idating the two models using two greatly distinct anddiverse video datasets and a large image dataset, and

1

arX

iv:1

907.

1055

9v1

[cs

.CV

] 2

4 Ju

l 201

9

Page 2: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

(4) discussing the factors impacting the constants ofeach model.

The rest of this paper is organized as follows. Sec-tion 2 provides background information and discussesthe related work. Section 3 shows the developmentof both proposed models. Subsequently, Section 4 ex-plains the experimental setup, and Section 5 presentsand analyzes the main results. Section 6 provides addi-tional discussion and analysis of the factors impactingthe constants of both proposed models. Finally, con-clusions are drawn.

2 Background Information and RelatedWork

2.1 Face Recognition

Face recognition is a major CV algorithm in manyapplications, including authentication systems, per-sonal photo enhancement, automated video surveil-lance, and photo search engines. Face recognition ap-proaches can be classified into two broad categories:neural-based and statistical-based.

Neural-based solutions employ neural networks toclassify objects within an input image. ConvolutionalNeural Networks (CNNs) are the most widely employedtype that has proven strong results in terms of ac-curacy. Examples include GoogleNet [2], VGG [3],and MobileFaceNets [4]. The performance of VGG,GoogleNet, and SqueezeNet has been benchmarked interms of verification of accuracy against different typesof noise in [9]. All these architectures aim at reducingthe size of the deep CNN. We use FaceNet [10] withthe architecture model targeted towards a datacenterapplication. FaceNet achieved state-of-the-art perfor-mance in face recognition, according to a recent survey[11].

Face recognition using statistical-based algorithmscan be classified into two main categories: appearance-based and model-based [12]. The first typically rep-resents images as high-dimensional vectors and thenemploys statistical techniques, such as Principal Com-ponent Analysis (PCA) or Linear Discriminant Anal-ysis (LDA), for image vector analysis and feature ex-traction. PCA reduces the size of the n-dimensionalspace used by the appearance-based algorithm, ulti-mately simplifying computational complexity. LDAworks similarly but requires more training data. Onthe other hand, model-based algorithms require man-ual human face model construction to capture fa-cial features, while feature matching is achieved usingan algorithm, such as Elastic Bunch Graph Matching(EBGM). In real-world situations, only a small number

of samples for each subject are available for training. Ifa sufficient amount of enough representative data is notavailable, Martinez and Kak [30] have shown that theswitch from nondiscriminant techniques (e.g., PCA) todiscriminant approaches (e.g., LDA) is not always war-ranted and may sometimes lead to poor system designwhen small and nonrepresentative training data setsare used. For the reasons above, we validate our pro-posed models using PCA.

2.2 Relationship to Prior Work

The overwhelming majority of research on CV con-sidered the development of robust algorithms to im-prove accuracy in static image datasets [4, 5] (and ref-erences within), fewer dealt with videos [13], and evenfewer contributions addressed system design aspects[14].

In this study, we model the CV accuracy in termsof the main video encoding parameters. None of theprior studies developed accuracy models. Most studieson CV did not even consider the impact of video adap-tation on the accuracy. In [15], video adaptation wasanalyzed in terms of face detection accuracy, withoutproviding any models and without using open datasets.Although face detection is a simple CV algorithm toimplement, it has limited usefulness in practice whenapplied alone. Face recognition is much more impor-tant as it can precisely reveal subject identity ratherthan pointing out the presence of an arbitrary subject.Furthermore, deep learning algorithms were not uti-lized.

Some prior studies on video adaptation consideredvideo quality metrics, such as Mean Squared Error(MSE) and Structural Similarity Index (SSIM), withmuch literature on rate-distortion optimization [16](and references within). In CV systems, however, therecognition accuracy, not the human perceptual qual-ity, should be the main metric because the videos areanalyzed by machines.

Study [17] explored the impact of illumination, fa-cial expression, and occlusion on statistical-based facerecognition accuracy without any modeling.

3 Development of the Proposed Models

3.1 Overview and Motivation

We analyze the effectiveness of combining videoadaptation strategies in terms of the CV accuracy,focusing primarily on face recognition. We consideradapting the video streams by changing both the spa-tial resolution and the Signal-to-Noise Ratio (SNR).

2

Page 3: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

We utilize a super-resolution algorithm to upscale thevideos to their original resolutions before the analy-sis at the destinations in order to boost the accuracy.We employ the Lanczos upscaling algorithm becauseit outperforms other algorithms including Bicubic andSpline, in terms of the overall tradeoff between perfor-mance and execution time [15]. For the SNR adap-tation, we consider both changing the target bitrateand Qp. We do not consider temporal adaptation asmissing frames will trivially lead to zero detection andtherefore no recognition.

Moreover, we propose two models of the CV accu-racy with respect to the parameters used by the afore-mentioned adaptations. These models are of great im-portance and can be utilized to control camera settingsin a way that optimizes the overall CV accuracy. Fig-ure 1 illustrates a potential use scenario. In the envi-sioned system, multiple sources stream video to a dis-tributed processing system for analysis. to enable thedistributed processing system to optimally determinethe settings of each camera (such as resolution and tar-get bitrate or quantization parameter). Eventually, thedistributed processing system, running the CV algo-rithm, will be able to achieve the optimal recognitionaccuracy, given various constraints and the availablesystem and network resources.

3.2 Model Development

We develop two novel models of the CV accuracyin terms of the video encoding parameters. The firstmodel characterizes face recognition accuracy with re-spect to variations in Qp and resolution, and thus werefer to it as QRMODA (Qp and Resolution basedMODel for Accuracy). Similarly, we develop anothermodel for accuracy with respect to the actual bitrateand resolution, and we refer to it as BRMODA (Bitrateand Resolution based MODel for Accuracy). Subse-quently, we show that both models apply to face recog-nition and face detection as well, but with differentconstant values.

Though accuracy is the simplest metric to evaluatethe performance of any classifier system, other mea-sures such as precision and recall, tell more about thenature of the classifier. Particularly, these metrics aremore important when dealing with imbalanced data.When the negative class is dominant, any classifier willmore likely predict negative and achieve high accu-racy. Nonetheless, such a classifier will have no morethan 50% precision/recall. This is because the lattertwo measure the ratio of correctly predicted to the to-tal positively predicted and positively actual classes, re-spectively. Since recall captures the rate with respect

Table 1. Used Notations

Notation Explanation

E Recall ErrorTP True PositiveFN False Negativeflogistic(x) The Logistic FunctionQp Quantization Parameterc1, c2, c3, c4,and c5

QRMODA Constants

c′1, c′2, c′3, c′4,and c′5

BRMODA Constants

N ×M Video ResolutionR Actual Video BitrateR2 Coefficient of Determination

to actual data, rather than predicted, we believe it ismore valuable in face recognition applications. In otherwords, a positively classified (False Positive) face is nota catastrophic issue, whereas an overlooked (false neg-ative) face that should have been flagged as positive,is a major security concern. For this reason, we arguethat recall is an important measure in face recognitionand is more valuable since it captures the sensitivityof the system [18]. Recall is also important when itcomes to evaluating the performance of a face detec-tor since it can characterize the performance of binaryclassification tests. An example of such tests is facedetection, which can result in either detecting a face ornot. Many recent literatures also use recall to reportsystem sensitivity, for instance recent literature like [5],also use recall to measure the performance of face de-tector adaptation to training with different datasets.

We use the recall error (denoted by E) to measurethe system sensitivity. Given a video of k frames, E forthe entire video can be given by

E = 1−∑k

i=1 TPi∑ki=1 TPi + FNi

, (1)

where TPi and FNi are the numbers of correctly anderroneously identified faces in frame i. Our goal is tocharacterize E in terms of simultaneous independentlyadapting parameters.

3.2.1 QRMODA

Since video adaptation is imposed due to network re-source limitations, we expect the CV accuracy to sufferstarvation beyond a certain threshold of Qp. However,due to a simultaneous independent adaptation in video

3

Page 4: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

Figure 1. Utilization of the Proposed Models in Controlling the Cameras of a CV System

resolution, a compensation for the accuracy loss will begranted if the video adapts to a higher resolution. Ourempirical data, discussed in Section 5, indicate that Efollows an exponential trend with respect to changes inresolution and a bounded exponential bias towards Qp

variations. Hence, we determine that E is a functionthat combines the characteristics of both (exponentialand bounded exponential) functions, which is known asthe logistic function. We find that E is a logistic func-tion of Qp with the x-axis of the Sigmoid’s midpoint(x0) being a function of spatial resolution. Specifically,given a video with a resolution of N ×M , quantized atQp, E can be characterized as

EQRMODA = flogistic(x = Qp, x0 = c1(NM)c2) + c3,(2)

where

flogistic(x) =c4

1 + ec5(x−x0). (3)

We introduce c3 as a bias to the model. This valuedefines the lowest achievable recall error (i.e. at theoriginal resolution with no quantization). The modelconstants c1 through c5 vary based on factors that wewill discuss in Section 6. c1 and c2 define the sharp-ness in the change of the Sigmoid slope and impact themodel’s trend with respect to variations in only thespatial resolution. Small values (less than 1) will resultin a smooth transition in recall (with a slope of around80◦, depending on the value of constant c5) that isslightly affected by spatial adaptation. Contrarily, val-ues greater than 1 will result in a sharp transition inrecall as more quantization is imposed (especially with

Figure 2. Illustration of the Trend Captured byQRMODA

low resolution). As the resolution is increased, the re-call transition will flatten. The constant c4 determinesthe maximum value of the logistic function without thebias. Specifically, (c3 + c4) determine the lowest re-call rate, regardless of adaptation variations. Lastly,c5 determines the logistic growth rate (steepness of thecurve). Since recall error increases with quantization,this is always negative. Figure 2 illustrates the trendcaptured by the QRMODA model with sample framesat a fixed resolution but at different QP adaptationlevels.

4

Page 5: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

3.2.2 BRMODA

Videos with low resolutions tend to produce low bi-trates when high target bitrates are imposed. Likewise,videos with high resolutions tend to produce higher bi-trates than the imposed target values. Low bitratevideos have lower recall rates due to reduction in videoquality. As higher bitrates are granted, the video qual-ity increases, thereby causing the recall error to dropdrastically. Our empirical results show an exponentialrelationship. We determine that E is a function of twoexponentials of the actual bitrate, with the number ofpixels in the frame being a multiplicative factor withone of the exponentials. Given an N ×M resolutionvideo with an actual bitrate R, E can be given as

EBRMODA = c′1(NM)c′2ec

′3R + c′4e

c′5R, (4)

where c′1 through c′5 are constants. This model usesthe value of the actual bitrate because the target bi-trate may not be achieved precisely by the encoder.Constants c′1 and c′2 are similar to their counterpartsin QRMODA in terms of purpose. They define thesteepness of the exponential drop with respect to spa-tial resolution variation. Constant c′3 is always nega-tive because E is inversely proportional to the actualachieved bitrate. In other words, high-resolution videosrequire high bitrates, and thus will produce high recallerrors when low bitrates are imposed. c′4 and c′5 con-trol the bias exponential. Figure 3 shows the trendcaptured by the BRMODA model, with sample framesat a fixed resolution but with different bitrates.

Figure 3. Illustration of the Trend Captured byBRMODA

4 Experimental Setup

4.1 Used Datasets

We utilize two greatly distinct video datasets:Honda/UCSD, and DISFA. The former is a standardvideo database provided for the evaluation of face de-tection, tracking, and recognition algorithms. Thelatter is used to study Facial Action Coding Sys-tems (FACS). Honda/UCSD has lower quality videos,which serve as an example of limited-bandwidth net-work systems. In contrast, DISFA has High Defini-tion (HD) quality videos. In addition, the subjects inHonda/UCSD make different combinations of 2-D and3-D head rotations and have different facial expressionswith varying speed. On the other hand, subjects inDISFA have limited pose variations, but great varia-tions in facial action expressions.

Furthermore, we utilize a large image dataset: LFW[8]. This dataset aims at studying the problem of un-constrained face recognition. The main properties ofall the used datasets are summarized in Table 2. Wedivide each database into three main sets: Training,Validation, and Testing. In Honda/UCSD, we use thefirst included dataset, which is already categorized into3 groups: Training, Testing, and Testing with PartialOcclusion. We use the latter for validation. Contrarilyin DISFA, we split the right camera videos to trainingand validation sets and use the left camera videos fortesting. For LFW, we use the split method suggestedby [8]. The adaptation is performed only on the Test-ing sets to avoid overfitting and selection bias towardsadapted frames.

4.2 Video Adaptation Generation

We perform H.264 encoding on the testing setvideos/images of all datasets using FFmpeg to achievedifferent adaptation levels. We generate two sets ofdoubly adapted videos to analyze the impact of twoencoding parameters on CV accuracy. The first setincludes videos with combined resolution and targetbitrate adaptations, whereas the second set includesvideos that have a combination of Qp and resolutionadaptations. We use the Lanczos algorithm to upscalethe videos, as it provides the best tradeoff in perfor-mance and execution time [15]. We generate an addi-tional set of doubly adapted images using the testingimage set of the LFW dataset. For this set, only a com-bination of Qp and resolution is used because bitrateadaptation is inapplicable to images.

5

Page 6: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

Table 2. Characteristics of the Used Datasets

Characteristic Honda/UCSD DISFA LFW

Camera SONY EVI-D30 PtGrey stereo VariesSubjects 20 (2 females and 18 males) 27 (12 females and 15 males) 5, 749Resolution 640× 480 1024× 768 250× 250Frame Rate 15 frame/sec 20 frame/sec N/AFormat Uncompressed AVI Uncompressed AVI JPEGSize 45 videos 54 videos 13, 233 images

4.3 Face Detection and Recognition Implementa-tions

We use CNN-based face detection and recognition,utilizing FaceNet [10] as the deep learning platform.We develop an interface that interacts with FaceNet toperform normalization, CNN training, face detection,and face recognition. The experiments start by orga-nizing all training frames in a tree like fashion such thateach subject maintains its own directory of the respec-tive frames. These frames are then aligned, maintain-ing the same directory structure. The aligned framesare then used to train the fully connected layers of thedeep CNN, generating a classifier model file for use bythe recognition module. Subsequently, we fine tune thismodel using the validation videos.

We use the testing set videos as an input to theCNN, and detect the faces in every frame of thosevideos using FaceNet. We employ the aforementionedclassifier model to classify each frame. The result ofthis step is a list of probabilities for each probe withrespective classes. We pick the class with the highestprobability and consider it as the best candidate iden-tifying the probe (Top 1 class). Finally, we collect aconfusion matrix, which we use to compute the overallrecall, precision, and F1-score of each experiment.

In the statistical-based approach, we develop aface detector using the Viola-Jones [19] face detectionalgorithm that is implemented in OpenCV. We de-velop a platform for extracting eye coordinates fromall frames. Since eye classifiers are not accurate andmay return falsely detected eyes, we develop a mech-anism to filter true eyes based on their sagittal coor-dinates. These coordinates are vitally important forrecognition because they represent input parametersfor the preprocessing steps, including geometric nor-malization, histogram equalization, and masking. Weutilize the CSU Face Identification Evaluation System[1] to perform training and face recognition. We em-ploy PCA because of its effectiveness in generating sim-pler representations of the huge video dataset with all

adaptations.

5 Model Validation and Analysis

5.1 Baselines and Evaluation Metrics

We use two baselines to benchmark the validity ofBRMODA and QRMODA. These baselines representdeep learning FaceNet (NN2 architecture [10]) and sta-tistical (PCA) face recognition methods. We also em-ploy another baseline for face detection using Viola-Jones algorithm. Although the methods used by [10]and [1] perform image analysis, we develop interfacesto work with adapted video frames from Honda/UCSDand DISFA datasets. We use R2 to assess the goodnessof fit of the proposed models, and our accuracy metricsare recall, precision, and F1-score. The R2 values areshown with each figure subcaption.

5.2 Result Presentation and Analysis

For each model, we show the results for experi-ments performed using two methods of face detec-tion/recognition (neural and statistical based) and uti-lizing three different datasets for QRMODA and twovideo datasets for BRMODA (since bitrate adaptationis not applicable to the image dataset). For each setof experiments, we present two subfigures that demon-strate the extremes of the resolution adaptations con-sidered.

We validate QRMODA in terms of detection andrecognition sensitivity at different spatial resolutions,as shown in Figure 4. Subfigures 4(a) through 4(d)show the recall error with respect to Qp variations forthe Honda/UCSD dataset. FaceNet is used for bothface detection and recognition tasks. Similarly, Sub-figures 4(e) to 4(h) show QRMODA’s robustness toa change in the detection/recognition methods, whenViola-Jones algorithm is used for face detection, andPCA is used for recognition. Subfigures 4(i) and 4(j)

6

Page 7: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

Table 3. List of Constants for CNN-based QRMODA/BRMODA [Detection, Recognition]Const. Honda/UCSD DISFA

c1 17.98, 24.03 0.7, 1.54c2 0.08493, 0.05211 1.255, 1.121c3 0.5, 0.61 0.003, 0.003c4 0.5, 0.3838 0.039, 0.5913c5 −0.2,−0.2864 −0.4,−0.517c′1 0.414, 0.0363 2.64× 10−4, 1.867× 10−6

c′2 0.175, 0.292 0.65, 1.02c′3 −0.126,−0.054 −0.2,−0.117c′4 0.174, 0.273 0.0229, 0.06102c′5 −7.97× 10−6, −4.718× 10−6 −4.8× 10−6, −3.03× 10−6

demonstrate the validation using a different dataset(DISFA). We also validate QRMODA using state-of-the-art (according to a recent evaluation [11]) resultson LFW as shown in Subfigures 4(k) and 4(l). Addi-tionally, a 3D graph of PCA vs. QRMODA is shown inSubfigures 4(m) - 4(p) using Honda/UCSD and DISFA,respectively. The latter subfigures demonstrate the en-tire model behavior with respect to simultaneous vari-ations in both encoding parameters (i.e. Qp and theResolution) represented by x and y axes, respectively.The recall error is color-coded over the z-axis. The re-sults demonstrate that our proposed model is highlyaccurate.

The recall error increases slowly with Qp up to acritical point, represented by the Sigmoid’s midpointof the Logistic function. After that point, the errorincreases sharply with Qp until it becomes 100%. Thelower bound for face detection error is about 0.2 forthe Honda/UCSD and approximately 0 for the DISFA.Additionally, both the detection and recognition re-call rates in DISFA are much higher than those inHonda/UCSD. This difference in threshold levels is be-cause of the variation in video contents, which containfewer frontal face poses in the Honda/UCSD than thosein DISFA. The pose angle is recognized as an impor-tant factor in detection and recognition. Table 3 listssome of the constants used in this study.

Figure 5 validates BRMODA in terms of detectionand recognition recall errors at different resolutions, us-ing neural-based and the statistical-based methods, re-spectively. Each subfigure shows the normalized recallerror versus the actual bitrate for selected resolutions.As the target bitrates may not be achieved preciselyby the encoder, we report the actual bitrates, whichare depicted in the figures on a logarithmic scale be-cause of the wide range of considered bitrates in ourexperiments. The results demonstrate that the model

is highly accurate in terms of the calculated R2. Re-markably, both models can be applied to other accu-racy measures as well, such as precision and F1-score.Subfigure 5(a) demonstrates how BRMODA can be ap-plied to all the different metrics. The recall is inverselyproportional to the actual bitrate achieved due to thenegative value of c′3. This behavior varies with spatialresolution variation because high resolution videos re-quire high bitrates, thereby produce high errors whenlow bitrates are imposed. For this reason, the recallerror value starts at larger values in Subfigures 5(b),5(c), 5(f), 5(i), 5(j), and 5(m) than those in Subfigures5(d), 5(e), 5(g), 5(h), 5(l), 5(o), and 5(p), respectively.

6 Discussion

The actual recall rate depends on different factors,including the subject’s pose angle and inter-ocular dis-tance in pixels. Both these factors depend on thecamera placement and settings (zoom, pan, and tilt).Other factors are related to the environment, such aslighting and potential occlusion by other subjects orobjects in the scene. A further factor is the face recog-nition algorithm being used. As shown in Section 5,CNN-based face recognition achieves the highest recall.This is not only due to the merits of deep-learning butalso the ability of FaceNet to detect and align frameswith side facial poses, whereas the cascade classifiersused in OpenCV fails to do so.

The proposed models characterize the recall error interms of the main encoding parameters. The constantvalues can be determined dynamically upon system cal-ibration. We recommend that the constants are deter-mined based on actual videos captured by the cam-eras in the (surveillance) site. The system can gener-ate different adaptations, and then determine the con-stants that best fit the model(s). For instance, to deter-

7

Page 8: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

0 10 20 30 40 50 60

QP

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (FaceNet)Detection (QRMODA)Recognition (FaceNet)Recognition (QRMODA)

(a) 600×450 (R2: 0.992, 0.995)Honda/UCSD

0 10 20 30 40 50 60

QP

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (FaceNet)Detection (QRMODA)Recognition (FaceNet)Recognition (QRMODA)

(b) 520×390 (R2: 0.989, 0.993)Honda/UCSD

0 10 20 30 40 50 60

QP

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (FaceNet)Detection (QRMODA)Recognition (FaceNet)Recognition (QRMODA)

(c) 400 × 300 (R2: 0.988,0.997)Honda/UCSD

0 10 20 30 40 50 60

QP

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (FaceNet)Detection (QRMODA)Recognition (FaceNet)Recognition (QRMODA)

(d) 280×210 (R2: 0.982, 0.998)Honda/UCSD

0 10 20 30 40 50 60

QP

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(e) 600 × 450 (R2: 0.9962,0.999) Honda/UCSD

0 10 20 30 40 50 60

QP

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(f) 520 × 390 (R2: 0.9949,0.0.997) Honda/UCSD

0 10 20 30 40 50 60

QP

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(g) 400×300 (R2: 0.993, 0.989)Honda/UCSD

0 10 20 30 40 50 60

QP

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(h) 240×180 (R2: 0.973, 0.962)Honda/UCSD

0 10 20 30 40 50 60 70 80

QP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(i) 480×360 (R2: 0.9968, 0.998)DISFA

0 10 20 30 40 50 60 70 80

QP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (QRMODA)Recognition (PCA)Recognition (QRMODA)

(j) 280 × 210 (R2: 0.895, 1)DISFA

0 5 10 15 20 25 30 35

QP

0.2

0.25

0.3

0.35

0.4

0.45

Acc

ura

cy E

rro

r

FaceNetQRMODA

(k) 50× 50 (R2: 0.998) LFW

0 5 10 15 20 25 30 35

QP

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Acc

ura

cy E

rro

r

FaceNetQRMODA

(l) 200× 200 (R2: 0.992) LFW

0.5

80

0.6

0.7

60 3

Rec

all E

rro

r

0.8

QP

0.9

40 2

Resolution

×105

1

20 1

0 0

(m) PCA (HONDA/UCSD)

0.5

80

0.6

0.7

60 3

Rec

all E

rro

r

0.8

QP

0.9

40 2

Resolution

×105

1

20 1

0 0

(n) QRMODA(HONDA/UCSD)

060

0.2

0.4

8

Rec

all E

rro

r

40

0.6

6

QP

0.8

Resolution

×105

1

4202

0 0

(o) PCA (DISFA)

060

0.2

0.4

8

Rec

all E

rro

r

40

0.6

6

QP

0.8

Resolution

×105

1

4202

0 0

(p) QRMODA (DISFA)

Figure 4. Validation and Analysis of QRMODA

mine the constants in QRMODA, monotone regressionsplines can be employed. Standard curve fitting pro-cedures, such as reformatting and redefining tools canbe used. Some of these functions are available off-the-shelf, including the recently released splines2 packageimplementation in R [20].

7 Conclusions

We have proposed two novel models that charac-terize CV accuracy. We have conducted extensive ex-periments involving combinations of video adaptationtechniques to assess the effect of video encoding param-eters on the system. We have used two greatly distinctvideo datasets and a large image dataset for valida-

8

Page 9: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

101 102 103 104

Actual Bitrate (Kbps)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Err

or

Recall(Actual)Recall(BRMODA)Precision(Actual)Precision(BRMODA)F1 Score(Actual)F1 Score(BRMODA)

(a) 600× 450 Honda/UCSD

101 102 103 104

Actual Bitrate (Kbps)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(b) 600×450 (R2: 0.997, 0.994)Honda/UCSD

101 102 103 104

Actual Bitrate (Kbps)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(c) 480×360 (R2: 0.998, 0.994)Honda/UCSD

101 102 103 104

Actual Bitrate (Kbps)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(d) 360×270 (R2: 0.994, 0.998)Honda/UCSD

101 102 103 104

Actual Bitrate (Kbps)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(e) 240×180 (R2: 0.992, 0.997)Honda/UCSD

101 102 103 104

Actual Bitrate (Kbps)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(f) 400× 300 (R2: 0.894, 0.999)DISFA

101 102 103 104

Actual Bitrate (Kbps)

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(g) 240 × 180 Resolution (R2:0.995, 0.992) DISFA

101 102 103 104

Actual Bitrate (Kbps)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Rec

all E

rro

r

Recognition (FaceNet)Recognition (BRMODA)Detection (FaceNet)Detection (BRMODA)

(h) 120 × 90 (R2: 0.991, 0.997)DISFA

101 102 103 104

Actual Bitrate

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(i) 600× 450 (R2: 0.998, 0.994)Honda/UCSD

101 102 103 104

Actual Bitrate

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(j) 520 × 390 Resolution (R2:0.997, 0.981) Honda/UCSD

101 102 103 104

Actual Bitrate

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(k) 400 × 300 Resolution (R2:0.991, 0.971) Honda/UCSD

101 102 103 104

Actual Bitrate

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(l) 200× 150 (R2: 0.982, 0.936)Honda/UCSD

101 102 103 104

Actual Bitrate

0

0.02

0.04

0.06

0.08

0.1

0.12

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(m) 560×420 (R2: 0.999, 0.999)DISFA

101 102 103 104

Actual Bitrate

0

0.02

0.04

0.06

0.08

0.1

0.12

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(n) 480 × 360 Resolution (R2:0.9091, 0.999) DISFA

101 102 103 104

Actual Bitrate

0

0.02

0.04

0.06

0.08

0.1

0.12

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(o) 360 × 270 Resolution (R2:0.9846, 0.997) DISFA

101 102 103 104

Actual Bitrate

0

0.02

0.04

0.06

0.08

0.1

0.12

Rec

all E

rro

r

Detection (Viola-Jones)Detection (BRMODA)Recognition (PCA)Recognition (BRMODA)

(p) 280 × 210 (R2: 0.9458,0.996) DISFA

Figure 5. Validation and Analysis of BRMODA [5(a)-5(h): Neural-Based, 5(i)-5(p): Statistical-Based]

tion. We have validated the models using both CNNand statistical-based methods and reported R2. Theresults show that both models are valid under all exper-imental scenarios. We find it remarkable that the twomodels apply to greatly distinct video/image datasetsand to both face recognition and detection. The modelsalso apply to both deep learning and statistical-basedmethods and can be utilized to capture the different

CV accuracy metrics (precision, recall, F1-score). Ul-timately, we have discussed the factors impacting theconstants of each model.

References

[1] R. Beveridge, D. Bolme, B. A. Draper, andM. Teixeira, “The CSU face identification evalu-

9

Page 10: Hayder R. Hamandi and Nabil J. Sarhan Detroit, MI · Hayder R. Hamandi and Nabil J. Sarhan Wayne State University Detroit, MI Abstract A major challenge facing Computer Vision systems

ation system,” Machine Vision and Applications,vol. 16, pp. 128–138, February 2005.

[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens,and Z. Wojna, “Rethinking the inception archi-tecture for computer vision,” in Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 2818–2826, 2016.

[3] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al.,“Deep face recognition.,” in Proceedings of theBritish Machine Vision Conference (BMVC),vol. 1, p. 6, 2015.

[4] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mo-bilefacenets: Efficient cnns for accurate real-timeface verification on mobile devices,” in Proceedingsof Chinese Conference of Biometric Recognition(CCBR), pp. 428–438, 2018.

[5] M. Abdullah Jamal, H. Li, and B. Gong, “Deepface detector adaptation without negative transferor catastrophic forgetting,” in Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 5608–5618, 2018.

[6] K.-C. Lee, J. Ho, M.-H. Yang, and D. Kriegman,“Visual tracking and recognition using probabilis-tic appearance manifolds,” Journal of ComputerVision and Image Understanding, vol. 99, no. 3,pp. 303–331, 2005.

[7] M. Mavadati, M. Mahoor, K. Bartlett, P. Trinh,and J. Cohn, “Disfa: A spontaneous facial actionintensity database,” IEEE Transactions on Affec-tive Computing, vol. 4, pp. 151–160, April 2013.

[8] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A databaseforstudying face recognition in unconstrained en-vironments,” in Proceedings of Workshop on facesin’Real-Life’Images: detection, alignment, andrecognition, 2008.

[9] K. Grm, V. Struc, A. Artiges, M. Caron, andH. K. Ekenel, “Strengths and weaknesses of deeplearning models for face recognition against imagedegradations,” Journal of IET Biometrics, vol. 7,no. 1, pp. 81–89, 2017.

[10] F. Schroff, D. Kalenichenko, and J. Philbin,“Facenet: A unified embedding for face recog-nition and clustering,” in Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 815–823, 2015.

[11] M. A. Hmani and D. Petrovska-Delacretaz,“State-of-the-art face recognition performance us-ing publicly available software and datasets,” inProceedings of IEEE International Conference onAdvanced Technologies for Signal and Image Pro-cessing (ATSIP), pp. 1–6, 2018.

[12] X. Lu, “Image analysis for face recognition,” Mas-ter’s thesis, Michigan State University, 2004.

[13] L. Esterle and P. R. Lewis, “Online multi-object k-coverage with mobile smart cameras,” in Proceed-ings of the ACM International Conference on Dis-tributed Smart Cameras (ICDSC), pp. 107–112,2017.

[14] Z. Zhong, L. Zheng, Z. Zheng, S. Li, andY. Yang, “Camera style adaptation for person re-identification,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), pp. 5157–5166, 2018.

[15] Y. Sharrab and N. Sarhan, “Accuracy and powerconsumption tradeoffs in video rate adaptationfor computer vision applications,” in Proceedingsof IEEE International Conference on Multimediaand Expo (ICME), (Melbourne, VIC, Australia),pp. 410–415, July 2012.

[16] C.-H. Hsu and M. Hefeeda, “A framework forcross-layer optimization of video streaming inwireless networks,” ACM Transactions Multi-media Computing Communication Applications,vol. 7, pp. 5:1–5:28, Feb. 2011.

[17] M. Gunther, L. El Shafey, and S. Marcel, “Facerecognition in challenging environments: An ex-perimental and reproducible research survey,” inFace recognition across the imaging spectrum,pp. 247–280, Springer, 2016.

[18] D. M. Powers, “Evaluation: from precision, recalland F-measure to ROC, informedness, marked-ness and correlation,” Journal of Machine Learn-ing Technologies, vol. 2, no. 1, pp. 37–63, 2011.

[19] P. Viola and M. Jones, “Robust real-time face de-tection,” International Journal of Computer Vi-sion, vol. 57, no. 2, pp. 137–154, 2004.

[20] “Package splines2.” https://cran.r-project.org/web/packages/splines2/splines2.pdf.

10


Recommended