A distributed object detector-tracker aided video encoder...

A distributed object detector-tracker aided videoencoder for smart camera networks.

ICDSC-2017

Srivatsa Bhargava J,Pushkar Gorur,

Bharadwaj Amrutur.

ECE, RBCCPS, IISc Bangalore.

September 5, 2017

Motivation

I Large scale deployment of surveillance cameras to monitor ourpublic spaces.

I Centralized storage and analysis of surveillance video streamswill become a challenge due to increased bandwidth andcompute requirements.

Region of Interest Aided Video Encoding.

I Region of interest aided video encoding close to the edgecamera nodes.

I Used Foreground segmentation to define the Regions ofInterest.1

1[WS09], [HZL+15], [HC13], [CG13], [BHH11]

Semantically defining Regions of Interest

I Using the semantic information to assign interest levelsregions can help further improve the compression.

I Reliable object detectors such as faster-RCNN[Gir15],YOLO[RDGF16] use features extracted by DNNs.

Semantically defining Regions of Interest

I ROI MB: Macro Block in the upper one third of thepedestrian bounding box.

I RORI MB: Macro Block in the lower two thirds of thepedestrian bounding box.

I RONI MB: Macro Block in the background region.I UnExp MB: Macro Block in the foreground region but not in

pedestrian bounding box.

Challenge

I Reliable object detectors can be power hungry and cannot fiton the computational footprint of processors on edge cameranodes

I Feature extraction networks such as VGG-16 DNN consumeupto 13 Watts when run on nvidia-TX1. [CPC16]

I Infeasible to run DNN based object detectors on the edgecamera nodes on every frame.

I Can we share a single powerful compute node capable ofrunning reliable object detector among multiple edge cameranodes?

Distributed object detector-tracker aided smart cameranetwork.

I Idea: To compute the locations of objects of interest in theedge camera nodes by tracking using low-level features,periodically corrected by the detections from a centralizeddetector.

I Goal:Reduction of datarate requirement on the Backhaul linkwithout significant increase in the edge camera computerequirement.

Distributed object detector-tracker aided smart cameranetwork.

I D-Frame: Frame sent to the aggregator node to run theobject detector.

I T-Frame: Frame processed at the edge camera node tocompute the locations of objects of interest by tracking usinglow-level features.

Proposed system architecture.

Tracking objects in D-Frames

I Object locations measured by the detector.I Tracking-by-Detection formulation used in MOT literature

[KLCR15], [BAPXH16].I We assume

I Gaussian measurement noise to have corrupted the detectionbounding box co-ordinates.

I Linear state update model on the true latent location of thepedestrians.

I Kalman filter to estimate the true latent location of thepedestrians.

I We used color histograms(8 bins per channel) to model theappearance of the tracklet template and detections.

I Bhattacharya distance as the distance metric to solve for dataassociation.

Tracking algorithm for D-Frames

I InputsI Framet , trackletList, unsuppMeasList

I Run object detector to compute the detList.I Predict the next state of all the tracklets using the Kalman

predictor step.I Try to associate each detection in detList with one of the

tracklets in trackletList, such that the sum of bhattacharyadistance of the detection’s color histogram and to tracklettemplates in past two frames is minimized.

I Try to associate remaining detections to candidate tracklets inthe unsuppMeasList.


I Correct the state of all the tracklets with the Kalmancorrector using the measurement from associated detection.

I If any tracklet in the trackletList was not associated with ameasurement for the past Tdeath frames continuously, deletethat tracklet from the list.

I If any candidate tracklet in unsuppMeasList is associated to ameasurement continuously for past Tbirth frames, create a newtracklet in the trackletList.

I Return the updated trackletList.

Tracking objects in T-Frames

I Low level features are used to compute the update thelocations of objects in each frame.

I Foreground segmentation mask.I Sparse optical flow vectors. Computed at corner points.I Noisy flow vectors discarded using consistency check on

forward, backward flow vectors and NCC scores.I Data association is done using color histogram matching.I Tracking-by-Detection formulation cannot be directly applied

in this case.

Tracking objects in T-Frames

I Single object can be potentially result in multiple foregroundblobs.

I Multiple objects can cluster together to form a singleforeground blob.

I Estimating the locations of the object bounding boxes fromforeground blob bounding boxes involves solving MLestimation of skew normal distributed parameters.

Tracking algorithm for T-FramesI Inputs

I Framet , trackletList, unsuppMeasListI Run Foreground Segmentation to compute fgBlobList.I Compute Shi-Tomasi corner points on each tracklet’s

template images and foreground regions to get the liststrackletKPs and foregroundKPs.

I Compute the LK optical flow vectors from the previousframe’s trackletKPs to current frame and current frame’sforegroundKPs to previous frame and backwards.

I Discard noisy flow vectors by performing forward-backwardconsistency check to populate sparseOFVList.

I For each tracklet in trackletListI Cluster the foreground blobs that might be associated to that

tracklet by constructing the set, append this set’sencompassing bounding box in updated fgBlobList.

trackletClique = {fgBlob ∈ fgBlobList |frac OFV (fgBlob, tracklet) ≥ τ1}

Tracking algorithm for T-FramesI Predict the next state of all the tracklets using the Kalman

predictor step.I For each fgBlob in updated fgBlobList

I Cluster the tracklets that might have resulted in thatforeground blob by constructing the set.

fgBlobClique = {tracklet ∈ trackletList |frac OFV (fgBlob, tracklet) ≥ τ2}

I Update the fgBlob associations of the tracklets.I For each tracklet in trackletList

I Prune tracklet OFVs to remove Optical flow vectors that didnot land on associated FGBlobs.

I Compute init mv = mean(tracklet OFVs)I Search in local 64x64 region(with step size 8) to compute

refine the motion vector for the tracklet.final mv = argmin

mv∈[−32,32]2dist(tracklet.hist, hist(tracklet.loc+mv))

Tracking algorithm for T-Frames

I For each tracklet that was associated with an fgBlobI Run Kalman Corrector step using the prev loc + final mv as

the new location.I If any tracklet in the trackletList was not associated with a

measurement for the past Tdeath frames continuously, deletethat tracklet from the list.

I Return the updated trackletList.

Modulating the QP using the interest categoryI Rate Distortion Optimization is performed to compute the

mode and QP that minimize the cost for each Macro Block(MB).

J(QP) = D(QP) + λR(QP)

I D(QP): Measure of distortion caused due to quantization.I R(QP): Bitrate cost at the given quantization.

I We modulate the distortion with an interest parameter αMB.

J(QP) = αMBD(QP) + λR(QP)I Parameter αMB can take on one of the four user selected

values αROI MB ≥ αUnExp MB ≥ αRORI MB ≥ αRONI MB whichare set depending on the relative interest of the regions for theend application.

I Macro Blocks categorized as RONI will be encoded only inI-Frames, and marked as Skip MBs in P-Frames.

Experimental Results - RD Plots

I RD plots of 1280x720 resolution, 300 frame sequences2

encoded at 10fps.I Relative interest parameter settings used are αROI MB =

1000, αRORI MB = 50, αUnExp MB = 500, αRONI MB = 1).

2The yuv files of these sequences and annotations can be downloaded from:https://goo.gl/JzuYks

PSNR in various regions

I PSNR in the computed ROI, RORI and RONI Regions foreach frame in entranceRoad sequence encoded at 2Mbpsbitrate at 10fps.

Performance when compressed D-Frames are sent to objectdetector

I PSNR of Region of Interest when the D-Frames werecompressed with JPEG encoder with various compressionratios before running the object detector with 3 frame intervalbetween object detector runs.

Sample Video with ROI encoding

Zoomed Videos with and without ROI encoding

Comparison of various system configurations

I Computed on entranceRoad sequence(1280x720) at 32dB ROIPSNR








System Configuration Edgecameracomputeper frame

Camera toAggregatordatarate

Aggregatorcompute perframe

Backhauldatarateper camera

Total compute percamera per frame

Without ROI 52.3ms(1x) 7.5Mbps(1x) N.A 7.5Mbps(1x) 52.3ms(1x).Encode, Detection and Track-ing on edge camera.

379.5ms(7.256x)

2.5Mbps(0.33x)

N.A 2.5Mbps(0.33x)

379.5ms(7.256x).

Encode on edge camera, Detec-tion and Tracking on aggrega-tor node.

61.15ms(1.169x)

8.53Mbps(1.137x)

327.2ms(6.25x) /cam-era

2.5Mbps(0.33x)

379.5ms (7.419x).

Encode and tracking on edgecamera, Detection on aggrega-tor (N=3).

100.383ms(1.919x)

3.422Mbps(0.456x)

303.2ms(5.797x)/3 cameras

2.8Mbps(0.364x)

201.45ms (3.85x).

Encode and tracking on edgecamera, Detection on aggrega-tor (N=5).

102.79ms(1.965x)

5.43Mbps(0.72x)

303.2ms(5.797x) /5 cameras

4.6Mbps(0.613x)

163.43ms (3.12x).

I Comparison of computational costs and achieved bitrates forencoding the entranceRoad at 32dB ROI PSNR in varioussimulated system configurations.

Conclusions and Future Directions

I Conclusions:I We have demonstrated that significant (upto 3x) bitrate

savings are achievable without any significant reduction inPSNR computed over Regions of Interest compared to the caseof encoding without ROI information.

I We have demonstrated a distributed object detector-trackerframework to make this practically amenable.

I Our experiments indicate scope for Power-Rate-Distortionoptimization in these smart camera networks.

I Future Directions:I Improving computational efficiency of object detection

frameworks.I Improving the accuracy T-Frame tracker.I Analysis of Power-Rate-Distortion optimization of the ROI

aided encoders in smart camera networks.

Thank You!

Backup Slides

Execution times

System component Average Execution time per frameObject Detector 300 ms (on Titan Black GPU)D-Frame Tracker 27.2 ms (on core-i5 CPU, 1 thread)T-Frame Tracker 54.1 ms (on core-i5 CPU, 1 thread)H.264 Encoding 52.3 ms (on core-i5 CPU, 1 thread)

I Average per frame execution times of the proposed systemcomponents

Tracking objects in D-FramesI Object locations measured by the detector.I Tracking-by-Detection formulation used in MOT literature

[KLCR15], [BAPXH16].I Observation likelihood of detections will be:

p(y jd |X , A) = u(y j

d)Aj0

NT (t)∏i=1

N (y jd |Xi(t), Σd)Aji

p(hjd |X , A) = u(hj

d)Aj0

NT (t)∏i=1

(λe−λdB(hjd (t),hi

T (t−1)))Aji

I y jd : 2-D image plane position and dimensions of the jth object

detection.I hj

d : D-dimensional feature descriptor of the detected objectfor the jth object detection.

I A : Data association matrix.I We use Kalman filter based state update and Bhattacharya

distance of color histograms to solve for data association A.


Data: Input frame:framet ,List of tracklets:trackletList,Frame Number: fNum,Unsupported Measurements List:unsuppMeasList.Result: List of Update Tracks: updatedTrackletList

detList = ObjectDetector(framet)predict all Tracklet States kalman(trackletList)for deti in detList do

sim1 = computeSimilarity(deti , allTrackletsStates(fNum − 1))sim2 = computeSimilarity(deti , allTrackletsStates(fNum − 2))i assoc = resolveAssociations(sim1, sim2)if i assoc then

tracklet.associate(deti)else

unsuppMeasList.append(unsuppObj(deti))end

endfor trackletj in trackletList do

correct prediction kalman(trackletj)if trackletj .lastUpdate ≤ (fNum − Tdeath) then

delete(trackletj)endfor unsuppObj in unsuppMeasList do

if unsuppObj .numUpdates ≥ Tbirth thencreateTracklet(unsuppObj)

endendtrackletList.Update()

Tracking objects in T-FramesI Estimating the locations of the object bounding boxes from

foreground blob bounding boxes involves solving MLestimation of skew normal distributed parameters.

p(y jb|X , A, C) = u(y j

b)Aj0fy (y jb|X , A, C)(1−Aj0)

p(hjb|X , A, C) = u(hj

d)Aj0fh(hjb|X , A, C)(1−Aj0)

I Where:

fy (y jb|X , A, C) =

NT∑i=1

{N (yb|X i , σiX ) · Aji

NT∏n=1n 6=j

(0.5 − 2 · erf((ybX − Xn)√2σn

X))Ajn}

I Difficulty in disambiguation of individual object bounding boxlocations from multiple corresponding foreground blobbounding boxes.

Tracking algorithm for T-FramesData: Input frame:framet ,List of tracklets:trackletList,Frame Number: fNum,Unsupported Measurements List:unsuppMeasList.Result: List of Update Tracks: updatedTrackletList

fgBlobList, currFrame m = run FGBG segm(framet)trackletKPs = compute ST corners(prevFramem)predict all Tracklet States(trackletList)sparseOFVList = getSparseOFV(prevFrame m, currFrame m, trackletKPs)for tracklet in trackletList do

trackletClique = {fgBlobs ∈ fgBlobList |frac OFV (fgBlob, tracklet) ≥τ1}clustered fg Blobs.append(trackletClique)

endupdate fgBlobList(clustered fg Blobs)sparseOFVList = getSparseOFV(prevFrame m, currFrame m, trackletKPs)for fgBlob in fgBlobList do

fgBlobClique = {tracklet ∈ trackletList | frac OFV(fgBlob, tracklet) ≥ τ2}Cj = update blob clique(fgBlobClique, predicted tracklets)

endfor tracklet in trackletList do

init mv = mean(tracklet OFVs)final mv =argmin

mvdist(Cj ∗ tracklet.hist, hist(tracklet.loc + mv))

correct prediction kalman(tracklet.loc + final mv)endfor tracklet in trackletList do

if tracklet.lastUpdate < (fNum - Tdeath) thendelete(tracklet)

endend

Sileye Ba, Xavier Alameda-Pineda, Alessio Xompero, and RaduHoraud, An on-line variational bayesian model for multi-persontracking from cluttered scenes, Computer Vision and ImageUnderstanding 153 (2016), 64–76.

Sebastian Brutzer, Benjamin Hoferlin, and Gunther Heidemann,Evaluation of background subtraction techniques for videosurveillance, Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on, IEEE, 2011, pp. 1937–1944.

Carlos Cuevas and Narciso Garcia, Efficient moving objectdetection for lightweight applications on smart cameras, IEEETransactions on Circuits and Systems for Video Technology 23(2013), no. 1, 1–14.

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello, Ananalysis of deep neural network models for practical applications,arXiv preprint arXiv:1605.07678 (2016).

Ross Girshick, Fast r-cnn, Proceedings of the IEEE InternationalConference on Computer Vision, 2015, pp. 1440–1448.

Shih-Chia Huang and Bo-Hao Chen, Highly accurate movingobject detection in variable bit rate video-based traffic monitoringsystems, IEEE transactions on neural networks and learningsystems 24 (2013), no. 12, 1920–1931.

Hong Han, Jianfei Zhu, Shengcai Liao, Zhen Lei, and Stan Z Li,Moving object detection revisited: Speed and robustness, IEEETransactions on Circuits and Systems for Video Technology 25(2015), no. 6, 910–921.

Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M Rehg,Multiple hypothesis tracking revisited, Proceedings of the IEEEInternational Conference on Computer Vision, 2015,pp. 4696–4704.

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi,You only look once: Unified, real-time object detection,Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 779–788.

Ching-Yu Wu and Po-Chyi Su, A region of interest rate-controlscheme for encoding traffic surveillance videos, InternationalConference on Intelligent Information Hiding and MultimediaSignal Processing. (2009).

Date post:	25-Jan-2019
Category:	Documents
Upload:	duongthien
View:	213 times
Download:	0 times

A distributed object detector-tracker aided video encoder...

Documents