+ All Categories
Home > Documents > Server-Driven Video Streaming for Deep Learning Inferencejunchenj/docs/DDS_ArXiv.pdf · YouTube,...

Server-Driven Video Streaming for Deep Learning Inferencejunchenj/docs/DDS_ArXiv.pdf · YouTube,...

Date post: 20-May-2020
Category:
Upload: others
View: 18 times
Download: 1 times
Share this document with a friend
15
Server-Driven Video Streaming for Deep Learning Inference ABSTRACT Video streaming is crucial for AI applications that gather mas- sive videos from sources to remote servers for inference by deep neural nets (DNNs). Unlike Internet video that maxi- mize quality for human perception, this new types of video streaming permits aggressive compression or pruning of pix- els/frames that are not relevant to achieving high inference accuracy for AI applications. However, much of this potential is left unrealized because video streaming and compression are driven by the video source (camera) in today’s protocols where the compute is rather limited. We advocate that the video streaming protocol should be driven by real-time feedback from the server-side DNN in- stead. Our insight is two-fold: (1) server-side DNN has more context about the pixels/frames that maximize its inference accuracy; and (2) the DNN’s output contains rich informa- tion that is useful to guide video streaming/compression. We present DDS (DNN-Driven Streaming), a concrete design of this approach. DDS continuously sends a low-quality video stream to the server; the server runs the DNN to quickly determine the areas to zoom in (in similar spirit of region proposals but with key differences) and re-send them with a higher quality to increase the inference accuracy. We find that DDS maintains the same accuracy while reducing bandwidth usage by upto 59% or improves accuracy by upto 9% with no additional bandwidth usage compared to several recent baselines on multiple video genres and three vision tasks. 1 INTRODUCTION Internet video must balance between maximizing application- level quality and adapting to limited network resources. This perennial challenge has sparked decades of research and yielded various models of user-perceived quality of experi- ence (QoE) and a plethora of QoE-optimizing streaming pro- tocols. In the meantime, the proliferation of deep learning and video sensors has ushered in many distributed intelligent appli- cations (e.g., urban traffic analytics and safety anomaly detec- tion [4, 18, 24]). They also require streaming videos from cam- eras through bandwidth-constrained networks [20] to remote servers for deep neural nets (DNN)-based inference. We refer to it as machine-centric video streaming. Rather than maxi- mizing human-perceived QoE, machine-centric video stream- ing maximizes for DNN inference accuracy. This has inspired recent efforts to compress or prune frames and pixels that may not affect the DNN output (e.g., [26–28, 32, 45, 71, 73, 76]). A key design question in any video streaming system is where to place the functionality of deciding which actions can optimize application quality? Surprisingly, despite a wide variety of designs, all video streaming systems (both machine- centric and user-centric) take an essentially source-driven approach—it is the content source that decides how the video is best decoded/streamed. In traditional Internet videos (e.g., YouTube, Netflix), the video server determines the best encod- ing for a given bandwidth constraint, without explicit feed- back from the viewer. 1 Similarly, current machine-centric video streaming relies largely on the camera (source) to deter- mine which frames and pixels to stream. While the source-driven approach serves user-viewed videos well, we argue that it is necessarily suboptimal for analytics- oriented applications. The source-driven approach hinges on two premises: (1) the application-level quality can be esti- mated by the video source, and (2) it is hard to measure user experience directly in real time. While the assumptions largely hold in Internet video streaming, they need to be revisited in machine-centric video streaming. First, it is inherently difficult for the source (camera) to es- timate the inference accuracy of the server-side DNN by itself. Inference accuracy depends heavily on the compute-intensive feature extractors (tens of NN layers) in the server-side DNN. The disparity between most cameras and GPU servers in their compute capability means that any camera-side heuristics are unlikely to match the complexity of the server-side DNNs, and thus, the performance of the source-driven protocols is inherently limited. For instance, some works use inter-frame pixel changes [26] or cheap object detectors [76] to identify and send only the frames/regions that contain new objects, but they may consume more bandwidth than necessary (e.g., back- ground changes causing pixel-level differences) and/or cause more false negatives (e.g., small objects could be missed). Second, while incorporating real-time feedback from hu- man users may be hard in traditional video streaming, DNN models can provide rich and instantaneous feedback! Run- ning an object-detection DNN on an image returns not only detected bounding boxes, but also additional feedback for free, like the confidence score of these detections, interme- diate features etc. Moreover, such feedback can be extracted on-demand by probing the DNN with extra images. Unfortu- nately, such abundant feedback information has not yet been systematically exploited by prior work. In this paper, we advocate for an alternative DNN-driven approach, in which video compression and streaming are 1 DASH [6] protocol and its variants are often regarded as client(receiver)- driven, but the client does not provide any instant user-perceived QoE feed- back to let server re-encode the video; instead it merely monitors and adapts to bandwidth fluctuation, which could also be implemented on the source.
Transcript

Server-Driven Video Streaming for Deep Learning Inference

ABSTRACTVideo streaming is crucial for AI applications that gather mas-sive videos from sources to remote servers for inference bydeep neural nets (DNNs). Unlike Internet video that maxi-mize quality for human perception, this new types of videostreaming permits aggressive compression or pruning of pix-els/frames that are not relevant to achieving high inferenceaccuracy for AI applications. However, much of this potentialis left unrealized because video streaming and compressionare driven by the video source (camera) in today’s protocolswhere the compute is rather limited.

We advocate that the video streaming protocol should bedriven by real-time feedback from the server-side DNN in-stead. Our insight is two-fold: (1) server-side DNN has morecontext about the pixels/frames that maximize its inferenceaccuracy; and (2) the DNN’s output contains rich informa-tion that is useful to guide video streaming/compression. Wepresent DDS (DNN-Driven Streaming), a concrete design ofthis approach. DDS continuously sends a low-quality videostream to the server; the server runs the DNN to quicklydetermine the areas to zoom in (in similar spirit of regionproposals but with key differences) and re-send them with ahigher quality to increase the inference accuracy. We find thatDDS maintains the same accuracy while reducing bandwidthusage by upto 59% or improves accuracy by upto 9% withno additional bandwidth usage compared to several recentbaselines on multiple video genres and three vision tasks.

1 INTRODUCTIONInternet video must balance between maximizing application-level quality and adapting to limited network resources. Thisperennial challenge has sparked decades of research andyielded various models of user-perceived quality of experi-ence (QoE) and a plethora of QoE-optimizing streaming pro-tocols. In the meantime, the proliferation of deep learning andvideo sensors has ushered in many distributed intelligent appli-cations (e.g., urban traffic analytics and safety anomaly detec-tion [4, 18, 24]). They also require streaming videos from cam-eras through bandwidth-constrained networks [20] to remoteservers for deep neural nets (DNN)-based inference. We referto it as machine-centric video streaming. Rather than maxi-mizing human-perceived QoE, machine-centric video stream-ing maximizes for DNN inference accuracy. This has inspiredrecent efforts to compress or prune frames and pixels that maynot affect the DNN output (e.g., [26–28, 32, 45, 71, 73, 76]).

A key design question in any video streaming system iswhere to place the functionality of deciding which actionscan optimize application quality? Surprisingly, despite a wide

variety of designs, all video streaming systems (both machine-centric and user-centric) take an essentially source-drivenapproach—it is the content source that decides how the videois best decoded/streamed. In traditional Internet videos (e.g.,YouTube, Netflix), the video server determines the best encod-ing for a given bandwidth constraint, without explicit feed-back from the viewer.1 Similarly, current machine-centricvideo streaming relies largely on the camera (source) to deter-mine which frames and pixels to stream.

While the source-driven approach serves user-viewed videoswell, we argue that it is necessarily suboptimal for analytics-oriented applications. The source-driven approach hinges ontwo premises: (1) the application-level quality can be esti-mated by the video source, and (2) it is hard to measure userexperience directly in real time. While the assumptions largelyhold in Internet video streaming, they need to be revisited inmachine-centric video streaming.

First, it is inherently difficult for the source (camera) to es-timate the inference accuracy of the server-side DNN by itself.Inference accuracy depends heavily on the compute-intensivefeature extractors (tens of NN layers) in the server-side DNN.The disparity between most cameras and GPU servers in theircompute capability means that any camera-side heuristics areunlikely to match the complexity of the server-side DNNs,and thus, the performance of the source-driven protocols isinherently limited. For instance, some works use inter-framepixel changes [26] or cheap object detectors [76] to identifyand send only the frames/regions that contain new objects, butthey may consume more bandwidth than necessary (e.g., back-ground changes causing pixel-level differences) and/or causemore false negatives (e.g., small objects could be missed).

Second, while incorporating real-time feedback from hu-man users may be hard in traditional video streaming, DNNmodels can provide rich and instantaneous feedback! Run-ning an object-detection DNN on an image returns not onlydetected bounding boxes, but also additional feedback forfree, like the confidence score of these detections, interme-diate features etc. Moreover, such feedback can be extractedon-demand by probing the DNN with extra images. Unfortu-nately, such abundant feedback information has not yet beensystematically exploited by prior work.

In this paper, we advocate for an alternative DNN-drivenapproach, in which video compression and streaming are

1DASH [6] protocol and its variants are often regarded as client(receiver)-driven, but the client does not provide any instant user-perceived QoE feed-back to let server re-encode the video; instead it merely monitors and adaptsto bandwidth fluctuation, which could also be implemented on the source.

Better

Figure 1: Performance of DDS and the baselines on a videodataset with object detection. Full results are in §5.

driven by how the server-side DNN reacts to real-time videocontent, and explore it in the context of a concrete designcalled DDS (DNN-Driven Streaming). The key idea of DDSis an iterative workflow that centers around the notion offeedback regions—a relatively small set of regions whosequality could greatly influence the inference accuracy. Whenthe camera receives a new video segment, (1) it first sendsthe video in low quality to the server for DNN inference; (2)the server then derives the feedback regions from informationspit out by the DNN; and (3) the camera then re-encodesthe segment with only feedback regions in higher resolution(while blacking out the remaining pixels) and sends it to theserver for more accurate DNN inference.

DDS derives feedback regions from DNN by combiningtwo sources of information: (a) objectness of each element(pixel or bounding box, depending on the application), whichindicates the likelihood of the element being relevant to theinference result, and (b) the low-quality inference results,which indicates the elements that can be clearly analyzed atlow quality (e.g., large objects). The insight is that althoughthe low-quality video may not suffice to produce high DNNinference accuracy, but it can produce surprisingly accuratefeedback regions, which is essentially a binary classifica-tion task (i.e.,whether increasing its quality will yield moreaccurate inference), rather than the actual inference tasks(e.g.,what objects are in which region).

There is parallel between DDS and the emerging recenttrend in deep learning community of using attention mecha-nisms in DNN architectures (e.g., [38, 69]). Attention mecha-nisms learn to attend to the important pixels that will likelyimprove the DNN accuracy, thus sharing the same high-levelinsight with DDS that not every pixel affects the DNN out-put equally. These works are complementary to DDS– theyimprove DNN accuracy with additional DNN layers to focuscomputation on the important regions, while DDS proposesa suite of techniques to increase bandwidth efficiency (bysending only a few regions in high quality) for the same DNNaccuracy.

We evaluate DDS and a range of recent proposals, includ-ing reusing existing video streaming [71, 73], camera-sidefiltering heuristics [26, 51, 76], on three vision tasks. Across

49 videos, we find DDS achieves same or higher accuracywhile cutting bandwidth consumption by upto 59% or usesthe same bandwidth consumption while increasing accuracyby 3-9% (Figure 1 shows an example.)Differences with prior region-based streaming: We notethat while EAAR [51] and Vigil [76] also share with us theidea of selecting “relevant regions” and sending only them inhigh resolution, there are several crucial differences. EAARassumes that the DNN has a built-in region proposal net anduses its output as relevant regions. This leads to two problems.First, it wastes bandwidth to encode regions that are of interestbut can be easily inferred at low quality (e.g., big objects).Second, it will not work for the many DNNs which do notuse region-proposal net (e.g., semantic segmentation models).Moreover, it uses RoI encoding which requires regions ofinterest be determined from prior frames, forcing frames to beencoded separately. On the other hand, Vigil relies on shallowDNNs, which use inherently simpler feature extractors, soit often leads to false negatives (small objects) even if it hasaccess to the video in high quality. In short, there is conceptualgap between selecting the feedback regions needed to bestreamed at high quality and off-the-shelf DNNs (shallowDNNs or region-proposal nets specifically designed to boostaccuracy for certain DNNs), and it is this very gap that DDSseeks to bridge.

2 MOTIVATIONWe start with the background of video streaming for dis-tributed video analytics, including its need, performance met-rics and design space. We then use empirical measurementsto elucidate the key limitations of prior solutions.

2.1 Video streaming for video analyticsWhy video streaming? Two trends contribute to the widespread of distributed video analytics and much recent effort(e.g., [71, 73]). On one hand, computer vision accuracy hasbeen greatly improved by deep learning at the cost of in-creased compute demand. On the other hand, low prices ofhigh-definition cameras have enabled cheaply scaling out theanalytics of ever more cameras [2, 5, 17].

While one can always add accelerators [1, 12] to cam-eras,2 an economical way of scaling out camera networks isto offload the compute-intensive inference (partially or com-pletely) to remote GPU servers. For instance, buying 180HD cameras and an NVIDIA Tesla T4 GPU (with a through-put of 5,700fps [13]) to process the feeds each at 30fps (to-tal 5,400fps) costs $25 × 180(cameras)[15]+$2000(GPU)[9]=$6.5K; whereas buying 180 NVIDIA Jetson TX2 cameras(each runs ResNet50 barely at 30fps) costs about $600[11]×180= $108K, 1-2 orders of magnitude more expensive. Therefore,

2Of course, some video feeds cannot be sent out due to privacy regulationand have to be processed locally, and they are beyond the scope of this work.

2

(a) Video streaming for human viewers

(b) Video streaming for computer-vision analytics

Source(Video server)

Human Viewer

Source(Camera)

Server(DNN)

Figure 2: Contrasting video streaming for human viewers andmachine-centric video streaming leads to unique bandwidth-saving opportunities.

camera network operators increasingly stream videos fromnetwork-connected cameras to servers for deep learning in-ference in traffic monitoring [24], surveillance in buildings,video analytics in retail stores [8], and inspection of ware-houses or remote industrial sites with drone cameras [34].Why saving bandwidth? We aim to reduce bandwidth us-age of video analytics for three reasons. First, while manyInternet videos are still destined to viewers, the video feedsprocessed by analytics engines have grown dramatically; byone analysis, the amount of data generated by new videosurveillance cameras installed in 2015 is equal to all Netflix’scurrent users steaming 1.2 hours of ultra-high HD contentsimultaneously [22]. Second, many cameras are deployed inthe wild with only cellular network coverage [20], so set-ting up network connections and sending videos upstreamto remote servers can be expensive (more so than streamingvideos from CDN servers to wifi-connected PCs). Finally,while purchasing GPU servers is not cheap, it is a one-timecost that is amortized over time (so cheaper in the long run);while the communication cost is not amortized thereby incur-ring substantial operational costs for many large-scale cameranetworks.Performance metrics: An ideal video streaming protocol forvideo analytics should balance among three metrics: accuracy,bandwidth usage, and freshness. Most real-world scenarioseither detect/segment objects of interest, recognize faces orclassify if the images contain a person or an object-of-interest.Here, we define them in multi-class object detection andsemantics segmentation.• Accuracy: Accuracy is measured by the standard metrics:

F1 score for object detection (the harmonic mean of preci-sion and recall for the detected objects’ location and classlabels) and IoU for semantic segmentation. We use theoutputs of running the server-side DNN over the video inits highest quality as the reference and treat them as the

0.0 0.2 0.4 0.6 0.8 1.0Fraction of frame with objects

0.0

0.2

0.4

0.6

0.8

1.0

CDF

TrafficDashcamsDrone

Figure 3: Bandwidth-saving opportunities: Over 50-80% offrames, the objects (cars or pedestrians) only account for lessthan 20% of the frame area, so most pixels do not contributeto the accuracy of video analytics.

ground truth when calculating accuracy. This is consistentwith other work (e.g., [42, 73, 74]).

• Bandwidth usage: We measure the bandwidth usage by sizeof the sent video divided by its duration.

• Response delay (freshness): Finally, we define freshness bythe average processing delay per object, i.e., the expectedtime between when an object first appears in the video feedand when its region is reported and correctly classified,which includes both the time to transmit to the server andrun inference. We report the average delay.

2.2 Potential room for improvementTraditional video streaming maximizes human quality of expe-rience (QoE)—a high video resolution and smooth playback(minimum stalls, frame drops or quality switches) [31, 43, 47].For machine-centric video streaming, however, it is crucialthat server-received video has sufficient pixels in regions thatheavily affect the DNN’s ability to identify/classify objects;however, the received video does not have to be high defini-tion or smooth at all.

This contrast has profound implications that machine-centricstreaming could achieve high “quality” (i.e., accuracy) withmuch less bandwidth. Each frame can be spatially encodedwith non-uniform quality levels. In object detection, for in-stance, one may give low quality to or even blackout the areasother than the objects of interest (Figure 2(b))3. While rarelyused in traditional video streaming, this scheme could signif-icantly reduce bandwidth consumption and response delay,especially because objects of interest usually only occupy afraction of the video size. Figure 3 shows that across threedifferent scenarios (the datasets will be described in §5.1), in50-80% of frames, the objects of interest (cars or pedestrians)only occupy less than 20% of the spatial area. We also ob-serve similar uneven distributions of important pixels in face

3This may look like region-of-interest (ROI) encoding [56], but even ROIencoding does not completely remove the background either, and the ROIsare defined with respect to human perception.

3

Better

Figure 4: Current solutions exhibit a rigid bandwidth-accuracy tradeoff: any gain in accuracy comes at a costin bandwidth. The results are based on the traffic videos inour dataset (§5.1).

recognition and semantic segmentation. The question then ishow to fully explore the potential room for improvement?Limitations of existing solutions: Existing solutions forvideo streaming are essentially source-driven—the decisionsof how the video should be compressed or pruned are madeat the source with no real-time feedback from the server-sideDNN that analyzes the video. However, they show unfavor-able trade-off between bandwidth and accuracy (illustrated inFigure 4): any gain of accuracy comes at the cost of consid-erably more bandwidth usage. The fundamental problem isthat any heuristic that fits camera’s limited compute capacityis unlikely to precisely imitate a much more complex DNN,with no real-time feedback from it.

This problem manifests itself differently in two types ofsource-driven solutions. The first type is uniform-qualitystreaming that modifies the existing video streaming pro-tocols and adapts the quality level to maximize inferenceaccuracy under a bandwidth constraint. AWStream [73] usesDASH/H.264 and periodically re-profiles the inference accu-racy under each quality level. CloudSeg [71] sends a video atlow quality but upsizes the video by super resolution by theserver. They have two limitations: First, they do not leveragethe uneven distribution of important pixels; instead, the videosare encoded by traditional codecs with a spatially uniformquality. Second, while AWStream does get feedback from theserver DNN, it is not based on real-time video content, so itcannot suggest actions such as increasing quality on a specificregion in the current frame.

The second type is camera-side heuristics that identifies im-portant pixels/regions/frames that contain information neededby the server-side analytics engine (e.g., queried objects) byrunning various local heuristics (e.g., checking significantinter-frame difference [26], a cheap vision model [27, 28,45, 76]), or some DNN layers [32, 67]. These solutions es-sentially trade camera-side compute power to run inferencefor better accuracy/bandwidth trade-off [32]. For instance,Glimpse [26] and NoScope [45] rely on the inter-frame dif-ferences to signal which frames contain objects, but any falsepositives (e.g., pixel changes on the background) will cost

Source(Camera)

Server(DNN)

Passive video stream driven by camera-side heuristics

Source(Camera)

Server(DNN)

Stream A: Passive low-quality

Stream B: Feedback-driven

(a) Traditional video streaming

(b) Real-time DNN-driven streaming

Figure 5: Contrasting the new real-time DNN-driven stream-ing (iterative) with traditional video streaming in video ana-lytics.

unnecessary bandwidth usage and any false negatives willpreclude the server from detecting important information.

3 DESIGN OF DDS: A DNN-DRIVENSTREAMING PROTOCOL

We explore an alternative approach, called DDS (DNN-drivenstreaming), to analytics-oriented video streaming. The keydeparture from prior source-driven solutions is that DDS isdriven by the feedback (ambiguous regions) carefully elicitedfrom the output of the server-side DNN, which in theoryallows DDS to send only part of the video content needed bythe analytics engine.

This section presents the design of DDS. We start with thegeneral workflow that applies to different vision applications,with a focus on and how it generates reliable feedback. Thenwe use three concrete vision applications to demonstrate theworking of DDS. Finally, we extend the basic workflow ofDDS to reduce bandwidth usage, reduce latency and adapt todynamic bandwidth availability.3.1 General workflow with ambiguity as

feedbackWe first introduce some terminologies for clarity. A videostream is temporally chopped to segments4 (each containing𝑛 consecutive frames), and DDS operates on a segment-by-segment basis. Along the spatial dimension, we refer to thebasic unit of a vision application output as an element: e.g., anelement can be a rectangle-shape bounding box in the contextof object detection or face recognition; an element can also bea pixel in the context of semantic segmentation. In general, avision application will assign a label per element. We will usethe notion of element when presenting the general workflowof DDS. The next section will explain how DDS handlesdifferent vision applications differently.

Unlike prior source-driven streaming which is “single-shot”(i.e., cameras using heuristics to determine how the video4The concept of segments is commonly used in the Internet video proto-cols [6] and prior analytics-oriented video streaming systems (e.g., [73]).

4

Camera

Server

TimelineDNN Inference

Re-encoding ambiguous regions

DNN Inference

Downsizing

ambiguous regions

Figure 6: Illustration of the DDS workflow on a single frame.The two rounds of server-side inference use the same DNN.

should be streamed out), DDS has an iterative workflow builton two logical streams (illustrated in Figure 5). In StreamA, the camera continuously sends the video in low qualityto the server. In Stream B, The server DNN frequently (e.g.,every handful of frames) generates some ambiguous regions(explained soon) based on its output on the real-time low-quality video and sends them as feedback back the the camera.Upon receiving the feedback from the server, the camera thenre-encodes the ambiguous regions in a higher quality level andsends it to the server DNN for a “second look”. Figure 6 showsa simplified example of DDS in action. On the low qualityframe, the server DNN proposes three ambiguous regions.The camera then re-encodes these regions in higher quality,and sends them to the server, which then runs DNN againand returns more accurate result.5 DDS can adaptively choosethe second-iteration quality level to cope with bandwidthvariance.Defining and identifying ambiguous regions: The key ofDDS’s success, therefore, is the design of the ambiguousregions—a collection of ambiguous elements that are likelyto alter the inference result if encoded in higher quality (whichusually leads to more accurate inference).6 In other word, weare looking for those regions that are close to be correctly de-tected/classified but not yet, and these regions have the highestutility when being streamed at a high resolution. To this end,we identify elements that are of high objectness score (like-lihood of belonging to any class) but are of low confidencescore (the confidence in inference result). Figure 7 illustratesthe process of finding ambiguous elements by extracting high-confidence elements from high-objectness elements. It helpsto understand ambiguous elements with examples: in objectdetection, large objects tend not to be ambiguous since theyoften have high objectness scores as well as high confidence

5In theory, the iterative process can have more than two iterations, though theresponse delay will grow with more iterations and we found the performancebenefit diminishes after two iterations.6Although ambiguous regions are made mostly of ambiguous elements, butambiguous regions also need to be shaped in a way that minimizes encodingoverhead, which will be discussed in §3.3

Elements with high confidence scores

Elements with high objectness scores

Ambiguous elementsDNN inference output

Figure 7: The process of finding ambiguous elements by ex-tracting high-confidence elements from high-objectness ele-ments

scores; whereas some small objects that look similar to mul-tiple classes may have a high objectness but low confidencewhich makes them “ambiguous”. The per-element objectnessscores and confidence scores can be approximately extractedfrom DNN output (see §3.2 for more details).

It is useful to clarify the rationale behind ambiguous-region-based streaming.Why is low-quality video sufficient for feedback? The ra-tionale is that in low quality, ambiguous regions falls intothe gap between those regions that contains any objects andthose regions that contains confidently detected objects. Sincethe former is a binary classification problem and the latteris a multi-class classification problem, the former is funda-mentally easier than the latter. Consequently the gap betweenthese two type of regions do exist.Why not region proposal? Compared to region-proposal-network-based solutions like EAAR [51], there are four cru-cial difference. First, it requires region proposal networksto run. This makes it not generalizable to applications likesemantic segmentation. Second, it utilize single-image-basedRoI encoding. Without leveraging inter-frame redundancy,it will consume too much bandwidth. Third, it shift the re-gion proposal result of previous frame through cheap motion-vector-based tracking. Thus, it is not able to discover newly-appeared object in current frame, and will not work wellunder low frame rate. Fourth, it spend too much bandwidthon those objects that are already detected under low quality.The idea of server-driven region-based streaming also seemssimilar to Vigil [76], which also identifies and sends onlyregions likely with objects to the server, but as we will showin §5.2, even if Vigil uses a model (MobileNet-SSD) that runsonly 3× faster than the server-side DNN [16], it still missesabout 40% more objects of interest than DDS and sends over50% more data. The reasons are two-fold. First, Vigil’s localmodel uses inherently simpler feature extractor, and thus tendto miss some (especially small) objects even on high qualityimages. Second, the local model does not change with theserver-side vision task, so it may propose more regions than

5

necessary (if the server only looks for a specific type of ob-jects). Finally, unlike DDS, Vigil and Glimpse require extracomputing power on the camera side.

3.2 Extracting ambiguous regionsHigh-confidence inference output 𝐷dnn is a list of DNN-returned elements (labeled bounding boxes in object detectionand face recognition, or connected components in semanticsegmentation), each associated with its coordinates and confi-dence score. We use the elements that have confidence scoreover a threshold, which suggests they are ready to be reported.In each frame, one can “spatially sample” one or more re-gions (a region is a bounding box) and together they forma region set. We use 𝑆 to denote a segment, 𝑅 a region set,and𝑉𝑆,𝑅,𝑄 the video generated from a region set 𝑅 of segment𝑆 encoded in quality level 𝑄 , which specifies the resolutionand qp (quantization parameters). Note that a region coveringthe full frame can form a region set, labeled 𝑅full, so 𝑉𝑆,𝑅full,𝑄

denotes the entire segment encoded in quality level 𝑄 (i.e.,without any spatial sampling).General logic of feedback region generation: We extractfeedback regions 𝑅feedback based on high-confidence infer-ence result 𝐷dnn and proposed regions 𝑅dnn (instead of merelybased on proposed regions 𝑅dnn of previous video chunk, suchas EAAR [51]). As described in Fig. 8, DDS picks from 𝑅dnnthe feedback regions that meet two criteria: (a) each feedbackregion (in terms of the fraction of the whole frame) is lessthan a threshold 𝛾 (by default, 4%7); (b) each feedback regiondoes not overlap significantly with any high-confident infer-ence output 𝐷dnn,8; and (c) the objectness score of a region isover a threshold \ ; The rationale behind them is as follow. (a)filters out the regions that the DNN is already confident basedon the low-quality video. (b) filters out large proposed re-gions, because if it does contain an object, it should have beendetected by the DNN with high confidence already. Finally,(c) uses a threshold of objectness score (by default 0.5) as acontrol knob to adjust the bandwidth demand of the secondround (Stream B). Fig. 8 demonstrates an example.Applying the logic in various vision tasks:• Bounding-box-based inference: For bounding-box-based

inference (e.g., object detection, face recognition), our ob-servation is that: most of the DNNs (e.g., FastRCNN, Faster-RCNN, MTCNN, YoLo, or MaskRCNN [35]) explicitlypropose regions and evaluate their objectness scores. (Notethat YoLo assign per-class classification scores to each

7Although 4% of the whole frame size looks low, it actual is quite high(roughly the size of a truck in a distance of 70ft from the camera).8In object detection, a proposed region does not overlap with a detected re-gion if their IoU (intersection-over-union) is below a predetermined threshold(0.3, by default). In semantic segmentation, a proposed region (64x64 block)overlaps with a connected high-confidence region if they share commonpixels.

bounding-box-based application pixel-based application

overlapped with inference

result

inference result high-obj.-score regions

extract ambiguous regions

too large

final ambiguous regions

inference result high-obj.-score regions

blue: low-obj.-score regions

white: high-obj.-score regions

overlapped with inference result,

“blue” these regions

pick several “whitest” rectangle as final

ambiguous regions

extract ambiguous regions

Figure 8: Illustration on how DDS extract feedback regionson two types of applications.

proposed region. We can simply get the objectness scoresby summing up these per-class (except for backgroundclass) classification scores.9) We merge duplicated bound-ing boxes for processing efficiency. Fig. 8 demonstrates anexample.

• Pixel-based inference: For pixel-based inference (e.g., se-mantic segmentation), our observation is that: we can treateach pixel as an individual region. So we can obtain theobjectness score by applying the method similar to theYoLo case above and filter the regions according to algo-rithm mentioned above. After these two steps, we obtain acollection of pixels with corresponding objectness scores.For encoding efficiency, we clamp these pixels to several64×64-shaped bounding boxes by repeatedly proposing theregions with maximum sum objectness score and zero outthe score in that region. Fig. 8 demonstrates an example.

3.3 Optimizing bandwidth and latencySaving bandwidth by leveraging codec: A naive imple-mentation of Stream B would encode each region as separateimages where the requested regions have high quality andthe rest of the frames have low quality. But the total size ofthese images would be much greater than the original videosegment in even the higher quality level without cropping theregions! The reason is that video codecs (e.g., H.264/H.265),after decades of optimization, are very effective in exploitingspatial redundancy and temporal redundancy between videoframes to reduce the encoded video size. So instead, we lever-age this encoding effectiveness. In each frame, we set thepixels outside of the requested regions in the high qualityimage to black, i.e., RGB=(0,0,0), and encode these imagesinto a video file (mp4).Reducing delay via early reporting: The cost that DDSpays to get better performance is the worst-case responsedelay; the result of Stream B will wait for two “round trips”

9We acknowledge that the sum of confidence scores across all classes doesnot directly translate to the likelihood of a region being in any of the classes.That said, we have empirically found it a good indicator of objectness.

6

Stream A

low-quality inference result extract ambiguous regionsearly report:

report all results that do not overlap with ambiguous regions

Stream B

append results

errors inside errors inside

bounding-box-basedapplication

pixel-basedapplication

Figure 9: DDS reduces response delay through early reporting. Over 90% of the bounding-box-based inference results and over93% of the pixel-based inference results are early reported before stream B.

before it can be returned to the analysts. To address this prob-lem, we leverage the observation that a substantial fractionof the DNN output from the low-quality video (Stream A)already has high confidence and thus can be returned with-out waiting for Stream B. While this optimization does notchange the bandwidth consumption or worst-case responsedelay, it substantially reduces the delay of many inferenceresults. In object detection, we empirically found that over90% of all final detected objects could have been detected inStream A. These objects can be returned much faster than anyprior approach, because Stream A uses a quality level muchlower than what other work (e.g., [27, 28, 73]) would need toachieve the same accuracy. Fig. 9 demonstrates an example.Leveraging camera-side heuristics: DDS can also leveragecamera-side computing power. The key advantage of doing sois not only incorporating more computing power, but to lever-age that the cameras have access to the raw (full resolution)video. Here, we use tracking as a camera-side heuristics thatcan be incorporated in DDS. Tracking is usually less compute-intensive than tasks such as object detection, so some priorwork has proposed having the camera track objects (e.g., us-ing [36]) detected by the server-side logic (e.g., [26]). DDScan benefit from a similar idea by requesting fewer framesin the feedback regions and performing camera-side trackingbetween the requested frames. For instance, given a 30-framesegment, the server will decide regions in each of the 15frames, but instead of requesting all of them from the camera,it only requests the regions in the first frame and sends backthe final results of the first frame to the camera which thentracks them using local tracking logic.

3.4 Handling bandwidth variationLike other video streaming protocols, DDS must adapt itsbandwidth usage to handle bandwidth fluctuation. There arethree control knobs: the low quality level and the high qualitylevel, and the objectness score threshold \ . We empiricallyfind that these knobs affect the bandwidth/accuracy tradeoffin a similar way (i.e., on the same Pareto boundary; §5.4),

Figure 10: DDS’s adaptive feedback control system dynami-cally tunes the low and high quality configurations based onthe difference between the estimated available bandwidth forthe next segment and that used for the previous segment.

so DDS only tunes low quality level and high quality level,while fixing the objectness threshold at 0.5.

To tune the low and high quality regions we implement afeedback control system, illustrated in Figure 10. We base thiscontroller on prior work that proposed a virtual, adaptive con-trol system that can be customized for specific deployments[25, 57]. Instantiating the controller requires specifying threethings: a bandwidth constraint to be met, feedback for mon-itoring whether or not the bandwidth constraint is met, andthe tunable parameters that affect the bandwidth. For DDS,the bandwidth constraint is the estimated bandwidth for thenext segment (labelled (1) in the figure), the feedback is thetotal bandwidth (for both low and high quality) from the lastsegment (2), and the tunable parameters are the resolution andquantization parameters of both the low and high quality (3).The controller continually estimates the base behavior—inthis case, the last segment’s bandwidth if the default param-eter settings had been used. The controller then takes thisbase behavior as well as the difference between the desiredbandwidth for the next segment and the achieved bandwidthfor the previous segment and computes a scaling factor for thebase bandwidth. This scaling factor is passed to an optimizerwhich finds the low and high quality settings that deliver thescaled bandwidth while maximizing F1 score.

Our dynamic adaptation has several useful formal proper-ties based on its use of feedback control. First, the content

7

estimator handles non-linearities in the relationship betweenthe parameters and bandwidth. Intuitively, we can think of theparameter-bandwidth relationship as a curve and the contentestimator as drawing a tangent to that curve. The controllerthen uses that tangent as a linear approximation to the truebehavior [33]. The achieved bandwidth, then, will convergeto the desired bandwidth in time proportional to the logarithmof the error in this estimation [57]. Second, the optimizerfinds the highest quality given the bandwidth specified by thecontroller. This optimality is achieved by scheduling config-urations over multiple segments. As the system has a small,constant number of constraints, an optimal solution can befound in constant time [46].

4 IMPLEMENTATION AND INTERFACEWe implement DDS with about 2K lines of code, mostly inPython and the code is available and will be updated in [7].

DDS sits between the low-level functions (video codec andDNN inference) and the high-level applications (e.g., objectquery or face recognition). It provides “south-bound” APIsand “north-bound” APIs, both making minimum assumptionsabout the exact implementation of the low-level and high-level functions.

South-bound APIs interact with the video codec and DNN.Our implementation only uses the APIs already exposed bythe x264 MPEG video, such as x264_encoder_encode[21].From DNN, DDS implements two functions: (1) ambiguousregions (𝑅dnn), each with a specified location; and (2) detec-tion results (𝐷dnn) including the detected objects/identitieseach with a specified location and a detection confidencescore.

North-bound APIs implement the same analyst-facing (north-bound) APIs as the DNNs (DDS can simply forward any func-tion call to DNNs), so the high-level applications (e.g., [48,55]) do not need to change and DDS can be deployed trans-parently to analysts. The only assumption is that, becauseDDS runs the DNN twice on the same video segment, the twoDNN detection results must be merged into a single detectionresult; we did this in a similar way as how DNNs internallyto merge redundant results (e.g., [68]).

5 EVALUATIONThe key takeaways of our evaluation are:• On three vision tasks, DDS achieves same or higher ac-

curacy than baselines while using 18-58% less bandwidth(Figure 11) and 25-65% lower response delay (Figure 12).

• DDS sees even more improvements on certain video genreswhere objects are small (Figure 13) and on applicationswhere the specific target objects appear rarely (Figure 14).

• DDS’s gains remain substantial under various bandwidthbudgets (Figure 15) and bandwidth fluctuation (Figure 16).

Name Vision tasks Total length # videos # objs/IDs

Traffic obj detect / segment 2331s 7 24789Drone obj detect / segment 163s 13 41678

Dashcam obj detect / segment 5361s 9 24691Friends face recog 6000s 20 7945TBBT face recog 6000s 20 8370

Table 1: Summary of our datasets.

• DDS poses limited additional compute overhead on bothcamera and server (Figure 18).

5.1 MethodologyExperiment setup: We build an emulator of video streamingthat can measure the exact analytics accuracy and bandwidthdemand of DDS and our baselines. It consists of a client(camera) that encodes/decodes locally stored videos and afully functional server that runs any given DNN and a sepa-rate video encoder/decoder. Unless stated otherwise, we useFasterRCNN-ResNet101 [63] as the server-side DNN forobject detection InsightFace [10, 30] for face recognition,and FCN-ResNet101 [54]. DDS extracts objectness scores(from region proposals) of the object detection pipeline usingFasterRCNN[63], those of the face recognition pipeline usingMTCNN[75], and those of semantic segmentation using themethod described in §3.2. As we will see in §5.3, differentchoices of DNNs will not qualitatively change the takeaways.When needed, we vary video quality along QP parameters(from {26,28,30,36,38,40}) and resolution (from scale factorsof {0.8,0.7,0.5}), and DDS uses 36 (qp) as low quality and26 (qp) as high quality, both with resolution scale of 0.8. Inmost graphs, we assume a stable network connection, but in§5.4, we will vary the available bandwidth as well as increasebandwdith variance.Datasets: To evaluate DDS over various video genres, wecompile five video datasets each representing a real-worldscenario (summarized in Table 1 and their links can be foundin [3]). These videos are obtained from two public sources.First, we get videos from aiskyeye [19], a computer-visionbenchmark designed to test DNN accuracies on drone videos.Nonetheless, the key difference between DDS and baselinesnot the DNN but the client-server pipelines which can be af-fected by other factors such as fraction of frames with objectsof interest or size of the regions with objects. To cover a rangeof dynamism along these features, we also get videos fromYouTube as follows. We search keywords (e.g., “highway traf-fic video HD”) in private browsing mode (to avoid biases frompersonalization); among the top results, we remove the videosthat are irrelevant (e.g., news report that mentions traffic), andwe download the remaining videos in their entirety or the first10-minutes (if they exceed 10 minutes). The vision task is todetect (or segment) vehicles in traffic and dashcam videos,

8

DDS (ours) AWStream

EAARCloudSeg

Glimpse

Vigil

(a) Object detection (Traffic)

DDS (ours) AWStream

EAARCloudSeg

Glimpse

Vigil

(b) Object detection (Dashcam)

DDS (ours)AWStream

EAARCloudSeg

Glimpse

Vigil

(c) Object detection (Drone)

DDS (ours) AWStream

BetterVigil

Glimpse

(d) Face recognition (Friends)

DDS (ours)AWStream

BetterCloudseg

(e) Semantic segmentation (Traffic)

DDS (ours) AWStream

BetterCloudseg

(f) Sem. segmentation (Dashcam)

DDS (ours) AWStream

BetterCloudseg

(g) Semantic segmentation (Drone)

DDS (ours) AWStream

Better

Vigil

Glimpse

(h) Face recognition (TBBT)

Figure 11: The normalized bandwidth consumption v.s. inference accuracy of DDS and several baselines on various videogenres and applications. DDS achieves high accuracy with 55% bandwidth savings on object detection and 42% on semanticsegmentation, and 36% on face recognition. Ellipses show 1-𝜎 range of results.

to detect humans in drone videos, and recognize faces in sit-com videos. Because we use public sources to cover morereal-world videos, many videos do not have human-annotatedground truth. So for fairness, in all videos in our dataset, weuse the DNN output on the full-size original video as thereference result to calculate accuracy. For instance, in objectdetection, the accuracy is defined by the F1 score with respectto the server DNN output in highest resolution (original) withover 30% confidence score.Baselines: We use five baselines to represent two state-of-the-art techniques (see §2.2): camera-side heuristics (Glimpse[26], Vigil [76], EAAR [51]) and adaptive streaming (AW-Stream [73], CloudSeg [71]). Following minor modificationsare made to ensure the comparison is fair. First, all baselinesand DDS use the same server-side DNN. Second, althoughDDS needs no more camera-side compute power than en-coding, camera-side heuristics baselines are given enoughcompute resource to run more advanced tracking [36] andobject detection algorithm [64] than what Glimpse and Vigiloriginally used, so these baselines’ performance is strictlybetter than their original designs. Third, we allow EAAR toutilize the region proposal result of current frame (instead ofprevious frame, as mentioned in [51]) to help encode cur-rent frame. This will strictly enhance the inference accuracyof EAAR. Fourth, all DNNs used in baselines and DDS arepre-trained (i.e., not transfer-learned with samples from thetest dataset like in NoScope [45]); so our version of Cloud-Seg uses the pre-trained super-resolution model [23] from thewebsite [14]. This ensures the gains are due to the streaming

algorithm, not due to customization of DNN, and also helpsreproducibility.

Finally, although DDS could lower frame rate, but to ensureaccuracies of all baselines are calculated on the same set ofimages, we only tune resolution and QP in DDS, which arealso the most effective knobs [73].Performance metrics: We use the definition of accuracy andresponse delay from §2.1. To avoid impact of video contenton bandwidth consumption, we report bandwidth demandsof DDS and baselines after normalizing them against thebandwidth consumption of the original videos.

5.2 End-to-end improvementsWe start with DDS’s overall performance gains over the base-lines along bandwidth savings, accuracy, and response delay.Bandwidth savings: Figure 11 compares the bandwidth-accuracy tradeoffs of DDS with those of the baselines, whenthe total available bandwidth is set to match the size of theoriginal video (we will vary the available bandwidth in §5.4).Across three vision tasks, DDS achieves higher or comparableaccuracy than AWStream and CloudSeg but uses 55% lessbandwidth in object detection and 42% in semantic segmen-tation. Glimpse and Vigil do sometime use less bandwidthbut they have much lower accuracy. On the contrary, EAARhas higher accuracy but consumes too much bandwidth. Over-all, even if DDS is less accurate or uses more bandwidth, italways has an overwhelming gain on the other metric.Response delay: Figure 12 shows the response delay ofDDS and AWStream (the baseline that is the closest to DDSperformance-wise) with the same length of a segment (num-ber of consecutive frames encoded as a segment before sent

9

15 30 60 90 120# of frames per segment

0.00.51.01.52.02.53.03.5

Resp

onse

del

ay (s

) DDSAWStream

(a) Object detection

15 30 60 90 120# of frames per segment

0.00.51.01.52.02.53.03.54.0

Resp

onse

del

ay (s

) DDSAWStream

(b) Semantic segmentation

Figure 12: Response delay of DDS is consistently lower thanAWStream under various lengths of video segment.

to the server). The segment length limits the lower bound ofthe delay, and same segment length, we see that despite theneed for two iterations, DDS reduces the response delay by25-65% compared to AWStream. Moreover, the gap widensas the segment length increases. To put it into perspective,popular video sites use 4-second to 8-second segments [77](i.e., over 120 frames per segment), a range in which DDS’sgains are considerable.Intuition: DDS achieves significant bandwidth savings com-pared to AWStream because DDS is driven by DNN-generatedfeedback on the real-time video content while AWStream iscontent agnostic. Thus, DDS aggressively compresses videoby cropping out only regions of interest in the second iteration.The bandwidth savings compared to Glimpse or Vigil resultfrom the fact that camera-side heuristics lack crucial server-side information required to select the important frames. Forfairness, we do not customize the super-resolution model tothe specific video content; as a result, the super-resolutedframes actually leads to lower accuracy from the server DNN.DDS’s response delay is much lower than AWStream becauseit detects most of the objects in the first, low quality iteration.AWStream sends a single video stream and spends more timeto encode that stream with quality.

5.3 Sensitivity to application settingsImpact of video genres: Next, Figure 13 shows the distribu-tion of per-video bandwidth savings w.r.t AWStream (dividingAWStream’s bandwidth usage by DDS’s when DDS’s accu-racy is at least as high as AWStream) in three datasets. As ex-pected, there is substantial variability of performance amongvideos of the same type. This is because DDS’s gains dependon the fraction of pixels occupied by objects of interest, whichcan varies with videos.

That said, the impact of content on performance gainscan also be task-dependent. A somewhat surprising observa-tion is that, in object detection, dashcam videos show lessimprovement than other two datasets, but in semantic seg-mentation, dashcam videos show the most improvement! Infact, it highlights the different nature between the two vi-sion tasks. In object detection, in presence of a large object

Better

(a) Object detection

Better

(b) Semantic segmentation

Figure 13: Distributions of per-video bandwidth savings intwo datasets. The gains of DDS are video dependent.

Better

Figure 14: Segmentation on only motorcycles achieves 2-4×more bandwidth savings than segmentation on all classes.

that cannot be confidently classified, DDS will need to senda large region in its entirety to the server. However, in se-mantic segmentation, the ambiguity happens to individualpixels at the boundary of objects, which is less affected bylarge objects. Dashcam videos have more large objects thandrone/traffic camera videos, so contrast will appear only indashcam videos.Impact of inference tasks: So far we have evaluated DDSwhen the vision tasks deal with the common object classes(or all classes) in videos. An additional advantage of DDS isit can save more bandwidth by taking advantage of the server-side application when needs only a few specific object classes.To demonstrate this, we change the segmentation task fromdetecting all objects to detecting only motorcyles. Figure 14shows the DDS’s bandwidth savings (when achieving sameor higher accuracy than AWStream) on three traffic videosin which motorcyles appear for a small non-zero fraction offrames, and compare the savings with those when the taskis over all classes. We see that DDS’s gains are significantlyhigher when only motorcyles are the segmentation target, andthe additional gains are higher when the motorcyclists take asmaller fraction of pixels and frames (e.g., Video 2).Impact of DNN architecture: Last but not least, we testdifferent DNN architectures on a randomly selected traf-fic video (5-minute long), and find that DDS achieves sub-stantial bandwidth savings under different server-side DNNmodels: FasterRCNN-ResNet101 (44%) and FasterRCNN-ResNet50 [16] (54%) has the same architecture but differentfeature extractors, while Yolo [61] (51%) uses a differentarchitecture and feature extractor. This implies that we candesign the video streaming protocol agnostic to the model

10

0.00 0.25 0.50 0.75 1.00Normalized bandwidth consumption

0.78

0.80

0.82

Accu

racy

AWStreamDDS

(a) Object detection

0.4 0.6 0.8 1.0Normalized bandwidth consumption

0.60

0.65

0.70

0.75

Accu

racy

AWStreamDDS

(b) Semantic segmentation

Figure 15: DDS outperforms (the closest baseline) in accu-racy under various bandwidth consumption budgets.

architecture of the server-side DNN. We leave a full examina-tion of different DNN architecture (e.g., MaskRCNN [35]) tofuture work.

5.4 Sensitivity to network settingsAccuracy vs. bandwidth budget: We then vary the avail-able bandwidth and compare DDS with AWStream, whichis performance-wise the closest baseline. Figure 11 showsthat as available bandwidth varies, DDS always uses 35-43%bandwidth in object detection and 42-89% in semantic seg-mentation while achieving higher accuracy.Impact of bandwidth variance : Figure 16 compares DDSwith AWStream under increasing bandwidth variance. Weuse synthetic network bandwidth traces where available band-width is drawn from a normal distribution of 900·𝑁 (1, 𝜎2)kbpswhile we increase 𝜎 from 5% to 100%. We observe that DDSmaintains an accuracy advantage over AWStream. AlthoughDDS and AWStream use the same bandwidth estimator (aver-age of last two segments), DDS uses the available bandwidthmore efficiently because DDS’s feedback control system con-tinually adapts the model relating configuration parametersto bandwidth. Thus, DDS adaptively selects the best possibleconfiguration parameters at each time instant. Further, theavailable configuration parameters for low and high qualityvideo enable a wider range of possible adaptations and thusDDS tunes its behavior at a finer granularity compared toAWStream. Even when the variance in available bandwidth ishigh (𝜎 > 70% of the mean), DDS maintains a relatively lowresponse delay while AWStream’s delay increases.Impact of parameter settings: Figure 17 shows the impactof key parameters of DDS on its accuracy/bandwidth trade-offs: the QP of the first iteration in Stream A (“low qp”), theQP of the second iteration in Stream B (“high qp”), and theobjectness threshold. We vary one parameter at a time andtest them on the same set of traffic videos. The figures showthat by varying these parameters, we can flexibly trade accu-racy for bandwidth savings. Overall, we observe their effectroughly falls on the same Pareto boundary, so there may not

0.1 0.3 0.5 0.7 0.9Co-efficient of Variation ( = 900)

0.700

0.725

0.750

0.775

0.800

Accu

racy

(F1)

DDSAWStream

(a) Accuracy

0.1 0.3 0.5 0.7 0.9Co-efficient of Variation ( = 900)

0.0

0.2

0.4

0.6

0.8

Netw

ork

Dela

y (s

) DDSAWStream

(b) Network delay

Figure 16: DDS can handle bandwidth variance and main-tain a sizeable gain over the baseline of AWStream even undersubstantial bandwidth fluctuation.

0.10 0.15 0.20 0.25Normalized bandwidth consumption

0.70

0.75

0.80

0.85

Accu

racy

objectness thresh.low qphigh qp

(a) Object detection

0.4 0.5 0.6Normalized bandwidth consumption

0.65

0.70

0.75

0.80

Accu

racy

objectness thresh.low qp

(b) Semantic segmentation

Figure 17: Sensitivity analysis of DDS’s parameters. By vary-ing these parameters, DDS can flexibly adapt itself to reachdesirable accuracy for a given bandwidth constraint.

DDS

Awstream

Glimpse Vigil0

50

CPU

(%)

(a) Client CPU

DDS

Awstream

Glimpse Vigil0

5

10

CPU

(%)

(b) Server CPU

DDS

Awstream

Glimpse Vigil0

20

40

GPU

(%)

(c) Server GPU

Figure 18: Compared to prior solutions, DDS has low addi-tional systems overhead on both client and server.

be significant difference between the choices of parametersto vary when coping with bandwidth fluctuation.

5.5 System microbenchmarksCamera- & server-side overheads: Figure 18 puts the sys-tems overhead of DDS into the perspective of the baselines.We follow the assumptions in §5.2 that the camera has 4 CPUsand the server has 4 CPUs and 8 GPUs. On the camera (client)side, all methods use relatively low overhead, except Vigil andGlimpse which run local heuristics of tracking and object de-tection logic. On the server side, Vigil and Glimpse use muchless GPU and CPU cycles than DDS and AWStream, becausethey run inference only on a small fraction of frames (whichin part leads to lower accuracy). The reason DDS has lowerGPU usage than AWStream, despite running inference twiceper frame, is that AWStream profiles a list of configurations(e.g., pairs of QP and resolution) (total 216 configuration in[73]) and that cost is non-trivial. In our implementation, we

11

0 2 4 6 8 100.0

0.5

1.0

Dela

y (s

)

0 2 4 6 8 10Time (s)

0.0

0.5

Accu

racy

(F1)

Figure 19: DDS can handle server disconnection (or serverfailure) gracefully by falling back to client-side logic

Seperate

RegionsBlack-bkgd

Frames

Black-bkgd H.264

encoded video0

2000

4000

6000

8000

Band

widt

h C

onsu

mpt

ion

(Kbp

s)

(a) Reducing bandwidth usage bysmarter encoding

Awstream

Returning results

after 2nd iter

Early response

after 1st iter

0.0

0.2

0.4

0.6

0.8

Resp

onse

del

ay (S

ec)

(b) Reducing response delayby early response

Figure 20: System refinements (introduced in §3.3) to (a)reduce Stream B bandwidth and (b) reduce response delay.

evaluate only 24 configurations on 15 frames in each 300-frames video, and the average GPU usage of AWStream isalready higher than DDS. That said, we acknowledge that ifAWStream updates the profile less frequently (e.g., every fewminutes), its GPU usage could be lower than DDS, but thatmight cause its profile to be out of date.Fault tolerance: We stress test DDS under the condition thatthe server-side DNN is unreachable. By default, the cameraruns DDS protocol, and it also has a local tracking algorithmas a backup. Figure 19 shows the time-series of response delayand accuracy. First, DDS maintains desirable accuracy, andat 𝑡 = 5 second, the server is disconnected. We see that DDSwaits for a short time (until server times out at 𝑡 = 5.5) and fallback to tracking the last detection results from the server DNN.This allows for a graceful degradation in accuracy, rather thancrashing or delaying the inference indefinitely. Between theserver disconnection and the timeout, the segments will beplaced in a queue temporarily, and when the local inferencebegins, the queue will be gradually cleaned up. When theserver is back online (at 𝑡 = 13), the camera will be notifiedwith at most a segment-worth of delay (0.5 second), and beginto use the regular DDS protocol. Meanwhile the camera willcontinue to send a probe per segment.Performance optimization: Figure 20 examines two perfor-mance refinements. First, figure 20(a) shows that (1) puttingthe proposed regions on a black background frame yields

about 2× bandwidth savings over encoding each region sep-arately and (2) compressing these frames in an mp4-formatleads to another 10× bandwidth savings. Second, figure 20(b)shows that returning the first-iteration output, i.e. the high-confidence results in Stream A, before second iteration starts,we reduce the average response delay by about ∼ 40%.

6 RELATED WORKWe discuss the most closely related work in three categories.Video analytics systems: The need to scale video analyticshas sparked much systems research: DNN sharing (e.g., [39,41]), resource allocation (e.g., [49, 74]), vision model cas-cades (e.g., [45, 66]), efficient execution frameworks (e.g., [48,55, 59]), as well as camera/edge/cloud collaboration (e.g., [26,28, 67, 71, 73, 76], see §2.2 for a detailed discussion) or multi-camera collaboration (e.g., [40, 42]). The most related workto DDS is AWStream [73] which shares with DDS the ethosof using a server DNN-generated profile. The key distinc-tion is that such feedback is not real-time video content, soit cannot zoom in on specific regions on the current frames.Vigil [76] on the other hand does send cropped regions, but itis bottlenecked by the camera computing power (see §2.2).Vision applications: Computer vision and deep learningresearch has a substantial body of research (e.g., [30, 44,53, 60, 62, 63, 72]). Recent works on video object detectionshow it is inefficient to apply object detection DNN frameby frame; instead it should be augmented by tracking [37](similar to §3.3) or by a temporal model such as LSTM [50,52]. This work complements DDS by designing new, server-side deep learning models. DDS’s distinctive advantage isthat it explicitly optimizes the bandwidth/accuracy tradeoffsin a way that is largely agnostic to the server-side DNN.Internet video encoding/streaming: Recent innovations invideo encoding have provided better compression gains [29].The closest efforts to DDS are scalable video coding [58] andregion-of-interest encoding [56]. However, these approachesoptimize human quality of experience (QoE) and are largelyagnostic to what is in the video. Region-of-interest encodingrequires the viewer to specify the region of interest. Scalablevideo coding can utilize the bandwidth more efficiently thantraditional encoding methods, but it still compresses videouniformly in its entirety. Similarly, much work has been donein adaptive bitrate streaming (e.g., [31, 43, 47, 65]), whichfocuses on adapting bitrate of pre-coded video chunks tobandwidth fluctuation, rather than dynamic content as in DDS.

7 LIMITATIONS AND DISCUSSIONStrict server-side resource budget: In some sense, DDSreduces bandwidth usage at the expense of relying on server-side DNN to run inference more than once per frame. Similartradeoffs can be found in AWStream, which triggers costlyreprofiling periodically, and CloudSeg, which enforces an

12

upfront super-resolution model customization process. Allthese techniques may not be directly applicable where serverresources have strict budgets or GPU cost is proportional toits usage (e.g., cloud instances).Implication to privacy: Privacy is an emerging concern invideo analytics [70]. While DDS does not explicitly preserveprivacy, it is amenable to privacy-preserving techniques. SinceDDS does not send out full resolution image, it could berepurposed to denature videos before sending only a part ofthe video to the server for analytics.Edge AI accelerators: Though DDS makes little assump-tion on camera’s local computation capacity, it can naturallybenefit from the trend of more accelerators being added onedge devices, by using camera-side heuristics as describedin §3.3. We also envision DDS running along side the cam-era local analytics to share the workload to provide higherinference accuracy with minimal bandwidth overheads.

8 CONCLUSIONVideo streaming has been a driving application of networkingresearch that has inspired innovative design paradigms. Thiswork argues that the emerging AI applications inspire a par-adigm shift away from the basic source-driven approach tovideo streaming. We advocate for a DNN-driven design thatexploits opportunities unique to deep learning applications:(1) unlike user QoE, video inference accuracy depends lesson pixels than on what is in the video, and (2) deep learn-ing inference (receiver) offers extra information that can beleveraged to decide how video should be encoded/streamed.We believe development of such video streaming protocolswill significantly impact not only video analytics, but also thefuture analytics stack of many distributed AI applications.

REFERENCES[1] Amazon deeplens cameras. https://aws.amazon.com/deeplens/.[2] Are we ready for ai-powered security cameras? https://thenewstack.io/

are-we-ready-for-ai-powered-security-cameras/.[3] Benchmarking videos used in dds. https://github.com/ddsprotocol/dds.[4] Can 30,000 cameras help solve chicago’s crime problem? https://www.

nytimes.com/2018/05/26/us/chicago-police-surveillance.html.[5] Cloud-based video analytics as a service of 2018. https://www.asmag.

com/showpost/27143.aspx.[6] Dashjs. https://github.com/Dash-Industry-Forum/dash.js.[7] Dds: Machine-centric video streaming. https://github.com/

dds-research-project/dds.[8] How ai based video analytics is benefiting re-

tail industry. https://www.lanner-america.com/blog/ai-based-video-analytics-benefiting-retail-industry/.

[9] Hp r0w29a tesla t4 graphic card - 1 gpus - 16 gb. https://www.amazon.com/HP-R0W29A-Tesla-Graphic-Card/dp/B07PGY6QPT/. Accessed:2020-1-29.

[10] Insightface: 2d and 3d face analysis project. https://github.com/deepinsight/insightface.

[11] Jetson agx xavier. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/.

Accessed: 2020-1-29.[12] Nvidia jetson systems. https://www.nvidia.com/en-us/

autonomous-machines/embedded-systems-dev-kits-modules/.[13] Nvidia tesla deep learning product performance. https://developer.

nvidia.com/deep-learning-performance-training-inference. Accessed:2020-1-29.

[14] Official implementation of efficient cascading residual network for sr.https://github.com/nmhkahn/CARN-pytorch.

[15] Smraza raspberry pi 4 camera module 5 megapixels 1080p. https://www.amazon.com/Smraza-Raspberry-Megapixels-Adjustable-Fish-Eye/dp/B07L2SY756/. Accessed: 2020-1-29.

[16] Tensorflow detection model zoo. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md.

[17] Video meets the internet of things. https://www.mckinsey.com/industries/high-tech/our-insights/video-meets-the-internet-of-things.

[18] Video surveillance: How technology and the cloud isdisrupting the market. https://cdn.ihs.com/www/pdf/IHS-Markit-Technology-Video-surveillance.pdf.

[19] Vision meets drones: A challenge. http://www.aiskyeye.com/.[20] Wi-fi vs. cellular: Which is better for iot? https://www.verypossible.

com/blog/wi-fi-vs-cellular-which-is-better-for-iot.[21] x264 open source video lan. https://www.videolan.org/developers/x264.

html.[22] Data generated by new surveillance cameras to increase exponentially in

the coming years. http://www.securityinfowatch.com/news/12160483/,2016.

[23] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate,and lightweight super-resolution with cascading residual network. InProceedings of the European Conference on Computer Vision (ECCV),pages 252–268, 2018.

[24] Ganesh Ananthanarayanan, Victor Bahl, Peter Bodík, Krishna Chinta-lapudi, Matthai Philipose, Lenin Ravindranath Sivalingam, and SudiptaSinha. Real-time video analytics – the killer app for edge computing.IEEE Computer, October 2017.

[25] S. Barati, F. A. Bartha, S. Biswas, R. Cartwright, A. Duracz, D. Fussell,H. Hoffmann, C. Imes, J. Miller, N. Mishra, Arvind, D. Nguyen, K. V.Palem, Y. Pei, K. Pingali, R. Sai, A. Wright, Y. Yang, and S. Zhang.Proteus: Language and runtime support for self-adaptive software de-velopment. IEEE Software, 36(2):73–82, March 2019.

[26] Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl,and Hari Balakrishnan. Glimpse: Continuous, real-time object recogni-tion on mobile devices. In Proceedings of the 13th ACM Conferenceon Embedded Networked Sensor Systems, pages 155–168. ACM, 2015.

[27] Ting-Wu Chin, Ruizhou Ding, and Diana Marculescu. Adascale: To-wards real-time video object detection using adaptive scaling. arXivpreprint arXiv:1902.02910, 2019.

[28] Sandeep P Chinchali, Eyal Cidon, Evgenya Pergament, Tianshu Chu,and Sachin Katti. Neural networks meet physical networks: Distributedinference between edge devices and the cloud. In Proceedings of the17th ACM Workshop on Hot Topics in Networks, pages 50–56. ACM,2018.

[29] High Efficiency Video Coding and ITUT Rec. H. 265 and iso, 2013.[30] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. Arcface:

Additive angular margin loss for deep face recognition. In CVPR, 2019.[31] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph,

Aditya Ganjam, Jibin Zhan, and Hui Zhang. Understanding the impactof video quality on user engagement. In ACM SIGCOMM ComputerCommunication Review, volume 41, pages 362–373. ACM, 2011.

[32] John Emmons, Sadjad Fouladi, Ganesh Ananthanarayanan, ShivaramVenkataraman, Silvio Savarese, and Keith Winstein. Cracking openthe dnn black-box: Video analytics with dnns across the camera-cloud

13

boundary. In Proceedings of the 2019 Workshop on Hot Topics in VideoAnalytics and Intelligent Edges, pages 27–32, 2019.

[33] Antonio Filieri, Henry Hoffmann, and Martina Maggio. Automateddesign of self-adaptive software with control-theoretical formal guaran-tees. In Proceedings of the 36th International Conference on SoftwareEngineering, ICSE 2014, pages 299–310, New York, NY, USA, 2014.ACM.

[34] Shilpa George, Junjue Wang, Mihir Bala, Thomas Eiszler, Padman-abhan Pillai, and Mahadev Satyanarayanan. Towards drone-sourcedlive video analytics for the construction industry. In Proceedings ofthe 20th International Workshop on Mobile Computing Systems andApplications, pages 3–8. ACM, 2019.

[35] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Maskr-cnn. In Proceedings of the IEEE international conference on computervision, pages 2961–2969, 2017.

[36] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE transactions onpattern analysis and machine intelligence, 37(3):583–596, 2014.

[37] Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. Impres-sion network for video object detection. https://arxiv.org/pdf/1712.05896.pdf, Dec 2017.

[38] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 7132–7141, 2018.

[39] Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodik, Leana Gol-ubchik, Minlan Yu, Paramvir Bahl, and Matthai Philipose. Videoedge:Processing camera streams using hierarchical clusters. In 2018IEEE/ACM Symposium on Edge Computing (SEC), pages 115–131.IEEE, 2018.

[40] Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu,and Joseph E. Gonzalez. Scaling Video Analytics Systems to LargeCamera Deployments. In ACM HotMobile, 2019.

[41] Angela H Jiang, Daniel L-K Wong, Christopher Canel, Lilia Tang, IshanMisra, Michael Kaminsky, Michael A Kozuch, Padmanabhan Pillai,David G Andersen, and Gregory R Ganger. Mainstream: Dynamic stem-sharing for multi-tenant video processing. In 2018 USENIX AnnualTechnical Conference (USENIX ATC 18), pages 29–42, 2018.

[42] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, SiddharthaSen, and Ion Stoica. Chameleon: scalable adaptation of video analytics.In Proceedings of the 2018 Conference of the ACM Special InterestGroup on Data Communication, pages 253–266. ACM, 2018.

[43] Junchen Jiang, Vyas Sekar, and Hui Zhang. Improving fairness, effi-ciency, and stability in http-based adaptive video streaming with festive.IEEE/ACM Transactions on Networking (ToN), 22(1):326–340, 2014.

[44] Kinjal A Joshi and Darshak G Thakore. A survey on moving objectdetection and tracking in video surveillance system. InternationalJournal of Soft Computing and Engineering, 2(3):44–48, 2012.

[45] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and MateiZaharia. Noscope: optimizing neural network queries over video atscale. Proceedings of the VLDB Endowment, 10(11):1586–1597, 2017.

[46] D. H. K. Kim, C. Imes, and H. Hoffmann. Racing and pacing to idle:Theoretical and empirical analysis of energy optimization heuristics. InICCPS, 2015.

[47] S Shunmuga Krishnan and Ramesh K Sitaraman. Video stream qualityimpacts viewer behavior: inferring causality using quasi-experimentaldesigns. IEEE/ACM Transactions on Networking, 21(6):2001–2014,2013.

[48] Sanjay Krishnan, Adam Dziedzic, and Aaron J Elmore. Deeplens:Towards a visual data management system. arXiv preprintarXiv:1812.07607, 2018.

[49] Robert LiKamWa and Lin Zhong. Starfish: Efficient concurrency sup-port for computer vision applications. In Proceedings of the 13th

Annual International Conference on Mobile Systems, Applications, andServices, pages 213–226. ACM, 2015.

[50] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module forefficient video understanding. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 7083–7093, 2019.

[51] Luyang Liu, Hongyu Li, and Marco Gruteser. Edge assisted real-timeobject detection for mobile augmented reality. In The 25th AnnualInternational Conference on Mobile Computing and Networking, pages1–16, 2019.

[52] Mason Liu and Menglong Zhu. Mobile video object detection withtemporally-aware feature maps. https://arxiv.org/pdf/1711.06368.pdf,Mar 2018.

[53] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, andLe Song. Sphereface: Deep hypersphere embedding for face recognition.In CVPR, pages 6738–6746, 2017.

[54] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convo-lutional networks for semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages3431–3440, 2015.

[55] Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula. Optasia: A rela-tional platform for efficient large-scale video analytics. In Proceedingsof the Seventh ACM Symposium on Cloud Computing, pages 57–70.ACM, 2016.

[56] Marwa Meddeb. Region-of-interest-based video coding for video con-ference applications. PhD thesis, Telecom ParisTech, 2016.

[57] Nikita Mishra, Connor Imes, John D. Lafferty, and Henry Hoffmann.CALOREE: learning control for predictable latency and low energy. InASPLOS, 2018.

[58] J-R Ohm. Advances in scalable video coding. Proceedings of the IEEE,93(1):42–56, 2005.

[59] Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. Scan-ner: Efficient video analysis at scale. ACM Transactions on Graphics(TOG), 37(4):1–13, 2018.

[60] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and AliFarhadi. You only look once: Unified, real-time object detection. InCVPR, pages 779–788, 2016.

[61] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 7263–7271, 2017.

[62] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:Towards real-time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015.

[63] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks.IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.

[64] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linearbottlenecks. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4510–4520, 2018.

[65] Michael Seufert, Sebastian Egger, Martin Slanina, Thomas Zinner,Tobias Hossfeld, and Phuoc Tran-Gia. A survey on quality of experienceof http adaptive streaming. IEEE Communications Surveys & Tutorials,17(1):469–492, 2015.

[66] Haichen Shen, Seungyeop Han, Matthai Philipose, and Arvind Krishna-murthy. Fast video classification via adaptive cascading of deep models.arXiv preprint, 2017.

[67] Surat Teerapittayanon, Bradley McDanel, and HT Kung. Distributeddeep neural networks over the cloud, the edge and end devices. InDistributed Computing Systems (ICDCS), 2017 IEEE 37th InternationalConference on, pages 328–339. IEEE, 2017.

[68] Jilin Tu, Ana Del Amo, Yi Xu, Li Guari, Mingching Chang, and ThomasSebastian. A fuzzy bounding box merging technique for moving object

14

detection. In 2012 Annual Meeting of the North American FuzzyInformation Processing Society (NAFIPS), pages 1–6. IEEE, 2012.

[69] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Hong-gang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attentionnetwork for image classification. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 3156–3164,2017.

[70] Han Wang, Yuan Hong, Yu Kong, and Jaideep Vaidya. Publishingvideo data with indistinguishable objects. In Proceedings of the 22ndInternational Conference on Extending Database Technology (EDBT),2020.

[71] Yiding Wang, Weiyan Wang, Junxue Zhang, Junchen Jiang, and KaiChen. Bridging the edge-cloud barrier for real-time advanced visionanalytics. In 11th {USENIX} Workshop on Hot Topics in Cloud Com-puting (HotCloud 19), 2019.

[72] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: Asurvey. Acm computing surveys (CSUR), 38(4):13, 2006.

[73] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Ed-ward A Lee. Awstream: Adaptive wide-area streaming analytics. In

Proceedings of the 2018 Conference of the ACM Special Interest Groupon Data Communication, pages 236–252. ACM, 2018.

[74] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Phili-pose, Paramvir Bahl, and Michael J Freedman. Live video analyticsat scale with approximation and delay-tolerance. In NSDI, volume 9,page 1, 2017.

[75] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Jointface detection and alignment using multi-task cascaded convolutionalnetworks. CoRR, abs/1604.02878, 2016.

[76] Tan Zhang, Aakanksha Chowdhery, Paramvir Victor Bahl, KyleJamieson, and Suman Banerjee. The design and implementation of awireless video surveillance system. In Proceedings of the 21st AnnualInternational Conference on Mobile Computing and Networking, pages426–438. ACM, 2015.

[77] Tong Zhang, Fengyuan Ren, Wenxue Cheng, Xiaohui Luo, Ran Shu,and Xiaolan Liu. Modeling and analyzing the influence of chunk sizevariation on bitrate adaptation in dash. In IEEE INFOCOM 2017-IEEEConference on Computer Communications, pages 1–9. IEEE, 2017.

15


Recommended