Post on 30-May-2020
transcript
THE UNIVERSITY OF CHICAGO
ANALYTICS-ORIENTED VIDEO STREAMING STACK
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES DIVISION
IN CANDIDACY FOR THE DEGREE OF
MASTER OF SCIENCE (MS)
DEPARTMENT OF COMPUTER SCIENCE
BY
AHSAN PERVAIZ
CHICAGO, ILLINOIS
15TH JUNE 2019
Copyright c© 2019 by Ahsan Pervaiz
All Rights Reserved
Dedication Text
Epigraph Text
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Distributed Video Analytics Pipelines . . . . . . . . . . . . . . . . . . . . . . 52.2 Limitation of existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 The case for a Custom Video Streaming Stack . . . . . . . . . . . . . . . . . 8
3 OVERVIEW OF DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 DDS: DNN Driven Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 DESIGN AND IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Accuracy vs Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
LIST OF FIGURES
1.1 A Client-Side heuristics based protocol . . . . . . . . . . . . . . . . . . . . . . . 21.2 A DNN feeback based streaming protocol . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Diminishing returns of increasing resolution . . . . . . . . . . . . . . . . . . . . 72.2 Scaling down and Spatially Cropping Frames . . . . . . . . . . . . . . . . . . . . 10
3.1 Iterative control flow of DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Components and Interfaces of DDS . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 F1 Against Bandwidth for various configurations . . . . . . . . . . . . . . . . . 245.2 1-σ ellipse for DDS and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Worse Cases for DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Batch Processing Time (Delay) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.5 DDS maintains high accuracy in presence of bandwidth fluctuations . . . . . . . 30
6.1 Impact on Accuracy on stitching 4 images together . . . . . . . . . . . . . . . . 32
vi
LIST OF TABLES
4.1 Responsibilities of different parts of the Analytics Pipelines . . . . . . . . . . . . 21
5.1 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
ACKNOWLEDGMENTS
viii
ABSTRACT
As deep learning becomes the de-facto approach to live video analytics, we are seeing a surge
in efforts to develop distributed analytics pipelines that leverage the cloud resources to run
deep neural networks (DNNs) on live videos from edge cameras. Prior work has focused on
offloading computation from the edge devices to the cloud. This paper highlights a crucial yet
missing piece in the distributed video analytics pipeline – a custom video streaming stack
to save bandwith between the camera and analytics server while ensuring high inference
accuracy. Through empirical studies, we have found unexploited opportunities to customize
today’s video streaming stack for video analytics. Inspired by the findings, we present
DDS, a video streaming stack that caters to distributed video analytics. Unlike previous
video streaming mechanisms designed to meet user experience or bandwidth budget, DDS
is built on a novel DNN-driven workflow that relies on server-side DNN logic to explicitly
balance the video analytics accuracy and bandwidth savings. At the core of DDS is an
interactive streaming protocol built on the active learning framework to make DNN-driven
workflow practical. We implement DDS as an underlying video streaming layer which could
be combined with a wide variety of distributed video analytics pipelines. Through evaluation
on real-world video datasets, we find that DDS increases the inference accuracy by up to 2x
while consuming the same or lesser amount of bandwidth compared to other video analytics
pipelines.
ix
CHAPTER 1
INTRODUCTION
In recent years there has been an increased interest in smart devices and surveillance cameras.
In 2018 alone, approximately 130 million networked cameras were shipped worldwide [34].
With the increase in video content produced by smart camera devices, there is a growing need
to extract meaningful information from the large input in an automated and cost-effective
way. Deep neural networks have improved the accuracy of such computer vision tasks dra-
matically, in some cases even outperforming human beings [15]. Using deep neural networks
(DNNs) on the devices that are typically producing the video content is prohibitively expen-
sive [28, 27]. Allowing these smart devices to be able to efficiently run these DNNs would
require engineering a large amount of compute and storage resources in them, which would
increase the cost of such devices. Hence, researchers have been developing offloading schemes
and distributed pipelines for such analytics which take the computationally expensive task
of running the DNNs away from the device and on to the cloud. In such schemes, instead
of the server consuming raw frames produced by the camera, the client compresses the raw
frames and selects a subset to send to the server, which then runs the DNNs and sends
results back to the client.
Prior work on offloading schemes has missed a crucial component - a custom video stream-
ing stack. The streaming stack has a significant impact on the accuracy, bandwidth utiliza-
tion and delay of such protocol.
In this work, we use the key insight that streaming video content to a neural network
rather than a human being has fundamentally different requirements. Traditional video
streaming protocols (e.g. [30]) focus on the user-perceived quality. So one goal for such
streaming protocols is to stream video in highest possible resolution and quality uninterrupt-
edly. In contrast, playback smoothness does not affect analytics in the same way. Rather
than focusing on the content as a whole in time, video analytics focuses on the temporal and
1
Figure 1.1: A Client-Side heuristics based protocol
spatial features of a video across time.
Previously proposed schemes make use of client side logic to apply a variety of heuristics
to send only important frames to the server for further analysis [8, 38, 21]. The schemes
then use the returned results to further process frames not sent to the server. Client side
heuristics are inherently less accurate than server side DNNs, causing either too many frames
to be dropped, in which case the accuracy of the whole system decreases, or too many frames
to be sent to the server, in which case the system consumes more bandwidth than is needed
for the task. Furthermore, these protocols do not exploit other venues of saving bandwidth
with the help of compression because they work on the granularity of frames produced by
the smart device [8, 38].
The limitations of existing solutions motivate a different approach to solve the problem
of video analytics by making use of unexplored opportunities to achieve a better accu-
racy/bandwidth trade-off.
In this paper we present DDS, a new DDN-driven streaming stack that systemati-
cally balances bandwidth consumption and inference accuracy of distributed video analytics
pipelines. Unlike previous approaches which utilize client side logic, DDS is an iterative
protocol which makes use of the inference results from the server to determine the best com-
pression and streaming strategy. This approach allows DDS to maximize analytics accuracy
while saving bandwidth based on the results from the server rather than relying on client-side
heuristics which might not be good proxies for the content needed by the server to maximize
accuracy.
2
Figure 1.2: A DNN feeback based streaming protocol
The DDS protocol works iteratively. DDS, in its first iteration, sends a segment of the
video encoded in low resolution to the server. The server runs a complex DNN on these
low resolution frames and compiles a list of confirmed results along with a list of regions
in each frame in the segment that seem interesting and require further investigation. The
server sends both of these back to the client. In the second iteration, the client goes over
each frame and crops out the regions requested by the server. It then encodes these regions
in high resolution and send them to the server. The server runs the DNN on these specific
regions rather than the whole frame and send the results back to the client. The client places
the results onto the frames. This approach allows DDS to minimize the amount of data sent
to the server across iterations (low resolution and the cropped regions) while allowing us to
maximize accuracy, as DDS allows the server to further investigate only those regions which
it thinks can have objects of interest. This iterative process solves a fundamental issue in
other distributed video analytics pipelines: the analytics server cannot provide results for
what is not sent by the client, and the client cannot know what information is important for
the DNN to maximize accuracy.
As part of the paper, we present an implementation of DDS along with the evaluation
that shows the efficacy of this protocol. We provide end-to-end evaluation of our implemen-
tation that studies the accuracy/bandwidth trade-off along with delay of the protocol. The
contributions of this work are as follows:
• We demonstrate the common limitations of prior approaches by empirically looking
at the drawbacks of using client-side heuristics which result in sub-optimal accu-
3
racy/bandwidth tradeoff.
• We present DDS, a novel server-driven approach to video streaming protocols that uses
feedback from the server to send only those parts of the frame that are most likely to
increase accuracy.
• We address several challenges that are inherent to a server-driven approach. We con-
cretely define the notion of ‘feedback’, delineate concrete ways to reduce bandwidth
consumption and minimize delay in the presence of multiple iterations.
• We provide a comprehensive technical design of DDS and a concrete implementation to
demonstrates the practicality of our proposed protocol.
• We also perform an end-to-end evaluation of the implementation. The results of the
evaluation empirically demonstrate the benefits of incorporating server feedback into
the streaming stack.
The DDS protocol in itself is complete enough to produce good results without the need
to incorporate client side heuristics. At the same time, it is flexible enough to be extended
using client side heuristics to further improve the results, illustrating the flexibility of a server
driven design.
The following sections of the paper motivate the need for a custom streaming stack
along with discussing the limitations of existing offloading schemes(§2). We then provide
an overview (§3) and implementation (§4) of DDS along with the end-to-end evaluation (§5).
Towards the end, we discuss potential future work (S6) and prior work in this areas most
closely related to DDS (§7).
4
CHAPTER 2
MOTIVATION
In this section, we explore the need for advancement in video streaming analytics. We
look at the flaws in the common denominator of existing solution and explain why they
cannot be used to optimize bandwidth/accuracy effectively. Finally we show the potential
in customizing video streaming stack.
2.1 Distributed Video Analytics Pipelines
The function of a live video analytics pipeline is to perform multi-class object detection
on a live video stream and extract objects of interest. The object classes which are of interest
to an application are defined in an analytics query.
Different applications have different analytics queries, e.g. a traffic monitoring application
might be interested in the ‘number of cars on a road’, a coffee shop might be interested in
the ‘number of people in the queue at the counter’, and a surveillance application might be
interested in the ‘number of people entering or exiting a building’ or ‘location of a particular
object for the duration of the video stream’. Getting real-time results from these queries have
tangible benefits for the application owner. In the case of a traffic monitoring application,
the result from the analytics pipeline can be used to dictate the flow of traffic so as to
minimize traffic congestion. For a coffee shop application, getting an accurate count of the
number of people in the queue can allow the owner to decide whether or not to open a new
counter. And in the case of a surveillance application, the information obtained from the
analytics pipeline can be used to detect intruders.
In most applications that make use of live video analytics, as in the case of aforementioned
examples, the accuracy and latency of results is of significant importance because the data
is being used to make real-time decisions that will have real-world impact.
5
Considering the impact of the accuracy of these results, vision based analytics applica-
tions are increasingly making use of deep neural networks that provide more accurate results
than traditional methods [13, 21]. But obtaining results from these DNNs at an acceptable
rate requires expensive, dedicated GPUs or a GPU enabled VM from a cloud provider which
costs $850 per month on average. [4, 2, 1].
Additionally, for location dependant object recognition, the client must run a tracking
algorithm to update location of objects across those frames that are not sent to the server [8].
2.2 Limitation of existing solutions
Traditional streaming protocols such as DASH [30] and video encoding standards such as
H.264 are designed to optimize quality of experience of human users. This experience de-
teriorates if the video stalls or drops frames. Traditional streaming stacks aim to provide
the highest possible resolution (bitrate) while avoiding delay between frames. In contrast,
stalling and startup delay does not impact the performance of video analytics in the same
way as it impacts the experience of a user. Additionally, it is difficult to identify which re-
gions are important for a human watching a video. However, this can be reliably determined
in the case of video analytics (as we will see in the following sections). Therefore, the video
analytics backend does not require the entirety of the frame to perform accurate analytics. In
fact, sending a frame in a higher resolution than is actually needed has diminishing gains, as
illustrated in the Figure 2.1. Hence, a video analytics backend does not need the maximum
possible resolution of a video. Furthermore, it has been studied that compression (even with
imperceptible changes to a frame for a human being) can impact the accuracy of a neural
network [31]. So applying certain compression algorithms might not impact the quality of
experience of a human being but might affect the inference accuracy of a neural network.
Traditional video streaming hence does not take these factors into account.
Realizing that not all frames are equally important to the server, some pipelines perform
6
0.0 0.2 0.4 0.6 0.8 1.0Resolution Scale
0.0
0.2
0.4
0.6
0.8
1.0
F1 S
core
0.15
0.25
0.37
5 0.5
0.62
5
0.75
0.87
5 1.0
Change in F1 With Resolution
Figure 2.1: Diminishing returns of increasing resolution
computation on the client side to filter out frames that do not contain useful information.
However, the client is often resource constrained and thus the heuristics performed at the
client side are inherently simpler than the server-side DNN. Hence, these client-side heuristics
cannot be used as appropriate proxies for server-side DNNs, because client-side heuristics
cannot accurately determine which frames the server will need. Such approaches end up
sending either too few (which results in lower accuracy) or too many (which results in
excessive bandwidth consumption) frames to the server. For example, in Glimpse [8], the
client decides which frames to send to the server based on the pixel level difference between
frames. This simple approach would result in a frame not being sent to the server if an
object enters the scene but is small enough that it does not change the value of pixels by a
significant enough amount resulting in lower accuracy. Conversely, changes in the brightness
of the scene would result in significant pixel level difference (another such scenario is that
of a panning camera without a change in objects in the frame), so the client will send such
7
frames to the server resulting in excessive bandwidth consumption. In the case of vigil [38],
the client-side neural network is fast but not sufficiently accurate. It can fail to recognize
several objects in a scene as well as give several false positives, resulting in the client side
heuristic to make mistakes while picking the frames to be sent to the server. AWStream [36]
also makes use of client side logic, but in a different way than Vigil [38], Glimpse [8] and
NoScope [21]. AWStream uses the client side logic to perform offline and online profiling
to learn a profile that predicts the accuracy and bandwidth trade-off. When it is given the
task to stream a video to the server, it uses the previously constructed profiles to adjust
the application data rate to match bandwidth while achieving maximum possible accuracy.
AWStream does not make use of any feedback from the server, hence it cannot actively
determine what the server deems to be important [36].
Furthermore, client-side heuristics, such as those mentioned above, are typically designed
for a specific application [21]. They cannot be trivially ported to a different analytics task
or pipeline. Hence, these schemes leave much to be desired in terms of flexibility. This
motivates the need to develop a more flexible video streaming stack that can be used with
a variety of video analytics pipelines with minimal porting effort.
2.3 The case for a Custom Video Streaming Stack
Contrary to recent work which focuses on designing client-side logic, our work considers a
server driven design. The high level idea behind the server driven design is that the server
has access to a large amount of resources which it can use to make better decisions about
what is important to achieve higher accuracy.
We propose an iterative approach to video analytics. During the first iteration, the client
provides the server just enough data to hone in on the parts that seem the most promising.
During the second iteration, the client only sends the promising parts of the data requested
by the server and the server returns the final results after processing this extra information.
8
This feedback based approach allows the pipeline to selectively send bits of data that are
of interest to the server without involving heuristics that need to act as proxies for the
server-side DNN. This provides us the opportunity to minimize bandwidth consumption
while achieving higher accuracy.
Additionally, we look at an equally important part of any analytics pipeline, the streaming
stack. The streaming stack influences the three key metrics, which we refer to as measures :
• Inference Accuracy : Number of objects of interest successfully identified.
• Bandwidth Consumption: Total data sent between the client and the server.
• Response Delay : Interval between the production of the frame and the final compilation
of its results.
The ideal streaming stack would retain just enough information in a video that allows the
server-side DNN to detect all objects of interest, to obtain the maximum possible accuracy.
Similarly, the ideal streaming stack would minimize bandwidth consumption by curtailing
the total data transferred between the client and the server including the content of the video,
the results and control signal (if any). Along with the aforementioned measures, an ideal
streaming stack would have minimum response time. Response time is important because
the results generated by the analytics pipeline are to be used to control real-world systems
and hence should not be so stale that they do not remain actionable in the real-world.
We balance inference accuracy and bandwidth consumption by compressing the video
along the following dimensions, which we refer to as control knobs.
• Resolution: The resolution that the raw image is compressed in before sending it to
the server
• Spatial Cropping: Cropping out specified regions in a given video frame and sending
it in a given resolution.
9
• Compression Level: Encoding selected regions of the frame with a specified com-
pression level (quantization parameter)
Knob selection to perform dynamic adaptation is a general concept that is used to op-
timize one measure while meeting constraints for another (e.g [20, 17, 18]). This allows
for dynamic adaptation in the presence of changing conditions and goals. To put this into
perspective of work in video streaming, recent research looks into the tuning of knobs that
are used in encoding, resolution and compression to reduce the size of the data sent over
network while maximizing accuracy [36, 37, 30].
Figure 2.2: Scaling down and Spatially Cropping Frames
Figure 2.2 provides a general idea (which will be explained in greater detail in the fol-
lowing section) of how our proposed system encodes a video frame to save bandwidth usage.
Notice how the encoded frames in the figure are likely to cause the quality of experience of
a user to drop, however, they will not cause the accuracy of a neural network to drop.
10
CHAPTER 3
OVERVIEW OF DDS
In this section we provide an overview of the design of a novel server driven video analytics
protocol. We also look at the challenges that we need to solve in order to make such a
protocol practical.
3.1 DDS: DNN Driven Streaming
The fundamental flaw in the client-driven protocols is that their decisions and adaptations are
agnostic to the performance of the server-side DNN. This results in such protocols choosing
sub-optimal configurations. In contrast, the idead behind DDS is that of a server-side DNN
drive worklow – it is the server side DNN logic, rather than the client side video encoding
or heuristics, that drives the streaming stack. In DDS the server side DNN determines the
settings for the control knobs. This difference is what sets DDS apart from existing protocols
(as illustrated using Figures 1.1 and 1.2). The key advantage that this approach provides
over existing protocols is that the DNN itself can explicitly maximize the accuracy within a
bandwidth constraint. And compared to simple video encoding (e.g. [36]), DDS integrates
the object detection results so that encoding is aware of the spatial requirements of the DNN
to perform accurate inference. DDS uses encoders such as H.264/MPEG to further compress
the frames that are sent to the server. DDS does not have to rely on less accurate client-side
heuristics to determine which frames are important.
Furthermore this design allows DDS to be flexible enough to be used for a number of
different video analytics applications, while also providing the opportunity to seamlessly
incorporate client-side logic into the protocol as an extension (§6).
11
3.2 Challenges
The aforementioned designs has some inherent challenges that need to be solved in order to
make this approach possible.
What ’feedback’ does the server provide to the client? We need to concretely
define the notion of control signals that the client must act upon in order to maximize the
inference accuracy while minimizing bandwidth. Needless to say that the total data sent
back to the client as a result of the feedback need to have just enough information that it
can be used by the server to make correction inference. In DDS this feedback is the set of
regions that the server needs to inspect more closely. This set of regions is computed using
a combination of detection and tracking (the complete details are presented in the following
section). The server compiles a list of regions in the frames that it requires in a higher
resolution, and sends this list back to the client. The client spatially crops the frames to
only include the regions that the server deems are worth investigating further. The client
then encodes these spatially cropped frames and sends these frames to the server in a higher
resolution.
This prompts an iterative design (illustrated with Figure 3.1) for the protocol. In the first
iteration the client sends some information (frames) to allow the server to decide what more
information is needed (regions in high resolution. The client then, in the second iteration,
send only the requested information (spatially cropped frames). This iterative design leads
to two more challenges that are discussed in the following paragraphs.
How do we minimize bandwidth consumption? To minimize bandwidth consump-
tion, DDS, during the first iteration, send frames in just high enough resolution that the
server is able to make accurate decisions about regions of interest. And during the second
iteration, DDS blacks out the entire frame except the regions requested by the server and
send the frames at only enough resolution that the server is able to make accurate inference.
Additionally, instead of storing each region in a frame of it’s own, the client takes a union
12
of all the regions that are in a single frame and masks the frame except the area covered by
the union of the regions. The client then encodes and compresses these frames and sends
them to the server.
How do we minimize the delay of the protocol? To minimize the delay, DDS makes
use of number of optimizations. The first optimization stems from the fact that the input to
a typical DNN resizes an input image to 600×600 regardless of the original size of the image.
If the size of the frames sent to the server is less than 600×600 then the server stitches these
frames together and run object recognition on several frames in a single go. This is done as
part of the first iteration of the protocol. We study the effect of stitching on the F1 score
and the inference time in the evaluation section (§5). Additionally, the tracking done by
the server to compute regions of interest also adds delay to the pipeline. We recognize that
tracking is an independent task. Therefore, we do tracking in parallel to reduce the total
time required for the tracking part of the pipeline. The choice of the tracking algorithm is
also important, a less accurate but fast tracker will give us low delay but lower accuracy of
the regions of interest, while a highly accurate but slow tracker will result in high accuracy
of the regions of interest but a much greater delay.
There are several other network side optimizations that are done to reduce the network
delay of the protocol, but they are not core to the design DDS hence they will be discussed
in the Implementation section (§4).
13
Figure 3.1: Iterative control flow of DDS
14
CHAPTER 4
DESIGN AND IMPLEMENTATION
In this section we describe the core DDS algorithm and the solution to the technical challenges
in greater detail. We also provide a concrete evaluation that is used for the end to end
evaluation of the protocol.
4.1 Algorithm
As mentioned in the last section, DDS is an iterative protocol which relies on the feedback
from the server. The feedback in our case is the regions that the server deems are important
and should be investigated further. Allowing it to infer objects of interest with high accuracy
while send only a small amount of data. We formalize the iterative protocol by looking at
the iterations separately and discussing the algorithm behind the iterations.
During the first iteration the client sends a batch of s frames to the server in a low
resolution. The value of the low resolution is a knob that can be set dynamically to adapt to
changing video content and available bandwidth. Upon receiving the low resolution frames
the server runs object recognition using a deep neural network. The server then compiles
two lists of detected objects. In one list (A) it adds all objects that have detection confidence
greater than a certain threshold, these are the objects that the server is sure about and they
require no further investigation. In the second list (B) it adds all objects that have detection
confidence between two thresholds, these are the objects that need to be investigated further.
Any results that are not a part of these two lists are discarded because the server decides that
there is little value in investigating them further. The maximum and minimum resolution
used to generate these lists are also knobs that can be dynamically adjusted by the server.
For every result in list B the server tracks the objects in the backward and forward direction
using an optical motion tracking algorithm. The bounding boxes retrieved using tracking
15
are checked against the results in list A, and the bounding boxes are discarded if there
exists a result in list A that has the same bounding box as the bounding box obtained using
tracking. After the server has computed all bounding boxes from list B it sends information
about these bounding boxes back to the client along with results in list A, which contains
partial results for the frames in the batch. The client stores these partial results and uses the
bounding boxes’ information to prepare frames for the next iteration. For all the bounding
boxes in the same frame the client takes a union of the bounding boxes and blacks everything
in the original frame except the regions covered by the union of the bounding boxes. The
client does this for every frame for which there is a bounding box requested by the server.
The client compresses and encodes the frames in a higher resolution and sends them to the
server. Like the low resolution, high resolution is also a knob that can be set dynamically
for adaptation. The server receives decodes the higher resolution frames and runs object
detection on them. The results from this object detection are sent back to the client which
merges the results of the second iteration along with the partial results sent during the first
iteration to obtain final object detection results for frames in the batch. The algorithms
for the process is given below (Algorithm 1). The track statement in Algorithm 1 calls the
track routine from Algorithm 2.
Another important check that the server performs while compiling the list of regions that
it thinks are important is that it ensure that the size of regions of interest is not greater
than a percentage of the frame size. Because such regions might result in majority of the
frame being sent back in higher resolution resulting in higher bandwidth consumption. The
likelihood of objects that large not being confidently recognized during the low resolution
iteration is quite low. Hence such regions can safely be left out during the second iteration.
However, the maximum and minimum thresholds along
16
Data: lowRes, highRes, maxThres, minThres, maxSize
Result: boundingBoxes for each frame
acceptedResults = ∅, regions = ∅
lowResVideo = getVideo(lowRes)
lowResResults = runDNN (lowResVideo)
if lowResResults is ∅ then
regions = getVideo(highRes)
else
for result in lowResResults do
if result.confidence > maxThres then
acceptedResults = acceptedResults ∪ result
end
end
for result in lowResResults do
if result.confidence > minThreshold and result.boxSize < maxSize then
regions = regions ∪ result
region = region ∪ track(result, acceptedResults, lowResVideo)
end
end
end
highResCroppedVideo = getCroppedFrames(highRes, regions)
boundingBoxes = runDNN (highResCroppedVideo)
return boundingBoxesAlgorithm 1: Algorithm for the iterative process. track calls Algorithm 2
17
Data: result, acceptedResults, lowResVideo, trackerLen
Result: regions of interest
regions = ∅, frameNum = result.frameNum
for frameNum to frameNum+trackerLen do
newResult = trackingAlgo(frameNum, result)
if newResult not in acceptedResults and newResult.boxSize < maxSize then
regions = regions ∪ newResults
end
end
for frameNum to frameNum−trackerLen do
newResult = updatePosition(frameNum, result)
if newResult not in acceptedResults and newResult.boxSize < maxSize then
regions = regions ∪ newResults
end
end
return regionsAlgorithm 2: Algorithm for the tracking phase of the server
4.2 Implementation
Figure 4.1 show the key components and the interfaces of the DDS protocol. The protocol is
amenable to a variety of video analytics applications. In principle, the protocol can be used
with any camera capturing a feed and a DNN running on a remote server. Client Side
DDS: As in other video analytics pipelines, a video feed is encoded by the camera into a
sequences of frames in high quality using either H.264 or ffmpeg. For our implementation
we use ffmpeg. DDS assigns frames IDs to the frames, these are used by DDS to fulfill regions
of interest queries (for the second iteration) sent by the server. The camera-facing API is
also responsible for changing the resolution, cropping and re-encoding the frames as required
by the algorithm describe above. DDS also keeps the frames in a batch in memory until the
18
two iterations for the batch have been completed.
Server Side DDS: The server-side code of DDS queries the DNN and retrieves the results
for object detection using the DNN-facing API. Instead of relying on the pretrained DNN
to actively control the streading; DDS, passively queries the DNN using the aforementioned
API and the interface resturns the results in a format that can be consumed by DDS for our
implementation the DNN returns the results as a key-value map in which the keys are the
frame IDs and the value is a list of objects that were detection in the frame. The result for
each object is stored as a 3-tuple whose elements are the frame ID, the confidence and the
bounding box co-ordiantes. The server-side code uses the 3-tuples and tracks the objects
represented by a 3-tuple using a high speed tracker with kernel correlation filters (KCF) [10].
We use KCF tracker because, to the best of our knowledge, it provides a balance between
the tracking time and accuracy. Another tracking algorithm ,based on multiple instance
learning [5], is more accurate but is significantly slower than the KCF algorithm [10].
Figure 4.1: Components and Interfaces of DDS
Network: DDS maintains two TCP connections between the client and the server. One
of the connection, C1, is used for sending encoded (cropped) video data from the client to
the server. The other connection, C2, is used for receiving control or feedback messages
along with the results from the server. We implement two types of control signals. A
19
‘heartbeat’ is sent by the client and the server to provide status updates and are also used
a keep-alive messages to keep the connections open. The second type of control message
is a ‘request’ for the regions that the server needs for the second iteration of the protocol.
Upon receiving a request from the server, the client retrieves the corresponding frames from
local storage, performs cropping, downsizing and compression using ffmpeg and sends the
resulting encoded video to the server using C1
Optimizations: DDS’s iterative process requires repeated fetching of short flows. If the
connections were closed and reopened for every flow we would have to suffer from the slow
start phase of TCP every time. This can increase the delay significantly. So to avoid the
slow-start phase we make the persistent TCP-based sockets, which maintains a persistent
congestion control session. Additionally, instead of running inference on low resolution im-
ages one by one the server stitches images together based on their size and runs inference
on several images in one go. This decreases the inference time in the low resolution phase
of the protocol. The same optimization can also be made for the high resolution, but for
the current implementation we only used stitching for the first phase. As mentioned earlier,
tracking is done in parallel to speed up the tracking phase of the batch processing. Now
that we have a clear understanding of the DDS protocol, we can compare the division of work
between the camera, the client-side logic and the server side logic. Table 4.1 summarizes the
responsibilities of the components of the streaming stack for each protocol.
20
Camera Client-Side Server-Side
Glimpse Produce Frames
Heuristics (Differencing)
DNN DetectionTracking
Cache Management
Vigil Produce Frames
Heuristics (NN)
DNN DetectionTracking
Cache Management
AWStream Produce Frames DNN DetectionEncode and Compress Frames
DDS Produce Frames
Encode and compress framesDNN Detection
TrackingSpatially Crop Frames
Batch Management
Table 4.1: Responsibilities of different parts of the Analytics Pipelines
21
CHAPTER 5
EVALUATION
We evaluate the performance of DDS with respect to other baselines. By evaluating the
end-to-end performance of DDS and through microbenchmarking, we show that:
• DDS shows promising results by imporving the bandwidth-accuracy tradeoffs over latest
video analytics pipelines as well as traditional video streaming.
• The optimization on each control knob contributes a substantial fraction of improve-
ment of DDS.
• DDS achieves stable bandwidth-accuracy tradeoffs even in presence of dynamic network
conditions
• Compared to baselines, DDS introduces only marginal computing overhead on the
server/client sides and inference delay.
5.1 Methodology
Dataset: The object detection model uses the vehicle class. We use twenty-seven videos
from the KITTI dataset [12] each at 10 frames per second.
Server/Client Setup: We use 30% as the low confidence threshold, 80% as the high
confidence threshold and a batch size of 15 for our experiments. We run the client and server
on a gcloud VM with an Nvidia P100 GPU [3], 16GB of RAM with a 4 Core Haswell CPU.
For delay measurements the client is run on a laptop running ffmpeg and the DDS client
to encode videos and communicate with the server. The code for the client and the server
is written in Python and for the tracking algorithm we use library implementation of the
KCF Tracker from OpenCV [6].The DNN facing API is written using Tensorflow [32] with
ResNet 101 as the DNN model [14]. The implementation of DDS used for evaluation only
22
includes the parallel tracking optimization. The stitching optimization was not included in
the implementation used to obtain the measurements, the reason for this is discussed in the
§6.
Baseline setup: For our evaluation we use Glimpse, Vigil and AWStream as baselines.
For Glimpse [8] we use the frame difference detector as mentioned in the paper along with
the parameter values that were used in the paper. For Vigil [38], we use MobileNet SSD [19]
with a mean average precision of 21 [32] as the less accurate but fast client side neural
network. For AWStream [36] we try out multiple configuration of ffmpeg and choose the
most optimal configuration to represent the result of AWStream. For our experiments the
ffmpeg based AWStream and DDS were allowed to change only the resolution of the frames.
The client-side and server-side logic for all of the baselines are running on the same machine
that were used for DDS.
5.2 Accuracy vs Bandwidth
We present results for the accuracy/bandwidth trade-off of our system on a set of four videos
from the KITTI dataset. Figure 5.1 shows the change in accuracy/bandwidth with different
configurations. Each point on the plots corresponds to a different value of the high and low
resolution control knob. We can see that even under different configurations the F1 score
remains largely stable while the bandwidth consumption varies significantly. The points to
the top left represent the best possible tradeoff. This shows the potential for a server driven
protocol. This shows (as we will demonstrate later) that the accuracy can be kept largely
stable in the face of fluctuations in available bandwidth. By considering the behavior of the
protocol from the plots we can see that we need to choose the value of the control knobs
carefully to achieve the best accuracy/bandwidth trade-off.
We are able to achieve 2x greater accuracy than client-side heuristics based baselines.
But when it comes to AWStream, a client-side adaptation based protocol, we are able to
23
0 500 1000 1500 2000 2500 3000 3500
Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0F
1A
ccur
acy
MPEG
DDS
Vigil
glimpse
0 250 500 750 1000 1250 1500 1750
Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0
F1
Acc
urac
y
MPEG
DDS
Vigil
glimpse
0 500 1000 1500 2000 2500
Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0
F1
Acc
urac
y
MPEG
DDS
Vigil
glimpse
0 500 1000 1500 2000 2500 3000 3500 4000
Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0
F1
Acc
urac
y
MPEG
DDS
Vigil
glimpse
Figure 5.1: F1 Against Bandwidth for various configurations
perform slightly better than the most optimal setting that AWStream could use. In doing
so we either use strictly less or equal amount of bandwidth as the baselines.
To better illustrate the results we extend our experiments to a set of 22 videos and
use a DDS simulation to draw 1-σ ellipse (mean and one standard deviation) of the best
configuration points of DDS and the baselines across videos. Based on first plot from Figure
5.2 we can see that in the average case DDS provides better accuracy while utilizing lower
bandwidth. DDS consistently provides high accuracy (F1 Score > 0.90) across all videos.
The second plot in Figure 5.2 shows the good cases in which DDS outperforms all baselines
for both in terms of bandwidth consumption and accuracy. A typical characteristic of these
24
0 250 500 750 1000 1250 1500 1750 2000Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0F1
Sco
reAll Videos
AWStreamDDSVigilGlimpse
0 250 500 750 1000 1250 1500 1750 2000Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0
F1 S
core
Good Cases
AWStreamDDSVigilGlimpse
Figure 5.2: 1-σ ellipse for DDS and Baselines
videos is that the objects in the frames are far apart from each other and they do not occlude
each other. Furthermore, the bounding boxes for the objects do not overlap.
Unfortunately there are videos for which AWStream performs slightly better than DDS.
The results from these videos are shown in Figure 5.3. For these videos, although DDS
performs better than Vigil and Glimpse, AWStream is able to beat DDS in terms of accuracy
and bandwidth utilization. The main characteristic of such videos is that these are largely
empty with a small number of objects of interest in frames. The reason DDS does worse
in these scenarios is that during the low resolution phase of DDS, the DNN reports false
positives that have confidence above the minimum threshold, so the bounding boxes for these
false positives are requested during the second iteration of DDS increasing the bandwidth
utilization. Another important characteristic of these videos is that the objects that appear
in the videos are often occluded by other objects. Along with that the lighting condition
under which these objects are detected changes frequently in the video (for e.g an object
from being under sunlight to being under the shade of a tree). This causes problems at
several levels of the protocol. First and foremost, when performing object recognition on low
resolution images, an object detected in one frame, while present in the next frame might
not be successfully detected. As mentioned earlier, tracking is used to rectify this mistake.
25
0 200 400 600 800 1000 1200 1400 1600Bandwidth (Kbps)
0.0
0.2
0.4
0.6
0.8
1.0
F1 S
core
Worse Cases
AWStreamDDSVigilGlimpse
Figure 5.3: Worse Cases for DDS
However, tracking algorithms are susceptible to failing when the object movies from one
lighting condition to another, hence in the case of these videos tracking is unable to track
such objects. As a result of the tracking failure the bounding box for the object is not added
in the regions of interest and hence is not requested from the client in high resolution. This
means that the mistake made by DDS in low resolution can never be rectified.
5.3 Delay
We measure the delay for DDS and all baselines across five videos and report the average
delay. Delay is defined as the average time for the results of a frame to be finalized. In the
context of our measurements this is the average amount of time required to process a batch
of 15 frames. There is an important distinction in the way the delay measurement is made
in the Glimpse and Vigil papers’ evaluations. The delay in the original papers is defined
as the interval between the production of a frame and the results for that frame to appear
on screen. In both of these protocols there are caches that are used to store frames while
waiting for the result of a prior frame from the server. Once the result is received both Vigil
and Glimpse track the objects in the result through frames in the cache and show bounding
boxes on the latest frame. They report the interval between the production of the latest
26
DDS AWStream Vigil GlimpseScheme
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Dela
y (s
)
Figure 5.4: Batch Processing Time (Delay)
frame and the time it takes to get finalize the results for this frame using tracking. Hence,
this interval ends up being equal to the amount of time it takes to track objects through
the frames in the cache. We report numbers for delay that includes the time required to
processing each and every frame (even the ones in the cache).
We can see that Vigil and Glimpse have a far greater delay than DDS and AWStream.
This is because the client side logic is running on a computationally constraint device. Due
to this constraint Vigil and Glimpse cannot parallelize the most time consuming part of the
client side logic, tracking. Furthermore, they have to pay cache management overhead as
well. Since tracking through each and every frame in the cache would not allow Vigil and
Glimpse to catch up to the latest frame, they include extra functions in their logic that allow
them to select frames intelligently from the cache. Running these functions adds non-trivial
overhead to Glimpse and Vigil.
DDS and AWStream take about the same time to process the batch. DDS takes a little
more time because while AWStream does inference just once and does not do any tracking,
DDS has to track objects along with performing an extra round of inference in the high
resolution phase of the protocol. Tracking objects in parallel greatly reduces the delay due
to tracking. Adding the stitching optimization can provide further reduction in the delay
27
(this is discussed further in §6).
5.4 Microbenchmark
Next we look at the resource utilization for each of these schemes. It is important to recall
the division of labour within these protocols. This helps us appreciate the resource utilization
measurements for these schemes.
DDS AWStream Glimpse Vigil
Client CPU (%) 22 18 38 82
Server CPU (%) 62 20 18 18
Memory (# frames) 15 1 ∼8 ∼8
Table 5.1: Resource Utilization
The resource utilization measurements in the Table 5.1 are complete in agreement with the
division of labor shown in Table 4.1. The client side DDS logic uses more CPU than as
compared to AWStream because it also has the responsibility of cropping images, while the
CPU utilization for Glimpse and Vigil is much greater because of the client side heuristics.
Notably, Vigil has a significantly higher CPU utilization on the client because of it’s use of a
neural network. However, on the server-side the DDS uses the most CPU because tracking has
also been shifted to the server. The parallelization optimization further increases the CPU
utilization on the server-side of DDS. On the other hand the only responsibility of the server
on the baseline protocols is object recognition. However, since the server is not running in a
constrained environment it is much better to utilize more resources on the server side than
as compared to the client side.
Memory utilization for DDS is much higher than the baselines. Because the server might
request regions from the frames that it is currently processing, DDS client must keep the
frames of the batch in memory. Since the batch size in our experiments was 15, DDS has to
28
keep 15 frames in memory at any given time. Both Glimpse and Vigil save all frames that
are produced between the time when they send a frame to the server and receive the results
for the frame. The number of frames that will be saved depends upon the time it takes to
receive a response to the server. In our experiments both Glimpse and Vigil saved roughly
8 frames in their caches. AWStream uses the least memory out of all the protocols because
it does not need to run tracking or respond to the servers feedback it has no need to store
any frames other than the one that it is has to send to the server.
5.5 Adaptation
Figure 5.5 shows the change in F1 score across time while there are fluctuations in available
bandwidth.
The experiment has three intervals. During the first interval DDS has a large amount of
bandwidth available and so it chooses the configuration that provides the best F1 Score. In
the second interval the available bandwidth decreases and DDS recognizes that based on the
increased delay in response. To cope with this change DDS chooses a different configuration
that uses less bandwidth while providing highest possible accuracy. And in the third interval,
the available bandwidth increases but it does not go back to the original so DDS can choose
a configuration in between and increase the accuracy a bit more than that from the second
interval. This shows the potential advantages and the amenability of DDS to adaptation
schemes.
The methodology used in the experiment was that the first several segments were used
to profile the performance of each configuration. After the profiling was done, DDS was run
on the rest of the video. DDS decided which configuration to use based on the response time.
Figure 5.5 shows results from a single video which has fairly consistent content.
This simple profiling method, however, is not expected to work because for most videos
the content is not consistent. So we need to design a better way to adapt to changes in
29
0 5 10 15 20 25 30Time
0.0
0.2
0.4
0.6
0.8
1.0
F1 S
core
200KBps 100KBps 150KBps
Figure 5.5: DDS maintains high accuracy in presence of bandwidth fluctuations
available bandwidth. There are more challenges associated with the strawman profiling
approach. These challenges are discussed in §6
30
CHAPTER 6
FUTURE WORK
While we have shown the potential gains for using a server driven streaming protocol, there
remains work that needs to be done.
Investigate the bad cases. As shown in the evaluation section there are certain kinds
of videos for which DDS does not perform well. We need to develop a greater understanding
of the reason for the poor performance. Our investigation so far has revealed that confi-
dence threshold in conjunction with the type of scene in the video plays a huge role in the
performance of DDS on these videos. The videos for which DDS does not perform well have
objects moving through different regions of brightness. In such frames the DNN is suscepti-
ble to giving false positives. If the confidence of these false positives is greater than the low
confidence threshold then the server requests the bounding boxes for these false positives in
higher resolution, consuming high bandwidth. A way forward would be to find the distri-
bution of objects detected in low resolution in each confidence range. Along with that we
need to find the distribution of false positives for each confidence range. We need to confirm
if the distributions are roughly the same for the worse case videos. Then based on these
distribution we can decide the optimal values for the low and high resolution thresholds.
Study the effect of stitching on the overall system. Currently we have studied
the effect of stitching images on accuracy in isolation. As shown in Figure 6.1, stitching
images gives much better results when the original resolution of the images is low and as
the image resolution increases the gap between stitched and non-stitched results increases.
These results were obtained on experiment using a set of 20 vehicle images from the ImageNet
[9] along with 2 videos from the KITTI datset [12]. These results, while promising, need
to be further investigated. We need to empirically demonstrate that stitching images does
not deteriorate the accuracy of detection significantly for images with a variety of different
features.
31
Figure 6.1: Impact on Accuracy on stitching 4 images together
Develop a method for dynamic adaptation. Figure 5.5 shows the potential benefits
of adding adaption to DDS. Developing a more sophisticated and robust dynamic adaptation
mechanism would allow DDS to perform well in the presence of bandwidth fluctuations and
change in video content. Before we can add dynamic adaptation to DDS, we will need to
find answers for some for questions. Essentially, we need to develop a notion of feedback
measures that would provide the adaptation mechanism with real-time information about
the performance of DDS. This is a non-trivial problem to solve because DDS would not have
information about accuracy under current configurations. This means that we would need
to develop other measures that can be used as a proxy for the accuracy of the current
configuratinos. This would be different from the adaptation done in AWStream [36] which
performs adaptation to match changes in bandwidth. In DDS we will perform adaptation
for response delay and accuracy as well. Furthermore, adaptation is also necessary in the
client side, which is running an encoder in an energy constrained environment. CoAdapt [17]
bears direct relevance to the adaptation requirements of DDS. CoAdapt co-ordinates accuracy
aware applications, which have accuracy/performance tradeoffs, with power aware systems,
which expose power/performance tradeoffs. DDS has both the accuracy and power aware
aspects as seperate components of the overall system. In summary, adding adaptations for
32
energy would also allow us to meet energy aware setups that work under a given energy
budget.
Investigate other video encoding and compression knobs. Currently we are only
looking at frame resolution as a knob. We need to investigate the effect of changing the
quantization parameter. We can look at a cross product of quantization parameter and
resolution to increase the configuration space. This would also allow us to perform a direct
comparison between DDS and AWStream (since quantization parameter is one of the knobs
in AWStream). However, this is not quite straightforward. As shown in recent works [31]
using compression can impact the accuracy of the DNN in unexpected ways. So we will need
to study the impact of change in the quantization parameter and bitrate on the performance
of DDS in more detail. Furthermore while adding more compression can help us decrease
accuracy, this will change the time it takes to compress frames. So the delay will also need
to be considered in addition to the overall accuracy.
Incorporating client side logic. Existing analytics pipelines either use client side
heuristics based on pixel level difference ([8, 21]) or visual cues ([38]) to determine an im-
portant frame and send that frame to the server. Upon receiving results from the server
the client side logic propagates the results to the frames that were not sent to the server.
Based on the general design of DDS, one can easily incorporate client-side logic in DDS to add
another layer of frame selection. As part of future work we need to develop a natural way
to integrate client side logic into DDS as an extension to the pipeline. After that we need to
investigate whether such an extension would provide noticable benefits.
33
CHAPTER 7
RELATED WORK
DDS attempts to bridge the gap between deep video analytics and video streaming via a novel
iterative process that allows it to rectify it’s mistakes. Here, we discuss the most related
work three sides.
Object Detection and Tracking: Vision research has a deep literature on object
detection [25] and tracking [35] and more recently deep learning based approaches are gaining
traction ([28, 19]). Most deep learning based approaches are slow enough that they cannot
keep up with the frames that are produced in real-time. Hence, recent research has also
focused carry information from one frame to another using a relatively faster method (e.g.
with the help of tracking). In [16], deep feature extraction is run on a key frame and the
impression feature is propagated down to the next key frame. Similarly, work has been done
to generate feature maps with a joint convolutional recurrent unit formed by combining a
standard convolutional layer with an LSTM [23]. The recurrent layers are able to propagate
temporal cues across frames, allowing the network to access a progressively greater quantity
of information as it processes the video. Furthermore there is growing interest in making
DNN architectures more scalable [37]. All of these advances can be used to improve the
capability of the server side of DDS, allowing the server to detect regions of interest more
accurately while reduce the delay.
Iterative rectification: Iterative rectification of results over unlabeled data using active
learning has shown promising results in several computer vision tasks related to object
classification and scene understanding [22, 33]. The works focus on processing a small amount
of information and figuring out the best regions to process in the next steps. Advancements
in this area can help the iterative pipeline of DDS by allowing DDS to determine the best
regions to retrieve in higher resolution. Such advancements can help us solve the bad cases
for DDS as discussed in §5.
34
Dynamic Adaptation: Modern computing systems must provide a certain quality
of service with minimal energy in presence of fluctuation that impact the performance of
the systems. Metronome [29] provides a framework that extends operating systems with
self-adaptive capabilities. Recent work in dynamic adaptation [20, 17, 18], has shown that
adaptation can be used for general applications (including streaming applications, such as
H.264 and ffmpeg, to optimize an objective function while meeting constraints. In [11],
the authors have extended the idea of general adaptation to allow optimization of multiple
objectives. Furthermore [17], provides a system that allows co-ordination between accu-
racy aware application with power aware systems, which usually have competing objectives.
CALOREE [24] provides a control systems, whose parameters are defined by a machine
learning framework, which performs adaptation in the presence of a dynamic environment.
CALOREE learns control parameters to meet latency requirements with minimal energy
in systems that have complex interactions with hardware in the presence of unpredictable
changes in the operating environment and inputs. All of the aforementioned works have a
direct relevance to the adaptation requirments of DDS. As in [20] adaptation in DDS must
be done on a general purpose application, a video encoder, on the client side. And as in
[17], the objectives of the client side must be co-ordinated with the overall objects of DDS
pipeline, i.e. maximize accuracy and minimize latency. Hence, work from prior research can
be used for the adaptation requirements of DDS
Video streaming: Recent work in this area has yielded better compression gains in
recent encoding standards. Furthermore, advances in scalable coding [26] and regions of
interest encoding [7] provide the most direct benefit to DDS. Region of interest encoding
requires the viewer to specify a region of interest while scalable coding allow efficient uti-
lization of bandwidth. Unfortunately, the focus of these video compression approaches is
content-agnostic. This is perhaps the side of the literature that lack the most. DDS makes
focuses on a fundamentally different target, as delineated in the previous sections. There
35
need to be more research done in streaming videos to target that are computational logic
rather than human beings.
36
CHAPTER 8
CONCLUSION
The use of deep-learning for video analytics is on the rise. We present DDS, a way to make
efficient use of cloud computing resources for live video analytics. Our analytics pipeline pro-
vides high accuracy while making efficient use of available bandwidth. We base our protocol
on two key insights, 1) streaming to computation logic presents many unique opportunities
compared to traditional video streaming protocols, and 2) using the deterministic nature
of computational logic, we can exploit these opportunities by obtaining feedback from the
server and then acting upon this feedback by sending only important regions from frames
in higher resolution. We provide solutions to the inherent problems of such a server-driven
iterative protocol. To further demonstrate the practicality of our proposed protocol, we
provide a complete implementation. Our end to end evaluation makes use of our imple-
mentation and shows improvements both in terms of accuracy and bandwidth utilization
over baselines in the average case. The evaluation, in general, provides validation to our
initial hypothesis that incorporating server feedback into the live video analytics pipeline
can provide tangible benefits accuracy/bandwidth benefits. However there still remain cases
that need to be investigated further to improve the general performance of DDS. Nonetheless,
our evaluation shows the potential benefits of using a custom streaming stack that uses an
iterative approach based on server feedback.
37
REFERENCES
[1] Ec2 instance pricing amazon web services (aws).
[2] Google compute engine pricing.
[3] Nvidia tesla p100: The most advanced data center accelerator.
[4] Pricing - linux virtual machines — microsoft azure.
[5] Boris Babenko and Ming-Hsuan Yang Serge Belongie. Robust object tracking withonline multiple instance learning. 2011.
[6] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
[7] Mingliang Chen, Weiyao Lin, and Xiaozhen Zheng. An efficient coding method forcoding region-of-interest locations in AVS2. CoRR, abs/1503.00118, 2015.
[8] Tiffany Yu-Han Chen, Hari Balakrishnan, Lenin Ravindranath, and Paramvir Bahl.Glimpse: Continuous, real-time object recognition on mobile devices. GetMobile: MobileComp. and Comm., 20(1):26–29, July 2016.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In CVPR09, 2009.
[10] F. Feng, X. Wu, and T. Xu. Object tracking with kernel correlation filters based onmean shift. In 2017 International Smart Cities Conference (ISC2), pages 1–7, Sep.2017.
[11] Antonio Filieri, Henry Hoffmann, and Martina Maggio. Automated multi-objectivecontrol for self-adaptive software design. In Proceedings of the 2015 10th Joint Meetingon Foundations of Software Engineering, ESEC/FSE 2015, pages 13–24, New York, NY,USA, 2015. ACM.
[12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meetsrobotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
[13] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, andArvind Krishnamurthy. Mcdnn: An approximation-based execution framework for deepstream processing under resource constraints. In Proceedings of the 14th Annual Inter-national Conference on Mobile Systems, Applications, and Services, MobiSys ’16, pages123–136, New York, NY, USA, 2016. ACM.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. CoRR, abs/1512.03385, 2015.
[15] Alex Hern. Computers are now better than humans at recognising images, May 2015.
38
[16] Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. Impression network forvideo object detection. CoRR, abs/1712.05896, 2017.
[17] H. Hoffmann. Coadapt: Predictable behavior for accuracy-aware applications runningon power-aware systems. In 2014 26th Euromicro Conference on Real-Time Systems,pages 223–232, July 2014.
[18] Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal,and Martin Rinard. Dynamic knobs for responsive power-aware computing. In Proceed-ings of the Sixteenth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XVI, pages 199–212, New York, NY, USA,2011. ACM.
[19] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, To-bias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutionalneural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
[20] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. Poet: a portable approachto minimizing energy under soft real-time constraints. In 21st IEEE Real-Time andEmbedded Technology and Applications Symposium, pages 75–86, April 2015.
[21] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. Opti-mizing deep cnn-based queries over video streams at scale. CoRR, abs/1703.02529,2017.
[22] Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, andJan Kautz. Iamnn: Iterative and adaptive mobile neural network for efficient imageclassification. CoRR, abs/1804.10123, 2018.
[23] Mason Liu and Menglong Zhu. Mobile video object detection with temporally-awarefeature maps. CoRR, abs/1711.06368, 2017.
[24] Nikita Mishra, Connor Imes, John D. Lafferty, and Henry Hoffmann. Caloree: Learn-ing control for predictable latency and low energy. In Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for Programming Languages and Op-erating Systems, ASPLOS ’18, pages 184–198, New York, NY, USA, 2018. ACM.
[25] P. K. Mishra and G. P. Saroha. A study on video surveillance system for object detectionand tracking. In 2016 3rd International Conference on Computing for Sustainable GlobalDevelopment (INDIACom), pages 221–226, March 2016.
[26] J. . Ohm. Advances in scalable video coding. Proceedings of the IEEE, 93(1):42–56, Jan2005.
[27] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You onlylook once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
39
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. IEEE Transactions on Pattern Analysis and MachineIntelligence, 39(6):1137–1149, June 2017.
[29] F. Sironi, D. B. Bartolini, S. Campanoni, F. Cancare, H. Hoffmann, D. Sciuto, andM. D. Santambrogio. Metronome: Operating system level performance managementvia self-adaptive computing. In DAC Design Automation Conference 2012, pages 856–865, June 2012.
[30] Thomas Stockhammer. Dynamic adaptive streaming over http –: Standards and de-sign principles. In Proceedings of the Second Annual ACM Conference on MultimediaSystems, MMSys ’11, pages 133–144, New York, NY, USA, 2011. ACM.
[31] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. In InternationalConference on Learning Representations, 2014.
[32] Tensorflow. tensorflow/models.
[33] J. Wang, O. Russakovsky, and D. Ramanan. The more you look, the more you see:Towards general object understanding through recursive refinement. In 2018 IEEEWinter Conference on Applications of Computer Vision (WACV), pages 1794–1803,March 2018.
[34] Josh Woodhouse. Market for storage used for video surveillance worth 1.7bn in 2017,Dec 2017.
[35] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACMComput. Surv., 38(4), December 2006.
[36] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. Aw-stream: Adaptive wide-area streaming analytics. In Proceedings of the 2018 Conferenceof the ACM Special Interest Group on Data Communication, SIGCOMM ’18, pages236–252, New York, NY, USA, 2018. ACM.
[37] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, ParamvirBahl, and Michael J. Freedman. Live video analytics at scale with approximation anddelay-tolerance. In 14th USENIX Symposium on Networked Systems Design and Imple-mentation (NSDI 17), pages 377–392, Boston, MA, 2017. USENIX Association.
[38] Tan Zhang, Aakanksha Chowdhery, Paramvir (Victor) Bahl, Kyle Jamieson, and SumanBanerjee. The design and implementation of a wireless video surveillance system. InProceedings of the 21st Annual International Conference on Mobile Computing andNetworking, MobiCom ’15, pages 426–438, New York, NY, USA, 2015. ACM.
40