EdgeDuet: Tiling Small Object Detection for Edge Assisted Autonomous Mobile Vision
Xu Wang*, Zheng Yang*, Jiahang Wu*, Yi Zhao*, Zimu Zhou‡
*School of Software and BNRist, Tsinghua University, China‡School of Information Systems, Singapore Management University, Singapore
Motivation• Various autonomous mobile vision applications
– All deploy HD cameras and require continuously detect objects in complexscenes to make decision on the go, where small objects are common.
• An ideal object detection engine:– Accuracy– Real-Time– Resource-Efficient 2
Robot dogs for patrolling Drones for traffic surveillance Humanoid robot for workingSmall Objects are common
Existing Object Recognition Solutions• Run a light object detection model locally on-board
– Model compression techniques reduce the workload of deep learning models.– Low-resolution Inputs reduce the consumption of computation and memory
resources.
3
Missing most small objects
Existing Object Recognition Solutions• Offload the high-resolution video to the edge and run a heavy model
- Accurate detection on the edge ≠ Accurate detection on the device
4
View changes when uploading high-resolution frames
• Pioneer studies leverage “detect + track” strategy to support real-time object detection and decrease the influence of network delay– Offloading key frames to the edge.– tracking objects in current frame with cached detection results of previous
frames.• Still face a long network delay in each high-resolution frame’s routine
5Glimpse, Sensys’15 EAAR, Mobicom ’19
Existing Object Recognition Solutions
Question
6
• Since both an on-device light model and an on-edge heavy modeldon’t work for accurate object detection in autonomous mobilevision, could we joint the two system together and only offload smallobject detection to the edge?– The large-sized objects don’t have the transmission delay and could be tracked
immediately.– Only the partial content of high-resolution frame containing small objects should
be uploaded to the edge and the transmission delay is low.– Parallelism is more efficient by splitting the image into sub-images without
crossing the boundary of objects.
Question
7
• What and how to offload to the edge?– No priori knowledge about small objects– Paralleling processing
• How to aggregate the detection results and track all object in real-time?– Duplicate detection results– Multiple objects to track
Our System
8
• EdgeDuet: an edge-device collaborative framework for enhancing small object detection with tile-level parallelism– Low Latency– High Accuracy
System Overview
9
Offloading Module
Local Object Detection Module
Real-time Tracking Module
The Offloading Module
10
• EdgeDuet exploits RoI frame encoding to compress video frames, and content-prioritized tile offloading for highly parallel object detection at the edge.
RoI Frame Encoding
11
• Goals– Compress pixel blocks containing small objects in high quality.– Compress the rest of the frame in low quality.
• Determine blocks containing small objects• Determine compression levels
RoI Frame Encoding
12
• Determine blocks containing small objects– Objects is “small” if the local object detector cannot detect it but remote object
detector can.– We illustrate the capacity of local objector for small objects with the recall curve
of object size.– The size threshold of each class is defined as the value below which the recall
value is less than 90%.– Approximate locations are estimated by the locations of small objects of the
previous frame.
Class-dependent size threshold for small objects
RoI Frame Encoding
13
• Determine compression levels– High quality level should trade off the accuracy of object detection and the
transmission data size.– Low quality level should not be too small to miss new objects.– The low-quality level is chosen such that the remote object detector outputs low
confidence scores on the compressed blocks but will not fail to locate objects.
Content-Prioritized Tile Offloading
14
• We offload the whole frame in the unit of tile
3x2 tiles
Content-Prioritized Tile Offloading
15
• We offload the whole frame in the unit of tile
E U D I R
E U D I R
E U D I R
E U D I R
E U D I R
E U D I R
E
U
D
I
R
Tile Encode
Tile Upload
Tile Decode
Inference
BBox Return
Content-Prioritized Tile Offloading
16
• Enable tile-level parallelism– Modify the frame encoding, frame decoding and object detection stages to
eliminate dependencies among tiles.
Content-Prioritized Tile Offloading
17
• Tile-level encoding– Current HEVC video encoder support encoding tiles in parallel, however, it
won’t upload the bit-stream until the whole frame is encoded.– We modify the open-sourced video Encoder Kvazaar[3] to support tile-level
encoding.
[3] Kvazaar. https://github.com/ultravideo/kvazaar
Content-Prioritized Tile Offloading
18
• Tile-level decoding– Existing video decoders depend on the first tile to locate the other tiles.– We fake each tile as a “first tile” by modifying the bit-stream in video encoder
and the HEVC parser in the video decoder accordingly.
The output of video decoder
Content-Prioritized Tile Offloading
19
• Object Detection– Performing object detection on each tile separately may miss objects which
cross the boundaries of adjacent tiles.– Overlap-tiling: split each frame into primary tiles and overlap-tiles and group
each primary tile with its surrounding overlap tiles for small object detection.
Content-Prioritized Tile Offloading
20
• Enable Content-based Priority– Modify the task schedule module in Kvazaar.– Once receiving a frame to encode, Kvazaar split the frame into tiles and submit
the tasks of each tile to a task queue.– Add a dynamic priority mapping module to change the order of tasks in the
queue.– The priority is the number of small objects of the corresponding tile group.
Local Object Detector
21
• The local object detector aims to detect medium- to large-sized objects in the video frames locally on the mobile device.– The local object detector should balance between offline accuracy and
latency to achieve high online accuracy.– We choose YOLOv3FP16 (640x640) as the local object detector.
Performance of local detector on VisDrone dataset
Real-time Tracking
22
• The module aggregates the offloaded and the local detection results into the cache and tracks all objects with multiple single-objecttrackers.– To avoid duplicated result, we drop the results of the local detector for small
objects and those of the remote detector for medium- to large-sized.– Adaptively update the tracking results based on the speed of the objects.
General workflow using multiple single-object trackers
Experiment
23
• Dataset– VisDrone[4]
• Compared Methods– Glimpse– EAAR– LaT
• Network Setting:– 4G– Wi-Fi 2.5GHz– Wi-Fi 5GHz
• Metrics:– Latency– IoU Accuracy
[4] Vision Meets Drones: Past, Present and Future.
• 2 x Intel Xeon CPU E5-2560 v4• 2 x GTX 2080ti GPU• 256GB Memory
• iPhone 11 with the A13 bionic chip
Experiment
24
• Overall Performance– EdgeDuet notably outperforms the two offloading schemes, Glimpse and EAAR,
in both accuracy and latency under all the three network conditions.– LaT is the fastest because it only performs local detection. However, pure local
detection has the worst accuracy.
Experiment
25
• The accuracy of small objects– EdgeDuet achieves 161.5%, 245.0%, 292.4% improvement for small object
detection accuracy under the three network conditions.
Experiment
26
• Benefits of Individual Modules in EdgeDuet– EdgeDuet has smaller frame size than EAAR and Glimpse.– EdgeDuet achileves 12.2% and 5.1 % latency improvement over Frame-Level
and Tile-Level.– EdgeDuet improves the overall accuracy by 4.2% with adaptive tracker
configuration.
RoI Frame Encoding Content-Prioritized Tile Offloading Adaptive tracker
Conclusion & Contribution• EdgeDuet is the first framework that enhances small object
detection in crowded scenes via collaboration between the edge and the mobile device.
• We push the state-of-the-art offloaded object detection studies from task-level parallelism to tile-level parallelism, which notably reduces the offloading latency. EdgeDuet is a systematic design that enables accurate, real-time object detection on mobile devices even in the case of low network bandwidth.
• We implement EdgeDuet as a cross-platform framework. Evaluations on VisDrone show that EdgeDuet improves the overall accuracy by 44.7% and the end-to-end latency by 34.2% over the state-of-the-art object detection offloading schemes.
27