Computer Vision-based Solution to Monitor
Earth Material Loading Activities
by
Ehsan Rezazadeh Azar
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Department of Civil Engineering
University of Toronto
Toronto, Ontario
© Copyright by Ehsan Rezazadeh Azar (2013)
ii
Computer Vision-based Solution to Monitor Earth
Material Loading Activities
Ehsan Rezazadeh Azar
Doctor of Philosophy
Department of Civil Engineering
University of Toronto
2013
Abstract
Large-scale earthmoving activities make up a costly and air-polluting aspect of many
construction projects and mining operations, which depend entirely on the use of heavy
construction equipment. The long-term jobsites and manufacturing nature of the mining
sector has encouraged the application of automated controlling systems, more specifically
GPS, to control the earthmoving fleet. Computer vision-based methods are another potential
tool to provide real-time information at low-cost and to reduce human error in surface
earthmoving sites as relatively clear views can be selected and the equipment offer
recognizable targets. Vision-based methods have some advantages over positioning devices
as they are not intrusive, provide detailed data about the behaviour of each piece of
equipment, and offer reliable documentation for future reviews. This dissertation explains the
development of a vision-based system, named server-customer interaction planner (SCIT), to
recognize and estimate earth material loading cycles. The SCIT system consists of three main
modules: object recognition, tracking, and action recognition. Different object recognition
and tracking algorithms were evaluated and modified, and then the ideal methods were used
to develop the object recognition and tracking modules. A novel hybrid tracking framework
was developed for the SCIT system to track dump trucks in the challenging views found in
iii
the loading zones. The object recognition and tracking engines provide spatiotemporal data
about the equipment which are then analyzed by the action recognition module to estimate
loading cycles. The entire framework was evaluated using videos taken under varying
conditions. The results highlight the promising performance of the SCIT system with the
hybrid tracking engine, thereby validating the possibility of its practical application.
iv
Acknowledgements
There are a number of individuals to whom I want to express my special thanks. They
supported me in some way to proceed throughout the course of my PhD study at the
University of Toronto.
My deepest gratitude goes to my supervisor, Professor Brenda McCabe, for her enduring
support and guidance. Her broad knowledge, experience, encouragement, mentorship, and
constructive advices have been of great value for me. She has been always available when I
needed her advice. It has been my honor to work under her supervision.
I would like to express my sincere gratitude to Professor Sven Dickinson, a member of my
advisory committee, for his precious advices during all phases of this research. He helped me
bridge between computer vision and construction engineering, and carry out this exciting
interdisciplinary research. I am grateful to Professor Kim Pressnail, my other advisory
committee member, who provided insightful comments to improve the overall quality of this
research. I would also like to appreciate Professor Feniosky Pena-Mora for agreeing to be the
external appraiser for my PhD defence.
Finally, I would like to deeply thank my beloved parents and sisters for their endless
inspiration and unconditional love during all these years. They have been my best supporters.
v
Table of Contents
CHAPTER 1 - Introduction .................................................................................................................... 1
1.1 Research Objective ................................................................................................................. 4
1.2 Methodology .......................................................................................................................... 5
1.3 Research Scope ....................................................................................................................... 7
1.4 Outline of the Dissertation ...................................................................................................... 8
CHAPTER 2 - BACKGROUND ......................................................................................................... 10
2.1 Automated Data Collection in Construction ........................................................................ 10
2.2 Data Collection in Earthmoving Projects ............................................................................. 11
2.2.1 Machine control sensors ............................................................................................... 11
2.2.2 Global Positioning System ........................................................................................... 12
2.2.3 Ultra-wideband ............................................................................................................. 14
2.2.4 Radio-frequency identification (RFID) ........................................................................ 15
2.3 Computer vision-based methods .......................................................................................... 16
2.4 Object Recognition ............................................................................................................... 20
2.4.1 Haar-like Features ........................................................................................................ 21
2.4.2 Histogram of Oriented Gradients (HOG) ..................................................................... 22
2.5 Object Tracking .................................................................................................................... 23
2.5.1 Mean-shift Tracking ..................................................................................................... 23
2.6 Summary .............................................................................................................................. 25
CHAPTER 3 - OBJECT RECOGNITION MODULE ......................................................................... 28
3.1 Dump Trucks ........................................................................................................................ 29
3.1.1 Visual orientations ........................................................................................................ 30
3.1.2 Machine learning .......................................................................................................... 32
3.1.3 Performance of Detectors on Static Images ................................................................. 33
3.1.4 Haar-like vs. HOG performance ................................................................................... 41
3.1.5 Performance of Detector on Videos ............................................................................. 41
3.2 Hydraulic Excavator ............................................................................................................. 44
3.2.1 Deformable parts .......................................................................................................... 45
3.2.2 Features ........................................................................................................................ 48
3.2.3 Mixture models ............................................................................................................. 48
3.2.4 Static images ................................................................................................................. 50
3.2.5 Videos ........................................................................................................................... 54
3.2.6 Spatiotemporal reasoning ............................................................................................. 56
3.3 Robustness of the Recognition Results ................................................................................ 60
3.3.1 Occlusions .................................................................................................................... 60
3.3.2 Lighting ........................................................................................................................ 62
3.3.3 Shadow ......................................................................................................................... 62
3.3.4 Viewpoint ..................................................................................................................... 63
3.3.5 Articulation ................................................................................................................... 63
vi
3.3.6 Scale change ................................................................................................................. 64
3.3.7 Orientation change ........................................................................................................ 65
3.4 Summary .............................................................................................................................. 65
CHAPTER 4 - OBJECT TRACKING MODULE ............................................................................... 68
4.1 Mean-shift Algorithm ........................................................................................................... 68
4.2 Hybrid Tracking ................................................................................................................... 70
4.2.1 Possibilities to Optimize Hybrid Algorithm ................................................................. 75
4.3 Summary .............................................................................................................................. 77
CHAPTER 5 - THE ACTION RECOGNITION MODULE AND SYSTEM ARCITECTURE......... 79
5.1 Baseline Task ....................................................................................................................... 80
5.2 Spatiotemporal Information .................................................................................................. 81
5.3 Activity Recognition Module (ARM) .................................................................................. 81
5.3.1 ARM Stage 1: Logical loading configuration .............................................................. 81
5.3.2 ARM Stage 2: Machine learning action recognition .................................................... 82
5.4 Cycle Conclusion.................................................................................................................. 84
5.5 System Architecture ............................................................................................................. 85
5.6 Summary .............................................................................................................................. 88
CHAPTER 6 - SCIT VALIDATION RESULTS ................................................................................. 89
6.1 Experimental Results ............................................................................................................ 90
6.2 Discussion ............................................................................................................................ 93
6.2.1 False positive cycles ..................................................................................................... 94
6.2.2 False negative cycles .................................................................................................... 97
6.2.3 Differences in start and finish times ............................................................................. 97
6.3 Practical Applications ........................................................................................................... 98
6.3.1 Cycle counting .............................................................................................................. 98
6.3.2 Cycle durations ............................................................................................................. 99
6.4 Monitoring Other Earthmoving Operations ....................................................................... 107
6.4.1 Hauling ....................................................................................................................... 107
6.4.2 Leveling and compacting ........................................................................................... 109
6.4.3 Excavation .................................................................................................................. 110
6.4.4 Extended Monitoring System ..................................................................................... 110
CHAPTER 7 - CONCLUSION AND FUTURE DIRECTIONS ....................................................... 112
7.1 Summary of Research......................................................................................................... 113
7.2 Summary of Results ........................................................................................................... 114
7.3 Contributions to the Body of Knowledge ........................................................................... 115
7.4 Contribution to the Body of Practice .................................................................................. 115
7.5 Limitations .......................................................................................................................... 116
7.6 Future Directions ................................................................................................................ 117
7.6.1 Application of Two Calibrated Cameras .................................................................... 118
7.6.2 Application of Multiple Non-calibrated Cameras ...................................................... 118
7.6.3 Integration of SCIT and GPS...................................................................................... 119
REFERENCES ................................................................................................................................... 121
vii
List of Tables
Table 2-1: A summary of main features of equipment tracking methods ............................................ 26
Table 3-1: The number of training images in each category - With permission from ASCE
(Rezazadeh Azar and McCabe 2012a) ................................................................................................. 30
Table 3-2: Training windows of each method ...................................................................................... 31
Table 3-3: HOG runtimes for eight views using CPU and GPU .......................................................... 40
Table 3-4: Computation times of the Haar detectors in searching for eight orientations - With
permission from ASCE (Rezazadeh Azar and McCabe 2012a) ........................................................... 40
Table 3-5: Some samples of Haar detectors and their performances - With permission from ASCE
(Rezazadeh Azar and McCabe 2012a) ................................................................................................. 40
Table 3-6: Statistics of the training images in each view - With permission (Rezazadeh Azar and
McCabe 2012b) .................................................................................................................................... 48
Table 3-7: Dimension of the search areas based on the root dimensions ............................................. 50
Table 3-8: Results of the general HOG and part-based methods - With permission (Rezazadeh Azar
and McCabe 2012b) ............................................................................................................................. 52
Table 3-9: Part-based recognition runtimes for both directions using CPU and GPU ......................... 54
Table 3-10: Results of the general HOG and part-based algorithms in test videos - With permission
(Rezazadeh Azar and McCabe 2012b) ................................................................................................. 55
Table 3-11: Spatiotemporal constraints of the true positives - With permission (Rezazadeh Azar and
McCabe 2012b) .................................................................................................................................... 57
Table 3-12: Summary of the robustness assessment of the recognition process under main affecting
factors ................................................................................................................................................... 67
Table 5-1: Possible loading configurations .......................................................................................... 82
Table 6-1: Results of the experiments with different action recognition thresholds on test videos ..... 91
Table 6-2: Detailed results of the SCIT with hybrid - Test 3 ............................................................. 101
Table 6-3: Detailed results of the SCIT with hybrid - Test 4 ............................................................. 102
Table 6-4: Detailed results of the SCIT with hybrid - 3 second intervals .......................................... 104
Table 6-5: Detailed results of the SCIT with hybrid - 5 second intervals .......................................... 105
Table 6-6: Loading conditions of the test cases ................................................................................. 106
viii
List of Figures
Figure 1-1: Left: long queue of waiting dump trucks, Right: idle excavator waiting for trucks ............ 1
Figure 1-2: Methodology of the dissertation .......................................................................................... 7
Figure 2-1: GPS antennas and grade control for Left: Bulldozer, Right: Grader ................................. 13
Figure 2-2: Detection cascade (Viola and Jones 2001) ........................................................................ 22
Figure 2-3: Left: Original image; Right: Visualization of the HOG descriptor ................................... 23
Figure 2-4: Top row: tracking of a dump truck in two frames, Bottom row: back projection of the
density distribution ............................................................................................................................... 24
Figure 3-1: Left: orientations; Right: samples of views (clockwise from top left: front, front-left,
front-right, side-left, side-right, rear, rear-left, rear-right) - With permission from ASCE (Rezazadeh
Azar and McCabe 2012a) ..................................................................................................................... 30
Figure 3-2: Possible outcomes of a binary classification process ........................................................ 34
Figure 3-3: ROC curve of the HOG detectors - With permission from ASCE (Rezazadeh Azar and
McCabe 2012a) .................................................................................................................................... 36
Figure 3-4: HOG recognition samples - With permission from ASCE (Rezazadeh Azar and McCabe
2012a) ................................................................................................................................................... 37
Figure 3-5: Samples of missed dump trucks - With permission from ASCE (Rezazadeh Azar and
McCabe 2012a) .................................................................................................................................... 38
Figure 3-6: ROC curve of the HOG detectors on videos ..................................................................... 42
Figure 3-7: Detection results in a series of frames at specified time intervals (a through d) ............... 43
Figure 3-8: Deformations of the hydraulic excavator - With permission (Rezazadeh Azar and McCabe
2012b) ................................................................................................................................................... 45
Figure 3-9: Root and part of the excavator ........................................................................................... 46
Figure 3-10: Top row: training instances of the boom in left direction; Second row: training samples
of the boom in right direction ............................................................................................................... 47
Figure 3-11: Poses of the dipper - With permission (Rezazadeh Azar and McCabe 2012b) ............... 47
Figure 3-12: Flowchart of the part-based recognition process ............................................................. 49
Figure 3-13: Search regions for dipper - With permission (Rezazadeh Azar and McCabe 2012b) ..... 50
Figure 3-14: ROC curve of the results on the excavator test images ................................................... 52
Figure 3-15: Samples of detected excavators - With permission (Rezazadeh Azar and McCabe 2012b)
.............................................................................................................................................................. 53
Figure 3-16: Object recognition at time intervals (images a to j), and four distinguished paths (images
k to n) - With permission (Rezazadeh Azar and McCabe 2012b) ........................................................ 58
Figure 3-17: Occlusion at ground-level view ....................................................................................... 61
Figure 3-18: Partially masked truck from elevated view ..................................................................... 62
Figure 3-19: Difficult viewpoints ......................................................................................................... 63
Figure 3-20: Changes in size of the dump truck as it approaches the camera ...................................... 65
Figure 4-1: left: selection of target truck in the original frame, right: isolation of pixels with similar
color histograms ................................................................................................................................... 70
Figure 4-2: left: original image, right: isolation of pixels with HOG response for side-right facing
trucks .................................................................................................................................................... 70
Figure 4-3: Flowchart of the hybrid tracking process .......................................................................... 73
ix
Figure 4-4: a: detected truck at frame x1; b: HOG recognition result with lowered thresholds for three
viewpoints in frame x2; c: projected box of previous frame (frame x1) to frame x2 using KLT feature
tracker; d: fusion of the rectangles in b and c ....................................................................................... 73
Figure 4-5: Tracking of the orientation changes .................................................................................. 75
Figure 4-6: Left: red box encloses target truck, right: ROI to search for the target truck in the next
frame ..................................................................................................................................................... 76
Figure 4-7: Correction of the KLT method's distractions..................................................................... 77
Figure 5-1: Distances between the corners of trucks and the base point in both left and right
configurations ....................................................................................................................................... 84
Figure 5-2: a: Detection of the excavator; b: tracking the excavator; c: detection of a truck that does
not meet loading criteria; d: detection of the loading truck; e: tracking of the both equipment; f: truck
leaves the zone and tracking of the truck terminates ............................................................................ 86
Figure 5-3: Flowchart of the entire SCIT system ................................................................................. 87
Figure 6-1: Some of the earth material loading views .......................................................................... 90
Figure 6-2: Frames a to c: Expansion of mean-shift tracking, images d to f: Hybrid tracking ............ 93
Figure 6-3: frames a to c: Recognition and tracking of the incorrect loading truck due to severe
occlusion; frames d to f: correct recognition and tracking by changing the camera location .............. 96
Figure 6-4: a. incorrect detection of the loading start time, b. actual start time ................................... 98
Figure 6-5: Earthmoving foremen ........................................................................................................ 99
Figure 6-6: Left: detection of loading trucks; right: tracking of trucks .............................................. 109
Figure 6-7: Compaction with two rollers ........................................................................................... 110
Figure 7-1: Integration of the SCIT and GPS ..................................................................................... 120
x
Nomenclature/List of Acronyms
AdaBoost Adaptive Boosting, a machine learning algorithm
CAD Computer-aided design
CPU Central Processing Unit, computer hardware
CUDA Compute Unified Device Architecture, a parallel computing
architecture developed by Nvidia Company
GPS Global Positioning System
GPU Graphics Processing Unit, computer hardware
Haar-like features an object recognition algorithm
HOG Histogram of Oriented Gradients, an object recognition algorithm
HSV Hue, Saturation, and Value, a cylindrical-coordinate representation of
points in an RGB color model
KLT Kanade-Lucas-Tomasi, a computer vision feature tracking method
NVA Non-value added
OpenCV an open source cross-platform library for computer vision algorithms
PASCAL Pattern Analysis, Statistical Modelling and Computational Learning, a
Network of Excellence funded by the European Union
RAM Random Access Memory, computer hardware
RFID Radio-frequency identification, a wireless identification system
ROC Receiver Operating Characteristic curve, a graphical plot to
demonstrate the performance of a binary classifier system
ROI Region of Interest, selected subset of an image
xi
SCIT Server-customer interaction tracker; the developed vision-based
system in this research to estimate loading cycles
SVM Support Vector Machine, a machine learning algorithm
SVM-light an open source software to train linear support vector machine
classifiers
UWB Ultra-wideband, a radio-based positioning device
1
CHAPTER 1 - Introduction
For many years, the manufacturing sector has benefited from advances in information
technology (IT) to improve productivity and efficient data flow. The construction industry,
however, has been criticized for being slow to adopt IT. This inertia along with the
fragmented and temporary nature of construction projects have been resulted in the lack of
productivity improvement in the construction industry (Navon and Sacks 2007). For
example, most data capture and transfer processes used for productivity tracking are done
manually in construction (Navon and Sacks 2007; Akinci et al. 2006; Navon 2005). The
immediate consequences of poor performance include inactive excavators waiting for dump
trucks or a queue of dump trucks waiting for a loading unit (see Figure 1-1).
Figure 1-1: Left: long queue of waiting dump trucks, Right: idle excavator waiting for trucks
To address this issue, researchers have evaluated sensing technologies to detect, track, and
recognize the actions of the construction workers and equipment in the rugged environment
of a construction site, with a potential to automate the manual and error-prone productivity
measurement processes. Automated productivity measurement systems promise to have a
positive impact not only on the management of site equipment and human resources, but also
on the planning of future projects.
2
The automation of productivity data collection can improve construction performance by:
Reducing the need for expensive and error-prone human resources;
Providing real-time data and proactive resource monitoring;
Recording accurate productivity data for future applications such as resource planning
and stochastic simulation.
Earthmoving is a major component of heavy-civil construction, such as highways, earth- and
rock-fill dams, pipelines, land development, irrigation systems, and harbour construction
projects. Surface mining, which includes aggregate pits and quarries, as well as mining for
specific minerals such as oil sands, coal, bauxite, and copper, also requires major
earthmoving activities. In 2010, surface mining accounted for 53% of the 1.6 million bbl/day
of Alberta’s crude bitumen production and it takes about two tonnes of mined oil sands to
extract a barrel of synthetic crude oil (Government of Alberta 2012). Thus, about 619 million
metric tonnes of oil sands were excavated and processed in 2010, and this figure rises every
year. This situation illustrates the enormous magnitude of earthmoving processes worldwide
all of which depend on heavy equipment and have a repetitive nature. Slight improvements in
cycle times can result in significant improvements in productivity, cost savings, and
reductions in carbon emission.
Traditional manual monitoring methods – wireless person-to-person communication with
equipment operators/supervisors and watching the operations directly or through real-time
videos – are expensive, time-consuming and error-prone (Rojas 2008).
3
New sensing technologies such as global positioning system (GPS) receivers and ultra-
wideband (UWB) sensors have been used to monitor earthmoving machines and provide
continuous productivity data. These real-time positioning devices estimate the three-
dimensional location of the machine and the logical unit of the control system analyzes the
spatiotemporal pattern of the machine to recognize the type of action and therefore the
productivity (Navon et al. 2004; Kim and Russell 2003). Since these frameworks recognize
the actions from indirect data, they may not correctly distinguish productive movements from
non value-added traverses. In addition, these technologies are intrusive as each machine to be
tracked requires the appropriate sensor to be installed and kept up to date. This issue is more
problematic for the rented plants due to the effort and cost of repeatedly installing and
removing sensing tags from the equipment and of updating the monitoring software database.
LASER scanners can scan objects and provide a 3D point cloud of the scene, but these
cameras are costly and the scan processes are time-consuming and therefore they are mostly
used to scan static objects such as historic buildings or rock profiles. A low resolution scan
by fast laser scanners with 625,000 points per second rate can take more than 100 seconds
(Kiziltas et al. 2008), so they are not suitable to analyze dynamic earthmoving activities in
real-time.
Due to the emergence of low-cost cameras and high-capacity storage devices, it has become
common practice to monitor construction sites by surveillance cameras (Zou and Kim 2007;
Gong and Caldas 2010). Unlike building construction, clear sightlines can be selected and
maintained throughout an earthmoving project. Vision-based monitoring is an economical
solution for earthmoving operations, but visual recognition technologies needed to
4
automatically extract data from the videos are in their infancy, despite the capabilities
displayed by the entertainment media.
It is possible to use any of these approaches to automate site data collection; however, the
development of a vision-based system has the potential to replicate human visual skills and
logic, thereby providing contractors and owners with relatively inexpensive assistance to
automatically monitor the site. As no known research has achieved this goal, the research
contained herein aims to help fill this gap.
1.1 Research Objective
The objective of this research is to develop a computer vision-based framework that can
correctly identify and track equipment activities on an earthmoving construction project in
real-time. To achieve this, the framework should:
Use ordinary 2D construction videos
Detect, identify, and determine the orientation of typical loading equipment;
Track the movement of the machine as it moves through the camera’s view;
Identify the operations of the detected equipment and its interactions with other
equipment; and,
Develop an automated vision-based data collection system for earth material loading
operations using the mentioned modules, assess its performance, and validate the
practical application of the system.
5
1.2 Methodology
The following steps will be followed to achieve the objectives:
Assess the current state of productivity measurement in earthmoving projects and
identify shortcomings and areas for improvement;
Investigate the functional technology requirements for automating productivity
measurement practices;
Evaluate existing image and video processing methods and select an appropriate set
of algorithms to develop the productivity measurement system based on functional
requirements;
Develop a framework to automate the identification, tracking, and recognition of the
activities of earth material loading equipment in actual construction sites;
Evaluate the performance system by using several test cases with various conditions,
and;
Compare the machine-generated and ground truth results to validate the performance
of the system.
Identify limitations of the system
The research methodology for this study is depicted in Figure 1-2. First, the problem
statement for this research as well as the corresponding objectives and scope were defined.
Then the literature review was carried out to assess the current state of the data collection in
earthmoving projects, and to identify and select appropriate computer vision algorithms.
6
These algorithms were examined and modified to detect and track loading machines. Next,
an action recognition method was developed to recognize the loading action, and then all of
these modules were integrated to develop the pipeline Server-customer interaction tracker
framework. This framework was examined by means of test videos from two construction
sites. The results of the experiments were statistically compared with ground truth
productivity data to validate the system’s performance for practical applications. Finally, the
findings, advantages, and limitations of this study were summarized and future research
direction discussed.
7
Figure 1-2: Methodology of the dissertation
1.3 Research Scope
There are several types of earthwork activities such as excavation, loading, hauling,
dumping, grading, trimming, and compaction; hence, there are special types of equipment to
carry out each of these tasks. A number of detection classifiers and activity recognition
modules are required to cover all of these different activities. Therefore, only analyses of
8
loading activities are considered for this research. The loading operation includes loading of
different types of soil, clay, aggregate, rock, and mineral ores. The envisioned system should
be able to monitor these loading activities regardless of the material type; therefore, the term
“earth material” is selected for the topic of this research to represent the general application
of this project in construction and mining industries. Moreover, there are different types of
machines that can perform these activities; for instance, both loaders and hydraulic
excavators can load different types of machines such as rigid off-highway dump trucks,
articulated off-highway dump trucks, urban dump trucks, and scrapers. The hydraulic
excavator was selected as the test case for the loading unit, and both the rigid off-highway
dump truck and the urban dump truck were chosen as instances of hauling machines.
Although the scope of this research encompasses loading operations, the methodology,
modules, and outcomes of the developed system can be applied or generalized to other
earthmoving operations.
1.4 Outline of the Dissertation
This dissertation is organized into seven chapters. Chapter 1 introduces the problem
statement and motivation for the dissertation topic. It then explains the objectives,
methodology, and the scope of this research. Chapter 2 consists of three main parts. First it
provides an overview of previous research efforts in automated data collection in
construction, and then describes recent applications of vision-based algorithms in the
construction industry. Finally, it gives a synthesis of advances in computer vision with an
emphasis on object recognition and tracking methods. Chapters 3 and 4 describe object
recognition and objection tracking modules of the proposed framework, respectively. In
Chapter 5, the action recognition module and the entire architecture of the framework are
9
presented. Chapter 6 demonstrates the experiments carried out to evaluate the performance of
the system. Finally, Chapter 7 gives the summary, discusses the contributions and limitations
of the research, and makes some recommendations for future research.
10
CHAPTER 2 - BACKGROUND
This chapter provides background information on the research advances for automated data
collection in construction. Much of the research is built upon advances in sensor
technologies, computer science, and new algorithms that have been developed to search
digital images for stationary and moving objects. The first section looks at automated data
collection generally in construction with Section Two focusing on data collection in
earthmoving. Then, the application of vision-based algorithms for automated data collection
in construction is investigated, and lastly advances in object recognition and tracking
methods are briefly introduced.
2.1 Automated Data Collection in Construction
Manual data collection and analysis are tedious and labour-intensive, taking about 30%-50%
of supervisors’ time (McCullouch 1997), and 2% of the entire effort in construction sites
(Cheok et al. 2000). In addition, manual data collection is error-prone and usually requires
extra non-value added communication between the office and field personnel (Akinci et al.
2006).
As a result, automated data collection has become one of the leading research streams in the
construction community and various data collection devices have been employed for
material, personnel, and equipment tracking, progress monitoring, productivity measurement,
quality control, and safety management (Kiziltas et al. 2008). Research often depends on the
advancement of data collection devices - barcodes, radio frequency identification (RFID),
global positioning system (GPS), ultra wideband (UWB), laser scanning, and computer
vision algorithms are currently the most used technologies. Some of the features that
differentiate them include proneness to interference, data reading range, data accuracy,
11
interoperability of hardware and software, and memory requirements (Kiziltas et al. 2008).
Most recently developed systems analyse direct data while others interpret indirect data to
extract necessary information (Navon 2005; Navon and Sacks 2007). Laser scanners and
vision processing methods for estimation of the progress of building are instances of direct
data analysis, whereas application spatiotemporal data provided by GPS or UWB to estimate
equipment productivity are examples of indirect analysis.
2.2 Data Collection in Earthmoving Projects
The earthmoving sector has a longer history of the application of automated data collection
technologies than other segments of the construction industry (Navon 2005). More
specifically, the long term and repetitive nature of mining operations has allowed faster
technology adoption in which different sensing devices have been used to locate and dispatch
large fleets of earthmoving plants. Since the mining and earthmoving operation are the same
in nature and use similar equipment (even though mining machines are usually larger in
size), these sensing technologies were gradually employed in heavy construction projects as
well. In the following sections, the most common monitoring tools are introduced.
2.2.1 Machine control sensors
Various built-in sensing devices are commercially available that provide a wide range of data
from the machine itself, such as engine operating parameters (Caterpillar 2012), and location
and orientation of the machine parts such as boom and bucket orientations of a hydraulic
excavator (Trimble 2012c). These devices have been developed to improve the efficiency of
equipment operation, but it is also possible to collect and interpret these data to estimate its
productivity. Limitations with these devices include cost-effectiveness and data
12
interpretation. The engine parameters or movements of the machine part do not necessarily
result in productive actions, and it is also difficult to distinguish the type of the work.
2.2.2 Global Positioning System
The Global Positioning System (GPS) is a space-based radio-navigation system created by
the U.S. Department of Defence (DoD) using 24 satellites, the last one of which was
launched in 1994. Each satellite continuously transmits the time and its position. GPS
receivers must receive these messages from at least four satellites to compute their 3D
location and time using a triangulation technique. User equivalent range error is the
difference between the GPS coordinates and the true position. The main causes of this
inaccuracy are atmospheric effects, multipath distortion, satellite geometry, ephemeris errors
and orbit perturbations, time offset, instrumentation errors, and relativistic effects.
Differential correction techniques can resolve or minimize these errors. However, there are
other causes of minor distortions which cannot be corrected.
Recent commercially developed GPS receivers for construction equipment have high
accuracy (under one metre); however, they may still be subject to the mentioned anomalies
(Trimble 2012a). In addition, earthmoving machines sometimes work in deep pit mines and
valleys, proximity of rocks, trench slopes, and tall buildings, which can cause signal
reflection and multipath distortion problems. There is a modified mining GPS system to
solve such problems. This system uses additional transmit stations, or "Terralites", installed
on overlooking points of jobsites that relay satellite signals to equipment antennas (Trimble
2012a). This solution is expensive and mostly used in long-term open pit mines as it requires
a network of reference stations.
13
The GPS devices transmit the geographical location of machines to a central control
processor at regular time intervals, and these locational data together with logic algorithms of
the control system can recognize the action and estimate the productivity of the machine
(Navon et al. 2004; Kim and Russell 2003). This technology is suitable to track mobile
machines and detect queues or other misallocation, but it cannot provide any data other than
the location of a stationary plant such as a hydraulic excavator. Recent antennas have
customized features for specific heavy equipment that provide additional data to increase the
accuracy of earthmoving profiles and facilitate the operators’ job. For example, dual antennas
installed on two sides of the blade of the bulldozers or graders (see Figure 2-1) provide exact
position, cross slope, and heading of the blade to achieve accurate excavation and grading
profiles (Trimble 2012b).
Figure 2-1: GPS antennas and grade control for Left: Bulldozer, Right: Grader
The open pit mining sector has been extensively using this technology to dispatch and control
the earthmoving fleet (Vujic et al. 2008; Alarie and Gamache 2002). Several construction
researchers also applied GPS antennas to track heavy equipment in construction projects and
then estimate their productivity such as grading and leveling (Navon and Shpatnisky 2005;
Navon et al. 2004) and asphalt paving (Navon and Shpatnisky 2005; Peyret et al. 2000).
These productivity measurement frameworks extract the spatiotemporal data of the
14
equipment and transform them into the local map of the project. Then the processing
software interprets the movements of the machines in work zones and estimates their
productivity.
In addition to technology-related problems already mentioned, GPS-based productivity
measurement has two further limitations. First, since the 3D coordination and time are the
available spatiotemporal data, it is difficult to distinguish productive activities from non-
value added (NVA) traverses. Second, since rented equipment are commonly employed by
general contractors such that different machines are used on site on a daily basis, it is costly
and labour-intensive to install and remove GPS antennas from the plants and update the
monitoring software with each change in the fleet.
That said, the GPS navigation systems are still the superior technology in this field as this
level of detailed data from single equipment is not achievable by any other existing
automated data collection system.
2.2.3 Ultra-wideband
The Ultra-wideband (UWB) is another radio-based positioning device which has the ability
to find and track entities in limited zones. This system consists of a network of UWB
receivers, UWB tags, and a data processing unit. UWB sensors receive low energy radio
waves transmitted by tags and then the processing unit analyzes the attributes of the received
signals to locate the tags. Because UWB systems employ high-bandwidth waves (with very
short pulses), most signal reflections do not have the original pulse and the multipath fading
problem is not an issue. This system can estimate the location of a tag with two of four pieces
of information (Ghavami et al. 2007) which include time of arrival (TOA), time difference of
15
arrival (TDOA), angle of arrival (AOA) and received signal strength (RSS). This technology
provides reliable spatiotemporal data in tracking resources in construction sites (Cheng et al.
2011).
With regard to shortcomings, the ultra-wideband system requires a network of wired sensors
installed in different locations of a site, which makes it impractical for temporary linear
projects such as highway and pipeline construction. Problems with data interpretation and
the manual installation of tags are also associated with this technology.
2.2.4 Radio-frequency identification (RFID)
Radio-frequency identification (RFID) is a wireless system that employs radio-frequency
electromagnetic fields to identify and track a tag attached to an object. This system includes a
tag, which is an electronic chip coupled with an antenna, and a reader that transfers data to
the host computer. RFID tags are either passive or active tags where the passive tags do not
require a battery because the electromagnetic field of the reader powers them. Active tags use
their own power source, usually a battery, to transmit data via radio waves. The main
advantage of RFID systems is that the tags do not need direct line-of-sight, and they are
durable and can be encapsulated.
This sensing device has been used to measure the loading, hauling, and dumping times of the
dump trucks. Fixed readers may be installed in entrance gate of the loading and dumping
fields, and passive RFID tags attached to dump trucks. Then, the system can record the
entrance and exit times of the machines to each zone, and the time differences represent the
loading, traveling, and dumping cycle times (Montaser and Moselhi 2012). Although this
method is practical to use for a stationary construction site (foundation excavation), it is
16
cumbersome to frequently move the gates for more linear work (e.g. highway construction).
In addition, this system only registers the entrance and exit of the machines without regard to
whether the loading occurred.
2.3 Computer vision-based methods
Site images and videos provide a vast automated data collection opportunity (Golparvar-Fard
et al. 2009; Brilakis and Soibelman 2008; Abeid et al. 2003). The emergence of cheap digital
cameras and high-capacity storage devices has significantly increased the number of photos
and videos captured in construction sites, many of which have surveillance cameras (Gong
and Caldas 2010; Zou and Kim 2007). These images and videos are used for several purposes
such as progress measurement, claims, reports, safety, and training. However, manual
annotation, retrieval, and analysis of these multimedia sources are cumbersome, and valuable
information within them is often missed (Brilakis and Soibelman 2005). Therefore, in the last
decade, a stream of construction research has focused on intelligent systems to use photos
and videos more effectively.
Computer vision algorithms have been applied in several fields of construction, including
progress monitoring (Golparvar-Fard et al. 2011; Wu et al. 2010; Golparvar-Fard et al. 2009;
Kim and Kano 2008), defect detection (Guo et al. 2009; Hutchinson and Chen 2006),
automated image retrieval (Brilakis and Soibelman 2008; Brilakis and Soibelman 2005), and
productivity measurement (Gong and Caldas 2011; Peddi et al. 2009; Weerasinghe and
Ruwanpura 2009; Almassi and McCabe 2008). To compare digital images with an electronic
as-planned 4D model, a means to coordinate a fixed camera viewpoint and the direction
vector of a construction 4D model is available (Kim and Kano 2008). D4AR (4 Dimensional
Augmented Reality) (Golparvar-Fard et al. 2011) goes one step further and uses casually
17
captured images from the construction site to build a virtual 3D walk through the
environment. It thus positions the 4D CAD model to assess progress of the building. In
addition to comparing as-built with as-proposed features, productivity, work progress, and
safety data can be extracted. Methods exist for monitoring workers (Teizer and Vela 2009;
Peddi et al. 2009; Weerasinghe and Ruwanpura 2009), and tracking personnel, equipment,
and materials in noisy construction videos (Brilakis at al. 2011; Park et al. 2011). The Haar
object detection method has been found capable of detecting large tools or key productivity
indicators from video images, such as a concrete hopper (Gong and Caldas 2010). Work
cycles can be determined by counting the number of times the hopper moves into concrete-
pouring zones (Gong and Caldas 2010; Almassi and McCabe 2008). Visual processing of
workers to analyze worker status can also be achieved using human pose analysis (Peddi et
al. 2009) and thermal image analysis combined with sound wave patterns (Weerasinghe and
Ruwanpura 2009).
Objects in heavy construction sites typically take place outside and have few visual
obstructions. In earthmoving activities, such as digging, loading, moving, spreading, grading,
and compacting, specific equipment are involved. Humans recognize construction equipment
by the shapes and features that make them unique. Developing an automated system to
undertake this task, however, has several challenges, such as the shape similarities of
different equipment, partially obstructed views, and a visually noisy environment. To
complicate matters, some earthwork plants have moving or deformable parts, such as
hydraulic excavators, adding another level of complexity to the recognition process. Visual
recognition research in construction has focused on three primary methods: color, motion,
and shape.
18
In a semi-automated approach, the user manually selects the excavator of interest, and then
the algorithm tracks the target excavator in subsequent frames. The system analyses
displacement of the excavator to determine the working state of the plant (Zou and Kim
2007). The use of hue, saturation, and value color space to detect equipment is achievable in
good contrasting backgrounds, such as soil and snow. However, it is challenging to use in
construction sites as all of the equipment from one contractor can often be similarly colored.
Furthermore, many contractors use a popular orange or yellow for safety purposes. This
color-based method (Zou and Kim 2007) is not robust to changes of illumination, scale,
viewpoint and occlusion.
Since most construction entities, including earthmoving machines, are mobile, some research
efforts investigated the use of motion segmentation methods to detect moving objects and
then identify them. There are several foreground-background algorithms available, but the
selected method should be able to properly process construction videos with a dynamic
environment and harsh visual noise such as dust and smoke from equipment exhaust. In a
comparison of background subtraction algorithms (Gong and Caldas 2011), the performance
of Mixture of Gaussian (Grimson et al. 1998), Codebook (Kim et al. 2005), and a Bayesian-
based model (Li et al. 2003) demonstrated that the Bayesian-based algorithm had better
results in construction videos. This motion segmentation filter uses Bayes decision rules to
detect both gradual and abrupt movements in videos with static background.
The Bayesian-based foreground-background algorithm has been used to segment moving
entities in construction videos. The detected objects can then be identified by a classifier (e.g.
Bayes or neural network) using features of the object including height/ width aspect ratio,
height-normalized area, percentage of occupancy of the bounding box, and average gray-
19
scaled color of the area (Chi and Caldas 2011). This recognition system has been applied for
productivity measurement of earthmoving activities (Gong and Caldas 2011). This
recognition approach, however, has five major limitations. First, it requires that the video
background remain still. Second, the motion segmentation algorithm may absorb foreground
particles if they stay motionless for a long time (Li et al. 2003). Third, the background
subtraction method is not able to consistently segment a moving object as a single unit and it
sometimes identifies a part of the moving object or may split one moving object into two or
more disconnected pieces. Fourth, the employed feature-based classifiers can be misled if an
unknown object (not trained before) enters the scene. Finally, this algorithm has difficulty in
processing occluded views that are typical in jobsites. For example, two machines move
close together and the system may segment them as a single blob. The shape features of this
blob represent none of those two plants.
To the best of the author’s understanding, there is still a gap in the semantic equipment
recognition in heavy construction static images and videos, since the mentioned algorithms
are able to recognize machines under certain conditions, such as plain background and empty
jobsites. Therefore, it is essential to develop a recognition framework for heavy civil
engineering projects that allows different types of equipment to appear.
In addition to recognition, object tracking is another main module to make vision-based
systems practical. These algorithms can track the manually or automatically detected objects
and provide valuable spatiotemporal data including the location, direction of movement, and
velocity of the target. Thus, a number of studies were conducted to find suitable tracking
methods in noisy construction videos (Brilakis at al. 2011; Park et al. 2011; Gong and Caldas
2011).
20
All these research efforts are in a preliminary stage as they are only able to operate under
ideal conditions. For example, the test videos were taken from certain angles, with plain
backgrounds and slight occlusion, and only a few types of equipment appeared. In addition,
these systems only analyzed simple scenarios such as displacement of an excavator or a mini
loader to estimate their productivity.
The goal of this research is to close the practicability gap between vision-based systems and
earthmoving productivity measurement processes, where we can automatically recognize and
estimate loading cycles under various visual conditions such as different viewpoints and
existence and various types of construction equipment found in construction sites. In
addition, the system needs the least amount of human intervention, which is to properly set
the camera viewpoint.
2.4 Object Recognition
Computer vision, a form of artificial intelligence (AI), is evolving quickly, with object
recognition being one of the main branches of this field. However, existing algorithms have a
long way to go before they can match the flexibility and breadth of human vision (Dickinson
et al. 2009). The main difficulties arise from variations in illumination, viewpoint, scale,
occlusion, articulated shapes, and background clutter. In addition, the variety of samples
within a class can enhance the complexity (Fei et al. 2007). Object recognition approaches
have been developed using recognition by parts, appearance-based, and feature-based
methods. Feature-based methods usually have two steps: computation of object descriptors,
and classification. All feature-based methods quantize the descriptors of positive and
negative samples to train a classifier. Due to proven performance and the availability of
source codes, Haar-like features (Viola and Jones 2001) and Histogram of Oriented
21
Gradients (HOG) (Dalal and Triggs 2005) were used in this research. These are discussed in
more detail next.
2.4.1 Haar-like Features
The Haar-like features framework (Haar) was originally introduced for face detection (Viola
and Jones 2001), and then broadly applied for other recognition purposes, such as traffic sign
and pedestrian detection, due to its fast speed and high accuracy. This algorithm partitions
images into a set of overlapping regions at different scales, then classifies each window
whether it is a target object or not. The Haar-like features framework has a cascade structure
that employs a series of weak classifiers. It uses a progressive elimination-based
classification chain, which rejects any sub window that cannot pass one of the classifiers. The
classifiers in the cascade become gradually more complex. Each classifier rejects as many of
the remaining negative sub-windows as possible, while still passing all but a small fraction of
the true positives (see Figure 2-2).
The features used to train each classifier consist of rectangular regions configured in various
Haar and bar-like arrangements. To learn and classify objects, it applies a form of the
AdaBoost (Freund and Schapire 1997) algorithm to features that were extracted from digital
images. This method is fast and efficient in recognizing objects that have a stable,
characteristic appearance and do not have large pose variations, such as human faces.
22
Figure 2-2: Detection cascade (Viola and Jones 2001)
2.4.2 Histogram of Oriented Gradients (HOG)
Initially developed to detect pedestrians in static images (Dalal and Triggs 2005), the robust
HOG algorithm won the 2006 PASCAL object detection challenge (Everingham et al. 2006).
This method is invariant to changes in illumination and noise, and has been broadly applied
to detect rigid objects, such as highway vehicles (Rybski et al. 2010; Morlock 2008). In the
HOG algorithm, computed gradients of the gray-scale image are separated into spatial and
orientation cells (Figure 2-3) forming histograms of oriented gradients which are then
converted to a vector called a descriptor. Next, many positive and negative vectors are
required to train a detector using the linear support vector machine (SVM) algorithm (Cortes
and Vapnik 1995) under supervised learning. Unlike the cascade framework of the Haar-like
features, HOG uses a sliding window approach to search for the target object in all positions
and scales of an image. In this detection process, the detector first searches for the target in
the original scale image, then the frame is scaled down by the shrinkage coefficient, and the
scan is repeated until the image reaches the size of the classifier (e.g. 64x128 for pedestrian
23
detection). The classifier examines the test windows by the linear SVM classification process
which is a scalar product of the classifier and the test window vector. Unlike the Haar
method, the HOG algorithm tests all of the sub windows of the image and is more
computationally intensive.
Figure 2-3: Left: Original image; Right: Visualization of the HOG descriptor
2.5 Object Tracking
Video tracking algorithms are promising tools to locate a target object in video frames as
they are useful tools for security, traffic control, human-computer interaction and many other
applications. There are a number of tracking approaches including contour tracking, kernel-
based tracking, and feature matching. Comparative studies of tracking methods applied to the
visually noisy construction environment revealed that the Mean-shift algorithm is reliable for
tracking objects (Gong and Caldas 2011; Park et al. 2011), and the addition of the Kalman
filter and Particle filter can stabilize its performance (Gong and Caldas 2011).
2.5.1 Mean-shift Tracking
Mean-shift tracking is a non-parametric Kernel-based procedure for locating the maxima in
the density distribution of a dataset (Comaniciu et al. 2003). This iterative algorithm starts
with an initial estimate, and then calculates the weight of nearby points for re-estimation of
24
the mean, thus it ignores the outliers far from the peak. Any features of the object can be used
to create the dataset, but color is one of the efficient and commonly used features. The top
two frames in Figure 2-4 show the red dump truck being tracked. The images in second row
are the corresponding back projection of them, which filters the pixels in the range of the
target’s color histogram. Apparently, other objects such as the body of the excavator and the
other entering truck have the same color, but the mean-shift tracker ignores them. A modified
version of the Mean-shift algorithm, called continuously adaptive Mean-shift or Camshift
(Bradski 1998) employs mean-shift method and changes the size of the tracking window to
adapt with the changes of the shape, orientation, and size of the target object.
Figure 2-4: Top row: tracking of a dump truck in two frames, Bottom row: back projection of the density
distribution
25
2.6 Summary
This chapter describes recent advancements in monitoring earthmoving equipment on
jobsites. The developed technologies can be classified into two main groups: active and
passive tools.
Active methods include a central processing unit, a network of reference receivers, and
sensing devices that should be installed on each machine. There are different sensing
technologies, such as engine sensors, RFID, GPS, and UWB, that can provide various data
such as engine parameters and 3D location. In contrast, passive techniques do not require
sensing tags and the central unit processes the transmitted data from a network of reference
receivers. Laser scanners and computer vision-based techniques are two main forms of
passive data collection; however, only vision-based methods have been used to track
construction equipment because laser scanning processes are too slow to track mobile
equipment.
Table 2-1 summarizes the characteristics of each technology based on four functional
technology requirements: type (tagging requirement), data provided, hardware requirement,
and spatiotemporal tracking.
26
Table 2-1: A summary of main features of equipment tracking methods
Technology Type Data provided Hardware
requirement
Spatiotemporal
tracking
Engine
sensors Active
Equipment ID
Engine parameters such
as rotation per minute
Sensors and central
receiver No
RFID Active Equipment ID
Time
Tags and network of
scanning gates No
Standard
GPS Active
Equipment ID
3D location
Time
Sensors and central
receiver 3D
GPS with
transmit
stations
Active
Equipment ID
3D location
Time
Sensors, central
receiver and
additional reference
stations
3D
UWB Active
Equipment ID
3D location
Time
Tags and network of
UWB receivers
3D (limited to the
receivers zone)
Single
camera
for each
scene
Passive
2D location
Equipment type
Equipment orientation
2D size
Time
Network of digital
cameras
2D (limited to camera
coverage)
Stereo view
of
each scene
Passive
3D location
Equipment type
Equipment orientation
3D size
Time
Network of
calibrated digital
cameras
3D (limited to cameras
coverage)
Active systems, more specifically GPS, have been broadly used in construction and mining
industries for more than a decade and numerous research efforts have investigated
shortcomings and development possibilities of these technologies. Vision-based algorithms
are other potential tools to track earthwork machines and estimate their productivity. Vision-
based systems, however, are very new in this field and only a few research studies have been
carried out to monitor earthmoving equipment. These research projects are in their early
27
stages and require a high level of manual intervention. In addition, they are only able to
process under ideal conditions that do not resemble those that are found on jobsites.
As stated before, a practical vision-based system requires recognizing different types of
earthmoving equipment and tracking them under realistic conditions. Then the system must
be able to analyze the provided spatiotemporal data and identify equipment interactions. The
next chapter describes different state of the art object recognition algorithms which are
evaluated and modified, as required to identify loading machines under different visual
conditions with high accuracy and efficient speed.
28
CHAPTER 3 - OBJECT RECOGNITION MODULE
This chapter describes the development of an object recognition module. Two types of
equipment were selected to develop the module. A hydraulic excavator represents a loading
unit, and off-highway and urban dump trucks were chosen as instances of hauling equipment.
Dump trucks are rigid objects (except during their relatively short dumping periods) and
existing object recognition algorithms showed promising performance in the detection of
similar rigid vehicles such as cars. But recognition of an object that regularly changes its
shape, such as an articulated excavator, is more challenging. Therefore, a recognition system
was developed for excavators. This recognition framework combines a part-based approach
and spatiotemporal reasoning for recognition of operating excavators in construction videos.
Development of an object recognition classifier requires a large number of images for
training and testing phases that must not overlap. The training dataset should include both
positive and negative samples. Earthmoving machines appear fairly differently depending on
the viewpoint with respect to the camera. This makes anything but a sphere impossible to be
identified with a single detector. Therefore, it is essential to collect plenty of images
containing different orientations of a machine. In addition, training samples should be taken
under different lighting conditions and include different makes within a class to produce
efficient classifiers. Since a supervised learning approach is used to train classifiers, the
following steps should be carried out to prepare the training dataset. These tasks are human-
intensive and time-consuming.
Divide training samples into positive and negative groups. Negative samples must not
include target object;
29
Determine the training viewpoints for each equipment;
Group positive images into training categories;
Crop positive objects using fixed ratio boxes;
Resize cropped images to determined training sizes. This step is done automatically
using the Image Processing Toolbox of the MATLAB software.
For training and testing purposes, a large number of images containing different makes and
sizes of dump trucks and hydraulic excavators were collected from a multimedia archive and
freely-available on-line sources. In addition, a large number of images were captured by the
author from three construction sites, including two large earth-fill dams and a foundation
excavation of a condominium complex. These data were randomly divided into training and
testing datasets. The statistics of the training images for each viewpoint of the dump trucks
and excavators are provided in the following sections. Although all of the images were taken
in daylight, they vary by time of the year and illumination levels. Images were taken from
ground or above ground levels, and the vehicles in the images were located at various
distances from the camera.
3.1 Dump Trucks
A variety of makes, models, and colors of trucks were used to ensure that the classifier is not
limited by any one of these factors. Two object recognition algorithms, namely Haar-like
features and Histogram of Oriented Gradients (HOG), were trained using the training dataset.
Their performance was evaluated using the test dataset. The following sections describe the
development and testing processes.
30
3.1.1 Visual orientations
Since the visual features of a dump truck change with the camera viewpoint, both Haar and
HOG detectors were trained with image samples from eight orientations as shown in Figure
3-1. This follows from previous research where using eight visual orientations provided
strong results in the detection of urban vehicles (Rybski et al. 2010, Han et al. 2006). Having
eight orientations not only increases the chance of detection, but it also enables prediction of
the trajectory of the dump truck, which can be valuable data for activity interpretation. Table
3-1 presents the number of training images in each of the eight viewpoints.
Table 3-1: The number of training images in each category - With permission from ASCE (Rezazadeh
Azar and McCabe 2012a)
Visual orientation Front Front-
left
Front-
right
Side-
left
Side-
right
Rear Rear-
left
Rear-
right
# Positive samples 488 699 581 755 755 304 488 488
# Negative samples 8000 8000 8000 8000 8000 8000 8000 8000
Figure 3-1: Left: orientations; Right: samples of views (clockwise from top left: front, front-left, front-
right, side-left, side-right, rear, rear-left, rear-right) - With permission from ASCE (Rezazadeh Azar and
McCabe 2012a)
31
All of the positive and negative training images were manually cropped and then scaled
down to predetermined sizes. Since each of the two object recognition algorithms uses a
different training approach, training samples in each method involve different sizes. The
Haar method requires small image sizes such as 20x20 to 40x40 pixels while the HOG
algorithm uses bigger training frames in the range of 64 to 128 pixels in each dimension, thus
different training window sizes were used for each technique as presented in Table 3-2.
Table 3-2: Training windows of each method
Method
Dump truck views
Front and rear views
(pixels)
Other six views
(pixels)
Haar-like 21x19 36x20
HOG 104x96 128x80
The bounding boxes should completely enclose the machines in the training images, but 16
pixels of margin around the target object on all four sides were added to improve the
performance of the HOG detectors (Dalal and Triggs 2005). The Haar framework however,
requires smaller positive training windows with smaller margins. Therefore, 10 pixels were
cropped from all four sides of the HOG’s positive samples to decrease the margin. Finally,
84x76 images were resized by a 1/4 scale factor and 108x60 windows by 1/3 which resulted
in 21x19 and 36x20 boxes, respectively.
A set of 800 images containing no dump trucks was obtained from the same sources and
were used as the base negative training set. Negative images contained construction scenes
and many of them contained other earthwork plants, such as bulldozers and graders to
mitigate misclassification of such machines. Ten windows were randomly cropped from each
32
negative image and scaled down to the corresponding sizes for each viewpoint, resulting in
8000 negative training samples for each orientation.
3.1.2 Machine learning
Training samples are prepared for machine learning after they have been grouped, cropped,
and resized to the determined sizes. The AdaBoost learning algorithm (Freund and Schapire
1997) and the linear SVM method were employed to train Haar-like features and HOG
detectors, respectively. The Open source OpenCV 2.1 (OpenCV 2010) library has built-in
functions to train Haar classifiers. First, a vector of positive samples should be created by
using the cvCreateTrainingSamplesFromInfo() function. Then, the
cvCreateTreeCascadeClassifier() function employs the AdaBoost learning method to train a
cascade classifier from the vector of positive samples and negative images. This function
requires several input parameters, such as number of stages, minimum detection rate,
maximum false alarm rate, boost type, and the size of positive samples. Depending on these
parameters and the capability of the processor, it may take from a couple of hours to more
than a day to train a cascade classifier.
The OpenCV library, however, does not include an efficient linear SVM learning function to
train HOG detectors. It only contains functions to compute HOG features and classify search
windows. Therefore, the compute() function from the cv::HOGDescriptor structure was used
to create a vector of HOG features for every positive and negative sample. The calculated
positive and negative vectors were then grouped and labelled with +1 and -1 respectively,
and saved as a single .dat file. Finally, publicly available SVM-light software (Joachims
1999) was used to train HOG classifiers from the created .dat file.
33
Training of the HOG classifiers in two rounds significantly improves the results (Dalal and
Triggs 2005). In two-round training, the initially trained classifier searches the original
negative images. Any detected window is apparently a false detection. These false positives,
called hard negatives, were then scaled and added to the negative samples for the second
round of training.
3.1.3 Performance of Detectors on Static Images
An experimental process is designed to evaluate the performance of the detectors on static
images to choose a suitable method. It includes scanning each test image with eight single-
class (one for each orientation) detectors. These recognition tests were carried out on a dual
core 2.93 GHz processing unit with 3 gigabyte RAM. For this experiment, 380 test images
were randomly selected from the image pool, none of which had been used for training.
These images contained 681 dump trucks from all eight orientations together with other types
of heavy equipment, some of which also had similar colors, to evaluate the performance of
the detectors in congested views.
Both Haar-like and HOG detectors use binary classifiers for object recognition. These
classifiers search sub-windows in different location and scales of an image and decide
whether those sub-windows match the properties of the target object or not. Binary
classification of a sub-window has four possible outcomes that are presented in a confusion
matrix in Figure 3-2.
34
Figure 3-2: Possible outcomes of a binary classification process
A main parameter for evaluation of the performance of a binary classifier is hit rate, also
known as true positive rate, sensitivity or recall. The hit rate is defined as the ratio of the
correctly detected objects, or (true positives)/(true positives + false negatives). There are
other derivations from a confusion matrix to assess a classifier performance such as:
false positive rate (fall-out) = false positives/(false positives + true negatives);
accuracy = (true positives + true negatives)/total search windows;
specificity = true negatives/(false positives + true negatives);
precision = true positives/(true positives + false positives);
Binary classifiers operate with a discrimination threshold. Altering of this threshold would
change the hit rate and the false positive rate where lower thresholds pass more test sub-
windows, including both true and false positives. The Receiver operating characteristic
35
(ROC) curve is a common approach to illustrate the trade-off between the true positive and
false positive rates. The ROC curve demonstrates the hit rate versus the false positives rate at
varying thresholds.
Both the hit rate and false positive rate vary from 0 to 1, but the false positive rate usually has
very small numbers. For example, the HOG detectors have to classify more than 320,000
sub-windows in a 640x480 frame for eight views in which the majority of them are negative.
As will be demonstrated next, these classifiers generate very few numbers of false positives
per image, thus the false positive rate would have 10^-6 coefficient. To make it more sensible
for the construction readers, the false positive rate was substituted with the false positive per
frame in the ROC curves reported. This parameter simply presents the average number of
false positives occurred per test image.
Three factors were considered to evaluate the performance of the trained detectors: hit rate;
number of false positives per image; and computation times. The evaluation rules of the
PASCAL visual object classes challenge (Everingham and Winn 2010) were followed in this
experiment to determine whether a detected box is a true positive or false alarm. It requires
the detected bounding box to overlap more than 50% with a ground-truth bounding window
to be considered a true positive. In addition to location, a true positive should correctly
represent the orientation of the dump truck in the image. For example, if the classifier
identifies a dump truck with “Rear-left” orientation instead of “Front-left”, it will be counted
as a false alarm.
There were some instances where the detectors of two adjacent views detected the same truck
which had a boundary orientation. In this case, the system first checks all of the detected
36
bounding boxes to find the rectangles belonging to the same subset which should have a
similar size and location. In this method, two rectangles will be considered in the same group
if all of the distances between x and y elements of the matching corners are lower than the
minimum average of the width and height of the boxes times a threshold (Viola and Jones
2001). Then, the system picks the rectangle with the greater detection score and ignores the
other one. This method only considers two overlapping detections with adjacent viewpoints
and highest scores; three or more covering orientations are penalized.
3.1.3.1 HOG detectors
The receiver operating characteristic (ROC) curve was used to illustrate the performance of
the HOG detectors on processing test images as shown in Figure 3-3. This curve illustrates
the trade-off between the hit rate and false alarms per image where the lowered classification
thresholds pass more test sub-windows, including both true and false positives.
Figure 3-3: ROC curve of the HOG detectors - With permission from ASCE (Rezazadeh Azar and
McCabe 2012a)
37
Figure 3-4 shows some recognition samples. As observed in the results, HOG detectors could
recognize dump trucks with high accuracy among other types of machines, many of which
have a similar color. For instance, Figure 3-4 presents images taken from a rock-fill dam
construction project with rollers, bulldozers, graders, hydraulic excavators, and loaders. A
roller with “side-right” orientation is misclassified in the top right image.
Figure 3-4: HOG recognition samples - With permission from ASCE (Rezazadeh Azar and McCabe
2012a)
As the test images were randomly selected from the most challenging images, a number of
false negatives were partially masked by piles of soil or other machines (see Figure 3-5). On
the other hand, many of the false alarms resulted from false viewpoint estimations rather than
wrong location.
38
Figure 3-5: Samples of missed dump trucks - With permission from ASCE (Rezazadeh Azar and McCabe
2012a)
The computation time of this recognition method is another important factor to develop a
real-time application. Runtimes for different sizes of images were recorded and are presented
in Table 3-3. Processing a low resolution standard surveillance image of 640x480 pixels on a
dual core 2.93 GHz CPU for all eight viewpoints takes about 26 seconds, which is too long
for real-time purposes. This is because the HOG object recognition algorithm uses a brute-
force search approach as its classifier window searches for the target object in every location
and scale of the image. The classifier first searches for the object in the original scale frame,
then scales down the image by the shrinkage coefficient (set at 1.05 in this experiment), and
repeats the scan process. This process finishes when the image reaches the size of the
classifier window, which were 128x80 and 104x96 pixels. For instance, the system should
classify 6x40508 or 243,048 windows for the six viewpoints with the 128x80 training
windows and 2x40999 or 81998 windows for “Front” and “Rear” orientations which have
104x96 search windows, to find dump trucks in all eight orientations in a 640x480 frame.
Each of these linear SVM classifications is the dot product of the classifier and the test
window vectors, the vector sizes of which are [4752x1] for 104x96 windows and [4860x1]
for 128x80 boxes.
39
However, parallel implementation of the HOG algorithm on a new generation Graphics
Processing Unit (GPU) can speed up the standard sequential code by over 67 times
(Prisacariu and Reid 2009). A new generation of GPUs has hundreds of cores, enabling them
to process thousands of threads in parallel and allow non-uniform access to memory
(NUMA). The HOG recognition algorithm was implemented using a parallel computing
platform and programming model (CUDA) technology developed by NVIDIA (NVIDIA
2012). First the host CPU acquires the frame and copies it to GPU memory. The GPU
processes all of the scales and sub-windows of the frames and sends back calculated SVM
scores of each sub-window to the host CPU. The host CPU then formats the inputs that
contain the score and position of each sub-window. Finally, it carries out non-maximal
suppression to fuse the detected windows. The host CPU carries out the non-maximal
suppression to fuse the detected boxes, and the reason to process it on the CPU is that it
requires a lot of connections to RAM.
The computation times for scanning the same eight orientations with the same CPU (2.93
GHz dual core) and a GeForce GT 440 GPU with 2.1 compute capability were accelerated
significantly (Table 3-3). Parallel programming on GPU enables the use of the standalone
HOG method as the truck recognition module of the framework, with a suitable detection
rate, and maintenance of real-time video stream.
40
Table 3-3: HOG runtimes for eight views using CPU and GPU
Detector
Image size
CPU Dual core 2.93 GHz GPU NVIDIA GeForce GT440
HOG (Sec) HOG (Sec)
640x480 26 1.07
1024x768 69 2.8
1920x1080 186 7.6
2592x1944 455 18.8
3.1.3.2 Haar-like detectors
Although the standalone Haar-like feature detectors had short runtimes (Table 3-4), they
showed relatively low detection rates with very high false positives compared to the HOG
method (see Table 3-5). As such, this recognition algorithm was set aside.
Table 3-4: Computation times of the Haar detectors in searching for eight orientations - With permission
from ASCE (Rezazadeh Azar and McCabe 2012a)
Image size Haar runtime (sec)
640x480 1.2-2.0
1024x768 2.5-5.1
1920x1080 6.9-13.1
2592x1944 19.6-25.7
Table 3-5: Some samples of Haar detectors and their performances - With permission from ASCE
(Rezazadeh Azar and McCabe 2012a)
Training settings Test results
Minimum hit
rate
Max. false
alarm
Boosting type Detection
rate
False positives per
image
0.995 0.5 Gentle AdaBoost 49.2% 3.64
0.995 0.55 Gentle AdaBoost 71.5% 41.4
0.995 0.6 Gentle AdaBoost 86.8% 186.3
41
3.1.4 Haar-like vs. HOG performance
In static images, the HOG classifiers outperformed the Haar-like features algorithm with
respect to effectiveness, i.e., correct versus erroneous detections. With respect to efficiency,
however, the runtimes were much higher for HOG on the same CPU. Implementation of the
HOG recognition processes on a GPU resolved this issue. As such, the Haar-like recognition
algorithm was rejected for further use in this research.
3.1.5 Performance of Detector on Videos
To take advantage of the relatively rapid runtimes using the GPU, the system should scan
frames in time intervals slightly farther apart than the maximum runtime for real-time
applications. For example, processing each 640x480 pixel frame takes less than 1.1 seconds,
thus the system can be set to scan video frames in ≥1.5 second intervals (0.4 seconds extra
margin which the system may require for loading frames or other processes). Hence the
process can sustain a real-time video stream.
For the test on videos, the algorithm was set to scan the frames every 5 seconds, as this time
interval is suitable to detect a dump truck entering the scene and keeps up run-time
efficiency. Since the dump trucks move slowly in the sites due to speed limits (typically 25
km/h or less), it takes a considerable time to pass the view of the camera, and all of the trucks
appeared in at least one frame.
The performance of the HOG detectors was evaluated for the recognition of off-highway
dump trucks in test videos with 640x480 pixel frames. The 17 test videos with a total
duration of 65 minutes contained 62 dump trucks in different phases of their working cycles.
The system scans a frame every 5 seconds from the video stream, resulting in 773 frames
42
being processed. The recognition framework processed all of the videos without any delay to
the normal stream. The ROC of the results is illustrated in Figure 3-6, and Figure 3-7 shows
the recognition result on four consecutive video test frames.
Figure 3-6: ROC curve of the HOG detectors on videos
43
Figure 3-7: Detection results in a series of frames at specified time intervals (a through d)
The aim of this test was to detect trucks in the videos rather than the performance of the
detector in each frame separately; so the performance evaluation was carried out differently.
The hit rate is calculated here as the number of detected machines (regardless of frequency of
detections)/number of appeared dump trucks in the video stream. For instance, if a dump
truck appears in two frames and is spotted in one or both of them, it is counted as detected;
however, any false alarms in other frames were counted as well. So the detectors had more
than one detection chance for many of the 62 dump trucks. Since the detectors had multiple
opportunities to identify the trucks, higher detection thresholds were set than the ones used in
static images, which resulted in fewer false positives. In addition, the larger static images
(e.g. 1920x1080 and 2592x1944) had many more search windows than the 640x480 pixel
44
video frames; thus the probability of false alarms was much higher in the static images than
the videos. All of the mentioned points resulted in much better performances of the HOG
detectors on videos than on static images. The highest hit rate in videos was a 95.16% with
0.15 false positives per frame. In contrast, it was a 90.32% detection rate with 2.59 false
alarms per frame on static images.
3.2 Hydraulic Excavator
Deformable objects are significantly more difficult to detect, and state of the art research on
human detection focuses on pose identification due to the countless possible configurations
of the human body. The resulting algorithms are highly applicable for security surveillance,
entertainment, and automated image and video indexing. Many of these methods use part-
based and pictorial algorithms to detect a set of parts of a semantic object arranged in a
deformable configuration (Felzenszwalb et al. 2010; Andriluka et al. 2009). The Latent
support vector machine (Latent SVM) recognition method (Felzenszwalb et al. 2010) is a
cutting edge part-based model that won the 2009 PASCAL object detection challenge
(Everingham et al. 2009). This algorithm uses a modified version of the HOG detector, called
a root filter, to find the candidates for the target object. It then searches inside the detected
root boxes for the parts of the object at twice the spatial resolution relative to the original
resolution. A similar idea with substantial modifications was employed to detect a root and
then search for the possible configurations of the parts of the excavator to both recognize and
estimate the pose of the machine. The following sections describe this novel recognition
model.
45
3.2.1 Deformable parts
Highly articulated hydraulic excavators can swing 360 degrees and rotate all three parts of
their arm (boom, dipper, and attachment) around their hinged supports, as depicted in Figure
3-8. Hydraulic excavators appear in various figures, making it impractical for them to be
detected with the limited number training configurations as used in the case of dump trucks.
Figure 3-8: Deformations of the hydraulic excavator - With permission (Rezazadeh Azar and McCabe
2012b)
In latent SVM part-based models (Felzenszwalb et al. 2010), the root classifier detects the
entire body (e.g., human), then searches for the body parts (e.g., arms, torso, and legs) inside
the root to validate the detection. A hydraulic excavator can have several forms with parts of
the equipment masked by soil deposits or by other machines in the frames. Thus, it is very
difficult to find the root candidates (entire excavator) with a few root detectors, so the
approach was modified.
The part of the machine that is most visible was defined as the root. Then instead of
searching for the object parts within the root, the algorithm searches for the adjacent parts in
a variety of possible formations to validate the recognition process. The boom of the
excavator was selected as the root (Figure 3-9) and the dipper (second section of the
46
articulated arm) as the adjacent part. The main body and the bucket were not considered as
the root or adjacent parts, because the boom and dipper have approximately similar forms
and size ratios across different sizes and makes of excavators (except for long boom
excavators). The cabin shapes, however, can vary broadly. For instance, urban excavators
have compact bodies to swing in confined working zones. In addition, excavators can carry
different attachments at the end of their dipper, such as pneumatic hammers, buckets, and
trenchers, to perform specific operations. Moreover, bucket and cabin may be masked by
other machines or soil depots while the boom and dipper are the most visible part of the
excavator. Finally, the addition of other parts will decrease the detection rate.
Figure 3-9: Root and part of the excavator
The HOG classifier for the root (boom) was trained in left and right orientations. Figure 3-10
shows some of the training samples. Since the dipper revolves around the hinge support with
the boom, the dipper detector was trained for the six views as illustrated in Figure 3-11. As a
result, six poses are possible; these include left-horizontal, left-inclined, left-vertical, right-
horizontal, right-inclined, and right-vertical. It is impossible to distinguish the parts in full
47
front and rear views where the boom is aligned with the camera view. Therefore it is
necessary to train separate detectors. Because the SCIT needs to find the excavators in videos
and the excavators constantly slew while operating, it is possible to detect the excavator with
those six poses, thus front and rear views were not considered.
Figure 3-10: Top row: training instances of the boom in left direction; Second row: training samples of
the boom in right direction
Figure 3-11: Poses of the dipper - With permission (Rezazadeh Azar and McCabe 2012b)
48
3.2.2 Features
Root and part classifiers were trained using the HOG object recognition algorithm. Table 3-6
presents the statistics of the positive and negative samples used to train eight detectors. These
images were collected from the same sources used for dump truck detection.
Table 3-6: Statistics of the training images in each view - With permission (Rezazadeh Azar and McCabe
2012b)
Part Root-
left
Root-
right
Horizontal-
right
Inclined-
right
Vertical-
right
Horizontal-
left
Inclined-
left
Vertical-
left
#
Positive
samples
1040 800 398 791 926 398 791 926
#
Negative
samples
7700 7700 7700 7700 7700 7700 7700 7700
A negative training set including 770 negative images was collected, which contained
construction landscapes and earthmoving machines other than hydraulic excavators to reduce
the chance of misclassification of those plants as excavators. Ten boxes were randomly
cropped from each frame and scaled to corresponding viewpoint sizes, which produced 7700
negative training samples for each category. The two-round training approach was used to
train excavator classifiers.
3.2.3 Mixture models
A mixture model with n components is expressed by an n-tuple, P = (P1, …, Pm), where Pi is
the i-th piece of the articulated object, which in this case has two parts (m=2), the root and
the adjacent part. Each piece has a possible location and a HOG descriptor. As presented in
49
Figure 3-12, this part-based detection model is a two stage recognition process with both
implemented using a HOG detector.
Figure 3-12: Flowchart of the part-based recognition process
The system first searches the image for two directions (left, right) of the root, which may
produce several candidate windows. Then it searches for the dipper in possible regions
adjacent to the roots. For instance, if the root detector locates a boom in “left” orientation, the
dipper must be on the left side of the boom in one of the three possible configurations: “left-
horizontal”, “left-inclined”, and “left-vertical” as depicted in Figure 3-13. The sizes of these
search regions are based on the dimensions of the detected root as presented in Table 3-7.
Various size ratios were examined to gain the best detection rate and maintain run-time
efficiency. If the search area size increases, it will raise the computation time and enhance
the possibility of false positives. On the other hand, smaller search regions may not enclose
the dipper. The selected size ratios are large enough to surround all of the regular dippers,
except for long boom excavators, and to maintain the run-time efficiency.
50
Figure 3-13: Search regions for dipper - With permission (Rezazadeh Azar and McCabe 2012b)
Table 3-7: Dimension of the search areas based on the root dimensions
Search regions Width Height
Horizontal = width of the root’s bounding box = 0.7 height of the root’s bounding box
Inclined = width of the root’s bounding box = 1.2 height of the root’s bounding box
Vertical = width of the root’s bounding box = 1.4 height of the root’s bounding box
3.2.4 Static images
Two sets of experiments were carried out to assess the performance of the part-based
algorithm and to determine whether this modified method can improve the results compared
to standalone HOG detectors. In the first set of experiments, only the root classifiers
including right and left orientations scanned test images and for the second round, the part-
based framework was evaluated. The rules to accept a true positive are the same as the ones
used for dump truck recognition, namely the detected box should overlap more than 50% of
the ground-truth bounding box and the result should correctly show the direction of the
51
excavator boom. These detection models were evaluated using 253 images of different sizes,
showing 284 excavators varying in make and pose. The photos were randomly picked from
the collected image pool; none of them had been used in the training stage. These images
were captured in congested construction sites such that other types of equipment appeared in
many images, which allowed evaluation of the capability of the algorithm to correctly
recognize the hydraulic excavator among other machines.
The detectors processed the test images with different thresholds - Table 3-8 provides the
results. Tests 1 to 5 indicate a change of threshold in the root classifier. A pair-wise
comparison of the two tests in Table 3-8 demonstrates that the part-based method can notably
reduce false positives; but it decreases the detection rate as well. So the pairwise comparison
does not discern the overall performance of the two approaches. Therefore, the ROC curves
were plotted to compare the results at any given detection rate or false alarm and to find
differences. As shown in Figure 3-14, both of the part-based and standalone HOG have
almost the same detection rate, in the range of 0.52 to 2.19 false positives per frame, but the
part-based method significantly outperforms the general HOG method, in the ranges of lower
than 0.52 false alarms per frame. In addition, the part-based framework can estimate the
orientation of the dipper, which is helpful data for activity recognition.
52
Table 3-8: Results of the general HOG and part-based methods - With permission (Rezazadeh Azar and
McCabe 2012b)
General HOG (only roots) Part-based method
Detection rate
(%)
False positives per
frame
Detection rate
(%)
False positives per
frame
Test 1 61.62% 0.27 58.45% 0.11
Test 2 72.18% 0.52 66.20% 0.22
Test 3 77.46% 0.99 72.54% 0.57
Test 4 81.34% 1.81 77.82% 1.21
Test 5 85.56% 3.60 82.75% 2.19
Figure 3-14: ROC curve of the results on the excavator test images
Figure 3-15 illustrates some hydraulic excavators detected using the part-based method.
Since the HOG recognition algorithm is invariant to changes in illumination and scale, the
part-based framework (which uses HOG descriptors) demonstrated good performance in
detection of various sizes of excavators with different colors and illumination conditions. For
instance, Figure 3-15c shows an image captured at sunset in very low light, while the other
53
images in Figure 3-15 are taken in average (Figure 3-15d) to very bright conditions (Figure
3-15a).
Figure 3-15: Samples of detected excavators - With permission (Rezazadeh Azar and McCabe 2012b)
Both of these methods fail to detect the excavator if the arm is not visible or is aligned with
the camera viewfinder. Many of the false positives took place with a wrong boom direction
in the correct locations. For example, the part-based method spotted the excavator in the
“right-inclined” pose in addition to “left-inclined” (see the left machine in Figure 3-15b).
Another noticeable problem in the part-based algorithm was observed in the finding of the
dipper, when it recognized the dipper in two adjacent poses at the same time. There were
some examples in which the secondary classifier detected the dipper in both horizontal and
inclined, or inclined and vertical configurations. This issue is mainly due to overlapping of
the training samples. The samples were manually divided into training categories and there
were some samples on the boundary margins which were misclassified due to human error.
The possible solution is to choose the one with a higher detection score as implemented for
54
this research. Altogether, wrong objects accounted for 77.4% of the false positives, in 19.3%
of them the dipper had been detected as the root (wrong direction), 2.3% of them were
caused by incorrect sizes of the bounding boxes, and about 1% of the false alarms located the
boom correctly but failed to estimate the correct orientation of the dipper.
Implementation of the part-based algorithm on the graphics processing unit showed
satisfactory run-time results; it takes less than one second to process a standard VGA
640x480 pixel frame on a 2.93 GHz dual core CPU and a GeForce GT 440 GPU with 2.1
compute capability. Table 3-9 presents computation times on different image sizes. The
varied process times of the part-based method occurred due to the different number and sizes
of secondary search regions generated by the root detectors.
Table 3-9: Part-based recognition runtimes for both directions using CPU and GPU
Image size
(Pixels)
Runtime (Sec) on CPU
Dual core 2.93 GHz
Runtime (Sec) GPU
NVIDIA GeForce GT440
640x480 6-7 0.26-0.94
1024x768 18-21 1.2-3.1
1920x1080 49-53 1.9-5.4
2592x1944 116-120 6.9-10.8
3.2.5 Videos
The main aim of this module is to detect hydraulic excavators in construction videos for
further analysis. Hydraulic excavators are stationary equipment and only move to change
their working zones, so this recognition unit needs to detect them only once and then pass the
information to the tracking module. To evaluate the performance of the detectors on movies,
21 videos with a total duration of two hours and twelve minutes were recorded from three
construction projects. These videos had 640x480 pixel resolution.
55
The system subsamples a frame every ten seconds from the videos until it detects an object
regardless of its being true or false. A ten second interval is much larger than the maximum
recognition time for 640x480 frames which is one second (Table 3-9), so the system can
maintain the real-time stream of the video. The experiments were carried out using both of
the general HOG and part-based methods with different thresholds to compare their
performances. The evaluation criteria for this test are detection rate and the average time to
find the first object in the videos. Table 3-10 presents the results.
Table 3-10: Results of the general HOG and part-based algorithms in test videos - With permission
(Rezazadeh Azar and McCabe 2012b)
Method Test ID Detection Rate No. False Detections First detection (sec)
General
HOG
HOG0 85.71% 2 53.33
HOG1 76.19% 5 38.10
HOG2 71.43% 6 29.05
HOG3 71.43% 12 12.86
HOG4 61.90% 24 11.90
Part-based
method
P-B0 90.48% 0 90.00
P-B1 90.48% 1 64.29
P-B2 90.48% 2 35.71
P-B3 76.19% 8 14.76
P-B4 76.19% 10 13.33
The test cases with lower ID numbers have higher thresholds resulting in higher hit rate, and
the threshold decreases as the ID number rises. Higher thresholds result in higher detection
rate, even though it takes longer (more search frames) for them to spot the first object. As the
threshold decreases, the framework detects objects faster at the cost of more false positives.
The part-based method outperformed the standalone HOG method at higher threshold tests,
although it took longer and more search frames to find the first object.
56
3.2.6 Spatiotemporal reasoning
Even the most advanced object recognition algorithms have type one and type two errors (see
Figure 3-2 for definition of the error type one and two). Since the excavators are stationary,
the detected machine is passed to the tracking engine without further need for recognition.
So, the false positives are costly and will mislead the entire action recognition process. As
shown in Table 3-10, detectors with higher thresholds may need to search several frames to
find the excavator, and constant correct detection is not guaranteed.
Spatiotemporal reasoning is additional information that uses background knowledge to
interpret situations. Spatiotemporal reasoning employs inexpensive constraint analysis such
as special information on the objects and time (Renz and Nebel 2007). As a result,
spatiotemporal reasoning has been used for object recognition and visual motion analysis in
image sequences (Laptev et al 2007; Laptev and Lindeberg2006).
Since the excavators are stationary equipment with cyclic movement patterns, a
spatiotemporal reasoning algorithm was developed and added to the recognition framework
to enhance the detection rate in videos. First the HOG recognition thresholds are set low to
generate multiple bounding boxes including true positives and false alarms in several
consecutive frames. In this way the risk of false negatives is virtually eliminated, even
though several false positives are produced. The system scans 10 frames in the first minute of
the video (one every six seconds) for a hydraulic excavator, and then the detected windows
are grouped based on defined spatiotemporal constraints. Size, displacement, and directions
(left or right) of the detected boxes are the spatiotemporal constraints. Two movies of
operating excavators with a total duration of twenty-one minutes were studied to set these
constraints. One frame in every six seconds was processed and the true positives were
57
carefully examined to determine the constraints, which are presented in Table 3-11. A six
second interval gives a large time span to capture different figures of an excavator while
working.
Table 3-11: Spatiotemporal constraints of the true positives - With permission (Rezazadeh Azar and
McCabe 2012b)
Constraint Comparison criteria
Size < (1.4 * first object area) or
> (first object area / 1.4)
Displacement < Width of the first detected box
Direction If the first object has right direction, the next one
cannot have left direction in right side displacement
If the first object has left direction, the next one
cannot have right direction in left side displacement
Detected bounding boxes are called “nodes”, and a group of similar nodes is named “path”.
For every detected window, the algorithm searches the rest of the frames and groups the
nodes that belong to the same path. Even if a node is not detected in a frame, the algorithm
loops through all of the remaining frames for matches. Two nodes of a path cannot be in the
same frame. Figure 3-16 illustrates nodes captured in the first minute (images a through j)
and frames k to n in Figure 3-16 show the four identified paths.
58
Figure 3-16: Object recognition at time intervals (images a to j), and four distinguished paths (images k
to n) - With permission (Rezazadeh Azar and McCabe 2012b)
The path that follows the logical movements of an excavator will be selected. This process
also involves spatiotemporal reasoning to select the path closest to the movement pattern of
an excavator. Again the same two videos were investigated to develop the reasoning. Two
types of false positive paths were observed in the test videos. The first type of false alarm
paths had one node, which were random misclassifications (frame n in Figure 3-16). The
second type of false positive paths resulted due to repeated false detections within the same
region in several frames (frame l and m in Figure 3-16).
On the other hand, the paths of true positives include a cluster of bounding boxes with small
size variations, some displacement, and logical changes in direction of the boom due to
59
rotation of an operating excavator (frame k in Figure 3-16). To identify the correct path, the
framework first sorts the paths based on the number of nodes, as the path representing an
operating excavator is always amongst those with the highest number of nodes. The other
candidate paths with a number of nodes usually contain the recurring false detections, so the
path with higher change of directions and displacements is selected as the target object. Since
the identified path includes a group of boxes, the main issue is to determine a box which has
the best representation of the excavator. The action recognition module needs the base-point
and the width of the boom to interpret interactions. The system chooses the box with the
highest recognition score to determine the base point and the width of the boom.
This recognition system processed the first minute of the 21 test videos with one excavator in
each and correctly recognized twenty machines (95.2% detection rate) with just one
misclassification. This algorithm was tested on scenes with one working excavator, and it
needs further modifications to detect multiple excavators in video streams. In addition, this
method can process movies with static backgrounds, because spatiotemporal reasoning of the
detected objects is possible only in videos captured by stationary cameras. Changing of the
camera view will result in detection of an excavator in different regions of consecutive
frames, making it impractical to use spatiotemporal reasoning.
This spatiotemporal algorithm has some advantages over existing recognition methods used
in construction videos including color-based detection (Zou and Kim 2007), and background
subtraction with Bayes or neural network classifiers (Chi and Caldas 2011), which had a
detection rate (96%) very close to the results of this research (95.2%). As stated before,
color-based recognition has difficulties in various lighting conditions and is sensitive to
occlusion, and the normal Bayes or neural network based detectors can classify only a
60
limited number of trained objects. HOG based detectors however, can find the target
regardless of other existing objects. In that research (Chi and Caldas 2011), only three types
of objects including a mini loader, backhoe, and workers appeared in test videos and the
system had the corresponding classifiers, but the test videos in the current research contained
many moving objects including rollers, bulldozers, pickups, workers, dump trucks, SUVs,
truck mixers, pile driving machines, and mobile concrete pumps.
3.3 Robustness of the Recognition Results
This section discusses the robustness of the developed object detection module based on
seven factors namely, occlusion, lighting, shadow, viewpoint, articulation, scale change, and
orientation change.
3.3.1 Occlusions
Two videos from ground and elevated viewpoints were selected to evaluate the effect of
occlusion on object detection rates. In both of these videos, one dump truck occludes another.
The system scanned one frame every two seconds and the results are presented in Figure
3-17 (ground level) and Figure 3-18 (elevated view). In the ground level view, the
foreground (white) truck completely masks the other, which eventually becomes visible. In
the above ground view, the foreground truck only partially masked the other truck. Detectors
could not spot the background truck in frames a, b, c, of the ground level view and even
missed the foreground truck in frame b as its appearance is mixed with the background
machine (Figure 3-17). The system was able to recognize both of the machines in frame d
and the successive frames.
61
Figure 3-17: Occlusion at ground-level view
The system was able to detect the partially occluded trucks captured from an elevated view,
even though it missed the foreground machine in frame b (Figure 3-18).
62
Figure 3-18: Partially masked truck from elevated view
3.3.2 Lighting
Since the HOG features are the gradient of the object edges, this method is invariant to
lighting conditions. This issue was also investigated as the test images for both excavators
and dump trucks were taken under various conditions. The outcomes did not show any
significant effect of lighting in the detection results. For instance, Figure 3-15 illustrates
successfully detected excavators under a wide range of illumination.
3.3.3 Shadow
Although complete shadows don’t affect detection, partial shadows may affect the detection
process as they change the edges and therefore the HOG features of the object. However, it is
63
difficult to quantitatively describe shadow effects on the recognition performance. Numeric
description requires taking numerous samples under various shadows and then estimating the
shadows’ contrast, overlapping areas, and orientation. This issue was not investigated in this
research.
3.3.4 Viewpoint
Viewpoint of the camera is one of main success factors in the recognition process. Since
eight classifiers have been trained to detect the machine, there are still possible views in
which all of the detectors failed to recognize the target machine. These missed samples were
mostly captured from extreme overhead angles (see Figure 3-19 a) or when the machines
moved on very steep roads (Figure 3-19 b). These were not categorized in any of the eight
orientations as only a few of these viewpoints were used in the training stage.
Figure 3-19: Difficult viewpoints
3.3.5 Articulation
Dump trucks are rigid entities which can only open and close their bed. The dump truck
training samples did not include machines with open beds, so the classifiers were not trained
to detect trucks while dumping the load and it is not within the scope of this research to
64
detect dumping operations. However, there were some test images which showed trucks
dumping the load. The detectors were able to handle slight articulations as shown in Figure
3-4 (bottom right frame).
But articulation was a major issue in the detection of excavators as they can have numerous
figures. As long as the excavators had poses close to training samples, the detector was able
to recognize them with high accuracy; however, extreme figures were challenging to detect.
3.3.6 Scale change
Changes of the object size in the search frame did not affect the object recognition process.
Since the HOG recognition algorithm uses a sliding window approach with various scales,
the system is invariant to changes of target size. In addition, the shrinkage coefficient was set
to 1.05, which is a fairly low number to avoid missing a target. Figure 3-20 depicts a dump
truck moving toward the camera with dramatic scale change in four consecutive frames of
the video, and the detectors successfully detected all of the targets.
65
Figure 3-20: Changes in size of the dump truck as it approaches the camera
3.3.7 Orientation change
Similar to changes of size, orientation change was not a major issue in detection of machines
in a series of frames. For example, Figure 3-7 shows a series of frames from a video and
detectors successfully recognized the transition from side-left to rear-left in frame a and b.
3.4 Summary
Two object recognition algorithms, namely Haar-like features and HOG, were tested to
recognize dump trucks from eight orientations. The HOG algorithm significantly
outperformed the Haar-like features method, but the run-times were too long to use for real-
time applications. Parallel implementation of the HOG algorithm on GPU could solve this
66
issue. Experiments on the test videos showed acceptable hit rate with few false positives per
frame. The highest hit rate in videos was a 95.16% with 0.15 false alarms per frame.
Articulated poses of excavators, however, complicated the recognition process. Thus, a part-
based framework was developed to detect the boom and dipper of excavators in various
configurations. Since excavators are stationary equipment, detection of them in consecutive
frames can provide additional spatiotemporal data to enhance recognition performance.
Therefore, a spatiotemporal reasoning was developed to improve detection performance in
construction videos.
The object recognition module identifies dump trucks and excavators in videos. The output
of this module includes 2D boxes representing each machine and in addition, it identifies the
orientation of dump trucks and direction of excavator’s arm. Table 3-12 briefly describes the
performance of the object recognition module under seven main affecting factors. The
subsequent requirement of the SCIT system is to track the machines of interest. The machine
of interest is detected by the recognition module and identified as an active machine in
loading operation by the activity recognition module. The next chapter describes the object
tracking engine in which a novel tracking algorithm is introduced.
67
Table 3-12: Summary of the robustness assessment of the recognition process under main affecting
factors
Factors Strengths Weaknesses
Occlusion
Is able to detect objects with
moderate occlusion (partially masked
objects)
Fails in major occlusion, especially
the masked objects in ground-level
views
Lighting Is invariant to regular day lighting
condition
May have difficulty in very low
illumination and foggy days
Shadow Could detect objects with/under
shadow
Some shadows may drastically
alter HOG features and mislead
classifiers
Viewpoint
Could identify the objects with the
viewpoints included in the training
datasets of classifiers
May fail in some viewpoints such
as extreme overhead angles or
equipment on a very steep road
Articulation Not an issue for dump trucks May have problem in extreme
articulated poses of excavators
Scale change Could detect targets as long as they
are mostly visible in the frame -
Orientation change
Could detect usual orientation
changes of dump trucks and
excavators’ boom
May fail in exceptional
orientations which were not
included in training samples
68
CHAPTER 4 - OBJECT TRACKING MODULE
This chapter explains the steps to develop the tracking module for the SCIT system. Two
tracking algorithms including mean-shift and an innovative hybrid method were used as a
basis. The following sections describe these two algorithms.
4.1 Mean-shift Algorithm
Object tracking in videos is a useful tool to locate and monitor the activities of the equipment
and human resources on site. Recent studies (see section 2.5) showed that the mean-shift
method performs reliably in tracking resources in construction videos (Park et al. 2011; Gong
and Caldas 2011). Mean-shift is an iterative algorithm that starts with a preliminary point,
and then re-estimates the mean of the dataset until they converge. Probabilistic models such
as Kalman filter can predict the initial point to start the iteration and therefore enhance the
performance of the mean-shift tracking (Gong and Caldas 2011). A modified version of the
mean-shift algorithm, called continuously adaptive mean-shift or Camshift (Bradski 1998)
was selected as the tracking engine.
Mean-shift algorithm can theoretically track different features of the target object, such as
color histogram and edges. But the challenge is to provide an intensity dataset for the tracker
to search for the local peak. This intensity dataset is usually represented with an 8-bit
greyscale frame, varying from black at the weakest weight to white at the strongest. The hue,
saturation, and value (HSV) color histogram is the most common feature for tracking which
was employed for this research as well. The algorithm first calculates the HSV color
histogram of the target object, and then it segments the pixels with HSV histogram within the
range and provides the greyscale intensity dataset for the tracker in every search frame (see
Figure 4-1).
69
In addition to color histogram, the HOG response was also employed as the second
alternative. HOG algorithm is invariant to color and illumination, so it is able to handle some
issues related to color. For example, the trucks covered with mud have a very close
histogram to the background soil and may mislead the tracker. The HOG object detector
provides the dense greyscale of the detection response for the mean-shift tracker. The
maximum response is colored with the highest intensity and the rest of the responses are
normalized based on their detection score. For example, the right frame in the Figure 4-2
shows the dense greyscale response of the left frame. The system scans every one second and
the Camshift algorithm tracks the target using this response. This short interval is possible
due to the high computational capability of the GPU which takes less than 0.13 seconds to
process HOG detection for an orientation.
Searching for local maxima can cause issues in certain situations. For instance, the tracker
may expand or shift to a nearby object of similar features that the algorithm is using. The left
frame in Figure 4-1 shows selected target for tracking and the right frame illustrates
segmented dataset with similar color histogram. This close proximity caused the tracker (red
ellipse) to expand to both machines. This issue is not limited to the application of color
histogram as it is visible in Figure 4-2 which uses HOG response. This concern, along with
other observed issues are described further in the experimental results (Section 6.1).
70
Figure 4-1: left: selection of target truck in the original frame, right: isolation of pixels with similar color
histograms
Figure 4-2: left: original image, right: isolation of pixels with HOG response for side-right facing trucks
4.2 Hybrid Tracking
Because the mean-shift method has difficulty performing well when applied to construction
videos, a novel tracking framework was developed to robustly track dump trucks. This
hybrid algorithm was inspired by a recognition-based tracking framework developed to
interpret human actions (Barbu et al. 2012). They used the Latent SVM object recognition
method (Felzenszwalb et al. 2010) with lowered thresholds to create tracking candidates, and
employed the KLT feature tracker (Tomasi and Kanade 1991) to project each detected box
five frames forward to compensate for false negatives of the Latent SVM detector. The
71
second step uses a dynamic-programming algorithm (Viterbi 1971) to select a temporally
coherent set of detections for tracking.
Since the dump truck profiles do not change drastically between time steps, the HOG
algorithm was selected to help track a dump truck in the hybrid tracking technique developed
for this research. After recognition of a truck, the algorithm continues to find trucks, but in
shorter time intervals and in an optimized manner. In contrast with the recognition module
described in section 3.1.5, the hybrid tracker searches for only three orientations every two
seconds, i.e., the initial viewpoint of the target truck and the two adjacent orientations, which
the same GPU and CPU can process in 0.39 seconds. For instance, if the target truck was
spotted in a side-right viewpoint, the system only searches for front-right, side-right, and
rear-right orientations. This way, the hybrid tracker catches the changes in the trajectory of
the machine, but it is not required to check all eight orientations using a priori knowledge. In
addition, the detection thresholds are lowered to avoid false negatives, although the rate of
false positives increases as well. Each detected bounding box is a potential target.
Pure recognition-based tracking has two main issues. First, even decreasing the thresholds
cannot promise constant detection of the machine, so the target can be lost. Second, there
were some cases in the test videos where a second dump truck entered the frame and stopped
in the loading zone with a similar orientation as the truck being loaded. This often misled the
recognition-based tracker. In addition, a nearby false alarm can also cause the same error.
A feature tracking method was added to the framework to solve both of these problems. The
center point of the target machine is tracked by the Kanade-Lucas-Tomasi (KLT) feature
tracker (Tomasi and Kanade 1991) to project that bounding box to the next scanning frame.
72
Thus, it artificially generates bounding boxes in subsequent frames to significantly reduce the
risk of losing the machine, and to keep track of the actual target. Therefore, there will be a
projected window in addition to the true positive and false alarm boxes generated by the
recognition engine in every new frame. The KLT feature tracking algorithm is a differential
method to estimate the optical flow, which is based on three assumptions: 1) brightness
constancy, 2) temporal persistence, and 3) spatial coherence.
The result of the recognition and projection will be a set of boxes with a minimum one
member, thus a simple disjoint-set data structure algorithm is employed to partition the
detection that is temporally coherent with the projected window, yielding a new bounding
box and eliminating other detections. This fusion algorithm groups two rectangles in the
same subset if their bounding regions overlap. All of the distances between x and y elements
of the matching corners should be lower than the minimum average of the width and height
of the boxes times the threshold to group two rectangles (Viola and Jones 2001). The corners
of the final rectangle are the average of the corners of the projected box and the overlapping
detection. If none of the detections is temporally coherent with the projected box, the
projected box will be taken as the final rectangle.
The flowchart and the visual sequence of the entire hybrid tracking process are shown in
Figure 4-3 and Figure 4-4, respectively. After formation of the new box, its center becomes a
feature for the KLT tracker. The KLT feature tracking method is sensitive to objects passing
in front of the tracking features, such as the bucket of an excavator or construction workers.
Even the shadow of the bucket could distract the tracking process; however, the mixture
character of this novel tracker indicates that the continuous HOG object recognition in short
73
time interval prevents the target equipment from being lost, improving the performance of
this hybrid tracking algorithm.
Figure 4-3: Flowchart of the hybrid tracking process
Figure 4-4: a: detected truck at frame x1; b: HOG recognition result with lowered thresholds for three
viewpoints in frame x2; c: projected box of previous frame (frame x1) to frame x2 using KLT feature
tracker; d: fusion of the rectangles in b and c
74
The purpose of the tracking module is to track dump trucks in the limited region of the
loading zone and will be discussed in the next chapter. Dump trucks usually move slightly
for a better loading position in the loading zones. In addition, neither their orientation nor
their scale changes dramatically. Therefore, the hybrid tracker does not have to handle
extreme orientation/scale changes; the mixture character of this algorithm however, allows
processing moderate situations. The recognition aspect of the hybrid algorithm has a dynamic
character. Once the target is identified, the HOG recognition searches for three orientations.
After the recognition process, the algorithm fuses the projected rectangle, which has the
initial size of the machine, with a detected box that overlaps. This fusion process helps the
system adjust the size of the targets. In addition, this algorithm continually changes the three
search viewpoints in every recognition attempt to account for a turning vehicle. For instance,
the initial orientation of the target was “side-right” in Figure 4-5a, so the framework searched
for rear-right, side-right, and front-right viewpoint in the next scan process (Figure 4-5b).
The system found the target with same orientation in frame b, and then identified it with
“Front-right” trajectory in Figure 4-5c. Therefore, the search orientations were changed to
front, front-right, and side-right for the next scan process (Figure 4-5d). The machine was
recognized with front-right orientation in frame d. The two second time interval is
appropriate for the purpose of this research, but it should be decreased to track fast objects or
capture extreme scale and trajectory changes.
75
Figure 4-5: Tracking of the orientation changes
4.2.1 Possibilities to Optimize Hybrid Algorithm
This version of the hybrid algorithm is computationally-intensive and it is only viable by
using a GPU’s parallel computation capability. The HOG object recognition searches for the
target regardless of its size and location. Therefore, limiting the scale and the search region
can reduce computations. The HOG algorithm searches for the target in original frame, then
scales down the image by the shrinkage coefficient and scans the resized image. This process
continues until the frame reaches the size of the classifier window. In the case of using a 1.05
shrinkage coefficient, for example, a 128x80 classifier window scans 34 frame sizes in an
original image of 640x480 pixels. It is possible to optimize this recognition process for
tracking purposes. Instead of searching for all possible scales, the recognition aspect can be
76
set to search for the target in a range of scales (e.g. ± 20% of the prior scale). This
optimization may result in missing the targets with greater scale changes; however, it reduces
runtimes and therefore shorter time intervals can be exercised to compensate.
Another optimization opportunity is to limit the search region. Since the tracking engine
should track a dump truck with low speed, minor scale and orientation changes, a limited
region was set for each search trials. In this setting, the HOG recognition engine searches for
the three orientations in a region of interest (ROI) that is determined using the prior size and
location of the target. This ROI has the same center as the bounding box of the target in
previous search frame; the width and height of the ROI however, are two and half times
larger than the target’s bounding box (Figure 4-6). The ROI is dynamically defined for each
trial, allowing the system capture size changes and movement of the tracking machine.
Figure 4-6: Left: red box encloses target truck, right: ROI to search for the target truck in the next frame
Runtimes are significantly reduced compared to 0.39 seconds for searching the entire frame.
The process times depend on the size of the target’s bounding box which larger boxes create
bigger ROI and therefore longer process times. The runtimes were between 0.09 to 0.19
seconds in the test videos. These short process times allowed reduced intervals from two
seconds to one second. Shorter recognition intervals improve the algorithm’s performance in
77
tracking fast objects and capturing dramatic scale and orientation changes. In addition, it
corrects KLT’s errors and prevents misleading tracking process. For example, a passing
worker distracts the KLT feature tracker in frames a and b in Figure 4-7 (red dot represents
feature tracker), but the hybrid algorithm corrects it in the subsequent frames (frames c and
d). This optimized implementation of the hybrid tracking algorithm was used in this research.
Figure 4-7: Correction of the KLT method's distractions
4.3 Summary
This chapter described the tracking module of the SCIT system which was developed using
two tracking algorithms. Mean-shift algorithm is one of the employed methods which
demonstrated promising results in earlier research. This method however, may fail in
78
tracking dump trucks in some real-world conditions, such as trucks with similar features in
close proximity. Therefore, a novel tracking algorithm has been developed that combines the
HOG recognition and a feature tracking algorithm to track dump trucks under challenging
visual conditions found in jobsites.
The recognition and tracking modules provide useful spatiotemporal data of the earth
material loading equipment in construction video. The remaining major challenge is to
interpret these data to recognize the start and the finish time of loading cycles. The next
chapter introduces an action recognition algorithm that analyzes these spatiotemporal data to
recognize and estimate loading cycles.
79
CHAPTER 5 - THE ACTION RECOGNITION MODULE AND SYSTEM
ARCITECTURE
Activity recognition is a popular and evolving research area in the computer vision field.
Several research efforts, mostly focusing on human action recognition, tackle this important
subject (Aggarwal and Ryoo 2011). Logical reasoning and machine learning algorithms are
two main approaches employed to address task recognition problems. Decision variables
must pass a set of consistent logical constraints to infer the action in logic-based methods.
Probability theory and statistical learning models, such as Bayesian belief networks, Hidden
Markov Models, and support vector machines, have been used to interpret the events in
machine learning approaches.
In logic-based algorithms, actions are considered to be objectives and background knowledge
is stated in a set of first-order constraints called an event hierarchy. This hierarchy is encoded
in first-order logic to examine spatiotemporal data, such as location, direction, or size, to
recognize actions. Logical reasoning has been used in construction research to estimate
productivity in cases where construction equipment or tools entering a predetermined work
zone is considered to be a working state. In some examples, a GPS system was employed to
track earthmoving plants, where the movements of the machines in work envelopes were
used to interpret grading and levelling operations (Navon et al. 2004), and a concrete hopper
entering a determined zone triggered a concrete pouring cycle (Gong and Caldas 2010).
Nonetheless, logical reasoning approaches have two main shortcomings. First, logical
constraints are strict and cannot incorporate uncertainty. For instance, a logic-based
framework cannot chose between two or more plans for which an agent qualifies. Second,
80
logical reasoning does not have a learning capability; the system cannot learn previously
unknown situations to tackle similar future scenarios.
In contrast, probabilistic and machine learning methods can learn behaviours using training
samples obtained from sensors or databases and are able to account for some level of
uncertainty. For example, probabilistic methods, such as Bayesian belief networks can
estimate the probability of the potential plans while machine learning algorithms, e.g. support
vector machines, can not only decide whether a case belongs to an object class, but they also
give a score to passing instances. Thus, these methods can predict the most probable class.
The action recognition of the SCIT system consists of a logical reasoning and a machine
learning algorithm. This action recognition framework first checks whether the loading
equipment, including excavator and a dump truck, are positioned for loading and then uses a
machine learning algorithm to examine distances and sizes of machines for loading. The
following sections detail the components of the action recognition module.
5.1 Baseline Task
Earth material loading by an excavator is an interactive and one-way activity, significantly
reducing the complexities of the human domain. Unlike the flexible and unlimited number of
poses of the human body, excavators typically operate in unidirectional and predefined
patterns. Due to the straightforward nature of this activity, a consistent set of constraints was
used to filter candidates and another set of spatiotemporal calculi was used to train an action
recognition classifier. These spatiotemporal calculi provide simple information about time
and space, such as topology, direction, or distances between of entities (Renz and Nebel
2007).
81
5.2 Spatiotemporal Information
The recognition and tracking modules provide the location, size, and orientation/direction of
both dump trucks and the excavator. These spatiotemporal data were used to decide if the
server is loading one of the customers. Every detected dump truck and excavator is labelled
by the system according to its 2D coordinates, its orientation, and the dimensions of the
bounding box. To confirm a loading action, a dump truck should be in the appropriate
orientation within range of the excavator’s boom; hence the distance and the configuration of
the equipment are two key factors in recognition of a loading action.
5.3 Activity Recognition Module (ARM)
As noted before, this module is a mixture of two action recognition methods. They are
described more fully in the following subsections, but can be summarized as:
1. Logical reasoning process to identify possible loading activities using set rules
2. Machine learning process to confirm the loading activity
5.3.1 ARM Stage 1: Logical loading configuration
The first part of the ARM is a logical reasoning process to quickly examine the equipment
orientations for possible loading activity. These constraints are set based on priori knowledge
of this activity. Dump trucks always need to draw their open-box bed near to the excavator
for loading. For example, a left facing boom and a side-right dump truck located on the left
of the excavator are not likely positioned for loading, although there are always exceptions to
the rule.
82
Although logic-based methods have two main shortcomings, neither is a problem for this
aspect of the ARM. First, the intention of this aspect is to filter possible candidates, not to
identify the final loading truck. So there is no uncertainty about the main target. Second, the
determined constraints are the most probable configurations, although they may not be
followed under special site conditions. Given the usual training dataset, even the machine
learning classifier will have difficulty recognizing those rare situations. Table 5-1 shows the
possible loading configurations based on the location of the excavator and orientation of
dump trucks.
Table 5-1: Possible loading configurations
Excavator
Dump truck
Located on the left of the
dump truck
Located on the right of
the dump truck
Front
Front-left
Front-right
Side-left
Side-right
Rear
Rear-left
Rear-right
5.3.2 ARM Stage 2: Machine learning action recognition
If a dump truck passes the first stage of recognition, it will be sent on to the second stage to
examine the distance and size ratio of the server and customer. The corner of the boom’s
bounding rectangle closest to the hinged support of the excavator arm is set as the base point.
For instance, if the boom is facing left, the base point would be the bottom right corner of the
83
bounding box. The system measures the distances between the base point of the excavator
and the four corners of the dump truck, and then divides them by the width of the excavator
bounding box to include the size factor. Figure 5-1 shows these distances. The distance
between the base point of the excavator and the top corner closest to the base point is
distance 1. The distance to the other top corner is distance 2. The distance to the closer
bottom corner is distance 3, and to the last corner, it is distance 4. These numbers create a
vector with four elements. A supervised learning approach was utilized to train linear
Support Vector Machines (SVM) (Cortes and Vapnik 1995) as the second step of the action
recognition process.
Support vector machines are extensively used for pattern recognition, such as text
classification, object detection, and path recognition. This machine learning method
constructs a separating hyperplane or a set of hyperplanes that has the largest margin (gap)
between the positive and negative classes in either the input feature space or a kernelised
version of this. A large number of object class and non-object class samples are required to
train an efficient classifier.
Seven videos taken from different viewpoints with a total duration of fifty one minutes were
selected for the training stage. The object recognition module scanned a frame every three
seconds for the machines with lowered recognition thresholds to avoid false negatives and to
produce a large number of negative training samples. The detected dump trucks were
grouped into being loaded and not being loaded. Altogether, 1342 training vectors including
514 positive and 828 negative samples were produced to train the classifier. The publicly
available SVM-light software (Joachims 1999) was used to train the action recognition
classifier. This training software provides a vector with a similar size of a training sample. In
84
addition, it also calculates a threshold which then is used for classification. This threshold
can be changed slightly during the experiments for obtaining best results and sensitivity
analysis. SVM-light calculated the threshold -0.063 in this instance.
Figure 5-1: Distances between the corners of trucks and the base point in both left and right
configurations
For the classification stage, the SCIT system computes the distances between the base point
of the excavator and the four corners of any detected dump truck, including false positives,
then divides them by the excavator’s width, which produces a vector with four elements. The
resulting vectors are classified using the trained SVM classifier. This classification process is
the dot product of the classifier and test vector, and scores greater than the threshold are
accepted. If more than one dump truck is close enough to an excavator and therefore passes
the classification stage, the system will identify the one with the highest classification score
as the loading truck.
5.4 Cycle Conclusion
After recognition of the truck being loaded, the system will stop searching for dump trucks,
record the start time, define the loading zone, and pass the loading truck to the tracking
engine. Dimensions of the loading zone depend on the size of the loading truck’s bounding
85
box, where the loading zone’s length and height are 1.25 and 1.5 times the truck length and
height respectively (dark blue rectangle in Figure 5-2.d). The loading region is defined fairly
large to handle minor movements of dump trucks during loading for better positioning, and
to accommodate the small spatial variations used by the tracking methods thereby reducing
the risk of premature termination of the tracking of the loading dump truck. The tracking
module tracks the loading truck until the center of the tracking bounding box/ellipse exits the
loading zone. The system records that moment as the finish time and ends tracking of the
loaded truck.
5.5 System Architecture
The system was implemented using OpenCV 2.3.1 library (OpenCV 2011) in Visual C++
express 2010 environment. OpenCV is an open source library which mostly contains video
and image processing functions. This library is cross-platform which has C++, C, and Python
interfaces running on Windows, Linux, Android, and Mac operating systems. The library
includes various computer vision algorithms from basic level functions, such as loading and
saving images, to advanced algorithms, such as object recognition, tracking, and image
segmentation. Many of the functions used to develop the SCIT modules are already available
in the employed version of the OpenCv (OpenCV 2011). The HOG object recognition, mean-
shift tracking, and KLT feature tracking are the off-the-shelf algorithms used for this
research. As stated in chapters three and four however, these methods were
modified/integrated to develop the object recognition and tracking modules.
The object recognition module first searches for an excavator. Once detected (Figure 5-2.a),
it passes the detected bounding box to the mean-shift tracking module (Figure 5-2.b). The
current version of the SCIT’s object recognition module stops searching at one excavator, but
86
it is possible to extend the system to process the videos with multiple servers. In addition to
tracking the excavator, the system begins to scan for dump trucks in predetermined time
intervals (see Figure 5-2.c). While scanning a 640x480 pixel frame for all eight orientations
takes about 1.07 seconds, any time interval greater than 1.1 seconds can maintain the real-
time stream of the video. For this research, the recommended four second intervals were
exercised (Rojas 2008). Then the action recognition module analyzes all detected dump
trucks in each detection interval to check whether any of them meet the logical configuration
constraints and pass the action recognition classifier. If the system confirms the loading
action (see Figure 5-2.d), it will discontinue searching for dump trucks, record the start time,
define the loading zone, and send the loading truck to the tracking engine (see Figure 5-2.e,
this frame shows the mean-shift tracker).
Figure 5-2: a: Detection of the excavator; b: tracking the excavator; c: detection of a truck that does not
meet loading criteria; d: detection of the loading truck; e: tracking of the both equipment; f: truck leaves
the zone and tracking of the truck terminates
87
The tracking engine continues tracking the loading truck until the center of the tracker exits
the loading zone. The system records that moment as the finish time, terminates tracking of
the loaded truck, removes the loading zone (see Figure 5-2.f), and starts to search for new
dump trucks. As depicted in Figure 5-2.d, the center of the loading zone is shifted toward the
hydraulic excavator to handle the slight truck movements, which are mostly backward. The
shifted loading zone also causes earlier tracking termination and therefore improves the
accuracy of the SCIT finish times relative to the actual values. Figure 5-3 shows the entire
flowchart of this framework.
Figure 5-3: Flowchart of the entire SCIT system
There are possible situations where the excavator fully loads a dump truck, but it takes some
time for the truck to leave the zone. For example, the excavator has finished loading, but the
dump truck is still waiting for other vehicles to pass by. This argument is valid in the sense
that no productive work (loading) is done in the meantime, but as long as the truck stays in
the zone, next loading cycle cannot begin. These abnormal cycles are part of actual working
shifts and should be included in cycle times.
88
5.6 Summary
Logical positioning and close proximity of a server and a customer are two spatiotemporal
data used to recognize a loading cycle. A logical reasoning framework checks detected dump
trucks whether they are positioned for loading and then, a machine learning classifier
examines the relative distances of passed trucks to the serving excavator. If more than one
dump truck passes the action recognition module, the one with higher score is accepted. In
addition to action recognition, this module helps the system ignore the dump trucks and false
alarms which do not satisfy the requirements, and allows the truck detection thresholds to be
reduced to minimize false negatives.
The object recognition, tracking and the action recognition modules have been integrated to
develop the SCIT framework. The system first recognizes the server and then searches for
customers in predetermined time interval. The action recognition module examines detected
dump trucks and upon identification of a loading truck, the system defines a loading zone and
tracks the identified truck. Departure from the loading zone concludes the cycle and then the
SCIT repeats searching for new dump trucks. The next chapter describes testing on several
construction videos to evaluate the performance of the SCIT system under the actual
conditions found in jobsites.
89
CHAPTER 6 - SCIT VALIDATION RESULTS
This chapter describes the process used to evaluate the performance of the SCIT system.
Several videos of excavation activities were captured at two condominium complexes in
downtown Toronto, Ontario. However, only the videos containing equipment with similar
productivity rates were selected to create a homogenous productivity dataset. Eighteen videos
with a total duration of 2 hours and 27 minutes were chosen in which appeared two types of
hydraulic excavators (Caterpillar 245B and Caterpillar 345D) and several makes of urban
dump trucks with similar hauling capacities, such as Mack, Sterling, Volvo, and Kenworth.
Since this system aims to estimate cycle duration under actual conditions found in jobsites,
the videos were recorded during eight site visits in three seasons –winter, spring, and
summer– and at different times of the day and with different levels of cloudiness. This
allowed a variety of lighting conditions to be recorded. Moreover, these videos were taken
from both ground level and elevated viewpoints using two different makes of digital cameras
to diversify visual conditions. None of these videos were used to train the action recognition
classifier. Figure 6-1 depicts some of the views.
90
Figure 6-1: Some of the earth material loading views
The excavators had typical construction colors including yellow and red, while the urban
dump trucks were painted a variety of colors such as white, red, black, green, blue, gray, and
purple.
6.1 Experimental Results
The SCIT system with mean-shift and hybrid tracking modules processed the test videos with
varied action recognition thresholds, and the machine-generated results are provided together
with manual observations as ground truth in Table 6-1. This table provides the number of
true positives, false negatives, false alarms, incomplete detected cycles, and the average cycle
times of true positive cycles. Incomplete detected cycles were correctly recognized by the
system, but the tracking module failed to persistently track the loading truck throughout the
cycle which resulted in early termination of the tracking and resetting of the timer. In the
manual observation, the loading time starts when the excavator initiates loading activity and
it ends as the truck starts moving out.
91
The tests with smaller ID numbers have lower action recognition thresholds, and the
threshold rises as the test number increases. Test 5 uses the threshold provided by the SVM-
light software at the end of the training stage. But it should be decreased as it missed three
true positive cycles (Table 6-1). Thus, the threshold was reduced by -0.1 in the following
tests for the sensitivity analysis of threshold alteration.
Table 6-1: Results of the experiments with different action recognition thresholds on test videos
Number
of
detected
cycles
False
negative
cycles
False
positive
cycles
Incomplete
detected
cycles
Average cycle
time (based
on true positive
cycles
only) Seconds
ARM
threshold
Manual 55 0 0 0 101.87 -
Test 1: SCIT with
hybrid 53 0 4 0 106.49 -0.463
Test 2: SCIT with
hybrid 54 0 2 0 106.43 -0.363
Test 3: SCIT with
hybrid 54 0 2 0 105.93 -0.263
Test 4: SCIT with
hybrid 54 0 2 0 105.33 -0.163
Test 5: SCIT with
hybrid 51 3 1 0 105.08 -0.063
Test 3a: SCIT with
mean-shift using
HOG response
48 0 2 6 104.40 -0.263
Test 3b: SCIT with
mean-shift using
color histogram
30 0 2 24 105.00 -0.263
As presented in Table 6-1, the SCIT with hybrid method had the best performance in Test 4.
Tests 2 to 4 with hybrid tracking had the highest true positive cycles, but the average time
deviation in Test 4 was less than Test 2 and Test 3. The threshold found in hybrid Test 3 (as a
midpoint in the optimal range) was used for the SCIT with mean-shift tracking using color
histogram and HOG response, but the performance was substandard (Test 3a and Test 3b).
92
These tests were only able to provide correct data for 48 and 30 of the loading cycles. The
poor performance of the mean-shift algorithm in tracking the dump trucks being loaded
caused these unsatisfactory results, so the tests with other thresholds were aborted for SCIT
with mean-shift tracking.
The mean-shift algorithm with color histogram had problems tracking trucks with neutral
colors including white, black, and gray. In a number of cases, the tracker missed the target,
thereby concluded that the loading cycle was complete. This stopped the loading clock and
reset it for the start of a new cycle upon the next detection of the same truck, resulting in
erroneous productivity data. But since the HOG method is invariant to color, the mean-shift
with HOG response could correctly estimate those cycles.
In addition, there were some instances in the tests using the mean-shift algorithm with color
histogram (Figure 6-2, images a to c) where the tracking blob switched or expanded from the
loading truck to a nearby machine of the same color, producing false results. Similar problem
was observed while using mean-shift with HOG response in the cases with close proximity of
same oriented dump trucks regardless of their color. It resulted in six incomplete cycles. The
SCIT with the hybrid tracker correctly processed all of the mentioned videos (frames d to f in
Figure 6-2).
93
Figure 6-2: Frames a to c: Expansion of mean-shift tracking, images d to f: Hybrid tracking
6.2 Discussion
The SCIT with the hybrid tracking framework significantly outperformed the SCIT with
mean-shift algorithm using color histogram and HOG response, so the SCIT with mean-shift
was set aside and this section discusses only the results obtained from the SCIT with a hybrid
tracking engine.
The results in Table 6-1 demonstrate that the performance of the SCIT (with hybrid tracker)
is not very sensitive to the threshold change. For instance, the system performed the same
within the range of -0.163 to -0.363 in terms of true positives, false negatives, and false
positives. The average cycle times, however, had slight differences. Lowering the threshold
stretches the true positive cycle times as it detects loading trucks before they get sufficiently
close to the excavator to start loading. In addition, lowering the threshold increases the
chance of accepting more bounding boxes as a loading truck and therefore false positive
cycles. For instance, Test 1 has the lowest threshold and it misclassified four cycles (two
94
more than Tests 2, 3, and 4), and also the average cycle time is the highest. Increasing the
thresholds in Tests 2 to 5 improved the average time.
Raising the threshold gives more accurate cycle times, as the classifier recognizes dump
trucks when they are fully positioned for loading. However, higher thresholds may result in
missing some cycles, the cycles where the dump trucks stop farther away than usual for
loading and thereby their SVM score cannot pass the higher thresholds. For example, Test 5,
which had the highest threshold, missed three more cycles than Test 4.
All of the processed videos were investigated to find causes for the errors. Since this system
is composed of three modules, errors can be found in object recognition, tracking or the
action recognition classifier. The following sections describe the errors and deviations that
occurred in the testing.
6.2.1 False positive cycles
Two scenarios resulted in false detection of a loading cycle: recognition of a foreground
truck instead of a background truck and identification of a false positive box as a loading
truck. The SCIT incorrectly identified the wrong machine if another truck largely masked the
loading truck (Figure 6-3 a, b, and c). The framework identified the foreground machine
(green truck) instead of the loading truck (dark blue) and produced the wrong productivity
result. This configuration took place in a video captured from ground level; the system
however, was able to handle the overlooking views in which the loading truck was partially
masked by other equipment. Images d, e, and f in Figure 6-3 depict the same work zone with
the same arrangement of machines recorded only a few minutes later from an elevated angle.
The SCIT with hybrid tracker was able to correctly detect and track the loading truck. This
95
highlights the importance of the appropriate viewpoint of the camera for correct outcomes.
Nearby buildings, peaks of slopes, tower cranes, or temporary posts are proper options for
installation of the camera; however, some construction sites may not have these possibilities
as there are no overlooking points around, or they are not accessible. Although not
investigated in this research, two possible solutions to overcome the problem of such
situations without changing the camera view are:
Logical reasoning would help interpret an occluded situation. For example, if a new
truck blocks the view of the loading truck, it is possible to conclude that the actual
machine is masked and the system waits until the truck becomes visible before
estimating the loading cycle. However, exceptional situations may mislead the
system. For example, the background truck may not appear again and may leave the
scene under cover of the foreground machine.
Use local suppression to detect and track the part of the machine that is not occluded.
In this approach, if a part of a truck gets a higher score than the foreground machine
in the action recognition phase, the background machine is selected and tracked.
96
Figure 6-3: frames a to c: Recognition and tracking of the incorrect loading truck due to severe
occlusion; frames d to f: correct recognition and tracking by changing the camera location
In the second scenario, false alarm detections may result in false positive cycles. As stated
before, the thresholds for the recognition of dump trucks were decreased to avoid false
negatives, so the detectors produced many false positives most of which were rejected by the
action recognition module. Most of these false positives did not have the appropriate
orientation for loading or their size and location could not pass the SVM classification. In
addition to a false positive resulting from occlusion, there were three false positives in Test 1,
and there was one mistake in Tests 2, 3, and 4 which resulted from non-truck detections. Test
5 did not have a false positive cycle due to false alarm object detection. More false positives
happened at lower action recognition thresholds whereas higher thresholds successfully
ignored them. Since these false detections do not represent a real machine, hybrid tracker
does not track a real truck and all the center points exited the loading zone quickly. This
occurred due to movement of the excavator boom or other object which brought the KLT
97
tracking feature outside of the loading box. It is easy to spot these false positives in the
results as they have short durations compared to actual cycles.
6.2.2 False negative cycles
False negatives only occurred in the test with the highest action recognition threshold (Test
5). A review of the missing samples revealed that the dump trucks approached and stopped
farther away from the loader than usual due to special conditions of the loading zone.
Therefore the vector of relative distances did not pass the SVM classification.
6.2.3 Differences in start and finish times
The SCIT framework had some variations in recording the start and finish loading times
compared to manual observation, which produced different average cycle times (Table 6-1).
In reviewing the data, the inaccuracies had four main causes:
The SCIT scanned videos for new dump trucks every four seconds, so 0 to 4 second
variations of the activity start times compared to manual study are inevitable.
The SCIT was slow to detect the loading truck even though it was in place when the
system was searching for it. It took more than one scan process to identify the loading
truck.
Human observer recorded the finish time when the truck began to leave; however, it
took a few seconds for the system to detect the end of the activity as it waited until
the center of the tracking truck exited the loading zone.
The viewpoint of the camera is another important factor for a better result. The side
view of the loading operation provides the best view for action recognition. Both the
98
server and customer plants are at the same distance from the camera and the action
recognition classifier can detect the start of the loading action on time. Front
viewpoints were the most challenging cases. In these cases, the dump truck is close
enough in a 2D view to pass the action recognition module, but closing a gap between
the server and customer plants is still required. For instance, the SCIT detected the
start of the cycle after 3:56 minutes (Figure 6-4 frame a), but the operation initiated
after 4:04 minutes (Figure 6-4 frame b). The same issue causes delay in recording the
finish time. For example, it was four seconds differences for the instance in Figure
6-4.
Figure 6-4: a. incorrect detection of the loading start time, b. actual start time
6.3 Practical Applications
The SCIT system provides two main types of data which are highly useful in management of
earthmoving projects: number of cycles and activity durations.
6.3.1 Cycle counting
The most basic outcome is production confirmation in which the SCIT can count the number
of loading cycles made by the earthmoving fleet, and then approximate the quantity of earth
99
material moved using the number of trips and the standard capacity of dump trucks. The
number of trips can also be used to confirm the quantity of work achieved by the
earthmoving subcontractor. Earthmoving foremen are usually responsible for this task and
they should also direct the dump trucks in the loading zone (Figure 6-5). This zone is one of
the most hazardous areas in construction sites as the excavators and dump trucks operate in a
confined space (Edwards and Nicholas 2002). For instance, the right frame in Figure 6-5
shows a packed jobsite where two people were assigned to manage dump trucks. The SCIT
has the potential to eliminate the distractive data recording task and the foremen can focus on
site safety.
Figure 6-5: Earthmoving foremen
6.3.2 Cycle durations
The SCIT system records loading cycle durations, which have several applications during the
construction period and afterwards. Activity durations can be used to study productivity, find
bottlenecks, and enhance ongoing operations. Practitioners can use industry standards, such
as manufacturers’ performance handbooks or productivity data from previous similar
operations, as benchmarks to find substandard operations. In addition, productivity data are
the main input for advanced analysis such as stochastic simulation for planning future
100
activities (AbouRizk and Halpin 1992) and they can be used to estimate the cost of similar
future operations.
As discussed before, the results of the SCIT system deviate from the ground-truth data. The
main issue is to validate the accuracy of the machine-generated cycle times for practical
construction applications. Construction companies need activity durations to assess the
performance of their fleet at the end of each working shift, to find the causes of delay, and to
correct them accordingly. Construction practitioners are interested in the average
productivity of a working shift and they are not generally concerned about a few prolonged
cycle or idle times. In addition, working conditions, such as soil, weather, and equipment
conditions, vary for every working shift, therefore the test results of the SCIT are grouped
into the site visits.
Tests number 2, 3, and 4 with the hybrid tracking algorithm had the best performance in
terms of true positive cycles. Test 3 and Test 4 had lower deviation from the manual data, so
their output is grouped into the eight site visits to assess the deviations in each case. Table
6-2 and Table 6-3 provide the ground truth and machine-generated data along with the
average loading cycle times and the duration between them for each site visit, in Test 3 and
Test 4 respectively. In addition, these tables present the deviation percentage of the machine-
generated average loading cycles in each site visits and overall. This deviation is calculated
as: (machine-generated time – manual observation time) / manual observation time.
101
Table 6-2: Detailed results of the SCIT with hybrid - Test 3
Site visit No of
cycles
Data
type
Total
loading
time
Total time
between
cycles
Average
loading
time
Deviation
%
Average
time
between
cycles
1 17 Manual 0:28:54 0:16:05 0:01:42 2.94% 0:00:57
Software 0:29:47 0:15:12 0:01:45 0:00:54
2 6 Manual 0:10:12 0:04:17 0:01:42 13.73% 0:00:43
Software 0:11:35 0:02:54 0:01:56 0:00:29
3 3 Manual 0:03:55 0:06:14 0:01:18 11.54% 0:02:05
Software 0:04:21 0:05:48 0:01:27 0:01:56
4 2 Manual 0:04:32 0:02:28 0:02:16 3.68% 0:01:14
Software 0:04:42 0:02:18 0:02:21 0:01:09
5 3 Manual 0:05:42 0:02:25 0:01:54 6.14% 0:00:48
Software 0:06:03 0:02:04 0:02:01 0:00:41
6 5 Manual 0:06:11 0:04:16 0:01:14 0.00% 0:00:51
Software 0:06:08 0:04:19 0:01:14 0:00:52
7 6 Manual 0:10:27 0:04:53 0:01:44 4.81% 0:00:49
Software 0:10:53 0:04:27 0:01:49 0:00:45
8 12 Manual 0:21:25 0:12:36 0:01:47 1.87% 0:01:03
Software 0:21:51 0:12:10 0:01:49 0:01:01
Overall 54 Manual 1:31:18 0:53:14 0:01:41 4.95% 0:00:59
Software 1:35:20 0:49:12 0:01:46 0:00:55
102
Table 6-3: Detailed results of the SCIT with hybrid - Test 4
Site visit No of
cycles
Data
type
Total
loading
time
Total time
between
cycles
Average
loading
time
Deviation
%
Average time
between
cycles
1 17 Manual 0:28:54 0:16:05 0:01:42
0.98% 0:00:57
Software 0:29:18 0:15:41 0:01:43 0:00:55
2 6 Manual 0:10:12 0:04:17 0:01:42
13.73% 0:00:43
Software 0:11:35 0:02:54 0:01:56 0:00:29
3 3 Manual 0:03:55 0:06:14 0:01:18
3.85% 0:02:05
Software 0:04:04 0:06:05 0:01:21 0:02:02
4 2 Manual 0:04:32 0:02:28 0:02:16
8.82% 0:01:14
Software 0:04:56 0:02:04 0:02:28 0:01:02
5 3 Manual 0:05:42 0:02:25 0:01:54
7.89% 0:00:48
Software 0:06:09 0:01:58 0:02:03 0:00:39
6 5 Manual 0:06:11 0:04:16 0:01:14
0.00% 0:00:51
Software 0:06:08 0:04:19 0:01:14 0:00:52
7 6 Manual 0:10:27 0:04:53 0:01:44
4.81% 0:00:49
Software 0:10:53 0:04:27 0:01:49 0:00:45
8 12 Manual 0:21:25 0:12:36 0:01:47
1.87% 0:01:03
Software 0:21:45 0:12:16 0:01:49 0:01:01
Overall 54 Manual 1:31:18 0:53:14 0:01:41
3.96% 0:00:59
Software 1:34:48 0:49:44 0:01:45 0:00:55
The overall average cycle times in tests 3 and 4 have the deviation of 4.95% and 3.96%,
respectively. Test 4 with a higher action recognition threshold had slightly better
performance in which the machine-generated total loading time (1:34:48) has a lower
difference with manually recorded time (1:31:18) comparing to the total loading time
obtained in Test 3 (1:35:20). The lowest accuracy in both tests occurred in the videos of site
visit number 2 with a 13.73% deviation. This case was reviewed and showed the video
captured the loading operation from a front-on viewpoint. As discussed in section 6.2.3, front
103
views result in earlier start and later finish times; therefore the machine-generated average
cycle time was 14 second longer than the ground truth data.
The scan time interval for dump trucks is another important affecting factor. Thus, 3 and 5
second intervals were also tested (with the same action recognition threshold used in the Test
3 as a middle point in the optimal range of threshold) and the results are presented in Table
6-4 and Table 6-5, respectively. Changing scan intervals has a mixed effect on the SCIT
performance. Shorter time intervals (3 seconds) resulted in timely detection of dump truck in
some cases, and sometimes it caused early detection of the loading dump trucks. In terms of
accuracy, results of the test with 4 second intervals are slightly better than the 3 and 5 second
intervals.
In addition to accuracy of cycle times, false positives and computation load are two other
important factors. The test with 3 second intervals had four false positive cycles, while the
tests with 4 and 5 second intervals had two false alarm cycles. The test with 3 second
intervals scanned more frames than 4 and 5 second intervals; therefore the chance of
misdetection was higher. Besides, scanning for dump trucks in shorter time intervals imposes
more load on the GPU and the processing unit.
104
Table 6-4: Detailed results of the SCIT with hybrid - 3 second intervals
Site visit No of
cycles
Data
type
Total
loading
time
Total
time
between
cycles
Average
loading
time
Deviation
%
Average
time between
cycles
1 17 Manual 0:28:54 0:16:05 0:01:42
5.88% 0:00:57
Software 0:30:32 0:14:27 0:01:48 0:00:51
2 6 Manual 0:10:12 0:04:17 0:01:42
14.71% 0:00:43
Software 0:11:41 0:02:52 0:01:57 0:00:29
3 3 Manual 0:03:55 0:06:14 0:01:18
8.97% 0:02:05
Software 0:04:16 0:05:53 0:01:25 0:01:58
4 2 Manual 0:04:32 0:02:28 0:02:16
0.00% 0:01:14
Software 0:04:32 0:02:28 0:02:16 0:01:14
5 3 Manual 0:05:42 0:02:25 0:01:54
7.02% 0:00:48
Software 0:06:06 0:02:01 0:02:02 0:00:40
6 5 Manual 0:06:11 0:04:16 0:01:14
-1.35% 0:00:51
Software 0:06:04 0:04:23 0:01:13 0:00:53
7 6 Manual 0:10:27 0:04:53 0:01:44
6.73% 0:00:49
Software 0:11:09 0:04:11 0:01:51 0:00:42
8 12 Manual 0:21:25 0:12:36 0:01:47
0.93% 0:01:03
Software 0:21:39 0:12:22 0:01:48 0:01:02
Overall 54 Manual 1:31:18 0:53:14 0:01:41
5.94% 0:00:59
Software 1:35:59 0:48:37 0:01:47 0:00:54
105
Table 6-5: Detailed results of the SCIT with hybrid - 5 second intervals
Site visit No of
cycles
Data
type
Total
loading
time
Total time
between
cycles
Average
loading
time
Deviation
%
Average time
between
cycles
1 17 Manual 0:28:54 0:16:05 0:01:42
4.90% 0:00:57
Software 0:30:18 0:14:41 0:01:47 0:00:52
2 6 Manual 0:10:12 0:04:17 0:01:42
13.73% 0:00:43
Software 0:11:34 0:02:58 0:01:56 0:00:30
3 3 Manual 0:03:55 0:06:14 0:01:18
12.82% 0:02:05
Software 0:04:24 0:05:45 0:01:28 0:01:55
4 2 Manual 0:04:32 0:02:28 0:02:16
-2.21% 0:01:14
Software 0:04:25 0:02:35 0:02:13 0:01:18
5 3 Manual 0:05:42 0:02:25 0:01:54
6.14% 0:00:48
Software 0:06:04 0:02:03 0:02:01 0:00:41
6 5 Manual 0:06:11 0:04:16 0:01:14
-2.70% 0:00:51
Software 0:06:01 0:04:26 0:01:12 0:00:53
7 6 Manual 0:10:27 0:04:53 0:01:44
9.62% 0:00:49
Software 0:11:26 0:03:54 0:01:54 0:00:39
8 12 Manual 0:21:25 0:12:36 0:01:47
0.00% 0:01:03
Software 0:21:22 0:12:39 0:01:47 0:01:03
Overall 54 Manual 1:31:18 0:53:14 0:01:41
4.95% 0:00:59
Software 1:35:34 0:49:01 0:01:46 0:00:54
A key issue is: How much error is acceptable to the construction industry? Any industry has
a level of tolerance for their automated data collection systems. For instance, the aviation
industry has one of the highest safety levels and the expected level of accuracy for
positioning devices, such as radar or onboard GPS antennas, is extremely high. The heavy
civil engineering industry however, has high levels of variation in equipment productivity.
Even the equipment manufacturers who have tested their products under various conditions
with different skill levels of operators provide a range of productivity rates. For instance,
according to the Caterpillar performance handbook (Caterpillar 2006), dump trucks usually
106
require 36 to 48 seconds to manoeuvre for the loading position, and it takes three or four
buckets for the test case excavator (CAT 345C) to load a typical urban dump truck. This
handbook provides a range of cycle times for an excavator swing (a bucket) under different
conditions such as soil type, configuration of the excavator and dump truck, and the angle of
swing. Each swing cycle includes four steps: load bucket, swing loaded, dump bucket, and
swing empty. The manufacturing company provided detailed ranges for different machines
including the test case of this research, CAT 345C. The conditions of the loading operation in
the test cases are presented in Table 6-6. According to the handbook, the minimum and
maximum swing times are 19.8 and 24 seconds, respectively. This type of excavator can load
an urban truck with 3 or 4 buckets, depending on the conditions and skill/efficiency of the
operator. Sometimes the operator can fully fill the buckets and load the truck with only three
swings, but in most cases the loading conditions require four swings.
Table 6-6: Loading conditions of the test cases
Condition Easy Hard
Angle of swing 90° 120°
Soil condition Hard packed soil with up to 50% rock content
Depth of excavation < 70% of max. capability < 90% of max. capability
Excavator level relative to dump
truck Elevated Same level
Swing time for excavator CAT
345C 19.8 Seconds 24 Seconds
Since the cases in this research were two urban construction sites, excavator operators must
compact and level the surface of the load to prevent loose material from escaping onto the
street, cars, and pedestrians. A study of the seven videos used to train the action recognition
classifiers showed that it takes 10-15 seconds for the excavators to level and clean the load’s
surface. Thus, 12.5 seconds is added to the minimum and maximum benchmark times. The
107
minimum benchmark time is the shortest time to move through the shortest swing cycle three
times divided by an operator skill coefficient, for which the recommended number is 0.9
(Caterpillar 2006), plus the levelling time. The maximum benchmark time is four times the
longest swing cycle divided by the operator skill coefficient plus levelling time. The
calculated minimum and maximum benchmark times are 1:19 and 1:59 minutes,
respectively.
The overall ground-truth and machine-generated average cycle times are 1:41 and 1:45
minutes (Table 6-3). Not only are the recorded cycle times within the range suggested by
Caterpillar, they result in a narrow band within the range. It can be acknowledged, therefore,
that the deviation between the actual and SCIT cycles times is acceptable in this context.
6.4 Monitoring Other Earthmoving Operations
The case presented in this research is one of the most challenging activities for a vision-based
system because:
It is an interactive activity, two types of equipment involved;
Loading zones are visually occluded as many machines work in a confined area.
Other common earthmoving operations, including hauling, leveling, compacting, and
excavating, are done using one type of equipment and the views are not as occluded.
6.4.1 Hauling
Dump trucks carry earth material within or to outside of the jobsite. If the viewfinder
captures the entryway to a loading or dumping area, a vision-based system is able to count
the number of truck loads and record the time gap between them. Since the object recognition
108
and tracking modules were already developed for dump trucks, they have been used to count
the number of truck loads. The viewfinder should be set on the entryway or an access road,
then the object recognition modules searches for dump trucks in eight viewpoints in 4 second
intervals (the same way employed in the SCIT). Any detected machine is then tracked using
the hybrid tracking algorithm. The false positives however, remain an issue for this approach
as there is no action recognition module to remove false positives.
The spatiotemporal data provided by the tracking modules can be used to distinguish true
positives from false alarms. In this approach, any detected windows, including true positives
and false alarms, are passes for tracking using the hybrid algorithm. As explained in section
4.2, the hybrid algorithm uses successive recognition of the target in a ROI to track that
machine. False positives are two types: random scattered boxes or repetitive in a same
location. In first case, the object recognition part of the hybrid tracker will not detect the
target in the ROI and in the second group, the recognition section repeatedly recognizes the
target in the same place with no movement. So a simple action recognition module has been
developed to recognize moving dump trucks from false positives.
The system gives a zero score to all detected boxes. Any redetection of the target in tracking
intervals, one second for this framework, adds +1 to its score. A tracking object can have 0-4
score in a 4 second time interval, and the system removes targets with scores lower than 2.
This constraint eliminates random false positives. The second constraint is the displacement
of the target. Any object with a movement less than half width of its bounding box in a 4
second interval is omitted as well. This constraint may remove a motionless true positive (a
temporarily immobile dump truck on the road), but this does not affect outcomes. The system
will identify and track that truck upon departure.
109
For instance, Figure 6-6 illustrates two frames with a view of an access road from a rock
quarry to a rock-fill dam construction site. The dump trucks toward the right carry rocks to
the rock-fill area, and the ones toward the left return to quarry. The left frame shows detected
machines and the right one depicts tracking result of them 7 seconds later. The red lines
represent the tracking path of each machine.
Figure 6-6: Left: detection of loading trucks; right: tracking of trucks
6.4.2 Leveling and compacting
Leveling and compacting tasks are done using graders and rollers, respectively. A single or
multiple machines carry out these operations in a cyclic manner. For example, compaction of
an earth-fill layer may require 10 passes of a roller, so the contractor may assign one loader
to pass 10 times or may employ five passes of two rollers in a row (Figure 6-7), or any
combinations resulting in 10 passes. A vision-based system has to detect, track, and record
the time for each pass. The major effort is to train the object recognition module. Since these
plants have relatively rigid shapes, the same approach used to recognize and track dump
trucks can be applied. Given the width and thickness of each layer, the system can estimate
productivity of the compaction operation. The same method is applicable for graders.
110
Figure 6-7: Compaction with two rollers
6.4.3 Excavation
Bulldozers also move forward and backward to engage their blade or ripper for excavation.
Detection and tracking of them can help estimate productivity, but the solution is not as
straightforward as the rollers and graders. The volume of excavated material depends on the
soil/rock condition, ground slope, enforced depth of the blade in the surface, and skill level of
the operator. It is even challenging for the naked eye to estimate excavated volume. So a
vision-based system can only estimate movements of the bulldozer. Other techniques, such as
surveying and laser scanning, are required to provide the volume of excavated material and
then measure the productivity.
6.4.4 Extended Monitoring System
A system can include all the above mentioned modules, and site engineers have to set the
viewfinder and choose the type of activity to monitor. It is possible to increase the level of
automation in which it can semantically identify the type of the action and then switch to
associated modules. First the system requires defining an earthmoving taxonomy, so it can
identify type of the activities based on the existing equipment in the scene. In this method,
the system uses a brute-force approach to search for different types of equipment available in
111
the recognition classifiers. This feature gives a context-awareness capability to the system,
which enables it to sense the environment and react based on the processed information. This
approach, however, is computationally-intensive. Therefore, the system should only scan the
first couple of minutes of the video and if machines are detected, the system switches to the
object detection and action recognition modules associated with the detected equipment. For
example, if the system detects urban dump trucks and an excavator, it will conclude that the
site is a loading zone in an urban setting and will only use related modules.
112
CHAPTER 7 - CONCLUSION AND FUTURE DIRECTIONS
Earthmoving activities are a costly component of heavy construction projects and mining
operations. Various automated controlling systems have been employed to monitor these
equipment-intensive operations. GPS-based systems have been the main controlling device
for more than a decade in both construction and mining sectors. However, the GPS antennas
should be installed on every machine and the transmitted data must be indirectly interpreted
to estimate productivity. In addition, locational records do not provide clear representations
of the scenes to find the causes of delays or abnormal productivity rates.
Relatively clear sightlines found in earthmoving sites, low-cost cameras, high capacity
storage devices, and advances in image and video processing algorithms make earthwork
jobsites potential opportunities to employ vision-based monitoring systems. Recent studies
have used computer vision algorithms to address the identification and tracking of
earthmoving plants in construction videos and time-lapsed movies. The outcomes were then
employed to estimate equipment productivity. These research efforts however, have not
proven capable of automating equipment productivity measurement in real-world scenarios
as they were carried out under ideal conditions, namely plain backgrounds, a very low level
of occlusion, certain viewpoints, manually defined work zones, and with the presence of only
a few types of equipment. The performance of these approaches under certain conditions do
not resemble those needed for actual construction sites and have prevented their practical
application within the industry.
This research aimed to close this practicability gap between vision-based algorithms and
equipment productivity measurement processes. The earth material loading operation was
selected as the test case and a vision-based system, named server-customer interaction
113
tracker (SCIT), was developed to recognize and estimate loading cycles under a variety of
conditions.
7.1 Summary of Research
The SCIT system contains three main modules: object recognition, tracking, and action
recognition. The object recognition module developed two techniques to detect dump trucks
and hydraulic excavators i.e. rigid and deformable equipment. This module uses HOG
classifiers to detect dump trucks from eight viewpoints. It also uses the HOG algorithm to
detect candidate boxes for the excavator arm in a series of video frames. Then it selects the
group of boxes which is coherent with the movement pattern of the excavator.
The standard mean-shift and innovative hybrid tracking methods were separately employed
as a tracking engine. The hybrid tracking framework uses the HOG object recognition and
KLT feature tracking to track dump trucks. After recognition of the excavator, the system
scans the video frames at predetermined time intervals, three, four, and five seconds for this
research, and checks whether any of the detected dump trucks are logically positioned for
loading. In addition to this logical reasoning test, a machine learning algorithm, linear
Support Vector Machine, was used to select the loading dump truck based on its relative size
and distance to the excavator. In this test, the spatiotemporal information of a detected truck
are translated to a descriptor and then classified to recognize the start of the loading cycle.
Upon recognition of a loading truck, the SCIT defines a loading zone and the tracking
module tracks the loading truck while it stays inside the loading zone. Departure of the
loading truck finishes the loading cycle and the time difference is the duration of the loading
cycle. Then the SCIT resumes searching for new dump trucks.
114
7.2 Summary of Results
Several test videos were captured under various conditions during eight site visits to evaluate
system performance. The SCIT system processed these test videos with five different action
recognition thresholds. The lower threshold had more false positive cycles. Increasing the
action recognition thresholds reduced the number of false positives and improved the cycle
times. The improvement of the performance however, stopped at a certain threshold where
the system missed some true positive cycles. Three sets of tests showed the best results where
the SCIT correctly detected 54 out of 55 cycles. Two sets of tests had 4.95% and the third
one had 3.96% deviation from ground-truth cycle times. These results were obtained using 4
second time intervals to search for loading dump trucks. In addition, 3 and 5 second intervals
were tested in which the test with 3 second intervals had higher false positives and deviation
whereas the test with 5 second intervals had almost the same results as the test with 4 second
intervals. The results demonstrate that the SCIT is able to address the objectives of this
research which were:
Develop an automated vision-based system for regular 2D construction videos
Process in real-time
Detect and track earth material loading equipment
Recognize and estimate loading cycle with an acceptable deviation
It should be mentioned that the performance of this system is limited to inherent
shortcomings of single 2D videos which are explained later in section 7.5.
115
7.3 Contributions to the Body of Knowledge
This research lays the groundwork for the application of vision-based algorithms to monitor
construction plants. This was the first research effort to evaluate the HOG object recognition
algorithm and part-based approach to detect construction equipment. The results were
promising. In addition, it introduced a novel hybrid tracking algorithm to track construction
entities in noisy construction videos. This innovative algorithm employs an optimized
integration of the HOG recognition and the KLT feature tracking. This method performed
well in partially-occluded views in which the recommended mean-shift tracking algorithm
using HSV color histogram and HOG response failed to track successfully. Moreover, an
action recognition framework was developed using spatiotemporal data to recognize the
interaction of the loading plants. This modular system has learning ability and modules can
be substituted to recognize and estimate other cyclic operations. Finally, the SCIT’s modules
were implemented and optimized properly that enable the entire system to process real-time
construction videos on an ordinary computer.
7.4 Contribution to the Body of Practice
As stated in the previous chapter, construction companies require recording of the total
number of truck trips. This is done either manually or by using active monitoring systems
such as GPS or RFID. The SCIT framework has the potential to count the number of trips,
and in addition, it can provide detailed cycle time information in real-time. Manual sampling
methods fail to spot many abnormal cycles and the results may not be useful for finding and
resolving bottlenecks. In contrast, the SCIT system monitors all of the loading cycles and
tags them with their cycle time. Abnormal working shifts can be flagged and site engineers
can visually review the anomalies. The SCIT does not have some of the main shortcomings
116
of the radio-based monitoring system, namely GPS, as it is not intrusive and is able to
provide visual spatiotemporal data including the pose and orientation of the machines. These
data help the system interpret actions of the machines more accurately.
7.5 Limitations
Since the SCIT system processes a video stream from a single camera, this framework
includes the inherent limitations of 2D projection of the real-world, including:
Depth of the objects in the frames;
Occlusion;
Limited coverage of the site by a single stationary camera.
Therefore, there are certain conditions/viewpoints in which the system fails to provide
correct data. However, as mentioned previously, the SCIT system aims at closing the
practicability shortcomings of the recent vision-based systems which perform under ideal
conditions. In this respect, SCIT addresses the following issues:
It can process congested work zones with various backgrounds in which several
types of machines (unknown for detectors) appear in the videos.
It is fully invariant to colors, because none of the modules use color-based
techniques.
No manual intervention is required to define the work zones; the system defines
them.
117
The SCIT can handle a moderate level of occlusion (e.g. partially masked dump
trucks from elevated viewpoints).
However, the SCIT framework has three main shortcomings to reach an ideal level:
The camera location, viewpoint and focus must be set manually.
The system cannot accurately process highly occluded scenes (e.g. a masked dump
truck from a ground level viewpoint).
Since the current version of the object recognition module for hydraulic excavators
can detect only one excavator, the SCIT performance is limited to one operating
servant.
In addition, unlike active systems that use antennas or tags, the SCIT system can only
recognize the type of machine and it is not able to identify individual equipment. This issue
can be resolved if each machine is labelled with recognizable visual signs such as numbers;
however, there may be problems such as dust in construction sites and error in the sign
recognition algorithms.
7.6 Future Directions
This research offers a major step in developing vision-based monitoring systems for real-time
earthmoving videos. Future research is required to overcome the limitations of the current
version of the SCIT. As stated in previous sections, this system has a few shortcomings. In
addition to the intrinsic errors of the existing object recognition and tracking techniques, 2D
videos do not provide any information about the depth of the scene. So future research should
118
seek additional mechanisms to overcome the shortcomings of using a single 2D video.
Following are some possible solutions:
7.6.1 Application of Two Calibrated Cameras
Application of two calibrated cameras can provide stereo vision of the scene. With 3D
coordination of the cameras (center of projection point) and 2D location of the detected
machines in the video frames, Epipolar geometry can calculate the 3D location of the
equipment. In addition to adding depth estimation, the application of two videos would
handle the occlusion problem.
7.6.2 Application of Multiple Non-calibrated Cameras
Despite providing accurate results, the calibration of site cameras is a labour-intensive task. It
is possible to avoid this effort by using multiple non-calibrated cameras. Multiple non-
calibrated cameras would be much easier to install and manage while providing additional
valuable data. One example application is for linear projects, such as highways, where one
camera cannot cover the entire site. In this case, multiple cameras would provide a complete
view of the construction right-of-way. A slight overlap in the view of the cameras would be
beneficial so that movement from one view to the next could be coordinated. Each camera
would identify the activities occurring in that area after which the data could be combined to
gain knowledge of the whole operation. Another example application would be a large
excavation where each camera position under consideration would experience recurring
occlusions. Two cameras could be placed such that they resolved the occlusion problem of
the other. The production recorded from each camera could then be combined to determine
where occlusions caused one camera to misinterpret some of the loading cycles.
119
7.6.3 Integration of SCIT and GPS
Despite manual preparation and technology related problems, GPS systems are the leading
alternative in monitoring an earthmoving fleet. Integration of the SCIT system with the GPS
navigation system would cover the shortcomings of both frameworks. GPS antennas can
provide 3D coordination of the equipment and the SCIT can offer additional spatiotemporal
data including orientation and poses of the equipment. This system requires a geographical
map of the operation in which the camera coverage (see Figure 7-1) is identified based on the
focal length of the camera lenses. Then entrance of a machine into this area triggers the SCIT
for recognition and tracking of that equipment. This way, accurate GPS locational data can
eliminate the risk of both false positives and false negatives of the SCIT system. This hybrid
system compares the 2D visual detections with the corresponding 3D locational data to
remove inconsistent detections. Moreover, the appearance of 3D locational data of machines
in a geographical map could force SCIT to undergo extra recognition trials for timely
detection of equipment in the video. The SCIT is responsible for action recognition and
productivity estimation. This system will be also able to interpret occluded views, such as the
situation presented in Figure 7-1.
The visual information provided by SCIT is able to accurately recognize equipment poses
and actions, and distinguish productive from non-value added movements which are the
shortcomings of a standalone GPS system. In addition to providing accurate locational data,
the GPS system can identify each machine. This helps the system estimate the productivity of
each plant separately.
120
Figure 7-1: Integration of the SCIT and GPS
The manual setting of the viewfinder is the next shortcoming of the SCIT system. A module
could be developed to control the robotic base of the construction camera. This module
should automatically seek the jobsite for loading operations and, once found, it should
appropriately focus on that operation. A new generation of cameras have built-in GPS and
compasses, so if the system finds a loading operation, the system can estimate the operation’s
location at the jobsite and automatically annotate the productivity data.
121
REFERENCES
Abeid, J., Allouche, E., Arditi, D., and Hayman, M. (2003). “PHOTO-NET II: a computer-
based monitoring system applied to project management” Automation in
Construction, 12(5), 603–616.
AbouRizk, S.M., and Halpin, D.W. (1992). “Statistical Properties of Construction Duration
Data.” Journal of Construction Engineering and Management, 118(3), 525-544.
Aggarwal, J.K., and Ryoo, M.S. (2011). “Human activity analysis: A review.” ACM
Computing Surveys, 43(3), Article No. 16.
Akinci, B., Kiziltas, S., Ergen, E., Karaesmen, I.Z., and Keceli, F. (2006). “Modeling and
Analyzing the Impact of Technology on Data Capture and Transfer Processes at
Construction Sites: A Case Study.” J. Constr. Eng. Manage., 132(11), 1148-1157.
Alarie, S. and Gamache, M. (2002). “Overview of Solution Strategies Used in Truck
Dispatching Systems for Open Pit Mines.” International Journal of Surface Mining,
Reclamation and Environment, 16(1), 59–76.
Almassi, A.N. and McCabe, B. (2008). “Image recognition and automated data extraction in
construction.” Proceedings of the 2008 Canadian Society of Civil Engineering
Annual Conference, Quebec, QC, Canada, Paper G-568.
Andriluka, M., Roth, S., and Schiele, B. (2009). “Pictorial Structures Revisited: People
Detection and Articulated Pose Estimation.” Proc. Computer Vision and Pattern
Recognition (CVPR 2009), Miami, FL, USA, 1014 – 1021.
Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., Michaux, A.,
Mussman, S., Narayanaswamy, S., Salvi, D., Schmidt, L., Shangguan, J., Siskind,
J.M., Waggoner, J., Wang, S., Wei, J., Yin, Y., Zhang, Z. (2012). “Video In
Sentences Out.” Proceedings of Conference on Uncertainty in Artificial Intelligence
122
(UAI), Catalina, CA.
Bradski, G.R. (1998). “Real time face and object tracking as a component of a perceptual
user interface.” Proc., Applications of Computer Vision, Princeton, NJ, USA. 214 –
219.
Brilakis, I., Park, M.W. and Jog, G. (2011). "Automated Vision Tracking of Project Related
Entities." J. of Advanced Engineering Informatics, 25(4), 713-724.
Brilakis, I., and Soibelman, L. (2005). “Content-Based Search Engines for construction
image databases.” Automation in Construction, 14(4), 537-550.
Brilakis, I., and Soibelman, L. (2008). “Shape-Based Retrieval of Construction Site
Photographs.” Journal of Computing in Civil Engineering, 22(1), 14 – 20.
Caterpillar (2006). “Caterpillar Performance Handbook (Edition 36).” Caterpillar
incorporation, Peoria, Illinois, U.S.
Caterpillar (2012). “Electronic control units.”
<http://www.cat.com/cda/layout?m=81247&x=7> (April 9, 2012).
Cheng, T., Venugopal, M., Teizer, J. Vela, P.A. (2011). “Performance evaluation of ultra
wideband technology for construction resource location tracking in harsh
environments.” Automation in Construction, 20(8), 1173-1184.
Cheok, G. S., Lipman, R. R., Witzgall, C., Bernal, J., and Stone, W. C. (2000). “NIST
Construction Automation Program Rep. No: 4 Non- Intrusive Scanning Technology
for Construction Status Determination.” Building and Fire Research Laboratory,
National Institute of Standards and Technology, Gaithersburg, MD.
Chi, S., and Caldas, C.H. (2011). “Automated Object Identification Using Optical Video
Cameras on Construction Sites.” Journal of Computer-Aided Civil and Infrastructure
Engineering, 26(5), 368–380.
123
Comaniciu. D., Ramesh, V., and Meer, P. (2003). “Kernel-Based Object Tracking.” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564-577.
Cortes, C, and Vapnik, V. (1995). "Support-Vector Networks", Machine Learning, 20(3),
273-297.
Dalal. N, and Triggs. B, (2005). “Histograms of Oriented Gradients for Human Detection.”
Proc. Computer Vision and Pattern Recognition (CVPR 2005), 1, 886 – 893.
Dickinson S.J., Leonardis, A., Schiele, B., and Tarr M.J. (2009). “Object Categorization:
Computer and Human Vision Perspectives.” Cambridge University Press, New York,
NY, USA.
Edwards, D.J., and Nicholas, J. (2002). “The state of health and safety in the UK construction
industry with a focus on plant operators” Structural Survey, 20(2), 78-87.
Everingham, M., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. (2009). “The
PASCAL Visual Object Classes Challenge 2009 (VOC2009).”
<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/everingham_det
.pdf> (Mar. 15, 2011).
Everingham, M., and Winn, J. (2010). “The PASCAL Visual Object Classes Challenge 2010
(VOC2010) Development Kit.”
<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/devkit_doc_08-May-
2010.pdf> (Apr. 13, 2011).
Everingham, M., Zisserman, A., Williams, C. and Van Gool, L. (2006). “The PASCAL
Visual Object Classes Challenge 2006 (VOC2006) Results.”
<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2006/results.pdf> (Mar. 15,
2011).
Fei, L.F., Fergus, R., and Torralba, A. (2007). “Recognizing and Learning Object
124
Categories.” Lecture notes, Massachusetts Institute of Technology.
<http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html>
Felzenszwalb, P.F., Girshick, R.B., McAllester, D, and Ramanan, D. (2010). “Object
Detection with Discriminatively Trained Part-Based Models.” Journal of Pattern
Analysis and Machine Intelligence, 32(9), 1627 – 1645.
Freund, Y., and Schapire, R.E. (1997). “A Decision - Theoretic Generalization of Online
Learning and an Application to Boosting.” Journal of Computer and System Sciences,
55 (1), 119 - 139.
Ghavami, M., Michael, L. B., and Kohno, R. (2007), Ultra-Wideband Signals and Systems in
Communication Engineering. 2nd ed., John Wiley & Sons, Chichester, West Sussex,
England.
Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2009). “Application of D4AR – A 4
Dimensional augmented reality model for automating construction progress
monitoring data collection, processing and communication.” Information Technology
in Construction, 14, 129-153.
Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2011). “Integrated Sequential As-Built
and As-Planned Representation with D4AR Tools in Support of Decision-Making
Tasks in the AEC/FM Industry.” J. Constr. Eng. Manage., 137(12), 1099–1116.
Gong, J., and Caldas, C.H (2010). “Computer Vision-Based Video Interpretation Model for
Automated Productivity Analysis of Construction Operations.” Journal of Computing
in Civil Engineering, 24(3), 252-263.
Gong, J., and Caldas, C.H. (2011). “An object recognition, tracking, and contextual
reasoning-based video interpretation method for rapid productivity analysis of
construction operations.” Automation in Construction, 20(8), 1211–1226.
125
Government of Alberta (2012). ”Oil sands facts and statistics.”
<http://www.energy.gov.ab.ca/OilSands/791.asp> (February 20, 2012)
Grimson, W.E.L., Stauffer, C., Romano, R., and Lee, L. (1998). “Using adaptive tracking to
classify and monitor activities in a site.” Computer Vision and Pattern Recognition, Santa
Barbara, CA. 22-29.
Guo, W., Soibelman, L. and Garrett Jr. J.H. (2009). “Automated defect detection for sewer
pipeline inspection and condition assessment.” Automation in Construction, 18(5),
587-596.
Han, F., Shan, Y., Cekander, R., Sawhney, H. and Kumar, R. (2006). “A two-stage approach
to people and vehicle detection with hog-based SVM.” Performance Metrics for
Intelligent Systems Workshop, NIST, Gaithersburg, MD, 133-140.
Hutchinson, T. and Chen, Z. (2006). “Improved Image Analysis for Evaluating Concrete
Damage.” Journal of Computing in Civil Engineering, 20(3), 210-216.
Joachims. T. (1999). “Making large-scale SVM learning practical.” Advances in Kernel
Methods: Support Vector Learning, Scholkopf, B., Burges, C. and Smola, A. The
MIT Press, Cambridge, MA, USA.
Kim, K., Chalidabhongse, T.H., Harwood, D., and Davis, L. (2005). “Real-time foreground-
background segmentation using codebook model.” Journal of Real-Time Imaging, 11
(3), 172–185.
Kim, H. and Kano, N. (2008). “Comparison of construction photograph and VR image in
construction progress.” Automation in Construction, 17(2), 137–143.
Kim, S.K., and Russell, J.S. (2003). “Framework for an intelligent earthwork system: Part I.
System architecture.” Automation in Construction, 12(1), 1– 13.
Kiziltas, S., Akinci, B., Ergen, E., Tang, P., and Gordon, C. (2008). “Technological
126
assessment and process implications of field data capture technologies for
construction and facility/infrastructure management.” Journal of Information
Technology in Construction, 13, 134-154.
Laptev, I., Caputo, B., Schüldt, C., and Lindeberg, T. (2007). “Local velocity-adapted
motion events for spatio-temporal recognition.” Journal of Computer Vision and
Image Understanding, 108(3), 207– 229.
Laptev, I., and Lindeberg, T. (2006). “Local Descriptors for Spatio-temporal Recognition.”
International Workshop on Spatial Coherence for Visual Motion Analysis, 91-103.
Li, L., Huang, W., Gu, I. Y. H., and Tian, Q. (2003). “Foreground Object Detection from
Videos Containing Complex Background.” Proceedings of the eleventh ACM
international conference on Multimedia, Berkeley, CA, USA, 2-10.
McCullouch, B. (1997). “Automating field data collection in construction organizations.”
Proc., 5th Construction Congress: Managing Engineered Construction in Expanding
Global Markets, Minneapolis, 957–963.
Montaser, A., and Moselhi, O. (2012). “RFID+ for Tracking Earthmoving Operations.”
Construction Research Congress, West Lafayette, IN, USA, 1011-1020.
Morlock, D. (2008). “Vision Based Recognition of Vehicle Types.” Study Thesis, Karlsruhe
Institute of Technology, Karlsruhe, Germany.
Navon, R. (2005). “Automated project performance control of construction projects.”
Automation in Construction, 14(4), 467– 476.
Navon, R., Goldschmidt, E., and Shpatnisky, Y. (2004). “A concept proving prototype of
automated earthmoving control.” Automation in Construction, 13(2), 225– 239.
Navon, R., and Sacks, R. (2007). “Assessing research issues in Automated Project
Performance Control (APPC).” Automation in Construction, 16(4), 474–484.
127
Navon, R., and Shpatnisky, Y. (2005). “Field Experiments in Automated Monitoring of Road
Construction.” J. of Construction Engineering and Management, 131(4), 487– 493.
NVIDIA (2012). “CUDA Parallel Computing Platform.”
<http://www.nvidia.com/object/cuda_home_new.html> (February 12, 2012).
OpenCv (2010). “The OpenCv Library.” < http://opencv.willowgarage.com/wiki/> (April 12,
2010).
OpenCv (2011). “The OpenCv Library.” < http://opencv.willowgarage.com/wiki/> (Dec. 10,
2011).
Park, M.W., Makhmalbaf, A., and Brilakis, I. (2011). “Comparative study of vision tracking
methods for tracking of construction site resources.” Automation in Construction,
20(7), 905-915.
Peddi, A., Huan, L., Bai, Y. and Kim, S. (2009). “Development of human pose analyzing
algorithms for the determination of construction productivity in real-time.” Building a
sustainable future, Construction Research Congress, Seattle, WA, USA, 1, 11-20.
Peyret, F. Betaille, D., and Hintzy G. (2000). “High-precision application of GPS in the field
of real-time equipment positioning.” Automation in Construction, 9(3), 299-314.
Prisacariu, V., and Reid, I. (2009). “FastHOG - a realtime GPU implementation of HOG.”
Technical report, Department of Engineering Science, Oxford University, UK.
Renz, J., and Nebel, B. (2007). “Qualitative Spatial Reasoning using Constraint Calculi.”
Handbook of Spatial Logics, Aiello, M., Pratt-Hartmann, I., and van Benthem, J.
eds., Springer, New York, 161-215.
Rezazadeh Azar, E. and McCabe, B. (2012a). ”Automated Visual Recognition of Dump
Trucks in Construction Videos.” J. Comput. Civ. Eng., 26(6), 769–781.
Rezazadeh Azar, E. and McCabe, B. (2012b). ” Part based model and spatial–temporal
128
reasoning to recognize hydraulic excavators in construction images and videos.”
Automation in Construction, 24, 194-202.
Rojas, E.M. (2008). “Construction Productivity: A Practical Guide for Building and
Electrical Contractors.” J. Ross Publishing, Fort Lauderdale, Florida.
Rybski, P.E., Huber, D., Morris, D.D. and Hoffman. R. (2010). “Visual Classification of
Coarse Vehicle Orientation using Histogram of Oriented Gradients Features.” IEEE
Intelligent Vehicles Symposium, IEEE, San Diego, CA, USA. 921-928.
Teizer, J. and Vela, P.A. (2009) “Personnel tracking on construction sites using video
cameras.” Journal of Advanced Engineering Informatics, 23(4), 452-462.
Tomasi, C., and Kanade, T. (1991). “Detection and tracking of point features.” Technical
Report CMU-CS-91-132, Carnegie Mellon University, USA.
Trimble (2012a). “Terralite XPS Solutions.” <http://www.trimble.com/mining/Terralite-
XPS-Solutions/> (April 01, 2012).
Trimble (2012b). “Trimble GCS900 Grade Control System.”
<http://www.trimble.com/construction/heavy-civil/machine-control/grade-control/>
(April 15, 2012).
Trimble (2012c). “GCSFlex for Excavators.” <http://www.trimble.com/construction/heavy-
civil/machine-control/FlexFamily/GCSFlexExc.aspx?dtID=overview> (April 9,
2012).
Viola, P. and Jones, M. (2001). “Rapid object detection using a boosted cascade of simple
features.” Proc. Computer Vision and Pattern Recognition (CVPR 2001), IEEE,
Kauai, HI, USA, 1, 511-518.
Viterbi, A. J. (1971). “Convolutional codes and their performance in communication
systems.” IEEE Transactions on Communication Technology, 19(5),751–772.
129
Vujic, S., Zajic, B., Miljanovic, I., and Petrovski, A. (2008). “GPS telemetry of energetic
technical and technological parameters at open pit mines.” Journal of Mining Science,
44(4), 402-406.
Weerasinghe, I.P.T. and Ruwanpura, J.Y. (2009). “Automated data acquisition system to
assess construction worker performance.” Building a sustainable future, Construction
Research Congress, Seattle, WA, USA, 1, 61-70.
Wu, Y., Kim, H., Kim, C., and Han, S.H. (2010). “Object Recognition in Construction Site
Images Using 3D CAD-Based Filtering.” Journal of Computing in Civil Engineering,
24(1), 56-64.
Zou, J., and Kim, H. (2007). “Using Hue, Saturation, and Value Color Space for Hydraulic
Excavator Idle Time Analysis.” Journal of Computing in Civil Engineering, 21(4),
238-246.