Download - Computer Vision-based Solution to Monitor Earth Material ......development of a vision-based system, named server-customer interaction planner (SCIT), to recognize and estimate earth

Computer Vision-based Solution to Monitor

Earth Material Loading Activities

by

Ehsan Rezazadeh Azar

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Department of Civil Engineering

University of Toronto

Toronto, Ontario

© Copyright by Ehsan Rezazadeh Azar (2013)

ii

Computer Vision-based Solution to Monitor Earth

Material Loading Activities

Ehsan Rezazadeh Azar

Doctor of Philosophy

Department of Civil Engineering

University of Toronto

2013

Abstract

Large-scale earthmoving activities make up a costly and air-polluting aspect of many

construction projects and mining operations, which depend entirely on the use of heavy

construction equipment. The long-term jobsites and manufacturing nature of the mining

sector has encouraged the application of automated controlling systems, more specifically

GPS, to control the earthmoving fleet. Computer vision-based methods are another potential

tool to provide real-time information at low-cost and to reduce human error in surface

earthmoving sites as relatively clear views can be selected and the equipment offer

recognizable targets. Vision-based methods have some advantages over positioning devices

as they are not intrusive, provide detailed data about the behaviour of each piece of

equipment, and offer reliable documentation for future reviews. This dissertation explains the

development of a vision-based system, named server-customer interaction planner (SCIT), to

recognize and estimate earth material loading cycles. The SCIT system consists of three main

modules: object recognition, tracking, and action recognition. Different object recognition

and tracking algorithms were evaluated and modified, and then the ideal methods were used

to develop the object recognition and tracking modules. A novel hybrid tracking framework

was developed for the SCIT system to track dump trucks in the challenging views found in

iii

the loading zones. The object recognition and tracking engines provide spatiotemporal data

about the equipment which are then analyzed by the action recognition module to estimate

loading cycles. The entire framework was evaluated using videos taken under varying

conditions. The results highlight the promising performance of the SCIT system with the

hybrid tracking engine, thereby validating the possibility of its practical application.

iv

Acknowledgements

There are a number of individuals to whom I want to express my special thanks. They

supported me in some way to proceed throughout the course of my PhD study at the

University of Toronto.

My deepest gratitude goes to my supervisor, Professor Brenda McCabe, for her enduring

support and guidance. Her broad knowledge, experience, encouragement, mentorship, and

constructive advices have been of great value for me. She has been always available when I

needed her advice. It has been my honor to work under her supervision.

I would like to express my sincere gratitude to Professor Sven Dickinson, a member of my

advisory committee, for his precious advices during all phases of this research. He helped me

bridge between computer vision and construction engineering, and carry out this exciting

interdisciplinary research. I am grateful to Professor Kim Pressnail, my other advisory

committee member, who provided insightful comments to improve the overall quality of this

research. I would also like to appreciate Professor Feniosky Pena-Mora for agreeing to be the

external appraiser for my PhD defence.

Finally, I would like to deeply thank my beloved parents and sisters for their endless

inspiration and unconditional love during all these years. They have been my best supporters.

v

Table of Contents

CHAPTER 1 - Introduction .................................................................................................................... 1

1.1 Research Objective ................................................................................................................. 4

1.2 Methodology .......................................................................................................................... 5

1.3 Research Scope ....................................................................................................................... 7

1.4 Outline of the Dissertation ...................................................................................................... 8

CHAPTER 2 - BACKGROUND ......................................................................................................... 10

2.1 Automated Data Collection in Construction ........................................................................ 10

2.2 Data Collection in Earthmoving Projects ............................................................................. 11

2.2.1 Machine control sensors ............................................................................................... 11

2.2.2 Global Positioning System ........................................................................................... 12

2.2.3 Ultra-wideband ............................................................................................................. 14

2.2.4 Radio-frequency identification (RFID) ........................................................................ 15

2.3 Computer vision-based methods .......................................................................................... 16

2.4 Object Recognition ............................................................................................................... 20

2.4.1 Haar-like Features ........................................................................................................ 21

2.4.2 Histogram of Oriented Gradients (HOG) ..................................................................... 22

2.5 Object Tracking .................................................................................................................... 23

2.5.1 Mean-shift Tracking ..................................................................................................... 23

2.6 Summary .............................................................................................................................. 25

CHAPTER 3 - OBJECT RECOGNITION MODULE ......................................................................... 28

3.1 Dump Trucks ........................................................................................................................ 29

3.1.1 Visual orientations ........................................................................................................ 30

3.1.2 Machine learning .......................................................................................................... 32

3.1.3 Performance of Detectors on Static Images ................................................................. 33

3.1.4 Haar-like vs. HOG performance ................................................................................... 41

3.1.5 Performance of Detector on Videos ............................................................................. 41

3.2 Hydraulic Excavator ............................................................................................................. 44

3.2.1 Deformable parts .......................................................................................................... 45

3.2.2 Features ........................................................................................................................ 48

3.2.3 Mixture models ............................................................................................................. 48

3.2.4 Static images ................................................................................................................. 50

3.2.5 Videos ........................................................................................................................... 54

3.2.6 Spatiotemporal reasoning ............................................................................................. 56

3.3 Robustness of the Recognition Results ................................................................................ 60

3.3.1 Occlusions .................................................................................................................... 60

3.3.2 Lighting ........................................................................................................................ 62

3.3.3 Shadow ......................................................................................................................... 62

3.3.4 Viewpoint ..................................................................................................................... 63

3.3.5 Articulation ................................................................................................................... 63

vi

3.3.6 Scale change ................................................................................................................. 64

3.3.7 Orientation change ........................................................................................................ 65

3.4 Summary .............................................................................................................................. 65

CHAPTER 4 - OBJECT TRACKING MODULE ............................................................................... 68

4.1 Mean-shift Algorithm ........................................................................................................... 68

4.2 Hybrid Tracking ................................................................................................................... 70

4.2.1 Possibilities to Optimize Hybrid Algorithm ................................................................. 75

4.3 Summary .............................................................................................................................. 77

CHAPTER 5 - THE ACTION RECOGNITION MODULE AND SYSTEM ARCITECTURE......... 79

5.1 Baseline Task ....................................................................................................................... 80

5.2 Spatiotemporal Information .................................................................................................. 81

5.3 Activity Recognition Module (ARM) .................................................................................. 81

5.3.1 ARM Stage 1: Logical loading configuration .............................................................. 81

5.3.2 ARM Stage 2: Machine learning action recognition .................................................... 82

5.4 Cycle Conclusion.................................................................................................................. 84

5.5 System Architecture ............................................................................................................. 85

5.6 Summary .............................................................................................................................. 88

CHAPTER 6 - SCIT VALIDATION RESULTS ................................................................................. 89

6.1 Experimental Results ............................................................................................................ 90

6.2 Discussion ............................................................................................................................ 93

6.2.1 False positive cycles ..................................................................................................... 94

6.2.2 False negative cycles .................................................................................................... 97

6.2.3 Differences in start and finish times ............................................................................. 97

6.3 Practical Applications ........................................................................................................... 98

6.3.1 Cycle counting .............................................................................................................. 98

6.3.2 Cycle durations ............................................................................................................. 99

6.4 Monitoring Other Earthmoving Operations ....................................................................... 107

6.4.1 Hauling ....................................................................................................................... 107

6.4.2 Leveling and compacting ........................................................................................... 109

6.4.3 Excavation .................................................................................................................. 110

6.4.4 Extended Monitoring System ..................................................................................... 110

CHAPTER 7 - CONCLUSION AND FUTURE DIRECTIONS ....................................................... 112

7.1 Summary of Research......................................................................................................... 113

7.2 Summary of Results ........................................................................................................... 114

7.3 Contributions to the Body of Knowledge ........................................................................... 115

7.4 Contribution to the Body of Practice .................................................................................. 115

7.5 Limitations .......................................................................................................................... 116

7.6 Future Directions ................................................................................................................ 117

7.6.1 Application of Two Calibrated Cameras .................................................................... 118

7.6.2 Application of Multiple Non-calibrated Cameras ...................................................... 118

7.6.3 Integration of SCIT and GPS...................................................................................... 119

REFERENCES ................................................................................................................................... 121

vii

List of Tables

Table 2-1: A summary of main features of equipment tracking methods ............................................ 26

Table 3-1: The number of training images in each category - With permission from ASCE

(Rezazadeh Azar and McCabe 2012a) ................................................................................................. 30

Table 3-2: Training windows of each method ...................................................................................... 31

Table 3-3: HOG runtimes for eight views using CPU and GPU .......................................................... 40

Table 3-4: Computation times of the Haar detectors in searching for eight orientations - With

permission from ASCE (Rezazadeh Azar and McCabe 2012a) ........................................................... 40

Table 3-5: Some samples of Haar detectors and their performances - With permission from ASCE

(Rezazadeh Azar and McCabe 2012a) ................................................................................................. 40

Table 3-6: Statistics of the training images in each view - With permission (Rezazadeh Azar and

McCabe 2012b) .................................................................................................................................... 48

Table 3-7: Dimension of the search areas based on the root dimensions ............................................. 50

Table 3-8: Results of the general HOG and part-based methods - With permission (Rezazadeh Azar

and McCabe 2012b) ............................................................................................................................. 52

Table 3-9: Part-based recognition runtimes for both directions using CPU and GPU ......................... 54

Table 3-10: Results of the general HOG and part-based algorithms in test videos - With permission

(Rezazadeh Azar and McCabe 2012b) ................................................................................................. 55

Table 3-11: Spatiotemporal constraints of the true positives - With permission (Rezazadeh Azar and

McCabe 2012b) .................................................................................................................................... 57

Table 3-12: Summary of the robustness assessment of the recognition process under main affecting

factors ................................................................................................................................................... 67

Table 5-1: Possible loading configurations .......................................................................................... 82

Table 6-1: Results of the experiments with different action recognition thresholds on test videos ..... 91

Table 6-2: Detailed results of the SCIT with hybrid - Test 3 ............................................................. 101

Table 6-3: Detailed results of the SCIT with hybrid - Test 4 ............................................................. 102

Table 6-4: Detailed results of the SCIT with hybrid - 3 second intervals .......................................... 104

Table 6-5: Detailed results of the SCIT with hybrid - 5 second intervals .......................................... 105

Table 6-6: Loading conditions of the test cases ................................................................................. 106

viii

List of Figures

Figure 1-1: Left: long queue of waiting dump trucks, Right: idle excavator waiting for trucks ............ 1

Figure 1-2: Methodology of the dissertation .......................................................................................... 7

Figure 2-1: GPS antennas and grade control for Left: Bulldozer, Right: Grader ................................. 13

Figure 2-2: Detection cascade (Viola and Jones 2001) ........................................................................ 22

Figure 2-3: Left: Original image; Right: Visualization of the HOG descriptor ................................... 23

Figure 2-4: Top row: tracking of a dump truck in two frames, Bottom row: back projection of the

density distribution ............................................................................................................................... 24

Figure 3-1: Left: orientations; Right: samples of views (clockwise from top left: front, front-left,

front-right, side-left, side-right, rear, rear-left, rear-right) - With permission from ASCE (Rezazadeh

Azar and McCabe 2012a) ..................................................................................................................... 30

Figure 3-2: Possible outcomes of a binary classification process ........................................................ 34

Figure 3-3: ROC curve of the HOG detectors - With permission from ASCE (Rezazadeh Azar and

McCabe 2012a) .................................................................................................................................... 36

Figure 3-4: HOG recognition samples - With permission from ASCE (Rezazadeh Azar and McCabe

2012a) ................................................................................................................................................... 37

Figure 3-5: Samples of missed dump trucks - With permission from ASCE (Rezazadeh Azar and

McCabe 2012a) .................................................................................................................................... 38

Figure 3-6: ROC curve of the HOG detectors on videos ..................................................................... 42

Figure 3-7: Detection results in a series of frames at specified time intervals (a through d) ............... 43

Figure 3-8: Deformations of the hydraulic excavator - With permission (Rezazadeh Azar and McCabe

2012b) ................................................................................................................................................... 45

Figure 3-9: Root and part of the excavator ........................................................................................... 46

Figure 3-10: Top row: training instances of the boom in left direction; Second row: training samples

of the boom in right direction ............................................................................................................... 47

Figure 3-11: Poses of the dipper - With permission (Rezazadeh Azar and McCabe 2012b) ............... 47

Figure 3-12: Flowchart of the part-based recognition process ............................................................. 49

Figure 3-13: Search regions for dipper - With permission (Rezazadeh Azar and McCabe 2012b) ..... 50

Figure 3-14: ROC curve of the results on the excavator test images ................................................... 52

Figure 3-15: Samples of detected excavators - With permission (Rezazadeh Azar and McCabe 2012b)

.............................................................................................................................................................. 53

Figure 3-16: Object recognition at time intervals (images a to j), and four distinguished paths (images

k to n) - With permission (Rezazadeh Azar and McCabe 2012b) ........................................................ 58

Figure 3-17: Occlusion at ground-level view ....................................................................................... 61

Figure 3-18: Partially masked truck from elevated view ..................................................................... 62

Figure 3-19: Difficult viewpoints ......................................................................................................... 63

Figure 3-20: Changes in size of the dump truck as it approaches the camera ...................................... 65

Figure 4-1: left: selection of target truck in the original frame, right: isolation of pixels with similar

color histograms ................................................................................................................................... 70

Figure 4-2: left: original image, right: isolation of pixels with HOG response for side-right facing

trucks .................................................................................................................................................... 70

Figure 4-3: Flowchart of the hybrid tracking process .......................................................................... 73

ix

Figure 4-4: a: detected truck at frame x1; b: HOG recognition result with lowered thresholds for three

viewpoints in frame x2; c: projected box of previous frame (frame x1) to frame x2 using KLT feature

tracker; d: fusion of the rectangles in b and c ....................................................................................... 73

Figure 4-5: Tracking of the orientation changes .................................................................................. 75

Figure 4-6: Left: red box encloses target truck, right: ROI to search for the target truck in the next

frame ..................................................................................................................................................... 76

Figure 4-7: Correction of the KLT method's distractions..................................................................... 77

Figure 5-1: Distances between the corners of trucks and the base point in both left and right

configurations ....................................................................................................................................... 84

Figure 5-2: a: Detection of the excavator; b: tracking the excavator; c: detection of a truck that does

not meet loading criteria; d: detection of the loading truck; e: tracking of the both equipment; f: truck

leaves the zone and tracking of the truck terminates ............................................................................ 86

Figure 5-3: Flowchart of the entire SCIT system ................................................................................. 87

Figure 6-1: Some of the earth material loading views .......................................................................... 90

Figure 6-2: Frames a to c: Expansion of mean-shift tracking, images d to f: Hybrid tracking ............ 93

Figure 6-3: frames a to c: Recognition and tracking of the incorrect loading truck due to severe

occlusion; frames d to f: correct recognition and tracking by changing the camera location .............. 96

Figure 6-4: a. incorrect detection of the loading start time, b. actual start time ................................... 98

Figure 6-5: Earthmoving foremen ........................................................................................................ 99

Figure 6-6: Left: detection of loading trucks; right: tracking of trucks .............................................. 109

Figure 6-7: Compaction with two rollers ........................................................................................... 110

Figure 7-1: Integration of the SCIT and GPS ..................................................................................... 120

x

Nomenclature/List of Acronyms

AdaBoost Adaptive Boosting, a machine learning algorithm

CAD Computer-aided design

CPU Central Processing Unit, computer hardware

CUDA Compute Unified Device Architecture, a parallel computing

architecture developed by Nvidia Company

GPS Global Positioning System

GPU Graphics Processing Unit, computer hardware

Haar-like features an object recognition algorithm

HOG Histogram of Oriented Gradients, an object recognition algorithm

HSV Hue, Saturation, and Value, a cylindrical-coordinate representation of

points in an RGB color model

KLT Kanade-Lucas-Tomasi, a computer vision feature tracking method

NVA Non-value added

OpenCV an open source cross-platform library for computer vision algorithms

PASCAL Pattern Analysis, Statistical Modelling and Computational Learning, a

Network of Excellence funded by the European Union

RAM Random Access Memory, computer hardware

RFID Radio-frequency identification, a wireless identification system

ROC Receiver Operating Characteristic curve, a graphical plot to

demonstrate the performance of a binary classifier system

ROI Region of Interest, selected subset of an image

xi

SCIT Server-customer interaction tracker; the developed vision-based

system in this research to estimate loading cycles

SVM Support Vector Machine, a machine learning algorithm

SVM-light an open source software to train linear support vector machine

classifiers

UWB Ultra-wideband, a radio-based positioning device

1

CHAPTER 1 - Introduction

For many years, the manufacturing sector has benefited from advances in information

technology (IT) to improve productivity and efficient data flow. The construction industry,

however, has been criticized for being slow to adopt IT. This inertia along with the

fragmented and temporary nature of construction projects have been resulted in the lack of

productivity improvement in the construction industry (Navon and Sacks 2007). For

example, most data capture and transfer processes used for productivity tracking are done

manually in construction (Navon and Sacks 2007; Akinci et al. 2006; Navon 2005). The

immediate consequences of poor performance include inactive excavators waiting for dump

trucks or a queue of dump trucks waiting for a loading unit (see Figure 1-1).

Figure 1-1: Left: long queue of waiting dump trucks, Right: idle excavator waiting for trucks

To address this issue, researchers have evaluated sensing technologies to detect, track, and

recognize the actions of the construction workers and equipment in the rugged environment

of a construction site, with a potential to automate the manual and error-prone productivity

measurement processes. Automated productivity measurement systems promise to have a

positive impact not only on the management of site equipment and human resources, but also

on the planning of future projects.

2

The automation of productivity data collection can improve construction performance by:

Reducing the need for expensive and error-prone human resources;

Providing real-time data and proactive resource monitoring;

Recording accurate productivity data for future applications such as resource planning

and stochastic simulation.

Earthmoving is a major component of heavy-civil construction, such as highways, earth- and

rock-fill dams, pipelines, land development, irrigation systems, and harbour construction

projects. Surface mining, which includes aggregate pits and quarries, as well as mining for

specific minerals such as oil sands, coal, bauxite, and copper, also requires major

earthmoving activities. In 2010, surface mining accounted for 53% of the 1.6 million bbl/day

of Alberta’s crude bitumen production and it takes about two tonnes of mined oil sands to

extract a barrel of synthetic crude oil (Government of Alberta 2012). Thus, about 619 million

metric tonnes of oil sands were excavated and processed in 2010, and this figure rises every

year. This situation illustrates the enormous magnitude of earthmoving processes worldwide

all of which depend on heavy equipment and have a repetitive nature. Slight improvements in

cycle times can result in significant improvements in productivity, cost savings, and

reductions in carbon emission.

Traditional manual monitoring methods – wireless person-to-person communication with

equipment operators/supervisors and watching the operations directly or through real-time

videos – are expensive, time-consuming and error-prone (Rojas 2008).

3

New sensing technologies such as global positioning system (GPS) receivers and ultra-

wideband (UWB) sensors have been used to monitor earthmoving machines and provide

continuous productivity data. These real-time positioning devices estimate the three-

dimensional location of the machine and the logical unit of the control system analyzes the

spatiotemporal pattern of the machine to recognize the type of action and therefore the

productivity (Navon et al. 2004; Kim and Russell 2003). Since these frameworks recognize

the actions from indirect data, they may not correctly distinguish productive movements from

non value-added traverses. In addition, these technologies are intrusive as each machine to be

tracked requires the appropriate sensor to be installed and kept up to date. This issue is more

problematic for the rented plants due to the effort and cost of repeatedly installing and

removing sensing tags from the equipment and of updating the monitoring software database.

LASER scanners can scan objects and provide a 3D point cloud of the scene, but these

cameras are costly and the scan processes are time-consuming and therefore they are mostly

used to scan static objects such as historic buildings or rock profiles. A low resolution scan

by fast laser scanners with 625,000 points per second rate can take more than 100 seconds

(Kiziltas et al. 2008), so they are not suitable to analyze dynamic earthmoving activities in

real-time.

Due to the emergence of low-cost cameras and high-capacity storage devices, it has become

common practice to monitor construction sites by surveillance cameras (Zou and Kim 2007;

Gong and Caldas 2010). Unlike building construction, clear sightlines can be selected and

maintained throughout an earthmoving project. Vision-based monitoring is an economical

solution for earthmoving operations, but visual recognition technologies needed to

4

automatically extract data from the videos are in their infancy, despite the capabilities

displayed by the entertainment media.

It is possible to use any of these approaches to automate site data collection; however, the

development of a vision-based system has the potential to replicate human visual skills and

logic, thereby providing contractors and owners with relatively inexpensive assistance to

automatically monitor the site. As no known research has achieved this goal, the research

contained herein aims to help fill this gap.

1.1 Research Objective

The objective of this research is to develop a computer vision-based framework that can

correctly identify and track equipment activities on an earthmoving construction project in

real-time. To achieve this, the framework should:

Use ordinary 2D construction videos

Detect, identify, and determine the orientation of typical loading equipment;

Track the movement of the machine as it moves through the camera’s view;

Identify the operations of the detected equipment and its interactions with other

equipment; and,

Develop an automated vision-based data collection system for earth material loading

operations using the mentioned modules, assess its performance, and validate the

practical application of the system.

5

1.2 Methodology

The following steps will be followed to achieve the objectives:

Assess the current state of productivity measurement in earthmoving projects and

identify shortcomings and areas for improvement;

Investigate the functional technology requirements for automating productivity

measurement practices;

Evaluate existing image and video processing methods and select an appropriate set

of algorithms to develop the productivity measurement system based on functional

requirements;

Develop a framework to automate the identification, tracking, and recognition of the

activities of earth material loading equipment in actual construction sites;

Evaluate the performance system by using several test cases with various conditions,

and;

Compare the machine-generated and ground truth results to validate the performance

of the system.

Identify limitations of the system

The research methodology for this study is depicted in Figure 1-2. First, the problem

statement for this research as well as the corresponding objectives and scope were defined.

Then the literature review was carried out to assess the current state of the data collection in

earthmoving projects, and to identify and select appropriate computer vision algorithms.

6

These algorithms were examined and modified to detect and track loading machines. Next,

an action recognition method was developed to recognize the loading action, and then all of

these modules were integrated to develop the pipeline Server-customer interaction tracker

framework. This framework was examined by means of test videos from two construction

sites. The results of the experiments were statistically compared with ground truth

productivity data to validate the system’s performance for practical applications. Finally, the

findings, advantages, and limitations of this study were summarized and future research

direction discussed.

7

Figure 1-2: Methodology of the dissertation

1.3 Research Scope

There are several types of earthwork activities such as excavation, loading, hauling,

dumping, grading, trimming, and compaction; hence, there are special types of equipment to

carry out each of these tasks. A number of detection classifiers and activity recognition

modules are required to cover all of these different activities. Therefore, only analyses of

8

loading activities are considered for this research. The loading operation includes loading of

different types of soil, clay, aggregate, rock, and mineral ores. The envisioned system should

be able to monitor these loading activities regardless of the material type; therefore, the term

“earth material” is selected for the topic of this research to represent the general application

of this project in construction and mining industries. Moreover, there are different types of

machines that can perform these activities; for instance, both loaders and hydraulic

excavators can load different types of machines such as rigid off-highway dump trucks,

articulated off-highway dump trucks, urban dump trucks, and scrapers. The hydraulic

excavator was selected as the test case for the loading unit, and both the rigid off-highway

dump truck and the urban dump truck were chosen as instances of hauling machines.

Although the scope of this research encompasses loading operations, the methodology,

modules, and outcomes of the developed system can be applied or generalized to other

earthmoving operations.

1.4 Outline of the Dissertation

This dissertation is organized into seven chapters. Chapter 1 introduces the problem

statement and motivation for the dissertation topic. It then explains the objectives,

methodology, and the scope of this research. Chapter 2 consists of three main parts. First it

provides an overview of previous research efforts in automated data collection in

construction, and then describes recent applications of vision-based algorithms in the

construction industry. Finally, it gives a synthesis of advances in computer vision with an

emphasis on object recognition and tracking methods. Chapters 3 and 4 describe object

recognition and objection tracking modules of the proposed framework, respectively. In

Chapter 5, the action recognition module and the entire architecture of the framework are

9

presented. Chapter 6 demonstrates the experiments carried out to evaluate the performance of

the system. Finally, Chapter 7 gives the summary, discusses the contributions and limitations

of the research, and makes some recommendations for future research.

10

CHAPTER 2 - BACKGROUND

This chapter provides background information on the research advances for automated data

collection in construction. Much of the research is built upon advances in sensor

technologies, computer science, and new algorithms that have been developed to search

digital images for stationary and moving objects. The first section looks at automated data

collection generally in construction with Section Two focusing on data collection in

earthmoving. Then, the application of vision-based algorithms for automated data collection

in construction is investigated, and lastly advances in object recognition and tracking

methods are briefly introduced.

2.1 Automated Data Collection in Construction

Manual data collection and analysis are tedious and labour-intensive, taking about 30%-50%

of supervisors’ time (McCullouch 1997), and 2% of the entire effort in construction sites

(Cheok et al. 2000). In addition, manual data collection is error-prone and usually requires

extra non-value added communication between the office and field personnel (Akinci et al.

2006).

As a result, automated data collection has become one of the leading research streams in the

construction community and various data collection devices have been employed for

material, personnel, and equipment tracking, progress monitoring, productivity measurement,

quality control, and safety management (Kiziltas et al. 2008). Research often depends on the

advancement of data collection devices - barcodes, radio frequency identification (RFID),

global positioning system (GPS), ultra wideband (UWB), laser scanning, and computer

vision algorithms are currently the most used technologies. Some of the features that

differentiate them include proneness to interference, data reading range, data accuracy,

11

interoperability of hardware and software, and memory requirements (Kiziltas et al. 2008).

Most recently developed systems analyse direct data while others interpret indirect data to

extract necessary information (Navon 2005; Navon and Sacks 2007). Laser scanners and

vision processing methods for estimation of the progress of building are instances of direct

data analysis, whereas application spatiotemporal data provided by GPS or UWB to estimate

equipment productivity are examples of indirect analysis.

2.2 Data Collection in Earthmoving Projects

The earthmoving sector has a longer history of the application of automated data collection

technologies than other segments of the construction industry (Navon 2005). More

specifically, the long term and repetitive nature of mining operations has allowed faster

technology adoption in which different sensing devices have been used to locate and dispatch

large fleets of earthmoving plants. Since the mining and earthmoving operation are the same

in nature and use similar equipment (even though mining machines are usually larger in

size), these sensing technologies were gradually employed in heavy construction projects as

well. In the following sections, the most common monitoring tools are introduced.

2.2.1 Machine control sensors

Various built-in sensing devices are commercially available that provide a wide range of data

from the machine itself, such as engine operating parameters (Caterpillar 2012), and location

and orientation of the machine parts such as boom and bucket orientations of a hydraulic

excavator (Trimble 2012c). These devices have been developed to improve the efficiency of

equipment operation, but it is also possible to collect and interpret these data to estimate its

productivity. Limitations with these devices include cost-effectiveness and data

12

interpretation. The engine parameters or movements of the machine part do not necessarily

result in productive actions, and it is also difficult to distinguish the type of the work.

2.2.2 Global Positioning System

The Global Positioning System (GPS) is a space-based radio-navigation system created by

the U.S. Department of Defence (DoD) using 24 satellites, the last one of which was

launched in 1994. Each satellite continuously transmits the time and its position. GPS

receivers must receive these messages from at least four satellites to compute their 3D

location and time using a triangulation technique. User equivalent range error is the

difference between the GPS coordinates and the true position. The main causes of this

inaccuracy are atmospheric effects, multipath distortion, satellite geometry, ephemeris errors

and orbit perturbations, time offset, instrumentation errors, and relativistic effects.

Differential correction techniques can resolve or minimize these errors. However, there are

other causes of minor distortions which cannot be corrected.

Recent commercially developed GPS receivers for construction equipment have high

accuracy (under one metre); however, they may still be subject to the mentioned anomalies

(Trimble 2012a). In addition, earthmoving machines sometimes work in deep pit mines and

valleys, proximity of rocks, trench slopes, and tall buildings, which can cause signal

reflection and multipath distortion problems. There is a modified mining GPS system to

solve such problems. This system uses additional transmit stations, or "Terralites", installed

on overlooking points of jobsites that relay satellite signals to equipment antennas (Trimble

2012a). This solution is expensive and mostly used in long-term open pit mines as it requires

a network of reference stations.

13

The GPS devices transmit the geographical location of machines to a central control

processor at regular time intervals, and these locational data together with logic algorithms of

the control system can recognize the action and estimate the productivity of the machine

(Navon et al. 2004; Kim and Russell 2003). This technology is suitable to track mobile

machines and detect queues or other misallocation, but it cannot provide any data other than

the location of a stationary plant such as a hydraulic excavator. Recent antennas have

customized features for specific heavy equipment that provide additional data to increase the

accuracy of earthmoving profiles and facilitate the operators’ job. For example, dual antennas

installed on two sides of the blade of the bulldozers or graders (see Figure 2-1) provide exact

position, cross slope, and heading of the blade to achieve accurate excavation and grading

profiles (Trimble 2012b).

Figure 2-1: GPS antennas and grade control for Left: Bulldozer, Right: Grader

The open pit mining sector has been extensively using this technology to dispatch and control

the earthmoving fleet (Vujic et al. 2008; Alarie and Gamache 2002). Several construction

researchers also applied GPS antennas to track heavy equipment in construction projects and

then estimate their productivity such as grading and leveling (Navon and Shpatnisky 2005;

Navon et al. 2004) and asphalt paving (Navon and Shpatnisky 2005; Peyret et al. 2000).

These productivity measurement frameworks extract the spatiotemporal data of the

14

equipment and transform them into the local map of the project. Then the processing

software interprets the movements of the machines in work zones and estimates their

productivity.

In addition to technology-related problems already mentioned, GPS-based productivity

measurement has two further limitations. First, since the 3D coordination and time are the

available spatiotemporal data, it is difficult to distinguish productive activities from non-

value added (NVA) traverses. Second, since rented equipment are commonly employed by

general contractors such that different machines are used on site on a daily basis, it is costly

and labour-intensive to install and remove GPS antennas from the plants and update the

monitoring software with each change in the fleet.

That said, the GPS navigation systems are still the superior technology in this field as this

level of detailed data from single equipment is not achievable by any other existing

automated data collection system.

2.2.3 Ultra-wideband

The Ultra-wideband (UWB) is another radio-based positioning device which has the ability

to find and track entities in limited zones. This system consists of a network of UWB

receivers, UWB tags, and a data processing unit. UWB sensors receive low energy radio

waves transmitted by tags and then the processing unit analyzes the attributes of the received

signals to locate the tags. Because UWB systems employ high-bandwidth waves (with very

short pulses), most signal reflections do not have the original pulse and the multipath fading

problem is not an issue. This system can estimate the location of a tag with two of four pieces

of information (Ghavami et al. 2007) which include time of arrival (TOA), time difference of

15

arrival (TDOA), angle of arrival (AOA) and received signal strength (RSS). This technology

provides reliable spatiotemporal data in tracking resources in construction sites (Cheng et al.

2011).

With regard to shortcomings, the ultra-wideband system requires a network of wired sensors

installed in different locations of a site, which makes it impractical for temporary linear

projects such as highway and pipeline construction. Problems with data interpretation and

the manual installation of tags are also associated with this technology.

2.2.4 Radio-frequency identification (RFID)

Radio-frequency identification (RFID) is a wireless system that employs radio-frequency

electromagnetic fields to identify and track a tag attached to an object. This system includes a

tag, which is an electronic chip coupled with an antenna, and a reader that transfers data to

the host computer. RFID tags are either passive or active tags where the passive tags do not

require a battery because the electromagnetic field of the reader powers them. Active tags use

their own power source, usually a battery, to transmit data via radio waves. The main

advantage of RFID systems is that the tags do not need direct line-of-sight, and they are

durable and can be encapsulated.

This sensing device has been used to measure the loading, hauling, and dumping times of the

dump trucks. Fixed readers may be installed in entrance gate of the loading and dumping

fields, and passive RFID tags attached to dump trucks. Then, the system can record the

entrance and exit times of the machines to each zone, and the time differences represent the

loading, traveling, and dumping cycle times (Montaser and Moselhi 2012). Although this

method is practical to use for a stationary construction site (foundation excavation), it is

16

cumbersome to frequently move the gates for more linear work (e.g. highway construction).

In addition, this system only registers the entrance and exit of the machines without regard to

whether the loading occurred.

2.3 Computer vision-based methods

Site images and videos provide a vast automated data collection opportunity (Golparvar-Fard

et al. 2009; Brilakis and Soibelman 2008; Abeid et al. 2003). The emergence of cheap digital

cameras and high-capacity storage devices has significantly increased the number of photos

and videos captured in construction sites, many of which have surveillance cameras (Gong

and Caldas 2010; Zou and Kim 2007). These images and videos are used for several purposes

such as progress measurement, claims, reports, safety, and training. However, manual

annotation, retrieval, and analysis of these multimedia sources are cumbersome, and valuable

information within them is often missed (Brilakis and Soibelman 2005). Therefore, in the last

decade, a stream of construction research has focused on intelligent systems to use photos

and videos more effectively.

Computer vision algorithms have been applied in several fields of construction, including

progress monitoring (Golparvar-Fard et al. 2011; Wu et al. 2010; Golparvar-Fard et al. 2009;

Kim and Kano 2008), defect detection (Guo et al. 2009; Hutchinson and Chen 2006),

automated image retrieval (Brilakis and Soibelman 2008; Brilakis and Soibelman 2005), and

productivity measurement (Gong and Caldas 2011; Peddi et al. 2009; Weerasinghe and

Ruwanpura 2009; Almassi and McCabe 2008). To compare digital images with an electronic

as-planned 4D model, a means to coordinate a fixed camera viewpoint and the direction

vector of a construction 4D model is available (Kim and Kano 2008). D4AR (4 Dimensional

Augmented Reality) (Golparvar-Fard et al. 2011) goes one step further and uses casually

17

captured images from the construction site to build a virtual 3D walk through the

environment. It thus positions the 4D CAD model to assess progress of the building. In

addition to comparing as-built with as-proposed features, productivity, work progress, and

safety data can be extracted. Methods exist for monitoring workers (Teizer and Vela 2009;

Peddi et al. 2009; Weerasinghe and Ruwanpura 2009), and tracking personnel, equipment,

and materials in noisy construction videos (Brilakis at al. 2011; Park et al. 2011). The Haar

object detection method has been found capable of detecting large tools or key productivity

indicators from video images, such as a concrete hopper (Gong and Caldas 2010). Work

cycles can be determined by counting the number of times the hopper moves into concrete-

pouring zones (Gong and Caldas 2010; Almassi and McCabe 2008). Visual processing of

workers to analyze worker status can also be achieved using human pose analysis (Peddi et

al. 2009) and thermal image analysis combined with sound wave patterns (Weerasinghe and

Ruwanpura 2009).

Objects in heavy construction sites typically take place outside and have few visual

obstructions. In earthmoving activities, such as digging, loading, moving, spreading, grading,

and compacting, specific equipment are involved. Humans recognize construction equipment

by the shapes and features that make them unique. Developing an automated system to

undertake this task, however, has several challenges, such as the shape similarities of

different equipment, partially obstructed views, and a visually noisy environment. To

complicate matters, some earthwork plants have moving or deformable parts, such as

hydraulic excavators, adding another level of complexity to the recognition process. Visual

recognition research in construction has focused on three primary methods: color, motion,

and shape.

18

In a semi-automated approach, the user manually selects the excavator of interest, and then

the algorithm tracks the target excavator in subsequent frames. The system analyses

displacement of the excavator to determine the working state of the plant (Zou and Kim

2007). The use of hue, saturation, and value color space to detect equipment is achievable in

good contrasting backgrounds, such as soil and snow. However, it is challenging to use in

construction sites as all of the equipment from one contractor can often be similarly colored.

Furthermore, many contractors use a popular orange or yellow for safety purposes. This

color-based method (Zou and Kim 2007) is not robust to changes of illumination, scale,

viewpoint and occlusion.

Since most construction entities, including earthmoving machines, are mobile, some research

efforts investigated the use of motion segmentation methods to detect moving objects and

then identify them. There are several foreground-background algorithms available, but the

selected method should be able to properly process construction videos with a dynamic

environment and harsh visual noise such as dust and smoke from equipment exhaust. In a

comparison of background subtraction algorithms (Gong and Caldas 2011), the performance

of Mixture of Gaussian (Grimson et al. 1998), Codebook (Kim et al. 2005), and a Bayesian-

based model (Li et al. 2003) demonstrated that the Bayesian-based algorithm had better

results in construction videos. This motion segmentation filter uses Bayes decision rules to

detect both gradual and abrupt movements in videos with static background.

The Bayesian-based foreground-background algorithm has been used to segment moving

entities in construction videos. The detected objects can then be identified by a classifier (e.g.

Bayes or neural network) using features of the object including height/ width aspect ratio,

height-normalized area, percentage of occupancy of the bounding box, and average gray-

19

scaled color of the area (Chi and Caldas 2011). This recognition system has been applied for

productivity measurement of earthmoving activities (Gong and Caldas 2011). This

recognition approach, however, has five major limitations. First, it requires that the video

background remain still. Second, the motion segmentation algorithm may absorb foreground

particles if they stay motionless for a long time (Li et al. 2003). Third, the background

subtraction method is not able to consistently segment a moving object as a single unit and it

sometimes identifies a part of the moving object or may split one moving object into two or

more disconnected pieces. Fourth, the employed feature-based classifiers can be misled if an

unknown object (not trained before) enters the scene. Finally, this algorithm has difficulty in

processing occluded views that are typical in jobsites. For example, two machines move

close together and the system may segment them as a single blob. The shape features of this

blob represent none of those two plants.

To the best of the author’s understanding, there is still a gap in the semantic equipment

recognition in heavy construction static images and videos, since the mentioned algorithms

are able to recognize machines under certain conditions, such as plain background and empty

jobsites. Therefore, it is essential to develop a recognition framework for heavy civil

engineering projects that allows different types of equipment to appear.

In addition to recognition, object tracking is another main module to make vision-based

systems practical. These algorithms can track the manually or automatically detected objects

and provide valuable spatiotemporal data including the location, direction of movement, and

velocity of the target. Thus, a number of studies were conducted to find suitable tracking

methods in noisy construction videos (Brilakis at al. 2011; Park et al. 2011; Gong and Caldas

2011).

20

All these research efforts are in a preliminary stage as they are only able to operate under

ideal conditions. For example, the test videos were taken from certain angles, with plain

backgrounds and slight occlusion, and only a few types of equipment appeared. In addition,

these systems only analyzed simple scenarios such as displacement of an excavator or a mini

loader to estimate their productivity.

The goal of this research is to close the practicability gap between vision-based systems and

earthmoving productivity measurement processes, where we can automatically recognize and

estimate loading cycles under various visual conditions such as different viewpoints and

existence and various types of construction equipment found in construction sites. In

addition, the system needs the least amount of human intervention, which is to properly set

the camera viewpoint.

2.4 Object Recognition

Computer vision, a form of artificial intelligence (AI), is evolving quickly, with object

recognition being one of the main branches of this field. However, existing algorithms have a

long way to go before they can match the flexibility and breadth of human vision (Dickinson

et al. 2009). The main difficulties arise from variations in illumination, viewpoint, scale,

occlusion, articulated shapes, and background clutter. In addition, the variety of samples

within a class can enhance the complexity (Fei et al. 2007). Object recognition approaches

have been developed using recognition by parts, appearance-based, and feature-based

methods. Feature-based methods usually have two steps: computation of object descriptors,

and classification. All feature-based methods quantize the descriptors of positive and

negative samples to train a classifier. Due to proven performance and the availability of

source codes, Haar-like features (Viola and Jones 2001) and Histogram of Oriented

21

Gradients (HOG) (Dalal and Triggs 2005) were used in this research. These are discussed in

more detail next.

2.4.1 Haar-like Features

The Haar-like features framework (Haar) was originally introduced for face detection (Viola

and Jones 2001), and then broadly applied for other recognition purposes, such as traffic sign

and pedestrian detection, due to its fast speed and high accuracy. This algorithm partitions

images into a set of overlapping regions at different scales, then classifies each window

whether it is a target object or not. The Haar-like features framework has a cascade structure

that employs a series of weak classifiers. It uses a progressive elimination-based

classification chain, which rejects any sub window that cannot pass one of the classifiers. The

classifiers in the cascade become gradually more complex. Each classifier rejects as many of

the remaining negative sub-windows as possible, while still passing all but a small fraction of

the true positives (see Figure 2-2).

The features used to train each classifier consist of rectangular regions configured in various

Haar and bar-like arrangements. To learn and classify objects, it applies a form of the

AdaBoost (Freund and Schapire 1997) algorithm to features that were extracted from digital

images. This method is fast and efficient in recognizing objects that have a stable,

characteristic appearance and do not have large pose variations, such as human faces.

22

Figure 2-2: Detection cascade (Viola and Jones 2001)

2.4.2 Histogram of Oriented Gradients (HOG)

Initially developed to detect pedestrians in static images (Dalal and Triggs 2005), the robust

HOG algorithm won the 2006 PASCAL object detection challenge (Everingham et al. 2006).

This method is invariant to changes in illumination and noise, and has been broadly applied

to detect rigid objects, such as highway vehicles (Rybski et al. 2010; Morlock 2008). In the

HOG algorithm, computed gradients of the gray-scale image are separated into spatial and

orientation cells (Figure 2-3) forming histograms of oriented gradients which are then

converted to a vector called a descriptor. Next, many positive and negative vectors are

required to train a detector using the linear support vector machine (SVM) algorithm (Cortes

and Vapnik 1995) under supervised learning. Unlike the cascade framework of the Haar-like

features, HOG uses a sliding window approach to search for the target object in all positions

and scales of an image. In this detection process, the detector first searches for the target in

the original scale image, then the frame is scaled down by the shrinkage coefficient, and the

scan is repeated until the image reaches the size of the classifier (e.g. 64x128 for pedestrian

23

detection). The classifier examines the test windows by the linear SVM classification process

which is a scalar product of the classifier and the test window vector. Unlike the Haar

method, the HOG algorithm tests all of the sub windows of the image and is more

computationally intensive.

Figure 2-3: Left: Original image; Right: Visualization of the HOG descriptor

2.5 Object Tracking

Video tracking algorithms are promising tools to locate a target object in video frames as

they are useful tools for security, traffic control, human-computer interaction and many other

applications. There are a number of tracking approaches including contour tracking, kernel-

based tracking, and feature matching. Comparative studies of tracking methods applied to the

visually noisy construction environment revealed that the Mean-shift algorithm is reliable for

tracking objects (Gong and Caldas 2011; Park et al. 2011), and the addition of the Kalman

filter and Particle filter can stabilize its performance (Gong and Caldas 2011).

2.5.1 Mean-shift Tracking

Mean-shift tracking is a non-parametric Kernel-based procedure for locating the maxima in

the density distribution of a dataset (Comaniciu et al. 2003). This iterative algorithm starts

with an initial estimate, and then calculates the weight of nearby points for re-estimation of

24

the mean, thus it ignores the outliers far from the peak. Any features of the object can be used

to create the dataset, but color is one of the efficient and commonly used features. The top

two frames in Figure 2-4 show the red dump truck being tracked. The images in second row

are the corresponding back projection of them, which filters the pixels in the range of the

target’s color histogram. Apparently, other objects such as the body of the excavator and the

other entering truck have the same color, but the mean-shift tracker ignores them. A modified

version of the Mean-shift algorithm, called continuously adaptive Mean-shift or Camshift

(Bradski 1998) employs mean-shift method and changes the size of the tracking window to

adapt with the changes of the shape, orientation, and size of the target object.

Figure 2-4: Top row: tracking of a dump truck in two frames, Bottom row: back projection of the density

distribution

25

2.6 Summary

This chapter describes recent advancements in monitoring earthmoving equipment on

jobsites. The developed technologies can be classified into two main groups: active and

passive tools.

Active methods include a central processing unit, a network of reference receivers, and

sensing devices that should be installed on each machine. There are different sensing

technologies, such as engine sensors, RFID, GPS, and UWB, that can provide various data

such as engine parameters and 3D location. In contrast, passive techniques do not require

sensing tags and the central unit processes the transmitted data from a network of reference

receivers. Laser scanners and computer vision-based techniques are two main forms of

passive data collection; however, only vision-based methods have been used to track

construction equipment because laser scanning processes are too slow to track mobile

equipment.

Table 2-1 summarizes the characteristics of each technology based on four functional

technology requirements: type (tagging requirement), data provided, hardware requirement,

and spatiotemporal tracking.

26

Table 2-1: A summary of main features of equipment tracking methods

Technology Type Data provided Hardware

requirement

Spatiotemporal

tracking

Engine

sensors Active

Equipment ID

Engine parameters such

as rotation per minute

Sensors and central

receiver No

RFID Active Equipment ID

Time

Tags and network of

scanning gates No

Standard

GPS Active

Equipment ID

3D location

Time

Sensors and central

receiver 3D

GPS with

transmit

stations

Active

Equipment ID

3D location

Time

Sensors, central

receiver and

additional reference

stations

3D

UWB Active

Equipment ID

3D location

Time

Tags and network of

UWB receivers

3D (limited to the

receivers zone)

Single

camera

for each

scene

Passive

2D location

Equipment type

Equipment orientation

2D size

Time

Network of digital

cameras

2D (limited to camera

coverage)

Stereo view

of

each scene

Passive

3D location

Equipment type

Equipment orientation

3D size

Time

Network of

calibrated digital

cameras

3D (limited to cameras

coverage)

Active systems, more specifically GPS, have been broadly used in construction and mining

industries for more than a decade and numerous research efforts have investigated

shortcomings and development possibilities of these technologies. Vision-based algorithms

are other potential tools to track earthwork machines and estimate their productivity. Vision-

based systems, however, are very new in this field and only a few research studies have been

carried out to monitor earthmoving equipment. These research projects are in their early

27

stages and require a high level of manual intervention. In addition, they are only able to

process under ideal conditions that do not resemble those that are found on jobsites.

As stated before, a practical vision-based system requires recognizing different types of

earthmoving equipment and tracking them under realistic conditions. Then the system must

be able to analyze the provided spatiotemporal data and identify equipment interactions. The

next chapter describes different state of the art object recognition algorithms which are

evaluated and modified, as required to identify loading machines under different visual

conditions with high accuracy and efficient speed.

28

CHAPTER 3 - OBJECT RECOGNITION MODULE

This chapter describes the development of an object recognition module. Two types of

equipment were selected to develop the module. A hydraulic excavator represents a loading

unit, and off-highway and urban dump trucks were chosen as instances of hauling equipment.

Dump trucks are rigid objects (except during their relatively short dumping periods) and

existing object recognition algorithms showed promising performance in the detection of

similar rigid vehicles such as cars. But recognition of an object that regularly changes its

shape, such as an articulated excavator, is more challenging. Therefore, a recognition system

was developed for excavators. This recognition framework combines a part-based approach

and spatiotemporal reasoning for recognition of operating excavators in construction videos.

Development of an object recognition classifier requires a large number of images for

training and testing phases that must not overlap. The training dataset should include both

positive and negative samples. Earthmoving machines appear fairly differently depending on

the viewpoint with respect to the camera. This makes anything but a sphere impossible to be

identified with a single detector. Therefore, it is essential to collect plenty of images

containing different orientations of a machine. In addition, training samples should be taken

under different lighting conditions and include different makes within a class to produce

efficient classifiers. Since a supervised learning approach is used to train classifiers, the

following steps should be carried out to prepare the training dataset. These tasks are human-

intensive and time-consuming.

Divide training samples into positive and negative groups. Negative samples must not

include target object;

29

Determine the training viewpoints for each equipment;

Group positive images into training categories;

Crop positive objects using fixed ratio boxes;

Resize cropped images to determined training sizes. This step is done automatically

using the Image Processing Toolbox of the MATLAB software.

For training and testing purposes, a large number of images containing different makes and

sizes of dump trucks and hydraulic excavators were collected from a multimedia archive and

freely-available on-line sources. In addition, a large number of images were captured by the

author from three construction sites, including two large earth-fill dams and a foundation

excavation of a condominium complex. These data were randomly divided into training and

testing datasets. The statistics of the training images for each viewpoint of the dump trucks

and excavators are provided in the following sections. Although all of the images were taken

in daylight, they vary by time of the year and illumination levels. Images were taken from

ground or above ground levels, and the vehicles in the images were located at various

distances from the camera.

3.1 Dump Trucks

A variety of makes, models, and colors of trucks were used to ensure that the classifier is not

limited by any one of these factors. Two object recognition algorithms, namely Haar-like

features and Histogram of Oriented Gradients (HOG), were trained using the training dataset.

Their performance was evaluated using the test dataset. The following sections describe the

development and testing processes.

30

3.1.1 Visual orientations

Since the visual features of a dump truck change with the camera viewpoint, both Haar and

HOG detectors were trained with image samples from eight orientations as shown in Figure

3-1. This follows from previous research where using eight visual orientations provided

strong results in the detection of urban vehicles (Rybski et al. 2010, Han et al. 2006). Having

eight orientations not only increases the chance of detection, but it also enables prediction of

the trajectory of the dump truck, which can be valuable data for activity interpretation. Table

3-1 presents the number of training images in each of the eight viewpoints.

Table 3-1: The number of training images in each category - With permission from ASCE (Rezazadeh

Azar and McCabe 2012a)

Visual orientation Front Front-

left

Front-

right

Side-

left

Side-

right

Rear Rear-

left

Rear-

right

# Positive samples 488 699 581 755 755 304 488 488

# Negative samples 8000 8000 8000 8000 8000 8000 8000 8000

Figure 3-1: Left: orientations; Right: samples of views (clockwise from top left: front, front-left, front-

right, side-left, side-right, rear, rear-left, rear-right) - With permission from ASCE (Rezazadeh Azar and

McCabe 2012a)

31

All of the positive and negative training images were manually cropped and then scaled

down to predetermined sizes. Since each of the two object recognition algorithms uses a

different training approach, training samples in each method involve different sizes. The

Haar method requires small image sizes such as 20x20 to 40x40 pixels while the HOG

algorithm uses bigger training frames in the range of 64 to 128 pixels in each dimension, thus

different training window sizes were used for each technique as presented in Table 3-2.

Table 3-2: Training windows of each method

Method

Dump truck views

Front and rear views

(pixels)

Other six views

(pixels)

Haar-like 21x19 36x20

HOG 104x96 128x80

The bounding boxes should completely enclose the machines in the training images, but 16

pixels of margin around the target object on all four sides were added to improve the

performance of the HOG detectors (Dalal and Triggs 2005). The Haar framework however,

requires smaller positive training windows with smaller margins. Therefore, 10 pixels were

cropped from all four sides of the HOG’s positive samples to decrease the margin. Finally,

84x76 images were resized by a 1/4 scale factor and 108x60 windows by 1/3 which resulted

in 21x19 and 36x20 boxes, respectively.

A set of 800 images containing no dump trucks was obtained from the same sources and

were used as the base negative training set. Negative images contained construction scenes

and many of them contained other earthwork plants, such as bulldozers and graders to

mitigate misclassification of such machines. Ten windows were randomly cropped from each

32

negative image and scaled down to the corresponding sizes for each viewpoint, resulting in

8000 negative training samples for each orientation.

3.1.2 Machine learning

Training samples are prepared for machine learning after they have been grouped, cropped,

and resized to the determined sizes. The AdaBoost learning algorithm (Freund and Schapire

1997) and the linear SVM method were employed to train Haar-like features and HOG

detectors, respectively. The Open source OpenCV 2.1 (OpenCV 2010) library has built-in

functions to train Haar classifiers. First, a vector of positive samples should be created by

using the cvCreateTrainingSamplesFromInfo() function. Then, the

cvCreateTreeCascadeClassifier() function employs the AdaBoost learning method to train a

cascade classifier from the vector of positive samples and negative images. This function

requires several input parameters, such as number of stages, minimum detection rate,

maximum false alarm rate, boost type, and the size of positive samples. Depending on these

parameters and the capability of the processor, it may take from a couple of hours to more

than a day to train a cascade classifier.

The OpenCV library, however, does not include an efficient linear SVM learning function to

train HOG detectors. It only contains functions to compute HOG features and classify search

windows. Therefore, the compute() function from the cv::HOGDescriptor structure was used

to create a vector of HOG features for every positive and negative sample. The calculated

positive and negative vectors were then grouped and labelled with +1 and -1 respectively,

and saved as a single .dat file. Finally, publicly available SVM-light software (Joachims

1999) was used to train HOG classifiers from the created .dat file.

33

Training of the HOG classifiers in two rounds significantly improves the results (Dalal and

Triggs 2005). In two-round training, the initially trained classifier searches the original

negative images. Any detected window is apparently a false detection. These false positives,

called hard negatives, were then scaled and added to the negative samples for the second

round of training.

3.1.3 Performance of Detectors on Static Images

An experimental process is designed to evaluate the performance of the detectors on static

images to choose a suitable method. It includes scanning each test image with eight single-

class (one for each orientation) detectors. These recognition tests were carried out on a dual

core 2.93 GHz processing unit with 3 gigabyte RAM. For this experiment, 380 test images

were randomly selected from the image pool, none of which had been used for training.

These images contained 681 dump trucks from all eight orientations together with other types

of heavy equipment, some of which also had similar colors, to evaluate the performance of

the detectors in congested views.

Both Haar-like and HOG detectors use binary classifiers for object recognition. These

classifiers search sub-windows in different location and scales of an image and decide

whether those sub-windows match the properties of the target object or not. Binary

classification of a sub-window has four possible outcomes that are presented in a confusion

matrix in Figure 3-2.

34

Figure 3-2: Possible outcomes of a binary classification process

A main parameter for evaluation of the performance of a binary classifier is hit rate, also

known as true positive rate, sensitivity or recall. The hit rate is defined as the ratio of the

correctly detected objects, or (true positives)/(true positives + false negatives). There are

other derivations from a confusion matrix to assess a classifier performance such as:

false positive rate (fall-out) = false positives/(false positives + true negatives);

accuracy = (true positives + true negatives)/total search windows;

specificity = true negatives/(false positives + true negatives);

precision = true positives/(true positives + false positives);

Binary classifiers operate with a discrimination threshold. Altering of this threshold would

change the hit rate and the false positive rate where lower thresholds pass more test sub-

windows, including both true and false positives. The Receiver operating characteristic

35

(ROC) curve is a common approach to illustrate the trade-off between the true positive and

false positive rates. The ROC curve demonstrates the hit rate versus the false positives rate at

varying thresholds.

Both the hit rate and false positive rate vary from 0 to 1, but the false positive rate usually has

very small numbers. For example, the HOG detectors have to classify more than 320,000

sub-windows in a 640x480 frame for eight views in which the majority of them are negative.

As will be demonstrated next, these classifiers generate very few numbers of false positives

per image, thus the false positive rate would have 10^-6 coefficient. To make it more sensible

for the construction readers, the false positive rate was substituted with the false positive per

frame in the ROC curves reported. This parameter simply presents the average number of

false positives occurred per test image.

Three factors were considered to evaluate the performance of the trained detectors: hit rate;

number of false positives per image; and computation times. The evaluation rules of the

PASCAL visual object classes challenge (Everingham and Winn 2010) were followed in this

experiment to determine whether a detected box is a true positive or false alarm. It requires

the detected bounding box to overlap more than 50% with a ground-truth bounding window

to be considered a true positive. In addition to location, a true positive should correctly

represent the orientation of the dump truck in the image. For example, if the classifier

identifies a dump truck with “Rear-left” orientation instead of “Front-left”, it will be counted

as a false alarm.

There were some instances where the detectors of two adjacent views detected the same truck

which had a boundary orientation. In this case, the system first checks all of the detected

36

bounding boxes to find the rectangles belonging to the same subset which should have a

similar size and location. In this method, two rectangles will be considered in the same group

if all of the distances between x and y elements of the matching corners are lower than the

minimum average of the width and height of the boxes times a threshold (Viola and Jones

2001). Then, the system picks the rectangle with the greater detection score and ignores the

other one. This method only considers two overlapping detections with adjacent viewpoints

and highest scores; three or more covering orientations are penalized.

3.1.3.1 HOG detectors

The receiver operating characteristic (ROC) curve was used to illustrate the performance of

the HOG detectors on processing test images as shown in Figure 3-3. This curve illustrates

the trade-off between the hit rate and false alarms per image where the lowered classification

thresholds pass more test sub-windows, including both true and false positives.

Figure 3-3: ROC curve of the HOG detectors - With permission from ASCE (Rezazadeh Azar and

McCabe 2012a)

37

Figure 3-4 shows some recognition samples. As observed in the results, HOG detectors could

recognize dump trucks with high accuracy among other types of machines, many of which

have a similar color. For instance, Figure 3-4 presents images taken from a rock-fill dam

construction project with rollers, bulldozers, graders, hydraulic excavators, and loaders. A

roller with “side-right” orientation is misclassified in the top right image.

Figure 3-4: HOG recognition samples - With permission from ASCE (Rezazadeh Azar and McCabe

2012a)

As the test images were randomly selected from the most challenging images, a number of

false negatives were partially masked by piles of soil or other machines (see Figure 3-5). On

the other hand, many of the false alarms resulted from false viewpoint estimations rather than

wrong location.

38

Figure 3-5: Samples of missed dump trucks - With permission from ASCE (Rezazadeh Azar and McCabe

2012a)

The computation time of this recognition method is another important factor to develop a

real-time application. Runtimes for different sizes of images were recorded and are presented

in Table 3-3. Processing a low resolution standard surveillance image of 640x480 pixels on a

dual core 2.93 GHz CPU for all eight viewpoints takes about 26 seconds, which is too long

for real-time purposes. This is because the HOG object recognition algorithm uses a brute-

force search approach as its classifier window searches for the target object in every location

and scale of the image. The classifier first searches for the object in the original scale frame,

then scales down the image by the shrinkage coefficient (set at 1.05 in this experiment), and

repeats the scan process. This process finishes when the image reaches the size of the

classifier window, which were 128x80 and 104x96 pixels. For instance, the system should

classify 6x40508 or 243,048 windows for the six viewpoints with the 128x80 training

windows and 2x40999 or 81998 windows for “Front” and “Rear” orientations which have

104x96 search windows, to find dump trucks in all eight orientations in a 640x480 frame.

Each of these linear SVM classifications is the dot product of the classifier and the test

window vectors, the vector sizes of which are [4752x1] for 104x96 windows and [4860x1]

for 128x80 boxes.

39

However, parallel implementation of the HOG algorithm on a new generation Graphics

Processing Unit (GPU) can speed up the standard sequential code by over 67 times

(Prisacariu and Reid 2009). A new generation of GPUs has hundreds of cores, enabling them

to process thousands of threads in parallel and allow non-uniform access to memory

(NUMA). The HOG recognition algorithm was implemented using a parallel computing

platform and programming model (CUDA) technology developed by NVIDIA (NVIDIA

2012). First the host CPU acquires the frame and copies it to GPU memory. The GPU

processes all of the scales and sub-windows of the frames and sends back calculated SVM

scores of each sub-window to the host CPU. The host CPU then formats the inputs that

contain the score and position of each sub-window. Finally, it carries out non-maximal

suppression to fuse the detected windows. The host CPU carries out the non-maximal

suppression to fuse the detected boxes, and the reason to process it on the CPU is that it

requires a lot of connections to RAM.

The computation times for scanning the same eight orientations with the same CPU (2.93

GHz dual core) and a GeForce GT 440 GPU with 2.1 compute capability were accelerated

significantly (Table 3-3). Parallel programming on GPU enables the use of the standalone

HOG method as the truck recognition module of the framework, with a suitable detection

rate, and maintenance of real-time video stream.

40

Table 3-3: HOG runtimes for eight views using CPU and GPU

Detector

Image size

CPU Dual core 2.93 GHz GPU NVIDIA GeForce GT440

HOG (Sec) HOG (Sec)

640x480 26 1.07

1024x768 69 2.8

1920x1080 186 7.6

2592x1944 455 18.8

3.1.3.2 Haar-like detectors

Although the standalone Haar-like feature detectors had short runtimes (Table 3-4), they

showed relatively low detection rates with very high false positives compared to the HOG

method (see Table 3-5). As such, this recognition algorithm was set aside.

Table 3-4: Computation times of the Haar detectors in searching for eight orientations - With permission

from ASCE (Rezazadeh Azar and McCabe 2012a)

Image size Haar runtime (sec)

640x480 1.2-2.0

1024x768 2.5-5.1

1920x1080 6.9-13.1

2592x1944 19.6-25.7

Table 3-5: Some samples of Haar detectors and their performances - With permission from ASCE

(Rezazadeh Azar and McCabe 2012a)

Training settings Test results

Minimum hit

rate

Max. false

alarm

Boosting type Detection

rate

False positives per

image

0.995 0.5 Gentle AdaBoost 49.2% 3.64

0.995 0.55 Gentle AdaBoost 71.5% 41.4

0.995 0.6 Gentle AdaBoost 86.8% 186.3

41

3.1.4 Haar-like vs. HOG performance

In static images, the HOG classifiers outperformed the Haar-like features algorithm with

respect to effectiveness, i.e., correct versus erroneous detections. With respect to efficiency,

however, the runtimes were much higher for HOG on the same CPU. Implementation of the

HOG recognition processes on a GPU resolved this issue. As such, the Haar-like recognition

algorithm was rejected for further use in this research.

3.1.5 Performance of Detector on Videos

To take advantage of the relatively rapid runtimes using the GPU, the system should scan

frames in time intervals slightly farther apart than the maximum runtime for real-time

applications. For example, processing each 640x480 pixel frame takes less than 1.1 seconds,

thus the system can be set to scan video frames in ≥1.5 second intervals (0.4 seconds extra

margin which the system may require for loading frames or other processes). Hence the

process can sustain a real-time video stream.

For the test on videos, the algorithm was set to scan the frames every 5 seconds, as this time

interval is suitable to detect a dump truck entering the scene and keeps up run-time

efficiency. Since the dump trucks move slowly in the sites due to speed limits (typically 25

km/h or less), it takes a considerable time to pass the view of the camera, and all of the trucks

appeared in at least one frame.

The performance of the HOG detectors was evaluated for the recognition of off-highway

dump trucks in test videos with 640x480 pixel frames. The 17 test videos with a total

duration of 65 minutes contained 62 dump trucks in different phases of their working cycles.

The system scans a frame every 5 seconds from the video stream, resulting in 773 frames

42

being processed. The recognition framework processed all of the videos without any delay to

the normal stream. The ROC of the results is illustrated in Figure 3-6, and Figure 3-7 shows

the recognition result on four consecutive video test frames.

Figure 3-6: ROC curve of the HOG detectors on videos

43

Figure 3-7: Detection results in a series of frames at specified time intervals (a through d)

The aim of this test was to detect trucks in the videos rather than the performance of the

detector in each frame separately; so the performance evaluation was carried out differently.

The hit rate is calculated here as the number of detected machines (regardless of frequency of

detections)/number of appeared dump trucks in the video stream. For instance, if a dump

truck appears in two frames and is spotted in one or both of them, it is counted as detected;

however, any false alarms in other frames were counted as well. So the detectors had more

than one detection chance for many of the 62 dump trucks. Since the detectors had multiple

opportunities to identify the trucks, higher detection thresholds were set than the ones used in

static images, which resulted in fewer false positives. In addition, the larger static images

(e.g. 1920x1080 and 2592x1944) had many more search windows than the 640x480 pixel

44

video frames; thus the probability of false alarms was much higher in the static images than

the videos. All of the mentioned points resulted in much better performances of the HOG

detectors on videos than on static images. The highest hit rate in videos was a 95.16% with

0.15 false positives per frame. In contrast, it was a 90.32% detection rate with 2.59 false

alarms per frame on static images.

3.2 Hydraulic Excavator

Deformable objects are significantly more difficult to detect, and state of the art research on

human detection focuses on pose identification due to the countless possible configurations

of the human body. The resulting algorithms are highly applicable for security surveillance,

entertainment, and automated image and video indexing. Many of these methods use part-

based and pictorial algorithms to detect a set of parts of a semantic object arranged in a

deformable configuration (Felzenszwalb et al. 2010; Andriluka et al. 2009). The Latent

support vector machine (Latent SVM) recognition method (Felzenszwalb et al. 2010) is a

cutting edge part-based model that won the 2009 PASCAL object detection challenge

(Everingham et al. 2009). This algorithm uses a modified version of the HOG detector, called

a root filter, to find the candidates for the target object. It then searches inside the detected

root boxes for the parts of the object at twice the spatial resolution relative to the original

resolution. A similar idea with substantial modifications was employed to detect a root and

then search for the possible configurations of the parts of the excavator to both recognize and

estimate the pose of the machine. The following sections describe this novel recognition

model.

45

3.2.1 Deformable parts

Highly articulated hydraulic excavators can swing 360 degrees and rotate all three parts of

their arm (boom, dipper, and attachment) around their hinged supports, as depicted in Figure

3-8. Hydraulic excavators appear in various figures, making it impractical for them to be

detected with the limited number training configurations as used in the case of dump trucks.

Figure 3-8: Deformations of the hydraulic excavator - With permission (Rezazadeh Azar and McCabe

2012b)

In latent SVM part-based models (Felzenszwalb et al. 2010), the root classifier detects the

entire body (e.g., human), then searches for the body parts (e.g., arms, torso, and legs) inside

the root to validate the detection. A hydraulic excavator can have several forms with parts of

the equipment masked by soil deposits or by other machines in the frames. Thus, it is very

difficult to find the root candidates (entire excavator) with a few root detectors, so the

approach was modified.

The part of the machine that is most visible was defined as the root. Then instead of

searching for the object parts within the root, the algorithm searches for the adjacent parts in

a variety of possible formations to validate the recognition process. The boom of the

excavator was selected as the root (Figure 3-9) and the dipper (second section of the

46

articulated arm) as the adjacent part. The main body and the bucket were not considered as

the root or adjacent parts, because the boom and dipper have approximately similar forms

and size ratios across different sizes and makes of excavators (except for long boom

excavators). The cabin shapes, however, can vary broadly. For instance, urban excavators

have compact bodies to swing in confined working zones. In addition, excavators can carry

different attachments at the end of their dipper, such as pneumatic hammers, buckets, and

trenchers, to perform specific operations. Moreover, bucket and cabin may be masked by

other machines or soil depots while the boom and dipper are the most visible part of the

excavator. Finally, the addition of other parts will decrease the detection rate.

Figure 3-9: Root and part of the excavator

The HOG classifier for the root (boom) was trained in left and right orientations. Figure 3-10

shows some of the training samples. Since the dipper revolves around the hinge support with

the boom, the dipper detector was trained for the six views as illustrated in Figure 3-11. As a

result, six poses are possible; these include left-horizontal, left-inclined, left-vertical, right-

horizontal, right-inclined, and right-vertical. It is impossible to distinguish the parts in full

47

front and rear views where the boom is aligned with the camera view. Therefore it is

necessary to train separate detectors. Because the SCIT needs to find the excavators in videos

and the excavators constantly slew while operating, it is possible to detect the excavator with

those six poses, thus front and rear views were not considered.

Figure 3-10: Top row: training instances of the boom in left direction; Second row: training samples of

the boom in right direction

Figure 3-11: Poses of the dipper - With permission (Rezazadeh Azar and McCabe 2012b)

48

3.2.2 Features

Root and part classifiers were trained using the HOG object recognition algorithm. Table 3-6

presents the statistics of the positive and negative samples used to train eight detectors. These

images were collected from the same sources used for dump truck detection.

Table 3-6: Statistics of the training images in each view - With permission (Rezazadeh Azar and McCabe

2012b)

Part Root-

left

Root-

right

Horizontal-

right

Inclined-

right

Vertical-

right

Horizontal-

left

Inclined-

left

Vertical-

left

#

Positive

samples

1040 800 398 791 926 398 791 926

#

Negative

samples

7700 7700 7700 7700 7700 7700 7700 7700

A negative training set including 770 negative images was collected, which contained

construction landscapes and earthmoving machines other than hydraulic excavators to reduce

the chance of misclassification of those plants as excavators. Ten boxes were randomly

cropped from each frame and scaled to corresponding viewpoint sizes, which produced 7700

negative training samples for each category. The two-round training approach was used to

train excavator classifiers.

3.2.3 Mixture models

A mixture model with n components is expressed by an n-tuple, P = (P1, …, Pm), where Pi is

the i-th piece of the articulated object, which in this case has two parts (m=2), the root and

the adjacent part. Each piece has a possible location and a HOG descriptor. As presented in

49

Figure 3-12, this part-based detection model is a two stage recognition process with both

implemented using a HOG detector.

Figure 3-12: Flowchart of the part-based recognition process

The system first searches the image for two directions (left, right) of the root, which may

produce several candidate windows. Then it searches for the dipper in possible regions

adjacent to the roots. For instance, if the root detector locates a boom in “left” orientation, the

dipper must be on the left side of the boom in one of the three possible configurations: “left-

horizontal”, “left-inclined”, and “left-vertical” as depicted in Figure 3-13. The sizes of these

search regions are based on the dimensions of the detected root as presented in Table 3-7.

Various size ratios were examined to gain the best detection rate and maintain run-time

efficiency. If the search area size increases, it will raise the computation time and enhance

the possibility of false positives. On the other hand, smaller search regions may not enclose

the dipper. The selected size ratios are large enough to surround all of the regular dippers,

except for long boom excavators, and to maintain the run-time efficiency.

50

Figure 3-13: Search regions for dipper - With permission (Rezazadeh Azar and McCabe 2012b)

Table 3-7: Dimension of the search areas based on the root dimensions

Search regions Width Height

Horizontal = width of the root’s bounding box = 0.7 height of the root’s bounding box

Inclined = width of the root’s bounding box = 1.2 height of the root’s bounding box

Vertical = width of the root’s bounding box = 1.4 height of the root’s bounding box

3.2.4 Static images

Two sets of experiments were carried out to assess the performance of the part-based

algorithm and to determine whether this modified method can improve the results compared

to standalone HOG detectors. In the first set of experiments, only the root classifiers

including right and left orientations scanned test images and for the second round, the part-

based framework was evaluated. The rules to accept a true positive are the same as the ones

used for dump truck recognition, namely the detected box should overlap more than 50% of

the ground-truth bounding box and the result should correctly show the direction of the

51

excavator boom. These detection models were evaluated using 253 images of different sizes,

showing 284 excavators varying in make and pose. The photos were randomly picked from

the collected image pool; none of them had been used in the training stage. These images

were captured in congested construction sites such that other types of equipment appeared in

many images, which allowed evaluation of the capability of the algorithm to correctly

recognize the hydraulic excavator among other machines.

The detectors processed the test images with different thresholds - Table 3-8 provides the

results. Tests 1 to 5 indicate a change of threshold in the root classifier. A pair-wise

comparison of the two tests in Table 3-8 demonstrates that the part-based method can notably

reduce false positives; but it decreases the detection rate as well. So the pairwise comparison

does not discern the overall performance of the two approaches. Therefore, the ROC curves

were plotted to compare the results at any given detection rate or false alarm and to find

differences. As shown in Figure 3-14, both of the part-based and standalone HOG have

almost the same detection rate, in the range of 0.52 to 2.19 false positives per frame, but the

part-based method significantly outperforms the general HOG method, in the ranges of lower

than 0.52 false alarms per frame. In addition, the part-based framework can estimate the

orientation of the dipper, which is helpful data for activity recognition.

52

Table 3-8: Results of the general HOG and part-based methods - With permission (Rezazadeh Azar and

McCabe 2012b)

General HOG (only roots) Part-based method

Detection rate

(%)

False positives per

frame

Detection rate

(%)

False positives per

frame

Test 1 61.62% 0.27 58.45% 0.11

Test 2 72.18% 0.52 66.20% 0.22

Test 3 77.46% 0.99 72.54% 0.57

Test 4 81.34% 1.81 77.82% 1.21

Test 5 85.56% 3.60 82.75% 2.19

Figure 3-14: ROC curve of the results on the excavator test images

Figure 3-15 illustrates some hydraulic excavators detected using the part-based method.

Since the HOG recognition algorithm is invariant to changes in illumination and scale, the

part-based framework (which uses HOG descriptors) demonstrated good performance in

detection of various sizes of excavators with different colors and illumination conditions. For

instance, Figure 3-15c shows an image captured at sunset in very low light, while the other

53

images in Figure 3-15 are taken in average (Figure 3-15d) to very bright conditions (Figure

3-15a).

Figure 3-15: Samples of detected excavators - With permission (Rezazadeh Azar and McCabe 2012b)

Both of these methods fail to detect the excavator if the arm is not visible or is aligned with

the camera viewfinder. Many of the false positives took place with a wrong boom direction

in the correct locations. For example, the part-based method spotted the excavator in the

“right-inclined” pose in addition to “left-inclined” (see the left machine in Figure 3-15b).

Another noticeable problem in the part-based algorithm was observed in the finding of the

dipper, when it recognized the dipper in two adjacent poses at the same time. There were

some examples in which the secondary classifier detected the dipper in both horizontal and

inclined, or inclined and vertical configurations. This issue is mainly due to overlapping of

the training samples. The samples were manually divided into training categories and there

were some samples on the boundary margins which were misclassified due to human error.

The possible solution is to choose the one with a higher detection score as implemented for

54

this research. Altogether, wrong objects accounted for 77.4% of the false positives, in 19.3%

of them the dipper had been detected as the root (wrong direction), 2.3% of them were

caused by incorrect sizes of the bounding boxes, and about 1% of the false alarms located the

boom correctly but failed to estimate the correct orientation of the dipper.

Implementation of the part-based algorithm on the graphics processing unit showed

satisfactory run-time results; it takes less than one second to process a standard VGA

640x480 pixel frame on a 2.93 GHz dual core CPU and a GeForce GT 440 GPU with 2.1

compute capability. Table 3-9 presents computation times on different image sizes. The

varied process times of the part-based method occurred due to the different number and sizes

of secondary search regions generated by the root detectors.

Table 3-9: Part-based recognition runtimes for both directions using CPU and GPU

Image size

(Pixels)

Runtime (Sec) on CPU

Dual core 2.93 GHz

Runtime (Sec) GPU

NVIDIA GeForce GT440

640x480 6-7 0.26-0.94

1024x768 18-21 1.2-3.1

1920x1080 49-53 1.9-5.4

2592x1944 116-120 6.9-10.8

3.2.5 Videos

The main aim of this module is to detect hydraulic excavators in construction videos for

further analysis. Hydraulic excavators are stationary equipment and only move to change

their working zones, so this recognition unit needs to detect them only once and then pass the

information to the tracking module. To evaluate the performance of the detectors on movies,

21 videos with a total duration of two hours and twelve minutes were recorded from three

construction projects. These videos had 640x480 pixel resolution.

55

The system subsamples a frame every ten seconds from the videos until it detects an object

regardless of its being true or false. A ten second interval is much larger than the maximum

recognition time for 640x480 frames which is one second (Table 3-9), so the system can

maintain the real-time stream of the video. The experiments were carried out using both of

the general HOG and part-based methods with different thresholds to compare their

performances. The evaluation criteria for this test are detection rate and the average time to

find the first object in the videos. Table 3-10 presents the results.

Table 3-10: Results of the general HOG and part-based algorithms in test videos - With permission

(Rezazadeh Azar and McCabe 2012b)

Method Test ID Detection Rate No. False Detections First detection (sec)

General

HOG

HOG0 85.71% 2 53.33

HOG1 76.19% 5 38.10

HOG2 71.43% 6 29.05

HOG3 71.43% 12 12.86

HOG4 61.90% 24 11.90

Part-based

method

P-B0 90.48% 0 90.00

P-B1 90.48% 1 64.29

P-B2 90.48% 2 35.71

P-B3 76.19% 8 14.76

P-B4 76.19% 10 13.33

The test cases with lower ID numbers have higher thresholds resulting in higher hit rate, and

the threshold decreases as the ID number rises. Higher thresholds result in higher detection

rate, even though it takes longer (more search frames) for them to spot the first object. As the

threshold decreases, the framework detects objects faster at the cost of more false positives.

The part-based method outperformed the standalone HOG method at higher threshold tests,

although it took longer and more search frames to find the first object.

56

3.2.6 Spatiotemporal reasoning

Even the most advanced object recognition algorithms have type one and type two errors (see

Figure 3-2 for definition of the error type one and two). Since the excavators are stationary,

the detected machine is passed to the tracking engine without further need for recognition.

So, the false positives are costly and will mislead the entire action recognition process. As

shown in Table 3-10, detectors with higher thresholds may need to search several frames to

find the excavator, and constant correct detection is not guaranteed.

Spatiotemporal reasoning is additional information that uses background knowledge to

interpret situations. Spatiotemporal reasoning employs inexpensive constraint analysis such

as special information on the objects and time (Renz and Nebel 2007). As a result,

spatiotemporal reasoning has been used for object recognition and visual motion analysis in

image sequences (Laptev et al 2007; Laptev and Lindeberg2006).

Since the excavators are stationary equipment with cyclic movement patterns, a

spatiotemporal reasoning algorithm was developed and added to the recognition framework

to enhance the detection rate in videos. First the HOG recognition thresholds are set low to

generate multiple bounding boxes including true positives and false alarms in several

consecutive frames. In this way the risk of false negatives is virtually eliminated, even

though several false positives are produced. The system scans 10 frames in the first minute of

the video (one every six seconds) for a hydraulic excavator, and then the detected windows

are grouped based on defined spatiotemporal constraints. Size, displacement, and directions

(left or right) of the detected boxes are the spatiotemporal constraints. Two movies of

operating excavators with a total duration of twenty-one minutes were studied to set these

constraints. One frame in every six seconds was processed and the true positives were

57

carefully examined to determine the constraints, which are presented in Table 3-11. A six

second interval gives a large time span to capture different figures of an excavator while

working.

Table 3-11: Spatiotemporal constraints of the true positives - With permission (Rezazadeh Azar and

McCabe 2012b)

Constraint Comparison criteria

Size < (1.4 * first object area) or

> (first object area / 1.4)

Displacement < Width of the first detected box

Direction If the first object has right direction, the next one

cannot have left direction in right side displacement

If the first object has left direction, the next one

cannot have right direction in left side displacement

Detected bounding boxes are called “nodes”, and a group of similar nodes is named “path”.

For every detected window, the algorithm searches the rest of the frames and groups the

nodes that belong to the same path. Even if a node is not detected in a frame, the algorithm

loops through all of the remaining frames for matches. Two nodes of a path cannot be in the

same frame. Figure 3-16 illustrates nodes captured in the first minute (images a through j)

and frames k to n in Figure 3-16 show the four identified paths.

58

Figure 3-16: Object recognition at time intervals (images a to j), and four distinguished paths (images k

to n) - With permission (Rezazadeh Azar and McCabe 2012b)

The path that follows the logical movements of an excavator will be selected. This process

also involves spatiotemporal reasoning to select the path closest to the movement pattern of

an excavator. Again the same two videos were investigated to develop the reasoning. Two

types of false positive paths were observed in the test videos. The first type of false alarm

paths had one node, which were random misclassifications (frame n in Figure 3-16). The

second type of false positive paths resulted due to repeated false detections within the same

region in several frames (frame l and m in Figure 3-16).

On the other hand, the paths of true positives include a cluster of bounding boxes with small

size variations, some displacement, and logical changes in direction of the boom due to

59

rotation of an operating excavator (frame k in Figure 3-16). To identify the correct path, the

framework first sorts the paths based on the number of nodes, as the path representing an

operating excavator is always amongst those with the highest number of nodes. The other

candidate paths with a number of nodes usually contain the recurring false detections, so the

path with higher change of directions and displacements is selected as the target object. Since

the identified path includes a group of boxes, the main issue is to determine a box which has

the best representation of the excavator. The action recognition module needs the base-point

and the width of the boom to interpret interactions. The system chooses the box with the

highest recognition score to determine the base point and the width of the boom.

This recognition system processed the first minute of the 21 test videos with one excavator in

each and correctly recognized twenty machines (95.2% detection rate) with just one

misclassification. This algorithm was tested on scenes with one working excavator, and it

needs further modifications to detect multiple excavators in video streams. In addition, this

method can process movies with static backgrounds, because spatiotemporal reasoning of the

detected objects is possible only in videos captured by stationary cameras. Changing of the

camera view will result in detection of an excavator in different regions of consecutive

frames, making it impractical to use spatiotemporal reasoning.

This spatiotemporal algorithm has some advantages over existing recognition methods used

in construction videos including color-based detection (Zou and Kim 2007), and background

subtraction with Bayes or neural network classifiers (Chi and Caldas 2011), which had a

detection rate (96%) very close to the results of this research (95.2%). As stated before,

color-based recognition has difficulties in various lighting conditions and is sensitive to

occlusion, and the normal Bayes or neural network based detectors can classify only a

60

limited number of trained objects. HOG based detectors however, can find the target

regardless of other existing objects. In that research (Chi and Caldas 2011), only three types

of objects including a mini loader, backhoe, and workers appeared in test videos and the

system had the corresponding classifiers, but the test videos in the current research contained

many moving objects including rollers, bulldozers, pickups, workers, dump trucks, SUVs,

truck mixers, pile driving machines, and mobile concrete pumps.

3.3 Robustness of the Recognition Results

This section discusses the robustness of the developed object detection module based on

seven factors namely, occlusion, lighting, shadow, viewpoint, articulation, scale change, and

orientation change.

3.3.1 Occlusions

Two videos from ground and elevated viewpoints were selected to evaluate the effect of

occlusion on object detection rates. In both of these videos, one dump truck occludes another.

The system scanned one frame every two seconds and the results are presented in Figure

3-17 (ground level) and Figure 3-18 (elevated view). In the ground level view, the

foreground (white) truck completely masks the other, which eventually becomes visible. In

the above ground view, the foreground truck only partially masked the other truck. Detectors

could not spot the background truck in frames a, b, c, of the ground level view and even

missed the foreground truck in frame b as its appearance is mixed with the background

machine (Figure 3-17). The system was able to recognize both of the machines in frame d

and the successive frames.

61

Figure 3-17: Occlusion at ground-level view

The system was able to detect the partially occluded trucks captured from an elevated view,

even though it missed the foreground machine in frame b (Figure 3-18).

62

Figure 3-18: Partially masked truck from elevated view

3.3.2 Lighting

Since the HOG features are the gradient of the object edges, this method is invariant to

lighting conditions. This issue was also investigated as the test images for both excavators

and dump trucks were taken under various conditions. The outcomes did not show any

significant effect of lighting in the detection results. For instance, Figure 3-15 illustrates

successfully detected excavators under a wide range of illumination.

3.3.3 Shadow

Although complete shadows don’t affect detection, partial shadows may affect the detection

process as they change the edges and therefore the HOG features of the object. However, it is

63

difficult to quantitatively describe shadow effects on the recognition performance. Numeric

description requires taking numerous samples under various shadows and then estimating the

shadows’ contrast, overlapping areas, and orientation. This issue was not investigated in this

research.

3.3.4 Viewpoint

Viewpoint of the camera is one of main success factors in the recognition process. Since

eight classifiers have been trained to detect the machine, there are still possible views in

which all of the detectors failed to recognize the target machine. These missed samples were

mostly captured from extreme overhead angles (see Figure 3-19 a) or when the machines

moved on very steep roads (Figure 3-19 b). These were not categorized in any of the eight

orientations as only a few of these viewpoints were used in the training stage.

Figure 3-19: Difficult viewpoints

3.3.5 Articulation

Dump trucks are rigid entities which can only open and close their bed. The dump truck

training samples did not include machines with open beds, so the classifiers were not trained

to detect trucks while dumping the load and it is not within the scope of this research to

64

detect dumping operations. However, there were some test images which showed trucks

dumping the load. The detectors were able to handle slight articulations as shown in Figure

3-4 (bottom right frame).

But articulation was a major issue in the detection of excavators as they can have numerous

figures. As long as the excavators had poses close to training samples, the detector was able

to recognize them with high accuracy; however, extreme figures were challenging to detect.

3.3.6 Scale change

Changes of the object size in the search frame did not affect the object recognition process.

Since the HOG recognition algorithm uses a sliding window approach with various scales,

the system is invariant to changes of target size. In addition, the shrinkage coefficient was set

to 1.05, which is a fairly low number to avoid missing a target. Figure 3-20 depicts a dump

truck moving toward the camera with dramatic scale change in four consecutive frames of

the video, and the detectors successfully detected all of the targets.

65

Figure 3-20: Changes in size of the dump truck as it approaches the camera

3.3.7 Orientation change

Similar to changes of size, orientation change was not a major issue in detection of machines

in a series of frames. For example, Figure 3-7 shows a series of frames from a video and

detectors successfully recognized the transition from side-left to rear-left in frame a and b.

3.4 Summary

Two object recognition algorithms, namely Haar-like features and HOG, were tested to

recognize dump trucks from eight orientations. The HOG algorithm significantly

outperformed the Haar-like features method, but the run-times were too long to use for real-

time applications. Parallel implementation of the HOG algorithm on GPU could solve this

66

issue. Experiments on the test videos showed acceptable hit rate with few false positives per

frame. The highest hit rate in videos was a 95.16% with 0.15 false alarms per frame.

Articulated poses of excavators, however, complicated the recognition process. Thus, a part-

based framework was developed to detect the boom and dipper of excavators in various

configurations. Since excavators are stationary equipment, detection of them in consecutive

frames can provide additional spatiotemporal data to enhance recognition performance.

Therefore, a spatiotemporal reasoning was developed to improve detection performance in

construction videos.

The object recognition module identifies dump trucks and excavators in videos. The output

of this module includes 2D boxes representing each machine and in addition, it identifies the

orientation of dump trucks and direction of excavator’s arm. Table 3-12 briefly describes the

performance of the object recognition module under seven main affecting factors. The

subsequent requirement of the SCIT system is to track the machines of interest. The machine

of interest is detected by the recognition module and identified as an active machine in

loading operation by the activity recognition module. The next chapter describes the object

tracking engine in which a novel tracking algorithm is introduced.

67

Table 3-12: Summary of the robustness assessment of the recognition process under main affecting

factors

Factors Strengths Weaknesses

Occlusion

Is able to detect objects with

moderate occlusion (partially masked

objects)

Fails in major occlusion, especially

the masked objects in ground-level

views

Lighting Is invariant to regular day lighting

condition

May have difficulty in very low

illumination and foggy days

Shadow Could detect objects with/under

shadow

Some shadows may drastically

alter HOG features and mislead

classifiers

Viewpoint

Could identify the objects with the

viewpoints included in the training

datasets of classifiers

May fail in some viewpoints such

as extreme overhead angles or

equipment on a very steep road

Articulation Not an issue for dump trucks May have problem in extreme

articulated poses of excavators

Scale change Could detect targets as long as they

are mostly visible in the frame -

Orientation change

Could detect usual orientation

changes of dump trucks and

excavators’ boom

May fail in exceptional

orientations which were not

included in training samples

68

CHAPTER 4 - OBJECT TRACKING MODULE

This chapter explains the steps to develop the tracking module for the SCIT system. Two

tracking algorithms including mean-shift and an innovative hybrid method were used as a

basis. The following sections describe these two algorithms.

4.1 Mean-shift Algorithm

Object tracking in videos is a useful tool to locate and monitor the activities of the equipment

and human resources on site. Recent studies (see section 2.5) showed that the mean-shift

method performs reliably in tracking resources in construction videos (Park et al. 2011; Gong

and Caldas 2011). Mean-shift is an iterative algorithm that starts with a preliminary point,

and then re-estimates the mean of the dataset until they converge. Probabilistic models such

as Kalman filter can predict the initial point to start the iteration and therefore enhance the

performance of the mean-shift tracking (Gong and Caldas 2011). A modified version of the

mean-shift algorithm, called continuously adaptive mean-shift or Camshift (Bradski 1998)

was selected as the tracking engine.

Mean-shift algorithm can theoretically track different features of the target object, such as

color histogram and edges. But the challenge is to provide an intensity dataset for the tracker

to search for the local peak. This intensity dataset is usually represented with an 8-bit

greyscale frame, varying from black at the weakest weight to white at the strongest. The hue,

saturation, and value (HSV) color histogram is the most common feature for tracking which

was employed for this research as well. The algorithm first calculates the HSV color

histogram of the target object, and then it segments the pixels with HSV histogram within the

range and provides the greyscale intensity dataset for the tracker in every search frame (see

Figure 4-1).

69

In addition to color histogram, the HOG response was also employed as the second

alternative. HOG algorithm is invariant to color and illumination, so it is able to handle some

issues related to color. For example, the trucks covered with mud have a very close

histogram to the background soil and may mislead the tracker. The HOG object detector

provides the dense greyscale of the detection response for the mean-shift tracker. The

maximum response is colored with the highest intensity and the rest of the responses are

normalized based on their detection score. For example, the right frame in the Figure 4-2

shows the dense greyscale response of the left frame. The system scans every one second and

the Camshift algorithm tracks the target using this response. This short interval is possible

due to the high computational capability of the GPU which takes less than 0.13 seconds to

process HOG detection for an orientation.

Searching for local maxima can cause issues in certain situations. For instance, the tracker

may expand or shift to a nearby object of similar features that the algorithm is using. The left

frame in Figure 4-1 shows selected target for tracking and the right frame illustrates

segmented dataset with similar color histogram. This close proximity caused the tracker (red

ellipse) to expand to both machines. This issue is not limited to the application of color

histogram as it is visible in Figure 4-2 which uses HOG response. This concern, along with

other observed issues are described further in the experimental results (Section 6.1).

70

Figure 4-1: left: selection of target truck in the original frame, right: isolation of pixels with similar color

histograms

Figure 4-2: left: original image, right: isolation of pixels with HOG response for side-right facing trucks

4.2 Hybrid Tracking

Because the mean-shift method has difficulty performing well when applied to construction

videos, a novel tracking framework was developed to robustly track dump trucks. This

hybrid algorithm was inspired by a recognition-based tracking framework developed to

interpret human actions (Barbu et al. 2012). They used the Latent SVM object recognition

method (Felzenszwalb et al. 2010) with lowered thresholds to create tracking candidates, and

employed the KLT feature tracker (Tomasi and Kanade 1991) to project each detected box

five frames forward to compensate for false negatives of the Latent SVM detector. The

71

second step uses a dynamic-programming algorithm (Viterbi 1971) to select a temporally

coherent set of detections for tracking.

Since the dump truck profiles do not change drastically between time steps, the HOG

algorithm was selected to help track a dump truck in the hybrid tracking technique developed

for this research. After recognition of a truck, the algorithm continues to find trucks, but in

shorter time intervals and in an optimized manner. In contrast with the recognition module

described in section 3.1.5, the hybrid tracker searches for only three orientations every two

seconds, i.e., the initial viewpoint of the target truck and the two adjacent orientations, which

the same GPU and CPU can process in 0.39 seconds. For instance, if the target truck was

spotted in a side-right viewpoint, the system only searches for front-right, side-right, and

rear-right orientations. This way, the hybrid tracker catches the changes in the trajectory of

the machine, but it is not required to check all eight orientations using a priori knowledge. In

addition, the detection thresholds are lowered to avoid false negatives, although the rate of

false positives increases as well. Each detected bounding box is a potential target.

Pure recognition-based tracking has two main issues. First, even decreasing the thresholds

cannot promise constant detection of the machine, so the target can be lost. Second, there

were some cases in the test videos where a second dump truck entered the frame and stopped

in the loading zone with a similar orientation as the truck being loaded. This often misled the

recognition-based tracker. In addition, a nearby false alarm can also cause the same error.

A feature tracking method was added to the framework to solve both of these problems. The

center point of the target machine is tracked by the Kanade-Lucas-Tomasi (KLT) feature

tracker (Tomasi and Kanade 1991) to project that bounding box to the next scanning frame.

72

Thus, it artificially generates bounding boxes in subsequent frames to significantly reduce the

risk of losing the machine, and to keep track of the actual target. Therefore, there will be a

projected window in addition to the true positive and false alarm boxes generated by the

recognition engine in every new frame. The KLT feature tracking algorithm is a differential

method to estimate the optical flow, which is based on three assumptions: 1) brightness

constancy, 2) temporal persistence, and 3) spatial coherence.

The result of the recognition and projection will be a set of boxes with a minimum one

member, thus a simple disjoint-set data structure algorithm is employed to partition the

detection that is temporally coherent with the projected window, yielding a new bounding

box and eliminating other detections. This fusion algorithm groups two rectangles in the

same subset if their bounding regions overlap. All of the distances between x and y elements

of the matching corners should be lower than the minimum average of the width and height

of the boxes times the threshold to group two rectangles (Viola and Jones 2001). The corners

of the final rectangle are the average of the corners of the projected box and the overlapping

detection. If none of the detections is temporally coherent with the projected box, the

projected box will be taken as the final rectangle.

The flowchart and the visual sequence of the entire hybrid tracking process are shown in

Figure 4-3 and Figure 4-4, respectively. After formation of the new box, its center becomes a

feature for the KLT tracker. The KLT feature tracking method is sensitive to objects passing

in front of the tracking features, such as the bucket of an excavator or construction workers.

Even the shadow of the bucket could distract the tracking process; however, the mixture

character of this novel tracker indicates that the continuous HOG object recognition in short

73

time interval prevents the target equipment from being lost, improving the performance of

this hybrid tracking algorithm.

Figure 4-3: Flowchart of the hybrid tracking process

Figure 4-4: a: detected truck at frame x1; b: HOG recognition result with lowered thresholds for three

viewpoints in frame x2; c: projected box of previous frame (frame x1) to frame x2 using KLT feature

tracker; d: fusion of the rectangles in b and c

74

The purpose of the tracking module is to track dump trucks in the limited region of the

loading zone and will be discussed in the next chapter. Dump trucks usually move slightly

for a better loading position in the loading zones. In addition, neither their orientation nor

their scale changes dramatically. Therefore, the hybrid tracker does not have to handle

extreme orientation/scale changes; the mixture character of this algorithm however, allows

processing moderate situations. The recognition aspect of the hybrid algorithm has a dynamic

character. Once the target is identified, the HOG recognition searches for three orientations.

After the recognition process, the algorithm fuses the projected rectangle, which has the

initial size of the machine, with a detected box that overlaps. This fusion process helps the

system adjust the size of the targets. In addition, this algorithm continually changes the three

search viewpoints in every recognition attempt to account for a turning vehicle. For instance,

the initial orientation of the target was “side-right” in Figure 4-5a, so the framework searched

for rear-right, side-right, and front-right viewpoint in the next scan process (Figure 4-5b).

The system found the target with same orientation in frame b, and then identified it with

“Front-right” trajectory in Figure 4-5c. Therefore, the search orientations were changed to

front, front-right, and side-right for the next scan process (Figure 4-5d). The machine was

recognized with front-right orientation in frame d. The two second time interval is

appropriate for the purpose of this research, but it should be decreased to track fast objects or

capture extreme scale and trajectory changes.

75

Figure 4-5: Tracking of the orientation changes

4.2.1 Possibilities to Optimize Hybrid Algorithm

This version of the hybrid algorithm is computationally-intensive and it is only viable by

using a GPU’s parallel computation capability. The HOG object recognition searches for the

target regardless of its size and location. Therefore, limiting the scale and the search region

can reduce computations. The HOG algorithm searches for the target in original frame, then

scales down the image by the shrinkage coefficient and scans the resized image. This process

continues until the frame reaches the size of the classifier window. In the case of using a 1.05

shrinkage coefficient, for example, a 128x80 classifier window scans 34 frame sizes in an

original image of 640x480 pixels. It is possible to optimize this recognition process for

tracking purposes. Instead of searching for all possible scales, the recognition aspect can be

76

set to search for the target in a range of scales (e.g. ± 20% of the prior scale). This

optimization may result in missing the targets with greater scale changes; however, it reduces

runtimes and therefore shorter time intervals can be exercised to compensate.

Another optimization opportunity is to limit the search region. Since the tracking engine

should track a dump truck with low speed, minor scale and orientation changes, a limited

region was set for each search trials. In this setting, the HOG recognition engine searches for

the three orientations in a region of interest (ROI) that is determined using the prior size and

location of the target. This ROI has the same center as the bounding box of the target in

previous search frame; the width and height of the ROI however, are two and half times

larger than the target’s bounding box (Figure 4-6). The ROI is dynamically defined for each

trial, allowing the system capture size changes and movement of the tracking machine.

Figure 4-6: Left: red box encloses target truck, right: ROI to search for the target truck in the next frame

Runtimes are significantly reduced compared to 0.39 seconds for searching the entire frame.

The process times depend on the size of the target’s bounding box which larger boxes create

bigger ROI and therefore longer process times. The runtimes were between 0.09 to 0.19

seconds in the test videos. These short process times allowed reduced intervals from two

seconds to one second. Shorter recognition intervals improve the algorithm’s performance in

77

tracking fast objects and capturing dramatic scale and orientation changes. In addition, it

corrects KLT’s errors and prevents misleading tracking process. For example, a passing

worker distracts the KLT feature tracker in frames a and b in Figure 4-7 (red dot represents

feature tracker), but the hybrid algorithm corrects it in the subsequent frames (frames c and

d). This optimized implementation of the hybrid tracking algorithm was used in this research.

Figure 4-7: Correction of the KLT method's distractions

4.3 Summary

This chapter described the tracking module of the SCIT system which was developed using

two tracking algorithms. Mean-shift algorithm is one of the employed methods which

demonstrated promising results in earlier research. This method however, may fail in

78

tracking dump trucks in some real-world conditions, such as trucks with similar features in

close proximity. Therefore, a novel tracking algorithm has been developed that combines the

HOG recognition and a feature tracking algorithm to track dump trucks under challenging

visual conditions found in jobsites.

The recognition and tracking modules provide useful spatiotemporal data of the earth

material loading equipment in construction video. The remaining major challenge is to

interpret these data to recognize the start and the finish time of loading cycles. The next

chapter introduces an action recognition algorithm that analyzes these spatiotemporal data to

recognize and estimate loading cycles.

79

CHAPTER 5 - THE ACTION RECOGNITION MODULE AND SYSTEM

ARCITECTURE

Activity recognition is a popular and evolving research area in the computer vision field.

Several research efforts, mostly focusing on human action recognition, tackle this important

subject (Aggarwal and Ryoo 2011). Logical reasoning and machine learning algorithms are

two main approaches employed to address task recognition problems. Decision variables

must pass a set of consistent logical constraints to infer the action in logic-based methods.

Probability theory and statistical learning models, such as Bayesian belief networks, Hidden

Markov Models, and support vector machines, have been used to interpret the events in

machine learning approaches.

In logic-based algorithms, actions are considered to be objectives and background knowledge

is stated in a set of first-order constraints called an event hierarchy. This hierarchy is encoded

in first-order logic to examine spatiotemporal data, such as location, direction, or size, to

recognize actions. Logical reasoning has been used in construction research to estimate

productivity in cases where construction equipment or tools entering a predetermined work

zone is considered to be a working state. In some examples, a GPS system was employed to

track earthmoving plants, where the movements of the machines in work envelopes were

used to interpret grading and levelling operations (Navon et al. 2004), and a concrete hopper

entering a determined zone triggered a concrete pouring cycle (Gong and Caldas 2010).

Nonetheless, logical reasoning approaches have two main shortcomings. First, logical

constraints are strict and cannot incorporate uncertainty. For instance, a logic-based

framework cannot chose between two or more plans for which an agent qualifies. Second,

http://en.wikipedia.org/wiki/Logically_consistent

80

logical reasoning does not have a learning capability; the system cannot learn previously

unknown situations to tackle similar future scenarios.

In contrast, probabilistic and machine learning methods can learn behaviours using training

samples obtained from sensors or databases and are able to account for some level of

uncertainty. For example, probabilistic methods, such as Bayesian belief networks can

estimate the probability of the potential plans while machine learning algorithms, e.g. support

vector machines, can not only decide whether a case belongs to an object class, but they also

give a score to passing instances. Thus, these methods can predict the most probable class.

The action recognition of the SCIT system consists of a logical reasoning and a machine

learning algorithm. This action recognition framework first checks whether the loading

equipment, including excavator and a dump truck, are positioned for loading and then uses a

machine learning algorithm to examine distances and sizes of machines for loading. The

following sections detail the components of the action recognition module.

5.1 Baseline Task

Earth material loading by an excavator is an interactive and one-way activity, significantly

reducing the complexities of the human domain. Unlike the flexible and unlimited number of

poses of the human body, excavators typically operate in unidirectional and predefined

patterns. Due to the straightforward nature of this activity, a consistent set of constraints was

used to filter candidates and another set of spatiotemporal calculi was used to train an action

recognition classifier. These spatiotemporal calculi provide simple information about time

and space, such as topology, direction, or distances between of entities (Renz and Nebel

2007).

81

5.2 Spatiotemporal Information

The recognition and tracking modules provide the location, size, and orientation/direction of

both dump trucks and the excavator. These spatiotemporal data were used to decide if the

server is loading one of the customers. Every detected dump truck and excavator is labelled

by the system according to its 2D coordinates, its orientation, and the dimensions of the

bounding box. To confirm a loading action, a dump truck should be in the appropriate

orientation within range of the excavator’s boom; hence the distance and the configuration of

the equipment are two key factors in recognition of a loading action.

5.3 Activity Recognition Module (ARM)

As noted before, this module is a mixture of two action recognition methods. They are

described more fully in the following subsections, but can be summarized as:

1. Logical reasoning process to identify possible loading activities using set rules

2. Machine learning process to confirm the loading activity

5.3.1 ARM Stage 1: Logical loading configuration

The first part of the ARM is a logical reasoning process to quickly examine the equipment

orientations for possible loading activity. These constraints are set based on priori knowledge

of this activity. Dump trucks always need to draw their open-box bed near to the excavator

for loading. For example, a left facing boom and a side-right dump truck located on the left

of the excavator are not likely positioned for loading, although there are always exceptions to

the rule.

82

Although logic-based methods have two main shortcomings, neither is a problem for this

aspect of the ARM. First, the intention of this aspect is to filter possible candidates, not to

identify the final loading truck. So there is no uncertainty about the main target. Second, the

determined constraints are the most probable configurations, although they may not be

followed under special site conditions. Given the usual training dataset, even the machine

learning classifier will have difficulty recognizing those rare situations. Table 5-1 shows the

possible loading configurations based on the location of the excavator and orientation of

dump trucks.

Table 5-1: Possible loading configurations

Excavator

Dump truck

Located on the left of the

dump truck

Located on the right of

the dump truck

Front

Front-left

Front-right

Side-left

Side-right

Rear

Rear-left

Rear-right

5.3.2 ARM Stage 2: Machine learning action recognition

If a dump truck passes the first stage of recognition, it will be sent on to the second stage to

examine the distance and size ratio of the server and customer. The corner of the boom’s

bounding rectangle closest to the hinged support of the excavator arm is set as the base point.

For instance, if the boom is facing left, the base point would be the bottom right corner of the

83

bounding box. The system measures the distances between the base point of the excavator

and the four corners of the dump truck, and then divides them by the width of the excavator

bounding box to include the size factor. Figure 5-1 shows these distances. The distance

between the base point of the excavator and the top corner closest to the base point is

distance 1. The distance to the other top corner is distance 2. The distance to the closer

bottom corner is distance 3, and to the last corner, it is distance 4. These numbers create a

vector with four elements. A supervised learning approach was utilized to train linear

Support Vector Machines (SVM) (Cortes and Vapnik 1995) as the second step of the action

recognition process.

Support vector machines are extensively used for pattern recognition, such as text

classification, object detection, and path recognition. This machine learning method

constructs a separating hyperplane or a set of hyperplanes that has the largest margin (gap)

between the positive and negative classes in either the input feature space or a kernelised

version of this. A large number of object class and non-object class samples are required to

train an efficient classifier.

Seven videos taken from different viewpoints with a total duration of fifty one minutes were

selected for the training stage. The object recognition module scanned a frame every three

seconds for the machines with lowered recognition thresholds to avoid false negatives and to

produce a large number of negative training samples. The detected dump trucks were

grouped into being loaded and not being loaded. Altogether, 1342 training vectors including

514 positive and 828 negative samples were produced to train the classifier. The publicly

available SVM-light software (Joachims 1999) was used to train the action recognition

classifier. This training software provides a vector with a similar size of a training sample. In

84

addition, it also calculates a threshold which then is used for classification. This threshold

can be changed slightly during the experiments for obtaining best results and sensitivity

analysis. SVM-light calculated the threshold -0.063 in this instance.

Figure 5-1: Distances between the corners of trucks and the base point in both left and right

configurations

For the classification stage, the SCIT system computes the distances between the base point

of the excavator and the four corners of any detected dump truck, including false positives,

then divides them by the excavator’s width, which produces a vector with four elements. The

resulting vectors are classified using the trained SVM classifier. This classification process is

the dot product of the classifier and test vector, and scores greater than the threshold are

accepted. If more than one dump truck is close enough to an excavator and therefore passes

the classification stage, the system will identify the one with the highest classification score

as the loading truck.

5.4 Cycle Conclusion

After recognition of the truck being loaded, the system will stop searching for dump trucks,

record the start time, define the loading zone, and pass the loading truck to the tracking

engine. Dimensions of the loading zone depend on the size of the loading truck’s bounding

85

box, where the loading zone’s length and height are 1.25 and 1.5 times the truck length and

height respectively (dark blue rectangle in Figure 5-2.d). The loading region is defined fairly

large to handle minor movements of dump trucks during loading for better positioning, and

to accommodate the small spatial variations used by the tracking methods thereby reducing

the risk of premature termination of the tracking of the loading dump truck. The tracking

module tracks the loading truck until the center of the tracking bounding box/ellipse exits the

loading zone. The system records that moment as the finish time and ends tracking of the

loaded truck.

5.5 System Architecture

The system was implemented using OpenCV 2.3.1 library (OpenCV 2011) in Visual C++

express 2010 environment. OpenCV is an open source library which mostly contains video

and image processing functions. This library is cross-platform which has C++, C, and Python

interfaces running on Windows, Linux, Android, and Mac operating systems. The library

includes various computer vision algorithms from basic level functions, such as loading and

saving images, to advanced algorithms, such as object recognition, tracking, and image

segmentation. Many of the functions used to develop the SCIT modules are already available

in the employed version of the OpenCv (OpenCV 2011). The HOG object recognition, mean-

shift tracking, and KLT feature tracking are the off-the-shelf algorithms used for this

research. As stated in chapters three and four however, these methods were

modified/integrated to develop the object recognition and tracking modules.

The object recognition module first searches for an excavator. Once detected (Figure 5-2.a),

it passes the detected bounding box to the mean-shift tracking module (Figure 5-2.b). The

current version of the SCIT’s object recognition module stops searching at one excavator, but

86

it is possible to extend the system to process the videos with multiple servers. In addition to

tracking the excavator, the system begins to scan for dump trucks in predetermined time

intervals (see Figure 5-2.c). While scanning a 640x480 pixel frame for all eight orientations

takes about 1.07 seconds, any time interval greater than 1.1 seconds can maintain the real-

time stream of the video. For this research, the recommended four second intervals were

exercised (Rojas 2008). Then the action recognition module analyzes all detected dump

trucks in each detection interval to check whether any of them meet the logical configuration

constraints and pass the action recognition classifier. If the system confirms the loading

action (see Figure 5-2.d), it will discontinue searching for dump trucks, record the start time,

define the loading zone, and send the loading truck to the tracking engine (see Figure 5-2.e,

this frame shows the mean-shift tracker).

Figure 5-2: a: Detection of the excavator; b: tracking the excavator; c: detection of a truck that does not

meet loading criteria; d: detection of the loading truck; e: tracking of the both equipment; f: truck leaves

the zone and tracking of the truck terminates

87

The tracking engine continues tracking the loading truck until the center of the tracker exits

the loading zone. The system records that moment as the finish time, terminates tracking of

the loaded truck, removes the loading zone (see Figure 5-2.f), and starts to search for new

dump trucks. As depicted in Figure 5-2.d, the center of the loading zone is shifted toward the

hydraulic excavator to handle the slight truck movements, which are mostly backward. The

shifted loading zone also causes earlier tracking termination and therefore improves the

accuracy of the SCIT finish times relative to the actual values. Figure 5-3 shows the entire

flowchart of this framework.

Figure 5-3: Flowchart of the entire SCIT system

There are possible situations where the excavator fully loads a dump truck, but it takes some

time for the truck to leave the zone. For example, the excavator has finished loading, but the

dump truck is still waiting for other vehicles to pass by. This argument is valid in the sense

that no productive work (loading) is done in the meantime, but as long as the truck stays in

the zone, next loading cycle cannot begin. These abnormal cycles are part of actual working

shifts and should be included in cycle times.

88

5.6 Summary

Logical positioning and close proximity of a server and a customer are two spatiotemporal

data used to recognize a loading cycle. A logical reasoning framework checks detected dump

trucks whether they are positioned for loading and then, a machine learning classifier

examines the relative distances of passed trucks to the serving excavator. If more than one

dump truck passes the action recognition module, the one with higher score is accepted. In

addition to action recognition, this module helps the system ignore the dump trucks and false

alarms which do not satisfy the requirements, and allows the truck detection thresholds to be

reduced to minimize false negatives.

The object recognition, tracking and the action recognition modules have been integrated to

develop the SCIT framework. The system first recognizes the server and then searches for

customers in predetermined time interval. The action recognition module examines detected

dump trucks and upon identification of a loading truck, the system defines a loading zone and

tracks the identified truck. Departure from the loading zone concludes the cycle and then the

SCIT repeats searching for new dump trucks. The next chapter describes testing on several

construction videos to evaluate the performance of the SCIT system under the actual

conditions found in jobsites.

89

CHAPTER 6 - SCIT VALIDATION RESULTS

This chapter describes the process used to evaluate the performance of the SCIT system.

Several videos of excavation activities were captured at two condominium complexes in

downtown Toronto, Ontario. However, only the videos containing equipment with similar

productivity rates were selected to create a homogenous productivity dataset. Eighteen videos

with a total duration of 2 hours and 27 minutes were chosen in which appeared two types of

hydraulic excavators (Caterpillar 245B and Caterpillar 345D) and several makes of urban

dump trucks with similar hauling capacities, such as Mack, Sterling, Volvo, and Kenworth.

Since this system aims to estimate cycle duration under actual conditions found in jobsites,

the videos were recorded during eight site visits in three seasons –winter, spring, and

summer– and at different times of the day and with different levels of cloudiness. This

allowed a variety of lighting conditions to be recorded. Moreover, these videos were taken

from both ground level and elevated viewpoints using two different makes of digital cameras

to diversify visual conditions. None of these videos were used to train the action recognition

classifier. Figure 6-1 depicts some of the views.

90

Figure 6-1: Some of the earth material loading views

The excavators had typical construction colors including yellow and red, while the urban

dump trucks were painted a variety of colors such as white, red, black, green, blue, gray, and

purple.

6.1 Experimental Results

The SCIT system with mean-shift and hybrid tracking modules processed the test videos with

varied action recognition thresholds, and the machine-generated results are provided together

with manual observations as ground truth in Table 6-1. This table provides the number of

true positives, false negatives, false alarms, incomplete detected cycles, and the average cycle

times of true positive cycles. Incomplete detected cycles were correctly recognized by the

system, but the tracking module failed to persistently track the loading truck throughout the

cycle which resulted in early termination of the tracking and resetting of the timer. In the

manual observation, the loading time starts when the excavator initiates loading activity and

it ends as the truck starts moving out.

91

The tests with smaller ID numbers have lower action recognition thresholds, and the

threshold rises as the test number increases. Test 5 uses the threshold provided by the SVM-

light software at the end of the training stage. But it should be decreased as it missed three

true positive cycles (Table 6-1). Thus, the threshold was reduced by -0.1 in the following

tests for the sensitivity analysis of threshold alteration.

Table 6-1: Results of the experiments with different action recognition thresholds on test videos

Number

of

detected

cycles

False

negative

cycles

False

positive

cycles

Incomplete

detected

cycles

Average cycle

time (based

on true positive

cycles

only) Seconds

ARM

threshold

Manual 55 0 0 0 101.87 -

Test 1: SCIT with

hybrid 53 0 4 0 106.49 -0.463

Test 2: SCIT with

hybrid 54 0 2 0 106.43 -0.363

Test 3: SCIT with

hybrid 54 0 2 0 105.93 -0.263

Test 4: SCIT with

hybrid 54 0 2 0 105.33 -0.163

Test 5: SCIT with

hybrid 51 3 1 0 105.08 -0.063

Test 3a: SCIT with

mean-shift using

HOG response

48 0 2 6 104.40 -0.263

Test 3b: SCIT with

mean-shift using

color histogram

30 0 2 24 105.00 -0.263

As presented in Table 6-1, the SCIT with hybrid method had the best performance in Test 4.

Tests 2 to 4 with hybrid tracking had the highest true positive cycles, but the average time

deviation in Test 4 was less than Test 2 and Test 3. The threshold found in hybrid Test 3 (as a

midpoint in the optimal range) was used for the SCIT with mean-shift tracking using color

histogram and HOG response, but the performance was substandard (Test 3a and Test 3b).

92

These tests were only able to provide correct data for 48 and 30 of the loading cycles. The

poor performance of the mean-shift algorithm in tracking the dump trucks being loaded

caused these unsatisfactory results, so the tests with other thresholds were aborted for SCIT

with mean-shift tracking.

The mean-shift algorithm with color histogram had problems tracking trucks with neutral

colors including white, black, and gray. In a number of cases, the tracker missed the target,

thereby concluded that the loading cycle was complete. This stopped the loading clock and

reset it for the start of a new cycle upon the next detection of the same truck, resulting in

erroneous productivity data. But since the HOG method is invariant to color, the mean-shift

with HOG response could correctly estimate those cycles.

In addition, there were some instances in the tests using the mean-shift algorithm with color

histogram (Figure 6-2, images a to c) where the tracking blob switched or expanded from the

loading truck to a nearby machine of the same color, producing false results. Similar problem

was observed while using mean-shift with HOG response in the cases with close proximity of

same oriented dump trucks regardless of their color. It resulted in six incomplete cycles. The

SCIT with the hybrid tracker correctly processed all of the mentioned videos (frames d to f in

Figure 6-2).

93

Figure 6-2: Frames a to c: Expansion of mean-shift tracking, images d to f: Hybrid tracking

6.2 Discussion

The SCIT with the hybrid tracking framework significantly outperformed the SCIT with

mean-shift algorithm using color histogram and HOG response, so the SCIT with mean-shift

was set aside and this section discusses only the results obtained from the SCIT with a hybrid

tracking engine.

The results in Table 6-1 demonstrate that the performance of the SCIT (with hybrid tracker)

is not very sensitive to the threshold change. For instance, the system performed the same

within the range of -0.163 to -0.363 in terms of true positives, false negatives, and false

positives. The average cycle times, however, had slight differences. Lowering the threshold

stretches the true positive cycle times as it detects loading trucks before they get sufficiently

close to the excavator to start loading. In addition, lowering the threshold increases the

chance of accepting more bounding boxes as a loading truck and therefore false positive

cycles. For instance, Test 1 has the lowest threshold and it misclassified four cycles (two

94

more than Tests 2, 3, and 4), and also the average cycle time is the highest. Increasing the

thresholds in Tests 2 to 5 improved the average time.

Raising the threshold gives more accurate cycle times, as the classifier recognizes dump

trucks when they are fully positioned for loading. However, higher thresholds may result in

missing some cycles, the cycles where the dump trucks stop farther away than usual for

loading and thereby their SVM score cannot pass the higher thresholds. For example, Test 5,

which had the highest threshold, missed three more cycles than Test 4.

All of the processed videos were investigated to find causes for the errors. Since this system

is composed of three modules, errors can be found in object recognition, tracking or the

action recognition classifier. The following sections describe the errors and deviations that

occurred in the testing.

6.2.1 False positive cycles

Two scenarios resulted in false detection of a loading cycle: recognition of a foreground

truck instead of a background truck and identification of a false positive box as a loading

truck. The SCIT incorrectly identified the wrong machine if another truck largely masked the

loading truck (Figure 6-3 a, b, and c). The framework identified the foreground machine

(green truck) instead of the loading truck (dark blue) and produced the wrong productivity

result. This configuration took place in a video captured from ground level; the system

however, was able to handle the overlooking views in which the loading truck was partially

masked by other equipment. Images d, e, and f in Figure 6-3 depict the same work zone with

the same arrangement of machines recorded only a few minutes later from an elevated angle.

The SCIT with hybrid tracker was able to correctly detect and track the loading truck. This

95

highlights the importance of the appropriate viewpoint of the camera for correct outcomes.

Nearby buildings, peaks of slopes, tower cranes, or temporary posts are proper options for

installation of the camera; however, some construction sites may not have these possibilities

as there are no overlooking points around, or they are not accessible. Although not

investigated in this research, two possible solutions to overcome the problem of such

situations without changing the camera view are:

Logical reasoning would help interpret an occluded situation. For example, if a new

truck blocks the view of the loading truck, it is possible to conclude that the actual

machine is masked and the system waits until the truck becomes visible before

estimating the loading cycle. However, exceptional situations may mislead the

system. For example, the background truck may not appear again and may leave the

scene under cover of the foreground machine.

Use local suppression to detect and track the part of the machine that is not occluded.

In this approach, if a part of a truck gets a higher score than the foreground machine

in the action recognition phase, the background machine is selected and tracked.

96

Figure 6-3: frames a to c: Recognition and tracking of the incorrect loading truck due to severe

occlusion; frames d to f: correct recognition and tracking by changing the camera location

In the second scenario, false alarm detections may result in false positive cycles. As stated

before, the thresholds for the recognition of dump trucks were decreased to avoid false

negatives, so the detectors produced many false positives most of which were rejected by the

action recognition module. Most of these false positives did not have the appropriate

orientation for loading or their size and location could not pass the SVM classification. In

addition to a false positive resulting from occlusion, there were three false positives in Test 1,

and there was one mistake in Tests 2, 3, and 4 which resulted from non-truck detections. Test

5 did not have a false positive cycle due to false alarm object detection. More false positives

happened at lower action recognition thresholds whereas higher thresholds successfully

ignored them. Since these false detections do not represent a real machine, hybrid tracker

does not track a real truck and all the center points exited the loading zone quickly. This

occurred due to movement of the excavator boom or other object which brought the KLT

97

tracking feature outside of the loading box. It is easy to spot these false positives in the

results as they have short durations compared to actual cycles.

6.2.2 False negative cycles

False negatives only occurred in the test with the highest action recognition threshold (Test

5). A review of the missing samples revealed that the dump trucks approached and stopped

farther away from the loader than usual due to special conditions of the loading zone.

Therefore the vector of relative distances did not pass the SVM classification.

6.2.3 Differences in start and finish times

The SCIT framework had some variations in recording the start and finish loading times

compared to manual observation, which produced different average cycle times (Table 6-1).

In reviewing the data, the inaccuracies had four main causes:

The SCIT scanned videos for new dump trucks every four seconds, so 0 to 4 second

variations of the activity start times compared to manual study are inevitable.

The SCIT was slow to detect the loading truck even though it was in place when the

system was searching for it. It took more than one scan process to identify the loading

truck.

Human observer recorded the finish time when the truck began to leave; however, it

took a few seconds for the system to detect the end of the activity as it waited until

the center of the tracking truck exited the loading zone.

The viewpoint of the camera is another important factor for a better result. The side

view of the loading operation provides the best view for action recognition. Both the

98

server and customer plants are at the same distance from the camera and the action

recognition classifier can detect the start of the loading action on time. Front

viewpoints were the most challenging cases. In these cases, the dump truck is close

enough in a 2D view to pass the action recognition module, but closing a gap between

the server and customer plants is still required. For instance, the SCIT detected the

start of the cycle after 3:56 minutes (Figure 6-4 frame a), but the operation initiated

after 4:04 minutes (Figure 6-4 frame b). The same issue causes delay in recording the

finish time. For example, it was four seconds differences for the instance in Figure

6-4.

Figure 6-4: a. incorrect detection of the loading start time, b. actual start time

6.3 Practical Applications

The SCIT system provides two main types of data which are highly useful in management of

earthmoving projects: number of cycles and activity durations.

6.3.1 Cycle counting

The most basic outcome is production confirmation in which the SCIT can count the number

of loading cycles made by the earthmoving fleet, and then approximate the quantity of earth

99

material moved using the number of trips and the standard capacity of dump trucks. The

number of trips can also be used to confirm the quantity of work achieved by the

earthmoving subcontractor. Earthmoving foremen are usually responsible for this task and

they should also direct the dump trucks in the loading zone (Figure 6-5). This zone is one of

the most hazardous areas in construction sites as the excavators and dump trucks operate in a

confined space (Edwards and Nicholas 2002). For instance, the right frame in Figure 6-5

shows a packed jobsite where two people were assigned to manage dump trucks. The SCIT

has the potential to eliminate the distractive data recording task and the foremen can focus on

site safety.

Figure 6-5: Earthmoving foremen

6.3.2 Cycle durations

The SCIT system records loading cycle durations, which have several applications during the

construction period and afterwards. Activity durations can be used to study productivity, find

bottlenecks, and enhance ongoing operations. Practitioners can use industry standards, such

as manufacturers’ performance handbooks or productivity data from previous similar

operations, as benchmarks to find substandard operations. In addition, productivity data are

the main input for advanced analysis such as stochastic simulation for planning future

100

activities (AbouRizk and Halpin 1992) and they can be used to estimate the cost of similar

future operations.

As discussed before, the results of the SCIT system deviate from the ground-truth data. The

main issue is to validate the accuracy of the machine-generated cycle times for practical

construction applications. Construction companies need activity durations to assess the

performance of their fleet at the end of each working shift, to find the causes of delay, and to

correct them accordingly. Construction practitioners are interested in the average

productivity of a working shift and they are not generally concerned about a few prolonged

cycle or idle times. In addition, working conditions, such as soil, weather, and equipment

conditions, vary for every working shift, therefore the test results of the SCIT are grouped

into the site visits.

Tests number 2, 3, and 4 with the hybrid tracking algorithm had the best performance in

terms of true positive cycles. Test 3 and Test 4 had lower deviation from the manual data, so

their output is grouped into the eight site visits to assess the deviations in each case. Table

6-2 and Table 6-3 provide the ground truth and machine-generated data along with the

average loading cycle times and the duration between them for each site visit, in Test 3 and

Test 4 respectively. In addition, these tables present the deviation percentage of the machine-

generated average loading cycles in each site visits and overall. This deviation is calculated

as: (machine-generated time – manual observation time) / manual observation time.

101

Table 6-2: Detailed results of the SCIT with hybrid - Test 3

Site visit No of

cycles

Data

type

Total

loading

time

Total time

between

cycles

Average

loading

time

Deviation

%

Average

time

between

cycles

1 17 Manual 0:28:54 0:16:05 0:01:42 2.94% 0:00:57

Software 0:29:47 0:15:12 0:01:45 0:00:54

2 6 Manual 0:10:12 0:04:17 0:01:42 13.73% 0:00:43

Software 0:11:35 0:02:54 0:01:56 0:00:29

3 3 Manual 0:03:55 0:06:14 0:01:18 11.54% 0:02:05

Software 0:04:21 0:05:48 0:01:27 0:01:56

4 2 Manual 0:04:32 0:02:28 0:02:16 3.68% 0:01:14

Software 0:04:42 0:02:18 0:02:21 0:01:09

5 3 Manual 0:05:42 0:02:25 0:01:54 6.14% 0:00:48

Software 0:06:03 0:02:04 0:02:01 0:00:41

6 5 Manual 0:06:11 0:04:16 0:01:14 0.00% 0:00:51

Software 0:06:08 0:04:19 0:01:14 0:00:52

7 6 Manual 0:10:27 0:04:53 0:01:44 4.81% 0:00:49

Software 0:10:53 0:04:27 0:01:49 0:00:45

8 12 Manual 0:21:25 0:12:36 0:01:47 1.87% 0:01:03

Software 0:21:51 0:12:10 0:01:49 0:01:01

Overall 54 Manual 1:31:18 0:53:14 0:01:41 4.95% 0:00:59

Software 1:35:20 0:49:12 0:01:46 0:00:55

102

Table 6-3: Detailed results of the SCIT with hybrid - Test 4

Site visit No of

cycles

Data

type

Total

loading

time

Total time

between

cycles

Average

loading

time

Deviation

%

Average time

between

cycles

1 17 Manual 0:28:54 0:16:05 0:01:42

0.98% 0:00:57

Software 0:29:18 0:15:41 0:01:43 0:00:55

2 6 Manual 0:10:12 0:04:17 0:01:42

13.73% 0:00:43

Software 0:11:35 0:02:54 0:01:56 0:00:29

3 3 Manual 0:03:55 0:06:14 0:01:18

3.85% 0:02:05

Software 0:04:04 0:06:05 0:01:21 0:02:02

4 2 Manual 0:04:32 0:02:28 0:02:16

8.82% 0:01:14

Software 0:04:56 0:02:04 0:02:28 0:01:02

5 3 Manual 0:05:42 0:02:25 0:01:54

7.89% 0:00:48

Software 0:06:09 0:01:58 0:02:03 0:00:39

6 5 Manual 0:06:11 0:04:16 0:01:14

0.00% 0:00:51

Software 0:06:08 0:04:19 0:01:14 0:00:52

7 6 Manual 0:10:27 0:04:53 0:01:44

4.81% 0:00:49

Software 0:10:53 0:04:27 0:01:49 0:00:45

8 12 Manual 0:21:25 0:12:36 0:01:47

1.87% 0:01:03

Software 0:21:45 0:12:16 0:01:49 0:01:01

Overall 54 Manual 1:31:18 0:53:14 0:01:41

3.96% 0:00:59

Software 1:34:48 0:49:44 0:01:45 0:00:55

The overall average cycle times in tests 3 and 4 have the deviation of 4.95% and 3.96%,

respectively. Test 4 with a higher action recognition threshold had slightly better

performance in which the machine-generated total loading time (1:34:48) has a lower

difference with manually recorded time (1:31:18) comparing to the total loading time

obtained in Test 3 (1:35:20). The lowest accuracy in both tests occurred in the videos of site

visit number 2 with a 13.73% deviation. This case was reviewed and showed the video

captured the loading operation from a front-on viewpoint. As discussed in section 6.2.3, front

103

views result in earlier start and later finish times; therefore the machine-generated average

cycle time was 14 second longer than the ground truth data.

The scan time interval for dump trucks is another important affecting factor. Thus, 3 and 5

second intervals were also tested (with the same action recognition threshold used in the Test

3 as a middle point in the optimal range of threshold) and the results are presented in Table

6-4 and Table 6-5, respectively. Changing scan intervals has a mixed effect on the SCIT

performance. Shorter time intervals (3 seconds) resulted in timely detection of dump truck in

some cases, and sometimes it caused early detection of the loading dump trucks. In terms of

accuracy, results of the test with 4 second intervals are slightly better than the 3 and 5 second

intervals.

In addition to accuracy of cycle times, false positives and computation load are two other

important factors. The test with 3 second intervals had four false positive cycles, while the

tests with 4 and 5 second intervals had two false alarm cycles. The test with 3 second

intervals scanned more frames than 4 and 5 second intervals; therefore the chance of

misdetection was higher. Besides, scanning for dump trucks in shorter time intervals imposes

more load on the GPU and the processing unit.

104

Table 6-4: Detailed results of the SCIT with hybrid - 3 second intervals

Site visit No of

cycles

Data

type

Total

loading

time

Total

time

between

cycles

Average

loading

time

Deviation

%

Average

time between

cycles

1 17 Manual 0:28:54 0:16:05 0:01:42

5.88% 0:00:57

Software 0:30:32 0:14:27 0:01:48 0:00:51

2 6 Manual 0:10:12 0:04:17 0:01:42

14.71% 0:00:43

Software 0:11:41 0:02:52 0:01:57 0:00:29

3 3 Manual 0:03:55 0:06:14 0:01:18

8.97% 0:02:05

Software 0:04:16 0:05:53 0:01:25 0:01:58

4 2 Manual 0:04:32 0:02:28 0:02:16

0.00% 0:01:14

Software 0:04:32 0:02:28 0:02:16 0:01:14

5 3 Manual 0:05:42 0:02:25 0:01:54

7.02% 0:00:48

Software 0:06:06 0:02:01 0:02:02 0:00:40

6 5 Manual 0:06:11 0:04:16 0:01:14

-1.35% 0:00:51

Software 0:06:04 0:04:23 0:01:13 0:00:53

7 6 Manual 0:10:27 0:04:53 0:01:44

6.73% 0:00:49

Software 0:11:09 0:04:11 0:01:51 0:00:42

8 12 Manual 0:21:25 0:12:36 0:01:47

0.93% 0:01:03

Software 0:21:39 0:12:22 0:01:48 0:01:02

Overall 54 Manual 1:31:18 0:53:14 0:01:41

5.94% 0:00:59

Software 1:35:59 0:48:37 0:01:47 0:00:54

105

Table 6-5: Detailed results of the SCIT with hybrid - 5 second intervals

Site visit No of

cycles

Data

type

Total

loading

time

Total time

between

cycles

Average

loading

time

Deviation

%

Average time

between

cycles

1 17 Manual 0:28:54 0:16:05 0:01:42

4.90% 0:00:57

Software 0:30:18 0:14:41 0:01:47 0:00:52

2 6 Manual 0:10:12 0:04:17 0:01:42

13.73% 0:00:43

Software 0:11:34 0:02:58 0:01:56 0:00:30

3 3 Manual 0:03:55 0:06:14 0:01:18

12.82% 0:02:05

Software 0:04:24 0:05:45 0:01:28 0:01:55

4 2 Manual 0:04:32 0:02:28 0:02:16

-2.21% 0:01:14

Software 0:04:25 0:02:35 0:02:13 0:01:18

5 3 Manual 0:05:42 0:02:25 0:01:54

6.14% 0:00:48

Software 0:06:04 0:02:03 0:02:01 0:00:41

6 5 Manual 0:06:11 0:04:16 0:01:14

-2.70% 0:00:51

Software 0:06:01 0:04:26 0:01:12 0:00:53

7 6 Manual 0:10:27 0:04:53 0:01:44

9.62% 0:00:49

Software 0:11:26 0:03:54 0:01:54 0:00:39

8 12 Manual 0:21:25 0:12:36 0:01:47

0.00% 0:01:03

Software 0:21:22 0:12:39 0:01:47 0:01:03

Overall 54 Manual 1:31:18 0:53:14 0:01:41

4.95% 0:00:59

Software 1:35:34 0:49:01 0:01:46 0:00:54

A key issue is: How much error is acceptable to the construction industry? Any industry has

a level of tolerance for their automated data collection systems. For instance, the aviation

industry has one of the highest safety levels and the expected level of accuracy for

positioning devices, such as radar or onboard GPS antennas, is extremely high. The heavy

civil engineering industry however, has high levels of variation in equipment productivity.

Even the equipment manufacturers who have tested their products under various conditions

with different skill levels of operators provide a range of productivity rates. For instance,

according to the Caterpillar performance handbook (Caterpillar 2006), dump trucks usually

106

require 36 to 48 seconds to manoeuvre for the loading position, and it takes three or four

buckets for the test case excavator (CAT 345C) to load a typical urban dump truck. This

handbook provides a range of cycle times for an excavator swing (a bucket) under different

conditions such as soil type, configuration of the excavator and dump truck, and the angle of

swing. Each swing cycle includes four steps: load bucket, swing loaded, dump bucket, and

swing empty. The manufacturing company provided detailed ranges for different machines

including the test case of this research, CAT 345C. The conditions of the loading operation in

the test cases are presented in Table 6-6. According to the handbook, the minimum and

maximum swing times are 19.8 and 24 seconds, respectively. This type of excavator can load

an urban truck with 3 or 4 buckets, depending on the conditions and skill/efficiency of the

operator. Sometimes the operator can fully fill the buckets and load the truck with only three

swings, but in most cases the loading conditions require four swings.

Table 6-6: Loading conditions of the test cases

Condition Easy Hard

Angle of swing 90° 120°

Soil condition Hard packed soil with up to 50% rock content

Depth of excavation < 70% of max. capability < 90% of max. capability

Excavator level relative to dump

truck Elevated Same level

Swing time for excavator CAT

345C 19.8 Seconds 24 Seconds

Since the cases in this research were two urban construction sites, excavator operators must

compact and level the surface of the load to prevent loose material from escaping onto the

street, cars, and pedestrians. A study of the seven videos used to train the action recognition

classifiers showed that it takes 10-15 seconds for the excavators to level and clean the load’s

surface. Thus, 12.5 seconds is added to the minimum and maximum benchmark times. The

107

minimum benchmark time is the shortest time to move through the shortest swing cycle three

times divided by an operator skill coefficient, for which the recommended number is 0.9

(Caterpillar 2006), plus the levelling time. The maximum benchmark time is four times the

longest swing cycle divided by the operator skill coefficient plus levelling time. The

calculated minimum and maximum benchmark times are 1:19 and 1:59 minutes,

respectively.

The overall ground-truth and machine-generated average cycle times are 1:41 and 1:45

minutes (Table 6-3). Not only are the recorded cycle times within the range suggested by

Caterpillar, they result in a narrow band within the range. It can be acknowledged, therefore,

that the deviation between the actual and SCIT cycles times is acceptable in this context.

6.4 Monitoring Other Earthmoving Operations

The case presented in this research is one of the most challenging activities for a vision-based

system because:

It is an interactive activity, two types of equipment involved;

Loading zones are visually occluded as many machines work in a confined area.

Other common earthmoving operations, including hauling, leveling, compacting, and

excavating, are done using one type of equipment and the views are not as occluded.

6.4.1 Hauling

Dump trucks carry earth material within or to outside of the jobsite. If the viewfinder

captures the entryway to a loading or dumping area, a vision-based system is able to count

the number of truck loads and record the time gap between them. Since the object recognition

108

and tracking modules were already developed for dump trucks, they have been used to count

the number of truck loads. The viewfinder should be set on the entryway or an access road,

then the object recognition modules searches for dump trucks in eight viewpoints in 4 second

intervals (the same way employed in the SCIT). Any detected machine is then tracked using

the hybrid tracking algorithm. The false positives however, remain an issue for this approach

as there is no action recognition module to remove false positives.

The spatiotemporal data provided by the tracking modules can be used to distinguish true

positives from false alarms. In this approach, any detected windows, including true positives

and false alarms, are passes for tracking using the hybrid algorithm. As explained in section

4.2, the hybrid algorithm uses successive recognition of the target in a ROI to track that

machine. False positives are two types: random scattered boxes or repetitive in a same

location. In first case, the object recognition part of the hybrid tracker will not detect the

target in the ROI and in the second group, the recognition section repeatedly recognizes the

target in the same place with no movement. So a simple action recognition module has been

developed to recognize moving dump trucks from false positives.

The system gives a zero score to all detected boxes. Any redetection of the target in tracking

intervals, one second for this framework, adds +1 to its score. A tracking object can have 0-4

score in a 4 second time interval, and the system removes targets with scores lower than 2.

This constraint eliminates random false positives. The second constraint is the displacement

of the target. Any object with a movement less than half width of its bounding box in a 4

second interval is omitted as well. This constraint may remove a motionless true positive (a

temporarily immobile dump truck on the road), but this does not affect outcomes. The system

will identify and track that truck upon departure.

109

For instance, Figure 6-6 illustrates two frames with a view of an access road from a rock

quarry to a rock-fill dam construction site. The dump trucks toward the right carry rocks to

the rock-fill area, and the ones toward the left return to quarry. The left frame shows detected

machines and the right one depicts tracking result of them 7 seconds later. The red lines

represent the tracking path of each machine.

Figure 6-6: Left: detection of loading trucks; right: tracking of trucks

6.4.2 Leveling and compacting

Leveling and compacting tasks are done using graders and rollers, respectively. A single or

multiple machines carry out these operations in a cyclic manner. For example, compaction of

an earth-fill layer may require 10 passes of a roller, so the contractor may assign one loader

to pass 10 times or may employ five passes of two rollers in a row (Figure 6-7), or any

combinations resulting in 10 passes. A vision-based system has to detect, track, and record

the time for each pass. The major effort is to train the object recognition module. Since these

plants have relatively rigid shapes, the same approach used to recognize and track dump

trucks can be applied. Given the width and thickness of each layer, the system can estimate

productivity of the compaction operation. The same method is applicable for graders.

110

Figure 6-7: Compaction with two rollers

6.4.3 Excavation

Bulldozers also move forward and backward to engage their blade or ripper for excavation.

Detection and tracking of them can help estimate productivity, but the solution is not as

straightforward as the rollers and graders. The volume of excavated material depends on the

soil/rock condition, ground slope, enforced depth of the blade in the surface, and skill level of

the operator. It is even challenging for the naked eye to estimate excavated volume. So a

vision-based system can only estimate movements of the bulldozer. Other techniques, such as

surveying and laser scanning, are required to provide the volume of excavated material and

then measure the productivity.

6.4.4 Extended Monitoring System

A system can include all the above mentioned modules, and site engineers have to set the

viewfinder and choose the type of activity to monitor. It is possible to increase the level of

automation in which it can semantically identify the type of the action and then switch to

associated modules. First the system requires defining an earthmoving taxonomy, so it can

identify type of the activities based on the existing equipment in the scene. In this method,

the system uses a brute-force approach to search for different types of equipment available in

111

the recognition classifiers. This feature gives a context-awareness capability to the system,

which enables it to sense the environment and react based on the processed information. This

approach, however, is computationally-intensive. Therefore, the system should only scan the

first couple of minutes of the video and if machines are detected, the system switches to the

object detection and action recognition modules associated with the detected equipment. For

example, if the system detects urban dump trucks and an excavator, it will conclude that the

site is a loading zone in an urban setting and will only use related modules.

112

CHAPTER 7 - CONCLUSION AND FUTURE DIRECTIONS

Earthmoving activities are a costly component of heavy construction projects and mining

operations. Various automated controlling systems have been employed to monitor these

equipment-intensive operations. GPS-based systems have been the main controlling device

for more than a decade in both construction and mining sectors. However, the GPS antennas

should be installed on every machine and the transmitted data must be indirectly interpreted

to estimate productivity. In addition, locational records do not provide clear representations

of the scenes to find the causes of delays or abnormal productivity rates.

Relatively clear sightlines found in earthmoving sites, low-cost cameras, high capacity

storage devices, and advances in image and video processing algorithms make earthwork

jobsites potential opportunities to employ vision-based monitoring systems. Recent studies

have used computer vision algorithms to address the identification and tracking of

earthmoving plants in construction videos and time-lapsed movies. The outcomes were then

employed to estimate equipment productivity. These research efforts however, have not

proven capable of automating equipment productivity measurement in real-world scenarios

as they were carried out under ideal conditions, namely plain backgrounds, a very low level

of occlusion, certain viewpoints, manually defined work zones, and with the presence of only

a few types of equipment. The performance of these approaches under certain conditions do

not resemble those needed for actual construction sites and have prevented their practical

application within the industry.

This research aimed to close this practicability gap between vision-based algorithms and

equipment productivity measurement processes. The earth material loading operation was

selected as the test case and a vision-based system, named server-customer interaction

113

tracker (SCIT), was developed to recognize and estimate loading cycles under a variety of

conditions.

7.1 Summary of Research

The SCIT system contains three main modules: object recognition, tracking, and action

recognition. The object recognition module developed two techniques to detect dump trucks

and hydraulic excavators i.e. rigid and deformable equipment. This module uses HOG

classifiers to detect dump trucks from eight viewpoints. It also uses the HOG algorithm to

detect candidate boxes for the excavator arm in a series of video frames. Then it selects the

group of boxes which is coherent with the movement pattern of the excavator.

The standard mean-shift and innovative hybrid tracking methods were separately employed

as a tracking engine. The hybrid tracking framework uses the HOG object recognition and

KLT feature tracking to track dump trucks. After recognition of the excavator, the system

scans the video frames at predetermined time intervals, three, four, and five seconds for this

research, and checks whether any of the detected dump trucks are logically positioned for

loading. In addition to this logical reasoning test, a machine learning algorithm, linear

Support Vector Machine, was used to select the loading dump truck based on its relative size

and distance to the excavator. In this test, the spatiotemporal information of a detected truck

are translated to a descriptor and then classified to recognize the start of the loading cycle.

Upon recognition of a loading truck, the SCIT defines a loading zone and the tracking

module tracks the loading truck while it stays inside the loading zone. Departure of the

loading truck finishes the loading cycle and the time difference is the duration of the loading

cycle. Then the SCIT resumes searching for new dump trucks.

114

7.2 Summary of Results

Several test videos were captured under various conditions during eight site visits to evaluate

system performance. The SCIT system processed these test videos with five different action

recognition thresholds. The lower threshold had more false positive cycles. Increasing the

action recognition thresholds reduced the number of false positives and improved the cycle

times. The improvement of the performance however, stopped at a certain threshold where

the system missed some true positive cycles. Three sets of tests showed the best results where

the SCIT correctly detected 54 out of 55 cycles. Two sets of tests had 4.95% and the third

one had 3.96% deviation from ground-truth cycle times. These results were obtained using 4

second time intervals to search for loading dump trucks. In addition, 3 and 5 second intervals

were tested in which the test with 3 second intervals had higher false positives and deviation

whereas the test with 5 second intervals had almost the same results as the test with 4 second

intervals. The results demonstrate that the SCIT is able to address the objectives of this

research which were:

Develop an automated vision-based system for regular 2D construction videos

Process in real-time

Detect and track earth material loading equipment

Recognize and estimate loading cycle with an acceptable deviation

It should be mentioned that the performance of this system is limited to inherent

shortcomings of single 2D videos which are explained later in section 7.5.

115

7.3 Contributions to the Body of Knowledge

This research lays the groundwork for the application of vision-based algorithms to monitor

construction plants. This was the first research effort to evaluate the HOG object recognition

algorithm and part-based approach to detect construction equipment. The results were

promising. In addition, it introduced a novel hybrid tracking algorithm to track construction

entities in noisy construction videos. This innovative algorithm employs an optimized

integration of the HOG recognition and the KLT feature tracking. This method performed

well in partially-occluded views in which the recommended mean-shift tracking algorithm

using HSV color histogram and HOG response failed to track successfully. Moreover, an

action recognition framework was developed using spatiotemporal data to recognize the

interaction of the loading plants. This modular system has learning ability and modules can

be substituted to recognize and estimate other cyclic operations. Finally, the SCIT’s modules

were implemented and optimized properly that enable the entire system to process real-time

construction videos on an ordinary computer.

7.4 Contribution to the Body of Practice

As stated in the previous chapter, construction companies require recording of the total

number of truck trips. This is done either manually or by using active monitoring systems

such as GPS or RFID. The SCIT framework has the potential to count the number of trips,

and in addition, it can provide detailed cycle time information in real-time. Manual sampling

methods fail to spot many abnormal cycles and the results may not be useful for finding and

resolving bottlenecks. In contrast, the SCIT system monitors all of the loading cycles and

tags them with their cycle time. Abnormal working shifts can be flagged and site engineers

can visually review the anomalies. The SCIT does not have some of the main shortcomings

116

of the radio-based monitoring system, namely GPS, as it is not intrusive and is able to

provide visual spatiotemporal data including the pose and orientation of the machines. These

data help the system interpret actions of the machines more accurately.

7.5 Limitations

Since the SCIT system processes a video stream from a single camera, this framework

includes the inherent limitations of 2D projection of the real-world, including:

Depth of the objects in the frames;

Occlusion;

Limited coverage of the site by a single stationary camera.

Therefore, there are certain conditions/viewpoints in which the system fails to provide

correct data. However, as mentioned previously, the SCIT system aims at closing the

practicability shortcomings of the recent vision-based systems which perform under ideal

conditions. In this respect, SCIT addresses the following issues:

It can process congested work zones with various backgrounds in which several

types of machines (unknown for detectors) appear in the videos.

It is fully invariant to colors, because none of the modules use color-based

techniques.

No manual intervention is required to define the work zones; the system defines

them.

117

The SCIT can handle a moderate level of occlusion (e.g. partially masked dump

trucks from elevated viewpoints).

However, the SCIT framework has three main shortcomings to reach an ideal level:

The camera location, viewpoint and focus must be set manually.

The system cannot accurately process highly occluded scenes (e.g. a masked dump

truck from a ground level viewpoint).

Since the current version of the object recognition module for hydraulic excavators

can detect only one excavator, the SCIT performance is limited to one operating

servant.

In addition, unlike active systems that use antennas or tags, the SCIT system can only

recognize the type of machine and it is not able to identify individual equipment. This issue

can be resolved if each machine is labelled with recognizable visual signs such as numbers;

however, there may be problems such as dust in construction sites and error in the sign

recognition algorithms.

7.6 Future Directions

This research offers a major step in developing vision-based monitoring systems for real-time

earthmoving videos. Future research is required to overcome the limitations of the current

version of the SCIT. As stated in previous sections, this system has a few shortcomings. In

addition to the intrinsic errors of the existing object recognition and tracking techniques, 2D

videos do not provide any information about the depth of the scene. So future research should

118

seek additional mechanisms to overcome the shortcomings of using a single 2D video.

Following are some possible solutions:

7.6.1 Application of Two Calibrated Cameras

Application of two calibrated cameras can provide stereo vision of the scene. With 3D

coordination of the cameras (center of projection point) and 2D location of the detected

machines in the video frames, Epipolar geometry can calculate the 3D location of the

equipment. In addition to adding depth estimation, the application of two videos would

handle the occlusion problem.

7.6.2 Application of Multiple Non-calibrated Cameras

Despite providing accurate results, the calibration of site cameras is a labour-intensive task. It

is possible to avoid this effort by using multiple non-calibrated cameras. Multiple non-

calibrated cameras would be much easier to install and manage while providing additional

valuable data. One example application is for linear projects, such as highways, where one

camera cannot cover the entire site. In this case, multiple cameras would provide a complete

view of the construction right-of-way. A slight overlap in the view of the cameras would be

beneficial so that movement from one view to the next could be coordinated. Each camera

would identify the activities occurring in that area after which the data could be combined to

gain knowledge of the whole operation. Another example application would be a large

excavation where each camera position under consideration would experience recurring

occlusions. Two cameras could be placed such that they resolved the occlusion problem of

the other. The production recorded from each camera could then be combined to determine

where occlusions caused one camera to misinterpret some of the loading cycles.

119

7.6.3 Integration of SCIT and GPS

Despite manual preparation and technology related problems, GPS systems are the leading

alternative in monitoring an earthmoving fleet. Integration of the SCIT system with the GPS

navigation system would cover the shortcomings of both frameworks. GPS antennas can

provide 3D coordination of the equipment and the SCIT can offer additional spatiotemporal

data including orientation and poses of the equipment. This system requires a geographical

map of the operation in which the camera coverage (see Figure 7-1) is identified based on the

focal length of the camera lenses. Then entrance of a machine into this area triggers the SCIT

for recognition and tracking of that equipment. This way, accurate GPS locational data can

eliminate the risk of both false positives and false negatives of the SCIT system. This hybrid

system compares the 2D visual detections with the corresponding 3D locational data to

remove inconsistent detections. Moreover, the appearance of 3D locational data of machines

in a geographical map could force SCIT to undergo extra recognition trials for timely

detection of equipment in the video. The SCIT is responsible for action recognition and

productivity estimation. This system will be also able to interpret occluded views, such as the

situation presented in Figure 7-1.

The visual information provided by SCIT is able to accurately recognize equipment poses

and actions, and distinguish productive from non-value added movements which are the

shortcomings of a standalone GPS system. In addition to providing accurate locational data,

the GPS system can identify each machine. This helps the system estimate the productivity of

each plant separately.

120

Figure 7-1: Integration of the SCIT and GPS

The manual setting of the viewfinder is the next shortcoming of the SCIT system. A module

could be developed to control the robotic base of the construction camera. This module

should automatically seek the jobsite for loading operations and, once found, it should

appropriately focus on that operation. A new generation of cameras have built-in GPS and

compasses, so if the system finds a loading operation, the system can estimate the operation’s

location at the jobsite and automatically annotate the productivity data.

121

REFERENCES

Abeid, J., Allouche, E., Arditi, D., and Hayman, M. (2003). “PHOTO-NET II: a computer-

based monitoring system applied to project management” Automation in

Construction, 12(5), 603–616.

AbouRizk, S.M., and Halpin, D.W. (1992). “Statistical Properties of Construction Duration

Data.” Journal of Construction Engineering and Management, 118(3), 525-544.

Aggarwal, J.K., and Ryoo, M.S. (2011). “Human activity analysis: A review.” ACM

Computing Surveys, 43(3), Article No. 16.

Akinci, B., Kiziltas, S., Ergen, E., Karaesmen, I.Z., and Keceli, F. (2006). “Modeling and

Analyzing the Impact of Technology on Data Capture and Transfer Processes at

Construction Sites: A Case Study.” J. Constr. Eng. Manage., 132(11), 1148-1157.

Alarie, S. and Gamache, M. (2002). “Overview of Solution Strategies Used in Truck

Dispatching Systems for Open Pit Mines.” International Journal of Surface Mining,

Reclamation and Environment, 16(1), 59–76.

Almassi, A.N. and McCabe, B. (2008). “Image recognition and automated data extraction in

construction.” Proceedings of the 2008 Canadian Society of Civil Engineering

Annual Conference, Quebec, QC, Canada, Paper G-568.

Andriluka, M., Roth, S., and Schiele, B. (2009). “Pictorial Structures Revisited: People

Detection and Articulated Pose Estimation.” Proc. Computer Vision and Pattern

Recognition (CVPR 2009), Miami, FL, USA, 1014 – 1021.

Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., Michaux, A.,

Mussman, S., Narayanaswamy, S., Salvi, D., Schmidt, L., Shangguan, J., Siskind,

J.M., Waggoner, J., Wang, S., Wei, J., Yin, Y., Zhang, Z. (2012). “Video In

Sentences Out.” Proceedings of Conference on Uncertainty in Artificial Intelligence

122

(UAI), Catalina, CA.

Bradski, G.R. (1998). “Real time face and object tracking as a component of a perceptual

user interface.” Proc., Applications of Computer Vision, Princeton, NJ, USA. 214 –

219.

Brilakis, I., Park, M.W. and Jog, G. (2011). "Automated Vision Tracking of Project Related

Entities." J. of Advanced Engineering Informatics, 25(4), 713-724.

Brilakis, I., and Soibelman, L. (2005). “Content-Based Search Engines for construction

image databases.” Automation in Construction, 14(4), 537-550.

Brilakis, I., and Soibelman, L. (2008). “Shape-Based Retrieval of Construction Site

Photographs.” Journal of Computing in Civil Engineering, 22(1), 14 – 20.

Caterpillar (2006). “Caterpillar Performance Handbook (Edition 36).” Caterpillar

incorporation, Peoria, Illinois, U.S.

Caterpillar (2012). “Electronic control units.”

<http://www.cat.com/cda/layout?m=81247&x=7> (April 9, 2012).

Cheng, T., Venugopal, M., Teizer, J. Vela, P.A. (2011). “Performance evaluation of ultra

wideband technology for construction resource location tracking in harsh

environments.” Automation in Construction, 20(8), 1173-1184.

Cheok, G. S., Lipman, R. R., Witzgall, C., Bernal, J., and Stone, W. C. (2000). “NIST

Construction Automation Program Rep. No: 4 Non- Intrusive Scanning Technology

for Construction Status Determination.” Building and Fire Research Laboratory,

National Institute of Standards and Technology, Gaithersburg, MD.

Chi, S., and Caldas, C.H. (2011). “Automated Object Identification Using Optical Video

Cameras on Construction Sites.” Journal of Computer-Aided Civil and Infrastructure

Engineering, 26(5), 368–380.

123

Comaniciu. D., Ramesh, V., and Meer, P. (2003). “Kernel-Based Object Tracking.” IEEE

Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564-577.

Cortes, C, and Vapnik, V. (1995). "Support-Vector Networks", Machine Learning, 20(3),

273-297.

Dalal. N, and Triggs. B, (2005). “Histograms of Oriented Gradients for Human Detection.”

Proc. Computer Vision and Pattern Recognition (CVPR 2005), 1, 886 – 893.

Dickinson S.J., Leonardis, A., Schiele, B., and Tarr M.J. (2009). “Object Categorization:

Computer and Human Vision Perspectives.” Cambridge University Press, New York,

NY, USA.

Edwards, D.J., and Nicholas, J. (2002). “The state of health and safety in the UK construction

industry with a focus on plant operators” Structural Survey, 20(2), 78-87.

Everingham, M., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. (2009). “The

PASCAL Visual Object Classes Challenge 2009 (VOC2009).”

<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/everingham_det

.pdf> (Mar. 15, 2011).

Everingham, M., and Winn, J. (2010). “The PASCAL Visual Object Classes Challenge 2010

(VOC2010) Development Kit.”

<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/devkit_doc_08-May-

2010.pdf> (Apr. 13, 2011).

Everingham, M., Zisserman, A., Williams, C. and Van Gool, L. (2006). “The PASCAL

Visual Object Classes Challenge 2006 (VOC2006) Results.”

<http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2006/results.pdf> (Mar. 15,

2011).

Fei, L.F., Fergus, R., and Torralba, A. (2007). “Recognizing and Learning Object

http://www.cs.princeton.edu/~feifeili/

http://www.robots.ox.ac.uk/~fergus/

http://web.mit.edu/torralba/www/

124

Categories.” Lecture notes, Massachusetts Institute of Technology.

<http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html>

Felzenszwalb, P.F., Girshick, R.B., McAllester, D, and Ramanan, D. (2010). “Object

Detection with Discriminatively Trained Part-Based Models.” Journal of Pattern

Analysis and Machine Intelligence, 32(9), 1627 – 1645.

Freund, Y., and Schapire, R.E. (1997). “A Decision - Theoretic Generalization of Online

Learning and an Application to Boosting.” Journal of Computer and System Sciences,

55 (1), 119 - 139.

Ghavami, M., Michael, L. B., and Kohno, R. (2007), Ultra-Wideband Signals and Systems in

Communication Engineering. 2nd ed., John Wiley & Sons, Chichester, West Sussex,

England.

Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2009). “Application of D4AR – A 4

Dimensional augmented reality model for automating construction progress

monitoring data collection, processing and communication.” Information Technology

in Construction, 14, 129-153.

Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2011). “Integrated Sequential As-Built

and As-Planned Representation with D4AR Tools in Support of Decision-Making

Tasks in the AEC/FM Industry.” J. Constr. Eng. Manage., 137(12), 1099–1116.

Gong, J., and Caldas, C.H (2010). “Computer Vision-Based Video Interpretation Model for

Automated Productivity Analysis of Construction Operations.” Journal of Computing

in Civil Engineering, 24(3), 252-263.

Gong, J., and Caldas, C.H. (2011). “An object recognition, tracking, and contextual

reasoning-based video interpretation method for rapid productivity analysis of

construction operations.” Automation in Construction, 20(8), 1211–1226.

125

Government of Alberta (2012). ”Oil sands facts and statistics.”

<http://www.energy.gov.ab.ca/OilSands/791.asp> (February 20, 2012)

Grimson, W.E.L., Stauffer, C., Romano, R., and Lee, L. (1998). “Using adaptive tracking to

classify and monitor activities in a site.” Computer Vision and Pattern Recognition, Santa

Barbara, CA. 22-29.

Guo, W., Soibelman, L. and Garrett Jr. J.H. (2009). “Automated defect detection for sewer

pipeline inspection and condition assessment.” Automation in Construction, 18(5),

587-596.

Han, F., Shan, Y., Cekander, R., Sawhney, H. and Kumar, R. (2006). “A two-stage approach

to people and vehicle detection with hog-based SVM.” Performance Metrics for

Intelligent Systems Workshop, NIST, Gaithersburg, MD, 133-140.

Hutchinson, T. and Chen, Z. (2006). “Improved Image Analysis for Evaluating Concrete

Damage.” Journal of Computing in Civil Engineering, 20(3), 210-216.

Joachims. T. (1999). “Making large-scale SVM learning practical.” Advances in Kernel

Methods: Support Vector Learning, Scholkopf, B., Burges, C. and Smola, A. The

MIT Press, Cambridge, MA, USA.

Kim, K., Chalidabhongse, T.H., Harwood, D., and Davis, L. (2005). “Real-time foreground-

background segmentation using codebook model.” Journal of Real-Time Imaging, 11

(3), 172–185.

Kim, H. and Kano, N. (2008). “Comparison of construction photograph and VR image in

construction progress.” Automation in Construction, 17(2), 137–143.

Kim, S.K., and Russell, J.S. (2003). “Framework for an intelligent earthwork system: Part I.

System architecture.” Automation in Construction, 12(1), 1– 13.

Kiziltas, S., Akinci, B., Ergen, E., Tang, P., and Gordon, C. (2008). “Technological

http://www.energy.gov.ab.ca/OilSands/791.asp

126

assessment and process implications of field data capture technologies for

construction and facility/infrastructure management.” Journal of Information

Technology in Construction, 13, 134-154.

Laptev, I., Caputo, B., Schüldt, C., and Lindeberg, T. (2007). “Local velocity-adapted

motion events for spatio-temporal recognition.” Journal of Computer Vision and

Image Understanding, 108(3), 207– 229.

Laptev, I., and Lindeberg, T. (2006). “Local Descriptors for Spatio-temporal Recognition.”

International Workshop on Spatial Coherence for Visual Motion Analysis, 91-103.

Li, L., Huang, W., Gu, I. Y. H., and Tian, Q. (2003). “Foreground Object Detection from

Videos Containing Complex Background.” Proceedings of the eleventh ACM

international conference on Multimedia, Berkeley, CA, USA, 2-10.

McCullouch, B. (1997). “Automating field data collection in construction organizations.”

Proc., 5th Construction Congress: Managing Engineered Construction in Expanding

Global Markets, Minneapolis, 957–963.

Montaser, A., and Moselhi, O. (2012). “RFID+ for Tracking Earthmoving Operations.”

Construction Research Congress, West Lafayette, IN, USA, 1011-1020.

Morlock, D. (2008). “Vision Based Recognition of Vehicle Types.” Study Thesis, Karlsruhe

Institute of Technology, Karlsruhe, Germany.

Navon, R. (2005). “Automated project performance control of construction projects.”

Automation in Construction, 14(4), 467– 476.

Navon, R., Goldschmidt, E., and Shpatnisky, Y. (2004). “A concept proving prototype of

automated earthmoving control.” Automation in Construction, 13(2), 225– 239.

Navon, R., and Sacks, R. (2007). “Assessing research issues in Automated Project

Performance Control (APPC).” Automation in Construction, 16(4), 474–484.

127

Navon, R., and Shpatnisky, Y. (2005). “Field Experiments in Automated Monitoring of Road

Construction.” J. of Construction Engineering and Management, 131(4), 487– 493.

NVIDIA (2012). “CUDA Parallel Computing Platform.”

<http://www.nvidia.com/object/cuda_home_new.html> (February 12, 2012).

OpenCv (2010). “The OpenCv Library.” < http://opencv.willowgarage.com/wiki/> (April 12,

2010).

OpenCv (2011). “The OpenCv Library.” < http://opencv.willowgarage.com/wiki/> (Dec. 10,

2011).

Park, M.W., Makhmalbaf, A., and Brilakis, I. (2011). “Comparative study of vision tracking

methods for tracking of construction site resources.” Automation in Construction,

20(7), 905-915.

Peddi, A., Huan, L., Bai, Y. and Kim, S. (2009). “Development of human pose analyzing

algorithms for the determination of construction productivity in real-time.” Building a

sustainable future, Construction Research Congress, Seattle, WA, USA, 1, 11-20.

Peyret, F. Betaille, D., and Hintzy G. (2000). “High-precision application of GPS in the field

of real-time equipment positioning.” Automation in Construction, 9(3), 299-314.

Prisacariu, V., and Reid, I. (2009). “FastHOG - a realtime GPU implementation of HOG.”

Technical report, Department of Engineering Science, Oxford University, UK.

Renz, J., and Nebel, B. (2007). “Qualitative Spatial Reasoning using Constraint Calculi.”

Handbook of Spatial Logics, Aiello, M., Pratt-Hartmann, I., and van Benthem, J.

eds., Springer, New York, 161-215.

Rezazadeh Azar, E. and McCabe, B. (2012a). ”Automated Visual Recognition of Dump

Trucks in Construction Videos.” J. Comput. Civ. Eng., 26(6), 769–781.

Rezazadeh Azar, E. and McCabe, B. (2012b). ” Part based model and spatial–temporal

128

reasoning to recognize hydraulic excavators in construction images and videos.”

Automation in Construction, 24, 194-202.

Rojas, E.M. (2008). “Construction Productivity: A Practical Guide for Building and

Electrical Contractors.” J. Ross Publishing, Fort Lauderdale, Florida.

Rybski, P.E., Huber, D., Morris, D.D. and Hoffman. R. (2010). “Visual Classification of

Coarse Vehicle Orientation using Histogram of Oriented Gradients Features.” IEEE

Intelligent Vehicles Symposium, IEEE, San Diego, CA, USA. 921-928.

Teizer, J. and Vela, P.A. (2009) “Personnel tracking on construction sites using video

cameras.” Journal of Advanced Engineering Informatics, 23(4), 452-462.

Tomasi, C., and Kanade, T. (1991). “Detection and tracking of point features.” Technical

Report CMU-CS-91-132, Carnegie Mellon University, USA.

Trimble (2012a). “Terralite XPS Solutions.” <http://www.trimble.com/mining/Terralite-

XPS-Solutions/> (April 01, 2012).

Trimble (2012b). “Trimble GCS900 Grade Control System.”

<http://www.trimble.com/construction/heavy-civil/machine-control/grade-control/>

(April 15, 2012).

Trimble (2012c). “GCSFlex for Excavators.” <http://www.trimble.com/construction/heavy-

civil/machine-control/FlexFamily/GCSFlexExc.aspx?dtID=overview> (April 9,

2012).

Viola, P. and Jones, M. (2001). “Rapid object detection using a boosted cascade of simple

features.” Proc. Computer Vision and Pattern Recognition (CVPR 2001), IEEE,

Kauai, HI, USA, 1, 511-518.

Viterbi, A. J. (1971). “Convolutional codes and their performance in communication

systems.” IEEE Transactions on Communication Technology, 19(5),751–772.

129

Vujic, S., Zajic, B., Miljanovic, I., and Petrovski, A. (2008). “GPS telemetry of energetic

technical and technological parameters at open pit mines.” Journal of Mining Science,

44(4), 402-406.

Weerasinghe, I.P.T. and Ruwanpura, J.Y. (2009). “Automated data acquisition system to

assess construction worker performance.” Building a sustainable future, Construction

Research Congress, Seattle, WA, USA, 1, 61-70.

Wu, Y., Kim, H., Kim, C., and Han, S.H. (2010). “Object Recognition in Construction Site

Images Using 3D CAD-Based Filtering.” Journal of Computing in Civil Engineering,

24(1), 56-64.

Zou, J., and Kim, H. (2007). “Using Hue, Saturation, and Value Color Space for Hydraulic

Excavator Idle Time Analysis.” Journal of Computing in Civil Engineering, 21(4),

238-246.