PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS USINGGPU-BASED HARDWARE ACCELERATION
By
HYUNGWOOK PARK
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to my advisor, Dr. Paul A. Fishwick for
his excellent inspiration and guidance throughout my Ph.D. studies at the University of
Florida. I would also like to thank my Ph.D. committee members, Dr. Jih-Kwon Peir, Dr.
Shigang Chen, Dr. Benjamin C. Lok, and Dr. Howard W. Beck for their precious time and
advice for my research. Also, I am grateful to the Korean Army. They gave me a chance
to study in the United States of America with financial support. I would like to thank my
parents, Hyunkoo Park and Oksoon Jung who encouraged me throughout my studies. I
would especially like to thank my wife, Jisuk Han, and my sons, Kyungeon and Sangeon
Park. They have been very supportive and patient throughout my studies. I would never
have finished my study without them.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Motivations and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Contributions to Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.1 A GPU-Based Toolkit for Discrete Event Simulation Based on ParallelEvent Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Mutual Exclusion Mechanism for GPU . . . . . . . . . . . . . . . . 161.2.3 Event Clustering Algorithm on SIMD Hardware . . . . . . . . . . . 171.2.4 Error Analysis and Correction . . . . . . . . . . . . . . . . . . . . . 18
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 18
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Queuing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Discrete Event Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Event Scheduling Method . . . . . . . . . . . . . . . . . . . . . . . 232.2.2 Parallel Discrete Event Simulation . . . . . . . . . . . . . . . . . . 25
2.2.2.1 Conservative synchronization . . . . . . . . . . . . . . . . 262.2.2.2 Optimistic synchronization . . . . . . . . . . . . . . . . . 282.2.2.3 A comparison of two methods . . . . . . . . . . . . . . . 30
2.3 GPU and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.1 GPU as a Coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.3 GeForce 8800 GTX . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.4 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Discrete Event Simulation on SIMD Hardware . . . . . . . . . . . . . . . . 383.2 Tradeoff between Accuracy and Performance . . . . . . . . . . . . . . . . 403.3 Concurrent Priority Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 Parallel Simulation Problem Space . . . . . . . . . . . . . . . . . . . . . . 41
5
4 A GPU-BASED APPLICATION FRAMEWORK SUPPORTING FAST DISCRETEEVENT SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Parallel Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Issues in a Queuing Model Simulation . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Selective Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Data Structures and Functions . . . . . . . . . . . . . . . . . . . . . . . . 504.3.1 Event Scheduling Method . . . . . . . . . . . . . . . . . . . . . . . 504.3.2 Functions for a Queuing Model . . . . . . . . . . . . . . . . . . . . 544.3.3 Random Number Generation . . . . . . . . . . . . . . . . . . . . . 58
4.4 Steps for Building a Queuing Model . . . . . . . . . . . . . . . . . . . . . 584.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . 624.5.2 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5.3 Parallel Simulation with a Sequential Event Scheduling Method . . 634.5.4 Parallel Simulation with a Parallel Event Scheduling Method . . . . 644.5.5 Cluster Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 AN ANALYSIS OF QUEUE NETWORK SIMULATION USING GPU-BASEDHARDWARE ACCELERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Parallel Discrete Event Simulation of Queuing Networks on the GPU . . . 675.1.1 A Time-Synchronous/Event Algorithm . . . . . . . . . . . . . . . . 675.1.2 Timestamp Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Implementation and Analysis of Queuing Network Simulation . . . . . . . 705.2.1 Closed and Open Queuing Networks . . . . . . . . . . . . . . . . . 705.2.2 Computer Network Model . . . . . . . . . . . . . . . . . . . . . . . 725.2.3 CUDA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.1 Simulation Model: Closed and Open Queuing Networks . . . . . . 76
5.3.1.1 Accuracy: closed vs. open queuing network . . . . . . . 775.3.1.2 Accuracy: effects of parameter settings on accuracy . . . 795.3.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Computer Network Model: a Mobile Ad Hoc Network . . . . . . . . 835.3.2.1 Simulation model . . . . . . . . . . . . . . . . . . . . . . 835.3.2.2 Accuracy and performance . . . . . . . . . . . . . . . . . 86
5.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6
LIST OF TABLES
Table page
2-1 Notations for queuing model statistics . . . . . . . . . . . . . . . . . . . . . . . 22
2-2 Equations for key queuing model statistics . . . . . . . . . . . . . . . . . . . . . 23
3-1 Classification of parallel simulation examples . . . . . . . . . . . . . . . . . . . 42
4-1 The future event list and its attributes . . . . . . . . . . . . . . . . . . . . . . . . 51
4-2 The service facility and its attributes . . . . . . . . . . . . . . . . . . . . . . . . 55
5-1 Simulation scenarios of MANET . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5-2 Utilization and sojourn time (Soj.time) for different values of time intervals (�t)and mean service times (�s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8
LIST OF FIGURES
Figure page
2-1 Components of a single server queuing model . . . . . . . . . . . . . . . . . . 21
2-2 Cycle used for event scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2-3 Stream and kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2-4 Traditional vs. GeForce 8 series GPU pipeline . . . . . . . . . . . . . . . . . . . 34
2-5 GeForce 8800 GTX architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2-6 Execution between the host and the device . . . . . . . . . . . . . . . . . . . . 37
3-1 Diagram of parallel simulation problem space . . . . . . . . . . . . . . . . . . . 42
4-1 The algorithm for parallel event scheduling . . . . . . . . . . . . . . . . . . . . 44
4-2 The result of a concurrent request from two threads without a mutual exclusionalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4-3 A mutual exclusion algorithm with clustering events . . . . . . . . . . . . . . . . 48
4-4 Pseudocode for NextEventTime . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4-5 Pseudocode for NextEvent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4-6 Pseudocode for Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4-7 Pseudocode for Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4-8 Pseudocode for Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4-9 Pseudocode for ScheduleServer . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-10 First step in parallel reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4-11 Steps in parallel reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4-12 Step 3: Event extraction and departure event . . . . . . . . . . . . . . . . . . . 60
4-13 Step 4: Update of service facility . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4-14 Step 5: New event scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4-15 3×3 toroidal queuing network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-16 Performance improvement by using a GPU as coprocessor . . . . . . . . . . . 64
4-17 Performance improvement from parallel event scheduling . . . . . . . . . . . . 65
9
5-1 Pseudocode for a hybrid time-synchronous/event algorithm with parallel eventscheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5-2 Queuing delay in the computer network model . . . . . . . . . . . . . . . . . . 73
5-3 3 linear queuing networks with 3 servers . . . . . . . . . . . . . . . . . . . . . . 76
5-4 Summary statistics of closed and open queuing network simulations . . . . . . 78
5-5 Summary statistics with varying parameter settings . . . . . . . . . . . . . . . . 80
5-6 Performance improvement with varying time intervals (�t) . . . . . . . . . . . . 82
5-7 Comparison between wireless and mobile ad hoc networks . . . . . . . . . . . 84
5-8 Average end-to-end delay with varying time intervals (�t) . . . . . . . . . . . . 87
5-9 Average hop counts and packet delivery ratio with varying time intervals (�t) . 89
5-10 Performance improvement in MANET simulation with varying time intervals (�t) 90
5-11 3-dimensional representation of utilization for varying time intervals and meanservice times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5-12 Comparison between experimental and estimation results . . . . . . . . . . . . 93
5-13 Result of error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10
Abstract of dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS USINGGPU-BASED HARDWARE ACCELERATION
By
Hyungwook Park
December 2009
Chair: Paul A. FishwickMajor: Computer Engineering
Queuing networks are used widely in computer simulation studies. Examples of
queuing networks can be found in areas such as the supply chains, manufacturing work
flow, and internet routing. If the networks are fairly small in size and complexity, it is
possible to create discrete event simulations of the networks without incurring significant
delays in analyzing the system. However, as the networks grow in size, such analysis
can be time consuming and thus require more expensive parallel processing computers
or clusters.
The trend in computing architectures has been toward multicore central processing
units (CPUs) and graphics processing units (GPUs). A GPU is the fairly inexpensive
hardware, and found in most recent computing platforms, but practical example of
single instruction, multiple data (SIMD) architectures. The majority of studies using
the GPU within the graphics and simulation communities have focused on the use
of the GPU for models that are traditionally simulated using regular time increments,
whether these increments are accomplished through the addition of a time delta
(i.e., numerical integration) or event scheduling using the delta (i.e., discrete event
approximations of continuous-time systems). These types of models have the property
of being decomposable over a variable or parameter space. In prior studies, discrete
event simulation, such as a queuing network simulation, has been characterized as
being an inefficient application for the GPU primarily due to the inherent synchronicity of
11
the GPU organization and an apparent mismatch between the classic event scheduling
cycle and the GPUs basic functionality. However, we have found that irregular time
advances of the sort common in discrete event models can be successfully mapped to
a GPU, thus making it possible to execute discrete event systems on an inexpensive
personal computer platform.
This dissertation introduces a set of tools that allows the analyst to simulate
queuing networks in parallel using a GPU. We then present an analysis of a GPU-based
algorithm, describing benefits and issues with the GPU approach. The algorithm
clusters events, achieving speedup at the expense of an approximation error which
grows as the cluster size increases. We were able to achieve 10-x speedup using our
approach with a small error in the output statistics of the general network topology. This
error can be mitigated, based on error analysis trends, obtaining reasonably accurate
output statistics.
12
CHAPTER 1INTRODUCTION
1.1 Motivations and Challenges
Queuing models [1–4] are constructed to analyze humanly engineered systems
where jobs, parts, or people flow through a network of nodes (i.e. resources). The
study of queuing models, their simulation, and their analysis is one of the primary
research topics studied within the discrete event simulation community [5]. There
are two approaches to estimating the performance and analysis of queuing systems:
analytical modeling and simulation [3, 5, 6]. An analytical model is the abstraction of
a system based on probability theory, representing the description of a formal system
consisting of equations used to estimate the performance of the system. However, it is
difficult to represent all situations in the real world using an analytical model because
that requires a restricted set of assumptions, such as an infinite number of queue
capacity and no bounds on the inter-arrival and service time, which do not often occur
in the real world. A simulation is often used to analyze the queuing system when a
theory for the system equations is unknown or the algorithm for the equations is too
complicated to be solved in closed-form. Computer simulation involves the formulation
of a mathematical model, often including a diagram. This model is then translated into
computer code, which is then executed and compared against a physical, or real-world,
system’s behavior under a variety of conditions.
Queuing model simulations can be expensive in terms of time and resources in
cases where the models are composed of multiple resource nodes and tokens that
flow through the system. Therefore, there is a need to find ways to speed up queuing
model simulations so that analyses can be obtained more quickly. Past approaches to
speeding up queuing model simulations have used asynchronous message-passing
with special emphasis on two approaches: the conservative and the optimistic
approaches [7]. Both approaches have been used to synchronize the asynchronous
13
logical processors (LPs), preserving causal relationships across LPs so that the results
obtained are exactly the same as those produced by sequential simulation. Most studies
of parallel simulation have been performed on multiple instruction, multiple data (MIMD)
machines, or related networks to execute the part of a simulation model or LP. The
parallel simulation approaches with partitioning the simulation model into several LPs
could easily be employed with a queuing model simulation, since the start of each
execution need not be explicitly synchronized with other LPs.
A graphics processing unit (GPU) is a processor that renders 3D graphics in real
time, and which contains several sub-processing units. Recently, the GPU has become
an increasingly attractive architecture for solving compute-intensive problems for general
purpose computation, which is called general-purpose computation on GPUs (GPGPU)
[8–11]. Availability as a commodity and increased computational power make the GPU
a substitute for expensive clusters of workstations in a parallel simulation, at a relatively
low cost. For much of the history of GPU development, there has been a need to map
the model into the graphics application programming interface (API), which limited the
availability of the GPU to those experts who had GPU- and graphics-specific knowledge.
This drawback has been resolved with the advent of the GeForce 8 series GPUs [12]
and compute unified device architecture (CUDA) [13, 14]. The control of the unified
stream processors on the GeForce 8 series GPUs is transparent to the programmer,
and CUDA provides an efficient environment to develop parallel codes in a high-level
language C without the need for graphics-specific knowledge.
In contrast to the previously ubiquitous MIMD approach to parallel computation
within the context of simulation research, the GPU is single instruction, multiple data
(SIMD)-based hardware that is oriented toward stream processing. SIMD hardware
is a relatively simple, inexpensive, and highly parallel architecture; however, there
are limits to developing an asynchronous model due to its synchronous operation.
Stream processing [15, 16] is the basic programming model of SIMD architecture. The
14
stream processing approach exploits data and task parallelism by mapping data flow to
processors, and provides efficient communication by accessing memory in a predictable
pattern using a producer-consumer locality as well. For these reasons, most simulation
models on the GPU are time-synchronous and compute-intensive models with stream
memory access.
However, queuing models are a typical asynchronous model, and their temporal
events are relatively fine-grained. Queuing models are usually simulated based on event
scheduling with manipulation of the future event list (FEL). Event scheduling tends to be
a sequential operation, which often overwhelms the execution times of events in queuing
model simulations. Another problem lies in the dynamic data structure for the event
scheduling method in discrete event simulations. Dynamic data structures cannot be
directly used on the GPU because dynamic memory allocation is not supported during
kernel execution. Moreover, the randomized memory access for individual data cannot
take advantage of massive parallelism on the GPU.
Nonetheless, the GPU can become useful hardware for facilitating fine-grained
discrete event simulations, especially for large-scale models, with the concurrent
utilization of a number of threads and fast data transfer between processors. The
execution time of each event can be very small, but a higher data parallelism with
clustering of the events can be achieved for a large-scale model.
The objective of this dissertation is to simulate asynchronous queuing networks
using GPU-based hardware acceleration. Two main issues related to this study are:
(1) how can we simulate asynchronous models on SIMD hardware? And (2) how can
we achieve a higher degree of parallelism? Investigations of these two main issues
reveal that further attention must be paid to the following related issues: (a) parallel
event scheduling, (b) data consistency without explicit support for mutual exclusion, (c)
event clustering, and (d) error estimation and correction. This dissertation presents an
approach to resolve these challenges.
15
1.2 Contributions to Knowledge
1.2.1 A GPU-Based Toolkit for Discrete Event Simulation Based on Parallel EventScheduling
We have developed GPU-based simulation libraries for CUDA so that the GPU can
easily be used for discrete event simulation, especially for a queuing network simulation.
A GPU is designed to process array-based data structures for the purpose of processing
pixel images in real time. The framework includes the functions for event scheduling and
queuing models that have been developed using arrays on the GPU.
In discrete event simulation, the event scheduling method occupies a large
portion of the overall simulation time. The FEL implementation, therefore, needs to
be parallelized in order to take full advantage of the GPU architecture. A concurrent
priority queue approach [17, 18] allows each processor to access the global FEL in
parallel on shared memory multiprocessors. The concurrent priority queue approach,
however, cannot be directly applied to SIMD-based hardware since the concurrent
insertion and deletion of the priority queue usually involves mutual exclusion, which is
not natively supported by GeForce 8800 GTX GPU [13].
Parallel event scheduling allows us to achieve significant speedup in queuing model
simulations on the GPU. A GPU has many threads executed in parallel, and each thread
can concurrently access the FEL. If the FEL is decomposed into many sub-FELs, and
each sub-FEL is exclusively accessed by one thread, the access to one element in the
FEL is guaranteed to be isolated from other threads. Exclusive access to each element
allows event insertion and deletion to be concurrently executed.
1.2.2 Mutual Exclusion Mechanism for GPU
We have reorganized the processing steps in a queuing model simulation by
employing alternate updates between the FEL and service facilities so that they can be
updated in SIMD fashion. The new procedure enables us to prevent multiple threads
16
from simultaneously accessing the same element, without having explicit support for
mutual exclusion on the GPU.
An alternate update is a lock-free method for mutual exclusion on the GPU, in order
to update two interactive arrays at the same time. Only one array can be exclusively
accessed by a thread index if the indexes of two arrays are not inter-related. If one array
needs to update the other array, the element in the other array is arbitrarily accessed by
the thread. Data consistency cannot be maintained if two or more threads concurrently
access the same element in the other array. The other array must be updated after
the thread index is switched to exclusively access itself. The updated array, however,
has to search all of the elements in the request array to find the request elements.
If the updated array knows which elements in the request array are likely to request
the update in advance, the number of searches will be limited. Each node in queuing
networks usually knows its incoming edges, which makes it possible to reduce the
number of searches during an alternate update, mitigating the overall execution time.
1.2.3 Event Clustering Algorithm on SIMD Hardware
SIMD-based simulation is useful when a lot of computation is required by a single
instruction with different data. However, its potential problems include the bottleneck in
the control processor and load imbalance among processors. The bottleneck problem
should not be significant when applying the CPU/GPU approach, since the CPU is
designed to process heavyweight threads, whereas the GPU is designed to process
lightweight threads and to execute arithmetic equations quickly [16].
The load imbalance problem can be resolved by employing a time-synchronous/event
algorithm in order to achieve a higher degree of parallelism. A single timestamp cannot
execute many events in parallel, since events in queuing models are irregularly spaced.
Thus, event times need to be modified so that they can be clustered and synchronized.
A time-synchronous/event algorithm is the SIMD-based hybrid approach to two common
types of discrete simulation: discrete event and time-stepped. The algorithm adopts the
17
advantages of both methods to utilize the GPU. The simulation clock advances when the
event occurs, but the events in the middle of the time interval are executed concurrently.
A time-synchronous/event algorithm naturally leads to approximation errors in the
summary statistics yielded from the simulation, because the events are not executed at
their precise timestamp.
We investigated three different types of queuing models to observe the effects of
our simulation method, including an implementation of a real-world application (mobile
ad hoc network model). The experimental results of our investigation show that our
algorithm has different impacts on the statistical results and performance of three types
of queuing models.
1.2.4 Error Analysis and Correction
The error in our simulation is a numerical error since we preserves timestamp
ordering and causal relationships of events, and the result is approximate in terms
of gathered summary statistics. The error may be acceptable for those modeled
applications where the analyst is more concerned with speed, and can accept relatively
small inaccuracies in summary statistics. In some cases, the error can be approximated
and potentially corrected to yield more accurate statistics. We present a method for
estimating the potential error incurred through event clustering by combining queuing
theory and simulation results. This method can be used to obtain a closer approximation
to the summary statistics through partially correcting the error.
1.3 Organization of the Dissertation
This dissertation is organized into 6 chapters. Chapter 2 reviews background
information, including the queuing model, sequential and parallel discrete event
simulation, GPU, and CUDA. Chapter 3 describes related work. We discuss other
studies for discrete event simulation on SIMD hardware, and a tradeoff between
accuracy and performance. Chapter 4 describes a GPU-based library and applications
framework for discrete event simulation. We introduce the routines that support parallel
18
event scheduling with mutual exclusion and queuing model simulations. Chapter
5 discusses a theoretical methodology and its performance analysis, including the
tradeoffs between numerical errors and performance gain, as well as the approaches
for error estimation and correction. Chapter 6 provides a summary of our findings and
introduces areas for future research.
19
CHAPTER 2BACKGROUND
2.1 Queuing Model
Queues are commonly found in most human-engineered systems where there exist
one or more shared resources. Any system where the customer requests a service for
a finite-capacity resource may be considered to be a queuing system [1]. The grocery
store, theme parks, and fast-food restaurants are well-known examples of queuing
systems. A queuing system can also be referred to as a system of flow. A new customer
enters the queuing system and joins the queue (i.e., line) of customers unless there is
no queue and another customer who completes his service may exit the system at the
same time. During the execution, a waiting line is formed in a system because the arrival
time of each customer is not predictable, and the service time often exceeds customer
inter-arrival times. A significant number of arrivals make each customer to wait in line
longer than usual. Queuing models are constructed by a scientist or engineer to analyze
the performance of a dynamic system where waiting can occur. In general, the goals of
a queuing model are to minimize the average number of waiting customers in a queue
and to predict the estimated number of facilities in a queuing system. The performance
results of queuing model simulation are produced at the end of a simulation in the form
of aggregate statistics.
A queuing model is described by its attributes [2, 6]: customer population, arrival
and service pattern, queue discipline, queue capacity, and the number of servers. A new
customer from the calling population enters into the queuing model and waits for service
in the queue. If the queue is empty and the server is idle, a new customer is immediately
sent to the server for service, otherwise the customer remains in the queue joining the
waiting line until the queue is empty and the server becomes idle. When a customer
enters into the server, the status of the server becomes busy, not allowing any more
20
Source
Arrival Departure
Customers wait for service
Queue
Server
Currently served customer
CallingPopulation
ArrivalPattern
QueueDiscipline
ServicePattern
Figure 2-1. Components of a single server queuing model
arrivals to gain access to the server. After being served, a customer exits the system.
Figure 2-1 illustrates a single server queue with its attributes.
The calling population, which can be either finite or infinite, is defined as the pool
of customers who possibly can request the service in the near future. If the size of the
calling population is infinite, the arrival rate is not affected by others. But the arrival rate
varies according to the number of customers who have arrived if the size of the calling
population is finite and small. Arrival and service patterns are the two most important
factors determining behaviors of queuing models. A queuing model may be deterministic
or stochastic. For the stochastic case, new arrivals occur in a random pattern and their
service time is obtained by probability distribution. The arrival and service rates, based
on observation, are provided as the values of parameters for stochastic queuing models.
The arrival rate is defined as the mean number of customers per unit time, and the
service rate is defined by the capacity of the server in the queuing model. If the service
rate is less than the arrival rate, the size of the queue will grow infinitely. The arrival rate
must be less than the service rate in order to maintain a stable queuing system [1, 6].
21
Table 2-1. Notations for queuing model statisticsNotation Description
ari Arrival time for customer iai Inter-arrival time for customer i�a Average inter-arrival time𝜆 Arrival rateT Total simulation timen Number of arrived customerssi Service time of i th customer𝜇 Service ratessi Service start time of i th customerdi Departure time of i th customer�q Mean wait time�w Mean residence time𝜌 UtilizationB System busy timeI System idle time
The randomness of arrival and service patterns cause the length of waiting lines in the
queue to vary.
When a server becomes idle, the next customer is selected among candidates
from the queue. The selection of strategy from the queue is called queue discipline.
Queue discipline [6, 19] is a scheduling algorithm to select the next customer from the
queue. The common algorithms of queue discipline are first-in first-out (FIFO), last-in
first-out (LIFO), service in random order (SIRO), and priority queue. The earlier arrived
customer is usually selected from a queue in the real world, thus the most common
queue discipline is FIFO. In a priority queue discipline, each arrival has its priority. The
arrival that has the highest priority is chosen from queue among waiting customers.
The purpose of building a queuing model and running a simulation is to obtain
meaningful statistics such as the server performance. The notations used for statistics
are listed in Table 2-1, and the equations for key statistics are summarized in Table 2-2.
22
Table 2-2. Equations for key queuing model statisticsName Equation Description
Inter-arrivalai = ari - ari−1
Interval between two consecutivetime arrivals
Mean�a =
∑ai
nAverage inter-arrival timeinter-arrival time
Arrival rate𝜆 =
n
TThe number of arrivals at unit time
𝜆 =1
�aLong run average
Mean�s =
∑si
n
Average time for each customer to beservice time served
Service rate 𝜇 =1
�sServer capability at unit time
Mean�q =
∑(ssi − ari)
n
Average time for each customer to spendwait time in a queue
Mean�w =
∑(di − ari)
n
Average time each customer stays in theresidence time system
SystemB =
∑si Total service time of serverbusy time
SystemI = T− B Total idle time of serveridle time
System𝜌 =
B
T
The proportion of the time in which theutilization server is busy
2.2 Discrete Event Simulation
2.2.1 Event Scheduling Method
Discrete event simulation changes the state variables at a discrete time when
the event occurs. An event scheduling method [20] is the basic paradigm for discrete
event simulation and is used along with a time-advanced algorithm. The simulation
clock indicates the current simulated time, the event time of last event occurrence. The
unprocessed, or future, events are stored in a data structure called the FEL. Events in
the FEL are usually sorted in non-decreasing timestamp order. When the simulation
starts, the head of the FEL is extracted from the FEL, updating the simulation clock. The
extracted event is then sent to an event routine, where it reproduces a new event after
23
(2) Update the clock
Event routine 1
Event routine 2
Event routine 3
Future event list (FEL)
NEXT_EVENTSimulation Clock
10 12
SCHEDULE
Token ID 5
Time 12
Event 2
Token ID 6
Time 15
Event 1
Token ID 3
Time 18
Event 3
(3) Execute the event
Token ID 5
Time 17
Event 3
(4) Insert new event into FEL
(1) Extract the head
from FEL
Figure 2-2. Cycle used for event scheduling
its execution. The new event is inserted to the FEL, sorting the FEL in non-decreasing
timestamp order. This step is iterated until the simulation ends.
Figure 2-2 illustrates the basic cycle for event scheduling [20]. Three future events
are stored into the FEL. When NEXT EVENT is called, token ID #5 with timestamp 12
is extracted from the head of the FEL. The simulation clock then advances from 10 to
12. The event is executed at event routine 2, which creates a new future event, event #3.
Token ID #5 with event #3 is scheduled and inserted into the FEL. Token ID #5 is placed
between token ID #6 and token ID #3 after comparing their timestamps. The event loop
iterates to call NEXT EVENT until the simulation ends.
The priority queue is the abstract data structure for an FEL. The priority queue
involves two operations for processing and maintaining the FEL: insert and delete-min.
The simplest way to implement the priority queue is to use an array or a linked list.
These data structures store events in a linear order by event time but are inefficient
24
for large-scale models, since the newly inserted event compares its event time with
all others in the sequence. An array and linked list takes O(N) time for insertion, and
O(1) time for deletion on average, where N is the number of elements in these data
structures. When an event is inserted, an array can be accessed faster than a linked list
on the disk, since the elements in arrays are stored contiguously. On the other hand, an
FEL using an array requires its own dynamic storage management [20].
The heap and splay tree [21] are data structures typically used for an FEL. They are
tree-based data structures and can execute operations faster than linear data structure,
such an array. Min heap implemented in a height-balanced binary search tree takes
O(log N) time for both insertion and deletion. A splay tree is a self-balancing binary tree,
but a certain elements can rearrange the tree, placing that element into the root. This
makes recently accessed elements able to be quickly referenced again. The splay tree
performs both operations in O(log N) amortized time. Heap and splay tree are therefore
suitable data structures for a priority queue for a large-scale model.
Calendar queues [22] are operated by a hash function, which performs both
operations in O(1), on average. Each bucket is a day that has a specific range and
each has a specific data structure for storing events in timestamp order. Enqueue and
dequeue functions are operated by hash functions according to event time. The number
of buckets and ranges in a day are adjusted to operate the hash function efficiently.
Calendar queues are efficient when events are equally distributed to each bucket, which
minimizes the adjustment of bucket size.
2.2.2 Parallel Discrete Event Simulation
In traditional parallel discrete event simulation (PDES) [7, 23, 24], the model
is decomposed into several LPs, and each LP is assigned to a processor used for
parallel simulation. Each LP runs its own independent part of the simulation with local
clock and state variables. When LPs need to communicate with each other, they send
timestamped messages to each other over a system bus or via a networking system.
25
Each local clock advances at different paces because the interval between consecutive
events on the LP is irregular. For this reason, the timestamp of incoming events from
other LPs can be earlier than the currently executed event. It is called a causality error
if the incoming events are supposed to change the state variable to which the current
event is referring. The violation of the causality error can produce different results.
As a result, a synchronization method needs to process events in a non-decreasing
timestamp order and to preserve causal relationships across processors. The
performance gains are not proportional to the increased number of processors due
to the synchronization overhead. Conservative and optimistic approaches are two main
categories in synchronization.
2.2.2.1 Conservative synchronization
In conservative synchronization methods, each processor executes events when it
can guarantee that other processors will not send events with a smaller timestamp than
that of the current event. Conservative methods can cause a deadlock situation between
LPs because every LP can block the event if it is considered to be unsafe to process.
Deadlock avoidance, and deadlock detection and recovery are two major challenges of
conservative synchronization methods.
Chandy and Misra [25] and Bryant [26] developed a deadlock avoidance algorithm.
The necessary and sufficient condition is that the messages are sent to other LPs over
the links in non-decreasing timestamp order, which guarantees that the processor will
not receive an event with a lower timestamp than the previous one. A null message is
sent to avoid the deadlock, indicating that the processor will not send a timestamped
message smaller than a null message. The timestamp of a null message is determined
by each incoming link, which provides the lower bound of the timestamp when the next
event occurs. The lower bound is determined by the knowledge of the simulation such
as lookahead, or the minimum timestamp increment for a message passing between
LPs. The variations of the null message method tried to reduce the number of null
26
messages based on demand since the amount of null message traffic can degrade
performance [27].
The deadlock detection and recovery proposed by Chandy and Misra [28] tried
to eliminate the use of null messages. The deadlock recovery approach allows the
processors to become deadlocked. When the deadlock is detected, the recovery
function is called. A controller, used to break the deadlock, identifies the event
containing the smallest timestamp among the processors, and sends the messages
to that LP indicating that the event is safe to process.
Barrier synchronization is one of the conservative synchronization approaches.
The lower bound on the timestamp (LBTS1 ) is calculated, based on the time of the next
event, and lookahead determines the time when all processors stop the execution to
safely process the event. The events are executed only if the timestamps of events are
less than LBTS. The distance between LPs is often used to determine LBTS since it
implies the minimum time to transmit the event from one LP to another, such as air traffic
simulation.
Conservative approaches are easy to implement but performance relies on
lookahead. Lookahead is the minimum time increment when the new event is scheduled,
thus lookahead (L) guarantees that no other events containing a smaller timestamp are
generated until the current clock plus L. Lookahead is used to predict the next incoming
events from other processors when the processor determines if the current event is safe.
If the lookahead is too small or zero, the currently executed event can cause all events
on the other LPs to wait. In this case, the events are nearly executed in sequential.
1 LBTS is defined as ”Lower bound on the timestamp of any message LP can receivein the future” in [7] p77.
27
2.2.2.2 Optimistic synchronization
In optimistic methods, each processor executes its own events regardless of those
received from other processors. However, each processor has to roll back the simulation
when it detects a causality error from event execution in order to recover the system.
Rollback in a parallel computing environment is a complicated process because some of
the messages sent to other LPs also need to be canceled.
Time-warp [29] is the most well-known scheme in optimistic synchronization. Time
warp has two major parts: the local and global control mechanisms. The local control
mechanism assumes that each local processor executes the events in timestamp order
using its own local virtual clock. When an LP sends a message to others, the identical
message, except for one field, is created. The original message sent from the LPs has a
positive sign, and its corresponding copy, called antimessage, has a negative. Each LP
maintains three queues. State queue contains the snapshots of the recent states at an
instant in time in the LP. The state is changed whenever the event occurs, and enqueued
at the state queue. Received messages from other LPs are stored at an input queue in
the timestamp order. The antimessage, produced by its own LP, is stored at the output
queue. When the timestamp of the arrival event is earlier than the local virtual time of
the LP, the LP encounters the causality error. The state is restored from state queue
prior to the timestamp of the current arrival message. Antimessages are dequeued
from the output queue and sent to other LPs, if their timestamps are between the arrival
event and the local virtual time. When the LP receives an antimessage, they annihilate
each other to cancel future events if the input queue contains the corresponding positive
message. The LP is rolled back by an antimessage if the corresponding positive
messages are already executed.
Global virtual time (GVT) gives an idea to solve some problems on local control
of the Time Warp mechanism, such as the memory management, the global control
of rollback and the safe commitment time. The GVT is defined by the minimum of
28
local virtual time among LPs and the timestamp of messages in transit, and serves
as a lower bound for the virtual times of the LPs. GVT allows the efficient memory
management because it does not need to maintain the previous states if those execution
times are earlier than the GVT. Duplicate antimessages are often produced while the
LP reevaluates the antimessages causing the problem of performance. The Lazy
cancelation waits to send the antimessage until the LP checks to see if the re-execution
produces the same messages, whereas Lazy reevaluation uses state vectors, instead of
messages, to solve this problem [7].
In the optimistic approach, the past states are saved for recovery, but it has one
of the most significant drawbacks regarding memory management. State saving [30]
makes copies of the past states during simulation. Copy state saving (CSS) copies the
entire states of simulation before each event occurs. CSS is the easiest method for
state saving, but two drawbacks are the huge memory consumption to save the entire
states and the performance overhead during rollback. Periodic state saving (PSS) sets
the checkpoint by interval skipping a few events. The performance is improved with
PSS, but all state values still have to be saved at the checkpoint. Incremental state
saving (ISS) is the method based on backtracking. Only the values and address of
modified variables are stored before the events execute. The old values are written to
the variables in reverse order when the states need to be restored. ISS reduces the
memory consumption and execution overheads, but the programmer has to add the
modules to handle each variable.
Reverse computation (RC) [31] was proposed to solve the limitation of the state
saving method for forward computation. RC does not save the values of state variables
during simulation. Computation is performed in reverse order to recover the values
of state variables until it reaches the checkpoint when the rollback is initiated. RC
uses the bit variable to check the changes, thus it can drastically reduce the memory
consumption during simulation for especially fine-grained models.
29
2.2.2.3 A comparison of two methods
Each synchronization approach has a drawback [32]. It takes considerable time to
run a simulation with zero lookahead in the conservative method. It is also too difficult to
roll back a simulation system to the previous state without error if we run the simulation
with a complicated model using the optimistic method. In general, the optimistic method
has an advantage over the conservative in that the execution is allowed where a
causality error is possible, but actually does not exist. In addition, the conservative
method often needs specific information for the application to determine when it is safe
to process the events, but it is not very relevant to an optimistic approach [23]. In some
cases, a very small lookahead cannot continue the simulation in parallel, but can in
sequential. Finding the lookahead and its size can be critical factors to determine the
performance gains in the conservative method [24]. However, optimistic mechanism
is much more complex to implement, and frequent rollback causes more computation
overhead for a compute-intensive system. If the model is too complex to apply the
optimistic method, the conservative method is a better choice. On the other hand, if a
very small lookahead is expected, the optimistic method has to be applied.
2.3 GPU and CUDA
2.3.1 GPU as a Coprocessor
A GPU is a dedicated graphics processor that renders 3D graphics in real time,
which requires tremendous computational power. The computation speed of the
GeForce 8800 GTX is approximately four times faster than that of an Intel Core2
Quad processor with 3.0 GHz, which is approximately twice as expensive as the
GeForce 8800 GTX [13]. The increment of the CPU clock speed has slowed since 2003
due to the physical limitations, so Intel and AMD turned their intention to multi-core
architectures [33]. On the other hand, the increment of GPU speed is still growing
because more transistors can be used for parallel data processing than data caching
and flow control on the GPU. Programmability is another reason that the GPU has
30
become attractive. The vertex and fragment processors can be customized with the
user’s own program.
The GPU has different features compared to the CPU [16]. The CPU is designed
to process general purpose programs. For this reason, CPU programming models
and their processes are generally serial, and the CPU enables the complex branch
controls. The GPU, however, is dedicated to processing the pixel image in real time,
thus it has much more parallelism than the CPU does. The CPU returns memory
reference quickly to process as many jobs as possible, maximizing its throughput and
minimizing the memory latency. As a result, a single thread on a CPU can produce
higher performance compared to that on a GPU. On the other hand, the GPU maximizes
the parallelism through threads. The performance of a single thread on a GPU is not as
good, compared to that on a CPU, but the executions of threads in a massively parallel
hide the memory latency to produce high throughput from parallel tasks. In addition,
more transistors are dedicated to GPU for data computation rather than data caching
and flow control. The GPU can take a great advantage over a CPU when the cache miss
occurs [34].
Despite many advantages, the harnessing power of the GPU has been considered
to be difficult because GPU-specific knowledge, such as graphics APIs and hardware,
needs to deal with the programmable GPU. The traditional GPUs have two types of
programmable processors: vertex and fragment [35]. Vertex processors transform
the streams of vertices which are defined by positions, colors, textures and lighting.
The transformed vertices are converted into fragments by the rasterizer. Fragment
processors compute the color of each pixel to render the image. Graphics shader
programming languages, such as Cg [36] and HLSL [37], allow the programmer to write
the code for the vertex and fragment processors in high-level programming language.
Those languages are easy to learn, compared to assembly language, but are still
graphic-specific assuming that the user has the basic knowledge of interactive graphic
31
programming. The program, therefore, needs to be written in a graphics fashion using
texture and pixel by mapping the computational variables to graphics primitives using
graphics API [38], such as DirectX or OpenGL even for general purpose computations.
Another problem was the constrained memory layout and access. The indirect
write or scatter operation was not possible because there is no write instruction in the
fragment processor [39]. As a result, the implementation of sparse data structure, such
as list and tree, where scattering is required, is problematic removing the flexibility
in programming. The CPU can handle the memory easily because it has the unified
memory model, but it is not trivial on the GPU because memory cannot be written
anywhere [35]. Finally, the advent of the GeForce 8800 GTX GPU and CUDA eliminates
the limitations and provides an easy solution to the programmer.
2.3.2 Stream Processing
Stream processing [15, 16] is the basis of the GPU programming model today. The
application of stream processing is divided into several parts for parallel processing.
Each part is referred to as a kernel, which is a programmed function to process
the stream and is independent of the incoming stream. The stream is a sequence
of elements composed of the same type and it requires the same instruction for
computation. Figure 2-3 shows the relationship between the stream and the kernel. The
stream processing model can process the input stream on each ALU at the same kernel
in parallel since each element of input stream is independent of each other. Also, stream
processing allows many streams to be processed concurrently at different kernels,
which hides the memory latency and communication delay. However, the stream
processing model is less flexible and not suitable for the general purpose program with
the randomized data access because the stream is directly passed to other kernels
connected in sequential after it is processed. Stream processing can consist of several
stages, each of which has several kernels. Data parallelism is exploited by processing
32
InputData
Kernel
KernelKernel
Kernel
KernelKernel
KernelKernelOutputData
Stream
Stream Stream
Stream
Stream
Stream
Stream
Figure 2-3. Stream and kernel
many streams in parallel at each stage and task parallelism is exploited by running
several stages concurrently.
Many cores can be utilized concurrently with a stream programming model. For
example, GeForce 8800 GTX has 16 multiprocessors, and each can have the maximum
768 threads. Theoretically, approximately ten thousand threads can be executed in
parallel yielding high performance parallelism.
2.3.3 GeForce 8800 GTX
The GeForce 8800 GTX [12, 13] GPU is the first GPU model unifying vertex,
geometry and fragment shaders into 128 individual stream processors. The previous
GPUs have the classic pipeline model with a number of stages to render the image
from the vertices. Many passes inside the GPU consume the bandwidth. Moreover,
some stages are not required to process general purpose computations, which degrade
the performance of the processing of the general purpose workloads on the GPU.
Figure 2-4 [40] illustrates the difference of pipeline stages between the traditional and
GeForce 8 series GPUs. In GeForce 8800 GTX GPU, the shaders have been unified
into the stream processors, which reduce the number of pipeline stages and change
the sequential processing into loop-oriented processing. Unified stream processors
help to improve load balancing. Any graphical data can be assigned to any available
33
Application
Display
Application
Display
Fragment
Rasterization
Vertex/Geometry
Command Command
RasterizationStream
Processors
ProgrammableProcessors
Figure 2-4. Traditional vs. GeForce 8 series GPU pipeline
stream processor, and its output stream can be used as an input stream of other stream
processors.
Figure 2-5 [41] shows the GeForce 8800 GTX architecture. The GPU consists of
16 stream multiprocessors (SMs). Each SM has 8 stream processors (SPs), which
makes a total of 128. Each SP contains a single arithmetic unit that supports IEEE 754
single-precision floating-point arithmetic and 32-bit integer operations, and can process
the instruction in SIMD fashion. Each SM can take up to 8 blocks or 768 threads, which
makes for a total of 12,288 threads, and 8192 registers on each SM can be dynamically
allocated into the threads running on it.
34
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
SP
SMInstruction
Unit
SharedMemory
Global Memory
Thread Execution Manager
Figure 2-5. GeForce 8800 GTX architecture
2.3.4 CUDA
CUDA [13] is an API of C programming language for utilizing the NVIDIA class
of GPUs. CUDA, therefore, does not require a tough learning curve and provides
a simplified solution for those who are not familiar with the knowledge of graphics
hardware and API. The user can focus on the algorithm itself rather than on its
implementation with CUDA. When the program is written in CUDA, the CPU is a host
that runs the C program, and the GPU is a device that operates as a co-processor to the
CPU. The application is programmed into a C function, called kernel, and downloaded
to the GPU when compiled. The kernel uses memory on the GPU, memory allocation
and data transfer from the CPU to the GPU, therefore, need to be done before the kernel
invocation.
CUDA exploits data parallelism by utilizing a massive number of threads
simultaneously after partitioning larger problems into smaller elements. A thread is
the basic unit of execution that uses its unique identification to exclusively access parts
of elements in the data. The much smaller cost of creating and switching threads (as
compared to the higher costs associated with the CPU) makes the GPU more efficient
when running in parallel. The programmer organizes the threads in a two-level hierarchy.
35
A kernel invocation creates a grid (the unit of execution of a kernel). A grid consists of
the group of thread blocks that executes a single kernel with the same instruction and
different data. Each thread block consists of a batch of threads that can share data
with other threads through a low-latency shared memory. Moreover, their executions
are synchronized within a thread block to coordinate memory accesses by barrier
synchronization using the syncthreads() function. Threads in the same block need to
reside on the same SM for the efficient operation, which restricts the number of threads
in a single block.
In the GeForce 8800 GTX, each block can take up to 512 threads. The programmer
determines the degree of parallelism by assigning the number of threads and blocks for
executing a kernel. The execution configuration has to be specified when invoking the
kernel on the GPU, by defining the number of grids, blocks and bytes in shared memory
per block, in an expression of following form, where memory size is optional:
KernelFunc<<<DimGrid, DimBlock, SharedMemBytes>>>(parameters);
The corresponding function is defined by global void
KernelFunc(parameters) on the GPU, where global represents the computing
device or GPU. Data are copied from the host or CPU to global memory on the GPU
and are loaded to the shared memory. After performing the computation, the results are
copied back to the host via PCI-Express.
Each SM processes a grid by scheduling batches of thread blocks, one after
another, but block ordering is not guaranteed. The number of thread blocks in one batch
depends upon the degree to which the shared memory and registers are assigned,
per block and thread, respectively. The currently executed blocks are referred to as
active blocks, and each one is split into a group of threads called a warp. The number
of threads in a warp is called warp size and it is set to 32 on the GeForce 8 series. At
each clock cycle, the threads in a warp are physically executed in parallel. Each warp
is executed alternatively by time-slicing scheduling, which hides the memory access
36
Thread (0, 0)
Block (0, 0)
Grid 1
Thread (1, 0)
Thread (0, 1) Thread (1, 1)
Thread (0, 0)
Block (0, 0)
Grid 1
Thread (1, 0)
Thread (0, 1) Thread (1, 1)
Thread (0, 0)
Block (0, 0)
Grid 2
Thread (1, 0)
Thread (0, 1) Thread (1, 1)
SequentialExecution
SequentialExecution
SequentialExecution
KernelInvocation 2
KernelInvocation 1
Host
Device
Figure 2-6. Execution between the host and the device
latency. The number of thread blocks can increase if we can decrease the size of shared
memory per block and the number of registers per thread. However, it fails to launch the
kernel if the shared memory per thread block is insufficient.
The overall performance depends on how effectively the programmer assigns those
threads and blocks, keeping threads busy as many as possible. Each SM can usually be
composed of 3 thread blocks with 256 threads, or 6 blocks with 128 threads. The 16KB
shared memory is assigned to each thread block, which can limit the number of threads
in a thread block and the number of elements for which each thread is responsible.
Figure 2-6 shows the interaction between the host and the device. A host executes the
C program in sequence before invoking kernel 1. A kernel invocation creates a grid,
which includes a number of blocks and threads, and maps one or more blocks onto one
SM. After executing a kernel 2 in parallel on the device, a host continues to execute the
program.
37
CHAPTER 3RELATED WORK
3.1 Discrete Event Simulation on SIMD Hardware
In the 1990s, efforts were made to parallelize discrete event simulations using a
SIMD approach. Given a balanced workload, SIMD had the potential to significantly
speed up simulations. The research performed in this area was focused on replication.
The processors were used to parallelize the choice of parameters by implementing a
standard clock algorithm [42, 43]. Ayani and Berkman [44] used SIMD for parallelizing
simultaneous event executions, but SIMD was determined to be a poor choice because
of the uneven distribution of timed events. There was a need to fill the gap between
asynchronous applications and synchronous machines so that the SIMD machine could
be utilized for asynchronous applications [45].
Recently, the computer graphics community has widely published on the use of the
GPU for physical and geometric problem solving, and for visualization. These types of
models have the property of being decomposable over a variable or parameter space,
such as cellular automata [46] for discrete spaces and partial differential equations
(PDEs) [47, 48] for continuous spaces. Queuing models, however, do not strictly adhere
to the decomposability property.
Perumalla [49] has performed a discrete event simulation on a GPU by running a
diffusion simulation. Perumalla’s algorithm selects the minimum event time from the
list of update times, and uses it as a time-step to synchronously update all elements
on a given space throughout the simulation period. This approach is useful if a single
event in the simulation model causes large amounts of computation, where the event
occurrences are not so frequent. Queuing models, in contrast, have many events, but
each event does not require significant computation. A number of events with different
timestamps in queuing model simulations could make the execution nearly sequential
with this algorithm.
38
Xu and Bagrodia [50] proposed a discrete event simulation framework for network
simulations. They used the GPU as a co-processor to distribute compute-intensive
workloads for high-fidelity network simulations. Other parallel computing architectures
are combined to perform the computation in parallel. A field programmable gate array
(FPGA) and a Cell processor are included for task-parallel computation, and a GPU
is used for data-parallel computation. A fluid-flow-based TCP and a high-fidelity
physical layer model are exploited to utilize the GPU. The former is modeled with
driven differential equations, and the latter uses the adaptive antenna algorithm which
recursively updates the weights of the beamformers using least squares estimation. The
event scheduling method on the CPU sends those compute-intensive events to the GPU
whenever events occur.
These two examples showed the methodology of running a discrete event
simulation on the GPU, but both methods cannot be applicable for the purpose of
improving the performance in queuing models simulations on the GPU. In the GPU
simulation, 2D or 3D spaces represent the simulation results, and these spaces are
implemented in arrays on the GPU. Their models are easily adapted to the GPU by
partitioning the result array and computing each of them in parallel since a single event
in their simulation models updates all elements in the result array at once. However, an
individual event in queuing models make the changes only on a single element (e.g.
service facility) in the result array, which makes it difficult to parallelize queuing model
simulations. Queuing model simulations need to have many concurrent events to benefit
from the GPU.
Lysenko and D’Souza [51] proposed a GPU-based framework for large scale agent
based model (ABM) simulations. In ABM simulation, sequential execution using discrete
event simulation techniques makes the performance too inefficient for large scale ABM.
Data-parallel algorithms for environment updates, and agent interaction, death, and birth
were, therefore, presented for GPU-based ABM simulation. This study used an iterative
39
randomized scheme so that agent replication could be executed in O(1) average time in
parallel on the GPU.
3.2 Tradeoff between Accuracy and Performance
Some studies of parallel simulation have focused on enhancing performance at the
expense of accuracy, while others have focused on accuracy with a view to improving
performance. Tolerant synchronization [52] uses the lock-step method to process the
simulation conservatively, but it allows the processor to execute the event optimistically
if the timestamp is less than the tolerance point in the synchronization. The recovery
procedure is not called, even if a causality error occurs, until the timestamp reaches the
tolerance point.
Synchronization with a fixed quantum is a lock-step synchronization [53] that
ensures that all events are properly synchronized before advancing to the next quantum.
However, a quantum that is too small causes a significant slowdown of overall execution
time. In an adaptive synchronization technique [54], the quantum size is adjusted based
on the number of events at the current lock-step. A dynamic lock-step value improves
the performance with a larger quantum value, thus reducing the synchronization
overhead when the number of events is small and where the error rate is low.
State-matching is the most dominant overhead in a time-parallel simulation [7],
as is synchronization in a space-parallel simulation. If the initial and final states are
not matched at the boundary of a time interval, re-computation of those time intervals
degrades simulation performance. Approximation simulations [55, 56] have been used to
improve the simulation performance, albeit with a loss of accuracy.
Fujimoto [32] proposed exploitation of temporal uncertainty, which introduces
approximate time. Approximate time is a time interval for the execution of the event,
rather than a precise timestamp, and assigned into each event based on its timestamp.
When approximate time is used, the time intervals of events on the different LPs can
be overlapped on the timeline at one common point. Whereas events on the different
40
LPs have to wait for a synchronization signal with a conservative method when a precise
timestamp is assigned, approximate-timed events can be executed concurrently if their
time intervals overlap with each other. The performance is improved due to increased
concurrency, but at the cost of accuracy in the simulation result. Our approach differs
from this method in that we do not assign a time interval to each event: instead, events
are clustered at a time interval when they are extracted from the FEL. In addition, an
approximate time is executed based on a MIMD scheme that partitions the simulation
model, whereas our approach is based on a SIMD scheme.
3.3 Concurrent Priority Queue
The priority queue is the abstract data structure that has widely been used as an
FEL for discrete event simulation. The global priority queue is commonly used and
accessed sequentially for the purpose of ensuring consistency in PDES on shared
memory multiprocessors. The concurrent access of the priority queue has been studied
because the sequential access limits the potential speedup in parallel simulation
[17, 18]. Most concurrent priority queue approaches have been based on mutual
exclusion, locking part of a heap or tree when inserting or deleting the events so that
other processors would not access the currently updated element [57, 58]. However, this
blocking-based algorithm limits potential performance improvements to a certain degree,
since it involves several drawbacks, such as deadlock and starvation, which cause
the system to be in idle or wait states. The lock-free approach [59] avoids blocking
by using atomic synchronization primitives and guarantees that at least one active
operation can be processed. PDES that use the distributed FEL or message queue
have improved their performance by optimizing the scheduling algorithm to minimize the
synchronization overhead and to hide communication latency [60, 61].
3.4 Parallel Simulation Problem Space
Parallel simulation problem space can be classified using time-space and classes
of parallel computers, as shown in Figure 3-1. Parallel simulation models fall into two
41
Continuous Discrete
Asynchronous Synchronous
MIMD
Parallel Simulation
Problem Space
SIMD GPUMIMD SIMD GPU MIMD SIMD GPU
Space Event
(1)
(10)(9)
(5) (6)(2) (3) (4) (7) (8)
Time/space
Behavior
Architecture
Examples
Partitioning
Method
Figure 3-1. Diagram of parallel simulation problem space
Table 3-1. Classification of parallel simulation examplesIndex Examples
(1) Ordinary differential equations [62](2) Reservoir simulation [63](3) Cloud dynamics [47], N-body simulation [48](4) Chandy and Misra [25], Time-warp [29](5) Ayani and Bourkman [44], Shu and Wu [45](6) Partial differential equations [64](7) Cellular automata [65](8) Retina simulation [46](9) Diffusion simulation [49], Xu and Bagrodia [50](10) Our queuing model simulation
major categories: continuous and discrete. Most physical simulations are continuous
simulations (i.e., ordinary and partial differential equations, cellular automata); however,
complex human-made systems (i.e., communication networks) tend to have a discrete
structure. Discrete models can be categorized into two, in regards to the behavior of
simulation models: asynchronous (discrete-event) and synchronous (time-stepped)
models. Asynchronous models can be classified according to how the partitioning is
done. The examples of each branch in Figure 3-1 are summarized in Table 3-1.
42
CHAPTER 4A GPU-BASED APPLICATION FRAMEWORK SUPPORTING FAST DISCRETE EVENT
SIMULATION
4.1 Parallel Event Scheduling
SIMD-based computation has a bottleneck problem in that some operations,
such as instruction fetch, have to be implemented sequentially, which causes many
processors to be halted. Event scheduling in SIMD-based simulation can be considered
as a step of instruction fetch that distributes the workload into each processor. The
sequential operations in a shared event list can be crucial to the overall performance
of simulation for a large-scale model. Most implementations of concurrent priority
queue have been run on MIMD machines. Their asynchronous operations reduce the
number of locks at the instant time of simulation. However, it is inefficient to implement
a concurrent priority queue with a lock-based approach on SIMD hardware, especially
a GPU because the point in time when multiple threads access the priority queue is
synchronized. It produces many locks that are involved in mutual exclusion, making their
operations almost sequential. Moreover, sparse and dynamic data structure, such as
heaps, cannot be directly developed on the GPU since the GPU is optimized to process
dense and static data structures such as linear arrays.
Both insert and delete-min operations re-sort the FEL in timestamp order. Other
threads cannot access the FEL during the sort, since all the elements in the FEL are
sorted if a linear array is used for the data structure of the FEL. The concept of parallel
event scheduling is that an FEL is divided into many sub-FELs, and only one of them
is handled by each thread on the GPU. An element index that is used to access the
element in the FEL is calculated by a thread ID combined with a block ID, which allows
each thread to access its elements in parallel without any interference from other
threads. In addition, keeping the global FEL unsorted guarantees that each thread
can access its elements, regardless of the operations of other threads. The number of
43
while (current time is less than simulation time)// executed by multiple threadsminimumTimestamp = ParallelReduction(FEL);for each local FEL by each thread in parallel do
currentEvent = ExtractEvent(minimumTimestamp);nextEvent = ExecuteEvent(currentEvent);ScheduleEvent(nextEvent);
end for eachend while
Figure 4-1. The algorithm for parallel event scheduling
elements that each thread is responsible for processing at the current time is calculated
by dividing the number of elements in the FEL by the number of threads.
As a result, the heads of the global FEL and each local FEL accessed by each
thread are not the events with the minimum timestamp. Instead, the smallest timestamp
is determined by parallel reduction [14, 66], using multiple threads. With this timestamp,
each thread compares the minimum timestamp with that of each element in the local
FEL to find and extract the current active events (delete-min). After the current events
are executed in parallel, new events are created by the current events. The currently
extracted elements in the FEL are re-written by updating the attributes, such as an
event and its time (insert). The algorithm for parallel event scheduling on the GPU is
summarized in Figure 4-1.
Additional operations are needed for a queuing model simulation. The purpose of
discrete event simulation is to analyze the behavior of the system [67]. In a queuing
model simulation, a service facility is the system to be analyzed. Service facilities
are modeled in arrays as resources that contain information regarding server status,
current customers, and their queues. Scheduling the incoming customer to the service
facility (Arrival), releasing the customer after its service (Departure), and manipulating
the queue when its server is busy are the service facility operations. Queuing model
simulations also benefit from the tens of thousands of threads on the GPU. However,
44
there are some issues to be considered, since the arrays of both the FEL and service
facility reside in global memory, and threads share them.
4.2 Issues in a Queuing Model Simulation
4.2.1 Mutual Exclusion
Most simulations that are run on a GPU use 2D or 3D spaces to represent the
simulation results. The spaces and the variables, used for updating those spaces, are
implemented in an array on the GPU. The result array is updated based on variable
arrays throughout the simulation. For example, the velocity array is used for updating
the result array by a partial differential equation in a fluid simulation. The result array is
dependent on the variable arrays, but not vice versa. In a fluid simulation, the changes
of velocity in a fluid simulation make the result different, but the result array does not
change the velocity. This kind of update is one-directional. Mutual exclusion is not
necessary, since each thread is responsible for a fixed number of elements, and does
not interfere with other threads.
However, the updates in a queuing model simulation are bi-directional. One event
simultaneously updates both the FEL and service facility arrays. Bi-directional updates
occurring at the same time may cause their results to be incorrect, because one of
the element indexes–either the FEL or the service facility–cannot be accessed by
other threads independently. For example, consider a concurrent request to the same
service facility that has only one server, as shown in Figure 4-2A. Both threads try to
schedule their token to the server because its idle status is read by both threads at the
same time. The simultaneous writing to the same location leads to the wrong result in
thread #1, as shown in Figure 4-2B. We need a mutual exclusion algorithm because
data inconsistency can occur when updating both arrays at the same time. The mutual
exclusion involved in this environment is different from the case of the concurrent priority
queue, in that two different arrays concurrently attempt to update each other and are
accessed by the same element index.
45
Facility #1, Busy, Token #3 Facility #2, Idle, -
Token ID #1
Time 2
Event ARRIVAL
Facility #2
Status Free
Thread #1
Token ID #2
Time 2
Event ARRIVAL
Facility #2
Status Free
Thread #2
A A concurrent request from two threads
Facility #1, Busy, Token #3 Facility #2, Busy, Token #2
Token ID #1
Time 2
Event ARRIVAL
Facility #2
Status Served
Thread #1
Token ID #2
Time 2
Event ARRIVAL
Facility #2
Status Served
Thread #2
B The incorrect results for a concurrent request. The status of token#1 should be Queue.
Figure 4-2. The result of a concurrent request from two threads without a mutualexclusion algorithm
The simplest way to implement mutual exclusion is to separate both updates.
Alternate access between the FEL and service facility can resolve this problem. When
updates are happening in terms of the FEL, each extracted token in the FEL stores
information about the service facility, indicating that an update is required at the next
step. Service facilities are then updated based on these results. Each service facility
searches the FEL to find the extracted tokens that are related to itself at the current time.
46
Then, the extracted tokens are placed into the server or queue at the service facility for
an arrival event, or the status of the server turns to idle for a departure event. Finally, the
locations of extracted tokens in the FEL are updated using the results of the updated
service facility.
One of biggest problems for discrete event simulation on a GPU is that events are
selectively updated. A few events occurring at one event time make it difficult to fully
utilize all the threads at once. If the model has as many concurrent events as possible,
this approach is more efficient. One approach to improving performance is to cluster
events into one event time. If the event time can be rounded to integers or one decimal,
more events can concurrently occur.
However, a causality error can occur because two or more tokens with different
timestamps may have the same timestamp, due to the rounding of the timestamp. The
correct order must be maintained, otherwise the statistical results produced will be
different. Wieland [68] proposed a method to treat simultaneous events. The event
times of simultaneous events are altered by adding or subtracting a threshold so that
each event has a different timestamp. His method deals with originally simultaneous
events that are unknown in their correct order, but simultaneous events in our method
were non-simultaneous events before their timestamps were rounded. We use an
original timestamp to maintain the correct event order for simultaneous events. If
two tokens arrive at the same service facility with the same timestamp due to the
rounding of the timestamp, the token with the smaller original timestamp has priority.
An original timestamp is maintained as one of the attributes in the token. For originally
simultaneous events, the service facility randomly breaks tie and determines their order.
The pseudocode for mutual exclusion algorithms with clustering events is summarized in
Figure 4-3.
47
// update the FELfor each token in the FEL by each thread in parallel do
if (Token.Time is less than or equal to the rounded minimum timestamp)Token.Extracted == TRUE;
end ifend for each
// update the service facilityfor each service facility by each thread in parallel do
for each token in the FEL doif (Token.Extracted == TRUE && Token.Facility == currentServiceFacility)
if (Token.Event == DEPARTURE)Facility.ServerStatus = IDLE;
else if (Token.Event == ARRIVAL)Add the token into the requestTokenList;
end ifend if
end for each// sort the current token list in original timestamp ordersortedTokenList = Sort(requestTokenList);if (Facility.ServerStatus == BUSY)
Place all tokens into the queue in sorted order;else if (Facility.ServerStatus == IDLE)
Place the head of token from sortedTokenList into the server, andplace others into the queue;
end ifend for each
// update the FELfor each token in the FEL by each thread in parallel do
if (Token.Extracted == TRUE)Token.Extracted = FALSE;Token.Time = nextEventTime;Token.Event = nextEvent;Token.Status = SERVED or QUEUE;
end ifend for each
Figure 4-3. A mutual exclusion algorithm with clustering events
48
4.2.2 Selective Update
An alternate update that is used for mutual exclusion produces the other issue.
Each extracted token in the FEL has information about the service facility, whereas each
service facility does not know which token has requested the service at the current time
during the alternate update. Each service facility searches the entire FEL to find the
requested tokens, which is executed in O(N) time. This sequential search significantly
degrades performance, especially for a large-scale model simulation. The number
of searched tokens for each service facility, therefore, needs to be reduced for the
performance of parallel simulations. One of the solutions is to use the incoming edges
of each service facility because a token enters the service facility only from the incoming
edges. If we limit the number of searches to the number of incoming edges, the search
time is reduced to O(Maximum number of edges) time.
A departure event can be executed at the first step of mutual exclusion because it
does not cause any type of thread collisions between the FEL and service facility. For a
departure event, no other concurrent requests for the same server at the service facility
exist, since the number of released tokens from one server is always one. Therefore,
each facility can store the just-released token when a departure event is executed. Each
service facility refers to its neighbor service facilities to check whether they released
the token at the current time during the update of itself. Performance may depend on
the simulation model, because search time depends on the maximum number of edges
among service facilities.
4.2.3 Synchronization
Threads in the same thread block can be synchronized with shared local memory,
but the executions of threads in different thread blocks are completely independent
of each other. This independent execution removes the dependency of assignments
between the thread blocks and processors, allowing thread blocks to be scheduled
across any processor [14].
49
For a large-scale queuing model, arrays for both the FEL and service facility reside
in global memory. Both arrays are accessed and updated by an element ID in sequence.
If these steps are not synchronized, some indexes are used to access the FEL, while
others are used to update the service facility. The elements in both arrays may then
have incorrect information when updated.
We expect the same effect of synchronization between blocks if the kernel is
decomposed into multiple kernels [66]. Alternate accesses between both arrays need
to be developed as multiple kernels, and invoking these kernels in sequence from
the CPU explicitly synchronizes the thread blocks. One of the bottlenecks in CUDA
implementation is data transfer between the CPU and GPU, but sequential invocations
of kernels provide a global synchronization point without transferring any data between
them.
4.3 Data Structures and Functions
4.3.1 Event Scheduling Method
FEL Let a token denote any type of customer that requests service at the service
facility. The FEL is therefore a collection of unprocessed tokens, and tokens are
identified by their ID without being sorted in non-decreasing timestamp order. Each
element in the FEL has its own attributes: token ID, event, time, facility, and so on.
An FEL is represented as a two-dimensional array, and each one-dimensional array
consists of attributes of a token. Table 4-1 shows an instant status of the FEL with some
of the attributes. For example, token ID #3 will arrive at facility #3 at the simulation time
of 2. Status represents the specific location of the token at the service facility. Free is
assigned when the token is not associated with any service facility. Token #1, placed in
the queue of facility #2, cannot be scheduled for service until the server becomes idle.
Finding the Minimum Timestamp: NextEventTime The minimum timestamp
is calculated by parallel reduction without re-sorting the FEL. Parallel reduction is a
tree-based approach, and the number of comparisons is cut in half at each step. Each
50
Table 4-1. The future event list and its attributesToken ID Event Time Facility Status
#1 Arrival 2 #2 Queue#2 Departure 3 #3 Served#3 Arrival 2 #3 Free#4 Departure 4 #1 Served
thread finds the minimum value by comparing a fixed length of input. The number of
threads that is used for comparison is also cut in half after each thread completes
calculating the minimum value from its input. Finally, the minimum value is stored in
thread ID 0. The minimum timestamp is calculated by invoking the NextEventTime
function which returns the minimum timestamp. The CUDA-style pseudocode for
NextEventTime is illustrated in Figure 4-4. We have modified the parallel reduction
code [66] in the NVIDIA CUDA software development kit to develop the NextEventTime
function.
Comparison of elements using global memory is very expensive, and additional
memory spaces are required so that the FEL is prevented from being re-sorted. Iterative
executions allow the shared memory to be used for a large-scale model, although the
shared memory, 16 KB per thread block, is too small to be used for a large-scale model.
As an intermediate step, each block produces one minimum timestamp. At the start of
the next step, comparisons of the results between the blocks should be synchronized. In
addition, the number of threads and blocks used for comparison at the block-level step
will be different from those used at the thread-level step, due to the size of the remaining
elements. The different number of threads and blocks at the various steps as well as
the need for global synchronization requires that parallel reduction be invoked from the
CPU.
Event Extraction and Approximate Time: NextEvent When the minimum
timestamp is determined, each thread extracts the events with the smallest timestamp
by calling the NextEvent function. Figure 4-5 shows the pseudocode for the NextEvent
51
global voidNextEventTime(float *FEL, float *minTime, int ThreadSize){
shared float eTime[BlockSize];int tid = threadIdx.x;int eid = blockIdx.x*BlockSize + threadIdx.x;int m = 0, j = 0, k = 0;
// copy some parts of event times from the FEL to shared memoryfor (int i = eid*ThreadSize; i < eid*ThreadSize + ThreadSize; i++) {
eTime[tid*ThreadSize + (m++)] = FEL[i*numOfTokenAttr + Time];}
syncthreads();
// compare event timesfor (int i = 1; i < BlockSize*ThreadSize; i*=2) {
// find the minimum value within each threadif (i < ThreadSize) {
j = 0; k = 1;for (int m = 1; m <= ThreadSize/(2*i); m++) {
if (eTime[tid*ThreadSize + j*i] > eTime[tid*ThreadSize + k*i]) {eTime[tid*ThreadSize + j*i] = eTime[tid*ThreadSize + k*i];
}j = j + 2;k = k + 2;
}}
// comparison between threadselse {
if ((tid % ((2*i)/ThreadSize) == 0) && (eTime[tid] > eTime[tid + i])) {eTime[tid] = eTime[tid + i];
}}
syncthreads();}
// copy the minimum value to global memoryif (tid == 0) {
minTime[blockIdx.x] = eTime[0];}
}
Figure 4-4. Pseudocode for NextEventTime
52
device intNextEvent(float *FEL, int elementIndex, int interval){
if (FEL[elementIndex*numOfTokenAttr + Time] <= interval &&FEL[elementIndex*numOfTokenAttr + Status] != QUEUE)return TRUE;
elsereturn FALSE;
}
Figure 4-5. Pseudocode for NextEvent
function. Approximate time is calculated by assigning the interval that the event time can
be rounded to, so that more events are clustered into that time. Events are extracted
from the FEL, unless tokens are in a queue. Each thread executes one of the event
routines in parallel after the events are extracted.
Event Insertion and Update: Schedule New events are scheduled and inserted
into the FEL by the currently executed events. Each element that is executed at the
current time updates the current status (e.g. next event time, served facility, queue, and
so on) by calling the Schedule function. Figure 4-6 illustrates the pseudocode for the
Schedule function. The Schedule function is the general function to update the element
in the FEL, as well as to schedule new events. In an open queuing network, the number
of elements in the FEL varies, due to the arrivals from and departures to outside of the
simulation model. The index is maintained to put a newly arrived token into the FEL, and
the location of an exiting token is marked as being empty. When the index reaches the
last location of the FEL, the index goes back to the first location, and keeps increasing
by 1 until it finds an empty space. The CPU is responsible for generating new arrivals
from outside of the simulation model in the open queuing network, due to the mutual
exclusion problem of the index.
53
device voidSchedule(float *FEL, int elementIndex, float *currentToken){
for (int i = 0; i < numOfTokenAttr; i++) {FEL[elementIndex*numOfTokenAttr + i] = currentToken[i];
}}
Figure 4-6. Pseudocode for Schedule
4.3.2 Functions for a Queuing Model
Service Facility The service facility is the resource that provides the token with
service in the queuing model. A service facility consists of servers and a queue. When
the token arrives at the service facility, the token is placed into the server if its server is
idle. Otherwise, the token is placed into a queue. Each service facility can have one or
more servers.
A service facility has several attributes, such as its server status, currently served
token, a number of statistics, and queue status. The queue stores the waiting tokens in
First in, First out (FIFO) order, and its capacity is defined by the programmer at the start
of the simulation. When a token exits from the service facility, the head of the queue has
priority for the next service.
The results that are of interest in running the queuing model simulation are the
summary statistics, such as the utilization of the service facility and the average wait
time in the queue. Each service facility has some fields in which to collect information
about its service, in order to provide the summary statistics at the end of the simulation.
A service facility is also represented as a two-dimensional array, and each
one-dimensional array consists of attributes of a service facility. Each queue at the
service facility is represented in a one-dimensional array with an upper limit of capacity.
Table 4-2 shows the instant status of a service facility using some of the attributes. At
service facility #1, the server is busy, and the current service for token #2 began at a
54
device voidRequest(float *FEL, int elementIndex){
FEL[elementIndex*numOfTokenAttr + Extracted] = TRUE;}
Figure 4-7. Pseudocode for Request
simulation time of 4. The queue at service facility #1 has one token (#3), and its total
busy time so far is 3.
Table 4-2. The service facility and its attributesFacility Server Served Busy Service Queue Queue
ID Status Token Time Start time Length#1 Busy #2 3 4 1 #3#2 Busy #1 1 2 0 -#3 Idle - 0 0 0 -#4 Busy #6 2 5 2 #4, #5
Arrival: Request Arrival and departure are the two main events for the queuing
model, and both events are executed after being extracted from the FEL. The arrival
event updates the element of the requesting token in the FEL and service facility after
checking the status of its server (busy or idle). However, an arrival event, executed
by calling the Request function, only updates the status of tokens, due to the mutual
exclusion problem. The pseudocode for the Request function is illustrated in Figure 4-7.
Departure: Release The departure event also needs updates for both the FEL
and the service facility. For a departure event, it is possible to update both of them, as
specified in the previous section. When the event is a departure event, token information
is updated by calling the Schedule function. Then, the Release function is called in order
to update the statistics of the service facility for the released token and the status of
its server. Figure 4-8 illustrates the pseudocode for the Release function. When the
Release function is executed, the index of the updated service facility is determined
55
device voidRelease(float *Facility, int released, int tokenId){
Facility[released*numOfFacAttr + BusyTime]+= currentTime - Facility[released*numOfFacAttr + ServiceStart];
Facility[released*numOfFacAttr + NumOfServed]++;Facility[released*numOfFacAttr + ServerStatus] = IDLE;Facility[released*numOfFacAttr + ReleasedToken] = tokenId;
}Figure 4-8. Pseudocode for Release
by the released token, not by the element index, as shown in Figure 4-8. The Release
function stores the currently released token for selective update.
Scheduling the Service Facility: ScheduleServer The service facility places
the currently requesting tokens into the server or queue, after searching the FEL, by
calling the ScheduleServer function. Figure 4-9 illustrates the pseudocode for the
ScheduleServer function. The token in the queue has priority, and is placed into the
server if the queue is not empty. For two or more tokens, the token with the minimum
original timestamp is placed into the server, and others are placed into the queue if the
queue is empty. The token is dropped from the service facility when the queue is full.
The head and tail indexes are used to insert a token into (EnQueue), and delete a token
from (DeQueue) the queue.
Collecting Statistics: PrintResult Each service facility has several attributes for
summary statistics, including the accumulated busy time, the number of served tokens,
and the average length of the queue. When each event occurs at the service facility,
these attributes are updated. At the end of the simulation, these attributes are copied
to the CPU. The summary statistics are produced by calling the PrintResult function.
The PrintResult function is a CPU-side function with no parameters, and it returns
summary statistics including utilization, throughput, and mean wait time.
56
device voidScheduleServer(float *Facility, int elementIndex, int *currentList){
// if the queue is not empty, put the head of queue into the serverif (!IsEmpty(Facility, elementIndex)) {
selectedToken = DeQueue(Facility, elementIndex);startIndex = 0;
}
// if the queue is emptyelse {
// put the token with the minimum original timestamp into the serverselectedToken = MinOriginalTime(currentList);startIndex = 1;
}Facility[elementIndex*numOfFacAttr + ServedToken] = selectedToken;Facility[elementIndex*numOfFacAttr + ServerStatus] = BUSY;Facility[elementIndex*numOfFacAttr + ServiceStart] = currentTime;
// put other tokens into the queuefor (int i = startIndex; i < currentListSize; i++) {
queue length = QueueLength(Facility, elementIndex);if (queue length >= queueCapacuty) {
// drop the current token;break;
}else {
EnQueue(Facility, elementIndex, currentToken);}
}}
Figure 4-9. Pseudocode for ScheduleServer
57
4.3.3 Random Number Generation
In discrete event simulations, the time duration for each state is modeled as a
random variable [67]. Inter-arrival and service times in queuing models are the types
of variables that are modeled as specified statistical distributions. The Mersenne
twister [69] is used to produce the seeds for a pseudo-random number generator since
bitwise arithmetic and an arbitrary amount of memory writes are suitable for the CUDA
programming model [70]. Each thread block updates the seed array for the current
execution at every simulation step. Those seeds with statistical distributions, such
as uniform and exponential distributions, then produce the random numbers for the
variables.
4.4 Steps for Building a Queuing Model
This section describes the basic steps in developing the queuing model simulation.
Each step represents each kernel invoked from the CPU in sequence to develop the
mutual exclusion on the GPU. We have assumed that each service facility has only one
server for this example.
Step 1: Initialization The memory spaces are allocated for the FEL and service
facilities, and the state variables are defined by the programmer. The number of
elements for which each thread is responsible is determined by the problem size, as
well as by user selections, such as the number of threads in a thread block and the
number of blocks in a grid. Data structures for the FEL and service facility are copied to
the GPU, and initial events are generated for the simulation.
Step 2: Minimum Timestamp The NextEventTime function finds the minimum
timestamp in the FEL by utilizing multiple threads. At this step, each thread is
responsible for handling a certain number of elements in the FEL. The number of
elements each thread is responsible for may be different from that of other steps, if
shared memory is used for element comparison. The steps for finding the minimum
timestamp are illustrated in Figures 4-10 and 4-11. In Figure 4-10, each thread
58
#1, 5, A, 3 #2, 3, D, 1 #3, 6, D, 4 #4, 2, A, 6 #5, 2, D, 6 #6, 4, A, 5 #7, 6, D, 7 #8, 2, A, 8
Thread #1 Thread #2 Thread #3 Thread #4
ID, Time, Event (A: Arrival, D: Departure), Facility
Future Event List (FEL)
Figure 4-10. First step in parallel reduction
3 3 2 2 2 4 2 2
2 3 2 2 2 4 2 2
2 3 2 2 2 4 2 2
Event times in shared memory
Minimum event time
Figure 4-11. Steps in parallel reduction
compares two timestamps, and the smaller timestamp is stored at the left location.
The timestamps in the FEL are copied to the shared memory when they are compared
so that the FEL will not be re-sorted, as shown in Figure 4-11.
Step 3: Event Extraction and Departure Event The NextEvent function extracts
the events with the minimum timestamp. At this step, each thread is responsible for
handling a certain number of elements in the FEL, as illustrated in Figure 4-12. Two
main event routines are executed at this step. A Request function executes an arrival
event partially, just indicating that these events will be executed at the current iteration.
A Release function, on the other hand, executes a departure event entirely at this
step, since only one constant index is used to access the service facility for a Release
59
#1, 5, A, 3 #2, 3, D, 1 #3, 6, D, 4 #4, 2, A, 6 #5, 2, D, 6 #6, 4, A, 5 #7, 6, D, 7 #8, 2, A, 8
Thread #1 Thread #2 Thread #3 Thread #4
Future Event List (FEL)
#1, B, 2 #2, I, - #3, I, - #4, B, 3 #5, I, - #6, B, 5 #7, B, 7 #8, I, -
Service Facility
ID, Status (B: Busy, I: Idle), Token
ID, Time, Event (A: Arrival, D: Departure), Facility
#6, I, -
#5, 6, A, 1
Figure 4-12. Step 3: Event extraction and departure event
function. In Figure 4-12, tokens #4, #5, and #8 are extracted for future updates, and
service facility #6 releases token #5 at this step, updating both the FEL and service
facility at the same time. Token #5 is re-scheduled when the Release function is
executed.
Step 4: Update of Service Facility The ScheduleServer function updates
the status of the server and the queue for each facility. At this step, each thread is
responsible for processing a certain number of elements in the service facility, as
illustrated in Figure 4-13. Each facility finds the newly arrived tokens by checking the
incoming edges and the FEL. If there is a newly arrived token at each service facility,
the service facilities with idle server (#2, #3, #5, #6, and #8) will place it into the server,
whereas the service facilities with busy server (#1, #4, and #7) will put it into the queue.
Token #8 is placed into the server of service facility #8. Token #4 can be located in the
server of service facility #6 because service facility #6 has already released token #5 at
the previous step.
Step 5: New Event Scheduling The Schedule function updates the executed
tokens in the FEL. At this step, each thread is responsible for processing a certain
number of elements in the FEL, as illustrated in Figure 4-14. All tokens that have
60
#1, 5, A, 3 #2, 3, D, 1 #3, 6, D, 4 #4, 2, A, 6 #5, 6, A, 1 #6, 4, A, 5 #7, 6, D, 7 #8, 2, A, 8
Thread #1 Thread #2 Thread #3 Thread #4
Future Event List (FEL)
#1, B, 2 #2, I, - #3, I, - #4, B, 3 #5, I, - #6, I, - #7, B, 7 #8, I, -
Service Facility
ID, Status (B: Busy, I: Idle), Token
ID, Time, Event (A: Arrival, D: Departure), Facility
#6, B, 4 #8, B, 8
Figure 4-13. Step 4: Update of service facility
#1, 5, A, 3 #2, 3, D, 1 #3, 6, D, 4 #4, 2, A, 6 #5, 6, A, 1 #6, 4, A, 5 #7, 6, D, 7 #8, 2, A, 8
Thread #1 Thread #2 Thread #3 Thread #4
Future Event List (FEL)
#1, B, 2 #2, I, - #3, I, - #4, B, 3 #5, I, - #6, B, 4 #7, B, 7 #8, B, 8
Service Facility
ID, Status (B: Busy, I: Idle), Token
ID, Time, Event (A: Arrival, D: Departure), Facility
#4, 5, D, 6
#8, 7, D, 8
Figure 4-14. Step 5: New event scheduling
requested the service at the current time are re-scheduled by updating the attributes
of tokens in the FEL. Then, the control goes to Step 2, until the simulation ends. The
attributes of tokens #4 and #8 in Figure 4-14 are updated based on the results of the
previous step, as shown in Figure 4-13.
Step 6: Summary Statistics When the simulation ends, both arrays are copied to
the CPU, and the summary statistics are calculated and generated.
61
4.5 Experimental Results
The experimental results compare two parallel simulations with a sequential
simulation: the first is a parallel simulation with a sequential event scheduling method,
and the second is a parallel simulation with a parallel event scheduling method.
4.5.1 Simulation Environment
The experiment was conducted on an Intel Core 2 Extreme Quad 2.66GHz
processor with 3GB of main memory. The Nvidia GeForce 8800 GTX GPU [12] has
768MB of memory with a memory bandwidth of 86.4 GB/s. The CPU communicates
with the GPU via PCI-Express with a maximum of 4 GB/s in each direction. The C
version of SimPack [71] with a heap-based FEL was used in two sequential event
scheduling methods for comparison with parallel version. SimPack is a simulation
toolkit which supports the construction of various types of models and executing the
simulation, based on an extension of the general-purpose programming language. C,
C++, Java, JavaScript, and Python versions of SimPack have been developed. The
results represented in this dissertation are the average value of five runs.
4.5.2 Simulation Model
The toroidal queuing network model was used for the simulation. This application
is an example of a closed queuing network interconnected with a service facility.
Figure 4-15 shows an example of 3×3 toroidal queuing network. Each service facility
is connected to its four neighbors. When the token arrives at the service facility, the
service time is assigned to the token by a random number generator with an exponential
distribution. After being served by the service facility, the token moves to one of its
four neighbors, selected with uniform distribution. The mean service time of the facility
is set to 10 with exponential distribution, and the message population–the number of
initially assigned tokens per service facility–is set to 1. Each service time is rounded
to an integer so that many events are clustered into one event time. However, this will
introduce a numerical error into the simulation results because their execution times are
62
Figure 4-15. 3×3 toroidal queuing network
different to their original timestamps. The error may be acceptable in some applications,
but an error correction method may be required for more accurate results. In Chapter 5,
we analyze the error introduced by clustering events, and present the methods for error
estimation and correction.
4.5.3 Parallel Simulation with a Sequential Event Scheduling Method
In this experiment, the CPU and GPU are combined into a master-slave paradigm.
The CPU works as the control unit, and the GPU executes the programmed codes
for events. We used a parallel simulation method based on a SIMD scheme so that
events with the same timestamp value are executed concurrently. If there are two or
more events with the same timestamp, they are clustered into a list, and each event
on the list is executed by each thread. During the simulation, the GPU produces two
random numbers for each active token; the service time at the current service facility by
exponential distribution, and next service facility by uniform distribution. When the CPU
calls the kernel and passes the streams of active tokens, threads on the GPU generate
the results in parallel, and return them to the CPU. The CPU schedules the tokens using
these results.
Figure 4-16 shows the performance improvement in the GPU experiments. The
CPU-based simulation showed better performance in the 16×16 facilities because (1)
63
0.5
1
1.5
2
16×16 32×32 64×64 128×128 256×256 512×512
Spee
dup
Number of facilities
CPU-GPU simulationCPU-based simulation
Figure 4-16. Performance improvement by using a GPU as coprocessor
the sequential execution time in one time interval on the CPU was not long enough
compared to the data transfer time between the CPU and GPU (2) the number of events
in one time interval was not enough to maximize the number of threads on the GPU.
The GPU-based simulation outperforms the sequential simulation when (1) is satisfied,
and the performance increases when (2) is satisfied. However, the performance was
not good enough when we compare the results with other coarse-grained simulations.
In the SIMD execution, some parts of codes are processed in sequence, such as the
instruction fetch. The event scheduling method (e.g., the event insertion and extraction)
performed in sequence represents over 95% of the overall simulation time while the
event execution time (e.g., random number generation) is reduced by utilizing the GPU.
4.5.4 Parallel Simulation with a Parallel Event Scheduling Method
In the GPU experiment, the number of threads in the thread block is fixed at 128.
The number of elements that each thread processes and the number of thread blocks
are determined by the size of the simulation model. For example, there are 8 thread
blocks, and each thread only processes one element for both arrays in a 32×32 model.
64
0 1 2
5
10
14
16×16 32×32 64×64 128×128 256×256 512×512
Spee
dup
Number of facilities
GPU-based simulationCPU-GPU simulation
Figure 4-17. Performance improvement from parallel event scheduling
There are 64 thread blocks, and each thread processes 32 elements for both arrays
in a 512×512 model. Figure 4-17 shows the performance improvement in the GPU
experiments compared to sequential simulation on the CPU. The performance graph
shows an s-shaped curve. For a small simulation model, the CPU-based simulation
shows better performance, since the times to execute the mutual exclusion algorithm
and transfer data between the CPU and GPU exceed the sequential execution times.
Moreover, the number of concurrently executed events is too small. The GPU-based
simulation outperforms the sequential simulation when the number of concurrent events
is large enough to overcome the overhead of parallel execution. Finally, the performance
gradually increases when the problem size is large enough to fully utilize the threads on
the GPU. Compared to Figure 4-16, parallel event scheduling removes the bottleneck of
the simulation, and significantly improves the performance.
4.5.5 Cluster Experiment
We also derived the simulation over a cluster using a sequential event scheduling
method. The clusters used for the simulation are composed of 24 Sun workstations
65
interconnected by a 100Mbps Ethernet. Each workstation is a Sun SPARC 1GHz
machine with a running version 5.8 of the Solaris operating system with 512 MB of
main memory. In this experiment, the processors are combined into a master-slave
paradigm. One master processor works as the control unit, and several slave processors
execute the programmed codes for events. Each event on the list of concurrent events
is sent to each processor. The simulation over a cluster did not demonstrate a good
performance without artificial delay, since the computation time of each event was too
short compared to the communication delay between the processors. Communication
delay of the null message between master and slave processors was measured as less
than 1 millisecond (ms), but it overwhelms ten microseconds (𝜇s) of the computation
speed for each event.
Most traditional parallel discrete event simulations exchange messages between
processors in order to send an event to other processors or to use them as a signal
of synchronization. Communication delay is a critical factor in the performance of
simulation when computation granularity of events is relatively small [72]. Other
experimental results show that modest speedup is obtained from parallel simulation
with fine granularity, but speedup is relatively small compared to coarse granularity [73],
or performance is even worse than that of sequential simulation [74]. Communication
delay can be relatively negligible in the CPU-GPU simulation since communications are
handled on the same hardware.
66
CHAPTER 5AN ANALYSIS OF QUEUE NETWORK SIMULATION USING GPU-BASED
HARDWARE ACCELERATION
5.1 Parallel Discrete Event Simulation of Queuing Networks on the GPU
5.1.1 A Time-Synchronous/Event Algorithm
We used a parallel simulation method based on a SIMD scheme so that events
with the same timestamp value are executed concurrently. The simulation begins with
the extraction of the event with the lowest timestamp from an FEL. Event extraction
continues for as long as the next event has the same timestamp. Events with the same
timestamp are clustered into the current list of execution, and each event is executed
on each thread of the GPU. However, since it is unlikely that several events occur at
a single point of simulated time in a discrete event simulation, many threads will be
idle, resulting in wasted GPU resources and inefficient performance. We introduce a
time-synchronous/event algorithm using a time interval instead of a precise time, in
order to have more events occurring concurrently and reducing the load imbalance on
the threads of the GPU. Clustering events within a time interval makes it possible for
many more events to be executed at a single point of simulated time, which reduces the
number of idle threads and achieves more efficient parallel processing.
A time-synchronous/event algorithm that is used to cluster more events at a
single event time is a hybrid algorithm of discrete event simulations and time-stepped
simulations. The main difference between the two types of discrete simulation is the
method used to advance time. Our approach is similar to a time-stepped simulation in
the sense that we execute events at the end of the time interval to improve the degree of
parallelism. However, a time-stepped simulation can be inefficient if the state changes
in the simulation model occur irregularly, or if event density is low at the time interval.
Although there is no event at the next time-step, the clock must advance to the next
time-step, which reduces efficiency owing to idle processing time. Our approach is
67
while (current time is less than simulation time)minimumTimeStamp = ParallelReduction(FEL);currentStep = the smallest multiple of time interval greater than or
equal to minimumTimeStamp;for each local FEL by each thread (or processor) in parallel do
if (the timestamp of event is less than or equal to currentStep)CurrentList += ExtractEvent(FEL);ExecuteEvent(CurrentList);Schedule new events from the results;
end ifend for each
end while
Figure 5-1. Pseudocode for a hybrid time-synchronous/event algorithm with parallelevent scheduling
based on discrete event simulations in that the clock advances by the next event, rather
than by the next time-step.
The pseudocode for a time-synchronous/event algorithm with parallel event
scheduling is illustrated in Figure 5-1. At each start of the simulation loop, the lowest
timestamp is calculated from the FEL by parallel reduction [66]. The clock is set to the
minimum timestamp, and the smallest multiple of the time interval that is greater than or
equal to the minimum timestamp is set to the current time-step. All events are extracted
from the FEL in parallel by multiple threads on the GPU if their timestamp is less than,
or equal to, the current time-step. Each extracted event is exclusively accessed and
executed by each thread on the GPU. The time interval in our approach is used to
execute events concurrently rather than to advance the clock. After executing the events,
the clock advances to the next lowest event time, and not to the next time-step.
However, if events are executed only at the end of the time interval, the results
lose accuracy because each event has to be delayed in its execution compared to its
original timestamp. Fortunately, we can approximate the error due to the stochastic
nature of queues. For small and non-complex queuing networks, the analytic model
can provide the statistics without running a simulation based on queuing theory, albeit
68
with assumptions and approximations [1, 3]. We use queuing theory to estimate the
total error rate after we obtain the simulation results. The time interval can be another
parameter of the queuing model combined with two time-dependent parameters: arrival
and service rate. With the use of the time interval, the error rate caused by the time
interval is related to the arrival and service rates, and the amount of error depends
on the values of these parameters. The relationships between the time interval and
parameters are described in sections 5.3 and 5.4.
5.1.2 Timestamp Ordering
In parallel simulation, the purpose of synchronization is to process the events in
non-decreasing timestamps order to obtain the same results as those of a sequential
simulation. In a traditional parallel discrete event simulation, the event order can be
violated by the different speeds of event executions and communication delays between
processors, resulting in a causality error. Other simulation methods using the tradeoff
between accuracy and speedup allow the timestamp ordering to be violated within
certain limits, whereas our approach still keeps the timestamp ordering of events.
We do not need the explicit synchronization method since all the events are stored
in the global event list, and the time for each event execution is determined by the global
clock. The synchronous step of the simulation preserves the executions of events in a
non-decreasing timestamp order, blocking the event extractions from the FEL before
the current events finish scheduling the next events. The error caused by the time
interval, therefore, is different from the causality error because the timestamp ordering
is preserved, even though events are clustered at the end of the time interval. The
error in the result is the statistical error since each event does not occur at its precise
timestamp. However, a causality error can occur for the events with the same timestamp
when events are clustered by a time interval. Consider two or more tokens with different
timestamps requesting service at the same service facility. Their timestamps are
different, but they can be clustered into one time interval. In this case, an original
69
timestamp is used to determine the correct event order for simultaneous events. For
originally simultaneous events, the event order is randomly determined by each service
facility, as described in section 4.2.1.
5.2 Implementation and Analysis of Queuing Network Simulation
5.2.1 Closed and Open Queuing Networks
Queuing networks are classified into two types: closed and open [3]. In an open
queuing network, each token arrives at the system, based on the arrival rate, and
leaves the system after being served. On the other hand, the finite number of tokens
is assigned, and tokens are executed circulating the network in a closed queuing
network. Open queuing networks are more realistic queuing models than closed
queuing networks are, and communication network and traffic flow models [75] are
typical examples of these. However, closed queuing networks are widely used in the
modeling of a system where the number of tokens in the system has an impact on the
nature of the arrival process, due to the finite input populations [76]. CPU scheduling,
flexible manufacturing systems [77] and truck-shovel systems [78] are examples of
closed queuing networks.
The main difference between these two types of queuing networks is that the open
queuing network has new arrivals during the simulation. The number of tokens in the
open queuing network at any instant of time is always different due to the arrivals and
departures. The closed queuing network has a constant number of tokens during the
simulation since there are no new arrivals or departures. The error rate produced by the
use of a time interval will be different between the two types of queuing networks since
the number of tokens in the system affects the simulation results.
In the open queuing network, the arrival rate remains constant although the events
are only executed at the end of each time interval. A delayed execution time for each
event, compared to its precise timestamp, decreases the departure rate of the queuing
network, resulting in an increased number of tokens in the queuing network. As the
70
number of tokens in the queuing network increases, the wait time also increases since
the length of the queue at the service facility increases. In the closed queuing network,
we only need to consider the arrival and departure rates between the service facilities
since there is no entry from the outside. The delayed tokens arrive at the next service
facility as late as the difference between their original timestamps and actual execution
times. The length of the queue at the service facility remains unchanged by the time
interval since all tokens in the system are delayed at the same rate.
The implementation between closed and open queuing networks is also different.
It is possible to allocate a fixed size array for the FEL in the closed queuing network
because of the constant number of tokens during the simulation. A static memory
allocation with a fixed number of elements allows extraction of events from, and
re-scheduling into, the FEL to be performed on the GPU. The data need not be sent
back to the CPU in the middle of the simulation. In an open queuing network, the size
of the FEL is always different. For this reason the upper limit of memory for the FEL
needs to be estimated, which causes many threads on the GPU to be idle and memory
to be wasted. Moreover, the GeForce 8800 GTX GPU–the device of computer capability
1.0–does not support mutual exclusion and atomic functions [13]. The manual locking
method for concurrency control cannot be used when the interval between threads
that try to access the same element in memory is too short. The assignments of new
arrivals from outside of the queuing network to the shared resources of the FEL require
sequential execution so that multiple threads are prevented from concurrently accessing
and writing their new arrivals to the same empty location. In this case, newly arrived
tokens need to be generated on the CPU, resulting in data transfer between the CPU
and GPU. Both sequential execution and data transfer are a performance bottleneck,
and data transfer time can have a critical impact on performance in large-scale models.
The experimental results in Section 6 show these performance differences between a
closed and open queuing network.
71
On the other hand, if memory is separately allocated into the service facility
with external inputs from outside of the queuing network, then new arrivals can be
handled on the GPU. A location that is separate from other threads prevents them from
accessing the same location in the FEL. It is a feasible solution if there are few entries
from outside of the queuing network in the simulation model. Generally, this is not a
good approach for the large-scale model since memory allocation rapidly grows as the
number of service facilities increases.
5.2.2 Computer Network Model
The queuing model was originally developed to analyze and design a
telecommunication system [79], and has been frequently used to analyze the
performance of computer network systems. When a packet is sent from one node to
an adjacent node, there are four delays between the two nodes: processing, queuing,
transmission, and propagation delays [80]. Among these, the queuing delay is the most
studied delay because it is the only delay affected by the traffic load and congestion
pattern.
The time required for a packet to wait in the queue before being transmitted onto
the link is the queuing delay. In a computer network, the queuing delay includes the
medium access delay, which can increase the queuing delay. The medium access delay
is the time required for a packet to wait in the node until the medium is sensed as idle.
If another node connected to the same medium is transmitting packets, then a packet
in the first node cannot be transmitted, even if no packets are waiting in that queue.
Consequently, the shared medium is regarded as a common resource for the packets,
and the service facilities in the same medium are regarded as another queue for the
shared medium. Our simulation method causes the error rate to be higher due to the
two consecutive queues.
Figure 5-2 illustrates the possible delays caused by the time interval in the computer
network simulation when a packet is transmitted to the next node. In the general
72
Node Timeline
(1) Packet arrival,waits in queue
(5) Transmission timeby a time interval
(2) Originalexecution time
(4) End of backoff,original transmission time
(3) Execution timeby a time interval
d1 d2
Figure 5-2. Queuing delay in the computer network model
queuing model, d1 is the only delay caused by a time interval, but the packet cannot be
transmitted when the backoff time ends, and the medium is sensed as idle. The second
delay d2 is added onto the medium access delay, and it causes a greater error compared
to the general queuing model. The delays of other packets in the same medium also
make the d2 much longer.
The media access control (MAC) protocol [81] allows for several nodes to be
connected to the same medium and coordinates their accesses to the shared medium.
The implementation of the MAC protocol on the GPU can be different with respect to
the behaviors of the network. It is sufficient for each node only to sense the shared
medium in the sequential execution, but exclusive access to the shared medium by each
node needs to be guaranteed in the parallel execution. In a wired network, the network
topology is usually static, and the nodes connected to the same medium are also static
and not changed during the simulation. The MAC protocol in the simulation can be
developed to centrally control the nodes, which makes it possible for the MAC protocol to
be executed on the GPU using an alternate update.
The implementation of the MAC protocol in a wireless network with an access point
(AP) is not much different from in a wired network. The topology of mobile nodes is
dynamic, but that of APs is static. The nodes connected to the same AP are different
73
at any point in time, but the MAC protocol in the simulation can still be developed to
centrally control the nodes after searching all nodes currently connected to the AP.
However, a mobile ad hoc network (MANET) simulation [82] requires the distributed
implementation of the MAC protocol. The topology in MANETs is changed rapidly, and
the shared medium, which is determined by the transmission range of each mobile node
without any fixed AP, is completely different for each mobile node. The MAC protocol in
the simulation needs to be implemented with respect to each node. This requires the
sequential execution of the MAC protocol on the GPU, degrading the performance.
5.2.3 CUDA Implementation
A higher degree of parallelism can be achieved by concurrently utilizing many
stream multiprocessors on the GPU. A GPU implicitly supports data-level parallelism by
decomposing the computations into large pieces of small tasks and by guaranteeing the
threads exclusive access to each element. The FEL and service facility are two main
data structures in the queuing model, and these are represented as two-dimensional
arrays. One or more elements in both arrays are assigned to a single thread for parallel
execution. Each thread executes only the active events at the current time-step. A
GPU can process one kernel at a time, thus a task-level parallelism is required to
be implemented manually by combining two or more tasks in a single kernel, and by
dividing the thread blocks based on the number of tasks. In our MANET simulation,
the event extractions of data packet and location updates for each mobile node can be
programmed into a single kernel since the tasks are independent of each other, which
increases the utilization of threads.
Parallel processing is different from sequential processing in that many tasks are
concurrently executed, reducing the overall execution time. The problem needs to
be safely decomposed into sub-tasks so that each currently executed task does not
affect each other without changing the order of execution [83]. The FEL and service
facility have dependency in that two arrays need to be updated at the same time when
74
a request event is called. Arbitrary access to one array by multiple threads in parallel
allows multiple threads to concurrently access the same elements in the array. Their
executions need to be separated. Alternate updates between the FEL and service
facility have resolved this problem.
We need data transfer between the CPU and GPU to avoid simultaneous access
to shared resources since our GPU does not support mutual exclusion. The fast speed
of data transfer between the CPU and GPU via the PCI-Express bus has a significant
advantage over clusters with message passing between processors, which makes
the CPU with a GPU a more appropriate architecture for the simulation of fine-grained
events. However, the frequent data transfer between two devices can be a bottleneck in
the simulation. Data transfer time can be reduced by minimizing the size of data transfer,
which can be achieved by separating the array into two parts. The essential elements
which require sequential execution on the CPU are composed of a separate array that
has the index to the main array.
The size of the data structure needs to be static on the GPU. The number of service
facilities is constant during the simulation, whereas the number of elements in the FEL
of the open queuing network is always changed at any point in time. Concurrent access
to the elements in the FEL forces the generation of newly arrived tokens to be executed
on the CPU, which, however, makes it possible to dynamically adjust the size of the
FEL on the CPU at the start of each simulation loop. The array of the FEL can either be
doubled or cut in half, based on the number of tokens in the FEL.
We have made the most use of the data-parallel algorithms in NVIDIA CUDA SDK
[84] for our parallel queuing model simulation. A parallel reduction is used for finding
the minimum timestamp when extracting the events from the FEL, which allows us to
maintain the FEL without sorting it. The sequential execution of the MAC protocol on
the CPU does not need to search all the elements in the array if the array for the MAC
75
CallingPopulation
Server
Switch Server
Server Server Server
ServerServer
ServerServer
Figure 5-3. 3 linear queuing networks with 3 servers
protocol is sorted in a non-decreasing timestamp order. A bitonic sort on the GPU
allows us to search only the needed elements within the array.
5.3 Experimental Results
5.3.1 Simulation Model: Closed and Open Queuing Networks
When we ran a simulation using the time interval, we used two kinds of queuing
network models–closed and open queuing networks–to identify the differences of the
statistical results and performance between the two models. We first compared the
results of the closed queuing network with those of the open queuing network, and
analyzed the accuracy of the closed queuing network.
The first model is the queuing network of the toroidal topology used in section
4.5.2. The values of various parameters can be important factors affecting accuracy and
performance. We ran the simulation with varying values of two different parameters to
see the effects of these parameters on the statistical results. The open queuing network
consists of N linear queuing networks with k servers, as shown in Figure 5-3. A new
token arrives at the queuing network based on arrival rate 𝜆 from the calling population.
The newly arrived token is assigned to one of the linear queuing networks with uniform
distribution. After being served at the last server in the linear queuing network, the
token completes its job and exits the queuing network. The arrival and service times are
determined by exponential distribution.
76
5.3.1.1 Accuracy: closed vs. open queuing network
The values of the parameters and the number of service facilities for closed and
open queuing networks are configured to obtain similar results when the time interval is
set to zero. The results for various time intervals are compared with those of a zero time
interval to determine accuracy. The mean service time of the service facility is set to 10
with exponential distribution for both queuing networks. In the closed queuing network,
the message population–the number of initially assigned tokens per service facility–is
set to 1. In the open queuing network, the mean inter-arrival time from the calling
population is set to 20. We used the 32×32 topology as a basis for the experiments to
determine the accuracy.
Two summary statistics are presented in Figure 5-4 to show the difference by
using the time interval. Sojourn time is the average time a token to stay in one service
facility including the wait time in the queue. Utilization represents the performance of the
simulation model. In each subsequent plot, the time interval is on the horizontal axis. A
time interval of zero indicates no error in accuracy. As the interval increases, the error
also increases for the variable being measured on the vertical axis. Figure 5-4A shows
the average sojourn time of open and closed queuing networks for the time interval. It
takes much longer for a token to pass a service facility in the open queuing network,
since the number of tokens grows in the open queuing network, compared to the closed
queuing network as the time interval increases. Figure 5-4B shows the utilization for
the time interval. Utilization of the closed queuing network drops since arrivals for each
service facility are delayed due to the time interval, whereas utilization of the open
queuing network is almost constant since the arrival rate is constant regardless of
the time interval, and the increased number of tokens fills up possible idle time at the
service facility.
77
10
15
20
25
30
0 0.2 0.4 0.6 0.8 1
Sojo
urn
time
per
faci
lity
Time interval
Open queuing networkClosed queuing network
A Sojourn time
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8 1
Util
izat
ion
Time interval
Open queuing networkClosed queuing network
B Utilization
Figure 5-4. Summary statistics of closed and open queuing network simulations
78
5.3.1.2 Accuracy: effects of parameter settings on accuracy
The time interval becomes one of the parameters in our simulation, and it causes
error by combining with other parameters. The time interval is a time-dependent
parameter, and it forces the execution time of each event to be delayed at the end of the
time interval. Time-dependent parameters, therefore, are said to be the primary factors
affecting the accuracy of a simulation. The closed queuing network was used for the
simulation to determine the effects of the parameter settings on accuracy.
Figure 5-5A shows the utilization with variations in the number of service facilities
for the time interval. The experimental results clearly show that the error rate is constant
regardless of the number of service facilities which is not a time-dependent parameter.
Figure 5-5B shows the utilization of the 32×32 toroidal queuing network, with variation in
the mean service time for the time interval. The variation of the mean service time–one
of time-dependent parameters–makes the error rate different. As the mean service time
increases, the ratio of the delay time by the time interval to the mean service time drops.
The error, therefore, decreases as the mean service time increases for the same time
interval. Interestingly, the error rate in Figure 5-5B is determined by the ratio of the mean
service time to the time interval. The utilizations are almost same in the following three
cases:
∙ Mean service time: 5. Time interval: 0.2
∙ Mean service time: 10. Time interval: 0.4
∙ Mean service time: 20. Time interval: 0.8
Figure 5-5B implies that the error rate can be estimated based on the fact that
the error rate is regular for the same ratio of a time-dependent parameter to the time
interval.
5.3.1.3 Performance
The performance was calculated by comparing the runtime of a parallel simulation
with that of a sequential simulation. We can expect better performance as the time
79
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8 1
Util
izat
ion
Time interval
128×128 nodes 64× 64 nodes 32× 32 nodes
A Utilization with varying the number of facilities
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8 1
Util
izat
ion
Time interval
mean service time = 20mean service time = 10mean service time = 5
B Utilization with varying the mean service time
Figure 5-5. Summary statistics with varying parameter settings
80
interval increases since many events are clustered at one time interval; however, a large
time interval also introduces more errors in the results.
Figure 5-6A shows the improvement in the performance of closed queuing network
simulations for the number of service facilities and the time interval, with the same
values of parameters that were used in Figure 5-4. This graph indicates that the
performance improvement depends on the number of events in one time interval.
As expected, a larger time interval leads to better performance. For a very small-scale
model, especially 16×16 topology, the number of threads that run concurrently is too
small. As a result, the overheads of parallel execution, such as mutual exclusion and
CPU-GPU interactions, exceed the sequential execution times. The parallel simulation
outperforms the sequential simulation when the number of clustered events in one
time interval is large enough to overcome the overheads of parallel execution. Not
all participant threads can be fully utilized in a discrete event simulation, since only
extracted events are executed at once. A large time interval keeps more participant
threads busy, resulting in an increasing number of events in one time interval. The
performance, therefore, increases in proportion to the increment of the time interval.
Finally, the performance improvement gradually increases when the number of events in
one time interval is large enough to maximize the number of threads executed in parallel
on the GPU. In a 512×512 topology, the number of events in the FEL is too large to
be loaded into the shared memory on the GPU at a time during the parallel reduction,
which limits the performance improvements compared to the 256×256 topology.
Figure 5-6B shows the speedup of open queuing network simulations for the
number of service facilities and the time interval, with the same values of parameters
that were used in Figure 5-4. The shapes of the curves are very similar to those of
closed queuing network simulations, except for the magnitude of speedup. The
overheads of sequential execution for newly arrived tokens on the CPU and of data
transfer between the CPU and GPU result in a degradation of performance in the
81
0 1 2
5
10
14
16×16 32×32 64×64 128×128 256×256 512×512
Spee
dup
Number of facilities
∆t = 1.0∆t = 0.5∆t = 0.2∆t = 0.1
A Closed queuing network
0 1 2
5
10
14
16×16 32×32 64×64 128×128 256×256 512×512
Spee
dup
Number of facilities
∆t = 1.0∆t = 0.5∆t = 0.2∆t = 0.1
B Open queuing network
Figure 5-6. Performance improvement with varying time intervals (�t)
simulation of an open queuing network. The experimental results indicate that the
relationship between the error rate and performance improvement is model-dependent
and implementation-dependent; hence it is not easy to formalize.
82
Parallel overheads in our experimental results are summarized below.
∙ Thread synchronization between event times
∙ Reorganization of simulation steps for mutual exclusion
∙ Data transfer between the CPU and the GPU
∙ Sequential execution on the CPU to avoid simultaneous access to sharedresources
∙ Load imbalance between threads at each iteration
5.3.2 Computer Network Model: a Mobile Ad Hoc Network
5.3.2.1 Simulation model
A MANET is a self-configuring network composed of mobile nodes without any
centralized infrastructure. Each mobile node directly sends the packet to other mobile
nodes in a MANET. Each mobile node relays the packet to its neighbor node when the
source and destination nodes are not in transmission range of each other. Figure 5-71
illustrates the difference between wireless and mobile ad hoc networks. In a wireless
network, each mobile node is connected to an AP and communicates with other mobile
nodes via the AP. Figure 5-7A shows that node #1 can communicate with node #3 via
two APs in a wireless network. On the other hand, node #1 can communicate with node
#3 via nodes #2 and #4 in a MANET, as shown in Figure 5-7B.
When a mobile node sends the packet, it is relayed by intermediate nodes to reach
the destination node, using a routing algorithm. The development of an effective routing
algorithm can reduce the end-to-end delay as well as the number of hop counts, thus
minimizing congestion of the network. For this reason, A MANET is often developed to
evaluate the routing algorithm. A MANET simulation requires many more computations
than a traditional wired network simulation because of its mobile nature. The locations
1 Each circle represents the transmission range of each mobile node or AP.
83
Node #1 Node #3
AP AP
Node #2N d #4
Node #2Node #4
A Wireless network
Node #1N d #3
Node #1Node #3
Node #2
Node #4
B Mobile ad hoc network
Figure 5-7. Comparison between wireless and mobile ad hoc networks
84
of mobile nodes are always changing, which makes the topology different at any point
in time. A routing table in each mobile node, therefore, must be frequently updated.
A routing algorithm requires a beacon signal to be transmitted between mobile nodes
to update the routing table. A MANET simulation can benefit from a GPU because it
requires heavy computations with frequent location updates of each mobile node and
routing table. We have developed the MANET simulation model with a routing algorithm,
mobility behavior, and MAC protocol to run the packet-level simulation.
Routing Algorithm: Greedy Perimeter Stateless Routing (GPSR) [85] is used
to implement the routing algorithm in a MANET. Each mobile node maintains only its
neighbor table. When the mobile node receives a greedy mode packet for forwarding,
it transmits the packet to the neighbor whose location is geographically closest to the
destination. If the current node is the closest node to the packet’s destination, the packet
is turned to a perimeter mode. The packet in the perimeter mode traverses the edges
in the planar graph by applying the right-hand rule. The packet returns to the greedy
mode when it reaches the node that is geographically closer to the destination than
the mobile node that previously set the packet to the perimeter mode. Each mobile
node broadcasts the beacon signal periodically to acknowledge its location to the
neighbors. Those mobile nodes that receive the beacon signal update their neighbor
table. Each mobile node transmits the beacon signal every 0.5 to 1.5 seconds. The
detailed algorithm is specified in Karp and Kung [85].
Mobility: The mobility of a mobile node is modeled by the random waypoint mobility
model [86]. A mobile node chooses a random destination with a random speed which is
uniformly distributed between 0 and 20 m/s. When the node arrives at its destination, it
stays for a certain period of time before selecting a new destination. The pause time is
uniformly distributed between 0 and 20 seconds.
MAC Protocol: The mobile node can transmit the packet if none of mobile nodes
within the transmission range currently transmits the packet. Each mobile node senses
85
Table 5-1. Simulation scenarios of MANETNumber of mobile nodes 50 200 800 3200
Region (m×m) 1500×600 3000×1200 6000×2400 12000×4800Node density 1 node / 9000 m2
Packet arrival rate 0.8 0.4 0.2 0.1(per node) packet/sec packet/sec packet/sec packet/sec
the medium before it sends the packet, and transmits the packet only if the medium is
sensed as idle. When the medium is sensed as busy, a random backoff time is chosen,
and the mobile node waits until the backoff time expires. We assumed that the packet
can be transmitted immediately when the medium is sensed as idle and the backoff time
expires with an ideal collision avoidance.
Simulations were performed in four scenarios, as shown in Table 5-1. Each scenario
has a different number of nodes and region sizes, but the node density is identical. At
the start of the simulation, mobile nodes are randomly distributed within the area in each
scenario. Each node generates a 1024 byte packet at a rate of 𝜆 packets per second,
and transmits constant bit-rate (CBR) traffic to the randomly selected destination. The
transmission rate of each node is 1 Mbps, and the transmission range of each node is
250 meters. In each scenario, when we moved to the large-scale model, the number
of mobile nodes was quadrupled, but the number of packets was doubled so that the
network was not congested.
5.3.2.2 Accuracy and performance
We produced three statistics for the number of mobile nodes with varying time
intervals. Average end-to-end delay is the average transmission time of packets across a
network from the source to destination node. Packet delivery ratio is the successful ratio
of the data packets delivered to their destinations. Average hop count is the average
number of edges to be transmitted across a network from the source to destination
node.
86
0
100
200
300
400
500
600
50 200 800 3200
Ave
rage
end
-to-
end
dela
y (m
s)
Number of mobile nodes
∆t = 2ms∆t = 1ms∆t = 0ms
Figure 5-8. Average end-to-end delay with varying time intervals (�t)
Figure 5-8 and 5-9 show three statistics produced by our simulation method. Each
curve represents the results with a different time interval, and the difference from a
time interval of zero represents an error. In an average end-to-end delay, the error
rate is higher as the time interval increases, especially for the large-scale models as
shown in Figure 5-8. We observed that the increment in the number of service facilities
did not cause the error rate to be different in the previous section since the number of
service facilities is not a time-dependent parameter. However, the graph of an average
end-to-end delay shows that the error rate varies with respect to the number of mobile
nodes. This is related to the medium access delay. As mentioned in the previous
section, we expected more delays in the computer network simulation by using the time
interval due to the medium access delay. In our simulation model, a large-scale model
has broader areas compared to a small-scale model. A packet usually passes a larger
number of intermediate nodes to reach the destination in the large-scale model. More
medium access delays, therefore, are expected to be included in the end-to-end delay,
resulting in more error in the results.
87
Figure 5-9A and B respectively show the average hop counts and packet delivery
ratio. All packets are included in the packet delivery ratio regardless of the existence of
their paths to the destination. These two statistics show both the efficiency and accuracy
of a routing algorithm. The error resulting from the time interval implies that the routing
table in each mobile node was not updated correctly. The results seem to be constant
regardless of the time interval value. Our time interval (1 ms or 2 ms) is too negligible to
affect the results, compared to the interval of beacon signal (1 second on average) from
each mobile node. Moreover, these two statistics are not time-dependent statistics, and
are not determined by time-dependent parameters. The experimental results indicate
that we can obtain accurate results if the results as well as the parameters are not
time-dependent.
Figure 5-10 shows the performance improvement for the number of mobile nodes
with varying time intervals. The sequential executions of new packet arrivals and MAC
protocols were the bottlenecks in performance, but we could achieve speedup by
executing the sub-tasks in parallel and minimizing data transfer between the CPU and
GPU. In addition, each event in a MANET simulation requires much more computation
time compared to the queuing models in the previous section. Two sub-tasks are easily
parallelizable: neighbor update in the routing algorithm, and location update in the
mobility. A single kernel combines each sub-task with the event routines for data packets
which are independent of those tasks.
5.4 Error Analysis
In this section, we explain how the error equation is derived and the error is
corrected to improve the accuracy of the resulting statistics. The methods for error
estimation and correction should be simple enough since our objective is to obtain
the results from the simulation, not from the complicated analytical method. For error
estimation, we first need to capture the characteristics of the simulation model, thereby
determining which parameters are sensitive to error. Then the error rate is derived as
88
0
5
10
15
20
25
50 200 800 3200
Ave
rage
hop
cou
nts
Number of mobile nodes
∆t = 2ms∆t = 1ms∆t = 0ms
A Average hop counts
0.75
0.8
0.85
0.9
0.95
1
50 200 800 3200
Pack
et d
eliv
ery
ratio
(%
)
Number of mobile nodes
∆t = 2ms∆t = 1ms∆t = 0ms
B Packet delivery ratio
Figure 5-9. Average hop counts and packet delivery ratio with varying time intervals (�t)
89
0
1
2
3
4
5
6
50 200 800 3200
Spee
dup
Number of mobile nodes
∆t = 2ms∆t = 1ms
Figure 5-10. Performance improvement in MANET simulation with varying time intervals(�t)
an equation by combing the time interval with error-sensitive parameters using queuing
theory. In this dissertation, we start with a simple model–the closed queuing network–for
the analysis, because there are fewer parameters to consider.
Figure 5-11 and Table 5-2 show the relationship between time interval and mean
service time in closed queuing network simulations. Figure 5-11 shows a 3-dimensional
graph of utilization for varying time intervals and mean service times. When the mean
service time is relatively large, or when the time interval is small, the error rate tends
to be low. Table 5-2 ssummarizes two summary statistics for different values of time
intervals and mean service times. We can find some regularity in this table. The results
imply that the ratio of the mean service time to the time interval is directly related to the
error rate. These results indicate that time-dependent parameters are sensitive to error,
and that such errors can be estimated.
When a token is clustered at the end of the time interval, the token is delayed by
the amount of time between the original and actual execution times. Let d denote the
delay time by the time interval. When the token moves to the next service facility, the
90
0 0.5
1 1.5
2Time interval
1 5
10 15
20
Mean service time
0.1 0.2 0.3 0.4 0.5 0.6
Utilization
Figure 5-11. 3-dimensional representation of utilization for varying time intervals andmean service times
Table 5-2. Utilization and sojourn time (Soj.time) for different values of time intervals (�t)and mean service times (�s)�s = 5 �s = 10 �s = 20
�t Utilization Soj.time �t Utilization Soj.time �t Utilization Soj.time0 0.5042 9.98 0 0.5042 19.97 0 0.5043 39.92
0.5 0.4843 10.50 0.5 0.4938 20.59 0.5 0.4977 40.731 0.4671 10.87 1 0.4840 21.03 1 0.4930 41.222 0.4343 11.65 2 0.4671 21.74 2 0.4840 42.06
inter-arrival time of the next service facility increases by an average of d. The utilization
of the M/M/1 queue is defined by 𝜆𝜇, where 𝜆 and 𝜇 refer to the arrival and service rates,
respectively [1]. The equation can also be defined by s
a, where s and a refer to the
service time and inter-arrival time, respectively. Consider the linear queuing network with
two queues, and yield statistics at an instant in time. The equation of utilization (𝜌2) for
the second queue is defined by equation (5–1) since the instant of inter-arrival time at
the second queue is the sum of the service time at the first queue and the delay time by
the time interval (𝛿).
𝜌2 =s
a + d(5–1)
91
Let an error rate denote the rate of decrease in utilization by the time interval. The error
rate e can be defined by equation (5–2).
e =𝜌2𝜌1
=a
a + d, where 𝜌1 =
s
aand 𝜌2 =
s
a + d(5–2)
To calculate an average d, we have to consider the probability P0 that the service facility
does not contain a token. In the open queuing network, the increased number of tokens
due to the time interval causes the probability P0 to drop, thus d increases exponentially.
In the closed queuing network, the probability P0 is not affected by the time interval
since all tokens are delayed, reducing the arrival rate to each service facility. All tokens
have to wait until the end of the time interval, thus the d of a long-run time-average is
𝛿/2. The decline in utilization is affected by half the time interval. The inter-arrival time of
a long-run time-average �a in equation (5–2) approaches �s, the service time of a long-run
time-average. When we substitute d = 𝛿/2 into the equation (5–2), the error rate e is
defined by
e =�s
�s + 𝛿/2(5–3)
The utilization with the time interval, 𝜌(𝛿) is defined by equation (5–4), where 𝛿0 refers to
a zero time interval.
𝜌(𝛿) =�s
�s + 𝛿/2× 𝜌(𝛿0) =
𝜌(𝛿0)
1 + 𝜇𝛿/2(5–4)
Consequently, we can derive the equation to correct the error in utilization. The original
value of the utilization in the toroidal queuing network can be approximated by
𝜌(𝛿0) = (1 + 𝜇𝛿/2)× 𝜌(𝛿) (5–5)
Figure 5-12 shows the comparison of the error rate between the experimental
and estimated results for two cases of the mean service time. As the ratio of the
mean service time to time interval increases, the difference between the two results
decreases. Figure 5-13 shows the results calculated by equation (5–5) of the error
correction method with the experimental results in Figure 5-12. The graph indicates
92
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
Util
izat
ion
Time interval
mean service time = 20, experiment estimation
mean service time = 5, experiment estimation
Figure 5-12. Comparison between experimental and estimation results
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
Util
izat
ion
Time interval
mean service time = 20mean service time = 5
Figure 5-13. Result of error correction
that we can significantly reduce the error by the error correction method. For the mean
service time of 20, the error rate is only 0.6% at the time interval of 1.
93
The equation of the utilization for error correction is not derived from the analysis
of individual nodes. Our intention is to approximate the total error rate when adding
one more parameter–time interval–so that the error is corrected to yield more accurate
results. The equation for the total error rate is derived from the equations of queuing
theory. The equation combined with the results from the simulation produces more
accurate results without building a complicated analytical model from each node.
94
CHAPTER 6CONCLUSION
6.1 Summary
We have built a CUDA-based library to support parallel event scheduling and
queuing model simulation on a GPU, and introduced a time-synchronous/event
approach to achieve a higher degree of parallelism. There has been little research
in the use of a SIMD platform for parallelizing the simulation of queuing models. The
concerns in the literature regarding event distribution and the seemingly inappropriate
application of GPUs for discrete event simulation are addressed (1) by allowing events
to occur at approximate boundaries at the expense of accuracy, and (2) by using a
detection and compensation approach to minimize error. The tradeoff in our work is that
while we get significant speedup, the results are approximate and result in a numerical
error. However, in simulations where there is flexibility in the output results, the error
may be acceptable.
The event scheduling method occupies a significant portion of computational
time in discrete event simulations. A concurrent priority queue approach allowed each
processor to simultaneously access the global FEL on shared memory multiprocessors.
However, an array-based data structure and synchronous executions among threads
without explicit support for mutual exclusion prevented the concurrent priority queue
approach from being directly applied to the GPU. In our parallel event scheduling
method, the FEL is divided into many sub-FELs, which allows threads to process these
smaller units in parallel by utilizing a number of threads on a GPU without invoking
sophisticated mutual exclusion methods. Each element in the array holds its position
while the FEL remains unsorted, which guarantees that each element is only accessed
by one thread. In addition, alternate updates between the FEL and service facilities in a
queuing model simulation allow both shared resources to be updated bi-directionally on
the GPU, thereby avoiding simultaneous access to the shared resources.
95
We have simulated and analyzed three types of queuing models to see what
different impacts they have on the statistical results and performance using our
simulation method. The experimental results show that we can achieve up to 10 times
the speedup using our algorithm, although the increased speed comes at the expense of
accuracy in the results. The relationship between accuracy and performance, however,
is model dependent, and not easy to define on a more general basis. In addition, the
statistical results in MANET simulations show that our method only causes an error in
the time-dependent statistics. Although the improvement of performance introduced
an error into the simulation results, the experimental results showed that the error
in queuing network simulations is regular enough to apply in order to estimate more
accurate results. The time interval can be one of the parameters used to produce the
results, and so the error can be approximated with the values of the parameters and
topologies of the queuing network. The error produced by the time interval can be
mitigated using results from queuing theory.
6.2 Future Research
The current GPUs and CUDA provide programmers with an efficient framework of
parallel processing for general purpose computations. A GPU can be more powerful
and cost-effective than other parallel computers if it is efficiently programmed. However,
parallel programming on GPUs may still be inconvenient for programmers, since all
general algorithms and programming techniques cannot be directly converted and used.
We can further improve the performance of queuing network simulations by
removing more sequential executions from the GPU. The magnitude of the performance
depends on how much we can reduce sequential executions in the simulation. In this
study, we were able to completely remove sequential executions in the simulation of
the closed queuing network. However, the synchronous executions of multiple threads
require at least some code to be sequential. Thus, removing sequential execution in the
programming codes not only improves performance, but also reduces the error in the
96
statistical results, since we can achieve considerable speedup with a small time interval.
We will be able to convert some sequential code to the parallel code related to data
inconsistency, using atomic functions for devices of compute capability 1.1 and above.
However, we still need parallel algorithms to process the remaining sequential code (e.g.
MAC protocol in MANET simulations) in parallel.
Error analysis for real applications is more complex than it is for the example of the
toroidal queuing network, since the service rates of each service facility are different,
and also because there are many parameters to be considered. For these reasons, it
is difficult to capture the characteristics of the complex simulation models. Our future
research will include more studies for error estimation and correction methods in regards
to various applications.
97
REFERENCES
[1] L. Kleinrock, Queueing Systems Volume 1: Theory, Wiley-Interscience, New York,NY, 1975.
[2] D. Gross and C. M. Harris, Fundamentals of Queueing Theory (Wiley Series inProbability and Statistics), Wiley-Interscience, February 1998.
[3] G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi, Queueing Networks andMarkov Chains : Modeling and Performance Evaluation with Computer ScienceApplications, Wiley-Interscience, New York, NY, 2006.
[4] R. B. Cooper, Introduction to Queueing Theory, North-Holland (Elsevier), 2ndedition, 1981.
[5] A. M. Law and W. D. Kelton, Simulation Modeling & Analysis, McGraw-Hill, Inc,New York, NY, 4th edition, 2006.
[6] J. Banks, J. Carson, B. L. Nelson, and D. Nicol, Discrete-Event System Simulation,Fourth Edition, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, December 2004.
[7] R. M. Fujimoto, Parallel and Distribution Simulation Systems, Wiley-Interscience,New York, NY, 2000.
[8] GPGPU, General-Purpose Computation on Graphics Hardware, 2008. Web.September 2008. <http://www.gpgpu.org>.
[9] D. Luebke, M. Harris, J. Kruger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley,and A. Lefohn, “Gpgpu: general purpose computation on graphics hardware,” inSIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes, New York, NY, USA, 2004,ACM Press.
[10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, andT. J. Purcell, “A survey of general-purpose computation on graphics hardware,”Computer Graphics Forum, vol. 26, no. 1, pp. 80–113, 2007.
[11] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, “GPUcomputing,” Proceedings of the IEEE, vol. 96, no. 5, pp. 879–899, May 2008.
[12] NVIDIA, Technical Brief: NVIDIA GeForce8800 GPU architecture overview, 2006.
[13] NVIDIA, NVIDIA CUDA Programming Guide 2.0, 2008.
[14] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programmingwith cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[15] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D.Owens, “Programmable stream processors,” Computer, vol. 36, no. 8, pp. 54–62,2003.
98
[16] J. D. Owens, “Streaming architectures and technology trends,” in GPU Gems 2,M. Pharr, Ed., chapter 29. Addison Wesley, Upper Saddle River, NJ, 2005.
[17] V. N. Rao and V. Kumar, “Concurrent access of priority queues,” IEEE Trans.Comput., vol. 37, no. 12, pp. 1657–1665, 1988.
[18] D. W. Jones, “Concurrent operations on priority queues,” Commun. ACM, vol. 32,no. 1, pp. 132–137, 1989.
[19] L. M. Leemis and S. K. Park, Discrete-Event Simulation: A First Course,Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2005.
[20] P. A. Fishwick, Simulation Model Design and Execution: Building Digital Worlds,Prentice Hall, Upper Saddle River, NJ, 1995.
[21] D. D. Sleator and R. E. Tarjan, “Self-adjusting binary search trees,” J. ACM, vol. 32,no. 3, pp. 652–686, 1985.
[22] R. Brown, “Calendar queues: a fast o(1) priority queue implementation for thesimulation event set problem,” Commun. ACM, vol. 31, no. 10, pp. 1220–1227,1988.
[23] R. M. Fujimoto, “Parallel simulation: parallel and distributed simulation systems,” inWSC ’01: Proceedings of the 33nd conference on Winter simulation, Washington,DC, USA, 2001, pp. 147–157, IEEE Computer Society.
[24] K. S. Perumalla, “Parallel and distributed simulation: Traditional techniques andrecent advances,” in Proceedings of the 2006 Winter Simulation Conference, LosAlamitos, CA, Dec. 2006, pp. 84–95, IEEE Computer Society.
[25] K. M. Chandy and J. Misra, “Distributed simulation: A case study in design andverification of distributed programs,” Software Engineering, IEEE Transactions on,vol. SE-5, no. 5, pp. 440–452, 1979.
[26] R. E. Bryant, “Simulation of packet communication architecture computer systems,”Tech. Rep., Cambridge, MA, USA, 1977.
[27] J. Misra, “Distributed discrete-event simulation,” ACM Comput. Surv., vol. 18, no. 1,pp. 39–65, 1986.
[28] K. M. Chandy and J. Misra, “Asynchronous distributed simulation via a sequence ofparallel computations,” Commun. ACM, vol. 24, no. 4, pp. 198–206, 1981.
[29] D. R. Jefferson, “Virtual time,” ACM Trans. Program. Lang. Syst., vol. 7, no. 3, pp.404–425, 1985.
[30] F. Gomes, B. Unger, J. Cleary, and S. Franks, “Multiplexed state saving for boundedrollback,” in WSC ’97: Proceedings of the 29th conference on Winter simulation,Washington, DC, USA, 1997, pp. 460–467, IEEE Computer Society.
99
[31] C. D. Carothers, K. S. Perumalla, and R. M. Fujimoto, “Efficient optimistic parallelsimulations using reverse computation,” ACM Trans. Model. Comput. Simul., vol. 9,no. 3, pp. 224–253, 1999.
[32] R. M. Fujimoto, “Exploiting temporal uncertainty in parallel and distributedsimulations,” in Proceedings of the 13th workshop on Parallel and distributedsimulation, Washington, DC, May 1999, pp. 46–53, IEEE Computer Society.
[33] H. Sutter, “The free lunch is over: A fundamental turn toward concurrency insoftware,” Dr. Dobb’s Journal, vol. 30, no. 3, 2005.
[34] A. E. Lefohn, J. Kniss, and J. D. Owens, “Implementing efficient parallel datastructures on gpus,” in GPU Gems 2, M. Pharr, Ed., chapter 33. Addison Wesley,Upper Saddle River, NJ, 2005.
[35] M. Harris, “Mapping computational concepts to gpus,” in GPU Gems 2, M. Pharr,Ed., chapter 31. Addison Wesley, Upper Saddle River, NJ, 2005.
[36] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard, “Cg: a system forprogramming graphics hardware in a c-like language,” in SIGGRAPH ’03: ACMSIGGRAPH 2003 Papers, New York, NY, USA, 2003, pp. 896–907, ACM.
[37] Microsoft, Microsoft high-level shading language, 2008. Web. April 2008.<http://msdn.microsoft.com/en-us/library/ee418149(VS.85).aspx>.
[38] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, andP. Hanrahan, “Brook for gpus: stream computing on graphics hardware,” ACMTrans. Graph., vol. 23, no. 3, pp. 777–786, 2004.
[39] I. Buck, “Taking the plunge into gpu computing,” in GPU Gems 2, M. Pharr, Ed.,chapter 32. Addison Wesley, Upper Saddle River, NJ, 2005.
[40] J. D. Owens, “Gpu architecture overview,” in SIGGRAPH ’07: ACM SIGGRAPH2007 courses, New York, NY, USA, 2007, p. 2, ACM.
[41] D. Luebke, “Gpu architecture & applications,” March 2 2008, Tutorial, ASPLOS2008.
[42] P. Vakili, “Massively parallel and distributed simulation of a class of discrete eventsystems: a different perspective,” ACM Transactions on Modeling and ComputerSimulation, vol. 2, no. 3, pp. 214–238, 1992.
[43] N. T. Patsis, C. Chen, and M. E. Larson, “Simd parallel discrete event dynamicsystem simulation,” IEEE Transactions on Control Systems Technology, vol. 5, pp.30–41, 1997.
[44] R. Ayani and B. Berkman, “Parallel discrete event simulation on simd computers,”Journal of Parallel and Distributed Computing, vol. 18, no. 4, pp. 501–508, 1993.
100
[45] W. W. Shu and M. Wu, “Asynchronous problems on simd parallel computers,” IEEETransactions on Parallel and Distributed Systems, vol. 6, no. 7, pp. 704–713, 1995.
[46] S. Gobron, F. Devillard, and B. Heit, “Retina simulation using cellular automata andgpu programming,” Machine Vision and Applications, vol. 18, no. 6, pp. 331–342,2007.
[47] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Lastra, “Simulation of clouddynamics on graphics hardware,” in HWWS ’03: Proceedings of the ACM SIG-GRAPH/EUROGRAPHICS conference on Graphics hardware, Aire-la-Ville,Switzerland, Switzerland, 2003, pp. 92–101, Eurographics Association.
[48] L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,” in GPUGems 3, H. Nguyen, Ed., chapter 31. Addison Wesley, Upper Saddle River, NJ,2007.
[49] K. S. Perumalla, “Discrete-event execution alternatives on general purposegraphical processing units (gpgpus),” in PADS ’06: Proceedings of the 20thWorkshop on Principles of Advanced and Distributed Simulation, Washington, DC,2006, pp. 74–81, IEEE Computer Society.
[50] Z. Xu and R. Bagrodia, “Gpu-accelerated evaluation platform for high fidelitynetwork modeling,” in PADS ’07: Proceedings of the 21st International Workshopon Principles of Advanced and Distributed Simulation, Washington, DC, 2007, pp.131–140, IEEE Computer Society.
[51] M. Lysenko and R. M. D’Souza, “A framework for megascale agent based modelsimulations on graphics processing units,” Journal of Artificial Societies and SocialSimulation, vol. 11, no. 4, pp. 10, 2008.
[52] P. Martini, M. Rumekasten, and J. Tolle, “Tolerant synchronization for distributedsimulations of interconnected computer networks,” in Proceedings of the 11thworkshop on Parallel and distributed simulation, Washington, DC, June 1997, pp.138–141, IEEE Computer Society.
[53] S. K. Reinhardt, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood,“The wisconsin wind tunnel: virtual prototyping of parallel computers,” in SIG-METRICS ’93: Proceedings of the 1993 ACM SIGMETRICS conference onMeasurement and modeling of computer systems, New York, NY, USA, 1993, pp.48–60, ACM.
[54] A. Falcon, P. Faraboschi, and D. Ortega, “An adaptive synchronization techniquefor parallel simulation of networked clusters,” in ISPASS ’08: Proceedings ofthe ISPASS 2008 - IEEE International Symposium on Performance Analysis ofSystems and software, Washington, DC, USA, 2008, pp. 22–31, IEEE ComputerSociety.
101
[55] J. J. Wang and M. Abrams, “Approximate time-parallel simulation of queueingsystems with losses,” in WSC ’92: Proceedings of the 24th conference on Wintersimulation, New York, NY, USA, 1992, pp. 700–708, ACM.
[56] T. Kiesling, “Using approximation with time-parallel simulation,” Simulation, vol. 81,no. 4, pp. 255–266, 2005.
[57] G. C. Hunt, M. M. Michael, S. Parthasarathy, and M. L. Scott, “An efficient algorithmfor concurrent priority queue heaps,” Inf. Process. Lett., vol. 60, no. 3, pp. 151–157,1996.
[58] M. D. Grammatikakis and S. Liesche, “Priority queues and sorting methods forparallel simulation,” IEEE Trans. Softw. Eng., vol. 26, no. 5, pp. 401–422, 2000.
[59] H. Sundell and P. Tsigas, “Fast and lock-free concurrent priority queues formulti-thread systems,” J. Parallel Distrib. Comput., vol. 65, no. 5, pp. 609–627,2005.
[60] E. Naroska and U. Schwiegelshohn, “A new scheduling method for paralleldiscrete-event simulation,” in Euro-Par ’96: Proceedings of the Second InternationalEuro-Par Conference on Parallel Processing-Volume II, London, UK, 1996, pp.582–593, Springer-Verlag.
[61] J. Liu, D. M. Nicol, and K. Tan, “Lock-free scheduling of logical processes in parallelsimulation,” in In Proceedings of the 2000 Parallel and Distributed SimulationConference, Lake ArrowHead, CA, 2001, pp. 22–31.
[62] M. A. Franklin, “Parallel solution of ordinary differential equations,” IEEE Trans.Comput., vol. 27, no. 5, pp. 413–420, 1978.
[63] J. M. Rutledge, D. R. Jones, W. H. Chen, and E. Y. Chung, “The use of a massivelyparallel simd computer for reservoir simulation,” in Eleventh SPE Symposium onReservoir Simulation, 1991, pp. 117–124.
[64] A. T. Chronopoulos and G. Wang, “Parallel solution of a traffic flow simulationproblem,” Parallel Comput., vol. 22, no. 14, pp. 1965–1983, 1997.
[65] J. Signorini, “How a simd machine can implement a complex cellular automata? acase study: von neumann’s 29-state cellular automaton,” in Supercomputing ’89:Proceedings of the 1989 ACM/IEEE conference on Supercomputing, New York, NY,USA, 1989, pp. 175–186, ACM.
[66] M. Harris, Optimizing Parallel Reduction in CUDA, NVIDIA Corporation, 2007.
[67] R. Mansharamani, “An overview of discrete event simulation methodologies andimplementation,” Sadhana, vol. 22, no. 7, pp. 611–627, 1997.
[68] F. Wieland, “The threshold of event simultaneity,” SIGSIM Simul. Dig., vol. 27, no. 1,pp. 56–59, 1997.
102
[69] M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionallyequidistributed uniform pseudo-random number generator,” ACM Trans. Model.Comput. Simul., vol. 8, no. 1, pp. 3–30, 1998.
[70] V. Podlozhnyuk, Parallel Mersenne Twister, NVIDIA Corporation, 2007.
[71] P. A. Fishwick, “Simpack: getting started with simulation programming in c andc++,” in Proceedings of the 1992 Winter Simulation Conference, J. J. Swain,D. Goldsman, R. C. Crain, and J. R. Wilson, Eds., New York, NY, 1992, pp.154–162, ACM Press.
[72] C. D. Carothers, R. M. Fujimoto, and P. England, “Effect of communicationoverheads on time warp performance: an experimental study,” in Proceedingsof the 8th workshop on Parallel and distributed simulation, New York, NY, July 1994,pp. 118–125, ACM Press.
[73] T. L. Wilmarth and L. V. Kale, “Pose: Getting over grainsize in parallel discreteevent simulation,” in Proceedings of the 2004 International Conference on ParallelProcessing (ICPP’04), Washington, DC, Aug. 2004, pp. 12–19, IEEE ComputerSociety.
[74] C. L. O. Kawabata, R. H. C. Santana, M. J. Santana, S. M. Bruschi, and K. R.L. J. C. Branco, “Performance evaluation of a cmb protocol,” in Proceedingsof the 38th conference on Winter simulation, Los Alamitos, CA, Dec. 2006, pp.1012–1019, IEEE Computer Society.
[75] N. Vandaele, T. V. Woensel, and A. Verbruggen, “A queueing based traffic flowmodel,” Transportation Research Part D: Transport and Environment, vol. 5, no. 2,pp. 121 – 135, 2000.
[76] L. Kleinrock, Queueing Systems Volume 2: Computer Applications,Wiley-Interscience, New York, NY, 1975.
[77] A. Seidmann, P. Schweitzer, and S. Shalev-Oren, “Computerized closed queueingnetwork models of flexible manufacturing systems,” Large Scale Syst. J., vol. 12,pp. 91–107, 1987.
[78] P. K. Muduli and T. M. Yegulalp, “Modeling truck-shovel systems as closedqueueing network with multiple job classes,” International Transactions in Op-erational Research, vol. 3, no. 1, pp. 89–98, 1996.
[79] R. B. Cooper, “Queueing theory,” in ACM 81: Proceedings of the ACM ’81conference, New York, NY, USA, 1981, pp. 119–122, ACM.
[80] J. F. Kurose and K. W. Ross, Computer Networking: A Top-Down Approach (4thEdition), Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2007.
103
[81] S. Kumar, V. S. Raghavan, and J. Deng, “Medium access control protocols for adhoc wireless networks: A survey,” Ad Hoc Networks, vol. 4, no. 3, pp. 326–358,2006.
[82] A. Boukerche and L. Bononi, “Simulation and modeling of wireless, mobile, andad hoc networks,” in Mobile Ad Hoc Networking, S. G. I. S. Stefano Basagni,Marco Conti, Ed., chapter 14. Wiley-Interscience, New York, NY, 2004.
[83] T. Mattson, B. Sanders, and B. Massingill, Patterns for parallel programming,Addison-Wesley Professional, 2004.
[84] NVIDIA, CUDA, 2009. Web. May 2009. <http://www.nvidia.com/cuda>.
[85] B. Karp and H. T. Kung, “Gpsr: greedy perimeter stateless routing for wirelessnetworks,” in MobiCom ’00: Proceedings of the 6th annual international conferenceon Mobile computing and networking, New York, NY, USA, 2000, pp. 243–254,ACM.
[86] T. Camp, J. Boleng, and V. Davies, “A survey of mobility models for ad hoc networkresearch,” Wireless Communications & Mobile Computing (WCMC): Special issueon Mobile Ad Hoc Networking: Research, Trends and Applications, vol. 2, pp.483–502, 2002.
104
BIOGRAPHICAL SKETCH
Hyungwook Park received his B.S. degree in computer science from Korea
Military Academy in 1995 and M.S. degree in computer and information science and
engineering from University of Florida in 2003. He served as a senior programmer in
the Republic of Korea Army Headquarters and Logistics Command until he started his
Ph.D. studies at the University of Florida in 2005. His research interests are modeling
and parallel simulation.
105