PARALLEL GENERAL-PURPOSE RESERVOIR SIMULATION
WITH COUPLED RESERVOIR MODELS
AND MULTISEGMENT WELLS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ENERGY
RESOURCES ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Yifan Zhou
November 2012
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Prof. Hamdi Tchelepi) Principal Co-Advisor
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Prof. Khalid Aziz) Principal Co-Advisor
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Prof. Roland Horne)
Approved for the University Committee on Graduate Studies
ii
Abstract
The development of a parallel general-purpose simulation framework for coupled reser-
voir models and multisegment wells is the subject of this dissertation. With this work,
the General Purpose Research Simulator (GPRS) based on a flexible Automatic Dif-
ferentiation (AD) framework, AD-GPRS, is now a powerful and flexible platform for
modeling thermal-compositional fluid flow in reservoir models with fully unstructured
grids. AD-GPRS has advanced and extensible spatial and temporal discretization
schemes, advanced linear solvers, and a generalized multisegment well model. In
addition, AD-GPRS supports OpenMP parallelization on multicore platforms and a
Nested Factorization (NF) linear solver for systems with multiple GPUs (Graphics
Processing Units).
AD-GPRS employs generalized MultiPoint Flux Approximations (MPFA) for spa-
tial discretization and a multilevel Adaptive Implicit Method (AIM) for time dis-
cretization. A generalized connection list is used to locate the block nonzero entries
in the system matrix associated with general MPFA discretization. Our AIM imple-
mentation allows for new fluid models and nonlinear formulations. The framework
can deal with any combination of TPFA (Two-Point Flux Approximation), MPFA,
FIM (Fully Implicit Method), and AIM.
For efficient linear solution of coupled reservoir models and advanced wells, AD-
GPRS supports linear systems based on the MLBS (MultiLevel Block Sparse) data
iii
structure. MLBS was first designed and implemented in the original GPRS and ex-
tended further in AD-GPRS. Equipped with the CPR (Constrained Pressure Resid-
ual) preconditioner, MLBS is very powerful and highly efficient. MLBS has a hierar-
chical data structure and can accommodate systems with general MPFA discretization
and AIM.
For accurate well modeling, a general MultiSegment (MS) well model is imple-
mented in AD-GPRS. In this model, variables and equations are defined for both
nodes and connections. The general MS well model allows for the following advanced
features: general branching, loops with arbitrary flow directions, multiple exit con-
nections with different constraints, and special nodes (e.g., separators, valves). The
linear and nonlinear solvers are extended to address the numerical challenges brought
about by the general MS well model.
Parallel reservoir simulation has recently drawn a lot of attention, and specializ-
ing the algorithms for the target parallel architecture has grown in importance. We
describe an architecture-aware approach to parallelization. First, multithreading par-
allelization of AD-GPRS is described. Parallel Jacobian construction is achieved with
a thread-safe extension of the ADETL library. For linear solution, we use a two-stage
CPR preconditioning strategy, which combines the parallel multigrid solver XSAMG
and the Block Jacobi technique with Block ILU(0) applied locally.
We also describe multi-GPU parallelization of Nested Factorization (NF). We
build on the Massively Parallel NF (MPNF) framework described by Appleyard et
al. [8]. The most important features of our GPU-based implementation of MPNF
include: 1) special ordering of the matrix elements to maximize coalesced access
to the GPU global memory, 2) application of ‘twisted factorization’ to increase the
number of concurrent threads at no additional cost, and 3) multi-GPU extension of the
algorithm by first performing computations in the halo region of each GPU, and then
overlapping the peer-to-peer memory transfer between GPUs with the computations
of the interior regions.
iv
Acknowledgements
I would like to express my sincere gratitude to my advisors, Prof. Hamdi Tchelepi
and Prof. Khalid Aziz for their guidance, help, and encouragement in this work.
I feel so fortunate that they let me work on the development of AD-GPRS, which
fits my background and interest very well. During the past five years, I was advised
to explore various interesting and challenging aspects of reservoir simulation and I
believe the wide coverage of these aspects will greatly benefit my future career. Prof.
Hamdi Tchelepi has an excellent sense on the cutting-edge research topics in reservoir
simulation and was always able to offer me solid support and insightful view during
my investigation towards a new direction. Prof. Khalid Aziz is very knowledgeable
and highly experienced. I was able to learn a lot during every discussion with him.
I would like to thank Prof. Roland Horne for reading my dissertation and providing
valuable corrections and suggestions, some of which are quite beneficial for improving
my academic writing skills. Prof. Kate Maher kindly chaired my PhD oral defense
and is gratefully acknowledged. I would also like to thank Prof. Biondo Biondi for
serving in my defense committee as an oral examiner. His valuable comments are
highly appreciated.
I would like to thank Dr. Rami Younis, who is the original developer of ADETL
and now a professor in U. Tulsa, for his help and support during my M.S. research
on extending ADETL. Regarding the development of AD-GPRS, I worked with Dr.
Denis Voskov a lot and his contribution is sincerely acknowledged.
v
I would like to thank Dr. Brad Mallison from Chevron ETC for offering good
guidance and sufficient freedom in my intern project that is relevant to my PhD
research work. I would also like to express my gratitude to Dr. Hui Cao, who is the
original developer of GPRS and now with Total, for giving us helpful suggestions on
various aspects, including the AIM formulation and linear solvers.
I would like to thank Dr. Yuanlin Jiang, who is also an important developer of
GPRS and now with QRI, for his contribution in the general MS well model. I also
benefited from the discussion with Dr. Arthur Moncorge from Total and his helpful
suggestions are appreciated.
I would like to thank Dr. Klaus Stuben and Sebastian Gries from Fraunhofer
SCAI for their help and support in the XSAMG solver. I would also like to thank Dr.
Robert Clapp, Xukai Shen, Chris Leader, and a few others from the Department of
Geophysics for their kindly help in setting up the GPU computational environment.
I would like to thank all my friends in the Department of Energy Resources En-
gineering, as well as, other departments in the School of Earth Sciences at Stanford
University. I appreciate their help, support, and consideration for the last five years.
The friendships with them are very important wealth of my life.
I would like to acknowledge the industrial affiliates of SUPRI-B (Reservoir Simu-
lation) and SESAAI (Algorithms and Architectures) for their financial support that
made this work possible.
Finally, I would like to thank my parents for their continuous love and support
during the past 27 years. What is most important, I would like to express my greatest
gratitude to my beloved wife, Dr. Xiaochen Wang. We experienced many significant
moments together and have countless beautiful memory pieces. I am so proud of all
her achievements. Without her care, love, support, and encouragement, my PhD life
would have never been so colorful and splendid. She deserves the appreciation and
love from the deepest part of my heart.
vi
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Background and Dissertation Outline . . . . . . . . . . . . . . . . . . 1
1.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Hierarchical organization of simulation variables . . . . . . . . 10
1.2.2 Building residual equations . . . . . . . . . . . . . . . . . . . . 12
2 AD Framework with MPFA and AIM capabilities 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 AD-Based MPFA Framework . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Nonlinear level . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Linear level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 AD-Based AIM Formulation . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Implicit-level determination (nonlinear level) . . . . . . . . . . 27
2.3.2 Treatment of nonlinear terms in the flux (nonlinear level) . . . 30
2.3.3 Algebraic reduction in terms of implicit variables (linear level) 32
2.3.4 Updating of the remaining variables (linear level) . . . . . . . 33
vii
2.4 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Full and upscaled SPE 10 problems . . . . . . . . . . . . . . . 34
2.4.2 Unstructured grid problem . . . . . . . . . . . . . . . . . . . . 38
2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Linear Solver Framework 43
3.1 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 CSR linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Block linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 First level: global matrix . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Second level: reservoir and Facilities matrix . . . . . . . . . . 54
3.3.3 Third level: well matrices . . . . . . . . . . . . . . . . . . . . 57
3.4 Solution strategy of block linear system . . . . . . . . . . . . . . . . . 61
3.4.1 Matrix extraction . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Algebraic reduction from full to primary system . . . . . . . . 64
3.4.3 Algebraic reduction from primary to implicit system . . . . . . 69
3.4.4 Preconditioned Linear Solver . . . . . . . . . . . . . . . . . . . 73
3.4.5 The Two-Stage Preconditioning Strategy . . . . . . . . . . . . 74
3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 General MultiSegment Well Model 91
4.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Drift-Flux Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.1 Liquid-gas model . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2 Oil-water model . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.3 Gas-oil-water model . . . . . . . . . . . . . . . . . . . . . . . 106
viii
4.5 Extensions of the AD Simulation Framework . . . . . . . . . . . . . . 107
4.5.1 Global variable set . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.2 Linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.3 Jacobian for the general MS well model . . . . . . . . . . . . . 110
4.6 Well Initialization, Calculation, and Variable Updating . . . . . . . . 113
4.6.1 Well initialization . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6.2 Well calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.6.3 Updating of the well variables . . . . . . . . . . . . . . . . . . 120
4.7 Multistage Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . 121
4.7.1 First stage: global on the pressure system . . . . . . . . . . . 122
4.7.2 Second stage: local on the overall system . . . . . . . . . . . . 125
4.8 Nonlinear Solution: Local Facility Solver . . . . . . . . . . . . . . . . 127
4.9 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.9.1 Two-dimensional reservoir with a dual-branch general MS well 129
4.9.2 Upscaled SPE 10 reservoir with three multilateral producers . 131
4.9.3 Linear solver performance . . . . . . . . . . . . . . . . . . . . 134
4.9.4 Nonlinear solver performance . . . . . . . . . . . . . . . . . . 137
4.9.5 Comparison of simulation results: AD-GPRS versus Eclipse . 138
4.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5 Multicore Parallelization 142
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 Jacobian Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.1 Thread-safe ADETL . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.2 Parallel computations other than the linear solver . . . . . . . 144
5.3 Linear Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1 Parallel matrix data structure . . . . . . . . . . . . . . . . . . 145
ix
5.3.2 First stage pressure solution — XSAMG preconditioner . . . . 146
5.3.3 Second stage overall solution — Block Jacobi/BILU precondi-
tioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4 Parallel benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6 GPU Parallelization of Nested Factorization 153
6.1 Introduction to GPU Architecture . . . . . . . . . . . . . . . . . . . . 153
6.2 Nested Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3 Massively Parallel Nested Factorization . . . . . . . . . . . . . . . . . 157
6.4 Our CUDA-based Implementation . . . . . . . . . . . . . . . . . . . . 161
6.4.1 Basic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4.2 Runtime profiles . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Coalesced memory access . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.6 More parallelism: multiple threads in a kernel . . . . . . . . . . . . . 167
6.6.1 Decrease the number of colors . . . . . . . . . . . . . . . . . . 168
6.6.2 Twisted Factorization . . . . . . . . . . . . . . . . . . . . . . 168
6.6.3 Cyclic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.7 More flexibility in the system matrix . . . . . . . . . . . . . . . . . . 170
6.7.1 Inactive cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.7.2 Additional (e.g., well) equations . . . . . . . . . . . . . . . . . 171
6.8 Multi-GPU Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 173
6.8.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.8.2 Data transfer between GPUs . . . . . . . . . . . . . . . . . . . 174
6.8.3 Overlapping data transfer with computation . . . . . . . . . . 177
6.8.4 Mapping between global and local vectors . . . . . . . . . . . 178
6.9 Parallel Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
x
6.9.1 Single-GPU test case 1: upscaled SPE 10 . . . . . . . . . . . . 180
6.9.2 Single-GPU test case 2: full SPE 10 . . . . . . . . . . . . . . . 182
6.9.3 Multi-GPU test case 1: 8-fold refinement (2 by 2 by 2) of SPE 10183
6.9.4 Multi-GPU test case 2: 24-fold refinement (4 by 3 by 2) of SPE 10186
6.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.11 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7 Conclusions and Future Work 193
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Nomenclature 200
Bibliography 206
Appendix A Programming Model of AD-GPRS 218
A.1 The structure of AD-GPRS . . . . . . . . . . . . . . . . . . . . . . . 219
A.2 Flow sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
A.3 List of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
xi
List of Tables
1.1 A hypothetical variable set for three-phase black-oil simulation [85] . 11
2.1 FIM and AIM runtime performance of upscaled SPE 10 problem . . 35
2.2 Runtime performance of full SPE 10 case . . . . . . . . . . . . . . . 37
2.3 Runtime performance of unstructured grid case . . . . . . . . . . . . 40
3.1 Row pointer array of CSR format . . . . . . . . . . . . . . . . . . . . 45
3.2 Column index and value arrays of CSR format . . . . . . . . . . . . 45
xii
List of Figures
1.1 Key features and capabilities of Automatic-Differentiation General-
Purpose Research Simulator (AD-GPRS) . . . . . . . . . . . . . . . . 3
2.1 Illustration of different MPFA schemes . . . . . . . . . . . . . . . . . 16
2.2 Illustration of the AIM scheme (IMPES+FIM) . . . . . . . . . . . . 17
2.3 Block nonzero entries corresponding to TPFA and MPFA fluxes . . . 23
2.4 An example of generalized connection list . . . . . . . . . . . . . . . 25
2.5 Grid mapping of full SPE 10 with skewness and distortion [99] . . . . 36
2.6 Gas rates of two injectors and oil rates of two producers in full SPE 10
with skewed grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Gas saturation at the end of simulation for TPFA and MPFA with 32:1
anisotropy ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Gas rate of producer #1 and #4 for TPFA and MPFA with 1:1, 2:1,
8:1, 32:1 anisotropy ratios . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 Typical global Jacobian matrix in reservoir simulation [38] . . . . . . 52
3.2 Submatrices in the global Jacobian matrix [98] . . . . . . . . . . . . . 53
3.3 Structure of the JRR submatrix . . . . . . . . . . . . . . . . . . . . . 55
3.4 Structure of second-level MLBS matrices [38] . . . . . . . . . . . . . . 58
3.5 Sample reservoir with two wells . . . . . . . . . . . . . . . . . . . . . 60
3.6 Structure of third-level MLBS matrices . . . . . . . . . . . . . . . . . 61
xiii
4.1 Illustration of the original multisegment well model (from [38]) . . . 92
4.2 Illustration of the general multisegment well model (from [38]) . . . 93
4.3 Multilevel block-sparse linear system (modified from [38]) . . . . . . . 110
4.4 Jacobian matrix structure of the general MS well model . . . . . . . 112
4.5 A segment with zero, one, or multiple perforations . . . . . . . . . . 115
4.6 Initialization of mixture flow rates . . . . . . . . . . . . . . . . . . . 118
4.7 Calculation sequence of the general MS well model . . . . . . . . . . 120
4.8 Variable update sequence of the general MS well model . . . . . . . . 122
4.9 The reservoir and well configuration of example 1 . . . . . . . . . . . 129
4.10 The simulation results of example 1 . . . . . . . . . . . . . . . . . . 130
4.11 The reservoir and well settings of example 2 with separate controls . 132
4.12 The simulation results of example 2 with separate controls . . . . . . 132
4.13 The reservoir and well settings of example 2 with a group control . . 133
4.14 The simulation results of example 2 with a group control . . . . . . . 134
4.15 The linear solver performance of example 3 . . . . . . . . . . . . . . 136
4.16 The nonlinear solver performance of example 4 . . . . . . . . . . . . 138
4.17 The reservoir and well settings of example 3 (AD-GPRS versus Eclipse) 139
4.18 Comparison of simulation results: AD-GPRS versus Eclipse . . . . . 140
5.1 Illustration of Thread-Local Storage (TLS) . . . . . . . . . . . . . . . 144
5.2 Illustration of XSAMG preconditioner [31] . . . . . . . . . . . . . . . 147
5.3 Illustration of Block Jacobi preconditioner . . . . . . . . . . . . . . . 149
5.4 Performance result of the full SPE 10 model with TPFA discretization 150
5.5 Performance result of the full SPE 10 model with MPFA O-method
discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.1 Illustration of the Fermi GPU architecture [52] . . . . . . . . . . . . . 154
6.2 Examples of different coloring strategies . . . . . . . . . . . . . . . . 160
xiv
6.3 Runtime profile of the GPU-based MPNF preconditioner for solving
the pressure system of the top 10 layers of the SPE 10 reservoir model
in single-precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 Comparison of CUDA memory access patterns . . . . . . . . . . . . . 165
6.5 Runtime profile of the implementation with BiCGStab, customized
reduction kernel, and coalesced memory access . . . . . . . . . . . . . 167
6.6 Example of partitioning with three GPUs (top view) . . . . . . . . . 174
6.7 Illustration of left-right data transfer approach (modified from [49]) . 175
6.8 Illustration of pairwise data transfer approach (modified from [49]) . . 176
6.9 Assignment of tasks to two streams in the solution phase . . . . . . . 177
6.10 Assignment of tasks to two streams in the setup phase . . . . . . . . 179
6.11 The performance results of the upscaled SPE 10 problem with 139
thousand cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.12 The performance results of the full SPE 10 problem with 1.1 million cells183
6.13 The performance results of the refined SPE 10 problem with 8.8 million
cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.14 The performance results of the further refined SPE 10 problem with
26.9 million cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.1 Overall structure of the entire simulator . . . . . . . . . . . . . . . . 219
A.2 Structure of the NonlinearFormulation . . . . . . . . . . . . . . . . . 221
A.3 Structure of the Reservoir . . . . . . . . . . . . . . . . . . . . . . . . 223
A.4 Structure of the Fluid . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A.5 Structure of the Facilities . . . . . . . . . . . . . . . . . . . . . . . . 226
A.6 Structure of the AIMScheme . . . . . . . . . . . . . . . . . . . . . . 227
A.7 Structure of the NonlinearSolver . . . . . . . . . . . . . . . . . . . . 228
A.8 Structure of the LinearSystem . . . . . . . . . . . . . . . . . . . . . 229
xv
Chapter 1
Introduction
1.1 Background and Dissertation Outline
Reservoir simulation is a primary tool for planning and managing oil recovery and
CO2 sequestration processes. In recent years, there has been significant growth in
the resolution and complexity of simulation models of practical interest. This growth
includes both the reservoir and the well models. Moreover, there is a growing need
for accurate modeling of Enhanced Oil Recovery (EOR) processes from conventional
and unconventional resources. Several important efforts aimed at developing reser-
voir flow simulators based on generalized (thermal) compositional formulations have
been reported [5, 15, 18, 23, 25, 63]. In Stanford’s Reservoir Simulation Industrial Af-
filiates Program (SUPRI-B, see https://pangea.stanford.edu/researchgroups/
supri-b/), significant efforts have been invested to develop a flexible research plat-
form for general-purpose reservoir flow simulation.
The General Purpose Research Simulator (GPRS), which was first developed by
Cao [15] and extended significantly by Jiang [38], is a powerful platform that serves
as the predecessor of the new computational platform discussed here. GPRS is distin-
guished from the previous efforts in its extensible modular design and object-oriented
1
CHAPTER 1. INTRODUCTION 2
computer code written in C++ [38]. GPRS employs a compositional-thermal formula-
tion, utilizes a Two Point Flux Approximation (TPFA) for spatial discretization, and
the Adaptive Implicit Method (AIM) [15, 65, 80] for time discretization. GPRS has
a connection-based design allowing for both structured and unstructured grids. Ad-
vanced well modeling capabilities, including advanced multilateral wells, are available
in GPRS. Advanced linear solvers with multistage preconditioners make it possible to
solve large models with unstructured grids and complex wells [38,39]. More recently,
chemical reaction modeling has also been developed and integrated into GPRS [28,29].
With these features, GPRS can simulate subsurface CO2 sequestration, flow in frac-
tured formations, and facilitate gradient-based optimization [69]. Many students and
researchers in our group have contributed to this reservoir-simulation research plat-
form. The most significant developments include those by Cao [15,16], Jiang [38,39],
Fan [28,29], Pan [60,61], and Voskov [86,87].
Given the fact that we already have a powerful simulation research platform -
namely GPRS - why build a new research platform using Automatic Differentia-
tion (AD)? To answer this question, we need to think about all the unresolved and
upcoming challenges associated with the development of a general-purpose reservoir-
simulation platform. The research platform should be able to accommodate the grow-
ing variety and complexity of subsurface nonlinear processes that must be modeled
accurately and efficiently.
All the existing general-purpose reservoir simulators - both in industry and academia
- employ hand (manual) differentiation and implementation of the Jacobian [15, 21,
25, 70, 75]. Derivation of analytical derivatives, coding, debugging, and extensive
testing are necessary whenever a new physical mechanism, constitutive relation, or
discretization scheme (in space or time) is to be added. The time consuming, te-
dious, and error-prone process of constructing the Jacobian matrix is a major reason
why it is extremely difficult to extend the capabilities of existing reservoir simulators,
CHAPTER 1. INTRODUCTION 3
including the current version of GPRS.
Here, our objective is to establish a general-purpose numerical simulation frame-
work that can be used as a flexible, extensible, and computationally efficient platform
for reservoir simulation research. For this purpose, AD is introduced as a key capabil-
ity of the new framework. Given the discrete form of the governing nonlinear residual
equations and declaration of the independent variables, the AD library employs ad-
vanced expression templates with block data-structures to automatically generate
compact computer code for the Jacobian matrix [92]. A brief description of our new
AD-GPRS framework is given in Section 1.2.
Fractured Porous Media
CO2 Sequestration
Geomechanics
Coupling
Gradient-based
Optimization
and History
Matching
Compositional / Thermal Formulation
Advanced Linear / Nonlinear Solvers
Unstructured Grids with TPFA / MPFA
Adaptive Implicit Method (AIM)
General Multi-Segment Wells
Multi-Core / GPU Parallelization
Automatic Differentiation
Figure 1.1: Key features and capabilities of Automatic-Differentiation General-Purpose Research Simulator (AD-GPRS)
We describe the overall design of our new generation of GPRS based on a flex-
ible AD framework. The key features and capabilities of AD-GPRS are shown in
Figure 1.1. With the research reported in this dissertation, AD-GPRS is capable of
CHAPTER 1. INTRODUCTION 4
modeling thermal-compositional fluid flow in reservoir models with fully unstructured
grids. AD-GPRS has general spatial and temporal discretization schemes, advanced
linear solvers, and a generalized multisegment well model. In addition, AD-GPRS
supports OpenMP parallelization on multicore platforms and a Nested Factorization
(NF, see [7]) linear solver for systems with multiple GPUs. With these capabilities,
AD-GPRS can simulate challenging problems in a flexible and efficient way. This in-
cludes modeling the long-term behavior of CO2 sequestration processes and fluid flow
in reservoirs with complex geological features, such as fractures and faults. Gradient-
based optimization and history matching are also supported through a fully-integrated
AD-based adjoint capability.
For flexible reservoir modeling, AD-GPRS supports generally unstructured grids,
employs a generalized MultiPoint Flux Approximation (MPFA [2, 3, 27, 59]) for spa-
tial discretization, and uses a multilevel Adaptive Implicit Method (AIM) for time
discretization. The MPFA and AIM capabilities are described in Chapter 2. For gen-
erality, no particular structure is assumed for the stencil. A generalized connection
list is introduced to locate the block nonzero entries in the system matrix with general
MPFA discretization. Our AIM implementation is designed to facilitate systematic
application of the method to new fluid models and variable formulations. AD-GPRS
allows for any combination of TPFA (Two-Point Flux Approximation), MPFA, FIM
(Fully Implicit Method), and AIM. The generic and modular design is amenable to
extension, both in terms of modeling additional flow processes and implementing new
numerical methods. The AD-based modeling capability is demonstrated for highly
nonlinear compositional problems using challenging large-scale reservoir models that
include full-tensor permeability fields and nonorthogonal grids. The behaviors of
TPFA and several MPFA schemes are analyzed for both FIM and AIM simulations.
The implications of using MPFA and AIM on both the nonlinear and linear solvers
are discussed and analyzed.
CHAPTER 1. INTRODUCTION 5
For efficient linear solution of coupled reservoir models and advanced wells, AD-
GPRS supports a block linear system based on the MLBS (MultiLevel Block Sparse)
data structure [38], which is discussed in Chapter 3. MLBS was first designed and im-
plemented in the original GPRS and further extended in AD-GPRS. Equipped with
GMRES (Generalized Minimal RESidual, see [67, 68]) and the CPR (Constrained
Pressure Residual, see [88, 89]) preconditioner, MLBS is very powerful and highly
efficient. This is the reason why MLBS is employed in AD-GPRS. The MLBS hierar-
chical data structure and associated solution strategies, which include matrix extrac-
tion, algebraic reduction, iterative linear solution and preconditioning, and explicit
updating, are discussed in detail.
For accurate modeling of multiphase flow in wellbores and surface (pipeline) net-
works, a general MultiSegment (MS) well model [38] is implemented in AD-GPRS
and discussed in Chapter 4. In this MS model, variables and equations are defined
for both nodes and connections. The general MS well model has the following ad-
vantages: general branching, loops with arbitrary flow directions, multiple exit con-
nections with different constraints, and special nodes (e.g., separators, valves). The
model definition, mathematical formulation, and AD-based implementation, includ-
ing the extensions of the AD framework, as well as, well initialization, calculation,
and variable-updating procedures, are described in detail. The extensions of the lin-
ear and nonlinear solvers to address the numerical difficulties brought about by the
general MS well model are also discussed. Moreover, the robustness and efficiency
of AD-GPRS for complex reservoir models with MS wells are demonstrated with
numerical examples.
Parallel reservoir simulation has recently drawn a lot of attention due to the
rapidly increasing size and complexity of simulation models, as well as, the growth
in computational power. Due to the limitations of power consumption and heat
dissipation, single-core processors with high clock frequency have already been phased
CHAPTER 1. INTRODUCTION 6
out [79]. The various architectures of parallel computation bring new opportunities
and challenges to existing algorithms. Specializing the algorithms for the target
parallel architecture has grown in importance in the last few years. Emerging parallel
architectures for high performance computing include multicore, many-core [43, 71],
and GPGPU (General-Purpose computation on Graphics Processing Units) [52, 55]
platforms.
We describe an architecture-aware approach to parallel reservoir simulation in
two chapters. In Chapter 5, we describe multithreading parallelization of AD-GPRS.
Both the Jacobian generation and linear-solver computations are covered. Parallel
Jacobian construction is achieved with a thread-safe extension of our AD library.
For linear solution, a two-stage CPR (Constrained Pressure Residual) precondition-
ing strategy is used. The latest parallel multigrid solver from Fraunhofer SCAI -
XSAMG [31] - is used as the first-stage pressure preconditioner, whereas the Block
Jacobi technique is employed in the second stage with Block ILU(0) as the local pre-
conditioner. The parallel performance of AD-GPRS is demonstrated using the full
SPE 10 problem [19] with three different discretization schemes for nonorthogonal
grids on multicore platforms.
In Chapter 6, we discuss the details of the multi-GPU parallelization of Nested
Factorization (NF) [7] - a linear solver in which high concurrency can be exploited
through proper modification. We build on the Massively Parallel NF (MPNF) frame-
work described by Appleyard et al. [8]. The most important features of our GPU-
based implementation of MPNF include: 1) special ordering of the matrix elements
to maximize coalesced access to GPU global memory, 2) application of ‘twisted fac-
torization’ to increase the number of concurrent threads at no additional cost, and
3) multi-GPU extension of the algorithm by first performing computation in the halo
region in each GPU and then overlapping the peer-to-peer memory transfer between
GPUs with the computation of the interior regions. Numerical examples, including
CHAPTER 1. INTRODUCTION 7
upscaled, full, and refined SPE 10 problems, are used to demonstrate the parallel
performance of our MPNF implementation on single-GPU and multi-GPU platforms.
Finally, we summarize the research discussed here and draw conclusions in Chapter
7. Possible directions for future work are also suggested.
1.2 Automatic Differentiation
AD is a technique for generating computer code that computes the derivatives. The
AD process consists of (1) analyzing the expression parse-tree and decomposing the
expression into basic unary, or binary (+, -, *, /) operations, (2) applying the basic
differentiation rules: linearity, product, and treatment of quotients, (3) performing
transcendental elementary function derivatives (e.g., (exp(v))′ = exp(v) · v′), and (4)
performing the chain rule (i.e., (f ◦ g)′ = (f ′ ◦ g)g′) [30, 34, 64]. AD offers flexibility,
generality, and accuracy up to machine precision. However, because the algorithmic
complexity of AD is at least comparable to that of analytical differentiation, the ability
to develop an optimally efficient AD library is domain specific and often requires
significant efforts [92].
Various aspects of AD, including improved higher-order derivative evaluation and
sparsity-aware methods, are under active research and development by a sizeable
community. Introductions, recent research activities [12, 14] and software packages
are available (see http://www.autodiff.org). Although AD is being used in nu-
merical simulation and optimization [13, 24, 42], it is not yet a mainstream approach
in industrial-grade, large-scale simulators.
AD is different from Numerical Differentiation (ND), which is a popular approach
for computing gradients [50]. ND uses truncated Taylor series to approximate deriva-
tives. For example, the second-order central differencing for the first derivative can
CHAPTER 1. INTRODUCTION 8
be expressed as:
f ′(x) = [f(x+ ∆x)− f(x−∆x)] /2∆x+O(∆x2). (1.1)
The implementation is usually simpler than hand differentiation (i.e., coding the
analytic form of the derivative) because an explicit algebraic form of the derivatives
is not required. ND requires evaluation of the residual equations at a number of
points, which depends on the number of variables and the specific approximation
scheme. For instance, if central differencing is used for a residual equation f with N
variables, then 2N function evaluations are needed.
AD has three primary advantages over ND: (1) conditional branches (e.g., upwind-
ing, or variable switching) are treated easily in AD, whereas they cannot be handled
readily in ND due to discontinuity in the derivatives; (2) AD has no truncation error,
because it generates the code that corresponds to the analytical derivatives, whereas
it is not always possible to bound the truncation error a priori in ND by selecting a
proper interval ∆xi. Too large an interval leads to large truncation errors, and too
small an interval incurs significant round-off errors [50]. (3) AD evaluates a residual
expression in a single pass, whereas the algorithmic complexity of ND grows quickly
with the number of variables. These advantages of AD over ND are particularly
important for the development of robust simulators for strongly coupled nonlinear
processes [92].
A simple example of AD is presented next. First, the addition operation (+) and
the sine (sin) function evaluator are both augmented with the ability to compute
derivatives (underlined functions are augmented), namely:
a+ b = {a+ b, a′ + b′}, (1.2)
sin(f) = {sin(f), cos(f) · f ′}. (1.3)
CHAPTER 1. INTRODUCTION 9
Then, using the chain rule, the value and derivatives of sin(a + b) can be computed
as:
sin(a+ b) = sin({a+ b, a′ + b′})
= {sin(a+ b), cos(a+ b) · (a′ + b′)} (1.4)
= {sin(a+ b), cos(a+ b) · a′ + cos(a+ b) · b′}.
Repeating this process, it is easy to see that no matter how complex the residual
equation, R, is, the associated gradient, R′, will always be in the form of a linear
combination of ‘simple’ sparse gradients, which have been already defined, or com-
puted. That is,
R′ = c1v′1 + c2v
′2 + ...+ cNv
′N (1.5)
where ck (1 ≤ k ≤ N) are scalar coefficients, and v′k (1 ≤ k ≤ N) are sparse gradients
that have been defined or computed prior to the evaluation of R. A slightly more
complicated AD example of a two-cell oil-water problem is described in [97].
Several investigations of the use of AD for reservoir simulation have been pur-
sued recently. In [92, 93], an automatically differentiable data-type was introduced,
and an AD library - the Automatically Differentiable Expression Templates Library
(ADETL) - was designed and implemented. To make the automatic differentiation
process efficient, customized memory allocators and block-sparse treatment were later
incorporated into ADETL [96]. A novel ADETL-based simulation framework that al-
lows for a wide range of nonlinear solution strategies (e.g., natural, or molar, variable
sets) in compositional simulation was presented recently [84]. Our new research plat-
form AD-GPRS will soon have all the capabilities of GPRS, which had been developed
using hand differentiation [15,38].
CHAPTER 1. INTRODUCTION 10
1.2.1 Hierarchical organization of simulation variables
In order to establish a seamless link between the library (ADETL) [92, 93, 96] and
the simulator (AD-GPRS), ADETL provides a series of adapters, among which the
global variable set (adX) is the most important one. This adX adapter contains the
entire set of simulation variables (independent, dependent, and constant) for all cells
in the reservoir model, as well as, for all nodes and connections in all the facilities
(e.g., wells, surface pipelines, etc.). A detailed description of the global variable set
and other adapters can be found in the ADETL User Manual [97].
A hierarchical structure is adopted for the organization and indexing of all vari-
ables in adX : at the highest level, adX is composed of several subsets correspond-
ing to different bases of variables. For example, we currently have one subset for
node-based variables and one subset of connection-based variables in AD-GPRS. In
order to specify the arrangement of variables among multiple subsets, a ‘helper class’
(AD Structure) is provided by ADETL. In adX, we can define any number of such
structure records, each of which represents a certain portion of one variable subset.
Here, a portion can be a number of grid blocks, well nodes, or well connections. Then,
the entire set of variables is organized according to the order of these records. Please
refer to the “Adapters” example in the ADETL User Manual [97] for details.
On the second level within each structure record, there are several blocks of vari-
ables. Here, a block has a generic definition and can be used to represent a reservoir
cell, or well node, in the node-based subset, or a well connection in the connection-
based subset. In each block, there is a number of variables. This number is deter-
mined for each subset at the beginning of a simulation and is fixed for the simulation
run. Usually, the number of variables depends on the number of (fluid) phases and
components.
On the third level, variables within a block can be classified as independent or
CHAPTER 1. INTRODUCTION 11
Table 1.1: A hypothetical variable set for three-phase black-oil simulation [85]
Phase state P So Sg Rs
O, W 0 1 - -O, G 0 1 - 2O, W, G 0 1 2 3
dependent (or constant) variables. The dependency of a variable is not a static prop-
erty, but is determined dynamically by the (phase) state of the block. For example,
in the variable set shown in Table 1.1, Rs is an independent variable with the phase
state (O, G), but a dependent variable if the phase state is (O, W). If a variable
is independent, it has an active index greater than, or equal to, 0, which represents
the ordering of all independent variables within that block. If the phase state of a
reservoir cell is (O, G), as given in Table 1.1, P , So, and Rs are the 0th, 1st, and
2nd independent variable in that cell, respectively. On the other hand, if a variable is
dependent, it has an active index of −1, and can be either a constant (not depending
on any variables) or a function of other variables (e.g., ρp = ρp(P, T, xcp)).
On the fourth level, independent variables can be further classified as primary and
secondary variables. With an active index between 0 and NPrimary−1, the variable is
primary ; otherwise (with an active index that is greater than or equal to NPrimary),
the variable is secondary. The residual vector and associated Jacobian matrix formed
using all independent variables make up the full system, which is usually too large
to be solved efficiently at the linear level. Among the governing equations for each
block, there are several (NPrimary) global conservation equations and the rest are local
constraints (e.g., thermodynamic equilibrium). As a consequence, we may perform
an algebraic reduction on the full system in order to generate a system that only
contains the first NPrimary equations and variables in each block [15]. This generated
system is called the primary system, which can be solved on the linear level for the
CHAPTER 1. INTRODUCTION 12
solution to primary variables. Then an explicit update can be performed to obtain
the solution to secondary variables.
If the Adaptive Implicit Method (AIM) is applied, the primary variables can be
further classified as implicit and explicit variables. Through a further step of algebraic
reduction after preparing the primary system, we can obtain an even smaller linear
system, which is called the implicit system. Correspondingly, after the solution of the
implicit system and before the update of secondary variables, an additional explicit
updating step can be applied to compute the solution of the explicit primary variables.
The details of the two-step algebraic reduction and the two-step explicit update can
be found in Section 3.4.
1.2.2 Building residual equations
The data type of all basic elements (variables) in the global variable set adX is
called ADscalar. An ADscalar stores not only the value, but also the gradient of one
variable, i.e., its derivatives with respect to all independent variables. The residual
equations in AD-GPRS have the same data type and are constructed using these AD
variables. There are three major steps in building and utilizing residual equations:
• Declaring independent variables:
The declaration of independent variables is managed by the global variable set
according to the block phase state, which is referred to as ‘status’. There-
fore, what we need to do is to determine the status of each block based on
certain conditions (e.g., the appearance/disappearance of phases). After the
status of all blocks in both subsets is updated, we can call the specific function
provided in adX to map the status changes to the declaration of independent
variables. Occasionally, for a local system (e.g., Newton flash computations)
that has a different ordering from the global system, specific interfaces (e.g.,
CHAPTER 1. INTRODUCTION 13
make independent, see [97] for details) in ADETL can be called to declare local
independent variables. For each independent variable, its gradient will be set
to contain a value of 1 with respect to itself only and 0 with respect to all other
variables.
• Writing residual equations:
This step contains several substeps: 1) properties calculation (evaluate depen-
dent variables as functions of independent variables); 2) for each reservoir cell,
calculate accumulation and other local terms and add them to the residual
equations of that cell; 3) for each interface between reservoir cells, compute
flux terms and add them to the residual equations of corresponding cells; 4) for
each well, compute the source/sink terms and add them to the reservoir resid-
ual equations of perforated cells; and 5) add one or more residual equations
for each well. In all these substeps, only the code that computes the residual
equations (or properties used in residual equations) is required, while the code
that computes the associated Jacobian matrix (or derivatives of properties with
respect to independent variables) is not needed. The derivatives of all AD vari-
ables, including the entire AD residual vector, are automatically computed by
ADETL.
• Using automatically generated gradients:
In this step, we use the interfaces in ADETL to extract the automatically gener-
ated Jacobian matrix, in which each column represents one independent variable
and contains the derivatives of all residual equations with respect to it. Corre-
spondingly, each row represents one residual equation and contains its deriva-
tives with respect to all independent variables. Depending on the structure and
format of the selected linear system, different auxiliary functions provided in
ADETL are used in the extraction process. After extraction, the full system,
CHAPTER 1. INTRODUCTION 14
as introduced in Section 1.2.1, is obtained in the desired format. Then, we
apply the algebraic reduction to derive the primary, or implicit, system, solve
the obtained linear system, and perform the explicit update. These processes
are briefly explained in Section 1.2.1 and discussed in detail in Section 3.4.
An example explaining the AD-based residual and Jacobian generation for a simple
oil-water model following these steps can be found in the Tutorial section of the
ADETL User Manual [97].
Chapter 2
AD Framework with MPFA and
AIM capabilities
2.1 Introduction
Many of today’s industrial reservoir simulation models are quite complex with highly
detailed (often full-tensor) permeability distributions, geometrically complex geologic
features (e.g., faults, fractures, pinch-outs) and advanced wells (multilateral, multi-
segment). Complex grids, advanced multilateral wells, and heterogeneous full-tensor
permeability pose significant challenges for the space and time discretization schemes
applied to the nonlinear conservation equations. In most simulators, TPFA (Two-
Point Flux Approximation) is used for the ‘geometric part’ of the flux. Phase-based,
SPU (Single-Point Upstream) weighting is used for the flux of a phase, or compo-
nent [9]. The use of TPFA and SPU is widespread, due to their simplicity and ro-
bustness. However, when used with nonorthogonal grids and (absolute) permeability
tensors, whose principle components are not aligned with the grid, TPFA introduces
an error that does not diminish as the grid is refined [27]. MPFA (MultiPoint Flux
Approximation) schemes have been developed to provide a consistent representation
15
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 16
η: position of
the continuity
point on sub-
face
(a) MPFA O(η)-method
η=1
η=0.5
η=0
(b) MPFA L-method
Figure 2.1: Illustration of different MPFA schemes
of the geometric part of the flux [2, 3, 27, 59]. Here, through use of the acronym
MPFA, we include all locally conservative schemes that represent a flux through an
interface as a linear combination of potentials in neighboring cells, e.g., the MPFA
O(η)-method [2,27,83] shown in Figure 2.1(a) and the MPFA L-method [3,4] shown
in Figure 2.1(b). Our implementation is designed to accommodate all such schemes
into compositional simulation models. See [1] for an introduction to MPFA methods.
For time discretization, the FIM (Fully Implicit Method) uses a backward-Euler
strategy, in which all the degrees of freedom (variables) and the coefficients that de-
pend on them are evaluated at the new time level, n + 1 [9]. The advantage of FIM
is its unconditional stability. However, each timestep of FIM is quite expensive, both
in memory and computational cost, especially when the number of components is
large and the problem is highly nonlinear. Moreover, compared to mixed-implicit
schemes, the time truncation errors (numerical dispersion) associated with FIM can
be large. On the other hand, the IMPES (IMplicit Pressure Explicit Saturation)
scheme, which treats all the variables other than pressure explicitly, is computation-
ally efficient for a single timestep. However, the stability constraint on the IMPES
timestep size can be quite restrictive [22]. As a result, the total cost of an IMPES
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 17
Stability Efficiency (per iteration)
FIM IMPES AIM
Stability Limit of a Large Time Step Δt
Figure 2.2: Illustration of the AIM scheme (IMPES+FIM)
simulation can easily exceed that of FIM. The performance of the IMPSAT (IMplicit
Pressure and SATuration) scheme, which treats pressure and saturation(s) implicitly,
lies between that of IMPES and FIM [15]. In order to take advantage of the uncon-
ditional stability of FIM and the low computational cost (on a per Newton iteration
basis) of IMPES, or IMPSAT, time discretization schemes with different levels of im-
plicitness can be combined [15,65,80]. The AIM (Adaptive Implicit Method) scheme
utilizes such a combination by applying implicit treatment only to variables with CFL
(Courant-Friedrichs-Lewy) numbers that exceed the explicit stability limit. This idea
is illustrated in Figure 2.2, where the AIM scheme combines FIM and IMPES to
achieve an optimal balance between stability and efficiency. Note that pressure is
always treated implicitly. In thermal models, temperature in the convection terms of
the energy balance can be treated implicitly, or explicitly, depending on the stability
limit [51].
2.2 AD-Based MPFA Framework
The MPFA O-method had been implemented in the original GPRS; however, the
MPFA capability of GPRS was not robust, due to the following reasons:
• Loss of generality: TPFA and MPFA were implemented as two separate branches
in the simulator, such that different data structures and algorithms were used.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 18
Both TPFA and MPFA versions of many C++ classes were needed, which led
to code duplication and difficulty in maintaining and extending the simulation
capabilities.
• Incompatibility: Several capabilities, including AIM, were only implemented
within the TPFA branch. With time, the disparity in both functionality and
robustness between the TPFA and MPFA branches continued to grow.
• Inefficiency: Because some advanced options did not work with the MPFA
branch, the TPFA branch became the default code base and enhancements to
the efficiency of the simulator, such as our specialized block-based GMRES
solver [68], were not applied to the MPFA branch.
In order to overcome these limitations, a generic design that accommodates both
MPFA and AIM is used in AD-GPRS. By ‘generic’, we mean that (1) a unified
structure with no code duplication is used to handle a wide range of MPFA and AIM
schemes, and that (2) a minimal effort is needed for further modification and extension
of the existing functionality. Specifically, the general MPFA implementation assumes
no particular structure for the flux stencil. Hence, TPFA is simply a special case of
MPFA. Moreover, the computational cost and storage are proportional to the sparsity
pattern of the discretization stencil, even in cases where the sparsity pattern is not
symmetric. This generality is achieved by splitting the spatial discretization into two
levels: nonlinear and linear. Residual equations and the elements of the associated
Jacobian matrix are evaluated at the nonlinear level, whereas element extraction,
algebraic reduction, and linear-system assembly belong to the linear level.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 19
2.2.1 Nonlinear level
At the nonlinear level within the framework, there is no need to consider the specific
data-structure of the Jacobian, i.e., where and how the derivatives should be com-
puted and stored. The only change needed at this level is in the computation of the
flux across the interface shared by two grid cells, i0 and i1. We assume that i0 < i1
and the interface normal-vector has an orientation that points into cell i0. The overall
flux of a component c from i1 to i0 is given by:
F i0,i1c =
∑p
xc,pρpλpΦi0,i1p (2.1)
Here xc,p is the mole fraction of component c in phase p, ρp is the molar density of
phase p, and λp is the mobility of phase p. To simplify the notation, we have dropped
the superscript for i0 and i1 in these expressions. Φi0,i1p denotes the flow part of the
flux of phase p. Here xc,p, ρp, and λp are evaluated using SPU weighting based on the
sign of Φi0,i1p .
A TPFA expression for the flow part of the phase flux can be expressed as:
Φi0,i1p = T i0,i1 ·
(P i1p − P i0
p − g · γi0,i1p · (Di1 −Di0))
(2.2)
where P ip is the pressure of phase p in cell i, g is the gravitational acceleration coeffi-
cient, Di is the depth of cell i, and T i0,i1 is the two-point transmissibility coefficient
for the interface {i0, i1}. We assume T i0,i1 ≥ 0. γi0,i1p is the mass density of phase p
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 20
at the interface {i0, i1}, which can be expressed as:
γi0,i1p =
(γi0p + γi1p
)/2, if phase p appears in both cell i0 and i1
γi0p , if phase p appears only in cell i0
γi1p , if phase p appears only in cell i1
0, otherwise
(2.3)
In AD-GPRS, the simple expression (2.2) is replaced with a more general MPFA
expression:
Φi0,i1p =
np−1∑m=0
T i0,i1im·(P imp − g · γi0,i1p ·Dim
)(2.4)
where np is the number of points associated with the flux across interface {i0, i1},
T i0,i1imis the transmissibility coefficient associated with interface {i0, i1} and cell im.
We assume thatnp−1∑m=0
T i0,i1im= 0. Generally we have that T i0,i1i0
≤ 0 and T i0,i1i1≥ 0. We
make no assumption regarding the flux stencil for m ≥ 2, although most commonly
used MPFA schemes restrict the flux stencil to those cells that share a vertex with
the interface geometrically.
If we take np = 2, then the MPFA expression (2.4) is equivalent to its TPFA
counterpart (2.2) with T i0,i1 = T i0,i1i1= −T i0,i1i0
. As a consequence, we can use the
generalized MPFA expression (2.4) to represent a TPFA or MPFA flux. We reiterate
that at this level there is no need to consider how the gradients should be computed
and where they ought to be stored. With properly defined independent variables
(e.g., P , Sp, and xc,p in the natural-variables set) and dependent variables (e.g.,
γp = γp(P, xcp)), the gradients of Φi0,i1p = Φi0,i1
p (P, γp) are evaluated automatically by
the AD framework. Then, the value and associated gradients are used to compute the
overall component fluxes, Fc, as in Eq. (2.1). Thus, by only changing the computation
of the flow part of the phase flux, a generic implementation of MPFA is achieved at
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 21
the nonlinear level without additional complications, or specialization.
2.2.2 Linear level
At the linear level, the main challenge is to identify and store the location of the
nonzero block entries in the Jacobian matrix. Once this sparsity pattern is known,
the following linear algebra operations can be performed:
• System matrix extraction from the AD residual vector.
• Algebraic reduction and explicit update.
• Linear solution including Sparse Matrix-Vector multiplication (SpMV) and pre-
conditioning.
The diagonal block entries, one per cell in the reservoir model with NB cells, are all
nonzero. The structure of these blocks will not change from TPFA to MPFA, even
though the values change with time. So, there is no need for a special list to record
the positions of the diagonal blocks.
On the other hand, the number, location, and form of the off-diagonal block
entries depend on the specific flux expression used. In a TPFA discretization with
NF interface fluxes, each flux expression leads to exactly two nonzero entries: one in
the lower, and one in the upper off-diagonal part of the Jacobian.
Consider the scenario depicted in Figure 2.3, in which four cells (white parallelo-
grams with a blue dot in the center of each cell) with global indices (cell numbers) k1,
k2, k3, and k4 share a common vertex (the top right corner of cell k1). Here we analyze
the block nonzero entries corresponding to the red interface between cell k1 and k2,
as well as the blue interface between cell k1 and k3. The fluxes at both interfaces can
be discretized with either TPFA or MPFA.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 22
The TPFA stencils associated with the red and blue interfaces are {k1, k2} and
{k1, k3}, respectively. Note that k1 < k2 and k1 < k3. Block row i in the Jacobian
matrix refers to the set of discrete equations associated with cell i. Block column j
refers to the set of independent variables associated with cell j. Now, the flux for the
red interface appears in the discrete equations associated with cells k1 and k2, and the
flux involves the variables associated with cells in the stencil {k1, k2}. Therefore, this
flux introduces a nonzero block entry into the upper off-diagonal part of the matrix in
row k1, column k2, and a second nonzero block entry into the lower off-diagonal part
in row k2, column k1. Similarly, the flux for the blue interface introduces nonzero
off-diagonal blocks into row k1, column k3 and row k3, column k1. The positions of
these nonzero off-diagonal blocks are indicated in Figure 2.3(a). If TPFA is used for
all interfaces, each flux will introduce one nonzero entry in the upper off-diagonal part
of the matrix and one nonzero entry in the lower off-diagonal part. Each off-diagonal
block will store contributions from exactly one flux expression.
For the more general case of MPFA, such a simple mapping between fluxes and
off-diagonal blocks does not exist. As before, the flux for each interface appears in
the discrete equations associated with the two cells that share the interface. However,
each MPFA stencil {i0, i1, ..., inp−1} depends on the variables in np cells, where np ≥ 2
can be different for each flux, and will therefore contribute to np − 1 off-diagonal
blocks in row i0, and np−1 off-diagonal blocks in row i1. Each off-diagonal block will
generally contain contributions from multiple fluxes. Assume that the MPFA stencil
associated with the red interface is {k1, k2, k3, k4}, and the stencil of the blue interface
is {k1, k3, k2, k4}. The positions of the off-diagonal blocks associated with the red and
blue interfaces are shown in Figures 2.3(b) and 2.3(c), respectively. Note that the
green blocks contain contributions from both fluxes. There is no guarantee that the
sparsity pattern of the Jacobian will be symmetric. An efficient data-structure is
needed to keep track of the sparsity pattern of the Jacobian associated with MPFA
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 23
k1
k1 k3
k3
k2 k4
k1
k1 k3
k2
k2 k4
k1
k1 k3
k3
k2
k2
k1 k2
k3 k4
(a) Block nonzero entries corresponding to the blue and red TPFA fluxes
(b) Block nonzero entries corresponding to the red MPFA flux
(c) Block nonzero entries corresponding to the blue MPFA flux
Figure 2.3: Block nonzero entries corresponding to TPFA and MPFA fluxes
discretization.
We propose a ‘generalized connection list’ data-structure. At the beginning of a
simulation run, the list is generated from the specification of the MPFA stencils for
the NF flux entries in the model. Each entry includes cells {i0, i1, ..., inp−1} and the
transmissibilities {Ti0 , Ti1 , ..., Tinp−1} associated with each MPFA flux. The general-
ized connection list has three important properties: (1) only structural information
(i.e., the row and column indices of a nonzero entry) is stored. (2) Each nonzero entry
is represented by a unique (row, column) pair. That is, if the pair (k1, k2) has been
visited and is already accounted for in the generalized connection list when the red
MPFA flux was processed (see Figure 2.3(b)(c)), then this (k1, k2) nonzero location
is not inserted into the list a second time when the blue MPFA flux is processed.
Note that the additional contribution to the (k1, k2) value from the blue MPFA flux
is accounted for during the simulation run. (3) All block nonzero entries, both in up-
per and lower off-diagonal parts (i.e., (k1, k2) and (k2, k1) are considered as different
entries), are added to the generalized connection list, in order to allow for general
schemes with possibly nonsymmetric connection structures.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 24
The ‘set’ data-structure in the STL (Standard Template Library, see the descrip-
tion in [45]) that does not allow repetition of the elements is used as a storage for
the connection list. For each MPFA flux associated with np cells, the simulator will
attempt to insert each of the 2(np− 1) pairs to the generalized connection list. The
set data-structure automatically skips over existing pairs and only inserts new oc-
currences. After each and every member of the MPFA flux list is visited once, all of
the block nonzero entries in the off-diagonal parts of the matrix are recorded. Be-
cause this generalized connection list remains the same throughout the simulation,
we may then convert the generated set of block nonzero entries to any desired data
structure, which may be more efficient in terms of access, or better suited for parallel
computations.
Currently, the internal data format used to contain the converted generalized
connection list is CSR (Compressed Sparse Row), where the structure information
is stored in two arrays: row ptr (row pointers) and col ind (column indices). An
interface is provided for converting the intermediate set of block nonzero entries to
the internal CSR format. Note that because the diagonal entries always exist in each
block row, and are usually treated differently in the algebraic computations (e.g.,
algebraic reduction and explicit update), we do not store the diagonal entries in the
converted generalized connection list. Also, because only the structural information is
stored, there is not a single val (data values) array that is associated with the general
connection list. The actual data storage for the Jacobian is separated from this data
structure and will be discussed later in Section 3.3.
An example of the generalized connection list is shown in Figure 2.4. As described
above, an intermediate set of 20 off-diagonal block nonzero entries in eight rows is first
generated from all the fluxes and then converted into the CSR format. The diagonal
entries are not in the intermediate set, nor in the converted CSR arrays.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 25
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
(0, 1) (0, 4) (1, 0) (1, 2) (1, 5)
(2, 1) (2, 3) (2, 6) (3, 2) (3, 7)
(4, 0) (4, 5) (5, 1) (5, 4) (5, 6)
(6, 2) (6, 5) (6, 7) (7, 3) (7, 6)
row_ptr 0 2 5 8 10 12 15 18 20
col_ind 1 4 0 2 5 1 3 6 2 7
0 5 1 4 6 2 5 7 3 6
Figure 2.4: An example of generalized connection list
Finally, we note that connection-based loops in matrix operations, such as matrix
extraction, algebraic reduction, explicit updating, and SpMV, must be modified con-
sistently with the generalized connection list. Most of these modifications are straight-
forward. For example, loops over standard two-point connections should be changed
to loops over all entries in the generalized connection list. Given the currently adopted
CSR format, these entries are usually accessed row by row and then block by block.
That is, we first loop through all rows, and in each row i (0 ≤ i < NB) the indices of
the nonzero block entries are all integers k such that row ptr(i) ≤ k < row ptr(i+1).
Thus, the columns of these entries are col ind(k), and we may access them one by
one. The treatment of the diagonal entries is not changed.
2.3 AD-Based AIM Formulation
The AD-GPRS simulation framework is designed to support arbitrary levels of im-
plicitness, including FIM, IMPES, IMPSAT, and AIM. Among these schemes, AIM
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 26
is the most general one, such that all other schemes can be considered as special cases
of AIM. In order to have a unified code-base for flexible discretization in time, the
AIM implementation is designed to handle different schemes consistently and is split
into nonlinear and linear levels. A four-step procedure, with the corresponding level
in parenthesis, is used to implement AIM in AD-GPRS:
1. Determination of the implicitness level (nonlinear level).
2. Treatment of the nonlinear terms in the flux (nonlinear level).
3. Algebraic reduction in terms of the implicit primary variables (linear level).
4. Updating the remaining explicit primary and secondary variables (linear level).
The AD-GPRS framework allows for complete flexibility in the specification of the
independent variables for different formulations and solution strategies [84]. There-
fore, no assumption is made in the common part (used by all variable formulations) of
time discretization about a specific formulation. Also, it should be relatively straight-
forward to introduce the AIM scheme for a new formulation. For this purpose, gen-
erality has been preserved in the following procedures of AIM:
• Switching criteria between different levels of implicitness.
• Algebraic reduction of the primary equations in terms of the primary unknowns.
• Explicit updating of the secondary variables.
As a result, a proposed new formulation (e.g., molar) can be realized easily as
follows: 1) implement the declaration of the variable set and associated properties
calculation, 2) test the FIM formulation using the new variable set, 3) implement
the various treatments for the nonlinear terms in the flux, and 4) assign levels of
implicitness to the variables for the specific AIM scheme of interest.
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 27
The four-step procedure used to implement AIM is discussed in the following
sections.
2.3.1 Implicit-level determination (nonlinear level)
CFL-based stability criteria are widely used in reservoir simulation ( [22, 51, 65]).
In an isothermal compositional simulation (using the natural-variables formulation),
there are two types of CFL numbers:
• Component-based CFL number:
CFLX,c =∆t
Vφ
∑p ρpQpxc,p∑p ρpSpxc,p
, CFLX = maxc{CFLX,c} (2.5)
where V is the volume of the grid block, φ is the porosity, and Qp is the phase
volumetric flux. Note that the denominator (Vφ∑
p ρpSpxc,p) represents the
total number of moles of component c in all phases in the current cell.
• Saturation-based CFL number (for one, two, or three phases, with gravity,
without capillarity):
– one phase (trivial):
CFLS =Qp∆t
Vφ(2.6)
– two phases [15]:
CFLS =∆t
Vφ·λp1λp0
∂λp0∂Sp0
Qp0 −λp0λp1
∂λp1∂Sp0
Qp1
λp0 + λp1(2.7)
– three phases [22]:
CFLS =1
2
∆t
Vφ
∣∣∣∣f0,0 + f1,1 +√
(f0,0 + f1,1)2 − 4(f0,0f1,1 − f0,1f1,0)∣∣∣∣ (2.8)
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 28
where fi,j is evaluated as:
fi,j =δi,jλT − λpiλpjλT
·2∑
k=0
(∂λpk∂Spj
Qpk
)(2.9)
in which δi,j =
1 i = j
0 i 6= j
is the delta function, and λT =∑2
k=0 λpk is the
total mobility.
The expressions that take capillarity into consideration can be used to replace
expressions (2.6) to (2.8) when necessary [22]. Note that here the specific ex-
pression being used for any cell depends on the number of existing mobile phases
of that cell, and not on the total number of phases in the system. That is to
say, in a three-phase system, we may have cells with three mobile phases using
Eq. (2.8), cells with two mobile phases using Eq. (2.7), and cells with only one
mobile phase using Eq. (2.6).
For the total flow rate (QT ) or any phase flow rate (Qp) involved in the CFL
computation, it can be either the inflow or outflow rate. For any given cell, the inflow
rate is the summation of the flow rate over all the associated connections with a flow
that goes from a neighboring cell into that cell. Correspondingly, the outflow rate is
the summation of the flow rate over all the associated connections with a flow that
goes from that cell into a neighboring cell. It has been proved in [22] that the ‘inflow’
CFL number (i.e., the CFL number calculated from the inflow rate) is a more suitable
choice for the stability criteria. A less efficient but potentially more stable choice is to
use the maximum value of the ‘inflow’ CFL number and the ‘outflow’ CFL number.
Based on the above component- and saturation- based CFL numbers, the decision
to treat variables of a cell implicit or explicit for the current timestep can be made
as follows:
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 29
• If max{CFLX , CFLS} ≤ CFLLim, use IMPES
• If CFLX ≤ CFLLim, use IMPSAT
• Otherwise, use FIM
where CFLLim is the upper stability limit. For nonlinear formulations other than
the natural variables, different stability criteria can be designed to determine implicit
levels.
From Eq. (2.5) to (2.8), we may observe that CFL numbers are functions of
timestep size ∆t. For the FIM scheme and the AIM scheme with no restriction on
the maximum portion of FIM cells, the timestep size is determined solely based on
the maximum variable changes between two timesteps (see the section about time
stepping criteria in [85]) and is given as an input for the CFL computation and the
subsequent implicit-level determination. On the other hand, for the IMPES/IMPSAT
scheme and the AIM scheme with certain restriction on the maximum portion of FIM
cells, a tentative timestep size ∆tupper is first calculated based on the maximum vari-
able changes and is given as an upper limit for the actual timestep size ∆t. The actual
timestep size ∆t is then obtained by forcing the stability criteria of IMPES/IMPSAT
scheme (i.e., the maximum CFL in the entire reservoir, given ∆t, is below CFLLim),
or by forcing the restriction on the maximum portion of FIM cells in the AIM scheme.
That is, given the maximum portion of FIM cells PFIM (0 ≤ PFIM ≤ 1), we may
calculate the minimum number of non-FIM cells as: Nnon−FIM = (1 − PFIM)NB.
Then, by sorting the CFL numbers of all cells from the smallest to the largest, we
require that the Nnon−FIM′th CFL number, given ∆t, is below CFLLim.
To further improve the stability of the overall system, the option of extending the
FIM cells by several layers is provided. After the determination of implicit level for
all cells, the following operation may be performed: for each non-FIM cell that is a
neighbor of an FIM cell, change its implicit level to FIM at the end of this operation,
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 30
such that the FIM cells are extended exactly by one layer. Given the number of
layers (NL) to be extended, the above operation will be performed for NL times in
each timestep, such that the FIM cells are extended by NL layers. The larger NL we
choose, the more FIM cells we will have. As a consequence, the solution will be less
efficient but potentially more stable.
2.3.2 Treatment of nonlinear terms in the flux (nonlinear
level)
In an IMPES, or IMPSAT, formulation, the local terms (accumulation and fugac-
ity) are computed as in the FIM formulation. For the flux terms, one can employ
different treatments to the various quantities, which are either nonlinear functions
of the independent variables (e.g., ρp, λp, and γp), or some independent variables
themselves (e.g., xc,p in the natural-variables set). In our framework, we provide four
different options for calculating each of these quantities. Taking the treatment of
ρp = ρp(P, xc,p) in an IMPES cell (only P is implicit) as an example, we have:
• Explicit treatment: the nonlinear term is fixed at the beginning of the timestep
and no derivatives are calculated, e.g., the value is ρnp , and the derivative vector
is empty;
• Partially implicit treatment (lagged iteration): the nonlinear term is updated
in each Newton iteration and only contains derivatives of it with respect to
implicit variables, e.g., ρn+1,kp and
(∂ρp∂p
)n+1,k
;
• Fully implicit treatment: the nonlinear term is updated in each Newton iteration
and contains all derivatives of it (with respect to implicit and explicit variables),
e.g., ρn+1,kp and
(∂ρp∂p
)n+1,k
,(
∂ρp∂xc,p
)n+1,k
;
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 31
• Mixed time-level treatment: for the flux term computation in each Newton
iteration, the nonlinear term is recomputed using the specified implicit vari-
ables (from the last iteration, e.g., P n+1,k) and explicit variables (from the last
timestep, e.g., xnc,p), and the derivatives of the quantity with respect to implicit
variables are also recomputed naturally with AD, e.g., ρp(Pn+1,k, T, xnc,p) and
∂ρp(Pn+1,k,T,xnc,p)
∂Pn+1,k .
Among the four options, the mixed time-level treatment seems to be the most
natural way to handle implicitness. One may raise the question: why do we need
the other three options? The reason is related to efficiency and convergence. Unlike
the explicit, partially implicit, or fully implicit treatments, which use the computed
value (either at current or last timestep) and derivatives of the nonlinear term, the
mixed time-level treatment usually incurs extra calculations. This is because some
of the nonlinear terms (e.g., ρp) are also used in the local accumulation and fugacity
terms, and such nonlinear terms must be treated fully implicitly in the computation
of local terms, regardless of the time discretization of the flux terms. The extra cost
for recomputing such nonlinear terms can sometimes be larger than the time saved
in the discretization and linear solution leading to an inefficient overall scheme.
As for convergence, the mixed time-level treatment is not necessarily the best
choice for all implicit schemes. As pointed out in [15], the explicit treatment, which
fixes the value of a term at the beginning of the timestep and removes all its deriva-
tives as defined above, is necessary (in terms of convergence) for ρp, λp, and xc,p in
the IMPES formulation, and for xc,p in the IMPSAT formulation. Partially implicit
treatment, which uses the last iteration value of a term and removes its derivatives
with respect to explicit variables (i.e., Sp and xc,p in IMPES; xc,p in IMPSAT), is
suitable (also in terms of convergence) for γp in IMPES formulation, and ρp, λp, γp
in IMPSAT formulation. Fully implicit treatment is necessary for all the nonlinear
terms (ρp, λp, xc,p, and γp) in the FIM formulation. Note that single-phase cells with
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 32
the IMPSAT implicit level are actually treated as IMPES cells.
Because special treatment of nonlinear terms is only needed for the flux term, the
choice is made immediately before flux computation. The original values and deriva-
tives of these terms (evaluated using FIM) are saved. After the flux computation,
the affected terms are restored to their original values (and derivatives) if necessary.
For example, we need to recover the value and derivatives of xc,p, which are indepen-
dent variables and should be updated from the last-iteration values after the linear
solution. As for phase mobilities or capillary pressures, their values and derivatives
do not need to be recovered, because they are only used in the flux term. However,
at the end of a converged timestep, the values of all terms are updated using the
converged solution at time level n+ 1, and can be readily used for CFL computation
and explicit treatment in the next timestep.
2.3.3 Algebraic reduction in terms of implicit variables (lin-
ear level)
As described in Section 1.2.1, the full independent variable set is composed of pri-
mary (nc, as in the natural-variables set of isothermal compositional simulation) and
secondary variables (nc(np−1)). With AIM (or IMPES/IMPSAT), the primary vari-
ables can be further divided into: implicit (N iImp, the number of implicit variables
in cell i: 1 for IMPES, 2 for IMPSAT, and nc for FIM) and explicit (nc − N iImp)
variables.
The full Jacobian includes the partial derivatives of all the equations (conserva-
tion laws and local constraints) as a function of the full variable set (primary and
secondary). However, the linear solver sees a smaller (reduced) Jacobian, in which
the mass (and energy) conservation equations are expressed in terms of the implicit
primary set. This reduced Jacobian is obtained algebraically from the full system as
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 33
follows:
• Reduction from the full set to the primary-system (needed by FIM and AIM):
because only nc mass-conservation equations are globally coupled, whereas the
remaining nc(np − 1) describe local constraints, we can use a Schur comple-
ment [94] for each block row that has more than nc equations (i.e., where more
than one phases is present in the corresponding cell) in the full system to obtain
a primary system that contains only nc equations as a function of nc variables.
• Reduction from the primary set to the implicit system (only needed by AIM):
when AIM is used, it is possible that not all nc primary variables in a cell are
implicit. Thus, we can further apply the Schur complement for those cells to
obtain an even smaller implicit system that contains only N iImp equations and
variables for each cell i.
2.3.4 Updating of the remaining variables (linear level)
The linear solver deals with the reduced Jacobian (i.e., implicit system in terms of
the implicit unknowns). The implicit system is obtained algebraically from the full
system in two stages. After we obtain the solution of implicit unknowns from the
linear solver, a two-stage updating strategy is then used for the remaining variables,
including the explicit primary variables and secondary variables:
• From implicit solution to primary solution: the solution to the nc−N iImp explicit
primary variables is recovered algebraically from the implicit solution, which
contains the Newton updates to the N iImp implicit variables per cell computed
by a specific linear solver.
• From primary solution to full solution: the solution to the nc(np− 1) secondary
variables is recovered algebraically from the primary solution, which contains
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 34
the Newton updates to the nc primary variables per cell obtained in the first-
stage explicit update.
The details of the above two-step algebraic reduction and two-step update strategy
will be discussed in Section 3.4, which is about the solution strategies of block linear
system.
2.4 Test Cases
Two test cases, including the full / upscaled SPE 10 and a fully unstructured grid
problem, are examined to justify the correctness and efficiency of the AD frame-
work with AIM and MPFA capabilities. An additional test case, which uses a two-
dimensional circular reservoir model, can be founded in [99].
2.4.1 Full and upscaled SPE 10 problems
This example is based on the permeability and porosity fields of the full and an
upscaled version of the SPE 10 problem [19]. The upscaled problem employs a simple
2x2x2 coarsening (i.e., 1/8th of SPE 10). Two time discretization schemes are tested
and contrasted: 1) FIM and 2) AIM combining IMPES and FIM. A nine-component
fluid system is used with an initial distribution of 1% CO2, 19% C1, 5% C2, 5% C3,
10% n-C4, 10% n-C5, 10% C6, 20% C8, and 20% C10. We use a line-drive well pattern:
two injectors are placed at both corners of one side of the reservoir, and two producers
are at both corners of the other side of the reservoir model. All four wells penetrate
all the layers and are under BHP (Bottom Hole Pressure) control: 40 bar (producers)
and 120 bar (injectors). The stream is 90% CO2 and 10% C1 at both injectors.
For the upscaled problem, the model has about 140K cells, and there are nine
primary variables (NC = 9) in each cell. So, we have a total of 1.26M primary
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 35
variables to solve for at each timestep. Dealing with such a large system poses a
great challenge to the linear solver. However, if AIM is applied in this case, given an
average of 10% implicit cells, we only need to solve for about 0.25M implicit variables.
The solution of the linear system will become much easier, because its size has been
decreased by approximately 80%.
Table 2.1: FIM and AIM runtime performance of upscaled SPE 10 problem
Scheme FIM AIM (IMPES + FIM) DifferenceNewton Iterations 377 354 -6.1%Solver Iterations 3.95 4.09 +3.5%
Discretization 2.62 1.82 -30.5%Nonlinear Term Treatment - 0.11 -
Properties Calculation 3.06 2.52 -17.6%Matrix Extraction/Reduction 1.32 0.99 -25.4%
Linear Solver Time 5.45 2.94 -46.1%Total Time 13.09 9.02 -31.1%
The runtime performance data for both schemes are listed in Table 2.1, where the
items in boldface are reported on a per Newton-iteration basis. In this case, we actu-
ally have an average of about 9% implicit cells and 19% of the implicit variables. The
number of Newton iterations (-6%) and solver iterations (+3%) are about the same,
whereas we get savings of 18% to 46% in the discretization, properties calculation,
matrix extraction/reduction, and linear solver time when using AIM. As we expected,
the largest saving is in the linear solution, which takes a little more than half of the
corresponding FIM time. Although some extra cost is incurred in the treatment of
the nonlinear terms, it is negligible (taking only 1% of the total time). Overall, we
have a reduced cost of about 31% per Newton iteration, with high-quality simulation
results.
As for the full SPE 10 problem, the grid is manipulated in all 85 horizontal
layers, each containing 60 × 220 cells, in two steps: 1) skewing each layer to be a
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 36
Figure 2.5: Grid mapping of full SPE 10 with skewness and distortion [99]
parallelogram, and 2) applying distortion to the x-coordinate of the grid nodes based
on the six control points depicted in Figure 2.5, where the numerical values indicate
the x-coordinate of the control points before (left) and after (right) mapping. The
positions of the other grid nodes are determined by linear interpolation between the
control points. This results in a three-dimensional large-scale nonorthogonal grid
with over 1.1 million cells. A four-component system with an initial distribution
of 1% CO2, 20% C1, 29% C4, and 50% C10 and the same line-drive well pattern
(two injectors at one side, two producers at the other side) are used. Three spatial
discretization schemes are employed and contrasted: TPFA, MPFA O(0)-method,
and MPFA L-method.
The simulation results are shown in Figure 2.6. We can see that the results
of both MPFA discretizations closely match each other, which indicates that our
MPFA framework is very robust with arbitrary discretization schemes on such a
large-scale reservoir model. On the other hand, TPFA results deviate from the MPFA
counterparts, because in skewed nonorthogonal grids the fluxes cannot be represented
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 37
800
700
600
500
400
300
200
100
0
0 2 4 6 8 10
Gas
Rat
e (k
m3/d
ay)
Time (day)
Injectors TPFA I1TPFA I2O(0)-method I1O(0)-method I2L-method I1L-method I2
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10
Oil
Rat
e (m
3/d
ay)
Time (day)
Producers TPFA P1TPFA P2O(0)-method P1O(0)-method P2L-method P1L-method P2
Figure 2.6: Gas rates of two injectors and oil rates of two producers in full SPE 10with skewed grid
accurately using TPFA.
Table 2.2: Runtime performance of full SPE 10 case
Discretization TPFA L-method MPFA O(0)-method MPFAStencil 7 9 (+29%) 11 (+57%)
Peak memory 9GB 10GB (+11%) 11GB (+22%)Newton iterations 170 167 184Solver iterations 5.6 4.5 5.3
Discretization time 7.0 7.9 (+13%) 9.1 (+29%)Solver time 25.9 28.8 (+11%) 36.5 (+41%)Total time 43.4 47.0 (+8%) 56.8 (+31%)
The performance data are listed in Table 2.2. The items in boldface are reported
on a per Newton-iteration basis. The number of solver iterations is smaller with
MPFA discretizations. For both the L-method and O(0)-method, the increases in the
memory usage, discretization time, and linear solution time are all below the growth
in the stencil (29% and 57% respectively). The total extra cost per Newton iteration
is 31% for the O-method and only 8% for the L-method. Given approximately the
same results (both more accurate than the TPFA solution), smaller memory footprint,
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 38
and better CPU efficiency, the L-method is preferred for this type of problem (with
skewed grid).
2.4.2 Unstructured grid problem
The next test case is a two-dimensional unstructured grid problem with 17,545 trian-
gular cells of different sizes. We test TPFA and MPFA discretization for four different
anisotropy ratios: 1:1, 2:1, 8:1, and 32:1. The MPFA transmissibilities are calculated
using the O(1/3)-method, which is known to give good results for triangular cells.
An injector is placed at the center of the reservoir with an initial distribution
of 0% CO2, 5% C1, 25% C4, and 70% C10. A mixture of 90% CO2 and 10% C1 is
injected at a rate of 3×104 m3/day. There are four producers, and they are all under
BHP control of 40 bars. The well configuration is shown in Figure 2.7, where the
white circles represent the five wells. Several linear no-flow features are included in
the model to mimic faults. The fluid flow is strongly affected by these flow barriers.
Thus, grid refinement is used to increase resolution around wells and faults, while
relatively coarse grids are used for the rest of the model.
The simulation results are shown in Figure 2.8. We can see that the deviation of
the TPFA gas rates from the MPFA results increases as the anisotropy ratio increases.
In the isotropic case, the difference is barely noticeable, while in the case with 32:1
anisotropy ratio, the TPFA rates differ significantly.
The gas saturation profile shown in Figure 2.7 helps to explain why there are such
big differences in the obtained gas rates between TPFA and MPFA discretization for
the cases with large anisotropy ratios. The correct flow direction, which is primarily
governed by the anisotropy and heterogeneity of the permeability field, is properly
represented in the MPFA based computations, whereas in the TPFA profile it is com-
pletely missed, leading to O(1) error in the simulation results. That is to say, in
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 39
TPFA MPFA O(1/3)
1
2
3
4
1
2
3
4
Figure 2.7: Gas saturation at the end of simulation for TPFA and MPFA with 32:1anisotropy ratio
order to get accurate results for unstructured-grid models with high anisotropy ra-
tios, TPFA computations are prone to significant errors, and an MPFA discretization
scheme is needed in order to represent the flow and transport accurately.
The performance data are listed in Table 2.3. Similarly, the items in boldface are
reported on a per Newton-iteration basis. In this case, the MPFA and TPFA stencils
have 18 and 4 entries, respectively. So, on this basis alone, MPFA is 3.5 times more
expensive than TPFA. However, as shown in Table 2.3, for anisotropy ratios from
1:1 to 8:1, the cost increase is less than 1.2 times for the discretization and less
than 1.1 times for the linear solution. As a consequence, the extra cost in the total
computational time is less than 80%. On the other hand, for MPFA discretization
with the anisotropy ratio of 32:1, we see that the linear-solver iterations per Newton
iteration is quite large (> 20), which leads to a much higher cost (+261%) in the linear
solution. The increased cost comes from convergence difficulties in the, usually robust
and highly efficient, AMG solver [76,77], which is used as the first-stage pressure solver
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 40
0
2
4
6
8
10
12
0 100 200 300 400 500 600
Gas
Rat
e (
k m
3/d
ay)
Time (day)
1:1
MPFA P1
TPFA P1
MPFA P4
TPFA P40
2
4
6
8
10
12
0 100 200 300 400 500 600Time (day)
2:1
0
2
4
6
8
10
12
0 100 200 300 400 500 600
Gas
Rat
e (k
m3/d
ay)
Time (day)
8:1
0
2
4
6
8
10
12
0 100 200 300 400 500 600Time (day)
32:1
Figure 2.8: Gas rate of producer #1 and #4 for TPFA and MPFA with 1:1, 2:1, 8:1,32:1 anisotropy ratios
Table 2.3: Runtime performance of unstructured grid case
Discretization Newton Solver Discretization Solver time Total timescheme iter iter time
TPFA 1:1 597 9.48 0.08 0.33 0.61MPFA 1:1 576 5.58 0.18 (+114%) 0.60 (+82%) 0.98 (+60%)TPFA 2:1 593 10.04 0.08 0.34 0.63MPFA 2:1 588 5.75 0.18 (+111%) 0.61 (+78%) 0.99 (+58%)TPFA 8:1 573 11.55 0.09 0.40 0.69MPFA 8:1 581 10.17 0.18 (+113%) 0.85 (+110%) 1.24 (+78%)TPFA 32:1 602 12.28 0.09 0.45 0.77MPFA 32:1 639 20.79 0.19 (+96%) 1.62 (+261%) 2.02 (+161%)
MPFA 32:1 SAMG 650 7.80 0.19 (+95%) 1.10 (+146%) 1.51 (+95%)
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 41
in the CPR preconditioner [38, 88, 89]. When the reduced pressure matrix contains
large positive off-diagonal entries introduced by MPFA discretizations for models
with high anisotropy ratios, it deviates significantly from being an M-matrix. Using
the more robust SAMG [78] as the pressure solver, the linear-solver iterations per
Newton is decreased significantly (< 10), and the associated linear solution cost is
greatly reduced from +261% to +146%. Therefore, for models with high anisotropy
ratios, SAMG is a preferred option that always computes the stable pressure solution.
2.5 Concluding Remarks
From the test cases presented, we observe that different spatial discretization schemes
(e.g., TPFA and MPFA) can lead to considerable differences in the computed results,
especially when the grid is highly nonorthogonal (e.g., skewed elements or unstruc-
tured grid). This also applies to the cases with full-tensor permeability. In order for
a general-purpose simulator to deal with nonorthogonal grids and full-tensor perme-
ability, as well as complex well geometry, TPFA discretization is often inadequate
and an MPFA scheme is necessary.
Our general AD-based, MPFA framework is quite efficient. For both TPFA and
various MPFA discretizations, the CPU time and memory cost are proportional to the
additional entries associated with the discretization. For MPFA-based simulations,
only the discretization and linear-solution time are affected; moreover, the increased
cost in both parts is always less than the growth in the stencil size. The effort associ-
ated with all the remaining parts of a simulation remain the same as those for TPFA.
Compared with TPFA, the significant improvements from the MPFA computations
come at an acceptable additional cost.
The flexible multilevel AIM implementation in our AD-based framework employs
FIM only where and when necessary, and that reduces the overall simulation cost,
CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 42
along with reduced levels of numerical dispersion. Our AIM framework is currently
applied to the natural variable formulation with compositional, black-oil, or dead-oil
fluid, but it is designed to support new fluid models and nonlinear formulations.
In the AD framework, a unified code-base has been developed with minimal code
duplication. The unified code works for both TPFA and MPFA in space, and for
any combination of FIM, AIM, IMPES, and IMPSAT in time. Due to the generic
design that decouples the implementation into nonlinear and linear levels, the specific
schemes used for space and time discretization are compatible with the other cell-
based functionality in the simulator.
The AD-based MPFA and AIM capabilities described in this chapter allow for
efficient and flexible modeling of the reservoir part. This is an essential piece of our
general-purpose simulation framework. In Chapter 4, we focus on modeling advanced
wells based on a general multisegment model.
Chapter 3
Linear Solver Framework
3.1 Overview of the Framework
In AD-GPRS, a ‘linear system’ uses one specific format to store the Jacobian matrix
and is capable of performing the following tasks:
1. Extract the Jacobian matrix (and residual values) into the associated matrix
storage format from the AD residual vector that gets constructed in each New-
ton iteration using the independent variables. As long as this task is well defined
for each linear system, the linear solution can be completely separated from the
nonlinear computation.
2. Apply certain algebraic reduction steps to obtain a smaller linear system (e.g.,
primary or implicit system) that can be solved more efficiently. This is an
optional step that mainly affects the efficiency, but not the solution of the
linear solver.
3. Utilize compatible linear solvers (and preconditioners) to obtain the linear solu-
tion to the reduced matrix system. This is the key step where the linear solver
43
CHAPTER 3. LINEAR SOLVER FRAMEWORK 44
gets called. The obtained solution is corresponding to the reduced matrix sys-
tem, which is usually smaller than the full matrix system.
4. Use the explicit update procedure to get back the solution to the full Jacobian
matrix. If step 2 was applied prior to calling the linear solver in step 3, this step
is required. That is, we need to obtain the Newton update to all independent
variables rather than to only the primary or implicit variables.
5. Write the extracted Jacobian matrix to a text or binary file. This is an op-
tional feature that facilitates debugging process and is only called in special
circumstances.
Currently, AD-GPRS supports two linear system types. The first uses the pub-
lic Compressed Sparse Row (CSR, or CRS - Compressed Row Storage, see http:
//netlib.org/linalg/html_templates/node91.html). The second employs our
customized MultiLevel Block-Sparse (MLBS, or MLSB - MultiLevel Sparse Block,
see Chapter 3 in [38]) matrix format.
3.2 CSR linear system
Here, the Jacobian matrix is extracted and stored in the CSR format, which is a
widely used data structure for sparse matrices. The main idea of the CSR format is
illustrated using the example 4× 4 matrix with nine nonzero elements shown below:1.0 2.0 0 0
3.0 4.0 5.0 0
6.0 0 7.0 8.0
0 0 0 9.0
CHAPTER 3. LINEAR SOLVER FRAMEWORK 45
Table 3.1: Row pointer array of CSR format
ind 0 1 2 3 4row ptr 0 2 5 8 9
Table 3.2: Column index and value arrays of CSR format
ind 0 1 2 3 4 5 6 7 8col ind 0 1 0 1 2 0 2 3 3
val 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
The CSR representation of the above matrix includes three arrays: row ptr (row
pointer, as shown in Table 3.1), col ind (column index, as shown in the middle row
of Table 3.2), and val (value, as shown in the last row of Table 3.2). Note that the
row and column indices are zero-based (0, 1, 2, . . .). For a general nrow × ncol sparse
matrix with nnz nonzero elements, the three arrays are defined as follows:
• row ptr: this array has nrow + 1 elements. row ptr[i] (i = 0, . . . , nrow) indicates
the total number of nonzero elements in all the rows from 0 to i − 1. Thus
row ptr[0] must be equal to 0, and row ptr[nrow] must be equal to nnz because
it represents the total number of nonzero elements in all rows. The indices of
nonzero entries in row i is row ptr[i], . . ., row ptr[i+ 1]−1.
• col ind: this array has nnz elements. col ind[j] (j = 0, . . . , nnz − 1) indicates
the column of the j’th nonzero entry. That is, col ind[j] (j =row ptr[i], . . .,
row ptr[i + 1]−1) corresponds to the column of each nonzero entry in row i.
The elements are usually ordered from smallest column to largest column in
each row. This is the definition used in the CSR linear system. There are
several variant definitions in the matrix components of the MLBS linear system.
For example, in each row, the column of the diagonal entry is always ordered
the first, and the columns of the rest nonzero entries are ordered from the
CHAPTER 3. LINEAR SOLVER FRAMEWORK 46
smallest to the largest afterwards. This is required by the AMG [76, 77] or
SAMG preconditioners. Also, the column of the diagonal entry may sometimes
be neglected in the col ind array if the diagonal entry always exists in each row.
• val: this array also has nnz elements. val[j] (j = 0, . . . , nnz−1) stores the value
of the j’th nonzero entry and thus has one-to-one correspondence to col ind[j].
That is, its order is determined by the order of col ind in each row. Each
element in the val array can be a single value (pointwise) or a dense block of
values (blockwise). The pointwise storage is used for the CSR linear system,
whereas the blockwise extension is used in the matrix components of the MLBS
linear system.
Based on the above description of the three CSR arrays, the common way to
visit its entries is usually row-by-row on the first level, and element-by-element on
the second level. That is, we first loop row i from 0 to nrow − 1, and for each i, we
loop element j from row ptr[i] to row ptr[i+ 1]−1 and we may access its column by
col ind[j] or its value by val[j].
The major advantage of this format is its simplicity (easy to understand and oper-
ate) and wide availability of existing algorithms based on it. Because it is a standard
format, ADETL provides convenient interfaces to directly extract the necessary data
components of the CSR format (row pointer, column index, and value array) from an
AD vector [97]. Based on these interfaces, the entire extraction process is straight-
forward. Most of the time, we need no specialization in the CSR linear system to
accommodate new features, such as the facility models introduced into AD-GPRS.
However, because the CSR format does not contain any information for its sub-
matrices, it is relatively complicated and expensive to extract and manipulate an
arbitrary block (e.g., the submatrix for the entire facilities part, or for a specific well
CHAPTER 3. LINEAR SOLVER FRAMEWORK 47
model) from the entire Jacobian. As a consequence, the available CSR-based lin-
ear solvers in AD-GPRS usually do not exploit the multilevel block structure of the
Jacobian and thus are not optimal. Among the available solvers, some are sparse
direct solvers, which are usually very slow and require huge amounts of memory for
large problems, including PARDISO (http://www.pardiso-project.org/) and SU-
PERLU (http://crd.lbl.gov/~xiaoye/SuperLU/). Some sparse iterative solvers,
such as ILUPACK (http://www.icm.tu-bs.de/~bolle/ilupack/), can also be in-
corporated.
The above solvers are called “library” solvers, because all of them are provided
by external libraries and accessed through wrapper classes. There are a lot of config-
urable parameters for each of them. These parameters are stored in a editable text file
and can be read into AD-GPRS through an extensible I/O class. In addition, most
of these solvers provide the option to directly solve the transposed matrix (without
being manually transposed), which could be useful for the adjoint functionality.
The major purpose of the CSR linear system is for validation. When some new
features are introduced to AD-GPRS, a nontrivial extension may be required in the
MLBS linear system. In such circumstances, we may test these features with the CSR
linear system first to see if they work properly. However, due to the low efficiency
and significant memory consumption, no large problem can be tested in this way.
After validation for small problems, the extension in the MLBS linear system may be
implemented to accommodate the new features, such that large-scale problems may
be tested to demonstrate efficiency (and wider applicability) of the features.
3.3 Block linear system
The block linear system is the default option in AD-GPRS, due to its high efficiency
and many advanced features. Its associated matrix format is the previously mentioned
CHAPTER 3. LINEAR SOLVER FRAMEWORK 48
MLBS matrix, which has a much more sophisticated data structure than that of the
CSR format.
The term ‘data structure’ refers to a scheme for organizing related pieces of infor-
mation. A good data structure design has the following features [38]:
• Accessibility: Users should be able to generate, access, and modify data easily;
• Encapsulation: Data should be hidden by the object that owns it, and other
objects should have minimal dependency on the detailed format of the data;
• Extensibility: It should be relatively easy to fit future developments into the
data structure and expand it if necessary;
• Computational efficiency: The data structures should be compatible with state-
of-the-art computational algorithms.
To fulfill the above requirements, the MLBS format [38] was first designed and
implemented in the original GPRS. In [38], the MLBS format was demonstrated to be
the most efficient option when used in conjunction with the GMRES solver [67,68] and
CPR (Constrained Pressure Residual, see [88, 89]) preconditioner. To inherit these
advantages, the MLBS matrix is introduced into AD-GPRS as the matrix format for
the block linear system. Due to the nature of the AD framework, the elements of
the MLBS matrix are no longer manually set as in the original GPRS. To establish
a connection from the AD residual vector to the MLBS matrix, the functionality of
the MLBS matrix format has been extended such that the elements can be extracted
from the AD residual vector automatically for each Newton iteration.
To simulate systems with many relatively independent components, e.g., an oilfield
production system that includes reservoirs, wells, and surface facilities, a hierarchical
storage system is used as the data structure of the MLBS matrix format. This data
structure corresponds to the components in the physical model. In this data structure,
CHAPTER 3. LINEAR SOLVER FRAMEWORK 49
a higher-level matrix is composed of a number of independent submatrices. The
higher-level matrix only contains some general information (e.g., the size of matrix)
and the pointers to the submatrices. There is no requirement on the type of each
submatrix, which enables the overall system to be composed of several different types
of matrices. The implementation details of the submatrices are completely hidden
from the higher level matrix. This feature honors data encapsulation and achieves
great flexibility and extensibility. The submatrices of new models can be integrated
into AD-GPRS without changing the high-level architecture.
The (sub)matrices at all levels of MLBS matrix are inherited from the same base
data type: GPRS Matrix, which defines the following common interfaces:
1. Extraction of the full system matrix from the AD residual vector and the sub-
sequent algebraic reduction from the full system to the primary system
2. Algebraic reduction from the primary system obtained in 1 to the implicit sys-
tem (needed for AIM)
3. Assembly of the global RHS (Right-Hand-Side) vector from all submatrices of
the implicit system obtained in 2
4. Distribution of the global solution vector to all submatrices in the implicit
system (called after the linear solution of the implicit system)
5. (Explicit) update of the primary explicit solution from the implicit solution
acquired in 4 (implicit → primary solution, needed for AIM)
6. (Explicit) update of the secondary solution from primary solution obtained in
5 (primary → full solution)
7. Sparse Matrix-Vector multiplication (SpMV, y = Ax, or y = y + Ax)
8. Recalculation of the matrix size, i.e., number of rows and columns
CHAPTER 3. LINEAR SOLVER FRAMEWORK 50
9. Output of the matrix in a descriptive format (keeping the structure information)
or pointwise COO [67] format (ignoring the structure information). COO stands
for coordinate list, which is a list of (row, column, value) tuples representing all
nonzero entries
10. Fetch of the type, size (number of rows and columns), and associated element
extractor of the matrix.
Among the above set of functionality, items 1 to 6 are the new extensions to the
MLBS matrix in the original GPRS. The matrix extraction from the AD residual
vector is needed for the MLBS matrix in the AD framework, whereas the two-step
algebraic reduction (from full to primary, then from primary to implicit system),
assembly of the global RHS vector, distribution of the global solution vector, two-
step explicit update (from implicit to primary, then from primary to full solution)
were previously implemented in a separate series of classes called ‘equation selec-
tor’. Because this set of functionality depends strongly on the data organization in
(sub)matrices at different levels, this implementation yielded strong coupling between
the (sub)matrices and the ‘equation selector’ classes and weakened the encapsulation
and extensibility of the MLBS matrix format. To overcome this drawback and satisfy
the previously mentioned requirements on the data structure (accessibility, encapsu-
lation, extensibility, and computational efficiency), items 1 to 6 are now integrated
as part of the functionality in each MLBS (sub)matrix type in AD-GPRS. In order
for a new (sub)matrix type to function normally in the MLBS hierarchy, all items in
the above set, including items 1 to 6, must be properly defined.
Moreover, the above set of functionality is defined and implemented in a recursive
way such that a higher-level matrix does not depend on the specific types or imple-
mentation details of its submatrices at lower levels. For example, when we need to
calculate the size of a higher-level matrix containing several submatrices, it will first
CHAPTER 3. LINEAR SOLVER FRAMEWORK 51
call the size calculation functions of these submatrices. Upon the return of the subma-
trix sizes, the higher-level matrix will then calculate its own size. Thus, as long as the
size calculation is properly defined for these submatrices, no specialization is required
in the higher-level matrix. During the function call of some submatrix with its own
submatrices at an even lower level, the corresponding functions in these lower-level
submatrices again need to be called, until the lowest-level submatrices are visited.
This explains how the recursive implementation works. Besides size calculation, the
recursive implementation also applies to matrix extraction, algebraic reduction, global
RHS assembly, solution distribution, explicit update, SpMV, and matrix output.
In the following sections, we will go through the hierarchy of the MLBS matrix
format from top to bottom.
3.3.1 First level: global matrix
In general-purpose reservoir simulation, we often need to solve a coupled system
that contains many complex objects, such as the reservoir model, wells, and surface
facilities. Each of these objects has its own set of nonlinear equations and variables.
Generally, the Newton method is employed as the nonlinear solver. Therefore, in fully
coupled schemes, a global Jacobian matrix that incorporates all the derivatives of the
different governing equations with respect to all the variables, is required. Figure
3.1 shows a Jacobian matrix generated for a three-dimensional structured reservoir
model with several standard wells. The blue dots represent nonzero elements.
The seven-diagonal banded structure in the top-left part of the matrix repre-
sents the reservoir equations and associated variables. The diagonal structure in
the bottom-right part represents the well equations and variables. The bottom-left
and top-right parts represent the coupling terms between the reservoir and the wells.
Thus, we know that the matrix structures are quite different among the various parts
as described above. Given a typical structure, it is natural to conceptually separate
CHAPTER 3. LINEAR SOLVER FRAMEWORK 52
CHAPTER 3. INNOVATIVE DATA STRUCTURES IN GPRS 28
coupled schemes, a global Jacobian matrix that incorporates all the derivatives of
the different governing equations with respect to all the variables, is required. Figure
3.2 shows a Jacobian matrix generated for a three-dimensional structured reservoir
model with several wells. The blue dots represent non-zero elements.
Figure 3.2: Typical matrix in reservoir simulation
The seven-diagonal banded structure in the matrix represents the reservoir equa-
tions and associated variables. The diagonal part in the bottom right corner repre-
sents the well equations and variables. The lower-left rows and upper-right columns
represent the coupling terms between the reservoir and wells. In Chapter 2, we sepa-
rated the facilities from the reservoir. Consequently, the global Jacobian matrix can
be conceptually separated into four parts:
Figure 3.1: Typical global Jacobian matrix in reservoir simulation [38]
the global matrix into the following four parts:
• JRR, derivatives of reservoir equations with respect to reservoir variables
• JRF , derivatives of reservoir equations with respect to facility variables
• JFR, derivatives of facility equations with respect to reservoir variables
• JFF , derivatives of facility equations with respect to facility variables
In the MLBS data structure, the first level (global Jacobian matrix) is defined
as a wrapper matrix with no substantial information other than handles to the four
submatrices. The structure of the wrapper matrix is shown in Figure 3.2. The
wrapper has no requirement on the types of its submatrices, i.e., all four parts (JRR,
JRF , JFR, and JFF ) can have different types. This is achieved through four template
parameters in the class definition. When the global matrix is declared, the types of
CHAPTER 3. LINEAR SOLVER FRAMEWORK 53
the four submatrices must be determined. These types will be discussed on the next
level.
RR RF
FR FF
Figure 3.2: Submatrices in the global Jacobian matrix [98]
In the global matrix, each of the four submatrices can be conveniently accessed
through the provided interfaces. In fact, for any MLBS (sub)matrix type that contains
a number of submatrices, consistent interfaces are provided for the access to these
submatrices. As a result, great flexibility is offered in the design of solution strategies.
For example, we can get the decoupled facilities matrix JFF , without any additional
cost because it is a stand-alone part in the global matrix. The same operation would
be quite complex and expensive, if the global CSR format is used. Using the MLBS
matrix format, sequential solution strategies for the reservoir and facilities can be
implemented relatively easily.
In the original GPRS, the structure and data of the four submatrices are manip-
ulated by either the reservoir object (JRR) or facilities object (JRF , JFR, and JFF ).
Now in AD-GPRS, the structure of these submatrices is initialized by the linear sys-
tem object, taking necessary information, such as the reservoir general connection
list, number of implicit variables (and indices of implicit components) in each reser-
voir cell, the facilities object, and the AD variable set, as input. The data, on the
CHAPTER 3. LINEAR SOLVER FRAMEWORK 54
other hand, are extracted from the AD residual vector by these submatrices them-
selves. That is, the reservoir and facilities objects no longer need to know about the
structure of the associated submatrices and manually fill in the values. Instead, they
would access and manipulate the common AD residual vector on the nonlinear level,
while the derivatives of all equations with respect to independent variables are auto-
matically generated by the AD framework. Afterwards, through a unified interface,
these derivatives are extracted by the selected linear system into the desired format.
In this case, the block linear system will call the corresponding extraction function
in the global matrix, which will in turn call the extraction function in each of its
submatrices, to extract the Jacobian matrix into MLBS format.
3.3.2 Second level: reservoir and Facilities matrix
The four submatrices (JRR, JRF , JFR, and JFF ) in the first-level global Jacobian
matrix have substructures of their own and are considered to be the second-level
matrices.
Reservoir matrix
The reservoir matrix JRR contains the derivatives of all reservoir equations with re-
spect to all reservoir variables. This includes the NPri primary (mass and energy
conservation) and NSec secondary (local constraints, e.g., thermodynamic equilib-
rium) equations, as well as, NPri primary and NSec secondary variables. In order
to improve data processing efficiency and avoid extensive memory usage, the block
structure is exploited and the data are stored in the block diagonal and off-diagonal
arrays. An example of JRR with its associated data arrays is illustrated in Figure
3.3, which is based on the 3× 2 (two-dimensional) reservoir shown in Figure 3.5. As
described in the common interfaces for all MLBS matrices, the matrix extraction is
CHAPTER 3. LINEAR SOLVER FRAMEWORK 55
0 1 2 3 4 5
0
1
2
3
4
5
0 1 2 3 4 5
0
JRR (primary)
Diagonal
Array
Off-
Diagonal
Array
1 2 3 4 5
Pri-Pri Part
Sec-Pri Part
Pri Sec
Pri
Sec
Diagonal
block in
the full
system
Discarded after algebraic reduction from full to
primary system
0, 1 0, 3 1, 0 1, 2 1, 4 2, 1 2, 5
3, 0 3, 4 4, 1 4, 3 4, 5 5, 2 5, 4
Figure 3.3: Structure of the JRR submatrix
combined with the algebraic reduction from the full system to the primary system.
Thus the data storage of JRR only contains the elements in the reduced primary sys-
tem and some supplementary elements (secondary(equation) - primary(variable) part
of diagonal blocks) that are needed for the recovery of the solution to the full system.
The data storage for each part is ordered first block by block, and then element by
element (column-major, as shown by the yellow arrows in Figure 3.3).
On one hand, the diagonal array contains NBN2Pri +NBNPriNSec elements, where
NB is the number of diagonal blocks (i.e., the number of reservoir cells), NPri and
NSec are the number of primary and secondary equations/variables as defined above.
The first part (NBN2Pri elements, first row of the diagonal array in Figure 3.3) is
CHAPTER 3. LINEAR SOLVER FRAMEWORK 56
for all diagonal blocks in the primary system, whereas the second part (NBNPriNSec
elements, second row of the diagonal array in Figure 3.3) is for the supplementary
elements mentioned above. As illustrated in the top-right part of Figure 3.3, the
derivatives of either primary or secondary equations with respect to secondary vari-
ables in the diagonal blocks are discarded after the algebraic reduction process from
full to primary system.
On the other hand, the off-diagonal array contains NoffN2Pri elements, where
Noff is the number of off-diagonal blocks (i.e., the number of entries in the reservoir
general connection list). For off-diagonal blocks, the secondary equations should have
no derivatives (all derivatives are local, i.e., in the diagonal blocks) and the derivatives
of primary equations with respect to secondary variables are discarded in a similar
way after the algebraic reduction process. The ordering of off-diagonal blocks in this
array is based on the order of entries in the general connection list, because each
entry in the list is uniquely corresponding to one off-diagonal block. Currently, the
ordering scheme arranges all off-diagonal blocks first row by row, and in each row by
the column indices of the blocks. This is essentially the same as the order of elements
in a CSR matrix, except that diagonal elements are not included because they always
exist and are stored in another data array.
Besides these two data arrays, JRR contains some additional information, such as
the generalized connection list and the number of implicit variables in each block. The
constant pointers to these additional data are passed to JRR upon its construction,
because JRR will only need to access, not to manipulate, these data during the
computational process. Reusing these data arrays significantly reduces the memory
cost of the simulator and leaves out the synchronization process.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 57
Facilities matrices
After describing the structure of the reservoir matrix JRR, we may now discuss that
of the facilities matrices JRF , JFR, and JFF . These three matrices correspond to
computations in the facilities class, which represents the aggregate of all the wells
and surface facilities in the field. There may be tens to hundreds of these objects in
a large-scale field model.
Each facility object, e.g., a general multisegment well, owns a set of stand-alone
matrices, named JRW,i, JWR,i, and JWW,i, respectively, where i is the index of the
facility object. JRW,i is a vertical slice in JRF and contains the derivatives of the
reservoir equations with respect to well variables of the i’th facility object. Similarly,
JWR,i is a horizontal slice in JFR and contains the derivatives of well equations of the
i’th facility object with respect to all reservoir variables; JWW,i is a square matrix
in JFF and contains derivatives of well equations with respect to well variables, both
belonging to the i’th facility object. That is to say, JRF , JFR, and JFF are all
wrapper matrices and respectively contain the handles to the JRW,i, JWR,i, and
JWW,i matrices from all subordinate facility objects. Figure 3.4 shows the structure
of second-level matrices corresponding to a reservoir with two wells. We may notice
that the details of the well matrices corresponding to the first well (JRW,1, JWR,1, and
JWW,1) and to the second well (JRW,2, JWR,2, and JWW,2) are completely hidden
from the second-level matrices. The operations in JRF , JFR, and JFF are defined in
a generic way that is independent of the structures of their submatrices.
3.3.3 Third level: well matrices
As described in Section 3.3.2, the matrices JRW,i, JWR,i, and JWW,i belong to a
single facility object with index i. They may or may not have substructures, i.e., any
of them can be on the very basic level (similar to JRR, which has no submatrices),
CHAPTER 3. LINEAR SOLVER FRAMEWORK 58
CHAPTER 3. INNOVATIVE DATA STRUCTURES IN GPRS 33
Well 1Well 2
(1) (2) (3)
(4) (5) (6)
(7) (8) (9)
Figure 3.4: Sample reservoir with wells
RR
WR1
WR2 WW2
RW1
RW2
Figure 3.5: Submatrices in matFRJ, matRFJ and matFFJFigure 3.4: Structure of second-level MLBS matrices [38]
or contain several lower-level submatrices. As mentioned before, there is no format
requirement for these basic matrices; they can even be dense or empty matrices, if
necessary. The structure details are left for model developers. Here, we only discuss
the data formats currently used in AD-GPRS, which can serve as an example for
future development. Based on model types, the matrix set of JRW,i, JWR,i, and
JWW,i can be very different in both size and format.
Figure 3.6 shows the structure of these submatrices in JRF , JFR, and JFF . These
illustrations correspond to the standard well (with 2 perforations in cell 1 and 4
respectively) and the general multisegment well (with 3 nodes, 3 connections, and
2 perforations in cell 3 and 6 respectively) shown in Figure 3.5. The blue blocks
in the matrices represent nonzero elements, whereas the white blocks represent zero
elements and are therefore not stored.
On one hand, the first column in JRF (JRW,1), the first row in JFR (JWR,1),
and first diagonal element in JFF (JWW,1) are corresponding to Well 1, which is a
standard well, in Figure 3.5. Currently, for the consistency in algebraic reduction
CHAPTER 3. LINEAR SOLVER FRAMEWORK 59
and explicit update, the JWW,1 matrix shares the same type as JRR, although it
contains only one element. Both JRW,1 and JWR,1 have the same type, a block
COO (coordinate list) format. For a block COO matrix with Nblk nonzero blocks,
two arrays row ind and col ind, each with size Nblk, are used to store the row and
column coordinate of these blocks. The elements of these blocks are stored in a data
array with size NblkNrowPriN
colPri, where N row
Pri denotes the number of primary equations
in the corresponding main-diagonal matrix on the same row, whereas N colPri denotes
the number of primary variables in the corresponding main-diagonal matrix on the
same column. Similar to JRR, the data storage is ordered first block by block, then
element by element (column-major). Here, the corresponding main-diagonal matrix is
referred to the submatrix whose equations and variables have the same classification,
or equivalently, whose first subscript and second subscript in the label are the same
(e.g., JRR, JWW,i). For a block COO matrix that is not on the main diagonal (e.g.,
JRW,i), it needs to access one corresponding main-diagonal matrix on the same row
(e.g., JRR) and one on the same column (e.g., JWW,i), in order for its operations such
as matrix extraction, algebraic reduction, and explicit update, to function properly.
On the other hand, the submatrices JRW,2, JWR,2, and JWW,2, as enclosed by the
dashed line in JRF , JFR, and JFF , are corresponding to Well 2, which is a general
multisegment well. We can see that the equation layout of a general multisegment
well is much more complicated than that of a standard well: JRW,2 and JWR,2 are
wrapper matrices containing 2 submatrices respectively, whereas JWW,2 is a wrapper
matrix with 4 submatrices (JNN , JNC , JCN , and JCC , where N and C stand for
node and connection respectively). The details of these submatrices will be discussed
in Section 4.5.3 after we introduce the general multisegment well model.
As we can see in the example above, the structure of well matrices between dif-
ferent well models can be completely different. This is achieved by hiding the details
CHAPTER 3. LINEAR SOLVER FRAMEWORK 60
1
4
2
5
3
6
Well 1 Well 2
Figure 3.5: Sample reservoir with two wells
from upper-level matrices through common interfaces. In addition, the matrix for-
mat is also hidden from the well model, which handles the nonlinear computation
only. The decoupling between well models and matrix formats is achieved by the
facility-matrix handlers. Each of these handlers is defined for a specific well model
(e.g., the standard well model) and a certain matrix format (e.g., the MLBS matrix).
The handler will take an object of the corresponding well model as the input and
create associated matrices JRW,i, JWR,i, and JWW,i in the corresponding format.
For example, a handler defined for standard well and MLBS matrix will create sub-
matrices in the same format as JRW,1, JWR,1, and JWW,1 shown in Figure 3.6. The
handlers defined for different facility models and the same matrix format are created
and managed by an upper-level facility-matrix handler, which is defined for the over-
all facilities class and the same matrix format. This top-level handler is created by
the corresponding linear system. More details about the facility-matrix handlers are
discussed in Section 4.5.2.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 61
JFF JFR
JRF
WR1 WW1
WR2 WW2
RW1 RW2
Figure 3.6: Structure of third-level MLBS matrices
3.4 Solution strategy of block linear system
Due to the flexibility provided by the hierarchical structure of block linear system,
advanced techniques can be applied to the solution of this linear system such that the
solution efficiency will generally be much higher than that of the CSR linear system.
In order to minimize the size of the global linear system, a Schur-complement
procedure is applied to the full Jacobian of each (reservoir/well) block to express
the primary equations as a function of the primary variables only. This general
CHAPTER 3. LINEAR SOLVER FRAMEWORK 62
Schur-complement reduction strategy is an important aspect of AD-GPRS as well as
the original GPRS as described in [15]. There is no special alignment of equations
and variables except the condition that the matrix in Schur-complement should be
invertible. One more step of Schur-complement reduction will be taken on the primary
Jacobian to obtain an even smaller implicit Jacobian, when an AIM scheme is adopted.
Instead of having a separate Schur-complement procedure from the matrix types
(as in the equation selector class of original GPRS, see [15]), the procedure is defined
recursively, along with the matrix extraction, inside various MLBS matrix types in
AD-GPRS. For example, on the top level, the four submatrices (JRR, JRF , JFR, and
JFF ) correspond to four different types and each has its own matrix extraction/Schur-
complement function. This approach yields better flexibility and extensibility. When
we introduce a new MLBS matrix type, as long as its own extraction and Schur-
complement procedure (and other necessary procedures as listed previously) are prop-
erly defined, it will fit into the hierarchy of MLBS matrices. Ideally, we do not need
to modify the corresponding procedures in any other MLBS matrix types.
After the size of the full linear system is reduced, the resulting implicit linear
system is solved for the Newton updates to implicit variables. The most effective
solution strategy is to employ GMRES Krylov subspace solver [67,68] preconditioned
by the two-stage Constrained Pressure Residual (CPR) approach originally presented
in [88] and [89]. An algebraic reduction step (e.g., true-IMPES) is used to construct
the pressure system in the first stage, which is then solved for the approximate solution
to pressure variables using an AMG solver ( [76,77]). In the second stage, block ILU
preconditioner with different levels of fill-ins is employed for the solution of individual
submatrices (see [17] and [38]). In this approach, it is not necessary to construct
special pressure equations nonlinearly.
After the implicit linear system is solved, we can find the solution to primary ex-
plicit variables and to secondary variables locally for each grid cell through an explicit
CHAPTER 3. LINEAR SOLVER FRAMEWORK 63
update corresponding to the algebraic reduction applied prior to the linear solution.
Thus we can obtain the Newton updates to all independent variables with the solu-
tion of a much smaller linear system (and hence considerably faster). The details of
the matrix extraction, algebraic reduction (Schur-complement), preconditioned linear
solution, explicit update procedures are discussed in the following sections.
3.4.1 Matrix extraction
An intermediate extractor class built on top of the extraction functions of the AD
library (see [97]) is used by all MLBS matrices for acquiring the elements from an AD
residual vector. A piece of derivatives with neighboring columns on a single row (i.e.,
from a single ADscalar variable) can be extracted at a time. The internal structure of
the gradient in an ADscalar variable can vary, e.g., in the form of pointwise (column,
value) pairs, or blockwise (block column, all values in a block) pairs (see the second
part of [96] for details). However, a unified interface is provided by the AD library to
extract a given number of neighboring derivatives, such that the internal structure of
the gradient is completely hidden from the user of the AD library.
After obtaining the needed piece of neighboring derivatives with the given size and
starting from the given column in the given row, the MLBS matrix can then store
these derivatives into its internal arrays according to its data storage scheme. We
usually extract a piece of derivatives with the size equal to the total number of active
(primary and secondary) variables in the current block of the matrix, because these
derivatives always have the neighboring columns. The number of active variables
can change from block to block, and from iteration to iteration, depending on the
active phase status of the block. Note that the derivatives are usually stored in a
column-major fashion in MLBS matrices, whereas they are obtained in the form of a
continuous piece in a row. In addition, permutation (needed by variable-based AIM
scheme) can be applied to both columns and rows of an extracted matrix block. Thus
CHAPTER 3. LINEAR SOLVER FRAMEWORK 64
the MLBS matrix is responsible for mapping the elements in the extracted piece into
the appropriate places of its data storage. Also note that part of the derivatives (e.g.,
those with respect to secondary variables) may be stored in temporary arrays and
get discarded after the Schur-complement process that generates the primary system
from the full system. Nevertheless, necessary information for the recovery of the full
solution from the primary solution is kept.
Two alternative interfaces are provided by the extractor class, depending on
whether we would like to continue the searching process from the last extracted piece
of derivatives in order to speed up the process. This is achieved by passing either
the saved location of last extracted piece or the starting location in a gradient to
the extraction function of the gradient in an ADscalar variable. The former interface
can be more effective during the process of extracting several blocks of derivatives
with ascending columns from the same row, whereas the later one should be used
for finding a specific piece of the gradient, e.g., the diagonal part, in a row. The
saved locations of last extracted piece will be reset to the starting locations in all
gradients after the Jacobian generation and prior to the matrix extraction process in
each Newton iteration.
3.4.2 Algebraic reduction from full to primary system
Under fully implicit formulation, the primary variables (NC per grid block, +1 for
thermal formulation) are all implicit. So there is a one-step algebraic reduction from
full to primary system (i.e., Schur-complement process) before the linear solution,
and a one-step explicit update from primary to full variables after the linear solution.
The application of this procedure to JRR and its associated RHS vector is described
below.
First we define some notation, which is consistent with the implementation in
AD-GPRS, for the system matrix:
CHAPTER 3. LINEAR SOLVER FRAMEWORK 65
• A – Derivatives of primary equations with respect to primary variables;
• B – Derivatives of primary equations with respect to secondary variables;
• C – Derivatives of secondary equations with respect to primary variables;
• D – Derivatives of secondary equations with respect to secondary variables;
• Xp, Xs – Update for primary variables and update for secondary variables;
• Fp, Fs – RHS of primary equations and RHS of secondary equations;
• first block subscript — the row of the block;
• second block subscript — the column of the block;
The block nonzero entries of the full system using the above notation are shown
below:
A B
C D
i,i
· · ·
A B
0 0
i,j
.... . .
... A B
0 0
j,i
· · ·
A B
C D
j,j
·
Xp
Xs
i
... Xp
Xs
j
=
Fp
Fs
i
... Fp
Fs
j
(3.1)
For each block row i, first let the second row (with C and D in the diagonal block,
and 0 in off-diagonal blocks, corresponding to secondary equations) left-multiply by
D−1i,i , and then let first row (with A and B in the diagonal block, corresponding to
primary equations) minus the resultant second row (with D−1i,i C and I in the diagonal
block) left-multiplied by Bi,i. The full system becomes:
CHAPTER 3. LINEAR SOLVER FRAMEWORK 66
A−BD−1C 0
D−1C I
i,i
· · ·
A B
0 0
i,j
.... . .
... A B
0 0
j,i
· · ·
A−BD−1C 0
D−1C I
j,j
·
Xp
Xs
i
... Xp
Xs
j
=
Fp − (BD−1)i,iFs
D−1i,i Fs
i
... Fp − (BD−1)j,jFs
D−1j,j Fs
j
(3.2)
Next for each flux between cell i (corresponding to block row/column i) and j
(corresponding to block row/column j), let the first row (primary equations) of block
row i minus the second row (secondary equations) of block row j left-multiplied by
Bi,j, and similarly, let the first row (primary equations) of block row j minus the
second row (secondary equations) of block row i left-multiplied by Bj,i. The full
system becomes:
CHAPTER 3. LINEAR SOLVER FRAMEWORK 67
A−BD−1C 0
D−1C I
i,i
· · ·
A−B(D−1C)j,j 0
0 0
i,j
.... . .
... A−B(D−1C)i,i 0
0 0
j,i
· · ·
A−BD−1C 0
D−1C I
j,j
·
Xp
Xs
i
... Xp
Xs
j
=
Fp − (BD−1)i,iFs −Bi,jD−1j,j (Fs)j
D−1i,i Fs
i
... Fp − (BD−1)j,jFs −Bj,iD−1i,i (Fs)i
D−1j,j Fs
j
(3.3)
Then the primary system, as shown in Eq. (3.4), is decoupled from the full system
by taking the first row (primary equations) and first column (primary variables) out
of each block row and column.(A−BD−1C)i,i · · · Ai,j −Bi,j(D
−1C)j,j...
. . ....
Aj,i −Bj,i(D−1C)i,i · · · (A−BD−1C)j,j
·
(Xp)i...
(Xp)j
=
(Fp)i − (BD−1)i,i(Fs)i −Bi,jD
−1j,j (Fs)j
...
(Fp)j − (BD−1)j,j(Fs)j −Bj,iD−1i,i (Fs)i
(3.4)
Repeating this process for each cell j that is a neighbor to cell i in the general
connection list, the diagonal part of block row i in the primary system is kept as
(A − BD−1C)i,i, and the off-diagonal entry on block row i and block column j is
computed as Ai,j −Bi,j(D−1C)j,j for each neighboring cell j. The final RHS of block
CHAPTER 3. LINEAR SOLVER FRAMEWORK 68
row i can be obtained as:
RPrii = (Fp)i − (BD−1)i,i(Fs)i −
∑j∈nbr(i)
(Bi,jD
−1j,j (Fs)j
)(3.5)
The above reduction process can be applied to any main-diagonal submatrix that
shares the same structure as JRR, e.g., JNN part in the JWW,i matrix of a general
multisegment well. Besides these main-diagonal submatrices, the treatments for the
off-diagonal terms as described above also need to be applied to the coupling subma-
trices JRW,i and JWR,i of all facility objects (and those block COO matrices on even
lower levels, e.g., JNC and JCN parts in the JWW,i matrix of a general multisegment
well). For any block nonzero entry of an off-diagonal block COO matrix in the full
system, it will be in the form: A B
0 0
row,col
. (3.6)
where the second row and second column may or may not exist. For example, JRW,i of
a standard well contains no derivatives with respect to secondary variables, because
there is only one variable, BHP, per well; and similarly, JWR,i of a standard well
contains no derivatives from secondary equations, because there is only one equation
per well.
Treating the block entry (3.6) as an off-diagonal block in the main-diagonal matrix,
we may reduce it to the following primary block:
Arow,col −Brow,col(D−1C)col,col, (3.7)
where Arow,col and Brow,col are from the block entry (3.6), and (D−1C)col,col is from
the diagonal part of the corresponding main-diagonal matrix on the same column.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 69
The RHS of corresponding block row will be updated as:
RPrirow ← RPri
row −Brow,colD−1col,col(Fs)col (3.8)
The entire algebraic reduction process from full to primary system has been de-
scribed. Correspondingly, after primary variable update (Xp)i (i = 1, 2, ..., NB) is
obtained, the update of secondary variables can be explicitly computed as:
(Xs)i = D−1i,i (Fs)i − (D−1C)i,i(Xp)i, i = 1, 2, ..., NB (3.9)
where D−1i,i (Fs)i and (D−1C)i,i have already been computed in Eq. (3.2). D−1i,i (Fs)i is
stored in the original place for secondary RHS, whereas (D−1C)i,i corresponds to the
extra elements stored in the second part of the diagonal array of JRR.
3.4.3 Algebraic reduction from primary to implicit system
Under IMPES, IMPSAT or any kind of AIM formulation, the primary variables are
no longer all implicit. So in addition to the original “full to primary” reduction
discussed in Section 3.4.2, another reduction from primary to implicit system should
be taken between the original reduction and the linear solution. Correspondingly,
an additional explicit update step from implicit to primary variables is also needed
between the linear solution and the original “primary to full” update. Effectively, we
perform a two-step algebraic reduction (full to primary to implicit) before the linear
solution, as well as, a two-step explicit update (implicit to primary to full) after the
linear solution.
First we further divide the primary equations and variables into implicit and (pri-
mary) explicit parts. Note that secondary variables are also explicit in IMPES or
IMPSAT formulation, but here we only work on the primary system expressed by Eq.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 70
(3.4). Also, there is no such natural classification of “implicit” and “explicit” equa-
tions. By “implicit” equations we mean the first (NImp)i equations, where (NImp)i
is the number of implicit variables in grid block i. Correspondingly, “explicit” equa-
tions are the rest of the primary equations in that block. The following notation is
introduced for the second algebraic reduction and explicit update steps:
• AII – The derivatives of implicit equations with respect to implicit variables;
• AIE – The derivatives of implicit equations with respect to explicit variables;
• AEI – The derivatives of explicit equations with respect to implicit variables;
• AEE – The derivatives of explicit equations with respect to explicit variables;
• XI , XE – Update for implicit variables and update for explicit variables;
• FI , FE – RHS of implicit equations and RHS of explicit equations;
Without loss of generality, we simply assume that block i and j both have at
least one primary explicit variables such that secondary variables in both blocks are
also explicit. With correct explicit treatment of nonlinear terms in the flux term, we
should have: Bi,j = 0 (i 6= j), i.e., derivatives with respect to secondary variables
(explicit) in off-diagonal blocks are zero. As a result, the off-diagonal terms should be
kept at their original values after the first step reduction from full to primary system,
i.e., we have Ai,j − Bi,j(D−1C)j,j = Ai,j and Aj,i − Bj,i(D
−1C)i,i = Aj,i. Therefore
we should have (AIE)i,j = (AEE)i,j = 0 (i 6= j) (derivatives with respect to primary
explicit variables in off-diagonal blocks are zero). Using the above notation, the
CHAPTER 3. LINEAR SOLVER FRAMEWORK 71
primary system obtained in Eq. (3.4) can be expressed as:
AII AIE
AEI AEE
i,i
· · ·
AII 0
AEI 0
i,j
.... . .
... AII 0
AEI 0
j,i
· · ·
AII AIE
AEI AEE
j,j
·
XI
XE
i
... XI
XE
j
=
FI
FE
i
... FI
FE
j
(3.10)
For each block row i, let the first row (with AII and AIE in the diagonal block)
minus the second row (with AEI and AEE in the diagonal block) left-multiplied by
(AIEA−1EE)i,i. The primary system becomes:
AII − (AIEA−1EE)AEI 0
AEI AEE
i,i
· · ·
AII − (AIEA−1EE)i,iAEI 0
AEI 0
i,j
.... . .
... AII − (AIEA−1EE)j,jAEI 0
AEI 0
j,i
· · ·
AII − (AIEA−1EE)AEI 0
AEI AEE
j,j
·
XI
XE
i
... XI
XE
j
=
FI − (AIEA−1EE)i,iFE
FE
i
... FI − (AIEA−1EE)j,jFE
FE
j
(3.11)
Then the implicit system, as shown below, is decoupled from the primary system by
taking the first row (implicit equations) and first column (implicit variables) out of
CHAPTER 3. LINEAR SOLVER FRAMEWORK 72
each block row and column:(AII − (AIEA
−1EE)AEI
)i,i
· · · (AII)i,j − (AIEA−1EE)i,i(AEI)i,j
.... . .
...
(AII)j,i − (AIEA−1EE)j,j(AEI)j,i · · ·
(AII − (AIEA
−1EE)AEI
)j,j
·
(XI)i...
(XI)j
=
(FI)i − (AIEA
−1EE)i,i(FE)i
...
(FI)j − (AIEA−1EE)j,j(FE)j
(3.12)
Repeating this process for each cell j that is a neighbor to cell i in the general
connection list, the diagonal part and RHS of block row i in the implicit system are
kept as(AII − (AIEA
−1EE)AEI
)i,i
and (FI)i − (AIEA−1EE)i,i(FE)i respectively, whereas
the off-diagonal entry on block row i and block column j can be computed as (AII)i,j−
(AIEA−1EE)i,i(AEI)i,j for each neighboring cell j.
The above process can again be applied to any main-diagonal submatrix that
shares the same structure as JRR. In addition, the treatment for off-diagonal terms
needs to be applied to the coupling submatrices as well. For any block nonzero entry
(with explicit variables) of an off-diagonal block COO matrix in the primary system,
it will be in the form: AII 0
AEI 0
row,col
. (3.13)
where the second row and column may or may not exist, depending on the corre-
sponding facility model of the block COO matrix.
Treating the block entry (3.13) as an off-diagonal block in the main-diagonal
matrix, we may reduce it to the following primary block:
(AII)row,col − (AIEA−1EE)row,row(AEI)row,col, (3.14)
CHAPTER 3. LINEAR SOLVER FRAMEWORK 73
where (AII)row,col and (AEI)row,col are from the block entry (3.13), and (AIEA−1EE)row,row
is from the diagonal part of the corresponding main-diagonal matrix in the same row.
The entire algebraic reduction process from primary to implicit system has been
described. Correspondingly, after implicit variable update (XI)i (i = 1, 2, ..., NB)
is obtained from the linear solution of the implicit system, the update of primary
explicit variables can be computed as:
(XE)i = (AEE)−1i,i
(FE)i − (AEI)i,i(XI)i −∑
j∈nbr(i)
(AEI)i,j(XI)j
, i = 1, 2, ..., NB
(3.15)
where nbr(i) includes not only all neighboring cells to cell i in the general connection
list of the main-diagonal matrix, but also the block nonzero entries on row i (if any)
from the off-diagonal block COO matrix as discussed above.
Combining the implicit update (XI)i and primary explicit update (XE)i for each
grid block, we have obtained the primary update (Xp)i, which will be further used to
compute the full system update using Eq. (3.9).
3.4.4 Preconditioned Linear Solver
After obtaining the implicit linear system (3.12), we need to solve it with an efficient
linear solver. Although the size of the implicit linear system is already much smaller
than that of the full system, it may still have quite a large number of unknowns,
e.g., more than O(106). Large, sparse linear systems in reservoir simulation typically
are solved with preconditioned iterative Krylov solvers. The generalized minimum
residual (GMRES) solver [67, 68] is one of the most widely used Krylov subspace
solvers in our community. Due to the superior performance it offers in conjunction
with the application of an effective preconditioner, GMRES is used as the default
solver option in AD-GPRS. Here a brief introduction to GMRES is given.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 74
GMRES is an iterative algorithm for solving linear system of equations in the form
of Ax = b, where A is a given sparse invertible matrix, b is a given right-hand-side
vector, and x is the unknown vector to be computed by the algorithm. GMRES has
no requirement or dependency on the format of matrix A or vector x and b, given
that SpMV and preconditioning solution are well defined on the matrix and vector
types, as well as, that the BLAS operations (copy, scal, axpy, dot, etc.) are well
defined on the vector type. Therefore, GMRES can work perfectly with the MLBS
matrix format described in Section 3.3. The preconditioned GMRES algorithm with
restart option is listed in Algorithm 3.1.
GMRES algorithm guarantees that the residual norm (||b−Ax||2) monotonically
decreases during each restart cycle (i.e., for i = 0, 1, . . . ,m − 1, where i denotes the
number of iterations in the current restart cycle, and m denotes the total number
of iterations for each restart cycle). For an N × N matrix, it takes at most N
steps for the algorithm to converge if the algorithm never restarts. In practice, N is
usually quite large, and the algorithm has to restart due to the constraint of numerical
round-off error and available memory space (to store Hm, vi, and w′i). With a
good preconditioner, it usually takes the GMRES algorithm only a small number
of iterations (<< N) to converge. Thus it is quite important to devise an effective
preconditioning strategy, as described in the next section.
3.4.5 The Two-Stage Preconditioning Strategy
The performance of Krylov solvers depends strongly on the quality of the precondi-
tioner. Some preconditioners, e.g., the ILU family [11, 66], can be applied to various
types of matrices but the effectiveness may vary among these matrices. On the other
hand, some preconditioners, e.g., AMG [76, 77], have strict requirements on the ma-
trices to be solved but can be very effective when the requirements are satisfied.
The equations and unknowns in reservoir simulation have mixed properties. They
CHAPTER 3. LINEAR SOLVER FRAMEWORK 75
Algorithm 3.1 Preconditioned GMRES with restart option
1: Given the initial guess x0, compute r = b−Ax0 and β = ||r||22: Define the (m + 1) ×m matrix Hm = {hkl}0≤k≤m,0≤l≤m−1 (m: total number of
iterations for each restart cycle).3: Let j = 1 (j: current number of iterations)4: while j ≤ jmax do5: Compute v0 = r/β6: Let s0 = β, i = 0 (i: number of iterations in the current restart cycle)7: while i < m and j ≤ jmax do8: Compute w′i = M−1vi (preconditioner solution)9: Compute w = Aw′i10: for k = 0, 1, . . . , i do11: hki := (w,vk)12: w := w − hkivk13: end for14: hi+1,i = ||w||215: vi+1 = w/hi+1,i
16: Transform Hm into upper triangular form using plane rotations, in orderto solve the minimization problem ||βe1 − Hmy||2
17: Update si in the corresponding RHS s = {sk}0≤k≤i18: Compute si+1 as the residual of the minimization problem19: if si+1/||b||2 < ε then
20: Compute y = H−1m s and update x as: x = x +
∑ik=0 ykw
′k
21: Exit the algorithm with convergence22: end if23: Let i = i+ 1, j = j + 124: end while25: Compute y = H
−1m s and update x as: x = x +
∑mk=0 ykw
′k
26: Compute r = b−Ax, β = ||r||227: if β/||b||2 < ε then28: Exit the algorithm with convergence29: end if30: end while
CHAPTER 3. LINEAR SOLVER FRAMEWORK 76
have both near-elliptic (pressure equations) and near-hyperbolic (advection equa-
tions) parts. In order to handle this mixed character, a powerful two-stage pre-
conditioner - the Constrained Pressure Residual (CPR) method - was proposed by
Wallis [88] and Wallis et al. [89]. In the first stage, the pressure system is obtained
from the fully coupled Jacobian using an efficient algebraic reduction scheme, which
mimics the steps associated with constructing the pressure equation of the IMPES
formulation. This pressure system preserves the coupling of the reservoir-wells system
and is solved with an AMG solver. The low frequency errors associated with the pres-
sure variables are resolved in this stage. In the second stage, an ILU preconditioner is
usually applied to the full Jacobian matrix, which removes the high frequency (local)
errors. The general form of the two-stage preconditioner can be written as in [17]:
M−11,2 = M−1
2 [I −AM−11 ] + M−1
1 , (3.16)
where M 1,2 denotes the two-stage scheme; M 1 and M 2 denote the first and second
stage preconditioners, respectively, and I is an identity matrix with A denoting the
matrix to be solved.
Note that we never explicitly calculate the inverse of M 1 or M 2. Instead, we
always calculate M−1i r (i = 1, 2) for any given RHS vector r (e.g., see step 8 in
Algorithm 3.1). The preconditioner at either stage should be able to calculate the
approximate solution in an efficient manner. Otherwise, even with good convergence
behavior, the total cost of the linear solver will be too high and the preconditioner
loses its advantage.
The two-stage preconditioner is applied as follows:
1. Apply the first-stage preconditioner to calculate x1 = M−11 r, where r is the
given RHS and x1 is the first-stage solution vector
2. Compute r2 = r−Ax1, which becomes the updated RHS for the second stage
CHAPTER 3. LINEAR SOLVER FRAMEWORK 77
3. Apply the second-stage preconditioner to calculate x2 = M−12 r2
4. Compute x = x1 +x2 as the overall update. This is consistent with the defini-
tion of the two-stage preconditioner in Eq. (3.16), because we have:
x = M−11,2r =
(M−1
2 [I −AM−11 ] + M−1
1
)r
= M−12 [I −AM−1
1 ]r + M−11 r
= M−12 [r −Ax1] + x1
= M−12 r2 + x1
= x2 + x1 (3.17)
With highly tuned preconditioners for each of the two stages as components,
CPR has been demonstrated to be a very efficient preconditioner for reservoir simu-
lation with standard wells (see, for example, [15,17,37,38]). More recently, CPR has
been extended to handle complex unstructured models with advanced multisegment
wells [38] and well groups [96,98]. The extension to handle coupled reservoir-facilities
simulation with general multisegment wells is discussed in Section 4.7. Here we give a
detailed description about the important components in the CPR method in general.
First stage: pressure system
In the first stage of CPR, a pressure system needs to be constructed from the implicit
system in reservoir simulation. As mentioned in [38], two alternative approaches to
get the pressure system can be considered:
• Nonlinear construction of the pressure system. The IMPES conservation equa-
tions can be formed by first performing the nonlinear treatment (explicit treat-
ment of ρp, λp, xcp and partially implicit treatment of γp) and then adding up
the mass conservation equations of all components in each cell. Then we obtain
CHAPTER 3. LINEAR SOLVER FRAMEWORK 78
the pressure system with one equation (the total mass conservation equation)
and one variable (pressure) per cell. This approach yields high construction
cost and strong coupling between the implementation on the nonlinear level
and that on the linear level.
• Algebraic reduction from the implicit system. With certain assumptions (true-
IMPES or quasi-IMPES), the pressure linear system can be algebraically re-
duced from the implicit system derived in Section 3.4.2 and 3.4.3. The reduced
linear system will only contain one element per block and essentially become
pointwise. This operation is performed purely on the linear level and the asso-
ciated cost is much lower than the former approach, given that we have already
obtained the implicit linear system prior to the setup stage of the preconditioner.
As reported in [38], the constructed pressure system is quite similar from both
approaches. Due to the low cost and decoupled implementation, algebraic reduction
is the desired approach to obtain the pressure system. As mentioned above, two
options based on different assumptions, true-IMPES and quasi-IMPES, can be used
to perform the reduction from the implicit matrix system to an IMPES-like matrix
system.
First we further divide the implicit equations and variables into pressure and
other (implicit) parts. The starting point here is the implicit system expressed by
Eq. (3.12). Note that our target is the pressure equation, but it is not naturally
generated in an implicit system. Thus, in the implicit system, we use “pressure”
equation simply to denote the first equation (corresponding to the pressure variable,
which is the first variable), and use “other” implicit equations to denote the rest of
the implicit equations (see the definition at the beginning of Section 3.4.3) in that
block. The following notation is introduced for the true- and quasi- IMPES reduction:
• App, Fpp – The derivative of accumulation term and the derivative of flux term
CHAPTER 3. LINEAR SOLVER FRAMEWORK 79
in pressure equation with respect to pressure variable;
• Apo, Fpo – The derivatives of accumulation term and the derivatives of flux term
in pressure equation with respect to other implicit variables;
• Aop, Fop – The derivatives of accumulation term and the derivatives of flux term
in other implicit equations with respect to pressure variable;
• Aoo, Foo – The derivatives of accumulation term and the derivatives of flux term
in other implicit equations with respect to other implicit variables;
• Xp, Xo – Update for pressure variables and update for other implicit variables;
• Rp, Ro – RHS of pressure equations and RHS of other implicit equations;
Without loss of generality, we simply assume that block i and j both have at least
one other implicit variables. Using the above notation, the implicit system obtained
in Eq. (3.12) can be expressed as:
App + Fpp Apo + Fpo
Aop + Fop Aoo + Foo
i,i
· · ·
Fpp Fpo
Fop Foo
i,j
.... . .
... Fpp Fpo
Fop Foo
j,i
· · ·
App + Fpp Apo + Fpo
Aop + Fop Aoo + Foo
j,j
·
Xp
Xo
i
... Xp
Xo
j
=
Rp
Ro
i
... Rp
Ro
j
. (3.18)
CHAPTER 3. LINEAR SOLVER FRAMEWORK 80
We may observe that diagonal entries contain contribution from both accumula-
tion and flux terms, whereas the off-diagonal entries contain only the contribution
from the flux term.
True-IMPES reduction. The assumption of true-IMPES reduction is to treat
all variables other than pressure in the flux terms explicitly. But for preconditioning
purpose, the values from the last iteration, instead of from the last timestep, are used
for these variables. This is the major difference between the pressure system obtained
by true-IMPES reduction and that from the IMPES formulation, where the values
of the variables other than pressure are fixed at the last timestep. To achieve this
treatment purely on the linear level, we may apply a column sum to the derivatives
with respect to other implicit variables. That is, for each off-diagonal block (i, j), we
add its Fpo and Foo parts to the corresponding parts in the diagonal block with the
same block column, i.e., (j, j). Because for each flux term (Fc)i,j in block row i, there
is a corresponding flux term (Fc)j,i in block row j with the same absolute value but
opposite sign (i.e., (Fc)j,i = −(Fc)i,j), we have:
(Fpo)j,j = −∑
i∈nbr(j)
(Fpo)i,j, (Foo)j,j = −∑
i∈nbr(j)
(Foo)i,j (3.19)
Then by summing up the derivatives with respect to other implicit variables in
all off-diagonal blocks with the same block column into the corresponding places of
the diagonal block in that block column, the Fpo and Foo terms in the diagonal block
will be cancelled out. Afterwards, we may simply ignore the Fpo and Foo parts in the
off-diagonal blocks (note that we do not need to manually fill them with zeros; we
just do not use them in the following steps of the reduction process). The implicit
CHAPTER 3. LINEAR SOLVER FRAMEWORK 81
system becomes:
App + Fpp Apo
Aop + Fop Aoo
i,i
· · ·
Fpp 0
Fop 0
i,j
.... . .
... Fpp 0
Fop 0
j,i
· · ·
App + Fpp Apo
Aop + Fop Aoo
j,j
·
Xp
Xo
i
... Xp
Xo
j
=
Rp
Ro
i
... Rp
Ro
j
.
(3.20)
Next, we may apply the Schur-complement process. That is, for each block row i,
first let the second row (with Aop + Fop and Aoo in the diagonal block) left-multiply
by (Aoo)−1i,i , and then let the first row (with App + Fpp and Apo in the diagonal block)
minus the resultant second row (with A−1oo (Aop + Fop) and I in the diagonal block)
left-multiplied by (Apo)i,i. By combining A and F as J , the system becomes:
Jpp − (ApoA−1oo )Jop 0
A−1oo Jop I
i,i
· · ·
Fpp − (ApoA−1oo )i,iFop 0
(Aoo)−1i,i Fop 0
i,j
.... . .
... Fpp − (ApoA−1oo )j,jFop 0
(Aoo)−1j,jFop 0
j,i
· · ·
Jpp − (ApoA−1oo )Jop 0
A−1oo Jop I
j,j
·
Xp
Xo
i
... Xp
Xo
j
=
Rp − (ApoA−1oo )i,iRo
(Aoo)−1i,i Ro
i
... Rp − (ApoA−1oo )j,jRo
(Aoo)−1j,jRo
j
. (3.21)
Then the pressure system, as shown below, is decoupled from the implicit system
by taking the first row (pressure equations) and first column (pressure variables) out
CHAPTER 3. LINEAR SOLVER FRAMEWORK 82
of each block row and column:(Jpp − (ApoA
−1oo )Jop)i,i · · · (Fpp)i,j − (ApoA
−1oo )i,i(Fop)i,j
.... . .
...
(Fpp)j,i − (ApoA−1oo )j,j(Fop)j,i · · · (Jpp − (ApoA
−1oo )Jop)j,j
·
(Xp)i...
(Xp)j
=
(Rp)i − (ApoA
−1oo )i,i(Ro)i
...
(Rp)j − (ApoA−1oo )j,j(Ro)j
. (3.22)
Now we consider the treatment for standard wells. Firstly, each entry in a
reservoir-well submatrix JRW,i will be in the form:
Jpp
Jop
row,col
(3.23)
We treat it as an off-diagonal block with no derivatives with respect to other implicit
variables. Therefore, it can be reduced to:
(Jpp)row,col − (ApoA−1oo )row,row(Jop)row,col. (3.24)
where (Jpp)row,col and (Jop)row,col are taken from the JRW,i matrix, and (ApoA−1oo )row,row
is taken from the reservoir matrix.
Secondly, each entry in a well-reservoir submatrix JWR,i will be in the form:
[Jpp Jpo
]row,col
(3.25)
Similarly, we treat it as an off-diagonal block with no other implicit equations. There-
fore, by ignoring the derivatives with respect to other implicit variables in the explicit
treatment, we simply reduce it to (Jpp)row,col.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 83
Thirdly, there is only one entry in a well-well submatrix JWW,i: (Jpp)row,row. We
can directly keep it in the reduced system.
Quasi-IMPES reduction. In quasi-IMPES reduction, we directly apply Schur-
complement process on the implicit system in the form (3.18). Here we also combine
A and F as J . For each block row i, first let the second row (with Jop and Joo in
the diagonal block) left-multiply by (Joo)−1i,i , and then let the first row (with Jpp and
Jpo in the diagonal block) minus the resultant second row (with J−1oo Jop and I in the
diagonal block) left-multiplied by (Jpo)i,i. The system becomes:
Jpp − (JpoJ−1oo )Jop 0
J−1oo Jop I
i,i
· · ·
Fpp − (JpoJ−1oo )i,iFop F ∗
(Joo)−1i,i Fop (Joo)
−1i,i Foo
i,j
.... . .
... Fpp − (JpoJ−1oo )j,jFop F ∗
(Joo)−1j,jFop (Joo)
−1j,jFop
j,i
· · ·
Jpp − (JpoJ−1oo )Jop 0
J−1oo Jop I
j,j
·
Xp
Xo
i
... Xp
Xo
j
=
Rp − (JpoJ−1oo )i,iRo
(Joo)−1i,i Ro
i
... Rp − (JpoJ−1oo )j,jRo
(Joo)−1j,jRo
j
, (3.26)
where F ∗i,j can be computed as:
F ∗i,j = (Fpo)i,j − (JpoJ−1oo )i,i(Foo)i,j (3.27)
CHAPTER 3. LINEAR SOLVER FRAMEWORK 84
The assumption of quasi-IMPES reduction is that F ∗ = 0 (derivatives of pres-
sure equation with respect to other implicit variables are ignored after the Schur-
complement process) such that we do not need to calculate Eq. (3.27) in the simu-
lation. Under this assumption, we can obtain the following pressure system that is
decoupled from the implicit system by taking the first row (pressure equations) and
first column (pressure variables) out of each block row and column:
(Jpp − (JpoJ
−1oo )Jop)i,i · · · (Fpp)i,j − (JpoJ
−1oo )i,i(Fop)i,j
.... . .
...
(Fpp)j,i − (JpoJ−1oo )j,j(Fop)j,i · · · (Jpp − (JpoJ
−1oo )Jop)j,j
·
(Xp)i...
(Xp)j
=
(Rp)i − (JpoJ
−1oo )i,i(Ro)i
...
(Rp)j − (JpoJ−1oo )j,j(Ro)j
. (3.28)
The standard wells can be treated in a similar way as explained for the true-IMPES
reduction, except that each entry in JRW,i will be reduced to:
(Jpp)row,col − (JpoJ−1oo )row,row(Jop)row,col, (3.29)
instead of the term given by Eq. (3.24).
Comments on the two reduction schemes. As described above and men-
tioned in [38], similar steps have been taken by both reduction schemes in different
orders. True-IMPES first ignores the derivatives of the flux term with respect to
other implicit variables, and then performs the Schur-complement reduction, whereas
quasi-IMPES first performs the Schur-complement reduction, and then ignores the
derivatives of the resultant flux term with respect to other implicit variables. The
dropped derivatives in true-IMPES reduction correspond to explicit treatment (one-
iteration lag) of all variables other than pressure in the flux term, whereas those in
CHAPTER 3. LINEAR SOLVER FRAMEWORK 85
quasi-IMPES reduction have no clear physical meaning. Also, as claimed in [38],
true-IMPES reduction has been demonstrated to be a consistently better approach
than quasi-IMPES reduction via a large number of simulations.
Solution of the pressure system. With the previously described treatment
for standard wells during the true-IMPES reduction, we will obtain an augmented
pressure linear system in the following form: Ap,RR Ap,RW
Ap,WR Ap,WW
· xp,R
xp,W
=
bp,R
bp,W
, (3.30)
where R stands for reservoir and W stands for well. The four parts Ap,RR, Ap,RW ,
Ap,WR, and Ap,WW in the pressure matrix Ap are all pointwise sparse (Ap,WW has a
diagonal structure) and contains only the derivatives of pressure (or well constraint)
equations with respect to the pressure (or well BHP) variable. This pressure system
(in CSR format, see the introduction in Section 3.2) can be directly solved by the
AMG or SAMG preconditioner, given that the following criteria are met: 1) the
diagonal element is always ordered as the first element in each row, 2) all diagonal
elements are positive, and 3) row ptr and col ind arrays are both 1-based.
An alternative strategy is to derive the Schur-complement pressure system as:
(Ap,RR −Ap,RWA−1p,WWAp,WR
)· xp,R = bp,R −Ap,RWA−1p,WWbp,W . (3.31)
After solving the above linear system for xp,R, we can explicitly obtain xp,W as
xp,W = A−1p,WW (bp,W −Ap,WRxp,R) (3.32)
Here the Schur-complement pressure matrix Aschurp = Ap,RR−Ap,RWA−1p,WWAp,WR
may not have the same structure as the Ap,RR part in the original Ap matrix, when
CHAPTER 3. LINEAR SOLVER FRAMEWORK 86
there are wells with multiple perforations that are not directly connected. Thus,
substantially higher cost will result if we dynamically change the structure of the
pressure matrix after the true-IMPES reduction. However, the structure of the Schur-
complement pressure matrix Aschurp shall be fixed if the well configuration (number of
wells and perforations of each well) does not change. For practical implementation,
the structure of Aschurp is predetermined at the beginning of the simulation and will
only need to be reconstructed when the well configuration changes.
No matter which pressure system (augmented or Schur-complement) we solve, the
reduction process can be expressed by the following equation:
Ap = R ·AImp · P (3.33)
where AImp is the implicit matrix in Eq. (3.12), R is the restriction matrix, and P
is the prolongation matrix. If we express the original implicit linear system as:
AImp · xImp = bImp, (3.34)
then by applying the restriction operator to the left of both sides, and letting xImp =
P · xp, bp = R · bImp, we have:
(R ·AImp · P ) · xp = R · bImp (3.35)
Ap · xp = bp (3.36)
Eq. (3.36) is the pressure system we solve, which can correspond to either Eq.
(3.30) or (3.31). The system is expected to be near elliptic and thus can be effectively
solved by the multigrid type of solvers, e.g., AMG and SAMG. Usually a single V-
cycle in AMG or SAMG per linear iteration is accurate enough for preconditioning
purposes. For challenging problems such as unstructured-grid problem with strong
CHAPTER 3. LINEAR SOLVER FRAMEWORK 87
anisotropy, SAMG can be more stable than AMG.
After the solution of the pressure system, we apply the prolongation operator P to
the pressure solution xp, i.e., x1 = P ·xp. Note that we usually do not perform explicit
update to the other implicit variables. That is, we distribute the pressure solution to
the update of pressure variable in all reservoir cells and all wells, and pad zero values
for the update of other implicit variables during the first-stage of CPR. The updates
to other implicit variables will be computed in the second stage. Also note that the
reduction of the matrix only needs to be taken once during every Newton iteration,
whereas the restriction of the RHS vector and the prolongation of the solution vector
need to be taken as many times as the number of linear iterations during a Newton
iteration. This is because of the fact that during each linear iteration, the GMRES
solver will need to find the approximate solution to a new RHS vector vi (see step 8
in Algorithm 3.1).
Second stage: overall system
The reservoir and facility models are generally strongly coupled via pressure. Due to
the pressure decoupling at the first stage of CPR, we may solve the overall system
locally (i.e., one submatrix at a time) in the second stage. For these submatrices, the
ILU family of preconditioners, such as ILU(0), BILU(0), BILU(k) [44], can be used.
As described at the beginning of this section, we will first calculate a corrected RHS
for the second stage as:
r2 = r −Ax1 (3.37)
where r is the original RHS vector and x1 is the prolonged first-stage solution vector.
The local preconditioners can then be applied on the submatrices and associated
RHS vectors for the reservoir part and for each facility model. For the reservoir part,
the BILU(0) preconditioner is preferred, because its factorization and per-iteration
CHAPTER 3. LINEAR SOLVER FRAMEWORK 88
solution is much more efficient than BILU(k) and can provide acceptable accuracy as
a second-stage preconditioner. However, when we apply ILU family of preconditioners
directly as a single-stage preconditioner, BILU(k) with k = 1 is a better choice for the
reservoir part because it converges much faster and the overall cost is smaller [38,98].
For the facilities part, submatrices from standard wells are trivial because the Jww,i
submatrix for each well only contains one element and the solution can be directly
obtained. The linear solution strategy for advanced well models is discussed later in
Section 4.7.
Here we briefly describe the idea of ILU family of preconditioners. For the setup
stage, an incomplete LU factorization is conducted on the input matrix A. Depending
on the specific preconditioner type, we have:
Pointwise factorization without fill-ins: ILU0(A) = LpU p (3.38)
Blockwise factorization without fill-ins: BILU0(A) = LbU b (3.39)
Blockwise factorization with k-level fill-ins: BILUk(A) = LkbU
kb (3.40)
The factorized matrices L and U are lower triangular and upper triangular (point-
wise or blockwise). With the given RHS, the forward and backward sweeps can be
applied to obtain the solution to these factorized matrices quickly. For the linear
system Ax = b, we first use forward elimination to solve
Lz = b (3.41)
for the intermediate solution z, and then use backward substitution to solve
Ux = z (3.42)
for the final solution x.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 89
When we apply an ILU-type preconditioner without fill-ins, every single entry in
the factorized matrices L and U must be in the position of an original nonzero entry
in matrix A. All other entries induced by the factorization process are ignored. Thus,
the structure of L and U are already determined when the input matrix A is received
by the preconditioner. On the other hand, for an ILU-type preconditioner with k-level
fill-ins (k > 0), we need to first perform a symbolic factorization to determine the
position of fill-in entries in L and U and then proceed with the numerical factorization
to setup the values of both the original and fill-in entries. The fill-in entries with level
k are induced by the elimination of the fill-in entries with level k − 1, for any k > 0.
The original nonzero entries in matrix A can be thought of as fill-in entries with
level 0. Thus the higher the fill-in level k is, the more fill-in entries we will have in
the factorized matrices, and the higher the factorization and per-iteration solution
cost will be. In return, the convergence rate should also get better because we have
obtained a closer approximation of the original matrix A by the preconditioner.
The BILU(0) preconditioner in AD-GPRS works on a copy of the JRR matrix. The
copy is in block CSR format and contains only the derivatives of implicit equations
with respect to implicit variables, which are copied from the JRR matrix after the two-
step algebraic reduction in each Newton iteration. The reason why the preconditioner
cannot work directly on the original JRR matrix is that JRR is needed for the SpMV
in GMRES solver and thus cannot be modified during the linear iterations, whereas
the BILU(0) preconditioner performs an in-place factorization on the input matrix.
That is, the nonzero blocks of the factorized matrices L and U are stored at the same
places in memory as the nonzero blocks in the input matrix.
CHAPTER 3. LINEAR SOLVER FRAMEWORK 90
3.5 Concluding Remarks
In this chapter we described the linear solver framework, including the underlying
matrix formats and associated solution strategies of AD-GPRS. Currently AD-GPRS
can deal with two linear systems: the CSR linear system based on the public CSR
matrix format, and the block linear system based on the customized MLBS matrix
format. On one hand, the CSR matrix format has a very simple structure and is
easy to understand. However, the efficiency of its associated linear solvers is limited
by this structure. As a result, CSR linear system can only be used to validate the
correctness of new features via small problems.
On the other hand, the MLBS matrix format has a much more sophisticated struc-
ture. A hierarchical storage system has been adopted such that the entire Jacobian
matrix is first divided into four parts (JRR, JRF , JFR, and JFF ), with each part
(except JRR) to be composed of smaller submatrices. There is no requirement im-
posed on the type of the submatrices, as long as the common matrix operations, e.g.,
extraction from AD residual vector, algebraic reduction, explicit update, and SpMV,
are well defined. This design offers great flexibility and extensibility for the model
developers to devise the most suitable matrix format and associated solution strategy
for each individual facility model (or a certain new feature) of AD-GPRS.
For the solution of block linear system, a two-step algebraic reduction from full
system to implicit system has been applied prior to calling the linear solver. The
implicit system is solved by an iterative Krylov subspace solver (e.g., GMRES) with
a two-stage preconditioning strategy. Afterwards, a two-step explicit update is used
to recover the full solution from implicit solution. The solution efficiency using this
strategy is much better than using a single-stage preconditioner (e.g., BILU), not to
mention those working on the simple (pointwise) CSR linear system.
Chapter 4
General MultiSegment Well Model
4.1 Model Description
As mentioned in [38], long deviated and horizontal wells have became more and more
important in the petroleum industry in recent years. Many horizontal wells, especially
offshore ones, can have huge volumetric rates such that the pressure drop due to
friction and acceleration cannot be neglected [58]. In addition, the phase holdup
effect may greatly affect the behavior of the well [73, 74]. Because the standard
well model [62] is not capable of modeling these effects, the MultiSegment (MS) -
discretized wellbore - model was introduced to accurately capture the flow behaviors
in the wellbore [36, 38]. Hereinafter we refer to this model as the “original” MS
well model, in order to differentiate it from a more generalized model discussed in
this chapter. In addition to the hydrostatic pressure drop, the pressure drop due to
friction and acceleration are accounted for in the original MS well model. The drift-
flux model is often used in original MS wells to compute individual phase holdup
and velocities in the wellbore [73, 74]. Figure 4.1 shows an original MS well and the
associated variables (degrees of freedom). Each original MS well is discretized into
multiple segments with predefined flow directions. Here, a segment must have two
91
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 92
Figure 4.1: Illustration of the original multisegment well model (from [38])
ends labelled “heel” and “toe”, respectively. The pressure is defined at the toe end of
the segment, whereas the mixture velocity is defined at the heel end of the segment.
Holdups and mole fractions are defined for the entire segment at the segment center.
The network of surface pipelines also plays an important role in field operations. In
a production system of several wells connected to a pipeline network, the constraints
are often applied on gathering points and storage facilities, not on the wellheads [38].
With properly chosen parameters, the original MS well model described above can
be used to simulate the transient effects in the wellbore. However, the original MS
well model cannot properly handle complex topology, including general branching,
loops, and multiple exits, of the pipeline network. Moreover, the flow directions
are predefined in the original MS well model, whereas in a pipeline network, the
flow directions may change with time and be different from the predefined ones. As a
consequence, a preliminary idea to generalize the original MS well model was proposed
in [38]. A simple stand-alone simulation package was implemented to justify the idea.
Here a general MS well model is described, which has been implemented using the
AD framework and seamlessly integrated with AD-GPRS.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 93
two-outlet segment
multi-outlet segment
special segment
node
connection
Qm
cpp xP ,,
Figure 4.2: Illustration of the general multisegment well model (from [38])
As shown in Figure 4.2, each general MS well is discretized into nodes and connec-
tions in a manner that is quite similar to finite-volume discretization of reservoir flow.
In this model, each node (yellow circle) is associated with a unique segment that has
a finite volume and may be connected to any number of other nodes. Each nonexit
connection (red link) connects two segments (and their associated nodes), whereas
each exit connection (blue arrow) connects one segment (node) to the outside region.
As described in [38], there are three types of segments:
• Two-outlet segments (blue bars). They represent the most common type of
segments. Two-outlet segments are similar to the segments in the original MS
well model and can be used to model a section of the pipeline. There is no strict
restriction on the geometry of a two-outlet segment, i.e., it can be straight, or
bent. However, from an accuracy point of view (the holdup depends on the
angle of the segment), it would be better to avoid bent two-outlet segments and
have each two-outlet segment to be straight, or nearly straight. We can always
divide a two-outlet segment into two, or more, shorter two-outlet segments that
are connected.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 94
• Multioutlet segments (blue circles). Multioutlet segments represent junctions in
a pipeline system and can be used to create branches and loops. These segments
can have zero volume and thus have no accumulation.
• Special segments (symbols based on their functionality). Each type of special
segments can provide a specific functionality other than just being a segment
of “pipe”. For example, the special segment shown in Figure 4.2 is a separator
segment, which represents a facility that separates different fluid phases.
Using this discretization scheme, the general MS well model is capable of handling
complex geometries of multilateral wells, well groups, pipeline networks, or arbitrary
combinations of the above, provided the drift-flux model is sufficiently accurate for
the conditions being considered.
Well constraints are handled in a generic way in the general MS well model.
Multiple constraints can be imposed on a single general MS well. A constraint,
depending on its type, can be applied on any node, or exit connection. Because
pressure is a nodal variable (see Section 4.2), the pressure constraint will be applied
on the corresponding node. Also, because the mixture flow rate is a connection
variable, the rate constraint is defined on an exit connection.
The advanced features supported by this model are:
• General branching that allows any well node to be connected with any number of
other nodes. With the flexibility offered by this feature, we can define a facility
model with very complex geometry and thus have a better approximation to its
physical reality.
• Loops with arbitrary flow directions, whereby the flow direction is determined
dynamically during the Newton iteration process. The actual flow direction
is indicated by the sign of the mixture flow rate defined on each connection
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 95
(see Section 4.2): if the rate is positive, the actual direction is the same as
the preassumed direction; otherwise, the actual direction is opposite to the
preassumed direction.
• Multiple exit connections with different constraints. The current implementa-
tion supports only one active well constraint at a time. Extension to multiple
constraints is, however, supported by the nonlinear framework and the design
of the general MS well model.
• Special segments with various functionality (e.g., separators, valves). Any seg-
ment in a general MS well can be a special segment, such that its property
calculation is defined differently from an ordinary segment. The accumulation
terms in the mass conservation equations and the local constraints can also be
customized to accommodate the specific functionality of the segment. This idea
is elaborated in the first part of Section 4.6.1.
4.2 Variables
There are many more variables in the MS well model compared with the standard
well model, which has only one variable, the Bottom Hole Pressure (BHP), per well.
For a general MS well with multiple nodes and connections, different variables are
defined on nodes and on connections. Without loss of generality, we demonstrate
the variable definition of this model using the natural-variables formulation [15, 23],
with either black-oil or compositional fluids. On each node, the following independent
variables are defined:
• Pw (pressure)
• Tw (temperature, independent only in a thermal formulation)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 96
• αp (holdup, or in-situ phase fraction, of phase p)
• xc,p (mole fraction of component c in phase p)
In addition to the above independent variables, dependent variables such as ρp (den-
sity), λp (mobility), and γp (mass density) are also defined on each node. In this
regard, both the independent and dependent variables are essentially the same as
the ones defined on each reservoir node (αp is equivalent to Sp). Therefore, they are
stored in the node-based subset of the global variable set (see Section 4.5.1).
Note that the above variables are all defined at the center of the node. This is not
like the original MS well model where pressure and mixture velocity are purposefully
defined at the two ends of a segment. Due to the consistency in definition with cell-
centered variables on a reservoir node, the initialization and property calculation of
well nodes can be carried out in a more generalized fashion. That is, they can share
a lot of computational processes with reservoir cells.
As described above, these variables are applicable when the natural-variables for-
mulation is used. Different node-based variables can be defined automatically for
the general MS well model if another variable formulation (e.g., molar) is used. Our
general MS well model has no restrictions on the selection of the variable formulation,
or the fluid model. That is, we expect very little modification to deal with a new
variable formulation, or fluid model. This is a quite important generalization as we
will only need to maintain one unified piece of code for the general MS well model,
and the code will work for arbitrary combinations of variable formulations and fluid
models.
On each connection, regardless of the selected variable formulation, the following
independent variable is defined:
• Qm (mixture flow rate, Qm = A · Vm, where A is the cross-sectional area of the
upstream segment, and Vm is the mixture velocity)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 97
The dependent variables Qp (p = 1, . . . , np) are defined as the flow rate of phase
p. We have Qp = A · Vsp, where Vsp is the superficial velocity of phase p. Qp’s are
also defined on each connection. As a consequence, Qm and Qp’s are stored in the
connection-based subset of the global variable set (see Section 4.5.1).
4.3 Equations
On each node (segment) of an MS well, we solve mass and energy balance equations
and deal with local constraints. The nc mass balance equations are:
∂
∂t
∑p
ρpαpxc,p −∂
∂z
∑p
ρpVspxc,p +∑p
ρpxc,pqp = 0, c = 1, . . . , nc, (4.1)
where ρp is the density of phase p, αp is the in-situ phase fraction of phase p, xc,p is
the mole fraction of component c in phase p, Vsp is the superficial velocity of phase
p, and qp is the inflow (per unit volume) of phase p to the segment. The time and
spatial derivative terms account for the mass accumulation and convective mass flux,
respectively. The last term,∑
p ρpxc,pqp, represents the mass source/sink through the
wellbore. The corresponding discretized equations are:
Vi
(∑p
ρpαpxc,p
)n+1
i
−
(∑p
ρpαpxc,p
)n
i
−∆t∑
j∈nbr(i)
(∑p
ρpQpxc,p
)(i,j)
+ ∆t
(∑p
ρpxc,pQp
)i
= 0, c = 1, . . . , nc, (4.2)
where Vi is the volume of node i, nbr(i) is the set containing all neighboring nodes
of i, (Qp)(i,j) = (A · Vsp)(i,j) is the phase flow rate through connection (i, j), and
(Qp)i = (V · qp)i is the volumetric inflow of phase p to node i.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 98
For a thermal formulation, the energy balance equation [46,47,72] is:
∂
∂t
∑p
ρpαp
(Up +
1
2V 2p
)+
∂
∂z
∑p
ρpαpVp
(Hp +
1
2V 2p
)=∑p
ρpαpVpg −Qloss +∑p
ρpHpqp, (4.3)
where Up and Hp are the internal energy and enthalpy of phase p, Vp is the interstitial
velocity of phase p (Vp = Vsp/αp), and g = g·cosθ is the gravitational component along
the well, in which θ is the inclination angle of the segment from vertical. The time
derivative term accounts for the energy accumulation, whereas the spatial deriva-
tive term is for the convective energy flux. The first term on the right-hand-side,∑p ρpαpVpg, represents the rate of work done on the fluid by gravitational forces [47].
Qloss is the heat loss from wellbore fluid to surroundings, and it is calculated from
the formula:
Qloss = −2πrwUto(Tw − T ), (4.4)
where Uto is overall heat transfer coefficient, rw is wellbore radius and Tw is the
temperature of wellbore fluids. The last term in Eq. (4.3),∑
p ρpHpqp, represents the
energy source/sink through the wellbore. The corresponding discretized equation is:
Vi
(∑p
ρpαp
(Up +
1
2V 2p
))n+1
i
−
(∑p
ρpαp
(Up +
1
2V 2p
))n
i
+ ∆t
∑j∈nbr(i)
(∑p
ρpQp
(Hp +
1
2V 2p
))(i,j)
= ∆t
(V∑p
ρpVspg − VQloss +∑p
ρpHpQp
)i
.
(4.5)
where (Vsp)i is equal to (Vsp)(i,j) such that i is an upstream node of j with respect
to the mixture flow rate. This treatment is used because all velocity (or flow rate)
related properties, including Vsp, are defined on connections instead of nodes, and
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 99
we have to use the connection-based counterparts to evaluate the superficial phase
velocities at the node (segment center).
In order to close the system composed of (4.1) and (4.3), or the corresponding
discretized equations (4.2) and (4.5), additional equations are needed, as discussed
below. For compositional fluids, thermodynamic equilibrium equations for the hydro-
carbon components are:
fc,p (P, T, xc,p)− fc,q (P, T, xc,q) = 0, c = 1, . . . , nc, 1 ≤ p 6= q ≤ np, (4.6)
where fc,p (P, T, xc,p) is fugacity of component c in phase p. Correspondingly, simple
PVT-based equilibrium equations are solved for black-oil fluid. Linear constraints for
component mole fractions in a phase and phase holdups are:
nc∑c=1
xc,p = 1, p = 1, . . . , np, (4.7)
np∑p=1
αp = 1. (4.8)
Similar to the set of reservoir equations, these linear constraints are sometimes
eliminated at the nonlinear level, i.e., having one xc,p (currently the first one) in each
phase depend on the other xc,p’s in that phase, and having one αp (currently the
last one) depend on the other αp’s. These linear constraints are applicable for the
natural-variables formulation. Corresponding constraints can be defined for other
variable formulations.
AD-GPRS currently supports two approaches for modeling phase fractions inside
the wellbore: 1) the homogeneous model (no slip), and 2) the drift-flux model. These
models are used for the phase flow rates (Qp’s). Homogeneous model assumes equal
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 100
velocities and no-slip between phases. That is,
Vp = Vm, p = 1, . . . , np. (4.9)
Using the relationship between Vsp and Vp, and multiplying both sides by A (the
cross-sectional area), we can obtain the following equation for phase flow rates with
the assumption of homogeneous flux:
Qp = αpQm, p = 1, . . . , np. (4.10)
The drift-flux model uses a more complicated strategy for the computation of Qp and
is described in Section 4.4. More sophisticated mechanics models and those based on
multidimensional tabular data can be accommodated as described in the first part of
Section 4.6.1.
The following equation describing the pressure relation is solved for each nonexit
connection between two neighboring nodes:
∆Pw = ∆Pwh + ∆Pw
f + ∆Pwa . (4.11)
Here, ∆Pw is the pressure drop between two nodes, and ∆Pwh , ∆Pw
f , and ∆Pwa are
the hydrostatic, frictional and acceleration components of the pressure drop, respec-
tively. In AD-GPRS, the user may choose to include one, two, or all three of these
components. The hydrostatic pressure difference between two nodes is:
∆Pwh = γmg∆D, (4.12)
where ∆D is the depth difference between two nodes, and γm =∑
p αpγp is the
mixture mass density, in which γp is the mass density of phase p.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 101
The frictional pressure difference between two nodes is:
∆Pwf =
2ftpγmVm |Vm|DH
∆z, (4.13)
where ftp is the Fanning friction factor, DH is the hydraulic diameter of a segment,
Vm is the mixture velocity, and ∆z is the length between two nodes. ftp is a function
of the dimensionless Reynolds number Re (ratio of inertial forces to viscous forces),
the roughness height ε, and the hydraulic diameter DH , as described in [35]:
ftp =
16/Re if Re ≤ 2000 (laminar flow)
1/
(−3.6 log10
(6.9Re
+(
ε3.7DH
) 109
))2
if Re ≥ 4000 (turbulent flow)
16/2000 + kf (Re− 2000) if 2000 < Re < 4000 (intermediate)
(4.14)
where kf is the slope for the linear interpolation of ftp during the intermediate range
(2000 < Re < 4000), and is given by:
kf =ftp(Re = 4000)− ftp(Re = 2000)
4000− 2000. (4.15)
The Reynolds number Re is calculated as:
Re =γmQmDH
µmA, (4.16)
where A is the cross-sectional area of a segment, µm =∑
p αpµp is the mixture
viscosity, in which µp is the viscosity of phase p.
The pressure-drop component due to acceleration is given by the following formula:
∆Pwa =
2minVmA
, (4.17)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 102
where min =∑
p γpQp is the mass flow rate of the mixture entering the segment.
For the exit connection (blue arrow in Figure 4.2) that connects node e with the
outside region, we may denote it as (e, exit). The following equation is solved on this
connection when a constant pressure constraint is applied:
Pwe = P target, (4.18)
where P target is the specified target for the pressure of node e.
Correspondingly, the following equation is solved on this connection when a con-
stant rate constraint is applied:
νscjρscj·
(∑c
∑p
ρpQpxc,p
)(e,exit)
= Qtargetj , (4.19)
where j is the phase of which the rate is controlled, sc represents the surface condition.
νscj and ρscj are the mole fraction and density of phase j obtained through the well
flash at surface condition. Qtargetj is the specified target for the volumetric rate of
phase j at surface condition.
4.4 Drift-Flux Model
The drift-flux model was first proposed by Zuber and Findlay [100] for vertical two-
phase bubbly flow. In AD-GPRS, we use the extension described in [73,74] to model
the slip in: 1) two-phase liquid-gas flow [74], 2) two-phase oil-water flow [74], and
3) three-phase gas-oil-water flow as a combination of the two-phase liquid-gas and
oil-water flow models [73].
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 103
4.4.1 Liquid-gas model
The drift-flux model for liquid-gas flow is based on the assumption that gas bubbles
tend to flow through the central portion of the pipe, where the local mixture velocity
is greater than the cross-sectional average velocity [74]. In addition, the density
difference between the gas and liquid phases gives rise to a drift between phases. The
relationship can be expressed as:
Vg =Vsgαg
= C0Vm + Vd, (4.20)
where Vg is the gas velocity, Vsg is the superficial gas velocity, Vd is the terminal rise
velocity, and C0 is a flow profile parameter that is given by:
C0 =A
1 + (A− 1)(αg−B1−B
)2 , (4.21)
where parameters A and B are found from the experimental data. A typical value of
A is 1.0 for circular pipes. In such circumstance, B does not affect the model, and
we have C0 = 1.0 over the entire range of αg [73, 74].
The terminal rise velocity Vd is calculated from:
Vd =(1− αgC0)C0K(αg)Vc
αgC0
√γgγl
+ 1− αgC0
m(θ), (4.22)
where Vc is the characteristic velocity, m(θ) is a scaling parameter for inclined pipes,
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 104
and K(αg) is given by:
K(αg) =
1.53/C0 if αg ≤ a1
Ku(DH) if αg ≥ a2
1.53/C0 + Ku(DH)−1.53/C0
a2−a1 (αg − a1) if a1 < αg < a2
(4.23)
where a1 = 0.06 and a2 = 0.21 for circular pipes. Ku(DH) denotes the critical
Kutateladze number, which is a function of the dimensionless diameter of a pipe:
DH =
(g(γl − γg)
σgl
)1/2
DH , (4.24)
as described in [74]. Here σgl is the gas-liquid interfacial tension and DH is the
hydraulic diameter of the pipe.
The characteristic velocity Vc and scaling parameter m(θ) are given by:
Vc =
(σglg(γl − γg)
γ2l
)1/4
, (4.25)
m(θ) = m0(cosθ)n1(1 + sinθ)n2 . (4.26)
For circular pipes the following parameters have been proposed: m0 = 1.85, n1 = 0.21
and n2 = 0.95 [73,74]. Note that if θ = 90◦ (the pipe is horizontal), we have m(θ) = 0,
and thus Vd = 0. If we have A = 1.0 in Eq. (4.21) at the same time, we will have
Vg = Vm, and the model is degenerated to the homogeneous flux model.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 105
4.4.2 Oil-water model
The drift-flux model for oil-water flow [74] is quite similar to that for liquid-gas flow
described above. The oil velocity Vo is computed as:
Vo =Vsoαo
= C ′0Vl + V ′d , (4.27)
where Vo is the oil velocity, Vso is the superficial oil velocity, Vl is the (mixed) liquid
velocity, and C ′0 is a continuous function of oil volume fraction αo:
C ′0 =
A′ if αo ≤ B′1,
1 if αo ≥ B′2,
A′ − (A′ − 1)(αo−B′1B′2−B′1
)if B′1 < αo < B′2.
(4.28)
For circular pipes, A′ has a typical value of 1.0. In such circumstance, the parameters
B′1 and B′2 are not relevant, and we always have C ′0 = 1.0.
The terminal rise velocity V ′d is given by:
V ′d = 1.53V ′c (1− αo)n′m′(θ), (4.29)
where n′ = 1 for circular pipes, the characteristic velocity V ′c is given by:
V ′c =
(σow(γw − γo)
γ2w
)1/4
. (4.30)
Interfacial tensions σgw (gas-water interfacial tension) and σgo (gas-oil interfacial ten-
sion) are calculated using the published correlations in [10]. Then, σow can be calcu-
lated as:
σow = |σgo − σgw| (4.31)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 106
The scaling parameter m′(θ) is given by:
m′(θ) =
n′1cosθ + n′2sin2θ + n′3sin3θ if θ ≤ 88◦,√|cosθ|(1 + sinθ)2 if θ > 88◦,
(4.32)
where n′1 = 1.07, n′2 = 3.23, and n′3 = 2.32 for circular pipes. Again, when θ = 90◦,
we have m′(θ) = 0, and thus V ′d = 0. If we also have A′ = 1.0 in Eq. (4.28), then
Vo = Vl, and the model is degenerated to the homogeneous flux model.
4.4.3 Gas-oil-water model
The three-phase (gas-oil-water) model works as a combination of the two-phase liquid-
gas and oil-water models [73]. First, the oil phase and water phase are considered
to be a mixed liquid phase. Both the gas-liquid interfacial tension and the mass
density of the liquid phase can be estimated using the volume-weighted average of
two corresponding properties for oil and water phases:
σgl =αoσgo + αwσgw
αo + αw, (4.33)
γl =αoγo + αwγwαo + αw
. (4.34)
Using the above properties, liquid-gas model can be used to compute the superficial
velocity of the gas phase as (derived from Eq. (4.20)):
Vsg = αgC0Vm + αgVd. (4.35)
Correspondingly, the superficial velocity of the liquid phase can be calculated as:
Vsl = Vm − Vsg. (4.36)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 107
Afterwards, we may apply the oil-water model inside the mixed liquid phase, with
αo (the volume fraction of the oil phase in all three phases) in Eq. (4.28) and (4.29)
replaced by αol (the volume fraction of the oil phase inside the mixed liquid phase),
which is given by:
αol =αo
αo + αw. (4.37)
The superficial velocity of the oil phase can be obtained as (derived from Eq. (4.27)):
Vso = αolC′0Vsl + αoV
′dmg(θ, αg). (4.38)
where mg(θ, αg) is an additional scaling parameter to account for the impact of gas
on the oil-water slip as described in [73]. Then the corresponding superficial velocity
of the water phase can be calculated as:
Vsw = Vsl − Vso. (4.39)
4.5 Extensions of the AD Simulation Framework
In order to implement the general MS well model under the AD-based simulation
framework, some extensions are needed.
4.5.1 Global variable set
The first extension is related to the AD variable set. As not all the independent
variables in the general MS well model are defined on the nodes (some are defined
on the connections), a new global variable set containing multiple subsets has been
established. The following subsets are currently included in the global set:
• The original variable set defined for each node (control volume), which includes
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 108
all node-based independent variables (e.g., P , Sp, and xc,p for the natural-
variables formulation) and dependent variables (e.g., ρp, λp, and γp). All reser-
voir variables and part of the general MS well variables (those defined on each
node - yellow circles in Figure 4.2) are currently contained in this subset.
• A new variable subset defined for each connection. It is introduced because
many properties are naturally defined on connections (faces). This is not only
true for the general MS well model in the facilities part (for those variables
defined on each connection - red links and blue arrows in Figure 4.2), but also
valid for the reservoir part, where connection-based variables are useful in the
dual-grid approach for geomechanics modeling.
Other subsets can be added in the future as necessary. All subsets have the same
interfaces as the original node-based variable set, such that any variable can be fetched
easily. On the other hand, global operations such as backup, restoration, and variable-
switching are managed by the global set in a consistent manner.
4.5.2 Linear system
The second extension is related to the linear system. In order to achieve a generic
design, the implementation of various facility models, including the general MS well
model, is decomposed into nonlinear and linear levels. We have already described its
mathematical formulation, which corresponds to the implementation on the nonlinear
level, in Sections 4.2 and 4.3. What is left is the support for the linear level. Because
AD-GPRS is designed to support multiple linear systems and multiple facility models,
an approach using so-called “handlers” is introduced to mix and match the facility
models and linear systems. For example, we can have the following handlers for the
current set of facility models (standard well and general MS well) and linear systems
(CSR linear system and block-sparse linear system):
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 109
• Standard well and CSR linear system (trivial)
• General MS well and CSR linear system (trivial)
• Standard well and block-sparse linear system
• General MS well and block-sparse linear system
In the future, more handlers can be created when a new linear system, or facility
model, is introduced. Each facility-matrix handler as defined above is responsible for
the following tasks:
• Creation of the submatrices for the facility model inside the global Jacobian
matrix. The submatrices, which may have their own types, are created using
the structural information of the facility model and then added to the global
Jacobian matrix.
• Preparation of the preconditioning data for the facility model. The precondi-
tioning data are generated from the extracted nonzero entries in the submatrices
as well as other necessary information of the facility model and are currently
used in the assembly of the reduced pressure system for the first stage of CPR
preconditioner.
Now, we discuss the extensions needed for the multilevel block-sparse (MLBS)
linear system, which has been described in detail in Chapter 3. Here we recap some
of its important features. The hierarchical structure of this linear system is shown
in Figure 4.3. On the first level, the entire system matrix is considered as a single
object, which is then divided into four parts, JRR, JRF , JFR, and JFF , on the sec-
ond level. Here, the first subscript represents the equations and the second subscript
represents the variables. In addition, R stands for the reservoir and F stands for
facilities. The third-level submatrices (e.g., JRW,1, JRW,2 as shown in Figure 4.3(c),
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 110
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 74960
JRF JRR
JFF JFR
JRW,1
(a) 1st level (b) 2nd level (c) 3rd level
JRW,2
Figure 4.3: Multilevel block-sparse linear system (modified from [38])
where W stands for a well, i.e., an individual facility model) inside JRF , JFR, and
JFF are corresponding to individual facility models. The types of these submatrices
need to be customized such that the matrix storage and computation are effective
for different facility models, including the standard well model and the general MS
well model. For the specific submatrix types of each facility model, the common
matrix operations, including extraction from the AD residual vector, algebraic re-
duction, explicit updating, and Sparse Matrix-Vector multiplication (SpMV), need
to be properly defined. Then, an advanced linear preconditioning strategy is needed
to handle the contributions from the general MS well model when solving the fully
coupled reservoir-facilities matrix system.
4.5.3 Jacobian for the general MS well model
The Jacobian matrix (JWW , as a submatrix in JFF ) of the general MS well model
is composed of four parts (see Figure 4.4):
• JNN : Derivatives of node equations with respect to node variables. This sub-
matrix is block sparse, with the full size of each block equal to (NPri +NSec)×
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 111
(NPri + NSec), where NPri is the number of primary variables and NSec is the
number of secondary variables. For each connection (i, j), there are two corre-
sponding off-diagonal nonzero blocks, one on row i, column j, and the other on
row j, column i. The structure of this submatrix is quite similar to that of JRR
on the second level of the MLBS linear system, when the reservoir is discretized
using the TPFA scheme.
• JNC : Derivatives of node equations with respect to connection variables. This
submatrix is block sparse, with the full size of each block equal to (NPri +
NSec) × 1. For each connection (i, j) with a connection index k (i.e., it is the
k’th connection), there are two corresponding nonzero blocks, one on row i,
column k, and the other on row j, column k.
• JCN : Derivatives of connection equations with respect to node variables. This
submatrix is block sparse, with the full size of each block equal to 1× (NPri +
NSec). For each connection (i, j) with a connection index k, there are two
corresponding nonzero blocks, one on row k, column i, and the other on row k,
column j. This submatrix has a transposed structure from JNC .
• JCC : Derivatives of connection equations with respect to connection variables.
This submatrix is pointwise diagonal, because there is only one independent
variable (Qm) defined for each connection and there is no direct coupling be-
tween different connections.
Due to the structural similarity between the JWW matrix of the general MS well
and the global Jacobian matrix, which is also composed of four parts (JRR, JRF ,
JFR, and JFF ), the same matrix wrapper is used for both matrices, although the
specific types used for the four parts may be different. Besides the above (fourth-
level) submatrices inside JWW , there are submatrices that represent the coupling
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 112
between the reservoir and the general MS well: JRW (as a submatrix in JRF ) and
JWR (as a submatrix in JFR). JRW can be further divided into two parts: JRN
and JRC , which correspond to derivatives of reservoir equations with respect to well
node and well connection variables respectively. With the current physical model,
we always have JRC = 0. Similarly, JWR is also composed of two parts: JNR and
JCR, which contain the derivatives of well node and well connection equations with
respect to reservoir variables respectively. JCR 6= 0, if and only if the acceleration-
related pressure drop term is included in the pressure-relation equations defined on
well connections.
JNN JNC
JCC JCN
Figure 4.4: Jacobian matrix structure of the general MS well model
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 113
4.6 Well Initialization, Calculation, and Variable
Updating
4.6.1 Well initialization
The initialization of a general MS well includes three major parts: creation of node
(segment) and connection objects, initialization of static properties, and initialization
of dynamic properties.
Creation of node (segment) and connection objects
In the general MS well model, each node is a separate object. Nodes share the same
base type but can have different inherited types. Common interfaces have been de-
signed for the node objects, e.g., properties calculation, adding local contributions
to the residual equations, updating of node-based variables, and so on. The imple-
mentation details are left to the developers. This explains how special nodes can be
handled within the framework, i.e., we may construct new segment types (e.g., sepa-
rator or valve segment) by deriving from one of the existing segment types (e.g., the
base type) and customizing the member variables, as well as, the underlying compu-
tational processes inside the member functions. Then, we are able to create some of
the nodes using these special segment types while keeping other nodes as the ordinary
segments.
The same situation applies to connections. That is, each connection is also a
separate object that shares the same base type. The computation of phase flow rates
(Qp) and various types of pressure drops can be customized in the derived connection
types. This is the mechanism to handle different physical models (e.g., homogeneous
flux and drift flux) on connections. For example, when the drift-flux model is applied
to a general MS well, the exit connections and those connections with 1) no depth
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 114
difference between the two neighboring nodes and 2) constant flow-profile parameters
C0 = C ′0 = 1 (Eq. (4.21) and (4.28)) will still be created as the “homogeneous”
connections, whereas the rest of the connections will be created as the “drift-flux”
connections. As a result, distinct computations of Qp’s will be performed for the two
types of connections, although they are in the same MS well and the calling con-
ventions to their member functions are completely the same. In this way, additional
segment types, or physical models that are based on more accurate mechanics prin-
ciples or tabulated experimental data, can be introduced into the general MS well
model in a consistent manner without altering the code structure.
Initialization of static properties
In this part, the properties that do not change during the entire simulation will be ini-
tialized. Currently, this includes the pointers to the first variable of each node and of
each connection, the set of perforations corresponding to each segment (one segment
may have zero, one, or multiple perforations, as shown in Figure 4.5(a), 4.5(b), and
4.5(c), respectively), and the depth difference between each perforation and its corre-
sponding node (perforations may or may not be located at the node center) multiplied
by gravitational acceleration. This procedure will also let each segment initialize its
own static properties. For an ordinary segment, this includes its hydraulic diameter
DH , cross-sectional area A, volume V, and the slope kf (Eq. (4.15)) used in the linear
interpolation of Fanning friction factor ftp for an intermediate Re (Eq. (4.14)).
Initialization of dynamic properties
This part initializes all the independent variables and other properties that change
during the simulation. This is a sophisticated process and contains several steps:
1. Give initial guesses to the base variables, including pressure (Pw), temperature
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 115
A segment
A reservoir cell with no perforation
A perforated reservoir cell (a) A segment
without any perforation
(b) A segment with one
perforation
(c) A segment with multiple perforations
Figure 4.5: A segment with zero, one, or multiple perforations
(Tw), and overall mole fractions (zc) of each well node. The independent vari-
ables of any formulation can be converted from these base variables. Among
these variables, the pressure of each node is estimated based on the hydrostatic
pressure difference from the estimated value of wellhead pressure (defined on
the 0th node). That is,
Pwi = Pw
0 − γavgm g(Di −D0), (4.40)
where Pwi is the pressure of node i, Di is the depth of node i, and γavgm is the
overall mixture mass density computed using the following formula,
γavgm =1
Nperf
Nperf−1∑k=0
(∑p λpγp∑p λp
)res
perf(k)
, (4.41)
where Nperf is the number of perforations, the superscript res represents a
reservoir property, and the subscript perf(k) is the cell number of the k’th
perforation. Note that we usually use holdups (αp) or phase rates (Qp) as the
weighting factors to calculate the mixture mass density, whereas here we use
phase mobilities (λp) from the perforated reservoir cells. That is because, during
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 116
initialization, P and αp in each segment have not yet been determined, and
thus we cannot calculate Qp, which depends on the pressure difference between
the perforated reservoir cell and the corresponding segment. In this regard,
λp/(∑
p λp) provides a rough estimation for Qp/(∑
pQp) during initialization
only. Once αp’s are initialized and then updated in each Newton iteration during
the simulation, they will be used as the weighting factors for the mixture mass
density of each segment. On the other hand, temperature and overall mole
fractions have distinct estimations for injectors and producers. For injectors,
they are directly assigned from the given injection stream (T inj and zinjc ). For
producers, because perforations, the properties of which we already know from
the reservoir initialization, have no one-to-one correspondence to nodes, we
calculate the average values of temperature and overall mole fractions over all
perforations and use these average values as initial guesses. That is,
Tw,avg =1
Nperf
Nperf−1∑k=0
T resperf(k) (4.42)
zavgc∗ =1
Nperf
Nperf−1∑k=0
( ∑p λpρpxc∗p∑
c
∑p λpρpxc,p
)res
perf(k)
. (4.43)
2. For each node, convert the base variables Pwi , Twi , and zc,i to the independent
variables of the selected formulation (e.g., natural, molar)
3. For each node, repeat the following substeps for a few iterations (e.g., 3) to
reach hydrostatic equilibrium:
(a) Detect the change of phase status in this node
(b) Calculate the various properties of this node, including the mixture mass
density γm,i as the volume-weighted average of phase mass densities: γm,i =∑p(αpγp)i
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 117
(c) Update the node pressure using the new mixture mass density γm,i:
Pwi = Pw
0 − γm,ig(Di −D0). (4.44)
4. Calculate the inflow/outflow rates at all perforations using the updated inde-
pendent variables of the well
5. Initialize the mixture flow rates (Qm) on all connections. This step includes the
following substeps:
(a) Find the starting nodes (i.e., nodes with only one associated connection)
and push them into the node queue (first in, first out)
(b) Set the status of all Qm’s as “undetermined”
(c) Fetch the first node seg in the node queue and remove it from the queue
(d) Calculate the volumetric influx Qtot into node seg, which includes the
injection/production rate of all phases from all perforations on this node,
and the mixture flow rate from the associated connections with already
determined Qm
(e) Calculate the total cross-sectional area Atot of the associated connections
with undetermined Qm
(f) For each associated connection with undetermined Qm, if its cross-sectional
area is A, its Qm can be determined as:
Qm =A
AtotQtot. (4.45)
In addition, set the status of this Qm as “determined” and push the node
on the other side (i.e., not seg) of this connection into the node queue
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 118
(g) Check if the node queue is empty: if yes, the process of determining mixture
flow rates is ended; otherwise, go to step (5c)
1 2
3 4
seg 1 2
3 4 n
c
c
a) Before processing node seg
b) After processing node seg
1 2
3 4
seg 1 2
3 4
Segment (node) n
Connection c with undetermined Qm
Connection c with determined Qm
seg ... Queue ... Queue 3 4
Figure 4.6: Initialization of mixture flow rates
To better understand the initialization of mixture flow rates, let us consider the
following example. As shown in Figure 4.6(a), we suppose: 1) seg is the first
node in the node queue, 2) Qm,1 and Qm,2 have already been determined, and
3) Qm,3 and Qm,4 are undetermined. In step (5d), we calculate the volumetric
influx Qtot as: Qtot = Qm,1+Qm,2 (Qm,1 and Qm,2 are determined). Then, in step
(5e), we calculate the total cross-sectional area Atot as: Atot = A3 + A4 (Qm,3
and Qm,4 are undetermined). Next, in step (5f), the undetermined mixture flow
rates, Qm,3 and Qm,4, are calculated as:
Qm,3 =A3
AtotQtot, Qm,4 =
A4
AtotQtot (4.46)
In addition, the statuses of Qm,3 and Qm,4 are now set as determined. Node 3
and 4 are pushed into the end of the node queue, as shown in Figure 4.6(b).
This process continues until the node queue becomes empty.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 119
4.6.2 Well calculation
The calculation sequence of the general MS well model is shown in Figure 4.7. This
procedure is performed during each Newton iteration, after the computation of the
residual equations (and associated Jacobian matrix) for the reservoir part. The high-
lighted steps (in red) are additional steps from the standard well calculation and are
specifically implemented for general MS wells:
• Property calculation for each node, including density (ρp), viscosity (µp), mo-
bility (λp), overall mole fraction (zc), and fugacity (fc,p). This step is essentially
the same as the property calculation of a reservoir node, except that the poros-
ity (φ) does not need to be calculated and is always equal to 1 in the wellbore
and that the volume-weighted average of mass density (γm) and viscosity (µm)
need to be calculated in addition to all other properties. The node-based inde-
pendent variables, with their latest values and gradients, are used in this step.
The values and gradients usually correspond to the updated ones in the last
Newton iteration, or the converged ones in the last timestep during the first
Newton iteration of every new timestep. At the beginning of the simulation,
the values estimated in the initialization process (see Section 4.6.1) and the gra-
dients determined according to the initial independent indices are used. This
also applies to the calculation of connection-based properties.
• Property calculation for each connection, including phase flow rate (Qp), hydro-
static (∆Pwh ), frictional (∆Pw
f ), and acceleration-related (∆Pwa ) pressure drops.
The computation of phase flow rate depends on the selection of the flux model,
whereas three types of pressure drops depend on the selection of pressure drop
models. Both the connection-based mixture flow rates (Qm) and the node-
based independent variables in the two neighbouring nodes, with their latest
values and gradients, are used in this step. See Section 4.3 for details about the
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 120
calculation of connection-based properties.
• Construction of MS well equations on each node (mass balance (4.2), energy
balance (4.5), local constraints (4.6) - (4.8)) and on each connection (pressure
drop relation (4.11)). The detailed forms of these equations are discussed in
Section 4.3. Various properties calculated in the previous steps are used in this
step. With the help of AD framework, only nonlinear residual code is needed
while the associated gradients are automatically generated.
For each node, compute: density, viscosity, mobility, Zc, and fugacity
For each perforation, compute: component and phase rates
For each connection, compute: phase flow rate, frictional and
acceleration-related pressure drop
Compute total rate and run flash at surface condition
Update reservoir residual equations (perforations)
Control needs to be switched?
No
Well calculation completed
Switch to first viable control Yes
Form MSWell residual equations (nodes and connections)
Figure 4.7: Calculation sequence of the general MS well model
4.6.3 Updating of the well variables
The variable updating sequence of the general MS well model is shown in Figure
4.8. For each well node (Figure 4.8(a)), the updating sequence is similar to the
corresponding process for a reservoir node:
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 121
1. Application of the updates to a temporary set of variables
2. Nonlinear correction (e.g., Appleyard chop [70]) and enforcement of physical
limits
3. Detection of phase status change: if the node has more than one appeared
phases, we check if the updated holdups, or phase mole fractions, indicate any
phase disappearance. Also, if not all phases appear in the node, we perform
phase stability test to check if any new phase should appear
4. Copying the updates back from the temporary set
This process is much more involved than that for a standard well, where only one
variable (BHP) gets updated directly. However, it does not introduce extra complex-
ity into the implementation, because most of the functions needed above already exist
for the update of reservoir variables and can be shared. For each connection (Figure
4.8(b)), the update is directly applied to the mixture flow rate without checking its
range, because the mixture flow rate can be either positive or negative, indicating the
actual flow direction.
4.7 Multistage Preconditioner
The two-stage CPR (Constrained Pressure Residual) preconditioner, as described in
Section 3.4.5, has been extended to deal with the coupled reservoir-facilities system
with original MS wells and well groups [96,98]. Here we describe its extension for the
coupled system with general MS wells.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 122
Apply tentative Newton updates to
a temporary set
Apply nonlinear correction (e.g., Appleyard chop)
Enforce physical limits on variables
Phase stability test: check if any new
phase should appear
Single phase?
Check if the updated holdups indicate
phase disappearance
Copy the values from the temporary set to node variables
Apply Newton update to the connection variable:
mixture flow rate
No Yes
a) for each node b) for each connection
All phases appear? No
Yes
Figure 4.8: Variable update sequence of the general MS well model
4.7.1 First stage: global on the pressure system
The basic idea for this stage is to first reduce the coupled system with advanced
facility models (e.g., general MS wells) to a system with only standard wells, and then
apply the true IMPES reduction for the reservoir part to get the reduced pressure
system. Thus, the question lies in how the equations and variables for an advanced
facility model are converted into those that resemble a standard well. For general
MS wells, the idea is similar to that for the original MS wells as described in [96,98],
i.e., we algebraically reduce the equations and variables for a general MS well into a
standard-well like equation and a single BHP variable. This approach assumes that
pressure differences among all nodes and mixture flow rates through all connections
are lagged by one iteration. That is, the pressure update for any node i is equal to
that for node 0: δPwi = δPw
0 , and the update of mixture flow rate of any connection
(i, j) is 0 in this stage: δ(Qm)(i,j) = 0. The detailed treatments depend on the applied
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 123
well constraint and are described in the following sections.
Constant-pressure constraint
Without loss of generality, we assume the constraint is defined on the 0th node. In
this case, the general MS well equations are directly reduced to Pw0 = P target, which
has the same form as the pressure control equation of a standard well and guarantees
the pressure at 0th node to be equal to the given constant value of the well constraint.
Typically Pw0 = P target should always hold under the constant-pressure constraint.
Thus, we will usually have: δPwi = δPw
0 = 0.
Constant-rate constraint
We assume that the constraint is defined at the exit connection that links the 0th
node with the outside region. By summing up the mass conservation equations (4.2)
over all components and all nodes of this general MS well, we will have:
NN−1∑i=0
Vi
(∑c
∑p
ρpαpxc,p
)n+1
i
−
(∑c
∑p
ρpαpxc,p
)n
i
−∆t
(∑c
∑p
ρpQpxc,p
)(0,exit)
+ ∆t
NN−1∑i=0
(∑c
∑p
ρpxc,pQp
)i
= 0, (4.47)
where NN is the number of nodes in this MS well, and (0, exit) represents the exit
connection. Note that 1) all three terms (accumulation, flux, and source/sink) are
summed up over all components, and 2) the accumulation and source/sink terms are
summed up over all nodes, whereas the flux terms through exit connections are the
only ones that are kept in the resultant equation (4.47). This is because, for each
flux term (∑
p ρpQpxc,p)(i,j) that appears in the mass balance equation for component
c in node i, there is a corresponding flux term (∑
p ρpQpxc,p)(j,i) in the mass balance
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 124
equation for component c in node j, except when j = exit, i.e., (i, j) is an exit
connection. Because we always have (∑
p ρpQpxc,p)(i,j) = −(∑
p ρpQpxc,p)(j,i), the flux
terms from nonexit connections cancel out in pairs, and those from exit connections
are left and thus included in the resultant equation (4.47).
Recall the rate control equation (4.19): (νscj /ρscj ) ·
(∑c
∑p ρpQpxc,p
)(0,exit)
=
Qtargetj . Now, if we multiply this equation by ∆t · (ρscj /νscj ) and add the resultant
equation to (4.47), we have
NN−1∑i=0
Vi
(∑c
∑p
ρpαpxc,p
)n+1
i
−
(∑c
∑p
ρpαpxc,p
)n
i
+ ∆t
NN−1∑i=0
(∑c
∑p
ρpxc,pQp
)i
= ∆t ·ρscjνscj
Qtargetj , (4.48)
By applying the assumption that the node pressure differences are lagged by one
iteration, we can sum up the derivatives of this equation with respect to the pressure
at any well node to a single derivative of this equation with respect to a single pressure
variable Pw0 . In addition, we may treat the coefficient ∆t · (ρscj /νscj ) on the right hand
side as a constant. Its value may change in every Newton iteration, but we may
neglect its derivatives for preconditioning purposes. As a result, for each general MS
well in the coupled system, we obtain a single equation expressed by (4.48), which
has a form that is similar to a rate-control equation of a standard well.
After the above reduction process, the standard true IMPES reduction may be
applied to obtain a pressure system that can be effectively solved by certain precon-
ditioners (e.g., AMG). The obtained pressure update to the reduced well variable Pw0
is applied to the pressure of all nodes in this general MS well. The update to all the
other variables (αp, xc,p on each node, Qm on each connection) in the general MS well
is zero in the first stage.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 125
4.7.2 Second stage: local on the overall system
In the second stage, the reservoir part (JRR) and the facilities part (JFF ) are solved
separately. However, due to the coupling matrices JRF and JFR, the right-hand-side
of one part may still be affected by the solution of the other. Because the cost of
solving the facilities part is usually much smaller than that of solving the reservoir
part, we use the following sequence for the second stage:
1. Solve the following linear system for a preliminary facility solution vector xpreF :
JFF · xpreF = bF . (4.49)
Here, the individual facility submatrices JWW,1, ..., JWW,Nw inside JFF are
solved one by one with the corresponding subvectors in xpreF and bF as the
solution and RHS vectors. Nw is the number of facility models in the coupled
system.
2. With xpreF obtained in the last step, we update the RHS vector of reservoir part
as:
bcorrR = bR − JRF · xpreF . (4.50)
Then, we solve the following linear system for the reservoir solution vector xR:
JRR · xR = bcorrR . (4.51)
3. With xR obtained in the last step, we update the RHS vector of the facilities
part as follows:
bcorrF = bF − JFR · xR. (4.52)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 126
Then, we refine the facility solution xF by solving the following linear system:
JFF · xF = bcorrF . (4.53)
Again, the individual facility submatrices JWW,1, ..., JWW,Nw are solved one
by one with the proper solution and RHS vectors. The preliminary facility
solution vector xpreF may be used as an initial guess for the final solution vector
xF during this step.
Using the same treatment as in the original CPR preconditioner, a single sweep
of the BILU(0) preconditioner is applied to the reservoir part (JRR). For general MS
wells, its submatrix inside the facilities part (JFF ) is solved by preconditioned GM-
RES to a tight tolerance, because it is observed that the accuracy of facility solution
can have considerable impact on the overall linear convergence rate. Specifically, for
the connection part (JCC), the inversion (J−1CC) can be directly obtained, because
JCC has a pointwise diagonal structure. Thus, we may perform a Schur complement
to obtain the following linear system:
(JNN − JNCJ
−1CCJCN
)xN = bN − JNCJ
−1CCbC . (4.54)
The left-hand-side matrix should have the same structure as JNN , but contains dif-
ferent values. Due to the structural nature of JNN (and hence of the resultant matrix
from Schur-complement process), we may apply a BILU(1) preconditioner to obtain
the solution of the node part, xN . Afterwards, we calculate the solution to the
connection part, xC , as:
xC = J−1CC (bC − JCNxN ) . (4.55)
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 127
4.8 Nonlinear Solution: Local Facility Solver
With complex physical models (e.g., drift-flux model and all three types of pressure
drops), general MS well equations become highly nonlinear and converge slower than
the reservoir equations. By fixing the reservoir conditions, a local nonlinear solver that
iterates on the facility part only can be used to accelerate the Newton convergence.
The effectiveness of the local nonlinear solver depends the following requirements:
• The relative cost per iteration is very low. This condition is valid when the
reservoir model is sufficiently large, e.g., the number of reservoir cells is larger
by at least two orders of magnitude than the total number of well nodes in all
the MS wells. In such cases, the solution cost for the facilities part will be much
smaller than that for the reservoir part.
• The reservoir and facilities parts in the coupled system can be decoupled eas-
ily both on the nonlinear and linear levels. On the nonlinear level, wells are
separate objects, such that the calculation of properties, construction of resid-
ual equations, and updating of the independent variables can be carried out
independently from the reservoir part. On the linear level, when the multilevel
block-sparse linear system structure is used, the facility matrix (JFF ) is an in-
dependent component in global system matrix. Thus, it can be taken out and
solved separately without additional cost.
• The local facility solution will not yield a negative impact on the convergence
of the reservoir part (and hence of the overall system). Sometimes when the
facility solution is obtained with an inaccurate reservoir condition, it may reflect
an overshot update in the facility variables, and thus slow down the convergence
of the overall system. In order to resolve this problem, we may activate the
local nonlinear solver only when the reservoir part is close to convergence, such
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 128
that the reservoir condition used in the local facility solution will be relatively
accurate. In this way, the oscillation in Newton iterations brought about by
the premature usage of the local nonlinear solver can be avoided. However, the
threshold that is used to determine whether the reservoir part is already close
to convergence must be chosen carefully. Too large a threshold may lead to
oscillations. Too small a threshold will limit the usage of the local nonlinear
solver, and thus diminish the savings in computational time.
The algorithm of the local nonlinear solver includes the following steps:
1. Perform nonlinear treatments for the flux-related properties in all perforated
reservoir cells (see Section 2.3.2). This is only needed when IMPES or IMPSAT
time discretization is used, because all the reservoir cells, including perforated
ones, need to be treated explicitly for IMPES or IMPSAT formulation, whereas
in FIM or AIM formulation, the perforated reservoir cells should always be
treated implicitly
2. Reset the local nonlinear iteration to zero
3. While the local nonlinear iteration is below the specified maximum value, do
the following:
(a) Compute the properties for all facility models and form their own residual
equations with a fixed reservoir condition
(b) Check the norm of the residual and the maximum change of variables in
the local facilities system: if both are below the threshold, the facilities
part has converged, go to 4; otherwise, continue the iteration process
(c) Solve the facility linear system extracted from the residual equations formed
above for the Newton update to the facility variables
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 129
(d) If the linear solution fails, report the error, and go to 4 without conver-
gence; otherwise, apply the Newton update to the facility variables and
perform variable switching if the phase state of any well node has changed
(e) If the Newton update fails for any facility model, restore the state of that
facility model, report the error, and go to 4 without convergence; otherwise,
continue the iteration process
(f) Increase the local nonlinear iteration by 1 and go to 3
4. Restore the nonlinear treatment applied in step 1
5. Report the number of local nonlinear iterations and that local nonlinear solution
has ended
4.9 Numerical Examples
4.9.1 Two-dimensional reservoir with a dual-branch general
MS well
C10
21 20 19
26 25 24
18
23
17 16 15 14
9
12
7 6 5
4
3
2
1
20bar Surface
Reservoir
30
27
28
22
31
29
100K m3/d CH4
Figure 4.9: The reservoir and well configuration of example 1
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 130
The first example is set up to demonstrate the correctness of the model. The
reservoir and well configurations are shown in Figure 4.9. A two-dimensional reservoir
with 51× 51× 1 cells is initially filled with 100% C10. A standard injector is on the
left side of the reservoir injecting pure C1 at 105 m3/day. A general MS well, which
is composed of 31 nodes and 32 connections, has the following characteristics:
• Two producing branches separated from node 17 on the surface and perforated
in the reservoir at node 29 (with smaller WI, or well index) and at node 31
(with larger WI), respectively
• A loop (nodes 7-14) on the surface
• Wellhead pressure control at 20 bars applied on node 1. Note that the two
producing branches have no separate controls and will be governed by this
control only.
a) Node pressure b) Node saturation c) Connection mixture flow rate
Figure 4.10: The simulation results of example 1
The simulation results of highlighted nodes (9, 12, 21, 26) and connections (8→ 7,
14→ 10, 13→ 14, 7→ 11) are shown in Figure 4.10. There are two key observations.
First, the pressure and saturation of nodes 9 and 12 are exactly the same, while those
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 131
of nodes 21 and 26 are different. This is because node 21 connects to a perforation
with a smaller WI, while node 26 connects to a perforation with a larger WI. Hence,
we get higher pressure and earlier gas breakthrough in node 26. On the other hand,
because two distinct producing branches join at node 17, and downstream of it, node
9 and 12 are symmetric in the loop, they have exactly the same node properties.
Second, the mixture flow rates of connections 14 → 10 and 8 → 7 have positive
values, whereas those of connections 13 → 14 and 7 → 11 have negative values. We
notice that the predefined directions of these four connections are as shown in Figure
4.9, whereas the actual flow directions may, or may not, be the same as the predefined
ones. For connections 14 → 10 and 8 → 7, their actual flow directions match the
predefined ones, so that the mixture flow rates are positive. On the contrary, for
connections 13 → 14 and 7 → 11, we know the flow actually goes in the reverse
directions from their predefined ones, and this results in negative mixture flow rates.
The sign (direction) and value (magnitude) of each mixture flow rate are determined
by the model automatically in each iteration, and this demonstrates that our general
MS well model can handle loops with arbitrary flow directions.
4.9.2 Upscaled SPE 10 reservoir with three multilateral pro-
ducers
The second example is based on the permeability and porosity fields of an upscaled
version of the SPE 10 problem, in which a simple 4x4x4 coarsening (i.e., 1/64th of
SPE 10) is employed. The reservoir and well settings are shown in Figure 4.11. The
reservoir is initially filled with 0% CO2, 5% C1, 25% C4, and 70% C10. There are two
standard vertical injectors, each injecting pure CO2 with BHP control at 140 bar.
Three multilateral producers, controlled separately by wellhead pressure of 45 bar,
are used in the example. Each producer has two horizontal branches, which are fully
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 132
2 standard injectors
3 multi-segment producers
Figure 4.11: The reservoir and well settings of example 2 with separate controls
perforated and composed of nodes 8-17 (the yellow dashed ellipse in Figure 4.11) and
of nodes 19-28, respectively.
a) oil rates of three producers b) properties of one producing branch
Figure 4.12: The simulation results of example 2 with separate controls
Figure 4.12(a) shows the oil rates of the three producers, from which we can see
that this is a problem with strong nonlinearity. We can also examine certain proper-
ties, such as the node pressure and the connection mixture flow rate, of any producing
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 133
branch. Figure 4.12(b) shows a profile of these properties for the highlighted branch
(yellow dashed ellipse in Figure 4.11) at the end of the simulation. Note that node 8
is at the bend between the inclined segments and horizontal segments, whereas node
17 is at the end of the horizontal segments. From node 17 to 8, the pressure decreases
and the connection mixture flow rate increases as a result of more incoming fluid from
the perforations.
2 standard injectors
3 multi-segment producers
Surface
9 7 8
10
6
11
1
5
40bar
17 14 18 15
19
16
12
4
13 3 2
Reservoir
Figure 4.13: The reservoir and well settings of example 2 with a group control
The group control is also tested by connecting three multilateral producers to a
common surface part, such that they are effectively translated into three branches
of a single MS well, as shown in Figure 4.13. The three producers join the surface
pipeline through a loop and a single wellhead pressure control at 40 bar is applied to
node 1, which is the outlet of the loop. As a consequence, three multilateral producers
are now governed by a group wellhead pressure control at the surface.
Figure 4.14(a) shows a comparison of the total oil rate between separate controls
and the group control, from which we can see the difference incurred by the group
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 134
a) total oil rate of three producers b) properties of the surface loop
Figure 4.14: The simulation results of example 2 with a group control
control. The profile of node pressure and connection mixture flow rate of the surface
loop are shown in Figure 4.14(b). Note the green dashed partition line in the surface
loop in both Figure 4.13 and 4.14(b). It is observed that from node 2 to 8, pressure
increases and the mixture flow rates are positive, whereas from node 8 to 13, pressure
decreases and the mixture flow rates are negative. This is again because of the flow
directions. The predefined flow direction is 13 → 12 → · · · → 3 → 2, whereas the
actual flow direction is 8 → 7 → · · · → 3 → 2 and 8 → 9 → · · · → 12 → 13 on
both sides of the green dashed line. As a result, mixture flow rates are positive in
connections 8 → 7, ..., 3 → 2, and negative in connections 13 → 12, ..., 9 → 8. The
general MS well model correctly recognizes the actual flow directions.
4.9.3 Linear solver performance
The third example is designed to test the linear solver performance. This example
is also based on the permeability and porosity fields of an upscaled version of the
SPE 10 problem. A simple 2x2x2 coarsening (i.e., 1/8th of SPE 10) is employed for
this example. A 9-component fluid is used with the following initial mole fraction:
1% CO2, 19% C1, 5% C2, 5% C3 10% n-C4, 10% n-C5, 10% C6, 20% C8, and 20% C10.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 135
The well geometries and locations are the same as in the second example, as shown
in Figure 4.11. Each of the two standard vertical injectors is injecting 90% CO2 and
10% C1 with BHP control at 150 bar. Each of the three multilateral producers is
governed by well head pressure control at 30 bar.
A complex physics model, including drift-flux model and all three types of pres-
sure drops, is used for the three multilateral producers, such that their Jacobian
matrices have complex structures and are hard to solve. Here, we compare three
preconditioning options:
• BILU(0): this is an one-stage preconditioner working directly on the overall
system. Specifically, BILU(0) is applied for the reservoir and BILU(1) is ap-
plied for the general MS wells. We cannot even get converged linear solution if
BILU(0) were also to be applied for the general MS wells
• BILU(1): this is also an one-stage preconditioner working directly on the overall
system. The difference with the previous option is that BILU(1) is applied both
for the reservoir and for the general MS wells
• CPR: this is the two-stage preconditioner described in Section 4.7, with AMG
preconditioner applied in the first stage solution of the global pressure system,
and a set of BILU preconditioners applied to the second stage solution of the
local overall system. For the second stage, BILU(0) is applied to the reservoir
and BILU(1) is applied for the general MS wells
The benchmark was conducted on a single core of Xeon X5520 CPU running at
2.27GHz with 24GB of RAM. The performance results, averaged over all Newton
iterations, are shown in Figure 4.15 for all three preconditioner options described
above. From Figure 4.15(a), we find that the number of linear solver iterations ob-
tained with CPR is one order of magnitude smaller than that obtained with BILU(0).
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 136
a) Number of linear solver iterations per Newton iteration
b) Linear solver time and total time per Newton iteration
Figure 4.15: The linear solver performance of example 3
Even compared with BILU(1), which has better convergence behavior but also much
higher preconditioning cost than BILU(0), CPR takes only 21% of the linear solver
iterations taken by BILU(1). Correspondingly, from Figure 4.15(b), we see a large
drop in the linear solver time per Newton iteration from 151.7s for BILU(0) and 70.6s
for BILU(1) to only 12.5s for CPR. Although CPR has higher preconditioning cost
per iteration than either BILU(0) or BILU(1), the cost of BLAS operations per it-
eration increases linearly with the number of iterations for the GMRES solver and
becomes prohibitively expensive for BILU(0) and BILU(1) with such high number of
iterations (If a small number of iterations before the restarting of GMRES solver is
set, BILU(0) and BILU(1) will take even more iterations to converge). Notice that
the total cost per Newton iteration is also decreased by several folds from 159.1s for
BILU(0) and 78.4s for BILU(1) to 20.4s for CPR. The performance difference is quite
large, and we expect higher savings with CPR in even larger models.
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 137
4.9.4 Nonlinear solver performance
The fourth example is designed to test the performance of the local nonlinear solver
for the facilities part. The reservoir, well, and fluid configurations are completely the
same as those in the first part of the second example (with three separate general MS
wells) discussed in Section 4.9.2.
Here we compare three options of nonlinear solvers:
• Option 1: standard Newton. The local facility solver is not applied at all
• Option 2: Newton with the local facility solver, which gets activated only when
the reservoir part has already converged while the facilities part has not yet
converged
• Option 3: Newton with the local facility solver, which gets activated whenever
the reservoir part is close to convergence (i.e., when its normalized residual gets
within the range that is one order of magnitude larger than the convergence
threshold)
The benchmark was also conducted on a single core of Xeon X5520 CPU. The
performance results (total number of Newton iterations and total simulation time)
are shown in Figure 4.16. We see a 33.5% decrease in the number of Newton iterations
for Option 1 compared to Option 2. Note that the activation of the local nonlinear
solver is quite conservative (i.e., only when the reservoir part has already converged)
in Option 2. By activating the local nonlinear solver more frequently in Option 3,
but not yet incurring oscillation, we achieve a further 9.3% decrease in the number of
Newton iterations from Option 2. Correspondingly, we decrease the total simulation
time of Option 1 by 34.0% when using Option 2, and achieve a further 8.7% decrease
from the total simulation time of Option 2 when using Option 3. The savings in
the total simulation time is proportional to the savings in the number of Newton
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 138
Figure 4.16: The nonlinear solver performance of example 4
iterations. Although the local facility solver is activated many more times in Option
3 (252 times) than in Option 2 (117 times), its cost is so small (< 1% of the total
simulation time) such that it can be basically ignored in both options. With larger
reservoir such as 1/8th of SPE 10 or full SPE 10, the relative cost of the local facility
solver will be even smaller. Therefore, we should be able to achieve comparable or
greater savings in the total simulation time with the local facility solver.
4.9.5 Comparison of simulation results: AD-GPRS versus
Eclipse
The last example is designed to validate the simulation results of the coupled system
with general MS wells. When a general MS well has relatively simple geometry (e.g.,
without loops), a single exit, and no special segments, we can find an equivalent
representation of the well using the original MS well model, which is implemented
in Eclipse [70] simulator. Here, we use the permeability and porosity fields of an
upscaled version of the SPE 10 problem. A simple 2x2x2 coarsening is employed,
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 139
4 multi-segment injectors
4 multi-segment producers
Figure 4.17: The reservoir and well settings of example 3 (AD-GPRS versus Eclipse)
resulting in 30 × 110 × 42 cells. When a 4-component fluid (CO2, C1, C4, and C10)
is used, the memory used by the Eclipse 300 simulator exceeds the limit of a 32-bit
application. As a result, we drop the bottom seven layers of the upscaled model and
keep 30 × 110 × 35 cells. As shown in Figure 4.17, four multisegment injectors and
four multisegment producers are included. All the wells (injectors and producers) are
vertical and perforate all 35 reservoir layers. Each well is discretized into 46 segments
(and 46 connections). Each injector is injecting 90% CO2 and 10% C1 with well head
pressure control at 140 bar, whereas each producer is governed by well head pressure
control at 5 bar. The drift-flux model and all three types of pressure drops are used
for all the wells. Using the equivalent input data, we simulate the problem with both
AD-GPRS and Eclipse 300 for a total simulation time of 100 days.
The simulation results obtained by both AD-GPRS and Eclipse 300 are shown
in Figure 4.18. In Figure 4.18(a), the gas injection rate is plotted versus simulation
time for the four multisegment injectors (W1, W2, W5, and W6). We observe that for
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 140
a) Gas rates of all injectors b) Oil rates of all producers
Figure 4.18: Comparison of simulation results: AD-GPRS versus Eclipse
each injector, the rates obtained by Eclipse 300 (represented by dots) are overlapped
closely with those obtained by AD-GPRS (represented by a curve) throughout the
simulation. In Figure 4.18(b), the oil production rate is plotted versus simulation
time for the four multisegment producers (W3, W4, W7, and W8). For all producers,
we again achieve good match between the rates obtained by Eclipse 300 and those
obtained by AD-GPRS. Thus, in this case where MS wells have only relatively simple
geometry, single exit, and no special segments, the consistency between the general
MS well model in AD-GPRS and the original MS well model in Eclipse is validated.
4.10 Concluding Remarks
In this chapter, we discussed the mathematical formulation, our generic design and
AD-based implementation, our advanced multistage linear solver, and new nonlinear
solution strategy for the general MS well model. Each MS well is discretized into
CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 141
nodes and connections. We define variables and write the governing equations both on
nodes and on connections. The general MS well model has the following advantages:
1) general branching that allows for complex well geometry, 2) loops with arbitrary
flow directions, 3) multiple exit connections with different constraints, and 4) special
nodes with various functionality (e.g., separators, valves). The first two advantages
have been demonstrated in the numerical examples, whereas the required features for
the last two advantages are well supported by the current framework but have not
yet been implemented into AD-GPRS. These features can be introduced in the future
as extensions to the model.
Regarding the linear solution of the model, the effectiveness of specialized linear
preconditioner, which is extended from the standard two-stage CPR preconditioner, is
demonstrated using numerical examples by comparing it against popular single-stage
preconditioners. As demonstrated here, local nonlinear solution for the facilities part
can be used to accelerate the nonlinear convergence of the coupled system with general
MS wells.
Chapter 5
Multicore Parallelization
5.1 Introduction
In this chapter, we describe our OpenMP-based shared-memory parallelization of
AD-GPRS. This strategy has two parts: Jacobian generation and the linear solver.
Under the AD framework described in Chapters 1 and 2, parallelization of Jacobian
generation is achieved with a thread-safe extension of our AD library (ADETL) [92,
96], which utilizes the Thread-Local Storage (TLS) technique (see [26] for details)
to avoid data race conditions. With the thread-safe ADETL, computations such as
discretization, property calculation, and Newton updates are safely parallelized.
For the parallel linear solution, a lot of investigations have been conducted in
this field [17, 20, 32, 48, 75]. Among these attempts, both shared-memory (e.g., using
OpenMP) and distributed-memory (e.g., using MPI) approaches have been studied.
In our approach, we first parallelize our own MultiLevel Block-Sparse (MLBS) ma-
trix data structure (see Section 3.3), and then we use a two-stage CPR (Constrained
Pressure Residual, see [88,89]) preconditioning strategy in the iterative solution pro-
cess. The latest parallel multigrid solver from Fraunhofer SCAI - XSAMG [31,76–78]
- is used as the 1st-stage pressure preconditioner. For the 2nd-stage, we employ the
142
CHAPTER 5. MULTICORE PARALLELIZATION 143
Block Jacobi technique, with Block ILU(0) applied as the local preconditioner.
This OpenMP implementation had a small impact on the overall structure of the
object-oriented code, so that code maintenance and extension will be relatively easy.
5.2 Jacobian Generation
5.2.1 Thread-safe ADETL
The purpose of the thread-safe extension of ADETL is to parallelize all the compu-
tations (other than the linear solver) in AD-GPRS.
Static variables are used by ADETL in various stages of gradient computation
(e.g., the memory pool for generating AD expression objects, see [92, 96] for de-
tails). Thus, data-race conditions may be raised when the library is directly used in
a multithreading environment. In order to guarantee thread safety in the gradient
computation, the Thread-Local Storage (TLS) technique is utilized. The basic idea of
TLS is for each thread to contain a local instance of the static variable, which is only
accessible by its ‘owner thread’. The usage of TLS is illustrated in Fig. 5.1. When
integers a and b are declared as static variables without the TLS keyword, they are
contained in the global memory and are accessible by all threads. In contrast, if the
TLS keyword (‘ thread’ in Linux; ‘ declspec(thread)’ in Windows) is specified in the
declaration, each thread has a local copy of both variables and these local copies can
only be read or written to by the corresponding thread that owns them, thus avoiding
the potential data races.
Besides the thread-safe extension, global operations (e.g., switching of the inde-
pendent variables, backup and restoration of all variables) in the AD variable and
backup set (adX and adX n) are parallelized using OpenMP directives, because they
are only allowed to be called from a single thread at any time.
CHAPTER 5. MULTICORE PARALLELIZATION 144
static __thread int a, b;
Thread 1 Thread 2 Thread N
…… a
b
…
a
b
…
a
b
…
static int a, b;
Global
a
b
…
Figure 5.1: Illustration of Thread-Local Storage (TLS)
5.2.2 Parallel computations other than the linear solver
Here, we discuss the parallelization of the following computations: discretization,
property calculation, and the Newton update. For discretization (i.e., computation
of flux and accumulation terms), a partitioner, such as METIS [40], is used to create
subdomains such that the amount of repeated flux computations across subdomains
is minimized. Using one subdomain for each thread, the locking mechanism with high
overhead can be totally avoided. The memory bandwidth requirement is estimated to
be high for the stenciling operations associated with flux computations. For properties
calculation, the computations are local and inherently parallel. Thus, good scalability
is expected. For the Newton update of variables and phase states, the work load can be
quite different depending on the phase state. Thus, one may consider using dynamic
scheduling to achieve a better load balance.
The above descriptions apply to the reservoir part. For the facilities part, all
computations (switch of facility control, calculation of facility properties, application
of source/sink terms, and construction of facility residuals) at nonlinear level are
parallelized by assigning facility objects to different threads, because the facility ob-
jects are independent of each other and have quite different computational processes.
There is no requirement on the distribution of facility objects in all threads. Dynamic
scheduling may also be considered because the work load can differ greatly when the
facility model is not uniform (e.g., standard well versus general MS well).
CHAPTER 5. MULTICORE PARALLELIZATION 145
5.3 Linear Solver
5.3.1 Parallel matrix data structure
On the topmost level, our MLBS matrix is composed of four parts: JRR, JRF , JFR,
JFF , as described in Section 3.3. The submatrix JRR corresponds to the reservoir
part, and its size is usually much larger than the other submatrices. Thus, matrix
operations such as extraction from AD vector, algebraic reduction (two-step), Sparse
Matrix-Vector multiplication (SpMV), and explicit update (two-step) are parallelized
for JRR by decomposing it into NTh pieces, where NTh is the number of threads.
Each piece contains approximately NB/NTh block rows and is assigned to one thread.
Because the locations of block nonzero entries in JRR are recorded in the generalized
connection list, which uses the CSR format internally, each thread can find and access
the block nonzero entries in its corresponding block rows directly. Suppose block rows
ri to ri+1−1 belong to thread i, then the indices for associated block diagonal entries
are ri, ri+ 1, . . ., ri+1−1, and the indices for corresponding block off-diagonal entries
are row ptr(ri), row ptr(ri)+1, . . ., row ptr(ri+1)−1, where row ptr is the row pointer
array of the generalized connection list.
For the facility-related submatrices (JRF , JFR, and JFF ), each of them is again
composed of several second-level submatrices (e.g., JFF is composed of JWW,1,
JWW,2, ..., JWW,NF, where NF is the number of facilities in the simulation). Be-
cause there is no direct coupling between different facility objects on the nonlinear
level, the second-level submatrices in JRF , JFR, or JFF are decoupled and can be
operated on concurrently. Thus, parallelization for the facility-related submatrices is
achieved by assigning their second-level submatrices to different threads and perform-
ing computation on them in parallel. This is similar to our treatment to the facility
objects at the nonlinear level.
CHAPTER 5. MULTICORE PARALLELIZATION 146
5.3.2 First stage pressure solution — XSAMG preconditioner
Given the parallel matrix data structure, the presolution (matrix extraction, algebraic
reduction), post-solution (explicit update), as well as, the BLAS operations in the
iterative linear solver can be performed in parallel. The only step that remains serial
in the linear solution is the preconditioning. As described in Section 3.4.5, for fully
implicit reservoir simulation problems, the two-stage CPR preconditioner has proved
to be a very effective choice. Recall that in CPR, we first solve a global pressure
system that is algebraically reduced from the primary system, and then solve locally
an overall primary system.
For the first-stage pressure solution, the latest parallel multigrid solver from Fraun-
hofer SCAI, XSAMG [31], is used. The idea of XSAMG is illustrated in Fig. 5.2.
The original SAMG preconditioner runs on a single computational node. If the major
simulation cost is in the SAMG linear solution, then by simply replacing SAMG with
the new XSAMG preconditioner, which runs on several computational nodes, consid-
erable saving can be achieved. That is, XSAMG provides us with a mechanism to
solve the pressure system on multiple computational nodes, regardless of how the rest
of the simulator is parallelized (e.g., multicore parallel or even serial). The advantage
of this approach is that little effort is required in code modification. One only needs
to add xsamg init and xsamg finalize at the beginning and end of the code, and
replace the call to SAMG with that to XSAMG.
The overall parallel efficiency is limited by Amdahl’s law. In the fully implicit
reservoir simulation, depending on the fluid complexity, the pressure solution can
take around 5% (very complex fluid) to 50% (very simple fluid) of the total simulation
time. Thus, it is usually not sufficient to just parallelize the pressure solution, which
would only yield 1.05X (X - times, similarly hereinafter) to 2X speedup of the overall
simulation cost.
CHAPTER 5. MULTICORE PARALLELIZATION 147
Figure 5.2: Illustration of XSAMG preconditioner [31]
When applying XSAMG to the first stage of CPR, there are two important con-
cerns. First, the ‘first touch’ policy is critical in optimizing the performance of
XSAMG. On workstations with NonUniform Memory Architecture (NUMA), the
memory subsystem is divided to allow each CPU to have both local and remote
memory accesses. Accesses to the local memory are faster than accesses to the re-
mote memory. The smallest unit of data for the transfer between the main memory
that is directly accessible to the CPU and the virtual memory on the auxiliary stor-
age is called a “page”. The ‘first touch’ policy indicates that the first CPU touching
(i.e., reading from, or, writing to) the memory page will be the CPU that faults the
page in (i.e., load the memory page from auxiliary storage to main memory) and the
page will be bound to that CPU, allowing faster accesses by that CPU. Therefore,
a parallel assembly process based on a partition of the matrix rows is needed for
XSAMG such that the matrix coefficients in each part of the rows are first accessed
by the corresponding CPU that will later perform computations on them during the
setup and solution phases. Second, a partial setup strategy is used to improve the
performance and parallel scalability. The setup of XSAMG contains two steps: 1)
CHAPTER 5. MULTICORE PARALLELIZATION 148
coarsening and interpolation, which has suboptimal scalability; and 2) update of the
Galerkin operators, which has good scalability. Given a new pressure matrix, com-
plete setup involves both steps and scales poorly with the growing number of cores.
On the contrary, in a partial setup, one only performs the second step, and that scales
much better. Based on our experience, it is usually safe to call the complete setup
only for the first Newton iteration of a timestep, and call a partial setup for each of
the remaining Newton iterations. By applying this partial setup strategy, we observe
considerable improvement in the scalability for the setup phase of XSAMG and no
severe degradation in the convergence.
5.3.3 Second stage overall solution — Block Jacobi/BILU
preconditioner
After applying the pressure update from the first stage of CPR method, we need to
perform the second stage preconditioning to obtain an overall solution. Because most
of the global coupling is resolved in the first stage, a block Jacobi strategy has been
applied to the second stage solution, with BILU(0) or BILU(k) selected as the local
preconditioner.
At the beginning of the simulation, the reservoir matrix is partitioned into sev-
eral outer blocks with similar sizes. Note that this partition can be the same as,
or different from, the partitioning used for the discretization computations. Better
convergence behavior can be achieved if the partitioning minimizes the interdomain
connectivities, which are reflected in the transmissibility of interdomain connections,
or approximately, in the (neglected) pressure derivatives.
The idea of the second stage solution is shown in Fig. 5.3. It is worth mentioning
that each submatrix JRR,i also has a block sparse structure and shares the data
storage with the original reservoir matrix JRR. In addition, the structure of JRR,i is
CHAPTER 5. MULTICORE PARALLELIZATION 149
RR RR2
RR3
RR4
RR1
x x x x x x x x x x x x ...... x x x x Data
Figure 5.3: Illustration of Block Jacobi preconditioner
created in the beginning and kept the same if the partitioning is fixed. Thus, there
is no additional setup, or data transfer, cost associated with JRR,i in each Newton
iteration.
Moreover, due to the consistency in the data format of JRR,i and the original JRR,
we may have an arbitrary choice of preconditioners for the submatrices, e.g., BILU(0)
or BILU(k). Also, because the block Jacobi preconditioner has the same interfaces as
the original BILU preconditioners, only trivial modifications are required in the 2nd
stage of CPR. With a small number of outer blocks, the impact of using block-Jacobi
on convergence is mild: the number of linear solver iterations increased by less than
10% in our test problems.
5.4 Parallel benchmark
The full SPE 10 model [19] of 60×220×85 (1.1M) cells with all of its horizontal layers
skewed and distorted as explained in [99] is used for the parallel benchmark. Three
CHAPTER 5. MULTICORE PARALLELIZATION 150
different spatial discretization schemes are tested: TPFA (7-pt), MPFA L-method (9-
pt), and MPFA O(0)-method (11-pt). A line-drive well pattern with two injectors and
two producers is used. We simulated the model in parallel AD-GPRS with thread-
safe ADETL and the two-stage CPR-based linear solver (XSAMG + Block Jacobi /
BILU) on a workstation with dual quad-core Xeon E5520 CPU at 2.27GHz and 24GB
of RAM.
1.7 2.0 2.1 1.9
3.1 3.6 3.6 3.3
5.3
7.5
5.7 6.0
0
1
2
3
4
5
6
7
8
Discretization Properties Newton update Total nonlinear time
Par
alle
l Sp
eed
up
Nonlinear
1.7 2.0 2.1 2.0 1.9
2.9 2.9 2.8 3.3
2.9
4.1 3.8 3.7
4.8
3.8
0
1
2
3
4
5
6
7
8
Mat. extraction &alg. reduction
Linear solution XSAMGprecondition
BILU precondition Total linear time
Par
alle
l Sp
eed
up
Linear 1.9
3.1
4.8
0
1
2
3
4
5
6
7
8
Par
alle
l Sp
eed
up
Total
2 threads
4 threads
8 threads
Figure 5.4: Performance result of the full SPE 10 model with TPFA discretization
The benchmark results for the TPFA (7-pt) discretization are shown in Fig. 5.4.
We observe that the speedups for the nonlinear (other than the linear solver) compu-
tations are generally good, especially for the properties calculation, where the local
computations yield little stress on the memory bandwidth. The overall speedup for
the “nonlinear” part is 6X with 8 threads. On the other hand, the speedups for linear
solver computations are usually lower, due to the much higher requirement on the
memory bandwidth. The overall speedup for the linear part is 3.8X with 8 threads.
CHAPTER 5. MULTICORE PARALLELIZATION 151
As a result, the total parallel speedup is 4.8X with 8 threads.
1.8 2.0 1.9 1.9
3.1 3.7
3.0 3.2
5.6
7.5
5.0
6.0
0
1
2
3
4
5
6
7
8
Discretization Properties Newton update Total non-linear time
Par
alle
l Sp
ee
du
p
Nonlinear
1.9
3.1
5.2
0
1
2
3
4
5
6
7
8
Par
alle
l Sp
eed
up
Total
2 threads
4 threads
8 threads1.9 2.1 2.2 2.2 2.0
3.2 3.0 3.2 3.4 3.0
4.9 4.5 4.2
5.8
4.5
0
1
2
3
4
5
6
7
8
Mat. extraction &alg. reduction
Linear solution XSAMGprecondition
BILU precondition Total linear time
Par
alle
l Sp
eed
up
Linear
Figure 5.5: Performance result of the full SPE 10 model with MPFA O-methoddiscretization
For the MPFA O-method (11-pt) discretization, as shown in Fig. 5.5, the per-
formance is somewhat similar to that of TPFA. The same overall speedup for the
nonlinear part has been achieved, whereas the speedup for the linear part is actually
a bit higher (4.5X with 8 threads). The total parallel speedup is hence increased
to 5.2X with 8 threads. The result for MPFA L-method (9-pt) lies somewhere in
between: a total parallel speedup of 5X has been achieved with 8 threads.
5.5 Concluding Remarks
The multithreading shared-memory parallelization strategy of AD-GPRS has been
described in this chapter. This OpenMP approach had a small impact on the overall
structure of the object-oriented code. Parallel Jacobian construction is achieved with
CHAPTER 5. MULTICORE PARALLELIZATION 152
a thread-safe extension of ADETL, which utilizes the Thread Local Storage (TLS)
technique to avoid data-race conditions. For the linear solution, the matrix oper-
ations in the MLBS data structure are first parallelized. Then, a two-stage CPR
(Constrained Pressure Residual) preconditioning strategy, combining XSAMG and
Block Jacobi/BILU, is used. The implementation ensures that the data transfer cost
between the global matrix and the local preconditioning matrices is minimized.
The benchmarking results are obtained by simulating the full SPE 10 problem
(with three discretization schemes applied on the manipulated nonorthogonal grid)
using parallel AD-GPRS on multicore platforms. We find that - on average - the
speed up is about 5.0X on an 8-core (dual quad-core Nehalem) node. The parallel
performance is analyzed for the different nonlinear and linear kernels in the simulation.
Specifically, the average speedup is about 6.0X for the nonlinear computations and
4.1X for the linear solution.
For future work, the hybrid MPI/OpenMP parallelization, which may be consid-
ered as an extension of the current OpenMP parallelization, is a possible direction.
In that case, the distributed-memory parallel multigrid solver, SAMGp, may be used
to replace XSAMG. Also, we may consider the parallelization using emerging HPC
(High Performance Computing) architectures for certain components in the simula-
tor. As an example, in the next chapter, we will describe the GPU parallelization
of the Nested Factorization algorithm. Significant speedups can be achieved if the
algorithm, through proper modification, can fit the computing architecture well.
Chapter 6
GPU Parallelization of Nested
Factorization
6.1 Introduction to GPU Architecture
Among the emerging parallel architectures for high performance computing, GPGPU
(General-Purpose computing on Graphics Processing Units) platforms have drawn a
lot of attention in recent years. The GPU was originally devoted to graphics process-
ing and acceleration. With the help of GPGPU computing languages and libraries,
the GPU can be used to perform computations that have traditionally been handled
by the CPU. Currently, the dominant proprietary GPGPU computing framework
is NVIDIA’s CUDA (Computing Unified Device Architecture) [54, 55], whereas the
dominant open language for GPGPU is OpenCL (Open Computing Language) [41].
The GPU parallel algorithm described in this chapter has been implemented using
CUDA and integrated with AD-GPRS.
To better understand how the GPU platform can be used for computationally in-
tensive kernels, let us take a look at the architecture of a GPU. Here, we use NVIDIA’s
Fermi architecture [52] as an example. As shown in Figure 6.1(a), each Fermi chip
153
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 154
(a) Fermi GPU chip
(b) Fermi Streaming Multiprocessor (SM)
Figure 6.1: Illustration of the Fermi GPU architecture [52]
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 155
contains 16 Streaming Multiprocessors (SM) (e.g., the one enclosed in the red box),
a thread scheduler and dispatcher (GigaThread), an interface to the host (via PCIe
bus), a shared L2 cache, and up to 6GB of GDDR5 (Graphics Double Data Rate,
version 5) DRAM. The internal structure of each SM is shown in Figure 6.1(b). Each
SM contains 32 CUDA cores, 16 Load/Store units (LD/ST), four Special Function
Units (SFU), two ‘warp’ schedulers, and 64KB of configurable shared memory and
L1 cache. So, there are 512 CUDA cores per Fermi chip. The current top-tier Fermi-
architecture product - NVIDIA Tesla M2090 - has a peak capacity of 1331 and 665
GFlops for single- and double-precision computations, respectively. The M2090 has
a peak memory bandwidth of 177GB/s. These peak performance numbers are high
compared with state-of-the-art CPUs. However, the actual performance of an algo-
rithm depends strongly on the ability to utilize the computational resources in the
GPU and may differ significantly from the raw peak numbers.
6.2 Nested Factorization
Nested Factorization (NF) [7] is a preconditioning method built on the nested tridi-
agonal structure of the Jacobian matrix generated for a structured grid. There are
three levels in this nested structure: planes, lines, and cells. Correspondingly, when
applied to a grid with NPl×NLi×NCe cells (NPl: number of planes; NLi: number of
lines in each plane; NCe: number of cells in each line), the solution strategy of NF is:
1. Factorize/solve all two-dimensional planes i (1 ≤ i ≤ NPl) in the entire three-
dimensional domain. If we group the equations and variables of each plane into
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 156
a block, the matrix A has a block tridiagonal structure:
A =
D1 U1 0 · · · 0
L2 D2 U2...
0. . . . . . . . . 0
... LNPl−1 DNPl−1 UNPl−1
0 · · · 0 LNPlDNPl
.
(6.1)
We use the following notation for this block tridiagonal structure:
A = TriDiagNPli=1 (Li, Di, Ui) , (6.2)
where Li contains the derivatives of equations of plane i with respect to variables
of plane i − 1, Di contains the derivatives of equations of plane i with respect
to variables of plane i, and Ui contains the derivatives of equations of plane i
with respect to variables of plane i+ 1. On the first level, the block tridiagonal
solution of A is serial.
2. Factorize/solve all one-dimensional lines j (1 ≤ j ≤ NLi) inside each two-
dimensional plane i. If we group the equations and variables of each line into a
block, the submatrix Di also has a block tridiagonal structure (Li is eliminated
using the solution of plane i− 1, whereas Ui is assimilated into Di through an
approximation, e.g., relaxed row sum, column sum, or simply diagonal):
Di = TriDiagNLij=1
(Lij, D
ij, U
ij
), (6.3)
where Lij contains the derivatives of equations of line j with respect to variables
of line j − 1 (both in plane i), Dij contains the derivatives of equations of line j
with respect to variables of line j, and U ij contains the derivatives of equations
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 157
of line j with respect to variables of line j + 1. On the second level, the block
tridiagonal solution of Di is again serial.
3. Factorize/solve all cells k (1 ≤ k ≤ NCe) within each one-dimensional line j.
The local submatrix Dij also has a tridiagonal structure (Lij is eliminated using
the solution of line j−1, whereas U ij is again assimilated into Di
j approximately):
Dij = TriDiagNCe
k=1
(li,jk , d
i,jk , u
i,jk
), (6.4)
where li,jk is the derivative of equation of cell k with respect to variable of cell
k − 1 (both in plane i, line j), di,jk is the derivative of equation of cell k with
respect to variable of cell k, and ui,jk is the derivative of equation of cell k with
respect to variable of cell k + 1. On the third level, the tridiagonal solution of
Dij is usually serial.
When the two-stage CPR preconditioning strategy [88, 89] is applied, NF can
be used as the preconditioner for the first-stage pressure system, as well as, for the
second-stage full Jacobian. When used as the pressure preconditioner, our experience
indicates that NF usually converges slower than AMG [76,77].
6.3 Massively Parallel Nested Factorization
The standard NF algorithm is serial. In order to extend the algorithm to GPU
systems, where thousands or more concurrent threads are required, significant modi-
fications are needed. The so called Massively Parallel Nested Factorization (MPNF)
method was proposed by Appleyard et al. [8].
Two important concepts are introduced in MPNF: kernel and color. As defined
in [8], a kernel is a group of cells - treated as a unit - with an exact inverse. For
example, a row of cells in a model is a kernel, because the exact solution can be
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 158
obtained from a tridiagonal solve. Given a coloring strategy and the total number of
colors, each kernel is assigned a color, such that no adjacent kernels have the same
color. Then, by reordering the equations and variables first by color, then by kernel,
and last by cell, the MPNF solution process is as follows:
1. Factorize/solve all colors c (1 ≤ c ≤ Ncolor, Ncolor: the number of colors) in the
entire domain. By grouping the equations and variables of each color into a
block, the matrix A will have a block tridiagonal structure:
A = TriDiagNcolorc=1 (Lc, Dc, Uc) (6.5)
where Lc contains the derivatives of the equations of color c with respect to the
variables of color c − 1, Dc contains the derivatives of the equations of color c
with respect to the variables of the same color, and Uc contains the derivatives
of the equations of color c with respect to the variables of color c+1. Therefore,
on the first level of MPNF, the block tridiagonal solution of A is still serial.
2. Factorize/solve all kernels r (1 ≤ r ≤ Nker(c), Nker(c): the number of kernels with
color c) that are assigned color c by the selected coloring strategy. Recall that
the coloring is such that no adjacent kernels have the same color. Consequently,
there will be no derivatives of the equations of one kernel r with respect to
variables of another kernel r′, given that kernels r and r′ have the same color
c. Thus, by grouping the equations and variables of each kernel into a block,
the submatrix Dc has a block-diagonal structure (Lc is eliminated using the
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 159
solution of color c− 1, whereas Uc is assimilated into Dc approximately):
Dc =
Dc1 0 · · · 0
0 Dc2
.... . .
... DcNker(c)−1 0
0 · · · 0 DcNker(c)
(6.6)
where Dcr contains the derivatives of equations of kernel r in color c with respect
to the variables of the same kernel. On the second level of MPNF, the solution
of Dc is parallel whereby for each color we can employ as many concurrent
threads as the number of kernels in a color (Nker(c)). This is the computational
kernel in the algorithm where massive parallelism is exploited.
3. Factorize/solve all cells k (1 ≤ k ≤ NCe) within each kernel k. The innermost
level of MPNF is the same as that of NF. That is, the submatrix Dcr has the
following tridiagonal structure:
Dcr = TriDiagNCe
k=1 (Lc,rk , Dc,rk , U c,r
k ) (6.7)
where Lc,rk , Dc,rk , and U c,r
k have the same meaning as was explained in the third
level of NF. At this level, the tridiagonal solution of Dcr is serial, or has limited
parallelism.
Now we discuss the coloring strategy. The basic strategy is the checkerboard
coloring, where each kernel is alternatively assigned one of two colors, as shown in
Fig. 6.2(a). This is the simplest strategy and will yield a valid coloring of all kernels
for simple TPFA (Two-Point Flux Approximation) stencils. However, the resulting
preconditioner has poor convergence behavior. A more accurate preconditioner can
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 160
be obtained via a multicoloring strategy, such as cyclical (colors are assigned in the
order of 1, 2, ..., Ncolor−1, Ncolor, 1, 2, ...) and oscillatory coloring (colors are assigned
in the order of 1, 2, ..., Ncolor − 1, Ncolor, Ncolor − 1, Ncolor − 2, ...) [8]. In cyclical
coloring, because kernels with the last and first color are adjacent, the block matrix
on the outermost level deviates slightly from the tridiagonal structure (one additional
entry in the first and last block row). On the other hand, oscillatory coloring will
always yield a block tridiagonal matrix on the outermost level, because its first color
is only connected with the second color, and its last color is only connected with the
second last color. An example of a 4-color oscillatory coloring is shown in Fig. 6.2(b).
As stated in [8], oscillatory coloring usually produces a more accurate preconditioner
(and hence better convergence behavior), given the same number of colors.
(a) chequerboard coloring (b) 4-color oscillatory coloring
Figure 6.2: Examples of different coloring strategies
For small problems, the strategy of assigning more than one thread to a kernel is
described briefly in [8]. This can be achieved using parallel tridiagonal solution meth-
ods, such as twisted factorization and cyclic reduction. The algorithmic descriptions
(with no GPU implementation details) of both methods are documented in [81].
It is reported in [8] that the speedup of the GPU-based MPNF-preconditioned
linear solver ranged from 5.7X to 14.3X for large problems (with 100,000, or more,
cells). Those results were obtained on a workstation with dual Intel Xeon X5550
CPUs (each CPU has four cores at 2.66GHz and 8MB cache), 12GB of RAM, and an
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 161
NVIDIA Tesla C2050 GPU (448 CUDA cores at 1.15 GHz, 3GB memory).
6.4 Our CUDA-based Implementation
6.4.1 Basic features
Based on the description above, we have redesigned and implemented the MPNF al-
gorithm based on CUDA. The key features of our GPU-based MPNF implementation
are:
• Support computation in single- and double-precision (denoted as SP and DP
below)
• Support checkerboard and three forms of oscillatory coloring strategies
• Use CUDA-based GMRES solver as an accelerator for the pressure system
• Use cuBLAS library [53] for BLAS (copy, scal, axpy, etc.) and reduction (dot
product) operations
• Use HYB (hybrid) format, which is a combination of EllPack [67] and COO
(coordinate list) format, in cuSPARSE library [57] for Sparse Matrix-Vector
multiplication (SpMV) in GMRES solver
• Support asynchronous memory transfer from CPU to GPU during the setup
(factorization) phase
The last feature is worth a bit more discussion. Asynchronous memory transfer is
used to hide the latency of memory transfer due to the limited bandwidth of the PCIe
bus, which connects the CPU and the GPU. Latency hiding is achieved by copying
the matrix entries of the next color, c + 1, from the CPU to the GPU, while the
current color, c, is being set up. NVIDIA GPUs have separate copy engines from the
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 162
processing engines; thus, they can handle asynchronous memory transfer while the
CUDA cores are busy computing. To ensure that the necessary data are ready before
the next matrix factorization, synchronization of the GPU device is applied, such
that the setup procedure of the next color, c + 1, will not begin until the following
conditions are satisfied:
• The contribution from the current color, c, to the next color, c + 1 (through
submatrix Uc), has been calculated
• Matrix entries of the next color, c + 1 (in submatrices Lc+1 and Dc+1), have
been transferred from the CPU to the GPU
6.4.2 Runtime profiles
The runtime profile of the MPNF preconditioner is obtained using the top 10 layers
of the SPE 10 reservoir model [19] with 60× 220× 10 (132 thousand) cells and two-
phase flow. Note that all of the results reported here are for the pressure system of
equations. The SP (Single-Precision) runtime profile is shown in Fig. 6.3(a).
We observe that the setup phase - Setup (factorization) and Create HYB matrix -
occupies a very small fraction (0.7% + 0.6%) of the total MPNF cost. In the solution
phase, the cost is dominated by BLAS (29.8%, mainly axpy) and reduction (36.3%,
dot product) operations, while the MPNF solution kernel takes less than 30% of the
total time. This is mainly because 1) the number of NF iterations in each pressure
solution is high (above 20, on average), and there are several pressure solutions per
overall CPR linear solution, so that the setup cost is usually much less than the
solution cost; 2) dot-product operations are not a good fit for the SIMT (Single
Instruction Multiple Thread) GPU architecture, because the number of participating
CUDA cores is halved for each reduction step. On the other hand, the computational
density (per memory access) is low for axpy operations. These operations will be
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 163
0.7% 0.6% 1.0%
29.8%
36.3%
28.0%
3.6%
Setup (factorization)
Create HYB matrix
Reorder/copy of B/X
BLAS operations
Reduction (dot/norm)
Solve by preconditioner
MV multiplication
Total cost: 112.3s
(a) Basic implementation
1.1% 0.9% 1.2%
8.4%
6.4%
72.7%
9.1%
Setup (factorization)
Create HYB matrix
Reorder/copy of B/X
BLAS operations
Reduction (dot/norm)
Solve by preconditioner
MV multiplication
Total cost: 74.6s
(b) With BiCGStab and customized reduction kernel
Figure 6.3: Runtime profile of the GPU-based MPNF preconditioner for solving thepressure system of the top 10 layers of the SPE 10 reservoir model in single-precision
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 164
severely memory bounded, such that the peak flops cannot be reached; and 3) most of
the axpy and dot-product operations are in the orthogonalization process of GMRES.
The number of such operations per NF iteration increases linearly with the number of
iterations (i+ 1 operations for the i’th iteration). Thus, the cost of these operations
becomes significant when the number of NF iterations is large.
One possible way to resolve this problem is to use BiCGStab (BiConjugate Gradi-
ent Stabilized method, see [67,82]) as the accelerator. Although there is no guarantee
that the residual norm will decrease monotonically in every iteration, BiCGStab has
the advantage that the number of axpy and dot-product operations is fixed at five to
six (depending on the number of residual-norm evaluations) per iteration. Because
the number of MPNF iterations is usually much higher than ten for each pressure so-
lution, BiCGStab appears to be a better choice than GMRES, considering the savings
in axpy and dot-product operations.
Another consideration is to use the customized CUDA reduction kernel instead of
the most flexible one provided in cuBLAS. This is motivated by the parallel reduction
example in CUDA Samples [56], which are a set of code samples included with the
NVIDIA CUDA Toolkit. In the parallel reduction example, seven different versions
of reduction kernels have been implemented, with several tunable options. Using the
best available kernel (the 7th kernel with multiple elements assigned to each thread)
with proper setting (64 thread blocks at maximum, with 256 threads per block, us-
ing CPU for the final reduction of the aggregated results in all thread blocks), the
customized reduction kernel is able to provide up to 20% larger memory bandwidth
than the reduction operation in cuBLAS.
Using BiCGStab as an accelerator for MPNF along with the customized reduction
kernel, the updated runtime profile is shown in Fig. 6.3(b). It is observed that the
BLAS (8.4%) and reduction (6.4%) operations are no longer the bottlenecks. The
total cost decreases from 112.3s to 74.6s (a 33.6% reduction) and is now dominated
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 165
by the MPNF solution kernel (72.7%).
6.5 Coalesced memory access
In CUDA programs, ensuring coalesced memory access (see [54, 55] for details) is
critical for efficient utilization of the GPU resources. Coalesced memory organization
can be described as follows:
• Half-warp is the basic unit in the SIMT GPU architecture and contains 16
threads that always execute the same instruction at any time;
• For one instruction, if the participating threads in a half-warp follow a certain
access pattern (basically, contiguous memory or a permutation of it, see Fig.
6.4(a)), this costs the equivalent of 1 memory transaction. This is the most
efficient access pattern to GPU global memory;
• Otherwise, the memory access is noncoalesced (see Fig. 6.4(b)), and the in-
struction will cost 16 memory transactions.
(a) Coalesced: 1 transaction (b) Non-coalesced: 16 transactions
Kernel 1-16, element 1 Kernel 1 Kernel 2 Kernel 3
Figure 6.4: Comparison of CUDA memory access patterns
Consider the following top-down ordering of the elements in diagonal submatrices
Dc: in each color, first by the kernel index, then by the element index in a kernel.
For a typical instruction that accesses the same element of all kernels in a color,
neighboring threads will not access contiguous memory. For example, when the first
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 166
element of all kernels is accessed, the access pattern follows what is shown in Fig.
6.4(b), where the interval between the accesses from two neighboring threads is the
number of elements in a kernel. To achieve coalescence, the matrix elements need
to be ordered in a reverse way. That is, in each color, first by the element index in
a kernel, then by the kernel index. With the reversed ordering, the access pattern
will follow exactly what is shown in Fig. 6.4(a), because the elements from different
kernels are now stored contiguously. Besides the matrix elements in Dc, the unknown
and right-hand-side (RHS) vectors are reordered in the same way.
For off-diagonal submatrices Lc and Uc, they have a block sparse structure with
each block row and column corresponding to one kernel. The maximum number
of nonzero block entries in each block row can be predetermined according to the
distribution of neighboring colors in the selected coloring strategy. Thus, the EllPack
format is used for Lc and Uc and the following top-down ordering was initially adopted:
first by block row (kernel) index, then by the index of block nonzero entry in a block
row (kernel), and finally by the element index inside a block nonzero entry (in diagonal
format). In the reverse ordering strategy, matrix elements in Lc and Uc are rearranged
as: first by the element index in a block nonzero entry, then by the index of the block
nonzero entry in a block row (kernel), and finally by the block row (kernel) index.
This ordering strategy yields a much more efficient access pattern to the GPU global
memory.
The updated runtime profile obtained using MPNF with coalesced memory access
is shown in Fig. 6.5. The cost of the preconditioner solution is driven down from
54.3s to 16.0s (a 70.5% reduction). Correspondingly, there is a further decrease in the
total MPNF cost from 74.6s to 35.4s (a 52.5% reduction). Now, the preconditioner
solution (45.1%) accounts for half of the total MPNF cost, while BLAS and reduction
operations, as well as, SpMV contribute to the other half of the cost.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 167
1.9% 1.9% 2.0%
17.6%
14.0%
45.1%
17.5% Setup (factorization)
Create HYB matrix
Reorder/copy of B/X
BLAS operations
Reduction (dot/norm)
Solve by preconditioner
MV multiplication
Total cost: 74.6s 35.4s Solve cost: 54.3s 16.0s
Figure 6.5: Runtime profile of the implementation with BiCGStab, customized re-duction kernel, and coalesced memory access
6.6 More parallelism: multiple threads in a kernel
As described in Section 6.3 and in [8], assigning more than one thread to a kernel
improves the performance for relatively small problems. This is motivated by the fact
that we usually need several times more threads than CUDA cores to fully utilize the
power of the GPU [54,55]. Because a Tesla M2090 GPU has 512 cores, we can make
full use of its power, if we have thousands of concurrent threads, or more. To analyze
this, we consider the following examples:
• A large reservoir model with 300×160×50 (2.4 million) cells. In a typical 4-color
oscillatory coloring, the number of kernels with the first color (or the last color)
is only about half of the kernels with another color that is not the first or the
last. Consequently, the number of kernels per color is about 8000 ∼ 16000 for
this problem. This is sufficient to utilize the full power of the current generation
of GPUs.
• A relatively small reservoir model (upscaled SPE 10) with 30 × 110 × 42 (139
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 168
thousand) cells. Using the same 4-color oscillatory ordering, the number of
kernels per color is only about 550 ∼ 1100. Given the number of cores in the
current generation of GPUs, this number is far from sufficient to achieve optimal
parallel efficiency.
From the above comparison, we can say that the number of concurrent threads
is only sufficient for relatively large problems. To increase the number of concurrent
threads, there are several possible approaches.
6.6.1 Decrease the number of colors
Instead of four colors, we may use three colors or even two colors (checkerboard).
This will increase the number of threads per color up to NxNy
2, assuming that a kernel
is a column of cells along the z-direction. However, as the number of colors decreases,
the accuracy of the preconditioner also degrades. This will usually deteriorate the
convergence rate and increase the number of NF iterations per pressure iteration.
Thus, in certain circumstances, it is possible that the increased parallelism may be
more than offset by the increased computational cost.
6.6.2 Twisted Factorization
The advantage of the Twisted Factorization (TF) [81] method is that it doubles the
number of concurrent threads, while keeping the same total computational cost. This
is achieved by assigning two threads to the tridiagonal linear system of each kernel
(Dcr · xcr = bcr) during the setup and solution phases. Instead of the standard LU
factorization, an alternative PQ factorization (Dcr = P c
r · Qcr) is performed. For the
setup (factorization) phase, one sweep and one synchronization is needed:
1. Two threads setup the entries in factorized matrix P cr and Qc
r by sweeping from
both ends (first and last row) to the middle row concurrently
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 169
2. Thread synchronization
3. Exchange data to setup the entries in the middle row of factorized matrix Qcr
For the solution phase, an additional sweep and an extra synchronization are
needed:
1. Two threads solve the upper and lower part of the linear system P cr · ycr = bcr
concurrently by sweeping from both ends to the middle row of factorized matrix
P cr
2. Thread synchronization
3. Exchange data to obtain solution to the middle element of ycr
4. Thread synchronization
5. Two threads solve the upper and lower part of the linear system Qcr · xcr = ycr
concurrently by sweeping from the middle row to both ends of factorized matrix
Qcr
In order to maintain coalesced memory access, the two threads assigned to the
same kernel should be from different half-warps. That is, we may assign one half-warp
to sweep between the first and the middle row of 16 kernels, and another half-warp
to sweep between the last and the middle row of the same 16 kernels. Then, with
the reversed variable ordering discussed in Section 6.5, the memory access with TF
remains coalesced.
Because parallelism can be increased at no additional cost (i.e., the total compu-
tational cost is kept the same and the convergence rate is unaffected), this approach
is always desirable. However, it can only be applied once to each kernel, thus increas-
ing the number of concurrent threads to two times. This may still be insufficient for
certain small problems.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 170
6.6.3 Cyclic Reduction
The basic idea of the Cyclic Reduction (CR) [33, 81, 95] method is to eliminate odd
and even numbered equations and variables separately and concurrently. This method
can be applied repeatedly and each time the number of concurrent threads is doubled.
However, the computational cost also increases. As stated in [33], CR yields 2.7 times
more operations than standard Gaussian elimination when it is applied the maximum
times on a tridiagonal matrix. We have not yet implemented CR, or its variant PCR
(Parallel Cyclic Reduction) [95], in our code base.
6.7 More flexibility in the system matrix
6.7.1 Inactive cells
In heterogeneous reservoir models, some cells have very small pore volumes and are
thus labelled inactive. Their variables and equations are usually removed from the
system matrix during computation. In the MPNF algorithm, it is preferable to keep
the complete structure of the system matrix with both active and inactive cells. For
this purpose, we will create a mapping between the set of active cells, which is used
by all other parts of the simulator, and the set of all cells, which include both inactive
and active cells and is only used by MPNF.
This mapping procedure can be well integrated with the reordering process used in
MPNF to rearrange the cells into the top-down ordering (color → kernel → element)
or the reversed ordering as described in Section 6.5 (color → element → kernel).
During the setup phase of the MPNF matrix, the entries corresponding to the active
cells are directly copied from the original matrix, whereas the entries corresponding
to the inactive cells are not touched and their trivial initial values will be kept: 1 for
diagonal entries, and 0 for off-diagonal entries.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 171
In the solution phase, for one factorized matrix, multiple RHS vectors will be given
to the MPNF algorithm. At the beginning of each solution process with a new RHS
vector borig, the corresponding RHS vector used by MPNF, b, is mapped from active
cells with original order to all cells with top-down or reversed order. Then, the GPU-
based MPNF-preconditioned linear solver is called to compute for the solution vector
x, which corresponds to all cells with top-down or reversed order. Finally x is mapped
back to xorig, which contains elements for active cells with original order. Because
our current implementation of MPNF is only used in the pressure preconditioning,
xorig can be taken by the CPR preconditioner as the first-stage solution. In this way,
the inactive cells can be handled seamlessly by the MPNF preconditioner.
6.7.2 Additional (e.g., well) equations
In general-purpose simulation, reservoir equations are usually coupled with some ad-
ditional (e.g., well) equations. Thus, the entire pressure linear system will have the
form: Ap,RR Ap,RW
Ap,WR Ap,WW
xR
xW
=
bR
bW
, (6.8)
where Ap,UV (U, V ∈ {R,W}) contains the derivatives of equations of U with respect
to variables of V . The Ap,RR part is the reservoir matrix that MPNF can already han-
dle. In order to solve the entire system with these additional equations, we consider
the following Schur-complement process:
A′p,RR = Ap,RR −Ap,RWA−1p,WWAp,WR, (6.9)
b′R = bR −Ap,RWA−1p,WW bW . (6.10)
If the Schur-complement system A′p,RRxR = b′R can be solved with the MPNF-
preconditioned linear solver for the reservoir solution xR, then the well solution xW
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 172
can be obtained as:
xW = A−1p,WW (bW −Ap,WRxR). (6.11)
The issue with preconditioning the Schur-complement system using MPNF lies
in the structure of A′p,RR. For wells with multiple completions (i.e., having multiple
nonzero entries in one row of Ap,WR, or in one column of Ap,RW ), the resultant A′p,RR
will have additional nonzero entries introduced in the Schur-complement process.
These entries represent the synthetic connections between the multiple reservoir cells
with completions, which might not have been connected. For a typical vertical well
with multiple completions, the perforated cells are usually in one MPNF kernel that
contains a column of cells. The additional nonzero entries will turn the matrix of
this kernel from a tridiagonal matrix into a full matrix, which will greatly reduce the
parallel solution efficiency, even if there are only a few such full matrices.
The remedy is to use an approximated matrix in the MPNF preconditioner:
A′′p,RR = Ap,RR − approx(Ap,RWA−1p,WWAp,WR), (6.12)
where the operator ‘approx’ can be a relaxed rowsum operator, or something similar,
given that A′′p,RR will have the same structure as the original Ap,RR. Note that al-
though the MPNF preconditioner solves the approximate Schur-complement problem
A′′p,RRxR = b′R, the accelerator (e.g., BiCGStab) still iterates on the exact Schur-
complement problem A′p,RRxR = b′R. Otherwise, the pressure solution will become
inaccurate such that the convergence rate of the CPR preconditioner is impacted
negatively.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 173
6.8 Multi-GPU Parallelization
With the growth in problem size, a single GPU, even with 512 cores, may not be
able to solve problems of interest efficiently. This is due to limitations on both
computational power and memory capacity. Thus, multi-GPU parallelization of the
algorithm is employed to deal with very large problems. The basic idea is to partition
the entire domain into several subdomains and assign the computational tasks on
each subdomain to one GPU. Data need to be transferred between the subdomains
after the setup/solution of each color.
6.8.1 Partitioning
Due to the characteristics of peer-to-peer (P2P) data transfer between GPUs, the
partitioning is currently conducted in one dimension only, i.e., the dimension that
is not along the kernel direction and has the largest number of cells. As a result of
the one-dimensional partition, GPU g (0 ≤ g < NGPU , NGPU : number of GPUs)
only needs to communicate with GPU g − 1 (if g > 0) and g + 1 (if g < NGPU − 1).
Inside each subdomain, only the data from the cells at the domain edge need to
be transferred to another GPU, due to the interdomain connections. That is, GPU
g needs to transfer the data from its first layer (perpendicular to the partitioning
dimension) of cells to GPU g − 1 (if g > 0) and the data from its last layer of cells
to GPU g + 1 (if g < NGPU − 1). The region that contains the first and last layer
of cells is called the inner ‘halo’ region, and the region that contains the rest of cells
in a subdomain is called the interior region. The name ‘inner’ halo region is used
to differentiate from the ‘outer’ halo region, which corresponds to the layer of cells
immediately outside the subdomain. Each ‘outer’ halo region collocates with one
‘inner’ halo region of another GPU and is used to contain the data transferred from
that GPU.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 174
GPU 0
Interior
GPU 1
Interior
Ha
lo
Ha
lo
Ha
lo
GPU 2
Interior
Ha
lo
Left right
Right left
Figure 6.6: Example of partitioning with three GPUs (top view)
An example of partitioning with three GPUs is shown in Figure 6.6. The right
‘inner’ halo region of GPU 0 (in pink color) is the front ‘outer’ halo region of GPU
1. Correspondingly, the left ‘inner’ halo region of GPU 1 (in light green color) is the
back ‘outer’ halo region of GPU 0. The similar classification can be applied to the
halo regions between GPU 1 and 2.
6.8.2 Data transfer between GPUs
Based on the topology of multiple GPUs on a given host, there are two primary
approaches for data transfer between GPUs: left-right approach and pairwise ap-
proach [49]. Both approaches require two stages.
First we discuss the left-right approach. In the first stage, the data are transferred
from left to right, i.e., from the right inner halo region of g to the front outer halo
region of g + 1, for each GPU g (0 ≤ g < NGPU − 1). Then a synchronization is
performed until the first-stage data transfers are finished. In the second stage, the
data are transferred from right to left, i.e., from the left inner halo region of g to the
back outer halo region of g − 1, for each GPU g (1 ≤ g < NGPU). In this way, no
matter how many GPUs we have in the system, there will be no data contention in
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 175
the PCIe channel, as shown in Figure 6.7.
GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7
PCIe switch PCIe switch PCIe switch PCIe switch
PCIe switch PCIe switch
I/O hub
Download
Upload
GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7
PCIe switch PCIe switch PCIe switch PCIe switch
PCIe switch PCIe switch
I/O hub
(a) First (left-to-right) phase
(b) Second (right-to-left) phase
Path w/o contention
Figure 6.7: Illustration of left-right data transfer approach (modified from [49])
Then we describe the pairwise approach. In the first stage, the data are transferred
between each even-numbered GPU and its next odd-numbered GPU, i.e., from the
right inner halo region of g to the front outer halo region of g+1, and from the left inner
halo region of g + 1 to the back outer halo region of g, for each even-numbered GPU
g (0 ≤ g < NGPU − 1, g is even). Then a synchronization is performed until the first-
stage data transfers are finished. In the second stage, the data are transferred between
each odd-numbered GPU and its next even-numbered GPU, i.e., performing the same
procedure as in the first stage, but for each odd-numbered GPU g (1 ≤ g < NGPU−1,
g is odd). The data contention in the PCIe channel can also be avoided if NGPU ≤ 4.
For NGPU > 4, there will be data contention in the second stage, as shown in Figure
6.8.
We choose the left-right approach because it has no data contention with an
arbitrary number of GPUs. See the bottom part of Figure 6.6. In this three-GPU
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 176
GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7
PCIe switch PCIe switch PCIe switch PCIe switch
PCIe switch PCIe switch
I/O hub
Download & Upload
GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7
PCIe switch PCIe switch PCIe switch PCIe switch
PCIe switch PCIe switch
I/O hub
(a) First (even-odd) phase
(b) Second (odd-even) phase
Path w/o contention
Path w/ contention
Figure 6.8: Illustration of pairwise data transfer approach (modified from [49])
example, data are transferred from GPU 0 to GPU 1, and from GPU 1 to GPU 2
in the first stage, and then transferred from GPU 2 to GPU 1, and from GPU 1 to
GPU 0 in the second stage.
Note that one large chunk of data can usually be transferred more efficiently be-
tween two GPUs than many small pieces of data. Therefore, after the setup/solution
of each color, we should minimize the number of transactions for P2P data transfer
between GPUs. For top-down ordering of the matrix elements (in each color, first by
the kernel, then by the element in the kernel), the data to be transferred are already
grouped together if we order the kernels at the edge of subdomain consecutively.
Therefore, the data can be directly transferred after the computation. However, for
the reverse ordering scheme (in each color, first by the element in the kernel, then
by the kernel), the data to be transferred are scattered in the data chunk for each
element. Consequently, a preparation process that gathers the data from multiple
places in the GPU memory into a large chunk in the preallocated buffer is needed
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 177
between the data transfer and computation.
6.8.3 Overlapping data transfer with computation
Similar to the idea of asynchronous memory transfer from CPU to GPU at the setup
stage, the P2P data transfer between GPUs can be overlapped with the computation
inside the GPU. Thus the data-transfer cost can be hidden, if it is always smaller
than the computational cost.
The overlapping strategy utilizes the partitioned regions in each subdomain. The
computation in the inner halo region and in the interior region must be separate.
The inner halo region is usually very small compared with the interior region and
thus the computation in inner halo region should finish in a relatively short period.
Then we can start the computation in the interior region, as well as, the data transfer
between GPUs at the same time. This is because the data to be transferred are those
in the inner halo region and have already been computed. The simultaneous kernel
launch and data transfer are achieved via streams. A typical assignment of tasks
in the solution phase is shown in Figure 6.9, in which all computational tasks are
performed on stream 1 whereas data transfer tasks are performed on stream 2. These
two streams are created for each GPU and thus we have a total of 2NGPU streams.
For this phase, stream 1 is usually fully occupied whereas stream 2 may idle during
certain periods of time.
Compute
halo region
for color 1
Compute interior region
for color 1
Peer-to-peer transfer
of color 1 between
neighbouring GPUs
Compute
halo region
for color 2
Compute interior region
for color 2
Peer-to-peer transfer
of color 2 between
neighbouring GPUs
Stream 1
Stream 2
…
…
Figure 6.9: Assignment of tasks to two streams in the solution phase
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 178
Assuming that the data-transfer cost is always smaller than the computational
cost of an interior region, the goal in the solution phase is really to minimize the
total computational cost of halo and interior regions. Given the fact that a large
number of concurrent threads is usually needed for good utilization of the GPU, the
computational efficiency in the halo region, which contains only a small number of
kernels, may be suboptimal. A possible way to resolve this problem is to temporarily
repartition the halo and interior regions inside a subdomain in this phase. That is,
we can assign more kernels to the halo region and less kernels to the interior region,
in order to achieve improved overall utilization of the GPU and thus a minimal total
computational cost. Note that no matter how the halo and interior regions are divided
here, only the data from kernels at edge of each subdomain (i.e., the initial definition
of the halo region) need to be transferred between GPUs.
For the setup phase, the asynchronous memory transfer from CPU to GPU will
also be performed on stream 2 and overlap with the computation (setup and rowsum)
on stream 1. As a consequence, the tasks are assigned as shown in Figure 6.10, in
which stream 1 and 2 have the same definitions as in Figure 6.9. Note that due to
the limited bandwidth of PCIe bus (especially on the path from CPU to I/O hub
shared by the multiple GPUs), the memory transfer from CPU to GPU is usually
the bottleneck in this phase. Therefore, stream 2, which is in charge of data transfer
from CPU to GPU and between GPUs, will be fully occupied, whereas stream 1 for
managing computational tasks will idle from time to time. This is different from the
behavior at the solution phase.
6.8.4 Mapping between global and local vectors
As described in Section 6.7.1, at the beginning of each solution process with a new
RHS vector, there is a mapping from active cells with original order to all cells with
top-down or reversed order. At the end of the solution process, the solution vector is
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 179
Setup halo region for
color 1 using D1
Setup interior region
for color 1 using D1 Peer-to-peer transfer
of color 1 in GPUs
Stream 1 Stream 2
… …
Memory transfer of D1
from CPU to GPU
Memory transfer of U1
from CPU to GPU
Rowsum halo region
for color 1 using U1
Memory transfer of L2
from CPU to GPU
Rowsum interior region
for color 1 using U1
Memory transfer of D2
from CPU to GPU
Memory transfer of U2
from CPU to GPU
Setup halo region for
color 2 using L2 & D2
Rowsum halo region
for color 1 using U2 Memory transfer of L3
from CPU to GPU
Peer-to-peer transfer
of color 2 in GPUs
Setup interior region for
color 2 using L2 & D2
Rowsum interior region
for color 2 using U2
Memory transfer of D3
from CPU to GPU
Figure 6.10: Assignment of tasks to two streams in the setup phase
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 180
mapped back in an opposite way.
We extend that idea for multi-GPU parallelization of the algorithm. The mapping
at the beginning of the solution process is now a distribution of the vector elements,
i.e., from active cells in the global domain (i.e., including all subdomains) with original
order to all cells in one extended subdomain (i.e., the subdomain for one GPU plus
its outer halo region) with top-down or reversed order. In this mapping, the source is
(a part of) the same vector whereas the target is different for each GPU. As a result,
during the multi-GPU solution process, each GPU will only access its local vectors
plus the additional data received from the P2P transfer after the setup/solution of
each color.
Correspondingly, at the end of the solution process, there is an opposite mapping
for the local solution vector in each GPU. The opposite mapping is essentially an
assembly of the vector elements, i.e., from all cells in one extended subdomain with
top-down or reversed order to (a part of) the active cells in the global domain with
original order.
6.9 Parallel Benchmark
6.9.1 Single-GPU test case 1: upscaled SPE 10
We first consider a relatively small test problem based on the upscaled SPE 10 reser-
voir model with 30× 110× 42 (139 thousand) cells. The focus is on the performance
of the pressure preconditioner. We use a simple two-phase, two-component fluid. The
4-color oscillatory coloring is used as the default strategy. To establish a basis for
comparison, the OpenMP-based MPNF algorithm run on one, or multiple, CPU cores
was implemented. The time used by MPNF for a single CPU core is treated as the
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 181
baseline (1X speedup) of the benchmark. The test was conducted on a Dell Pow-
erEdge C410x PCIe Expansion Chassis with dual Intel Xeon X5660 CPUs (6 cores
at 2.80GHz, 12MB cache), 192GB of RAM, and an NVIDIA Tesla M2090 GPU (512
CUDA cores at 1.3GHz, 6GB memory).
The performance results are shown in Fig. 6.11. The left two columns represent
parallel speedups of SP and DP respectively, whereas the right column shows the ratio
of the minimum number of concurrent threads in a color to the number of CUDA
cores (512 for M2090), denoted as RT/C .
17.1
13.9
6.4
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
SP speedup DP speedup min. #threads / #cores
CPU (1 core)
CPU (8 cores)
GPU (non-coalesced)
GPU (coalesced)
GPU (co + TF)
GPU 3c (co + TF)
GPU 2c (co + TF)
Figure 6.11: The performance results of the upscaled SPE 10 problem with 139thousand cells
From Fig. 6.11 we observe that more than six times speedups are obtained for both
SP and DP using 8 CPU cores, whereas the basic GPU-based MPNF (accelerated
by BiCGStab with customized reduction kernel and noncoalesced memory access)
has only 2.4X speedup for SP and 3.2X speedup for DP. However, when we apply
coalesced memory access to the GPU-based MPNF, its speedups rapidly increase to
8.5X for SP (+252%) and 7.9X for DP (+148%), which are already higher than that
of 8 CPU cores.
Note that for this problem, when using 4-color oscillatory coloring without TF the
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 182
value of RT/C is only 1.1, which is far from sufficient. Using TF, RT/C is increased
to 2.1, which is still too low, and the speedups show a corresponding boost to 12.7X
for SP and 11.0X for DP. Next we investigate the impact of decreasing the number
of colors. Using three colors, RT/C is further increased to 3.2, and the speedups
correspondingly increase despite the increased number of NF iterations. When the
number of colors is further decreased to 2, RT/C becomes 6.4, which is still not enough
to fully utilize the power of the GPU. We see a further increase in the speedups to
17.1X for SP and 13.9X for DP, respectively. With Cyclic Reduction (CR), we can
exploit more parallelism and possibly achieve further improvement for this relatively
small problem.
6.9.2 Single-GPU test case 2: full SPE 10
Now we consider a larger test problem based on the full SPE 10 reservoir model with
60× 220× 85 (1.1 million) cells. The fluid model, default coloring strategy, and the
hardware configuration of both CPU and GPU are the same as in the previous test
problem.
The performance results are shown in Fig. 6.12. For this problem, the speedups
of the basic GPU-based MPNF with noncoalesced memory access (SP: 3.0X; DP:
3.1X) are again lower than those of the CPU-based MPNF using 8 cores (SP: 6.0X;
DP: 4.1X). However, when applying the coalesced memory access, there is significant
improvement in the performance of the GPU-based MPNF. The coalesced speedups
(SP: 22.1X; DP: 18.2X) are about 6 ∼ 7 times of their noncoalesced counterparts.
For this problem, RT/C is about 4.3 for the basic GPU-based MPNF. This is not
sufficient to achieve optimal performance. Using TF, RT/C is increased to 8.6, which
is marginally acceptable. We see a corresponding increase in the speedups to 26.5X
(SP) and 18.7X (DP), respectively. Next, we analyze the impact of the number of
colors. First, if we decrease the number of colors to three, or even two (checkerboard),
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 183
although the value of RT/C further increases to 12.9 and 25.8, respectively, the limited
increase in the parallelism (RT/C = 8.6 is already acceptable) is more than offset by
the increased number of NF iterations. On the other hand, if we increase the number
of colors to five or six, a potentially more accurate preconditioner can be obtained.
However, the value of RT/C becomes too low (6.4 for five colors, and 5.2 for six colors)
to maintain good utilization of the GPU. As a result, the overall speedup with five
or six colors is lower than that with four colors. That is, 4-color oscillatory coloring
with TF yields the best performance for this problem.
26.5
18.7
8.6
0.0
4.0
8.0
12.0
16.0
20.0
24.0
28.0
SP speedup DP speedup min. #threads / #cores
CPU (1 core)
CPU (8 cores)
GPU (non-coalesced)
GPU (coalesced)
GPU (co + TF)
GPU 3c (co + TF)
GPU 2c (co + TF)
GPU 5c (co + TF)
GPU 6c (co + TF)
Figure 6.12: The performance results of the full SPE 10 problem with 1.1 million cells
6.9.3 Multi-GPU test case 1: 8-fold refinement (2 by 2 by 2)
of SPE 10
In order to test the multi-GPU parallelization of the MPNF algorithm, the problem
size needs to be large enough to obtain good utilization of the massive computational
power provided by the thousands of CUDA cores. Thus, we build a test case based
on a refined SPE 10 reservoir model with 120 × 440 × 170 (8.8 million) cells. The
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 184
fluid model, default coloring strategy, and the CPU hardware configuration are the
same as in the single-GPU test cases. Four NVIDIA Tesla M2090 GPUs are used in
this test case, bringing a total of 4 × 512 = 2048 CUDA cores, and 4 × 6GB=24GB
memory.
For this test case, we use only the optimal setting (4-color oscillatory ordering,
with coalesced memory access and twisted factorization) for the MPNF algorithm
and still use the run time needed by a single core of X5660 CPU as the baseline.
We simulate the test case with one, two, three, and four GPUs, respectively. The
cost of data preparation (reordering of matrix and vector elements) on the CPU is
not included in the total cost shown here, because this part cannot be accelerated by
multiple GPUs. This part of the cost will be relatively small when the entire linear
solver is parallelized on GPU (such that the RHS and solution vector will not need to
be transferred back and forth between CPU and GPU, and all the reordering tasks
can be performed on GPU).
The performance results are shown in Figure 6.13. We observe that with a single
GPU, the speedup (SP: 29.2X; DP: 19.3X) is quite similar to that obtained using
the full SPE 10 model (SP: 26.5X; DP: 18.7X), although RT/C has a much higher
value, namely 34.4, for the above mentioned optimal setting. This further validates
our interpretation that RT/C = 8.6, which was obtained using the full SPE 10 model
with the same setting, is already acceptable for good utilization of the GPU. For SP,
the (1024-core) 2-GPU solution has a speedup of 53.5X compared with the single-core
solution and is 1.8X faster than the single-GPU MPNF solution. Correspondingly,
for DP, a speedup of 36.2X is obtained with two GPUs and this is 1.9X faster than
on a single GPU. The speedup numbers for (1536-core) 3-GPU solution are 71.1X for
SP and 50.5X for DP, which are 2.4X and 2.6X faster than the single-GPU solution,
respectively. For the (2048-core) 4-GPU solution, we achieve 82.3X speedup for SP
and 61.7X speedup for DP. Using DP, the 4-GPU solution is more than three times
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 185
faster than the single-GPU solution.
1X
1X
1.8X
1.9X
2.4X
2.6X
2.8X
3.2X
0
10
20
30
40
50
60
70
80
90
Single-precision speedup Double-precision speedup
Spe
ed
up
ve
rsu
s o
ne
CP
U c
ore
12 CPU cores
1 GPU (512 cores)
2 GPUs (1024 cores)
3 GPUs (1536 cores)
4 GPUs (2048 cores)
Figure 6.13: The performance results of the refined SPE 10 problem with 8.8 millioncells
Here, we analyze the reason why the speedup of multi-GPU solution versus single-
GPU solution deviates from linear scalability. Although the P2P data-transfer cost
is hidden by overlapping the transfer with the computation, the total computational
cost, with a separation between the inner halo and interior region, will be higher
than that without a separation. This is due to the decrease in the effective RT/C ,
which will be substantially lower than 34.4/NGPU for NGPU ≥ 2, and thus an overall
good utilization of all GPUs cannot be achieved when NGPU increases. Moreover,
the data-transfer cost from CPU to a single GPU will be more or less the same as
that from CPU to multiple GPUs, because the bandwidth of the PCIe bus from CPU
to I/O hub is limited and shared by multiple GPUs. With the evolution of the PCIe
standard (e.g., the bandwidth of PCIe 3.0 is 16GB/s, which is twice as much as that
of PCIe 2.x), this limitation will get alleviated.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 186
6.9.4 Multi-GPU test case 2: 24-fold refinement (4 by 3 by
2) of SPE 10
As analyzed in the last test case, one major constraint on the efficiency of multi-
GPU solution is the reduced RT/C , which represents the level of parallelism. Even for
the refined SPE 10 model with 8.8 million cells, the effective RT/C will not be large
enough when the MPNF solution is performed on several GPUs. Here, a further
refined SPE 10 model with 240×660×170 (26.9 million) cells is generated to test the
scalability of multi-GPU solution for extremely large models. This corresponds to a
4X, 3X, and 2X refinement in the X, Y, and Z directions, respectively. To further
simplify the computation on the CPU side, a two-phase black-oil fluid model is used
in this test case. The default coloring strategy and the CPU hardware configuration
are the same as in the single-GPU test cases. Up to 6 NVIDIA Tesla M2090 GPUs
are used, thus there is a total of 6 × 512 = 3072 CUDA cores and 6 × 6GB=36GB
memory.
Similar to the previous multi-GPU test case, we use only the ‘optimal setting’ for
the MPNF algorithm and take the run time used by a single core of X5660 CPU as
the baseline. We simulate this refined SPE 10 model with one to six GPUs for SP,
and two to six GPUs for DP. This is because, even the pressure system (about 8GB)
is too large to be contained in 1 GPU, when DP is used. In the result, in order to
compare the multi-GPU solution efficiency in a convenient way, we still include the
1-GPU DP speedup, which is estimated from the 1-GPU SP run time and the typical
run-time ratio between DP and SP. For the same reason, as discussed in the last test
case, the cost of data preparation on the CPU is not included in the total cost.
The performance results are shown in Figure 6.14. We observe that with a single
GPU, the SP speedup (31.9X) of this 24-fold refined version of the SPE 10 model
is only slightly better than that of the refined SPE 10 model (29.3X). Although
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 187
1X 1X
1.9X
1.9X
2.9X
2.9X
3.5X
3.5X
4.5X
4.5X
5.3X
5.3X
0
20
40
60
80
100
120
140
160
180
Single-precision speedup Double-precision speedup
Spe
ed
up
ve
rsu
s o
ne
CP
U c
ore
12 CPU cores
1 GPU (512 cores)
2 GPUs (1024 cores)
3 GPUs (1536 cores)
4 GPUs (2048 cores)
5 GPUs (2560 cores)
6 GPUs (3072 cores)
Figure 6.14: The performance results of the further refined SPE 10 problem with 26.9million cells
RT/C here is as high as 103.1, the improvement in speedup is minor due to the over-
saturated GPU utilization. For SP, the (3072-core) 6-GPU solution reaches a speedup
of 169.9X, which is 5.3 times faster than the 1-GPU solution and 40.8 times faster
than the OpenMP-based MPNF using all 12 CPU cores. On the CPU cores, MPNF
is severely limited by the memory bandwidth and has a SP speedup of 4.2X only.
Correspondingly, for DP, the 6-GPU solution (111.2X) is also 5.3 times faster than
the estimated 1-GPU solution time and about 34 times faster than the 12-CPU-core
solution (3.3X). Note that in this extremely large problem where the parallelism is
abundant, the multi-GPU solution efficiency (in terms of the speedup over 1-GPU
solution) of SP can be as good as that of DP. With 3 or more GPUs, the scalability
in this case (26.9 million cells) is considerably better than that in the refined SPE 10
model (8.8 million cells) where effective RT/C drops below the acceptable range. We
should also note that in terms of absolute time, the cost per pressure solve (the
MPNF-preconditioned BiCGStab solver with a relative tolerance of 0.2) is below one
second for SP solution on 4 or more GPUs, and quite close to one second for DP
solution on 6 GPUs.
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 188
6.10 Discussion
For CPU parallelization, one thread per core (or up to two if HyperThreading is used)
is ideal for achieving optimal performance. However, it is critical to have many more
threads per CUDA core in our GPU-based MPNF method for ideal performance. This
is closely related with the SIMT architecture and the fast switching feature of thread
warps on the GPU. If we have large numbers of thread warps, we can always switch
to the idle warps when some active warps get stalled (e.g., waiting for the completion
of a memory transaction). In this way, the latency of memory transactions can be
effectively hidden, so that we can saturate the GPU memory bus with large amounts
of memory transactions on the fly.
We can utilize the full power of the GPU (i.e., reach the asymptotic limit) only
when the number of threads per core (RT/C) is sufficiently large. Based on our
analysis, 8 < RT/C < 10 allows for good utilization of the GPU; if 10 < RT/C < 20,
the full power of the GPU can be utilized. However, if we push this number too far
(RT/C >> 20) by feeding extremely large problems to the algorithm, or by applying
the Cyclic Reduction (CR) too many times, no further improvement will be obtained.
That is, the asymptotic limit of the algorithm is reached under the current GPU
architecture.
When our GPU-based MPNF is applied to both stages of the CPR preconditioner,
the GPU memory size can be the major constraint on the size of the problem we can
solve. This is because the size of the nonzero entry grows from 1×1 to nc×nc, where
nc is the number of components. For the pressure system of full SPE 10 problem,
MPNF costs about 300MB of GPU memory, in which the major part is used for
the storage of the elements in the original and factorized matrices. Now suppose
that we want to apply MPNF to both pressure and primary systems with nc = 6,
the algorithm will require about 10GB of GPU memory, which is already beyond
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 189
the memory capacity (6GB) of an NVIDIA Tesla M2090 GPU. This is part of the
motivation for extending the algorithm to support multiple GPUs, so that a larger
total amount of GPU memory is available.
6.11 Future Directions
Based on the current features of our MPNF implementation, two major directions
are considered as the future work. First, we may apply MPNF to both stages of
CPR preconditioner with a global CUDA-based BiCGStab / GMRES accelerator. To
achieve good overall speedup of the linear solution, instead of the pressure solution
only, this extension is necessary. However, for the best efficiency, all the steps in
the linear solution process (not just the preconditioning) should be performed on the
GPU platform and this brings a lot of challenges to the design and implementation.
Here, several important issues that must be addressed in this extension are listed:
• Coalesced memory access. In order to maintain the optimal access pattern
to GPU global memory, the elements inside each block nonzero entry of size
nc × nc cannot be stored together. Otherwise, because each block entry will
only be handled by one thread, neighboring threads will no longer read or write
on contiguous memory addresses and thus the access pattern is noncoalesced.
Therefore, the suggested reversed ordering for matrix and vector elements is:
in each color, first by the element index in each block entry, then by the block
entry index in each kernel, and finally by the kernel index. With this order,
coalesced memory access can be maintained.
• Block SpMV. When solving the pressure system, HYB format in the cuSPARSE
library is an efficient choice for SpMV. However, the current version of cuS-
PARSE only provides a pointwise HYB format, which is not ideal for the block
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 190
sparse structure of the primary matrix. Therefore, we need to utilize the block
sparse matrix format (HYB is still preferred) provided by other GPU-based li-
braries or implement our own data structure and associated GPU kernels for
block SpMV.
• Schur-complement process for generating the primary system JschurRR that in-
cludes only reservoir equations and variables. In the augmented primary system
that includes both reservoir and all facility equations and variables, the facilities
submatrix JFF is no longer diagonal. It is block diagonal and the structure of
each block, which corresponds to a facility model, can be very different. Thus
it is usually hard to obtain its inverse J−1FF . As a result, the schur-complement
JschurRR = JRR − JRFJ
−1FFJFR cannot be directly computed. For the second-
stage preconditioning of overall system, JRR can be used to approximate JschurRR ,
whereas for block SpMV, we can obtain the product as
JschurRR · x = JRR · x− JRF ·
(J−1FF (JFR · x)
), (6.13)
where y = J−1FF (JFR · x) is calculated by solving
JFF · y = JFR · x. (6.14)
Considerable effort is expected in making all these steps performed efficiently
on GPU.
Second, we may extend the algorithm to account for off-band coefficients on the
outermost level, in order to accommodate matrices constructed from 2.5-dimensional
unstructured grids with more flexible coloring strategies or MPFA discretization
schemes. This extension has been discussed in [6]. As long as the reservoir model is
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 191
still divided into kernels (e.g., columns of cells) with similar sizes, and a viable col-
oring strategy (no neighboring kernels share the same color) is available, the MPNF
algorithm has no requirement on the grid structure of the two-dimensional domain
perpendicular to the kernel direction. A possible coloring strategy for 2.5-dimensional
unstructured grids is described in [6]. More colors (e.g., 6) may be needed to handle
the coloring on such grids.
Besides the coloring strategy, the definition of off-diagonal submatrices on the first
level of MPNF, Lc and Uc, needs to be extended. For SpMV, they may need to access
vector elements from more than one color, although most of the elements are still
expected to be from the previous color for Lc, and from the next color for Uc.
6.12 Concluding Remarks
In this chapter, we describe a parallel NF linear solver especially suited for GPUs.
We build on the Massively Parallel NF framework described by Appleyard et al. [8].
Our GPU-based implementation of MPNF supports asynchronous memory transfer,
utilizes CUDA-based BiCGStab solver as an accelerator, and uses a customized re-
duction kernel with a greater bandwidth in the computation of dot products. The
most important features of our approach include: 1) special ordering of the matrix
elements that maximizes coalesced access to GPU global memory, 2) application of
“twisted factorization”, which increases the number of concurrent threads at no addi-
tional cost and maintains coalesced memory access, and 3) extension of the algorithm
for systems with multiple GPUs by first solving the halo region in each GPU and
overlapping the peer-to-peer memory transfer between GPUs with the solution of the
interior regions for each color in the solution sequence.
The GPU-based MPNF linear solver is demonstrated using several large problems,
and we breakdown the performance details of all the algorithmic components. For the
CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 192
full SPE 10 model on a 512-core Tesla M2090 GPU, our implementation achieves a
speed up of 26 for single-precision and 19 for double-precision computations compared
with a single core of the Xeon X5660 CPU. Moreover, the 6-GPU (3072 cores) solution
of a highly refined SPE 10 model (26.9 million cells) is more than five times faster
than the single-GPU solution.
Chapter 7
Conclusions and Future Work
7.1 Conclusions
The development of a parallel general-purpose numerical simulation framework is the
subject of this dissertation. The Automatic-Differentiation General-Purpose Research
Simulator (AD-GPRS) described here serves as a flexible, extensible, and efficient re-
search platform for coupled reservoir models and advanced (e.g., multisegment) wells.
The ADETL library, which was first developed by Younis [92, 93] and extended by
Zhou [96] provides a core component of the infrastructure for this powerful reservoir
simulation platform. This AD capability allows us to write only the nonlinear discrete
residual code and get the Jacobian matrix in an automatic and efficient way. The au-
tomatic generation of the Jacobian using AD enhances the flexibility and extensibility
of the simulation platform substantially. Moreover, our generic and modular code de-
sign facilitates the addition of new capabilities, including additional flow processes
and numerical methods, to the simulator.
For flexible reservoir modeling, AD-GPRS employs MPFA spatial discretization
schemes and a multilevel AIM strategy for time discretization. A unified code-base,
which works for TPFA or MPFA in space, and for any combination of FIM, AIM,
193
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 194
IMPES, and IMPSAT in time, has been developed. The object-oriented C++ im-
plementation has minimal code duplication. Due to the generic design and imple-
mentation across the nonlinear and linear levels, the new discretization framework in
both space and time are compatible with all the other functionality in the simula-
tor. Moreover, the AD-based simulation capability is demonstrated using challenging
compositional problems with strong nonlinearity and large-scale highly heterogeneous
reservoir models, discretized using TPFA and various MPFA schemes. Any model can
be run using FIM or AIM. Our results indicate clearly that MPFA-based composi-
tional simulations are significantly more accurate than TPFA computations, espe-
cially for nonorthogonal and unstructured grids. This also applies to the cases with
full-tensor permeability. In addition, the multilevel AIM strategy reduces the overall
simulation cost, along with reduced levels of numerical dispersion.
The linear solver framework is a very important component of AD-GPRS. Among
the two linear systems offered by AD-GPRS, the CSR representation of the linear
system is based on the public CSR matrix format, which has a very simple structure
and is thus easy to understand. This simple structure, however, limits the efficiency
of the associated linear solvers and cannot be used to simulate large problems. The
more efficient option is the block linear system based on the customized MLBS ma-
trix format, which has a much more sophisticated structure. Based on a hierarchical
storage system, MLBS satisfies the requirement for a well-designed data structure
(accessibility, encapsulation, extensibility, and computational efficiency). For any
new submatrix type added to MLBS, no requirement is imposed on its internal struc-
ture, as long as the matrix operations share common interfaces and are implemented
in a consistent way. With this extensible design, model developers have sufficient
flexibility to create new matrix formats and associated solution strategies that are
suitable for each individual facility model of AD-GPRS. This also applies to other
new features that change the structure of the global Jacobian. To solve the block
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 195
linear system efficiently, a two-step algebraic reduction is applied first to obtain the
smallest implicit system. With the CPR-based two-stage preconditioning strategy,
an iterative Krylov subspace solver (e.g., GMRES) is then used to solve the implicit
system. Then, a two-step explicit update is used to recover the full solution from the
implicit solution. This solution strategy offers much higher efficiency than using a
single-stage preconditioner (e.g., BILU), let alone schemes based on the (pointwise)
CSR linear system.
For accurate and unified modeling of multiphase flow in wellbores and surface
(pipeline) networks, AD-GPRS supports a generalized MS well model. With a flexible
discretization into nodes and connections, the general MS well model has a series of
advantages over the original MS well model: 1) general branching: the well segments
and surface pipelines can fork and rejoin at any place such that very complex geometry
can be created, 2) loops with arbitrary flow directions: the actual flow directions
are determined at run time and can change from iteration to iteration, such that a
loop may be defined without its flow directions to be prespecified, 3) multiple exit
connections with different constraints, and 4) special nodes with various functionality
(e.g., separators, valves). The first two advantages are illustrated using the numerical
examples, whereas the features needed by the last two advantages have not yet been
implemented. With the flexible and extensible framework, these features can be
introduced later as extensions to the general MS well model. For the linear solution,
the original two-stage CPR preconditioner has been extended to handle the coupled
reservoir model with general MS wells. Comparisons with single-stage BILU(0) and
BILU(1) preconditioners indicate the superior efficiency of the extended two-stage
preconditioner. In addition, to address the difficulties in the nonlinear convergence
of the coupled system, a local nonlinear solver for the well regions is introduced by
fixing the reservoir conditions and only solving for the facility solutions. About one
third of the Newton iterations is saved with the local nonlinear solver for an upscaled
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 196
SPE 10 model. We also found that the best strategy to utilize the local nonlinear
solver is to activate it only when the reservoir part is already close to convergence. In
such circumstances, the reservoir part will serve as a reasonably accurate boundary
condition for the local facility solution and oscillations in the Newton iterations can
be avoided.
Today, multicore CPUs are prevalent in mainstream desktop computers and servers.
In addition, general-purpose GPU computing is quickly emerging as an important
technology. The techniques of high performance computing have drawn a lot of at-
tention and are developing at a very fast pace. Parallelization allows us 1) to better
utilize the computational resources such that the result can be obtained in a shorter
time, and 2) to simulate more detailed models with higher discretizing and composi-
tional resolution and more sophisticated physics. As a consequence, OpenMP-based
multithreading parallelization has been implemented in AD-GPRS. Parallel Jacobian
construction is realized through the thread-safe extension of ADETL library. For the
parallel linear solution, the CPR-based two-stage preconditioning strategy is used,
with the parallel multigrid solver XSAMG applied in the first stage and the block
Jacobi preconditioner, which takes BILU as local preconditioners, applied in the sec-
ond stage. Based on the benchmarking results for the full SPE 10 problem (with
three discretization schemes applied on the manipulated nonorthogonal grid) on a
dual quad-core Nehalem node, we find that the linear solver speedup (an average of
4.1X) is usually lower than that for the remaining computations (i.e., other than the
linear solver - an average of 6.0X). This is because of the higher requirement on the
memory bandwidth and number of memory channels in the linear solver. Note that
the speedup of the linear solver could have been lower if the NUMA-aware pressure
matrix assembly, the partial setup strategy for XSAMG, or the optimal partitioning
for block Jacobi were not applied. The overall speedup is averaged at 5.0X, which is
reasonable for the multicore platform we have tested.
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 197
Finally, a parallel NF linear solver is developed especially suited for GPUs. This
solver is built on the MPNF framework described by Appleyard et al. [8] and is in-
tegrated into parallel AD-GPRS as a pressure preconditioner. Our implementation
of MPNF is designed especially for the GPU architecture. For instance, our solver
applies asynchronous memory transfer from CPU to GPU in the setup phase, uses
CUDA-based BiCGStab solver as an accelerator in the solution phase, and computes
dot products using a customized reduction kernel that offers a greater bandwidth.
Most importantly, a reversed ordering is applied to the matrix elements such that
coalesced access to GPU global memory can be maximized. For small problems, the
number of concurrent threads is insufficient for good utilization of a GPU. Thus,
the “twisted factorization” technique is applied to improve the concurrency with no
extra computation or side effects on coalesced memory access. For the algorithm
to work on multiple GPUs, the peer-to-peer memory transfer, which is overlapped
with the computational kernels, is used for the data exchange between GPUs. The
GPU-based MPNF solver is demonstrated using several large reservoir models and
compared against its CPU-based counterpart (using OpenMP). We find that 1) coa-
lesced memory access is the most critical factor that affects the performance of the
algorithm: there can be several times difference in the speedup with, or without, this
feature; 2) twisted factorization can further improve the performance of the algorithm
and should always be applied for arbitrary problem size; 3) the accuracy of the pre-
conditioner increases while the parallelism decreases when more colors are used. The
number of colors for which near optimal performance is achieved is subject to the na-
ture of the problem, 4) the number of threads per CUDA core (RT/C) is a reasonable
measurement for the GPU utilization: 8 ∼ 10 is the minimally acceptable range for
a good utilization of GPU; 5) the current bottleneck of multi-GPU parallelization is
in the data transfer between CPU and GPU due to the limited PCIe bandwidth on a
single computational node, as well as, in the decreased number of concurrent threads
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 198
due to the domain partitioning and separation of computation in halo and interior
regions. Comparing the pressure solver time for the full SPE 10 model on a 512-core
Tesla M2090 GPU and on a single core of the Xeon X5660 CPU, our GPU-based
MPNF solver (with coalesced memory access and twisted factorization) achieves a
speed up of 26 for single-precision and 19 for double-precision computations. More-
over, the pressure solution of a highly refined SPE 10 model (26.9 million cells) on
six GPUs with totally 3072 cores is more than five times faster than that on a single
GPU.
7.2 Future Work
Based on the work described in this dissertation, AD-GPRS has evolved into a flex-
ible and efficient reservoir simulation research laboratory in which students and re-
searchers can easily practice their ideas. With general spatial and temporal discretiza-
tion schemes, AD-GPRS is capable of modeling thermal-compositional fluid flow in
reservoir models with fully unstructured grids and general multisegment wells. In
addition, the multicore and multi-GPU parallelization enable AD-GPRS to simulate
large-scale models efficiently on the corresponding platforms. Here, we list several
high-priority research directions related to the work described in this dissertation:
• Apply the MPNF algorithm to both stages of CPR preconditioner with a global
CUDA-based BiCGStab / GMRES accelerator. See Section 6.11 for detailed
comments.
• Account for off-band coefficients on the outermost level of MPNF matrices,
in order for the MPNF algorithm to accommodate 2.5D unstructured grids
(and MPFA discretization schemes) with more flexible coloring strategies. See
Section 6.11 for detailed comments.
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 199
• Develop a hybrid MPI/OpenMP parallelization in order for AD-GPRS to run
efficiently on clusters. As a further step, extend the multi-GPU parallelization
for the MPNF algorithm to run across a number of GPU-equipped nodes.
• Extend the general MS well model to handle multiple exit connections, as well
as, nodes with special functionality.
• Further test the local nonlinear solver and extend it for the solution of other
parts in the simulation, e.g., geomechanics and chemical reactions, when a fully-
implicit coupling is used to connect them to the fluid flow part.
• Investigate the impact of various MPFA schemes on the behavior of the linear
and nonlinear solvers.
• Devise AIM schemes for other compositional formulations (e.g., molar).
• Analyze and improve the variable-based AIM formulation, which sometimes
incurs instability with the current linear-stability criteria.
• Further grow the modeling and solution capabilities of AD-GPRS, e.g., new
thermal-compositional formulations and fluid models, coupled geomechanics,
and chemical reactions.
Nomenclature
A Cross-sectional area (of an MS-well segment)
C0 Flow profile parameter in the drift-flux model
D Depth of a reservoir cell or an MS-well node
DH Hydraulic diameter of an MS-well segment
DH Dimensionless hydraulic diameter of an MS-well segment
Fc Overall flux of component c
fc,p Fugacity of component c in phase p
ftp Fanning friction factor
g Gravitational acceleration coefficient
g Gravitational component along the well
Hp Enthalpy of phase p
kf Slope used in the linear interpolation of ftp for an intermediate Re
Ku(DH) Critical Kutateladze number
m(θ) A scaling parameter in the terminal rise velocity for inclined pipes
min Mass flow rate of the mixture entering an MS-well segment
nbr(i) The set containing all neighboring nodes of an MS-well node i
NB Number of reservoir cells
nc Number of components
NImp Number of implicit variables in a cell
np Number of phases
200
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 201
np Number of points associated with a flux
Nperf Number of perforations in a well
NPri Number of primary variables in a cell
NSec Number of secondary variables in a cell
perf(k) Reservoir cell number of the k’th perforation in a well
P Pressure (of the base phase - oil)
Pp Pressure of phase p
∆Pw Total pressure difference between two MS-well nodes
∆Pwh Hydrostatic pressure difference between two MS-well nodes
∆Pwf Frictional pressure difference between two MS-well nodes
∆Pwa Acceleration-related pressure difference between two MS-well nodes
Qloss Heat loss from wellbore fluid to surroundings
Qm Mixture flow rate
qp Inflow (per unit volume) of phase p (to an MS-well segment)
Qp Volumetric inflow of phase p (to an MS-well segment)
Qp Flow rate of phase p
Re Dimensionless Reynolds number
RT/C The ratio of the minimum number of concurrent threads in a color
of an MPNF matrix to the number of CUDA cores
rw Wellbore radius
Sp Saturation of phase p
∆t Timestep size
T Temperature
T i0,i1 Two-point transmissibility coefficient for the interface {i0, i1}
T i0,i1imMulti-point transmissibility coefficient associated with interface
{i0, i1} and cell im
V Volume of a reservoir cell or an MS-well node
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 202
Up Internal energy of phase p
Uto Overall heat transfer coefficient
Vc Characteristic velocity
Vd Terminal rise velocity
Vm Mixture velocity
Vp Interstitial velocity of phase p
Vsp Superficial velocity of phase p
xc,p Mole fraction of component c in phase p
zc Overall mole fraction of component c
∆z Length between two MS-well nodes
αp In-situ phase fraction (holdup) of phase p
ε Roughness height
γp Mass density of phase p
λp Mobility of phase p
µp Viscosity of phase p
νp Mole fraction of phase p
φ Porosity
Φp The flow part of the flux of phase p
ρp Molar density of phase p
σpq Interfacial tension between phase p and q
θ Inclination angle of an MS-well segment from vertical
Acronyms and Abbreviations
AD Automatic Differentiation
ADETL Automatically Differentiable Expression Templates Library
AIM Adaptive Implicit Method
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 203
AMG Algebraic MultiGrid
axpy Level 1 BLAS operation (add a multiple of one vector to another):
y ← ax + y
BiCGStab BiConjugate Gradient Stabilized method
BILU Block Incomplete LU factorization
BLAS Basic Linear Algebra Subprograms
BHP Bottom Hole Pressure
CFL Courant-Friedrichs-Lewy
COO COOordinate list
copy Level 1 BLAS operation (copy one vector to another): y ← x
CPR Constrained Pressure Residual
CSR Compressed Sparse Row (equivalent to CRS - Compressed Row
Storage)
CR Cyclic Reduction
CUDA Computing Unified Device Architecture
dot Level 1 BLAS operation (inner product): d← x · y
DP Double Precision
EOR Enhanced Oil Recovery
FIM Fully Implicit Method
GMRES Generalized Minimal RESidual
GPGPU General-Purpose computing on Graphics Processing Units
GPRS General Purpose Research Simulator
GPU Graphics Processing Unit
HPC High Performance Computing
HYB HYBrid matrix format (EllPack + COO)
ILU Incomplete LU factorization
IMPES IMplicit Pressure Explicit Saturation
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 204
IMPSAT IMplicit Pressure and SATuration
I/O Input / Output
MLBS Multi-Level Block Sparse (equivalent to MLSB - Multi-Level Sparse
Block)
MPFA Multi-Point Flux Approximations
MPNF Massively Parallel Nested Factorization
MS MultiSegment
NF Nested Factorization
NUMA Non-Uniform Memory Architecture
PCIe PCI (Peripheral Component Interconnect) express
RHS Right Hand Side
SAMG Algebraic MultiGrid methods for Systems
scal Level 1 BLAS operation (scale a vector by a constant): x← ax
SIMT Single Instruction Multiple Threads
SP Single Precision
SpMV Sparse Matrix-Vector multiplication
SPU Single-Point Upstream
TF Twisted Factorization
TLS Thread-Local Storage
TPFA Two-Point Flux Approximations
Subscripts
c Component
p Phase: g (gas), o (oil), w (water), or l (liquid, as a mixture of oil
and water)
m Mixture
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 205
i Node-based property for a reservoir cell or an MS-well node
(i, j) Connection-based property for the flux across interface (i, j)
C Connection (in a general MS well)
F Facilities
N Node (in a general MS well)
R Reservoir
W Well
Superscripts
avg Averaged value
inj Injection condition
n Time level n
n+ 1 Time level n+ 1
res Reservoir property
sc Standard condition
w Wellbore property
Bibliography
[1] I. Aavatsmark. An introduction to multipoint flux approximations for quadri-
lateral grids. Comput. Geosci., 6(3-4):405–432, 2002.
[2] I. Aavatsmark, T. Barkve, and T. Mannseth. Control-volume discretization
methods for 3D quadrilateral grids in inhomogeneous, anisotropic reservoirs.
SPE 38000, SPE Journal, 3(2):146–154, June 1998.
[3] I. Aavatsmark, G. Eigestad, B.-O. Heimsund, B. Mallison, J. Nordbotten, and
E. Øian. A new finite-volume approach to efficient discretization on challenging
grids. SPE 106435, SPE Journal, 15(3):658–669, September 2010.
[4] I. Aavatsmark, G. Eigestad, B. Mallison, and J. Nordbotten. A compact multi-
point flux approximation method with improved robustness. Numerical Methods
for Partial Differential Equations, 24(5):1329–1360, September 2008.
[5] G. Acs, S. Doleschall, and E. Farkas. General purpose compositional model.
SPE 10515, SPE Journal, 25(4):543–553, August 1985.
[6] J. Appleyard. Method and apparatus for estimating the state of a system, 01
2012. Patent, US 2012/0022841 A1.
[7] J. R. Appleyard. Nested factorization. SPE 12264, proceedings of the 7th SPE
Reservoir Simulation Symposium, San Francisco, CA, November 1983.
206
BIBLIOGRAPHY 207
[8] J. R. Appleyard, J. D. Appleyard, M. A. Wakefield, and A. L. Desitter. Accel-
erating reservoir simulators using gpu technology. SPE 141402, proceedings of
the 21st SPE Reservoir Simulation Symposium, Houston, TX, Feburary 2011.
[9] K. Aziz and A. Settari. Petroleum Reservoir Simulation. Applied Science Pub-
lishers, 1979.
[10] H. Beggs. Production optimization using NODAL analysis. Oil and Gas Con-
sultants International, Inc., Tulsa, OK, 1991.
[11] G. Behie. Practical considerations for incomplete factorization methods in reser-
voir simulation. SPE 12263, proceedings of the 7th SPE Reservoir Simulation
Symposium, San Francisco, CA, November 1983.
[12] C. Bischof, H. Bucker, P. Hovland, U. Naumann, and J. Utke. Advances in
Automatic Differentiation, Lect. Notes in Comp. Sci. and Eng. Springer, 2008.
[13] C. Bischof, G. Corliss, L. Green, A. Griewank, K. Haigler, and P. Newman.
Automatic differentiation of advanced CFD codes for multidisciplinary design.
Journal on Computing Systems in Engineering, 3:625–637, 1992.
[14] H. Bucker, G. Corliss, P. Hovland, U. Naumann, and B. Norris. Automatic Dif-
ferentiation: Applications, Theory and Implementations, Lect. Notes in Comp.
Sci. and Eng. Springer, 2006.
[15] H. Cao. Development of Techniques for General Purpose Simulation. PhD
thesis, Stanford University, 2002.
[16] H. Cao and K. Aziz. Performance of IMPSAT and IMPSAT-AIM models in
compositional simulation. SPE 77720, proceedings of the SPE Annual Technical
Conference and Exhibition, San Antonio, TX, 29 September 29-October 2 2002.
BIBLIOGRAPHY 208
[17] H. Cao, H. Tchelepi, J. Wallis, and H. Yardumian. Parallel scalable CPR-type
linear solver for reservoir simulation. SPE 96809, proceedings of the SPE Annual
Technical Conference and Exhibition, Dallas, TX, October 2005.
[18] M. Chien, H. Yardumian, E. Chung, and W. Todd. The formulation of a thermal
simulation model in a vectorized, general purpose reservoir simulator. In SPE
18418, proceedings of the 10th SPE Reservoir Simulation Symposium, Houston,
TX, February 1989.
[19] M. Christie and M. Blunt. Tenth SPE comparative solution project: A com-
parison of upscaling techniques. SPE 72469, SPE Reservoir Eval. & Eng.,
4(4):308–317, August 2001.
[20] T. Clees and L. Ganzer. An efficient algebraic multigrid solver strategy for adap-
tive implicit methods in oil reservoir simulation. SPE 105789, SPE Journal,
15(3):670–681, September 2010.
[21] CMG. STARS User’s Guide. The Computer Modelling Group,
http://www.cmg.com, 2008.
[22] K. Coats. IMPES stability: Selection of stable timesteps. SPE 84924, SPE
Journal, 8(2):181–187, June 2003.
[23] K. Coats, L. Thomas, and R. Pierson. Compositional and black oil reservoir
simulation. SPE 50990, SPE Reservoir Eval. & Eng., 1(4):372–379, August
1998.
[24] G. Corliss, C. Bischof, A. Griewank, S. Wright, and T. Robey. Automatic dif-
ferentiation for PDE’s: Unsaturated flow case study. In Advances in Computer
Methods for Partial Differential Equations - VII. IMACS, 1992.
BIBLIOGRAPHY 209
[25] D. DeBaun, T. Byer, P. Childs, J. Chen, F. Saaf, M. Wells, J. Liu, H. Cao,
L. Pianelo, V. Tilakraj, P. Crumpton, D. Walsh, H. Yardumian, R. Zorzynski,
K.-T. Lim, M. Schrader, V. Zapata, J. Nolen, and H. Tchelepi. An extensi-
ble architecture for next generation scalable parallel reservoir simulation. In
SPE 93274, proceedings of the 18th SPE Reservoir Simulation Symposium, The
Woodlands, TX, February 2005.
[26] U. Drepper. Elf handling for thread-local storage. Technical report, Red Hat
Inc., February 2003.
[27] M. Edwards and C. Rogers. Finite volume discretizations with imposed flux
continuity for the general tensor pressure equation. Comput. Geosci., 2(4):256–
290, 1998.
[28] Y. Fan. Chemical Reaction Modeling in a Subsurface Flow Simulator with Ap-
plication to In-Situ Upgrading and CO2 Mineralization. PhD thesis, Stanford
University, 2010.
[29] Y. Fan, L. J. Durlofsky, and H. A. Tchelepi. Numerical simulation of the in-situ
upgrading of oil shale. SPE 118958, SPE Journal, 15(2):368–381, June 2010.
[30] H. Fischer. Special problems in automatic differentiation. In A. Griewank and
G. Corliss, editors, Automatic Differentiation of Algorithms. SIAM, PA, 1991.
[31] Fraunhofer Institute SCAI. XSAMG Announcement. Fraunhofer SCAI,
http://www.scai.fraunhofer.de, 2011.
[32] L. S. Fung and A. H. Dogru. Parallel unstructured-solver methods for simu-
lation of complex giant reservoirs. SPE 106237, SPE Journal, 13(4):440–446,
December 2008.
BIBLIOGRAPHY 210
[33] W. Gander and G. H. Golub. Cyclic reduction history and applications. Pro-
ceedings of the Workshop on Scientific Computing, Hong Kong, March 1997.
[34] A. Griewank. On automatic differentiation. In M. Iri and K. Tanabe, editors,
Mathematical Programming: Recent Developments and Applications. Kluwar
Academic Publishers, IL, 1990.
[35] S. Haaland. Simple and explicit formula for the friction factor in turbulent pipe
flow including natural gas pipelines. Ifag b-131, Div Aero and Gas Dynamics,
The Norwegian Institution of Technology, 1981.
[36] J. Holmes, T. Barkve, and O. Lund. Application of a multisegment well model
to simulate flow in advanced wells. SPE 50646, proceedings of the SPE European
Petroleum Conference, The Hague, Netherlands, October 1998.
[37] Y. Jiang. Tracer flow modeling and efficient solvers for GPRS. Master’s thesis,
Stanford University, 2004.
[38] Y. Jiang. Techniques for Modeling Complex Reservoirs and Advanced Wells.
PhD thesis, Stanford University, 2007.
[39] Y. Jiang and H. A. Tchelepi. Scalable multistage linear solver for coupled sys-
tems of multisegment wells and unstructured reservoir models. In SPE 119175,
proceedings of the 20th SPE Reservoir Simulation Symposium, The Woodlands,
TX, Feburary 2009.
[40] G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for parti-
tioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392,
1999.
[41] Khronus OpenCL Working Group. The OpenCL Specification, Version 1.2,
November 2011.
BIBLIOGRAPHY 211
[42] J. Kim and S. Finsterle. Application of automatic differentiation in Tough2. In
Proceedings of The Tough Symposium, LBNL. LBNL, May 2003.
[43] L. Koesterke, J. Boisseau, J. Cazes, K. Milfeld, and D. Stanzione. Early expe-
riences with the intel many integrated cores accelerated computing technology.
In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery,
TG ’11, pages 21:1–21:8, New York, NY, USA, 2011. ACM.
[44] F. Kwok. A block ILU(k) preconditioner for GPRS. Technical report, Stanford
University, December 2004.
[45] S. B. Lippman, J. Lajoie, and B. E. Moo. C++ Primer, 4th Edition. Addison-
Wesley Professional, 2005.
[46] S. Livescu, L. Durlofsky, and K. Aziz. A semianalytical thermal multiphase
wellbore flow model for use in reservoir simulation. SPE 115796, SPE Journal,
15(3):794–804, September 2010.
[47] S. Livescu, L. Durlofsky, K. Aziz, and J. Ginestra. A fully-coupled thermal mul-
tiphase wellbore flow model for use in reservoir simulation. Journal of Petroleum
Science and Engineering, 71(3-4):138 – 146, 2010. Fourth International Sym-
posium on Hydrocarbons and Chemistry.
[48] P. Lu, J. Shaw, T. Eccles, I. Mishev, A. Usadi, and B. Beckner. Adaptive parallel
reservoir simulation. In International Petroleum Technology Conference. IPTC
12199, December 2008.
[49] P. Micikevicius. Multi-GPU Programming. NVIDIA Corporation, 2012.
[50] P. Moin. Fundamentals of Engineering Numerical Analysis. Cambridge Uni-
versity Press, 2001.
BIBLIOGRAPHY 212
[51] A. Moncorge and H. A. Tchelepi. Stability criteria for thermal adaptive implicit
compositional flows. SPE 111610, SPE Journal, 14(2):311–322, June 2009.
[52] NVIDIA Corporation. Whitepaper - NVIDIA’s Next Generation CUDA Com-
pute Architecture: Fermi, 2009.
[53] NVIDIA Corporation. CUBLAS Library, Version 5.0, October 2012.
[54] NVIDIA Corporation. CUDA C Best Practices Guide, Version 5.0, October
2012.
[55] NVIDIA Corporation. CUDA C Programming Guide, Version 5.0, October
2012.
[56] NVIDIA Corporation. CUDA Samples, Version 5.0, October 2012.
[57] NVIDIA Corporation. CUSPARSE Library, Version 5.0, October 2012.
[58] L. Ouyang. Single Phase and Multiphase Fluid Flow in Horizontal Wells. PhD
thesis, Stanford University, 1998.
[59] M. Pal and M. G. Edwards. Quasimonotonic continuous darcy-flux approxima-
tion for general 3D grids of any element type. In SPE 106486, proceedings of
the 19th SPE Reservoir Simulation Symposium, Houston, TX, February 2007.
[60] H. Pan and H. A. Tchelepi. Reduced variable method for general-purpose com-
positional reservoir simulation. In SPE 131737, proceedings of the International
Oil and Gas Conference and Exhibition in China, Beijing, China, June 2010.
[61] H. Pan and H. A. Tchelepi. Compositional flow simulation using reduced-
variables and stability-analysis bypassing. In SPE 142189, proceedings of the
21st SPE Reservoir Simulation Symposium, The Woodlands, TX, February
2011.
BIBLIOGRAPHY 213
[62] D. Peaceman. Interpretation of well-block pressures in numerical reservoir sim-
ulation. SPE Journal, 18(3):183–194, June 1978.
[63] P. Quandalle and J. Sabathier. Typical features of a multipurpose reservoir
simulator. SPE 16007, SPE Reservoir Engineering, 4(4):475–480, 1989.
[64] L. Rall. Perspectives on automatic differentiation: Past, present and future.
In Automatic Differentiation: Applications, Theory and Implementations, Lect.
Notes in Comp. Sci. and Eng. Springer, 2005.
[65] T. Russell. Stability analysis and switching criteria for adaptive implicit meth-
ods based on the CFL condition. In SPE 18416, proceedings of the 10th SPE
Reservoir Simulation Symposium, Houston, TX, February 1989.
[66] Y. Saad. ILUT: a dual threshold incomplete LU factorization. Numerical linear
algebra with applications, 1(4):387–402, 1994.
[67] Y. Saad. Iterative Methods for Sparse Linear Systems, Second Edition. Society
for Industrial and Applied Mathematics, 2003.
[68] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856–
869, 1986.
[69] P. Sarma. Efficient Closed-Loop Optimal Control of Petroleum Reservoirs Under
Uncertainty. PhD thesis, Stanford University, 2006.
[70] Schlumberger. Eclipse Technical Description 2011.2. Schlumberger, 2011.
[71] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junk-
ins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and
BIBLIOGRAPHY 214
P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing.
ACM Trans. Graph., 27(3):18:1–18:15, Aug. 2008.
[72] A. Semenova, S. Livescu, L. J. Durlofsky, and K. Aziz. Modeling of multiseg-
mented thermal wells in reservoir simulation. In SPE 130371, proceedings of the
SPE EUROPEC/EAGE Annual Conference and Exhibition, Barcelona, Spain,
June 2010.
[73] H. Shi, J. Holmes, L. Diaz, L. J. Durlofsky, and K. Aziz. Drift-flux parame-
ters for three-phase steady-state flow in wellbores. SPE 89836, SPE Journal,
10(2):130–137, 2005.
[74] H. Shi, J. Holmes, L. J. Durlofsky, K. Aziz, L. Diaz, B. Alkaya, and G. Oddie.
Drift-flux modeling of two-phase flow in wellbores. SPE 84228, SPE Journal,
10(1):24–33, 2005.
[75] G. Shiralkar, G. Fleming, J. Watts, T. Wong, B. Coats, R. Mossbarger, E. Rob-
bana, and A. Batten. Development and field application of a high performance,
unstructured simulator with parallel capability. In SPE 93080, proceedings of
the 18th SPE Reservoir Simulation Symposium, The Woodlands, TX, February
2005.
[76] K. Stuben. Algebraic multigrid (AMG): Experiences and comparisons. proceed-
ings of the International Multigrid Conference, April 1983.
[77] K. Stuben. An introduction to algebraic multigrid. Appendix in book ’Multigrid’,
pages 413–532, 2001.
[78] K. Stuben and T. Clees. SAMG User’s Manual. Fraunhofer Institute SCAI,
2003.
BIBLIOGRAPHY 215
[79] H. Sutter. The free lunch is over: a fundamental turn toward concurrency in
software. Dr. Dobb’s Journal, 30(3), 2005.
[80] G. Thomas and D. Thurnau. Reservoir simulation using an adaptive implicit
method. SPE 10120, SPE Journal, 23(5):759–768, October 1983.
[81] H. A. van der Vorst. Large tridiagonal and block tridiagonal linear systems on
vector and parallel computers. Parallel Computing, 5:45–54, July 1987.
[82] H. A. van der Vorst. BI-CGSTAB: a fast and smoothly converging variant
of BI-CG for the solution of nonsymmetric linear systems. SIAM Journal on
Scientific and Statistical Computing, 13(2):631–644, March 1992.
[83] S. Verma. Flexible Grids for Reservoir Simulation. PhD thesis, Stanford Uni-
versity, 1996.
[84] D. Voskov, R. Younis, and H. Tchelepi. General nonlinear solution strategies
for multiphase multicomponent eos based simulation. In SPE 118996, proceed-
ings of the 20th SPE Reservoir Simulation Symposium, The Woodlands, TX,
Feburary 2009.
[85] D. Voskov and Y. Zhou. Technical Description of the AD-GPRS. Energy
Resources Engineering, Stanford University, 2012.
[86] D. V. Voskov and H. A. Tchelepi. Compositional space parameterization: Mul-
ticontact miscible displacements and extension to multiple phases. SPE 113492,
SPE Journal, 14(3):441–449, September 2009.
[87] D. V. Voskov and H. A. Tchelepi. Compositional space parameterization: The-
ory and application for immiscible displacements. SPE 106029, SPE Journal,
14(3):431–440, September 2009.
BIBLIOGRAPHY 216
[88] J. Wallis. Incomplete Gaussian elimination as a preconditioning for general-
ized conjugate gradient acceleration. SPE 12265, proceedings of the 7th SPE
Reservoir Simulation Symposium, San Francisco, CA, November 1983.
[89] J. Wallis, R. Kendall, T. Little, and J. Nolen. Constrained residual acceleration
of conjugate residual methods. SPE 13536, proceedings of the 8th SPE Reservoir
Simulation Symposium, Dallas, TX, February 1985.
[90] X. Wang. Trust-Region Newton Solver for Multiphase Flow and Transport in
Porous Media. PhD thesis, Stanford University, 2012.
[91] X. Wang and H. A. Tchelepi. Trust-region based nonlinear solver for counter-
current two-phase flow in heterogeneous porous media. In 13th European Con-
ference on the Mathematics of Oil Recovery. EAGE, September 2012.
[92] R. Younis. Modern Advances in Software and Solution Algorithms for Reservoir
Simulation. PhD thesis, Stanford University, 2011.
[93] R. Younis and K. Aziz. Parallel automatically differentiable data-types for next-
generation simulator development. In SPE 106493, proceedings of the 19th SPE
Reservoir Simulation Symposium, Houston, TX, February 2007.
[94] F. Zhang. The Schur Complement and Its Applications. Springer, 2005.
[95] Y. Zhang, J. Cohen, and J. D. Owens. Fast tridiagonal solvers on the GPU.
SIGPLAN Not., 45(5):127–136, Jan. 2010.
[96] Y. Zhou. Multistage preconditioner for well groups and automatic differentia-
tion for next generation GPRS. Master’s thesis, Stanford University, 2009.
[97] Y. Zhou. ADETL User Manual, 2012.
BIBLIOGRAPHY 217
[98] Y. Zhou, Y. Jiang, and H. Tchelepi. A scalable multi-stage linear solver for
coupled reservoir models with multi-segment wells. Journal of Computational
Geosciences, 2012.
[99] Y. Zhou, H. A. Tchelepi, and B. T. Mallison. Automatic differentiation frame-
work for compositional simulation on unstructured grids with multi-point dis-
cretization schemes. In SPE 141592, proceedings of the 21st SPE Reservoir
Simulation Symposium, The Woodlands, TX, Feburary 2011.
[100] N. Zuber and J. A. Findlay. Average volumetric concentration in two-phase
flow systems. Journal of Heat Transfer, 87:453–468, November 1965.
Appendix A
Programming Model of AD-GPRS
In this appendix the programming model of AD-GPRS is described. Our objective is
to make AD-GPRS into a flexible and efficient reservoir simulation research laboratory
with extensible modeling and solution capabilities. To achieve this goal, a modular
object-oriented design is adopted. All of the code is written in standard C++. While
this design is convenient for the developers to extend the simulator by incorporating
new physics, introducing complex processes, or adding new formulations and solu-
tion algorithms, it requires some effort for newcomers to become fully familiar with
this code. AD-GPRS is a large-scale and complex program. Currently it has over
200 header and source files (excluding ADETL and its associated FASTL library)
separated into more than ten subdirectories. In order to facilitate the development
and usage of AD-GPRS, good documentation, including technical descriptions, user
manuals, examples, and doxygen-style comments (see www.doxygen.org) in the code,
are essential.
This appendix is organized similar to Appendix A in [15]. The structure of AD-
GPRS is discussed level by level, from top down, followed by a flow sequence of the
simulation framework. The list of directories and files of AD-GPRS, together with
their descriptions, is also provided. Here we use italic shape to denote class names,
218
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 219
slanted shape to denote variable names, and boldface to denote function names.
A.1 The structure of AD-GPRS
The system model shows the basic classes and their relations, and it is very helpful
for understanding the structure of AD-GPRS. Due to the complexity of AD-GPRS,
the system model is explained level by level using multiple figures.
Simulator
SimMaster
Nonlinear Formulation
Reservoir Facilities AIM
Scheme Nonlinear
Solver
adX adX_n
Containment
full_set
Streamline Tracer
adFX allStatus totalBlocks
Figure A.1: Overall structure of the entire simulator
Figure A.1 shows the overall structure of the entire simulator. Simulator is the
topmost class that is instantiated in the main function. It can be seen as an entry
to versatile simulation functionality. Simulator class contains the outermost loop on
input regions. Inside each input region, there is an initialization process to accommo-
date the change in parameters, followed by a loop on timesteps. For each timestep,
Simulator determines the proper timestep size such that the time stepping require-
ments of report steps and input regions (see Section A.2 for more information) are
satisfied. At the beginning of simulation, Simulator creates the SimMaster object,
which will be deallocated when the simulation run finishes.
SimMaster is a manager of all simulation-related objects. When SimMaster is cre-
ated, it creates specific NonlinearFormulation object and its associated AIMScheme
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 220
object according to the formulation type in the input data. After that, common
objects including Reservoir, Facilities, NonlinearSolver, and StreamlineTracer are
constructed. The data members of SimMaster also includes the global variable set
(adX) that stores all the simulation variables with both values and gradients, as well
as, the global backup set (adX n) that saves a copy of values of all variables. For
details about global variable and backup set, please refer to Section 1.2.1 and the
ADETL user manual [97]. There is an additional backup set (adFX), which is used
only when the active-window mode is activated. In such circumstances, adX and
adX n will only cover the variables in the active region, whereas adFX will have a
copy of the variables in the entire simulation domain, regardless of whether the re-
gion is activated or not. Moreover, for the initialization and subsequent update of the
global variable and backup sets, two additional vectors are needed: 1) totalBlocks,
which contains the number of blocks in each variable subset, and 2) allStatus, which
contains the (phase) status for each block in all variable subsets.
A common namespace shared by a variety of classes throughout the simulator is
called full set. It contains a set of global indexing constants that are fixed for a given
number of phases and components, and are general to any variable formulation. For
example, for node-based variables, there are such indexing constants as PRES (pres-
sure), TEMP (temperature), SAT (saturations), XCP (molar fractions), MOBILITY
(phase mobilities), and so on. On the other hand, for connection-based variables,
indexing constants include QM (mixture flow rate) and QSP (superficial phase flow
rates). These indexing constants cover both the independent and dependent variables.
Any status in a variable formulation activates a part of the corresponding variables
to be independent, and leaves the rest variables to be dependent, without changing
their values.
Now we discuss each component in SimMaster class in detail. Figure A.2 shows
the structure of NonlinearFormulation class. This is the abstract base class of various
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 221
Nonlinear Formulation
Natural Variable
Formulation
Molar Variable
Formulation
Containment
Inheritance
Gamma Variable
Formulation
…… (new formulation)
Phase Combinations
Component Table
Always Present Phases
Conservation Equation Order
Thermal Properties
Figure A.2: Structure of the NonlinearFormulation
formulations, such as NaturalVariableFormulation, MolarVariableFormulation, and so
on. By constructing new inherited classes of NonlinearFormulation (or one of its
derived classes) with the same interfaces but possibly different realizations, we are
able to introduce new formulations.
NonlinearFormulation contains a variety of formulation-related virtual member
functions. Some functions are called in the initialization stage to specify the phase,
component, and variable structure of a formulation. This includes makeStructure
(create the table of active independent variables associated with each possible phase
status) and buildStatusTable (create the table of existing phases associated with
each possible phase status). Please see the numerical framework section in [85] for
more details.
Other virtual functions in NonlinearFormulation, such as enforceConstraints
(enforce certain local constraints), changeStatus (check the phase status and change
it if necessary), fluidPropertiesCalculation (compute all fluid properties), and
makeSecondaryTerm (form secondary equations), usually perform computations
on a specific block, which can be either a reservoir cell, or a well node. That is,
the NonlinearFormulation object is shared by Reservoir and various facility classes.
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 222
Thus these virtual functions should be designed in a generic way in order to work
for both cases. For example, the rock properties and fluid properties cannot be
calculated by the same function in NonlinearFormulation, because rock properties
are used in Reservoir class only, whereas fluid properties are shared by Reservoir
and facility classes. That is the reason why we have fluidPropertiesCalculation
function separated from the original properties calculation that covers both rock and
fluid properties.
Among the common member variables, pOrder is an array with the size equal to
the number of primary (i.e., mass and energy conservation) equations. It is used to
reorder the conservation equations from their default order (i.e., the order of compo-
nents) on the nonlinear level. The energy conservation equation, if present, is still
required to be the last primary equation, i.e., after all mass conservation equations.
Next, phaseCombinations is a vector of integer vectors created in the common build-
StatusTable function. The number of elements in phaseCombinations is equal to
the total number of possible phase statuses while each integer vector represents the
appeared phases in a specific phase status. If all combinations of np phases are feasi-
ble, there will be 2np − 1 statuses. For example, when np = 3, we have the following
23 − 1 = 7 statuses (integer vectors): (0), (1), (0, 1), (2), (0, 2), (1, 2), (0, 1, 2). This
table will be fetched by many other classes, such as Reservoir and AIMScheme, to
loop through all appeared phases corresponding to the current status of a cell. More-
over, the default values of alwaysPresentPhases (boolean vector, all false by default)
and componentTable (vector of boolean vectors, all true by default) are also set in
the buildStatusTable function. Each element in alwaysPresentPhases indicates if
a phase is always present in all possible statuses (e.g., for black-oil and dead-oil fluid,
we have water and oil phases as always present phases), whereas each element in
componentTable indicates if a component (first dimension) can appear in a specific
phase (second dimension). Last but not least, pThermalFluid is only created in a
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 223
thermal formulation and can be used for the calculation of thermal properties.
Reservoir
Fluid
Containment
General Connection
List
Nonlinear Formulation
adX
adX_n
Access
Variable Checker
Trust Region Solver
Rock
Figure A.3: Structure of the Reservoir
Next, we discuss the second component in SimMaster class: Reservoir. Figure
A.3 shows the structure of Reservoir class. All reservoir-related operations (called
for reservoir only, not for any facility model, e.g., initialization/update of all reser-
voir variables, computation of all reservoir properties, flux terms, and accumulation
terms, etc.) are defined here. Reservoir class needs to access the NonlinearFormula-
tion, global variable (adX) and backup (adX n) sets. By calling unified interfaces in
NonlinearFormulation, the Reservoir class is general for all formulations, i.e., we do
not need to create derived Reservoir classes for different formulations.
Besides several constant pointers for fetching various reservoir properties in the
input data, the data members of Reservoir class include a Rock object, a Fluid
object, a general connection list, a variable checker, and a trust-region solver. Rock
class is designed to compute rock properties (e.g., porosity or pore volume under
reservoir condition) for the reservoir. If geomechanics capability is integrated, the
functionality of Rock class (and other relevant classes) will grow tremendously. The
general connection list is a data structure for recording the block nonzero entries
in the system matrix for a generalized MPFA discretization. At the beginning of a
simulation run, the list is generated from the specification of the MPFA stencils for
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 224
all flux entries in the model. For details about general connection list, please refer
to Section 2.2.2 and [99]. The variable checker can be used for calculating among all
reservoir cells the maximum changes in the basic variables (e.g., p, Sp, νp, xcp, and zc)
between two Newton iterations. It provides a mechanism to check the convergence of
the nonlinear system in addition to checking the residuals. The trust-region solver is
an advanced nonlinear solver that provides local chopping in each reservoir cell based
on the shape of the flux function (see [90,91] for details).
Fluid
TwoPhase CompFluid
RockFluid
Containment
Inheritance
BlackOil Fluid
…… (new fluid type)
(A vector of)
Phase TwoPhase
CSAT
OG OW EOS Phase
PVT Phase
…
MultiPhase CompFluid
MultiPhase CSAT
DeadOil Fluid
Gamma PhaseFluid
Figure A.4: Structure of the Fluid
Given the problem that we simulate, the SimMaster will create the proper Fluid
object for the Reservoir, e.g., TwoPhaseCompFluid for two-phase compositional sim-
ulation, MultiPhaseCompFluid for multiphase compositional simulation, and Black-
OilFluid for black-oil simulation. The structure of Fluid class is shown in Figure A.4.
A variety of fluid computations, which can be used by both reservoir and various
facility models, are provided in Fluid class through common interfaces. Part of the
data in the Fluid class, though, are specific to the reservoir or facility object it is
associated with. Thus, by creating the proper Fluid object and utilizing its com-
mon interfaces, (derived) NonlinearFormulation class can hide the details of certain
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 225
fluid computations. To minimize the code duplication, one derived Fluid class can be
shared by multiple derived NonlinearFormulation classes (e.g., TwoPhaseCompFluid
can be used with both NaturalVariableFormulation and MolarVariableFormulation).
If none of the existing Fluid class satisfies the requirement of a new nonlinear formula-
tion (e.g., GammaVariableFormulation), a new Fluid class (e.g., GammaPhaseFluid)
needs to be created by inheriting from the base class or any existing derived Fluid
class. On the other hand, one derived NonlinearFormulation class can be compatible
with multiple types of fluid, e.g., we can have NaturalVariableFormulation working
with TwoPhaseCompFluid, MultiPhaseCompFluid, or BlackOilFluid.
The Fluid class contains two primary data members: RockFluid and a vector of
Phase pointers. The RockFluid class, which is independent of the fluid type, com-
putes two properties: relative permeability and capillary pressure. The Phase class
is an abstract data type. It provides a series of phase-based property computations,
such as viscosity and density of one phase. Different data types, such as EOSPhase
(properties are calculated based on EOS) and PVTPhase (properties are calculated
based on PVT), are inherited from the Phase class. Each pointer in the phase vector
represents one specific phase (e.g., Gas, Oil, or Water) in the simulation, and can
be in any phase data type (EOS or PVT). The phases are created by the inherited
Fluid class. For example, BlackOilFluid creates a vector of PVT phases, whereas
TwoPhaseCompFluid creates a vector of EOS phases. Actually, the Fluid class is
able to create a mixture of phases with different types, e.g., two EOS phases and one
PVT phase in a compositional setting with two hydrocarbon phases and one simple
water phase. This allows the developers to flexibly create new fluid scenarios with
heterogeneous treatments on phases.
The third component in SimMaster class is Facilities. Figure A.5 shows its struc-
ture. All facility-related operations (called for facility models only, not for reservoir,
e.g., initialization/update of all well variables, computation of all well properties,
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 226
Facilities
(A vector of) Well
Containment
(A vector of) WellState
Access
Nonlinear Formulation
adX
adX_n
Standard Well
(A vector of) WellControl
Generalized Well
Inheritance
…… (new well model)
Facility Matrix
Handler
Variable Checker
Fluid … (other
common data) Well Matrix
Handler
Figure A.5: Structure of the Facilities
adding the source/sink terms to reservoir residuals and forming well residuals, etc.)
are defined in this class. These operations are usually performed by calling the corre-
sponding function in each of the Well objects stored in a vector owned and managed
by the Facilities class. In addition to this vector of Well pointers, the Facilities class
contains a vector of WellState variables that can be used to backup and restore and
state of all wells at any time (e.g., at the beginning of a timestep), as well as, a
variable checker that calculates among all well nodes the maximum changes in the
basic variables between two Newton iterations.
The Well class is an abstract base class that provides common interfaces for all
facility models. Currently, the standard well and generalized MultiSegment (MS)
well model have been implemented in AD-GPRS. By inheriting from the Well or its
derived classes, we can add new facility models (e.g., well groups) into the simulator.
Similar to the reservoir, wells need to access the NonlinearFormulation, global variable
(adX) and backup (adX n) sets. Each well also has its own fluid object, which is
created from the reservoir fluid object and contains the specific fluid-related data
associated with the well. Because each facility model has its own matrix format,
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 227
while each linear system assumes different matrix structure, it will not be generic
to have the facility treatment in a linear system or the matrix handling in a facility
model. To mix and match various facility models and linear systems, the idea of
facility-matrix handler is introduced. Please refer to Section 4.5.2 for details. There
is a top-level handler for the Facilities class and the selected matrix system. The
top-level handler will create lower-level handlers for each well object and the same
matrix system. There are also many other common data, e.g., a vector of possible
well controls, defined for the base Well class.
AIMScheme
AIM Statistics
AIM Parameters
Nonlinear Formulation
adX
adX_n
nImpVars CFL
Containment
Access
Inheritance
NaturalVariable AIMScheme
MolarVariable AIMScheme
... (new scheme)
(Reservoir) Fluid
CFLb ImpComp
Figure A.6: Structure of the AIMScheme
The fourth component in SimMaster class is AIMScheme. Figure A.6 shows its
structure. AIMScheme is designed for calculating the CFL numbers, determining
the implicit levels, and performing nonlinear treatments for all blocks. It is the base
class for various AIM schemes corresponding to different nonlinear formulations. For
example, we have NaturalVariableAIMScheme working for NaturalVariableFormula-
tion with all compatible fluid types. The base AIMScheme class, which contains the
basic implementation of CFL computation, can be temporarily used for any new for-
mulation. However, in that case, the new formulation is only compatible with Fully
Implicit Method (FIM) because the nonlinear treatments are different for various
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 228
formulations and are supposed to be implemented in the corresponding derived AIM-
Scheme classes. If AIM capability is desired later, a new inherited AIMScheme class
needs to be constructed for that nonlinear formulation by defining proper nonlinear
treatments (and processes for computing CFL numbers / criteria for determining
implicitness, if necessary).
Similar to Reservoir and all facility models, AIMScheme accesses NonlinearFor-
mulation object, adX, and adX n. In addition, it needs to access the fluid object in
the Reservoir class for the possible recalculation of fluid properties during the non-
linear treatment (see Section 2.3.2). Its most important data members are as listed
here: 1) AIMStatistics : a structured type containing statistical information (e.g., av-
erage / maximum number of implicit cells / variables) for AIM; 2) AIMParameters :
a structured type containing configuration parameters (e.g., whether AIM is used,
the CFL limit, maximum number of implicit variables in a non-FIM cell) for AIM;
3) nImpVars: a vector containing the number of implicit variables in each cell; 4)
ImpComp: a boolean array indicating whether each component in each cell is im-
plicit (needed by variable-based AIM); 5) CFL: a vector containing the component-
and phase- based fluxes for CFL computation; and 6) CFLb: a vector containing the
most constraining CFL number of each grid block.
Nonlinear Solver
residual residual
Normalizer
Nonlinear Formulation
Reservoir
Facilities
Linear System
Residual Checker
Containment
Access
AIMScheme
adX
adX_n
allStatus
Figure A.7: Structure of the NonlinearSolver
The fifth component in SimMaster is NonlinearSolver, with its structure shown
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 229
in Figure A.7. The main objective of NonlinearSolver is to find the nonlinear so-
lution of any timestep, given the converged solution at the last timestep and the
current timestep size. By default, Newton’s method (with certain chopping scheme)
is applied. Due to the wide range of data and functionality needed, NonlinearSolver
accesses almost every other components in SimMaster, including NonlinearFormula-
tion, Reservoir, Facilities, and AIMScheme objects, as well as, such variables as adX,
adX n, and allStatus. The most important data members of NonlinearSolver are as
follows: 1) residual: this is the AD residual vector, which contains not only the resid-
ual values but also the gradients (Jacobian) of the discretized governing equations; 2)
residualNormalizer: during check of convergence, the elements in the residual vector
are normalized by these values, which are usually computed at the same time when
nonlinear residuals are built; 3) LinearSystem: it extracts Jacobian matrix from the
(AD) residual vector and obtains the Newton update through linear solution; and 4)
ResidualChecker : it is used to check the convergence of nonlinear iterations based
on the normalized residual values, given the corresponding tolerances for transport
equations, local constraints, and well equations.
LinearSystem
CSR LinearSystem
Block LinearSystem
nImpVars
ImpComp
Facilities
adX
Containment
Access
Inheritance
RHS CSR
Matrix CSR
Solver MLSB Matrix
MLSB Solver
Reservoir general connection list
Figure A.8: Structure of the LinearSystem
The LinearSystem alone is a very large and essential part of AD-GPRS. Its struc-
ture is shown in Figure A.8. LinearSystem is the abstract base class for different
types of linear system (each one has its own matrix format and a set of solvers that
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 230
can work on that matrix format). In the initialization process, the general connection
list from the Reservoir, the nImpVars and ImpComp vectors from the AIMScheme,
the Facilities object, as well as the global variable set (adX) need to be passed to the
LinearSystem, which acquires necessary information from these objects. The dense
Right-Hand-Side (RHS) vector is shared by all types of linear system. In a general
linear problem Ax = b, this data member is used to store vector b before linear
solution and vector x after linear solution.
Currently, two types of linear systems are derived: CSRLinearSystem and Block-
LinearSystem. CSRLinearSystem (see Section 3.2 for details) works on Compressed
Sparse Row (CSR) matrix format. A handy function is provided by the ADETL
library to extract the gradients contained in the (AD) residual vector in the CSR for-
mat. Therefore, most of the time, we need no specialization in this linear system to
accommodate new facility models or other features introduced into AD-GPRS. How-
ever, the efficiency of using CSRLinearSystem is usually relatively low. Generally
speaking, the library solvers working on CSR matrix type can be supported. For ex-
ample, PARDISO and SuperLU direct solvers are currently included. However, these
solvers do not have comparable efficiency with state-of-the-art iterative linear solvers
employing CPR (Constrained Pressure Residual [88, 89]) based multistage precondi-
tioners. It is suggested to use CSRLinearSystem and its associated solver options
only for verifying the correctness in small problems.
On the other hand, BlockLinearSystem (see Section 3.3 for details) works on
a much more complicated matrix format, which is extended from the MultiLevel
Sparse Block (MLSB) matrix [38] in the original GPRS. This is a recursively defined
matrix structure. On the very top level, it is divided into four parts: JRR, JRF ,
JFR, and JFF , where the subscript R and F represent Reservoir and Facilities,
respectively. The first subscript indicates the equations, whereas the second subscript
indicates the variables (e.g., JRF represents the derivatives of Reservoir equations
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 231
with respect to Facilities variables). On the second level, each of the four parts has
its own structure and possibly contains several submatrices, which are defined on the
third level. By repeating this process, a hierarchy of matrix structure is established.
When we want to define the matrix structure for a new facility model, we need to
customize the submatrices in JRF , JFR, and JFF parts. All of the (sub)matrices in
MLSB format are derived from the abstract base class GPRS Matrix, where common
interfaces such as matrix extraction, algebraic reduction, explicit update, and matrix-
vector multiplication are defined. Because these matrix operations are implemented
recursively in derived MLSB matrices, new MLSB matrix types can be introduced
without knowing or altering the implementation details in existing ones.
Given the flexible structure and awareness of block sparsity, MLSB-based linear
solvers can be much more efficient than the CSR-based ones. Currently, we have
the block GMRES solver with several preconditioner options, including single-stage
BILU(0) / BILU(1) and multistage CPR (AMG / SAMG + BILU(0) / BILU(1)).
CPR combining SAMG and BILU(0) usually yields the best performance and most
robust results.
A.2 Flow sequence
Having shown the system model and important classes in AD-GPRS, we can now
review the flow sequence. The essential steps of a forward simulation are as follows:
1. Read the input data (problem definition) from the given simulation deck
2. Initialization (allocate memory, create objects, and apply initial conditions)
3. While there is still some input region that has not been simulated, do:
(a) Update the simulation parameters and corresponding data for the current
input region, if it is not the first one
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 232
(b) While the current input region has not been finished, do:
i. Calculate the timestep size
ii. Initialize the timestep: if the nonlinear solution succeeded in the last
timestep, use this converged solution as the initial guess; otherwise,
retract the independent variables and phase statuses from the last
converged solution
iii. While the maximum number of Newton iterations in each timestep
has not been reached, do:
A. Discretize the reservoir residuals by computing accumulation terms
and secondary constraints for each reservoir cell, as well as, flux
terms for each reservoir interface
B. For each facility model, check whether its control needs to be
switched, calculate its properties, add source/sink terms to the
corresponding reservoir residuals, and form its own well residuals
C. Check if the Newton iteration converges. If yes, go to 3(b)iv
D. Extract the linear system from (AD) residual vector, perform al-
gebraic reduction, solve the linear system, and use explicit update
to get back the full solution to all independent variables
E. Apply the solution (Newton update) to all reservoir and well vari-
ables. Update the phase statuses for all reservoir cells
F. Calculate updated rock and fluid properties for all reservoir cells
G. Go to step 3(b)iii for the next Newton iteration
iv. If solution is accepted before reaching the maximum number of Newton
iterations, report statistics for a converged timestep and dump the
solution if necessary (e.g., at a report step)
v. Go to step 3b for the next timestep
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 233
(c) Go to step 3 for the next input region
4. Report overall statistics and perform post-processing if necessary
5. Deallocate everything and end simulation
A.3 List of files
Besides the visual studio project file (for Windows) and Makefile (for Linux) located
in the base directory (SourceCode), there are more than ten subdirectories. Here we
introduce the files contained in each of them.
• ACSP
CSP DataStorage.hpp/.cpp Data storage for CSP method
CSP Interpolation.hpp/.cpp Interpolation for CSP method
CSP TesselationBuilder.hpp/.cpp Tesselation builder for CSP method
Octree.hpp/.cpp Octal tree data structure
• AIMScheme
AIMScheme.hpp/.cpp The base class of various AIM schemes
NaturalVariableAIMScheme.hpp/.cpp The derived AIM scheme for NaturalVari-
ableFormulation
• CSAT
CSAT FB.hpp/.cpp CSAT FB
CSAT G.hpp/.cpp CSAT G
CSAT Interpolation.hpp/.cpp Interpolation for CSAT method
CSAT steam.hpp/.cpp CSAT with steam
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 234
CSATInput.hpp Input data for CSAT method
neg flash.hpp Negative flash subroutines
phase equil.hpp/.cpp Phase equilibrium using CSAT
tie storage.hpp/.cpp Tie-line storage
TieSimplex.hpp/.cpp Tie simplex
TieSimplexTypesContainer.hpp/.cpp Container for tie simplex types
• Facilities
Facilities.hpp/.cpp The management class of all wells (facility
models)
Facilities MLSB Handler.hpp/.cpp The facility-matrix handler for the top fa-
cility management class and MLSB matrix
format
FacilityMatrixHandler.hpp The base class of various facility-matrix
handlers
FacilityMatrixHandlerFactory.hpp Helper functions that create the corre-
sponding facility-matrix handler given the
facility model, system matrix, and param-
eters
GeneralizedWell.hpp/.cpp Generalized MS well class
GeneralizedWellConfig.hpp Configuration parameter class for Gener-
alizedWell
GenWell MLSB Handler.hpp/.cpp The facility-matrix handler for General-
izedWell class and MLSB matrix format
PseudoWell.hpp Pseudo well class
StandardWell.hpp/.cpp Standard well class
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 235
StdWell MLSB Handler.hpp/.cpp The facility-matrix handler for Standard-
Well class and MLSB matrix format
Well.hpp/.cpp The base class for various facility models
WellControl.hpp The general well control class
WellInput.hpp All input data of a single well
WellState.hpp Class for gathering all basic well state vec-
tors
WellStateHDF5.hpp/.cpp Class for the conversion of WellState be-
tween its internal and HDF5 format
• IO
FluidParameters.hpp/.cpp Input parameters for Fluid
GridParameters.hpp Input parameters for Grid
hdf5 utils.hpp/.cpp HDF5 utilities for I/O
InputData.hpp/.cpp The collection of input data for one input
region
IO.hpp/.cpp Class for output everything (used by log-
ger) and partially for input
Keywords.hpp/.cpp Class for handling all keywords in the AD-
GPRS input file (forward simulation)
KeywordsBase.hpp/.cpp Base class for keyword handling
Log.hpp/.cpp Class for redirecting output to screen and
disk file
Logger.hpp/.cpp Class for managing screen output, log
files, and solution dumping
Parse.hpp/.cpp Parser used in Keywords class
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 236
print.hpp/.cpp Functions for writing a vector or array to
disk file
readdata.hpp/.cpp The base class for reading data in certain
formats
ReservoirParameters.hpp/.cpp Input parameters for Reservoir
TuningParameters.hpp/.cpp Input parameters for tuning the simula-
tion
WellParameters.hpp/.cpp Input parameters for facility models
• LinearSolvers
adDenseBlockExtractor.hpp Auxiliary class for extracting dense blocks
of derivatives in ADvector
AMG.h Interfaces to Fortran functions of open-
source AMG solver (AMG1990)
AMGPre.h/.cpp The pressure preconditioner using open-
source AMG solver
BILU0Pre.h/.cpp Block ILU(0) preconditioner
BILUPre.h/.cpp Block ILU(k) preconditioner
BlkGMRESSolverMP.h/.cpp Block GMRES solver for MLSB matrix
with general MPFA discretization
BlockLinearSolverBase.h Base class for MLSB-based linear solvers
BlockLinearSystem.hpp/.cpp MLSB-based linear system
comprow pointer.h/.cpp Block CSR matrix format (used in block-
based preconditioners such as BILU(0)
and BILU(k))
CSRLinearSystem.hpp/.cpp CSR-based linear system
csrmatrixformat.hpp/.cpp (Pointwise) CSR matrix format
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 237
GaussSolver.hpp Common functions for dense Gaussian
elimination
GenWellBILUPre.h/.cpp Block ILU preconditioner for Generalized-
Well matrix in MLSB format
gmres.h Template implementation of GMRES ac-
celerator
GPRS Matrix.h/.cpp Base class for submatrices in MLSB for-
mat
librarysolver.hpp/.cpp Base class for all library solvers
LinearSystemBase.hpp/.cpp Base class for all linear systems
myheap.h/.cpp Heap data structure for symbolic factor-
ization in Block ILU(k) preconditioner
parametersforlibrarysolvers.hpp/.cpp Class for reading and storing parameters
of library solvers
pardiso4.hpp/.cpp Library PARDISO solver
PreconditionerBase.h Base class for all (MLSB-based) precondi-
tioners
ResFacBILUPre.h/.cpp Global Reservoir-Facilities Block ILU pre-
conditioner
SAMG.h Interfaces to Fortran functions of SAMG
solver
SAMGPre.h/.cpp The pressure preconditioner using SAMG
solver
SubRRBlk.h/.cpp The matrix type for the RR part and cer-
tain submatrices in the FF part of MLSB
matrix
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 238
SubCOOBlk.h/.cpp The matrix type for certain submatrices
in the RF and FR part of MLSB matrix
superlu.h/.cpp Library SuperLU solver
SysFFJwrapper.h/.cpp The wrapper format for the FF part in
MLSB matrix
SysFRJwrapper.h/.cpp The wrapper format for the FR part in
MLSB matrix
SysMatwrapper.h The template wrapper format for the en-
tire MLSB matrix (as well as the subma-
trix of a general MS well in the FF part)
SysRFJwrapper.h/.cpp The wrapper format for the RF part in
MLSB matrix
TrueIMPESPre.h/.cpp CPR-based multistage preconditioner
with true-IMPES reduction for reservoir
matrix
YUSolverInterface.hpp Class for extracting and solving local
dense linear systems (e.g., used in New-
ton flash)
• NonlinearSolvers
NonlinearSolver.hpp/.cpp Class for handling all the functionality as-
sociated with the nonlinear solution in ad-
vancing one timestep
ResidualChecker.hpp/.cpp Newton convergence checker based on the
maximum normalized residuals
TrustRegionSolver.hpp/.cpp Class for the trust-region based chopping
strategy
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 239
VariableChecker.hpp/.cpp Newton convergence checker based on the
maximum variable changes
• Properties
BlackOilFluid.hpp/.cpp Derived fluid class for black-oil simulation
CO2Phase.hpp/.cpp Derived phase class specifically for model-
ing CO2
ComponentParameters.hpp/.cpp Class containing (default) component pa-
rameters and EOS coefficients computa-
tion
CubicEquationSolver.hpp Functions for solving cubic equation in
EOS solution process
DeadOilFluid.hpp/.cpp Derived fluid class for dead-oil simulation
EOSPhase.hpp/.cpp Derived phase class based on EOS rela-
tionship
Fluid.hpp/.cpp Base class for various fluid types
GammaPhaseFluid.hpp/.cpp Derived fluid class specifically for Gam-
maVariableFormulation
kvalinput.hpp Input data for K-value method
MultiPhaseCompFluid.hpp/.cpp Derived fluid class for multiphase compo-
sitional simulation
MultiPhaseFlash.hpp/.cpp Class for the flash calculation with general
multiphase fluid
Phase.hpp Base class for various phase types
PVTPhase.hpp/.cpp Derived phase class based on PVT rela-
tionship
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 240
RockFluid.hpp/.cpp Class for the computation of relative per-
meability and capillary pressure
SCALData.hpp/.cpp Class used by RockFluid for tabulating
the relative permeability and capillary
pressure between two phases
ThermalProperties.hpp/.cpp Class for calculating thermal properties
TwoPhaseCompFluid.hpp/.cpp Derived fluid class for two-phase composi-
tional simulation
TwoPhaseKvalueCompFluid.hpp/.cppDerived fluid class for two-phase composi-
tional simulation with K-values
• Reservoir
connections.hpp Type definitions of generic MPFA connec-
tion data and generalized connection list
Reservoir.hpp/.cpp Class for managing all reservoir-related
computations
Rock.hpp/.cpp Class for computing rock properties (e.g.,
porosity) for Reservoir
• Simulation
full set.hpp/.cpp A namespace containing a set of global in-
dexing constants that are used throughout
the simulation
InputTimeStepping.hpp Input data for time stepping parameters
main.cpp The main (entry) function of the entire
AD-GPRS
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 241
SimData.hpp/.cpp The singleton class that processes and
stores the current input data
SimMaster.hpp/.cpp Class for managing all simulation-related
objects and handling preprocessing and
post-processing functionality
SimulationTime.hpp/.cpp The singleton class containing simulation
time and timestep size
Simulator.hpp/.cpp The topmost class instantiated in main
function as an entry to versatile simula-
tion functionality
Statistics.hpp/.cpp The singleton class that handles all statis-
tical data during simulation
• Streamlines
Point3d.hpp Struct for a point in the three-dimensional
space
Streamline.hpp/.cpp Class of streamline
StreamlineHDF5.hpp/.cpp Class for the conversion of Streamline be-
tween its internal and HDF5 format
StreamlineTracer.hpp/.cpp Class for tracing streamlines
• Utilities
enumtypes.hpp/.cpp A collection of all enumerated data types
and functions that convert strings to these
types
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 242
impl function.hpp Implementation of implicit function theo-
rem
interpolation.hpp Inline functions for linear interpolation
and extrapolation
main def.hpp Major constants and definitions
MyMatrix.h/.cpp Functions for dense-matrix computation
and sparse-matrix index calculation
Timer.hpp/.cpp Class for accurate timing of various func-
tionality
utilities.hpp Useful common functions (e.g., finding the
k’th smallest element in an array, limiting
the variable update, and fixing variables
to be within the physical range)
• VariableFormulations
GammaVariableFormulation.hpp/.cppClass of gamma variable formulation
ModelInput.hpp Input data for nonlinear formulation and
fluid model
MolarVariableFormulation.hpp/.cpp Class of molar variable formulation
NaturalVariableFormulation.hpp/.cpp Class of natural variable formulation
NonlinearFormulation.hpp/.cpp Base class of various nonlinear formula-
tions
The above list of subdirectories and files are shared by the serial and parallel AD-
GPRS. There are some additional files for the parallel AD-GPRS. Currently Intel
C++ compiler is required for the compilation of parallel AD-GPRS on both Linux
and Windows platforms. Thus, there is an Intel C++ project file, in addition to the
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 243
Visual C++ solution and project files, for parallel AD-GPRS. Other additional files
are in the subdirectories as listed below:
• LinearSolvers
BlockJacobiPre.h/.cpp OpenMP-based Block Jacobi precondi-
tioner
CartesianMatrix.h Matrix format used by NF (Nested Factor-
ization) preconditioner on Cartesian grid
CartesianMatrixImpl.h Implementation file for the template
CartesianMatrix class
CudaMPNFPre.h/.cu CUDA-based MPNF (Massively Parallel
NF) preconditioner
CudaNFPreWrapper.h The wrapper class utilizing GMRES or
BiCGStab as an accelerator for CUDA-
based MPNF preconditioner
CudaNFPreWrapperImpl.h Implementation file for the template Cu-
daNFPreWrapper class
CudaReduction.h/.cu Class for computing dot products on GPU
(based on an example from CUDA SDK)
CudaUtilities.h/.cpp Some useful macros, functions, and global
variables shared by various CUDA-based
classes
MPNFPre.h OpenMP-based MPNF preconditioner
MPNFPreImpl.h Implementation file for the template MP-
NFPre class
NestedMatrix.h Matrix format with a nested structure and
used by OpenMP-based and CUDA-based
MPNF preconditioner
APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 244
NestedMatirxImpl.h Implementation file for the template Nest-
edMatrix class
NFPre.h Serial NF preconditioner
NFPreImpl.h Implementation file for the template NF-
Pre class
NFPreWrapper.h The wrapper class utilizing GMRES or
BiCGStab as an accelerator for OpenMP-
based MPNF preconditioner or serial NF
preconditioner
xsamg.h Interfaces to Fortran functions of parallel
multigrid solver XSAMG
• Utilities
OpenMPTools.h/.cpp Some useful macros, functions, and global
variables shared by various OpenMP-
based functionality