Parallel General-Purpose Reservoir Simulation … general-purpose reservoir simulation with coupled...

PARALLEL GENERAL-PURPOSE RESERVOIR SIMULATION

WITH COUPLED RESERVOIR MODELS

AND MULTISEGMENT WELLS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ENERGY

RESOURCES ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Yifan Zhou

November 2012

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy.

(Prof. Hamdi Tchelepi) Principal Co-Advisor




(Prof. Khalid Aziz) Principal Co-Advisor




(Prof. Roland Horne)

Approved for the University Committee on Graduate Studies

ii

Abstract

The development of a parallel general-purpose simulation framework for coupled reser-

voir models and multisegment wells is the subject of this dissertation. With this work,

the General Purpose Research Simulator (GPRS) based on a flexible Automatic Dif-

ferentiation (AD) framework, AD-GPRS, is now a powerful and flexible platform for

modeling thermal-compositional fluid flow in reservoir models with fully unstructured

grids. AD-GPRS has advanced and extensible spatial and temporal discretization

schemes, advanced linear solvers, and a generalized multisegment well model. In

addition, AD-GPRS supports OpenMP parallelization on multicore platforms and a

Nested Factorization (NF) linear solver for systems with multiple GPUs (Graphics

Processing Units).

AD-GPRS employs generalized MultiPoint Flux Approximations (MPFA) for spa-

tial discretization and a multilevel Adaptive Implicit Method (AIM) for time dis-

cretization. A generalized connection list is used to locate the block nonzero entries

in the system matrix associated with general MPFA discretization. Our AIM imple-

mentation allows for new fluid models and nonlinear formulations. The framework

can deal with any combination of TPFA (Two-Point Flux Approximation), MPFA,

FIM (Fully Implicit Method), and AIM.

For efficient linear solution of coupled reservoir models and advanced wells, AD-

GPRS supports linear systems based on the MLBS (MultiLevel Block Sparse) data

iii

structure. MLBS was first designed and implemented in the original GPRS and ex-

tended further in AD-GPRS. Equipped with the CPR (Constrained Pressure Resid-

ual) preconditioner, MLBS is very powerful and highly efficient. MLBS has a hierar-

chical data structure and can accommodate systems with general MPFA discretization

and AIM.

For accurate well modeling, a general MultiSegment (MS) well model is imple-

mented in AD-GPRS. In this model, variables and equations are defined for both

nodes and connections. The general MS well model allows for the following advanced

features: general branching, loops with arbitrary flow directions, multiple exit con-

nections with different constraints, and special nodes (e.g., separators, valves). The

linear and nonlinear solvers are extended to address the numerical challenges brought

about by the general MS well model.

Parallel reservoir simulation has recently drawn a lot of attention, and specializ-

ing the algorithms for the target parallel architecture has grown in importance. We

describe an architecture-aware approach to parallelization. First, multithreading par-

allelization of AD-GPRS is described. Parallel Jacobian construction is achieved with

a thread-safe extension of the ADETL library. For linear solution, we use a two-stage

CPR preconditioning strategy, which combines the parallel multigrid solver XSAMG

and the Block Jacobi technique with Block ILU(0) applied locally.

We also describe multi-GPU parallelization of Nested Factorization (NF). We

build on the Massively Parallel NF (MPNF) framework described by Appleyard et

al. [8]. The most important features of our GPU-based implementation of MPNF

include: 1) special ordering of the matrix elements to maximize coalesced access

to the GPU global memory, 2) application of ‘twisted factorization’ to increase the

number of concurrent threads at no additional cost, and 3) multi-GPU extension of the

algorithm by first performing computations in the halo region of each GPU, and then

overlapping the peer-to-peer memory transfer between GPUs with the computations

of the interior regions.

iv

Acknowledgements

I would like to express my sincere gratitude to my advisors, Prof. Hamdi Tchelepi

and Prof. Khalid Aziz for their guidance, help, and encouragement in this work.

I feel so fortunate that they let me work on the development of AD-GPRS, which

fits my background and interest very well. During the past five years, I was advised

to explore various interesting and challenging aspects of reservoir simulation and I

believe the wide coverage of these aspects will greatly benefit my future career. Prof.

Hamdi Tchelepi has an excellent sense on the cutting-edge research topics in reservoir

simulation and was always able to offer me solid support and insightful view during

my investigation towards a new direction. Prof. Khalid Aziz is very knowledgeable

and highly experienced. I was able to learn a lot during every discussion with him.

I would like to thank Prof. Roland Horne for reading my dissertation and providing

valuable corrections and suggestions, some of which are quite beneficial for improving

my academic writing skills. Prof. Kate Maher kindly chaired my PhD oral defense

and is gratefully acknowledged. I would also like to thank Prof. Biondo Biondi for

serving in my defense committee as an oral examiner. His valuable comments are

highly appreciated.

I would like to thank Dr. Rami Younis, who is the original developer of ADETL

and now a professor in U. Tulsa, for his help and support during my M.S. research

on extending ADETL. Regarding the development of AD-GPRS, I worked with Dr.

Denis Voskov a lot and his contribution is sincerely acknowledged.

v

I would like to thank Dr. Brad Mallison from Chevron ETC for offering good

guidance and sufficient freedom in my intern project that is relevant to my PhD

research work. I would also like to express my gratitude to Dr. Hui Cao, who is the

original developer of GPRS and now with Total, for giving us helpful suggestions on

various aspects, including the AIM formulation and linear solvers.

I would like to thank Dr. Yuanlin Jiang, who is also an important developer of

GPRS and now with QRI, for his contribution in the general MS well model. I also

benefited from the discussion with Dr. Arthur Moncorge from Total and his helpful

suggestions are appreciated.

I would like to thank Dr. Klaus Stuben and Sebastian Gries from Fraunhofer

SCAI for their help and support in the XSAMG solver. I would also like to thank Dr.

Robert Clapp, Xukai Shen, Chris Leader, and a few others from the Department of

Geophysics for their kindly help in setting up the GPU computational environment.

I would like to thank all my friends in the Department of Energy Resources En-

gineering, as well as, other departments in the School of Earth Sciences at Stanford

University. I appreciate their help, support, and consideration for the last five years.

The friendships with them are very important wealth of my life.

I would like to acknowledge the industrial affiliates of SUPRI-B (Reservoir Simu-

lation) and SESAAI (Algorithms and Architectures) for their financial support that

made this work possible.

Finally, I would like to thank my parents for their continuous love and support

during the past 27 years. What is most important, I would like to express my greatest

gratitude to my beloved wife, Dr. Xiaochen Wang. We experienced many significant

moments together and have countless beautiful memory pieces. I am so proud of all

her achievements. Without her care, love, support, and encouragement, my PhD life

would have never been so colorful and splendid. She deserves the appreciation and

love from the deepest part of my heart.

vi

Contents

Abstract iii

Acknowledgements v

1 Introduction 1

1.1 Background and Dissertation Outline . . . . . . . . . . . . . . . . . . 1

1.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Hierarchical organization of simulation variables . . . . . . . . 10

1.2.2 Building residual equations . . . . . . . . . . . . . . . . . . . . 12

2 AD Framework with MPFA and AIM capabilities 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 AD-Based MPFA Framework . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Nonlinear level . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Linear level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 AD-Based AIM Formulation . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Implicit-level determination (nonlinear level) . . . . . . . . . . 27

2.3.2 Treatment of nonlinear terms in the flux (nonlinear level) . . . 30

2.3.3 Algebraic reduction in terms of implicit variables (linear level) 32

2.3.4 Updating of the remaining variables (linear level) . . . . . . . 33

vii

2.4 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Full and upscaled SPE 10 problems . . . . . . . . . . . . . . . 34

2.4.2 Unstructured grid problem . . . . . . . . . . . . . . . . . . . . 38

2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Linear Solver Framework 43

3.1 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 CSR linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Block linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 First level: global matrix . . . . . . . . . . . . . . . . . . . . . 51

3.3.2 Second level: reservoir and Facilities matrix . . . . . . . . . . 54

3.3.3 Third level: well matrices . . . . . . . . . . . . . . . . . . . . 57

3.4 Solution strategy of block linear system . . . . . . . . . . . . . . . . . 61

3.4.1 Matrix extraction . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Algebraic reduction from full to primary system . . . . . . . . 64

3.4.3 Algebraic reduction from primary to implicit system . . . . . . 69

3.4.4 Preconditioned Linear Solver . . . . . . . . . . . . . . . . . . . 73

3.4.5 The Two-Stage Preconditioning Strategy . . . . . . . . . . . . 74


4 General MultiSegment Well Model 91

4.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4 Drift-Flux Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.4.1 Liquid-gas model . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.4.2 Oil-water model . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.4.3 Gas-oil-water model . . . . . . . . . . . . . . . . . . . . . . . 106

viii

4.5 Extensions of the AD Simulation Framework . . . . . . . . . . . . . . 107

4.5.1 Global variable set . . . . . . . . . . . . . . . . . . . . . . . . 107

4.5.2 Linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.5.3 Jacobian for the general MS well model . . . . . . . . . . . . . 110

4.6 Well Initialization, Calculation, and Variable Updating . . . . . . . . 113

4.6.1 Well initialization . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.6.2 Well calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.6.3 Updating of the well variables . . . . . . . . . . . . . . . . . . 120

4.7 Multistage Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . 121

4.7.1 First stage: global on the pressure system . . . . . . . . . . . 122

4.7.2 Second stage: local on the overall system . . . . . . . . . . . . 125

4.8 Nonlinear Solution: Local Facility Solver . . . . . . . . . . . . . . . . 127

4.9 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.9.1 Two-dimensional reservoir with a dual-branch general MS well 129

4.9.2 Upscaled SPE 10 reservoir with three multilateral producers . 131

4.9.3 Linear solver performance . . . . . . . . . . . . . . . . . . . . 134

4.9.4 Nonlinear solver performance . . . . . . . . . . . . . . . . . . 137

4.9.5 Comparison of simulation results: AD-GPRS versus Eclipse . 138


5 Multicore Parallelization 142

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2 Jacobian Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.2.1 Thread-safe ADETL . . . . . . . . . . . . . . . . . . . . . . . 143

5.2.2 Parallel computations other than the linear solver . . . . . . . 144

5.3 Linear Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.3.1 Parallel matrix data structure . . . . . . . . . . . . . . . . . . 145

ix

5.3.2 First stage pressure solution — XSAMG preconditioner . . . . 146

5.3.3 Second stage overall solution — Block Jacobi/BILU precondi-

tioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.4 Parallel benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


6 GPU Parallelization of Nested Factorization 153

6.1 Introduction to GPU Architecture . . . . . . . . . . . . . . . . . . . . 153

6.2 Nested Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.3 Massively Parallel Nested Factorization . . . . . . . . . . . . . . . . . 157

6.4 Our CUDA-based Implementation . . . . . . . . . . . . . . . . . . . . 161

6.4.1 Basic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.4.2 Runtime profiles . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.5 Coalesced memory access . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.6 More parallelism: multiple threads in a kernel . . . . . . . . . . . . . 167

6.6.1 Decrease the number of colors . . . . . . . . . . . . . . . . . . 168

6.6.2 Twisted Factorization . . . . . . . . . . . . . . . . . . . . . . 168

6.6.3 Cyclic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 170

6.7 More flexibility in the system matrix . . . . . . . . . . . . . . . . . . 170

6.7.1 Inactive cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6.7.2 Additional (e.g., well) equations . . . . . . . . . . . . . . . . . 171

6.8 Multi-GPU Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 173

6.8.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.8.2 Data transfer between GPUs . . . . . . . . . . . . . . . . . . . 174

6.8.3 Overlapping data transfer with computation . . . . . . . . . . 177

6.8.4 Mapping between global and local vectors . . . . . . . . . . . 178

6.9 Parallel Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

x

6.9.1 Single-GPU test case 1: upscaled SPE 10 . . . . . . . . . . . . 180

6.9.2 Single-GPU test case 2: full SPE 10 . . . . . . . . . . . . . . . 182

6.9.3 Multi-GPU test case 1: 8-fold refinement (2 by 2 by 2) of SPE 10183

6.9.4 Multi-GPU test case 2: 24-fold refinement (4 by 3 by 2) of SPE 10186

6.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

6.11 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189


7 Conclusions and Future Work 193

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Nomenclature 200

Bibliography 206

Appendix A Programming Model of AD-GPRS 218

A.1 The structure of AD-GPRS . . . . . . . . . . . . . . . . . . . . . . . 219

A.2 Flow sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

A.3 List of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

xi

List of Tables

1.1 A hypothetical variable set for three-phase black-oil simulation [85] . 11

2.1 FIM and AIM runtime performance of upscaled SPE 10 problem . . 35

2.2 Runtime performance of full SPE 10 case . . . . . . . . . . . . . . . 37

2.3 Runtime performance of unstructured grid case . . . . . . . . . . . . 40

3.1 Row pointer array of CSR format . . . . . . . . . . . . . . . . . . . . 45

3.2 Column index and value arrays of CSR format . . . . . . . . . . . . 45

xii

List of Figures

1.1 Key features and capabilities of Automatic-Differentiation General-

Purpose Research Simulator (AD-GPRS) . . . . . . . . . . . . . . . . 3

2.1 Illustration of different MPFA schemes . . . . . . . . . . . . . . . . . 16

2.2 Illustration of the AIM scheme (IMPES+FIM) . . . . . . . . . . . . 17

2.3 Block nonzero entries corresponding to TPFA and MPFA fluxes . . . 23

2.4 An example of generalized connection list . . . . . . . . . . . . . . . 25

2.5 Grid mapping of full SPE 10 with skewness and distortion [99] . . . . 36

2.6 Gas rates of two injectors and oil rates of two producers in full SPE 10

with skewed grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.7 Gas saturation at the end of simulation for TPFA and MPFA with 32:1

anisotropy ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.8 Gas rate of producer #1 and #4 for TPFA and MPFA with 1:1, 2:1,

8:1, 32:1 anisotropy ratios . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 Typical global Jacobian matrix in reservoir simulation [38] . . . . . . 52

3.2 Submatrices in the global Jacobian matrix [98] . . . . . . . . . . . . . 53

3.3 Structure of the JRR submatrix . . . . . . . . . . . . . . . . . . . . . 55

3.4 Structure of second-level MLBS matrices [38] . . . . . . . . . . . . . . 58

3.5 Sample reservoir with two wells . . . . . . . . . . . . . . . . . . . . . 60

3.6 Structure of third-level MLBS matrices . . . . . . . . . . . . . . . . . 61

xiii

4.1 Illustration of the original multisegment well model (from [38]) . . . 92

4.2 Illustration of the general multisegment well model (from [38]) . . . 93

4.3 Multilevel block-sparse linear system (modified from [38]) . . . . . . . 110

4.4 Jacobian matrix structure of the general MS well model . . . . . . . 112

4.5 A segment with zero, one, or multiple perforations . . . . . . . . . . 115

4.6 Initialization of mixture flow rates . . . . . . . . . . . . . . . . . . . 118

4.7 Calculation sequence of the general MS well model . . . . . . . . . . 120

4.8 Variable update sequence of the general MS well model . . . . . . . . 122

4.9 The reservoir and well configuration of example 1 . . . . . . . . . . . 129

4.10 The simulation results of example 1 . . . . . . . . . . . . . . . . . . 130

4.11 The reservoir and well settings of example 2 with separate controls . 132

4.12 The simulation results of example 2 with separate controls . . . . . . 132

4.13 The reservoir and well settings of example 2 with a group control . . 133

4.14 The simulation results of example 2 with a group control . . . . . . . 134

4.15 The linear solver performance of example 3 . . . . . . . . . . . . . . 136

4.16 The nonlinear solver performance of example 4 . . . . . . . . . . . . 138

4.17 The reservoir and well settings of example 3 (AD-GPRS versus Eclipse) 139

4.18 Comparison of simulation results: AD-GPRS versus Eclipse . . . . . 140

5.1 Illustration of Thread-Local Storage (TLS) . . . . . . . . . . . . . . . 144

5.2 Illustration of XSAMG preconditioner [31] . . . . . . . . . . . . . . . 147

5.3 Illustration of Block Jacobi preconditioner . . . . . . . . . . . . . . . 149

5.4 Performance result of the full SPE 10 model with TPFA discretization 150

5.5 Performance result of the full SPE 10 model with MPFA O-method

discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.1 Illustration of the Fermi GPU architecture [52] . . . . . . . . . . . . . 154

6.2 Examples of different coloring strategies . . . . . . . . . . . . . . . . 160

xiv

6.3 Runtime profile of the GPU-based MPNF preconditioner for solving

the pressure system of the top 10 layers of the SPE 10 reservoir model

in single-precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.4 Comparison of CUDA memory access patterns . . . . . . . . . . . . . 165

6.5 Runtime profile of the implementation with BiCGStab, customized

reduction kernel, and coalesced memory access . . . . . . . . . . . . . 167

6.6 Example of partitioning with three GPUs (top view) . . . . . . . . . 174

6.7 Illustration of left-right data transfer approach (modified from [49]) . 175

6.8 Illustration of pairwise data transfer approach (modified from [49]) . . 176

6.9 Assignment of tasks to two streams in the solution phase . . . . . . . 177

6.10 Assignment of tasks to two streams in the setup phase . . . . . . . . 179

6.11 The performance results of the upscaled SPE 10 problem with 139

thousand cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.12 The performance results of the full SPE 10 problem with 1.1 million cells183

6.13 The performance results of the refined SPE 10 problem with 8.8 million

cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

6.14 The performance results of the further refined SPE 10 problem with

26.9 million cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A.1 Overall structure of the entire simulator . . . . . . . . . . . . . . . . 219

A.2 Structure of the NonlinearFormulation . . . . . . . . . . . . . . . . . 221

A.3 Structure of the Reservoir . . . . . . . . . . . . . . . . . . . . . . . . 223

A.4 Structure of the Fluid . . . . . . . . . . . . . . . . . . . . . . . . . . 224

A.5 Structure of the Facilities . . . . . . . . . . . . . . . . . . . . . . . . 226

A.6 Structure of the AIMScheme . . . . . . . . . . . . . . . . . . . . . . 227

A.7 Structure of the NonlinearSolver . . . . . . . . . . . . . . . . . . . . 228

A.8 Structure of the LinearSystem . . . . . . . . . . . . . . . . . . . . . 229

xv

Chapter 1

Introduction

1.1 Background and Dissertation Outline

Reservoir simulation is a primary tool for planning and managing oil recovery and

CO2 sequestration processes. In recent years, there has been significant growth in

the resolution and complexity of simulation models of practical interest. This growth

includes both the reservoir and the well models. Moreover, there is a growing need

for accurate modeling of Enhanced Oil Recovery (EOR) processes from conventional

and unconventional resources. Several important efforts aimed at developing reser-

voir flow simulators based on generalized (thermal) compositional formulations have

been reported [5, 15, 18, 23, 25, 63]. In Stanford’s Reservoir Simulation Industrial Af-

filiates Program (SUPRI-B, see https://pangea.stanford.edu/researchgroups/

supri-b/), significant efforts have been invested to develop a flexible research plat-

form for general-purpose reservoir flow simulation.

The General Purpose Research Simulator (GPRS), which was first developed by

Cao [15] and extended significantly by Jiang [38], is a powerful platform that serves

as the predecessor of the new computational platform discussed here. GPRS is distin-

guished from the previous efforts in its extensible modular design and object-oriented

1

https://pangea.stanford.edu/researchgroups/supri-b/

https://pangea.stanford.edu/researchgroups/supri-b/

CHAPTER 1. INTRODUCTION 2

computer code written in C++ [38]. GPRS employs a compositional-thermal formula-

tion, utilizes a Two Point Flux Approximation (TPFA) for spatial discretization, and

the Adaptive Implicit Method (AIM) [15, 65, 80] for time discretization. GPRS has

a connection-based design allowing for both structured and unstructured grids. Ad-

vanced well modeling capabilities, including advanced multilateral wells, are available

in GPRS. Advanced linear solvers with multistage preconditioners make it possible to

solve large models with unstructured grids and complex wells [38,39]. More recently,

chemical reaction modeling has also been developed and integrated into GPRS [28,29].

With these features, GPRS can simulate subsurface CO2 sequestration, flow in frac-

tured formations, and facilitate gradient-based optimization [69]. Many students and

researchers in our group have contributed to this reservoir-simulation research plat-

form. The most significant developments include those by Cao [15,16], Jiang [38,39],

Fan [28,29], Pan [60,61], and Voskov [86,87].

Given the fact that we already have a powerful simulation research platform -

namely GPRS - why build a new research platform using Automatic Differentia-

tion (AD)? To answer this question, we need to think about all the unresolved and

upcoming challenges associated with the development of a general-purpose reservoir-

simulation platform. The research platform should be able to accommodate the grow-

ing variety and complexity of subsurface nonlinear processes that must be modeled

accurately and efficiently.

All the existing general-purpose reservoir simulators - both in industry and academia

- employ hand (manual) differentiation and implementation of the Jacobian [15, 21,

25, 70, 75]. Derivation of analytical derivatives, coding, debugging, and extensive

testing are necessary whenever a new physical mechanism, constitutive relation, or

discretization scheme (in space or time) is to be added. The time consuming, te-

dious, and error-prone process of constructing the Jacobian matrix is a major reason

why it is extremely difficult to extend the capabilities of existing reservoir simulators,


including the current version of GPRS.

Here, our objective is to establish a general-purpose numerical simulation frame-

work that can be used as a flexible, extensible, and computationally efficient platform

for reservoir simulation research. For this purpose, AD is introduced as a key capabil-

ity of the new framework. Given the discrete form of the governing nonlinear residual

equations and declaration of the independent variables, the AD library employs ad-

vanced expression templates with block data-structures to automatically generate

compact computer code for the Jacobian matrix [92]. A brief description of our new

AD-GPRS framework is given in Section 1.2.

Fractured Porous Media

CO2 Sequestration

Geomechanics

Coupling

Gradient-based

Optimization

and History

Matching

Compositional / Thermal Formulation

Advanced Linear / Nonlinear Solvers

Unstructured Grids with TPFA / MPFA

Adaptive Implicit Method (AIM)

General Multi-Segment Wells

Multi-Core / GPU Parallelization

Automatic Differentiation

Figure 1.1: Key features and capabilities of Automatic-Differentiation General-Purpose Research Simulator (AD-GPRS)

We describe the overall design of our new generation of GPRS based on a flex-

ible AD framework. The key features and capabilities of AD-GPRS are shown in

Figure 1.1. With the research reported in this dissertation, AD-GPRS is capable of


modeling thermal-compositional fluid flow in reservoir models with fully unstructured

grids. AD-GPRS has general spatial and temporal discretization schemes, advanced

linear solvers, and a generalized multisegment well model. In addition, AD-GPRS

supports OpenMP parallelization on multicore platforms and a Nested Factorization

(NF, see [7]) linear solver for systems with multiple GPUs. With these capabilities,

AD-GPRS can simulate challenging problems in a flexible and efficient way. This in-

cludes modeling the long-term behavior of CO2 sequestration processes and fluid flow

in reservoirs with complex geological features, such as fractures and faults. Gradient-

based optimization and history matching are also supported through a fully-integrated

AD-based adjoint capability.

For flexible reservoir modeling, AD-GPRS supports generally unstructured grids,

employs a generalized MultiPoint Flux Approximation (MPFA [2, 3, 27, 59]) for spa-

tial discretization, and uses a multilevel Adaptive Implicit Method (AIM) for time

discretization. The MPFA and AIM capabilities are described in Chapter 2. For gen-

erality, no particular structure is assumed for the stencil. A generalized connection

list is introduced to locate the block nonzero entries in the system matrix with general

MPFA discretization. Our AIM implementation is designed to facilitate systematic

application of the method to new fluid models and variable formulations. AD-GPRS

allows for any combination of TPFA (Two-Point Flux Approximation), MPFA, FIM

(Fully Implicit Method), and AIM. The generic and modular design is amenable to

extension, both in terms of modeling additional flow processes and implementing new

numerical methods. The AD-based modeling capability is demonstrated for highly

nonlinear compositional problems using challenging large-scale reservoir models that

include full-tensor permeability fields and nonorthogonal grids. The behaviors of

TPFA and several MPFA schemes are analyzed for both FIM and AIM simulations.

The implications of using MPFA and AIM on both the nonlinear and linear solvers

are discussed and analyzed.


For efficient linear solution of coupled reservoir models and advanced wells, AD-

GPRS supports a block linear system based on the MLBS (MultiLevel Block Sparse)

data structure [38], which is discussed in Chapter 3. MLBS was first designed and im-

plemented in the original GPRS and further extended in AD-GPRS. Equipped with

GMRES (Generalized Minimal RESidual, see [67, 68]) and the CPR (Constrained

Pressure Residual, see [88, 89]) preconditioner, MLBS is very powerful and highly

efficient. This is the reason why MLBS is employed in AD-GPRS. The MLBS hierar-

chical data structure and associated solution strategies, which include matrix extrac-

tion, algebraic reduction, iterative linear solution and preconditioning, and explicit

updating, are discussed in detail.

For accurate modeling of multiphase flow in wellbores and surface (pipeline) net-

works, a general MultiSegment (MS) well model [38] is implemented in AD-GPRS

and discussed in Chapter 4. In this MS model, variables and equations are defined

for both nodes and connections. The general MS well model has the following ad-

vantages: general branching, loops with arbitrary flow directions, multiple exit con-

nections with different constraints, and special nodes (e.g., separators, valves). The

model definition, mathematical formulation, and AD-based implementation, includ-

ing the extensions of the AD framework, as well as, well initialization, calculation,

and variable-updating procedures, are described in detail. The extensions of the lin-

ear and nonlinear solvers to address the numerical difficulties brought about by the

general MS well model are also discussed. Moreover, the robustness and efficiency

of AD-GPRS for complex reservoir models with MS wells are demonstrated with

numerical examples.

Parallel reservoir simulation has recently drawn a lot of attention due to the

rapidly increasing size and complexity of simulation models, as well as, the growth

in computational power. Due to the limitations of power consumption and heat

dissipation, single-core processors with high clock frequency have already been phased


out [79]. The various architectures of parallel computation bring new opportunities

and challenges to existing algorithms. Specializing the algorithms for the target

parallel architecture has grown in importance in the last few years. Emerging parallel

architectures for high performance computing include multicore, many-core [43, 71],

and GPGPU (General-Purpose computation on Graphics Processing Units) [52, 55]

platforms.

We describe an architecture-aware approach to parallel reservoir simulation in

two chapters. In Chapter 5, we describe multithreading parallelization of AD-GPRS.

Both the Jacobian generation and linear-solver computations are covered. Parallel

Jacobian construction is achieved with a thread-safe extension of our AD library.

For linear solution, a two-stage CPR (Constrained Pressure Residual) precondition-

ing strategy is used. The latest parallel multigrid solver from Fraunhofer SCAI -

XSAMG [31] - is used as the first-stage pressure preconditioner, whereas the Block

Jacobi technique is employed in the second stage with Block ILU(0) as the local pre-

conditioner. The parallel performance of AD-GPRS is demonstrated using the full

SPE 10 problem [19] with three different discretization schemes for nonorthogonal

grids on multicore platforms.

In Chapter 6, we discuss the details of the multi-GPU parallelization of Nested

Factorization (NF) [7] - a linear solver in which high concurrency can be exploited

through proper modification. We build on the Massively Parallel NF (MPNF) frame-

work described by Appleyard et al. [8]. The most important features of our GPU-

based implementation of MPNF include: 1) special ordering of the matrix elements

to maximize coalesced access to GPU global memory, 2) application of ‘twisted fac-

torization’ to increase the number of concurrent threads at no additional cost, and

3) multi-GPU extension of the algorithm by first performing computation in the halo

region in each GPU and then overlapping the peer-to-peer memory transfer between

GPUs with the computation of the interior regions. Numerical examples, including


upscaled, full, and refined SPE 10 problems, are used to demonstrate the parallel

performance of our MPNF implementation on single-GPU and multi-GPU platforms.

Finally, we summarize the research discussed here and draw conclusions in Chapter

7. Possible directions for future work are also suggested.

1.2 Automatic Differentiation

AD is a technique for generating computer code that computes the derivatives. The

AD process consists of (1) analyzing the expression parse-tree and decomposing the

expression into basic unary, or binary (+, -, *, /) operations, (2) applying the basic

differentiation rules: linearity, product, and treatment of quotients, (3) performing

transcendental elementary function derivatives (e.g., (exp(v))′ = exp(v) · v′), and (4)

performing the chain rule (i.e., (f ◦ g)′ = (f ′ ◦ g)g′) [30, 34, 64]. AD offers flexibility,

generality, and accuracy up to machine precision. However, because the algorithmic

complexity of AD is at least comparable to that of analytical differentiation, the ability

to develop an optimally efficient AD library is domain specific and often requires

significant efforts [92].

Various aspects of AD, including improved higher-order derivative evaluation and

sparsity-aware methods, are under active research and development by a sizeable

community. Introductions, recent research activities [12, 14] and software packages

are available (see http://www.autodiff.org). Although AD is being used in nu-

merical simulation and optimization [13, 24, 42], it is not yet a mainstream approach

in industrial-grade, large-scale simulators.

AD is different from Numerical Differentiation (ND), which is a popular approach

for computing gradients [50]. ND uses truncated Taylor series to approximate deriva-

tives. For example, the second-order central differencing for the first derivative can

http://www.autodiff.org


be expressed as:

f ′(x) = [f(x+ ∆x)− f(x−∆x)] /2∆x+O(∆x2). (1.1)

The implementation is usually simpler than hand differentiation (i.e., coding the

analytic form of the derivative) because an explicit algebraic form of the derivatives

is not required. ND requires evaluation of the residual equations at a number of

points, which depends on the number of variables and the specific approximation

scheme. For instance, if central differencing is used for a residual equation f with N

variables, then 2N function evaluations are needed.

AD has three primary advantages over ND: (1) conditional branches (e.g., upwind-

ing, or variable switching) are treated easily in AD, whereas they cannot be handled

readily in ND due to discontinuity in the derivatives; (2) AD has no truncation error,

because it generates the code that corresponds to the analytical derivatives, whereas

it is not always possible to bound the truncation error a priori in ND by selecting a

proper interval ∆xi. Too large an interval leads to large truncation errors, and too

small an interval incurs significant round-off errors [50]. (3) AD evaluates a residual

expression in a single pass, whereas the algorithmic complexity of ND grows quickly

with the number of variables. These advantages of AD over ND are particularly

important for the development of robust simulators for strongly coupled nonlinear

processes [92].

A simple example of AD is presented next. First, the addition operation (+) and

the sine (sin) function evaluator are both augmented with the ability to compute

derivatives (underlined functions are augmented), namely:

a+ b = {a+ b, a′ + b′}, (1.2)

sin(f) = {sin(f), cos(f) · f ′}. (1.3)


Then, using the chain rule, the value and derivatives of sin(a + b) can be computed

as:

sin(a+ b) = sin({a+ b, a′ + b′})

= {sin(a+ b), cos(a+ b) · (a′ + b′)} (1.4)

= {sin(a+ b), cos(a+ b) · a′ + cos(a+ b) · b′}.

Repeating this process, it is easy to see that no matter how complex the residual

equation, R, is, the associated gradient, R′, will always be in the form of a linear

combination of ‘simple’ sparse gradients, which have been already defined, or com-

puted. That is,

R′ = c1v′1 + c2v

′2 + ...+ cNv

′N (1.5)

where ck (1 ≤ k ≤ N) are scalar coefficients, and v′k (1 ≤ k ≤ N) are sparse gradients

that have been defined or computed prior to the evaluation of R. A slightly more

complicated AD example of a two-cell oil-water problem is described in [97].

Several investigations of the use of AD for reservoir simulation have been pur-

sued recently. In [92, 93], an automatically differentiable data-type was introduced,

and an AD library - the Automatically Differentiable Expression Templates Library

(ADETL) - was designed and implemented. To make the automatic differentiation

process efficient, customized memory allocators and block-sparse treatment were later

incorporated into ADETL [96]. A novel ADETL-based simulation framework that al-

lows for a wide range of nonlinear solution strategies (e.g., natural, or molar, variable

sets) in compositional simulation was presented recently [84]. Our new research plat-

form AD-GPRS will soon have all the capabilities of GPRS, which had been developed

using hand differentiation [15,38].


1.2.1 Hierarchical organization of simulation variables

In order to establish a seamless link between the library (ADETL) [92, 93, 96] and

the simulator (AD-GPRS), ADETL provides a series of adapters, among which the

global variable set (adX) is the most important one. This adX adapter contains the

entire set of simulation variables (independent, dependent, and constant) for all cells

in the reservoir model, as well as, for all nodes and connections in all the facilities

(e.g., wells, surface pipelines, etc.). A detailed description of the global variable set

and other adapters can be found in the ADETL User Manual [97].

A hierarchical structure is adopted for the organization and indexing of all vari-

ables in adX : at the highest level, adX is composed of several subsets correspond-

ing to different bases of variables. For example, we currently have one subset for

node-based variables and one subset of connection-based variables in AD-GPRS. In

order to specify the arrangement of variables among multiple subsets, a ‘helper class’

(AD Structure) is provided by ADETL. In adX, we can define any number of such

structure records, each of which represents a certain portion of one variable subset.

Here, a portion can be a number of grid blocks, well nodes, or well connections. Then,

the entire set of variables is organized according to the order of these records. Please

refer to the “Adapters” example in the ADETL User Manual [97] for details.

On the second level within each structure record, there are several blocks of vari-

ables. Here, a block has a generic definition and can be used to represent a reservoir

cell, or well node, in the node-based subset, or a well connection in the connection-

based subset. In each block, there is a number of variables. This number is deter-

mined for each subset at the beginning of a simulation and is fixed for the simulation

run. Usually, the number of variables depends on the number of (fluid) phases and

components.

On the third level, variables within a block can be classified as independent or


Table 1.1: A hypothetical variable set for three-phase black-oil simulation [85]

Phase state P So Sg Rs

O, W 0 1 - -O, G 0 1 - 2O, W, G 0 1 2 3

dependent (or constant) variables. The dependency of a variable is not a static prop-

erty, but is determined dynamically by the (phase) state of the block. For example,

in the variable set shown in Table 1.1, Rs is an independent variable with the phase

state (O, G), but a dependent variable if the phase state is (O, W). If a variable

is independent, it has an active index greater than, or equal to, 0, which represents

the ordering of all independent variables within that block. If the phase state of a

reservoir cell is (O, G), as given in Table 1.1, P , So, and Rs are the 0th, 1st, and

2nd independent variable in that cell, respectively. On the other hand, if a variable is

dependent, it has an active index of −1, and can be either a constant (not depending

on any variables) or a function of other variables (e.g., ρp = ρp(P, T, xcp)).

On the fourth level, independent variables can be further classified as primary and

secondary variables. With an active index between 0 and NPrimary−1, the variable is

primary ; otherwise (with an active index that is greater than or equal to NPrimary),

the variable is secondary. The residual vector and associated Jacobian matrix formed

using all independent variables make up the full system, which is usually too large

to be solved efficiently at the linear level. Among the governing equations for each

block, there are several (NPrimary) global conservation equations and the rest are local

constraints (e.g., thermodynamic equilibrium). As a consequence, we may perform

an algebraic reduction on the full system in order to generate a system that only

contains the first NPrimary equations and variables in each block [15]. This generated

system is called the primary system, which can be solved on the linear level for the


solution to primary variables. Then an explicit update can be performed to obtain

the solution to secondary variables.

If the Adaptive Implicit Method (AIM) is applied, the primary variables can be

further classified as implicit and explicit variables. Through a further step of algebraic

reduction after preparing the primary system, we can obtain an even smaller linear

system, which is called the implicit system. Correspondingly, after the solution of the

implicit system and before the update of secondary variables, an additional explicit

updating step can be applied to compute the solution of the explicit primary variables.

The details of the two-step algebraic reduction and the two-step explicit update can

be found in Section 3.4.

1.2.2 Building residual equations

The data type of all basic elements (variables) in the global variable set adX is

called ADscalar. An ADscalar stores not only the value, but also the gradient of one

variable, i.e., its derivatives with respect to all independent variables. The residual

equations in AD-GPRS have the same data type and are constructed using these AD

variables. There are three major steps in building and utilizing residual equations:

• Declaring independent variables:

The declaration of independent variables is managed by the global variable set

according to the block phase state, which is referred to as ‘status’. There-

fore, what we need to do is to determine the status of each block based on

certain conditions (e.g., the appearance/disappearance of phases). After the

status of all blocks in both subsets is updated, we can call the specific function

provided in adX to map the status changes to the declaration of independent

variables. Occasionally, for a local system (e.g., Newton flash computations)

that has a different ordering from the global system, specific interfaces (e.g.,


make independent, see [97] for details) in ADETL can be called to declare local

independent variables. For each independent variable, its gradient will be set

to contain a value of 1 with respect to itself only and 0 with respect to all other

variables.

• Writing residual equations:

This step contains several substeps: 1) properties calculation (evaluate depen-

dent variables as functions of independent variables); 2) for each reservoir cell,

calculate accumulation and other local terms and add them to the residual

equations of that cell; 3) for each interface between reservoir cells, compute

flux terms and add them to the residual equations of corresponding cells; 4) for

each well, compute the source/sink terms and add them to the reservoir resid-

ual equations of perforated cells; and 5) add one or more residual equations

for each well. In all these substeps, only the code that computes the residual

equations (or properties used in residual equations) is required, while the code

that computes the associated Jacobian matrix (or derivatives of properties with

respect to independent variables) is not needed. The derivatives of all AD vari-

ables, including the entire AD residual vector, are automatically computed by

ADETL.

• Using automatically generated gradients:

In this step, we use the interfaces in ADETL to extract the automatically gener-

ated Jacobian matrix, in which each column represents one independent variable

and contains the derivatives of all residual equations with respect to it. Corre-

spondingly, each row represents one residual equation and contains its deriva-

tives with respect to all independent variables. Depending on the structure and

format of the selected linear system, different auxiliary functions provided in

ADETL are used in the extraction process. After extraction, the full system,


as introduced in Section 1.2.1, is obtained in the desired format. Then, we

apply the algebraic reduction to derive the primary, or implicit, system, solve

the obtained linear system, and perform the explicit update. These processes

are briefly explained in Section 1.2.1 and discussed in detail in Section 3.4.

An example explaining the AD-based residual and Jacobian generation for a simple

oil-water model following these steps can be found in the Tutorial section of the

ADETL User Manual [97].

Chapter 2

AD Framework with MPFA and

AIM capabilities

2.1 Introduction

Many of today’s industrial reservoir simulation models are quite complex with highly

detailed (often full-tensor) permeability distributions, geometrically complex geologic

features (e.g., faults, fractures, pinch-outs) and advanced wells (multilateral, multi-

segment). Complex grids, advanced multilateral wells, and heterogeneous full-tensor

permeability pose significant challenges for the space and time discretization schemes

applied to the nonlinear conservation equations. In most simulators, TPFA (Two-

Point Flux Approximation) is used for the ‘geometric part’ of the flux. Phase-based,

SPU (Single-Point Upstream) weighting is used for the flux of a phase, or compo-

nent [9]. The use of TPFA and SPU is widespread, due to their simplicity and ro-

bustness. However, when used with nonorthogonal grids and (absolute) permeability

tensors, whose principle components are not aligned with the grid, TPFA introduces

an error that does not diminish as the grid is refined [27]. MPFA (MultiPoint Flux

Approximation) schemes have been developed to provide a consistent representation

15

CHAPTER 2. AD FRAMEWORK WITH MPFA AND AIM CAPABILITIES 16

η: position of

the continuity

point on sub-

face

(a) MPFA O(η)-method

η=1

η=0.5

η=0

(b) MPFA L-method

Figure 2.1: Illustration of different MPFA schemes

of the geometric part of the flux [2, 3, 27, 59]. Here, through use of the acronym

MPFA, we include all locally conservative schemes that represent a flux through an

interface as a linear combination of potentials in neighboring cells, e.g., the MPFA

O(η)-method [2,27,83] shown in Figure 2.1(a) and the MPFA L-method [3,4] shown

in Figure 2.1(b). Our implementation is designed to accommodate all such schemes

into compositional simulation models. See [1] for an introduction to MPFA methods.

For time discretization, the FIM (Fully Implicit Method) uses a backward-Euler

strategy, in which all the degrees of freedom (variables) and the coefficients that de-

pend on them are evaluated at the new time level, n + 1 [9]. The advantage of FIM

is its unconditional stability. However, each timestep of FIM is quite expensive, both

in memory and computational cost, especially when the number of components is

large and the problem is highly nonlinear. Moreover, compared to mixed-implicit

schemes, the time truncation errors (numerical dispersion) associated with FIM can

be large. On the other hand, the IMPES (IMplicit Pressure Explicit Saturation)

scheme, which treats all the variables other than pressure explicitly, is computation-

ally efficient for a single timestep. However, the stability constraint on the IMPES

timestep size can be quite restrictive [22]. As a result, the total cost of an IMPES


Stability Efficiency (per iteration)

FIM IMPES AIM

Stability Limit of a Large Time Step Δt

Figure 2.2: Illustration of the AIM scheme (IMPES+FIM)

simulation can easily exceed that of FIM. The performance of the IMPSAT (IMplicit

Pressure and SATuration) scheme, which treats pressure and saturation(s) implicitly,

lies between that of IMPES and FIM [15]. In order to take advantage of the uncon-

ditional stability of FIM and the low computational cost (on a per Newton iteration

basis) of IMPES, or IMPSAT, time discretization schemes with different levels of im-

plicitness can be combined [15,65,80]. The AIM (Adaptive Implicit Method) scheme

utilizes such a combination by applying implicit treatment only to variables with CFL

(Courant-Friedrichs-Lewy) numbers that exceed the explicit stability limit. This idea

is illustrated in Figure 2.2, where the AIM scheme combines FIM and IMPES to

achieve an optimal balance between stability and efficiency. Note that pressure is

always treated implicitly. In thermal models, temperature in the convection terms of

the energy balance can be treated implicitly, or explicitly, depending on the stability

limit [51].

2.2 AD-Based MPFA Framework

The MPFA O-method had been implemented in the original GPRS; however, the

MPFA capability of GPRS was not robust, due to the following reasons:

• Loss of generality: TPFA and MPFA were implemented as two separate branches

in the simulator, such that different data structures and algorithms were used.


Both TPFA and MPFA versions of many C++ classes were needed, which led

to code duplication and difficulty in maintaining and extending the simulation

capabilities.

• Incompatibility: Several capabilities, including AIM, were only implemented

within the TPFA branch. With time, the disparity in both functionality and

robustness between the TPFA and MPFA branches continued to grow.

• Inefficiency: Because some advanced options did not work with the MPFA

branch, the TPFA branch became the default code base and enhancements to

the efficiency of the simulator, such as our specialized block-based GMRES

solver [68], were not applied to the MPFA branch.

In order to overcome these limitations, a generic design that accommodates both

MPFA and AIM is used in AD-GPRS. By ‘generic’, we mean that (1) a unified

structure with no code duplication is used to handle a wide range of MPFA and AIM

schemes, and that (2) a minimal effort is needed for further modification and extension

of the existing functionality. Specifically, the general MPFA implementation assumes

no particular structure for the flux stencil. Hence, TPFA is simply a special case of

MPFA. Moreover, the computational cost and storage are proportional to the sparsity

pattern of the discretization stencil, even in cases where the sparsity pattern is not

symmetric. This generality is achieved by splitting the spatial discretization into two

levels: nonlinear and linear. Residual equations and the elements of the associated

Jacobian matrix are evaluated at the nonlinear level, whereas element extraction,

algebraic reduction, and linear-system assembly belong to the linear level.


2.2.1 Nonlinear level

At the nonlinear level within the framework, there is no need to consider the specific

data-structure of the Jacobian, i.e., where and how the derivatives should be com-

puted and stored. The only change needed at this level is in the computation of the

flux across the interface shared by two grid cells, i0 and i1. We assume that i0 < i1

and the interface normal-vector has an orientation that points into cell i0. The overall

flux of a component c from i1 to i0 is given by:

F i0,i1c =

∑p

xc,pρpλpΦi0,i1p (2.1)

Here xc,p is the mole fraction of component c in phase p, ρp is the molar density of

phase p, and λp is the mobility of phase p. To simplify the notation, we have dropped

the superscript for i0 and i1 in these expressions. Φi0,i1p denotes the flow part of the

flux of phase p. Here xc,p, ρp, and λp are evaluated using SPU weighting based on the

sign of Φi0,i1p .

A TPFA expression for the flow part of the phase flux can be expressed as:

Φi0,i1p = T i0,i1 ·

(P i1p − P i0

p − g · γi0,i1p · (Di1 −Di0))

(2.2)

where P ip is the pressure of phase p in cell i, g is the gravitational acceleration coeffi-

cient, Di is the depth of cell i, and T i0,i1 is the two-point transmissibility coefficient

for the interface {i0, i1}. We assume T i0,i1 ≥ 0. γi0,i1p is the mass density of phase p


at the interface {i0, i1}, which can be expressed as:

γi0,i1p =

(γi0p + γi1p

)/2, if phase p appears in both cell i0 and i1

γi0p , if phase p appears only in cell i0

γi1p , if phase p appears only in cell i1

0, otherwise

(2.3)

In AD-GPRS, the simple expression (2.2) is replaced with a more general MPFA

expression:

Φi0,i1p =

np−1∑m=0

T i0,i1im·(P imp − g · γi0,i1p ·Dim

)(2.4)

where np is the number of points associated with the flux across interface {i0, i1},

T i0,i1imis the transmissibility coefficient associated with interface {i0, i1} and cell im.

We assume thatnp−1∑m=0

T i0,i1im= 0. Generally we have that T i0,i1i0

≤ 0 and T i0,i1i1≥ 0. We

make no assumption regarding the flux stencil for m ≥ 2, although most commonly

used MPFA schemes restrict the flux stencil to those cells that share a vertex with

the interface geometrically.

If we take np = 2, then the MPFA expression (2.4) is equivalent to its TPFA

counterpart (2.2) with T i0,i1 = T i0,i1i1= −T i0,i1i0

. As a consequence, we can use the

generalized MPFA expression (2.4) to represent a TPFA or MPFA flux. We reiterate

that at this level there is no need to consider how the gradients should be computed

and where they ought to be stored. With properly defined independent variables

(e.g., P , Sp, and xc,p in the natural-variables set) and dependent variables (e.g.,

γp = γp(P, xcp)), the gradients of Φi0,i1p = Φi0,i1

p (P, γp) are evaluated automatically by

the AD framework. Then, the value and associated gradients are used to compute the

overall component fluxes, Fc, as in Eq. (2.1). Thus, by only changing the computation

of the flow part of the phase flux, a generic implementation of MPFA is achieved at


the nonlinear level without additional complications, or specialization.

2.2.2 Linear level

At the linear level, the main challenge is to identify and store the location of the

nonzero block entries in the Jacobian matrix. Once this sparsity pattern is known,

the following linear algebra operations can be performed:

• System matrix extraction from the AD residual vector.

• Algebraic reduction and explicit update.

• Linear solution including Sparse Matrix-Vector multiplication (SpMV) and pre-

conditioning.

The diagonal block entries, one per cell in the reservoir model with NB cells, are all

nonzero. The structure of these blocks will not change from TPFA to MPFA, even

though the values change with time. So, there is no need for a special list to record

the positions of the diagonal blocks.

On the other hand, the number, location, and form of the off-diagonal block

entries depend on the specific flux expression used. In a TPFA discretization with

NF interface fluxes, each flux expression leads to exactly two nonzero entries: one in

the lower, and one in the upper off-diagonal part of the Jacobian.

Consider the scenario depicted in Figure 2.3, in which four cells (white parallelo-

grams with a blue dot in the center of each cell) with global indices (cell numbers) k1,

k2, k3, and k4 share a common vertex (the top right corner of cell k1). Here we analyze

the block nonzero entries corresponding to the red interface between cell k1 and k2,

as well as the blue interface between cell k1 and k3. The fluxes at both interfaces can

be discretized with either TPFA or MPFA.


The TPFA stencils associated with the red and blue interfaces are {k1, k2} and

{k1, k3}, respectively. Note that k1 < k2 and k1 < k3. Block row i in the Jacobian

matrix refers to the set of discrete equations associated with cell i. Block column j

refers to the set of independent variables associated with cell j. Now, the flux for the

red interface appears in the discrete equations associated with cells k1 and k2, and the

flux involves the variables associated with cells in the stencil {k1, k2}. Therefore, this

flux introduces a nonzero block entry into the upper off-diagonal part of the matrix in

row k1, column k2, and a second nonzero block entry into the lower off-diagonal part

in row k2, column k1. Similarly, the flux for the blue interface introduces nonzero

off-diagonal blocks into row k1, column k3 and row k3, column k1. The positions of

these nonzero off-diagonal blocks are indicated in Figure 2.3(a). If TPFA is used for

all interfaces, each flux will introduce one nonzero entry in the upper off-diagonal part

of the matrix and one nonzero entry in the lower off-diagonal part. Each off-diagonal

block will store contributions from exactly one flux expression.

For the more general case of MPFA, such a simple mapping between fluxes and

off-diagonal blocks does not exist. As before, the flux for each interface appears in

the discrete equations associated with the two cells that share the interface. However,

each MPFA stencil {i0, i1, ..., inp−1} depends on the variables in np cells, where np ≥ 2

can be different for each flux, and will therefore contribute to np − 1 off-diagonal

blocks in row i0, and np−1 off-diagonal blocks in row i1. Each off-diagonal block will

generally contain contributions from multiple fluxes. Assume that the MPFA stencil

associated with the red interface is {k1, k2, k3, k4}, and the stencil of the blue interface

is {k1, k3, k2, k4}. The positions of the off-diagonal blocks associated with the red and

blue interfaces are shown in Figures 2.3(b) and 2.3(c), respectively. Note that the

green blocks contain contributions from both fluxes. There is no guarantee that the

sparsity pattern of the Jacobian will be symmetric. An efficient data-structure is

needed to keep track of the sparsity pattern of the Jacobian associated with MPFA


k1

k1 k3

k3

k2 k4

k1

k1 k3

k2

k2 k4

k1

k1 k3

k3

k2

k2

k1 k2

k3 k4

(a) Block nonzero entries corresponding to the blue and red TPFA fluxes

(b) Block nonzero entries corresponding to the red MPFA flux

(c) Block nonzero entries corresponding to the blue MPFA flux

Figure 2.3: Block nonzero entries corresponding to TPFA and MPFA fluxes

discretization.

We propose a ‘generalized connection list’ data-structure. At the beginning of a

simulation run, the list is generated from the specification of the MPFA stencils for

the NF flux entries in the model. Each entry includes cells {i0, i1, ..., inp−1} and the

transmissibilities {Ti0 , Ti1 , ..., Tinp−1} associated with each MPFA flux. The general-

ized connection list has three important properties: (1) only structural information

(i.e., the row and column indices of a nonzero entry) is stored. (2) Each nonzero entry

is represented by a unique (row, column) pair. That is, if the pair (k1, k2) has been

visited and is already accounted for in the generalized connection list when the red

MPFA flux was processed (see Figure 2.3(b)(c)), then this (k1, k2) nonzero location

is not inserted into the list a second time when the blue MPFA flux is processed.

Note that the additional contribution to the (k1, k2) value from the blue MPFA flux

is accounted for during the simulation run. (3) All block nonzero entries, both in up-

per and lower off-diagonal parts (i.e., (k1, k2) and (k2, k1) are considered as different

entries), are added to the generalized connection list, in order to allow for general

schemes with possibly nonsymmetric connection structures.


The ‘set’ data-structure in the STL (Standard Template Library, see the descrip-

tion in [45]) that does not allow repetition of the elements is used as a storage for

the connection list. For each MPFA flux associated with np cells, the simulator will

attempt to insert each of the 2(np− 1) pairs to the generalized connection list. The

set data-structure automatically skips over existing pairs and only inserts new oc-

currences. After each and every member of the MPFA flux list is visited once, all of

the block nonzero entries in the off-diagonal parts of the matrix are recorded. Be-

cause this generalized connection list remains the same throughout the simulation,

we may then convert the generated set of block nonzero entries to any desired data

structure, which may be more efficient in terms of access, or better suited for parallel

computations.

Currently, the internal data format used to contain the converted generalized

connection list is CSR (Compressed Sparse Row), where the structure information

is stored in two arrays: row ptr (row pointers) and col ind (column indices). An

interface is provided for converting the intermediate set of block nonzero entries to

the internal CSR format. Note that because the diagonal entries always exist in each

block row, and are usually treated differently in the algebraic computations (e.g.,

algebraic reduction and explicit update), we do not store the diagonal entries in the

converted generalized connection list. Also, because only the structural information is

stored, there is not a single val (data values) array that is associated with the general

connection list. The actual data storage for the Jacobian is separated from this data

structure and will be discussed later in Section 3.3.

An example of the generalized connection list is shown in Figure 2.4. As described

above, an intermediate set of 20 off-diagonal block nonzero entries in eight rows is first

generated from all the fluxes and then converted into the CSR format. The diagonal

entries are not in the intermediate set, nor in the converted CSR arrays.


0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

(0, 1) (0, 4) (1, 0) (1, 2) (1, 5)

(2, 1) (2, 3) (2, 6) (3, 2) (3, 7)

(4, 0) (4, 5) (5, 1) (5, 4) (5, 6)

(6, 2) (6, 5) (6, 7) (7, 3) (7, 6)

row_ptr 0 2 5 8 10 12 15 18 20

col_ind 1 4 0 2 5 1 3 6 2 7

0 5 1 4 6 2 5 7 3 6

Figure 2.4: An example of generalized connection list

Finally, we note that connection-based loops in matrix operations, such as matrix

extraction, algebraic reduction, explicit updating, and SpMV, must be modified con-

sistently with the generalized connection list. Most of these modifications are straight-

forward. For example, loops over standard two-point connections should be changed

to loops over all entries in the generalized connection list. Given the currently adopted

CSR format, these entries are usually accessed row by row and then block by block.

That is, we first loop through all rows, and in each row i (0 ≤ i < NB) the indices of

the nonzero block entries are all integers k such that row ptr(i) ≤ k < row ptr(i+1).

Thus, the columns of these entries are col ind(k), and we may access them one by

one. The treatment of the diagonal entries is not changed.

2.3 AD-Based AIM Formulation

The AD-GPRS simulation framework is designed to support arbitrary levels of im-

plicitness, including FIM, IMPES, IMPSAT, and AIM. Among these schemes, AIM


is the most general one, such that all other schemes can be considered as special cases

of AIM. In order to have a unified code-base for flexible discretization in time, the

AIM implementation is designed to handle different schemes consistently and is split

into nonlinear and linear levels. A four-step procedure, with the corresponding level

in parenthesis, is used to implement AIM in AD-GPRS:

1. Determination of the implicitness level (nonlinear level).

2. Treatment of the nonlinear terms in the flux (nonlinear level).

3. Algebraic reduction in terms of the implicit primary variables (linear level).

4. Updating the remaining explicit primary and secondary variables (linear level).

The AD-GPRS framework allows for complete flexibility in the specification of the

independent variables for different formulations and solution strategies [84]. There-

fore, no assumption is made in the common part (used by all variable formulations) of

time discretization about a specific formulation. Also, it should be relatively straight-

forward to introduce the AIM scheme for a new formulation. For this purpose, gen-

erality has been preserved in the following procedures of AIM:

• Switching criteria between different levels of implicitness.

• Algebraic reduction of the primary equations in terms of the primary unknowns.

• Explicit updating of the secondary variables.

As a result, a proposed new formulation (e.g., molar) can be realized easily as

follows: 1) implement the declaration of the variable set and associated properties

calculation, 2) test the FIM formulation using the new variable set, 3) implement

the various treatments for the nonlinear terms in the flux, and 4) assign levels of

implicitness to the variables for the specific AIM scheme of interest.


The four-step procedure used to implement AIM is discussed in the following

sections.

2.3.1 Implicit-level determination (nonlinear level)

CFL-based stability criteria are widely used in reservoir simulation ( [22, 51, 65]).

In an isothermal compositional simulation (using the natural-variables formulation),

there are two types of CFL numbers:

• Component-based CFL number:

CFLX,c =∆t

Vφ

∑p ρpQpxc,p∑p ρpSpxc,p

, CFLX = maxc{CFLX,c} (2.5)

where V is the volume of the grid block, φ is the porosity, and Qp is the phase

volumetric flux. Note that the denominator (Vφ∑

p ρpSpxc,p) represents the

total number of moles of component c in all phases in the current cell.

• Saturation-based CFL number (for one, two, or three phases, with gravity,

without capillarity):

– one phase (trivial):

CFLS =Qp∆t

Vφ(2.6)

– two phases [15]:

CFLS =∆t

Vφ·λp1λp0

∂λp0∂Sp0

Qp0 −λp0λp1

∂λp1∂Sp0

Qp1

λp0 + λp1(2.7)

– three phases [22]:

CFLS =1

2

∆t

Vφ

∣∣∣∣f0,0 + f1,1 +√

(f0,0 + f1,1)2 − 4(f0,0f1,1 − f0,1f1,0)∣∣∣∣ (2.8)


where fi,j is evaluated as:

fi,j =δi,jλT − λpiλpjλT

·2∑

k=0

(∂λpk∂Spj

Qpk

)(2.9)

in which δi,j =

1 i = j

0 i 6= j

is the delta function, and λT =∑2

k=0 λpk is the

total mobility.

The expressions that take capillarity into consideration can be used to replace

expressions (2.6) to (2.8) when necessary [22]. Note that here the specific ex-

pression being used for any cell depends on the number of existing mobile phases

of that cell, and not on the total number of phases in the system. That is to

say, in a three-phase system, we may have cells with three mobile phases using

Eq. (2.8), cells with two mobile phases using Eq. (2.7), and cells with only one

mobile phase using Eq. (2.6).

For the total flow rate (QT ) or any phase flow rate (Qp) involved in the CFL

computation, it can be either the inflow or outflow rate. For any given cell, the inflow

rate is the summation of the flow rate over all the associated connections with a flow

that goes from a neighboring cell into that cell. Correspondingly, the outflow rate is

the summation of the flow rate over all the associated connections with a flow that

goes from that cell into a neighboring cell. It has been proved in [22] that the ‘inflow’

CFL number (i.e., the CFL number calculated from the inflow rate) is a more suitable

choice for the stability criteria. A less efficient but potentially more stable choice is to

use the maximum value of the ‘inflow’ CFL number and the ‘outflow’ CFL number.

Based on the above component- and saturation- based CFL numbers, the decision

to treat variables of a cell implicit or explicit for the current timestep can be made

as follows:


• If max{CFLX , CFLS} ≤ CFLLim, use IMPES

• If CFLX ≤ CFLLim, use IMPSAT

• Otherwise, use FIM

where CFLLim is the upper stability limit. For nonlinear formulations other than

the natural variables, different stability criteria can be designed to determine implicit

levels.

From Eq. (2.5) to (2.8), we may observe that CFL numbers are functions of

timestep size ∆t. For the FIM scheme and the AIM scheme with no restriction on

the maximum portion of FIM cells, the timestep size is determined solely based on

the maximum variable changes between two timesteps (see the section about time

stepping criteria in [85]) and is given as an input for the CFL computation and the

subsequent implicit-level determination. On the other hand, for the IMPES/IMPSAT

scheme and the AIM scheme with certain restriction on the maximum portion of FIM

cells, a tentative timestep size ∆tupper is first calculated based on the maximum vari-

able changes and is given as an upper limit for the actual timestep size ∆t. The actual

timestep size ∆t is then obtained by forcing the stability criteria of IMPES/IMPSAT

scheme (i.e., the maximum CFL in the entire reservoir, given ∆t, is below CFLLim),

or by forcing the restriction on the maximum portion of FIM cells in the AIM scheme.

That is, given the maximum portion of FIM cells PFIM (0 ≤ PFIM ≤ 1), we may

calculate the minimum number of non-FIM cells as: Nnon−FIM = (1 − PFIM)NB.

Then, by sorting the CFL numbers of all cells from the smallest to the largest, we

require that the Nnon−FIM′th CFL number, given ∆t, is below CFLLim.

To further improve the stability of the overall system, the option of extending the

FIM cells by several layers is provided. After the determination of implicit level for

all cells, the following operation may be performed: for each non-FIM cell that is a

neighbor of an FIM cell, change its implicit level to FIM at the end of this operation,


such that the FIM cells are extended exactly by one layer. Given the number of

layers (NL) to be extended, the above operation will be performed for NL times in

each timestep, such that the FIM cells are extended by NL layers. The larger NL we

choose, the more FIM cells we will have. As a consequence, the solution will be less

efficient but potentially more stable.

2.3.2 Treatment of nonlinear terms in the flux (nonlinear

level)

In an IMPES, or IMPSAT, formulation, the local terms (accumulation and fugac-

ity) are computed as in the FIM formulation. For the flux terms, one can employ

different treatments to the various quantities, which are either nonlinear functions

of the independent variables (e.g., ρp, λp, and γp), or some independent variables

themselves (e.g., xc,p in the natural-variables set). In our framework, we provide four

different options for calculating each of these quantities. Taking the treatment of

ρp = ρp(P, xc,p) in an IMPES cell (only P is implicit) as an example, we have:

• Explicit treatment: the nonlinear term is fixed at the beginning of the timestep

and no derivatives are calculated, e.g., the value is ρnp , and the derivative vector

is empty;

• Partially implicit treatment (lagged iteration): the nonlinear term is updated

in each Newton iteration and only contains derivatives of it with respect to

implicit variables, e.g., ρn+1,kp and

(∂ρp∂p

)n+1,k

;

• Fully implicit treatment: the nonlinear term is updated in each Newton iteration

and contains all derivatives of it (with respect to implicit and explicit variables),

e.g., ρn+1,kp and

(∂ρp∂p

)n+1,k

,(

∂ρp∂xc,p

)n+1,k

;


• Mixed time-level treatment: for the flux term computation in each Newton

iteration, the nonlinear term is recomputed using the specified implicit vari-

ables (from the last iteration, e.g., P n+1,k) and explicit variables (from the last

timestep, e.g., xnc,p), and the derivatives of the quantity with respect to implicit

variables are also recomputed naturally with AD, e.g., ρp(Pn+1,k, T, xnc,p) and

∂ρp(Pn+1,k,T,xnc,p)

∂Pn+1,k .

Among the four options, the mixed time-level treatment seems to be the most

natural way to handle implicitness. One may raise the question: why do we need

the other three options? The reason is related to efficiency and convergence. Unlike

the explicit, partially implicit, or fully implicit treatments, which use the computed

value (either at current or last timestep) and derivatives of the nonlinear term, the

mixed time-level treatment usually incurs extra calculations. This is because some

of the nonlinear terms (e.g., ρp) are also used in the local accumulation and fugacity

terms, and such nonlinear terms must be treated fully implicitly in the computation

of local terms, regardless of the time discretization of the flux terms. The extra cost

for recomputing such nonlinear terms can sometimes be larger than the time saved

in the discretization and linear solution leading to an inefficient overall scheme.

As for convergence, the mixed time-level treatment is not necessarily the best

choice for all implicit schemes. As pointed out in [15], the explicit treatment, which

fixes the value of a term at the beginning of the timestep and removes all its deriva-

tives as defined above, is necessary (in terms of convergence) for ρp, λp, and xc,p in

the IMPES formulation, and for xc,p in the IMPSAT formulation. Partially implicit

treatment, which uses the last iteration value of a term and removes its derivatives

with respect to explicit variables (i.e., Sp and xc,p in IMPES; xc,p in IMPSAT), is

suitable (also in terms of convergence) for γp in IMPES formulation, and ρp, λp, γp

in IMPSAT formulation. Fully implicit treatment is necessary for all the nonlinear

terms (ρp, λp, xc,p, and γp) in the FIM formulation. Note that single-phase cells with


the IMPSAT implicit level are actually treated as IMPES cells.

Because special treatment of nonlinear terms is only needed for the flux term, the

choice is made immediately before flux computation. The original values and deriva-

tives of these terms (evaluated using FIM) are saved. After the flux computation,

the affected terms are restored to their original values (and derivatives) if necessary.

For example, we need to recover the value and derivatives of xc,p, which are indepen-

dent variables and should be updated from the last-iteration values after the linear

solution. As for phase mobilities or capillary pressures, their values and derivatives

do not need to be recovered, because they are only used in the flux term. However,

at the end of a converged timestep, the values of all terms are updated using the

converged solution at time level n+ 1, and can be readily used for CFL computation

and explicit treatment in the next timestep.

2.3.3 Algebraic reduction in terms of implicit variables (lin-

ear level)

As described in Section 1.2.1, the full independent variable set is composed of pri-

mary (nc, as in the natural-variables set of isothermal compositional simulation) and

secondary variables (nc(np−1)). With AIM (or IMPES/IMPSAT), the primary vari-

ables can be further divided into: implicit (N iImp, the number of implicit variables

in cell i: 1 for IMPES, 2 for IMPSAT, and nc for FIM) and explicit (nc − N iImp)

variables.

The full Jacobian includes the partial derivatives of all the equations (conserva-

tion laws and local constraints) as a function of the full variable set (primary and

secondary). However, the linear solver sees a smaller (reduced) Jacobian, in which

the mass (and energy) conservation equations are expressed in terms of the implicit

primary set. This reduced Jacobian is obtained algebraically from the full system as


follows:

• Reduction from the full set to the primary-system (needed by FIM and AIM):

because only nc mass-conservation equations are globally coupled, whereas the

remaining nc(np − 1) describe local constraints, we can use a Schur comple-

ment [94] for each block row that has more than nc equations (i.e., where more

than one phases is present in the corresponding cell) in the full system to obtain

a primary system that contains only nc equations as a function of nc variables.

• Reduction from the primary set to the implicit system (only needed by AIM):

when AIM is used, it is possible that not all nc primary variables in a cell are

implicit. Thus, we can further apply the Schur complement for those cells to

obtain an even smaller implicit system that contains only N iImp equations and

variables for each cell i.

2.3.4 Updating of the remaining variables (linear level)

The linear solver deals with the reduced Jacobian (i.e., implicit system in terms of

the implicit unknowns). The implicit system is obtained algebraically from the full

system in two stages. After we obtain the solution of implicit unknowns from the

linear solver, a two-stage updating strategy is then used for the remaining variables,

including the explicit primary variables and secondary variables:

• From implicit solution to primary solution: the solution to the nc−N iImp explicit

primary variables is recovered algebraically from the implicit solution, which

contains the Newton updates to the N iImp implicit variables per cell computed

by a specific linear solver.

• From primary solution to full solution: the solution to the nc(np− 1) secondary

variables is recovered algebraically from the primary solution, which contains


the Newton updates to the nc primary variables per cell obtained in the first-

stage explicit update.

The details of the above two-step algebraic reduction and two-step update strategy

will be discussed in Section 3.4, which is about the solution strategies of block linear

system.

2.4 Test Cases

Two test cases, including the full / upscaled SPE 10 and a fully unstructured grid

problem, are examined to justify the correctness and efficiency of the AD frame-

work with AIM and MPFA capabilities. An additional test case, which uses a two-

dimensional circular reservoir model, can be founded in [99].

2.4.1 Full and upscaled SPE 10 problems

This example is based on the permeability and porosity fields of the full and an

upscaled version of the SPE 10 problem [19]. The upscaled problem employs a simple

2x2x2 coarsening (i.e., 1/8th of SPE 10). Two time discretization schemes are tested

and contrasted: 1) FIM and 2) AIM combining IMPES and FIM. A nine-component

fluid system is used with an initial distribution of 1% CO2, 19% C1, 5% C2, 5% C3,

10% n-C4, 10% n-C5, 10% C6, 20% C8, and 20% C10. We use a line-drive well pattern:

two injectors are placed at both corners of one side of the reservoir, and two producers

are at both corners of the other side of the reservoir model. All four wells penetrate

all the layers and are under BHP (Bottom Hole Pressure) control: 40 bar (producers)

and 120 bar (injectors). The stream is 90% CO2 and 10% C1 at both injectors.

For the upscaled problem, the model has about 140K cells, and there are nine

primary variables (NC = 9) in each cell. So, we have a total of 1.26M primary


variables to solve for at each timestep. Dealing with such a large system poses a

great challenge to the linear solver. However, if AIM is applied in this case, given an

average of 10% implicit cells, we only need to solve for about 0.25M implicit variables.

The solution of the linear system will become much easier, because its size has been

decreased by approximately 80%.

Table 2.1: FIM and AIM runtime performance of upscaled SPE 10 problem

Scheme FIM AIM (IMPES + FIM) DifferenceNewton Iterations 377 354 -6.1%Solver Iterations 3.95 4.09 +3.5%

Discretization 2.62 1.82 -30.5%Nonlinear Term Treatment - 0.11 -

Properties Calculation 3.06 2.52 -17.6%Matrix Extraction/Reduction 1.32 0.99 -25.4%

Linear Solver Time 5.45 2.94 -46.1%Total Time 13.09 9.02 -31.1%

The runtime performance data for both schemes are listed in Table 2.1, where the

items in boldface are reported on a per Newton-iteration basis. In this case, we actu-

ally have an average of about 9% implicit cells and 19% of the implicit variables. The

number of Newton iterations (-6%) and solver iterations (+3%) are about the same,

whereas we get savings of 18% to 46% in the discretization, properties calculation,

matrix extraction/reduction, and linear solver time when using AIM. As we expected,

the largest saving is in the linear solution, which takes a little more than half of the

corresponding FIM time. Although some extra cost is incurred in the treatment of

the nonlinear terms, it is negligible (taking only 1% of the total time). Overall, we

have a reduced cost of about 31% per Newton iteration, with high-quality simulation

results.

As for the full SPE 10 problem, the grid is manipulated in all 85 horizontal

layers, each containing 60 × 220 cells, in two steps: 1) skewing each layer to be a


Figure 2.5: Grid mapping of full SPE 10 with skewness and distortion [99]

parallelogram, and 2) applying distortion to the x-coordinate of the grid nodes based

on the six control points depicted in Figure 2.5, where the numerical values indicate

the x-coordinate of the control points before (left) and after (right) mapping. The

positions of the other grid nodes are determined by linear interpolation between the

control points. This results in a three-dimensional large-scale nonorthogonal grid

with over 1.1 million cells. A four-component system with an initial distribution

of 1% CO2, 20% C1, 29% C4, and 50% C10 and the same line-drive well pattern

(two injectors at one side, two producers at the other side) are used. Three spatial

discretization schemes are employed and contrasted: TPFA, MPFA O(0)-method,

and MPFA L-method.

The simulation results are shown in Figure 2.6. We can see that the results

of both MPFA discretizations closely match each other, which indicates that our

MPFA framework is very robust with arbitrary discretization schemes on such a

large-scale reservoir model. On the other hand, TPFA results deviate from the MPFA

counterparts, because in skewed nonorthogonal grids the fluxes cannot be represented


800

700

600

500

400

300

200

100

0

0 2 4 6 8 10

Gas

Rat

e (k

m3/d

ay)

Time (day)

Injectors TPFA I1TPFA I2O(0)-method I1O(0)-method I2L-method I1L-method I2

0

500

1000

1500

2000

2500

3000

0 2 4 6 8 10

Oil

Rat

e (m

3/d

ay)

Time (day)

Producers TPFA P1TPFA P2O(0)-method P1O(0)-method P2L-method P1L-method P2

Figure 2.6: Gas rates of two injectors and oil rates of two producers in full SPE 10with skewed grid

accurately using TPFA.

Table 2.2: Runtime performance of full SPE 10 case

Discretization TPFA L-method MPFA O(0)-method MPFAStencil 7 9 (+29%) 11 (+57%)

Peak memory 9GB 10GB (+11%) 11GB (+22%)Newton iterations 170 167 184Solver iterations 5.6 4.5 5.3

Discretization time 7.0 7.9 (+13%) 9.1 (+29%)Solver time 25.9 28.8 (+11%) 36.5 (+41%)Total time 43.4 47.0 (+8%) 56.8 (+31%)

The performance data are listed in Table 2.2. The items in boldface are reported

on a per Newton-iteration basis. The number of solver iterations is smaller with

MPFA discretizations. For both the L-method and O(0)-method, the increases in the

memory usage, discretization time, and linear solution time are all below the growth

in the stencil (29% and 57% respectively). The total extra cost per Newton iteration

is 31% for the O-method and only 8% for the L-method. Given approximately the

same results (both more accurate than the TPFA solution), smaller memory footprint,


and better CPU efficiency, the L-method is preferred for this type of problem (with

skewed grid).

2.4.2 Unstructured grid problem

The next test case is a two-dimensional unstructured grid problem with 17,545 trian-

gular cells of different sizes. We test TPFA and MPFA discretization for four different

anisotropy ratios: 1:1, 2:1, 8:1, and 32:1. The MPFA transmissibilities are calculated

using the O(1/3)-method, which is known to give good results for triangular cells.

An injector is placed at the center of the reservoir with an initial distribution

of 0% CO2, 5% C1, 25% C4, and 70% C10. A mixture of 90% CO2 and 10% C1 is

injected at a rate of 3×104 m3/day. There are four producers, and they are all under

BHP control of 40 bars. The well configuration is shown in Figure 2.7, where the

white circles represent the five wells. Several linear no-flow features are included in

the model to mimic faults. The fluid flow is strongly affected by these flow barriers.

Thus, grid refinement is used to increase resolution around wells and faults, while

relatively coarse grids are used for the rest of the model.

The simulation results are shown in Figure 2.8. We can see that the deviation of

the TPFA gas rates from the MPFA results increases as the anisotropy ratio increases.

In the isotropic case, the difference is barely noticeable, while in the case with 32:1

anisotropy ratio, the TPFA rates differ significantly.

The gas saturation profile shown in Figure 2.7 helps to explain why there are such

big differences in the obtained gas rates between TPFA and MPFA discretization for

the cases with large anisotropy ratios. The correct flow direction, which is primarily

governed by the anisotropy and heterogeneity of the permeability field, is properly

represented in the MPFA based computations, whereas in the TPFA profile it is com-

pletely missed, leading to O(1) error in the simulation results. That is to say, in


TPFA MPFA O(1/3)

1

2

3

4

1

2

3

4

Figure 2.7: Gas saturation at the end of simulation for TPFA and MPFA with 32:1anisotropy ratio

order to get accurate results for unstructured-grid models with high anisotropy ra-

tios, TPFA computations are prone to significant errors, and an MPFA discretization

scheme is needed in order to represent the flow and transport accurately.

The performance data are listed in Table 2.3. Similarly, the items in boldface are

reported on a per Newton-iteration basis. In this case, the MPFA and TPFA stencils

have 18 and 4 entries, respectively. So, on this basis alone, MPFA is 3.5 times more

expensive than TPFA. However, as shown in Table 2.3, for anisotropy ratios from

1:1 to 8:1, the cost increase is less than 1.2 times for the discretization and less

than 1.1 times for the linear solution. As a consequence, the extra cost in the total

computational time is less than 80%. On the other hand, for MPFA discretization

with the anisotropy ratio of 32:1, we see that the linear-solver iterations per Newton

iteration is quite large (> 20), which leads to a much higher cost (+261%) in the linear

solution. The increased cost comes from convergence difficulties in the, usually robust

and highly efficient, AMG solver [76,77], which is used as the first-stage pressure solver


0

2

4

6

8

10

12

0 100 200 300 400 500 600

Gas

Rat

e (

k m

3/d

ay)

Time (day)

1:1

MPFA P1

TPFA P1

MPFA P4

TPFA P40

2

4

6

8

10

12

0 100 200 300 400 500 600Time (day)

2:1

0

2

4

6

8

10

12

0 100 200 300 400 500 600

Gas

Rat

e (k

m3/d

ay)

Time (day)

8:1

0

2

4

6

8

10

12

0 100 200 300 400 500 600Time (day)

32:1

Figure 2.8: Gas rate of producer #1 and #4 for TPFA and MPFA with 1:1, 2:1, 8:1,32:1 anisotropy ratios

Table 2.3: Runtime performance of unstructured grid case

Discretization Newton Solver Discretization Solver time Total timescheme iter iter time

TPFA 1:1 597 9.48 0.08 0.33 0.61MPFA 1:1 576 5.58 0.18 (+114%) 0.60 (+82%) 0.98 (+60%)TPFA 2:1 593 10.04 0.08 0.34 0.63MPFA 2:1 588 5.75 0.18 (+111%) 0.61 (+78%) 0.99 (+58%)TPFA 8:1 573 11.55 0.09 0.40 0.69MPFA 8:1 581 10.17 0.18 (+113%) 0.85 (+110%) 1.24 (+78%)TPFA 32:1 602 12.28 0.09 0.45 0.77MPFA 32:1 639 20.79 0.19 (+96%) 1.62 (+261%) 2.02 (+161%)

MPFA 32:1 SAMG 650 7.80 0.19 (+95%) 1.10 (+146%) 1.51 (+95%)


in the CPR preconditioner [38, 88, 89]. When the reduced pressure matrix contains

large positive off-diagonal entries introduced by MPFA discretizations for models

with high anisotropy ratios, it deviates significantly from being an M-matrix. Using

the more robust SAMG [78] as the pressure solver, the linear-solver iterations per

Newton is decreased significantly (< 10), and the associated linear solution cost is

greatly reduced from +261% to +146%. Therefore, for models with high anisotropy

ratios, SAMG is a preferred option that always computes the stable pressure solution.

2.5 Concluding Remarks

From the test cases presented, we observe that different spatial discretization schemes

(e.g., TPFA and MPFA) can lead to considerable differences in the computed results,

especially when the grid is highly nonorthogonal (e.g., skewed elements or unstruc-

tured grid). This also applies to the cases with full-tensor permeability. In order for

a general-purpose simulator to deal with nonorthogonal grids and full-tensor perme-

ability, as well as complex well geometry, TPFA discretization is often inadequate

and an MPFA scheme is necessary.

Our general AD-based, MPFA framework is quite efficient. For both TPFA and

various MPFA discretizations, the CPU time and memory cost are proportional to the

additional entries associated with the discretization. For MPFA-based simulations,

only the discretization and linear-solution time are affected; moreover, the increased

cost in both parts is always less than the growth in the stencil size. The effort associ-

ated with all the remaining parts of a simulation remain the same as those for TPFA.

Compared with TPFA, the significant improvements from the MPFA computations

come at an acceptable additional cost.

The flexible multilevel AIM implementation in our AD-based framework employs

FIM only where and when necessary, and that reduces the overall simulation cost,


along with reduced levels of numerical dispersion. Our AIM framework is currently

applied to the natural variable formulation with compositional, black-oil, or dead-oil

fluid, but it is designed to support new fluid models and nonlinear formulations.

In the AD framework, a unified code-base has been developed with minimal code

duplication. The unified code works for both TPFA and MPFA in space, and for

any combination of FIM, AIM, IMPES, and IMPSAT in time. Due to the generic

design that decouples the implementation into nonlinear and linear levels, the specific

schemes used for space and time discretization are compatible with the other cell-

based functionality in the simulator.

The AD-based MPFA and AIM capabilities described in this chapter allow for

efficient and flexible modeling of the reservoir part. This is an essential piece of our

general-purpose simulation framework. In Chapter 4, we focus on modeling advanced

wells based on a general multisegment model.

Chapter 3

Linear Solver Framework

3.1 Overview of the Framework

In AD-GPRS, a ‘linear system’ uses one specific format to store the Jacobian matrix

and is capable of performing the following tasks:

1. Extract the Jacobian matrix (and residual values) into the associated matrix

storage format from the AD residual vector that gets constructed in each New-

ton iteration using the independent variables. As long as this task is well defined

for each linear system, the linear solution can be completely separated from the

nonlinear computation.

2. Apply certain algebraic reduction steps to obtain a smaller linear system (e.g.,

primary or implicit system) that can be solved more efficiently. This is an

optional step that mainly affects the efficiency, but not the solution of the

linear solver.

3. Utilize compatible linear solvers (and preconditioners) to obtain the linear solu-

tion to the reduced matrix system. This is the key step where the linear solver

43

CHAPTER 3. LINEAR SOLVER FRAMEWORK 44

gets called. The obtained solution is corresponding to the reduced matrix sys-

tem, which is usually smaller than the full matrix system.

4. Use the explicit update procedure to get back the solution to the full Jacobian

matrix. If step 2 was applied prior to calling the linear solver in step 3, this step

is required. That is, we need to obtain the Newton update to all independent

variables rather than to only the primary or implicit variables.

5. Write the extracted Jacobian matrix to a text or binary file. This is an op-

tional feature that facilitates debugging process and is only called in special

circumstances.

Currently, AD-GPRS supports two linear system types. The first uses the pub-

lic Compressed Sparse Row (CSR, or CRS - Compressed Row Storage, see http:

//netlib.org/linalg/html_templates/node91.html). The second employs our

customized MultiLevel Block-Sparse (MLBS, or MLSB - MultiLevel Sparse Block,

see Chapter 3 in [38]) matrix format.

3.2 CSR linear system

Here, the Jacobian matrix is extracted and stored in the CSR format, which is a

widely used data structure for sparse matrices. The main idea of the CSR format is

illustrated using the example 4× 4 matrix with nine nonzero elements shown below:1.0 2.0 0 0

3.0 4.0 5.0 0

6.0 0 7.0 8.0

0 0 0 9.0

http://netlib.org/linalg/html_templates/node91.html

http://netlib.org/linalg/html_templates/node91.html


Table 3.1: Row pointer array of CSR format

ind 0 1 2 3 4row ptr 0 2 5 8 9

Table 3.2: Column index and value arrays of CSR format

ind 0 1 2 3 4 5 6 7 8col ind 0 1 0 1 2 0 2 3 3

val 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

The CSR representation of the above matrix includes three arrays: row ptr (row

pointer, as shown in Table 3.1), col ind (column index, as shown in the middle row

of Table 3.2), and val (value, as shown in the last row of Table 3.2). Note that the

row and column indices are zero-based (0, 1, 2, . . .). For a general nrow × ncol sparse

matrix with nnz nonzero elements, the three arrays are defined as follows:

• row ptr: this array has nrow + 1 elements. row ptr[i] (i = 0, . . . , nrow) indicates

the total number of nonzero elements in all the rows from 0 to i − 1. Thus

row ptr[0] must be equal to 0, and row ptr[nrow] must be equal to nnz because

it represents the total number of nonzero elements in all rows. The indices of

nonzero entries in row i is row ptr[i], . . ., row ptr[i+ 1]−1.

• col ind: this array has nnz elements. col ind[j] (j = 0, . . . , nnz − 1) indicates

the column of the j’th nonzero entry. That is, col ind[j] (j =row ptr[i], . . .,

row ptr[i + 1]−1) corresponds to the column of each nonzero entry in row i.

The elements are usually ordered from smallest column to largest column in

each row. This is the definition used in the CSR linear system. There are

several variant definitions in the matrix components of the MLBS linear system.

For example, in each row, the column of the diagonal entry is always ordered

the first, and the columns of the rest nonzero entries are ordered from the


smallest to the largest afterwards. This is required by the AMG [76, 77] or

SAMG preconditioners. Also, the column of the diagonal entry may sometimes

be neglected in the col ind array if the diagonal entry always exists in each row.

• val: this array also has nnz elements. val[j] (j = 0, . . . , nnz−1) stores the value

of the j’th nonzero entry and thus has one-to-one correspondence to col ind[j].

That is, its order is determined by the order of col ind in each row. Each

element in the val array can be a single value (pointwise) or a dense block of

values (blockwise). The pointwise storage is used for the CSR linear system,

whereas the blockwise extension is used in the matrix components of the MLBS

linear system.

Based on the above description of the three CSR arrays, the common way to

visit its entries is usually row-by-row on the first level, and element-by-element on

the second level. That is, we first loop row i from 0 to nrow − 1, and for each i, we

loop element j from row ptr[i] to row ptr[i+ 1]−1 and we may access its column by

col ind[j] or its value by val[j].

The major advantage of this format is its simplicity (easy to understand and oper-

ate) and wide availability of existing algorithms based on it. Because it is a standard

format, ADETL provides convenient interfaces to directly extract the necessary data

components of the CSR format (row pointer, column index, and value array) from an

AD vector [97]. Based on these interfaces, the entire extraction process is straight-

forward. Most of the time, we need no specialization in the CSR linear system to

accommodate new features, such as the facility models introduced into AD-GPRS.

However, because the CSR format does not contain any information for its sub-

matrices, it is relatively complicated and expensive to extract and manipulate an

arbitrary block (e.g., the submatrix for the entire facilities part, or for a specific well


model) from the entire Jacobian. As a consequence, the available CSR-based lin-

ear solvers in AD-GPRS usually do not exploit the multilevel block structure of the

Jacobian and thus are not optimal. Among the available solvers, some are sparse

direct solvers, which are usually very slow and require huge amounts of memory for

large problems, including PARDISO (http://www.pardiso-project.org/) and SU-

PERLU (http://crd.lbl.gov/~xiaoye/SuperLU/). Some sparse iterative solvers,

such as ILUPACK (http://www.icm.tu-bs.de/~bolle/ilupack/), can also be in-

corporated.

The above solvers are called “library” solvers, because all of them are provided

by external libraries and accessed through wrapper classes. There are a lot of config-

urable parameters for each of them. These parameters are stored in a editable text file

and can be read into AD-GPRS through an extensible I/O class. In addition, most

of these solvers provide the option to directly solve the transposed matrix (without

being manually transposed), which could be useful for the adjoint functionality.

The major purpose of the CSR linear system is for validation. When some new

features are introduced to AD-GPRS, a nontrivial extension may be required in the

MLBS linear system. In such circumstances, we may test these features with the CSR

linear system first to see if they work properly. However, due to the low efficiency

and significant memory consumption, no large problem can be tested in this way.

After validation for small problems, the extension in the MLBS linear system may be

implemented to accommodate the new features, such that large-scale problems may

be tested to demonstrate efficiency (and wider applicability) of the features.

3.3 Block linear system

The block linear system is the default option in AD-GPRS, due to its high efficiency

and many advanced features. Its associated matrix format is the previously mentioned

http://www.pardiso-project.org/

http://crd.lbl.gov/~xiaoye/SuperLU/

http://www.icm.tu-bs.de/~bolle/ilupack/


MLBS matrix, which has a much more sophisticated data structure than that of the

CSR format.

The term ‘data structure’ refers to a scheme for organizing related pieces of infor-

mation. A good data structure design has the following features [38]:

• Accessibility: Users should be able to generate, access, and modify data easily;

• Encapsulation: Data should be hidden by the object that owns it, and other

objects should have minimal dependency on the detailed format of the data;

• Extensibility: It should be relatively easy to fit future developments into the

data structure and expand it if necessary;

• Computational efficiency: The data structures should be compatible with state-

of-the-art computational algorithms.

To fulfill the above requirements, the MLBS format [38] was first designed and

implemented in the original GPRS. In [38], the MLBS format was demonstrated to be

the most efficient option when used in conjunction with the GMRES solver [67,68] and

CPR (Constrained Pressure Residual, see [88, 89]) preconditioner. To inherit these

advantages, the MLBS matrix is introduced into AD-GPRS as the matrix format for

the block linear system. Due to the nature of the AD framework, the elements of

the MLBS matrix are no longer manually set as in the original GPRS. To establish

a connection from the AD residual vector to the MLBS matrix, the functionality of

the MLBS matrix format has been extended such that the elements can be extracted

from the AD residual vector automatically for each Newton iteration.

To simulate systems with many relatively independent components, e.g., an oilfield

production system that includes reservoirs, wells, and surface facilities, a hierarchical

storage system is used as the data structure of the MLBS matrix format. This data

structure corresponds to the components in the physical model. In this data structure,


a higher-level matrix is composed of a number of independent submatrices. The

higher-level matrix only contains some general information (e.g., the size of matrix)

and the pointers to the submatrices. There is no requirement on the type of each

submatrix, which enables the overall system to be composed of several different types

of matrices. The implementation details of the submatrices are completely hidden

from the higher level matrix. This feature honors data encapsulation and achieves

great flexibility and extensibility. The submatrices of new models can be integrated

into AD-GPRS without changing the high-level architecture.

The (sub)matrices at all levels of MLBS matrix are inherited from the same base

data type: GPRS Matrix, which defines the following common interfaces:

1. Extraction of the full system matrix from the AD residual vector and the sub-

sequent algebraic reduction from the full system to the primary system

2. Algebraic reduction from the primary system obtained in 1 to the implicit sys-

tem (needed for AIM)

3. Assembly of the global RHS (Right-Hand-Side) vector from all submatrices of

the implicit system obtained in 2

4. Distribution of the global solution vector to all submatrices in the implicit

system (called after the linear solution of the implicit system)

5. (Explicit) update of the primary explicit solution from the implicit solution

acquired in 4 (implicit → primary solution, needed for AIM)

6. (Explicit) update of the secondary solution from primary solution obtained in

5 (primary → full solution)

7. Sparse Matrix-Vector multiplication (SpMV, y = Ax, or y = y + Ax)

8. Recalculation of the matrix size, i.e., number of rows and columns


9. Output of the matrix in a descriptive format (keeping the structure information)

or pointwise COO [67] format (ignoring the structure information). COO stands

for coordinate list, which is a list of (row, column, value) tuples representing all

nonzero entries

10. Fetch of the type, size (number of rows and columns), and associated element

extractor of the matrix.

Among the above set of functionality, items 1 to 6 are the new extensions to the

MLBS matrix in the original GPRS. The matrix extraction from the AD residual

vector is needed for the MLBS matrix in the AD framework, whereas the two-step

algebraic reduction (from full to primary, then from primary to implicit system),

assembly of the global RHS vector, distribution of the global solution vector, two-

step explicit update (from implicit to primary, then from primary to full solution)

were previously implemented in a separate series of classes called ‘equation selec-

tor’. Because this set of functionality depends strongly on the data organization in

(sub)matrices at different levels, this implementation yielded strong coupling between

the (sub)matrices and the ‘equation selector’ classes and weakened the encapsulation

and extensibility of the MLBS matrix format. To overcome this drawback and satisfy

the previously mentioned requirements on the data structure (accessibility, encapsu-

lation, extensibility, and computational efficiency), items 1 to 6 are now integrated

as part of the functionality in each MLBS (sub)matrix type in AD-GPRS. In order

for a new (sub)matrix type to function normally in the MLBS hierarchy, all items in

the above set, including items 1 to 6, must be properly defined.

Moreover, the above set of functionality is defined and implemented in a recursive

way such that a higher-level matrix does not depend on the specific types or imple-

mentation details of its submatrices at lower levels. For example, when we need to

calculate the size of a higher-level matrix containing several submatrices, it will first


call the size calculation functions of these submatrices. Upon the return of the subma-

trix sizes, the higher-level matrix will then calculate its own size. Thus, as long as the

size calculation is properly defined for these submatrices, no specialization is required

in the higher-level matrix. During the function call of some submatrix with its own

submatrices at an even lower level, the corresponding functions in these lower-level

submatrices again need to be called, until the lowest-level submatrices are visited.

This explains how the recursive implementation works. Besides size calculation, the

recursive implementation also applies to matrix extraction, algebraic reduction, global

RHS assembly, solution distribution, explicit update, SpMV, and matrix output.

In the following sections, we will go through the hierarchy of the MLBS matrix

format from top to bottom.

3.3.1 First level: global matrix

In general-purpose reservoir simulation, we often need to solve a coupled system

that contains many complex objects, such as the reservoir model, wells, and surface

facilities. Each of these objects has its own set of nonlinear equations and variables.

Generally, the Newton method is employed as the nonlinear solver. Therefore, in fully

coupled schemes, a global Jacobian matrix that incorporates all the derivatives of the

different governing equations with respect to all the variables, is required. Figure

3.1 shows a Jacobian matrix generated for a three-dimensional structured reservoir

model with several standard wells. The blue dots represent nonzero elements.

The seven-diagonal banded structure in the top-left part of the matrix repre-

sents the reservoir equations and associated variables. The diagonal structure in

the bottom-right part represents the well equations and variables. The bottom-left

and top-right parts represent the coupling terms between the reservoir and the wells.

Thus, we know that the matrix structures are quite different among the various parts

as described above. Given a typical structure, it is natural to conceptually separate


CHAPTER 3. INNOVATIVE DATA STRUCTURES IN GPRS 28

coupled schemes, a global Jacobian matrix that incorporates all the derivatives of

the different governing equations with respect to all the variables, is required. Figure

3.2 shows a Jacobian matrix generated for a three-dimensional structured reservoir

model with several wells. The blue dots represent non-zero elements.

Figure 3.2: Typical matrix in reservoir simulation

The seven-diagonal banded structure in the matrix represents the reservoir equa-

tions and associated variables. The diagonal part in the bottom right corner repre-

sents the well equations and variables. The lower-left rows and upper-right columns

represent the coupling terms between the reservoir and wells. In Chapter 2, we sepa-

rated the facilities from the reservoir. Consequently, the global Jacobian matrix can

be conceptually separated into four parts:

Figure 3.1: Typical global Jacobian matrix in reservoir simulation [38]

the global matrix into the following four parts:

• JRR, derivatives of reservoir equations with respect to reservoir variables

• JRF , derivatives of reservoir equations with respect to facility variables

• JFR, derivatives of facility equations with respect to reservoir variables

• JFF , derivatives of facility equations with respect to facility variables

In the MLBS data structure, the first level (global Jacobian matrix) is defined

as a wrapper matrix with no substantial information other than handles to the four

submatrices. The structure of the wrapper matrix is shown in Figure 3.2. The

wrapper has no requirement on the types of its submatrices, i.e., all four parts (JRR,

JRF , JFR, and JFF ) can have different types. This is achieved through four template

parameters in the class definition. When the global matrix is declared, the types of


the four submatrices must be determined. These types will be discussed on the next

level.

RR RF

FR FF

Figure 3.2: Submatrices in the global Jacobian matrix [98]

In the global matrix, each of the four submatrices can be conveniently accessed

through the provided interfaces. In fact, for any MLBS (sub)matrix type that contains

a number of submatrices, consistent interfaces are provided for the access to these

submatrices. As a result, great flexibility is offered in the design of solution strategies.

For example, we can get the decoupled facilities matrix JFF , without any additional

cost because it is a stand-alone part in the global matrix. The same operation would

be quite complex and expensive, if the global CSR format is used. Using the MLBS

matrix format, sequential solution strategies for the reservoir and facilities can be

implemented relatively easily.

In the original GPRS, the structure and data of the four submatrices are manip-

ulated by either the reservoir object (JRR) or facilities object (JRF , JFR, and JFF ).

Now in AD-GPRS, the structure of these submatrices is initialized by the linear sys-

tem object, taking necessary information, such as the reservoir general connection

list, number of implicit variables (and indices of implicit components) in each reser-

voir cell, the facilities object, and the AD variable set, as input. The data, on the


other hand, are extracted from the AD residual vector by these submatrices them-

selves. That is, the reservoir and facilities objects no longer need to know about the

structure of the associated submatrices and manually fill in the values. Instead, they

would access and manipulate the common AD residual vector on the nonlinear level,

while the derivatives of all equations with respect to independent variables are auto-

matically generated by the AD framework. Afterwards, through a unified interface,

these derivatives are extracted by the selected linear system into the desired format.

In this case, the block linear system will call the corresponding extraction function

in the global matrix, which will in turn call the extraction function in each of its

submatrices, to extract the Jacobian matrix into MLBS format.

3.3.2 Second level: reservoir and Facilities matrix

The four submatrices (JRR, JRF , JFR, and JFF ) in the first-level global Jacobian

matrix have substructures of their own and are considered to be the second-level

matrices.

Reservoir matrix

The reservoir matrix JRR contains the derivatives of all reservoir equations with re-

spect to all reservoir variables. This includes the NPri primary (mass and energy

conservation) and NSec secondary (local constraints, e.g., thermodynamic equilib-

rium) equations, as well as, NPri primary and NSec secondary variables. In order

to improve data processing efficiency and avoid extensive memory usage, the block

structure is exploited and the data are stored in the block diagonal and off-diagonal

arrays. An example of JRR with its associated data arrays is illustrated in Figure

3.3, which is based on the 3× 2 (two-dimensional) reservoir shown in Figure 3.5. As

described in the common interfaces for all MLBS matrices, the matrix extraction is


0 1 2 3 4 5

0

1

2

3

4

5

0 1 2 3 4 5

0

JRR (primary)

Diagonal

Array

Off-

Diagonal

Array

1 2 3 4 5

Pri-Pri Part

Sec-Pri Part

Pri Sec

Pri

Sec

Diagonal

block in

the full

system

Discarded after algebraic reduction from full to

primary system

0, 1 0, 3 1, 0 1, 2 1, 4 2, 1 2, 5

3, 0 3, 4 4, 1 4, 3 4, 5 5, 2 5, 4

Figure 3.3: Structure of the JRR submatrix

combined with the algebraic reduction from the full system to the primary system.

Thus the data storage of JRR only contains the elements in the reduced primary sys-

tem and some supplementary elements (secondary(equation) - primary(variable) part

of diagonal blocks) that are needed for the recovery of the solution to the full system.

The data storage for each part is ordered first block by block, and then element by

element (column-major, as shown by the yellow arrows in Figure 3.3).

On one hand, the diagonal array contains NBN2Pri +NBNPriNSec elements, where

NB is the number of diagonal blocks (i.e., the number of reservoir cells), NPri and

NSec are the number of primary and secondary equations/variables as defined above.

The first part (NBN2Pri elements, first row of the diagonal array in Figure 3.3) is


for all diagonal blocks in the primary system, whereas the second part (NBNPriNSec

elements, second row of the diagonal array in Figure 3.3) is for the supplementary

elements mentioned above. As illustrated in the top-right part of Figure 3.3, the

derivatives of either primary or secondary equations with respect to secondary vari-

ables in the diagonal blocks are discarded after the algebraic reduction process from

full to primary system.

On the other hand, the off-diagonal array contains NoffN2Pri elements, where

Noff is the number of off-diagonal blocks (i.e., the number of entries in the reservoir

general connection list). For off-diagonal blocks, the secondary equations should have

no derivatives (all derivatives are local, i.e., in the diagonal blocks) and the derivatives

of primary equations with respect to secondary variables are discarded in a similar

way after the algebraic reduction process. The ordering of off-diagonal blocks in this

array is based on the order of entries in the general connection list, because each

entry in the list is uniquely corresponding to one off-diagonal block. Currently, the

ordering scheme arranges all off-diagonal blocks first row by row, and in each row by

the column indices of the blocks. This is essentially the same as the order of elements

in a CSR matrix, except that diagonal elements are not included because they always

exist and are stored in another data array.

Besides these two data arrays, JRR contains some additional information, such as

the generalized connection list and the number of implicit variables in each block. The

constant pointers to these additional data are passed to JRR upon its construction,

because JRR will only need to access, not to manipulate, these data during the

computational process. Reusing these data arrays significantly reduces the memory

cost of the simulator and leaves out the synchronization process.


Facilities matrices

After describing the structure of the reservoir matrix JRR, we may now discuss that

of the facilities matrices JRF , JFR, and JFF . These three matrices correspond to

computations in the facilities class, which represents the aggregate of all the wells

and surface facilities in the field. There may be tens to hundreds of these objects in

a large-scale field model.

Each facility object, e.g., a general multisegment well, owns a set of stand-alone

matrices, named JRW,i, JWR,i, and JWW,i, respectively, where i is the index of the

facility object. JRW,i is a vertical slice in JRF and contains the derivatives of the

reservoir equations with respect to well variables of the i’th facility object. Similarly,

JWR,i is a horizontal slice in JFR and contains the derivatives of well equations of the

i’th facility object with respect to all reservoir variables; JWW,i is a square matrix

in JFF and contains derivatives of well equations with respect to well variables, both

belonging to the i’th facility object. That is to say, JRF , JFR, and JFF are all

wrapper matrices and respectively contain the handles to the JRW,i, JWR,i, and

JWW,i matrices from all subordinate facility objects. Figure 3.4 shows the structure

of second-level matrices corresponding to a reservoir with two wells. We may notice

that the details of the well matrices corresponding to the first well (JRW,1, JWR,1, and

JWW,1) and to the second well (JRW,2, JWR,2, and JWW,2) are completely hidden

from the second-level matrices. The operations in JRF , JFR, and JFF are defined in

a generic way that is independent of the structures of their submatrices.

3.3.3 Third level: well matrices

As described in Section 3.3.2, the matrices JRW,i, JWR,i, and JWW,i belong to a

single facility object with index i. They may or may not have substructures, i.e., any

of them can be on the very basic level (similar to JRR, which has no submatrices),


CHAPTER 3. INNOVATIVE DATA STRUCTURES IN GPRS 33

Well 1Well 2

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

Figure 3.4: Sample reservoir with wells

RR

WR1

WR2 WW2

RW1

RW2

Figure 3.5: Submatrices in matFRJ, matRFJ and matFFJFigure 3.4: Structure of second-level MLBS matrices [38]

or contain several lower-level submatrices. As mentioned before, there is no format

requirement for these basic matrices; they can even be dense or empty matrices, if

necessary. The structure details are left for model developers. Here, we only discuss

the data formats currently used in AD-GPRS, which can serve as an example for

future development. Based on model types, the matrix set of JRW,i, JWR,i, and

JWW,i can be very different in both size and format.

Figure 3.6 shows the structure of these submatrices in JRF , JFR, and JFF . These

illustrations correspond to the standard well (with 2 perforations in cell 1 and 4

respectively) and the general multisegment well (with 3 nodes, 3 connections, and

2 perforations in cell 3 and 6 respectively) shown in Figure 3.5. The blue blocks

in the matrices represent nonzero elements, whereas the white blocks represent zero

elements and are therefore not stored.

On one hand, the first column in JRF (JRW,1), the first row in JFR (JWR,1),

and first diagonal element in JFF (JWW,1) are corresponding to Well 1, which is a

standard well, in Figure 3.5. Currently, for the consistency in algebraic reduction


and explicit update, the JWW,1 matrix shares the same type as JRR, although it

contains only one element. Both JRW,1 and JWR,1 have the same type, a block

COO (coordinate list) format. For a block COO matrix with Nblk nonzero blocks,

two arrays row ind and col ind, each with size Nblk, are used to store the row and

column coordinate of these blocks. The elements of these blocks are stored in a data

array with size NblkNrowPriN

colPri, where N row

Pri denotes the number of primary equations

in the corresponding main-diagonal matrix on the same row, whereas N colPri denotes

the number of primary variables in the corresponding main-diagonal matrix on the

same column. Similar to JRR, the data storage is ordered first block by block, then

element by element (column-major). Here, the corresponding main-diagonal matrix is

referred to the submatrix whose equations and variables have the same classification,

or equivalently, whose first subscript and second subscript in the label are the same

(e.g., JRR, JWW,i). For a block COO matrix that is not on the main diagonal (e.g.,

JRW,i), it needs to access one corresponding main-diagonal matrix on the same row

(e.g., JRR) and one on the same column (e.g., JWW,i), in order for its operations such

as matrix extraction, algebraic reduction, and explicit update, to function properly.

On the other hand, the submatrices JRW,2, JWR,2, and JWW,2, as enclosed by the

dashed line in JRF , JFR, and JFF , are corresponding to Well 2, which is a general

multisegment well. We can see that the equation layout of a general multisegment

well is much more complicated than that of a standard well: JRW,2 and JWR,2 are

wrapper matrices containing 2 submatrices respectively, whereas JWW,2 is a wrapper

matrix with 4 submatrices (JNN , JNC , JCN , and JCC , where N and C stand for

node and connection respectively). The details of these submatrices will be discussed

in Section 4.5.3 after we introduce the general multisegment well model.

As we can see in the example above, the structure of well matrices between dif-

ferent well models can be completely different. This is achieved by hiding the details


1

4

2

5

3

6

Well 1 Well 2

Figure 3.5: Sample reservoir with two wells

from upper-level matrices through common interfaces. In addition, the matrix for-

mat is also hidden from the well model, which handles the nonlinear computation

only. The decoupling between well models and matrix formats is achieved by the

facility-matrix handlers. Each of these handlers is defined for a specific well model

(e.g., the standard well model) and a certain matrix format (e.g., the MLBS matrix).

The handler will take an object of the corresponding well model as the input and

create associated matrices JRW,i, JWR,i, and JWW,i in the corresponding format.

For example, a handler defined for standard well and MLBS matrix will create sub-

matrices in the same format as JRW,1, JWR,1, and JWW,1 shown in Figure 3.6. The

handlers defined for different facility models and the same matrix format are created

and managed by an upper-level facility-matrix handler, which is defined for the over-

all facilities class and the same matrix format. This top-level handler is created by

the corresponding linear system. More details about the facility-matrix handlers are

discussed in Section 4.5.2.


JFF JFR

JRF

WR1 WW1

WR2 WW2

RW1 RW2

Figure 3.6: Structure of third-level MLBS matrices

3.4 Solution strategy of block linear system

Due to the flexibility provided by the hierarchical structure of block linear system,

advanced techniques can be applied to the solution of this linear system such that the

solution efficiency will generally be much higher than that of the CSR linear system.

In order to minimize the size of the global linear system, a Schur-complement

procedure is applied to the full Jacobian of each (reservoir/well) block to express

the primary equations as a function of the primary variables only. This general


Schur-complement reduction strategy is an important aspect of AD-GPRS as well as

the original GPRS as described in [15]. There is no special alignment of equations

and variables except the condition that the matrix in Schur-complement should be

invertible. One more step of Schur-complement reduction will be taken on the primary

Jacobian to obtain an even smaller implicit Jacobian, when an AIM scheme is adopted.

Instead of having a separate Schur-complement procedure from the matrix types

(as in the equation selector class of original GPRS, see [15]), the procedure is defined

recursively, along with the matrix extraction, inside various MLBS matrix types in

AD-GPRS. For example, on the top level, the four submatrices (JRR, JRF , JFR, and

JFF ) correspond to four different types and each has its own matrix extraction/Schur-

complement function. This approach yields better flexibility and extensibility. When

we introduce a new MLBS matrix type, as long as its own extraction and Schur-

complement procedure (and other necessary procedures as listed previously) are prop-

erly defined, it will fit into the hierarchy of MLBS matrices. Ideally, we do not need

to modify the corresponding procedures in any other MLBS matrix types.

After the size of the full linear system is reduced, the resulting implicit linear

system is solved for the Newton updates to implicit variables. The most effective

solution strategy is to employ GMRES Krylov subspace solver [67,68] preconditioned

by the two-stage Constrained Pressure Residual (CPR) approach originally presented

in [88] and [89]. An algebraic reduction step (e.g., true-IMPES) is used to construct

the pressure system in the first stage, which is then solved for the approximate solution

to pressure variables using an AMG solver ( [76,77]). In the second stage, block ILU

preconditioner with different levels of fill-ins is employed for the solution of individual

submatrices (see [17] and [38]). In this approach, it is not necessary to construct

special pressure equations nonlinearly.

After the implicit linear system is solved, we can find the solution to primary ex-

plicit variables and to secondary variables locally for each grid cell through an explicit


update corresponding to the algebraic reduction applied prior to the linear solution.

Thus we can obtain the Newton updates to all independent variables with the solu-

tion of a much smaller linear system (and hence considerably faster). The details of

the matrix extraction, algebraic reduction (Schur-complement), preconditioned linear

solution, explicit update procedures are discussed in the following sections.

3.4.1 Matrix extraction

An intermediate extractor class built on top of the extraction functions of the AD

library (see [97]) is used by all MLBS matrices for acquiring the elements from an AD

residual vector. A piece of derivatives with neighboring columns on a single row (i.e.,

from a single ADscalar variable) can be extracted at a time. The internal structure of

the gradient in an ADscalar variable can vary, e.g., in the form of pointwise (column,

value) pairs, or blockwise (block column, all values in a block) pairs (see the second

part of [96] for details). However, a unified interface is provided by the AD library to

extract a given number of neighboring derivatives, such that the internal structure of

the gradient is completely hidden from the user of the AD library.

After obtaining the needed piece of neighboring derivatives with the given size and

starting from the given column in the given row, the MLBS matrix can then store

these derivatives into its internal arrays according to its data storage scheme. We

usually extract a piece of derivatives with the size equal to the total number of active

(primary and secondary) variables in the current block of the matrix, because these

derivatives always have the neighboring columns. The number of active variables

can change from block to block, and from iteration to iteration, depending on the

active phase status of the block. Note that the derivatives are usually stored in a

column-major fashion in MLBS matrices, whereas they are obtained in the form of a

continuous piece in a row. In addition, permutation (needed by variable-based AIM

scheme) can be applied to both columns and rows of an extracted matrix block. Thus


the MLBS matrix is responsible for mapping the elements in the extracted piece into

the appropriate places of its data storage. Also note that part of the derivatives (e.g.,

those with respect to secondary variables) may be stored in temporary arrays and

get discarded after the Schur-complement process that generates the primary system

from the full system. Nevertheless, necessary information for the recovery of the full

solution from the primary solution is kept.

Two alternative interfaces are provided by the extractor class, depending on

whether we would like to continue the searching process from the last extracted piece

of derivatives in order to speed up the process. This is achieved by passing either

the saved location of last extracted piece or the starting location in a gradient to

the extraction function of the gradient in an ADscalar variable. The former interface

can be more effective during the process of extracting several blocks of derivatives

with ascending columns from the same row, whereas the later one should be used

for finding a specific piece of the gradient, e.g., the diagonal part, in a row. The

saved locations of last extracted piece will be reset to the starting locations in all

gradients after the Jacobian generation and prior to the matrix extraction process in

each Newton iteration.

3.4.2 Algebraic reduction from full to primary system

Under fully implicit formulation, the primary variables (NC per grid block, +1 for

thermal formulation) are all implicit. So there is a one-step algebraic reduction from

full to primary system (i.e., Schur-complement process) before the linear solution,

and a one-step explicit update from primary to full variables after the linear solution.

The application of this procedure to JRR and its associated RHS vector is described

below.

First we define some notation, which is consistent with the implementation in

AD-GPRS, for the system matrix:


• A – Derivatives of primary equations with respect to primary variables;

• B – Derivatives of primary equations with respect to secondary variables;

• C – Derivatives of secondary equations with respect to primary variables;

• D – Derivatives of secondary equations with respect to secondary variables;

• Xp, Xs – Update for primary variables and update for secondary variables;

• Fp, Fs – RHS of primary equations and RHS of secondary equations;

• first block subscript — the row of the block;

• second block subscript — the column of the block;

The block nonzero entries of the full system using the above notation are shown

below:

A B

C D

i,i

· · ·

A B

0 0

i,j

.... . .

... A B

0 0

j,i

· · ·

A B

C D

j,j

·

Xp

Xs

i

... Xp

Xs

j

=

Fp

Fs

i

... Fp

Fs

j

(3.1)

For each block row i, first let the second row (with C and D in the diagonal block,

and 0 in off-diagonal blocks, corresponding to secondary equations) left-multiply by

D−1i,i , and then let first row (with A and B in the diagonal block, corresponding to

primary equations) minus the resultant second row (with D−1i,i C and I in the diagonal

block) left-multiplied by Bi,i. The full system becomes:


A−BD−1C 0

D−1C I

i,i

· · ·

A B

0 0

i,j

.... . .

... A B

0 0

j,i

· · ·

A−BD−1C 0

D−1C I

j,j

·

Xp

Xs

i

... Xp

Xs

j

=

Fp − (BD−1)i,iFs

D−1i,i Fs

i

... Fp − (BD−1)j,jFs

D−1j,j Fs

j

(3.2)

Next for each flux between cell i (corresponding to block row/column i) and j

(corresponding to block row/column j), let the first row (primary equations) of block

row i minus the second row (secondary equations) of block row j left-multiplied by

Bi,j, and similarly, let the first row (primary equations) of block row j minus the

second row (secondary equations) of block row i left-multiplied by Bj,i. The full

system becomes:


A−BD−1C 0

D−1C I

i,i

· · ·

A−B(D−1C)j,j 0

0 0

i,j

.... . .

... A−B(D−1C)i,i 0

0 0

j,i

· · ·

A−BD−1C 0

D−1C I

j,j

·

Xp

Xs

i

... Xp

Xs

j

=

Fp − (BD−1)i,iFs −Bi,jD−1j,j (Fs)j

D−1i,i Fs

i

... Fp − (BD−1)j,jFs −Bj,iD−1i,i (Fs)i

D−1j,j Fs

j

(3.3)

Then the primary system, as shown in Eq. (3.4), is decoupled from the full system

by taking the first row (primary equations) and first column (primary variables) out

of each block row and column.(A−BD−1C)i,i · · · Ai,j −Bi,j(D

−1C)j,j...

. . ....

Aj,i −Bj,i(D−1C)i,i · · · (A−BD−1C)j,j

·

(Xp)i...

(Xp)j

=

(Fp)i − (BD−1)i,i(Fs)i −Bi,jD

−1j,j (Fs)j

...

(Fp)j − (BD−1)j,j(Fs)j −Bj,iD−1i,i (Fs)i

(3.4)

Repeating this process for each cell j that is a neighbor to cell i in the general

connection list, the diagonal part of block row i in the primary system is kept as

(A − BD−1C)i,i, and the off-diagonal entry on block row i and block column j is

computed as Ai,j −Bi,j(D−1C)j,j for each neighboring cell j. The final RHS of block


row i can be obtained as:

RPrii = (Fp)i − (BD−1)i,i(Fs)i −

∑j∈nbr(i)

(Bi,jD

−1j,j (Fs)j

)(3.5)

The above reduction process can be applied to any main-diagonal submatrix that

shares the same structure as JRR, e.g., JNN part in the JWW,i matrix of a general

multisegment well. Besides these main-diagonal submatrices, the treatments for the

off-diagonal terms as described above also need to be applied to the coupling subma-

trices JRW,i and JWR,i of all facility objects (and those block COO matrices on even

lower levels, e.g., JNC and JCN parts in the JWW,i matrix of a general multisegment

well). For any block nonzero entry of an off-diagonal block COO matrix in the full

system, it will be in the form: A B

0 0

row,col

. (3.6)

where the second row and second column may or may not exist. For example, JRW,i of

a standard well contains no derivatives with respect to secondary variables, because

there is only one variable, BHP, per well; and similarly, JWR,i of a standard well

contains no derivatives from secondary equations, because there is only one equation

per well.

Treating the block entry (3.6) as an off-diagonal block in the main-diagonal matrix,

we may reduce it to the following primary block:

Arow,col −Brow,col(D−1C)col,col, (3.7)

where Arow,col and Brow,col are from the block entry (3.6), and (D−1C)col,col is from

the diagonal part of the corresponding main-diagonal matrix on the same column.


The RHS of corresponding block row will be updated as:

RPrirow ← RPri

row −Brow,colD−1col,col(Fs)col (3.8)

The entire algebraic reduction process from full to primary system has been de-

scribed. Correspondingly, after primary variable update (Xp)i (i = 1, 2, ..., NB) is

obtained, the update of secondary variables can be explicitly computed as:

(Xs)i = D−1i,i (Fs)i − (D−1C)i,i(Xp)i, i = 1, 2, ..., NB (3.9)

where D−1i,i (Fs)i and (D−1C)i,i have already been computed in Eq. (3.2). D−1i,i (Fs)i is

stored in the original place for secondary RHS, whereas (D−1C)i,i corresponds to the

extra elements stored in the second part of the diagonal array of JRR.

3.4.3 Algebraic reduction from primary to implicit system

Under IMPES, IMPSAT or any kind of AIM formulation, the primary variables are

no longer all implicit. So in addition to the original “full to primary” reduction

discussed in Section 3.4.2, another reduction from primary to implicit system should

be taken between the original reduction and the linear solution. Correspondingly,

an additional explicit update step from implicit to primary variables is also needed

between the linear solution and the original “primary to full” update. Effectively, we

perform a two-step algebraic reduction (full to primary to implicit) before the linear

solution, as well as, a two-step explicit update (implicit to primary to full) after the

linear solution.

First we further divide the primary equations and variables into implicit and (pri-

mary) explicit parts. Note that secondary variables are also explicit in IMPES or

IMPSAT formulation, but here we only work on the primary system expressed by Eq.


(3.4). Also, there is no such natural classification of “implicit” and “explicit” equa-

tions. By “implicit” equations we mean the first (NImp)i equations, where (NImp)i

is the number of implicit variables in grid block i. Correspondingly, “explicit” equa-

tions are the rest of the primary equations in that block. The following notation is

introduced for the second algebraic reduction and explicit update steps:

• AII – The derivatives of implicit equations with respect to implicit variables;

• AIE – The derivatives of implicit equations with respect to explicit variables;

• AEI – The derivatives of explicit equations with respect to implicit variables;

• AEE – The derivatives of explicit equations with respect to explicit variables;

• XI , XE – Update for implicit variables and update for explicit variables;

• FI , FE – RHS of implicit equations and RHS of explicit equations;

Without loss of generality, we simply assume that block i and j both have at

least one primary explicit variables such that secondary variables in both blocks are

also explicit. With correct explicit treatment of nonlinear terms in the flux term, we

should have: Bi,j = 0 (i 6= j), i.e., derivatives with respect to secondary variables

(explicit) in off-diagonal blocks are zero. As a result, the off-diagonal terms should be

kept at their original values after the first step reduction from full to primary system,

i.e., we have Ai,j − Bi,j(D−1C)j,j = Ai,j and Aj,i − Bj,i(D

−1C)i,i = Aj,i. Therefore

we should have (AIE)i,j = (AEE)i,j = 0 (i 6= j) (derivatives with respect to primary

explicit variables in off-diagonal blocks are zero). Using the above notation, the


primary system obtained in Eq. (3.4) can be expressed as:

AII AIE

AEI AEE

i,i

· · ·

AII 0

AEI 0

i,j

.... . .

... AII 0

AEI 0

j,i

· · ·

AII AIE

AEI AEE

j,j

·

XI

XE

i

... XI

XE

j

=

FI

FE

i

... FI

FE

j

(3.10)

For each block row i, let the first row (with AII and AIE in the diagonal block)

minus the second row (with AEI and AEE in the diagonal block) left-multiplied by

(AIEA−1EE)i,i. The primary system becomes:

AII − (AIEA−1EE)AEI 0

AEI AEE

i,i

· · ·

AII − (AIEA−1EE)i,iAEI 0

AEI 0

i,j

.... . .

... AII − (AIEA−1EE)j,jAEI 0

AEI 0

j,i

· · ·

AII − (AIEA−1EE)AEI 0

AEI AEE

j,j

·

XI

XE

i

... XI

XE

j

=

FI − (AIEA−1EE)i,iFE

FE

i

... FI − (AIEA−1EE)j,jFE

FE

j

(3.11)

Then the implicit system, as shown below, is decoupled from the primary system by

taking the first row (implicit equations) and first column (implicit variables) out of


each block row and column:(AII − (AIEA

−1EE)AEI

)i,i

· · · (AII)i,j − (AIEA−1EE)i,i(AEI)i,j

.... . .

...

(AII)j,i − (AIEA−1EE)j,j(AEI)j,i · · ·

(AII − (AIEA

−1EE)AEI

)j,j

·

(XI)i...

(XI)j

=

(FI)i − (AIEA

−1EE)i,i(FE)i

...

(FI)j − (AIEA−1EE)j,j(FE)j

(3.12)

Repeating this process for each cell j that is a neighbor to cell i in the general

connection list, the diagonal part and RHS of block row i in the implicit system are

kept as(AII − (AIEA

−1EE)AEI

)i,i

and (FI)i − (AIEA−1EE)i,i(FE)i respectively, whereas

the off-diagonal entry on block row i and block column j can be computed as (AII)i,j−

(AIEA−1EE)i,i(AEI)i,j for each neighboring cell j.

The above process can again be applied to any main-diagonal submatrix that

shares the same structure as JRR. In addition, the treatment for off-diagonal terms

needs to be applied to the coupling submatrices as well. For any block nonzero entry

(with explicit variables) of an off-diagonal block COO matrix in the primary system,

it will be in the form: AII 0

AEI 0

row,col

. (3.13)

where the second row and column may or may not exist, depending on the corre-

sponding facility model of the block COO matrix.

Treating the block entry (3.13) as an off-diagonal block in the main-diagonal

matrix, we may reduce it to the following primary block:

(AII)row,col − (AIEA−1EE)row,row(AEI)row,col, (3.14)


where (AII)row,col and (AEI)row,col are from the block entry (3.13), and (AIEA−1EE)row,row

is from the diagonal part of the corresponding main-diagonal matrix in the same row.

The entire algebraic reduction process from primary to implicit system has been

described. Correspondingly, after implicit variable update (XI)i (i = 1, 2, ..., NB)

is obtained from the linear solution of the implicit system, the update of primary

explicit variables can be computed as:

(XE)i = (AEE)−1i,i

(FE)i − (AEI)i,i(XI)i −∑

j∈nbr(i)

(AEI)i,j(XI)j

, i = 1, 2, ..., NB

(3.15)

where nbr(i) includes not only all neighboring cells to cell i in the general connection

list of the main-diagonal matrix, but also the block nonzero entries on row i (if any)

from the off-diagonal block COO matrix as discussed above.

Combining the implicit update (XI)i and primary explicit update (XE)i for each

grid block, we have obtained the primary update (Xp)i, which will be further used to

compute the full system update using Eq. (3.9).

3.4.4 Preconditioned Linear Solver

After obtaining the implicit linear system (3.12), we need to solve it with an efficient

linear solver. Although the size of the implicit linear system is already much smaller

than that of the full system, it may still have quite a large number of unknowns,

e.g., more than O(106). Large, sparse linear systems in reservoir simulation typically

are solved with preconditioned iterative Krylov solvers. The generalized minimum

residual (GMRES) solver [67, 68] is one of the most widely used Krylov subspace

solvers in our community. Due to the superior performance it offers in conjunction

with the application of an effective preconditioner, GMRES is used as the default

solver option in AD-GPRS. Here a brief introduction to GMRES is given.


GMRES is an iterative algorithm for solving linear system of equations in the form

of Ax = b, where A is a given sparse invertible matrix, b is a given right-hand-side

vector, and x is the unknown vector to be computed by the algorithm. GMRES has

no requirement or dependency on the format of matrix A or vector x and b, given

that SpMV and preconditioning solution are well defined on the matrix and vector

types, as well as, that the BLAS operations (copy, scal, axpy, dot, etc.) are well

defined on the vector type. Therefore, GMRES can work perfectly with the MLBS

matrix format described in Section 3.3. The preconditioned GMRES algorithm with

restart option is listed in Algorithm 3.1.

GMRES algorithm guarantees that the residual norm (||b−Ax||2) monotonically

decreases during each restart cycle (i.e., for i = 0, 1, . . . ,m − 1, where i denotes the

number of iterations in the current restart cycle, and m denotes the total number

of iterations for each restart cycle). For an N × N matrix, it takes at most N

steps for the algorithm to converge if the algorithm never restarts. In practice, N is

usually quite large, and the algorithm has to restart due to the constraint of numerical

round-off error and available memory space (to store Hm, vi, and w′i). With a

good preconditioner, it usually takes the GMRES algorithm only a small number

of iterations (<< N) to converge. Thus it is quite important to devise an effective

preconditioning strategy, as described in the next section.

3.4.5 The Two-Stage Preconditioning Strategy

The performance of Krylov solvers depends strongly on the quality of the precondi-

tioner. Some preconditioners, e.g., the ILU family [11, 66], can be applied to various

types of matrices but the effectiveness may vary among these matrices. On the other

hand, some preconditioners, e.g., AMG [76, 77], have strict requirements on the ma-

trices to be solved but can be very effective when the requirements are satisfied.

The equations and unknowns in reservoir simulation have mixed properties. They


Algorithm 3.1 Preconditioned GMRES with restart option

1: Given the initial guess x0, compute r = b−Ax0 and β = ||r||22: Define the (m + 1) ×m matrix Hm = {hkl}0≤k≤m,0≤l≤m−1 (m: total number of

iterations for each restart cycle).3: Let j = 1 (j: current number of iterations)4: while j ≤ jmax do5: Compute v0 = r/β6: Let s0 = β, i = 0 (i: number of iterations in the current restart cycle)7: while i < m and j ≤ jmax do8: Compute w′i = M−1vi (preconditioner solution)9: Compute w = Aw′i10: for k = 0, 1, . . . , i do11: hki := (w,vk)12: w := w − hkivk13: end for14: hi+1,i = ||w||215: vi+1 = w/hi+1,i

16: Transform Hm into upper triangular form using plane rotations, in orderto solve the minimization problem ||βe1 − Hmy||2

17: Update si in the corresponding RHS s = {sk}0≤k≤i18: Compute si+1 as the residual of the minimization problem19: if si+1/||b||2 < ε then

20: Compute y = H−1m s and update x as: x = x +

∑ik=0 ykw

′k

21: Exit the algorithm with convergence22: end if23: Let i = i+ 1, j = j + 124: end while25: Compute y = H

−1m s and update x as: x = x +

∑mk=0 ykw

′k

26: Compute r = b−Ax, β = ||r||227: if β/||b||2 < ε then28: Exit the algorithm with convergence29: end if30: end while


have both near-elliptic (pressure equations) and near-hyperbolic (advection equa-

tions) parts. In order to handle this mixed character, a powerful two-stage pre-

conditioner - the Constrained Pressure Residual (CPR) method - was proposed by

Wallis [88] and Wallis et al. [89]. In the first stage, the pressure system is obtained

from the fully coupled Jacobian using an efficient algebraic reduction scheme, which

mimics the steps associated with constructing the pressure equation of the IMPES

formulation. This pressure system preserves the coupling of the reservoir-wells system

and is solved with an AMG solver. The low frequency errors associated with the pres-

sure variables are resolved in this stage. In the second stage, an ILU preconditioner is

usually applied to the full Jacobian matrix, which removes the high frequency (local)

errors. The general form of the two-stage preconditioner can be written as in [17]:

M−11,2 = M−1

2 [I −AM−11 ] + M−1

1 , (3.16)

where M 1,2 denotes the two-stage scheme; M 1 and M 2 denote the first and second

stage preconditioners, respectively, and I is an identity matrix with A denoting the

matrix to be solved.

Note that we never explicitly calculate the inverse of M 1 or M 2. Instead, we

always calculate M−1i r (i = 1, 2) for any given RHS vector r (e.g., see step 8 in

Algorithm 3.1). The preconditioner at either stage should be able to calculate the

approximate solution in an efficient manner. Otherwise, even with good convergence

behavior, the total cost of the linear solver will be too high and the preconditioner

loses its advantage.

The two-stage preconditioner is applied as follows:

1. Apply the first-stage preconditioner to calculate x1 = M−11 r, where r is the

given RHS and x1 is the first-stage solution vector

2. Compute r2 = r−Ax1, which becomes the updated RHS for the second stage


3. Apply the second-stage preconditioner to calculate x2 = M−12 r2

4. Compute x = x1 +x2 as the overall update. This is consistent with the defini-

tion of the two-stage preconditioner in Eq. (3.16), because we have:

x = M−11,2r =

(M−1

2 [I −AM−11 ] + M−1

1

)r

= M−12 [I −AM−1

1 ]r + M−11 r

= M−12 [r −Ax1] + x1

= M−12 r2 + x1

= x2 + x1 (3.17)

With highly tuned preconditioners for each of the two stages as components,

CPR has been demonstrated to be a very efficient preconditioner for reservoir simu-

lation with standard wells (see, for example, [15,17,37,38]). More recently, CPR has

been extended to handle complex unstructured models with advanced multisegment

wells [38] and well groups [96,98]. The extension to handle coupled reservoir-facilities

simulation with general multisegment wells is discussed in Section 4.7. Here we give a

detailed description about the important components in the CPR method in general.

First stage: pressure system

In the first stage of CPR, a pressure system needs to be constructed from the implicit

system in reservoir simulation. As mentioned in [38], two alternative approaches to

get the pressure system can be considered:

• Nonlinear construction of the pressure system. The IMPES conservation equa-

tions can be formed by first performing the nonlinear treatment (explicit treat-

ment of ρp, λp, xcp and partially implicit treatment of γp) and then adding up

the mass conservation equations of all components in each cell. Then we obtain


the pressure system with one equation (the total mass conservation equation)

and one variable (pressure) per cell. This approach yields high construction

cost and strong coupling between the implementation on the nonlinear level

and that on the linear level.

• Algebraic reduction from the implicit system. With certain assumptions (true-

IMPES or quasi-IMPES), the pressure linear system can be algebraically re-

duced from the implicit system derived in Section 3.4.2 and 3.4.3. The reduced

linear system will only contain one element per block and essentially become

pointwise. This operation is performed purely on the linear level and the asso-

ciated cost is much lower than the former approach, given that we have already

obtained the implicit linear system prior to the setup stage of the preconditioner.

As reported in [38], the constructed pressure system is quite similar from both

approaches. Due to the low cost and decoupled implementation, algebraic reduction

is the desired approach to obtain the pressure system. As mentioned above, two

options based on different assumptions, true-IMPES and quasi-IMPES, can be used

to perform the reduction from the implicit matrix system to an IMPES-like matrix

system.

First we further divide the implicit equations and variables into pressure and

other (implicit) parts. The starting point here is the implicit system expressed by

Eq. (3.12). Note that our target is the pressure equation, but it is not naturally

generated in an implicit system. Thus, in the implicit system, we use “pressure”

equation simply to denote the first equation (corresponding to the pressure variable,

which is the first variable), and use “other” implicit equations to denote the rest of

the implicit equations (see the definition at the beginning of Section 3.4.3) in that

block. The following notation is introduced for the true- and quasi- IMPES reduction:

• App, Fpp – The derivative of accumulation term and the derivative of flux term


in pressure equation with respect to pressure variable;

• Apo, Fpo – The derivatives of accumulation term and the derivatives of flux term

in pressure equation with respect to other implicit variables;

• Aop, Fop – The derivatives of accumulation term and the derivatives of flux term

in other implicit equations with respect to pressure variable;

• Aoo, Foo – The derivatives of accumulation term and the derivatives of flux term

in other implicit equations with respect to other implicit variables;

• Xp, Xo – Update for pressure variables and update for other implicit variables;

• Rp, Ro – RHS of pressure equations and RHS of other implicit equations;

Without loss of generality, we simply assume that block i and j both have at least

one other implicit variables. Using the above notation, the implicit system obtained

in Eq. (3.12) can be expressed as:

App + Fpp Apo + Fpo

Aop + Fop Aoo + Foo

i,i

· · ·

Fpp Fpo

Fop Foo

i,j

.... . .

... Fpp Fpo

Fop Foo

j,i

· · ·

App + Fpp Apo + Fpo

Aop + Fop Aoo + Foo

j,j

·

Xp

Xo

i

... Xp

Xo

j

=

Rp

Ro

i

... Rp

Ro

j

. (3.18)


We may observe that diagonal entries contain contribution from both accumula-

tion and flux terms, whereas the off-diagonal entries contain only the contribution

from the flux term.

True-IMPES reduction. The assumption of true-IMPES reduction is to treat

all variables other than pressure in the flux terms explicitly. But for preconditioning

purpose, the values from the last iteration, instead of from the last timestep, are used

for these variables. This is the major difference between the pressure system obtained

by true-IMPES reduction and that from the IMPES formulation, where the values

of the variables other than pressure are fixed at the last timestep. To achieve this

treatment purely on the linear level, we may apply a column sum to the derivatives

with respect to other implicit variables. That is, for each off-diagonal block (i, j), we

add its Fpo and Foo parts to the corresponding parts in the diagonal block with the

same block column, i.e., (j, j). Because for each flux term (Fc)i,j in block row i, there

is a corresponding flux term (Fc)j,i in block row j with the same absolute value but

opposite sign (i.e., (Fc)j,i = −(Fc)i,j), we have:

(Fpo)j,j = −∑

i∈nbr(j)

(Fpo)i,j, (Foo)j,j = −∑

i∈nbr(j)

(Foo)i,j (3.19)

Then by summing up the derivatives with respect to other implicit variables in

all off-diagonal blocks with the same block column into the corresponding places of

the diagonal block in that block column, the Fpo and Foo terms in the diagonal block

will be cancelled out. Afterwards, we may simply ignore the Fpo and Foo parts in the

off-diagonal blocks (note that we do not need to manually fill them with zeros; we

just do not use them in the following steps of the reduction process). The implicit


system becomes:

App + Fpp Apo

Aop + Fop Aoo

i,i

· · ·

Fpp 0

Fop 0

i,j

.... . .

... Fpp 0

Fop 0

j,i

· · ·

App + Fpp Apo

Aop + Fop Aoo

j,j

·

Xp

Xo

i

... Xp

Xo

j

=

Rp

Ro

i

... Rp

Ro

j

.

(3.20)

Next, we may apply the Schur-complement process. That is, for each block row i,

first let the second row (with Aop + Fop and Aoo in the diagonal block) left-multiply

by (Aoo)−1i,i , and then let the first row (with App + Fpp and Apo in the diagonal block)

minus the resultant second row (with A−1oo (Aop + Fop) and I in the diagonal block)

left-multiplied by (Apo)i,i. By combining A and F as J , the system becomes:

Jpp − (ApoA−1oo )Jop 0

A−1oo Jop I

i,i

· · ·

Fpp − (ApoA−1oo )i,iFop 0

(Aoo)−1i,i Fop 0

i,j

.... . .

... Fpp − (ApoA−1oo )j,jFop 0

(Aoo)−1j,jFop 0

j,i

· · ·

Jpp − (ApoA−1oo )Jop 0

A−1oo Jop I

j,j

·

Xp

Xo

i

... Xp

Xo

j

=

Rp − (ApoA−1oo )i,iRo

(Aoo)−1i,i Ro

i

... Rp − (ApoA−1oo )j,jRo

(Aoo)−1j,jRo

j

. (3.21)

Then the pressure system, as shown below, is decoupled from the implicit system

by taking the first row (pressure equations) and first column (pressure variables) out


of each block row and column:(Jpp − (ApoA

−1oo )Jop)i,i · · · (Fpp)i,j − (ApoA

−1oo )i,i(Fop)i,j

.... . .

...

(Fpp)j,i − (ApoA−1oo )j,j(Fop)j,i · · · (Jpp − (ApoA

−1oo )Jop)j,j

·

(Xp)i...

(Xp)j

=

(Rp)i − (ApoA

−1oo )i,i(Ro)i

...

(Rp)j − (ApoA−1oo )j,j(Ro)j

. (3.22)

Now we consider the treatment for standard wells. Firstly, each entry in a

reservoir-well submatrix JRW,i will be in the form:

Jpp

Jop

row,col

(3.23)

We treat it as an off-diagonal block with no derivatives with respect to other implicit

variables. Therefore, it can be reduced to:

(Jpp)row,col − (ApoA−1oo )row,row(Jop)row,col. (3.24)

where (Jpp)row,col and (Jop)row,col are taken from the JRW,i matrix, and (ApoA−1oo )row,row

is taken from the reservoir matrix.

Secondly, each entry in a well-reservoir submatrix JWR,i will be in the form:

[Jpp Jpo

]row,col

(3.25)

Similarly, we treat it as an off-diagonal block with no other implicit equations. There-

fore, by ignoring the derivatives with respect to other implicit variables in the explicit

treatment, we simply reduce it to (Jpp)row,col.


Thirdly, there is only one entry in a well-well submatrix JWW,i: (Jpp)row,row. We

can directly keep it in the reduced system.

Quasi-IMPES reduction. In quasi-IMPES reduction, we directly apply Schur-

complement process on the implicit system in the form (3.18). Here we also combine

A and F as J . For each block row i, first let the second row (with Jop and Joo in

the diagonal block) left-multiply by (Joo)−1i,i , and then let the first row (with Jpp and

Jpo in the diagonal block) minus the resultant second row (with J−1oo Jop and I in the

diagonal block) left-multiplied by (Jpo)i,i. The system becomes:

Jpp − (JpoJ−1oo )Jop 0

J−1oo Jop I

i,i

· · ·

Fpp − (JpoJ−1oo )i,iFop F ∗

(Joo)−1i,i Fop (Joo)

−1i,i Foo

i,j

.... . .

... Fpp − (JpoJ−1oo )j,jFop F ∗

(Joo)−1j,jFop (Joo)

−1j,jFop

j,i

· · ·

Jpp − (JpoJ−1oo )Jop 0

J−1oo Jop I

j,j

·

Xp

Xo

i

... Xp

Xo

j

=

Rp − (JpoJ−1oo )i,iRo

(Joo)−1i,i Ro

i

... Rp − (JpoJ−1oo )j,jRo

(Joo)−1j,jRo

j

, (3.26)

where F ∗i,j can be computed as:

F ∗i,j = (Fpo)i,j − (JpoJ−1oo )i,i(Foo)i,j (3.27)


The assumption of quasi-IMPES reduction is that F ∗ = 0 (derivatives of pres-

sure equation with respect to other implicit variables are ignored after the Schur-

complement process) such that we do not need to calculate Eq. (3.27) in the simu-

lation. Under this assumption, we can obtain the following pressure system that is

decoupled from the implicit system by taking the first row (pressure equations) and

first column (pressure variables) out of each block row and column:

(Jpp − (JpoJ

−1oo )Jop)i,i · · · (Fpp)i,j − (JpoJ

−1oo )i,i(Fop)i,j

.... . .

...

(Fpp)j,i − (JpoJ−1oo )j,j(Fop)j,i · · · (Jpp − (JpoJ

−1oo )Jop)j,j

·

(Xp)i...

(Xp)j

=

(Rp)i − (JpoJ

−1oo )i,i(Ro)i

...

(Rp)j − (JpoJ−1oo )j,j(Ro)j

. (3.28)

The standard wells can be treated in a similar way as explained for the true-IMPES

reduction, except that each entry in JRW,i will be reduced to:

(Jpp)row,col − (JpoJ−1oo )row,row(Jop)row,col, (3.29)

instead of the term given by Eq. (3.24).

Comments on the two reduction schemes. As described above and men-

tioned in [38], similar steps have been taken by both reduction schemes in different

orders. True-IMPES first ignores the derivatives of the flux term with respect to

other implicit variables, and then performs the Schur-complement reduction, whereas

quasi-IMPES first performs the Schur-complement reduction, and then ignores the

derivatives of the resultant flux term with respect to other implicit variables. The

dropped derivatives in true-IMPES reduction correspond to explicit treatment (one-

iteration lag) of all variables other than pressure in the flux term, whereas those in


quasi-IMPES reduction have no clear physical meaning. Also, as claimed in [38],

true-IMPES reduction has been demonstrated to be a consistently better approach

than quasi-IMPES reduction via a large number of simulations.

Solution of the pressure system. With the previously described treatment

for standard wells during the true-IMPES reduction, we will obtain an augmented

pressure linear system in the following form: Ap,RR Ap,RW

Ap,WR Ap,WW

· xp,R

xp,W

=

bp,R

bp,W

, (3.30)

where R stands for reservoir and W stands for well. The four parts Ap,RR, Ap,RW ,

Ap,WR, and Ap,WW in the pressure matrix Ap are all pointwise sparse (Ap,WW has a

diagonal structure) and contains only the derivatives of pressure (or well constraint)

equations with respect to the pressure (or well BHP) variable. This pressure system

(in CSR format, see the introduction in Section 3.2) can be directly solved by the

AMG or SAMG preconditioner, given that the following criteria are met: 1) the

diagonal element is always ordered as the first element in each row, 2) all diagonal

elements are positive, and 3) row ptr and col ind arrays are both 1-based.

An alternative strategy is to derive the Schur-complement pressure system as:

(Ap,RR −Ap,RWA−1p,WWAp,WR

)· xp,R = bp,R −Ap,RWA−1p,WWbp,W . (3.31)

After solving the above linear system for xp,R, we can explicitly obtain xp,W as

xp,W = A−1p,WW (bp,W −Ap,WRxp,R) (3.32)

Here the Schur-complement pressure matrix Aschurp = Ap,RR−Ap,RWA−1p,WWAp,WR

may not have the same structure as the Ap,RR part in the original Ap matrix, when


there are wells with multiple perforations that are not directly connected. Thus,

substantially higher cost will result if we dynamically change the structure of the

pressure matrix after the true-IMPES reduction. However, the structure of the Schur-

complement pressure matrix Aschurp shall be fixed if the well configuration (number of

wells and perforations of each well) does not change. For practical implementation,

the structure of Aschurp is predetermined at the beginning of the simulation and will

only need to be reconstructed when the well configuration changes.

No matter which pressure system (augmented or Schur-complement) we solve, the

reduction process can be expressed by the following equation:

Ap = R ·AImp · P (3.33)

where AImp is the implicit matrix in Eq. (3.12), R is the restriction matrix, and P

is the prolongation matrix. If we express the original implicit linear system as:

AImp · xImp = bImp, (3.34)

then by applying the restriction operator to the left of both sides, and letting xImp =

P · xp, bp = R · bImp, we have:

(R ·AImp · P ) · xp = R · bImp (3.35)

Ap · xp = bp (3.36)

Eq. (3.36) is the pressure system we solve, which can correspond to either Eq.

(3.30) or (3.31). The system is expected to be near elliptic and thus can be effectively

solved by the multigrid type of solvers, e.g., AMG and SAMG. Usually a single V-

cycle in AMG or SAMG per linear iteration is accurate enough for preconditioning

purposes. For challenging problems such as unstructured-grid problem with strong


anisotropy, SAMG can be more stable than AMG.

After the solution of the pressure system, we apply the prolongation operator P to

the pressure solution xp, i.e., x1 = P ·xp. Note that we usually do not perform explicit

update to the other implicit variables. That is, we distribute the pressure solution to

the update of pressure variable in all reservoir cells and all wells, and pad zero values

for the update of other implicit variables during the first-stage of CPR. The updates

to other implicit variables will be computed in the second stage. Also note that the

reduction of the matrix only needs to be taken once during every Newton iteration,

whereas the restriction of the RHS vector and the prolongation of the solution vector

need to be taken as many times as the number of linear iterations during a Newton

iteration. This is because of the fact that during each linear iteration, the GMRES

solver will need to find the approximate solution to a new RHS vector vi (see step 8

in Algorithm 3.1).

Second stage: overall system

The reservoir and facility models are generally strongly coupled via pressure. Due to

the pressure decoupling at the first stage of CPR, we may solve the overall system

locally (i.e., one submatrix at a time) in the second stage. For these submatrices, the

ILU family of preconditioners, such as ILU(0), BILU(0), BILU(k) [44], can be used.

As described at the beginning of this section, we will first calculate a corrected RHS

for the second stage as:

r2 = r −Ax1 (3.37)

where r is the original RHS vector and x1 is the prolonged first-stage solution vector.

The local preconditioners can then be applied on the submatrices and associated

RHS vectors for the reservoir part and for each facility model. For the reservoir part,

the BILU(0) preconditioner is preferred, because its factorization and per-iteration


solution is much more efficient than BILU(k) and can provide acceptable accuracy as

a second-stage preconditioner. However, when we apply ILU family of preconditioners

directly as a single-stage preconditioner, BILU(k) with k = 1 is a better choice for the

reservoir part because it converges much faster and the overall cost is smaller [38,98].

For the facilities part, submatrices from standard wells are trivial because the Jww,i

submatrix for each well only contains one element and the solution can be directly

obtained. The linear solution strategy for advanced well models is discussed later in

Section 4.7.

Here we briefly describe the idea of ILU family of preconditioners. For the setup

stage, an incomplete LU factorization is conducted on the input matrix A. Depending

on the specific preconditioner type, we have:

Pointwise factorization without fill-ins: ILU0(A) = LpU p (3.38)

Blockwise factorization without fill-ins: BILU0(A) = LbU b (3.39)

Blockwise factorization with k-level fill-ins: BILUk(A) = LkbU

kb (3.40)

The factorized matrices L and U are lower triangular and upper triangular (point-

wise or blockwise). With the given RHS, the forward and backward sweeps can be

applied to obtain the solution to these factorized matrices quickly. For the linear

system Ax = b, we first use forward elimination to solve

Lz = b (3.41)

for the intermediate solution z, and then use backward substitution to solve

Ux = z (3.42)

for the final solution x.


When we apply an ILU-type preconditioner without fill-ins, every single entry in

the factorized matrices L and U must be in the position of an original nonzero entry

in matrix A. All other entries induced by the factorization process are ignored. Thus,

the structure of L and U are already determined when the input matrix A is received

by the preconditioner. On the other hand, for an ILU-type preconditioner with k-level

fill-ins (k > 0), we need to first perform a symbolic factorization to determine the

position of fill-in entries in L and U and then proceed with the numerical factorization

to setup the values of both the original and fill-in entries. The fill-in entries with level

k are induced by the elimination of the fill-in entries with level k − 1, for any k > 0.

The original nonzero entries in matrix A can be thought of as fill-in entries with

level 0. Thus the higher the fill-in level k is, the more fill-in entries we will have in

the factorized matrices, and the higher the factorization and per-iteration solution

cost will be. In return, the convergence rate should also get better because we have

obtained a closer approximation of the original matrix A by the preconditioner.

The BILU(0) preconditioner in AD-GPRS works on a copy of the JRR matrix. The

copy is in block CSR format and contains only the derivatives of implicit equations

with respect to implicit variables, which are copied from the JRR matrix after the two-

step algebraic reduction in each Newton iteration. The reason why the preconditioner

cannot work directly on the original JRR matrix is that JRR is needed for the SpMV

in GMRES solver and thus cannot be modified during the linear iterations, whereas

the BILU(0) preconditioner performs an in-place factorization on the input matrix.

That is, the nonzero blocks of the factorized matrices L and U are stored at the same

places in memory as the nonzero blocks in the input matrix.



In this chapter we described the linear solver framework, including the underlying

matrix formats and associated solution strategies of AD-GPRS. Currently AD-GPRS

can deal with two linear systems: the CSR linear system based on the public CSR

matrix format, and the block linear system based on the customized MLBS matrix

format. On one hand, the CSR matrix format has a very simple structure and is

easy to understand. However, the efficiency of its associated linear solvers is limited

by this structure. As a result, CSR linear system can only be used to validate the

correctness of new features via small problems.

On the other hand, the MLBS matrix format has a much more sophisticated struc-

ture. A hierarchical storage system has been adopted such that the entire Jacobian

matrix is first divided into four parts (JRR, JRF , JFR, and JFF ), with each part

(except JRR) to be composed of smaller submatrices. There is no requirement im-

posed on the type of the submatrices, as long as the common matrix operations, e.g.,

extraction from AD residual vector, algebraic reduction, explicit update, and SpMV,

are well defined. This design offers great flexibility and extensibility for the model

developers to devise the most suitable matrix format and associated solution strategy

for each individual facility model (or a certain new feature) of AD-GPRS.

For the solution of block linear system, a two-step algebraic reduction from full

system to implicit system has been applied prior to calling the linear solver. The

implicit system is solved by an iterative Krylov subspace solver (e.g., GMRES) with

a two-stage preconditioning strategy. Afterwards, a two-step explicit update is used

to recover the full solution from implicit solution. The solution efficiency using this

strategy is much better than using a single-stage preconditioner (e.g., BILU), not to

mention those working on the simple (pointwise) CSR linear system.

Chapter 4

General MultiSegment Well Model

4.1 Model Description

As mentioned in [38], long deviated and horizontal wells have became more and more

important in the petroleum industry in recent years. Many horizontal wells, especially

offshore ones, can have huge volumetric rates such that the pressure drop due to

friction and acceleration cannot be neglected [58]. In addition, the phase holdup

effect may greatly affect the behavior of the well [73, 74]. Because the standard

well model [62] is not capable of modeling these effects, the MultiSegment (MS) -

discretized wellbore - model was introduced to accurately capture the flow behaviors

in the wellbore [36, 38]. Hereinafter we refer to this model as the “original” MS

well model, in order to differentiate it from a more generalized model discussed in

this chapter. In addition to the hydrostatic pressure drop, the pressure drop due to

friction and acceleration are accounted for in the original MS well model. The drift-

flux model is often used in original MS wells to compute individual phase holdup

and velocities in the wellbore [73, 74]. Figure 4.1 shows an original MS well and the

associated variables (degrees of freedom). Each original MS well is discretized into

multiple segments with predefined flow directions. Here, a segment must have two

91

CHAPTER 4. GENERAL MULTISEGMENT WELL MODEL 92

Figure 4.1: Illustration of the original multisegment well model (from [38])

ends labelled “heel” and “toe”, respectively. The pressure is defined at the toe end of

the segment, whereas the mixture velocity is defined at the heel end of the segment.

Holdups and mole fractions are defined for the entire segment at the segment center.

The network of surface pipelines also plays an important role in field operations. In

a production system of several wells connected to a pipeline network, the constraints

are often applied on gathering points and storage facilities, not on the wellheads [38].

With properly chosen parameters, the original MS well model described above can

be used to simulate the transient effects in the wellbore. However, the original MS

well model cannot properly handle complex topology, including general branching,

loops, and multiple exits, of the pipeline network. Moreover, the flow directions

are predefined in the original MS well model, whereas in a pipeline network, the

flow directions may change with time and be different from the predefined ones. As a

consequence, a preliminary idea to generalize the original MS well model was proposed

in [38]. A simple stand-alone simulation package was implemented to justify the idea.

Here a general MS well model is described, which has been implemented using the

AD framework and seamlessly integrated with AD-GPRS.


two-outlet segment

multi-outlet segment

special segment

node

connection

Qm

cpp xP ,,

Figure 4.2: Illustration of the general multisegment well model (from [38])

As shown in Figure 4.2, each general MS well is discretized into nodes and connec-

tions in a manner that is quite similar to finite-volume discretization of reservoir flow.

In this model, each node (yellow circle) is associated with a unique segment that has

a finite volume and may be connected to any number of other nodes. Each nonexit

connection (red link) connects two segments (and their associated nodes), whereas

each exit connection (blue arrow) connects one segment (node) to the outside region.

As described in [38], there are three types of segments:

• Two-outlet segments (blue bars). They represent the most common type of

segments. Two-outlet segments are similar to the segments in the original MS

well model and can be used to model a section of the pipeline. There is no strict

restriction on the geometry of a two-outlet segment, i.e., it can be straight, or

bent. However, from an accuracy point of view (the holdup depends on the

angle of the segment), it would be better to avoid bent two-outlet segments and

have each two-outlet segment to be straight, or nearly straight. We can always

divide a two-outlet segment into two, or more, shorter two-outlet segments that

are connected.


• Multioutlet segments (blue circles). Multioutlet segments represent junctions in

a pipeline system and can be used to create branches and loops. These segments

can have zero volume and thus have no accumulation.

• Special segments (symbols based on their functionality). Each type of special

segments can provide a specific functionality other than just being a segment

of “pipe”. For example, the special segment shown in Figure 4.2 is a separator

segment, which represents a facility that separates different fluid phases.

Using this discretization scheme, the general MS well model is capable of handling

complex geometries of multilateral wells, well groups, pipeline networks, or arbitrary

combinations of the above, provided the drift-flux model is sufficiently accurate for

the conditions being considered.

Well constraints are handled in a generic way in the general MS well model.

Multiple constraints can be imposed on a single general MS well. A constraint,

depending on its type, can be applied on any node, or exit connection. Because

pressure is a nodal variable (see Section 4.2), the pressure constraint will be applied

on the corresponding node. Also, because the mixture flow rate is a connection

variable, the rate constraint is defined on an exit connection.

The advanced features supported by this model are:

• General branching that allows any well node to be connected with any number of

other nodes. With the flexibility offered by this feature, we can define a facility

model with very complex geometry and thus have a better approximation to its

physical reality.

• Loops with arbitrary flow directions, whereby the flow direction is determined

dynamically during the Newton iteration process. The actual flow direction

is indicated by the sign of the mixture flow rate defined on each connection


(see Section 4.2): if the rate is positive, the actual direction is the same as

the preassumed direction; otherwise, the actual direction is opposite to the

preassumed direction.

• Multiple exit connections with different constraints. The current implementa-

tion supports only one active well constraint at a time. Extension to multiple

constraints is, however, supported by the nonlinear framework and the design

of the general MS well model.

• Special segments with various functionality (e.g., separators, valves). Any seg-

ment in a general MS well can be a special segment, such that its property

calculation is defined differently from an ordinary segment. The accumulation

terms in the mass conservation equations and the local constraints can also be

customized to accommodate the specific functionality of the segment. This idea

is elaborated in the first part of Section 4.6.1.

4.2 Variables

There are many more variables in the MS well model compared with the standard

well model, which has only one variable, the Bottom Hole Pressure (BHP), per well.

For a general MS well with multiple nodes and connections, different variables are

defined on nodes and on connections. Without loss of generality, we demonstrate

the variable definition of this model using the natural-variables formulation [15, 23],

with either black-oil or compositional fluids. On each node, the following independent

variables are defined:

• Pw (pressure)

• Tw (temperature, independent only in a thermal formulation)


• αp (holdup, or in-situ phase fraction, of phase p)

• xc,p (mole fraction of component c in phase p)

In addition to the above independent variables, dependent variables such as ρp (den-

sity), λp (mobility), and γp (mass density) are also defined on each node. In this

regard, both the independent and dependent variables are essentially the same as

the ones defined on each reservoir node (αp is equivalent to Sp). Therefore, they are

stored in the node-based subset of the global variable set (see Section 4.5.1).

Note that the above variables are all defined at the center of the node. This is not

like the original MS well model where pressure and mixture velocity are purposefully

defined at the two ends of a segment. Due to the consistency in definition with cell-

centered variables on a reservoir node, the initialization and property calculation of

well nodes can be carried out in a more generalized fashion. That is, they can share

a lot of computational processes with reservoir cells.

As described above, these variables are applicable when the natural-variables for-

mulation is used. Different node-based variables can be defined automatically for

the general MS well model if another variable formulation (e.g., molar) is used. Our

general MS well model has no restrictions on the selection of the variable formulation,

or the fluid model. That is, we expect very little modification to deal with a new

variable formulation, or fluid model. This is a quite important generalization as we

will only need to maintain one unified piece of code for the general MS well model,

and the code will work for arbitrary combinations of variable formulations and fluid

models.

On each connection, regardless of the selected variable formulation, the following

independent variable is defined:

• Qm (mixture flow rate, Qm = A · Vm, where A is the cross-sectional area of the

upstream segment, and Vm is the mixture velocity)


The dependent variables Qp (p = 1, . . . , np) are defined as the flow rate of phase

p. We have Qp = A · Vsp, where Vsp is the superficial velocity of phase p. Qp’s are

also defined on each connection. As a consequence, Qm and Qp’s are stored in the

connection-based subset of the global variable set (see Section 4.5.1).

4.3 Equations

On each node (segment) of an MS well, we solve mass and energy balance equations

and deal with local constraints. The nc mass balance equations are:

∂

∂t

∑p

ρpαpxc,p −∂

∂z

∑p

ρpVspxc,p +∑p

ρpxc,pqp = 0, c = 1, . . . , nc, (4.1)

where ρp is the density of phase p, αp is the in-situ phase fraction of phase p, xc,p is

the mole fraction of component c in phase p, Vsp is the superficial velocity of phase

p, and qp is the inflow (per unit volume) of phase p to the segment. The time and

spatial derivative terms account for the mass accumulation and convective mass flux,

respectively. The last term,∑

p ρpxc,pqp, represents the mass source/sink through the

wellbore. The corresponding discretized equations are:

Vi

(∑p

ρpαpxc,p

)n+1

i

−

(∑p

ρpαpxc,p

)n

i

−∆t∑

j∈nbr(i)

(∑p

ρpQpxc,p

)(i,j)

+ ∆t

(∑p

ρpxc,pQp

)i

= 0, c = 1, . . . , nc, (4.2)

where Vi is the volume of node i, nbr(i) is the set containing all neighboring nodes

of i, (Qp)(i,j) = (A · Vsp)(i,j) is the phase flow rate through connection (i, j), and

(Qp)i = (V · qp)i is the volumetric inflow of phase p to node i.


For a thermal formulation, the energy balance equation [46,47,72] is:

∂

∂t

∑p

ρpαp

(Up +

1

2V 2p

)+

∂

∂z

∑p

ρpαpVp

(Hp +

1

2V 2p

)=∑p

ρpαpVpg −Qloss +∑p

ρpHpqp, (4.3)

where Up and Hp are the internal energy and enthalpy of phase p, Vp is the interstitial

velocity of phase p (Vp = Vsp/αp), and g = g·cosθ is the gravitational component along

the well, in which θ is the inclination angle of the segment from vertical. The time

derivative term accounts for the energy accumulation, whereas the spatial deriva-

tive term is for the convective energy flux. The first term on the right-hand-side,∑p ρpαpVpg, represents the rate of work done on the fluid by gravitational forces [47].

Qloss is the heat loss from wellbore fluid to surroundings, and it is calculated from

the formula:

Qloss = −2πrwUto(Tw − T ), (4.4)

where Uto is overall heat transfer coefficient, rw is wellbore radius and Tw is the

temperature of wellbore fluids. The last term in Eq. (4.3),∑

p ρpHpqp, represents the

energy source/sink through the wellbore. The corresponding discretized equation is:

Vi

(∑p

ρpαp

(Up +

1

2V 2p

))n+1

i

−

(∑p

ρpαp

(Up +

1

2V 2p

))n

i

+ ∆t

∑j∈nbr(i)

(∑p

ρpQp

(Hp +

1

2V 2p

))(i,j)

= ∆t

(V∑p

ρpVspg − VQloss +∑p

ρpHpQp

)i

.

(4.5)

where (Vsp)i is equal to (Vsp)(i,j) such that i is an upstream node of j with respect

to the mixture flow rate. This treatment is used because all velocity (or flow rate)

related properties, including Vsp, are defined on connections instead of nodes, and


we have to use the connection-based counterparts to evaluate the superficial phase

velocities at the node (segment center).

In order to close the system composed of (4.1) and (4.3), or the corresponding

discretized equations (4.2) and (4.5), additional equations are needed, as discussed

below. For compositional fluids, thermodynamic equilibrium equations for the hydro-

carbon components are:

fc,p (P, T, xc,p)− fc,q (P, T, xc,q) = 0, c = 1, . . . , nc, 1 ≤ p 6= q ≤ np, (4.6)

where fc,p (P, T, xc,p) is fugacity of component c in phase p. Correspondingly, simple

PVT-based equilibrium equations are solved for black-oil fluid. Linear constraints for

component mole fractions in a phase and phase holdups are:

nc∑c=1

xc,p = 1, p = 1, . . . , np, (4.7)

np∑p=1

αp = 1. (4.8)

Similar to the set of reservoir equations, these linear constraints are sometimes

eliminated at the nonlinear level, i.e., having one xc,p (currently the first one) in each

phase depend on the other xc,p’s in that phase, and having one αp (currently the

last one) depend on the other αp’s. These linear constraints are applicable for the

natural-variables formulation. Corresponding constraints can be defined for other

variable formulations.

AD-GPRS currently supports two approaches for modeling phase fractions inside

the wellbore: 1) the homogeneous model (no slip), and 2) the drift-flux model. These

models are used for the phase flow rates (Qp’s). Homogeneous model assumes equal


velocities and no-slip between phases. That is,

Vp = Vm, p = 1, . . . , np. (4.9)

Using the relationship between Vsp and Vp, and multiplying both sides by A (the

cross-sectional area), we can obtain the following equation for phase flow rates with

the assumption of homogeneous flux:

Qp = αpQm, p = 1, . . . , np. (4.10)

The drift-flux model uses a more complicated strategy for the computation of Qp and

is described in Section 4.4. More sophisticated mechanics models and those based on

multidimensional tabular data can be accommodated as described in the first part of

Section 4.6.1.

The following equation describing the pressure relation is solved for each nonexit

connection between two neighboring nodes:

∆Pw = ∆Pwh + ∆Pw

f + ∆Pwa . (4.11)

Here, ∆Pw is the pressure drop between two nodes, and ∆Pwh , ∆Pw

f , and ∆Pwa are

the hydrostatic, frictional and acceleration components of the pressure drop, respec-

tively. In AD-GPRS, the user may choose to include one, two, or all three of these

components. The hydrostatic pressure difference between two nodes is:

∆Pwh = γmg∆D, (4.12)

where ∆D is the depth difference between two nodes, and γm =∑

p αpγp is the

mixture mass density, in which γp is the mass density of phase p.


The frictional pressure difference between two nodes is:

∆Pwf =

2ftpγmVm |Vm|DH

∆z, (4.13)

where ftp is the Fanning friction factor, DH is the hydraulic diameter of a segment,

Vm is the mixture velocity, and ∆z is the length between two nodes. ftp is a function

of the dimensionless Reynolds number Re (ratio of inertial forces to viscous forces),

the roughness height ε, and the hydraulic diameter DH , as described in [35]:

ftp =

16/Re if Re ≤ 2000 (laminar flow)

1/

(−3.6 log10

(6.9Re

+(

ε3.7DH

) 109

))2

if Re ≥ 4000 (turbulent flow)

16/2000 + kf (Re− 2000) if 2000 < Re < 4000 (intermediate)

(4.14)

where kf is the slope for the linear interpolation of ftp during the intermediate range

(2000 < Re < 4000), and is given by:

kf =ftp(Re = 4000)− ftp(Re = 2000)

4000− 2000. (4.15)

The Reynolds number Re is calculated as:

Re =γmQmDH

µmA, (4.16)

where A is the cross-sectional area of a segment, µm =∑

p αpµp is the mixture

viscosity, in which µp is the viscosity of phase p.

The pressure-drop component due to acceleration is given by the following formula:

∆Pwa =

2minVmA

, (4.17)


where min =∑

p γpQp is the mass flow rate of the mixture entering the segment.

For the exit connection (blue arrow in Figure 4.2) that connects node e with the

outside region, we may denote it as (e, exit). The following equation is solved on this

connection when a constant pressure constraint is applied:

Pwe = P target, (4.18)

where P target is the specified target for the pressure of node e.

Correspondingly, the following equation is solved on this connection when a con-

stant rate constraint is applied:

νscjρscj·

(∑c

∑p

ρpQpxc,p

)(e,exit)

= Qtargetj , (4.19)

where j is the phase of which the rate is controlled, sc represents the surface condition.

νscj and ρscj are the mole fraction and density of phase j obtained through the well

flash at surface condition. Qtargetj is the specified target for the volumetric rate of

phase j at surface condition.

4.4 Drift-Flux Model

The drift-flux model was first proposed by Zuber and Findlay [100] for vertical two-

phase bubbly flow. In AD-GPRS, we use the extension described in [73,74] to model

the slip in: 1) two-phase liquid-gas flow [74], 2) two-phase oil-water flow [74], and

3) three-phase gas-oil-water flow as a combination of the two-phase liquid-gas and

oil-water flow models [73].


4.4.1 Liquid-gas model

The drift-flux model for liquid-gas flow is based on the assumption that gas bubbles

tend to flow through the central portion of the pipe, where the local mixture velocity

is greater than the cross-sectional average velocity [74]. In addition, the density

difference between the gas and liquid phases gives rise to a drift between phases. The

relationship can be expressed as:

Vg =Vsgαg

= C0Vm + Vd, (4.20)

where Vg is the gas velocity, Vsg is the superficial gas velocity, Vd is the terminal rise

velocity, and C0 is a flow profile parameter that is given by:

C0 =A

1 + (A− 1)(αg−B1−B

)2 , (4.21)

where parameters A and B are found from the experimental data. A typical value of

A is 1.0 for circular pipes. In such circumstance, B does not affect the model, and

we have C0 = 1.0 over the entire range of αg [73, 74].

The terminal rise velocity Vd is calculated from:

Vd =(1− αgC0)C0K(αg)Vc

αgC0

√γgγl

+ 1− αgC0

m(θ), (4.22)

where Vc is the characteristic velocity, m(θ) is a scaling parameter for inclined pipes,


and K(αg) is given by:

K(αg) =

1.53/C0 if αg ≤ a1

Ku(DH) if αg ≥ a2

1.53/C0 + Ku(DH)−1.53/C0

a2−a1 (αg − a1) if a1 < αg < a2

(4.23)

where a1 = 0.06 and a2 = 0.21 for circular pipes. Ku(DH) denotes the critical

Kutateladze number, which is a function of the dimensionless diameter of a pipe:

DH =

(g(γl − γg)

σgl

)1/2

DH , (4.24)

as described in [74]. Here σgl is the gas-liquid interfacial tension and DH is the

hydraulic diameter of the pipe.

The characteristic velocity Vc and scaling parameter m(θ) are given by:

Vc =

(σglg(γl − γg)

γ2l

)1/4

, (4.25)

m(θ) = m0(cosθ)n1(1 + sinθ)n2 . (4.26)

For circular pipes the following parameters have been proposed: m0 = 1.85, n1 = 0.21

and n2 = 0.95 [73,74]. Note that if θ = 90◦ (the pipe is horizontal), we have m(θ) = 0,

and thus Vd = 0. If we have A = 1.0 in Eq. (4.21) at the same time, we will have

Vg = Vm, and the model is degenerated to the homogeneous flux model.


4.4.2 Oil-water model

The drift-flux model for oil-water flow [74] is quite similar to that for liquid-gas flow

described above. The oil velocity Vo is computed as:

Vo =Vsoαo

= C ′0Vl + V ′d , (4.27)

where Vo is the oil velocity, Vso is the superficial oil velocity, Vl is the (mixed) liquid

velocity, and C ′0 is a continuous function of oil volume fraction αo:

C ′0 =

A′ if αo ≤ B′1,

1 if αo ≥ B′2,

A′ − (A′ − 1)(αo−B′1B′2−B′1

)if B′1 < αo < B′2.

(4.28)

For circular pipes, A′ has a typical value of 1.0. In such circumstance, the parameters

B′1 and B′2 are not relevant, and we always have C ′0 = 1.0.

The terminal rise velocity V ′d is given by:

V ′d = 1.53V ′c (1− αo)n′m′(θ), (4.29)

where n′ = 1 for circular pipes, the characteristic velocity V ′c is given by:

V ′c =

(σow(γw − γo)

γ2w

)1/4

. (4.30)

Interfacial tensions σgw (gas-water interfacial tension) and σgo (gas-oil interfacial ten-

sion) are calculated using the published correlations in [10]. Then, σow can be calcu-

lated as:

σow = |σgo − σgw| (4.31)


The scaling parameter m′(θ) is given by:

m′(θ) =

n′1cosθ + n′2sin2θ + n′3sin3θ if θ ≤ 88◦,√|cosθ|(1 + sinθ)2 if θ > 88◦,

(4.32)

where n′1 = 1.07, n′2 = 3.23, and n′3 = 2.32 for circular pipes. Again, when θ = 90◦,

we have m′(θ) = 0, and thus V ′d = 0. If we also have A′ = 1.0 in Eq. (4.28), then

Vo = Vl, and the model is degenerated to the homogeneous flux model.

4.4.3 Gas-oil-water model

The three-phase (gas-oil-water) model works as a combination of the two-phase liquid-

gas and oil-water models [73]. First, the oil phase and water phase are considered

to be a mixed liquid phase. Both the gas-liquid interfacial tension and the mass

density of the liquid phase can be estimated using the volume-weighted average of

two corresponding properties for oil and water phases:

σgl =αoσgo + αwσgw

αo + αw, (4.33)

γl =αoγo + αwγwαo + αw

. (4.34)

Using the above properties, liquid-gas model can be used to compute the superficial

velocity of the gas phase as (derived from Eq. (4.20)):

Vsg = αgC0Vm + αgVd. (4.35)

Correspondingly, the superficial velocity of the liquid phase can be calculated as:

Vsl = Vm − Vsg. (4.36)


Afterwards, we may apply the oil-water model inside the mixed liquid phase, with

αo (the volume fraction of the oil phase in all three phases) in Eq. (4.28) and (4.29)

replaced by αol (the volume fraction of the oil phase inside the mixed liquid phase),

which is given by:

αol =αo

αo + αw. (4.37)

The superficial velocity of the oil phase can be obtained as (derived from Eq. (4.27)):

Vso = αolC′0Vsl + αoV

′dmg(θ, αg). (4.38)

where mg(θ, αg) is an additional scaling parameter to account for the impact of gas

on the oil-water slip as described in [73]. Then the corresponding superficial velocity

of the water phase can be calculated as:

Vsw = Vsl − Vso. (4.39)

4.5 Extensions of the AD Simulation Framework

In order to implement the general MS well model under the AD-based simulation

framework, some extensions are needed.

4.5.1 Global variable set

The first extension is related to the AD variable set. As not all the independent

variables in the general MS well model are defined on the nodes (some are defined

on the connections), a new global variable set containing multiple subsets has been

established. The following subsets are currently included in the global set:

• The original variable set defined for each node (control volume), which includes


all node-based independent variables (e.g., P , Sp, and xc,p for the natural-

variables formulation) and dependent variables (e.g., ρp, λp, and γp). All reser-

voir variables and part of the general MS well variables (those defined on each

node - yellow circles in Figure 4.2) are currently contained in this subset.

• A new variable subset defined for each connection. It is introduced because

many properties are naturally defined on connections (faces). This is not only

true for the general MS well model in the facilities part (for those variables

defined on each connection - red links and blue arrows in Figure 4.2), but also

valid for the reservoir part, where connection-based variables are useful in the

dual-grid approach for geomechanics modeling.

Other subsets can be added in the future as necessary. All subsets have the same

interfaces as the original node-based variable set, such that any variable can be fetched

easily. On the other hand, global operations such as backup, restoration, and variable-

switching are managed by the global set in a consistent manner.

4.5.2 Linear system

The second extension is related to the linear system. In order to achieve a generic

design, the implementation of various facility models, including the general MS well

model, is decomposed into nonlinear and linear levels. We have already described its

mathematical formulation, which corresponds to the implementation on the nonlinear

level, in Sections 4.2 and 4.3. What is left is the support for the linear level. Because

AD-GPRS is designed to support multiple linear systems and multiple facility models,

an approach using so-called “handlers” is introduced to mix and match the facility

models and linear systems. For example, we can have the following handlers for the

current set of facility models (standard well and general MS well) and linear systems

(CSR linear system and block-sparse linear system):


• Standard well and CSR linear system (trivial)

• General MS well and CSR linear system (trivial)

• Standard well and block-sparse linear system

• General MS well and block-sparse linear system

In the future, more handlers can be created when a new linear system, or facility

model, is introduced. Each facility-matrix handler as defined above is responsible for

the following tasks:

• Creation of the submatrices for the facility model inside the global Jacobian

matrix. The submatrices, which may have their own types, are created using

the structural information of the facility model and then added to the global

Jacobian matrix.

• Preparation of the preconditioning data for the facility model. The precondi-

tioning data are generated from the extracted nonzero entries in the submatrices

as well as other necessary information of the facility model and are currently

used in the assembly of the reduced pressure system for the first stage of CPR

preconditioner.

Now, we discuss the extensions needed for the multilevel block-sparse (MLBS)

linear system, which has been described in detail in Chapter 3. Here we recap some

of its important features. The hierarchical structure of this linear system is shown

in Figure 4.3. On the first level, the entire system matrix is considered as a single

object, which is then divided into four parts, JRR, JRF , JFR, and JFF , on the sec-

ond level. Here, the first subscript represents the equations and the second subscript

represents the variables. In addition, R stands for the reservoir and F stands for

facilities. The third-level submatrices (e.g., JRW,1, JRW,2 as shown in Figure 4.3(c),


0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 74960

JRF JRR

JFF JFR

JRW,1

(a) 1st level (b) 2nd level (c) 3rd level

JRW,2

Figure 4.3: Multilevel block-sparse linear system (modified from [38])

where W stands for a well, i.e., an individual facility model) inside JRF , JFR, and

JFF are corresponding to individual facility models. The types of these submatrices

need to be customized such that the matrix storage and computation are effective

for different facility models, including the standard well model and the general MS

well model. For the specific submatrix types of each facility model, the common

matrix operations, including extraction from the AD residual vector, algebraic re-

duction, explicit updating, and Sparse Matrix-Vector multiplication (SpMV), need

to be properly defined. Then, an advanced linear preconditioning strategy is needed

to handle the contributions from the general MS well model when solving the fully

coupled reservoir-facilities matrix system.

4.5.3 Jacobian for the general MS well model

The Jacobian matrix (JWW , as a submatrix in JFF ) of the general MS well model

is composed of four parts (see Figure 4.4):

• JNN : Derivatives of node equations with respect to node variables. This sub-

matrix is block sparse, with the full size of each block equal to (NPri +NSec)×


(NPri + NSec), where NPri is the number of primary variables and NSec is the

number of secondary variables. For each connection (i, j), there are two corre-

sponding off-diagonal nonzero blocks, one on row i, column j, and the other on

row j, column i. The structure of this submatrix is quite similar to that of JRR

on the second level of the MLBS linear system, when the reservoir is discretized

using the TPFA scheme.

• JNC : Derivatives of node equations with respect to connection variables. This

submatrix is block sparse, with the full size of each block equal to (NPri +

NSec) × 1. For each connection (i, j) with a connection index k (i.e., it is the

k’th connection), there are two corresponding nonzero blocks, one on row i,

column k, and the other on row j, column k.

• JCN : Derivatives of connection equations with respect to node variables. This

submatrix is block sparse, with the full size of each block equal to 1× (NPri +

NSec). For each connection (i, j) with a connection index k, there are two

corresponding nonzero blocks, one on row k, column i, and the other on row k,

column j. This submatrix has a transposed structure from JNC .

• JCC : Derivatives of connection equations with respect to connection variables.

This submatrix is pointwise diagonal, because there is only one independent

variable (Qm) defined for each connection and there is no direct coupling be-

tween different connections.

Due to the structural similarity between the JWW matrix of the general MS well

and the global Jacobian matrix, which is also composed of four parts (JRR, JRF ,

JFR, and JFF ), the same matrix wrapper is used for both matrices, although the

specific types used for the four parts may be different. Besides the above (fourth-

level) submatrices inside JWW , there are submatrices that represent the coupling


between the reservoir and the general MS well: JRW (as a submatrix in JRF ) and

JWR (as a submatrix in JFR). JRW can be further divided into two parts: JRN

and JRC , which correspond to derivatives of reservoir equations with respect to well

node and well connection variables respectively. With the current physical model,

we always have JRC = 0. Similarly, JWR is also composed of two parts: JNR and

JCR, which contain the derivatives of well node and well connection equations with

respect to reservoir variables respectively. JCR 6= 0, if and only if the acceleration-

related pressure drop term is included in the pressure-relation equations defined on

well connections.

JNN JNC

JCC JCN

Figure 4.4: Jacobian matrix structure of the general MS well model


4.6 Well Initialization, Calculation, and Variable

Updating

4.6.1 Well initialization

The initialization of a general MS well includes three major parts: creation of node

(segment) and connection objects, initialization of static properties, and initialization

of dynamic properties.

Creation of node (segment) and connection objects

In the general MS well model, each node is a separate object. Nodes share the same

base type but can have different inherited types. Common interfaces have been de-

signed for the node objects, e.g., properties calculation, adding local contributions

to the residual equations, updating of node-based variables, and so on. The imple-

mentation details are left to the developers. This explains how special nodes can be

handled within the framework, i.e., we may construct new segment types (e.g., sepa-

rator or valve segment) by deriving from one of the existing segment types (e.g., the

base type) and customizing the member variables, as well as, the underlying compu-

tational processes inside the member functions. Then, we are able to create some of

the nodes using these special segment types while keeping other nodes as the ordinary

segments.

The same situation applies to connections. That is, each connection is also a

separate object that shares the same base type. The computation of phase flow rates

(Qp) and various types of pressure drops can be customized in the derived connection

types. This is the mechanism to handle different physical models (e.g., homogeneous

flux and drift flux) on connections. For example, when the drift-flux model is applied

to a general MS well, the exit connections and those connections with 1) no depth


difference between the two neighboring nodes and 2) constant flow-profile parameters

C0 = C ′0 = 1 (Eq. (4.21) and (4.28)) will still be created as the “homogeneous”

connections, whereas the rest of the connections will be created as the “drift-flux”

connections. As a result, distinct computations of Qp’s will be performed for the two

types of connections, although they are in the same MS well and the calling con-

ventions to their member functions are completely the same. In this way, additional

segment types, or physical models that are based on more accurate mechanics prin-

ciples or tabulated experimental data, can be introduced into the general MS well

model in a consistent manner without altering the code structure.

Initialization of static properties

In this part, the properties that do not change during the entire simulation will be ini-

tialized. Currently, this includes the pointers to the first variable of each node and of

each connection, the set of perforations corresponding to each segment (one segment

may have zero, one, or multiple perforations, as shown in Figure 4.5(a), 4.5(b), and

4.5(c), respectively), and the depth difference between each perforation and its corre-

sponding node (perforations may or may not be located at the node center) multiplied

by gravitational acceleration. This procedure will also let each segment initialize its

own static properties. For an ordinary segment, this includes its hydraulic diameter

DH , cross-sectional area A, volume V, and the slope kf (Eq. (4.15)) used in the linear

interpolation of Fanning friction factor ftp for an intermediate Re (Eq. (4.14)).

Initialization of dynamic properties

This part initializes all the independent variables and other properties that change

during the simulation. This is a sophisticated process and contains several steps:

1. Give initial guesses to the base variables, including pressure (Pw), temperature


A segment

A reservoir cell with no perforation

A perforated reservoir cell (a) A segment

without any perforation

(b) A segment with one

perforation

(c) A segment with multiple perforations

Figure 4.5: A segment with zero, one, or multiple perforations

(Tw), and overall mole fractions (zc) of each well node. The independent vari-

ables of any formulation can be converted from these base variables. Among

these variables, the pressure of each node is estimated based on the hydrostatic

pressure difference from the estimated value of wellhead pressure (defined on

the 0th node). That is,

Pwi = Pw

0 − γavgm g(Di −D0), (4.40)

where Pwi is the pressure of node i, Di is the depth of node i, and γavgm is the

overall mixture mass density computed using the following formula,

γavgm =1

Nperf

Nperf−1∑k=0

(∑p λpγp∑p λp

)res

perf(k)

, (4.41)

where Nperf is the number of perforations, the superscript res represents a

reservoir property, and the subscript perf(k) is the cell number of the k’th

perforation. Note that we usually use holdups (αp) or phase rates (Qp) as the

weighting factors to calculate the mixture mass density, whereas here we use

phase mobilities (λp) from the perforated reservoir cells. That is because, during


initialization, P and αp in each segment have not yet been determined, and

thus we cannot calculate Qp, which depends on the pressure difference between

the perforated reservoir cell and the corresponding segment. In this regard,

λp/(∑

p λp) provides a rough estimation for Qp/(∑

pQp) during initialization

only. Once αp’s are initialized and then updated in each Newton iteration during

the simulation, they will be used as the weighting factors for the mixture mass

density of each segment. On the other hand, temperature and overall mole

fractions have distinct estimations for injectors and producers. For injectors,

they are directly assigned from the given injection stream (T inj and zinjc ). For

producers, because perforations, the properties of which we already know from

the reservoir initialization, have no one-to-one correspondence to nodes, we

calculate the average values of temperature and overall mole fractions over all

perforations and use these average values as initial guesses. That is,

Tw,avg =1

Nperf

Nperf−1∑k=0

T resperf(k) (4.42)

zavgc∗ =1

Nperf

Nperf−1∑k=0

( ∑p λpρpxc∗p∑

c

∑p λpρpxc,p

)res

perf(k)

. (4.43)

2. For each node, convert the base variables Pwi , Twi , and zc,i to the independent

variables of the selected formulation (e.g., natural, molar)

3. For each node, repeat the following substeps for a few iterations (e.g., 3) to

reach hydrostatic equilibrium:

(a) Detect the change of phase status in this node

(b) Calculate the various properties of this node, including the mixture mass

density γm,i as the volume-weighted average of phase mass densities: γm,i =∑p(αpγp)i


(c) Update the node pressure using the new mixture mass density γm,i:

Pwi = Pw

0 − γm,ig(Di −D0). (4.44)

4. Calculate the inflow/outflow rates at all perforations using the updated inde-

pendent variables of the well

5. Initialize the mixture flow rates (Qm) on all connections. This step includes the

following substeps:

(a) Find the starting nodes (i.e., nodes with only one associated connection)

and push them into the node queue (first in, first out)

(b) Set the status of all Qm’s as “undetermined”

(c) Fetch the first node seg in the node queue and remove it from the queue

(d) Calculate the volumetric influx Qtot into node seg, which includes the

injection/production rate of all phases from all perforations on this node,

and the mixture flow rate from the associated connections with already

determined Qm

(e) Calculate the total cross-sectional area Atot of the associated connections

with undetermined Qm

(f) For each associated connection with undetermined Qm, if its cross-sectional

area is A, its Qm can be determined as:

Qm =A

AtotQtot. (4.45)

In addition, set the status of this Qm as “determined” and push the node

on the other side (i.e., not seg) of this connection into the node queue


(g) Check if the node queue is empty: if yes, the process of determining mixture

flow rates is ended; otherwise, go to step (5c)

1 2

3 4

seg 1 2

3 4 n

c

c

a) Before processing node seg

b) After processing node seg

1 2

3 4

seg 1 2

3 4

Segment (node) n

Connection c with undetermined Qm

Connection c with determined Qm

seg ... Queue ... Queue 3 4

Figure 4.6: Initialization of mixture flow rates

To better understand the initialization of mixture flow rates, let us consider the

following example. As shown in Figure 4.6(a), we suppose: 1) seg is the first

node in the node queue, 2) Qm,1 and Qm,2 have already been determined, and

3) Qm,3 and Qm,4 are undetermined. In step (5d), we calculate the volumetric

influx Qtot as: Qtot = Qm,1+Qm,2 (Qm,1 and Qm,2 are determined). Then, in step

(5e), we calculate the total cross-sectional area Atot as: Atot = A3 + A4 (Qm,3

and Qm,4 are undetermined). Next, in step (5f), the undetermined mixture flow

rates, Qm,3 and Qm,4, are calculated as:

Qm,3 =A3

AtotQtot, Qm,4 =

A4

AtotQtot (4.46)

In addition, the statuses of Qm,3 and Qm,4 are now set as determined. Node 3

and 4 are pushed into the end of the node queue, as shown in Figure 4.6(b).

This process continues until the node queue becomes empty.


4.6.2 Well calculation

The calculation sequence of the general MS well model is shown in Figure 4.7. This

procedure is performed during each Newton iteration, after the computation of the

residual equations (and associated Jacobian matrix) for the reservoir part. The high-

lighted steps (in red) are additional steps from the standard well calculation and are

specifically implemented for general MS wells:

• Property calculation for each node, including density (ρp), viscosity (µp), mo-

bility (λp), overall mole fraction (zc), and fugacity (fc,p). This step is essentially

the same as the property calculation of a reservoir node, except that the poros-

ity (φ) does not need to be calculated and is always equal to 1 in the wellbore

and that the volume-weighted average of mass density (γm) and viscosity (µm)

need to be calculated in addition to all other properties. The node-based inde-

pendent variables, with their latest values and gradients, are used in this step.

The values and gradients usually correspond to the updated ones in the last

Newton iteration, or the converged ones in the last timestep during the first

Newton iteration of every new timestep. At the beginning of the simulation,

the values estimated in the initialization process (see Section 4.6.1) and the gra-

dients determined according to the initial independent indices are used. This

also applies to the calculation of connection-based properties.

• Property calculation for each connection, including phase flow rate (Qp), hydro-

static (∆Pwh ), frictional (∆Pw

f ), and acceleration-related (∆Pwa ) pressure drops.

The computation of phase flow rate depends on the selection of the flux model,

whereas three types of pressure drops depend on the selection of pressure drop

models. Both the connection-based mixture flow rates (Qm) and the node-

based independent variables in the two neighbouring nodes, with their latest

values and gradients, are used in this step. See Section 4.3 for details about the


calculation of connection-based properties.

• Construction of MS well equations on each node (mass balance (4.2), energy

balance (4.5), local constraints (4.6) - (4.8)) and on each connection (pressure

drop relation (4.11)). The detailed forms of these equations are discussed in

Section 4.3. Various properties calculated in the previous steps are used in this

step. With the help of AD framework, only nonlinear residual code is needed

while the associated gradients are automatically generated.

For each node, compute: density, viscosity, mobility, Zc, and fugacity

For each perforation, compute: component and phase rates

For each connection, compute: phase flow rate, frictional and

acceleration-related pressure drop

Compute total rate and run flash at surface condition

Update reservoir residual equations (perforations)

Control needs to be switched?

No

Well calculation completed

Switch to first viable control Yes

Form MSWell residual equations (nodes and connections)

Figure 4.7: Calculation sequence of the general MS well model

4.6.3 Updating of the well variables

The variable updating sequence of the general MS well model is shown in Figure

4.8. For each well node (Figure 4.8(a)), the updating sequence is similar to the

corresponding process for a reservoir node:


1. Application of the updates to a temporary set of variables

2. Nonlinear correction (e.g., Appleyard chop [70]) and enforcement of physical

limits

3. Detection of phase status change: if the node has more than one appeared

phases, we check if the updated holdups, or phase mole fractions, indicate any

phase disappearance. Also, if not all phases appear in the node, we perform

phase stability test to check if any new phase should appear

4. Copying the updates back from the temporary set

This process is much more involved than that for a standard well, where only one

variable (BHP) gets updated directly. However, it does not introduce extra complex-

ity into the implementation, because most of the functions needed above already exist

for the update of reservoir variables and can be shared. For each connection (Figure

4.8(b)), the update is directly applied to the mixture flow rate without checking its

range, because the mixture flow rate can be either positive or negative, indicating the

actual flow direction.

4.7 Multistage Preconditioner

The two-stage CPR (Constrained Pressure Residual) preconditioner, as described in

Section 3.4.5, has been extended to deal with the coupled reservoir-facilities system

with original MS wells and well groups [96,98]. Here we describe its extension for the

coupled system with general MS wells.


Apply tentative Newton updates to

a temporary set

Apply nonlinear correction (e.g., Appleyard chop)

Enforce physical limits on variables

Phase stability test: check if any new

phase should appear

Single phase?

Check if the updated holdups indicate

phase disappearance

Copy the values from the temporary set to node variables

Apply Newton update to the connection variable:

mixture flow rate

No Yes

a) for each node b) for each connection

All phases appear? No

Yes

Figure 4.8: Variable update sequence of the general MS well model

4.7.1 First stage: global on the pressure system

The basic idea for this stage is to first reduce the coupled system with advanced

facility models (e.g., general MS wells) to a system with only standard wells, and then

apply the true IMPES reduction for the reservoir part to get the reduced pressure

system. Thus, the question lies in how the equations and variables for an advanced

facility model are converted into those that resemble a standard well. For general

MS wells, the idea is similar to that for the original MS wells as described in [96,98],

i.e., we algebraically reduce the equations and variables for a general MS well into a

standard-well like equation and a single BHP variable. This approach assumes that

pressure differences among all nodes and mixture flow rates through all connections

are lagged by one iteration. That is, the pressure update for any node i is equal to

that for node 0: δPwi = δPw

0 , and the update of mixture flow rate of any connection

(i, j) is 0 in this stage: δ(Qm)(i,j) = 0. The detailed treatments depend on the applied


well constraint and are described in the following sections.

Constant-pressure constraint

Without loss of generality, we assume the constraint is defined on the 0th node. In

this case, the general MS well equations are directly reduced to Pw0 = P target, which

has the same form as the pressure control equation of a standard well and guarantees

the pressure at 0th node to be equal to the given constant value of the well constraint.

Typically Pw0 = P target should always hold under the constant-pressure constraint.

Thus, we will usually have: δPwi = δPw

0 = 0.

Constant-rate constraint

We assume that the constraint is defined at the exit connection that links the 0th

node with the outside region. By summing up the mass conservation equations (4.2)

over all components and all nodes of this general MS well, we will have:

NN−1∑i=0

Vi

(∑c

∑p

ρpαpxc,p

)n+1

i

−

(∑c

∑p

ρpαpxc,p

)n

i

−∆t

(∑c

∑p

ρpQpxc,p

)(0,exit)

+ ∆t

NN−1∑i=0

(∑c

∑p

ρpxc,pQp

)i

= 0, (4.47)

where NN is the number of nodes in this MS well, and (0, exit) represents the exit

connection. Note that 1) all three terms (accumulation, flux, and source/sink) are

summed up over all components, and 2) the accumulation and source/sink terms are

summed up over all nodes, whereas the flux terms through exit connections are the

only ones that are kept in the resultant equation (4.47). This is because, for each

flux term (∑

p ρpQpxc,p)(i,j) that appears in the mass balance equation for component

c in node i, there is a corresponding flux term (∑

p ρpQpxc,p)(j,i) in the mass balance


equation for component c in node j, except when j = exit, i.e., (i, j) is an exit

connection. Because we always have (∑

p ρpQpxc,p)(i,j) = −(∑

p ρpQpxc,p)(j,i), the flux

terms from nonexit connections cancel out in pairs, and those from exit connections

are left and thus included in the resultant equation (4.47).

Recall the rate control equation (4.19): (νscj /ρscj ) ·

(∑c

∑p ρpQpxc,p

)(0,exit)

=

Qtargetj . Now, if we multiply this equation by ∆t · (ρscj /νscj ) and add the resultant

equation to (4.47), we have

NN−1∑i=0

Vi

(∑c

∑p

ρpαpxc,p

)n+1

i

−

(∑c

∑p

ρpαpxc,p

)n

i

+ ∆t

NN−1∑i=0

(∑c

∑p

ρpxc,pQp

)i

= ∆t ·ρscjνscj

Qtargetj , (4.48)

By applying the assumption that the node pressure differences are lagged by one

iteration, we can sum up the derivatives of this equation with respect to the pressure

at any well node to a single derivative of this equation with respect to a single pressure

variable Pw0 . In addition, we may treat the coefficient ∆t · (ρscj /νscj ) on the right hand

side as a constant. Its value may change in every Newton iteration, but we may

neglect its derivatives for preconditioning purposes. As a result, for each general MS

well in the coupled system, we obtain a single equation expressed by (4.48), which

has a form that is similar to a rate-control equation of a standard well.

After the above reduction process, the standard true IMPES reduction may be

applied to obtain a pressure system that can be effectively solved by certain precon-

ditioners (e.g., AMG). The obtained pressure update to the reduced well variable Pw0

is applied to the pressure of all nodes in this general MS well. The update to all the

other variables (αp, xc,p on each node, Qm on each connection) in the general MS well

is zero in the first stage.


4.7.2 Second stage: local on the overall system

In the second stage, the reservoir part (JRR) and the facilities part (JFF ) are solved

separately. However, due to the coupling matrices JRF and JFR, the right-hand-side

of one part may still be affected by the solution of the other. Because the cost of

solving the facilities part is usually much smaller than that of solving the reservoir

part, we use the following sequence for the second stage:

1. Solve the following linear system for a preliminary facility solution vector xpreF :

JFF · xpreF = bF . (4.49)

Here, the individual facility submatrices JWW,1, ..., JWW,Nw inside JFF are

solved one by one with the corresponding subvectors in xpreF and bF as the

solution and RHS vectors. Nw is the number of facility models in the coupled

system.

2. With xpreF obtained in the last step, we update the RHS vector of reservoir part

as:

bcorrR = bR − JRF · xpreF . (4.50)

Then, we solve the following linear system for the reservoir solution vector xR:

JRR · xR = bcorrR . (4.51)

3. With xR obtained in the last step, we update the RHS vector of the facilities

part as follows:

bcorrF = bF − JFR · xR. (4.52)


Then, we refine the facility solution xF by solving the following linear system:

JFF · xF = bcorrF . (4.53)

Again, the individual facility submatrices JWW,1, ..., JWW,Nw are solved one

by one with the proper solution and RHS vectors. The preliminary facility

solution vector xpreF may be used as an initial guess for the final solution vector

xF during this step.

Using the same treatment as in the original CPR preconditioner, a single sweep

of the BILU(0) preconditioner is applied to the reservoir part (JRR). For general MS

wells, its submatrix inside the facilities part (JFF ) is solved by preconditioned GM-

RES to a tight tolerance, because it is observed that the accuracy of facility solution

can have considerable impact on the overall linear convergence rate. Specifically, for

the connection part (JCC), the inversion (J−1CC) can be directly obtained, because

JCC has a pointwise diagonal structure. Thus, we may perform a Schur complement

to obtain the following linear system:

(JNN − JNCJ

−1CCJCN

)xN = bN − JNCJ

−1CCbC . (4.54)

The left-hand-side matrix should have the same structure as JNN , but contains dif-

ferent values. Due to the structural nature of JNN (and hence of the resultant matrix

from Schur-complement process), we may apply a BILU(1) preconditioner to obtain

the solution of the node part, xN . Afterwards, we calculate the solution to the

connection part, xC , as:

xC = J−1CC (bC − JCNxN ) . (4.55)


4.8 Nonlinear Solution: Local Facility Solver

With complex physical models (e.g., drift-flux model and all three types of pressure

drops), general MS well equations become highly nonlinear and converge slower than

the reservoir equations. By fixing the reservoir conditions, a local nonlinear solver that

iterates on the facility part only can be used to accelerate the Newton convergence.

The effectiveness of the local nonlinear solver depends the following requirements:

• The relative cost per iteration is very low. This condition is valid when the

reservoir model is sufficiently large, e.g., the number of reservoir cells is larger

by at least two orders of magnitude than the total number of well nodes in all

the MS wells. In such cases, the solution cost for the facilities part will be much

smaller than that for the reservoir part.

• The reservoir and facilities parts in the coupled system can be decoupled eas-

ily both on the nonlinear and linear levels. On the nonlinear level, wells are

separate objects, such that the calculation of properties, construction of resid-

ual equations, and updating of the independent variables can be carried out

independently from the reservoir part. On the linear level, when the multilevel

block-sparse linear system structure is used, the facility matrix (JFF ) is an in-

dependent component in global system matrix. Thus, it can be taken out and

solved separately without additional cost.

• The local facility solution will not yield a negative impact on the convergence

of the reservoir part (and hence of the overall system). Sometimes when the

facility solution is obtained with an inaccurate reservoir condition, it may reflect

an overshot update in the facility variables, and thus slow down the convergence

of the overall system. In order to resolve this problem, we may activate the

local nonlinear solver only when the reservoir part is close to convergence, such


that the reservoir condition used in the local facility solution will be relatively

accurate. In this way, the oscillation in Newton iterations brought about by

the premature usage of the local nonlinear solver can be avoided. However, the

threshold that is used to determine whether the reservoir part is already close

to convergence must be chosen carefully. Too large a threshold may lead to

oscillations. Too small a threshold will limit the usage of the local nonlinear

solver, and thus diminish the savings in computational time.

The algorithm of the local nonlinear solver includes the following steps:

1. Perform nonlinear treatments for the flux-related properties in all perforated

reservoir cells (see Section 2.3.2). This is only needed when IMPES or IMPSAT

time discretization is used, because all the reservoir cells, including perforated

ones, need to be treated explicitly for IMPES or IMPSAT formulation, whereas

in FIM or AIM formulation, the perforated reservoir cells should always be

treated implicitly

2. Reset the local nonlinear iteration to zero

3. While the local nonlinear iteration is below the specified maximum value, do

the following:

(a) Compute the properties for all facility models and form their own residual

equations with a fixed reservoir condition

(b) Check the norm of the residual and the maximum change of variables in

the local facilities system: if both are below the threshold, the facilities

part has converged, go to 4; otherwise, continue the iteration process

(c) Solve the facility linear system extracted from the residual equations formed

above for the Newton update to the facility variables


(d) If the linear solution fails, report the error, and go to 4 without conver-

gence; otherwise, apply the Newton update to the facility variables and

perform variable switching if the phase state of any well node has changed

(e) If the Newton update fails for any facility model, restore the state of that

facility model, report the error, and go to 4 without convergence; otherwise,

continue the iteration process

(f) Increase the local nonlinear iteration by 1 and go to 3

4. Restore the nonlinear treatment applied in step 1

5. Report the number of local nonlinear iterations and that local nonlinear solution

has ended

4.9 Numerical Examples

4.9.1 Two-dimensional reservoir with a dual-branch general

MS well

C10

21 20 19

26 25 24

18

23

17 16 15 14

9

12

7 6 5

4

3

2

1

20bar Surface

Reservoir

30

27

28

22

31

29

100K m3/d CH4

Figure 4.9: The reservoir and well configuration of example 1


The first example is set up to demonstrate the correctness of the model. The

reservoir and well configurations are shown in Figure 4.9. A two-dimensional reservoir

with 51× 51× 1 cells is initially filled with 100% C10. A standard injector is on the

left side of the reservoir injecting pure C1 at 105 m3/day. A general MS well, which

is composed of 31 nodes and 32 connections, has the following characteristics:

• Two producing branches separated from node 17 on the surface and perforated

in the reservoir at node 29 (with smaller WI, or well index) and at node 31

(with larger WI), respectively

• A loop (nodes 7-14) on the surface

• Wellhead pressure control at 20 bars applied on node 1. Note that the two

producing branches have no separate controls and will be governed by this

control only.

a) Node pressure b) Node saturation c) Connection mixture flow rate

Figure 4.10: The simulation results of example 1

The simulation results of highlighted nodes (9, 12, 21, 26) and connections (8→ 7,

14→ 10, 13→ 14, 7→ 11) are shown in Figure 4.10. There are two key observations.

First, the pressure and saturation of nodes 9 and 12 are exactly the same, while those


of nodes 21 and 26 are different. This is because node 21 connects to a perforation

with a smaller WI, while node 26 connects to a perforation with a larger WI. Hence,

we get higher pressure and earlier gas breakthrough in node 26. On the other hand,

because two distinct producing branches join at node 17, and downstream of it, node

9 and 12 are symmetric in the loop, they have exactly the same node properties.

Second, the mixture flow rates of connections 14 → 10 and 8 → 7 have positive

values, whereas those of connections 13 → 14 and 7 → 11 have negative values. We

notice that the predefined directions of these four connections are as shown in Figure

4.9, whereas the actual flow directions may, or may not, be the same as the predefined

ones. For connections 14 → 10 and 8 → 7, their actual flow directions match the

predefined ones, so that the mixture flow rates are positive. On the contrary, for

connections 13 → 14 and 7 → 11, we know the flow actually goes in the reverse

directions from their predefined ones, and this results in negative mixture flow rates.

The sign (direction) and value (magnitude) of each mixture flow rate are determined

by the model automatically in each iteration, and this demonstrates that our general

MS well model can handle loops with arbitrary flow directions.

4.9.2 Upscaled SPE 10 reservoir with three multilateral pro-

ducers

The second example is based on the permeability and porosity fields of an upscaled

version of the SPE 10 problem, in which a simple 4x4x4 coarsening (i.e., 1/64th of

SPE 10) is employed. The reservoir and well settings are shown in Figure 4.11. The

reservoir is initially filled with 0% CO2, 5% C1, 25% C4, and 70% C10. There are two

standard vertical injectors, each injecting pure CO2 with BHP control at 140 bar.

Three multilateral producers, controlled separately by wellhead pressure of 45 bar,

are used in the example. Each producer has two horizontal branches, which are fully


2 standard injectors

3 multi-segment producers

Figure 4.11: The reservoir and well settings of example 2 with separate controls

perforated and composed of nodes 8-17 (the yellow dashed ellipse in Figure 4.11) and

of nodes 19-28, respectively.

a) oil rates of three producers b) properties of one producing branch

Figure 4.12: The simulation results of example 2 with separate controls

Figure 4.12(a) shows the oil rates of the three producers, from which we can see

that this is a problem with strong nonlinearity. We can also examine certain proper-

ties, such as the node pressure and the connection mixture flow rate, of any producing


branch. Figure 4.12(b) shows a profile of these properties for the highlighted branch

(yellow dashed ellipse in Figure 4.11) at the end of the simulation. Note that node 8

is at the bend between the inclined segments and horizontal segments, whereas node

17 is at the end of the horizontal segments. From node 17 to 8, the pressure decreases

and the connection mixture flow rate increases as a result of more incoming fluid from

the perforations.

2 standard injectors


Surface

9 7 8

10

6

11

1

5

40bar

17 14 18 15

19

16

12

4

13 3 2

Reservoir

Figure 4.13: The reservoir and well settings of example 2 with a group control

The group control is also tested by connecting three multilateral producers to a

common surface part, such that they are effectively translated into three branches

of a single MS well, as shown in Figure 4.13. The three producers join the surface

pipeline through a loop and a single wellhead pressure control at 40 bar is applied to

node 1, which is the outlet of the loop. As a consequence, three multilateral producers

are now governed by a group wellhead pressure control at the surface.

Figure 4.14(a) shows a comparison of the total oil rate between separate controls

and the group control, from which we can see the difference incurred by the group


a) total oil rate of three producers b) properties of the surface loop

Figure 4.14: The simulation results of example 2 with a group control

control. The profile of node pressure and connection mixture flow rate of the surface

loop are shown in Figure 4.14(b). Note the green dashed partition line in the surface

loop in both Figure 4.13 and 4.14(b). It is observed that from node 2 to 8, pressure

increases and the mixture flow rates are positive, whereas from node 8 to 13, pressure

decreases and the mixture flow rates are negative. This is again because of the flow

directions. The predefined flow direction is 13 → 12 → · · · → 3 → 2, whereas the

actual flow direction is 8 → 7 → · · · → 3 → 2 and 8 → 9 → · · · → 12 → 13 on

both sides of the green dashed line. As a result, mixture flow rates are positive in

connections 8 → 7, ..., 3 → 2, and negative in connections 13 → 12, ..., 9 → 8. The

general MS well model correctly recognizes the actual flow directions.

4.9.3 Linear solver performance

The third example is designed to test the linear solver performance. This example

is also based on the permeability and porosity fields of an upscaled version of the

SPE 10 problem. A simple 2x2x2 coarsening (i.e., 1/8th of SPE 10) is employed for

this example. A 9-component fluid is used with the following initial mole fraction:

1% CO2, 19% C1, 5% C2, 5% C3 10% n-C4, 10% n-C5, 10% C6, 20% C8, and 20% C10.


The well geometries and locations are the same as in the second example, as shown

in Figure 4.11. Each of the two standard vertical injectors is injecting 90% CO2 and

10% C1 with BHP control at 150 bar. Each of the three multilateral producers is

governed by well head pressure control at 30 bar.

A complex physics model, including drift-flux model and all three types of pres-

sure drops, is used for the three multilateral producers, such that their Jacobian

matrices have complex structures and are hard to solve. Here, we compare three

preconditioning options:

• BILU(0): this is an one-stage preconditioner working directly on the overall

system. Specifically, BILU(0) is applied for the reservoir and BILU(1) is ap-

plied for the general MS wells. We cannot even get converged linear solution if

BILU(0) were also to be applied for the general MS wells

• BILU(1): this is also an one-stage preconditioner working directly on the overall

system. The difference with the previous option is that BILU(1) is applied both

for the reservoir and for the general MS wells

• CPR: this is the two-stage preconditioner described in Section 4.7, with AMG

preconditioner applied in the first stage solution of the global pressure system,

and a set of BILU preconditioners applied to the second stage solution of the

local overall system. For the second stage, BILU(0) is applied to the reservoir

and BILU(1) is applied for the general MS wells

The benchmark was conducted on a single core of Xeon X5520 CPU running at

2.27GHz with 24GB of RAM. The performance results, averaged over all Newton

iterations, are shown in Figure 4.15 for all three preconditioner options described

above. From Figure 4.15(a), we find that the number of linear solver iterations ob-

tained with CPR is one order of magnitude smaller than that obtained with BILU(0).


a) Number of linear solver iterations per Newton iteration

b) Linear solver time and total time per Newton iteration

Figure 4.15: The linear solver performance of example 3

Even compared with BILU(1), which has better convergence behavior but also much

higher preconditioning cost than BILU(0), CPR takes only 21% of the linear solver

iterations taken by BILU(1). Correspondingly, from Figure 4.15(b), we see a large

drop in the linear solver time per Newton iteration from 151.7s for BILU(0) and 70.6s

for BILU(1) to only 12.5s for CPR. Although CPR has higher preconditioning cost

per iteration than either BILU(0) or BILU(1), the cost of BLAS operations per it-

eration increases linearly with the number of iterations for the GMRES solver and

becomes prohibitively expensive for BILU(0) and BILU(1) with such high number of

iterations (If a small number of iterations before the restarting of GMRES solver is

set, BILU(0) and BILU(1) will take even more iterations to converge). Notice that

the total cost per Newton iteration is also decreased by several folds from 159.1s for

BILU(0) and 78.4s for BILU(1) to 20.4s for CPR. The performance difference is quite

large, and we expect higher savings with CPR in even larger models.


4.9.4 Nonlinear solver performance

The fourth example is designed to test the performance of the local nonlinear solver

for the facilities part. The reservoir, well, and fluid configurations are completely the

same as those in the first part of the second example (with three separate general MS

wells) discussed in Section 4.9.2.

Here we compare three options of nonlinear solvers:

• Option 1: standard Newton. The local facility solver is not applied at all

• Option 2: Newton with the local facility solver, which gets activated only when

the reservoir part has already converged while the facilities part has not yet

converged

• Option 3: Newton with the local facility solver, which gets activated whenever

the reservoir part is close to convergence (i.e., when its normalized residual gets

within the range that is one order of magnitude larger than the convergence

threshold)

The benchmark was also conducted on a single core of Xeon X5520 CPU. The

performance results (total number of Newton iterations and total simulation time)

are shown in Figure 4.16. We see a 33.5% decrease in the number of Newton iterations

for Option 1 compared to Option 2. Note that the activation of the local nonlinear

solver is quite conservative (i.e., only when the reservoir part has already converged)

in Option 2. By activating the local nonlinear solver more frequently in Option 3,

but not yet incurring oscillation, we achieve a further 9.3% decrease in the number of

Newton iterations from Option 2. Correspondingly, we decrease the total simulation

time of Option 1 by 34.0% when using Option 2, and achieve a further 8.7% decrease

from the total simulation time of Option 2 when using Option 3. The savings in

the total simulation time is proportional to the savings in the number of Newton


Figure 4.16: The nonlinear solver performance of example 4

iterations. Although the local facility solver is activated many more times in Option

3 (252 times) than in Option 2 (117 times), its cost is so small (< 1% of the total

simulation time) such that it can be basically ignored in both options. With larger

reservoir such as 1/8th of SPE 10 or full SPE 10, the relative cost of the local facility

solver will be even smaller. Therefore, we should be able to achieve comparable or

greater savings in the total simulation time with the local facility solver.

4.9.5 Comparison of simulation results: AD-GPRS versus

Eclipse

The last example is designed to validate the simulation results of the coupled system

with general MS wells. When a general MS well has relatively simple geometry (e.g.,

without loops), a single exit, and no special segments, we can find an equivalent

representation of the well using the original MS well model, which is implemented

in Eclipse [70] simulator. Here, we use the permeability and porosity fields of an

upscaled version of the SPE 10 problem. A simple 2x2x2 coarsening is employed,


4 multi-segment injectors


Figure 4.17: The reservoir and well settings of example 3 (AD-GPRS versus Eclipse)

resulting in 30 × 110 × 42 cells. When a 4-component fluid (CO2, C1, C4, and C10)

is used, the memory used by the Eclipse 300 simulator exceeds the limit of a 32-bit

application. As a result, we drop the bottom seven layers of the upscaled model and

keep 30 × 110 × 35 cells. As shown in Figure 4.17, four multisegment injectors and

four multisegment producers are included. All the wells (injectors and producers) are

vertical and perforate all 35 reservoir layers. Each well is discretized into 46 segments

(and 46 connections). Each injector is injecting 90% CO2 and 10% C1 with well head

pressure control at 140 bar, whereas each producer is governed by well head pressure

control at 5 bar. The drift-flux model and all three types of pressure drops are used

for all the wells. Using the equivalent input data, we simulate the problem with both

AD-GPRS and Eclipse 300 for a total simulation time of 100 days.

The simulation results obtained by both AD-GPRS and Eclipse 300 are shown

in Figure 4.18. In Figure 4.18(a), the gas injection rate is plotted versus simulation

time for the four multisegment injectors (W1, W2, W5, and W6). We observe that for


a) Gas rates of all injectors b) Oil rates of all producers

Figure 4.18: Comparison of simulation results: AD-GPRS versus Eclipse

each injector, the rates obtained by Eclipse 300 (represented by dots) are overlapped

closely with those obtained by AD-GPRS (represented by a curve) throughout the

simulation. In Figure 4.18(b), the oil production rate is plotted versus simulation

time for the four multisegment producers (W3, W4, W7, and W8). For all producers,

we again achieve good match between the rates obtained by Eclipse 300 and those

obtained by AD-GPRS. Thus, in this case where MS wells have only relatively simple

geometry, single exit, and no special segments, the consistency between the general

MS well model in AD-GPRS and the original MS well model in Eclipse is validated.


In this chapter, we discussed the mathematical formulation, our generic design and

AD-based implementation, our advanced multistage linear solver, and new nonlinear

solution strategy for the general MS well model. Each MS well is discretized into


nodes and connections. We define variables and write the governing equations both on

nodes and on connections. The general MS well model has the following advantages:

1) general branching that allows for complex well geometry, 2) loops with arbitrary

flow directions, 3) multiple exit connections with different constraints, and 4) special

nodes with various functionality (e.g., separators, valves). The first two advantages

have been demonstrated in the numerical examples, whereas the required features for

the last two advantages are well supported by the current framework but have not

yet been implemented into AD-GPRS. These features can be introduced in the future

as extensions to the model.

Regarding the linear solution of the model, the effectiveness of specialized linear

preconditioner, which is extended from the standard two-stage CPR preconditioner, is

demonstrated using numerical examples by comparing it against popular single-stage

preconditioners. As demonstrated here, local nonlinear solution for the facilities part

can be used to accelerate the nonlinear convergence of the coupled system with general

MS wells.

Chapter 5

Multicore Parallelization

5.1 Introduction

In this chapter, we describe our OpenMP-based shared-memory parallelization of

AD-GPRS. This strategy has two parts: Jacobian generation and the linear solver.

Under the AD framework described in Chapters 1 and 2, parallelization of Jacobian

generation is achieved with a thread-safe extension of our AD library (ADETL) [92,

96], which utilizes the Thread-Local Storage (TLS) technique (see [26] for details)

to avoid data race conditions. With the thread-safe ADETL, computations such as

discretization, property calculation, and Newton updates are safely parallelized.

For the parallel linear solution, a lot of investigations have been conducted in

this field [17, 20, 32, 48, 75]. Among these attempts, both shared-memory (e.g., using

OpenMP) and distributed-memory (e.g., using MPI) approaches have been studied.

In our approach, we first parallelize our own MultiLevel Block-Sparse (MLBS) ma-

trix data structure (see Section 3.3), and then we use a two-stage CPR (Constrained

Pressure Residual, see [88,89]) preconditioning strategy in the iterative solution pro-

cess. The latest parallel multigrid solver from Fraunhofer SCAI - XSAMG [31,76–78]

- is used as the 1st-stage pressure preconditioner. For the 2nd-stage, we employ the

142

CHAPTER 5. MULTICORE PARALLELIZATION 143

Block Jacobi technique, with Block ILU(0) applied as the local preconditioner.

This OpenMP implementation had a small impact on the overall structure of the

object-oriented code, so that code maintenance and extension will be relatively easy.

5.2 Jacobian Generation

5.2.1 Thread-safe ADETL

The purpose of the thread-safe extension of ADETL is to parallelize all the compu-

tations (other than the linear solver) in AD-GPRS.

Static variables are used by ADETL in various stages of gradient computation

(e.g., the memory pool for generating AD expression objects, see [92, 96] for de-

tails). Thus, data-race conditions may be raised when the library is directly used in

a multithreading environment. In order to guarantee thread safety in the gradient

computation, the Thread-Local Storage (TLS) technique is utilized. The basic idea of

TLS is for each thread to contain a local instance of the static variable, which is only

accessible by its ‘owner thread’. The usage of TLS is illustrated in Fig. 5.1. When

integers a and b are declared as static variables without the TLS keyword, they are

contained in the global memory and are accessible by all threads. In contrast, if the

TLS keyword (‘ thread’ in Linux; ‘ declspec(thread)’ in Windows) is specified in the

declaration, each thread has a local copy of both variables and these local copies can

only be read or written to by the corresponding thread that owns them, thus avoiding

the potential data races.

Besides the thread-safe extension, global operations (e.g., switching of the inde-

pendent variables, backup and restoration of all variables) in the AD variable and

backup set (adX and adX n) are parallelized using OpenMP directives, because they

are only allowed to be called from a single thread at any time.


static __thread int a, b;

Thread 1 Thread 2 Thread N

…… a

b

…

a

b

…

a

b

…

static int a, b;

Global

a

b

…

Figure 5.1: Illustration of Thread-Local Storage (TLS)

5.2.2 Parallel computations other than the linear solver

Here, we discuss the parallelization of the following computations: discretization,

property calculation, and the Newton update. For discretization (i.e., computation

of flux and accumulation terms), a partitioner, such as METIS [40], is used to create

subdomains such that the amount of repeated flux computations across subdomains

is minimized. Using one subdomain for each thread, the locking mechanism with high

overhead can be totally avoided. The memory bandwidth requirement is estimated to

be high for the stenciling operations associated with flux computations. For properties

calculation, the computations are local and inherently parallel. Thus, good scalability

is expected. For the Newton update of variables and phase states, the work load can be

quite different depending on the phase state. Thus, one may consider using dynamic

scheduling to achieve a better load balance.

The above descriptions apply to the reservoir part. For the facilities part, all

computations (switch of facility control, calculation of facility properties, application

of source/sink terms, and construction of facility residuals) at nonlinear level are

parallelized by assigning facility objects to different threads, because the facility ob-

jects are independent of each other and have quite different computational processes.

There is no requirement on the distribution of facility objects in all threads. Dynamic

scheduling may also be considered because the work load can differ greatly when the

facility model is not uniform (e.g., standard well versus general MS well).


5.3 Linear Solver

5.3.1 Parallel matrix data structure

On the topmost level, our MLBS matrix is composed of four parts: JRR, JRF , JFR,

JFF , as described in Section 3.3. The submatrix JRR corresponds to the reservoir

part, and its size is usually much larger than the other submatrices. Thus, matrix

operations such as extraction from AD vector, algebraic reduction (two-step), Sparse

Matrix-Vector multiplication (SpMV), and explicit update (two-step) are parallelized

for JRR by decomposing it into NTh pieces, where NTh is the number of threads.

Each piece contains approximately NB/NTh block rows and is assigned to one thread.

Because the locations of block nonzero entries in JRR are recorded in the generalized

connection list, which uses the CSR format internally, each thread can find and access

the block nonzero entries in its corresponding block rows directly. Suppose block rows

ri to ri+1−1 belong to thread i, then the indices for associated block diagonal entries

are ri, ri+ 1, . . ., ri+1−1, and the indices for corresponding block off-diagonal entries

are row ptr(ri), row ptr(ri)+1, . . ., row ptr(ri+1)−1, where row ptr is the row pointer

array of the generalized connection list.

For the facility-related submatrices (JRF , JFR, and JFF ), each of them is again

composed of several second-level submatrices (e.g., JFF is composed of JWW,1,

JWW,2, ..., JWW,NF, where NF is the number of facilities in the simulation). Be-

cause there is no direct coupling between different facility objects on the nonlinear

level, the second-level submatrices in JRF , JFR, or JFF are decoupled and can be

operated on concurrently. Thus, parallelization for the facility-related submatrices is

achieved by assigning their second-level submatrices to different threads and perform-

ing computation on them in parallel. This is similar to our treatment to the facility

objects at the nonlinear level.


5.3.2 First stage pressure solution — XSAMG preconditioner

Given the parallel matrix data structure, the presolution (matrix extraction, algebraic

reduction), post-solution (explicit update), as well as, the BLAS operations in the

iterative linear solver can be performed in parallel. The only step that remains serial

in the linear solution is the preconditioning. As described in Section 3.4.5, for fully

implicit reservoir simulation problems, the two-stage CPR preconditioner has proved

to be a very effective choice. Recall that in CPR, we first solve a global pressure

system that is algebraically reduced from the primary system, and then solve locally

an overall primary system.

For the first-stage pressure solution, the latest parallel multigrid solver from Fraun-

hofer SCAI, XSAMG [31], is used. The idea of XSAMG is illustrated in Fig. 5.2.

The original SAMG preconditioner runs on a single computational node. If the major

simulation cost is in the SAMG linear solution, then by simply replacing SAMG with

the new XSAMG preconditioner, which runs on several computational nodes, consid-

erable saving can be achieved. That is, XSAMG provides us with a mechanism to

solve the pressure system on multiple computational nodes, regardless of how the rest

of the simulator is parallelized (e.g., multicore parallel or even serial). The advantage

of this approach is that little effort is required in code modification. One only needs

to add xsamg init and xsamg finalize at the beginning and end of the code, and

replace the call to SAMG with that to XSAMG.

The overall parallel efficiency is limited by Amdahl’s law. In the fully implicit

reservoir simulation, depending on the fluid complexity, the pressure solution can

take around 5% (very complex fluid) to 50% (very simple fluid) of the total simulation

time. Thus, it is usually not sufficient to just parallelize the pressure solution, which

would only yield 1.05X (X - times, similarly hereinafter) to 2X speedup of the overall

simulation cost.


Figure 5.2: Illustration of XSAMG preconditioner [31]

When applying XSAMG to the first stage of CPR, there are two important con-

cerns. First, the ‘first touch’ policy is critical in optimizing the performance of

XSAMG. On workstations with NonUniform Memory Architecture (NUMA), the

memory subsystem is divided to allow each CPU to have both local and remote

memory accesses. Accesses to the local memory are faster than accesses to the re-

mote memory. The smallest unit of data for the transfer between the main memory

that is directly accessible to the CPU and the virtual memory on the auxiliary stor-

age is called a “page”. The ‘first touch’ policy indicates that the first CPU touching

(i.e., reading from, or, writing to) the memory page will be the CPU that faults the

page in (i.e., load the memory page from auxiliary storage to main memory) and the

page will be bound to that CPU, allowing faster accesses by that CPU. Therefore,

a parallel assembly process based on a partition of the matrix rows is needed for

XSAMG such that the matrix coefficients in each part of the rows are first accessed

by the corresponding CPU that will later perform computations on them during the

setup and solution phases. Second, a partial setup strategy is used to improve the

performance and parallel scalability. The setup of XSAMG contains two steps: 1)


coarsening and interpolation, which has suboptimal scalability; and 2) update of the

Galerkin operators, which has good scalability. Given a new pressure matrix, com-

plete setup involves both steps and scales poorly with the growing number of cores.

On the contrary, in a partial setup, one only performs the second step, and that scales

much better. Based on our experience, it is usually safe to call the complete setup

only for the first Newton iteration of a timestep, and call a partial setup for each of

the remaining Newton iterations. By applying this partial setup strategy, we observe

considerable improvement in the scalability for the setup phase of XSAMG and no

severe degradation in the convergence.

5.3.3 Second stage overall solution — Block Jacobi/BILU

preconditioner

After applying the pressure update from the first stage of CPR method, we need to

perform the second stage preconditioning to obtain an overall solution. Because most

of the global coupling is resolved in the first stage, a block Jacobi strategy has been

applied to the second stage solution, with BILU(0) or BILU(k) selected as the local

preconditioner.

At the beginning of the simulation, the reservoir matrix is partitioned into sev-

eral outer blocks with similar sizes. Note that this partition can be the same as,

or different from, the partitioning used for the discretization computations. Better

convergence behavior can be achieved if the partitioning minimizes the interdomain

connectivities, which are reflected in the transmissibility of interdomain connections,

or approximately, in the (neglected) pressure derivatives.

The idea of the second stage solution is shown in Fig. 5.3. It is worth mentioning

that each submatrix JRR,i also has a block sparse structure and shares the data

storage with the original reservoir matrix JRR. In addition, the structure of JRR,i is


RR RR2

RR3

RR4

RR1

x x x x x x x x x x x x ...... x x x x Data

Figure 5.3: Illustration of Block Jacobi preconditioner

created in the beginning and kept the same if the partitioning is fixed. Thus, there

is no additional setup, or data transfer, cost associated with JRR,i in each Newton

iteration.

Moreover, due to the consistency in the data format of JRR,i and the original JRR,

we may have an arbitrary choice of preconditioners for the submatrices, e.g., BILU(0)

or BILU(k). Also, because the block Jacobi preconditioner has the same interfaces as

the original BILU preconditioners, only trivial modifications are required in the 2nd

stage of CPR. With a small number of outer blocks, the impact of using block-Jacobi

on convergence is mild: the number of linear solver iterations increased by less than

10% in our test problems.

5.4 Parallel benchmark

The full SPE 10 model [19] of 60×220×85 (1.1M) cells with all of its horizontal layers

skewed and distorted as explained in [99] is used for the parallel benchmark. Three


different spatial discretization schemes are tested: TPFA (7-pt), MPFA L-method (9-

pt), and MPFA O(0)-method (11-pt). A line-drive well pattern with two injectors and

two producers is used. We simulated the model in parallel AD-GPRS with thread-

safe ADETL and the two-stage CPR-based linear solver (XSAMG + Block Jacobi /

BILU) on a workstation with dual quad-core Xeon E5520 CPU at 2.27GHz and 24GB

of RAM.

1.7 2.0 2.1 1.9

3.1 3.6 3.6 3.3

5.3

7.5

5.7 6.0

0

1

2

3

4

5

6

7

8

Discretization Properties Newton update Total nonlinear time

Par

alle

l Sp

eed

up

Nonlinear

1.7 2.0 2.1 2.0 1.9

2.9 2.9 2.8 3.3

2.9

4.1 3.8 3.7

4.8

3.8

0

1

2

3

4

5

6

7

8

Mat. extraction &alg. reduction

Linear solution XSAMGprecondition

BILU precondition Total linear time

Par

alle

l Sp

eed

up

Linear 1.9

3.1

4.8

0

1

2

3

4

5

6

7

8

Par

alle

l Sp

eed

up

Total

2 threads

4 threads

8 threads

Figure 5.4: Performance result of the full SPE 10 model with TPFA discretization

The benchmark results for the TPFA (7-pt) discretization are shown in Fig. 5.4.

We observe that the speedups for the nonlinear (other than the linear solver) compu-

tations are generally good, especially for the properties calculation, where the local

computations yield little stress on the memory bandwidth. The overall speedup for

the “nonlinear” part is 6X with 8 threads. On the other hand, the speedups for linear

solver computations are usually lower, due to the much higher requirement on the

memory bandwidth. The overall speedup for the linear part is 3.8X with 8 threads.


As a result, the total parallel speedup is 4.8X with 8 threads.

1.8 2.0 1.9 1.9

3.1 3.7

3.0 3.2

5.6

7.5

5.0

6.0

0

1

2

3

4

5

6

7

8

Discretization Properties Newton update Total non-linear time

Par

alle

l Sp

ee

du

p

Nonlinear

1.9

3.1

5.2

0

1

2

3

4

5

6

7

8

Par

alle

l Sp

eed

up

Total

2 threads

4 threads

8 threads1.9 2.1 2.2 2.2 2.0

3.2 3.0 3.2 3.4 3.0

4.9 4.5 4.2

5.8

4.5

0

1

2

3

4

5

6

7

8

Mat. extraction &alg. reduction

Linear solution XSAMGprecondition

BILU precondition Total linear time

Par

alle

l Sp

eed

up

Linear

Figure 5.5: Performance result of the full SPE 10 model with MPFA O-methoddiscretization

For the MPFA O-method (11-pt) discretization, as shown in Fig. 5.5, the per-

formance is somewhat similar to that of TPFA. The same overall speedup for the

nonlinear part has been achieved, whereas the speedup for the linear part is actually

a bit higher (4.5X with 8 threads). The total parallel speedup is hence increased

to 5.2X with 8 threads. The result for MPFA L-method (9-pt) lies somewhere in

between: a total parallel speedup of 5X has been achieved with 8 threads.


The multithreading shared-memory parallelization strategy of AD-GPRS has been

described in this chapter. This OpenMP approach had a small impact on the overall

structure of the object-oriented code. Parallel Jacobian construction is achieved with


a thread-safe extension of ADETL, which utilizes the Thread Local Storage (TLS)

technique to avoid data-race conditions. For the linear solution, the matrix oper-

ations in the MLBS data structure are first parallelized. Then, a two-stage CPR

(Constrained Pressure Residual) preconditioning strategy, combining XSAMG and

Block Jacobi/BILU, is used. The implementation ensures that the data transfer cost

between the global matrix and the local preconditioning matrices is minimized.

The benchmarking results are obtained by simulating the full SPE 10 problem

(with three discretization schemes applied on the manipulated nonorthogonal grid)

using parallel AD-GPRS on multicore platforms. We find that - on average - the

speed up is about 5.0X on an 8-core (dual quad-core Nehalem) node. The parallel

performance is analyzed for the different nonlinear and linear kernels in the simulation.

Specifically, the average speedup is about 6.0X for the nonlinear computations and

4.1X for the linear solution.

For future work, the hybrid MPI/OpenMP parallelization, which may be consid-

ered as an extension of the current OpenMP parallelization, is a possible direction.

In that case, the distributed-memory parallel multigrid solver, SAMGp, may be used

to replace XSAMG. Also, we may consider the parallelization using emerging HPC

(High Performance Computing) architectures for certain components in the simula-

tor. As an example, in the next chapter, we will describe the GPU parallelization

of the Nested Factorization algorithm. Significant speedups can be achieved if the

algorithm, through proper modification, can fit the computing architecture well.

Chapter 6

GPU Parallelization of Nested

Factorization

6.1 Introduction to GPU Architecture

Among the emerging parallel architectures for high performance computing, GPGPU

(General-Purpose computing on Graphics Processing Units) platforms have drawn a

lot of attention in recent years. The GPU was originally devoted to graphics process-

ing and acceleration. With the help of GPGPU computing languages and libraries,

the GPU can be used to perform computations that have traditionally been handled

by the CPU. Currently, the dominant proprietary GPGPU computing framework

is NVIDIA’s CUDA (Computing Unified Device Architecture) [54, 55], whereas the

dominant open language for GPGPU is OpenCL (Open Computing Language) [41].

The GPU parallel algorithm described in this chapter has been implemented using

CUDA and integrated with AD-GPRS.

To better understand how the GPU platform can be used for computationally in-

tensive kernels, let us take a look at the architecture of a GPU. Here, we use NVIDIA’s

Fermi architecture [52] as an example. As shown in Figure 6.1(a), each Fermi chip

153

CHAPTER 6. GPU PARALLELIZATION OF NESTED FACTORIZATION 154

(a) Fermi GPU chip

(b) Fermi Streaming Multiprocessor (SM)

Figure 6.1: Illustration of the Fermi GPU architecture [52]


contains 16 Streaming Multiprocessors (SM) (e.g., the one enclosed in the red box),

a thread scheduler and dispatcher (GigaThread), an interface to the host (via PCIe

bus), a shared L2 cache, and up to 6GB of GDDR5 (Graphics Double Data Rate,

version 5) DRAM. The internal structure of each SM is shown in Figure 6.1(b). Each

SM contains 32 CUDA cores, 16 Load/Store units (LD/ST), four Special Function

Units (SFU), two ‘warp’ schedulers, and 64KB of configurable shared memory and

L1 cache. So, there are 512 CUDA cores per Fermi chip. The current top-tier Fermi-

architecture product - NVIDIA Tesla M2090 - has a peak capacity of 1331 and 665

GFlops for single- and double-precision computations, respectively. The M2090 has

a peak memory bandwidth of 177GB/s. These peak performance numbers are high

compared with state-of-the-art CPUs. However, the actual performance of an algo-

rithm depends strongly on the ability to utilize the computational resources in the

GPU and may differ significantly from the raw peak numbers.

6.2 Nested Factorization

Nested Factorization (NF) [7] is a preconditioning method built on the nested tridi-

agonal structure of the Jacobian matrix generated for a structured grid. There are

three levels in this nested structure: planes, lines, and cells. Correspondingly, when

applied to a grid with NPl×NLi×NCe cells (NPl: number of planes; NLi: number of

lines in each plane; NCe: number of cells in each line), the solution strategy of NF is:

1. Factorize/solve all two-dimensional planes i (1 ≤ i ≤ NPl) in the entire three-

dimensional domain. If we group the equations and variables of each plane into


a block, the matrix A has a block tridiagonal structure:

A =

D1 U1 0 · · · 0

L2 D2 U2...

0. . . . . . . . . 0

... LNPl−1 DNPl−1 UNPl−1

0 · · · 0 LNPlDNPl

.

(6.1)

We use the following notation for this block tridiagonal structure:

A = TriDiagNPli=1 (Li, Di, Ui) , (6.2)

where Li contains the derivatives of equations of plane i with respect to variables

of plane i − 1, Di contains the derivatives of equations of plane i with respect

to variables of plane i, and Ui contains the derivatives of equations of plane i

with respect to variables of plane i+ 1. On the first level, the block tridiagonal

solution of A is serial.

2. Factorize/solve all one-dimensional lines j (1 ≤ j ≤ NLi) inside each two-

dimensional plane i. If we group the equations and variables of each line into a

block, the submatrix Di also has a block tridiagonal structure (Li is eliminated

using the solution of plane i− 1, whereas Ui is assimilated into Di through an

approximation, e.g., relaxed row sum, column sum, or simply diagonal):

Di = TriDiagNLij=1

(Lij, D

ij, U

ij

), (6.3)

where Lij contains the derivatives of equations of line j with respect to variables

of line j − 1 (both in plane i), Dij contains the derivatives of equations of line j

with respect to variables of line j, and U ij contains the derivatives of equations


of line j with respect to variables of line j + 1. On the second level, the block

tridiagonal solution of Di is again serial.

3. Factorize/solve all cells k (1 ≤ k ≤ NCe) within each one-dimensional line j.

The local submatrix Dij also has a tridiagonal structure (Lij is eliminated using

the solution of line j−1, whereas U ij is again assimilated into Di

j approximately):

Dij = TriDiagNCe

k=1

(li,jk , d

i,jk , u

i,jk

), (6.4)

where li,jk is the derivative of equation of cell k with respect to variable of cell

k − 1 (both in plane i, line j), di,jk is the derivative of equation of cell k with

respect to variable of cell k, and ui,jk is the derivative of equation of cell k with

respect to variable of cell k + 1. On the third level, the tridiagonal solution of

Dij is usually serial.

When the two-stage CPR preconditioning strategy [88, 89] is applied, NF can

be used as the preconditioner for the first-stage pressure system, as well as, for the

second-stage full Jacobian. When used as the pressure preconditioner, our experience

indicates that NF usually converges slower than AMG [76,77].

6.3 Massively Parallel Nested Factorization

The standard NF algorithm is serial. In order to extend the algorithm to GPU

systems, where thousands or more concurrent threads are required, significant modi-

fications are needed. The so called Massively Parallel Nested Factorization (MPNF)

method was proposed by Appleyard et al. [8].

Two important concepts are introduced in MPNF: kernel and color. As defined

in [8], a kernel is a group of cells - treated as a unit - with an exact inverse. For

example, a row of cells in a model is a kernel, because the exact solution can be


obtained from a tridiagonal solve. Given a coloring strategy and the total number of

colors, each kernel is assigned a color, such that no adjacent kernels have the same

color. Then, by reordering the equations and variables first by color, then by kernel,

and last by cell, the MPNF solution process is as follows:

1. Factorize/solve all colors c (1 ≤ c ≤ Ncolor, Ncolor: the number of colors) in the

entire domain. By grouping the equations and variables of each color into a

block, the matrix A will have a block tridiagonal structure:

A = TriDiagNcolorc=1 (Lc, Dc, Uc) (6.5)

where Lc contains the derivatives of the equations of color c with respect to the

variables of color c − 1, Dc contains the derivatives of the equations of color c

with respect to the variables of the same color, and Uc contains the derivatives

of the equations of color c with respect to the variables of color c+1. Therefore,

on the first level of MPNF, the block tridiagonal solution of A is still serial.

2. Factorize/solve all kernels r (1 ≤ r ≤ Nker(c), Nker(c): the number of kernels with

color c) that are assigned color c by the selected coloring strategy. Recall that

the coloring is such that no adjacent kernels have the same color. Consequently,

there will be no derivatives of the equations of one kernel r with respect to

variables of another kernel r′, given that kernels r and r′ have the same color

c. Thus, by grouping the equations and variables of each kernel into a block,

the submatrix Dc has a block-diagonal structure (Lc is eliminated using the


solution of color c− 1, whereas Uc is assimilated into Dc approximately):

Dc =

Dc1 0 · · · 0

0 Dc2

.... . .

... DcNker(c)−1 0

0 · · · 0 DcNker(c)

(6.6)

where Dcr contains the derivatives of equations of kernel r in color c with respect

to the variables of the same kernel. On the second level of MPNF, the solution

of Dc is parallel whereby for each color we can employ as many concurrent

threads as the number of kernels in a color (Nker(c)). This is the computational

kernel in the algorithm where massive parallelism is exploited.

3. Factorize/solve all cells k (1 ≤ k ≤ NCe) within each kernel k. The innermost

level of MPNF is the same as that of NF. That is, the submatrix Dcr has the

following tridiagonal structure:

Dcr = TriDiagNCe

k=1 (Lc,rk , Dc,rk , U c,r

k ) (6.7)

where Lc,rk , Dc,rk , and U c,r

k have the same meaning as was explained in the third

level of NF. At this level, the tridiagonal solution of Dcr is serial, or has limited

parallelism.

Now we discuss the coloring strategy. The basic strategy is the checkerboard

coloring, where each kernel is alternatively assigned one of two colors, as shown in

Fig. 6.2(a). This is the simplest strategy and will yield a valid coloring of all kernels

for simple TPFA (Two-Point Flux Approximation) stencils. However, the resulting

preconditioner has poor convergence behavior. A more accurate preconditioner can


be obtained via a multicoloring strategy, such as cyclical (colors are assigned in the

order of 1, 2, ..., Ncolor−1, Ncolor, 1, 2, ...) and oscillatory coloring (colors are assigned

in the order of 1, 2, ..., Ncolor − 1, Ncolor, Ncolor − 1, Ncolor − 2, ...) [8]. In cyclical

coloring, because kernels with the last and first color are adjacent, the block matrix

on the outermost level deviates slightly from the tridiagonal structure (one additional

entry in the first and last block row). On the other hand, oscillatory coloring will

always yield a block tridiagonal matrix on the outermost level, because its first color

is only connected with the second color, and its last color is only connected with the

second last color. An example of a 4-color oscillatory coloring is shown in Fig. 6.2(b).

As stated in [8], oscillatory coloring usually produces a more accurate preconditioner

(and hence better convergence behavior), given the same number of colors.

(a) chequerboard coloring (b) 4-color oscillatory coloring

Figure 6.2: Examples of different coloring strategies

For small problems, the strategy of assigning more than one thread to a kernel is

described briefly in [8]. This can be achieved using parallel tridiagonal solution meth-

ods, such as twisted factorization and cyclic reduction. The algorithmic descriptions

(with no GPU implementation details) of both methods are documented in [81].

It is reported in [8] that the speedup of the GPU-based MPNF-preconditioned

linear solver ranged from 5.7X to 14.3X for large problems (with 100,000, or more,

cells). Those results were obtained on a workstation with dual Intel Xeon X5550

CPUs (each CPU has four cores at 2.66GHz and 8MB cache), 12GB of RAM, and an


NVIDIA Tesla C2050 GPU (448 CUDA cores at 1.15 GHz, 3GB memory).

6.4 Our CUDA-based Implementation

6.4.1 Basic features

Based on the description above, we have redesigned and implemented the MPNF al-

gorithm based on CUDA. The key features of our GPU-based MPNF implementation

are:

• Support computation in single- and double-precision (denoted as SP and DP

below)

• Support checkerboard and three forms of oscillatory coloring strategies

• Use CUDA-based GMRES solver as an accelerator for the pressure system

• Use cuBLAS library [53] for BLAS (copy, scal, axpy, etc.) and reduction (dot

product) operations

• Use HYB (hybrid) format, which is a combination of EllPack [67] and COO

(coordinate list) format, in cuSPARSE library [57] for Sparse Matrix-Vector

multiplication (SpMV) in GMRES solver

• Support asynchronous memory transfer from CPU to GPU during the setup

(factorization) phase

The last feature is worth a bit more discussion. Asynchronous memory transfer is

used to hide the latency of memory transfer due to the limited bandwidth of the PCIe

bus, which connects the CPU and the GPU. Latency hiding is achieved by copying

the matrix entries of the next color, c + 1, from the CPU to the GPU, while the

current color, c, is being set up. NVIDIA GPUs have separate copy engines from the


processing engines; thus, they can handle asynchronous memory transfer while the

CUDA cores are busy computing. To ensure that the necessary data are ready before

the next matrix factorization, synchronization of the GPU device is applied, such

that the setup procedure of the next color, c + 1, will not begin until the following

conditions are satisfied:

• The contribution from the current color, c, to the next color, c + 1 (through

submatrix Uc), has been calculated

• Matrix entries of the next color, c + 1 (in submatrices Lc+1 and Dc+1), have

been transferred from the CPU to the GPU

6.4.2 Runtime profiles

The runtime profile of the MPNF preconditioner is obtained using the top 10 layers

of the SPE 10 reservoir model [19] with 60× 220× 10 (132 thousand) cells and two-

phase flow. Note that all of the results reported here are for the pressure system of

equations. The SP (Single-Precision) runtime profile is shown in Fig. 6.3(a).

We observe that the setup phase - Setup (factorization) and Create HYB matrix -

occupies a very small fraction (0.7% + 0.6%) of the total MPNF cost. In the solution

phase, the cost is dominated by BLAS (29.8%, mainly axpy) and reduction (36.3%,

dot product) operations, while the MPNF solution kernel takes less than 30% of the

total time. This is mainly because 1) the number of NF iterations in each pressure

solution is high (above 20, on average), and there are several pressure solutions per

overall CPR linear solution, so that the setup cost is usually much less than the

solution cost; 2) dot-product operations are not a good fit for the SIMT (Single

Instruction Multiple Thread) GPU architecture, because the number of participating

CUDA cores is halved for each reduction step. On the other hand, the computational

density (per memory access) is low for axpy operations. These operations will be


0.7% 0.6% 1.0%

29.8%

36.3%

28.0%

3.6%

Setup (factorization)

Create HYB matrix

Reorder/copy of B/X

BLAS operations

Reduction (dot/norm)

Solve by preconditioner

MV multiplication

Total cost: 112.3s

(a) Basic implementation

1.1% 0.9% 1.2%

8.4%

6.4%

72.7%

9.1%

Setup (factorization)

Create HYB matrix

Reorder/copy of B/X

BLAS operations



MV multiplication

Total cost: 74.6s

(b) With BiCGStab and customized reduction kernel

Figure 6.3: Runtime profile of the GPU-based MPNF preconditioner for solving thepressure system of the top 10 layers of the SPE 10 reservoir model in single-precision


severely memory bounded, such that the peak flops cannot be reached; and 3) most of

the axpy and dot-product operations are in the orthogonalization process of GMRES.

The number of such operations per NF iteration increases linearly with the number of

iterations (i+ 1 operations for the i’th iteration). Thus, the cost of these operations

becomes significant when the number of NF iterations is large.

One possible way to resolve this problem is to use BiCGStab (BiConjugate Gradi-

ent Stabilized method, see [67,82]) as the accelerator. Although there is no guarantee

that the residual norm will decrease monotonically in every iteration, BiCGStab has

the advantage that the number of axpy and dot-product operations is fixed at five to

six (depending on the number of residual-norm evaluations) per iteration. Because

the number of MPNF iterations is usually much higher than ten for each pressure so-

lution, BiCGStab appears to be a better choice than GMRES, considering the savings

in axpy and dot-product operations.

Another consideration is to use the customized CUDA reduction kernel instead of

the most flexible one provided in cuBLAS. This is motivated by the parallel reduction

example in CUDA Samples [56], which are a set of code samples included with the

NVIDIA CUDA Toolkit. In the parallel reduction example, seven different versions

of reduction kernels have been implemented, with several tunable options. Using the

best available kernel (the 7th kernel with multiple elements assigned to each thread)

with proper setting (64 thread blocks at maximum, with 256 threads per block, us-

ing CPU for the final reduction of the aggregated results in all thread blocks), the

customized reduction kernel is able to provide up to 20% larger memory bandwidth

than the reduction operation in cuBLAS.

Using BiCGStab as an accelerator for MPNF along with the customized reduction

kernel, the updated runtime profile is shown in Fig. 6.3(b). It is observed that the

BLAS (8.4%) and reduction (6.4%) operations are no longer the bottlenecks. The

total cost decreases from 112.3s to 74.6s (a 33.6% reduction) and is now dominated


by the MPNF solution kernel (72.7%).

6.5 Coalesced memory access

In CUDA programs, ensuring coalesced memory access (see [54, 55] for details) is

critical for efficient utilization of the GPU resources. Coalesced memory organization

can be described as follows:

• Half-warp is the basic unit in the SIMT GPU architecture and contains 16

threads that always execute the same instruction at any time;

• For one instruction, if the participating threads in a half-warp follow a certain

access pattern (basically, contiguous memory or a permutation of it, see Fig.

6.4(a)), this costs the equivalent of 1 memory transaction. This is the most

efficient access pattern to GPU global memory;

• Otherwise, the memory access is noncoalesced (see Fig. 6.4(b)), and the in-

struction will cost 16 memory transactions.

(a) Coalesced: 1 transaction (b) Non-coalesced: 16 transactions

Kernel 1-16, element 1 Kernel 1 Kernel 2 Kernel 3

Figure 6.4: Comparison of CUDA memory access patterns

Consider the following top-down ordering of the elements in diagonal submatrices

Dc: in each color, first by the kernel index, then by the element index in a kernel.

For a typical instruction that accesses the same element of all kernels in a color,

neighboring threads will not access contiguous memory. For example, when the first


element of all kernels is accessed, the access pattern follows what is shown in Fig.

6.4(b), where the interval between the accesses from two neighboring threads is the

number of elements in a kernel. To achieve coalescence, the matrix elements need

to be ordered in a reverse way. That is, in each color, first by the element index in

a kernel, then by the kernel index. With the reversed ordering, the access pattern

will follow exactly what is shown in Fig. 6.4(a), because the elements from different

kernels are now stored contiguously. Besides the matrix elements in Dc, the unknown

and right-hand-side (RHS) vectors are reordered in the same way.

For off-diagonal submatrices Lc and Uc, they have a block sparse structure with

each block row and column corresponding to one kernel. The maximum number

of nonzero block entries in each block row can be predetermined according to the

distribution of neighboring colors in the selected coloring strategy. Thus, the EllPack

format is used for Lc and Uc and the following top-down ordering was initially adopted:

first by block row (kernel) index, then by the index of block nonzero entry in a block

row (kernel), and finally by the element index inside a block nonzero entry (in diagonal

format). In the reverse ordering strategy, matrix elements in Lc and Uc are rearranged

as: first by the element index in a block nonzero entry, then by the index of the block

nonzero entry in a block row (kernel), and finally by the block row (kernel) index.

This ordering strategy yields a much more efficient access pattern to the GPU global

memory.

The updated runtime profile obtained using MPNF with coalesced memory access

is shown in Fig. 6.5. The cost of the preconditioner solution is driven down from

54.3s to 16.0s (a 70.5% reduction). Correspondingly, there is a further decrease in the

total MPNF cost from 74.6s to 35.4s (a 52.5% reduction). Now, the preconditioner

solution (45.1%) accounts for half of the total MPNF cost, while BLAS and reduction

operations, as well as, SpMV contribute to the other half of the cost.


1.9% 1.9% 2.0%

17.6%

14.0%

45.1%

17.5% Setup (factorization)

Create HYB matrix

Reorder/copy of B/X

BLAS operations



MV multiplication

Total cost: 74.6s 35.4s Solve cost: 54.3s 16.0s

Figure 6.5: Runtime profile of the implementation with BiCGStab, customized re-duction kernel, and coalesced memory access

6.6 More parallelism: multiple threads in a kernel

As described in Section 6.3 and in [8], assigning more than one thread to a kernel

improves the performance for relatively small problems. This is motivated by the fact

that we usually need several times more threads than CUDA cores to fully utilize the

power of the GPU [54,55]. Because a Tesla M2090 GPU has 512 cores, we can make

full use of its power, if we have thousands of concurrent threads, or more. To analyze

this, we consider the following examples:

• A large reservoir model with 300×160×50 (2.4 million) cells. In a typical 4-color

oscillatory coloring, the number of kernels with the first color (or the last color)

is only about half of the kernels with another color that is not the first or the

last. Consequently, the number of kernels per color is about 8000 ∼ 16000 for

this problem. This is sufficient to utilize the full power of the current generation

of GPUs.

• A relatively small reservoir model (upscaled SPE 10) with 30 × 110 × 42 (139


thousand) cells. Using the same 4-color oscillatory ordering, the number of

kernels per color is only about 550 ∼ 1100. Given the number of cores in the

current generation of GPUs, this number is far from sufficient to achieve optimal

parallel efficiency.

From the above comparison, we can say that the number of concurrent threads

is only sufficient for relatively large problems. To increase the number of concurrent

threads, there are several possible approaches.

6.6.1 Decrease the number of colors

Instead of four colors, we may use three colors or even two colors (checkerboard).

This will increase the number of threads per color up to NxNy

2, assuming that a kernel

is a column of cells along the z-direction. However, as the number of colors decreases,

the accuracy of the preconditioner also degrades. This will usually deteriorate the

convergence rate and increase the number of NF iterations per pressure iteration.

Thus, in certain circumstances, it is possible that the increased parallelism may be

more than offset by the increased computational cost.

6.6.2 Twisted Factorization

The advantage of the Twisted Factorization (TF) [81] method is that it doubles the

number of concurrent threads, while keeping the same total computational cost. This

is achieved by assigning two threads to the tridiagonal linear system of each kernel

(Dcr · xcr = bcr) during the setup and solution phases. Instead of the standard LU

factorization, an alternative PQ factorization (Dcr = P c

r · Qcr) is performed. For the

setup (factorization) phase, one sweep and one synchronization is needed:

1. Two threads setup the entries in factorized matrix P cr and Qc

r by sweeping from

both ends (first and last row) to the middle row concurrently


2. Thread synchronization

3. Exchange data to setup the entries in the middle row of factorized matrix Qcr

For the solution phase, an additional sweep and an extra synchronization are

needed:

1. Two threads solve the upper and lower part of the linear system P cr · ycr = bcr

concurrently by sweeping from both ends to the middle row of factorized matrix

P cr


3. Exchange data to obtain solution to the middle element of ycr


5. Two threads solve the upper and lower part of the linear system Qcr · xcr = ycr

concurrently by sweeping from the middle row to both ends of factorized matrix

Qcr

In order to maintain coalesced memory access, the two threads assigned to the

same kernel should be from different half-warps. That is, we may assign one half-warp

to sweep between the first and the middle row of 16 kernels, and another half-warp

to sweep between the last and the middle row of the same 16 kernels. Then, with

the reversed variable ordering discussed in Section 6.5, the memory access with TF

remains coalesced.

Because parallelism can be increased at no additional cost (i.e., the total compu-

tational cost is kept the same and the convergence rate is unaffected), this approach

is always desirable. However, it can only be applied once to each kernel, thus increas-

ing the number of concurrent threads to two times. This may still be insufficient for

certain small problems.


6.6.3 Cyclic Reduction

The basic idea of the Cyclic Reduction (CR) [33, 81, 95] method is to eliminate odd

and even numbered equations and variables separately and concurrently. This method

can be applied repeatedly and each time the number of concurrent threads is doubled.

However, the computational cost also increases. As stated in [33], CR yields 2.7 times

more operations than standard Gaussian elimination when it is applied the maximum

times on a tridiagonal matrix. We have not yet implemented CR, or its variant PCR

(Parallel Cyclic Reduction) [95], in our code base.

6.7 More flexibility in the system matrix

6.7.1 Inactive cells

In heterogeneous reservoir models, some cells have very small pore volumes and are

thus labelled inactive. Their variables and equations are usually removed from the

system matrix during computation. In the MPNF algorithm, it is preferable to keep

the complete structure of the system matrix with both active and inactive cells. For

this purpose, we will create a mapping between the set of active cells, which is used

by all other parts of the simulator, and the set of all cells, which include both inactive

and active cells and is only used by MPNF.

This mapping procedure can be well integrated with the reordering process used in

MPNF to rearrange the cells into the top-down ordering (color → kernel → element)

or the reversed ordering as described in Section 6.5 (color → element → kernel).

During the setup phase of the MPNF matrix, the entries corresponding to the active

cells are directly copied from the original matrix, whereas the entries corresponding

to the inactive cells are not touched and their trivial initial values will be kept: 1 for

diagonal entries, and 0 for off-diagonal entries.


In the solution phase, for one factorized matrix, multiple RHS vectors will be given

to the MPNF algorithm. At the beginning of each solution process with a new RHS

vector borig, the corresponding RHS vector used by MPNF, b, is mapped from active

cells with original order to all cells with top-down or reversed order. Then, the GPU-

based MPNF-preconditioned linear solver is called to compute for the solution vector

x, which corresponds to all cells with top-down or reversed order. Finally x is mapped

back to xorig, which contains elements for active cells with original order. Because

our current implementation of MPNF is only used in the pressure preconditioning,

xorig can be taken by the CPR preconditioner as the first-stage solution. In this way,

the inactive cells can be handled seamlessly by the MPNF preconditioner.

6.7.2 Additional (e.g., well) equations

In general-purpose simulation, reservoir equations are usually coupled with some ad-

ditional (e.g., well) equations. Thus, the entire pressure linear system will have the

form: Ap,RR Ap,RW

Ap,WR Ap,WW

xR

xW

=

bR

bW

, (6.8)

where Ap,UV (U, V ∈ {R,W}) contains the derivatives of equations of U with respect

to variables of V . The Ap,RR part is the reservoir matrix that MPNF can already han-

dle. In order to solve the entire system with these additional equations, we consider

the following Schur-complement process:

A′p,RR = Ap,RR −Ap,RWA−1p,WWAp,WR, (6.9)

b′R = bR −Ap,RWA−1p,WW bW . (6.10)

If the Schur-complement system A′p,RRxR = b′R can be solved with the MPNF-

preconditioned linear solver for the reservoir solution xR, then the well solution xW


can be obtained as:

xW = A−1p,WW (bW −Ap,WRxR). (6.11)

The issue with preconditioning the Schur-complement system using MPNF lies

in the structure of A′p,RR. For wells with multiple completions (i.e., having multiple

nonzero entries in one row of Ap,WR, or in one column of Ap,RW ), the resultant A′p,RR

will have additional nonzero entries introduced in the Schur-complement process.

These entries represent the synthetic connections between the multiple reservoir cells

with completions, which might not have been connected. For a typical vertical well

with multiple completions, the perforated cells are usually in one MPNF kernel that

contains a column of cells. The additional nonzero entries will turn the matrix of

this kernel from a tridiagonal matrix into a full matrix, which will greatly reduce the

parallel solution efficiency, even if there are only a few such full matrices.

The remedy is to use an approximated matrix in the MPNF preconditioner:

A′′p,RR = Ap,RR − approx(Ap,RWA−1p,WWAp,WR), (6.12)

where the operator ‘approx’ can be a relaxed rowsum operator, or something similar,

given that A′′p,RR will have the same structure as the original Ap,RR. Note that al-

though the MPNF preconditioner solves the approximate Schur-complement problem

A′′p,RRxR = b′R, the accelerator (e.g., BiCGStab) still iterates on the exact Schur-

complement problem A′p,RRxR = b′R. Otherwise, the pressure solution will become

inaccurate such that the convergence rate of the CPR preconditioner is impacted

negatively.


6.8 Multi-GPU Parallelization

With the growth in problem size, a single GPU, even with 512 cores, may not be

able to solve problems of interest efficiently. This is due to limitations on both

computational power and memory capacity. Thus, multi-GPU parallelization of the

algorithm is employed to deal with very large problems. The basic idea is to partition

the entire domain into several subdomains and assign the computational tasks on

each subdomain to one GPU. Data need to be transferred between the subdomains

after the setup/solution of each color.

6.8.1 Partitioning

Due to the characteristics of peer-to-peer (P2P) data transfer between GPUs, the

partitioning is currently conducted in one dimension only, i.e., the dimension that

is not along the kernel direction and has the largest number of cells. As a result of

the one-dimensional partition, GPU g (0 ≤ g < NGPU , NGPU : number of GPUs)

only needs to communicate with GPU g − 1 (if g > 0) and g + 1 (if g < NGPU − 1).

Inside each subdomain, only the data from the cells at the domain edge need to

be transferred to another GPU, due to the interdomain connections. That is, GPU

g needs to transfer the data from its first layer (perpendicular to the partitioning

dimension) of cells to GPU g − 1 (if g > 0) and the data from its last layer of cells

to GPU g + 1 (if g < NGPU − 1). The region that contains the first and last layer

of cells is called the inner ‘halo’ region, and the region that contains the rest of cells

in a subdomain is called the interior region. The name ‘inner’ halo region is used

to differentiate from the ‘outer’ halo region, which corresponds to the layer of cells

immediately outside the subdomain. Each ‘outer’ halo region collocates with one

‘inner’ halo region of another GPU and is used to contain the data transferred from

that GPU.


GPU 0

Interior

GPU 1

Interior

Ha

lo

Ha

lo

Ha

lo

GPU 2

Interior

Ha

lo

Left right

Right left

Figure 6.6: Example of partitioning with three GPUs (top view)

An example of partitioning with three GPUs is shown in Figure 6.6. The right

‘inner’ halo region of GPU 0 (in pink color) is the front ‘outer’ halo region of GPU

1. Correspondingly, the left ‘inner’ halo region of GPU 1 (in light green color) is the

back ‘outer’ halo region of GPU 0. The similar classification can be applied to the

halo regions between GPU 1 and 2.

6.8.2 Data transfer between GPUs

Based on the topology of multiple GPUs on a given host, there are two primary

approaches for data transfer between GPUs: left-right approach and pairwise ap-

proach [49]. Both approaches require two stages.

First we discuss the left-right approach. In the first stage, the data are transferred

from left to right, i.e., from the right inner halo region of g to the front outer halo

region of g + 1, for each GPU g (0 ≤ g < NGPU − 1). Then a synchronization is

performed until the first-stage data transfers are finished. In the second stage, the

data are transferred from right to left, i.e., from the left inner halo region of g to the

back outer halo region of g − 1, for each GPU g (1 ≤ g < NGPU). In this way, no

matter how many GPUs we have in the system, there will be no data contention in


the PCIe channel, as shown in Figure 6.7.

GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

PCIe switch PCIe switch PCIe switch PCIe switch

PCIe switch PCIe switch

I/O hub

Download

Upload




I/O hub

(a) First (left-to-right) phase

(b) Second (right-to-left) phase

Path w/o contention

Figure 6.7: Illustration of left-right data transfer approach (modified from [49])

Then we describe the pairwise approach. In the first stage, the data are transferred

between each even-numbered GPU and its next odd-numbered GPU, i.e., from the

right inner halo region of g to the front outer halo region of g+1, and from the left inner

halo region of g + 1 to the back outer halo region of g, for each even-numbered GPU

g (0 ≤ g < NGPU − 1, g is even). Then a synchronization is performed until the first-

stage data transfers are finished. In the second stage, the data are transferred between

each odd-numbered GPU and its next even-numbered GPU, i.e., performing the same

procedure as in the first stage, but for each odd-numbered GPU g (1 ≤ g < NGPU−1,

g is odd). The data contention in the PCIe channel can also be avoided if NGPU ≤ 4.

For NGPU > 4, there will be data contention in the second stage, as shown in Figure

6.8.

We choose the left-right approach because it has no data contention with an

arbitrary number of GPUs. See the bottom part of Figure 6.6. In this three-GPU





I/O hub

Download & Upload




I/O hub

(a) First (even-odd) phase

(b) Second (odd-even) phase

Path w/o contention

Path w/ contention

Figure 6.8: Illustration of pairwise data transfer approach (modified from [49])

example, data are transferred from GPU 0 to GPU 1, and from GPU 1 to GPU 2

in the first stage, and then transferred from GPU 2 to GPU 1, and from GPU 1 to

GPU 0 in the second stage.

Note that one large chunk of data can usually be transferred more efficiently be-

tween two GPUs than many small pieces of data. Therefore, after the setup/solution

of each color, we should minimize the number of transactions for P2P data transfer

between GPUs. For top-down ordering of the matrix elements (in each color, first by

the kernel, then by the element in the kernel), the data to be transferred are already

grouped together if we order the kernels at the edge of subdomain consecutively.

Therefore, the data can be directly transferred after the computation. However, for

the reverse ordering scheme (in each color, first by the element in the kernel, then

by the kernel), the data to be transferred are scattered in the data chunk for each

element. Consequently, a preparation process that gathers the data from multiple

places in the GPU memory into a large chunk in the preallocated buffer is needed


between the data transfer and computation.

6.8.3 Overlapping data transfer with computation

Similar to the idea of asynchronous memory transfer from CPU to GPU at the setup

stage, the P2P data transfer between GPUs can be overlapped with the computation

inside the GPU. Thus the data-transfer cost can be hidden, if it is always smaller

than the computational cost.

The overlapping strategy utilizes the partitioned regions in each subdomain. The

computation in the inner halo region and in the interior region must be separate.

The inner halo region is usually very small compared with the interior region and

thus the computation in inner halo region should finish in a relatively short period.

Then we can start the computation in the interior region, as well as, the data transfer

between GPUs at the same time. This is because the data to be transferred are those

in the inner halo region and have already been computed. The simultaneous kernel

launch and data transfer are achieved via streams. A typical assignment of tasks

in the solution phase is shown in Figure 6.9, in which all computational tasks are

performed on stream 1 whereas data transfer tasks are performed on stream 2. These

two streams are created for each GPU and thus we have a total of 2NGPU streams.

For this phase, stream 1 is usually fully occupied whereas stream 2 may idle during

certain periods of time.

Compute

halo region

for color 1

Compute interior region

for color 1

Peer-to-peer transfer

of color 1 between

neighbouring GPUs

Compute

halo region

for color 2

Compute interior region

for color 2


of color 2 between

neighbouring GPUs

Stream 1

Stream 2

…

…

Figure 6.9: Assignment of tasks to two streams in the solution phase


Assuming that the data-transfer cost is always smaller than the computational

cost of an interior region, the goal in the solution phase is really to minimize the

total computational cost of halo and interior regions. Given the fact that a large

number of concurrent threads is usually needed for good utilization of the GPU, the

computational efficiency in the halo region, which contains only a small number of

kernels, may be suboptimal. A possible way to resolve this problem is to temporarily

repartition the halo and interior regions inside a subdomain in this phase. That is,

we can assign more kernels to the halo region and less kernels to the interior region,

in order to achieve improved overall utilization of the GPU and thus a minimal total

computational cost. Note that no matter how the halo and interior regions are divided

here, only the data from kernels at edge of each subdomain (i.e., the initial definition

of the halo region) need to be transferred between GPUs.

For the setup phase, the asynchronous memory transfer from CPU to GPU will

also be performed on stream 2 and overlap with the computation (setup and rowsum)

on stream 1. As a consequence, the tasks are assigned as shown in Figure 6.10, in

which stream 1 and 2 have the same definitions as in Figure 6.9. Note that due to

the limited bandwidth of PCIe bus (especially on the path from CPU to I/O hub

shared by the multiple GPUs), the memory transfer from CPU to GPU is usually

the bottleneck in this phase. Therefore, stream 2, which is in charge of data transfer

from CPU to GPU and between GPUs, will be fully occupied, whereas stream 1 for

managing computational tasks will idle from time to time. This is different from the

behavior at the solution phase.

6.8.4 Mapping between global and local vectors

As described in Section 6.7.1, at the beginning of each solution process with a new

RHS vector, there is a mapping from active cells with original order to all cells with

top-down or reversed order. At the end of the solution process, the solution vector is


Setup halo region for

color 1 using D1

Setup interior region

for color 1 using D1 Peer-to-peer transfer

of color 1 in GPUs

Stream 1 Stream 2

… …

Memory transfer of D1

from CPU to GPU

Memory transfer of U1

from CPU to GPU

Rowsum halo region

for color 1 using U1

Memory transfer of L2

from CPU to GPU

Rowsum interior region



from CPU to GPU

Memory transfer of U2

from CPU to GPU

Setup halo region for

color 2 using L2 & D2

Rowsum halo region

for color 1 using U2 Memory transfer of L3

from CPU to GPU


of color 2 in GPUs

Setup interior region for

color 2 using L2 & D2

Rowsum interior region



from CPU to GPU

Figure 6.10: Assignment of tasks to two streams in the setup phase


mapped back in an opposite way.

We extend that idea for multi-GPU parallelization of the algorithm. The mapping

at the beginning of the solution process is now a distribution of the vector elements,

i.e., from active cells in the global domain (i.e., including all subdomains) with original

order to all cells in one extended subdomain (i.e., the subdomain for one GPU plus

its outer halo region) with top-down or reversed order. In this mapping, the source is

(a part of) the same vector whereas the target is different for each GPU. As a result,

during the multi-GPU solution process, each GPU will only access its local vectors

plus the additional data received from the P2P transfer after the setup/solution of

each color.

Correspondingly, at the end of the solution process, there is an opposite mapping

for the local solution vector in each GPU. The opposite mapping is essentially an

assembly of the vector elements, i.e., from all cells in one extended subdomain with

top-down or reversed order to (a part of) the active cells in the global domain with

original order.

6.9 Parallel Benchmark

6.9.1 Single-GPU test case 1: upscaled SPE 10

We first consider a relatively small test problem based on the upscaled SPE 10 reser-

voir model with 30× 110× 42 (139 thousand) cells. The focus is on the performance

of the pressure preconditioner. We use a simple two-phase, two-component fluid. The

4-color oscillatory coloring is used as the default strategy. To establish a basis for

comparison, the OpenMP-based MPNF algorithm run on one, or multiple, CPU cores

was implemented. The time used by MPNF for a single CPU core is treated as the


baseline (1X speedup) of the benchmark. The test was conducted on a Dell Pow-

erEdge C410x PCIe Expansion Chassis with dual Intel Xeon X5660 CPUs (6 cores

at 2.80GHz, 12MB cache), 192GB of RAM, and an NVIDIA Tesla M2090 GPU (512

CUDA cores at 1.3GHz, 6GB memory).

The performance results are shown in Fig. 6.11. The left two columns represent

parallel speedups of SP and DP respectively, whereas the right column shows the ratio

of the minimum number of concurrent threads in a color to the number of CUDA

cores (512 for M2090), denoted as RT/C .

17.1

13.9

6.4

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

SP speedup DP speedup min. #threads / #cores

CPU (1 core)

CPU (8 cores)

GPU (non-coalesced)

GPU (coalesced)

GPU (co + TF)

GPU 3c (co + TF)

GPU 2c (co + TF)

Figure 6.11: The performance results of the upscaled SPE 10 problem with 139thousand cells

From Fig. 6.11 we observe that more than six times speedups are obtained for both

SP and DP using 8 CPU cores, whereas the basic GPU-based MPNF (accelerated

by BiCGStab with customized reduction kernel and noncoalesced memory access)

has only 2.4X speedup for SP and 3.2X speedup for DP. However, when we apply

coalesced memory access to the GPU-based MPNF, its speedups rapidly increase to

8.5X for SP (+252%) and 7.9X for DP (+148%), which are already higher than that

of 8 CPU cores.

Note that for this problem, when using 4-color oscillatory coloring without TF the


value of RT/C is only 1.1, which is far from sufficient. Using TF, RT/C is increased

to 2.1, which is still too low, and the speedups show a corresponding boost to 12.7X

for SP and 11.0X for DP. Next we investigate the impact of decreasing the number

of colors. Using three colors, RT/C is further increased to 3.2, and the speedups

correspondingly increase despite the increased number of NF iterations. When the

number of colors is further decreased to 2, RT/C becomes 6.4, which is still not enough

to fully utilize the power of the GPU. We see a further increase in the speedups to

17.1X for SP and 13.9X for DP, respectively. With Cyclic Reduction (CR), we can

exploit more parallelism and possibly achieve further improvement for this relatively

small problem.

6.9.2 Single-GPU test case 2: full SPE 10

Now we consider a larger test problem based on the full SPE 10 reservoir model with

60× 220× 85 (1.1 million) cells. The fluid model, default coloring strategy, and the

hardware configuration of both CPU and GPU are the same as in the previous test

problem.

The performance results are shown in Fig. 6.12. For this problem, the speedups

of the basic GPU-based MPNF with noncoalesced memory access (SP: 3.0X; DP:

3.1X) are again lower than those of the CPU-based MPNF using 8 cores (SP: 6.0X;

DP: 4.1X). However, when applying the coalesced memory access, there is significant

improvement in the performance of the GPU-based MPNF. The coalesced speedups

(SP: 22.1X; DP: 18.2X) are about 6 ∼ 7 times of their noncoalesced counterparts.

For this problem, RT/C is about 4.3 for the basic GPU-based MPNF. This is not

sufficient to achieve optimal performance. Using TF, RT/C is increased to 8.6, which

is marginally acceptable. We see a corresponding increase in the speedups to 26.5X

(SP) and 18.7X (DP), respectively. Next, we analyze the impact of the number of

colors. First, if we decrease the number of colors to three, or even two (checkerboard),


although the value of RT/C further increases to 12.9 and 25.8, respectively, the limited

increase in the parallelism (RT/C = 8.6 is already acceptable) is more than offset by

the increased number of NF iterations. On the other hand, if we increase the number

of colors to five or six, a potentially more accurate preconditioner can be obtained.

However, the value of RT/C becomes too low (6.4 for five colors, and 5.2 for six colors)

to maintain good utilization of the GPU. As a result, the overall speedup with five

or six colors is lower than that with four colors. That is, 4-color oscillatory coloring

with TF yields the best performance for this problem.

26.5

18.7

8.6

0.0

4.0

8.0

12.0

16.0

20.0

24.0

28.0

SP speedup DP speedup min. #threads / #cores

CPU (1 core)

CPU (8 cores)

GPU (non-coalesced)

GPU (coalesced)

GPU (co + TF)

GPU 3c (co + TF)

GPU 2c (co + TF)

GPU 5c (co + TF)

GPU 6c (co + TF)

Figure 6.12: The performance results of the full SPE 10 problem with 1.1 million cells

6.9.3 Multi-GPU test case 1: 8-fold refinement (2 by 2 by 2)

of SPE 10

In order to test the multi-GPU parallelization of the MPNF algorithm, the problem

size needs to be large enough to obtain good utilization of the massive computational

power provided by the thousands of CUDA cores. Thus, we build a test case based

on a refined SPE 10 reservoir model with 120 × 440 × 170 (8.8 million) cells. The


fluid model, default coloring strategy, and the CPU hardware configuration are the

same as in the single-GPU test cases. Four NVIDIA Tesla M2090 GPUs are used in

this test case, bringing a total of 4 × 512 = 2048 CUDA cores, and 4 × 6GB=24GB

memory.

For this test case, we use only the optimal setting (4-color oscillatory ordering,

with coalesced memory access and twisted factorization) for the MPNF algorithm

and still use the run time needed by a single core of X5660 CPU as the baseline.

We simulate the test case with one, two, three, and four GPUs, respectively. The

cost of data preparation (reordering of matrix and vector elements) on the CPU is

not included in the total cost shown here, because this part cannot be accelerated by

multiple GPUs. This part of the cost will be relatively small when the entire linear

solver is parallelized on GPU (such that the RHS and solution vector will not need to

be transferred back and forth between CPU and GPU, and all the reordering tasks

can be performed on GPU).

The performance results are shown in Figure 6.13. We observe that with a single

GPU, the speedup (SP: 29.2X; DP: 19.3X) is quite similar to that obtained using

the full SPE 10 model (SP: 26.5X; DP: 18.7X), although RT/C has a much higher

value, namely 34.4, for the above mentioned optimal setting. This further validates

our interpretation that RT/C = 8.6, which was obtained using the full SPE 10 model

with the same setting, is already acceptable for good utilization of the GPU. For SP,

the (1024-core) 2-GPU solution has a speedup of 53.5X compared with the single-core

solution and is 1.8X faster than the single-GPU MPNF solution. Correspondingly,

for DP, a speedup of 36.2X is obtained with two GPUs and this is 1.9X faster than

on a single GPU. The speedup numbers for (1536-core) 3-GPU solution are 71.1X for

SP and 50.5X for DP, which are 2.4X and 2.6X faster than the single-GPU solution,

respectively. For the (2048-core) 4-GPU solution, we achieve 82.3X speedup for SP

and 61.7X speedup for DP. Using DP, the 4-GPU solution is more than three times


faster than the single-GPU solution.

1X

1X

1.8X

1.9X

2.4X

2.6X

2.8X

3.2X

0

10

20

30

40

50

60

70

80

90

Single-precision speedup Double-precision speedup

Spe

ed

up

ve

rsu

s o

ne

CP

U c

ore

12 CPU cores

1 GPU (512 cores)

2 GPUs (1024 cores)

3 GPUs (1536 cores)

4 GPUs (2048 cores)

Figure 6.13: The performance results of the refined SPE 10 problem with 8.8 millioncells

Here, we analyze the reason why the speedup of multi-GPU solution versus single-

GPU solution deviates from linear scalability. Although the P2P data-transfer cost

is hidden by overlapping the transfer with the computation, the total computational

cost, with a separation between the inner halo and interior region, will be higher

than that without a separation. This is due to the decrease in the effective RT/C ,

which will be substantially lower than 34.4/NGPU for NGPU ≥ 2, and thus an overall

good utilization of all GPUs cannot be achieved when NGPU increases. Moreover,

the data-transfer cost from CPU to a single GPU will be more or less the same as

that from CPU to multiple GPUs, because the bandwidth of the PCIe bus from CPU

to I/O hub is limited and shared by multiple GPUs. With the evolution of the PCIe

standard (e.g., the bandwidth of PCIe 3.0 is 16GB/s, which is twice as much as that

of PCIe 2.x), this limitation will get alleviated.


6.9.4 Multi-GPU test case 2: 24-fold refinement (4 by 3 by

2) of SPE 10

As analyzed in the last test case, one major constraint on the efficiency of multi-

GPU solution is the reduced RT/C , which represents the level of parallelism. Even for

the refined SPE 10 model with 8.8 million cells, the effective RT/C will not be large

enough when the MPNF solution is performed on several GPUs. Here, a further

refined SPE 10 model with 240×660×170 (26.9 million) cells is generated to test the

scalability of multi-GPU solution for extremely large models. This corresponds to a

4X, 3X, and 2X refinement in the X, Y, and Z directions, respectively. To further

simplify the computation on the CPU side, a two-phase black-oil fluid model is used

in this test case. The default coloring strategy and the CPU hardware configuration

are the same as in the single-GPU test cases. Up to 6 NVIDIA Tesla M2090 GPUs

are used, thus there is a total of 6 × 512 = 3072 CUDA cores and 6 × 6GB=36GB

memory.

Similar to the previous multi-GPU test case, we use only the ‘optimal setting’ for

the MPNF algorithm and take the run time used by a single core of X5660 CPU as

the baseline. We simulate this refined SPE 10 model with one to six GPUs for SP,

and two to six GPUs for DP. This is because, even the pressure system (about 8GB)

is too large to be contained in 1 GPU, when DP is used. In the result, in order to

compare the multi-GPU solution efficiency in a convenient way, we still include the

1-GPU DP speedup, which is estimated from the 1-GPU SP run time and the typical

run-time ratio between DP and SP. For the same reason, as discussed in the last test

case, the cost of data preparation on the CPU is not included in the total cost.

The performance results are shown in Figure 6.14. We observe that with a single

GPU, the SP speedup (31.9X) of this 24-fold refined version of the SPE 10 model

is only slightly better than that of the refined SPE 10 model (29.3X). Although


1X 1X

1.9X

1.9X

2.9X

2.9X

3.5X

3.5X

4.5X

4.5X

5.3X

5.3X

0

20

40

60

80

100

120

140

160

180

Single-precision speedup Double-precision speedup

Spe

ed

up

ve

rsu

s o

ne

CP

U c

ore

12 CPU cores

1 GPU (512 cores)

2 GPUs (1024 cores)

3 GPUs (1536 cores)

4 GPUs (2048 cores)

5 GPUs (2560 cores)

6 GPUs (3072 cores)

Figure 6.14: The performance results of the further refined SPE 10 problem with 26.9million cells

RT/C here is as high as 103.1, the improvement in speedup is minor due to the over-

saturated GPU utilization. For SP, the (3072-core) 6-GPU solution reaches a speedup

of 169.9X, which is 5.3 times faster than the 1-GPU solution and 40.8 times faster

than the OpenMP-based MPNF using all 12 CPU cores. On the CPU cores, MPNF

is severely limited by the memory bandwidth and has a SP speedup of 4.2X only.

Correspondingly, for DP, the 6-GPU solution (111.2X) is also 5.3 times faster than

the estimated 1-GPU solution time and about 34 times faster than the 12-CPU-core

solution (3.3X). Note that in this extremely large problem where the parallelism is

abundant, the multi-GPU solution efficiency (in terms of the speedup over 1-GPU

solution) of SP can be as good as that of DP. With 3 or more GPUs, the scalability

in this case (26.9 million cells) is considerably better than that in the refined SPE 10

model (8.8 million cells) where effective RT/C drops below the acceptable range. We

should also note that in terms of absolute time, the cost per pressure solve (the

MPNF-preconditioned BiCGStab solver with a relative tolerance of 0.2) is below one

second for SP solution on 4 or more GPUs, and quite close to one second for DP

solution on 6 GPUs.


6.10 Discussion

For CPU parallelization, one thread per core (or up to two if HyperThreading is used)

is ideal for achieving optimal performance. However, it is critical to have many more

threads per CUDA core in our GPU-based MPNF method for ideal performance. This

is closely related with the SIMT architecture and the fast switching feature of thread

warps on the GPU. If we have large numbers of thread warps, we can always switch

to the idle warps when some active warps get stalled (e.g., waiting for the completion

of a memory transaction). In this way, the latency of memory transactions can be

effectively hidden, so that we can saturate the GPU memory bus with large amounts

of memory transactions on the fly.

We can utilize the full power of the GPU (i.e., reach the asymptotic limit) only

when the number of threads per core (RT/C) is sufficiently large. Based on our

analysis, 8 < RT/C < 10 allows for good utilization of the GPU; if 10 < RT/C < 20,

the full power of the GPU can be utilized. However, if we push this number too far

(RT/C >> 20) by feeding extremely large problems to the algorithm, or by applying

the Cyclic Reduction (CR) too many times, no further improvement will be obtained.

That is, the asymptotic limit of the algorithm is reached under the current GPU

architecture.

When our GPU-based MPNF is applied to both stages of the CPR preconditioner,

the GPU memory size can be the major constraint on the size of the problem we can

solve. This is because the size of the nonzero entry grows from 1×1 to nc×nc, where

nc is the number of components. For the pressure system of full SPE 10 problem,

MPNF costs about 300MB of GPU memory, in which the major part is used for

the storage of the elements in the original and factorized matrices. Now suppose

that we want to apply MPNF to both pressure and primary systems with nc = 6,

the algorithm will require about 10GB of GPU memory, which is already beyond


the memory capacity (6GB) of an NVIDIA Tesla M2090 GPU. This is part of the

motivation for extending the algorithm to support multiple GPUs, so that a larger

total amount of GPU memory is available.

6.11 Future Directions

Based on the current features of our MPNF implementation, two major directions

are considered as the future work. First, we may apply MPNF to both stages of

CPR preconditioner with a global CUDA-based BiCGStab / GMRES accelerator. To

achieve good overall speedup of the linear solution, instead of the pressure solution

only, this extension is necessary. However, for the best efficiency, all the steps in

the linear solution process (not just the preconditioning) should be performed on the

GPU platform and this brings a lot of challenges to the design and implementation.

Here, several important issues that must be addressed in this extension are listed:

• Coalesced memory access. In order to maintain the optimal access pattern

to GPU global memory, the elements inside each block nonzero entry of size

nc × nc cannot be stored together. Otherwise, because each block entry will

only be handled by one thread, neighboring threads will no longer read or write

on contiguous memory addresses and thus the access pattern is noncoalesced.

Therefore, the suggested reversed ordering for matrix and vector elements is:

in each color, first by the element index in each block entry, then by the block

entry index in each kernel, and finally by the kernel index. With this order,

coalesced memory access can be maintained.

• Block SpMV. When solving the pressure system, HYB format in the cuSPARSE

library is an efficient choice for SpMV. However, the current version of cuS-

PARSE only provides a pointwise HYB format, which is not ideal for the block


sparse structure of the primary matrix. Therefore, we need to utilize the block

sparse matrix format (HYB is still preferred) provided by other GPU-based li-

braries or implement our own data structure and associated GPU kernels for

block SpMV.

• Schur-complement process for generating the primary system JschurRR that in-

cludes only reservoir equations and variables. In the augmented primary system

that includes both reservoir and all facility equations and variables, the facilities

submatrix JFF is no longer diagonal. It is block diagonal and the structure of

each block, which corresponds to a facility model, can be very different. Thus

it is usually hard to obtain its inverse J−1FF . As a result, the schur-complement

JschurRR = JRR − JRFJ

−1FFJFR cannot be directly computed. For the second-

stage preconditioning of overall system, JRR can be used to approximate JschurRR ,

whereas for block SpMV, we can obtain the product as

JschurRR · x = JRR · x− JRF ·

(J−1FF (JFR · x)

), (6.13)

where y = J−1FF (JFR · x) is calculated by solving

JFF · y = JFR · x. (6.14)

Considerable effort is expected in making all these steps performed efficiently

on GPU.

Second, we may extend the algorithm to account for off-band coefficients on the

outermost level, in order to accommodate matrices constructed from 2.5-dimensional

unstructured grids with more flexible coloring strategies or MPFA discretization

schemes. This extension has been discussed in [6]. As long as the reservoir model is


still divided into kernels (e.g., columns of cells) with similar sizes, and a viable col-

oring strategy (no neighboring kernels share the same color) is available, the MPNF

algorithm has no requirement on the grid structure of the two-dimensional domain

perpendicular to the kernel direction. A possible coloring strategy for 2.5-dimensional

unstructured grids is described in [6]. More colors (e.g., 6) may be needed to handle

the coloring on such grids.

Besides the coloring strategy, the definition of off-diagonal submatrices on the first

level of MPNF, Lc and Uc, needs to be extended. For SpMV, they may need to access

vector elements from more than one color, although most of the elements are still

expected to be from the previous color for Lc, and from the next color for Uc.


In this chapter, we describe a parallel NF linear solver especially suited for GPUs.

We build on the Massively Parallel NF framework described by Appleyard et al. [8].

Our GPU-based implementation of MPNF supports asynchronous memory transfer,

utilizes CUDA-based BiCGStab solver as an accelerator, and uses a customized re-

duction kernel with a greater bandwidth in the computation of dot products. The

most important features of our approach include: 1) special ordering of the matrix

elements that maximizes coalesced access to GPU global memory, 2) application of

“twisted factorization”, which increases the number of concurrent threads at no addi-

tional cost and maintains coalesced memory access, and 3) extension of the algorithm

for systems with multiple GPUs by first solving the halo region in each GPU and

overlapping the peer-to-peer memory transfer between GPUs with the solution of the

interior regions for each color in the solution sequence.

The GPU-based MPNF linear solver is demonstrated using several large problems,

and we breakdown the performance details of all the algorithmic components. For the


full SPE 10 model on a 512-core Tesla M2090 GPU, our implementation achieves a

speed up of 26 for single-precision and 19 for double-precision computations compared

with a single core of the Xeon X5660 CPU. Moreover, the 6-GPU (3072 cores) solution

of a highly refined SPE 10 model (26.9 million cells) is more than five times faster

than the single-GPU solution.

Chapter 7

Conclusions and Future Work

7.1 Conclusions

The development of a parallel general-purpose numerical simulation framework is the

subject of this dissertation. The Automatic-Differentiation General-Purpose Research

Simulator (AD-GPRS) described here serves as a flexible, extensible, and efficient re-

search platform for coupled reservoir models and advanced (e.g., multisegment) wells.

The ADETL library, which was first developed by Younis [92, 93] and extended by

Zhou [96] provides a core component of the infrastructure for this powerful reservoir

simulation platform. This AD capability allows us to write only the nonlinear discrete

residual code and get the Jacobian matrix in an automatic and efficient way. The au-

tomatic generation of the Jacobian using AD enhances the flexibility and extensibility

of the simulation platform substantially. Moreover, our generic and modular code de-

sign facilitates the addition of new capabilities, including additional flow processes

and numerical methods, to the simulator.

For flexible reservoir modeling, AD-GPRS employs MPFA spatial discretization

schemes and a multilevel AIM strategy for time discretization. A unified code-base,

which works for TPFA or MPFA in space, and for any combination of FIM, AIM,

193

CHAPTER 7. CONCLUSIONS AND FUTURE WORK 194

IMPES, and IMPSAT in time, has been developed. The object-oriented C++ im-

plementation has minimal code duplication. Due to the generic design and imple-

mentation across the nonlinear and linear levels, the new discretization framework in

both space and time are compatible with all the other functionality in the simula-

tor. Moreover, the AD-based simulation capability is demonstrated using challenging

compositional problems with strong nonlinearity and large-scale highly heterogeneous

reservoir models, discretized using TPFA and various MPFA schemes. Any model can

be run using FIM or AIM. Our results indicate clearly that MPFA-based composi-

tional simulations are significantly more accurate than TPFA computations, espe-

cially for nonorthogonal and unstructured grids. This also applies to the cases with

full-tensor permeability. In addition, the multilevel AIM strategy reduces the overall

simulation cost, along with reduced levels of numerical dispersion.

The linear solver framework is a very important component of AD-GPRS. Among

the two linear systems offered by AD-GPRS, the CSR representation of the linear

system is based on the public CSR matrix format, which has a very simple structure

and is thus easy to understand. This simple structure, however, limits the efficiency

of the associated linear solvers and cannot be used to simulate large problems. The

more efficient option is the block linear system based on the customized MLBS ma-

trix format, which has a much more sophisticated structure. Based on a hierarchical

storage system, MLBS satisfies the requirement for a well-designed data structure

(accessibility, encapsulation, extensibility, and computational efficiency). For any

new submatrix type added to MLBS, no requirement is imposed on its internal struc-

ture, as long as the matrix operations share common interfaces and are implemented

in a consistent way. With this extensible design, model developers have sufficient

flexibility to create new matrix formats and associated solution strategies that are

suitable for each individual facility model of AD-GPRS. This also applies to other

new features that change the structure of the global Jacobian. To solve the block


linear system efficiently, a two-step algebraic reduction is applied first to obtain the

smallest implicit system. With the CPR-based two-stage preconditioning strategy,

an iterative Krylov subspace solver (e.g., GMRES) is then used to solve the implicit

system. Then, a two-step explicit update is used to recover the full solution from the

implicit solution. This solution strategy offers much higher efficiency than using a

single-stage preconditioner (e.g., BILU), let alone schemes based on the (pointwise)

CSR linear system.

For accurate and unified modeling of multiphase flow in wellbores and surface

(pipeline) networks, AD-GPRS supports a generalized MS well model. With a flexible

discretization into nodes and connections, the general MS well model has a series of

advantages over the original MS well model: 1) general branching: the well segments

and surface pipelines can fork and rejoin at any place such that very complex geometry

can be created, 2) loops with arbitrary flow directions: the actual flow directions

are determined at run time and can change from iteration to iteration, such that a

loop may be defined without its flow directions to be prespecified, 3) multiple exit

connections with different constraints, and 4) special nodes with various functionality

(e.g., separators, valves). The first two advantages are illustrated using the numerical

examples, whereas the features needed by the last two advantages have not yet been

implemented. With the flexible and extensible framework, these features can be

introduced later as extensions to the general MS well model. For the linear solution,

the original two-stage CPR preconditioner has been extended to handle the coupled

reservoir model with general MS wells. Comparisons with single-stage BILU(0) and

BILU(1) preconditioners indicate the superior efficiency of the extended two-stage

preconditioner. In addition, to address the difficulties in the nonlinear convergence

of the coupled system, a local nonlinear solver for the well regions is introduced by

fixing the reservoir conditions and only solving for the facility solutions. About one

third of the Newton iterations is saved with the local nonlinear solver for an upscaled


SPE 10 model. We also found that the best strategy to utilize the local nonlinear

solver is to activate it only when the reservoir part is already close to convergence. In

such circumstances, the reservoir part will serve as a reasonably accurate boundary

condition for the local facility solution and oscillations in the Newton iterations can

be avoided.

Today, multicore CPUs are prevalent in mainstream desktop computers and servers.

In addition, general-purpose GPU computing is quickly emerging as an important

technology. The techniques of high performance computing have drawn a lot of at-

tention and are developing at a very fast pace. Parallelization allows us 1) to better

utilize the computational resources such that the result can be obtained in a shorter

time, and 2) to simulate more detailed models with higher discretizing and composi-

tional resolution and more sophisticated physics. As a consequence, OpenMP-based

multithreading parallelization has been implemented in AD-GPRS. Parallel Jacobian

construction is realized through the thread-safe extension of ADETL library. For the

parallel linear solution, the CPR-based two-stage preconditioning strategy is used,

with the parallel multigrid solver XSAMG applied in the first stage and the block

Jacobi preconditioner, which takes BILU as local preconditioners, applied in the sec-

ond stage. Based on the benchmarking results for the full SPE 10 problem (with

three discretization schemes applied on the manipulated nonorthogonal grid) on a

dual quad-core Nehalem node, we find that the linear solver speedup (an average of

4.1X) is usually lower than that for the remaining computations (i.e., other than the

linear solver - an average of 6.0X). This is because of the higher requirement on the

memory bandwidth and number of memory channels in the linear solver. Note that

the speedup of the linear solver could have been lower if the NUMA-aware pressure

matrix assembly, the partial setup strategy for XSAMG, or the optimal partitioning

for block Jacobi were not applied. The overall speedup is averaged at 5.0X, which is

reasonable for the multicore platform we have tested.


Finally, a parallel NF linear solver is developed especially suited for GPUs. This

solver is built on the MPNF framework described by Appleyard et al. [8] and is in-

tegrated into parallel AD-GPRS as a pressure preconditioner. Our implementation

of MPNF is designed especially for the GPU architecture. For instance, our solver

applies asynchronous memory transfer from CPU to GPU in the setup phase, uses

CUDA-based BiCGStab solver as an accelerator in the solution phase, and computes

dot products using a customized reduction kernel that offers a greater bandwidth.

Most importantly, a reversed ordering is applied to the matrix elements such that

coalesced access to GPU global memory can be maximized. For small problems, the

number of concurrent threads is insufficient for good utilization of a GPU. Thus,

the “twisted factorization” technique is applied to improve the concurrency with no

extra computation or side effects on coalesced memory access. For the algorithm

to work on multiple GPUs, the peer-to-peer memory transfer, which is overlapped

with the computational kernels, is used for the data exchange between GPUs. The

GPU-based MPNF solver is demonstrated using several large reservoir models and

compared against its CPU-based counterpart (using OpenMP). We find that 1) coa-

lesced memory access is the most critical factor that affects the performance of the

algorithm: there can be several times difference in the speedup with, or without, this

feature; 2) twisted factorization can further improve the performance of the algorithm

and should always be applied for arbitrary problem size; 3) the accuracy of the pre-

conditioner increases while the parallelism decreases when more colors are used. The

number of colors for which near optimal performance is achieved is subject to the na-

ture of the problem, 4) the number of threads per CUDA core (RT/C) is a reasonable

measurement for the GPU utilization: 8 ∼ 10 is the minimally acceptable range for

a good utilization of GPU; 5) the current bottleneck of multi-GPU parallelization is

in the data transfer between CPU and GPU due to the limited PCIe bandwidth on a

single computational node, as well as, in the decreased number of concurrent threads


due to the domain partitioning and separation of computation in halo and interior

regions. Comparing the pressure solver time for the full SPE 10 model on a 512-core

Tesla M2090 GPU and on a single core of the Xeon X5660 CPU, our GPU-based

MPNF solver (with coalesced memory access and twisted factorization) achieves a

speed up of 26 for single-precision and 19 for double-precision computations. More-

over, the pressure solution of a highly refined SPE 10 model (26.9 million cells) on

six GPUs with totally 3072 cores is more than five times faster than that on a single

GPU.

7.2 Future Work

Based on the work described in this dissertation, AD-GPRS has evolved into a flex-

ible and efficient reservoir simulation research laboratory in which students and re-

searchers can easily practice their ideas. With general spatial and temporal discretiza-

tion schemes, AD-GPRS is capable of modeling thermal-compositional fluid flow in

reservoir models with fully unstructured grids and general multisegment wells. In

addition, the multicore and multi-GPU parallelization enable AD-GPRS to simulate

large-scale models efficiently on the corresponding platforms. Here, we list several

high-priority research directions related to the work described in this dissertation:

• Apply the MPNF algorithm to both stages of CPR preconditioner with a global

CUDA-based BiCGStab / GMRES accelerator. See Section 6.11 for detailed

comments.

• Account for off-band coefficients on the outermost level of MPNF matrices,

in order for the MPNF algorithm to accommodate 2.5D unstructured grids

(and MPFA discretization schemes) with more flexible coloring strategies. See

Section 6.11 for detailed comments.


• Develop a hybrid MPI/OpenMP parallelization in order for AD-GPRS to run

efficiently on clusters. As a further step, extend the multi-GPU parallelization

for the MPNF algorithm to run across a number of GPU-equipped nodes.

• Extend the general MS well model to handle multiple exit connections, as well

as, nodes with special functionality.

• Further test the local nonlinear solver and extend it for the solution of other

parts in the simulation, e.g., geomechanics and chemical reactions, when a fully-

implicit coupling is used to connect them to the fluid flow part.

• Investigate the impact of various MPFA schemes on the behavior of the linear

and nonlinear solvers.

• Devise AIM schemes for other compositional formulations (e.g., molar).

• Analyze and improve the variable-based AIM formulation, which sometimes

incurs instability with the current linear-stability criteria.

• Further grow the modeling and solution capabilities of AD-GPRS, e.g., new

thermal-compositional formulations and fluid models, coupled geomechanics,

and chemical reactions.

Nomenclature

A Cross-sectional area (of an MS-well segment)

C0 Flow profile parameter in the drift-flux model

D Depth of a reservoir cell or an MS-well node

DH Hydraulic diameter of an MS-well segment

DH Dimensionless hydraulic diameter of an MS-well segment

Fc Overall flux of component c

fc,p Fugacity of component c in phase p

ftp Fanning friction factor

g Gravitational acceleration coefficient

g Gravitational component along the well

Hp Enthalpy of phase p

kf Slope used in the linear interpolation of ftp for an intermediate Re

Ku(DH) Critical Kutateladze number

m(θ) A scaling parameter in the terminal rise velocity for inclined pipes

min Mass flow rate of the mixture entering an MS-well segment

nbr(i) The set containing all neighboring nodes of an MS-well node i

NB Number of reservoir cells

nc Number of components

NImp Number of implicit variables in a cell

np Number of phases

200


np Number of points associated with a flux

Nperf Number of perforations in a well

NPri Number of primary variables in a cell

NSec Number of secondary variables in a cell

perf(k) Reservoir cell number of the k’th perforation in a well

P Pressure (of the base phase - oil)

Pp Pressure of phase p

∆Pw Total pressure difference between two MS-well nodes

∆Pwh Hydrostatic pressure difference between two MS-well nodes

∆Pwf Frictional pressure difference between two MS-well nodes

∆Pwa Acceleration-related pressure difference between two MS-well nodes

Qloss Heat loss from wellbore fluid to surroundings

Qm Mixture flow rate

qp Inflow (per unit volume) of phase p (to an MS-well segment)

Qp Volumetric inflow of phase p (to an MS-well segment)

Qp Flow rate of phase p

Re Dimensionless Reynolds number

RT/C The ratio of the minimum number of concurrent threads in a color

of an MPNF matrix to the number of CUDA cores

rw Wellbore radius

Sp Saturation of phase p

∆t Timestep size

T Temperature

T i0,i1 Two-point transmissibility coefficient for the interface {i0, i1}

T i0,i1imMulti-point transmissibility coefficient associated with interface

{i0, i1} and cell im

V Volume of a reservoir cell or an MS-well node


Up Internal energy of phase p

Uto Overall heat transfer coefficient

Vc Characteristic velocity

Vd Terminal rise velocity

Vm Mixture velocity

Vp Interstitial velocity of phase p

Vsp Superficial velocity of phase p

xc,p Mole fraction of component c in phase p

zc Overall mole fraction of component c

∆z Length between two MS-well nodes

αp In-situ phase fraction (holdup) of phase p

ε Roughness height

γp Mass density of phase p

λp Mobility of phase p

µp Viscosity of phase p

νp Mole fraction of phase p

φ Porosity

Φp The flow part of the flux of phase p

ρp Molar density of phase p

σpq Interfacial tension between phase p and q

θ Inclination angle of an MS-well segment from vertical

Acronyms and Abbreviations

AD Automatic Differentiation

ADETL Automatically Differentiable Expression Templates Library

AIM Adaptive Implicit Method


AMG Algebraic MultiGrid

axpy Level 1 BLAS operation (add a multiple of one vector to another):

y ← ax + y

BiCGStab BiConjugate Gradient Stabilized method

BILU Block Incomplete LU factorization

BLAS Basic Linear Algebra Subprograms

BHP Bottom Hole Pressure

CFL Courant-Friedrichs-Lewy

COO COOordinate list

copy Level 1 BLAS operation (copy one vector to another): y ← x

CPR Constrained Pressure Residual

CSR Compressed Sparse Row (equivalent to CRS - Compressed Row

Storage)

CR Cyclic Reduction

CUDA Computing Unified Device Architecture

dot Level 1 BLAS operation (inner product): d← x · y

DP Double Precision

EOR Enhanced Oil Recovery

FIM Fully Implicit Method

GMRES Generalized Minimal RESidual

GPGPU General-Purpose computing on Graphics Processing Units

GPRS General Purpose Research Simulator

GPU Graphics Processing Unit

HPC High Performance Computing

HYB HYBrid matrix format (EllPack + COO)

ILU Incomplete LU factorization

IMPES IMplicit Pressure Explicit Saturation


IMPSAT IMplicit Pressure and SATuration

I/O Input / Output

MLBS Multi-Level Block Sparse (equivalent to MLSB - Multi-Level Sparse

Block)

MPFA Multi-Point Flux Approximations

MPNF Massively Parallel Nested Factorization

MS MultiSegment

NF Nested Factorization

NUMA Non-Uniform Memory Architecture

PCIe PCI (Peripheral Component Interconnect) express

RHS Right Hand Side

SAMG Algebraic MultiGrid methods for Systems

scal Level 1 BLAS operation (scale a vector by a constant): x← ax

SIMT Single Instruction Multiple Threads

SP Single Precision

SpMV Sparse Matrix-Vector multiplication

SPU Single-Point Upstream

TF Twisted Factorization

TLS Thread-Local Storage

TPFA Two-Point Flux Approximations

Subscripts

c Component

p Phase: g (gas), o (oil), w (water), or l (liquid, as a mixture of oil

and water)

m Mixture


i Node-based property for a reservoir cell or an MS-well node

(i, j) Connection-based property for the flux across interface (i, j)

C Connection (in a general MS well)

F Facilities

N Node (in a general MS well)

R Reservoir

W Well

Superscripts

avg Averaged value

inj Injection condition

n Time level n

n+ 1 Time level n+ 1

res Reservoir property

sc Standard condition

w Wellbore property

Bibliography

[1] I. Aavatsmark. An introduction to multipoint flux approximations for quadri-

lateral grids. Comput. Geosci., 6(3-4):405–432, 2002.

[2] I. Aavatsmark, T. Barkve, and T. Mannseth. Control-volume discretization

methods for 3D quadrilateral grids in inhomogeneous, anisotropic reservoirs.

SPE 38000, SPE Journal, 3(2):146–154, June 1998.

[3] I. Aavatsmark, G. Eigestad, B.-O. Heimsund, B. Mallison, J. Nordbotten, and

E. Øian. A new finite-volume approach to efficient discretization on challenging

grids. SPE 106435, SPE Journal, 15(3):658–669, September 2010.

[4] I. Aavatsmark, G. Eigestad, B. Mallison, and J. Nordbotten. A compact multi-

point flux approximation method with improved robustness. Numerical Methods

for Partial Differential Equations, 24(5):1329–1360, September 2008.

[5] G. Acs, S. Doleschall, and E. Farkas. General purpose compositional model.

SPE 10515, SPE Journal, 25(4):543–553, August 1985.

[6] J. Appleyard. Method and apparatus for estimating the state of a system, 01

2012. Patent, US 2012/0022841 A1.

[7] J. R. Appleyard. Nested factorization. SPE 12264, proceedings of the 7th SPE

Reservoir Simulation Symposium, San Francisco, CA, November 1983.

206

BIBLIOGRAPHY 207

[8] J. R. Appleyard, J. D. Appleyard, M. A. Wakefield, and A. L. Desitter. Accel-

erating reservoir simulators using gpu technology. SPE 141402, proceedings of

the 21st SPE Reservoir Simulation Symposium, Houston, TX, Feburary 2011.

[9] K. Aziz and A. Settari. Petroleum Reservoir Simulation. Applied Science Pub-

lishers, 1979.

[10] H. Beggs. Production optimization using NODAL analysis. Oil and Gas Con-

sultants International, Inc., Tulsa, OK, 1991.

[11] G. Behie. Practical considerations for incomplete factorization methods in reser-

voir simulation. SPE 12263, proceedings of the 7th SPE Reservoir Simulation

Symposium, San Francisco, CA, November 1983.

[12] C. Bischof, H. Bucker, P. Hovland, U. Naumann, and J. Utke. Advances in

Automatic Differentiation, Lect. Notes in Comp. Sci. and Eng. Springer, 2008.

[13] C. Bischof, G. Corliss, L. Green, A. Griewank, K. Haigler, and P. Newman.

Automatic differentiation of advanced CFD codes for multidisciplinary design.

Journal on Computing Systems in Engineering, 3:625–637, 1992.

[14] H. Bucker, G. Corliss, P. Hovland, U. Naumann, and B. Norris. Automatic Dif-

ferentiation: Applications, Theory and Implementations, Lect. Notes in Comp.

Sci. and Eng. Springer, 2006.

[15] H. Cao. Development of Techniques for General Purpose Simulation. PhD

thesis, Stanford University, 2002.

[16] H. Cao and K. Aziz. Performance of IMPSAT and IMPSAT-AIM models in

compositional simulation. SPE 77720, proceedings of the SPE Annual Technical

Conference and Exhibition, San Antonio, TX, 29 September 29-October 2 2002.

BIBLIOGRAPHY 208

[17] H. Cao, H. Tchelepi, J. Wallis, and H. Yardumian. Parallel scalable CPR-type

linear solver for reservoir simulation. SPE 96809, proceedings of the SPE Annual

Technical Conference and Exhibition, Dallas, TX, October 2005.

[18] M. Chien, H. Yardumian, E. Chung, and W. Todd. The formulation of a thermal

simulation model in a vectorized, general purpose reservoir simulator. In SPE

18418, proceedings of the 10th SPE Reservoir Simulation Symposium, Houston,

TX, February 1989.

[19] M. Christie and M. Blunt. Tenth SPE comparative solution project: A com-

parison of upscaling techniques. SPE 72469, SPE Reservoir Eval. & Eng.,

4(4):308–317, August 2001.

[20] T. Clees and L. Ganzer. An efficient algebraic multigrid solver strategy for adap-

tive implicit methods in oil reservoir simulation. SPE 105789, SPE Journal,

15(3):670–681, September 2010.

[21] CMG. STARS User’s Guide. The Computer Modelling Group,

http://www.cmg.com, 2008.

[22] K. Coats. IMPES stability: Selection of stable timesteps. SPE 84924, SPE

Journal, 8(2):181–187, June 2003.

[23] K. Coats, L. Thomas, and R. Pierson. Compositional and black oil reservoir

simulation. SPE 50990, SPE Reservoir Eval. & Eng., 1(4):372–379, August

1998.

[24] G. Corliss, C. Bischof, A. Griewank, S. Wright, and T. Robey. Automatic dif-

ferentiation for PDE’s: Unsaturated flow case study. In Advances in Computer

Methods for Partial Differential Equations - VII. IMACS, 1992.

BIBLIOGRAPHY 209

[25] D. DeBaun, T. Byer, P. Childs, J. Chen, F. Saaf, M. Wells, J. Liu, H. Cao,

L. Pianelo, V. Tilakraj, P. Crumpton, D. Walsh, H. Yardumian, R. Zorzynski,

K.-T. Lim, M. Schrader, V. Zapata, J. Nolen, and H. Tchelepi. An extensi-

ble architecture for next generation scalable parallel reservoir simulation. In

SPE 93274, proceedings of the 18th SPE Reservoir Simulation Symposium, The

Woodlands, TX, February 2005.

[26] U. Drepper. Elf handling for thread-local storage. Technical report, Red Hat

Inc., February 2003.

[27] M. Edwards and C. Rogers. Finite volume discretizations with imposed flux

continuity for the general tensor pressure equation. Comput. Geosci., 2(4):256–

290, 1998.

[28] Y. Fan. Chemical Reaction Modeling in a Subsurface Flow Simulator with Ap-

plication to In-Situ Upgrading and CO2 Mineralization. PhD thesis, Stanford

University, 2010.

[29] Y. Fan, L. J. Durlofsky, and H. A. Tchelepi. Numerical simulation of the in-situ

upgrading of oil shale. SPE 118958, SPE Journal, 15(2):368–381, June 2010.

[30] H. Fischer. Special problems in automatic differentiation. In A. Griewank and

G. Corliss, editors, Automatic Differentiation of Algorithms. SIAM, PA, 1991.

[31] Fraunhofer Institute SCAI. XSAMG Announcement. Fraunhofer SCAI,

http://www.scai.fraunhofer.de, 2011.

[32] L. S. Fung and A. H. Dogru. Parallel unstructured-solver methods for simu-

lation of complex giant reservoirs. SPE 106237, SPE Journal, 13(4):440–446,

December 2008.

BIBLIOGRAPHY 210

[33] W. Gander and G. H. Golub. Cyclic reduction history and applications. Pro-

ceedings of the Workshop on Scientific Computing, Hong Kong, March 1997.

[34] A. Griewank. On automatic differentiation. In M. Iri and K. Tanabe, editors,

Mathematical Programming: Recent Developments and Applications. Kluwar

Academic Publishers, IL, 1990.

[35] S. Haaland. Simple and explicit formula for the friction factor in turbulent pipe

flow including natural gas pipelines. Ifag b-131, Div Aero and Gas Dynamics,

The Norwegian Institution of Technology, 1981.

[36] J. Holmes, T. Barkve, and O. Lund. Application of a multisegment well model

to simulate flow in advanced wells. SPE 50646, proceedings of the SPE European

Petroleum Conference, The Hague, Netherlands, October 1998.

[37] Y. Jiang. Tracer flow modeling and efficient solvers for GPRS. Master’s thesis,

Stanford University, 2004.

[38] Y. Jiang. Techniques for Modeling Complex Reservoirs and Advanced Wells.

PhD thesis, Stanford University, 2007.

[39] Y. Jiang and H. A. Tchelepi. Scalable multistage linear solver for coupled sys-

tems of multisegment wells and unstructured reservoir models. In SPE 119175,

proceedings of the 20th SPE Reservoir Simulation Symposium, The Woodlands,

TX, Feburary 2009.

[40] G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for parti-

tioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392,

1999.

[41] Khronus OpenCL Working Group. The OpenCL Specification, Version 1.2,

November 2011.

BIBLIOGRAPHY 211

[42] J. Kim and S. Finsterle. Application of automatic differentiation in Tough2. In

Proceedings of The Tough Symposium, LBNL. LBNL, May 2003.

[43] L. Koesterke, J. Boisseau, J. Cazes, K. Milfeld, and D. Stanzione. Early expe-

riences with the intel many integrated cores accelerated computing technology.

In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery,

TG ’11, pages 21:1–21:8, New York, NY, USA, 2011. ACM.

[44] F. Kwok. A block ILU(k) preconditioner for GPRS. Technical report, Stanford

University, December 2004.

[45] S. B. Lippman, J. Lajoie, and B. E. Moo. C++ Primer, 4th Edition. Addison-

Wesley Professional, 2005.

[46] S. Livescu, L. Durlofsky, and K. Aziz. A semianalytical thermal multiphase

wellbore flow model for use in reservoir simulation. SPE 115796, SPE Journal,

15(3):794–804, September 2010.

[47] S. Livescu, L. Durlofsky, K. Aziz, and J. Ginestra. A fully-coupled thermal mul-

tiphase wellbore flow model for use in reservoir simulation. Journal of Petroleum

Science and Engineering, 71(3-4):138 – 146, 2010. Fourth International Sym-

posium on Hydrocarbons and Chemistry.

[48] P. Lu, J. Shaw, T. Eccles, I. Mishev, A. Usadi, and B. Beckner. Adaptive parallel

reservoir simulation. In International Petroleum Technology Conference. IPTC

12199, December 2008.

[49] P. Micikevicius. Multi-GPU Programming. NVIDIA Corporation, 2012.

[50] P. Moin. Fundamentals of Engineering Numerical Analysis. Cambridge Uni-

versity Press, 2001.

BIBLIOGRAPHY 212

[51] A. Moncorge and H. A. Tchelepi. Stability criteria for thermal adaptive implicit

compositional flows. SPE 111610, SPE Journal, 14(2):311–322, June 2009.

[52] NVIDIA Corporation. Whitepaper - NVIDIA’s Next Generation CUDA Com-

pute Architecture: Fermi, 2009.

[53] NVIDIA Corporation. CUBLAS Library, Version 5.0, October 2012.

[54] NVIDIA Corporation. CUDA C Best Practices Guide, Version 5.0, October

2012.

[55] NVIDIA Corporation. CUDA C Programming Guide, Version 5.0, October

2012.

[56] NVIDIA Corporation. CUDA Samples, Version 5.0, October 2012.

[57] NVIDIA Corporation. CUSPARSE Library, Version 5.0, October 2012.

[58] L. Ouyang. Single Phase and Multiphase Fluid Flow in Horizontal Wells. PhD

thesis, Stanford University, 1998.

[59] M. Pal and M. G. Edwards. Quasimonotonic continuous darcy-flux approxima-

tion for general 3D grids of any element type. In SPE 106486, proceedings of

the 19th SPE Reservoir Simulation Symposium, Houston, TX, February 2007.

[60] H. Pan and H. A. Tchelepi. Reduced variable method for general-purpose com-

positional reservoir simulation. In SPE 131737, proceedings of the International

Oil and Gas Conference and Exhibition in China, Beijing, China, June 2010.

[61] H. Pan and H. A. Tchelepi. Compositional flow simulation using reduced-

variables and stability-analysis bypassing. In SPE 142189, proceedings of the

21st SPE Reservoir Simulation Symposium, The Woodlands, TX, February

2011.

BIBLIOGRAPHY 213

[62] D. Peaceman. Interpretation of well-block pressures in numerical reservoir sim-

ulation. SPE Journal, 18(3):183–194, June 1978.

[63] P. Quandalle and J. Sabathier. Typical features of a multipurpose reservoir

simulator. SPE 16007, SPE Reservoir Engineering, 4(4):475–480, 1989.

[64] L. Rall. Perspectives on automatic differentiation: Past, present and future.

In Automatic Differentiation: Applications, Theory and Implementations, Lect.

Notes in Comp. Sci. and Eng. Springer, 2005.

[65] T. Russell. Stability analysis and switching criteria for adaptive implicit meth-

ods based on the CFL condition. In SPE 18416, proceedings of the 10th SPE

Reservoir Simulation Symposium, Houston, TX, February 1989.

[66] Y. Saad. ILUT: a dual threshold incomplete LU factorization. Numerical linear

algebra with applications, 1(4):387–402, 1994.

[67] Y. Saad. Iterative Methods for Sparse Linear Systems, Second Edition. Society

for Industrial and Applied Mathematics, 2003.

[68] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for

solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856–

869, 1986.

[69] P. Sarma. Efficient Closed-Loop Optimal Control of Petroleum Reservoirs Under

Uncertainty. PhD thesis, Stanford University, 2006.

[70] Schlumberger. Eclipse Technical Description 2011.2. Schlumberger, 2011.

[71] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junk-

ins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and

BIBLIOGRAPHY 214

P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing.

ACM Trans. Graph., 27(3):18:1–18:15, Aug. 2008.

[72] A. Semenova, S. Livescu, L. J. Durlofsky, and K. Aziz. Modeling of multiseg-

mented thermal wells in reservoir simulation. In SPE 130371, proceedings of the

SPE EUROPEC/EAGE Annual Conference and Exhibition, Barcelona, Spain,

June 2010.

[73] H. Shi, J. Holmes, L. Diaz, L. J. Durlofsky, and K. Aziz. Drift-flux parame-

ters for three-phase steady-state flow in wellbores. SPE 89836, SPE Journal,

10(2):130–137, 2005.

[74] H. Shi, J. Holmes, L. J. Durlofsky, K. Aziz, L. Diaz, B. Alkaya, and G. Oddie.

Drift-flux modeling of two-phase flow in wellbores. SPE 84228, SPE Journal,

10(1):24–33, 2005.

[75] G. Shiralkar, G. Fleming, J. Watts, T. Wong, B. Coats, R. Mossbarger, E. Rob-

bana, and A. Batten. Development and field application of a high performance,

unstructured simulator with parallel capability. In SPE 93080, proceedings of

the 18th SPE Reservoir Simulation Symposium, The Woodlands, TX, February

2005.

[76] K. Stuben. Algebraic multigrid (AMG): Experiences and comparisons. proceed-

ings of the International Multigrid Conference, April 1983.

[77] K. Stuben. An introduction to algebraic multigrid. Appendix in book ’Multigrid’,

pages 413–532, 2001.

[78] K. Stuben and T. Clees. SAMG User’s Manual. Fraunhofer Institute SCAI,

2003.

BIBLIOGRAPHY 215

[79] H. Sutter. The free lunch is over: a fundamental turn toward concurrency in

software. Dr. Dobb’s Journal, 30(3), 2005.

[80] G. Thomas and D. Thurnau. Reservoir simulation using an adaptive implicit

method. SPE 10120, SPE Journal, 23(5):759–768, October 1983.

[81] H. A. van der Vorst. Large tridiagonal and block tridiagonal linear systems on

vector and parallel computers. Parallel Computing, 5:45–54, July 1987.

[82] H. A. van der Vorst. BI-CGSTAB: a fast and smoothly converging variant

of BI-CG for the solution of nonsymmetric linear systems. SIAM Journal on

Scientific and Statistical Computing, 13(2):631–644, March 1992.

[83] S. Verma. Flexible Grids for Reservoir Simulation. PhD thesis, Stanford Uni-

versity, 1996.

[84] D. Voskov, R. Younis, and H. Tchelepi. General nonlinear solution strategies

for multiphase multicomponent eos based simulation. In SPE 118996, proceed-

ings of the 20th SPE Reservoir Simulation Symposium, The Woodlands, TX,

Feburary 2009.

[85] D. Voskov and Y. Zhou. Technical Description of the AD-GPRS. Energy

Resources Engineering, Stanford University, 2012.

[86] D. V. Voskov and H. A. Tchelepi. Compositional space parameterization: Mul-

ticontact miscible displacements and extension to multiple phases. SPE 113492,

SPE Journal, 14(3):441–449, September 2009.

[87] D. V. Voskov and H. A. Tchelepi. Compositional space parameterization: The-

ory and application for immiscible displacements. SPE 106029, SPE Journal,

14(3):431–440, September 2009.

BIBLIOGRAPHY 216

[88] J. Wallis. Incomplete Gaussian elimination as a preconditioning for general-

ized conjugate gradient acceleration. SPE 12265, proceedings of the 7th SPE

Reservoir Simulation Symposium, San Francisco, CA, November 1983.

[89] J. Wallis, R. Kendall, T. Little, and J. Nolen. Constrained residual acceleration

of conjugate residual methods. SPE 13536, proceedings of the 8th SPE Reservoir

Simulation Symposium, Dallas, TX, February 1985.

[90] X. Wang. Trust-Region Newton Solver for Multiphase Flow and Transport in

Porous Media. PhD thesis, Stanford University, 2012.

[91] X. Wang and H. A. Tchelepi. Trust-region based nonlinear solver for counter-

current two-phase flow in heterogeneous porous media. In 13th European Con-

ference on the Mathematics of Oil Recovery. EAGE, September 2012.

[92] R. Younis. Modern Advances in Software and Solution Algorithms for Reservoir

Simulation. PhD thesis, Stanford University, 2011.

[93] R. Younis and K. Aziz. Parallel automatically differentiable data-types for next-

generation simulator development. In SPE 106493, proceedings of the 19th SPE

Reservoir Simulation Symposium, Houston, TX, February 2007.

[94] F. Zhang. The Schur Complement and Its Applications. Springer, 2005.

[95] Y. Zhang, J. Cohen, and J. D. Owens. Fast tridiagonal solvers on the GPU.

SIGPLAN Not., 45(5):127–136, Jan. 2010.

[96] Y. Zhou. Multistage preconditioner for well groups and automatic differentia-

tion for next generation GPRS. Master’s thesis, Stanford University, 2009.

[97] Y. Zhou. ADETL User Manual, 2012.

BIBLIOGRAPHY 217

[98] Y. Zhou, Y. Jiang, and H. Tchelepi. A scalable multi-stage linear solver for

coupled reservoir models with multi-segment wells. Journal of Computational

Geosciences, 2012.

[99] Y. Zhou, H. A. Tchelepi, and B. T. Mallison. Automatic differentiation frame-

work for compositional simulation on unstructured grids with multi-point dis-

cretization schemes. In SPE 141592, proceedings of the 21st SPE Reservoir

Simulation Symposium, The Woodlands, TX, Feburary 2011.

[100] N. Zuber and J. A. Findlay. Average volumetric concentration in two-phase

flow systems. Journal of Heat Transfer, 87:453–468, November 1965.

Appendix A

Programming Model of AD-GPRS

In this appendix the programming model of AD-GPRS is described. Our objective is

to make AD-GPRS into a flexible and efficient reservoir simulation research laboratory

with extensible modeling and solution capabilities. To achieve this goal, a modular

object-oriented design is adopted. All of the code is written in standard C++. While

this design is convenient for the developers to extend the simulator by incorporating

new physics, introducing complex processes, or adding new formulations and solu-

tion algorithms, it requires some effort for newcomers to become fully familiar with

this code. AD-GPRS is a large-scale and complex program. Currently it has over

200 header and source files (excluding ADETL and its associated FASTL library)

separated into more than ten subdirectories. In order to facilitate the development

and usage of AD-GPRS, good documentation, including technical descriptions, user

manuals, examples, and doxygen-style comments (see www.doxygen.org) in the code,

are essential.

This appendix is organized similar to Appendix A in [15]. The structure of AD-

GPRS is discussed level by level, from top down, followed by a flow sequence of the

simulation framework. The list of directories and files of AD-GPRS, together with

their descriptions, is also provided. Here we use italic shape to denote class names,

218

www.doxygen.org

APPENDIX A. PROGRAMMING MODEL OF AD-GPRS 219

slanted shape to denote variable names, and boldface to denote function names.

A.1 The structure of AD-GPRS

The system model shows the basic classes and their relations, and it is very helpful

for understanding the structure of AD-GPRS. Due to the complexity of AD-GPRS,

the system model is explained level by level using multiple figures.

Simulator

SimMaster

Nonlinear Formulation

Reservoir Facilities AIM

Scheme Nonlinear

Solver

adX adX_n

Containment

full_set

Streamline Tracer

adFX allStatus totalBlocks

Figure A.1: Overall structure of the entire simulator

Figure A.1 shows the overall structure of the entire simulator. Simulator is the

topmost class that is instantiated in the main function. It can be seen as an entry

to versatile simulation functionality. Simulator class contains the outermost loop on

input regions. Inside each input region, there is an initialization process to accommo-

date the change in parameters, followed by a loop on timesteps. For each timestep,

Simulator determines the proper timestep size such that the time stepping require-

ments of report steps and input regions (see Section A.2 for more information) are

satisfied. At the beginning of simulation, Simulator creates the SimMaster object,

which will be deallocated when the simulation run finishes.

SimMaster is a manager of all simulation-related objects. When SimMaster is cre-

ated, it creates specific NonlinearFormulation object and its associated AIMScheme


object according to the formulation type in the input data. After that, common

objects including Reservoir, Facilities, NonlinearSolver, and StreamlineTracer are

constructed. The data members of SimMaster also includes the global variable set

(adX) that stores all the simulation variables with both values and gradients, as well

as, the global backup set (adX n) that saves a copy of values of all variables. For

details about global variable and backup set, please refer to Section 1.2.1 and the

ADETL user manual [97]. There is an additional backup set (adFX), which is used

only when the active-window mode is activated. In such circumstances, adX and

adX n will only cover the variables in the active region, whereas adFX will have a

copy of the variables in the entire simulation domain, regardless of whether the re-

gion is activated or not. Moreover, for the initialization and subsequent update of the

global variable and backup sets, two additional vectors are needed: 1) totalBlocks,

which contains the number of blocks in each variable subset, and 2) allStatus, which

contains the (phase) status for each block in all variable subsets.

A common namespace shared by a variety of classes throughout the simulator is

called full set. It contains a set of global indexing constants that are fixed for a given

number of phases and components, and are general to any variable formulation. For

example, for node-based variables, there are such indexing constants as PRES (pres-

sure), TEMP (temperature), SAT (saturations), XCP (molar fractions), MOBILITY

(phase mobilities), and so on. On the other hand, for connection-based variables,

indexing constants include QM (mixture flow rate) and QSP (superficial phase flow

rates). These indexing constants cover both the independent and dependent variables.

Any status in a variable formulation activates a part of the corresponding variables

to be independent, and leaves the rest variables to be dependent, without changing

their values.

Now we discuss each component in SimMaster class in detail. Figure A.2 shows

the structure of NonlinearFormulation class. This is the abstract base class of various



Natural Variable

Formulation

Molar Variable

Formulation

Containment

Inheritance

Gamma Variable

Formulation

…… (new formulation)

Phase Combinations

Component Table

Always Present Phases

Conservation Equation Order

Thermal Properties

Figure A.2: Structure of the NonlinearFormulation

formulations, such as NaturalVariableFormulation, MolarVariableFormulation, and so

on. By constructing new inherited classes of NonlinearFormulation (or one of its

derived classes) with the same interfaces but possibly different realizations, we are

able to introduce new formulations.

NonlinearFormulation contains a variety of formulation-related virtual member

functions. Some functions are called in the initialization stage to specify the phase,

component, and variable structure of a formulation. This includes makeStructure

(create the table of active independent variables associated with each possible phase

status) and buildStatusTable (create the table of existing phases associated with

each possible phase status). Please see the numerical framework section in [85] for

more details.

Other virtual functions in NonlinearFormulation, such as enforceConstraints

(enforce certain local constraints), changeStatus (check the phase status and change

it if necessary), fluidPropertiesCalculation (compute all fluid properties), and

makeSecondaryTerm (form secondary equations), usually perform computations

on a specific block, which can be either a reservoir cell, or a well node. That is,

the NonlinearFormulation object is shared by Reservoir and various facility classes.


Thus these virtual functions should be designed in a generic way in order to work

for both cases. For example, the rock properties and fluid properties cannot be

calculated by the same function in NonlinearFormulation, because rock properties

are used in Reservoir class only, whereas fluid properties are shared by Reservoir

and facility classes. That is the reason why we have fluidPropertiesCalculation

function separated from the original properties calculation that covers both rock and

fluid properties.

Among the common member variables, pOrder is an array with the size equal to

the number of primary (i.e., mass and energy conservation) equations. It is used to

reorder the conservation equations from their default order (i.e., the order of compo-

nents) on the nonlinear level. The energy conservation equation, if present, is still

required to be the last primary equation, i.e., after all mass conservation equations.

Next, phaseCombinations is a vector of integer vectors created in the common build-

StatusTable function. The number of elements in phaseCombinations is equal to

the total number of possible phase statuses while each integer vector represents the

appeared phases in a specific phase status. If all combinations of np phases are feasi-

ble, there will be 2np − 1 statuses. For example, when np = 3, we have the following

23 − 1 = 7 statuses (integer vectors): (0), (1), (0, 1), (2), (0, 2), (1, 2), (0, 1, 2). This

table will be fetched by many other classes, such as Reservoir and AIMScheme, to

loop through all appeared phases corresponding to the current status of a cell. More-

over, the default values of alwaysPresentPhases (boolean vector, all false by default)

and componentTable (vector of boolean vectors, all true by default) are also set in

the buildStatusTable function. Each element in alwaysPresentPhases indicates if

a phase is always present in all possible statuses (e.g., for black-oil and dead-oil fluid,

we have water and oil phases as always present phases), whereas each element in

componentTable indicates if a component (first dimension) can appear in a specific

phase (second dimension). Last but not least, pThermalFluid is only created in a


thermal formulation and can be used for the calculation of thermal properties.

Reservoir

Fluid

Containment

General Connection

List


adX

adX_n

Access

Variable Checker

Trust Region Solver

Rock

Figure A.3: Structure of the Reservoir

Next, we discuss the second component in SimMaster class: Reservoir. Figure

A.3 shows the structure of Reservoir class. All reservoir-related operations (called

for reservoir only, not for any facility model, e.g., initialization/update of all reser-

voir variables, computation of all reservoir properties, flux terms, and accumulation

terms, etc.) are defined here. Reservoir class needs to access the NonlinearFormula-

tion, global variable (adX) and backup (adX n) sets. By calling unified interfaces in

NonlinearFormulation, the Reservoir class is general for all formulations, i.e., we do

not need to create derived Reservoir classes for different formulations.

Besides several constant pointers for fetching various reservoir properties in the

input data, the data members of Reservoir class include a Rock object, a Fluid

object, a general connection list, a variable checker, and a trust-region solver. Rock

class is designed to compute rock properties (e.g., porosity or pore volume under

reservoir condition) for the reservoir. If geomechanics capability is integrated, the

functionality of Rock class (and other relevant classes) will grow tremendously. The

general connection list is a data structure for recording the block nonzero entries

in the system matrix for a generalized MPFA discretization. At the beginning of a

simulation run, the list is generated from the specification of the MPFA stencils for


all flux entries in the model. For details about general connection list, please refer

to Section 2.2.2 and [99]. The variable checker can be used for calculating among all

reservoir cells the maximum changes in the basic variables (e.g., p, Sp, νp, xcp, and zc)

between two Newton iterations. It provides a mechanism to check the convergence of

the nonlinear system in addition to checking the residuals. The trust-region solver is

an advanced nonlinear solver that provides local chopping in each reservoir cell based

on the shape of the flux function (see [90,91] for details).

Fluid

TwoPhase CompFluid

RockFluid

Containment

Inheritance

BlackOil Fluid

…… (new fluid type)

(A vector of)

Phase TwoPhase

CSAT

OG OW EOS Phase

PVT Phase

…

MultiPhase CompFluid

MultiPhase CSAT

DeadOil Fluid

Gamma PhaseFluid

Figure A.4: Structure of the Fluid

Given the problem that we simulate, the SimMaster will create the proper Fluid

object for the Reservoir, e.g., TwoPhaseCompFluid for two-phase compositional sim-

ulation, MultiPhaseCompFluid for multiphase compositional simulation, and Black-

OilFluid for black-oil simulation. The structure of Fluid class is shown in Figure A.4.

A variety of fluid computations, which can be used by both reservoir and various

facility models, are provided in Fluid class through common interfaces. Part of the

data in the Fluid class, though, are specific to the reservoir or facility object it is

associated with. Thus, by creating the proper Fluid object and utilizing its com-

mon interfaces, (derived) NonlinearFormulation class can hide the details of certain


fluid computations. To minimize the code duplication, one derived Fluid class can be

shared by multiple derived NonlinearFormulation classes (e.g., TwoPhaseCompFluid

can be used with both NaturalVariableFormulation and MolarVariableFormulation).

If none of the existing Fluid class satisfies the requirement of a new nonlinear formula-

tion (e.g., GammaVariableFormulation), a new Fluid class (e.g., GammaPhaseFluid)

needs to be created by inheriting from the base class or any existing derived Fluid

class. On the other hand, one derived NonlinearFormulation class can be compatible

with multiple types of fluid, e.g., we can have NaturalVariableFormulation working

with TwoPhaseCompFluid, MultiPhaseCompFluid, or BlackOilFluid.

The Fluid class contains two primary data members: RockFluid and a vector of

Phase pointers. The RockFluid class, which is independent of the fluid type, com-

putes two properties: relative permeability and capillary pressure. The Phase class

is an abstract data type. It provides a series of phase-based property computations,

such as viscosity and density of one phase. Different data types, such as EOSPhase

(properties are calculated based on EOS) and PVTPhase (properties are calculated

based on PVT), are inherited from the Phase class. Each pointer in the phase vector

represents one specific phase (e.g., Gas, Oil, or Water) in the simulation, and can

be in any phase data type (EOS or PVT). The phases are created by the inherited

Fluid class. For example, BlackOilFluid creates a vector of PVT phases, whereas

TwoPhaseCompFluid creates a vector of EOS phases. Actually, the Fluid class is

able to create a mixture of phases with different types, e.g., two EOS phases and one

PVT phase in a compositional setting with two hydrocarbon phases and one simple

water phase. This allows the developers to flexibly create new fluid scenarios with

heterogeneous treatments on phases.

The third component in SimMaster class is Facilities. Figure A.5 shows its struc-

ture. All facility-related operations (called for facility models only, not for reservoir,

e.g., initialization/update of all well variables, computation of all well properties,


Facilities

(A vector of) Well

Containment

(A vector of) WellState

Access


adX

adX_n

Standard Well

(A vector of) WellControl

Generalized Well

Inheritance

…… (new well model)

Facility Matrix

Handler

Variable Checker

Fluid … (other

common data) Well Matrix

Handler

Figure A.5: Structure of the Facilities

adding the source/sink terms to reservoir residuals and forming well residuals, etc.)

are defined in this class. These operations are usually performed by calling the corre-

sponding function in each of the Well objects stored in a vector owned and managed

by the Facilities class. In addition to this vector of Well pointers, the Facilities class

contains a vector of WellState variables that can be used to backup and restore and

state of all wells at any time (e.g., at the beginning of a timestep), as well as, a

variable checker that calculates among all well nodes the maximum changes in the

basic variables between two Newton iterations.

The Well class is an abstract base class that provides common interfaces for all

facility models. Currently, the standard well and generalized MultiSegment (MS)

well model have been implemented in AD-GPRS. By inheriting from the Well or its

derived classes, we can add new facility models (e.g., well groups) into the simulator.

Similar to the reservoir, wells need to access the NonlinearFormulation, global variable

(adX) and backup (adX n) sets. Each well also has its own fluid object, which is

created from the reservoir fluid object and contains the specific fluid-related data

associated with the well. Because each facility model has its own matrix format,


while each linear system assumes different matrix structure, it will not be generic

to have the facility treatment in a linear system or the matrix handling in a facility

model. To mix and match various facility models and linear systems, the idea of

facility-matrix handler is introduced. Please refer to Section 4.5.2 for details. There

is a top-level handler for the Facilities class and the selected matrix system. The

top-level handler will create lower-level handlers for each well object and the same

matrix system. There are also many other common data, e.g., a vector of possible

well controls, defined for the base Well class.

AIMScheme

AIM Statistics

AIM Parameters


adX

adX_n

nImpVars CFL

Containment

Access

Inheritance

NaturalVariable AIMScheme

MolarVariable AIMScheme

... (new scheme)

(Reservoir) Fluid

CFLb ImpComp

Figure A.6: Structure of the AIMScheme

The fourth component in SimMaster class is AIMScheme. Figure A.6 shows its

structure. AIMScheme is designed for calculating the CFL numbers, determining

the implicit levels, and performing nonlinear treatments for all blocks. It is the base

class for various AIM schemes corresponding to different nonlinear formulations. For

example, we have NaturalVariableAIMScheme working for NaturalVariableFormula-

tion with all compatible fluid types. The base AIMScheme class, which contains the

basic implementation of CFL computation, can be temporarily used for any new for-

mulation. However, in that case, the new formulation is only compatible with Fully

Implicit Method (FIM) because the nonlinear treatments are different for various


formulations and are supposed to be implemented in the corresponding derived AIM-

Scheme classes. If AIM capability is desired later, a new inherited AIMScheme class

needs to be constructed for that nonlinear formulation by defining proper nonlinear

treatments (and processes for computing CFL numbers / criteria for determining

implicitness, if necessary).

Similar to Reservoir and all facility models, AIMScheme accesses NonlinearFor-

mulation object, adX, and adX n. In addition, it needs to access the fluid object in

the Reservoir class for the possible recalculation of fluid properties during the non-

linear treatment (see Section 2.3.2). Its most important data members are as listed

here: 1) AIMStatistics : a structured type containing statistical information (e.g., av-

erage / maximum number of implicit cells / variables) for AIM; 2) AIMParameters :

a structured type containing configuration parameters (e.g., whether AIM is used,

the CFL limit, maximum number of implicit variables in a non-FIM cell) for AIM;

3) nImpVars: a vector containing the number of implicit variables in each cell; 4)

ImpComp: a boolean array indicating whether each component in each cell is im-

plicit (needed by variable-based AIM); 5) CFL: a vector containing the component-

and phase- based fluxes for CFL computation; and 6) CFLb: a vector containing the

most constraining CFL number of each grid block.

Nonlinear Solver

residual residual

Normalizer


Reservoir

Facilities

Linear System

Residual Checker

Containment

Access

AIMScheme

adX

adX_n

allStatus

Figure A.7: Structure of the NonlinearSolver

The fifth component in SimMaster is NonlinearSolver, with its structure shown


in Figure A.7. The main objective of NonlinearSolver is to find the nonlinear so-

lution of any timestep, given the converged solution at the last timestep and the

current timestep size. By default, Newton’s method (with certain chopping scheme)

is applied. Due to the wide range of data and functionality needed, NonlinearSolver

accesses almost every other components in SimMaster, including NonlinearFormula-

tion, Reservoir, Facilities, and AIMScheme objects, as well as, such variables as adX,

adX n, and allStatus. The most important data members of NonlinearSolver are as

follows: 1) residual: this is the AD residual vector, which contains not only the resid-

ual values but also the gradients (Jacobian) of the discretized governing equations; 2)

residualNormalizer: during check of convergence, the elements in the residual vector

are normalized by these values, which are usually computed at the same time when

nonlinear residuals are built; 3) LinearSystem: it extracts Jacobian matrix from the

(AD) residual vector and obtains the Newton update through linear solution; and 4)

ResidualChecker : it is used to check the convergence of nonlinear iterations based

on the normalized residual values, given the corresponding tolerances for transport

equations, local constraints, and well equations.

LinearSystem

CSR LinearSystem

Block LinearSystem

nImpVars

ImpComp

Facilities

adX

Containment

Access

Inheritance

RHS CSR

Matrix CSR

Solver MLSB Matrix

MLSB Solver

Reservoir general connection list

Figure A.8: Structure of the LinearSystem

The LinearSystem alone is a very large and essential part of AD-GPRS. Its struc-

ture is shown in Figure A.8. LinearSystem is the abstract base class for different

types of linear system (each one has its own matrix format and a set of solvers that


can work on that matrix format). In the initialization process, the general connection

list from the Reservoir, the nImpVars and ImpComp vectors from the AIMScheme,

the Facilities object, as well as the global variable set (adX) need to be passed to the

LinearSystem, which acquires necessary information from these objects. The dense

Right-Hand-Side (RHS) vector is shared by all types of linear system. In a general

linear problem Ax = b, this data member is used to store vector b before linear

solution and vector x after linear solution.

Currently, two types of linear systems are derived: CSRLinearSystem and Block-

LinearSystem. CSRLinearSystem (see Section 3.2 for details) works on Compressed

Sparse Row (CSR) matrix format. A handy function is provided by the ADETL

library to extract the gradients contained in the (AD) residual vector in the CSR for-

mat. Therefore, most of the time, we need no specialization in this linear system to

accommodate new facility models or other features introduced into AD-GPRS. How-

ever, the efficiency of using CSRLinearSystem is usually relatively low. Generally

speaking, the library solvers working on CSR matrix type can be supported. For ex-

ample, PARDISO and SuperLU direct solvers are currently included. However, these

solvers do not have comparable efficiency with state-of-the-art iterative linear solvers

employing CPR (Constrained Pressure Residual [88, 89]) based multistage precondi-

tioners. It is suggested to use CSRLinearSystem and its associated solver options

only for verifying the correctness in small problems.

On the other hand, BlockLinearSystem (see Section 3.3 for details) works on

a much more complicated matrix format, which is extended from the MultiLevel

Sparse Block (MLSB) matrix [38] in the original GPRS. This is a recursively defined

matrix structure. On the very top level, it is divided into four parts: JRR, JRF ,

JFR, and JFF , where the subscript R and F represent Reservoir and Facilities,

respectively. The first subscript indicates the equations, whereas the second subscript

indicates the variables (e.g., JRF represents the derivatives of Reservoir equations


with respect to Facilities variables). On the second level, each of the four parts has

its own structure and possibly contains several submatrices, which are defined on the

third level. By repeating this process, a hierarchy of matrix structure is established.

When we want to define the matrix structure for a new facility model, we need to

customize the submatrices in JRF , JFR, and JFF parts. All of the (sub)matrices in

MLSB format are derived from the abstract base class GPRS Matrix, where common

interfaces such as matrix extraction, algebraic reduction, explicit update, and matrix-

vector multiplication are defined. Because these matrix operations are implemented

recursively in derived MLSB matrices, new MLSB matrix types can be introduced

without knowing or altering the implementation details in existing ones.

Given the flexible structure and awareness of block sparsity, MLSB-based linear

solvers can be much more efficient than the CSR-based ones. Currently, we have

the block GMRES solver with several preconditioner options, including single-stage

BILU(0) / BILU(1) and multistage CPR (AMG / SAMG + BILU(0) / BILU(1)).

CPR combining SAMG and BILU(0) usually yields the best performance and most

robust results.

A.2 Flow sequence

Having shown the system model and important classes in AD-GPRS, we can now

review the flow sequence. The essential steps of a forward simulation are as follows:

1. Read the input data (problem definition) from the given simulation deck

2. Initialization (allocate memory, create objects, and apply initial conditions)

3. While there is still some input region that has not been simulated, do:

(a) Update the simulation parameters and corresponding data for the current

input region, if it is not the first one


(b) While the current input region has not been finished, do:

i. Calculate the timestep size

ii. Initialize the timestep: if the nonlinear solution succeeded in the last

timestep, use this converged solution as the initial guess; otherwise,

retract the independent variables and phase statuses from the last

converged solution

iii. While the maximum number of Newton iterations in each timestep

has not been reached, do:

A. Discretize the reservoir residuals by computing accumulation terms

and secondary constraints for each reservoir cell, as well as, flux

terms for each reservoir interface

B. For each facility model, check whether its control needs to be

switched, calculate its properties, add source/sink terms to the

corresponding reservoir residuals, and form its own well residuals

C. Check if the Newton iteration converges. If yes, go to 3(b)iv

D. Extract the linear system from (AD) residual vector, perform al-

gebraic reduction, solve the linear system, and use explicit update

to get back the full solution to all independent variables

E. Apply the solution (Newton update) to all reservoir and well vari-

ables. Update the phase statuses for all reservoir cells

F. Calculate updated rock and fluid properties for all reservoir cells

G. Go to step 3(b)iii for the next Newton iteration

iv. If solution is accepted before reaching the maximum number of Newton

iterations, report statistics for a converged timestep and dump the

solution if necessary (e.g., at a report step)

v. Go to step 3b for the next timestep


(c) Go to step 3 for the next input region

4. Report overall statistics and perform post-processing if necessary

5. Deallocate everything and end simulation

A.3 List of files

Besides the visual studio project file (for Windows) and Makefile (for Linux) located

in the base directory (SourceCode), there are more than ten subdirectories. Here we

introduce the files contained in each of them.

• ACSP

CSP DataStorage.hpp/.cpp Data storage for CSP method

CSP Interpolation.hpp/.cpp Interpolation for CSP method

CSP TesselationBuilder.hpp/.cpp Tesselation builder for CSP method

Octree.hpp/.cpp Octal tree data structure

• AIMScheme

AIMScheme.hpp/.cpp The base class of various AIM schemes

NaturalVariableAIMScheme.hpp/.cpp The derived AIM scheme for NaturalVari-

ableFormulation

• CSAT

CSAT FB.hpp/.cpp CSAT FB

CSAT G.hpp/.cpp CSAT G

CSAT Interpolation.hpp/.cpp Interpolation for CSAT method

CSAT steam.hpp/.cpp CSAT with steam


CSATInput.hpp Input data for CSAT method

neg flash.hpp Negative flash subroutines

phase equil.hpp/.cpp Phase equilibrium using CSAT

tie storage.hpp/.cpp Tie-line storage

TieSimplex.hpp/.cpp Tie simplex

TieSimplexTypesContainer.hpp/.cpp Container for tie simplex types

• Facilities

Facilities.hpp/.cpp The management class of all wells (facility

models)

Facilities MLSB Handler.hpp/.cpp The facility-matrix handler for the top fa-

cility management class and MLSB matrix

format

FacilityMatrixHandler.hpp The base class of various facility-matrix

handlers

FacilityMatrixHandlerFactory.hpp Helper functions that create the corre-

sponding facility-matrix handler given the

facility model, system matrix, and param-

eters

GeneralizedWell.hpp/.cpp Generalized MS well class

GeneralizedWellConfig.hpp Configuration parameter class for Gener-

alizedWell

GenWell MLSB Handler.hpp/.cpp The facility-matrix handler for General-

izedWell class and MLSB matrix format

PseudoWell.hpp Pseudo well class

StandardWell.hpp/.cpp Standard well class


StdWell MLSB Handler.hpp/.cpp The facility-matrix handler for Standard-

Well class and MLSB matrix format

Well.hpp/.cpp The base class for various facility models

WellControl.hpp The general well control class

WellInput.hpp All input data of a single well

WellState.hpp Class for gathering all basic well state vec-

tors

WellStateHDF5.hpp/.cpp Class for the conversion of WellState be-

tween its internal and HDF5 format

• IO

FluidParameters.hpp/.cpp Input parameters for Fluid

GridParameters.hpp Input parameters for Grid

hdf5 utils.hpp/.cpp HDF5 utilities for I/O

InputData.hpp/.cpp The collection of input data for one input

region

IO.hpp/.cpp Class for output everything (used by log-

ger) and partially for input

Keywords.hpp/.cpp Class for handling all keywords in the AD-

GPRS input file (forward simulation)

KeywordsBase.hpp/.cpp Base class for keyword handling

Log.hpp/.cpp Class for redirecting output to screen and

disk file

Logger.hpp/.cpp Class for managing screen output, log

files, and solution dumping

Parse.hpp/.cpp Parser used in Keywords class


print.hpp/.cpp Functions for writing a vector or array to

disk file

readdata.hpp/.cpp The base class for reading data in certain

formats

ReservoirParameters.hpp/.cpp Input parameters for Reservoir

TuningParameters.hpp/.cpp Input parameters for tuning the simula-

tion

WellParameters.hpp/.cpp Input parameters for facility models

• LinearSolvers

adDenseBlockExtractor.hpp Auxiliary class for extracting dense blocks

of derivatives in ADvector

AMG.h Interfaces to Fortran functions of open-

source AMG solver (AMG1990)

AMGPre.h/.cpp The pressure preconditioner using open-

source AMG solver

BILU0Pre.h/.cpp Block ILU(0) preconditioner

BILUPre.h/.cpp Block ILU(k) preconditioner

BlkGMRESSolverMP.h/.cpp Block GMRES solver for MLSB matrix

with general MPFA discretization

BlockLinearSolverBase.h Base class for MLSB-based linear solvers

BlockLinearSystem.hpp/.cpp MLSB-based linear system

comprow pointer.h/.cpp Block CSR matrix format (used in block-

based preconditioners such as BILU(0)

and BILU(k))

CSRLinearSystem.hpp/.cpp CSR-based linear system

csrmatrixformat.hpp/.cpp (Pointwise) CSR matrix format


GaussSolver.hpp Common functions for dense Gaussian

elimination

GenWellBILUPre.h/.cpp Block ILU preconditioner for Generalized-

Well matrix in MLSB format

gmres.h Template implementation of GMRES ac-

celerator

GPRS Matrix.h/.cpp Base class for submatrices in MLSB for-

mat

librarysolver.hpp/.cpp Base class for all library solvers

LinearSystemBase.hpp/.cpp Base class for all linear systems

myheap.h/.cpp Heap data structure for symbolic factor-

ization in Block ILU(k) preconditioner

parametersforlibrarysolvers.hpp/.cpp Class for reading and storing parameters

of library solvers

pardiso4.hpp/.cpp Library PARDISO solver

PreconditionerBase.h Base class for all (MLSB-based) precondi-

tioners

ResFacBILUPre.h/.cpp Global Reservoir-Facilities Block ILU pre-

conditioner

SAMG.h Interfaces to Fortran functions of SAMG

solver

SAMGPre.h/.cpp The pressure preconditioner using SAMG

solver

SubRRBlk.h/.cpp The matrix type for the RR part and cer-

tain submatrices in the FF part of MLSB

matrix


SubCOOBlk.h/.cpp The matrix type for certain submatrices

in the RF and FR part of MLSB matrix

superlu.h/.cpp Library SuperLU solver

SysFFJwrapper.h/.cpp The wrapper format for the FF part in

MLSB matrix

SysFRJwrapper.h/.cpp The wrapper format for the FR part in

MLSB matrix

SysMatwrapper.h The template wrapper format for the en-

tire MLSB matrix (as well as the subma-

trix of a general MS well in the FF part)

SysRFJwrapper.h/.cpp The wrapper format for the RF part in

MLSB matrix

TrueIMPESPre.h/.cpp CPR-based multistage preconditioner

with true-IMPES reduction for reservoir

matrix

YUSolverInterface.hpp Class for extracting and solving local

dense linear systems (e.g., used in New-

ton flash)

• NonlinearSolvers

NonlinearSolver.hpp/.cpp Class for handling all the functionality as-

sociated with the nonlinear solution in ad-

vancing one timestep

ResidualChecker.hpp/.cpp Newton convergence checker based on the

maximum normalized residuals

TrustRegionSolver.hpp/.cpp Class for the trust-region based chopping

strategy


VariableChecker.hpp/.cpp Newton convergence checker based on the

maximum variable changes

• Properties

BlackOilFluid.hpp/.cpp Derived fluid class for black-oil simulation

CO2Phase.hpp/.cpp Derived phase class specifically for model-

ing CO2

ComponentParameters.hpp/.cpp Class containing (default) component pa-

rameters and EOS coefficients computa-

tion

CubicEquationSolver.hpp Functions for solving cubic equation in

EOS solution process

DeadOilFluid.hpp/.cpp Derived fluid class for dead-oil simulation

EOSPhase.hpp/.cpp Derived phase class based on EOS rela-

tionship

Fluid.hpp/.cpp Base class for various fluid types

GammaPhaseFluid.hpp/.cpp Derived fluid class specifically for Gam-

maVariableFormulation

kvalinput.hpp Input data for K-value method

MultiPhaseCompFluid.hpp/.cpp Derived fluid class for multiphase compo-

sitional simulation

MultiPhaseFlash.hpp/.cpp Class for the flash calculation with general

multiphase fluid

Phase.hpp Base class for various phase types

PVTPhase.hpp/.cpp Derived phase class based on PVT rela-

tionship


RockFluid.hpp/.cpp Class for the computation of relative per-

meability and capillary pressure

SCALData.hpp/.cpp Class used by RockFluid for tabulating

the relative permeability and capillary

pressure between two phases

ThermalProperties.hpp/.cpp Class for calculating thermal properties

TwoPhaseCompFluid.hpp/.cpp Derived fluid class for two-phase composi-

tional simulation

TwoPhaseKvalueCompFluid.hpp/.cppDerived fluid class for two-phase composi-

tional simulation with K-values

• Reservoir

connections.hpp Type definitions of generic MPFA connec-

tion data and generalized connection list

Reservoir.hpp/.cpp Class for managing all reservoir-related

computations

Rock.hpp/.cpp Class for computing rock properties (e.g.,

porosity) for Reservoir

• Simulation

full set.hpp/.cpp A namespace containing a set of global in-

dexing constants that are used throughout

the simulation

InputTimeStepping.hpp Input data for time stepping parameters

main.cpp The main (entry) function of the entire

AD-GPRS


SimData.hpp/.cpp The singleton class that processes and

stores the current input data

SimMaster.hpp/.cpp Class for managing all simulation-related

objects and handling preprocessing and

post-processing functionality

SimulationTime.hpp/.cpp The singleton class containing simulation

time and timestep size

Simulator.hpp/.cpp The topmost class instantiated in main

function as an entry to versatile simula-

tion functionality

Statistics.hpp/.cpp The singleton class that handles all statis-

tical data during simulation

• Streamlines

Point3d.hpp Struct for a point in the three-dimensional

space

Streamline.hpp/.cpp Class of streamline

StreamlineHDF5.hpp/.cpp Class for the conversion of Streamline be-

tween its internal and HDF5 format

StreamlineTracer.hpp/.cpp Class for tracing streamlines

• Utilities

enumtypes.hpp/.cpp A collection of all enumerated data types

and functions that convert strings to these

types


impl function.hpp Implementation of implicit function theo-

rem

interpolation.hpp Inline functions for linear interpolation

and extrapolation

main def.hpp Major constants and definitions

MyMatrix.h/.cpp Functions for dense-matrix computation

and sparse-matrix index calculation

Timer.hpp/.cpp Class for accurate timing of various func-

tionality

utilities.hpp Useful common functions (e.g., finding the

k’th smallest element in an array, limiting

the variable update, and fixing variables

to be within the physical range)

• VariableFormulations

GammaVariableFormulation.hpp/.cppClass of gamma variable formulation

ModelInput.hpp Input data for nonlinear formulation and

fluid model

MolarVariableFormulation.hpp/.cpp Class of molar variable formulation

NaturalVariableFormulation.hpp/.cpp Class of natural variable formulation

NonlinearFormulation.hpp/.cpp Base class of various nonlinear formula-

tions

The above list of subdirectories and files are shared by the serial and parallel AD-

GPRS. There are some additional files for the parallel AD-GPRS. Currently Intel

C++ compiler is required for the compilation of parallel AD-GPRS on both Linux

and Windows platforms. Thus, there is an Intel C++ project file, in addition to the


Visual C++ solution and project files, for parallel AD-GPRS. Other additional files

are in the subdirectories as listed below:

• LinearSolvers

BlockJacobiPre.h/.cpp OpenMP-based Block Jacobi precondi-

tioner

CartesianMatrix.h Matrix format used by NF (Nested Factor-

ization) preconditioner on Cartesian grid

CartesianMatrixImpl.h Implementation file for the template

CartesianMatrix class

CudaMPNFPre.h/.cu CUDA-based MPNF (Massively Parallel

NF) preconditioner

CudaNFPreWrapper.h The wrapper class utilizing GMRES or

BiCGStab as an accelerator for CUDA-

based MPNF preconditioner

CudaNFPreWrapperImpl.h Implementation file for the template Cu-

daNFPreWrapper class

CudaReduction.h/.cu Class for computing dot products on GPU

(based on an example from CUDA SDK)

CudaUtilities.h/.cpp Some useful macros, functions, and global

variables shared by various CUDA-based

classes

MPNFPre.h OpenMP-based MPNF preconditioner

MPNFPreImpl.h Implementation file for the template MP-

NFPre class

NestedMatrix.h Matrix format with a nested structure and

used by OpenMP-based and CUDA-based

MPNF preconditioner


NestedMatirxImpl.h Implementation file for the template Nest-

edMatrix class

NFPre.h Serial NF preconditioner

NFPreImpl.h Implementation file for the template NF-

Pre class

NFPreWrapper.h The wrapper class utilizing GMRES or

BiCGStab as an accelerator for OpenMP-

based MPNF preconditioner or serial NF

preconditioner

xsamg.h Interfaces to Fortran functions of parallel

multigrid solver XSAMG

• Utilities

OpenMPTools.h/.cpp Some useful macros, functions, and global

variables shared by various OpenMP-

based functionality

Date post:	23-Mar-2018
Category:	Documents
Upload:	doanxuyen
View:	221 times
Download:	0 times

Parallel General-Purpose Reservoir Simulation … general-purpose reservoir simulation with coupled...

Documents