Date post: | 05-Jul-2018 |
Category: |
Documents |
Upload: | rasakirraski |
View: | 215 times |
Download: | 0 times |
of 304
8/16/2019 158705079 x
1/304
800 East 96th Street, 3rd FloorIndianapolis, IN 46240 USA
Cisco Press
Practical Service Level Management:
Delivering High-Quality Web-Based Services
John McConnell with Eric Siegel
8/16/2019 158705079 x
2/304
ii
Practical Service Level Management:Delivering High-Quality Web-Based ServicesJohn McConnell with Eric Siegel
Copyright© 2004 Cisco Systems, Inc.
Published by:
Cisco Press
800 East 96th Street
Indianapolis, IN 46240 USA
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, recording, or by any information storage and retrieval system, without
written permission from the publisher, except for the inclusion of brief quotations in a review.
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
First Printing January 2004
Library of Congress Cataloging-in-Publication Number: 2001097399
ISBN: 1-58705-079-x
Warning and DisclaimerThis book is designed to provide information about service level management. Every effort has been made to make
this book as complete and as accurate as possible, but no warranty or fitness is implied.
The information is provided on an “as is” basis. The author, Cisco Press, and Cisco Systems, Inc. shall have neither
liability nor responsibility to any person or entity with respect to any loss or damages arising from the information
contained in this book or from the use of the discs or programs that may accompany it.
The opinions expressed in this book belong to the author and are not necessarily those of Cisco Systems, Inc.
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is craftedwith care and precision, undergoing rigorous development that involves the unique expertise of members from the
professional technical community.
Readers’ feedback is a natural continuation of this process. If you have any comments regarding how we could
improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail
at [email protected]. Please make sure to include the book title and ISBN in your message.
We greatly appreciate your assistance.
Trademark AcknowledgmentsAll terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized.
Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book
should not be regarded as affecting the validity of any trademark or service mark.
Corporate and Government SalesCisco Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales.
For more information, please contact:
U.S. Corporate and Government Sales 1-800-382-3419 [email protected]
For sales outside of the U.S., please contact:
International Sales 1-317-581-3793 [email protected]
8/16/2019 158705079 x
3/304
Publisher John Wait
Editor-in-Chief John Kane
Executive Editor Brett Bartow
Cisco Representative Anthony Wolfenden
Cisco Press Program Manager Sonia Torres ChavezManager, Marketing Communications, Cisco Systems Scott Miller
Cisco Marketing Program Manager Edie Quiroz
Production Manager Patrick Kanouse
Acquisitions Editor Michelle Grandin
Development Editor Jill Batistick
Project Editor Marc Fowler
Copy Editor Jill Batistick
Technical Editors David M. Fishman
John P. Morency
Richard L. Ptak
Team Coordinator Tammi BarnettBook Designer Gina Rexrode
Cover Designer Louisa Adair
Composition Mark Shirar
Indexer Larry Sweazy
8/16/2019 158705079 x
4/304
iv
In Loving Memory
This book was finished as a final tribute to my late husband, John McConnell.
I hope these words keep his ideas alive in the industry a little longer.
—Grace Morlock McConnell
My perception of my very special son, John W. McConnell
All that we can be, we must be.
Find a star and never settle for less.
John was born to be one of a kind.
Making his way with a mind of his own,
and making a difference and making it known.
He had his dreams and hopes to pursue.
by his mother, Jeanette McConnell
8/16/2019 158705079 x
5/304
Dedication John W. McConnell
December 9, 1943–November 3, 2002
This book is dedicated to my wife, Grace, whose support has been so helpful in carving out the time and quiet
needed for this project. My friends and Grace have also provided a supportive environment and tolerated my
frequent absences to work with clients. Returning home to a warm community has been really important to me.
8/16/2019 158705079 x
6/304
vi
AcknowledgmentsMany people have been part of this process of turning some ideas and experience into a book. First, my thanks to
the Cisco Press team, especially Michelle Grandin. The steady enthusiasm and willingness of all to help are deeply
appreciated.In the same vein, the technical reviewers have been so helpful. I’ve had the pleasure of spending good time exchanging
views with John Morency and Rich Ptak at many analyst conferences and other events; their suggestions for this
manuscript were specific and helpful, and in some cases spurred some spirited discussions. Although I’ve never met
David Fishman face to face, I’d be pleased to buy him a good meal someday as thanks for so many good suggestions
and his attention to detail and integrity on getting it right.
Another group I want to acknowledge are the clients I’ve worked with around the world. I’ve gotten to learn a lot
about how technology is actually used and to work with people who want to push the envelope.
Finally, my thanks to my friends and colleagues in the industry who constantly stimulate and challenge me. It’s
been a tremendous blessing to be among so many creative and independent thinkers and doers that have shaped the
networking industry.
—John McConnell
It’s impossible to begin these acknowledgements without wishing that John were still alive. This is his book, not
mine. He conceived it; he drafted it; he should have been writing this page. We all used to joke about how John
“towered over the industry,” and it wasn’t just because of his height. In working from John’s drafts to complete the
book, in talking to colleagues about his work, and in remembering the easy, jovial way he talked about examples of
industry practices, I was constantly reminded of his stature and of the friendly way he had. I think I can say, with
confidence, that everyone in the industry truly misses him; I certainly do.
John’s wife wanted to see this book come to publication, and Cisco Press went far out of their way to make that happen.
Jill Batistick and Michelle Grandin, the editors, were wonderfully friendly and helpful; they made the process of
working through the chapters almost enjoyable. The technical reviewers, Rich Ptak, John Morency, and David Fishman
put a tremendous amount of work into the book. They didn’t just point out my errors; they suggested correctionsand entire new paragraphs that could improve the text. They were truly partners in bringing the book to publication.
I’d also like to thank Astrid Wasserman, of MediaLive International, Inc., (the organizers of Networld+Interop),
who gave me a copy of John’s proposed two-day seminar on Service Level Management. Although it was never
presented, the seminar slides gave me a lot of insight into his ideas.
I have tried to stay close to John’s original thoughts and text, although I have occasionally succumbed to temptation
and added additional information. Minor additions occur in all chapters; major additions are in Chapter 2 (measure-
ment statistics), Chapter 6 (triage for quick assignment of problems to appropriate diagnostic teams), Chapter 8
(transaction response time), and Chapter 11 (flash loads and abandonment). Most of the additions are topics that I
had discussed with John at various conferences we attended together; I hope, and believe, that he would agree with
them. In all cases when the author speaks directly to the reader, that author is John.
—Eric Siegel October 14, 2003
8/16/2019 158705079 x
7/304
About the AuthorsJohn McConnell was involved in networking for over 30 years. A member of the ARPANET working group, Joh
contributed to early Internet architecture and protocol development. John has consulted with clients in the U.S
Europe, Asia, and the Middle East, and he has designed some of the first TCP/IP networks deployed in Europe athe Middle East.
John served as a consultant in the areas of systems and network management with a focus on Service Level Manageme
(SLM), policy-based management solutions, and the emerging issues of management solutions for e-business.
John received a master’s in electrical engineering and computer science from the University of California, Berkel
Eric Siegel, Principal Internet Consultant with Keynote Systems, Inc., “the Internet performance authority,” first
worked on the Internet in 1978. He wrote Designing Quality of Service Solutions for the Enterprise (John Wiley
Sons) and has taught Internet performance tuning, SLM, and quality of service (QoS) at major industry conferences
such as Networld+Interop.
Before joining Keynote Systems, Eric was a Senior Network Analyst at NetReference, Inc., where he specialized
network architectural design for Fortune 100 companies, and he was a Senior Network Architect with Tandem
Computers, where he was the technical leader and coordinator for all of Tandem’s data communications specialiworldwide. Eric also worked for Network Strategies, Inc. and for the MITRE Corporation, where he specialized
computer network design and performance evaluation. Eric received his B.S. and M.Eng. degrees in electrical en
neering from Cornell University, where he was elected to the Electrical Engineering honor society.
8/16/2019 158705079 x
8/304
viii
About the Technical ReviewersDavid M. Fishman is at Sun Microsystems, where he is responsible for availability measurement strategies in the
office of Sun’s Chief Customer Advocate. Prior to that, he managed Sun’s strategic technology relationship with
Oracle, driving technology alignment on High Availability (HA), Java technology, and performance. Before joining Sunin 1996, Fishman held a variety of technical and business development positions at Mercury Interactive Corporation in
a variety of product management and business development roles. Previous work experience includes high-tech
marketing and management in defense electronics, embedded systems, and office automation. David holds an
MBA from the School of Management at Yale University. He lives in Sunnyvale, California, with his wife and
two children.
John P. Morency is a 29-year veteran of the networking and telecommunications industries and president of
Momenta Research, Inc., a company that he founded in 2002. His industry experience includes network software
development, technical support, IT operations, industry consulting, product marketing, and business development.
Because of his wide range of experience, John has a very unique ability to effectively assess the business, technological,
and operational impacts of new products and technologies. This is evidenced by the significant business case and
Total Cost of Ownership (TCO) work that John has done on behalf of hundreds of Fortune 1000 clients over the
past ten years, resulting in hundreds of millions of dollars in both top- and bottom-line benefits.
John’s current research is focused on the business benefits attributable to the implementation of wireless LANs
(Wi-Fi), network telephony, content networking, system and network security, Web services, disaster recovery,
and IT process automation.
He is the author of over 400 publications on the operations and business impact of new IT technology. His speakership
and publication credentials include Networld+Interop, Network World, Billing World, Broadband Year, LightWave,
Telecommunications, and Telecom-Plus International, among many others.
Richard L. Ptak, founder of Ptak & Associates, Inc., has more than 25 years experience providing consulting services
on the use of IT resources to achieve competitive advantage. Ptak earned his B.S. and M.S. at Kansas State University.
His MBA was earned at the University of Chicago.
8/16/2019 158705079 x
9/304
Contents at a Glance
Preface xxi
Part I Service Level Agreements and Introduction to Service Level Management
Chapter 1 Introduction 5
Chapter 2 Service Level Management 13
Chapter 3 Service Management Architecture 41
Part II Components of the Service Level Management Infrastructure 59
Chapter 4 Instrumentation 61
Chapter 5 Event Management 81
Chapter 6 Real-Time Operations 101
Chapter 7 Policy-Based Management 129
Chapter 8 Managing the Application Infrastructure 145
Chapter 9 Managing the Server Infrastructure 163
Chapter 10 Managing the Transport Infrastructure 177
Part III Long-term Service Level Management Functions 193
Chapter 11 Load Testing 195
Chapter 12 Modeling and Capacity Planning 209
Part IV Planning and Implementation of Service Level Management 217
Chapter 13 ROI: Making the Business Case 219
Chapter 14 Implementing Service Level Management 231
Chapter 15 Future Developments 245
Index 259
8/16/2019 158705079 x
10/304
x
Contents
Preface xxi
Part I Service Level Agreements and Introduction to Service Level Management 3
Chapter 1 Introduction 5
E-business Services 5
B2B 6
B2C 7
B2E 8
Webbed Services and the Webbed Ecosystem 8
Service Level Management 9
Structure of the Book 10
Summary 11
Chapter 2 Service Level Management 13
Overview of Service Level Management 14
The Internal Role of the IT Group 14
The External Role of the IT Group 15
The Components of Service Level Management 15
The Participants in a Service Level Agreement 15
Metrics Within a Service Level Agreement 16
Introduction to Technical Metrics 17
High-Level Technical Metrics 18
Workload 18
Availability 19
Transaction Failure Rate 20
Transaction Response Time 20
File Transfer Time 20
Stream Quality 20
Low-Level Technical Metrics 21
Workload and Availability 21
Packet Loss 22
8/16/2019 158705079 x
11/304
Latency 22
Jitter 22
Server Response Time 23
Measurement Granularity 23Measurement Scope 23
Measurement Sampling Frequency 24
Measurement Aggregation Interval 26
Measurement Validation and Statistical Analysis 28
Measurement Validation 28
Statistical Analysis 29
Business Process Metrics 31
Problem Management Metrics 33Real-Time Service Management Metrics 33
Service Level Agreements 34
Summary 37
Chapter 3 Service Management Architecture 41
Web Service Delivery Architecture 42
Service Management Architecture: History and Design Factors 45
The Evolution of the Service Management Environment 45
Service Management Architectures for Heterogeneous Systems 46
Architectural Design Drivers 48
Demands for Changing, Expanding Services 49
Multiple Service Providers and Partners 49
Elastic Boundaries Among Teams and Providers 49
Demands for Fast System Management 50
Data Item Definition and Event Signaling 50
Service Management Architecture: A General Example 52
Instrumentation 52
Instrumentation Management 53
8/16/2019 158705079 x
12/304
xii
SLA Statistics and Reporting 54
Real-Time Event Handling, Operations, and Policy 54
Long-Term Operations 55
Back-Office Operations 56
Summary 57
Part II Components of the Service Level Management Infrastructure 59
Chapter 4 Instrumentation 61
Differences Between Element and Service Instrumentation 61
Information for Service Management Decisions 63
Operational Technical Decisions 64
Operational Business Decisions 64Decisions That Have Long-Term Effect 65
Instrumentation Modes: Trip Wires and Time Slices 65
Trip Wires 66
Time Slices 67
The Instrumentation System 68
Starting with the Instrumentation Managers 69
Collectors 70
Aggregators 72Processing 72
Ending with the Instrumentation Manager 73
Instrumentation Design for Service Monitoring 73
Demarcation Points 73
Passive and Active Monitoring Techniques 75
Passive Collection 75
Active Collection 75
Trade-Offs Between Passive and Active Collection 76
Hybrid Systems 77
8/16/2019 158705079 x
13/304
x
Instrumentation Trends 77
Adaptability 77
Collaboration 78
Tighter Linkage for Passive and Active Collection 78
Summary 78
Chapter 5 Event Management 81
Event Management Overview 82
Alert Triggers 82
Reliable Alert Transport 83
Alert Management 84
Basic Event Management Functions: Reducing the Noise and Boosting the SignalVolume Reduction 86
Roll-Up Method 86
De-duplication 87
Intelligent Monitoring 87
Artifact Reduction 88
Verification 88
Filtering 89
Correlation 90
Business Impact: Integrating Technology and Services 91
Top-Down and Bottom-Up Approaches 92
Modeling a Service 92
Care and Feeding Considerations 93
Prioritization 94
Activation 95
Coordination 96
A Market-Leading Event Manager: Micromuse 97
Netcool Product Suite 97
Event Management 98
Summary 99
8/16/2019 158705079 x
14/304
xiv
Chapter 6 Real-Time Operations 101
Reactive Management 103
Triage 104Root-Cause Analysis 107
Speed Versus Accuracy 107
Case Study of Root-Cause Analysis 108
Complicating Factors 110
Brownouts 110
Virtualized Resources 110
The Value of Good Enough 111
Proactive Management 112
The Benefits of Lead Time 112
Baseline Monitoring 112
The Value of Predicting Behavior 113
Automated Responses 113
Languages Used with Automated Responses 113
A Case Study 114
Step 1: Assessing Local Impact 114
Step 2: Adjusting Thresholds 115
Step 3: Assessing Headroom 115
Step 4: Taking Action 115
Step 5: Reporting 116
Building Automated Responses 116
Picking Candidates for Automation 116
Examples of Commercial Operations Managers 116
Tavve Software’s EventWatch 117
ProactiveNet 117
Netuitive 120
8/16/2019 158705079 x
15/304
Handling DDoS Attacks 121
Traditional Defense Against DDoS Situations 122
Defense Through Redundancy and Buffering 124
Automated Defenses 124
Organizational Policy for DDoS Defense 126
Summary 127
Chapter 7 Policy-Based Management 129
Policy-Based Management 129
The Need for Policies 130
Management Policies for Elements 131
Service-Centric Policies 132A Policy Architecture 133
Policy Management Tools 133
Repository 134
Policy Distribution 134
The Pull (Component-Centric) Model 134
The Push (Repository-Centric) Model 135
Hybrid Distribution 135
Enforcers 136
Policy Design 136
Policy Hierarchy 137
Policy Attributes 137
Policy Auditing 138
Policy Closure Criteria 138
Policy Testing 138
Policy Product Examples 139
Cisco QoS Policy Manager 139
Orchestream Service Activator 141
Summary 142
8/16/2019 158705079 x
16/304
xvi
Chapter 8 Managing the Application Infrastructure 145
Interaction of Operations and Application Development Teams 146
The Effect of Organizational Structures 146The Need to Understand the Operational Environment 146
Time Lines Are Shorter 147
Application-Level Metrics 147
Workload 149
Customer Behavior Measurement 149
Business Measurements 150
Service Quality Measurement 151
Transaction Response Time: An Example of Dependence on Lower-LevelServices 152
Serialization Delay 153
Queuing Delay 154
Propagation Delay 154
Processing Delay 156
The Need for Communications Among Design and Operations Groups 156
Instrumenting Applications 157
Instrumenting Web Servers 157
Instrumenting Other Server Components 159
End-User Measurements 160
Summary 161
Chapter 9 Managing the Server Infrastructure 163
Architecture of the Server Infrastructure 163
Load Distribution and Front-End Processing 164
Local Load Distribution 166
Geographic Load Distribution 168Caching 168
Content Distribution 169
8/16/2019 158705079 x
17/304
x
Instrumentation of the Server Infrastructure 171
Load Distribution Instrumentation 172
Cache Instrumentation 173Content Distribution Instrumentation 173
Summary 174
Chapter 10 Managing the Transport Infrastructure 177
Technical Quality Metrics for Transport Services 178
Workload and Bandwidth 178
Availability and Packet Loss 179
One-Way Latency 180
Round-Trip Latency 181
Jitter 181
QoS Technologies 181
Tag-Based QoS 182
IEEE 802 LAN QoS 182
IP TOS 183
IP DiffServ 183
MPLS 183
RSVP 184
Traffic-Shaping QoS 185
Rate Control 186
Queuing 187
Over-provisioning and Isolated Networks 188
Managing Data Flows Among Organizations 188
Levels of Control 189
Demarcation Points 189
Diagnosis and Recovery 189
Summary 191
8/16/2019 158705079 x
18/304
xviii
Part III Long-term Service Level Management Functions 193
Chapter 11 Load Testing 195
The Performance Envelope 196Load Testing Benchmarks 199
Load Test Beds and Load Generators 200
Building Transaction Load-Test Scripts and Profiles 203
Using the Test Results 205
Summary 206
Chapter 12 Modeling and Capacity Planning 209
Advantages of Simulation Modeling 209
Complexity of Simulation Modeling 211
Simulation Model Examples 211
Model Construction 211
Model Validation 213
Reporting 214
Capacity Planning 214
Summary 215
Part IV Planning and Implementation of Service Level Management 217Chapter 13 ROI: Making the Business Case 219
Impact of ROI on the Organization 220
A Basic ROI Model 220
The ROI Mission Statement 222
Project Costs 223
Project Benefits 223
Availability Benefits 224
Performance Benefits 225Staffing Benefits 225
Infrastructure Benefits 225
Deployment Benefits 225
8/16/2019 158705079 x
19/304
x
Soft Benefits 226
ROI Case Study 226
Summary 228
Chapter 14 Implementing Service Level Management 231
Phased Implementation of SLM 231
Choosing the Initial Project 231
Incremental Aggregation 232
An SLM Project Implementation Plan 233
Census and Documentation of the Existing System 233
Specification of Performance Metrics 234
Instrumentation Choices and Locations 235
Passive Measurements 236
Active Measurements 236
Baseline of Existing System Performance 237
Investigation of System Performance Sensitivities and System Tuning 237
Construction of SLAs 239
Roles and Responsibilities 240
Reporting Mechanisms and Scheduled Reviews 240
Dispute Resolution 241
Summary 242
Chapter 15 Future Developments 245
The Demands of Speed and Dynamism 245
Evolution of Management Systems Integration 248
Superficial Integration 248
Data Integration 248
Event Integration 249
Process Integration 250
Architectural Trends for Web Management Systems 250
Loosely Coupled Service-Management Systems Architecture 251
Process Managers 251
Clustering and the Webbed Architecture 252
8/16/2019 158705079 x
20/304
xx
Integrating the Components with Signaling and Messaging 252
Loosely Coupled Service-Management Processes 253
Business Goals for Service Performance 254Finding the Best Tools 255
Summary 256
Index 259
8/16/2019 158705079 x
21/304
x
PrefaceSome years ago I received a true pearl of wisdom from an industry colleague. “In order to truly understand your
profession,” he advised, “you must make the effort to learn other disciplines that are completely different from th
one that you espouse.”That colleague was John McConnell, a man who truly understood this advice by walking the talk over the course
his life. Born into a military family, John developed a keen understanding of the importance of the global ecosystem
a very young age through his childhood experiences in both Europe and the Far East. Despite being a shy, scholar
individual throughout primary and secondary school, John also demonstrated the value of hard work and dedicati
by making the varsity rowing team at U.C. Berkeley.
The strong work ethic that John nurtured at Berkeley served him well after he received his master’s in computer scienc
in 1968. What differentiated John from many of his fellow graduates, however, was the application of his craft to no
IT disciplines after graduation. Some of his first initiatives included the application of computer technology to measu
the rate of solar intensity upon the earth and the development of a programming language that was designed to test t
content and substance of moon samples brought back to earth by the Apollo astronauts. In addition, John develop
a number of network control programs for the ARPANET (the predecessor to today’s Internet) in the mid-1970s
when the state of the commercial data networking industry was in its true infancy.
John also spent a number of years in professional capacities that had very little to do with information technolog
After graduate school, John became an accomplished massage therapist, hypnotist, and practitioner in the art of
Rolfing, a technique for the detection, treatment, and removal of bodily stress and pain. In 1983, using his Rolfin
technique, John was selected to work with the members of the U.S. bicycling Olympic team, and he applied this
technique to aid the team in preparing for the 1984 Olympic games. Recently, when not consulting, John was traini
to become an instructor in the Ridhwan Foundation, an institution whose focus is the rediscovery and integration of
the true self into one’s own professional and personal life. Over the years, he had a myriad of personal interests
including soaring, mountain climbing, bird watching, backpacking, rowing, and blues festivals. One of his most
recent and satisfying accomplishments was the design, building, and completion of a second home in southern
Costa Rica that effectively enabled both he and his wife Grace to really get away from it all.
First and foremost, John’s professional focus in the IT industry was the advancement of technologies and producthat improved the efficiency and the effectiveness of IT management.
Given his whole life background, John was especially dedicated to reducing the operational and business “pain
points” associated with IT implementation and management. This focus is reflected in John’s prior work
Internetworking Computer Systems and Managing Client/Server Environments, as well as in Practical Service
Level Management: Delivering High-Quality Web-Based Services. John’s numerous publications, conferences, an
televised briefings reflect a focused dedication to the removal of technological barriers to the optimal effectivene
of IT organizations worldwide. His life experiences of a true Renaissance man uniquely enabled him to both und
stand and drive the level of change needed to not only improve state of the art, but also quality of life. John was
indeed the “gold standard” of knowledge, professionalism, and personal integrity that made the pursuit of these
goals not only a logical possibility, but, for many of us, a practical reality. The loss of John will be keenly felt fo
some time, but the goals and values that he aspired to and embraced will inspire and guide many of us for years t
come.
John Morency, President, Momenta Research
May 2003
8/16/2019 158705079 x
22/304
8/16/2019 158705079 x
23/304
P A R T I
Service Level Agreements andIntroduction to Service LevelManagement
Chapter 1 Introduction
Chapter 2 Service Level Management
Chapter 3 Service Management Architecture
8/16/2019 158705079 x
24/304
8/16/2019 158705079 x
25/304
C H A P T E R 1
Introduction
The World Wide Web—the Web—is the catalyst for the changes in our communications
work styles, business processes, and ways of seeking entertainment and information. Th
Internet is just the transport infrastructure for the web-based services that drive so muchinnovation. Note, however, that the Internet generally gets all the credit. As Thomas
Friedman writes in The Lexus and the Olive Tree:
The Internet is going to be like a huge vise that takes the globalization system that I have described—a
keeps tightening and tightening that system around everyone, in ways that will only make the world smal
and smaller and faster and faster with each passing day.
This is an accurate description of the environment that most of us deal with directly on
daily basis. The Internet is a tremendous business engine, and, as it transforms the ways w
do business, it is being transformed in turn by the ways we use it. We must learn how to
manage the growing array of online business services or risk being marginalized by a fast
moving and more dynamic business environment.
In this introductory chapter, I discuss the following:
• The types of e-business services• A definition of webbed services and the webbed ecosystem• Service Level Management (SLM)• The structure of this book
E-business Services E-business is a generic term defining business activities that are carried out totally, or in
part, through electronic communications between distributed organizations and people.
These activities are characterized by speed, flexibility, and constant change.
The Internet has become the vehicle for transforming business processes. The reasons f
its ascendancy include the following:
• The Internet protocols are the only workable set of technologies that really providehigh degree of interoperability among different systems.
• The wide geographic reach of the Internet increases the size of any potential mark
8/16/2019 158705079 x
26/304
6 Chapter 1: Introduction
• Internet economies make it feasible to distribute information and transact businessglobally.
• The introduction of the browser and its supporting technologies make the Internetmuch easier to use, thereby increasing the potential market.
There are many ways of segmenting and describing the large variety of services available
through the Internet and the Web. A simple classification that covers most services is based
on the relationship of the business to customers, business partners, and employees. For
example, the process shown in Figure 1-1 describes a simple situation involving all three
types of relationships: business to business (B2B), business to consumer (B2C), and
business to employee (B2E). These segments are an easy way of organizing our thinking
about services, although it’s important to remember that business processes in the real
world will have many variations and overlaps.
Figure 1-1 Business Relationships
The following sections discuss each relationship type in turn.
B2B B2B services are a broad category that incorporates transactions among differentbusinesses and government agencies. Many current B2B services, such as supply chain
management and credit authorization, use the Internet to drive down the costs and delays
associated with current processes and to boost their productivity.
B2B is rapidly broadening to include more than supply chain management and credit
authorization. Functions such as shipping, billing, and Customer Relationship Management
CustomerSupplier Sales Staff
Internal EnterpriseSystems
B2CB2B B2E
8/16/2019 158705079 x
27/304
E-business Services
(CRM) are now often external to the business; other businesses provide and host thesespecialized services as a utility. For example, entry of a customer’s order can result in mo
than the functions of pricing, authorizing, assembling, and shipping; a modern system
might use B2B links to provide the customer with a shipment tracking number from theshipping company, and it might interact with an external CRM service to reflect the curre
purchases and other factors of the customer’s profile. Meanwhile, the sales person might b
indirectly using B2B links to handle her commissions and personnel data through
outsourced employee management services, and engineering staff might use B2B links f
collaborative design.
Thanks to the Web, B2B is rapidly transforming into an even more dynamic set of servic
from which an enterprise can select in real time. No one wants to be dependent on a sing
supplier or customer; everyone must deal with competitive pressures exerted from both
sides. Services such as credit authorization and shipping are examples of those that can b
selected in real time based on their performance or costs. Other services and supplies ma
be selected from web-based exchanges or e-markets.
B2B processes can be complex. They must follow the business requirements for trackin
orders, negotiating contracts, arranging payments, and reporting outcomes that govern
these processes when they take place without the automation of electronic communication
Note that new benefits become available, although at the cost of additional complexity,
when B2B replaces older systems. For example, organizations can change their busines
processes to increase their business effectiveness by obtaining real-time information onorder volumes, revenue rates, cancelled orders, and other factors. This additional
information, while adding to complexity, provides value in addition to the acceleration o
the processes themselves by identifying further efficiencies.
Continuous monitoring of B2B suppliers, partners, and web infrastructure
(communications, hosting, and exchanges) is necessary to determine whether they are
meeting their service quality commitments.
As in conventional commerce, managing across organizations adds complexity. All the
links in the B2B services chains are known, but these links are controlled by many differe
organizations, are complex, and may change rapidly as services are selected in real time
Managing B2B services therefore requires cooperation with the management teams of th
other participants and, possibly, with third-party measurement organizations to assure tru
end-to-end service quality.
B2CB2C garnered most of the early attention from the trade press and analysts as traditiona
businesses took advantage of the Internet’s wide geographic reach and low costs for
reaching customers. Some businesses (eBay and Amazon.com, for example) were found
to exploit this new market opportunity.
8/16/2019 158705079 x
28/304
8 Chapter 1: Introduction
B2C sites continually add new services of their own while offering links to relatedbusinesses and services in an attempt to offer one-stop shopping—and selling—to their
customers. This is a highly competitive segment with little customer loyalty. The wide
selection of competing sites draws customers away whenever any one site has a servicedisruption.
B2C environments are characterized by a lack of visibility and management control of the
customer-access infrastructure, which is the set of networks, caches, and other systems that
consumers use to connect to the B2C site. Customers usually don’t want measurement tools
embedded in their systems, and the access infrastructure providers also resist making their
internal performance readily visible. There is also limited visibility into the performance of
partner sites (advertisers and other third parties), which are important parts of the
customer’s perception of total site performance. The span of control and management
available to B2C sites is therefore usually limited to monitoring and managing their internal
operations (inside the firewall) as well as measurement of Internet delays and performance
as seen from various points on the edge of the Internet.
B2EB2E services are also known as the intranet. These services help improve the internal
effectiveness of an organization and help it keep pace with its customers and business
partners. Many B2E services enable employees to query their benefits, schedule vacations,
fill out expense reports, and conduct a set of activities that formerly required a large staff to
coordinate.
B2C and B2E services use the web browser as the access device. Transactions are initiated
from the browser to deliver information and activate a range of business processes.However, B2E environments are the only ones that enable administrators to have control of
both ends—the servers as well as the desktops, cell phones, and personal communicators
used to access them.
Webbed Services and the Webbed EcosystemIn this book, I use the term webbed services to describe the set of business services that are
based on a component approach to systems design. This design is driven from the Web and
its associated technologies, regardless of the specific technologies used. Because webbed
services are constructed from a set of interconnected software components and services thatcan be reused in multiple places, they can usually avoid some of the expense, time, and
effort associated with building and modifying monolithic applications.
Webbed services is a very inclusive term; it’s increasingly difficult to find services that are
not somehow tied into the Web. As a case in point, I was recently speaking about webbed
services at a large retail organization, and someone in the group stated that their main
application did not fit into the webbed category because it was a stand-alone Oracle
8/16/2019 158705079 x
29/304
Service Level Management
Financials application. However, further discussion soon revealed that their internationaoperations used real-time currency conversion decisions. The real-time exchange rates i
the Oracle Financials application were, in fact, accessed through the Web.
Indeed, webbed services are now taking on many of the characteristics of an ecosystemwhich is a group of independent but interrelated elements comprising a unified whole. A
smooth business process depends on each element carrying out its tasks accurately and
quickly, with consideration for maintaining balances among all the elements. In a well-
balanced webbed ecosystem, all elements bear appropriate shares of the load. None is
overwhelmed, none is underutilized. Balance is concurrently maintained between servic
quality and service cost. The ecosystem metaphor is gaining momentum as online
processes evolve to dynamically select their elements (underlying services) based on the
current behavior and performance.
The webbed ecosystem perspective also holds within any subgroup of systems. For
instance, hosting facilities use a range of technologies, such as prioritizing devices,bandwidth managers, global load balancers, and caches, to deliver online business service
These systems also need balanced management; adding bandwidth when servers are
congested is a wasteful investment.
Service Level ManagementService quality is extremely important, given the accelerating number of critical busine
processes going online. Customers and business partners go elsewhere if the services the
want are not available or are performing sluggishly. Unfortunately, good service quality
a dynamic target and the demands continue to tighten. Competitors will match or excee
service quality levels and create pressure toward matching or bettering theirs.
Service Level Management (SLM) is the process of managing network and computing
resources to ensure the delivery of acceptable service quality at an acceptable price in a
acceptable time frame. It focuses on the behavior of the services rather than on tracking th
status of every router, switch, and server in the environment. Through SLM, service quali
is guaranteed and priced for different levels of service.
SLM is a competitive weapon in the marketplace, offering the guarantees needed to
transition critical business activities online. Poorly managed services have harmed many
businesses when their web sites crashed, their applications slowed to a crawl, or their W
content was not attractively presented or was too difficult to navigate. Good service quali
helps retain customers and differentiate your organization from those that have not yet
mastered the art of managing service quality.
Effective SLM is also an economic weapon. Managing resources more effectively reduc
costs, creates more revenue opportunities, and leverages technology investments.
Finally, SLM is a means to build the solid business relationships that make online busineinitiatives successful.
8/16/2019 158705079 x
30/304
8/16/2019 158705079 x
31/304
Summary
The second group of chapters (8–10) in this part steps through the majorsystems used for web service delivery. It looks at the ways they can be used
to improve service delivery and also discusses their specific instrumentation
needs, using the system management infrastructures described in the firstpart of this section. Chapter 8 investigates the instrumentation and
management of applications and of end-user access devices, such as
browsers. Chapter 9 looks at web server systems, including servers, load
balancers, and content distribution networks. Finally, Chapter 10 discusses
instrumentation and management of the transport infrastructure, including
QoS technology and traffic shaping to achieve policy objectives.
• Part III: Long-term Service Level Management Functions (Chapters 11–12)—This part covers load testing, modeling, and capacity planning. No management
system can provide necessary quality if the web serving system, as a whole, has
insufficient capacity.
• Part IV: Planning and Implementation of Service Level Management (Chapte13–15)—Calculation of Return on Investment (ROI) for SLM is critical to the
justification and design of an implementation; it’s covered in Chapter 13. Chapter
provides guidance for using the information in this book to design an SLM system f
your particular situation, and the part ends with discussion in Chapter 15 of possib
future developments in SLM.
SummaryThe Internet, and the Web, are transforming business processes for interaction among
businesses, government, suppliers, customers, and employees. As more and more criticabusiness processes go online, the service quality of those processes becomes more
important to the success of business as a whole.
SLAs are the formal, negotiated contracts between service providers and service users th
define the services to be provided, their quality goals, and the actions to be taken if the SL
terms are violated.
SLM is the process of managing network and computing resources to ensure the deliver
of acceptable service quality, usually as defined in an SLA, at an acceptable price in an
acceptable time frame. It is a competitive weapon in the marketplace because it can impro
customer relationships, create more revenue opportunities, and reduce costs.
8/16/2019 158705079 x
32/304
8/16/2019 158705079 x
33/304
C H A P T E R 2
Service Level Management
Service Level Management (SLM) is a key for delivering the services that are necessary
remain competitive in the Internet environment. Service quality must remain stable and
acceptable even when there are substantial changes in service volumes, customer activitiand the supporting infrastructures.
Superior service quality also becomes a competitive differentiator because it reduces
customer churn and brings in new customers who are willing to pay the premiums for
guaranteed service quality. Customer churn is an insidious problem for almost every servi
provider.
The competitive market increases customer acquisition costs because continuous
marketing and promotions are necessary just to replace the eroding customer base. High
customer acquisition costs must be dealt with by either raising prices (a difficult move in
highly competitive market) or by taking longer to amortize the acquisition costs before
profitability for each customer is achieved. Improving customer retention therefore
dramatically increases profits.
This chapter covers the basics of SLM and lays part of the groundwork for the rest of th
book:
• An overview of SLM• An introduction to technical metrics• Detailed discussions of measurement granularity and measurement validation• Business process metrics• Service Level Agreements (SLAs)
Note that the chapter ends with a summary discussion in the context of building an SLA
Use of metrics in combination with the SLA’s service level objectives to controlperformance is discussed in Chapter 6, “Real-Time Operations,” and Chapter 7, “Policy
Based Management.”
8/16/2019 158705079 x
34/304
14 Chapter 2: Service Level Management
Overview of Service Level ManagementOften, one group’s service provider is another group’s customer. It is critical to understand
that service delivery is often, in fact, a chain of such relationships. As Figure 2-1 shows,
some entities, such as an IT group, can play different roles in the service delivery process.
As shown in the figure, a hosting company can be a customer of multiple service providers
while in turn acting as a service provider. An IT group may be a customer of several service
providers offering basic Internet connectivity, application hosting, content delivery, or other
services. Customers may use multiple providers of the same service to increase their
availability and to protect against dependence on a single provider. Customers will also use
specialized service providers to fulfill particular needs.
Figure 2-1 Roles of Customers and Service Providers
The Internal Role of the IT GroupAn IT group serves the entire organization by aggregating demands of individual business
units and using them as leverage to reduce overall costs from service providers.
Today, such IT groups are making the necessary adjustments as managed services becomea mandatory requirement. IT managers are constantly reassessing the business and strategic
IT Group
ISP #2ISP #1
Content DeliveryHosting Co.
Customer Role
Service Provider Role
Telephone Co. Telephone Co.
Service Provider Role Service Provider Role
Service Provider Role Service Provider Role
Service Provider Role Service Provider Role
Customer Role Customer Role
Customer Role Customer Role
8/16/2019 158705079 x
35/304
Overview of Service Level Management
trade-offs of developing internal competence and expertise as opposed to outsourcing moof the traditional IT work to external providers. The goal is to save money, protect strateg
assets, and maintain the necessary flexibility to meet new challenges.
The External Role of the IT GroupIT groups are increasingly being required to provide specific levels of service, and they a
also more frequently involved in helping business units negotiate agreements with extern
service providers. Business units often choose to deal directly with service providers whe
they have specialized needs or when they determine that the IT group cannot offer servic
with competitive costs and benefits.
IT groups must therefore manage their own service levels as well as those of service
providers, and they must track compliance with negotiated SLAs.
The Components of Service Level ManagementThe process of monitoring service quality, detecting potential or actual problems, taking
actions necessary to maintain or restore the necessary service quality, and reporting on
achieved service levels is the core of SLM. Effective SLM solutions must deliver acceptab
service quality at an acceptable price.
Acceptable quality from a customer perspective means an ability to use the managed
services effectively. For example, acceptable quality may mean that an external custome
or business partner can transact the necessary business that will generate revenues,
strengthen business partnerships, increase the Internet brand, or improve internalproductivity. Specific ongoing measurements are carried out to determine acceptable
service quality levels, and noncompliance is noted and reported.
Acceptable costs must also be considered, because over-provisioning and throwing mon
at service quality problems is not an acceptable strategy for either service providers or the
customers (and in spite of the cost, it often doesn’t solve the problem). Service manageme
policies are applied to critical resources so that they are allocated to the appropriate
services; inappropriate activities are curtailed. Service providers that manage resources
effectively deliver superior service quality at competitive prices. Their customers, in tur
must also increase their online business effectiveness and strengthen their bottom-line
results.
The Participants in a Service Level AgreementThe SLA is the basic tool used to define acceptable quality and any relationships betwee
quality and price. Because the SLA has value for both providers and customers, it’s a
wonder why it has taken so long for it to become important. In practice, many organizatio
8/16/2019 158705079 x
36/304
16 Chapter 2: Service Level Management
and providers find the process of negotiating an acceptable SLA to be a difficult task. Aswith many technical offerings, customers often experience difficulty in expressing what
they need in technical terms that are both measurable and manageable; therefore, they have
difficulty specifying their needs precisely and verifying that they are getting what they pay for.Service providers, on the other hand, appreciate clearly-specified requirements and want to
take advantage of the opportunity to offer profitable premium services, but they also want
to minimize the risks of public failure and avoid increasingly stringent financial penalties
for noncompliance with the terms of the SLA.
Metrics Within a Service Level AgreementMeasurement is a key part of an SLA, and most SLAs have two different classes of metrics,
as shown in Figure 2-2, which may be divided into technical metrics and business process
metrics. Technical metrics include both high-level technical metrics, such as the success
rate of an entire transaction as seen by an end user, and low-level technical metrics, such asthe error rate of an underlying communications network. Business process metrics include
measures of provider business practices, such as the speed with which they respond to
problem reports.
Figure 2-2 Contents of a Service Level Agreement
Service Level Agreement
Technical MetricsHigh-Level
WorkloadAvailabilityTransaction FailureTransaction Response Time...
Low-LevelWorkloadAvailabilityPacket LossOne-Way Packet DelayJitterServer Response Time...
Business Process MetricsTrouble Response TimeTrouble Relief TimeProvisioning Time
...
Metric Specification and Handling
GranularityValidationStatistical Analysis
Penalties and Rewards
8/16/2019 158705079 x
37/304
Introduction to Technical Metrics
Service providers may package the metrics into specific profiles that suit common customrequirements while simplifying the process of selecting and specifying the parameters.
Service profiles help the service provider by simplifying their planning and resource
allocation operations.
Introduction to Technical MetricsTechnical metrics are a core component of SLAs. They are used to quantify and to asses
the key technical attributes of delivered services.
Examples of technical metrics are shown in Table 2-1. They are separated into the two bas
groups: high-level metrics, which deal with attributes that are highly relevant to end use
and are easily understood by them, and low-level metrics, which deal with attributes of th
underlying technologies. Note that you should be very specific when defining these term
in an agreement. Although many of these terms are in common use, their definitions varTable 2-1 Examples of Technical Metrics
Metric Description
High-Level Technical Metrics
Workload Applied workload in terms understandable by the end user (suc
as end-user transactions/second)
Availability Percentage of scheduled uptime that the system is perceived as
available and functioning by the end user
Transaction Failure Rate Percentage of initiated end-user transactions that fail to comple
Transaction Response Time Measure of response-time characteristics of a user transaction
File Transfer Time Measure of total transfer-time characteristics of a file transfer
Stream Quality Measure of the user-perceived quality of a multimedia stream
Low-Level Technical Metrics
Workload Applied workload in terms relevant to underlying technologies
(such as database transactions/second)
Availability Percentage of scheduled uptime that the subsystem is available
and functioning
Packet Loss Measure of one-way packet loss characteristics between
specified points
Latency Measure of transit time characteristics between specified points
Jitter Measure of the transit time variability characteristics between
specified points
Server Response Time Measure of response-time characteristics of particular server
subsystems
8/16/2019 158705079 x
38/304
18 Chapter 2: Service Level Management
Workload is an important characteristic of both high- and low-level metrics. It’s not ameasure of delivered quality; instead, it’s a critical measure of the load applied to the
system. For example, consider the workload of serving web pages. A text-only page might
comprise only 10 K bytes, whereas a graphics page could comprise a few megabytes. If therequirement is to deliver a page in six seconds to the end user, massively different
bandwidth and capacity will be necessary. Indeed, content may need to be altered for low-
speed connections to meet the six-second download time.
NOTE In many situations, certain technical metrics aren’t specified in the SLA. Instead, the
supplier is asked to use best effort , which represents the classic Internet delivery strategy of
“get it there somehow without concern for service quality.” Today, best effort represents the
commodity level for services. There are no special treatments for best-effort services. The
only need is that there are sufficient resources to prevent best-effort services from starving
out , which means having the connection time out because of long periods of inactivity.
Discussions of all of the examples in Table 2-1 follow, to illustrate the basic concepts of
technical metrics. Additional descriptions of these metrics, and other technical metrics,
appear in Chapters 4 and 8–10.
High-Level Technical MetricsThese metrics deal with workload and performance as seen and understood by the end user.
Workload
The workload high-level technical metric is the measure of applied load in end-user terms.
It’s unreasonable to expect a service provider to agree to service levels for an unspecified
amount of workload; it’s also unreasonable to expect that an end user will willingly
substitute specification of obscurely-related low-level workload metrics instead of
understandable high-level metrics. SLAs should therefore begin by specifying the high-
level workload metrics, and service providers can then work with the customer’s technical
staff to derive low-level workload metrics from them.
For transaction systems, the workload metric is usually specified in terms of the end-usertransaction mix and volumes, which typically vary according to time of day and other
business cycles. For existing systems, these statistics can be obtained from logs; for new
systems or situations (such as a proposed major advertising campaign designed to drive
prospective customers to a web site), the organization’s marketing group or their
consultants should work to produce the most accurate, specific estimates possible. These
workload estimates for new systems should be used for load testing as well as for SLAs.
8/16/2019 158705079 x
39/304
Introduction to Technical Metrics
Transaction workload metrics must include end-user tolerance for transaction responsetime delays. If response time delays are too long, external customers will abandon the
transaction. In legacy systems where external customers did not interact directly with th
server systems, abandonment was not a factor in workload testing. Call-center operatorshandled any delays by talking to the customers, shielding them from the problem, if
necessary. On the Web, customers see the delays without any shielding, and they may
decide at any point to abandon the transaction—with immediate impact on the server
system’s workload.
Another effect of the direct connection between customers and web-serving systems is th
there’s no buffer between those customers and the servers. In a call center, the workload
buffered by external queues. Incoming calls go through an automatic call distribution
system; callers are placed on hold until an operator is available. In an order-entry center
the workload is buffered by the stack of documents on the entry clerk’s desk. In contras
the web workload has no external buffer; massive spikes in workload hit the servers
instantly. These spikes in workload are called flash load , and they must be specified in thworkload metric and considered during load testing. Load specification for the Web shou
therefore be in terms of arrival rate, not concurrent users, as was the case for call center
and order-entry centers.
File-serving, web-page, and streaming-media workload metrics are similar to transactio
metrics, but simpler. They’re usually specified in terms of the size and number of files thmust be transferred in a given time interval. (For web pages, the types of the files are usual
specified. Dynamically-generated files are clearly more resource-intensive than stored
static files.) The serving system must have the bandwidth to serve the files, and it must al
be able to handle the anticipated number of concurrent connections. There’s a relationsh
between these two variables; given a certain arrival rate, higher end-to-end bandwidthresults in fewer concurrent users.
Availability
Availability is the percentage of time that the system is perceived as available and
functioning by the end user. It is a function of both the Mean Time Between Failures(MTBF) and the Mean Time To Repair (MTTR). Scheduled downtime might, in some
organizations, be excluded from these calculations. In those organizations, a system can b
declared 100 percent available even though it’s down for an hour every night for system
maintenance.
Availability is a binary measurement—the service is either available or it isn’t. For the en
user, and therefore for the high-level availability metric, the fact that particular underlyin
components of a service are unavailable is not a concern if that unavailability is conceal
through redundant systems design.
Availability can be improved by increasing the MTBF or by decreasing the time spent o
each failure, which is measured by the MTTR. Chapter 3, “Service Management
8/16/2019 158705079 x
40/304
20 Chapter 2: Service Level Management
Architecture,” introduces the concept of triage, which decreases MTTR through quickassignment of problems to the appropriate specialist organization.
Transaction Failure Rate
A transaction fails if, having successfully started, it does not successfully complete.
(Failure to start is the result of an availability problem.) As is true for availability, systems
design and redundancy may conceal some low-level failures from the end user and
therefore exclude the failures from the high-level transaction failure rate metric.
Transaction Response Time
This metric represents the acceptable delay for completing a transaction, measured at the
level of a business process.
It’s important to measure both the total time to complete a transaction and the elapsed time
per page of the transaction. That’s because the end user’s perception of transaction time,
which will be used to compare your system with your competitors’, is based on total
transaction time, regardless of the number of pages involved, while the slowest page will
influence end-user abandonment of a web transaction.
File Transfer Time
The file transfer time metric is closely associated with specified workload and is a measure
of success. The file transfer workload metric describes the work that must be accomplished
in a certain period; the file transfer time metric shows whether that workload wassuccessfully handled. Lack of end-to-end bandwidth, an insufficient number of concurrent
connections, or persistent transmission errors (requiring retransmission) will influence this
measure.
Stream Quality
The quality of multimedia streams is difficult to measure. Although underlying low-level
technical metrics, such as frame loss, can be obtained, their relationship to the quality as
perceived by an end user is very complex.
Streaming is a real-time service in which the content continues flowing even with variationsin the underlying data transmission rates and despite some underlying errors. A content
consumer may see a small blemish on a graphic because a packet is lost in transit—
equivalent to static on your car radio. There is no rewinding and playing it again, as there
might be with interactive services. Thus, packet loss is handled by just continuing with the
streaming rather than retransmitting lost packets.
8/16/2019 158705079 x
41/304
Introduction to Technical Metrics
Occasional packet loss can still be tolerated and sometimes may not even be noticed. Ifpacket loss increases, quality will begin to degrade until it falls below a threshold and
becomes unacceptable. Years of development have been focused on concealing these low
level errors from the multimedia consumer, and the major existing technologies fromMicrosoft, Real Networks, Apple, and others have different sensitivities to these errors.
Nevertheless, quality must be measured. The telephone companies years ago establishe
the Mean Opinion Score (MOS), a measure of the quality of telephone voice transmissio
There are also international standards for evaluation of audio and video quality as perceiv
by human end users; examples are the International Telecommunication Union’s ITU-T
P.800-series and P.900-series standards and the American National Standards Institute’s
T1.518 and T1.801 standards. Simpler methods are also in use, such as measuring the
percentage of successful connection attempts to the streaming server, the effective
bandwidth delivered over that connection, and the number of rebuffers during transmissio
Low-Level Technical MetricsThese metrics deal with workload and performance of the underlying technical subsystem
such as the transport infrastructure. Low-level technical metrics can be selected and defin
by first understanding the high-level technical metrics and their implications for the
performance requirements placed on underlying subsystems. For example, a clear
understanding of required transaction response time and the associated transaction
characteristics (the number of transits across the transport network, the size of each trans
and so on) can help set the objective for the low-level technical metric that measures
network transit time (latency).
Workload and Availability
These low-level technical metrics are similar to those for the high-level discussion, but
they’re focused on performance characteristics of the underlying systems rather than on
performance characteristics that are directly visible to end users. Their correlation with th
high-level metrics depends on the particular system design and the degree of redundanc
and substitution within that design.
Throughput, for example, is a low-level technical metric that measures the capacity of a
particular service flow. Services with rich content or critical real-time requirements mig
need sufficient bandwidth to maintain acceptable service quality. Certain transactions, suc
as downloading a file or accessing a new web page, might also require a certain bandwid
for transferring rich content, such as complex graphics, within the specified transaction
delay time.
8/16/2019 158705079 x
42/304
22 Chapter 2: Service Level Management
Packet Loss
Packet loss has different effects on the end-user experience, depending on the service using
the transport. The choice of a packet loss metric for a particular application must be
carefully considered. For example, packet loss in file transfer forces retransmission unlessthe high-level transport contains embedded error correction codes. In contrast, moderate
packet loss in streaming media may have no user-perceptible effect at all—unless bad luck
results in the loss of a key frame.
The burst length must be included in packet loss metrics. Usually a uniform distribution of
dropped packets over longer time intervals is implicitly assumed. For example, out of every
100 packets there could be two lost without violating an SLA calling for two percent packet
loss. There may be a different perspective if you examine behavior over longer intervals,such as 1,000 packets. Up to 20 packets in a row could be lost without violating the SLA.
However, losing 20 consecutive packets—creating a significant gap in data received—
might drive quality levels to unacceptable values.
Latency
Latency is the time needed for transit across the network; it’s critical for real-time services.
Excessive latency quickly degrades the quality of web sites and of interactive sound and video.
Routes in the Internet are usually asymmetric, with flows often taking different pathscoming and going between any pair of locations. Thus, the delays in each direction are
usually different. Fortunately, most Internet applications are primarily sensitive to round-
trip delays, which are much simpler to measure than one-way delays. File transfer, web
sites, and transactions all require a flow of acknowledgments in the opposite direction to
data flow. If acknowledgments are delayed, transmission temporarily ceases. The round-trip latency therefore controls the effective bandwidth of the transmission.
Round-trip latency is much simpler to measure than one-way latency, because clock
synchronization of separated locations is not necessary. That synchronization can be quite
tricky if it is accomplished across the same network that’s having its one-way delay
measured. In that case, fluctuations in the metric that’s being measured (one-way latency)
can easily affect the stability of the measurement apparatus for one-way latency. An
external reference, such as the satellite Global Positioning System (GPS) timers, is often
used in such situations.
Jitter
Jitter is the deviation in the arrival rate of data from ideal, evenly-spaced arrival; see Figure
2-3. Some packets may be bunched more closely together (in terms of inter-packet delays)
or spread farther apart after crossing the network infrastructure. Jitter is caused by the
internal operation of network equipment, and it’s unavoidable. Jitter is created whenever
there are queues and buffering in a system. Extreme varieties of jitter are also created when
there’s rerouting of packets because of network congestion or failure.
8/16/2019 158705079 x
43/304
Measurement Granularity
Figure 2-3 Jitter
Interactive teleconferencing is an example of a service that is extremely sensitive to jitte
too much jitter can make the service completely useless. Therefore, a reduction in jitter
approaching zero, represents an increase in quality.
Buffering in the receiving device can be used to smooth out jitter; the jitter buffer is famili
to those of us who have a CD player in the car. Small bumps are smoothed out and the soun
quality remains acceptable, but hitting a pothole usually causes more disturbance than th
buffer can overcome. The dejitter buffer allows for latency that is typically one or two tim
that of the expected jitter; it’s not a cure for all situations. The time spent in the dejitter
buffers is an important contributor to total system latency.
Server Response Time
Similar to the high-level technical metric transaction response time, this measures theindividual response time characteristics of underlying server systems. A common examp
is the response time of the database back-end systems to specific query types. Although n
directly seen by end users, this is an important part of overall system performance.
Measurement GranularityThe SLA must describe the granularity of the measurements. There are three related par
to that granularity: the scope, the sampling frequency, and the aggregation interval.
Measurement ScopeThe first consideration is the scope of the measurement, and availability metrics make a
excellent example. Many providers define the availability of their services based on an
overall average of availability across all access points. This is an approach that gives the
service providers the most flexibility and cushion for meeting negotiated levels.
IdealPacket
Spacing
ActualPacket
Spacing
Jitter
8/16/2019 158705079 x
44/304
24 Chapter 2: Service Level Management
Consider if your company had 100 sites and a target of 99 percent availability based on anoverall average. Ninety-nine of your sites could have complete availability (100 percent)
while one could have zero. Having a site with an extended period of complete unavailability
isn’t usually acceptable, but the service provider has complied with the negotiated terms ofthe SLA.
If the availability level is specified on a per-site basis instead, the provider would have been
found to be noncompliant and appropriate actions would follow in the form of penalties or
lost customers. The same principle applies when measuring the availability of multiple
sites, servers, or other units.
Availability has an additional scope dimension, in addition to breadth: the depth to which
the end user can penetrate to the desired service. To use a telephone analogy, is dial tone
sufficient, or must the end user be able to reach specific numbers? In other words, which
transactions must be accessible for the system to be regarded as available?
Scope issues for performance metrics are similar to those for the availability metric. Theremay be different sets of metrics for different groups of transactions, different times of day,
and different groups of end users. Some transactions may be unusually important to
particular groups of end users at particular times and completely unimportant at other
times.
Regardless of the scope selected for a given individual metric, it’s important to realize that
executive management will want these various metrics aggregated into a single measure ofoverall performance. Derivation of that aggregated metric must be addressed during
measurement definition.
Measurement Sampling FrequencyA shorter sampling frequency catches problems sooner at the expense of consuming
additional network, server, and application resources. Longer intervals between
measurements reduce the impacts while possibly missing important changes, or at least not
detecting them as quickly as when a shorter interval is used. Customers and the service
providers will need to negotiate the measurement interval because it affects the cost of theservice to some extent.
Statisticians recommend that sampling be random because it avoids accidental
synchronization with underlying processes and the resulting distortion of the metric.
Random sampling also helps discover brief patterns of poor performance; consecutive badresults are more meaningful than individual, spaced-out difficulties.
Confidence interval calculations can be used to help determine the sampling frequency.
Although it is impossible to perform an infinite number of measurements, it is possible to
calculate a range of values that we’re reasonably sure would contain the true summary
values (median, average, and so on) if you could have performed an infinite number of
measurements. For example, you might want to be able to say the following: “There’s a 95
8/16/2019 158705079 x
45/304
Measurement Granularity
percent chance that the true median, if we could perform an infinite number ofmeasurements, would be between five seconds and seven seconds.” That is what the “95
Percent Confidence Interval” seeks to estimate, as shown in Figure 2-4. When you take
more measurements, the confidence interval (two seconds in this example) usually becomnarrower. Therefore, confidence intervals can be used to help estimate how many
measurements you’ll need to obtain a given level of precision with statistical confidence
Figure 2-4 Confidence Interval for Internet Data
There are simple techniques for calculating confidence intervals for “normal distribution
of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent
section on statistical analysis, Internet distributions are so different from the “normaldistribution” that these techniques cannot be used. Instead, the statistical simulation
technique known as “bootstrapping” can be used for these calculations on Internet
distributions.
In some cases, depending on the pattern of measurements, simple approximations forcalculating confidence intervals may be used. Keynote Systems recommends the followin
calculation approximation for calculating the confidence interval for availability metrics
(This information is drawn from “Keynote Data Accuracy and Statistical Analysis for
Performance Trending and Service Level Management,” Keynote Systems Inc., San Mate
California, 2002.) The formula is as follows:
• Omit data points that indicate measurement problems instead of availabilityproblems.
• Calculate a preliminary estimate of the 95 percent confidence interval for averageavailability (avg) of a measurement sample with n valid data points:
Preliminary 95 Percent Confidence Interval = avg ± (1.96 * square root [(avg
* (1 – avg))/(n – 1)])
For example, with a sample size n of 100, if 12 percent of the valid
measurements are errors, the average availability is 88 percent. The
confidence interval is calculated by the formula as (0.82, 0.94). This suggests
that there’s a 95 percent probability that the true average availability—if
Actual Median
Confidence Interval
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Response Time (Seconds)
P e r c e n t a g e o f
M e a s u r e m e n t s
8/16/2019 158705079 x
46/304
26 Chapter 2: Service Level Management
we’d miraculously taken an infinite number of measurements—is between82 and 94 percent. Notice that even with 100 measurements, this confidence
interval leaves much room for uncertainty! To narrow that band, you need
more valid measurements (a larger n, such as 1000 data points).• Now you must decide if the preliminary calculations are reasonable. We suggest that
the preliminary calculation should be accepted only if the upper limit is below 100
percent and the lower limit is above 0 percent. (The example just used gives an upper
limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable
if n = 30 or greater.)
Note that we’re not saying that the confidence interval is too wide if the
upper limit is above 100 percent (or if the average availability itself is 100
percent because no errors were detected); we’re saying that you don’t know
what the confidence interval is. The reason is that the simplifying
assumptions you used to construct the calculation break down if there are notenough data points.
For performance metrics, a simple solution to the problem of confidence intervals is to use
geometric means and “geometric deviations” as measures of performance, which are
described in the subsequent section in this chapter on statistical analysis.
Keynote Systems suggests, in the paper previously cited, that you can approximate the 95
Percent Confidence Interval for the geometric mean as follows, for a measurement samplewith n valid (nonerror) data points:
Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n – 1] ) ) ]
Lower Limit = [geometric mean] / [ (geometric deviation)(1.96 / (square root of [n – 1] ) )
]This is similar to the use of the standard deviation with normally distributed data and can
be used as a rough approximation of confidence intervals for performance measurements.
Note that this ignores cyclic variations, such as by time of day or day of week; it is also
somewhat distorted because even the logarithms of the original data are asymmetrically
distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered
using this recipe are much less than those that result from the usual use of mean and
standard deviation.
Measurement Aggregation IntervalSelecting the time interval over which availability and performance are aggregated should
also be considered. Generally, providers and customers agree upon time spans ranging from
a week to a month. These are practical time intervals because they will tend to hide small
fluctuations and irrelevant outlying measurements, but still enable reasonably prompt
analysis and response. Longer intervals enable longer problem periods before the SLA is
violated.
8/16/2019 158705079 x
47/304
Measurement Granularity
Table 2-2 shows this idea. If availability is measured on a small scale (hourly), highavailability and requirements such as the 5-9’s or 99.999% permit only 0.036 seconds o
outage before there’s a breach of the SLA. Providers must provision with adequate
redundancy to meet this type of stringent requirement, and clearly they will pass on thescosts to the customers that demand such high availability.
If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicat
that an acceptable cumulative outage of 24 seconds per month is permitted while remaininin compliance. A 99.9 percent availability level permits up to 40 minutes of accumulate
downtime for a service each month. Many providers are still trying to negotiate an SLA
with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13
to 3.5 hours each month.Note that these values assume 24 ∗ 7 ∗ 365 operations. For operations that do not requir
round-the-clock availability, or are not up during weekends, or have scheduled maintenan
periods, the values will change. That said, they’re pretty easy to compute.
The key is for service provider and service customer to set a common definition of the
critical time interval. Because longer aggregation intervals permit longer periods during
which metrics may be outside tolerance, many organizations must look more deeply at the
aggregation definitions and look to their tolerance for service interruption. A 98 percen
availability level may be adequate and also economically acceptable, but how would the
business function if the 13.5 allotted hours of downtime per month occurred in a single
outage? Could the business tolerate an interruption of that length without serious damagIf not, then another metric that limits the interruption must be incorporated. This could b
expressed in a statement such as the following: “Monthly availability at all sites shall be 9
percent or higher, and no service outage shall exceed three minutes.” In other words, a litt
arithmetic to evaluate scenarios for compliance goes a long way.
Table 2-2 Measurement Aggregation Intervals for Availability
Availability Percentage Allowable Outage for Specified Aggregation Interval
Hour Day Week 4 Weeks
98% 1.2 min 28.8 min 3.36 hr 13.4 hr
98.5% 0.9 min 21.6 min 2.52 hr 10 hr
99% 0.6 min 14.4 min 1.68 hr 6.7 hr
99.5% 0.3 min 7.2 min 50.4 min 3.36 hr
99.9% 3.6 sec 1.44 min 10 min 40 min
99.99% 0.36 sec 8.64 sec 1 min 4 min
99.999% 0.036 sec 0.864 sec 6 sec 24 sec
8/16/2019 158705079 x
48/304
28 Chapter 2: Service Level Management
Measurement Validation and Statistical AnalysisThe Internet and Web are extremely complex statistically. Invalid measurements and
incorrect statistical analysis can easily lead to SLA violations and penalties, which may
then fall apart when challenged by the service provider using a more appropriate analysis.
Therefore, special care must be taken to discard invalid measurements and to use the
appropriate statistical analysis methods.
Measurement ValidationMeasurement problems, which are artifacts of the measurement process, are inevitable in
any large-scale measurement system. The important issues are how quickly these errors aredetected and tagged in the database, and the degree of engineering and business integrity
that’s applied to the process of error detection and tagging.
Measurement problems can be caused by instrument malfunction, such as a response timer
that fails, and by synthetic transaction script failure, which leads to false transaction error
reports. It can also be caused by abnormal congestion on a measurement tool’s access link
to the backbone network and by many other factors. These failures are of the measurement
system, not of the system being measured. They therefore are best excluded from any SLA
compliance metrics.
Detection and tagging of erroneous measurements may take time, sometimes up to a day or
more, as the measurement team investigates the situation. Fortunately, SLA reports are not
generally done in real time, and there’s therefore an opportunity to detect and remove such
measurements.
The same measurements will probably also be used for quick diagnosis, or triage, and that
usage requires real-time reporting. There’s therefore no chance to remove erroneous
measurements before use, and the quick diagnosis techniques must themselves handle
possible problems in the measurement system. Good, fast-acting artifact reduction
techniques (discussed in Chapter 5, “Event Management”) can eliminate a large number of
misleading error messages and reduce the burden on the provider management system.
An emerging alternative is using a trusted, independent third-party to provide the
monitoring and SLA compliance verification. The advantage in having an independent
party providing information is both service providers and their customers could view this
party as objective when they have disputes about delivered service quality.
Keynote Systems and Brix Networks are early movers into this market space. Keynote
Systems provides a service, whereas Brix Networks provides an integrated set of software
and hardware measurement devices to be installed and managed by the owner of the SLA.
They both provide active, managed measurement devices placed at the service demarcation
points between customers and