158705079 x

8/16/2019 158705079 x

1/304

800 East 96th Street, 3rd FloorIndianapolis, IN 46240 USA

Cisco Press

Practical Service Level Management:

Delivering High-Quality Web-Based Services

John McConnell with Eric Siegel

8/16/2019 158705079 x

2/304

ii

Practical Service Level Management:Delivering High-Quality Web-Based ServicesJohn McConnell with Eric Siegel

Copyright© 2004 Cisco Systems, Inc.

Published by:

Cisco Press

800 East 96th Street

Indianapolis, IN 46240 USA

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic

or mechanical, including photocopying, recording, or by any information storage and retrieval system, without

written permission from the publisher, except for the inclusion of brief quotations in a review.

Printed in the United States of America 1 2 3 4 5 6 7 8 9 0

First Printing January 2004

Library of Congress Cataloging-in-Publication Number: 2001097399

ISBN: 1-58705-079-x

Warning and DisclaimerThis book is designed to provide information about service level management. Every effort has been made to make

this book as complete and as accurate as possible, but no warranty or fitness is implied.

The information is provided on an “as is” basis. The author, Cisco Press, and Cisco Systems, Inc. shall have neither

liability nor responsibility to any person or entity with respect to any loss or damages arising from the information

contained in this book or from the use of the discs or programs that may accompany it.

The opinions expressed in this book belong to the author and are not necessarily those of Cisco Systems, Inc.

Feedback Information

At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is craftedwith care and precision, undergoing rigorous development that involves the unique expertise of members from the

professional technical community.

Readers’ feedback is a natural continuation of this process. If you have any comments regarding how we could

improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail

at [email protected]. Please make sure to include the book title and ISBN in your message.

We greatly appreciate your assistance.

Trademark AcknowledgmentsAll terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized.

Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book

should not be regarded as affecting the validity of any trademark or service mark.

Corporate and Government SalesCisco Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales.

For more information, please contact:

U.S. Corporate and Government Sales 1-800-382-3419 [email protected]

For sales outside of the U.S., please contact:

International Sales 1-317-581-3793 [email protected]

8/16/2019 158705079 x

3/304

Publisher John Wait

Editor-in-Chief John Kane

Executive Editor Brett Bartow

Cisco Representative Anthony Wolfenden

Cisco Press Program Manager Sonia Torres ChavezManager, Marketing Communications, Cisco Systems Scott Miller

Cisco Marketing Program Manager Edie Quiroz

Production Manager Patrick Kanouse

Acquisitions Editor Michelle Grandin

Development Editor Jill Batistick

Project Editor Marc Fowler

Copy Editor Jill Batistick

Technical Editors David M. Fishman

John P. Morency

Richard L. Ptak

Team Coordinator Tammi BarnettBook Designer Gina Rexrode

Cover Designer Louisa Adair

Composition Mark Shirar

Indexer Larry Sweazy

8/16/2019 158705079 x

4/304

iv

In Loving Memory

This book was finished as a final tribute to my late husband, John McConnell.

I hope these words keep his ideas alive in the industry a little longer.

—Grace Morlock McConnell

My perception of my very special son, John W. McConnell

All that we can be, we must be.

Find a star and never settle for less.

John was born to be one of a kind.

Making his way with a mind of his own,

and making a difference and making it known.

He had his dreams and hopes to pursue.

by his mother, Jeanette McConnell

8/16/2019 158705079 x

5/304

Dedication John W. McConnell

December 9, 1943–November 3, 2002

This book is dedicated to my wife, Grace, whose support has been so helpful in carving out the time and quiet

needed for this project. My friends and Grace have also provided a supportive environment and tolerated my

frequent absences to work with clients. Returning home to a warm community has been really important to me.

8/16/2019 158705079 x

6/304

vi

AcknowledgmentsMany people have been part of this process of turning some ideas and experience into a book. First, my thanks to

the Cisco Press team, especially Michelle Grandin. The steady enthusiasm and willingness of all to help are deeply

appreciated.In the same vein, the technical reviewers have been so helpful. I’ve had the pleasure of spending good time exchanging

views with John Morency and Rich Ptak at many analyst conferences and other events; their suggestions for this

manuscript were specific and helpful, and in some cases spurred some spirited discussions. Although I’ve never met

David Fishman face to face, I’d be pleased to buy him a good meal someday as thanks for so many good suggestions

and his attention to detail and integrity on getting it right.

Another group I want to acknowledge are the clients I’ve worked with around the world. I’ve gotten to learn a lot

about how technology is actually used and to work with people who want to push the envelope.

Finally, my thanks to my friends and colleagues in the industry who constantly stimulate and challenge me. It’s

been a tremendous blessing to be among so many creative and independent thinkers and doers that have shaped the

networking industry.

—John McConnell

It’s impossible to begin these acknowledgements without wishing that John were still alive. This is his book, not

mine. He conceived it; he drafted it; he should have been writing this page. We all used to joke about how John

“towered over the industry,” and it wasn’t just because of his height. In working from John’s drafts to complete the

book, in talking to colleagues about his work, and in remembering the easy, jovial way he talked about examples of

industry practices, I was constantly reminded of his stature and of the friendly way he had. I think I can say, with

confidence, that everyone in the industry truly misses him; I certainly do.

John’s wife wanted to see this book come to publication, and Cisco Press went far out of their way to make that happen.

Jill Batistick and Michelle Grandin, the editors, were wonderfully friendly and helpful; they made the process of

working through the chapters almost enjoyable. The technical reviewers, Rich Ptak, John Morency, and David Fishman

put a tremendous amount of work into the book. They didn’t just point out my errors; they suggested correctionsand entire new paragraphs that could improve the text. They were truly partners in bringing the book to publication.

I’d also like to thank Astrid Wasserman, of MediaLive International, Inc., (the organizers of Networld+Interop),

who gave me a copy of John’s proposed two-day seminar on Service Level Management. Although it was never

presented, the seminar slides gave me a lot of insight into his ideas.

I have tried to stay close to John’s original thoughts and text, although I have occasionally succumbed to temptation

and added additional information. Minor additions occur in all chapters; major additions are in Chapter 2 (measure-

ment statistics), Chapter 6 (triage for quick assignment of problems to appropriate diagnostic teams), Chapter 8

(transaction response time), and Chapter 11 (flash loads and abandonment). Most of the additions are topics that I

had discussed with John at various conferences we attended together; I hope, and believe, that he would agree with

them. In all cases when the author speaks directly to the reader, that author is John.

—Eric Siegel October 14, 2003

8/16/2019 158705079 x

7/304

About the AuthorsJohn McConnell was involved in networking for over 30 years. A member of the ARPANET working group, Joh

contributed to early Internet architecture and protocol development. John has consulted with clients in the U.S

Europe, Asia, and the Middle East, and he has designed some of the first TCP/IP networks deployed in Europe athe Middle East.

John served as a consultant in the areas of systems and network management with a focus on Service Level Manageme

(SLM), policy-based management solutions, and the emerging issues of management solutions for e-business.

John received a master’s in electrical engineering and computer science from the University of California, Berkel

Eric Siegel, Principal Internet Consultant with Keynote Systems, Inc., “the Internet performance authority,” first

worked on the Internet in 1978. He wrote Designing Quality of Service Solutions for the Enterprise (John Wiley

Sons) and has taught Internet performance tuning, SLM, and quality of service (QoS) at major industry conferences

such as Networld+Interop.

Before joining Keynote Systems, Eric was a Senior Network Analyst at NetReference, Inc., where he specialized

network architectural design for Fortune 100 companies, and he was a Senior Network Architect with Tandem

Computers, where he was the technical leader and coordinator for all of Tandem’s data communications specialiworldwide. Eric also worked for Network Strategies, Inc. and for the MITRE Corporation, where he specialized

computer network design and performance evaluation. Eric received his B.S. and M.Eng. degrees in electrical en

neering from Cornell University, where he was elected to the Electrical Engineering honor society.

8/16/2019 158705079 x

8/304

viii

About the Technical ReviewersDavid M. Fishman is at Sun Microsystems, where he is responsible for availability measurement strategies in the

office of Sun’s Chief Customer Advocate. Prior to that, he managed Sun’s strategic technology relationship with

Oracle, driving technology alignment on High Availability (HA), Java technology, and performance. Before joining Sunin 1996, Fishman held a variety of technical and business development positions at Mercury Interactive Corporation in

a variety of product management and business development roles. Previous work experience includes high-tech

marketing and management in defense electronics, embedded systems, and office automation. David holds an

MBA from the School of Management at Yale University. He lives in Sunnyvale, California, with his wife and

two children.

John P. Morency is a 29-year veteran of the networking and telecommunications industries and president of

Momenta Research, Inc., a company that he founded in 2002. His industry experience includes network software

development, technical support, IT operations, industry consulting, product marketing, and business development.

Because of his wide range of experience, John has a very unique ability to effectively assess the business, technological,

and operational impacts of new products and technologies. This is evidenced by the significant business case and

Total Cost of Ownership (TCO) work that John has done on behalf of hundreds of Fortune 1000 clients over the

past ten years, resulting in hundreds of millions of dollars in both top- and bottom-line benefits.

John’s current research is focused on the business benefits attributable to the implementation of wireless LANs

(Wi-Fi), network telephony, content networking, system and network security, Web services, disaster recovery,

and IT process automation.

He is the author of over 400 publications on the operations and business impact of new IT technology. His speakership

and publication credentials include Networld+Interop, Network World, Billing World, Broadband Year, LightWave,

Telecommunications, and Telecom-Plus International, among many others.

Richard L. Ptak, founder of Ptak & Associates, Inc., has more than 25 years experience providing consulting services

on the use of IT resources to achieve competitive advantage. Ptak earned his B.S. and M.S. at Kansas State University.

His MBA was earned at the University of Chicago.

8/16/2019 158705079 x

9/304

Contents at a Glance

Preface xxi

Part I Service Level Agreements and Introduction to Service Level Management

Chapter 1 Introduction 5

Chapter 2 Service Level Management 13

Chapter 3 Service Management Architecture 41

Part II Components of the Service Level Management Infrastructure 59

Chapter 4 Instrumentation 61

Chapter 5 Event Management 81

Chapter 6 Real-Time Operations 101

Chapter 7 Policy-Based Management 129

Chapter 8 Managing the Application Infrastructure 145

Chapter 9 Managing the Server Infrastructure 163

Chapter 10 Managing the Transport Infrastructure 177

Part III Long-term Service Level Management Functions 193

Chapter 11 Load Testing 195

Chapter 12 Modeling and Capacity Planning 209

Part IV Planning and Implementation of Service Level Management 217

Chapter 13 ROI: Making the Business Case 219

Chapter 14 Implementing Service Level Management 231

Chapter 15 Future Developments 245

Index 259

8/16/2019 158705079 x

10/304

x

Contents

Preface xxi

Part I Service Level Agreements and Introduction to Service Level Management 3

Chapter 1 Introduction 5

E-business Services 5

B2B 6

B2C 7

B2E 8

Webbed Services and the Webbed Ecosystem 8

Service Level Management 9

Structure of the Book 10

Summary 11

Chapter 2 Service Level Management 13

Overview of Service Level Management 14

The Internal Role of the IT Group 14

The External Role of the IT Group 15

The Components of Service Level Management 15

The Participants in a Service Level Agreement 15

Metrics Within a Service Level Agreement 16

Introduction to Technical Metrics 17

High-Level Technical Metrics 18

Workload 18

Availability 19

Transaction Failure Rate 20

Transaction Response Time 20

File Transfer Time 20

Stream Quality 20

Low-Level Technical Metrics 21

Workload and Availability 21

Packet Loss 22

8/16/2019 158705079 x

11/304

Latency 22

Jitter 22

Server Response Time 23

Measurement Granularity 23Measurement Scope 23

Measurement Sampling Frequency 24

Measurement Aggregation Interval 26

Measurement Validation and Statistical Analysis 28

Measurement Validation 28

Statistical Analysis 29

Business Process Metrics 31

Problem Management Metrics 33Real-Time Service Management Metrics 33

Service Level Agreements 34

Summary 37

Chapter 3 Service Management Architecture 41

Web Service Delivery Architecture 42

Service Management Architecture: History and Design Factors 45

The Evolution of the Service Management Environment 45

Service Management Architectures for Heterogeneous Systems 46

Architectural Design Drivers 48

Demands for Changing, Expanding Services 49

Multiple Service Providers and Partners 49

Elastic Boundaries Among Teams and Providers 49

Demands for Fast System Management 50

Data Item Definition and Event Signaling 50

Service Management Architecture: A General Example 52

Instrumentation 52

Instrumentation Management 53

8/16/2019 158705079 x

12/304

xii

SLA Statistics and Reporting 54

Real-Time Event Handling, Operations, and Policy 54

Long-Term Operations 55

Back-Office Operations 56

Summary 57

Part II Components of the Service Level Management Infrastructure 59

Chapter 4 Instrumentation 61

Differences Between Element and Service Instrumentation 61

Information for Service Management Decisions 63

Operational Technical Decisions 64

Operational Business Decisions 64Decisions That Have Long-Term Effect 65

Instrumentation Modes: Trip Wires and Time Slices 65

Trip Wires 66

Time Slices 67

The Instrumentation System 68

Starting with the Instrumentation Managers 69

Collectors 70

Aggregators 72Processing 72

Ending with the Instrumentation Manager 73

Instrumentation Design for Service Monitoring 73

Demarcation Points 73

Passive and Active Monitoring Techniques 75

Passive Collection 75

Active Collection 75

Trade-Offs Between Passive and Active Collection 76

Hybrid Systems 77

8/16/2019 158705079 x

13/304

x

Instrumentation Trends 77

Adaptability 77

Collaboration 78

Tighter Linkage for Passive and Active Collection 78

Summary 78

Chapter 5 Event Management 81

Event Management Overview 82

Alert Triggers 82

Reliable Alert Transport 83

Alert Management 84

Basic Event Management Functions: Reducing the Noise and Boosting the SignalVolume Reduction 86

Roll-Up Method 86

De-duplication 87

Intelligent Monitoring 87

Artifact Reduction 88

Verification 88

Filtering 89

Correlation 90

Business Impact: Integrating Technology and Services 91

Top-Down and Bottom-Up Approaches 92

Modeling a Service 92

Care and Feeding Considerations 93

Prioritization 94

Activation 95

Coordination 96

A Market-Leading Event Manager: Micromuse 97

Netcool Product Suite 97

Event Management 98

Summary 99

8/16/2019 158705079 x

14/304

xiv

Chapter 6 Real-Time Operations 101

Reactive Management 103

Triage 104Root-Cause Analysis 107

Speed Versus Accuracy 107

Case Study of Root-Cause Analysis 108

Complicating Factors 110

Brownouts 110

Virtualized Resources 110

The Value of Good Enough 111

Proactive Management 112

The Benefits of Lead Time 112

Baseline Monitoring 112

The Value of Predicting Behavior 113

Automated Responses 113

Languages Used with Automated Responses 113

A Case Study 114

Step 1: Assessing Local Impact 114

Step 2: Adjusting Thresholds 115

Step 3: Assessing Headroom 115

Step 4: Taking Action 115

Step 5: Reporting 116

Building Automated Responses 116

Picking Candidates for Automation 116

Examples of Commercial Operations Managers 116

Tavve Software’s EventWatch 117

ProactiveNet 117

Netuitive 120

8/16/2019 158705079 x

15/304

Handling DDoS Attacks 121

Traditional Defense Against DDoS Situations 122

Defense Through Redundancy and Buffering 124

Automated Defenses 124

Organizational Policy for DDoS Defense 126

Summary 127

Chapter 7 Policy-Based Management 129

Policy-Based Management 129

The Need for Policies 130

Management Policies for Elements 131

Service-Centric Policies 132A Policy Architecture 133

Policy Management Tools 133

Repository 134

Policy Distribution 134

The Pull (Component-Centric) Model 134

The Push (Repository-Centric) Model 135

Hybrid Distribution 135

Enforcers 136

Policy Design 136

Policy Hierarchy 137

Policy Attributes 137

Policy Auditing 138

Policy Closure Criteria 138

Policy Testing 138

Policy Product Examples 139

Cisco QoS Policy Manager 139

Orchestream Service Activator 141

Summary 142

8/16/2019 158705079 x

16/304

xvi

Chapter 8 Managing the Application Infrastructure 145

Interaction of Operations and Application Development Teams 146

The Effect of Organizational Structures 146The Need to Understand the Operational Environment 146

Time Lines Are Shorter 147

Application-Level Metrics 147

Workload 149

Customer Behavior Measurement 149

Business Measurements 150

Service Quality Measurement 151

Transaction Response Time: An Example of Dependence on Lower-LevelServices 152

Serialization Delay 153

Queuing Delay 154

Propagation Delay 154

Processing Delay 156

The Need for Communications Among Design and Operations Groups 156

Instrumenting Applications 157

Instrumenting Web Servers 157

Instrumenting Other Server Components 159

End-User Measurements 160

Summary 161

Chapter 9 Managing the Server Infrastructure 163

Architecture of the Server Infrastructure 163

Load Distribution and Front-End Processing 164

Local Load Distribution 166

Geographic Load Distribution 168Caching 168

Content Distribution 169

8/16/2019 158705079 x

17/304

x

Instrumentation of the Server Infrastructure 171

Load Distribution Instrumentation 172

Cache Instrumentation 173Content Distribution Instrumentation 173

Summary 174

Chapter 10 Managing the Transport Infrastructure 177

Technical Quality Metrics for Transport Services 178

Workload and Bandwidth 178

Availability and Packet Loss 179

One-Way Latency 180

Round-Trip Latency 181

Jitter 181

QoS Technologies 181

Tag-Based QoS 182

IEEE 802 LAN QoS 182

IP TOS 183

IP DiffServ 183

MPLS 183

RSVP 184

Traffic-Shaping QoS 185

Rate Control 186

Queuing 187

Over-provisioning and Isolated Networks 188

Managing Data Flows Among Organizations 188

Levels of Control 189

Demarcation Points 189

Diagnosis and Recovery 189

Summary 191

8/16/2019 158705079 x

18/304

xviii

Part III Long-term Service Level Management Functions 193

Chapter 11 Load Testing 195

The Performance Envelope 196Load Testing Benchmarks 199

Load Test Beds and Load Generators 200

Building Transaction Load-Test Scripts and Profiles 203

Using the Test Results 205

Summary 206

Chapter 12 Modeling and Capacity Planning 209

Advantages of Simulation Modeling 209

Complexity of Simulation Modeling 211

Simulation Model Examples 211

Model Construction 211

Model Validation 213

Reporting 214

Capacity Planning 214

Summary 215

Part IV Planning and Implementation of Service Level Management 217Chapter 13 ROI: Making the Business Case 219

Impact of ROI on the Organization 220

A Basic ROI Model 220

The ROI Mission Statement 222

Project Costs 223

Project Benefits 223

Availability Benefits 224

Performance Benefits 225Staffing Benefits 225

Infrastructure Benefits 225

Deployment Benefits 225

8/16/2019 158705079 x

19/304

x

Soft Benefits 226

ROI Case Study 226

Summary 228

Chapter 14 Implementing Service Level Management 231

Phased Implementation of SLM 231

Choosing the Initial Project 231

Incremental Aggregation 232

An SLM Project Implementation Plan 233

Census and Documentation of the Existing System 233

Specification of Performance Metrics 234

Instrumentation Choices and Locations 235

Passive Measurements 236

Active Measurements 236

Baseline of Existing System Performance 237

Investigation of System Performance Sensitivities and System Tuning 237

Construction of SLAs 239

Roles and Responsibilities 240

Reporting Mechanisms and Scheduled Reviews 240

Dispute Resolution 241

Summary 242

Chapter 15 Future Developments 245

The Demands of Speed and Dynamism 245

Evolution of Management Systems Integration 248

Superficial Integration 248

Data Integration 248

Event Integration 249

Process Integration 250

Architectural Trends for Web Management Systems 250

Loosely Coupled Service-Management Systems Architecture 251

Process Managers 251

Clustering and the Webbed Architecture 252

8/16/2019 158705079 x

20/304

xx

Integrating the Components with Signaling and Messaging 252

Loosely Coupled Service-Management Processes 253

Business Goals for Service Performance 254Finding the Best Tools 255

Summary 256

Index 259

8/16/2019 158705079 x

21/304

x

PrefaceSome years ago I received a true pearl of wisdom from an industry colleague. “In order to truly understand your

profession,” he advised, “you must make the effort to learn other disciplines that are completely different from th

one that you espouse.”That colleague was John McConnell, a man who truly understood this advice by walking the talk over the course

his life. Born into a military family, John developed a keen understanding of the importance of the global ecosystem

a very young age through his childhood experiences in both Europe and the Far East. Despite being a shy, scholar

individual throughout primary and secondary school, John also demonstrated the value of hard work and dedicati

by making the varsity rowing team at U.C. Berkeley.

The strong work ethic that John nurtured at Berkeley served him well after he received his master’s in computer scienc

in 1968. What differentiated John from many of his fellow graduates, however, was the application of his craft to no

IT disciplines after graduation. Some of his first initiatives included the application of computer technology to measu

the rate of solar intensity upon the earth and the development of a programming language that was designed to test t

content and substance of moon samples brought back to earth by the Apollo astronauts. In addition, John develop

a number of network control programs for the ARPANET (the predecessor to today’s Internet) in the mid-1970s

when the state of the commercial data networking industry was in its true infancy.

John also spent a number of years in professional capacities that had very little to do with information technolog

After graduate school, John became an accomplished massage therapist, hypnotist, and practitioner in the art of

Rolfing, a technique for the detection, treatment, and removal of bodily stress and pain. In 1983, using his Rolfin

technique, John was selected to work with the members of the U.S. bicycling Olympic team, and he applied this

technique to aid the team in preparing for the 1984 Olympic games. Recently, when not consulting, John was traini

to become an instructor in the Ridhwan Foundation, an institution whose focus is the rediscovery and integration of

the true self into one’s own professional and personal life. Over the years, he had a myriad of personal interests

including soaring, mountain climbing, bird watching, backpacking, rowing, and blues festivals. One of his most

recent and satisfying accomplishments was the design, building, and completion of a second home in southern

Costa Rica that effectively enabled both he and his wife Grace to really get away from it all.

First and foremost, John’s professional focus in the IT industry was the advancement of technologies and producthat improved the efficiency and the effectiveness of IT management.

Given his whole life background, John was especially dedicated to reducing the operational and business “pain

points” associated with IT implementation and management. This focus is reflected in John’s prior work

Internetworking Computer Systems and Managing Client/Server Environments, as well as in Practical Service

Level Management: Delivering High-Quality Web-Based Services. John’s numerous publications, conferences, an

televised briefings reflect a focused dedication to the removal of technological barriers to the optimal effectivene

of IT organizations worldwide. His life experiences of a true Renaissance man uniquely enabled him to both und

stand and drive the level of change needed to not only improve state of the art, but also quality of life. John was

indeed the “gold standard” of knowledge, professionalism, and personal integrity that made the pursuit of these

goals not only a logical possibility, but, for many of us, a practical reality. The loss of John will be keenly felt fo

some time, but the goals and values that he aspired to and embraced will inspire and guide many of us for years t

come.

John Morency, President, Momenta Research

May 2003

8/16/2019 158705079 x

22/304

8/16/2019 158705079 x

23/304

P A R T I

Service Level Agreements andIntroduction to Service LevelManagement

Chapter 1 Introduction

Chapter 2 Service Level Management

Chapter 3 Service Management Architecture

8/16/2019 158705079 x

24/304

8/16/2019 158705079 x

25/304

C H A P T E R 1

Introduction

The World Wide Web—the Web—is the catalyst for the changes in our communications

work styles, business processes, and ways of seeking entertainment and information. Th

Internet is just the transport infrastructure for the web-based services that drive so muchinnovation. Note, however, that the Internet generally gets all the credit. As Thomas

Friedman writes in The Lexus and the Olive Tree:

The Internet is going to be like a huge vise that takes the globalization system that I have described—a

keeps tightening and tightening that system around everyone, in ways that will only make the world smal

and smaller and faster and faster with each passing day.

This is an accurate description of the environment that most of us deal with directly on

daily basis. The Internet is a tremendous business engine, and, as it transforms the ways w

do business, it is being transformed in turn by the ways we use it. We must learn how to

manage the growing array of online business services or risk being marginalized by a fast

moving and more dynamic business environment.

In this introductory chapter, I discuss the following:

• The types of e-business services• A definition of webbed services and the webbed ecosystem• Service Level Management (SLM)• The structure of this book

E-business Services E-business is a generic term defining business activities that are carried out totally, or in

part, through electronic communications between distributed organizations and people.

These activities are characterized by speed, flexibility, and constant change.

The Internet has become the vehicle for transforming business processes. The reasons f

its ascendancy include the following:

• The Internet protocols are the only workable set of technologies that really providehigh degree of interoperability among different systems.

• The wide geographic reach of the Internet increases the size of any potential mark

8/16/2019 158705079 x

26/304

6 Chapter 1: Introduction

• Internet economies make it feasible to distribute information and transact businessglobally.

• The introduction of the browser and its supporting technologies make the Internetmuch easier to use, thereby increasing the potential market.

There are many ways of segmenting and describing the large variety of services available

through the Internet and the Web. A simple classification that covers most services is based

on the relationship of the business to customers, business partners, and employees. For

example, the process shown in Figure 1-1 describes a simple situation involving all three

types of relationships: business to business (B2B), business to consumer (B2C), and

business to employee (B2E). These segments are an easy way of organizing our thinking

about services, although it’s important to remember that business processes in the real

world will have many variations and overlaps.

Figure 1-1 Business Relationships

The following sections discuss each relationship type in turn.

B2B B2B services are a broad category that incorporates transactions among differentbusinesses and government agencies. Many current B2B services, such as supply chain

management and credit authorization, use the Internet to drive down the costs and delays

associated with current processes and to boost their productivity.

B2B is rapidly broadening to include more than supply chain management and credit

authorization. Functions such as shipping, billing, and Customer Relationship Management

CustomerSupplier Sales Staff

Internal EnterpriseSystems

B2CB2B B2E

8/16/2019 158705079 x

27/304

E-business Services

(CRM) are now often external to the business; other businesses provide and host thesespecialized services as a utility. For example, entry of a customer’s order can result in mo

than the functions of pricing, authorizing, assembling, and shipping; a modern system

might use B2B links to provide the customer with a shipment tracking number from theshipping company, and it might interact with an external CRM service to reflect the curre

purchases and other factors of the customer’s profile. Meanwhile, the sales person might b

indirectly using B2B links to handle her commissions and personnel data through

outsourced employee management services, and engineering staff might use B2B links f

collaborative design.

Thanks to the Web, B2B is rapidly transforming into an even more dynamic set of servic

from which an enterprise can select in real time. No one wants to be dependent on a sing

supplier or customer; everyone must deal with competitive pressures exerted from both

sides. Services such as credit authorization and shipping are examples of those that can b

selected in real time based on their performance or costs. Other services and supplies ma

be selected from web-based exchanges or e-markets.

B2B processes can be complex. They must follow the business requirements for trackin

orders, negotiating contracts, arranging payments, and reporting outcomes that govern

these processes when they take place without the automation of electronic communication

Note that new benefits become available, although at the cost of additional complexity,

when B2B replaces older systems. For example, organizations can change their busines

processes to increase their business effectiveness by obtaining real-time information onorder volumes, revenue rates, cancelled orders, and other factors. This additional

information, while adding to complexity, provides value in addition to the acceleration o

the processes themselves by identifying further efficiencies.

Continuous monitoring of B2B suppliers, partners, and web infrastructure

(communications, hosting, and exchanges) is necessary to determine whether they are

meeting their service quality commitments.

As in conventional commerce, managing across organizations adds complexity. All the

links in the B2B services chains are known, but these links are controlled by many differe

organizations, are complex, and may change rapidly as services are selected in real time

Managing B2B services therefore requires cooperation with the management teams of th

other participants and, possibly, with third-party measurement organizations to assure tru

end-to-end service quality.

B2CB2C garnered most of the early attention from the trade press and analysts as traditiona

businesses took advantage of the Internet’s wide geographic reach and low costs for

reaching customers. Some businesses (eBay and Amazon.com, for example) were found

to exploit this new market opportunity.

8/16/2019 158705079 x

28/304

8 Chapter 1: Introduction

B2C sites continually add new services of their own while offering links to relatedbusinesses and services in an attempt to offer one-stop shopping—and selling—to their

customers. This is a highly competitive segment with little customer loyalty. The wide

selection of competing sites draws customers away whenever any one site has a servicedisruption.

B2C environments are characterized by a lack of visibility and management control of the

customer-access infrastructure, which is the set of networks, caches, and other systems that

consumers use to connect to the B2C site. Customers usually don’t want measurement tools

embedded in their systems, and the access infrastructure providers also resist making their

internal performance readily visible. There is also limited visibility into the performance of

partner sites (advertisers and other third parties), which are important parts of the

customer’s perception of total site performance. The span of control and management

available to B2C sites is therefore usually limited to monitoring and managing their internal

operations (inside the firewall) as well as measurement of Internet delays and performance

as seen from various points on the edge of the Internet.

B2EB2E services are also known as the intranet. These services help improve the internal

effectiveness of an organization and help it keep pace with its customers and business

partners. Many B2E services enable employees to query their benefits, schedule vacations,

fill out expense reports, and conduct a set of activities that formerly required a large staff to

coordinate.

B2C and B2E services use the web browser as the access device. Transactions are initiated

from the browser to deliver information and activate a range of business processes.However, B2E environments are the only ones that enable administrators to have control of

both ends—the servers as well as the desktops, cell phones, and personal communicators

used to access them.

Webbed Services and the Webbed EcosystemIn this book, I use the term webbed services to describe the set of business services that are

based on a component approach to systems design. This design is driven from the Web and

its associated technologies, regardless of the specific technologies used. Because webbed

services are constructed from a set of interconnected software components and services thatcan be reused in multiple places, they can usually avoid some of the expense, time, and

effort associated with building and modifying monolithic applications.

Webbed services is a very inclusive term; it’s increasingly difficult to find services that are

not somehow tied into the Web. As a case in point, I was recently speaking about webbed

services at a large retail organization, and someone in the group stated that their main

application did not fit into the webbed category because it was a stand-alone Oracle

8/16/2019 158705079 x

29/304

Service Level Management

Financials application. However, further discussion soon revealed that their internationaoperations used real-time currency conversion decisions. The real-time exchange rates i

the Oracle Financials application were, in fact, accessed through the Web.

Indeed, webbed services are now taking on many of the characteristics of an ecosystemwhich is a group of independent but interrelated elements comprising a unified whole. A

smooth business process depends on each element carrying out its tasks accurately and

quickly, with consideration for maintaining balances among all the elements. In a well-

balanced webbed ecosystem, all elements bear appropriate shares of the load. None is

overwhelmed, none is underutilized. Balance is concurrently maintained between servic

quality and service cost. The ecosystem metaphor is gaining momentum as online

processes evolve to dynamically select their elements (underlying services) based on the

current behavior and performance.

The webbed ecosystem perspective also holds within any subgroup of systems. For

instance, hosting facilities use a range of technologies, such as prioritizing devices,bandwidth managers, global load balancers, and caches, to deliver online business service

These systems also need balanced management; adding bandwidth when servers are

congested is a wasteful investment.

Service Level ManagementService quality is extremely important, given the accelerating number of critical busine

processes going online. Customers and business partners go elsewhere if the services the

want are not available or are performing sluggishly. Unfortunately, good service quality

a dynamic target and the demands continue to tighten. Competitors will match or excee

service quality levels and create pressure toward matching or bettering theirs.

Service Level Management (SLM) is the process of managing network and computing

resources to ensure the delivery of acceptable service quality at an acceptable price in a

acceptable time frame. It focuses on the behavior of the services rather than on tracking th

status of every router, switch, and server in the environment. Through SLM, service quali

is guaranteed and priced for different levels of service.

SLM is a competitive weapon in the marketplace, offering the guarantees needed to

transition critical business activities online. Poorly managed services have harmed many

businesses when their web sites crashed, their applications slowed to a crawl, or their W

content was not attractively presented or was too difficult to navigate. Good service quali

helps retain customers and differentiate your organization from those that have not yet

mastered the art of managing service quality.

Effective SLM is also an economic weapon. Managing resources more effectively reduc

costs, creates more revenue opportunities, and leverages technology investments.

Finally, SLM is a means to build the solid business relationships that make online busineinitiatives successful.

8/16/2019 158705079 x

30/304

8/16/2019 158705079 x

31/304

Summary

The second group of chapters (8–10) in this part steps through the majorsystems used for web service delivery. It looks at the ways they can be used

to improve service delivery and also discusses their specific instrumentation

needs, using the system management infrastructures described in the firstpart of this section. Chapter 8 investigates the instrumentation and

management of applications and of end-user access devices, such as

browsers. Chapter 9 looks at web server systems, including servers, load

balancers, and content distribution networks. Finally, Chapter 10 discusses

instrumentation and management of the transport infrastructure, including

QoS technology and traffic shaping to achieve policy objectives.

• Part III: Long-term Service Level Management Functions (Chapters 11–12)—This part covers load testing, modeling, and capacity planning. No management

system can provide necessary quality if the web serving system, as a whole, has

insufficient capacity.

• Part IV: Planning and Implementation of Service Level Management (Chapte13–15)—Calculation of Return on Investment (ROI) for SLM is critical to the

justification and design of an implementation; it’s covered in Chapter 13. Chapter

provides guidance for using the information in this book to design an SLM system f

your particular situation, and the part ends with discussion in Chapter 15 of possib

future developments in SLM.

SummaryThe Internet, and the Web, are transforming business processes for interaction among

businesses, government, suppliers, customers, and employees. As more and more criticabusiness processes go online, the service quality of those processes becomes more

important to the success of business as a whole.

SLAs are the formal, negotiated contracts between service providers and service users th

define the services to be provided, their quality goals, and the actions to be taken if the SL

terms are violated.

SLM is the process of managing network and computing resources to ensure the deliver

of acceptable service quality, usually as defined in an SLA, at an acceptable price in an

acceptable time frame. It is a competitive weapon in the marketplace because it can impro

customer relationships, create more revenue opportunities, and reduce costs.

8/16/2019 158705079 x

32/304

8/16/2019 158705079 x

33/304

C H A P T E R 2

Service Level Management

Service Level Management (SLM) is a key for delivering the services that are necessary

remain competitive in the Internet environment. Service quality must remain stable and

acceptable even when there are substantial changes in service volumes, customer activitiand the supporting infrastructures.

Superior service quality also becomes a competitive differentiator because it reduces

customer churn and brings in new customers who are willing to pay the premiums for

guaranteed service quality. Customer churn is an insidious problem for almost every servi

provider.

The competitive market increases customer acquisition costs because continuous

marketing and promotions are necessary just to replace the eroding customer base. High

customer acquisition costs must be dealt with by either raising prices (a difficult move in

highly competitive market) or by taking longer to amortize the acquisition costs before

profitability for each customer is achieved. Improving customer retention therefore

dramatically increases profits.

This chapter covers the basics of SLM and lays part of the groundwork for the rest of th

book:

• An overview of SLM• An introduction to technical metrics• Detailed discussions of measurement granularity and measurement validation• Business process metrics• Service Level Agreements (SLAs)

Note that the chapter ends with a summary discussion in the context of building an SLA

Use of metrics in combination with the SLA’s service level objectives to controlperformance is discussed in Chapter 6, “Real-Time Operations,” and Chapter 7, “Policy

Based Management.”

8/16/2019 158705079 x

34/304

14 Chapter 2: Service Level Management

Overview of Service Level ManagementOften, one group’s service provider is another group’s customer. It is critical to understand

that service delivery is often, in fact, a chain of such relationships. As Figure 2-1 shows,

some entities, such as an IT group, can play different roles in the service delivery process.

As shown in the figure, a hosting company can be a customer of multiple service providers

while in turn acting as a service provider. An IT group may be a customer of several service

providers offering basic Internet connectivity, application hosting, content delivery, or other

services. Customers may use multiple providers of the same service to increase their

availability and to protect against dependence on a single provider. Customers will also use

specialized service providers to fulfill particular needs.

Figure 2-1 Roles of Customers and Service Providers

The Internal Role of the IT GroupAn IT group serves the entire organization by aggregating demands of individual business

units and using them as leverage to reduce overall costs from service providers.

Today, such IT groups are making the necessary adjustments as managed services becomea mandatory requirement. IT managers are constantly reassessing the business and strategic

IT Group

ISP #2ISP #1

Content DeliveryHosting Co.

Customer Role

Service Provider Role

Telephone Co. Telephone Co.

Service Provider Role Service Provider Role



Customer Role Customer Role

Customer Role Customer Role

8/16/2019 158705079 x

35/304

Overview of Service Level Management

trade-offs of developing internal competence and expertise as opposed to outsourcing moof the traditional IT work to external providers. The goal is to save money, protect strateg

assets, and maintain the necessary flexibility to meet new challenges.

The External Role of the IT GroupIT groups are increasingly being required to provide specific levels of service, and they a

also more frequently involved in helping business units negotiate agreements with extern

service providers. Business units often choose to deal directly with service providers whe

they have specialized needs or when they determine that the IT group cannot offer servic

with competitive costs and benefits.

IT groups must therefore manage their own service levels as well as those of service

providers, and they must track compliance with negotiated SLAs.

The Components of Service Level ManagementThe process of monitoring service quality, detecting potential or actual problems, taking

actions necessary to maintain or restore the necessary service quality, and reporting on

achieved service levels is the core of SLM. Effective SLM solutions must deliver acceptab

service quality at an acceptable price.

Acceptable quality from a customer perspective means an ability to use the managed

services effectively. For example, acceptable quality may mean that an external custome

or business partner can transact the necessary business that will generate revenues,

strengthen business partnerships, increase the Internet brand, or improve internalproductivity. Specific ongoing measurements are carried out to determine acceptable

service quality levels, and noncompliance is noted and reported.

Acceptable costs must also be considered, because over-provisioning and throwing mon

at service quality problems is not an acceptable strategy for either service providers or the

customers (and in spite of the cost, it often doesn’t solve the problem). Service manageme

policies are applied to critical resources so that they are allocated to the appropriate

services; inappropriate activities are curtailed. Service providers that manage resources

effectively deliver superior service quality at competitive prices. Their customers, in tur

must also increase their online business effectiveness and strengthen their bottom-line

results.

The Participants in a Service Level AgreementThe SLA is the basic tool used to define acceptable quality and any relationships betwee

quality and price. Because the SLA has value for both providers and customers, it’s a

wonder why it has taken so long for it to become important. In practice, many organizatio

8/16/2019 158705079 x

36/304


and providers find the process of negotiating an acceptable SLA to be a difficult task. Aswith many technical offerings, customers often experience difficulty in expressing what

they need in technical terms that are both measurable and manageable; therefore, they have

difficulty specifying their needs precisely and verifying that they are getting what they pay for.Service providers, on the other hand, appreciate clearly-specified requirements and want to

take advantage of the opportunity to offer profitable premium services, but they also want

to minimize the risks of public failure and avoid increasingly stringent financial penalties

for noncompliance with the terms of the SLA.

Metrics Within a Service Level AgreementMeasurement is a key part of an SLA, and most SLAs have two different classes of metrics,

as shown in Figure 2-2, which may be divided into technical metrics and business process

metrics. Technical metrics include both high-level technical metrics, such as the success

rate of an entire transaction as seen by an end user, and low-level technical metrics, such asthe error rate of an underlying communications network. Business process metrics include

measures of provider business practices, such as the speed with which they respond to

problem reports.

Figure 2-2 Contents of a Service Level Agreement

Service Level Agreement

Technical MetricsHigh-Level

WorkloadAvailabilityTransaction FailureTransaction Response Time...

Low-LevelWorkloadAvailabilityPacket LossOne-Way Packet DelayJitterServer Response Time...

Business Process MetricsTrouble Response TimeTrouble Relief TimeProvisioning Time

...

Metric Specification and Handling

GranularityValidationStatistical Analysis

Penalties and Rewards

8/16/2019 158705079 x

37/304

Introduction to Technical Metrics

Service providers may package the metrics into specific profiles that suit common customrequirements while simplifying the process of selecting and specifying the parameters.

Service profiles help the service provider by simplifying their planning and resource

allocation operations.

Introduction to Technical MetricsTechnical metrics are a core component of SLAs. They are used to quantify and to asses

the key technical attributes of delivered services.

Examples of technical metrics are shown in Table 2-1. They are separated into the two bas

groups: high-level metrics, which deal with attributes that are highly relevant to end use

and are easily understood by them, and low-level metrics, which deal with attributes of th

underlying technologies. Note that you should be very specific when defining these term

in an agreement. Although many of these terms are in common use, their definitions varTable 2-1 Examples of Technical Metrics

Metric Description

High-Level Technical Metrics

Workload Applied workload in terms understandable by the end user (suc

as end-user transactions/second)

Availability Percentage of scheduled uptime that the system is perceived as

available and functioning by the end user

Transaction Failure Rate Percentage of initiated end-user transactions that fail to comple

Transaction Response Time Measure of response-time characteristics of a user transaction

File Transfer Time Measure of total transfer-time characteristics of a file transfer

Stream Quality Measure of the user-perceived quality of a multimedia stream

Low-Level Technical Metrics

Workload Applied workload in terms relevant to underlying technologies

(such as database transactions/second)

Availability Percentage of scheduled uptime that the subsystem is available

and functioning

Packet Loss Measure of one-way packet loss characteristics between

specified points

Latency Measure of transit time characteristics between specified points

Jitter Measure of the transit time variability characteristics between

specified points

Server Response Time Measure of response-time characteristics of particular server

subsystems

8/16/2019 158705079 x

38/304


Workload is an important characteristic of both high- and low-level metrics. It’s not ameasure of delivered quality; instead, it’s a critical measure of the load applied to the

system. For example, consider the workload of serving web pages. A text-only page might

comprise only 10 K bytes, whereas a graphics page could comprise a few megabytes. If therequirement is to deliver a page in six seconds to the end user, massively different

bandwidth and capacity will be necessary. Indeed, content may need to be altered for low-

speed connections to meet the six-second download time.

NOTE In many situations, certain technical metrics aren’t specified in the SLA. Instead, the

supplier is asked to use best effort , which represents the classic Internet delivery strategy of

“get it there somehow without concern for service quality.” Today, best effort represents the

commodity level for services. There are no special treatments for best-effort services. The

only need is that there are sufficient resources to prevent best-effort services from starving

out , which means having the connection time out because of long periods of inactivity.

Discussions of all of the examples in Table 2-1 follow, to illustrate the basic concepts of

technical metrics. Additional descriptions of these metrics, and other technical metrics,

appear in Chapters 4 and 8–10.

High-Level Technical MetricsThese metrics deal with workload and performance as seen and understood by the end user.

Workload

The workload high-level technical metric is the measure of applied load in end-user terms.

It’s unreasonable to expect a service provider to agree to service levels for an unspecified

amount of workload; it’s also unreasonable to expect that an end user will willingly

substitute specification of obscurely-related low-level workload metrics instead of

understandable high-level metrics. SLAs should therefore begin by specifying the high-

level workload metrics, and service providers can then work with the customer’s technical

staff to derive low-level workload metrics from them.

For transaction systems, the workload metric is usually specified in terms of the end-usertransaction mix and volumes, which typically vary according to time of day and other

business cycles. For existing systems, these statistics can be obtained from logs; for new

systems or situations (such as a proposed major advertising campaign designed to drive

prospective customers to a web site), the organization’s marketing group or their

consultants should work to produce the most accurate, specific estimates possible. These

workload estimates for new systems should be used for load testing as well as for SLAs.

8/16/2019 158705079 x

39/304


Transaction workload metrics must include end-user tolerance for transaction responsetime delays. If response time delays are too long, external customers will abandon the

transaction. In legacy systems where external customers did not interact directly with th

server systems, abandonment was not a factor in workload testing. Call-center operatorshandled any delays by talking to the customers, shielding them from the problem, if

necessary. On the Web, customers see the delays without any shielding, and they may

decide at any point to abandon the transaction—with immediate impact on the server

system’s workload.

Another effect of the direct connection between customers and web-serving systems is th

there’s no buffer between those customers and the servers. In a call center, the workload

buffered by external queues. Incoming calls go through an automatic call distribution

system; callers are placed on hold until an operator is available. In an order-entry center

the workload is buffered by the stack of documents on the entry clerk’s desk. In contras

the web workload has no external buffer; massive spikes in workload hit the servers

instantly. These spikes in workload are called flash load , and they must be specified in thworkload metric and considered during load testing. Load specification for the Web shou

therefore be in terms of arrival rate, not concurrent users, as was the case for call center

and order-entry centers.

File-serving, web-page, and streaming-media workload metrics are similar to transactio

metrics, but simpler. They’re usually specified in terms of the size and number of files thmust be transferred in a given time interval. (For web pages, the types of the files are usual

specified. Dynamically-generated files are clearly more resource-intensive than stored

static files.) The serving system must have the bandwidth to serve the files, and it must al

be able to handle the anticipated number of concurrent connections. There’s a relationsh

between these two variables; given a certain arrival rate, higher end-to-end bandwidthresults in fewer concurrent users.

Availability

Availability is the percentage of time that the system is perceived as available and

functioning by the end user. It is a function of both the Mean Time Between Failures(MTBF) and the Mean Time To Repair (MTTR). Scheduled downtime might, in some

organizations, be excluded from these calculations. In those organizations, a system can b

declared 100 percent available even though it’s down for an hour every night for system

maintenance.

Availability is a binary measurement—the service is either available or it isn’t. For the en

user, and therefore for the high-level availability metric, the fact that particular underlyin

components of a service are unavailable is not a concern if that unavailability is conceal

through redundant systems design.

Availability can be improved by increasing the MTBF or by decreasing the time spent o

each failure, which is measured by the MTTR. Chapter 3, “Service Management

8/16/2019 158705079 x

40/304


Architecture,” introduces the concept of triage, which decreases MTTR through quickassignment of problems to the appropriate specialist organization.

Transaction Failure Rate

A transaction fails if, having successfully started, it does not successfully complete.

(Failure to start is the result of an availability problem.) As is true for availability, systems

design and redundancy may conceal some low-level failures from the end user and

therefore exclude the failures from the high-level transaction failure rate metric.

Transaction Response Time

This metric represents the acceptable delay for completing a transaction, measured at the

level of a business process.

It’s important to measure both the total time to complete a transaction and the elapsed time

per page of the transaction. That’s because the end user’s perception of transaction time,

which will be used to compare your system with your competitors’, is based on total

transaction time, regardless of the number of pages involved, while the slowest page will

influence end-user abandonment of a web transaction.

File Transfer Time

The file transfer time metric is closely associated with specified workload and is a measure

of success. The file transfer workload metric describes the work that must be accomplished

in a certain period; the file transfer time metric shows whether that workload wassuccessfully handled. Lack of end-to-end bandwidth, an insufficient number of concurrent

connections, or persistent transmission errors (requiring retransmission) will influence this

measure.

Stream Quality

The quality of multimedia streams is difficult to measure. Although underlying low-level

technical metrics, such as frame loss, can be obtained, their relationship to the quality as

perceived by an end user is very complex.

Streaming is a real-time service in which the content continues flowing even with variationsin the underlying data transmission rates and despite some underlying errors. A content

consumer may see a small blemish on a graphic because a packet is lost in transit—

equivalent to static on your car radio. There is no rewinding and playing it again, as there

might be with interactive services. Thus, packet loss is handled by just continuing with the

streaming rather than retransmitting lost packets.

8/16/2019 158705079 x

41/304


Occasional packet loss can still be tolerated and sometimes may not even be noticed. Ifpacket loss increases, quality will begin to degrade until it falls below a threshold and

becomes unacceptable. Years of development have been focused on concealing these low

level errors from the multimedia consumer, and the major existing technologies fromMicrosoft, Real Networks, Apple, and others have different sensitivities to these errors.

Nevertheless, quality must be measured. The telephone companies years ago establishe

the Mean Opinion Score (MOS), a measure of the quality of telephone voice transmissio

There are also international standards for evaluation of audio and video quality as perceiv

by human end users; examples are the International Telecommunication Union’s ITU-T

P.800-series and P.900-series standards and the American National Standards Institute’s

T1.518 and T1.801 standards. Simpler methods are also in use, such as measuring the

percentage of successful connection attempts to the streaming server, the effective

bandwidth delivered over that connection, and the number of rebuffers during transmissio

Low-Level Technical MetricsThese metrics deal with workload and performance of the underlying technical subsystem

such as the transport infrastructure. Low-level technical metrics can be selected and defin

by first understanding the high-level technical metrics and their implications for the

performance requirements placed on underlying subsystems. For example, a clear

understanding of required transaction response time and the associated transaction

characteristics (the number of transits across the transport network, the size of each trans

and so on) can help set the objective for the low-level technical metric that measures

network transit time (latency).

Workload and Availability

These low-level technical metrics are similar to those for the high-level discussion, but

they’re focused on performance characteristics of the underlying systems rather than on

performance characteristics that are directly visible to end users. Their correlation with th

high-level metrics depends on the particular system design and the degree of redundanc

and substitution within that design.

Throughput, for example, is a low-level technical metric that measures the capacity of a

particular service flow. Services with rich content or critical real-time requirements mig

need sufficient bandwidth to maintain acceptable service quality. Certain transactions, suc

as downloading a file or accessing a new web page, might also require a certain bandwid

for transferring rich content, such as complex graphics, within the specified transaction

delay time.

8/16/2019 158705079 x

42/304


Packet Loss

Packet loss has different effects on the end-user experience, depending on the service using

the transport. The choice of a packet loss metric for a particular application must be

carefully considered. For example, packet loss in file transfer forces retransmission unlessthe high-level transport contains embedded error correction codes. In contrast, moderate

packet loss in streaming media may have no user-perceptible effect at all—unless bad luck

results in the loss of a key frame.

The burst length must be included in packet loss metrics. Usually a uniform distribution of

dropped packets over longer time intervals is implicitly assumed. For example, out of every

100 packets there could be two lost without violating an SLA calling for two percent packet

loss. There may be a different perspective if you examine behavior over longer intervals,such as 1,000 packets. Up to 20 packets in a row could be lost without violating the SLA.

However, losing 20 consecutive packets—creating a significant gap in data received—

might drive quality levels to unacceptable values.

Latency

Latency is the time needed for transit across the network; it’s critical for real-time services.

Excessive latency quickly degrades the quality of web sites and of interactive sound and video.

Routes in the Internet are usually asymmetric, with flows often taking different pathscoming and going between any pair of locations. Thus, the delays in each direction are

usually different. Fortunately, most Internet applications are primarily sensitive to round-

trip delays, which are much simpler to measure than one-way delays. File transfer, web

sites, and transactions all require a flow of acknowledgments in the opposite direction to

data flow. If acknowledgments are delayed, transmission temporarily ceases. The round-trip latency therefore controls the effective bandwidth of the transmission.

Round-trip latency is much simpler to measure than one-way latency, because clock

synchronization of separated locations is not necessary. That synchronization can be quite

tricky if it is accomplished across the same network that’s having its one-way delay

measured. In that case, fluctuations in the metric that’s being measured (one-way latency)

can easily affect the stability of the measurement apparatus for one-way latency. An

external reference, such as the satellite Global Positioning System (GPS) timers, is often

used in such situations.

Jitter

Jitter is the deviation in the arrival rate of data from ideal, evenly-spaced arrival; see Figure

2-3. Some packets may be bunched more closely together (in terms of inter-packet delays)

or spread farther apart after crossing the network infrastructure. Jitter is caused by the

internal operation of network equipment, and it’s unavoidable. Jitter is created whenever

there are queues and buffering in a system. Extreme varieties of jitter are also created when

there’s rerouting of packets because of network congestion or failure.

8/16/2019 158705079 x

43/304

Measurement Granularity

Figure 2-3 Jitter

Interactive teleconferencing is an example of a service that is extremely sensitive to jitte

too much jitter can make the service completely useless. Therefore, a reduction in jitter

approaching zero, represents an increase in quality.

Buffering in the receiving device can be used to smooth out jitter; the jitter buffer is famili

to those of us who have a CD player in the car. Small bumps are smoothed out and the soun

quality remains acceptable, but hitting a pothole usually causes more disturbance than th

buffer can overcome. The dejitter buffer allows for latency that is typically one or two tim

that of the expected jitter; it’s not a cure for all situations. The time spent in the dejitter

buffers is an important contributor to total system latency.

Server Response Time

Similar to the high-level technical metric transaction response time, this measures theindividual response time characteristics of underlying server systems. A common examp

is the response time of the database back-end systems to specific query types. Although n

directly seen by end users, this is an important part of overall system performance.

Measurement GranularityThe SLA must describe the granularity of the measurements. There are three related par

to that granularity: the scope, the sampling frequency, and the aggregation interval.

Measurement ScopeThe first consideration is the scope of the measurement, and availability metrics make a

excellent example. Many providers define the availability of their services based on an

overall average of availability across all access points. This is an approach that gives the

service providers the most flexibility and cushion for meeting negotiated levels.

IdealPacket

Spacing

ActualPacket

Spacing

Jitter

8/16/2019 158705079 x

44/304


Consider if your company had 100 sites and a target of 99 percent availability based on anoverall average. Ninety-nine of your sites could have complete availability (100 percent)

while one could have zero. Having a site with an extended period of complete unavailability

isn’t usually acceptable, but the service provider has complied with the negotiated terms ofthe SLA.

If the availability level is specified on a per-site basis instead, the provider would have been

found to be noncompliant and appropriate actions would follow in the form of penalties or

lost customers. The same principle applies when measuring the availability of multiple

sites, servers, or other units.

Availability has an additional scope dimension, in addition to breadth: the depth to which

the end user can penetrate to the desired service. To use a telephone analogy, is dial tone

sufficient, or must the end user be able to reach specific numbers? In other words, which

transactions must be accessible for the system to be regarded as available?

Scope issues for performance metrics are similar to those for the availability metric. Theremay be different sets of metrics for different groups of transactions, different times of day,

and different groups of end users. Some transactions may be unusually important to

particular groups of end users at particular times and completely unimportant at other

times.

Regardless of the scope selected for a given individual metric, it’s important to realize that

executive management will want these various metrics aggregated into a single measure ofoverall performance. Derivation of that aggregated metric must be addressed during

measurement definition.

Measurement Sampling FrequencyA shorter sampling frequency catches problems sooner at the expense of consuming

additional network, server, and application resources. Longer intervals between

measurements reduce the impacts while possibly missing important changes, or at least not

detecting them as quickly as when a shorter interval is used. Customers and the service

providers will need to negotiate the measurement interval because it affects the cost of theservice to some extent.

Statisticians recommend that sampling be random because it avoids accidental

synchronization with underlying processes and the resulting distortion of the metric.

Random sampling also helps discover brief patterns of poor performance; consecutive badresults are more meaningful than individual, spaced-out difficulties.

Confidence interval calculations can be used to help determine the sampling frequency.

Although it is impossible to perform an infinite number of measurements, it is possible to

calculate a range of values that we’re reasonably sure would contain the true summary

values (median, average, and so on) if you could have performed an infinite number of

measurements. For example, you might want to be able to say the following: “There’s a 95

8/16/2019 158705079 x

45/304


percent chance that the true median, if we could perform an infinite number ofmeasurements, would be between five seconds and seven seconds.” That is what the “95

Percent Confidence Interval” seeks to estimate, as shown in Figure 2-4. When you take

more measurements, the confidence interval (two seconds in this example) usually becomnarrower. Therefore, confidence intervals can be used to help estimate how many

measurements you’ll need to obtain a given level of precision with statistical confidence

Figure 2-4 Confidence Interval for Internet Data

There are simple techniques for calculating confidence intervals for “normal distribution

of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent

section on statistical analysis, Internet distributions are so different from the “normaldistribution” that these techniques cannot be used. Instead, the statistical simulation

technique known as “bootstrapping” can be used for these calculations on Internet

distributions.

In some cases, depending on the pattern of measurements, simple approximations forcalculating confidence intervals may be used. Keynote Systems recommends the followin

calculation approximation for calculating the confidence interval for availability metrics

(This information is drawn from “Keynote Data Accuracy and Statistical Analysis for

Performance Trending and Service Level Management,” Keynote Systems Inc., San Mate

California, 2002.) The formula is as follows:

• Omit data points that indicate measurement problems instead of availabilityproblems.

• Calculate a preliminary estimate of the 95 percent confidence interval for averageavailability (avg) of a measurement sample with n valid data points:

Preliminary 95 Percent Confidence Interval = avg ± (1.96 * square root [(avg

* (1 – avg))/(n – 1)])

For example, with a sample size n of 100, if 12 percent of the valid

measurements are errors, the average availability is 88 percent. The

confidence interval is calculated by the formula as (0.82, 0.94). This suggests

that there’s a 95 percent probability that the true average availability—if

Actual Median

Confidence Interval

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Response Time (Seconds)

P e r c e n t a g e o f

M e a s u r e m e n t s

8/16/2019 158705079 x

46/304


we’d miraculously taken an infinite number of measurements—is between82 and 94 percent. Notice that even with 100 measurements, this confidence

interval leaves much room for uncertainty! To narrow that band, you need

more valid measurements (a larger n, such as 1000 data points).• Now you must decide if the preliminary calculations are reasonable. We suggest that

the preliminary calculation should be accepted only if the upper limit is below 100

percent and the lower limit is above 0 percent. (The example just used gives an upper

limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable

if n = 30 or greater.)

Note that we’re not saying that the confidence interval is too wide if the

upper limit is above 100 percent (or if the average availability itself is 100

percent because no errors were detected); we’re saying that you don’t know

what the confidence interval is. The reason is that the simplifying

assumptions you used to construct the calculation break down if there are notenough data points.

For performance metrics, a simple solution to the problem of confidence intervals is to use

geometric means and “geometric deviations” as measures of performance, which are

described in the subsequent section in this chapter on statistical analysis.

Keynote Systems suggests, in the paper previously cited, that you can approximate the 95

Percent Confidence Interval for the geometric mean as follows, for a measurement samplewith n valid (nonerror) data points:

Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n – 1] ) ) ]

Lower Limit = [geometric mean] / [ (geometric deviation)(1.96 / (square root of [n – 1] ) )

]This is similar to the use of the standard deviation with normally distributed data and can

be used as a rough approximation of confidence intervals for performance measurements.

Note that this ignores cyclic variations, such as by time of day or day of week; it is also

somewhat distorted because even the logarithms of the original data are asymmetrically

distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered

using this recipe are much less than those that result from the usual use of mean and

standard deviation.

Measurement Aggregation IntervalSelecting the time interval over which availability and performance are aggregated should

also be considered. Generally, providers and customers agree upon time spans ranging from

a week to a month. These are practical time intervals because they will tend to hide small

fluctuations and irrelevant outlying measurements, but still enable reasonably prompt

analysis and response. Longer intervals enable longer problem periods before the SLA is

violated.

8/16/2019 158705079 x

47/304


Table 2-2 shows this idea. If availability is measured on a small scale (hourly), highavailability and requirements such as the 5-9’s or 99.999% permit only 0.036 seconds o

outage before there’s a breach of the SLA. Providers must provision with adequate

redundancy to meet this type of stringent requirement, and clearly they will pass on thescosts to the customers that demand such high availability.

If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicat

that an acceptable cumulative outage of 24 seconds per month is permitted while remaininin compliance. A 99.9 percent availability level permits up to 40 minutes of accumulate

downtime for a service each month. Many providers are still trying to negotiate an SLA

with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13

to 3.5 hours each month.Note that these values assume 24 ∗ 7 ∗ 365 operations. For operations that do not requir

round-the-clock availability, or are not up during weekends, or have scheduled maintenan

periods, the values will change. That said, they’re pretty easy to compute.

The key is for service provider and service customer to set a common definition of the

critical time interval. Because longer aggregation intervals permit longer periods during

which metrics may be outside tolerance, many organizations must look more deeply at the

aggregation definitions and look to their tolerance for service interruption. A 98 percen

availability level may be adequate and also economically acceptable, but how would the

business function if the 13.5 allotted hours of downtime per month occurred in a single

outage? Could the business tolerate an interruption of that length without serious damagIf not, then another metric that limits the interruption must be incorporated. This could b

expressed in a statement such as the following: “Monthly availability at all sites shall be 9

percent or higher, and no service outage shall exceed three minutes.” In other words, a litt

arithmetic to evaluate scenarios for compliance goes a long way.

Table 2-2 Measurement Aggregation Intervals for Availability

Availability Percentage Allowable Outage for Specified Aggregation Interval

Hour Day Week 4 Weeks

98% 1.2 min 28.8 min 3.36 hr 13.4 hr

98.5% 0.9 min 21.6 min 2.52 hr 10 hr

99% 0.6 min 14.4 min 1.68 hr 6.7 hr

99.5% 0.3 min 7.2 min 50.4 min 3.36 hr

99.9% 3.6 sec 1.44 min 10 min 40 min

99.99% 0.36 sec 8.64 sec 1 min 4 min

99.999% 0.036 sec 0.864 sec 6 sec 24 sec

8/16/2019 158705079 x

48/304


Measurement Validation and Statistical AnalysisThe Internet and Web are extremely complex statistically. Invalid measurements and

incorrect statistical analysis can easily lead to SLA violations and penalties, which may

then fall apart when challenged by the service provider using a more appropriate analysis.

Therefore, special care must be taken to discard invalid measurements and to use the

appropriate statistical analysis methods.

Measurement ValidationMeasurement problems, which are artifacts of the measurement process, are inevitable in

any large-scale measurement system. The important issues are how quickly these errors aredetected and tagged in the database, and the degree of engineering and business integrity

that’s applied to the process of error detection and tagging.

Measurement problems can be caused by instrument malfunction, such as a response timer

that fails, and by synthetic transaction script failure, which leads to false transaction error

reports. It can also be caused by abnormal congestion on a measurement tool’s access link

to the backbone network and by many other factors. These failures are of the measurement

system, not of the system being measured. They therefore are best excluded from any SLA

compliance metrics.

Detection and tagging of erroneous measurements may take time, sometimes up to a day or

more, as the measurement team investigates the situation. Fortunately, SLA reports are not

generally done in real time, and there’s therefore an opportunity to detect and remove such

measurements.

The same measurements will probably also be used for quick diagnosis, or triage, and that

usage requires real-time reporting. There’s therefore no chance to remove erroneous

measurements before use, and the quick diagnosis techniques must themselves handle

possible problems in the measurement system. Good, fast-acting artifact reduction

techniques (discussed in Chapter 5, “Event Management”) can eliminate a large number of

misleading error messages and reduce the burden on the provider management system.

An emerging alternative is using a trusted, independent third-party to provide the

monitoring and SLA compliance verification. The advantage in having an independent

party providing information is both service providers and their customers could view this

party as objective when they have disputes about delivered service quality.

Keynote Systems and Brix Networks are early movers into this market space. Keynote

Systems provides a service, whereas Brix Networks provides an integrated set of software

and hardware measurement devices to be installed and managed by the owner of the SLA.

They both provide active, managed measurement devices placed at the service demarcation

points between customers and

Date post:	05-Jul-2018
Category:	Documents
Upload:	rasakirraski
View:	215 times
Download:	0 times

158705079 x

Documents