+ All Categories
Home > Documents > 158705079 x

158705079 x

Date post: 05-Jul-2018
Category:
Upload: rasakirraski
View: 215 times
Download: 0 times
Share this document with a friend

of 304

Transcript
  • 8/16/2019 158705079 x

    1/304

    800 East 96th Street, 3rd FloorIndianapolis, IN 46240 USA

    Cisco Press

    Practical Service Level Management:

    Delivering High-Quality Web-Based Services

    John McConnell with Eric Siegel

  • 8/16/2019 158705079 x

    2/304

    ii

    Practical Service Level Management:Delivering High-Quality Web-Based ServicesJohn McConnell with Eric Siegel

    Copyright© 2004 Cisco Systems, Inc.

    Published by:

    Cisco Press

    800 East 96th Street

    Indianapolis, IN 46240 USA

    All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic

    or mechanical, including photocopying, recording, or by any information storage and retrieval system, without

    written permission from the publisher, except for the inclusion of brief quotations in a review.

    Printed in the United States of America 1 2 3 4 5 6 7 8 9 0

    First Printing January 2004

    Library of Congress Cataloging-in-Publication Number: 2001097399

    ISBN: 1-58705-079-x

    Warning and DisclaimerThis book is designed to provide information about service level management. Every effort has been made to make

    this book as complete and as accurate as possible, but no warranty or fitness is implied.

    The information is provided on an “as is” basis. The author, Cisco Press, and Cisco Systems, Inc. shall have neither

    liability nor responsibility to any person or entity with respect to any loss or damages arising from the information

    contained in this book or from the use of the discs or programs that may accompany it.

    The opinions expressed in this book belong to the author and are not necessarily those of Cisco Systems, Inc.

    Feedback Information

    At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is craftedwith care and precision, undergoing rigorous development that involves the unique expertise of members from the

    professional technical community.

    Readers’ feedback is a natural continuation of this process. If you have any comments regarding how we could

    improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail

    at [email protected]. Please make sure to include the book title and ISBN in your message.

    We greatly appreciate your assistance.

    Trademark AcknowledgmentsAll terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized.

    Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book

    should not be regarded as affecting the validity of any trademark or service mark.

    Corporate and Government SalesCisco Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales.

    For more information, please contact:

    U.S. Corporate and Government Sales 1-800-382-3419 [email protected]

    For sales outside of the U.S., please contact:

    International Sales 1-317-581-3793 [email protected]

  • 8/16/2019 158705079 x

    3/304

    Publisher John Wait

    Editor-in-Chief John Kane

    Executive Editor Brett Bartow

    Cisco Representative Anthony Wolfenden

    Cisco Press Program Manager Sonia Torres ChavezManager, Marketing Communications, Cisco Systems Scott Miller

    Cisco Marketing Program Manager Edie Quiroz

    Production Manager Patrick Kanouse

    Acquisitions Editor Michelle Grandin

    Development Editor Jill Batistick

    Project Editor Marc Fowler

    Copy Editor Jill Batistick

    Technical Editors David M. Fishman

    John P. Morency

    Richard L. Ptak

    Team Coordinator Tammi BarnettBook Designer Gina Rexrode

    Cover Designer Louisa Adair

    Composition Mark Shirar

    Indexer Larry Sweazy

  • 8/16/2019 158705079 x

    4/304

    iv

    In Loving Memory 

    This book was finished as a final tribute to my late husband, John McConnell.

    I hope these words keep his ideas alive in the industry a little longer.

    —Grace Morlock McConnell

    My perception of my very special son, John W. McConnell

    All that we can be, we must be.

    Find a star and never settle for less.

    John was born to be one of a kind.

    Making his way with a mind of his own,

    and making a difference and making it known.

    He had his dreams and hopes to pursue.

    by his mother, Jeanette McConnell

  • 8/16/2019 158705079 x

    5/304

    Dedication John W. McConnell

     December 9, 1943–November 3, 2002

    This book is dedicated to my wife, Grace, whose support has been so helpful in carving out the time and quiet

    needed for this project. My friends and Grace have also provided a supportive environment and tolerated my

    frequent absences to work with clients. Returning home to a warm community has been really important to me.

  • 8/16/2019 158705079 x

    6/304

    vi

    AcknowledgmentsMany people have been part of this process of turning some ideas and experience into a book. First, my thanks to

    the Cisco Press team, especially Michelle Grandin. The steady enthusiasm and willingness of all to help are deeply

    appreciated.In the same vein, the technical reviewers have been so helpful. I’ve had the pleasure of spending good time exchanging

    views with John Morency and Rich Ptak at many analyst conferences and other events; their suggestions for this

    manuscript were specific and helpful, and in some cases spurred some spirited discussions. Although I’ve never met

    David Fishman face to face, I’d be pleased to buy him a good meal someday as thanks for so many good suggestions

    and his attention to detail and integrity on getting it right.

    Another group I want to acknowledge are the clients I’ve worked with around the world. I’ve gotten to learn a lot

    about how technology is actually used and to work with people who want to push the envelope.

    Finally, my thanks to my friends and colleagues in the industry who constantly stimulate and challenge me. It’s

    been a tremendous blessing to be among so many creative and independent thinkers and doers that have shaped the

    networking industry.

    —John McConnell

    It’s impossible to begin these acknowledgements without wishing that John were still alive. This is his book, not

    mine. He conceived it; he drafted it; he should have been writing this page. We all used to joke about how John

    “towered over the industry,” and it wasn’t just because of his height. In working from John’s drafts to complete the

    book, in talking to colleagues about his work, and in remembering the easy, jovial way he talked about examples of

    industry practices, I was constantly reminded of his stature and of the friendly way he had. I think I can say, with

    confidence, that everyone in the industry truly misses him; I certainly do.

    John’s wife wanted to see this book come to publication, and Cisco Press went far out of their way to make that happen.

    Jill Batistick and Michelle Grandin, the editors, were wonderfully friendly and helpful; they made the process of

    working through the chapters almost enjoyable. The technical reviewers, Rich Ptak, John Morency, and David Fishman

    put a tremendous amount of work into the book. They didn’t just point out my errors; they suggested correctionsand entire new paragraphs that could improve the text. They were truly partners in bringing the book to publication.

    I’d also like to thank Astrid Wasserman, of MediaLive International, Inc., (the organizers of Networld+Interop),

    who gave me a copy of John’s proposed two-day seminar on Service Level Management. Although it was never

    presented, the seminar slides gave me a lot of insight into his ideas.

    I have tried to stay close to John’s original thoughts and text, although I have occasionally succumbed to temptation

    and added additional information. Minor additions occur in all chapters; major additions are in Chapter 2 (measure-

    ment statistics), Chapter 6 (triage for quick assignment of problems to appropriate diagnostic teams), Chapter 8

    (transaction response time), and Chapter 11 (flash loads and abandonment). Most of the additions are topics that I

    had discussed with John at various conferences we attended together; I hope, and believe, that he would agree with

    them. In all cases when the author speaks directly to the reader, that author is John.

    —Eric Siegel  October 14, 2003

  • 8/16/2019 158705079 x

    7/304

    About the AuthorsJohn McConnell was involved in networking for over 30 years. A member of the ARPANET working group, Joh

    contributed to early Internet architecture and protocol development. John has consulted with clients in the U.S

    Europe, Asia, and the Middle East, and he has designed some of the first TCP/IP networks deployed in Europe athe Middle East.

    John served as a consultant in the areas of systems and network management with a focus on Service Level Manageme

    (SLM), policy-based management solutions, and the emerging issues of management solutions for e-business.

    John received a master’s in electrical engineering and computer science from the University of California, Berkel

    Eric Siegel, Principal Internet Consultant with Keynote Systems, Inc., “the Internet performance authority,” first

    worked on the Internet in 1978. He wrote Designing Quality of Service Solutions for the Enterprise (John Wiley

    Sons) and has taught Internet performance tuning, SLM, and quality of service (QoS) at major industry conferences

    such as Networld+Interop.

    Before joining Keynote Systems, Eric was a Senior Network Analyst at NetReference, Inc., where he specialized

    network architectural design for Fortune 100 companies, and he was a Senior Network Architect with Tandem

    Computers, where he was the technical leader and coordinator for all of Tandem’s data communications specialiworldwide. Eric also worked for Network Strategies, Inc. and for the MITRE Corporation, where he specialized

    computer network design and performance evaluation. Eric received his B.S. and M.Eng. degrees in electrical en

    neering from Cornell University, where he was elected to the Electrical Engineering honor society.

  • 8/16/2019 158705079 x

    8/304

    viii

    About the Technical ReviewersDavid M. Fishman is at Sun Microsystems, where he is responsible for availability measurement strategies in the

    office of Sun’s Chief Customer Advocate. Prior to that, he managed Sun’s strategic technology relationship with

    Oracle, driving technology alignment on High Availability (HA), Java technology, and performance. Before joining Sunin 1996, Fishman held a variety of technical and business development positions at Mercury Interactive Corporation in

    a variety of product management and business development roles. Previous work experience includes high-tech

    marketing and management in defense electronics, embedded systems, and office automation. David holds an

    MBA from the School of Management at Yale University. He lives in Sunnyvale, California, with his wife and

    two children.

    John P. Morency is a 29-year veteran of the networking and telecommunications industries and president of

    Momenta Research, Inc., a company that he founded in 2002. His industry experience includes network software

    development, technical support, IT operations, industry consulting, product marketing, and business development.

    Because of his wide range of experience, John has a very unique ability to effectively assess the business, technological,

    and operational impacts of new products and technologies. This is evidenced by the significant business case and

    Total Cost of Ownership (TCO) work that John has done on behalf of hundreds of Fortune 1000 clients over the

    past ten years, resulting in hundreds of millions of dollars in both top- and bottom-line benefits.

    John’s current research is focused on the business benefits attributable to the implementation of wireless LANs

    (Wi-Fi), network telephony, content networking, system and network security, Web services, disaster recovery,

    and IT process automation.

    He is the author of over 400 publications on the operations and business impact of new IT technology. His speakership

    and publication credentials include Networld+Interop, Network World, Billing World, Broadband Year, LightWave,

    Telecommunications, and Telecom-Plus International, among many others.

    Richard L. Ptak, founder of Ptak & Associates, Inc., has more than 25 years experience providing consulting services

    on the use of IT resources to achieve competitive advantage. Ptak earned his B.S. and M.S. at Kansas State University.

    His MBA was earned at the University of Chicago.

  • 8/16/2019 158705079 x

    9/304

    Contents at a Glance

    Preface xxi

    Part I Service Level Agreements and Introduction to Service Level Management

    Chapter 1 Introduction 5

    Chapter 2 Service Level Management 13

    Chapter 3 Service Management Architecture 41

    Part II Components of the Service Level Management Infrastructure 59

    Chapter 4 Instrumentation 61

    Chapter 5 Event Management 81

    Chapter 6 Real-Time Operations 101

    Chapter 7 Policy-Based Management 129

    Chapter 8 Managing the Application Infrastructure 145

    Chapter 9 Managing the Server Infrastructure 163

    Chapter 10 Managing the Transport Infrastructure 177

    Part III Long-term Service Level Management Functions 193

    Chapter 11 Load Testing 195

    Chapter 12 Modeling and Capacity Planning 209

    Part IV Planning and Implementation of Service Level Management 217

    Chapter 13 ROI: Making the Business Case 219

    Chapter 14 Implementing Service Level Management 231

    Chapter 15 Future Developments 245

    Index 259

  • 8/16/2019 158705079 x

    10/304

    x

    Contents

    Preface xxi

    Part I Service Level Agreements and Introduction to Service Level Management 3

    Chapter 1 Introduction 5

    E-business Services 5

    B2B 6

    B2C 7

    B2E 8

    Webbed Services and the Webbed Ecosystem 8

    Service Level Management 9

    Structure of the Book 10

    Summary 11

    Chapter 2 Service Level Management 13

    Overview of Service Level Management 14

    The Internal Role of the IT Group 14

    The External Role of the IT Group 15

    The Components of Service Level Management 15

    The Participants in a Service Level Agreement 15

    Metrics Within a Service Level Agreement 16

    Introduction to Technical Metrics 17

    High-Level Technical Metrics 18

    Workload 18

    Availability 19

    Transaction Failure Rate 20

    Transaction Response Time 20

    File Transfer Time 20

    Stream Quality 20

    Low-Level Technical Metrics 21

    Workload and Availability 21

    Packet Loss 22

  • 8/16/2019 158705079 x

    11/304

    Latency 22

    Jitter 22

    Server Response Time 23

    Measurement Granularity 23Measurement Scope 23

    Measurement Sampling Frequency 24

    Measurement Aggregation Interval 26

    Measurement Validation and Statistical Analysis 28

    Measurement Validation 28

    Statistical Analysis 29

    Business Process Metrics 31

    Problem Management Metrics 33Real-Time Service Management Metrics 33

    Service Level Agreements 34

    Summary 37

    Chapter 3 Service Management Architecture 41

    Web Service Delivery Architecture 42

    Service Management Architecture: History and Design Factors 45

    The Evolution of the Service Management Environment 45

    Service Management Architectures for Heterogeneous Systems 46

    Architectural Design Drivers 48

    Demands for Changing, Expanding Services 49

    Multiple Service Providers and Partners 49

    Elastic Boundaries Among Teams and Providers 49

    Demands for Fast System Management 50

    Data Item Definition and Event Signaling 50

    Service Management Architecture: A General Example 52

    Instrumentation 52

    Instrumentation Management 53

  • 8/16/2019 158705079 x

    12/304

    xii

    SLA Statistics and Reporting 54

    Real-Time Event Handling, Operations, and Policy 54

    Long-Term Operations 55

    Back-Office Operations 56

    Summary 57

    Part II Components of the Service Level Management Infrastructure 59

    Chapter 4 Instrumentation 61

    Differences Between Element and Service Instrumentation 61

    Information for Service Management Decisions 63

    Operational Technical Decisions 64

    Operational Business Decisions 64Decisions That Have Long-Term Effect 65

    Instrumentation Modes: Trip Wires and Time Slices 65

    Trip Wires 66

    Time Slices 67

    The Instrumentation System 68

    Starting with the Instrumentation Managers 69

    Collectors 70

    Aggregators 72Processing 72

    Ending with the Instrumentation Manager 73

    Instrumentation Design for Service Monitoring 73

    Demarcation Points 73

    Passive and Active Monitoring Techniques 75

    Passive Collection 75

    Active Collection 75

    Trade-Offs Between Passive and Active Collection 76

    Hybrid Systems 77

  • 8/16/2019 158705079 x

    13/304

    x

    Instrumentation Trends 77

    Adaptability 77

    Collaboration 78

    Tighter Linkage for Passive and Active Collection 78

    Summary 78

    Chapter 5 Event Management 81

    Event Management Overview 82

    Alert Triggers 82

    Reliable Alert Transport 83

    Alert Management 84

    Basic Event Management Functions: Reducing the Noise and Boosting the SignalVolume Reduction 86

    Roll-Up Method 86

    De-duplication 87

    Intelligent Monitoring 87

    Artifact Reduction 88

    Verification 88

    Filtering 89

    Correlation 90

    Business Impact: Integrating Technology and Services 91

    Top-Down and Bottom-Up Approaches 92

    Modeling a Service 92

    Care and Feeding Considerations 93

    Prioritization 94

    Activation 95

    Coordination 96

    A Market-Leading Event Manager: Micromuse 97

    Netcool Product Suite 97

    Event Management 98

    Summary 99

  • 8/16/2019 158705079 x

    14/304

    xiv

    Chapter 6 Real-Time Operations 101

    Reactive Management 103

    Triage 104Root-Cause Analysis 107

    Speed Versus Accuracy 107

    Case Study of Root-Cause Analysis 108

    Complicating Factors 110

    Brownouts 110

    Virtualized Resources 110

    The Value of Good Enough 111

    Proactive Management 112

    The Benefits of Lead Time 112

    Baseline Monitoring 112

    The Value of Predicting Behavior 113

    Automated Responses 113

    Languages Used with Automated Responses 113

    A Case Study 114

    Step 1: Assessing Local Impact 114

    Step 2: Adjusting Thresholds 115

    Step 3: Assessing Headroom 115

    Step 4: Taking Action 115

    Step 5: Reporting 116

    Building Automated Responses 116

    Picking Candidates for Automation 116

    Examples of Commercial Operations Managers 116

    Tavve Software’s EventWatch 117

    ProactiveNet 117

    Netuitive 120

  • 8/16/2019 158705079 x

    15/304

    Handling DDoS Attacks 121

    Traditional Defense Against DDoS Situations 122

    Defense Through Redundancy and Buffering 124

    Automated Defenses 124

    Organizational Policy for DDoS Defense 126

    Summary 127

    Chapter 7 Policy-Based Management 129

    Policy-Based Management 129

    The Need for Policies 130

    Management Policies for Elements 131

    Service-Centric Policies 132A Policy Architecture 133

    Policy Management Tools 133

    Repository 134

    Policy Distribution 134

    The Pull (Component-Centric) Model 134

    The Push (Repository-Centric) Model 135

    Hybrid Distribution 135

    Enforcers 136

    Policy Design 136

    Policy Hierarchy 137

    Policy Attributes 137

    Policy Auditing 138

    Policy Closure Criteria 138

    Policy Testing 138

    Policy Product Examples 139

    Cisco QoS Policy Manager 139

    Orchestream Service Activator 141

    Summary 142

  • 8/16/2019 158705079 x

    16/304

    xvi

    Chapter 8 Managing the Application Infrastructure 145

    Interaction of Operations and Application Development Teams 146

    The Effect of Organizational Structures 146The Need to Understand the Operational Environment 146

    Time Lines Are Shorter 147

    Application-Level Metrics 147

    Workload 149

    Customer Behavior Measurement 149

    Business Measurements 150

    Service Quality Measurement 151

    Transaction Response Time: An Example of Dependence on Lower-LevelServices 152

    Serialization Delay 153

    Queuing Delay 154

    Propagation Delay 154

    Processing Delay 156

    The Need for Communications Among Design and Operations Groups 156

    Instrumenting Applications 157

    Instrumenting Web Servers 157

    Instrumenting Other Server Components 159

    End-User Measurements 160

    Summary 161

    Chapter 9 Managing the Server Infrastructure 163

    Architecture of the Server Infrastructure 163

    Load Distribution and Front-End Processing 164

    Local Load Distribution 166

    Geographic Load Distribution 168Caching 168

    Content Distribution 169

  • 8/16/2019 158705079 x

    17/304

    x

    Instrumentation of the Server Infrastructure 171

    Load Distribution Instrumentation 172

    Cache Instrumentation 173Content Distribution Instrumentation 173

    Summary 174

    Chapter 10 Managing the Transport Infrastructure 177

    Technical Quality Metrics for Transport Services 178

    Workload and Bandwidth 178

    Availability and Packet Loss 179

    One-Way Latency 180

    Round-Trip Latency 181

    Jitter 181

    QoS Technologies 181

    Tag-Based QoS 182

    IEEE 802 LAN QoS 182

    IP TOS 183

    IP DiffServ 183

    MPLS 183

    RSVP 184

    Traffic-Shaping QoS 185

    Rate Control 186

    Queuing 187

    Over-provisioning and Isolated Networks 188

    Managing Data Flows Among Organizations 188

    Levels of Control 189

    Demarcation Points 189

    Diagnosis and Recovery 189

    Summary 191

  • 8/16/2019 158705079 x

    18/304

    xviii

    Part III Long-term Service Level Management Functions 193

    Chapter 11 Load Testing 195

    The Performance Envelope 196Load Testing Benchmarks 199

    Load Test Beds and Load Generators 200

    Building Transaction Load-Test Scripts and Profiles 203

    Using the Test Results 205

    Summary 206

    Chapter 12 Modeling and Capacity Planning 209

    Advantages of Simulation Modeling 209

    Complexity of Simulation Modeling 211

    Simulation Model Examples 211

    Model Construction 211

    Model Validation 213

    Reporting 214

    Capacity Planning 214

    Summary 215

    Part IV Planning and Implementation of Service Level Management 217Chapter 13 ROI: Making the Business Case 219

    Impact of ROI on the Organization 220

    A Basic ROI Model 220

    The ROI Mission Statement 222

    Project Costs 223

    Project Benefits 223

    Availability Benefits 224

    Performance Benefits 225Staffing Benefits 225

    Infrastructure Benefits 225

    Deployment Benefits 225

  • 8/16/2019 158705079 x

    19/304

    x

    Soft Benefits 226

    ROI Case Study 226

    Summary 228

    Chapter 14 Implementing Service Level Management 231

    Phased Implementation of SLM 231

    Choosing the Initial Project 231

    Incremental Aggregation 232

    An SLM Project Implementation Plan 233

    Census and Documentation of the Existing System 233

    Specification of Performance Metrics 234

    Instrumentation Choices and Locations 235

    Passive Measurements 236

    Active Measurements 236

    Baseline of Existing System Performance 237

    Investigation of System Performance Sensitivities and System Tuning 237

    Construction of SLAs 239

    Roles and Responsibilities 240

    Reporting Mechanisms and Scheduled Reviews 240

    Dispute Resolution 241

    Summary 242

    Chapter 15 Future Developments 245

    The Demands of Speed and Dynamism 245

    Evolution of Management Systems Integration 248

    Superficial Integration 248

    Data Integration 248

    Event Integration 249

    Process Integration 250

    Architectural Trends for Web Management Systems 250

    Loosely Coupled Service-Management Systems Architecture 251

    Process Managers 251

    Clustering and the Webbed Architecture 252

  • 8/16/2019 158705079 x

    20/304

    xx

    Integrating the Components with Signaling and Messaging 252

    Loosely Coupled Service-Management Processes 253

    Business Goals for Service Performance 254Finding the Best Tools 255

    Summary 256

    Index 259

  • 8/16/2019 158705079 x

    21/304

    x

    PrefaceSome years ago I received a true pearl of wisdom from an industry colleague. “In order to truly understand your

    profession,” he advised, “you must make the effort to learn other disciplines that are completely different from th

    one that you espouse.”That colleague was John McConnell, a man who truly understood this advice by walking the talk over the course

    his life. Born into a military family, John developed a keen understanding of the importance of the global ecosystem

    a very young age through his childhood experiences in both Europe and the Far East. Despite being a shy, scholar

    individual throughout primary and secondary school, John also demonstrated the value of hard work and dedicati

    by making the varsity rowing team at U.C. Berkeley.

    The strong work ethic that John nurtured at Berkeley served him well after he received his master’s in computer scienc

    in 1968. What differentiated John from many of his fellow graduates, however, was the application of his craft to no

    IT disciplines after graduation. Some of his first initiatives included the application of computer technology to measu

    the rate of solar intensity upon the earth and the development of a programming language that was designed to test t

    content and substance of moon samples brought back to earth by the Apollo astronauts. In addition, John develop

    a number of network control programs for the ARPANET (the predecessor to today’s Internet) in the mid-1970s

    when the state of the commercial data networking industry was in its true infancy.

    John also spent a number of years in professional capacities that had very little to do with information technolog

    After graduate school, John became an accomplished massage therapist, hypnotist, and practitioner in the art of

    Rolfing, a technique for the detection, treatment, and removal of bodily stress and pain. In 1983, using his Rolfin

    technique, John was selected to work with the members of the U.S. bicycling Olympic team, and he applied this

    technique to aid the team in preparing for the 1984 Olympic games. Recently, when not consulting, John was traini

    to become an instructor in the Ridhwan Foundation, an institution whose focus is the rediscovery and integration of

    the true self into one’s own professional and personal life. Over the years, he had a myriad of personal interests

    including soaring, mountain climbing, bird watching, backpacking, rowing, and blues festivals. One of his most

    recent and satisfying accomplishments was the design, building, and completion of a second home in southern

    Costa Rica that effectively enabled both he and his wife Grace to really get away from it all.

    First and foremost, John’s professional focus in the IT industry was the advancement of technologies and producthat improved the efficiency and the effectiveness of IT management.

    Given his whole life background, John was especially dedicated to reducing the operational and business “pain

    points” associated with IT implementation and management. This focus is reflected in John’s prior work

     Internetworking Computer Systems and Managing Client/Server Environments, as well as in Practical Service

     Level Management: Delivering High-Quality Web-Based Services. John’s numerous publications, conferences, an

    televised briefings reflect a focused dedication to the removal of technological barriers to the optimal effectivene

    of IT organizations worldwide. His life experiences of a true Renaissance man uniquely enabled him to both und

    stand and drive the level of change needed to not only improve state of the art, but also quality of life. John was

    indeed the “gold standard” of knowledge, professionalism, and personal integrity that made the pursuit of these

    goals not only a logical possibility, but, for many of us, a practical reality. The loss of John will be keenly felt fo

    some time, but the goals and values that he aspired to and embraced will inspire and guide many of us for years t

    come.

    John Morency, President, Momenta Research

    May 2003

  • 8/16/2019 158705079 x

    22/304

  • 8/16/2019 158705079 x

    23/304

    P A R T I

    Service Level Agreements andIntroduction to Service LevelManagement

    Chapter 1 Introduction

    Chapter 2 Service Level Management

    Chapter 3 Service Management Architecture

  • 8/16/2019 158705079 x

    24/304

  • 8/16/2019 158705079 x

    25/304

    C H A P T E R 1

    Introduction

    The World Wide Web—the Web—is the catalyst for the changes in our communications

    work styles, business processes, and ways of seeking entertainment and information. Th

    Internet is just the transport infrastructure for the web-based services that drive so muchinnovation. Note, however, that the Internet generally gets all the credit. As Thomas

    Friedman writes in The Lexus and the Olive Tree:

    The Internet is going to be like a huge vise that takes the globalization system that I have described—a

    keeps tightening and tightening that system around everyone, in ways that will only make the world smal

    and smaller and faster and faster with each passing day.

    This is an accurate description of the environment that most of us deal with directly on

    daily basis. The Internet is a tremendous business engine, and, as it transforms the ways w

    do business, it is being transformed in turn by the ways we use it. We must learn how to

    manage the growing array of online business services or risk being marginalized by a fast

    moving and more dynamic business environment.

    In this introductory chapter, I discuss the following:

    • The types of e-business services• A definition of webbed services and the webbed ecosystem• Service Level Management (SLM)• The structure of this book

    E-business Services E-business is a generic term defining business activities that are carried out totally, or in

    part, through electronic communications between distributed organizations and people.

    These activities are characterized by speed, flexibility, and constant change.

    The Internet has become the vehicle for transforming business processes. The reasons f

    its ascendancy include the following:

    • The Internet protocols are the only workable set of technologies that really providehigh degree of interoperability among different systems.

    • The wide geographic reach of the Internet increases the size of any potential mark

  • 8/16/2019 158705079 x

    26/304

    6 Chapter 1: Introduction

    • Internet economies make it feasible to distribute information and transact businessglobally.

    • The introduction of the browser and its supporting technologies make the Internetmuch easier to use, thereby increasing the potential market.

    There are many ways of segmenting and describing the large variety of services available

    through the Internet and the Web. A simple classification that covers most services is based

    on the relationship of the business to customers, business partners, and employees. For

    example, the process shown in Figure 1-1 describes a simple situation involving all three

    types of relationships: business to business (B2B), business to consumer (B2C), and

    business to employee (B2E). These segments are an easy way of organizing our thinking

    about services, although it’s important to remember that business processes in the real

    world will have many variations and overlaps.

    Figure 1-1  Business Relationships

    The following sections discuss each relationship type in turn.

    B2B B2B services are a broad category that incorporates transactions among differentbusinesses and government agencies. Many current B2B services, such as supply chain

    management and credit authorization, use the Internet to drive down the costs and delays

    associated with current processes and to boost their productivity.

    B2B is rapidly broadening to include more than supply chain management and credit

    authorization. Functions such as shipping, billing, and Customer Relationship Management

    CustomerSupplier Sales Staff

    Internal EnterpriseSystems

    B2CB2B B2E

  • 8/16/2019 158705079 x

    27/304

    E-business Services

    (CRM) are now often external to the business; other businesses provide and host thesespecialized services as a utility. For example, entry of a customer’s order can result in mo

    than the functions of pricing, authorizing, assembling, and shipping; a modern system

    might use B2B links to provide the customer with a shipment tracking number from theshipping company, and it might interact with an external CRM service to reflect the curre

    purchases and other factors of the customer’s profile. Meanwhile, the sales person might b

    indirectly using B2B links to handle her commissions and personnel data through

    outsourced employee management services, and engineering staff might use B2B links f

    collaborative design.

    Thanks to the Web, B2B is rapidly transforming into an even more dynamic set of servic

    from which an enterprise can select in real time. No one wants to be dependent on a sing

    supplier or customer; everyone must deal with competitive pressures exerted from both

    sides. Services such as credit authorization and shipping are examples of those that can b

    selected in real time based on their performance or costs. Other services and supplies ma

    be selected from web-based exchanges or e-markets.

    B2B processes can be complex. They must follow the business requirements for trackin

    orders, negotiating contracts, arranging payments, and reporting outcomes that govern

    these processes when they take place without the automation of electronic communication

    Note that new benefits become available, although at the cost of additional complexity,

    when B2B replaces older systems. For example, organizations can change their busines

    processes to increase their business effectiveness by obtaining real-time information onorder volumes, revenue rates, cancelled orders, and other factors. This additional

    information, while adding to complexity, provides value in addition to the acceleration o

    the processes themselves by identifying further efficiencies.

    Continuous monitoring of B2B suppliers, partners, and web infrastructure

    (communications, hosting, and exchanges) is necessary to determine whether they are

    meeting their service quality commitments.

    As in conventional commerce, managing across organizations adds complexity. All the

    links in the B2B services chains are known, but these links are controlled by many differe

    organizations, are complex, and may change rapidly as services are selected in real time

    Managing B2B services therefore requires cooperation with the management teams of th

    other participants and, possibly, with third-party measurement organizations to assure tru

    end-to-end service quality.

    B2CB2C garnered most of the early attention from the trade press and analysts as traditiona

    businesses took advantage of the Internet’s wide geographic reach and low costs for

    reaching customers. Some businesses (eBay and Amazon.com, for example) were found

    to exploit this new market opportunity.

  • 8/16/2019 158705079 x

    28/304

    8 Chapter 1: Introduction

    B2C sites continually add new services of their own while offering links to relatedbusinesses and services in an attempt to offer one-stop shopping—and selling—to their

    customers. This is a highly competitive segment with little customer loyalty. The wide

    selection of competing sites draws customers away whenever any one site has a servicedisruption.

    B2C environments are characterized by a lack of visibility and management control of the

    customer-access infrastructure, which is the set of networks, caches, and other systems that

    consumers use to connect to the B2C site. Customers usually don’t want measurement tools

    embedded in their systems, and the access infrastructure providers also resist making their

    internal performance readily visible. There is also limited visibility into the performance of

    partner sites (advertisers and other third parties), which are important parts of the

    customer’s perception of total site performance. The span of control and management

    available to B2C sites is therefore usually limited to monitoring and managing their internal

    operations (inside the firewall) as well as measurement of Internet delays and performance

    as seen from various points on the edge of the Internet.

    B2EB2E services are also known as the intranet. These services help improve the internal

    effectiveness of an organization and help it keep pace with its customers and business

    partners. Many B2E services enable employees to query their benefits, schedule vacations,

    fill out expense reports, and conduct a set of activities that formerly required a large staff to

    coordinate.

    B2C and B2E services use the web browser as the access device. Transactions are initiated

    from the browser to deliver information and activate a range of business processes.However, B2E environments are the only ones that enable administrators to have control of

    both ends—the servers as well as the desktops, cell phones, and personal communicators

    used to access them.

    Webbed Services and the Webbed EcosystemIn this book, I use the term webbed services to describe the set of business services that are

    based on a component approach to systems design. This design is driven from the Web and

    its associated technologies, regardless of the specific technologies used. Because webbed

    services are constructed from a set of interconnected software components and services thatcan be reused in multiple places, they can usually avoid some of the expense, time, and

    effort associated with building and modifying monolithic applications.

    Webbed services is a very inclusive term; it’s increasingly difficult to find services that are

    not somehow tied into the Web. As a case in point, I was recently speaking about webbed

    services at a large retail organization, and someone in the group stated that their main

    application did not fit into the webbed category because it was a stand-alone Oracle

  • 8/16/2019 158705079 x

    29/304

    Service Level Management

    Financials application. However, further discussion soon revealed that their internationaoperations used real-time currency conversion decisions. The real-time exchange rates i

    the Oracle Financials application were, in fact, accessed through the Web.

    Indeed, webbed services are now taking on many of the characteristics of an ecosystemwhich is a group of independent but interrelated elements comprising a unified whole. A

    smooth business process depends on each element carrying out its tasks accurately and

    quickly, with consideration for maintaining balances among all the elements. In a well-

    balanced webbed ecosystem, all elements bear appropriate shares of the load. None is

    overwhelmed, none is underutilized. Balance is concurrently maintained between servic

    quality and service cost. The ecosystem metaphor is gaining momentum as online

    processes evolve to dynamically select their elements (underlying services) based on the

    current behavior and performance.

    The webbed ecosystem perspective also holds within any subgroup of systems. For

    instance, hosting facilities use a range of technologies, such as prioritizing devices,bandwidth managers, global load balancers, and caches, to deliver online business service

    These systems also need balanced management; adding bandwidth when servers are

    congested is a wasteful investment.

    Service Level ManagementService quality is extremely important, given the accelerating number of critical busine

    processes going online. Customers and business partners go elsewhere if the services the

    want are not available or are performing sluggishly. Unfortunately, good service quality

    a dynamic target and the demands continue to tighten. Competitors will match or excee

    service quality levels and create pressure toward matching or bettering theirs.

    Service Level Management (SLM) is the process of managing network and computing

    resources to ensure the delivery of acceptable service quality at an acceptable price in a

    acceptable time frame. It focuses on the behavior of the services rather than on tracking th

    status of every router, switch, and server in the environment. Through SLM, service quali

    is guaranteed and priced for different levels of service.

    SLM is a competitive weapon in the marketplace, offering the guarantees needed to

    transition critical business activities online. Poorly managed services have harmed many

    businesses when their web sites crashed, their applications slowed to a crawl, or their W

    content was not attractively presented or was too difficult to navigate. Good service quali

    helps retain customers and differentiate your organization from those that have not yet

    mastered the art of managing service quality.

    Effective SLM is also an economic weapon. Managing resources more effectively reduc

    costs, creates more revenue opportunities, and leverages technology investments.

    Finally, SLM is a means to build the solid business relationships that make online busineinitiatives successful.

  • 8/16/2019 158705079 x

    30/304

  • 8/16/2019 158705079 x

    31/304

    Summary

    The second group of chapters (8–10) in this part steps through the majorsystems used for web service delivery. It looks at the ways they can be used

    to improve service delivery and also discusses their specific instrumentation

    needs, using the system management infrastructures described in the firstpart of this section. Chapter 8 investigates the instrumentation and

    management of applications and of end-user access devices, such as

    browsers. Chapter 9 looks at web server systems, including servers, load

    balancers, and content distribution networks. Finally, Chapter 10 discusses

    instrumentation and management of the transport infrastructure, including

    QoS technology and traffic shaping to achieve policy objectives.

    • Part III: Long-term Service Level Management Functions (Chapters 11–12)—This part covers load testing, modeling, and capacity planning. No management

    system can provide necessary quality if the web serving system, as a whole, has

    insufficient capacity.

    • Part IV: Planning and Implementation of Service Level Management (Chapte13–15)—Calculation of Return on Investment (ROI) for SLM is critical to the

     justification and design of an implementation; it’s covered in Chapter 13. Chapter

    provides guidance for using the information in this book to design an SLM system f

    your particular situation, and the part ends with discussion in Chapter 15 of possib

    future developments in SLM.

    SummaryThe Internet, and the Web, are transforming business processes for interaction among

    businesses, government, suppliers, customers, and employees. As more and more criticabusiness processes go online, the service quality of those processes becomes more

    important to the success of business as a whole.

    SLAs are the formal, negotiated contracts between service providers and service users th

    define the services to be provided, their quality goals, and the actions to be taken if the SL

    terms are violated.

    SLM is the process of managing network and computing resources to ensure the deliver

    of acceptable service quality, usually as defined in an SLA, at an acceptable price in an

    acceptable time frame. It is a competitive weapon in the marketplace because it can impro

    customer relationships, create more revenue opportunities, and reduce costs.

  • 8/16/2019 158705079 x

    32/304

  • 8/16/2019 158705079 x

    33/304

    C H A P T E R 2

    Service Level Management

    Service Level Management (SLM) is a key for delivering the services that are necessary

    remain competitive in the Internet environment. Service quality must remain stable and

    acceptable even when there are substantial changes in service volumes, customer activitiand the supporting infrastructures.

    Superior service quality also becomes a competitive differentiator because it reduces

    customer churn and brings in new customers who are willing to pay the premiums for

    guaranteed service quality. Customer churn is an insidious problem for almost every servi

    provider.

    The competitive market increases customer acquisition costs because continuous

    marketing and promotions are necessary just to replace the eroding customer base. High

    customer acquisition costs must be dealt with by either raising prices (a difficult move in

    highly competitive market) or by taking longer to amortize the acquisition costs before

    profitability for each customer is achieved. Improving customer retention therefore

    dramatically increases profits.

    This chapter covers the basics of SLM and lays part of the groundwork for the rest of th

    book:

    • An overview of SLM• An introduction to technical metrics• Detailed discussions of measurement granularity and measurement validation• Business process metrics• Service Level Agreements (SLAs)

    Note that the chapter ends with a summary discussion in the context of building an SLA

    Use of metrics in combination with the SLA’s service level objectives to controlperformance is discussed in Chapter 6, “Real-Time Operations,” and Chapter 7, “Policy

    Based Management.”

  • 8/16/2019 158705079 x

    34/304

    14 Chapter 2: Service Level Management

    Overview of Service Level ManagementOften, one group’s service provider is another group’s customer. It is critical to understand

    that service delivery is often, in fact, a chain of such relationships. As Figure 2-1 shows,

    some entities, such as an IT group, can play different roles in the service delivery process.

    As shown in the figure, a hosting company can be a customer of multiple service providers

    while in turn acting as a service provider. An IT group may be a customer of several service

    providers offering basic Internet connectivity, application hosting, content delivery, or other

    services. Customers may use multiple providers of the same service to increase their

    availability and to protect against dependence on a single provider. Customers will also use

    specialized service providers to fulfill particular needs.

    Figure 2-1  Roles of Customers and Service Providers

    The Internal Role of the IT GroupAn IT group serves the entire organization by aggregating demands of individual business

    units and using them as leverage to reduce overall costs from service providers.

    Today, such IT groups are making the necessary adjustments as managed services becomea mandatory requirement. IT managers are constantly reassessing the business and strategic

    IT Group

    ISP #2ISP #1

    Content DeliveryHosting Co.

    Customer Role

    Service Provider Role

    Telephone Co. Telephone Co.

    Service Provider Role Service Provider Role

    Service Provider Role Service Provider Role

    Service Provider Role Service Provider Role

    Customer Role Customer Role

    Customer Role Customer Role

  • 8/16/2019 158705079 x

    35/304

    Overview of Service Level Management

    trade-offs of developing internal competence and expertise as opposed to outsourcing moof the traditional IT work to external providers. The goal is to save money, protect strateg

    assets, and maintain the necessary flexibility to meet new challenges.

    The External Role of the IT GroupIT groups are increasingly being required to provide specific levels of service, and they a

    also more frequently involved in helping business units negotiate agreements with extern

    service providers. Business units often choose to deal directly with service providers whe

    they have specialized needs or when they determine that the IT group cannot offer servic

    with competitive costs and benefits.

    IT groups must therefore manage their own service levels as well as those of service

    providers, and they must track compliance with negotiated SLAs.

    The Components of Service Level ManagementThe process of monitoring service quality, detecting potential or actual problems, taking

    actions necessary to maintain or restore the necessary service quality, and reporting on

    achieved service levels is the core of SLM. Effective SLM solutions must deliver acceptab

    service quality at an acceptable price.

    Acceptable quality from a customer perspective means an ability to use the managed

    services effectively. For example, acceptable quality may mean that an external custome

    or business partner can transact the necessary business that will generate revenues,

    strengthen business partnerships, increase the Internet brand, or improve internalproductivity. Specific ongoing measurements are carried out to determine acceptable

    service quality levels, and noncompliance is noted and reported.

    Acceptable costs must also be considered, because over-provisioning and throwing mon

    at service quality problems is not an acceptable strategy for either service providers or the

    customers (and in spite of the cost, it often doesn’t solve the problem). Service manageme

    policies are applied to critical resources so that they are allocated to the appropriate

    services; inappropriate activities are curtailed. Service providers that manage resources

    effectively deliver superior service quality at competitive prices. Their customers, in tur

    must also increase their online business effectiveness and strengthen their bottom-line

    results.

    The Participants in a Service Level AgreementThe SLA is the basic tool used to define acceptable quality and any relationships betwee

    quality and price. Because the SLA has value for both providers and customers, it’s a

    wonder why it has taken so long for it to become important. In practice, many organizatio

  • 8/16/2019 158705079 x

    36/304

    16 Chapter 2: Service Level Management

    and providers find the process of negotiating an acceptable SLA to be a difficult task. Aswith many technical offerings, customers often experience difficulty in expressing what

    they need in technical terms that are both measurable and manageable; therefore, they have

    difficulty specifying their needs precisely and verifying that they are getting what they pay for.Service providers, on the other hand, appreciate clearly-specified requirements and want to

    take advantage of the opportunity to offer profitable premium services, but they also want

    to minimize the risks of public failure and avoid increasingly stringent financial penalties

    for noncompliance with the terms of the SLA.

    Metrics Within a Service Level AgreementMeasurement is a key part of an SLA, and most SLAs have two different classes of metrics,

    as shown in Figure 2-2, which may be divided into technical metrics and business process

    metrics. Technical metrics include both high-level technical metrics, such as the success

    rate of an entire transaction as seen by an end user, and low-level technical metrics, such asthe error rate of an underlying communications network. Business process metrics include

    measures of provider business practices, such as the speed with which they respond to

    problem reports.

    Figure 2-2 Contents of a Service Level Agreement 

    Service Level Agreement

    Technical MetricsHigh-Level

    WorkloadAvailabilityTransaction FailureTransaction Response Time...

    Low-LevelWorkloadAvailabilityPacket LossOne-Way Packet DelayJitterServer Response Time...

    Business Process MetricsTrouble Response TimeTrouble Relief TimeProvisioning Time

    ...

    Metric Specification and Handling

    GranularityValidationStatistical Analysis

    Penalties and Rewards

  • 8/16/2019 158705079 x

    37/304

    Introduction to Technical Metrics

    Service providers may package the metrics into specific profiles that suit common customrequirements while simplifying the process of selecting and specifying the parameters.

    Service profiles help the service provider by simplifying their planning and resource

    allocation operations.

    Introduction to Technical MetricsTechnical metrics are a core component of SLAs. They are used to quantify and to asses

    the key technical attributes of delivered services.

    Examples of technical metrics are shown in Table 2-1. They are separated into the two bas

    groups: high-level metrics, which deal with attributes that are highly relevant to end use

    and are easily understood by them, and low-level metrics, which deal with attributes of th

    underlying technologies. Note that you should be very specific when defining these term

    in an agreement. Although many of these terms are in common use, their definitions varTable 2-1  Examples of Technical Metrics

    Metric Description

    High-Level Technical Metrics

    Workload Applied workload in terms understandable by the end user (suc

    as end-user transactions/second)

    Availability Percentage of scheduled uptime that the system is perceived as

    available and functioning by the end user

    Transaction Failure Rate Percentage of initiated end-user transactions that fail to comple

    Transaction Response Time Measure of response-time characteristics of a user transaction

    File Transfer Time Measure of total transfer-time characteristics of a file transfer

    Stream Quality Measure of the user-perceived quality of a multimedia stream

    Low-Level Technical Metrics

    Workload Applied workload in terms relevant to underlying technologies

    (such as database transactions/second)

    Availability Percentage of scheduled uptime that the subsystem is available

    and functioning

    Packet Loss Measure of one-way packet loss characteristics between

    specified points

    Latency Measure of transit time characteristics between specified points

    Jitter Measure of the transit time variability characteristics between

    specified points

    Server Response Time Measure of response-time characteristics of particular server

    subsystems

  • 8/16/2019 158705079 x

    38/304

    18 Chapter 2: Service Level Management

    Workload is an important characteristic of both high- and low-level metrics. It’s not ameasure of delivered quality; instead, it’s a critical measure of the load applied to the

    system. For example, consider the workload of serving web pages. A text-only page might

    comprise only 10 K bytes, whereas a graphics page could comprise a few megabytes. If therequirement is to deliver a page in six seconds to the end user, massively different

    bandwidth and capacity will be necessary. Indeed, content may need to be altered for low-

    speed connections to meet the six-second download time.

    NOTE In many situations, certain technical metrics aren’t specified in the SLA. Instead, the

    supplier is asked to use best effort , which represents the classic Internet delivery strategy of

    “get it there somehow without concern for service quality.” Today, best effort represents the

    commodity level for services. There are no special treatments for best-effort services. The

    only need is that there are sufficient resources to prevent best-effort services from starving

    out , which means having the connection time out because of long periods of inactivity.

    Discussions of all of the examples in Table 2-1 follow, to illustrate the basic concepts of

    technical metrics. Additional descriptions of these metrics, and other technical metrics,

    appear in Chapters 4 and 8–10.

    High-Level Technical MetricsThese metrics deal with workload and performance as seen and understood by the end user.

    Workload

    The workload high-level technical metric is the measure of applied load in end-user terms.

    It’s unreasonable to expect a service provider to agree to service levels for an unspecified

    amount of workload; it’s also unreasonable to expect that an end user will willingly

    substitute specification of obscurely-related low-level workload metrics instead of

    understandable high-level metrics. SLAs should therefore begin by specifying the high-

    level workload metrics, and service providers can then work with the customer’s technical

    staff to derive low-level workload metrics from them.

    For transaction systems, the workload metric is usually specified in terms of the end-usertransaction mix and volumes, which typically vary according to time of day and other

    business cycles. For existing systems, these statistics can be obtained from logs; for new

    systems or situations (such as a proposed major advertising campaign designed to drive

    prospective customers to a web site), the organization’s marketing group or their

    consultants should work to produce the most accurate, specific estimates possible. These

    workload estimates for new systems should be used for load testing as well as for SLAs.

  • 8/16/2019 158705079 x

    39/304

    Introduction to Technical Metrics

    Transaction workload metrics must include end-user tolerance for transaction responsetime delays. If response time delays are too long, external customers will abandon the

    transaction. In legacy systems where external customers did not interact directly with th

    server systems, abandonment was not a factor in workload testing. Call-center operatorshandled any delays by talking to the customers, shielding them from the problem, if

    necessary. On the Web, customers see the delays without any shielding, and they may

    decide at any point to abandon the transaction—with immediate impact on the server

    system’s workload.

    Another effect of the direct connection between customers and web-serving systems is th

    there’s no buffer between those customers and the servers. In a call center, the workload

    buffered by external queues. Incoming calls go through an automatic call distribution

    system; callers are placed on hold until an operator is available. In an order-entry center

    the workload is buffered by the stack of documents on the entry clerk’s desk. In contras

    the web workload has no external buffer; massive spikes in workload hit the servers

    instantly. These spikes in workload are called flash load , and they must be specified in thworkload metric and considered during load testing. Load specification for the Web shou

    therefore be in terms of arrival rate, not concurrent users, as was the case for call center

    and order-entry centers.

    File-serving, web-page, and streaming-media workload metrics are similar to transactio

    metrics, but simpler. They’re usually specified in terms of the size and number of files thmust be transferred in a given time interval. (For web pages, the types of the files are usual

    specified. Dynamically-generated files are clearly more resource-intensive than stored

    static files.) The serving system must have the bandwidth to serve the files, and it must al

    be able to handle the anticipated number of concurrent connections. There’s a relationsh

    between these two variables; given a certain arrival rate, higher end-to-end bandwidthresults in fewer concurrent users.

    Availability

     Availability is the percentage of time that the system is perceived as available and

    functioning by the end user. It is a function of both the Mean Time Between Failures(MTBF) and the Mean Time To Repair (MTTR). Scheduled downtime might, in some

    organizations, be excluded from these calculations. In those organizations, a system can b

    declared 100 percent available even though it’s down for an hour every night for system

    maintenance.

    Availability is a binary measurement—the service is either available or it isn’t. For the en

    user, and therefore for the high-level availability metric, the fact that particular underlyin

    components of a service are unavailable is not a concern if that unavailability is conceal

    through redundant systems design.

    Availability can be improved by increasing the MTBF or by decreasing the time spent o

    each failure, which is measured by the MTTR. Chapter 3, “Service Management

  • 8/16/2019 158705079 x

    40/304

    20 Chapter 2: Service Level Management

    Architecture,” introduces the concept of triage, which decreases MTTR through quickassignment of problems to the appropriate specialist organization.

    Transaction Failure Rate

    A transaction fails if, having successfully started, it does not successfully complete.

    (Failure to start is the result of an availability problem.) As is true for availability, systems

    design and redundancy may conceal some low-level failures from the end user and

    therefore exclude the failures from the high-level transaction failure rate metric.

    Transaction Response Time

    This metric represents the acceptable delay for completing a transaction, measured at the

    level of a business process.

    It’s important to measure both the total time to complete a transaction and the elapsed time

    per page of the transaction. That’s because the end user’s perception of transaction time,

    which will be used to compare your system with your competitors’, is based on total

    transaction time, regardless of the number of pages involved, while the slowest page will

    influence end-user abandonment of a web transaction.

    File Transfer Time

    The file transfer time metric is closely associated with specified workload and is a measure

    of success. The file transfer workload metric describes the work that must be accomplished

    in a certain period; the file transfer time metric shows whether that workload wassuccessfully handled. Lack of end-to-end bandwidth, an insufficient number of concurrent

    connections, or persistent transmission errors (requiring retransmission) will influence this

    measure.

    Stream Quality

    The quality of multimedia streams is difficult to measure. Although underlying low-level

    technical metrics, such as frame loss, can be obtained, their relationship to the quality as

    perceived by an end user is very complex.

    Streaming is a real-time service in which the content continues flowing even with variationsin the underlying data transmission rates and despite some underlying errors. A content

    consumer may see a small blemish on a graphic because a packet is lost in transit—

    equivalent to static on your car radio. There is no rewinding and playing it again, as there

    might be with interactive services. Thus, packet loss is handled by just continuing with the

    streaming rather than retransmitting lost packets.

  • 8/16/2019 158705079 x

    41/304

    Introduction to Technical Metrics

    Occasional packet loss can still be tolerated and sometimes may not even be noticed. Ifpacket loss increases, quality will begin to degrade until it falls below a threshold and

    becomes unacceptable. Years of development have been focused on concealing these low

    level errors from the multimedia consumer, and the major existing technologies fromMicrosoft, Real Networks, Apple, and others have different sensitivities to these errors.

    Nevertheless, quality must be measured. The telephone companies years ago establishe

    the Mean Opinion Score (MOS), a measure of the quality of telephone voice transmissio

    There are also international standards for evaluation of audio and video quality as perceiv

    by human end users; examples are the International Telecommunication Union’s ITU-T

    P.800-series and P.900-series standards and the American National Standards Institute’s

    T1.518 and T1.801 standards. Simpler methods are also in use, such as measuring the

    percentage of successful connection attempts to the streaming server, the effective

    bandwidth delivered over that connection, and the number of rebuffers during transmissio

    Low-Level Technical MetricsThese metrics deal with workload and performance of the underlying technical subsystem

    such as the transport infrastructure. Low-level technical metrics can be selected and defin

    by first understanding the high-level technical metrics and their implications for the

    performance requirements placed on underlying subsystems. For example, a clear

    understanding of required transaction response time and the associated transaction

    characteristics (the number of transits across the transport network, the size of each trans

    and so on) can help set the objective for the low-level technical metric that measures

    network transit time (latency).

    Workload and Availability

    These low-level technical metrics are similar to those for the high-level discussion, but

    they’re focused on performance characteristics of the underlying systems rather than on

    performance characteristics that are directly visible to end users. Their correlation with th

    high-level metrics depends on the particular system design and the degree of redundanc

    and substitution within that design.

    Throughput, for example, is a low-level technical metric that measures the capacity of a

    particular service flow. Services with rich content or critical real-time requirements mig

    need sufficient bandwidth to maintain acceptable service quality. Certain transactions, suc

    as downloading a file or accessing a new web page, might also require a certain bandwid

    for transferring rich content, such as complex graphics, within the specified transaction

    delay time.

  • 8/16/2019 158705079 x

    42/304

    22 Chapter 2: Service Level Management

    Packet Loss

    Packet loss has different effects on the end-user experience, depending on the service using

    the transport. The choice of a packet loss metric for a particular application must be

    carefully considered. For example, packet loss in file transfer forces retransmission unlessthe high-level transport contains embedded error correction codes. In contrast, moderate

    packet loss in streaming media may have no user-perceptible effect at all—unless bad luck

    results in the loss of a key frame.

    The burst length must be included in packet loss metrics. Usually a uniform distribution of

    dropped packets over longer time intervals is implicitly assumed. For example, out of every

    100 packets there could be two lost without violating an SLA calling for two percent packet

    loss. There may be a different perspective if you examine behavior over longer intervals,such as 1,000 packets. Up to 20 packets in a row could be lost without violating the SLA.

    However, losing 20 consecutive packets—creating a significant gap in data received—

    might drive quality levels to unacceptable values.

    Latency

     Latency is the time needed for transit across the network; it’s critical for real-time services.

    Excessive latency quickly degrades the quality of web sites and of interactive sound and video.

    Routes in the Internet are usually asymmetric, with flows often taking different pathscoming and going between any pair of locations. Thus, the delays in each direction are

    usually different. Fortunately, most Internet applications are primarily sensitive to round-

    trip delays, which are much simpler to measure than one-way delays. File transfer, web

    sites, and transactions all require a flow of acknowledgments in the opposite direction to

    data flow. If acknowledgments are delayed, transmission temporarily ceases. The round-trip latency therefore controls the effective bandwidth of the transmission.

    Round-trip latency is much simpler to measure than one-way latency, because clock

    synchronization of separated locations is not necessary. That synchronization can be quite

    tricky if it is accomplished across the same network that’s having its one-way delay

    measured. In that case, fluctuations in the metric that’s being measured (one-way latency)

    can easily affect the stability of the measurement apparatus for one-way latency. An

    external reference, such as the satellite Global Positioning System (GPS) timers, is often

    used in such situations.

    Jitter

     Jitter  is the deviation in the arrival rate of data from ideal, evenly-spaced arrival; see Figure

    2-3. Some packets may be bunched more closely together (in terms of inter-packet delays)

    or spread farther apart after crossing the network infrastructure. Jitter is caused by the

    internal operation of network equipment, and it’s unavoidable. Jitter is created whenever

    there are queues and buffering in a system. Extreme varieties of jitter are also created when

    there’s rerouting of packets because of network congestion or failure.

  • 8/16/2019 158705079 x

    43/304

    Measurement Granularity

    Figure 2-3  Jitter 

    Interactive teleconferencing is an example of a service that is extremely sensitive to jitte

    too much jitter can make the service completely useless. Therefore, a reduction in jitter

    approaching zero, represents an increase in quality.

    Buffering in the receiving device can be used to smooth out jitter; the jitter buffer is famili

    to those of us who have a CD player in the car. Small bumps are smoothed out and the soun

    quality remains acceptable, but hitting a pothole usually causes more disturbance than th

    buffer can overcome. The dejitter buffer allows for latency that is typically one or two tim

    that of the expected jitter; it’s not a cure for all situations. The time spent in the dejitter

    buffers is an important contributor to total system latency.

    Server Response Time

    Similar to the high-level technical metric transaction response time, this measures theindividual response time characteristics of underlying server systems. A common examp

    is the response time of the database back-end systems to specific query types. Although n

    directly seen by end users, this is an important part of overall system performance.

    Measurement GranularityThe SLA must describe the granularity of the measurements. There are three related par

    to that granularity: the scope, the sampling frequency, and the aggregation interval.

    Measurement ScopeThe first consideration is the scope of the measurement, and availability metrics make a

    excellent example. Many providers define the availability of their services based on an

    overall average of availability across all access points. This is an approach that gives the

    service providers the most flexibility and cushion for meeting negotiated levels.

    IdealPacket

    Spacing

    ActualPacket

    Spacing

    Jitter

  • 8/16/2019 158705079 x

    44/304

    24 Chapter 2: Service Level Management

    Consider if your company had 100 sites and a target of 99 percent availability based on anoverall average. Ninety-nine of your sites could have complete availability (100 percent)

    while one could have zero. Having a site with an extended period of complete unavailability

    isn’t usually acceptable, but the service provider has complied with the negotiated terms ofthe SLA.

    If the availability level is specified on a per-site basis instead, the provider would have been

    found to be noncompliant and appropriate actions would follow in the form of penalties or

    lost customers. The same principle applies when measuring the availability of multiple

    sites, servers, or other units.

    Availability has an additional scope dimension, in addition to breadth: the depth to which

    the end user can penetrate to the desired service. To use a telephone analogy, is dial tone

    sufficient, or must the end user be able to reach specific numbers? In other words, which

    transactions must be accessible for the system to be regarded as available?

    Scope issues for performance metrics are similar to those for the availability metric. Theremay be different sets of metrics for different groups of transactions, different times of day,

    and different groups of end users. Some transactions may be unusually important to

    particular groups of end users at particular times and completely unimportant at other

    times.

    Regardless of the scope selected for a given individual metric, it’s important to realize that

    executive management will want these various metrics aggregated into a single measure ofoverall performance. Derivation of that aggregated metric must be addressed during

    measurement definition.

    Measurement Sampling FrequencyA shorter sampling frequency catches problems sooner at the expense of consuming

    additional network, server, and application resources. Longer intervals between

    measurements reduce the impacts while possibly missing important changes, or at least not

    detecting them as quickly as when a shorter interval is used. Customers and the service

    providers will need to negotiate the measurement interval because it affects the cost of theservice to some extent.

    Statisticians recommend that sampling be random because it avoids accidental

    synchronization with underlying processes and the resulting distortion of the metric.

    Random sampling also helps discover brief patterns of poor performance; consecutive badresults are more meaningful than individual, spaced-out difficulties.

    Confidence interval calculations can be used to help determine the sampling frequency.

    Although it is impossible to perform an infinite number of measurements, it is possible to

    calculate a range of values that we’re reasonably sure would contain the true summary

    values (median, average, and so on) if you could have performed an infinite number of

    measurements. For example, you might want to be able to say the following: “There’s a 95

  • 8/16/2019 158705079 x

    45/304

    Measurement Granularity

    percent chance that the true median, if we could perform an infinite number ofmeasurements, would be between five seconds and seven seconds.” That is what the “95

    Percent Confidence Interval” seeks to estimate, as shown in Figure 2-4. When you take

    more measurements, the confidence interval (two seconds in this example) usually becomnarrower. Therefore, confidence intervals can be used to help estimate how many

    measurements you’ll need to obtain a given level of precision with statistical confidence

    Figure 2-4 Confidence Interval for Internet Data

    There are simple techniques for calculating confidence intervals for “normal distribution

    of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent

    section on statistical analysis, Internet distributions are so different from the “normaldistribution” that these techniques cannot be used. Instead, the statistical simulation

    technique known as “bootstrapping” can be used for these calculations on Internet

    distributions.

    In some cases, depending on the pattern of measurements, simple approximations forcalculating confidence intervals may be used. Keynote Systems recommends the followin

    calculation approximation for calculating the confidence interval for availability metrics

    (This information is drawn from “Keynote Data Accuracy and Statistical Analysis for

    Performance Trending and Service Level Management,” Keynote Systems Inc., San Mate

    California, 2002.) The formula is as follows:

    • Omit data points that indicate measurement problems instead of availabilityproblems.

    • Calculate a preliminary estimate of the 95 percent confidence interval for averageavailability (avg) of a measurement sample with n valid data points:

    Preliminary 95 Percent Confidence Interval = avg ± (1.96 * square root [(avg

    * (1 – avg))/(n – 1)])

    For example, with a sample size n of 100, if 12 percent of the valid

    measurements are errors, the average availability is 88 percent. The

    confidence interval is calculated by the formula as (0.82, 0.94). This suggests

    that there’s a 95 percent probability that the true average availability—if

    Actual Median

    Confidence Interval

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

    Response Time (Seconds)

       P  e  r  c  e  n   t  a  g  e  o   f

       M  e  a  s  u  r  e  m  e  n   t  s

  • 8/16/2019 158705079 x

    46/304

    26 Chapter 2: Service Level Management

    we’d miraculously taken an infinite number of measurements—is between82 and 94 percent. Notice that even with 100 measurements, this confidence

    interval leaves much room for uncertainty! To narrow that band, you need

    more valid measurements (a larger n, such as 1000 data points).• Now you must decide if the preliminary calculations are reasonable. We suggest that

    the preliminary calculation should be accepted only if the upper limit is below 100

    percent and the lower limit is above 0 percent. (The example just used gives an upper

    limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable

    if n = 30 or greater.)

    Note that we’re not saying that the confidence interval is too wide if the

    upper limit is above 100 percent (or if the average availability itself is 100

    percent because no errors were detected); we’re saying that you don’t know 

    what the confidence interval is. The reason is that the simplifying

    assumptions you used to construct the calculation break down if there are notenough data points.

    For performance metrics, a simple solution to the problem of confidence intervals is to use

    geometric means and “geometric deviations” as measures of performance, which are

    described in the subsequent section in this chapter on statistical analysis.

    Keynote Systems suggests, in the paper previously cited, that you can approximate the 95

    Percent Confidence Interval for the geometric mean as follows, for a measurement samplewith n valid (nonerror) data points:

    Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n – 1] ) ) ]

    Lower Limit = [geometric mean] / [ (geometric deviation)(1.96 / (square root of [n – 1] ) )

     ]This is similar to the use of the standard deviation with normally distributed data and can

    be used as a rough approximation of confidence intervals for performance measurements.

    Note that this ignores cyclic variations, such as by time of day or day of week; it is also

    somewhat distorted because even the logarithms of the original data are asymmetrically

    distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered

    using this recipe are much less than those that result from the usual use of mean and

    standard deviation.

    Measurement Aggregation IntervalSelecting the time interval over which availability and performance are aggregated should

    also be considered. Generally, providers and customers agree upon time spans ranging from

    a week to a month. These are practical time intervals because they will tend to hide small

    fluctuations and irrelevant outlying measurements, but still enable reasonably prompt

    analysis and response. Longer intervals enable longer problem periods before the SLA is

    violated.

  • 8/16/2019 158705079 x

    47/304

    Measurement Granularity

    Table 2-2 shows this idea. If availability is measured on a small scale (hourly), highavailability and requirements such as the 5-9’s or 99.999% permit only 0.036 seconds o

    outage before there’s a breach of the SLA. Providers must provision with adequate

    redundancy to meet this type of stringent requirement, and clearly they will pass on thescosts to the customers that demand such high availability.

    If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicat

    that an acceptable cumulative outage of 24 seconds per month is permitted while remaininin compliance. A 99.9 percent availability level permits up to 40 minutes of accumulate

    downtime for a service each month. Many providers are still trying to negotiate an SLA

    with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13

    to 3.5 hours each month.Note that these values assume 24 ∗ 7 ∗ 365 operations. For operations that do not requir

    round-the-clock availability, or are not up during weekends, or have scheduled maintenan

    periods, the values will change. That said, they’re pretty easy to compute.

    The key is for service provider and service customer to set a common definition of the

    critical time interval. Because longer aggregation intervals permit longer periods during

    which metrics may be outside tolerance, many organizations must look more deeply at the

    aggregation definitions and look to their tolerance for service interruption. A 98 percen

    availability level may be adequate and also economically acceptable, but how would the

    business function if the 13.5 allotted hours of downtime per month occurred in a single

    outage? Could the business tolerate an interruption of that length without serious damagIf not, then another metric that limits the interruption must be incorporated. This could b

    expressed in a statement such as the following: “Monthly availability at all sites shall be 9

    percent or higher, and no service outage shall exceed three minutes.” In other words, a litt

    arithmetic to evaluate scenarios for compliance goes a long way.

    Table 2-2  Measurement Aggregation Intervals for Availability

    Availability Percentage Allowable Outage for Specified Aggregation Interval

    Hour Day Week 4 Weeks

      98% 1.2 min 28.8 min 3.36 hr 13.4 hr

      98.5% 0.9 min 21.6 min 2.52 hr 10 hr

      99% 0.6 min 14.4 min 1.68 hr 6.7 hr

      99.5% 0.3 min 7.2 min 50.4 min 3.36 hr

      99.9% 3.6 sec 1.44 min 10 min 40 min

      99.99% 0.36 sec 8.64 sec 1 min 4 min

      99.999% 0.036 sec 0.864 sec 6 sec 24 sec

  • 8/16/2019 158705079 x

    48/304

    28 Chapter 2: Service Level Management

    Measurement Validation and Statistical AnalysisThe Internet and Web are extremely complex statistically. Invalid measurements and

    incorrect statistical analysis can easily lead to SLA violations and penalties, which may

    then fall apart when challenged by the service provider using a more appropriate analysis.

    Therefore, special care must be taken to discard invalid measurements and to use the

    appropriate statistical analysis methods.

    Measurement ValidationMeasurement problems, which are artifacts of the measurement process, are inevitable in

    any large-scale measurement system. The important issues are how quickly these errors aredetected and tagged in the database, and the degree of engineering and business integrity

    that’s applied to the process of error detection and tagging.

    Measurement problems can be caused by instrument malfunction, such as a response timer

    that fails, and by synthetic transaction script failure, which leads to false transaction error

    reports. It can also be caused by abnormal congestion on a measurement tool’s access link

    to the backbone network and by many other factors. These failures are of the measurement

    system, not of the system being measured. They therefore are best excluded from any SLA

    compliance metrics.

    Detection and tagging of erroneous measurements may take time, sometimes up to a day or

    more, as the measurement team investigates the situation. Fortunately, SLA reports are not

    generally done in real time, and there’s therefore an opportunity to detect and remove such

    measurements.

    The same measurements will probably also be used for quick diagnosis, or triage, and that

    usage requires real-time reporting. There’s therefore no chance to remove erroneous

    measurements before use, and the quick diagnosis techniques must themselves handle

    possible problems in the measurement system. Good, fast-acting artifact reduction

    techniques (discussed in Chapter 5, “Event Management”) can eliminate a large number of

    misleading error messages and reduce the burden on the provider management system.

    An emerging alternative is using a trusted, independent third-party to provide the

    monitoring and SLA compliance verification. The advantage in having an independent

    party providing information is both service providers and their customers could view this

    party as objective when they have disputes about delivered service quality.

    Keynote Systems and Brix Networks are early movers into this market space. Keynote

    Systems provides a service, whereas Brix Networks provides an integrated set of software

    and hardware measurement devices to be installed and managed by the owner of the SLA.

    They both provide active, managed measurement devices placed at the service demarcation

    points between customers and


Recommended