12 November 2003 Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research...

12 November 2003

Rebecca IsaacsPaul Barham

Richard MortierDushyanth Narayanan

Microsoft Research Cambridge

James Bulpin University of Cambridge

Magpie: Distributed request tracking for realistic

performance modelling

12 November 2003

Performance in distributed systems

Faults in distributed systems are notoriously hard to diagnose

Performance problems are even more subtle to debug Often transient or affect only a subset of requests / users Frequently involve complex interactions between multiple

machines Aggregate statistics (e.g. utilization) may look perfectly

normal

12 November 2003

Magpie Approach

Track individual requests end to end Observe control flow (causality) Monitor resource consumption: CPU, bandwidth, disk

Debug performance “in the small”

Build a probabilistic workload model from the aggregate requests Cluster similar requests according to their observed

behaviour Debug performance “in the large”

12 November 2003

How do we use this information?

Performance debugging Why did this request take much longer than that

request? Fault detection Configuration and management

Performance prediction Realistic workload models for capacity planning Obtain automatically on a “live” system

12 November 2003

Magpie components

Instrumentation System activity recorded to logs

Generic request parser Extract individual requests from logs according to

an event schema Model construction

Behavioural clusters Probabilistic state machine

12 November 2003

Outline

Introduction What is a request? Instrumentation Request extraction Modelling Current status

12 November 2003

What is a request?

System activity which takes place in response to an action initiated by the application being traced HTTP request Database query File open request

We describe a request as The sequence of application components involved in its

processing The resource consumed at each stage

CPU, bandwidth, disk transfer size, (latency)

12 November 2003

A typical e-commerce site (1)

Web Front Ends

SQL ServersStorage

Internet

12 November 2003

A typical e-commerce site (2)F

ilte

r

Kernelhttp.sys

CLRIIS

Kernel

Web Server

Application Logic

WinSock2 API

SQL Server

Stored procedures

StaticContent

ASP.NET ADO.NET

WinSock2 API

Data

12 November 2003

HTTP request: detailed view

WEB.eec

WEB.398

Disk

Net RX

Net TX

10.051s 10.155s

Net TX

Net RX

Disk

SQL.9c4

10.051s 10.155s

!

- + - - + - - + - + -

- - -

10.100s

10.100s

HTTP request packet

from

IIS worker thread picks up request

http.sys Sync WinSock send to SQL Server

ASP.NET thread blocks after RPC to database

ASP.NET worker thread takes over

TDS request and reply packets sent and

received

SQL thread unblocks

HTTP response packets sent back to client

IIS worker thread wakes up to write log

Blocked IIS ASP.NET SQLKEY: Disk Other

12 November 2003

Why is request tracking hard?

Many components, multiple machines Must track control flow across machines

No globally unique request ID Components are developed independently

Multiple thread pools Many threads participate in processing a request

Asynchronous communication Must match send/recvs between threads/machines

Hand-rolled synchronization primitives SQL server has user-mode scheduler

12 November 2003

Outline


12 November 2003

Event Tracing for Windows

Low-overhead event mechanism Events timestamped with cycle counter Global ordering on events on a single machine Can enable/disable sets of events at runtime

Using ETW in Magpie Each instrumentation point posts an event Events are logged to disk Logs are post-processed to extract requests Can also consume events in real time

12 November 2003

Instrumentation points Existing ETW event providers

IIS, kernel

App-specific hooks IIS, ASP.NET, SQL Server

Detours Wrap dlls to trap Win32 and WinSock2 calls

WinPcap Capture packets on the wire

12 November 2003

CPU usage from kernel events

The ETW kernel logger records every context switch How do we know which cycles are used for which

request?

We can attribute cycles to a request by An application-specific event which occurs within

a delimited sector of CPU time, or The current context of execution, eg thread id

12 November 2003

Example: protocol processing in a DPC

cswitchDPCstart

DPCend

pkt recv

Request 1cycle count

Request 2cycle count

Events: cswitch

time

12 November 2003

Application and middleware events

Cover points where flow of control moves between components

Cover points where resources are multiplexed and demultiplexed E.g. user-level scheduling primitives

Propagation of a global request id is not required! Magpie used to do this but not any more

12 November 2003

Instrumenting a web serviceF

ilte

r

Kernelhttp.sys

CLRIIS

Kernel

Web Server

HTTPModule

Application Logic

SQL Server

Wra

pp

ers

Stored procedures

ISAPI Filter

StaticContent

ASP.NET ADO.NET

CLR profiler

WinSock2 APIIntercept

Data

Event Tracing for WindowsPacket capture

Event Tracing for WindowsPacket capture

Extended SPs

WinSock2 APIIntercept

12 November 2003

Outline


12 November 2003

Generic request extraction

No inbuilt assumptions about the system or the application No common unique identifier

Schema specifies semantics of events Easy to add new event types

Parser stitches events into requests based on event semantics

12 November 2003

Terminology

Namespace Event parameter which references an entity in the

system, eg thread id Timeline

Instantiation of a namespace with a unique value, eg thread id = 0xa

Events bind or unbind requests to timelines Bindings capture the semantics of each event for

a particular request type

12 November 2003

Cpuid=0

Tid=0xa

Tid=0xb

Connid=0xd

Enter R

ecv

cswitch

cswitch

DP

C start

DP

C end

Recv returns

TC

P pkt

Example: connecting events

Request 1Request 2

12 November 2003

End-to-end request extraction

An instance of the request parser runs on each machine in the distributed system Online or offline mode

Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier

12 November 2003

Outline


12 November 2003

Clustering for workload generation

Target the Indy performance modelling tool Calculates throughput, bottlenecks Needs transaction mix, resource consumption

Previously: microbenchmark approach

Run 10000 of each “transaction type” (URL) Divide aggregate resource usage by 10000

Aim: provide realistic workload models From real, mixed workloads Derive transaction “types” automatically

12 November 2003

Single request: cartoon view

Partial ordering of events Annotated with resource usage

5ms 6ms 1ms3ms 6ms

2ms 3ms

6ms

6k1k

192kread

24kread

12k1k

IIS CPU ASP.NET CPU SQL Server CPU

DiskNetwork

12 November 2003

Behavioural clustering of requests Represent requests as event strings

“Flatten” out any concurrency

Use Levenshtein string edit distance Modified to factor in resource usage vectors

Cluster requests based on this distance Linear-time algorithm

Each cluster is a request “type” Select representative from near centroid

12 November 2003

Build a workload model by clustering similar requests

Requests in the same cluster often have different URLs, and one URL may appear in many clusters

A

D

B

CE

A2ms 10ms 1ms14ms 24ms

5ms 11ms

5ms

6k0.2k

30k1k

5ms

5ms

0.1k0.2k

2k0.2k

7%

B 14ms 27ms 1ms 2ms 7ms

11k1k

2ms

10%

C 5ms 6ms 1ms3ms 6ms

2ms 3ms

6ms

6k1k

192kread

24kread

12k1k

15%

E 5ms 11ms

1k0.6k

63%

D 2ms 13ms 2ms3ms

5ms

5ms

0.3k

11k1k

11ms

0.3k

5%

12 November 2003

Taking it further: work-in-progress

Online and incremental modelling: Detect component failure Detect sudden shifts in workload

More sophisticated models Learn the probabilistic state machine for each request c.f. flowcharts annotated with performance information

“Bayesian watchdogs” Compute the likelihood of a request’s behaviour as it

moves through the system Deal with “unlikely” requests appropriately

12 November 2003

Outline


12 November 2003

Current status

Recent focus has been developing a generic request extraction scheme Prototype for 2-machine e-commerce site

TPC-W style workload Prototype for single machine SQL Server 2000

Challenge is user mode scheduler TPC-C workload

Other applications on the way Large-scale “Real” systems with “real” performance problems

12 November 2003

Conclusion

Magpie is a tool for performance analysis in a distributed system

Bottom up, per-request approach Complementary to existing techniques:

Performance counters Program profiling

Feeds into performance debugging and prediction tools

Date post:	26-Mar-2015
Category:	Documents
Upload:	jose-glass
View:	223 times
Download:	2 times

12 November 2003 Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research...

Documents