Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | jose-glass |
View: | 223 times |
Download: | 2 times |
12 November 2003
Rebecca IsaacsPaul Barham
Richard MortierDushyanth Narayanan
Microsoft Research Cambridge
James Bulpin University of Cambridge
Magpie: Distributed request tracking for realistic
performance modelling
12 November 2003
Performance in distributed systems
Faults in distributed systems are notoriously hard to diagnose
Performance problems are even more subtle to debug Often transient or affect only a subset of requests / users Frequently involve complex interactions between multiple
machines Aggregate statistics (e.g. utilization) may look perfectly
normal
12 November 2003
Magpie Approach
Track individual requests end to end Observe control flow (causality) Monitor resource consumption: CPU, bandwidth, disk
Debug performance “in the small”
Build a probabilistic workload model from the aggregate requests Cluster similar requests according to their observed
behaviour Debug performance “in the large”
12 November 2003
How do we use this information?
Performance debugging Why did this request take much longer than that
request? Fault detection Configuration and management
Performance prediction Realistic workload models for capacity planning Obtain automatically on a “live” system
12 November 2003
Magpie components
Instrumentation System activity recorded to logs
Generic request parser Extract individual requests from logs according to
an event schema Model construction
Behavioural clusters Probabilistic state machine
12 November 2003
Outline
Introduction What is a request? Instrumentation Request extraction Modelling Current status
12 November 2003
What is a request?
System activity which takes place in response to an action initiated by the application being traced HTTP request Database query File open request
We describe a request as The sequence of application components involved in its
processing The resource consumed at each stage
CPU, bandwidth, disk transfer size, (latency)
12 November 2003
A typical e-commerce site (1)
Web Front Ends
SQL ServersStorage
Internet
12 November 2003
A typical e-commerce site (2)F
ilte
r
Kernelhttp.sys
CLRIIS
Kernel
Web Server
Application Logic
WinSock2 API
SQL Server
Stored procedures
StaticContent
ASP.NET ADO.NET
WinSock2 API
Data
12 November 2003
HTTP request: detailed view
WEB.eec
WEB.398
Disk
Net RX
Net TX
10.051s 10.155s
Net TX
Net RX
Disk
SQL.9c4
10.051s 10.155s
!
- + - - + - - + - + -
- - -
10.100s
10.100s
HTTP request packet
from
IIS worker thread picks up request
http.sys Sync WinSock send to SQL Server
ASP.NET thread blocks after RPC to database
ASP.NET worker thread takes over
TDS request and reply packets sent and
received
SQL thread unblocks
HTTP response packets sent back to client
IIS worker thread wakes up to write log
Blocked IIS ASP.NET SQLKEY: Disk Other
12 November 2003
Why is request tracking hard?
Many components, multiple machines Must track control flow across machines
No globally unique request ID Components are developed independently
Multiple thread pools Many threads participate in processing a request
Asynchronous communication Must match send/recvs between threads/machines
Hand-rolled synchronization primitives SQL server has user-mode scheduler
12 November 2003
Outline
Introduction What is a request? Instrumentation Request extraction Modelling Current status
12 November 2003
Event Tracing for Windows
Low-overhead event mechanism Events timestamped with cycle counter Global ordering on events on a single machine Can enable/disable sets of events at runtime
Using ETW in Magpie Each instrumentation point posts an event Events are logged to disk Logs are post-processed to extract requests Can also consume events in real time
12 November 2003
Instrumentation points Existing ETW event providers
IIS, kernel
App-specific hooks IIS, ASP.NET, SQL Server
Detours Wrap dlls to trap Win32 and WinSock2 calls
WinPcap Capture packets on the wire
12 November 2003
CPU usage from kernel events
The ETW kernel logger records every context switch How do we know which cycles are used for which
request?
We can attribute cycles to a request by An application-specific event which occurs within
a delimited sector of CPU time, or The current context of execution, eg thread id
12 November 2003
Example: protocol processing in a DPC
cswitchDPCstart
DPCend
pkt recv
Request 1cycle count
Request 2cycle count
Events: cswitch
time
12 November 2003
Application and middleware events
Cover points where flow of control moves between components
Cover points where resources are multiplexed and demultiplexed E.g. user-level scheduling primitives
Propagation of a global request id is not required! Magpie used to do this but not any more
12 November 2003
Instrumenting a web serviceF
ilte
r
Kernelhttp.sys
CLRIIS
Kernel
Web Server
HTTPModule
Application Logic
SQL Server
Wra
pp
ers
Stored procedures
ISAPI Filter
StaticContent
ASP.NET ADO.NET
CLR profiler
WinSock2 APIIntercept
Data
Event Tracing for WindowsPacket capture
Event Tracing for WindowsPacket capture
Extended SPs
WinSock2 APIIntercept
12 November 2003
Outline
Introduction What is a request? Instrumentation Request extraction Modelling Current status
12 November 2003
Generic request extraction
No inbuilt assumptions about the system or the application No common unique identifier
Schema specifies semantics of events Easy to add new event types
Parser stitches events into requests based on event semantics
12 November 2003
Terminology
Namespace Event parameter which references an entity in the
system, eg thread id Timeline
Instantiation of a namespace with a unique value, eg thread id = 0xa
Events bind or unbind requests to timelines Bindings capture the semantics of each event for
a particular request type
12 November 2003
Cpuid=0
Tid=0xa
Tid=0xb
Connid=0xd
Enter R
ecv
cswitch
cswitch
DP
C start
DP
C end
Recv returns
TC
P pkt
Example: connecting events
Request 1Request 2
12 November 2003
End-to-end request extraction
An instance of the request parser runs on each machine in the distributed system Online or offline mode
Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier
12 November 2003
Outline
Introduction What is a request? Instrumentation Request extraction Modelling Current status
12 November 2003
Clustering for workload generation
Target the Indy performance modelling tool Calculates throughput, bottlenecks Needs transaction mix, resource consumption
Previously: microbenchmark approach
Run 10000 of each “transaction type” (URL) Divide aggregate resource usage by 10000
Aim: provide realistic workload models From real, mixed workloads Derive transaction “types” automatically
12 November 2003
Single request: cartoon view
Partial ordering of events Annotated with resource usage
5ms 6ms 1ms3ms 6ms
2ms 3ms
6ms
6k1k
192kread
24kread
12k1k
IIS CPU ASP.NET CPU SQL Server CPU
DiskNetwork
12 November 2003
Behavioural clustering of requests Represent requests as event strings
“Flatten” out any concurrency
Use Levenshtein string edit distance Modified to factor in resource usage vectors
Cluster requests based on this distance Linear-time algorithm
Each cluster is a request “type” Select representative from near centroid
12 November 2003
Build a workload model by clustering similar requests
Requests in the same cluster often have different URLs, and one URL may appear in many clusters
A
D
B
CE
A2ms 10ms 1ms14ms 24ms
5ms 11ms
5ms
6k0.2k
30k1k
5ms
5ms
0.1k0.2k
2k0.2k
7%
B 14ms 27ms 1ms 2ms 7ms
11k1k
2ms
10%
C 5ms 6ms 1ms3ms 6ms
2ms 3ms
6ms
6k1k
192kread
24kread
12k1k
15%
E 5ms 11ms
1k0.6k
63%
D 2ms 13ms 2ms3ms
5ms
5ms
0.3k
11k1k
11ms
0.3k
5%
12 November 2003
Taking it further: work-in-progress
Online and incremental modelling: Detect component failure Detect sudden shifts in workload
More sophisticated models Learn the probabilistic state machine for each request c.f. flowcharts annotated with performance information
“Bayesian watchdogs” Compute the likelihood of a request’s behaviour as it
moves through the system Deal with “unlikely” requests appropriately
12 November 2003
Outline
Introduction What is a request? Instrumentation Request extraction Modelling Current status
12 November 2003
Current status
Recent focus has been developing a generic request extraction scheme Prototype for 2-machine e-commerce site
TPC-W style workload Prototype for single machine SQL Server 2000
Challenge is user mode scheduler TPC-C workload
Other applications on the way Large-scale “Real” systems with “real” performance problems
12 November 2003
Conclusion
Magpie is a tool for performance analysis in a distributed system
Bottom up, per-request approach Complementary to existing techniques:
Performance counters Program profiling
Feeds into performance debugging and prediction tools