Post on 12-Feb-2017
transcript
Dynamic Data Access to theGT/CERCS Linux Mirror Site
Mohamed MansourMatthew Wolf
Karsten Schwan
HPGC - IPDPS 2004 2
Motivation
• Testing (benchmarking) high performancedistributed streaming applications– Scientific domain
– Enterprise applications
HPGC - IPDPS 2004 3
Scientific Data Stream
MolecularDynamics
Bondscalculate bondsand radial dist.
openGLVisualization
server
Co-ordinates
openGL triangulardata
co-ordinates +bonds
Radial dist. data Service
Data Channel
HPGC - IPDPS 2004 4
Application Specific Workloads
• Margo Seltzer et. al. [1999] - Test andevaluate systems with realistic workloads– Avoid over designing the system
– Provide rigorous insights into systemcapabilities
HPGC - IPDPS 2004 5
Goal
• Understand user interactions with largestreaming data repositories– Analyze ftp traces of GT/CERCS mirror site
• A tool to replay such workloads– StreamGen workload generation tool
HPGC - IPDPS 2004 6
Example
Bondscalculate bondsand radial dist.
openGLVisualization
server#1
openGL triangulardata
co-ordinates +bonds
Radial dist. data
openGLVisualization
server#2
openGL triangulardata
StreamPerf loadgenerator
Service
Data Channel
HPGC - IPDPS 2004 7
Outline
• Overview and definitions
• Method of analysis
• Results
• Summary
• Q&A
HPGC - IPDPS 2004 8
file_xxxx.rpm
file_xxxx.rpm
Non-Striped Trafficfile_xxxx.rpm
HPGC - IPDPS 2004 9
Striped Traffic – DownloadAccelerators
file_xxxx.rpm
file_xxxx.rpm
file_xxxx.rpm
HPGC - IPDPS 2004 10
Traffic Traces
file_xxxx.rpm
file_xxxx.rpm
file_xxxx.rpm
GT CERCSLinux Mirror
HPGC - IPDPS 2004 11
file_xxxx.rpm
file_xxxx.rpm
bytestotal
bytesdownloadedfactorstriping
_
__ =
+( )=factorstriping _
Striping Factor
HPGC - IPDPS 2004 12
file_xxxx.rpm file_xxxx.rpm
Striping Factor – Examples
%100_ =factorstriping
file_xxxx.rpm file_xxxx.rpm
%45_ =factorstriping
HPGC - IPDPS 2004 13
Method of Analysis
• Reconstruct user sessions from xferlogtraces
• Metadata, site heuristics and assumptions– Limit of two concurrent connections per host
– ls-lr files with relative path information
– Idle timeout of 2 hours
HPGC - IPDPS 2004 14
User SessionsRedhat 7.1 - Traffic Histogram (bin size = 1 day)
0
100
200
300
400
500
600
700
0 100 200 300 400 500 600 700
Time (days)
Ses
sio
ns
Non-striped traffic Striped traffic
HPGC - IPDPS 2004 15
Striping Factor Distribution
0
5
10
15
20
25
30
0 10 20 30 40 50 60 70 80 90 100
Fraction of data downloaded from GA TECH server (%)
Fra
cti
on
of
req
uest
train
s (
%)
SuSE 7.3
SuSE 8.0
SuSE 8.1
HPGC - IPDPS 2004 16
Single File Domination
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
RedHat
7.1
RedHat
7.2
RedHat
7.3
RedHat
8.0
SuSE 7
.3
SuSE 8
.0
SuSE 8
.1
Debian
Pot
ato
Debian
Woo
dy
Striped
Non-Striped
HPGC - IPDPS 2004 17
Single File Distribution(striped)
Redhat 7.3 - single file downloads - parallel download
1
10
100
1000
1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09
Downloaded data (bytes)
Fre
qu
ency
of
do
wn
load
s
HPGC - IPDPS 2004 18
Single File Distribution(non-striped)
Redhat 7.3 - single file downloads - no download accelerator
1
10
100
1000
10000
1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09
Downloaded data (bytes)
Fre
qu
en
ce
of
do
wn
loa
ds
HPGC - IPDPS 2004 19
Results
• Strong similarity between striped and non-stripedbehavior– Correlation factor between 70% and 98%
• Download accelerators are common– Only 20-25% of users do not use them
• Striping factor uniformly distributed over the range of 10-90%
• 7-25% ‘null’ requests• Requesting a single file is the most common pattern
– Download accelerators exhibit distinctive access patterns
HPGC - IPDPS 2004 20
Contributions
• Traffic traces– Reconstructed from real traces
• StreamGen – a library to generatestreaming workloads– Derived from httperf
– Replays traffic traces, or generate statisticalpatterns
HPGC - IPDPS 2004 21
Future Directions
• More in-depth analysis of striped behavior– Modified FTP server to collect offset data
• Use traces as realistic traffic models
HPGC - IPDPS 2004 22
References
• V. Oleson, K. Schwan, G. Eisenhauer, B. Plale, C. Pu, and D. Amin.“Operational information systems - an example from the airline industry.”In First Workshop on Industrial Experiences with Systems Software(WIESS)
• Matthew Wolf and Zhongtang Cai and Weiyun Huang and KarstenSchwan, “SmartPointers: personalized scientific data portals in yourhand.” In Proc. of the 2002 ACM/IEEE conference on Supercomputing,Baltimore, Maryland, 2002, pp. 1-16
• Margo Seltzer, David Krinsky, Keith Smith and Xiaolan Zhang, “The Casefor Application-Specific Benchmarking”, In Proceedings of the 1999Workshop on Hot Topics in Operating Systems, Rico, AZ, 1999
• D. Mosberger and T. Jin, “httperf: A tool for measuring web serverperformance”, WISP, ACM, Madison, WI, June 1998, pp. 59-67
• http://www.cc.gatech.edu/~mansour
Q&A
mansour@cc.gatech.edu