+ All Categories
Home > Documents > [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA...

[IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA...

Date post: 18-Dec-2016
Category:
Upload: kushagra
View: 216 times
Download: 3 times
Share this document with a friend
10
AbstractMega datacenters hosting large scale web services have unique workload attributes that need to be taken into account for optimal service scalability. Provisioning compute and storage resources to provide a seamless user experience is challenging since customer traffic loads vary widely across time and geographies, and the servers hosting these applications have to be rightsized to provide both performance within a single server and across a scale-out cluster. Typical user-facing web services have a three tiered hierarchy – front-end web servers, middle-tier application logic, and back-end data storage and processing layer. In this paper, we address the challenge of disk subsystem design for back-end servers hosting large amounts of unstructured (also called blob) data. Examples of typical content hosted on such servers include user generated content such as photos, email messages, videos, and social networking updates. Specific server applications analyzed in this paper correspond to the message store of a large scale email application, image tile storage for a large scale geo-mapping application, and user content storage for Web 2.0 type applications. We analyze the storage subsystems for these web services in a live production environment and provide an overview of the disk traffic patterns and access characteristics for each of these applications. We then explore time-series characteristics and derive probabilistic models showing state transitions between locations on the data volumes for these applications. We then explore how these probabilistic models could be extended into a framework for synthetic benchmark generation for such applications. Finally, we discuss how this framework can be used for storage subsystem rightsizing for optimal scalability of such backend storage clusters. I. INTRODUCTION SER-DRIVEN content on the internet has been growing at a tremendous rate over the past few years. IDC estimates that web service providers hosting such content accounted for over 1.4 Exabytes of new storage capacity in 2008, and by 2012 will account for 16.4 Exabytes [1, 2]. The explosive growth in these content stores is fueling a massive buildout of servers in mega datacenters to host and deliver the content. This user-driven content is very different from traditional schemes which use structured storage that is typically organized using databases. Some examples of user- driven content are pictures, videos, emails, social networking updates, and even online file storage. The unstructured nature of this content requires a totally different approach to server and storage subsystem design, with the primary design metrics being cost efficiency ($/GB) and power efficiency (Watts/GB). In this paper, we focus on understanding the disk subsystem workload characteristics for such unstructured data stores hosted in large datacenter environments, and also present a methodology for articulating the workload via a probabilistic state transition model that can be adapted to represent varying degrees of detail depending on the representation granularity chosen by the user. We discuss how this framework can be useful for server design and performance analysis for achieving the goals of cost and power efficiency across scale- out server environments. We chose three different web services for this analysis: message store for a large-scale email service, image tile storage for a large-scale geo-mapping service, and a blob storage service hosting massive amounts of user-driven content. Disk traces were collected from production servers in datacenters hosting these services and were then analyzed to identify patterns for workload randomness over time categorized by several key storage parameters. This is one of our major contributions in the paper since it provides insight into storage workload analysis for emerging web services applications in large mega datacenter environments. As part of the trace analysis, we isolated key storage parameters (such as workload randomness, address ranges, Inter-arrival rates, blocksizes) and represented these via probabilistic state transition diagrams. The intent was to derive a single visual representation of the workload for comparison across various applications. We extended this framework to make these state diagrams hierarchical, so that the depth of the workload details could also be represented, and made the state diagrams machine readable by generating a XML schema that captures the key workload information in these state diagrams. The XML configuration file can be used by stress test tools or by synthetic workload generators for replaying the workload characteristics on a variety of production server hardware configurations for performance analysis, tuning, and server rightsizing purposes. The remainder of the paper is organized as follows: Section II describes related work in the field of storage workload characterization. Section III explains our tracing and analysis infrastructure, while Section IV provides the background information for these applications. Section V presents the trace-based workload characterization, including the summary characteristics, temporal and probabilistic models for each of the applications. Section VI explains our framework for workload representation based on a state transition model and Storage Characterization for Unstructured Data in Online Services Applications Sriram Sankar and Kushagra Vaid Global Foundation Services (GFS), Microsoft Corporation U 148
Transcript
Page 1: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

Abstract— Mega datacenters hosting large scale web services

have unique workload attributes that need to be taken into account for optimal service scalability. Provisioning compute and storage resources to provide a seamless user experience is challenging since customer traffic loads vary widely across time and geographies, and the servers hosting these applications have to be rightsized to provide both performance within a single server and across a scale-out cluster. Typical user-facing web services have a three tiered hierarchy – front-end web servers, middle-tier application logic, and back-end data storage and processing layer. In this paper, we address the challenge of disk subsystem design for back-end servers hosting large amounts of unstructured (also called blob) data. Examples of typical content hosted on such servers include user generated content such as photos, email messages, videos, and social networking updates. Specific server applications analyzed in this paper correspond to the message store of a large scale email application, image tile storage for a large scale geo-mapping application, and user content storage for Web 2.0 type applications. We analyze the storage subsystems for these web services in a live production environment and provide an overview of the disk traffic patterns and access characteristics for each of these applications. We then explore time-series characteristics and derive probabilistic models showing state transitions between locations on the data volumes for these applications. We then explore how these probabilistic models could be extended into a framework for synthetic benchmark generation for such applications. Finally, we discuss how this framework can be used for storage subsystem rightsizing for optimal scalability of such backend storage clusters.

I. INTRODUCTION SER-DRIVEN content on the internet has been

growing at a tremendous rate over the past few years. IDC estimates that web service providers hosting such content accounted for over 1.4 Exabytes of new storage capacity in 2008, and by 2012 will account for 16.4 Exabytes [1, 2]. The explosive growth in these content stores is fueling a massive buildout of servers in mega datacenters to host and deliver the content. This user-driven content is very different from traditional schemes which use structured storage that is typically organized using databases. Some examples of user-driven content are pictures, videos, emails, social networking updates, and even online file storage. The unstructured nature of this content requires a totally different approach to server and storage subsystem design, with the primary design metrics being cost efficiency ($/GB) and power efficiency (Watts/GB).

In this paper, we focus on understanding the disk subsystem

workload characteristics for such unstructured data stores hosted in large datacenter environments, and also present a methodology for articulating the workload via a probabilistic state transition model that can be adapted to represent varying degrees of detail depending on the representation granularity chosen by the user. We discuss how this framework can be useful for server design and performance analysis for achieving the goals of cost and power efficiency across scale-out server environments.

We chose three different web services for this analysis:

message store for a large-scale email service, image tile storage for a large-scale geo-mapping service, and a blob storage service hosting massive amounts of user-driven content. Disk traces were collected from production servers in datacenters hosting these services and were then analyzed to identify patterns for workload randomness over time categorized by several key storage parameters. This is one of our major contributions in the paper since it provides insight into storage workload analysis for emerging web services applications in large mega datacenter environments.

As part of the trace analysis, we isolated key storage

parameters (such as workload randomness, address ranges, Inter-arrival rates, blocksizes) and represented these via probabilistic state transition diagrams. The intent was to derive a single visual representation of the workload for comparison across various applications. We extended this framework to make these state diagrams hierarchical, so that the depth of the workload details could also be represented, and made the state diagrams machine readable by generating a XML schema that captures the key workload information in these state diagrams. The XML configuration file can be used by stress test tools or by synthetic workload generators for replaying the workload characteristics on a variety of production server hardware configurations for performance analysis, tuning, and server rightsizing purposes.

The remainder of the paper is organized as follows: Section

II describes related work in the field of storage workload characterization. Section III explains our tracing and analysis infrastructure, while Section IV provides the background information for these applications. Section V presents the trace-based workload characterization, including the summary characteristics, temporal and probabilistic models for each of the applications. Section VI explains our framework for workload representation based on a state transition model and

Storage Characterization for Unstructured Data in Online Services Applications

Sriram Sankar and Kushagra Vaid

Global Foundation Services (GFS), Microsoft Corporation

U

148

Administrator
Text Box
978-1-4244-5156-2/09/$26.00 ©2009 IEEE
Page 2: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

explores possible extensions to represent additional information. This is followed by how this framework could be utilized for server configuration optimization in Section VII. Section VIII talks about future work and Section IX concludes our paper.

II. RELATED WORK

A. Storage characterization for Real industry workloads Availability of storage server workload traces has always

been a tough challenge for researchers. Storage traces from HP labs including Cello [3, 5] and Openmail [4], and file block traces from U. C. Berkeley (Snake [3] and WEB [6]) are the most popular ones that are available publicly. A recent effort made several production traces publicly available through SNIA [7, 25]. In addition to actual production traces, we can also obtain traces from publicly available benchmarks like TPC-C [8], TPC-D [9], TPC-H [10], FileBench [11], DBT-2 [12], Postmark [13], AM-utils [14]. Traces collected from benchmarks have the advantage of being in a controlled environment where we can vary parameters and collect traces for verification purposes. However, they do not provide accurate information that can be correlated to actual production environments. Hence there is not only a need to obtain more production traces, but also represent them in a convenient fashion that is easy to reproduce without sacrificing too much information about the actual trace. Our methodology aims to provide such flexibility of representation for storage workload characterization.

B. Representation models Production traces typically require a large amount of

archival storage. Hence, it is beneficial to convert a trace into a representative model of a workload that could be used for analysis of different storage subsystems. However it is challenging to define a representative model for production

disk access patterns, since the spatial and temporal properties need to be preserved [20]. Fractional ARIMA [21], Fractal models [22, 23] and On/Off Models [24] capture spatial locality but do not represent temporal locality adequately. PQRS model [15] defines a statistical model to capture burstiness and correlation of spatio-temporal disk traffic. The argument of self-similarity [16, 17, 18] in workload patterns also suggests that models could be used to represent real workloads. However, interesting state transition information is lost in converting a spatio-temporal trace into representative numbers. Our model provides the flexibility to represent the entire trace at different levels of granularity. On one end, it can represent the trace information via a massive state transition diagram which captures all the key information between any two disk accesses. On the other end, the model can be setup to sacrifice detail and instead represent the trace at a coarser granularity, showing the high level characteristics for transitions between aggregated clusters of disk access ranges.

III. EXPERIMENTAL SETUP

We use the tracing functionality provided with Microsoft Windows operating system, called Event Tracing for Windows (ETW) [27]. The tracing infrastructure is a general-purpose, high-speed and scalable tracing facility that can provide Disk and File I/O traces for profiling storage subsystem activity. Kernel-provided buffering and logging mechanism is leveraged to provide an event based tracing mechanism for events raised by both user-mode applications and kernel-mode device drivers. We capture the following information from production servers for storage events: Disk Event (Disk Read Start, Disk Write Start, Completion), Timestamp of request, Process issuing the request, Thread id, Virtual address of kernel data structure corresponding to specific IO, Request Offset, Size of request in bytes, Time elapsed, Disk number as

149

Page 3: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

viewed by the OS, Flags, Disk service time, Priority, File I/O details like Filename, Object ID etc.

With the above level of detail at the storage subsystem, we

are able to obtain information about the access profiles for the workloads and create representative models for them.

IV. WORKLOAD SERVER INFRASTRUCTURE

We selected three large scale web services application built on unstructured storage. These include message store for a large-scale email service (MSG-EMAIL), image tile storage for a large-scale geo-mapping service (MAPS), and a blob storage service hosting massive amounts of user-driven content (USER CONTENT). The logical architectures of the three applications are presented in Figure 1. As can be seen from the architecture, there are typically three tiers in the architecture, a web server tier, the middle tier where the application processes requests from the upper tier and finally, the data storage tier. For MSG-EMAIL, once the request is forwarded to the protocol processing layer, there is a lookup for metadata to identify which server the user data resides in. We characterize accesses to a single message store volume for the Email storage server. For the MAPS application, the user request is routed to the back-end image tile servers to retrieve the appropriate map tiles, and the mid-tier application layer performs tile composition and metadata overlays for constructing the final displayed map before sending back to the user. The image tile server is a striped volume spanning multiple disk drives. The USER CONTENT application has a custom API for blob manipulation and the storage tier stores these blobs in striped disk volumes.

V. CHARACTERIZATION

This section describes three distinct aspects of workload characterization for the web services being studied. In the first section, we present the summary storage access characteristics of the three applications that we have chosen for this analysis. In the next section, we delve into the spatio-temporal properties of these workloads with timeseries analysis. Finally, we capture transition state diagrams for a section of the trace to represent the spatio-temporal nature of the workloads.

A. Summary Characteristics Data access requests issued from an application are typically

transformed as the requests makes their way to the underlying storage volume. This is a result of the intermediate OS and storage hardware elements, which typically buffer, coalesce and parallelize the application generated request stream. It becomes a string of I/O requests that access different locations on a rotating media platter. Since the traces we collected are via a provider within the OS, we see accesses to logical addresses, which are then translated to actual physical locations at the disk by the hardware storage controller. There are several patterns and differentiating characteristics that can define an application at the storage layer. These could be broadly classified into:

a. Block sizes and Random access statistics: We find that for our specific applications, a majority of the block accesses fall between 4K and 64K block sizes and we tabulate the random nature of the workload.

b. Performance metrics: We also measure the IOPS, MBs/sec and Latency for these three applications.

Table 1 shows the block access distribution (we have

excluded block sizes that account for less than 1% of the total access, so total percentage is less than 100%) and the Randomness of the application. If two consecutive requests from the application access subsequent logical block numbers, then the access is considered to be sequential. From the table, we can see that MSG-EMAIL has 93% random access dominated by 4K block size. This is explained by the fact that a single server hosts email messages from several users and email access patterns in the aggregate have a random access nature. The MAPS workload is 27% random (hinting that most of the accesses are sequential) and is mostly reads (19.8 RD:WR ratio). This observation fits with the profile of the MAPS workload since the majority of time is spent reading map tiles from disk for serving composite maps to internet users. The USER CONTENT workload is dominated by 4K block accesses and is 91% random. Since this workload includes multiple simultaneous users hosted on the same server, all accessing different sized user content (photos, videos etc), the randomness observed fits with our expectations.

150

Page 4: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

Table 2 shows the IOPS (IOs per second), MB/sec and the latency for the servers on which the traces were captured. This table provides a comprehensive numerical representation of the performance that is currently observed for these three applications.

From the table it can be observed that MSG-EMAIL has the

highest average IOPS requirement. However, it should be noted that MAPS has the higher MB/sec, because of the larger 64K block size when compared to MSG-EMAIL that has 4K block size.

B. Temporal Analysis

The summary characteristics adequately represent the overall nature of the workload, but leave out important information about application behavior over time. To address this issue, we used trace information to plot several parameters including the Logical Block Number (LBN), Inter-arrival times and Outstanding I/O requests at the storage subsystem.

LBN vs time charts Figure 2 shows the plots of LBN vs time for all three

profiled applications. LBN is plotted on the Y-axis, whereas the X-axis shows time in microsecond granularity. A horizontal line in these graphs represents sequential accesses, whereas a pattern that has no horizontal lines is representative of a random workload.

Figure 2: LBN Time-Series Analysis The MSG-EMAIL data presented in Figure 2 corresponds to

a single disk volume comprising of several SATA hard drives hosting user email content. As can be seen from the LBN Vs Time graph, there are several places with horizontal patterns indicating sequential access over time; however, these are short-lived and do not constitute a major portion of the graph. There is also an interesting pattern that goes across the LBNs in a sweeping fashion over time – this is from a batch activity that reads all LBNs in the disk volume periodically. It is also evident that the workload is fairly random, a conclusion which is consistent with the data presented in Table 1. When we look at the LBN Vs Time graph for MAPS, we can clearly see horizontal lines throughout the plot, indicating sequential activity for bands of LBN ranges. This indicates high traffic to popular geographic regions for tiles representing several zoom

151

Page 5: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

levels and view types (road/hybrid/aerial). The chart for USER-CONTENT shows a very interesting pattern: the access pattern is very dense and completely random with respect to time. Note that the observations made from the time series charts for all three workloads are fairly consistent with the data derived earlier for the overall workload, as presented in Table 1.

Inter-arrival rates vs time charts The next set of graphs plot Inter-arrival time in milliseconds

on the Y-axis Vs Time on the X-axis. Inter-arrival time represents the spacing in time between two successive requests issued to the disk subsystem, and has a direct impact on the disk service times and outstanding requests at the disk queue. These graphs allow us to understand workload intensity and burstiness over time. A large range of Inter-arrival time values for a given time period indicates sparsely generated disk requests resulting from slowdown in user activity, and densely clustered Inter-arrival rates indicate burstiness defined by heavy user activity.

The MSG-EMAIL graph in Figure 3 shows distinct regions of time where the Inter-arrival rate densities are different. The middle region shows much bursty behavior than the start and end regions. The MAPS graph shows consistent workload intensity over time, except towards the end where there is a brief period of slow user activity. The USER-CONTENT graph shows consistent activity throughout the sampled time period.

Figure 3: Inter-arrival Time-Series Analysis Outstanding IOs vs time charts We also plot the outstanding IOs at the system in Figure 4 to

gain an understanding of how the Inter-arrival rate affected the storage subsystem latency from queuing effects. The message store of the email system had a large number of outstanding IOs but it was a rare event. Even in the normal working conditions the average outstanding IOs for MSG-EMAIL were higher than the other two applications. This could be attributed to how data is organized on the system and the access pattern at that time of measurement of the trace. When we look at graph for MAPS we observe a vertical spike at one instance where there were numerous requests issued to almost the entire LBN range (correlate this event in Figure 4 with Figure 2 for MAPS - we are able to observe a vertical pattern in both graphs that suggests that multiple requests were issued at the same time and hence caused queuing up at the disk subsystem).

152

Page 6: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

Figure 4: Outstanding IO Timeseries Analysis

The outstanding IOs for USER CONTENT seem to follow uniform pattern throughout the trace time period and does not show any alarming spikes as the other two. This represents an application that is fairly balanced with respect to the time IOs spend waiting for the IO subsystem, and is consistent with the corresponding Inter-arrival rate graph which showed consistency throughout the time interval.

In this section we have observed different patterns with

respect to LBN, Inter-arrival time and Outstanding IOs and we were able to correlate different charts for the same application and identify patterns over the traced time duration. While this analysis is useful for understanding workload characteristics at a high level, it is not sufficient for workload reproduction using synthetic benchmark tools. Detailed information about offsets, think-time, blocksize distributions between successive requests is not captured by the analysis shown so far. The next section elaborates on our model to represent this detailed information via the use of probabilistic transition diagrams.

C. Probabilistic State Transitions In our effort to represent workloads visually, we decided to

use the probabilistic state transition model as it provides an intuitive way to understand and represent the nature of workload access patterns. The application generated requests to the disk subsystem essentially are a sequence of accesses to different LBNs. Based on this observation, we chose to represent LBN ranges as discrete states. The transition lines between the states represent the storage access activity.

Our model is created as follows: To make it visually

comprehensible for purposes of this paper, we divide the LBN range into 4 parts and label them as States 1-4. Note that this can be changed to different granularity according to user needs. An application generated request stream can basically be perceived as a state machine which transitions between the LBN ranges (available address space). We also tag important workload information on the transition edges - the access pattern (whether is random or sequential), the percentage that such access constitutes as part of the total accesses in the application, and the composition of that particular access (block sizes and their absolute percentages). Each edge weight is a percentage of the accesses over the entire graph and is not localized to a particular vertex alone. For instance, in Figure 5, which represents the MSG-EMAIL application, we show 4 LBN ranges. The State1-State2 edge implies that each time a state change occurs from State1 to State2, it would be a a non-sequential (Random) offset with an overall probability of 3.2% with a block size distribution probability of 1.7% for 4K Reads and 0.8% for 4K Writes (Note: we do not provide all smaller block size accesses here for clarity reasons, but record it in the actual model). The self-loops signify that the accesses occur within the same state and provide the probability for that self

153

Page 7: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

transition. Figure 5 shows the probabilistic state transition diagram for

MSG-EMAIL. From the edge transition labels we can see that State3 has the highest transition into it and hence would be the most accessed when compared to all the other states. This signifies that this region of LBN is hotter for this application (Note how the state transition diagram can be used to isolate hot states). Another thing that we can note from the transition edges is the fact that 4K block accesses are the predominant access size throughout the state diagram and there are very few “SEQ” (sequential) edges as compared to “RND” (random) edges. Hence 4K Random access make up most of this

workload. In the next section we’ll see how we can encode temporal properties in the same diagram, when we extend our probabilistic state transition framework to include more information about the workload.

Figure 6 provides the probabilistic state transition diagrams

for both MAPS and USER CONTENT. We can observe similar hot LBN ranges from the incoming edges into each state. State4 seems to be hot state for MAPS and State2 for USER CONTENT. Note that the probabilistic state transition diagram provides the block access information as part of its structure (through edge transitions). It also provides locality data which the earlier analysis was not able to encode, in the

154

Page 8: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

form of its states. In the next section we shall see how this framework could be extended to provide more granularity and depth of data in an elegant manner.

VI. REPRESENTATION FRAMEWORK

In the previous section we created probabilistic state transition diagrams for visually representing the workloads. In this section, we describe how the probabilistic state diagrams can be converted into a hierarchical framework which could be understood by synthetic workload generators. This framework is flexible and can represent storage workloads at different levels of granularity. We also show a XML representation of the hierarchical state diagram, which can be parsed by different workload generators.

A. Hierarchical Extension to State Diagrams

There is a tradeoff between the amount of information

that can be encapsulated in a model and the corresponding ease of representation. A modular approach to workload representation is preferable, since depending on the need for granularity, automated tools can algorithmically traverse the representation. This approach led us to develop a hierarchical model for probabilistic transition diagrams, where each state can be further divided into 4 sub states in a tree-like fashion. Continuous application of this algorithm could eventually result in the entire trace being represented as a massive probabilistic state transition diagram.

We provide an illustration of this hierarchical

application in Figure 7(a) for State1 of MSG-EMAIL shown earlier in Figure 5. As can be seen, the self-loop in State1 in

Figure 5 is now more granular. Earlier we observed that State1 in the workload was between LBN ranges 0-25%. Now, we are able to zoom-in and observe the transitions between these LBNs within this range. This provides valuable information to a regeneration tool, since the tool now knows how “random” the workload actually was by hierarchically traversing the representation.

B. Incorporating Temporal Information

The transition state diagrams presented so far have

adequately represented the spatial information of a workload and the transition to other locality ranges. However, the temporal aspect of the transition was not captured in the framework. To encode the temporal information, we tag the transition edges with the average Inter-arrival time for that transition. For instance, in Figure 7 (b) for MSG-EMAIL, the transition edges now contain the IA (Inter-arrival) information on the transition edges. (Note: The block sizes and weights are the same as in Figure 4 and hence are not represented for clarity reasons). The Inter-arrival information could be used by a regeneration tool as a think time between IO requests to reproduce an application in a more representative fashion. Note that there could be additional depth of detail provided for each transition (IA time could be plotted as a distribution function and queuing theory models [26] could be utilized).

C. Representing as XML schema

A representation in visual format should be translatable

into a representation that could be machine-readable as well. In this section, we provide a method to represent the transition state diagrams in a XML schema that could be used by a synthetic workload generator to recreate the workload patterns. Figure 8 shows an illustration of such a

155

Page 9: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

schema for State1 in Figure 5 for the MSG-EMAIL workload. Note that we can represent distributions for Inter-arrival time within the schema for more levels of details. Providing the workload specification in an extendable format that is machine readable would be useful for any synthetic generator to regenerate the traces.

VII. SERVER CONFIGURATION OPTIMIZATION

The objective of trace collection and providing a representation model for these traces is to utilize them to guide server design that is more tailored to applications that run in the respective mega datacenters. We show how these methodologies could be utilized in rightsizing servers for file based storage in two levels of design: a) Server Component Design b) Datacenter Architecture Design.

A. Server Component design Given our earlier analysis for the MAPS workload we

know that a majority of the accesses are 64K reads, and most of them are sequential. Hence we can optimize the storage controller prefetch mechanism to incorporate the sequential patterns in the workload and do intelligent prefetch of 64K blocks to exploit locality of reference in the disk cache. By the same argument, MSG-EMAIL is a random workload. Given that MSG-EMAIL services many customers concurrently, increasing cache size on the controller may not be a good design choice for random reads. Our representation model can be used to generate workloads which can evaluate what the effect of a 256MB controller cache would be as compared to a 512MB cache. We conducted this experiment and validated an operating range for file servers given our observations about the number of outstanding IOs and IOPS required out of the storage system (as mentioned in Section V). We found that for lesser queue depths and for a random read workload, there is virtually no difference between a 256MB controller cache and a 512MB one. Hence, in this case it saves $/server and though it is a small design decision, this becomes a

significant part of investment and power savings in a mega datacenter scenario.

B. Datacenter Architecture design Storage workload characterization also helps us

understand the storage requirements of the application and how they change over time. An application written for a smaller memory or cache size may perform completely differently when new servers with larger memory or faster disk drives become available. This becomes a critical server selection parameter when a service has to be hosted on several thousands of servers in a datacenter.

Given that the demand for unstructured data is increasing at a tremendous pace, we need to pay close attention to the challenge of providing web service scalability alongside storage cost efficiency for these large scale online applications. From the analysis in Section V, we observe that the IOPS requirement exhibited by these applications is not as demanding as that of traditional transaction applications (e.g. databases). This argues for simpler storage subsystems designed with commodity high capacity SATA drives for optimal $/GB, scaled out for load distribution and performance across storage clusters. The analysis shown in this paper provides a framework to study these tradeoffs for future server deployments for these applications.

VIII. FUTURE WORK The probabilistic state diagram shown in this paper is an

elegant way to represent storage workload characteristics. However, it has the caveat that the analysis is effective primarily for steady state workload behavior, and not for workloads which have large variations in traffic patterns over time and hence exhibit “choppy” behavior. Also, the analysis is representative of the captured trace segment, which is an artifact of the particular server architecture and associated storage components, and also the time interval at which the trace was taken. To address these concerns, future work will focus on developing a methodology for picking representative trace segments from a longer capture of the

156

Page 10: [IEEE 2009 IEEE International Symposium on Workload Characterization (IISWC) - Austin, TX, USA (2009.10.4-2009.10.6)] 2009 IEEE International Symposium on Workload Characterization

workload. If the workload exhibits phases, separate trace captures will be needed to represent each workload phase, and associated state diagrams will be needed for each phase. Identifying representative trace segments can be done manually but is a tedious task. Automating this activity will ensure easy availability of valid trace captures for further detailed analysis. There are precedents for such automation based on trace capture analysis done for CPU bound workloads [19].

Our future work will also address the challenge of

correlation and correction, to ensure that the workload reproduction by a synthetic tool (which takes as input the state diagram representation) is iteratively tuned to ensure that the replayed test patterns are identical to the captured trace patterns from the original workload. We are also looking into mechanisms for making the state diagram adaptable to represent different storage system characteristics. This is especially interesting for studying emerging storage technologies such as solid state drives (SSD), so that we can capture traces from existing disk based subsystems and annotate it with changes if the storage system was SSD-based. This can be then used for analyzing different flash device based system architectures for the workloads of interest.

IX. CONCLUSION In this paper we analyzed storage workload characteristics

for three unique user-facing web services hosted in mega datacenter environments. We also demonstrated a methodology to represent the detailed workload data via probabilistic state transition diagrams and a XML schema to enable synthetic tools to regenerate the workload behavior in a test environment. We believe this body of work provides an extensible characterization framework that can be used for storage subsystem design, analysis and tuning activities.

X. ACKNOWLEDGEMENTS We thank Dileep Bhandarkar, Bruce Worthington and Swaroop Kavalanekar for their valuable inputs. We would like to acknowledge Hotmail, Windows Live Messenger and Bing Maps teams for their help and clarifications throughout the trace collection and analysis phase. We would also like to thank the reviewers from the Server Standards team, Microsoft Research and the external reviewers for their valuable feedback.

REFERENCES [1] IDC's Enterprise Disk Storage Consumption Model: Analytics and

Content Depots Provide a New Perspective on the Future of Storage Solutions, IDC report: #214066,Aug 2008, Richard L. Villars

[2] http://itknowledgeexchange.techtarget.com/storage-soup/idc-unstructured-data-will-become-the-primary-task-for-storage/

[3] C. Ruemmler and J. Wilkes, “UNIX Disk Access Patterns,” Proceedings of 1990 SIGMETRICS, Jan. 1990, pp. 405-420.

[4] K. Keeton, A. Veitch, D. Obal, and J. Wilkes, “I/O Characterization of Commercial Workloads,” Proceedings of the 3rd Workshop on Computer

Architecture Evaluation using Commercial Workloads (CAECW), Jan. 2000.

[5] T. Wong and J. Wilkes, “My Cache or Yours? Making Storage More Exclusive,” Proceedings of the USENIX Annual Technical Conference (USENIX), June 2002, pp. 161-175.

[6] D. Roselli, J. Lorch, and T. Anderson, “A Comparison of File System Workloads,” Proceedings of the 2000 USENIX Annual Technical Conference, June 2000, pp. 41-54.

[7] IOTTA Repository, Storage Networking Industry Association, http://iotta.snia.org/.

[8] “TPC Benchmark C, Standard Specification,” June 2007. Available: http://tpc.org/tpcc/spec/tpcc_current.pdf.

[9] “TPC Benchmark D (Decision Support), Standard Specification,” Feb. 1998. Available: http://tpc.org/tpcd/spec/tpcd_current.pdf.

[10] “TPC Benchmark H (Decision Support), Standard Specification,” Feb. 2008. Available: http://tpc.org/tpch/spec/tpch_262.pdf.

[11] R. McDougal, “FileBench: A Prototype Model Based Workload for File Systems, Work in Progress.” Available: http://www.solarisinternals.com/si/tools/filebench/filebench_nasconf.pdf.

[12] Database Test Suite, Database Test 2 (DBT-2). Available: http://osdldbt.sourceforge.net/#dbt2.

[13] A. Aranya, C. Wright, and E. Zadok, “Tracefs: A File System to Trace Them All,” Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST ), May 2004, pp. 129-143.

[14] J. S. Pendry, N. Williams, and E. Zadok. Am-utils User Manual, 6.1b3 edition, July 2003. Available: http://www.am-utils.org.

[15] M. Wang, A. Ailamaki, and C. Faloutsos, “Capturing the Spatio-Temporal Behavior of Real Traffic Data,” IFIP Intl. Symp. on Computer Performance Modeling, Measurement, and Evaluation (Performance), Sep. 2002, pp 147-163.

[16] M. E. Gomez, and V. Santonja, “Analysis of Self-Similarity in I/O Workload Using Structural Modeling,” Proceedings of the 7th

International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Oct. 1999

[17] M. E. Gomez, and V. Santonja, “A New Approach in the Analysis and Modeling of Disk Access Patterns,” Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr. 2000, pp. 172-177.

[18] M. E. Gomez, and V. Santonja, “A New Approach in the Modeling and Generation of Synthetic Disk Workload,” Proceedings of the 8th

International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Aug. 2000

[19] Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA, June 12 - 15, 2005).

[20] G. R. Ganger. “Generating representative synthetic workloads: An unsolved problem”, Proceedings of the Computer Management Group (CMG) Conference, pages 1263-1269, 1995.

[21] Mark W. Garrett and Walter Willinger. “Analysis, modeling and generation of self-similar VBR video traffic”. SIGCOMM, pages 269-280, 1994.

[22] Will E. Leland, Murad S. Taqq, Walter Willinger, and Daniel V. Wilson. “On the self-similar nature of Ethernet traffic”. ACM SIGCOMM, pages 183{193, San Francisco, California, 1993.

[23] R. H. Riedi, M. S. Crouse, V. J. Ribeiro, and R. G. Baraniuk. “A multifractal wavelet model with application to network traffic”. IEEE Transactions on Information Theory, 45(4):992-1018, 1999.

[24] R. Riedi and J. Vehel. “Multifractal Properties of TCP Traffic: a Numerical Study”. IEEE Transactions on Networking, October 1997.

[25] S. Kavalanekar, B. Worthington, Q. Zha, V. Sharda “Characterization of storage workload traces from production Windows servers”. In Proc. IEEE International Symposium on Workload Characterization (IISWC), Seattle, WA, Sept. 2008.

[26] Edward D. Lazowska , John Zahorjan , G. Scott Graham , Kenneth C. Sevcik, Quantitative system performance: computer system analysis using queueing network models, Prentice-Hall, Inc., Upper Saddle River, NJ, 1984

[27] I. Park, R. Buch, “Improve Debugging and Performance Tuning with ETW”, Microsoft Corporation, April 2007 http://msdn.microsoft.com/en-us/magazine/cc163437.aspx

157


Recommended