Testing and Analyzing Performance from the User's - IBM

Testing and Analyzing Performance from the User’s Perspective

Analyzing Performance for e-business

Authors: High Performance On Demand Solutions (Web address: ibm.com/websphere/developer/zones/hipods) Technical Lu Sheng ([email protected]) Contacts: Luis Ostdiek ([email protected]) Christopher Roach ([email protected]) Sun Ke ([email protected]) Management contact: Meng Ye/China/IBM@IBMCN Date: September 25, 2006 Status: Version 1.0

Abstract: Many tests are available to help understand and improve the performance of a Web site. Tests that are performed from a user’s point of view have their benefits and limitations. This paper reviews tests from the user side that are useful in analyzing Web site performance; it includes a detailed discussion of the data that needs to be collected and analyzed, as well as available tools.

© Copyright IBM Corporation 2006 2 Testing Web Site Performance from the User Side

Executive summary Performance is a critical issue to Internet-based e-business. Performance is measured by a variety of metrics such as client-side latency for downloading a Web page and server CPU utilization. These metrics are influenced by internal and environmental factors, such as number of concurrent users, quality of application code, network condition, design of Web pages, and number of servers. Almost all of these factors are interrelated and generally function as a whole. As an example, if the number of servers is increased, the number of concurrent users of one server would be decreased, which may reduce the CPU utilization of each server and reduce the latency that users experience. Sometimes, a business owner may encounter an issue with Web site performance and find it hard to determine what factor, or combination of factors, is the cause. Some factors, such as CPU utilization, could only be detected in the data center, while other factors, such as the condition of the last mile of network, are only available at the user location. Selecting the wrong physical location from which to perform the diagnosis may result in missing the actual cause. This paper aims to represent a systematic picture of what information could be acquired by tests from the perspective of the actual users (as opposed to, for example, employees) and how to disentangle factors from the data collected at the user side. The experience reported in this paper was derived from several customer cases. These customers include local and global businesses. Some analysis methods proposed here are implemented as tools that could release analysts from unnecessary, arduous work.


Note: Before using this information, read the information in “Notices” on the last page.

Contents Executive summary ....................................................................................................................... 2 Contents.......................................................................................................................................... 3 Introduction ................................................................................................................................... 4 Performance metrics of a Web site .............................................................................................. 4 Questions answered by tests at the user side .................................................................................................. 5 Overall framework of metrics can be tested from user’s point of view.......................................................... 5 Latencies of Web access steps........................................................................................................................ 6 Web quality .................................................................................................................................................... 8 Network problems .......................................................................................................................................... 9 Run test........................................................................................................................................... 9 Automation and sample selection................................................................................................................... 9 Analysis......................................................................................................................................... 11 Estimation..................................................................................................................................................... 12 Summary ...................................................................................................................................... 13 References .................................................................................................................................... 13 Acknowledgements ...................................................................................................................... 13 Notices........................................................................................................................................... 14


Introduction Of the many methods and corresponding tools to measure network performance or data center performance, only a few measure performance from the user’s perspective. One of the most widely used such tools is Gomez PeerReview [1], which employs about 10,000 end-user desktops at different connection rates to test end-to-end performance, network latencies, Internet service provider (ISP), backbone condition, and so on. But its coverage is not wide enough. For instance, a customer of PeerReview may be not able to get a report about user latency in China’s major cities. Thus performance analysts have to know what information could be obtained from a user side test, collect enough data, and analyze the data themselves. This paper introduces a set of metrics to gather by testing at the user side and the benefits from this kind of tests. In addition, it describes the metrics that need to be measured in and out of the data center. The data that needs to be collected and related tools are introduced.

Performance metrics of a Web site Web site performance is measured from different points of view, each with related metrics. This paper focuses on the performance metrics defined by where the metric is measured: inside or outside of the data center. Inside the data center, performance is measured by: • Workload, the work a server (or a group of servers) handles, including the number of

concurrent users, user arrival rate, and the number of concurrent connections. • Latency, the time between when the user sends a request and the server sends a response.

Latency can be mapped to different tiers for more detailed analysis. • System status, the utilization of server resources while handling workload. System status

metrics include CPU utilization and memory utilization. Performance testing tools like Mercury Interactive LoadRunner[2] can be used inside the data center; LoadRunner measures overall performance, including throughput and download time. Profiling tools are used to measure latency among tiers. Many business owners monitor system status with tools such as those from BMC [3] and those that implement the RMON [4] protocol. Figure 1 shows the difference between metrics in and out of data center.

Figure 1 Performance metrics in and outside the data center

Internet Data center

User

Outside data center Inside data center


The boundary of the metrics that can be measured in and out of the data center is not always clear. Workload may be sent from the user or just before the data center by performance test tools like LoadRunner. Web page design quality and Web integrity tests could also be performed either at the user side or just before the data center. The following sections describe user metrics and how they should be measured.

Questions answered by tests at the user side From a user’s perspective, these questions can be asked: • How is the quality of the Web page design? The Web site organization? • How is the quality of the Web application code? • How effective is the data center IT infrastructure? How much workload can it handle? Is the

server up and available? • Which server receives user requests? • Where are the users located? What connection type do users use? • What is the capacity of the network between the users and the server? Some of these questions can be answered by tests at the user side, but data center tests are more effective for questions about workload and application code. Questions about Web page quality and Web site organization can be answered either by the user side test or just before the data center. Some of these questions -- user location, condition of last mile of network -- can only be answered by tests performed at the user side. Tests performed at the user side can tell us: • How is the user experience? By time, location, connection type and ISP, what will an user

feel? • Can we improve user side performance without changing data center settings? Some possible

solutions are: using cache? where to place cache? whether it is possible to work around troubled links for most users? whether some parameters could be adjusted for users (such as changing the DNS time to live (TTL) setting)?

Overall framework of metrics can be tested from user’s point of view Performance at the user side can be measured from four perspectives: • Workload: A lot of users can generate heavy workload to the servers • Location of user and server:

• User location: city, connection type (dialup or T1 connection), ISP • Server location: replica (if using content delivery network or reverse proxy) location,

server (or load balancer) location • Web quality: can Web site be improved? Factors include:

• Web site organization • Web page design • Error rate

• Latency caused by steps to access Web site, which include: • Resolve DNS • Establish connection • Download page


In fact, any of these factors can be influenced by others. For example, users in different cities or with different ISPs will experience different latencies; users connecting with different ISPs will also experience different error rates; Web page design and workload both contribute to latency; and so on. In the above perspectives, latency is the factor that all other factors will be mapped to. Although workload can also be tested by users, testing in the data center is more suitable and effective.

Latencies of Web access steps The many reasons for latency at the user side are shown in Figure 2. In the figure, the time factors are indicated: • TDNS is the time required to resolve the domain name • Tconnection is the time required to establish a connection to the TCP • Tdownload is the time required to download an object • Tprocess is the time required for the Web server to generate the Web page (in a data center test,

it is the latency between request reception and response sending) • Treplicate is the time required to replicate the object content to a cache server

Figure 2 Reasons for latency

Figure 2 shows how the latency of one HTTP request results from three consecutive steps: DNS resolution, connection, and content download. A real case may be more complex. Some requests don’t need the DNS step because the IP address of the accessing domain name may be saved locally. For secure socket layer (SSL) connections, the establish connection step needs more interactions between server and client. Replication is unnecessary for those Web sites without any cache server.

Client program as browser Cache

server Web server

DNS resolution

Local DNS server

Tconnection

Root DNS servers

gTLD

Other DNS server

TCP connection

TCP Req/Resp APP

server & DB

Tprocess Treplicate+Tprocess

Tdownload

HTTP request

HTTP response

TDNS


Each step can be divided into substeps. Figure 3 illustrates how DNS resolution time is composed of the latencies of all DNS resolution queries, as shown in Figure 3. CNAME and A are DNS response types. CNAME indicates a DNS alias; A is an IP address. All domain names and IP addresses in the figure do not really exist; they were created to demonstrate DNS resolution steps. Figure 3. Multiphase DNS resolution path

In Figure 3, the local DNS server or client computer will dig (search using many queries) the Internet DNS infrastructure from the root to the DNS server holding the queried name by sending out several requests. Though it is hard to get detailed resolution time of each query from the user computer, it is possible to calculate them by the TTL setting. The answer of the DNS request is attached with TTL values that show how long the results will be cached. The shorter the TTL is, the more important a DNS query is. The time required to establish a TCP connection is determined by network condition and server workload. If we ignore the server processing time due to TCP queuing, the TCP connection establishment time is likely to be the round trip time (RTT) between the user and the server (Web server or cache server) to which the user is connected. The section Network problems explains the reason of the latency in this step. The time required to download content (or request/response time) is the latency from the HTTP request to when the page object is completely downloaded. It is composed of the time cost before the first byte of response coming back and the transmission time for the page content. The Gomez PeerReview calls the time cost before the first byte of response coming back, which is the combination of RTT and server processing time, first byte time. If there are replicas (for example, a content delivery network), in between, the replicate time should also be added.

www.comexample.com www.comexample.co

m.cachedns.com a21.cn.cachepr

ovider.com 1.2.3.4 CNAME CNAME A

Get com. from root

Get CNAME from comexample.com.

Get comexample.com. from com.

Get com. from root

Get CNAME from cachedns.com.

Get cachedns.com. from com.

Get com. from root

Get IP from cn.cacheprovider.com.

Get cacheprovider.com. from com.

Get cn.cacheprovider.com. from cacheprovider.com.


Web quality There are at least three levels of Web quality: • Web site organization: for example, if the time to download the home page is acceptable • Web page design: for example, if there are unnecessary characters, if a script is inefficient, or

if some objects are too big • Error: for example, if the page contains broken links, or the application produces an

unexpected error message There are several tools to check Web quality. The Page Detailer component of IBM® WebSphere® Studio [5] is one useful tool to analyze Web page quality. Figure 3 is an example of a Page Detailer analysis.

Figure 4 Sample of Page Detailer page analysis

Figure 4 shows the objects making up a Web page: DNS download time (light blue bar), TCP connection time (yellow bar), first byte response time (deep blue bar), and content download time (green bar) are all displayed intuitively. However, Page Detailer is limited because it doesn’t provide a detailed analysis of DNS cost and TCP connection time caused by network problems; these measures are not the focus of the tool.


Network problems Network problems cause network latency. A network problem may come from bottlenecks such as busy routers and congested links. Route discovery tools can be used to locate the trouble between the user and the server. However, because the WHOIS database cannot be fully trusted -- even if the path is traced out -- we do not know for sure which router belongs to whom and where it is [6]. The thrashing of the core router’s route table is another challenge of addressing network issues. A route may be quite dynamic, thus it is hard to say which route the request/response packages actually follows. Sometimes, topology information is required instead of a single route path. Traceroute is the most commonly used tool to discover the path between two end points (user and server) and RTT of each router in between. There are also bandwidth testing tools that determine the bottleneck and network utilization of each link between the user and the server. These tools use Internet control message protocol (ICMP) packets, which are filtered by many routers, so the tools cannot work in many circumstances. The bandwidth testing tools use a method named network self timing, which uses back-to-back large packets to measure the network utilities and latencies. Network self timing can only determine the bandwidth and utilization from the higher bandwidth link to the lower bandwidth link, so it is best used at the server side rather than the user side.

Run test User side testing can collect enough data to answer the questions listed in the previous section. Different purposes require different data, the most important of which is the latency measurement. Ideally, latency could be separated into steps. Hence DNS resolution time, TCP connection time, first byte response time, and content downloading time should be recorded separately. In addition to latency, error number and server response status are useful. The route change and RTT of each route could also provide valuable clues to identify network problems.

Automation and sample selection Sample selection is the most important factor when testing at the user side. A business has its own user community. To test users all over the world may not be a reasonable choice. For example, one customer of the HiPODS team is facing the China market and we assume its users reside in China’s major cities. That customer ought to select sample users from China’s major cities with major ISPs. Another customer of HiPODS team is facing China’s rural market, in which case users from small cities and rural areas should be sampled. The samples should cover real users and their behavior. Almost all current well known providers of user side testing use automatic tools installed on user computers worldwide. That works fine for some global size businesses. Businesses with a special user community, however, need to select their testers to match their user community. If the business is already running, analyzing the Web log is the best way to find out the location of the users. Before testing the user side, the user behavior can be known by analyzing the Web logs. For example, if most users of the Web site come from New York, it is better to test the user side with users from New York.


The HiPODS China team has developed RemoteEye, a tool that can gather and analyze data from the user side when the user is connected to the Internet. Figure 5 shows the RemoteEye interface. The tools user enters the duration of the test period, the list of Web addresses to test, the output log file and interval; the tool collects data and records it to the log file automatically.

Figure 5 Test interface of RemoteEye


Analysis All aspects mentioned in the section Performance metrics of a Web site are analyzed. We believe some combinations of these aspects may provide more useful information, and all the metrics can be organized into a time chart. Figure 6 shows how RemoteEye reports latency distribution information.

Figure 6 Client-side latency distribution of one point


Figure 7 represents the comparison of the latency of two Web addresses from one test point.

Figure 7 Latency comparison

Estimation Data collected at the user side can be used to estimate more information, such as server processing time. The first byte response time is the sum of RTT, server processing, and replicate times. For requests without cache, the cache replicate time could be ignored. Because RTT is similar to TCP connection time, server processing time could be approximated as first byte time minus TCP connection time. But this is just a rough estimate that, because of network jitter, could result in a value less than zero. Testing in a data center remains a better approach to estimate server processing time. If the Web site uses SSL over TCP, more hand shakes are required than just the one that establishes the TCP connection. In such situations, the TCP connection establishment time and SSL connection build up time should be tested separately to get more useful data. The cache replicate time, which cannot be determined by data center tests, can also be estimated using the estimation formula. If both tests are performed, one to the data center, another to cache, we have: Tcache = Tconnection2cache + Treplicate + Tprocess (1) Tdatacenter = Tconnection2datacenter + Tprocess (2) From formula (2), Tprocess can be estimated, which is passed to formula (1). Then Treplicate can also be estimated.


Summary Testing the performance of a Web site from the user side can reveal information not available when testing is restricted to the data center. Examples of information revealed from the user side include performance of the last mile of network, which server is really connected to which user, DNS resolution time and network RTT between users and the server, and how well the content delivery network is working. These performance metrics can be organized by time, location, connection type, ISP, and so on. User side testing can help improve Web site quality and Web page design. Finally, it is possible to improve the understanding of performance with different workloads and quantities of concurrent users. User side testing has limitations. For example, it’s not possible to obtain metrics of server performance such as CPU and memory use, or to determine which server or process tier is the bottleneck. The careful selection of user samples is the most important factor is avoiding bias in test results. Lastly, maintaining the user panel can be difficult. This paper organizes these benefits and limitations into an easily understood framework that could help performance analysts know how best, when, and where to test performance from the user side.

References [1] Learn about Gomez PeerReview at www.gomez.com/ [2] Learn about LoadRunner at www.mercury.com/us/products/loadrunner/ [3] Learn about BMC, Inc. at www.bmc.com/ [4] S. Waldbusser, Remote Network Monitoring Management Information Base, Internet Request for Comments (RFC2819), 2000.5 [5] Learn about Page Detailer at www.alphaworks.ibm.com/tech/ [6] K. C. Claffy, Internet measurement: what have we learned?, SANE 2006, 2006.05

Acknowledgements I would like to express my most sincere thanks to Marsha Brundage, Susan Holic, Sun Ke, Ye Meng, Luis Ostdiek, Li Hai Ping, and Christopher Roach for their help and knowledge of developing case studies about customer engagements.


Notices Trademarks The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM WebSphere Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsoystems, Inc. in the United States and other countries. Other company, product, and service names may be trademarks or service marks of others. Special Notice The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risks. Performance data contained in this document were determined in various controlled laboratory environments and are for reference purposes only. Customers should not adapt these performance numbers to their own environments and are for reference purposes only. Customers should not adapt these performance numbers to their own environments as system performance standards. The results that may be obtained in other operating environments may vary significantly. Users of this document should verify the applicable data for their specific environment.

Date post:	09-Feb-2022
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Testing and Analyzing Performance from the User's - IBM

Documents