+ All Categories
Home > Documents > Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent...

Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent...

Date post: 07-Aug-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
10
Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen Yang, Ronald G. Dreslinski University of Michigan, Ann Arbor, MI, USA {fengsy, subh, yangych, rdreslin}@umich.edu Abstract—Improvements in clock speed and exploitation of Instruction-Level Parallelism (ILP) hit a roadblock during mid- 2000s. This, coupled with the demise of Dennard scaling, led to the rise of multi-core machines. Today, multi-core processors are ubiquitous and architects have moved to specialization to work around the walls hit by single-core performance and chip Thermal Design Power (TDP). The pressure of innovation in the aftermath of Dennard scaling is shifting to software developers, who are required to write programs that make the most effective use of underlying hardware. This work presents quantitative and qualitative analyses of how software has evolved to reap the benefits of multi-core and heterogeneous computers, compared to state-of-the-art systems in 2000 and 2010. We study a wide spectrum of commonly-used applications on a state-of-the-art desktop machine and analyze two important metrics, Thread- Level Parallelism (TLP) and GPU utilization. We compare the results to prior work over the last two decades, which state that 2-3 CPU cores are sufficient for most applications and that the GPU is usually under-utilized. Our analyses show that the harnessed parallelism has improved and emerging workloads show good utilization of hardware resources. The average TLP across the applications we study is 3.1, with most applications attaining the maximum instantaneous TLP of 12 during execution. The GPU is over-provisioned for most applications, but workloads such as cryptocurrency mining utilize it to the fullest. Overall, we conclude that the effectiveness of software in utilizing the underlying hardware has improved, but still has scope for optimizations. Index Terms—Benchmarking, Multi-Core, Desktop Applica- tions, Thread-Level Parallelism, GPU utilization, Virtual Reality, Cryptocurrency Mining, Characterization I. I NTRODUCTION Innovation in the domain of improving single-threaded performance hit a plateau in the early 21 st century. Antici- pating the end of Dennard scaling, which states that power density of a chip remains almost constant across technology nodes [10], the hardware industry swiftly pivoted towards multi-core processors. The post-Dennard scaling era is plagued by the problem of dark silicon, caused because improvements in cooling technology failed to keep up with the Thermal Design Power (TDP) requirements of newer technology node chips [11]. Today, the hardware world is experiencing a move to specialization, trading away silicon area for gains in energy efficiency [32, 37]. Modern systems are almost ubiquitously heterogeneous, involving a combination of the CPU with a GPU and/or fixed-function accelerators. Common desktop systems at homes and offices have 4-8 logical CPUs with a discrete GPU connected via PCI-Express. User requirements are significantly diverse. Gamers, for example, require a high-end machine with an advanced GPU and efficient cooling. On the other hand, for a user who uses their system primarily to browse the web and watch videos, it is cost-inefficient to own a system with a high-end GPU. Thus, it is worthwhile to analyze the characteristics of commonly used applications tailored toward different users. We repeat some of the experiments of an eight-year old work by Blake et al. [3], characterizing the parallelism ex- ploited by software in desktop workstations. To analyze how a wide spectrum of commonly used desktop applications have evolved to utilize available hardware, we use the metrics of Thread Level Parallelism (TLP) and GPU utilization. Our application suite consists of a diverse choice of traditional desktop applications, such as web browsers, video authoring utilities, media players, as well as emerging applications, such as personal assistants, cryptocurrency miners and virtual reality (VR) games. In addition to the metrics mentioned earlier, we analyze the effect of core scaling and simultaneous multi-threading (SMT) on these applications. This work attempts to answer the following questions: Have modern versions of legacy software, and newer software that have replaced them, been keeping up with advances in hardware technology? How well do contemporary and emerging applications utilize the parallelism in the underlying hardware? What is the impact of core scaling, SMT and the GPU on the performance of applications? The rest of the paper is organized as follows. Section II introduces the parallelism analyses done 18 and 10 years ago [3, 13, 14], and emerging applications that have evolved from advancements in hardware since then. Section III de- scribes the system used for benchmarking, the metrics we study, trace collection methodology and the automation tech- nique used to obtain consistent measurements. Section IV details each testbench and Section V contains evaluation of trends in parallelism. Section VI presents work related to char- acterization of different workloads. We discuss key takeaways and suggestions for software developers to better harness the hardware in Section VII and conclude in Section VIII. II. BACKGROUND AND MOTIVATION This study provides an 18-year perspective on the evolution of parallelism in desktop workloads. In early 2000, when uniprocessors were prevalent, Flautner et al. [13, 14] evaluated
Transcript
Page 1: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

Parallelism Analysis of Prominent DesktopApplications: An 18-Year Perspective

Siying Feng, Subhankar Pal, Yichen Yang, Ronald G. DreslinskiUniversity of Michigan, Ann Arbor, MI, USA{fengsy, subh, yangych, rdreslin}@umich.edu

Abstract—Improvements in clock speed and exploitation ofInstruction-Level Parallelism (ILP) hit a roadblock during mid-2000s. This, coupled with the demise of Dennard scaling, ledto the rise of multi-core machines. Today, multi-core processorsare ubiquitous and architects have moved to specialization towork around the walls hit by single-core performance and chipThermal Design Power (TDP). The pressure of innovation in theaftermath of Dennard scaling is shifting to software developers,who are required to write programs that make the most effectiveuse of underlying hardware. This work presents quantitative andqualitative analyses of how software has evolved to reap thebenefits of multi-core and heterogeneous computers, comparedto state-of-the-art systems in 2000 and 2010. We study a widespectrum of commonly-used applications on a state-of-the-artdesktop machine and analyze two important metrics, Thread-Level Parallelism (TLP) and GPU utilization.

We compare the results to prior work over the last twodecades, which state that 2-3 CPU cores are sufficient for mostapplications and that the GPU is usually under-utilized. Ouranalyses show that the harnessed parallelism has improved andemerging workloads show good utilization of hardware resources.The average TLP across the applications we study is 3.1, withmost applications attaining the maximum instantaneous TLPof 12 during execution. The GPU is over-provisioned for mostapplications, but workloads such as cryptocurrency mining utilizeit to the fullest. Overall, we conclude that the effectiveness ofsoftware in utilizing the underlying hardware has improved, butstill has scope for optimizations.

Index Terms—Benchmarking, Multi-Core, Desktop Applica-tions, Thread-Level Parallelism, GPU utilization, Virtual Reality,Cryptocurrency Mining, Characterization

I. INTRODUCTION

Innovation in the domain of improving single-threadedperformance hit a plateau in the early 21st century. Antici-pating the end of Dennard scaling, which states that powerdensity of a chip remains almost constant across technologynodes [10], the hardware industry swiftly pivoted towardsmulti-core processors. The post-Dennard scaling era is plaguedby the problem of dark silicon, caused because improvementsin cooling technology failed to keep up with the ThermalDesign Power (TDP) requirements of newer technology nodechips [11]. Today, the hardware world is experiencing a moveto specialization, trading away silicon area for gains in energyefficiency [32, 37]. Modern systems are almost ubiquitouslyheterogeneous, involving a combination of the CPU witha GPU and/or fixed-function accelerators. Common desktopsystems at homes and offices have 4-8 logical CPUs with adiscrete GPU connected via PCI-Express.

User requirements are significantly diverse. Gamers, forexample, require a high-end machine with an advanced GPUand efficient cooling. On the other hand, for a user who usestheir system primarily to browse the web and watch videos, itis cost-inefficient to own a system with a high-end GPU. Thus,it is worthwhile to analyze the characteristics of commonlyused applications tailored toward different users.

We repeat some of the experiments of an eight-year oldwork by Blake et al. [3], characterizing the parallelism ex-ploited by software in desktop workstations. To analyze howa wide spectrum of commonly used desktop applications haveevolved to utilize available hardware, we use the metrics ofThread Level Parallelism (TLP) and GPU utilization. Ourapplication suite consists of a diverse choice of traditionaldesktop applications, such as web browsers, video authoringutilities, media players, as well as emerging applications,such as personal assistants, cryptocurrency miners and virtualreality (VR) games. In addition to the metrics mentionedearlier, we analyze the effect of core scaling and simultaneousmulti-threading (SMT) on these applications.

This work attempts to answer the following questions:• Have modern versions of legacy software, and newer

software that have replaced them, been keeping up withadvances in hardware technology?

• How well do contemporary and emerging applicationsutilize the parallelism in the underlying hardware?

• What is the impact of core scaling, SMT and the GPUon the performance of applications?

The rest of the paper is organized as follows. Section IIintroduces the parallelism analyses done 18 and 10 yearsago [3, 13, 14], and emerging applications that have evolvedfrom advancements in hardware since then. Section III de-scribes the system used for benchmarking, the metrics westudy, trace collection methodology and the automation tech-nique used to obtain consistent measurements. Section IVdetails each testbench and Section V contains evaluation oftrends in parallelism. Section VI presents work related to char-acterization of different workloads. We discuss key takeawaysand suggestions for software developers to better harness thehardware in Section VII and conclude in Section VIII.

II. BACKGROUND AND MOTIVATION

This study provides an 18-year perspective on the evolutionof parallelism in desktop workloads. In early 2000, whenuniprocessors were prevalent, Flautner et al. [13, 14] evaluated

Page 2: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

the TLP of existing desktop applications on a symmetric mul-tiprocessor (SMP) with 2-4 cores. The average TLP observedacross all benchmarks was lower than 2 and only specificworkloads, such as video encoding, benefited from moreprocessing cores. However, a second processor improved theresponsiveness of interactive applications. 10 years later, Blakeet al. [3] presented a study of TLP for commercial desktopapplications on an 8-core processor with SMT. They concludedthat 2-3 processor cores were still more than sufficient for mostapplications and that the GPU was mostly underutilized.

Another 8 years have passed since then, and desktop ma-chines have evolved into a combination of CPU, GPU, andfixed-function hardware. The prevalence of multiprocessorsbegs the question of how software developers have been catch-ing up with the advancements in hardware. We analyze theTLP and GPU utilization of a wide variety of commonly usedapplications on a state-of-the-art desktop with 6 SMT cores.GPU utilization measures the average amount of GPU usageover time, and TLP characterizes the amount of concurrencywith idle time factored out.

We also evaluate emerging workloads that have gainedpopularity in recent years, including virtual reality (VR)games, cryptocurrency miners, and personal assistants. Thefirst commercial VR headset was not released until 2016,and currently there are more than 150 million active VRusers worldwide [23]. Cryptocurrency mining has experiencedtremendous growth over the past decade, reaching a totalmarket capitalization of over 200 billion USD [1]. However,both the immersive gaming experience provided by VR andthe computational complexity of cryptocurrency mining havenon-trivial hardware requirements. For personal assistant ap-plications, the user demands have been scaling since Appleintroduced Siri in 2011 [19]. Although personal assistantapplications rely heavily on datacenters to offload the complexpart of the workload, it is worthwhile to explore how muchparallelism is exploited by the work performed locally.

III. METHODOLOGY

A. System Setup

Moore’s law has continued, albeit at a slower pace, ledby incremental advances in manufacturing technology andhardware architecture over the past decade. The system usedby Blake et al. [3] employed a dual-socket CPU with four2.26 GHz 4-way out-of-order cores per socket, along withan 8 MB last-level cache and 6 GB of RAM. We built adesktop machine with state-of-the-art hardware components,representative of a high-end gaming rig by current standards.Table I shows the specifications of our system. A dual-socketsetup is not used in this work since it is more prevalentin server-class machines and generally over-provisioned fordesktop workloads. The processor, operating at 3.70 GHz withTurbo boost to 4.70 GHz, consists of six superscalar coreswith a last-level cache of size 12 MB and 64 GB of RAM.Each core supports 2-way hyper-threading, providing us with amaximum of 12 logical cores. The processor is also equipped

CPU Intel Core i7-8700K, 3.70-4.70 GHz, 6 cores / 12 threadsGraphics NVIDIA GTX 1080 Ti, 1481 MHz, 3584 CUDA coresRAM 64 GB (16 GB × 4) DDR4 @ 3200 MHzStorage 2 TB (1 TB × 2) PCIe NVMe SSDOS Windows 10 Education Version 1803

TABLE I: Specifications of the benchmarking desktop system.

with an integrated graphics processor and specialized hardwareblocks, such as Quick Sync Video (QSV) [21].

Despite the processor having an integrated GPU, we evalu-ate a discrete GPU, which is a more practical setup in desktopsystems. The GTX 285, used by Blake et al., consists of 240CUDA cores operating at 648 MHz [30]. This work uses theGTX 1080 Ti, which has 3584 CUDA cores (∼15× more),running at 1481 MHz (∼2× more) [29]. Some benchmarks arealso tested with the GTX 680, which operates at 1006 MHzwith 1536 cores, to evaluate the differences in performanceand utilization between a high-end and a mid-end GPU [31].

Windows 10 is used in this work, as it supports a widerange of commercial applications that are commonly used bydesktop consumers. More than 80% of the global operatingsystems market share for desktops comprises of Windows [36].Besides, some applications in the benchmark suite, such as VRgames, are only supported by Windows.

B. Metrics

The formula for TLP is shown in Equation 1, where cidenotes the percentage of execution time when i threads arerunning simultaneously, in a system with n logical CPUs. c0represents the idle time in the application.

TLP =

∑ni=1 cii

1− c0(1)

The measurements are reflective of the application-level TLP,which measures the TLP of processes that pertain only to theapplication under consideration, in contrast to the system-wide TLP measured by Blake et al. [3] and Flautner etal. [13, 14]. This is because unlike system TLP, applicationTLP exposes the application behavior directly.

For GPU utilization, we consider the amount of time spentby work packets actually running over a period of time,where a packet is defined as a large collection of ApplicationProgramming Interface (API) calls packaged into a commandstream. GPU utilization is measured by aggregating for allpackets the ratio of packet running time to total time.

C. Trace Collection

Event Tracing for Windows (ETW) is a kernel-level tracingfeature that allows logging of application-defined events. Weuse UIforETW [33], a wrapper around ETW, to collect EventTrace Log (ETL) traces, after ending unrelated backgroundprocesses. The traces are then analyzed using the built-inWindows Performance Analyzer (WPA). Within WPA, weextract the relevant data columns from the CPU Usage(Precise) Timeline by CPU analysis for TLP andfrom the GPU Utilization (FM) analysis for GPU uti-lization. wpaexporter is used to automate the extractionof relevant data from WPA. Lastly, we use custom scripts to

Page 3: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

Start Testbench

Start trace

Stop Testbench

Save trace

.etl file

Extract columns (WPA):- Pr ocess- CPU- Ready Ti me- Swi t ch- I n Ti me

Extract columns (WPA):- Pr ocess- St ar t Execut i on- Fi ni shed

.csv file .csv file

TLP data GPU util % data

Gather number of active cores at each point in time

Gather amount of active GPU time at each interval of time

Application

TLP collection flow GPU util collection flow

User inputs UIforETW Mouse

Keyboard

Voice

VR

Fig. 1: Trace collection workflow to extract TLP and GPU utilization.

process the outputs of wpaexporter. Figure 1 summarizesthe measurement workflow for collecting TLP and GPU data.We cross-validate the GPU data with those reported by WPA.

D. Testing Automation

While conducting multiple iterations of experiments withthe same applications, it is crucial to maintain consistencyacross events such as mouse clicks and keystrokes. AutoItis an automation language designed for Windows that cansimulate keystrokes and mouse activities at a user-specifiedtime. Testing automation mitigates the variations created byuser interactions among different test iterations. For eachapplication that can be automated by AutoIt, we construct anautomation script that initiates the application and performs acarefully designed sequence of mouse and keyboard activities.

We inspect the effect of AutoIt on TLP and GPU uti-lization by comparing experiments on an application with ahigh amount of user interactions (PowerDirector) and onewith non-trivial GPU utilization (VLC Media Player). TheTLP for manual testing was 3.3% smaller than with automatictesting. The GPU utilization is 2.4% lower with AutoIt thanwhen performed manually. This demonstrates that the use ofAutoIt does not significantly distort the results in this work.

E. Manual Testing

Some applications accept user inputs that cannot be pre-cisely reproduced by automation tools. Personal assistants, forinstance, require audio inputs. So, we apply a fixed sequenceof requests and questions with strict timing constraints and usethe same person’s voice for all test iterations.

VR games have a diverse set of inputs to track the posi-tion and action of the player, such as signals from motionsensors and controllers. Besides, VR game scenes vary in anon-deterministic manner, requiring players to take differentactions to survive in the game. Repeating the same action cancause larger variation in each testing iteration. Therefore, itis both infeasible and undesirable to provide the exact sameinputs to VR games. To maximize the similarity betweendifferent test iterations, we choose the campaign mode for eachgame to create similar scenarios and survive for a predefined

amount of time. For games without a campaign mode, we startfrom a fixed checkpoint and perform similar actions each time.

IV. BENCHMARKS

We construct a suite of common desktop applications basedon their popularity among users and create as much overlap aspossible with the benchmarks used in [3, 13, 14] to understandhow software has adapted to advancements in hardware. Theversion number of each application is specified in Table II.

A. Image AuthoringImage authoring exhibits a high degree of parallelism during

rendering. We experiment on a 2D image editor, a 2D/3D CADapplication, and a 3D animation modeling application.

Adobe Photoshop: Photoshop is an advanced graphicsediting and design tool. 5 custom filters are applied seriallyon a 100 mega-pixel photograph.

Autodesk AutoCAD: AutoCAD is a CAD tool used inarchitecture, engineering, etc. We import a floorplan, pan,zoom, draw, fillet the edges, mirror and enter text.

Autodesk Maya 3D: Maya is a 3D graphics software usedfor making animated movies, 3D modeling, etc. We open upa complex model, smooth, perform a software render withraytracing followed by a hardware render with fog, motionblur and anti-aliasing, rotate, pan and zoom the camera.

B. OfficeOffice productivity applications are ubiquitous. A suite of

applications that aim to help office tasks are included.Adobe Acrobat Pro: Acrobat Pro is designed for PDF

editing. We scan documents, combine different files intoone PDF, manipulate the pages, insert links, watermarks andsignatures, and export the PDF into slides.

Microsoft Excel: Excel is a popular spreadsheet editor.We open a spreadsheet containing 1 million rows of text andnumbers, copy multiple columns, zoom, pan, change layout,compute means, sort and filter rows, and plot a histogram.

Microsoft Outlook: Outlook is a desktop email client byMicrosoft. We compose a new email, save and delete the draft,search and reply to a specific email, delete and recover anemail from the inbox, move an email in and out of the junkfolder, and finally categorize emails and do a filter operation.

Microsoft PowerPoint: PowerPoint is a tool that lets userscreate slides and deliver presentations. We open a complextemplate, add bullet points, format them, add shapes andanimate them, add a picture, scale and rotate it and finallycreate a table and populate it with text and numbers.

Microsoft Word: Word is Microsoft’s flagship document-processing utility, used for tasks such as writing letters andpreparing reports. We create a new document, add and deletetext, change formatting, insert, delete, scale and move images.

C. MultimediaDespite the booming popularity of online video stream-

ing/sharing, multimedia players remain widely in use. Thiswork analyzes QuickTime Player, Windows Media Playerand VLC Media Player. For each application, a 480p and a1080p version of the same video are played in succession.

Page 4: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

D. Video Authoring and Transcoding

Video authoring and transcoding utilities enable users toedit videos and convert high definition videos into variousformats. GPUs are usually used in such workloads to assistin performing highly parallel, compute-intensive tasks.

CyberLink PowerDirector: PowerDirector is a video edit-ing software that allows composing and editing video clips.We import three clips in PowerDirector, add transitions, titles,color correction and render it with and without CUDA support.

Adobe Premiere Pro: Premiere Pro is a video editor gearedfor professionals. The same operations for PowerDirector arerepeated with slight differences in filters and transitions.

HandBrake: HandBrake is an open-source videotranscoder. We use it to transcode part of a 3840×2160resolution high-quality video at 50 frames per second (FPS)to a 1920×1080 resolution MP4 video at 30 FPS.

WinX HD Video Converter: WinX is a video transcodingutility supporting GPU acceleration. We repeat the same testsequences that were used for HandBrake.

E. Web Browsing

Popular web browsers, Chrome, Firefox, and Edge arechosen for the benchmark suite. Chrome and Firefox take upalmost 80% of the desktop web browser market [35] and Edgeis a built-in browser for Windows 10 that succeeded InternetExplorer. We perform similar tests as in the prior work byBlake et al. [3] to study the impact of improved software.

For the first two tests, we watch a random video onYouTube, then browse ESPN, CNN and BestBuy, and finallyplay a flash game. We use a different tab for each websitein the first test and a single tab to browse all websites in thesecond test. In the third test, we browse ESPN, which hasplenty of active content (e.g. ads, videos, etc.) and in the finaltest, we browse Wikipedia, which has little active content.

F. Virtual Reality (VR) Gaming

VR headsets provide players with immersive gaming experi-ence that traditional 3D games cannot offer. The diverse set ofsensors and advanced rendering techniques necessary for VRgaming impose non-trivial pressure on hardware. We select acollection of VR games that have a large player community,high user ratings and intensive graphics, and use the highestsettings for all games to stress the GPU to the maximum.

Arizona Sunshine: We play the game in single-playerHorde mode, surviving multiple waves of zombies.

Fallout 4: We continue from a saved checkpoint where thecharacter has escaped from the nuclear fallout shelter.

Serious Sam 3: We play the game in survival mode, anddue to the difficulty in surviving continuously for 3 minutes,we play through after getting killed and respawned.

Space Pirate Trainer: We play the game in “old school”mode that involves surviving multiple waves of pirate bots.

Project CARS 2: We start a quick race with the default carand track and race 1-2 laps with multiple other drivers.

RAW Data: We play the game in campaign mode, survivingwaves of attacking humanoid robots and protecting an object.

G. Cryptocurrency Mining

Cryptocurrency miners validate transactions by computingon blockchains. The benchmark suite encompasses 4 miners –Bitcoin Miner and EasyMiner for Bitcoin and PhoenixMinerand Windows Ethereum Miner for Ethereum. Each one isrun for a predefined amount of time.

H. Personal Assistant

The prevalence of machine learning and natural languageprocessing has given rise to personal assistant applications.Apart from Cortana, which is built-in for Windows, we alsoanalyze Braina, a multi-functional interactive AI software.The tested queries cover requests for daily news, weather fore-cast, alarm/reminder management and questions about generalknowledge, word definitions and simple math problems.

V. EVALUATION

Detailed analyses of our benchmarks are presented in thissection. We summarize the TLP and GPU utilization of allthe applications in our suite and compare them against priorwork [3, 13, 14]. We then evaluate the impact of core scalingand SMT and the role of the GPU in accelerating modernapplications. We also do an in-depth analysis of specific work-loads, including web browsing and VR gaming. Details of ourexperiments and results are available in a public repository1.

A. Overall Results

Table II summarizes the TLP of the 6-core processor withSMT enabled and the GPU utilization of the GTX 1080Ti for all the applications. The “execution time” columnillustrates the amount of time when 0, 1,..., 12 logical coresare active simultaneously. The color of the heat map regioncorresponding to ci illustrates the percentage of executiontime when i threads are executed simultaneously. The “TLP”column shows the average and standard deviation of theTLP derived from 3 test iterations (with same duration) foreach application. Similarly, the “GPU utilization” columncontains the average and standard deviation of the measuredGPU utilization. Based on the low standard deviations, weconclude that our experimental results are consistent. Thelast two columns in Table II present the average TLP andGPU utilization for each category, respectively. In summary,every application exploits parallelism to some extent, with afew applications showing more concurrency than others. Forcategories like office, multimedia playback, personal assistantand web browsing, the degree of parallelism exploited is quitelow, as concluded from the average TLP of around 2. VRgaming displays moderate concurrency, with an average TLPranging from 2 to 4. The TLP is expected to be similar within acategory, but some categories are exceptions, including imageauthoring, video authoring, and cryptocurrency mining. Therealso exist applications that effectively utilize most of theavailable cores, e.g. applications for video transcoding exhibitan average TLP over 9. Overall, the average TLP across allbenchmarks is 3.1, where 6 out of 30 applications have anaverage TLP higher than 4.1https://github.com/SiyingFeng1995/Desktop Parallelism Analysis 2018

Page 5: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

Category Application Execution Time (%) TLP GPU Util. (%)C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Avg. σ Avg. σ

Avg.TLP

Avg. GPUUtil (%)

ImageAuthoring

Adobe Photoshop CC 8.6 0.10 1.6 0.2Autodesk Maya 3D 2019 2.7 0.08 9.9 0.2Autodesk AutoCAD LT 1.2 0.02 9.0 0.9

4.2 6.8

Office

Adobe Acrobat Pro DC 1.3 0.00 0.0 0.0Microsoft Excel 2016 2.1 0.03 2.1 0.0Microsoft PowerPoint 2016 1.2 0.01 4.0 0.1Microsoft Word 2016 1.3 0.01 1.7 0.0Microsoft Outlook 2016 1.3 0.05 2.5 0.2

1.4 1.7

MultimediaPlayback

QuickTime Player 7.7.9 1.1 0.02 16.4 0.1Windows Media Player 12.0 1.3 0.19 16.1 0.0VLC Media Player 3.0.3 1.8 0.18 15.7 0.9

1.4 16.0

VideoAuthoring

CyberLink PowerDirector v16 4.3 0.03 6.3 0.1Adobe Premier Pro CC 1.8 0.02 0.6 0.0 3.1 3.4

VideoTranscoding

HandBrake 1.1.0 9.4 0.04 0.4 0.0WinX HD Video Converter 5.12.1 9.2 0.02 13.6 0.1 9.3 7.0

WebBrowsing

Firefox v60 2.2 0.13 8.6 0.5Chrome v60 2.2 0.13 5.1 0.6Edge 42.17134.1.0 2.0 0.02 4.0 0.2

2.1 5.9

VR Gaming

Arizona Sunshine 1.5.11046 3.4 0.23 68.2 0.8Fallout 4 VR 1.2 4.0 0.15 84.9 1.7RAW Data 1.1.0 2.6 0.13 90.9 1.4Serious Sam VR BFE 341433 2.4 0.10 72.2 1.7Space Pirate Trainer 1.01 2.7 0.11 61.6 0.5Project CARS 2 1.7.1.0 3.8 0.16 80.2 2.1

3.1 76.3

CryptocurrencyMining

Bitcoin Miner 1.54.0 5.4 0.15 98.9 1.1EasyMiner v.0.87 11.9 0.02 96.1 0.4PhoenixMiner 3.0c 1.0 0.01 *100.0 0.1Windows Ethereum Miner 1.5.27 1.0 0.01 99.7 0.1

4.8 98.7

PersonalAssistant

Cortana 1.4 0.04 2.7 0.0Braina 1.43 1.1 0.02 0.0 0.0 1.3 1.4

*for PhoenixMiner, two packets were simultaneously executing on the GPU throughout the experiment Execution Time %

TABLE II: Summary of application TLP and GPU utilization of all applications in the benchmarking suite. The color of the heat map regioncorresponding to ci indicates the percentage of time when i threads are running concurrently.

The values of GPU utilization are lower than 10% for mostapplications. Video authoring and transcoding applicationsexhibit moderate GPU usage. VR games and cryptocurrencyminers, however, show significant utilization of the GPU,achieving an average GPU utilization over 90%. In general,the GPU is underutilized under most circumstances, except forgraphic-intensive and cryptocurrency mining applications.

B. Evolution of Concurrency

The experimental results are compared to those collectedfrom similar applications in prior work in 2000 [13, 14], and2010 [3]. Figures 2 and 3 show the comparison of TLP andGPU utilization respectively. Although the TLP of benchmarksin media playback and video authoring have decreased (by0.5-1.0), possibly due to the enhancements in single-coreperformance, most applications present either comparable orhigher TLP. The significantly larger number of inputs (fromsensors) and the escalation in computational complexity ofVR games result in a noticeable rise in TLP compared to3D games. Applications that have shown a large amount ofconcurrency in previous work, e.g. HandBrake, see a furtherincrease in TLP. Even applications with little growth in averageTLP exhibit progress. For example, Excel only has an averageTLP of 2, yet its instantaneous TLP reaches the maximum of12 during execution, which was not the case 8 years ago.

On the other hand, all benchmarks, except for those in VRgaming, show lower GPU utilization. This can be attributed toadvancements in the GPU hardware, since a higher utilizationof an older GPU, with fewer resources, is comparable to alower utilization of a newer GPU, with more resources. The

GPU utilization of VR games is commensurate with that oftraditional 3D games. Since the current GPU has 15× morecores, a viable explanation is that the amount of offloadedwork has also increased by an order of magnitude.

C. Architectural Decisions

1) Core Scaling: Experiments are performed on the proces-sor with 4, 8 and 12 active logical cores to analyze the TLPbehavior when more resources are available. Figure 4 showsthe TLP characteristics of the application with the highestaverage TLP in each category. For applications exhibiting alow degree of parallelism, including Chrome, VLC, Exceland Cortana, the TLP is tied to 2, since there is not muchparallelism to exploit. On the contrary, EasyMiner assignsindependent threads to each of the logical cores, leading to theTLP scaling linearly with the number of active cores. The TLPof other applications scale sub-linearly, based on the amountof parallel work. The measured TLP, clearly showing thesetrends, is shown in Figures 5-7.

HandBrake presents a highly parallel workload. The TLP ismostly at its maximum, but drops periodically due to serial-ization. Increasing the core count results in more fluctuationsin instantaneous TLP. This is consistent with the HandBrakedocumentation [18], which states that it can scale up to 6 coresand presents diminishing returns beyond that. The frame ratealso scales up with the number of cores and the time spent ontranscoding the same length of video decreases in proportion.

Photoshop involves a significant amount of user interaction,which leads to a non-trivial amount of idle time waiting foruser inputs. However, as mentioned in Section II, idle time

Page 6: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

02468

10

Quake 2

Crysis

Call of D

uty 4

Bioshock

Arizona Sunsh

ine

Fallo

ut 4

RAW Data

Serio

us Sam

Space

Pirate Trainer

Project CARS 2

Photoshop 4.0.1

Maya3D 2010

Photoshop CS4

Maya3D 2018

Photoshop CC

AdobeReader 4

.0

PowerPoint 97

Word 97

Excel 9

7

AdobeReader 9

.0

PowerPoint 2007

Word 2007

Excel 2

007

AdobeReader D

C

PowerPoint 2016

Word 2016

Excel 2

016

Win Media

Player

Quicktim

e 4.0.3

Quicktim

e 7.6

Quicktim

e 7.7.9

Win Media

Player

Premier 4.2

PowerDire

ctor v

7

Premier Pro CC

PowerDire

ctor v

16

HandBrak

e 0.9

HandBrak

e 1.1.0 IE 5

Firefo

x 3.5

Firefo

x v60

Edge

TLP

2000 2010 2018

3D Gaming VR Gaming ImageAuthoring

Office MediaPlayback

Video Authoring &Transcoding

WebBrowsing

Fig. 2: Comparison between TLP of desktop applications for 2000 [13, 14], 2010 [3] and 2018 [this work].

020406080

100

Call of D

uty 4

Bioshock

Crysis

Arizona Sunsh

ine

Fallo

ut 4

RAW Data

Serio

us Sam

Space

Pirate Trainer

Project CARS 2

Maya3D 2010

Photoshop CS4

Maya3D 2019

Photoshop CC

AutoCAD LT

Street &

Trips 2

010

AdobeReader 9

.0

PowerPoint 2007

Word 2007

Excel 2

007

AdobeReader D

C

PowerPoint 2016

Word 2016

Excel 2

016

Quicktim

e 7.6

Quicktim

e 7.7.9

Win Media

Player

VLC M

edia Player

PowerDire

ctor v

7

PowerDire

ctor v

16

Premiere Pro CC

HandBrak

e 0.9

HandBrak

e 1.1.0WinX

Safari 4

.0

Firefo

x 3.5

Firefo

x v60

Chrome v66Edge

GPU

Utili

zatio

n%

2010 2018

VR Gaming ImageAuthoring

Office MediaPlayback

Video Authoring &Transcoding

WebBrowsing

3D Gaming

Fig. 3: Comparsion between GPU utilization of desktop applications for 2010 [3] and 2018 [this work].

0

4

8

12

4 8 12

TLP

Number of Logical Cores

IdealEasyMinerHandBrakePhotoshopProject CARS 2ChromeVLC Media PlayerExcelCortana

Fig. 4: TLP of applications with the highest TLP in each categoryfor 4-12 logical cores with SMT, showing the impact of core scaling.

Fig. 5: Instantaneous TLP and GPU utilization over time for Hand-Brake for different number of cores with SMT. Note that videotranscoding shows proportional scaling with core count, and thusreduced runtime for transcoding the same video clip.

is not considered while calculating average TLP. User inputprocessing exhibits a low TLP, whereas the TLP of filterrendering scales linearly with the number of active cores andcan reach a maximum of 12 when all cores are enabled.The runtime is bottlenecked by user response time, so it getssmaller with increasing number of cores, but is still far fromlinear scaling, in compliance with Amdahl’s law [2].

Project CARS 2, similar to Photoshop, involves significantuser interaction and is also continuously processing sensor dataand rendering graphics. The instantaneous TLP reaches themaximum in bursts, but mostly remains between 2 and 6. Theaverage TLP saturates around 5, due to serialized work.

Overall, the performance gains provided by increasing theamount of hardware resources heavily depends on the volume

Fig. 6: Instantaneous TLP and GPU utilization over time for Pho-toshop for different number of cores with SMT. Note that filterrendering scales linearly with core count, yielding a shorter runtime,whereas user interaction processing does not exhibit much scalability.

Fig. 7: Instantaneous TLP and GPU utilization over time for ProjectCARS 2 (Rift) for different number of cores with SMT, illustratingmoderate scalability as the active core count increases.

of parallel tasks. Performance of parallel workloads scales upwith growing number of cores, leading to shorter executiontimes for the same tasks. Interactive benchmarks, with asignificant amount of parallelism, can also benefit from moreprocessor cores, though the processing of user interactionsis usually serialized. However, non-bursty workloads limitedby low TLP do not benefit much from extra processor cores.Therefore, further exploitation of parallelism is necessary totake advantage of the available hardware resources.

Page 7: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

00.10.20.30.40.50.60.70.80.91

2 4 6

HB-1080-SMT HB-1080 HB-680-SMT HB-680WinX-1080-SMT WinX-1080 WinX-680-SMT WinX-680

GTX 1080 GTX 680

0

10

20

30

40

2 4 6

Tran

scod

eRa

te (F

PS)

Number of Logical Cores

(a) Transcode Rate

0

15

30

45

60

2 4 6

GPUUtilizatio

n(%)

Number of Logical Cores

(b) GPU Utilization

Fig. 8: Transcode rate (GTX 1080 Ti only) and GPU utilization ofHandBrake and WinX for 2 to 6 logical cores, illustrating the effectof SMT and GPU offloading. The transcode rates for GTX 680 arenot shown as they overlap exactly with those for GTX 1080 Ti.

2) Simultaneous Multi-Threading (SMT): SMT is aimedat exploiting more parallelism and improving functional unit(FU) utilization in CPUs by allowing a physical core to runmultiple threads simultaneously [38]. Prior work states thatSMT boosts performance as threads bring useful data on-chip for each other [3]. However, Figure 8 shows that thetranscode rates of both HandBrake and WinX decrease whenSMT is enabled. This is because SMT helps when data isreused among threads within a physical core, but it also limitsthe hardware resources (e.g. functional units) available foreach thread. Statistics from Intel VTune Amplifier show thatfor HandBrake, enabling SMT causes a decrease in Last-LevelCache (LLC) misses and the time spent waiting on mainmemory, as threads fetch data for one another [22]. However,the percentage of time spent by a core stalled on the L1 cache,without missing in it, increases from 5.3% to 10.7%. Thisis explained by thread contention for computation resourceswithin a physical core. For example, an old store may bewaiting for available functional units to resolve its address andblocking a newer load. The performance degradation due toresource contention overcomes the benefits from lower pres-sure on the LLC/off-chip memory. It also hurts the utilizationof GPU, leading to a non-negligible reduction in transcoderate. This implies that as software exploits parallelism bydistributing computation among threads, SMT may have noor even detrimental impact on performance.

D. GPU Analysis

The dramatic breakthrough in GPU hardware over the pastcouple of decades has made it crucial to understand howGPUs are used to effectively assist compute-intensive tasksand whether GPUs are exploited to their full potential.

1) GPU Offloading: The performance and GPU utilizationof HandBrake and WinX with the high-end GTX 1080 Ti andthe mid-end GTX 680 are shown in Figure 8. HandBrake doesnot offload tasks to the GPU, so the utilization stays below 1%,regardless of the number of active cores and GPU settings.WinX, on the other hand, supports hardware acceleration withCUDA/NVENC. The transcode rates for different GPUs arealmost the same (the plots for GTX 680 are omitted as theyoverlap with those for GTX 1080 Ti). In order to achievesimilar performance, the GTX 680, which is inferior to the

LogicalCores

Transcode Rate TLP GPU Utilization (%)No GPU GPU No GPU GPU No GPU GPU

4 9 14 4.0 3.8 0.0 5.28 19 27 7.9 7.0 0.0 10.012 28 37 11.5 9.1 0.0 13.9

TABLE III: Transcode rate, TLP, and GPU utilization of WinX withand without NVIDIA CUDA/NVENC. Enabling the GPU improvesthe transcode rate and lowers the TLP.

1080 Ti, harnesses a much higher utilization. If we use an evenlower-end GPU, we expect the GPU utilization to increase, andthe performance will start to degrade after the GPU utilizationsaturates at the maximum.

The transcode rate, TLP and GPU utilization of WinX,with and without GPU acceleration, are shown in Table III.The CPU offloads compute-intensive transcoding tasks to theGPU through specific Application Programming Interfaces(APIs), and the amount of offloading, indicated by the GPUutilization, grows almost linearly with increase in TLP. WithCUDA/NVENC enabled, the transcode rate of WinX improvesby 143% on an average and TLP decreases by up to 22%.GPU acceleration not only increases performance, but alsorelieves stress on the CPU, making it available for other tasksand protecting it from thermal throttling. Similar offloadingbehavior is observed for Premiere Pro while exporting videowith CUDA support, as shown in Figure 9. The assistanceof GPU does not cause a significant change in runtime, butslightly lowers the instantaneous TLP.

Non-CUDA CUDA

Fig. 9: GPU utilization of GTX 680 and 1080 Ti for Premiere Pro.Video export with CUDA support shows higher utilization and lowerTLP than without CUDA, and the utilization is higher for GTX 680.

2) GPU Utilization: As shown in Table II, the GPU isunder-utilized for most of the applications, which is pos-sibly because the computational power of the GPU greatlyexceeds what is demanded from it. The GPU indeed executessubstantial tasks in various applications, such as hardwarerendering in Maya and video export in PowerDirector, yetboth exhibit GPU utilizations lower than 10%. Even for WinXVideo Converter, which uses CUDA/NVENC in the GPU fortranscoding, the average GPU utilization is 13.6%. On theother hand, there are applications that utilize the GPU muchmore efficiently, such as VR games and cryptocurrency miners.We measure the GPU utilization of the mid-end GPU forvideo-related applications and cryptocurrency miners, as theseuse the GPU more than the others, and compare them tothe utilization of the high-end GPU. Most applications see anotable improvement in utilization, except for cryptocurrencymining. Both GPUs show utilizations of up to 100% for

Page 8: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

0255075

100

WMP VLC WinX BitcoinM EasyMiner WinEth

GPU

Util

izatio

n%

GTX 680 GTX 1080 Ti

Fig. 10: GPU utilization of GTX 680 and 1080 Ti for applicationsthat show substantial use of GPU. VR is excluded as it requires aGPU better than GTX 970. PhoenixMiner does not support GTX 680.

Bitcoin Miner and EasyMiner, but as expected, the hash rateof GTX 680 is at least 2× lower despite the assistance of theCPU. Windows Ethereum Miner, however, has a higher GPUutilization with the superior GPU, since NVIDIA’s Keplerarchitecture in GTX 680, released before the prevalence ofcryptocurrency, is not optimized to run mining workloads.

In summary, a mid-end GPU is sufficient for most applica-tions, including video-editors/transcoders. However, for appli-cations, such as VR gaming and mining, that perform intensivecomputations on the GPU, a high-end GPU is indispensable,as the mid-end GPU causes a significant performance loss.

E. Web Browsing Workload Analysis

Among the web browsers, Chrome and Firefox share com-mon features. While the number of processes created byChrome is 10× larger than that by Firefox, Firefox uses muchmore resources in GPU to match the performance. Edge claimsto have the best power efficiency, with Chrome and Firefoxconsuming 36% and 53% more power respectively [40], whichis consistent with its low TLP and GPU utilization.

The TLP and GPU utilization of the web browsing test-benches are illustrated in Figure 11. The tests using multipletabs have similar or higher TLP compared to those using asingle tab, which is in contradiction to the results from Blakeet al. [3]. In the past, web browsers used to run the entireapplication in a single process, and the overhead of garbagecollection when a user navigates to another website resultedin higher TLP for single-tab tests. Current web browsers usemulti-process models to separate websites from each otherand the browser itself, so that web contents are loaded inparallel to improve responsiveness. This is also to preventfailures in one webpage’s content from crashing the entirebrowser [7, 27]. Inactive tabs run as background processes inthe system. Web browsing using multiple tabs spawns moreprocesses and threads than when using a single tab, resulting inhigher TLP. However, the increase is not significant, becausebrowsers constantly throttle inactive tabs after a certain amountof time [8]. Chrome generates the most number of processesand shows the least difference in the number of processes aswell as TLP between the two tests. The overhead of garbagecollection is also reduced, as it is scheduled to take placeduring idle time to avoid degradation of user experience [9].

In terms of the ESPN tests, Chrome attains the highest TLP,while Firefox and Edge do not exhibit much difference inTLP. Chrome generally creates a rendering process for eachinstance of the website, and the large amount of active contentin ESPN makes Chrome spawn more processes for webpagerendering, leading to a higher TLP. Firefox and Edge, on theother hand, do not show any apparent increase in the number

0

2

4

Chrome Firefox Edge Chrome Firefox Edge

Chart Title

Multi-tab Single-tab ESPN Wiki

0123

Chrome

FirefoxEdge

Chrome

FirefoxEdge

TLP

(a) TLP

0369

Chrome

FirefoxEdge

Chrome

FirefoxEdge

GPU

Utilization(%)

(b) GPU Utilization

Fig. 11: (a) TLP and (b) GPU utilization for web browsing tests usingmultiple vs. a single tab and browsing tests on ESPN vs. Wikipedia.

of processes. All web browsers use more GPU while renderingESPN, suggesting that graphic-intensive work is offloaded tothe GPU when possible, as expected.

Overall, web browsers have shown improvements in exploit-ing parallelism over the last two decades. Although TLP isbottlenecked by waiting for and processing user interaction,the improved parallelism enhances user experience in termsof both responsiveness and stability of web browsers.

F. Virtual Reality Workload Analysis

Oculus Rift and HTC Vive were released in 2016, andHTC Vive Pro launched in 2018. As shown in Figure 12,Rift achieves the highest TLP, especially for graphic-intensivegames like Project CARS and Fallout 4. Vive and Vive Prohave almost the same TLP. Furthermore, Rift achieves themost stable frame rate of the three headsets (Figure 13). Thespecified frame rate for all the headsets is 90 FPS, but ifonly 4 logical cores are available, the actual frame rate ofRift is clamped to 45 FPS due to asynchronous spacewarp(ASW). ASW compromises the frame rate when the systemcannot handle rendering at the full rate, to lower the minimumsystem requirements for the headset. This is consistent withthe reduction in both TLP and GPU utilization (Figure 7).Vive and Vive Pro, instead, apply asynchronous reprojectionto improve user experience. This technique pushes the GPUto render at 90 FPS, and inserts an adjusted frame when theGPU fails to render the next frame in time. So, with 4 logicalcores, the frame rate oscillates between 90 and 45 FPS, andonly slight variations appear in the TLP and GPU utilization.

The GPU utilization correlates with the resolution of theheadset. For all games except Fallout 4, Vive Pro, which hasthe best resolution, achieves the highest GPU utilization. Riftand Vive have the same resolution and show comparable GPUutilizations. Fallout 4 exhibits a different trend in hardwareutilization than the other games. The GPU utilization for VivePro is the lowest, and a lower frame rate for Vive Pro isobserved in the game.

VI. RELATED WORK

Prior work has explored characterizing simulated as wellas real systems. Eyerman and Eeckhout [12] evaluated avariety of multi-core systems to find the optimal design withlimited hardware resources. Lorenzon et al. [24] investigatedthe TLP and energy consumption of Application ProgrammingInterfaces (APIs) on embedded systems and general purpose

Page 9: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

0

5

AZ…Fa

ll… RA…Se

ri… Spa…

Proj…

Chart Title

Rift Vive Vive Pro

012345

AZ Sunshi

ne

Fallo

ut 4

RAW Data

Serio

us Sam

Space

Pirate

Proj. CARS

TLP

(a) TLP

0255075

100

AZ Sunshi

ne

Fallo

ut 4

RAW Data

Serio

us Sam

Space

Pirate

Proj. CARS

GPU

Util

izatio

n(%

)

(b) GPU Utilization

Fig. 12: (a) TLP and (b) GPU Utilization for VR games across OculusRift, HTC Vive, and HTC Vive Pro.

Fig. 13: Instantaneous frame rate for Project CARS 2 for Oculus Rift,HTC Vive, and HTC Vive Pro with 6 SMT cores. The frame rate ofRift is more stable than that of Vive and Vive Pro.

systems. Our work analyzes commercial applications on a realdesktop machine, allowing us to obtain realistic data.

Plenty of hardware reviews on VR headsets, cryptocurrencymining and desktops are available through tech channels[4, 5, 20, 26, 34, 39]. They measure gaming performancethrough frame rate and mining performance through hash rate.The study done by Magaki et al. [25] illustrated that ASICshave better energy and cost efficiency over GPU for mining.This work does an analysis from a parallelism perspective andevaluates TLP and GPU utilization.

Extensive characterization work has also been done onmobile devices [17, 28, 41]. Gao et al. [15, 16] studied howmulti-core mobile devices are utilized by common mobileapps. Chen et al. [6] characterized mobile augmented realityapplications from the system and architecture perspectives.

VII. DISCUSSION

TLP and GPU utilization can act as useful guidelines forend-users on the amount of hardware resources to invest.A key takeaway from this work is that employing manyprocessing cores and a high-end GPU does not always bringbenefits. For users who primarily spend time online or onoffice applications, 2-3 cores are sufficient to achieve maximalperformance. For professional users who use their desktopsfor video transcoding or image editing, performance scalesroughly linearly with increasing number of cores. For gamingand cryptocurrency mining, a better GPU leads to much betterperformance than a CPU with a large core count.

Multi-core scaling has hit a plateau, and software developersare now, more than ever, expected to write hardware-awareprograms. This is because TLP and GPU utilization are notfundamental to an application, but highly dependent on howthey are implemented in software. Based on our 18-yearperspective analysis, we discuss potential areas where softwarecan be improved to further exploit parallelism.

• Applications exhibiting complementary TLP characteris-tics can be scheduled to execute concurrently to achievebest utilization of the processor. For example, HandBrakeexhibits high TLP with short periods of TLP drop. TheOS could schedule another task during troughs in TLP,thus trading off fairness for better overall utilization.

• The GPU can be further exploited to assist compute-intensive tasks. Although the GPU suffers from poorsingle-threaded/latency-sensitive performance, tasks thathave lower priority or latency requirements could beoffloaded to the GPU. For instance, if the user is editingan image in Photoshop and transcoding videos in back-ground, the transcoding task can be offloaded to the GPUwhen Photoshop is using the CPU for rendering.

• Idle time and time periods of low activity can be utilizedto predict future user tasks and perform them specula-tively. This can lead to a performance boost, with a riskof wasted work and energy. However, power is not amajor constraint for desktops. Interactive applications aregood candidates for such kind of speculation, since lowTLP is attained while user inputs are being processed.For example, when a Photoshop user selects a blur filter,the system can speculate the next task to be blur filterrendering and the core can start fetching off-chip datalocally, while the user is specifying filter configurations.

• As the trend of innovation shifts further away fromparallelism into heterogeneity, future software could of-fload kernels within an application to dedicated hard-ware/accelerators that execute the kernel most efficiently.

VIII. CONCLUSION

Major advancements have taken place in desktop hardwareover the past decade. In this work, we analyzed the TLP,GPU utilization, and effects of core scaling and SMT, ontraditional and emerging applications commonly used in con-temporary desktop systems. Our results showed that softwarehas improved to take advantage of the parallelism availablein the hardware compared to the work in 2010 [3]. No-ticeable increases were seen in many applications, includ-ing those reputed for effective utilization of processor coreslike HandBrake and Photoshop. For applications with slightchanges in TLP over the past 18 years, efforts for exploitingavailable parallelism were exhibited by them achieving highinstantaneous TLP during execution. For example, Excel spent3.7% of time using the maximum number of available logi-cal cores concurrently, and web browsers have shifted fromsingle-process models to multi-process models, resulting inbetter responsiveness and reliability. Emerging applicationsalso demonstrated good utilization of hardware resources. Theaverage TLP of VR gaming is twice that of traditional 3Dgaming, and cryptocurrency miners involving CPU mininghave a TLP higher than that of over 80% of the benchmarks. Inaddition, SMT is beneficial when threads running on the samephysical core work on the same data with sufficient computa-tion resources, else it becomes detrimental for performance.

Page 10: Parallelism Analysis of Prominent Desktop Applications: An ... · Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective Siying Feng, Subhankar Pal, Yichen

On the other hand, overall GPU utilization was lower thanthat observed in 2010. This showed that the improvementsin the amount of available resources in the GPU has beengrowing at a faster pace than improvements in the parallelismharnessed by software. However, emerging workloads, e.g. VRgames and cryptocurrency miners, exhibited great potential, asthey fully exploited the computation power of the GPU.

In conclusion, appreciable progress has been made by soft-ware in exploiting parallelism. However, there is still sufficientscope for software to further improve hardware utilization.

REFERENCES

[1] “Cryptocurrency market capitalizations.” [Online]. Available:https://coinmarketcap.com/

[2] G. M. Amdahl, “Validity of the single processor approach to achievinglarge scale computing capabilities,” in Proceedings of the April 18-20,1967, Spring Joint Computer Conference, ser. AFIPS ’67 (Spring).New York, NY, USA: ACM, 1967, pp. 483–485.

[3] G. Blake, R. G. Dreslinski, T. Mudge, and K. Flautner, “Evolution ofthread-level parallelism in desktop applications,” in Proceedings of the37th Annual International Symposium on Computer Architecture, ser.ISCA ’10. New York, NY, USA: ACM, 2010, pp. 302–313.

[4] BuriedONE Blockchain, “GPU Mining Hashrates.” [Online].Available: https://www.buriedone.com/hashrates.html

[5] K. Carbotte, “Htc vive pro headset review: A high bar for premiumvr.” [Online]. Available: https://www.tomshardware.com/reviews/htc-vive-pro-headset-vr,5549.html

[6] H. Chen, Y. Dai, H. Meng, Y. Chen, and T. Li, “Understanding thecharacteristics of mobile augmented reality applications,” in 2018IEEE International Symposium on Performance Analysis of Systemsand Software (ISPASS), April 2018, pp. 128–138.

[7] Chromium Blog, “Multi-process architecture.” [Online]. Available:https://blog.chromium.org/2008/09/multi-process-architecture.html

[8] C. Davenport, “Google introduces background tab throttling in chrome57.” [Online]. Available: hhttps://www.androidpolice.com/2017/03/14/google-introduces-background-tab-throttling-chrome-57-desktop/

[9] U. Degenbaev, J. Eisinger, M. Ernst, R. McIlroy, and H. Payer, “Idletime garbage collection scheduling,” SIGPLAN Not., vol. 51, no. 6, pp.570–583, Jun. 2016.

[10] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc, “Design of ion-implanted mosfet’s with very small physicaldimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp.256–268, 1974.

[11] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, andD. Burger, “Dark silicon and the end of multicore scaling,” inComputer Architecture (ISCA), 2011 38th Annual InternationalSymposium on. IEEE, 2011, pp. 365–376.

[12] S. Eyerman and L. Eeckhout, “The benefit of smt in the multi-coreera: Flexibility towards degrees of thread-level parallelism,” SIGARCHComput. Archit. News, vol. 42, no. 1, pp. 591–606, Feb. 2014.

[13] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge, “Thread-levelparallelism and interactive performance of desktop applications,”SIGOPS Oper. Syst. Rev., vol. 34, no. 5, pp. 129–138, Nov. 2000.

[14] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge, “Thread-levelparallelism aof desktop applications,” Workshop on Multi-threadedExecution, Architecture and Compilation, 2000.

[15] C. Gao, A. Gutierrez, R. G. Dreslinski, T. Mudge, K. Flautner, andG. Blake, “A study of thread level parallelism on mobile devices,” in2014 IEEE International Symposium on Performance Analysis ofSystems and Software (ISPASS), March 2014, pp. 126–127.

[16] C. Gao, A. Gutierrez, M. Rajan, R. G. Dreslinski, T. Mudge, and C. J.Wu, “A study of mobile device utilization,” in 2015 IEEEInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), March 2015, pp. 225–234.

[17] A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi,C. Emmons, and N. Paver, “Full-system analysis and characterizationof interactive smartphone applications,” in 2011 IEEE InternationalSymposium on Workload Characterization, Nov 2011, pp. 81–90.

[18] Handbrake, “Video encoding speed.” [Online]. Available: https://handbrake.fr/docs/en/latest/technical/video-encoding-performance.html

[19] J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski,A. Khurana, R. G. Dreslinski, T. Mudge, V. Petrucci, L. Tang, andJ. Mars, “Sirius: An open end-to-end voice and vision personalassistant and its implications for future warehouse scale computers,”SIGPLAN Not., vol. 50, no. 4, pp. 223–238, Mar. 2015.

[20] T. Hochstenbach, “Project cars 2 review: benchmarks with 23 graphicscards.” [Online]. Available: https://us.hardware.info/reviews/7614/project-cars-2-review-benchmarks-with-23-graphics-cards

[21] Intel, “Intel core i7-8700k processor.” [Online]. Available:https://ark.intel.com/products/126684/Intel-Core-i7-8700K-Processor-12M-Cache-up-to-4-70-GHz-

[22] Intel, “Intel vtune amplifier.” [Online]. Available:https://software.intel.com/en-us/vtune-amplifier-help

[23] KZero, “Number of Active Virtual Reality Users Worldwide from2014 to 2018 (in Millions).” [Online]. Available: https://www.statista.com/statistics/426469/active-virtual-reality-users-worldwide/

[24] A. F. Lorenzon, M. C. Cera, and A. C. Schneider Beck, “Performanceand energy evaluation of different multi-threading interfaces inembedded and general purpose systems,” Journal of Signal ProcessingSystems, vol. 80, no. 3, pp. 295–307, Sep 2015.

[25] I. Magaki, M. Khazraee, L. V. Gutierrez, and M. B. Taylor, “Asicclouds: Specializing the datacenter,” in 2016 ACM/IEEE 43rd AnnualInternational Symposium on Computer Architecture (ISCA), June 2016,pp. 178–190.

[26] Matt Hanson, “Best mining GPU 2018 : the best graphics cards formining Bitcoin, Ethereum and more.” [Online]. Available:https://www.techradar.com/news/best-mining-gpu

[27] MDN web docs, “Multiprocess firefox.” [Online]. Available:https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/

[28] G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk,J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakos, andS. Gadelrab, “Evaluating the memory system behavior of smartphoneworkloads,” in 2014 International Conference on Embedded ComputerSystems: Architectures, Modeling, and Simulation (SAMOS XIV), July2014, pp. 83–92.

[29] NVIDIA, GeForce GTX 1080 Ti. [Online]. Available: https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/

[30] NVIDIA, GeForce GTX 285. [Online]. Available:https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-285

[31] NVIDIA, GeForce GTX 680. [Online]. Available:https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680

[32] S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti,H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: Anouter product based sparse matrix multiplication accelerator,” in 2018IEEE International Symposium on High Performance ComputerArchitecture (HPCA), Feb 2018, pp. 724–736.

[33] I. Park and R. Buch, “Event tracing: Improve debugging andperformance tuning with etw,” MSDN Magazine, April 2007.

[34] R. Smith, “The nvidia geforce gtx 1080 ti founder’s edition review:Bigger pascal for better performance.” [Online]. Available:https://www.anandtech.com/show/11180/the-nvidia-geforce-gtx-1080-ti-review

[35] Statista, “Global market share held by leading desktop internetbrowsers.” [Online]. Available: https://www.statista.com/statistics/544400/market-share-of-internet-browsers-desktop/

[36] Statista, “Global operating systems market share for desktop PCs.”[Online]. Available: https://www.statista.com/statistics/218089/global-market-share-of-windows-7/

[37] M. B. Taylor, “Is dark silicon useful? harnessing the four horsemen ofthe coming dark silicon apocalypse,” in DAC Design AutomationConference 2012, June 2012, pp. 1131–1136.

[38] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneousmultithreading: Maximizing on-chip parallelism,” in ACM SIGARCHComputer Architecture News, vol. 23, no. 2, 1995, pp. 392–403.

[39] S. Walton, “Intel core i7-8700k review: The new gaming king.”[Online]. Available:https://www.techspot.com/review/1497-intel-core-i7-8700k/page1.html

[40] J. Weber, “Get more out of your battery with microsoft edge.”[Online]. Available: https://blogs.windows.com/windowsexperience/2016/06/20/more-battery-with-edge/

[41] Y. Zhang, X. Wang, X. Liu, Y. Liu, L. Zhuang, and F. Zhao, “Towardsbetter cpu power management on multicore smartphones,” inProceedings of the Workshop on Power-Aware Computing and Systems,ser. HotPower ’13. New York, NY, USA: ACM, 2013, pp. 11:1–11:5.


Recommended