+ All Categories
Home > Documents > High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster...

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster...

Date post: 17-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
13
Center for Comprehensive Informatics Technical Report High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms George Teodoro Tony Pan Tahsin M. Kurc Jun Kong Lee A. D. Cooper Norbert Podhorszki Scott Klasky Joel H. Saltz CCI-TR-2012-9 December 23, 2012
Transcript
Page 1: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

Center for Comprehensive Informatics

Technical Report

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

George Teodoro

Tony Pan Tahsin M. Kurc

Jun Kong Lee A. D. Cooper

Norbert Podhorszki Scott Klasky Joel H. Saltz

CCI-TR-2012-9 December 23, 2012

Page 2: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPUCluster Platforms

George Teodoro1, Tony Pan1, Tahsin M. Kurc1, Jun Kong1, Lee A. D. Cooper1,Norbert Podhorszki2, Scott Klasky2, and Joel H. Saltz1

1Center for Comprehensive Informatics, Emory University, Atlanta, GA2Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN

Abstract—Analysis of large pathology image datasets offerssignificant opportunities for biomedical researchers to investi-gate the morphology of disease, but the resource requirementsof image analyses limit the scale of those studies. Motivatedby such a study, we propose and evaluate a parallel imageanalysis application pipeline for high throughput computationof large datasets of high resolution pathology tissue images ondistributed CPU-GPU platforms. To achieve efficient executionon these hybrid systems, we built runtime support that allowsus to express our cancer image analysis application as ahierarchical pipeline in which the application is implementedas a coarse-grain pipeline of stages, where each stage maybe further partitioned into another pipeline of fine-grainoperations. The fine-grain operations are efficiently managedand scheduled for computation on CPUs and GPUs usingperformance aware scheduling techniques along with severaloptimizations, including architecture aware process placement,data locality conscious task assignment, and data prefetchingand asynchronous data copy. These optimizations are employedto maximize utilization of the aggregate computing power ofCPUs and GPUs and minimize data copy overheads. Theresults, obtained with the analysis application for study ofbrain tumors, show that cooperative use of CPUs and GPUsachieves significant improvements on top of GPU-only versions(upto 1.6×) and that the execution of the application as aset of fine-grain operations provides more opportunities forruntime optimizations and attain better performance thancoarser-grain, monolithic implementations used in other works.Moreover, the cancer image analysis pipeline was able tocompute an image dataset consisting of 36,848 4Kx4K-pixelimage tiles (about 1.8TB uncompressed) in less than 4 minutes(150 tiles/second) on 100 nodes of a state-of-the-art hybridcluster system.

Keywords-Image Segmentation Pipelines; GPGPU; CPU-GPU platforms;

I. INTRODUCTION

Analysis of large datasets is a critical, yet challengingcomponent of scientific studies, because of dataset sizes andthe computational requirements of analysis applications. So-phisticated image scanner technologies developed in the pastdecade have revolutionized biomedical researchers’ abilityto perform high resolution microscopy imaging of tissuespecimens. With a state-of-the-art scanner, a researcher cancapture color images of upto 100Kx100K pixels rapidly.This allows a research project to collect datasets consisting

of thousands of images, each of which can be tens ofgigabytes in size. Processing of an image consists of severalsteps of data and computation intensive operations suchas normalization, segmentation, feature computation, andclassification. Analyzing a single image on a workstationcan take several hours, and processing a large dataset cantake a very long time. Moreover, a dataset may be analyzedmultiple times with different analysis parameters and algo-rithms to explore different scientific questions, to carry outsensitivity studies, and to quantify uncertainty and errors inanalysis results. These requirements create obstacles to theutilization of microscopy imaging in research and healthcareenvironments and significantly limit the scale of microscopyimaging studies.

The processing power and memory capacity of graph-ics processing units (GPUs) have rapidly and significantlyimproved in recent years. Contemporary GPUs provideextremely fast memories and massive multi-processing capa-bilities, exceeding those of multi-core CPUs. The applicationand performance benefits of GPUs for general purposeprocessing have been demonstrated for a wide range ofapplications [1]. As a result, hybrid systems with multi-coreCPUs and multiple GPUs are emerging as viable high perfor-mance computing platforms for scientific computation [2].This trend is also fueled by the availability of programmingabstractions and frameworks, such as CUDA1 and OpenCL2,that have reduced the complexity of porting computationalkernels to GPUs. Nevertheless, taking advantage of hybridplatforms for scientific computing still remains a challengingproblem. An application developer needs to deal with theefficient distribution of computational workload not onlyacross cluster nodes but also among multiple CPU coresand GPUs on a hybrid node. The developer also has totake into account potential performance variability acrossapplication operations. Operations ported to the GPU willnot all have the same amount of performance gains. Someoperations are more suitable for massive parallelism andgenerally achieve higher GPU-vs-CPU speedups than other

1http://nvidia.com/cuda2http://www.khronos.org/opencl/

Page 3: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

operations. In addition, the application developer has to min-imize data copy overheads when data have to be exchangedbetween application operations. These challenges often leadto underutilization of the power of hybrid platforms.

In this work, we investigate efficient parallelization strate-gies and runtime runtime support for efficient execution oflarge scale microscopy image analyses on hybrid clustersystems. Our approach combines the coarse-grain dataflowpattern with the bag-of-tasks pattern in order to facilitatethe implementation of the image analysis application from aset of operations on data. The runtime supports hierarchicalpipelines, in which a processing component can itself bea pipeline of operations. It implements optimizations forefficient use of CPUs and GPUs in coordination on acomputing node as well as distribution of computationsacross multiple nodes. The optimizations studied includedata locality conscious and performance variation awaretask assignment, data prefetching, asynchronous data copy,and architecture aware placement of control processes in acomputation node. Fine-grain operations that constitute ananalysis pipeline typically involve different data access andprocessing patterns. Consequently, variability in the amountof GPU acceleration of operations is likely to exist. Thisrequires the use of performance aware scheduling techniquesin order to optimize the use of CPUs and GPUs based onspeedups attained by each operation.

We have evaluated the image analysis application paral-lelization coupled with the runtime support optimizationsusing image datasets for study of brain tumors on a state-of-the-art hybrid cluster, where each node has multi-coreCPUs and multiple GPUs. Experimental results show thatcoordinated use of CPUs and GPUs along with the runtimeoptimizations results in significant performance improve-ments over CPU-only and GPU-only deployments. In addi-tion, multi-level pipeline scheduling and execution is fasterthan a monolithic implementation, since it can leveragethe hybrid infrastructure better. Application of all of theseoptimizations makes it possible to process an image datasetat 150 tiles/second on 100 hybrid compute nodes.

II. APPLICATION DESCRIPTION

The motivation for our work is the in silico studies ofbrain tumors [3]. These studies are conducted to find bettertumor classification strategies and to understand the biologyof brain tumors, using complementary datasets of high-resolution whole tissue slide images (WSIs), gene expressiondata, clinical data, and radiology images. WSIs are capturedby taking high resolution color (RGB) pictures of tissuespecimens stained and fixated on glass slides. Our group hasdeveloped image analysis applications to extract and classifymorphology and texture information from WSIs, with theobjective of exploring correlations between tissue morphol-ogy features, genomic signatures, and clinical data [3]. TheWSI analysis applications share a common workflow which

consists of the following core stages: 1) image preprocessingtasks such as color normalization, 2) segmentation of micro-anatomic objects such as cells and nuclei, 3) characterizationof the shape and texture features of the segmented objects,and 4) machine-learning methods that integrate informationfrom features to classify the images and objects. In terms ofcomputation cost, the preprocessing and classification stages(stages 1 and 4) are inexpensive relative to the segmentationand feature computation stages (stages 2 and 3). The currentimplementation of the classification stage works at imageand patient level and includes significant data reductionprior to the actual classification operation which decreasesdata and computational requirements. The segmentation andfeature computation stages, on the other hand, may operateon hundreds to thousands of images with resolutions rangingfrom 50K×50K to 100K×100K pixels and 105 to 107

micro-anatomic objects (e.g., cells and nuclei) per image.Thus, we target the segmentation and feature computationstages in this paper.

The segmentation stage detects cells and nuclei and de-lineates their boundaries. It consists of several componentoperations, forming a dataflow graph (see Figure 1). Theoperations in the segmentation include morphological re-construction to identify candidate objects, watershed seg-mentation to separate overlapping objects, and filtering toeliminate candidates that are unlikely to be nuclei based onobject characteristics. The feature computation stage derivesquantitative attributes in the form of a feature vector forthe entire image or for individual segmented objects. Thefeature types include pixel statistics, gradient statistics, edge,and morphometry. Most of the features can be computedconcurrently in a multi-threaded or parallel environment.

III. APPLICATION PARALLELIZATION FOR HIGHTHROUGHPUT EXECUTION

We have developed the high performance version of theapplication in several stages. First, we have implementedGPU-enabled versions, as well as CPU versions, of individ-ual operations in the segmentation and feature computationsteps (Section III-A). Second, we have developed a strategyand runtime middleware that combine bag-of-tasks andcoarse-grain dataflow patterns for parallelization across mul-tiple nodes and within each CPU-GPU node (Section III-B).Finally, we have incorporated a set of runtime optimizationsto reduce computation time and use CPUs and GPUs in acoordinated manner on each compute node (Section III-C).

A. GPU-based Implementations of Operations

We used existing implementations from OpenCV or fromother research groups, or implemented our own if no efficientimplementations were available. The Morphological Openoperation, for example, is available as part of OpenCV [4]that uses NVidia Performance Primitives (NPP) [5]. TheWatershed operation, on the other hand, has only a CPU

Page 4: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

Figure 1. Pipeline for segmenting nuclei in a whole slide tissue image and computing their features. The input to the pipeline is an image or image tile.The output is a set of features for each segmented nucleus.

Table ISOURCES OF CPU AND GPU IMPLEMENTATIONS OF OPERATIONS IN

THE SEGMENTATION AND FEATURE COMPUTATION STAGES.

Pipeline operation CPU source GPU source

RBC detection OpenCV and Vincent [7] ImplementedMorph. Reconstruction (MR)Morph. Open OpenCV (by a 19x19 disk) OpenCVReconToNuclei Vincent [7] MR ImplementedAreaThreshold Implemented ImplementedFillHolles Vincent [7] MR Implemented

Pre-Watershed Vincent [7] MR and OpenCV Implementedfor distance transformationWatershed OpenCV Korbes [6]BWLabel Implemented Implemented

Features comp. Implemented. Implemented.OpenCV(Canny) OpenCV(Canny)

implementation in the OpenCV library. We used the GPUversion by Korbes et. al. [6] for this operation. The maincompute intensive operations along with the sources of theCPU/GPU implementations are listed in Table I.

Several of the methods we developed in the segmenta-tion stage are irregular. The Morphological Reconstruction(MR) [7] and Distance Transform algorithms are usedas building blocks in a number of these methods. Thesealgorithms can be efficiently executed on a CPU using aqueue structure. In these algorithms, only the computationperformed in a subset of the elements (active elements) fromthe input data domain will effectively contribute to the outputresults. Therefore, to avoid wasting computation time, thoseactive elements are tracked using a container, e.g., a queue,so that only that subset of elements is processed. When anelement is selected for computation, it is removed from theset of active elements. Further, the computation of that givenactive element involves its neighboring elements on a grid,and one of more neighbors may be included in the set ofactive elements as result of the computation. This processcontinues until stability is reached; i.e., the container of ac-tive elements is empty. To port these algorithms to GPUs, wehave implemented a hierarchical and scalable queue to storeelements (pixels) in fast GPU memories along with severaloptimizations to reduce execution time. The implementationsare detailed in a technical report [8]. The queue-based imple-mentation resulted in significant performance improvementsover previously published GPU-enabled versions of the MRalgorithm [9]. Our implementation of the distance transformresults in a distance map equivalent to that of Danielsson’salgorithm [10].

The connected component labeling operation (BWLabel)was implemented using the union-find pattern [11]. Concep-

tually, the BWLabel with union-find first constructs a forestwhere each pixel is its own tree. It then merges adjacenttrees by putting one tree as a branch of the other. Treemerges only occur when the adjacent pixels have the samemask pixel value. During a merge, the roots of two treesare compared by the label values. The root with the smallerlabel value remains the root, while the other is grafted ontothe new root. After all the pixels have been visited, pixelsbelonging to the same component are on the same labeltree. The label can then be extracted by flattening the treesand reading the labels. The output of this operation in thecomputation pipeline, shown in Figure 1, is a labeled maskwith all segmented nuclei.

The operations in the feature computation stage consist ofpixel/neighborhood based transformations that are applied tothe input image (Color deconvolution, Canny, and Gradient)and computations on individual objects (e.g., nuclei) seg-mented in the segmentation stage. The feature computationson objects are generally more regular and compute intensivethan the operations in the segmentation stage. This charac-teristics of the feature computation operations lead to betterGPU acceleration [12].

B. Parallelization on Distributed CPU-GPU machines

The image analysis application encapsulates multiple pro-cessing patterns. First, each image can be partitioned intorectangular tiles, and the segmentation and feature compu-tation steps can be executed on each tile independently. Thisleads to a bag-of-tasks style processing pattern. Second, theprocessing of a single tile can be expressed as a hierarchicalcoarse-grain dataflow pattern, where the segmentation andfeature computation stages are the first level of the dataflowstructure. The second level is the set of fine-grain opera-tions that are within each of the coarse-grain stages. Thisformulation is illustrated in Figure 1.

The hierarchical representation lends itself to a separationof concerns and enables the use of different schedulingapproaches at each level. For instance, it allows for thepossibility of exporting second level operations (fine-grainoperations) to a local scheduler on a hybrid node, as opposedto describing each pipeline stage as a single monolithic taskthat should entirely be assigned to a GPU or a CPU. In thisway, the scheduler can control tasks in a smaller granularityand can account for performance variations across the finergrain tasks within a node, assigning them to the mostappropriate device.

Page 5: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

In order to allow this representation, our implementationis built on top of a Manager-Worker model, shown inFigure 2, that combines the bag-of-tasks style execution withthe coarse-grain dataflow execution pattern. The applicationManager creates stage instances, each of which are repre-sented by a tuple, (input data chunk, processing stage),and builds the dependencies among them to enforce thecorrect pipeline execution. This dependency graph is notcompletely known prior to execution, thus is built dynam-ically at runtime. For instance, after the segmentation ofa tile, the feature extraction for that particular chunk ofdata is only dispatched for computation if a certain numberof objects were segmented. Since stage instances may becreated as a consequence of the computation of other stageinstances, it is possible to create loops as the dependencygraph may be reinstantiated dynamically.

Figure 2. Overview of the multi-node parallelization strategy.

The granularity of tasks assigned to application Workernodes is equal to stage instances. The scheduling of tasksto Workers is carried out using a demand-driven approach.Stage instances are assigned to Workers for execution inthe same order the instances are created, and the Workerscontinuously request work as they finalize the executionof the previous instances (see Figure 2). In practice, asingle worker is able to execute multiple application stagesconcurrently, and the sets of Workers shown in Figure 2 arenot necessarily disjoint. All the communication among theprocesses is done using MPI.

Since a Worker (see Figure 3) is able to use all CPUcores and GPUs in a node concurrently, it may ask formultiple stage instances from the Manager in order to keepall computing devices busy. The maximum number of stageinstances assigned to a Worker at a time is a configurablevalue (Window size). The Worker may request multiple stageinstances in one request or in multiple requests; in the lattercase, the assignment of a stage instance and the retrievalof necessary input data chunks can be overlapped with theprocessing of an already assigned stage instance.

The Worker Communication Controller (WCC) moduleruns on one of the CPU cores and is responsible forperforming any necessary communication with the Man-ager. All computing devices used by a Worker are con-

Figure 3. A Worker is a multi-thread process. It uses all the devices ina hybrid node via the local Worker Resource Manager, which coordinatesthe scheduling and mapping of operation instances assigned to the Workerto CPU cores and GPUs.

trolled by a local Worker Resource Manager (WRM).When a Worker receives a stage instance from the ap-plication Manager and instantiates the pipeline of finer-grain operations in each of them, each of the fine-grainoperation instances, (input data, operation), is dispatchedfor execution with the local WRM. The WRM maps the(input data, operation) tuples to the local computingdevices as the dependencies between the operations areresolved. In this model of a Worker, one computing threadis assigned to manage each available CPU computing coreor a GPU. The threads notify the WRM whenever theybecome idle. The WRM then selects one of the tuples readyfor execution with operation implementation matching theprocessor managed by that particular thread. When all theoperations in the pipeline related to a given stage instanceare executed, a callback function is invoked to notify theWCC. The WCC then notifies the Manager about the end ofthat stage instance and requests more stage instances. Duringa stage instance destruction phase, it could also instantiateother stages instances as necessary.

C. Efficient Cooperative Execution on CPUs and GPUs

This section describes a set of optimizations that addresssmart assignment of operations to CPUs and GPUs and datamovement between those devices.

1) Performance Aware Task Scheduling (PATS): Thestage instances (Segmentation or Feature Computation) as-signed to a Worker create many finer-grain operation in-stances. The operation instances need to be mapped toavailable CPU cores and GPUs efficiently in order to fullyutilize the computing capacity of a node. Several recentefforts on task scheduling in heterogeneous environmentshave targeted machines equipped with CPUs and GPUs [13],[14], [15]. These works address the problem of partitioningand mapping tasks between CPUs and GPUs for applicationsin which operations (or tasks) achieve consistent speedupswhen executed on a GPU vs on a CPU. The previous

Page 6: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

efforts differ mainly in whether they use off-line, on-line, orautomated scheduling approaches. However, when there aremultiple types of operations in an application, the operationsmay have different processing and data access patterns andattain different amounts of speedup on a GPU.

In order to use performance variability to our advantage,we have developed a strategy, referred here to as PATS(formerly PRIORITY scheduling) [12]. This strategy assignstasks to CPU cores or GPUs based on an estimate of the rela-tive performance gain of each task on a GPU compared to itsperformance on a CPU core and on the computational loadsof the CPUs and GPUs. In this work, we have extended thePATS scheduler to take into account dependencies betweenoperations in an analysis workflow.

The PATS scheduler uses a queue of operation instances,i.e., (data element, operation) tuples, sorted based on therelative speedup expected for each tuple. As more tuplesare created for execution with each Worker and pendingoperation dependencies are resolved, more operations arequeued for execution. Each new operation is inserted in thequeue such that the queue remains sorted (see Figure 3).During execution, when a CPU core or GPU becomes idle,one of the tuples from the queue is assigned to the idledevice. If the idle device is a CPU core, the tuple withthe minimum estimated speedup value is assigned to theCPU core. If the idle device is a GPU, the tuple with themaximum estimated speedup is assigned to the GPU. ThePATS scheduler relies on maintaining the correct relativeorder of speedup estimates rather than the accuracy ofindividual speedup estimates. Even if the speedup estimatesof two tasks are not accurate with respect to their respectivereal speedup values, the scheduler will correctly assign thetasks to the computing devices on the node, as long as theorder of the speedup values is correct.

Time based scheduling strategies, e.g., heterogeneousearliest finish time, have shown to be very efficient forheterogeneous environments. The main reason we do not usea time based scheduling strategy is that most operations inour case have irregular computation patterns and data depen-dent execution times. Estimating execution times for thoseoperations would be very difficult. Thus, our schedulingapproach uses relative GPU-vs-CPU speedup values, whichwe have observed are easier to estimate, have less variance,and lead to better scheduling.

Although we provide the CPU and GPU implementationsof each operation in our implementation, this is not neces-sary for correct execution. When there is only CPU or GPUimplementation of an operation, the scheduler can restrictthe assignment of the operation to the appropriate type ofcomputing device.

2) Data Locality Conscious Task Assignment (DL): Thebenefits of using a GPU for a certain computation arestrongly impacted by the cost of data transfers between aGPU and a CPU before the GPU kernel can be started. In

our execution model, input and output data are well definedas they refer to the input and output streams of each stageand operation. Leveraging this structure, we have extendedthe base scheduler in order to promote data reuse and avoidpenalties due to excessive data movement. After an operationassigned to a GPU has finished, the scheduler explores theoperation dependency graph and searches for operationsready for execution that can reuse the data already in theGPU memory. If the operation speedups are not known, thescheduler always chooses to reuse data instead of selectinganother operation that do not reuse data. For the case wherespeedup estimates for operations are available, the schedulersearches for tasks that reuse data in the dependency graph,but it additionally takes into consideration other operationsready for execution. Although those operations may notreuse data, it may be worthwhile to pay the data transferpenalties if they benefit more from execution on a GPUthan the operations that can reuse the data. To choose whichoperation instance to execute in this situation, the speedupof the dependent operations with the best speedup (Sd) iscompared to that of the operation with the best speedup(Sq) that does not reuse the data. The dependent operationis chosen for execution, if Sd ≥ Sq×(1−transferImpact).Here transferImpact is a real value between 0 and 1 andrepresents the fraction of the operation execution time spentin data transfer.

3) Data Prefetching and Asynchronous Data Copy: Datalocality conscious task assignment reduces data transfersbetween CPUs and GPUs for successive operations in apipeline. However, there are moments in the executionwhen data still have to be exchanged between these devicesbecause of scheduling decisions. In those cases, data copyoverheads can be reduced by employing pre-fetching andasynchronous data copy. New data can be copied to theGPU in parallel to the execution of the computation kernelon a previously copied data [16]. In a similar way, resultsfrom previous computations may be copied to the CPU inparallel to a kernel execution. In order to employ both dataprefetching and asynchronous data copy, we modified theruntime system to perform the computation and communi-cation of pipelined operations in parallel. The execution ofeach operation using a GPU in this execution mode involvesthree phases: uploading, processing, and downloading.Each GPU manager thread and WRM pipeline multipleoperations through these three phases. Any input data neededfor another operation waiting to execute and the results froma completed operation are copied from and to the CPU inparallel to the ongoing computation in the GPU.

IV. EXPERIMENTAL EVALUATION

A. Experimental Setup

We have evaluated the proposed application paralleliza-tion and runtime system optimizations using a distributedmemory hybrid cluster, called Keeneland [2]. Keeneland

Page 7: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

is a National Science Foundation Track2D ExperimentalSystem and has 120 nodes in the current configuration.Each computation node is equipped with a dual socketIntel X5660 2.8 Ghz Westmere processor, 3 NVIDIA TeslaM2090 (Fermi) GPUs, and 24GB of DDR3 RAM (SeeFigure 4). The nodes are connected to each other through aQDR Infiniband switch.

Figure 4. Architecture of a Keeneland node.

Image datasets used in the evaluation were obtained frombrain tumor studies [3]. Each image was partitioned intotiles of 4K×4K pixels. The codes were compiled using“gcc 4.1.2”, “-O3” optimization flag, OpenCV 2.3.1, andNVIDIA CUDA SDK 4.0. The experiments were repeated3 times. The standard deviation in performance results wasnot observed to be higher than 2%. The input data werestored in the Lustre filesystem.

B. Performance of Application Operations on GPU

This section presents the performance gains on GPU ofthe individual pipeline operations. Figure 5 shows the perfor-mance gains achieved by each of the fine-grain operations ascompared to the single core CPU counterpart. The speedupvalues in the figure represent the performance gains (1) whenonly the computation phase is considered (computation-only) and (2) when the cost of data transfer between CPUand GPU is included (computation+data transfer). The figurealso shows the percentage of the overall computation timespent in an operation on one CPU core.

The results show that there are significant variations inperformance gains among operations, as expected. The mosttime consuming stages are the ones with the best speedupvalues – this is in part because of the fact that we havefocused on optimizing the GPU implementations of thoseoperations to reduce overall execution time. The featurecomputation stage stands out as having better GPU accel-eration than the segmentation stage. This is a consequenceof the former stage’s more regular and compute intensivenature.

This performance evaluation indicates that the taskscheduling approach should take into consideration theseperformance variations to maximize performance on hybridCPU-GPU platforms. We evaluate the performance impacton pipelined execution of using PATS for scheduling oper-ations in Section IV-C.

Figure 5. Evaluation of the GPU-based implementations of applicationcomponents (operations).

C. Cooperative Pipeline Execution using CPUs and GPUs

This section presents the experimental results when mul-tiple CPU cores and GPUs are used together to executethe brain cancer image analysis pipeline. In these experi-ments, two versions of the application workflow are used:(i) pipelined refers to the version described in Section II,where the operations performed by the application are or-ganized as a hierarchical pipeline; (ii) non-pipelined thatbundles the entire computation of an input tile as a singlemonolithic task, which is executed either by CPU or GPU.The comparison between these versions is important tounderstand the performance impact of pipelining applicationoperations.

Two scheduling strategies were employed for mappingtasks to CPUs or GPUs: (i) FCFS which does not takeperformance variation into consideration; and, (ii) PATS thatuses the expected speedups achieved by an operation inthe scheduling decision. When PATS is used, the speedupestimates for each of the operations are those presented inFigure 5.

Figure 6. Application scalability when multiple CPUs and GPUs are usedvia the PATS and FCFS scheduling strategies.

The results for the various configurations are presentedin Figure 6, using the three images. In all cases, the CPUspeedup using 12 cores is about 9. The sub-linear speedupsare a result of the application’s high memory bandwidthrequirements. The 3-GPU execution achieved about 1.8×speedup on top of the 12 CPU cores version for all im-ages. The coordinated use of CPUs and GPUs improved

Page 8: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

performance over the 3-GPU executions. We should notethat only upto 9 CPU cores are used for computation inthe multi-device experiments, because 3 cores are dedicatedto GPU control threads. In the non-pipelined version of theapplication, potential performance gains by using CPUs andGPUs together are limited by load imbalance. If a tile isassigned to a CPU core near the end of the execution, theGPUs will sit idle waiting until the CPU core finishes, whichreduces the benefits of cooperated use of computing devices.The performance of PATS for the non-pipelined version issimilar to FCFS. In this case, the PATS scheduling is notable to make better decisions than FCFS, because the non-pipelined version bundles all the internal operations of anapplication stage as a single task, hence the performancevariations of the operations are not exposed to the runtimesystem.

The CPU-GPU execution of the pipelined version of theapplication with FCFS (3 GPUs + 9 CPU cores - FCFSpipelined) also improved the 3-GPU execution, reachingsimilar performance to that of the non-pipelined execution.This version of the application requires that the data arecopied to and from a GPU before and after an operationin the pipeline is assigned to the GPU. This introducesa performance penalty due to the data transfer overheads,which are about 13% of the computation time as shownin Figure 5, and limits the performance improvements ofthe pipelined version. The advantage of using the pipelinedversion in this situation is that load imbalance among CPUsand GPUs is reduced. The assignment of computation toCPUs or GPUs occurs at a finer-grain; that is, applicationoperations in the second level of the pipeline make up thetasks scheduled to CPUs and GPUs, instead of the entirecomputation of a tile as in the non-pipelined version. Fig-ure 6 also presents the performance of the PATS schedulingfor the pipelined version of the application. As is seen fromthe figure, processing of tiles using PATS is about 1.33×faster than using FCFS with the non-pipelined or pipelinedversion of the application. The performance gains resultfrom the ability of PATS to better assign the applicationinternal operations to the most suited computing devices.

Figure 7. Execution profile (% of tasks processed by CPU or GPU) usingPATS per pipeline stage.

For instance, Figure 7 presents the percent of tasks thatPATS assigned to the CPUs or GPUs for each pipelinestage. As is shown, the execution of components withlower speedups are mostly performed using the CPUs, whilethe GPUs are kept occupied with operations that achievehigher speedups. For reference, when using FCFS with thepipelined version, about 62% of the tasks for each operationare assigned to GPUs and the rest to CPUs regardless ofperformance variations between the operations.

D. Data Locality Conscious Scheduling/Data Prefetching

This section evaluates the performance impact of the datalocality conscious task assignment (DL) and data prefetchingand asynchronous data download (Prefetching). Figure 8presents the performance improvements with these optimiza-tions for both PATS and FCFS policies. For reference, theGPU-only and CPU-only performance for each of the imagesare the same presented in last section (Figure 6). As isshown, the pipelined version with FCFS and DL is ableto improve the performance of the non-pipelined versionby about 1.1× for all input images. When Prefetching isused in addition to FCFS and DL (“3GPUs + 9 CPUcore - pipelined FCFS + DL + Prefetching”), there are nosignificant performance improvements. The main reason isthat DL already avoids any unnecessary CPU-GPU datatransfers; therefore, Prefetching will only be effective inreducing the cost of uploading the input tile to the GPU anddownloading the final results from the GPU. These costsare small and limit the performance gains resulting fromPrefetching.

Figure 8. Performance impact of data locality conscious mapping andasynchronous data copy optimizations.

Figure 8 also shows the performance results for PATSwhen DL and Prefetching are employed. The use of DLimproves the performance of PATS as well, but the gainsachieved (1.04×) with DL are smaller than those in FCFS.In this case, the estimated speedups for the operations areavailable, thus PATS will check whether it is worthwhile todownload the operation results to map another operation tothe GPU. The number of upload/downloads avoided by usingDL is also smaller than when FCFS is used, which explainsthe performance gain difference. Prefetching with DL resultsin an additional 1.03× performance improvement. This

Page 9: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

optimization was more effective in this case because thevolume of data transferred between the CPU and the GPUis higher than when FCFS with DL is employed.

E. Impact of Worker Request Window Size

This section analyzes the effect of the demand-drivenwindow size between Manager and Workers (i.e., the num-ber of pipeline stage instances concurrently assigned to aWorker) on the CPU-GPU scheduling strategies utilized bythe Worker. During this evaluation, we used 3 GPUs and9 CPU cores (with 3 CPU cores allocated to the GPUmanager threads) with FCFS and PATS. The window-sizeis varied from 12 until no significant performance changesare observed.

Table IIEXECUTION TIME (SECS.) FOR DIFFERENT REQUEST WINDOW SIZE AND

SCHEDULING POLICIES USING 3 GPUS AND 9 CPU CORES.

Demand-Driven Window Size12 13 14 15 16 17 18 19

FCFS 75.1 73.4 74.9 73.7 75.3 74.9 73.2 73.5PATS 75.1 61 56.9 53.1 54.1 51.5 51.2 50.7

Table II presents the execution times. FCFS scheduling isimpacted little by variation in the window size. The PATSscheduler performance, on the other hand, is limited forsmall window sizes. In the scenario where the window sizeis 12, FCFS and PATS tend to make the same schedulingdecisions, because only a single operation will be usuallyavailable when a processor requests work. This makes thedecision trivial and equal for both strategies. When thewindow size is increased, however, the scheduling decisionspace becomes larger, providing PATS with opportunities tomake better task assignments. As is shown in the table, witha window size of 15, PATS already achieves near its bestperformance. This is another good property of PATS, sincevery large window sizes can create load imbalance amongWorkers.

The profile of the execution (% of tasks processed byGPU) as the window size is varied as is displayed inFigure 9. As the window size increases, PATS changes theassignment of tasks, and operations with higher speedupsare more likely to be executed by GPUs. FCFS profile isnot presented in the same figure, but its profile is similar toPATS with a window size of 12 for all configurations.

F. Sensitivity to Inaccurate Speedup Estimation

In this section, we empirically evaluate the sensitivityof PATS scheduler to errors in the GPU-vs-CPU speedupestimation of operations. For the sake of this analysis, weintentionally inserted errors in the estimated speedup valuesof the application operations in a controlled manner. In orderto effectively confound the method, the estimated speedupvalues of operations with lower speedups that are mostly

Figure 9. Execution scheduling profile for different window sizes and thePATS strategy.

scheduled to the CPUs (Morph. Open, AreaThreashold, Fill-Holes, and BWLabel) were increased, while those of otheroperations were decreased. The changes were calculated asa percentage of an operation’s original estimated speedup,and the variation range was from 0% to 100%.

Figure 10. Performance of PATS when errors in speedup estimation forthe pipeline operations are introduced.

The execution times for different error rates are presentedin Figure 10. The results show that PATS is capable ofperforming well even with high errors and error rates inspeedup estimations. For instance, when 60% estimationerror is used, the performance of the pipeline is only 10%worse than the initial case (0% speedup estimation error). At70% and 80% errors, PATS performance is more impacted,as a result of a miss-ordering of the pipeline operationsbefore mostly processed by CPU (AreaThreashold, Fill-Holes, and BWLabel) with ReconToNuclei and Watershed.Consequently, those stages with lower speedups will bescheduled for execution on a GPU. Nevertheless, PATS stillperforms better than FCFS, because the operations in thefeature computation stage are not miss-ordered. To emulate100% estimation error, we set to 0 the speedups of allsubstages that in practice have higher speedups, and doublethe estimated speedups of the other stages that in realityhave lower speedup values. This forces PATS to preferablyassign operations with low speedups to GPU and the oneswith high speedup to CPU. Even with this level of error, the

Page 10: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

execution times are only about 10% worse than those usingFCFS.

G. Multi-node Scalability

This section presents the performance evaluation of thecancer image analysis application when multiple computa-tion nodes are used. The evaluation was carried out using340 glioblastoma brain tumor WSIs, which were partitionedinto a total of 36,848 4K×4K tiles. Similarly to otherexperiments, the input data tiles were stored as image filesin the Lustre filesystem. Therefore, the results presented inthis manuscript represent real end-to-end executions of theapplication, which include overheads for reading input data.

The strong scaling evaluation is presented in Figure 11.First, Figure 11(a) shows the execution times for all config-urations of the application when the number of computingnodes is varied from 8 to 100. All the application versionsachieved improvements as the number of nodes increase, andthe comparison of the approaches shows that the cooperativeCPU-GPU execution resulted in speedups of upto 2.7× and1.7× on top of the “12-CPU cores non-pipelined” (CPU-only) version, respectively, for PATS and FCFS. As shownin the results figure, PATS with optimizations achieved thebest performance for all number of computing nodes.

(a) Execution times.

(b) Parallelization efficiency.

Figure 11. Multi-node scalability: strong scaling evaluation.

Further, Figure 11(b) also presents the parallelizationefficiency for all versions of the application. As may benoticed, the parallelization efficiency reduces in different

rates for the application versions as the number of nodesincrease. For instance, the efficiency on 100 nodes, is about85% for the CPU-only version of the application, while it isnearly 70% for the CPU-GPU cooperative executions. Themain limiting factor and bottleneck for better parallelizationefficiency is the I/O overhead of reading image tiles. Asthe number of nodes increases, I/O operations becomemore expensive, because more clients access the file systemin parallel. The strategies that use cooperative CPU-GPUexecution have smaller efficiency simply because they arefaster and, consequently, require more I/O operations perunit of time. If only the computation times were measured,the efficiency for those versions would increase to about93%. Even with the I/O overheads, the application achievedgood scalability and was able to process the entire set of36,848 tiles in less than four minutes when 100 nodes wereemployed, using a total of the 1,200 CPU cores and 300GPUs in cooperation and PATS. This represents a hugeimprovement in data processing capabilities. Currently, asdiscussed in Section VI, we are evaluating efficient I/Omechanisms to improve the performance of this componentof the application on large scale machines.

V. RELATED WORK

The use of hybrid accelerated computing environments isgrowing in HPC leadership supercomputing machines [2].The appropriate utilization of these hybrid systems, however,typically requires complex software instruments to deal witha number of peculiar aspects of the different processorsavailable. These challenging problems motivated a numberof languages and runtime frameworks [14], [13], [15], [17],[18], [19], [20], [21], [22], [23], [24], [25], specializelibraries [4], and compiler techniques [26].

Mars [14] and Merge [13] evaluated the cooperative useof CPUs and GPUs to speed up MapReduce computations.Mars performed an initial evaluation on the benefits ofpartitioning Map and Reduce tasks between CPU and GPUstatically. Merge extended that approach with dynamic dis-tribution of work at runtime. The Qilin [15] system furtherproposed an automated methodology to map computationtasks to CPUs and GPUs. The Qilin strategy is based on aearly profiling phase, where performance data of the targetapplication is collected, to build a performance model thatis used to estimate the best work division. Neither of thesesolutions (Mars, Merge, and Qilin), however, are able to takeadvantage of distributed systems.

Other projects focused on the execution in distributedCPU-GPU equipped platforms [22], [23], [24], [25], [21].Ravi et al. [23], [25] proposed techniques for automatictranslation of generalized reductions to CPU-GPU environ-ments via compiling techniques, which are coupled withruntime support to coordinate execution. The runtime systemtechniques introduced auto-tuning approaches to dynami-cally partition tasks among CPUs and GPUs. The work

Page 11: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

by Hartley et al. [24] is contemporary to that of Ravi andproposed similar runtime strategies to divisible workloads.

DAGuE [22] and StarPU [17] are frameworks that focuson execution of regular linear algebra applications on CPU-GPU machines. These systems represent the applicationas a DAG of operations, and assert that dependencies arerespected. They offer different scheduling policies, includ-ing those that prioritize computation of critical paths inthe dependency graph in order to maximize parallelism.StarPU does not handle task dependencies across nodesin a distributed environment. It is left to the programmerthat must perform MPI based inter-node communication tosolve dependencies. Though DAGuE includes support formulti-node execution, it assumes that the application DAGis static and known before execution. This structure fits wellin regular linear algebra applications, but is a limitationthat prevents its use in irregular and dynamic applications.In our application, the dependency graph representing theapplication must be dynamically built during the execution,as the computation of the next stage of the analysis pipelinemay depend on the results of the current stage. For instance,the entire feature computation should not be computed, ifnuclei are not found in the segmentation stage. Additionally,neither of these solutions allow for the representation of theapplication as a multi-level pipeline, which is a key feature toachieve high performance with fine-grain task management.These limitations motivated the development of the infra-structure need to execute our application.

Our work targets a scientific data analysis pipeline, includ-ing the GPU/CPU implementations of several challengingirregular operations. In addition, we develop support forexecution of applications that can be described as a multi-level pipeline of operations, where coarse-grain stages aredivided into fine-grain operations. This facilitates leveragingvariability in the amount of GPU acceleration of fine-grainoperations that was not possible in the previous works. Wealso investigate a set of optimizations that include data lo-cality aware task assignment. This optimization dynamicallygroups operations that present good performance accordingto the current set of tasks ready to execute on a machine,instead of doing it statically prior to execution as in ourprevious work [12]. Data prefetching and asynchronousdata transfer optimizations are also employed in order tomaximize computational resource utilization.

VI. CONCLUSIONS AND FUTURE DIRECTIONS

Hybrid CPU-GPU cluster systems offer significant com-puting and memory capacity to address the computationalneeds of large scale scientific analyses. In this paper, wehave developed an image analysis application that can fullyexploit such platforms to achieve high-throughput data pro-cessing rates. We have shown that significant performanceimprovements are achieved when an analysis applicationcan be assembled as pipelines of fine-grain operations, as

compared to bundling all internal operations in one ortwo monolithic methods. The former allows for exportingapplication processing patterns more accurately to the run-time environment and empowers the middleware system tomake better scheduling decisions. Performance aware taskscheduling coupled with function variants enable efficientcoordinated use of CPU cores and GPUs in pipelinedoperations. Performance gains can further be increased onhybrid systems through such additional runtime optimiza-tions as locality conscious task mapping, data prefetching,and asynchronous data copy. Employing a combinationof these optimizations, our application implementation hasachieved a processing rate of about 150 tiles per secondwhen 100 nodes, each with 12 CPU cores and 3 GPUs, areused. These levels of processing speed make it feasible toprocess very large datasets and would enable a scientist toexplore different scientific questions rapidly and/or carry outalgorithm sensitivity studies.

The current implementation of the classification step (step4 in Section II) clusters images into groups based onaverage feature values per image. The average feature valuescan be computed by maintaining on each compute nodea running sum of the feature values of segmented objectsfor each image. The partial sums on the nodes can thenbe accumulated in a global sum operation, and the averagefeature values per image can be computed. Thus, the amountof data transferred from the feature computation step tothe classification step is relatively small. However, thereare cases when the output from a stage, or even from anoperation, needs to be staged to disk. For example, studyingthe sensitivity to input parameters and algorithm variationsof output from the segmentation stage would require us toexecute multiple runs. It might not be possible, due to timeand resource constraints, to maintain output from a run inmemory until all the runs have been completed. The outputfrom a stage in a single run or multiple runs may also needto be stored on disk for inspection or visualization at a latertime. As a future work, in order to support the I/O require-ments in such cases, we are developing an I/O componentbased on a stream-I/O approach, drawing from filter-streamnetworks [27], [28], [29], [30] and data staging [31], [32].This implementation facilitates flexibility. The I/O processescan be placed on different physical processors in the system.For example, if a system had separate machines for I/Opurposes, the I/O nodes could be placed on those machines.Moreover, the implementation allows us to leverage differentI/O sub-systems. In addition to POSIX I/O in which eachI/O process writes out its buffers independent of other I/Onodes, we have integrated ADIOS [33] for data output.ADIOS is shown to be efficient, portable, and scalable onsupercomputing platforms and for a range of applications.We are in the process of carrying out initial performanceevaluations of the I/O component.

Page 12: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

ACKNOWLEDGMENT

This work was supported in part byHHSN261200800001E from the National Cancer Institute,R24HL085343 from the National Heart Lung and BloodInstitute, by R01LM011119-01 and R01LM009239 fromthe National Library of Medicine, RC4MD005964 fromNational Institutes of Health, and PHS UL1RR025008 fromthe Clinical and Translational Science Awards program.This research used resources of the Keeneland ComputingFacility at the Georgia Institute of Technology, whichis supported by the National Science Foundation underContract OCI-0910735.

REFERENCES

[1] NVIDIA, “GPU Accelerated Applications,” 2012.[Online]. Available: http://www.nvidia.com/object/gpu-accelerated-applications.html

[2] J. S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis,S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford,and S. Yalamanchili, “Keeneland: Bringing HeterogeneousGPU Computing to the Computational Science Community,”Computing in Science and Engineering, vol. 13, 2011.

[3] L. A. D. Cooper, J. Kong, D. A. Gutman, F. Wang, S. R.Cholleti, T. C. Pan, P. M. Widener, A. Sharma, T. Mikkelsen,A. E. Flanders, D. L. Rubin, E. G. V. Meir, T. M. Kurc,C. S. Moreno, D. J. Brat, and J. H. Saltz, “An integrativeapproach for in silico glioma research,” IEEE Trans BiomedEng., vol. 57, no. 10, pp. 2617–2621, 2010.

[4] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal ofSoftware Tools, 2000.

[5] NVIDIA, NVIDIA Performance Primitives(NPP), 11 Febru-ary 2011. [Online]. Available: http://developer.nvidia.com/npp

[6] A. Korbes, G. B. Vitor, R. de Alencar Lotufo, and J. V.Ferreira, “Advances on watershed processing on GPU archi-tecture,” in Proceedings of the 10th International Conferenceon Mathematical Morphology, ser. ISMM’11, 2011.

[7] L. Vincent, “Morphological grayscale reconstruction in imageanalysis: Applications and efficient algorithms,” IEEE Trans-actions on Image Processing, vol. 2, pp. 176–201, 1993.

[8] G. Teodoro, T. Pan, T. M. Kurc, L. Cooper, J. Kong, andJ. H. Saltz, “A Fast Parallel Implementation of Queue-based Morphological Reconstruction using GPUs,” EmoryUniversity, Center for Comprehensive Informatics TechnicalReport CCI-TR-2012-2, January 2012.

[9] P. Karas, “Efficient Computation of Morphological GreyscaleReconstruction,” in MEMICS, ser. OASICS, vol. 16. SchlossDagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2010.

[10] P.-E. Danielsson, “Euclidean distance mapping,” ComputerGraphics and Image Processing, vol. 14, pp. 227–248, 1980.

[11] V. M. A. Oliveira and R. de Alencar Lotufo, “A Study onConnected Components Labeling algorithms using GPUs,” inSIBGRAPI, 2010.

[12] G. Teodoro, T. M. Kurc, T. Pan, L. A. Cooper, J. Kong,P. Widener, and J. H. Saltz, “Accelerating Large Scale Im-age Analyses on Parallel, CPU-GPU Equipped Systems,” in26th IEEE International Parallel and Distributed ProcessingSymposium (IPDPS), 2012.

[13] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng,“Merge: a programming model for heterogeneous multi-coresystems,” SIGPLAN Not., vol. 43, no. 3, pp. 287–296, 2008.

[14] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang,“Mars: A MapReduce Framework on Graphics Processors,”in Parallel Architectures and Compilation Techniques, 2008.

[15] C.-K. Luk, S. Hong, and H. Kim, “Qilin: Exploiting paral-lelism on heterogeneous multiprocessors with adaptive map-ping,” in 42nd International Symposium on Microarchitecture(MICRO), 2009.

[16] T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R.Beard, and D. I. August, “Automatic CPU-GPU communi-cation management and optimization,” in Proceedings of the32nd ACM SIGPLAN conference on Programming languagedesign and implementation, ser. PLDI ’11, 2011, pp. 142–151.

[17] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier,“Starpu: A unified platform for task scheduling on heteroge-neous multicore architectures,” in Euro-Par ’09: Proceedingsof the 15th International Euro-Par Conference on ParallelProcessing, 2009, pp. 863–874.

[18] G. F. Diamos and S. Yalamanchili, “Harmony: an executionmodel and runtime for heterogeneous many core systems,”in Proceedings of the 17th international symposium on Highperformance distributed computing, ser. HPDC ’08. NewYork, NY, USA: ACM, 2008, pp. 197–200.

[19] G. Teodoro, R. Sachetto, O. Sertel, M. Gurcan, W. M. Jr.,U. Catalyurek, and R. Ferreira, “Coordinating the use of GPUand CPU for improving performance of compute intensiveapplications,” in IEEE Cluster, 2009.

[20] N. Sundaram, A. Raghunathan, and S. T. Chakradhar, “Aframework for efficient and scalable execution of domain-specific templates on GPUs,” in IPDPS ’09: Proceedingsof the 2009 IEEE International Symposium on Parallel andDistributed Processing, 2009, pp. 1–12.

[21] G. Teodoro, T. D. R. Hartley, U. Catalyurek, and R. Fer-reira, “Run-time optimizations for replicated dataflows onheterogeneous environments,” in Proc. of the 19th ACMInternational Symposium on High Performance DistributedComputing (HPDC), 2010.

[22] G. Bosilca, A. Bouteiller, T. Herault, P. Lemarinier, N. Saeng-patsa, S. Tomov, and J. Dongarra, “Performance Portability ofa GPU Enabled Factorization with the DAGuE Framework,”in 2011 IEEE International Conference on Cluster Computing(CLUSTER), sept. 2011, pp. 395 –402.

[23] V. Ravi, W. Ma, D. Chiu, and G. Agrawal, “Compiler and run-time support for enabling generalized reduction computationson heterogeneous parallel configurations,” in Proceedings ofthe 24th ACM International Conference on Supercomputing.ACM, 2010, p. 137146.

Page 13: High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platformsamza/ece1747h/papers/CCI... · 2013-09-12 · High-throughput Analysis of Large Microscopy

[24] T. D. R. Hartley, E. Saule, and U. V. Catalyurek, “Automaticdataflow application tuning for heterogeneous systems,” inHiPC. IEEE, 2010, pp. 1–10.

[25] X. Huo, V. Ravi, and G. Agrawal, “Porting irregular reduc-tions on heterogeneous CPU-GPU configurations,” in 18thInternational Conference on High Performance Computing(HiPC), dec. 2011, pp. 1 –10.

[26] S. Lee, S.-J. Min, and R. Eigenmann, “OpenMP to GPGPU:a compiler framework for automatic translation and opti-mization,” in PPoPP ’09: Proceedings of the 14th ACMSIGPLAN symposium on Principles and practice of parallelprogramming, 2009, pp. 101–110.

[27] R. H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D. E. Culler,J. M. Hellerstein, D. A. Patterson, and K. Yelick, “Cluster I/Owith River: Making the Fast Case Common,” in IOPADS ’99:Input/Output for Parallel and Distributed Systems, Atlanta,GA, May 1999.

[28] B. Plale and K. Schwan, “Dynamic Querying of StreamingData with the dQUOB System,” IEEE Trans. Parallel Distrib.Syst., vol. 14, no. 4, pp. 422–432, 2003.

[29] V. S. Kumar, P. Sadayappan, G. Mehta, K. Vahi, E. Deelman,V. Ratnakar, J. Kim, Y. Gil, M. W. Hall, T. M. Kurc, andJ. H. Saltz, “An integrated framework for performance-basedoptimization of scientific workflows,” in HPDC, 2009, pp.177–186.

[30] T. Tavares, G. Teodoro, T. Kurc, R. Ferreira, D. Guedes, W. J.Meira, U. Catalyurek, S. Hastings, S. Oster, S. Langella, andJ. Saltz, “An efficient and reliable scientific workflow system,”IEEE International Symposium on Cluster Computing and theGrid, vol. 0, pp. 445–452, 2007.

[31] C. Docan, M. Parashar, and S. Klasky, “Dataspaces: an in-teraction and coordination framework for coupled simulationworkflows,” in HPDC, 2010, pp. 25–36.

[32] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan,and F. Zheng, “Datastager: scalable data staging services forpetascale applications,” Cluster Computing, vol. 13, no. 3, pp.277–290, 2010.

[33] J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, andC. Jin, “Flexible IO and integration for scientific codesthrough the adaptable IO system (ADIOS),” in CLADE, 2008,pp. 15–24.


Recommended