+ All Categories

MS-Word

Date post: 09-Dec-2014
Category:
Upload: cameroon45
View: 619 times
Download: 2 times
Share this document with a friend
Description:
 
Popular Tags:
36
SCIDAC CENTER FOR ENABLING DISTRIBUTED PETASCALE SCIENCE STATUS REPORT FOR THE PERIOD MAY 01, 2007 THROUGH NOVEMBER 31, 2008 Principal Investigator: Ian Foster The Center for Enabling Distributed Petascale Science Team: Andrew Baranovski 1 Joshua Boverhof 2 , Ann Chervernak 3 , Ian Foster 2 , Dan Gunter 2 , Kate Keahey 4 , Carl Kesselman 3 , Miron Livny 5 , Tim Freeman 4 , Robert Schuler 3 , Ravi K Madduri 4 , Rajkumar Kettimuthu 4 , John Bresnahan 4 , Michael Link 4 , Nick LeRoy 5 1 Fermi National Accelerator Laboratory 2 Lawrence Berkeley National Laboratory 3 University of Southern California, Information Sciences Institute 4 Argonne National Laboratory 5 University of Wisconsin, Madison 1
Transcript
Page 1: MS-Word

SCIDAC CENTER FOR ENABLING DISTRIBUTED PETASCALE SCIENCE

STATUS REPORT FOR THE PERIOD

MAY 01, 2007 THROUGH NOVEMBER 31, 2008

Principal Investigator:Ian Foster

The Center for Enabling Distributed Petascale Science Team:Andrew Baranovski1 Joshua Boverhof2, Ann Chervernak3, Ian Foster2,

Dan Gunter2, Kate Keahey4, Carl Kesselman3, Miron Livny5, Tim Freeman4, Robert Schuler3, Ravi K Madduri4,

Rajkumar Kettimuthu4, John Bresnahan4, Michael Link4, Nick LeRoy5

1 Fermi National Accelerator Laboratory2 Lawrence Berkeley National Laboratory3 University of Southern California, Information Sciences Institute4 Argonne National Laboratory5 University of Wisconsin, Madison

1

Page 2: MS-Word

TABLE OF CONTENTS1. Executive Summary......................................................................................................32. Highlights.....................................................................................................................3

2.1. Data Area Highlights.............................................................................................42.2. Scalable Services Area Highlights.........................................................................52.3. Troubleshooting Area Highlights..........................................................................5

3. Data Area Progress.......................................................................................................63.1. High-performance Transport: GridFTP.................................................................63.2. Data Replication and Placement............................................................................83.3. Resource Management: LotMan and Lease Manager...........................................93.4. dCache Improvements.........................................................................................10

4. Services Area Progress...............................................................................................124.1. Nimbus.................................................................................................................124.2. Service Construction Tools..................................................................................14

5. Troubleshooting Area Progress..................................................................................155.1. Collection and Archive Service Improvements...................................................155.2. Log Data Analysis...............................................................................................175.3. Integration of log data with MDS4......................................................................17

6. Collaborations.............................................................................................................186.1. National Energy Research Supercomputing Center (NERSC)............................186.2. Argonne Leadership Computing Facility (ALCF)..............................................186.3. Earth System Grid (ESG) Center for Enabling Technology...............................196.4. Pegasus group, USC/ISI......................................................................................196.5. Scientific Data Management group, LBNL.........................................................206.6. Globus Team, ANL.............................................................................................206.7. Solenoidal Tracker At RHIC (STAR) experiment..............................................206.8. TechX Corporation..............................................................................................216.9. Open Science Grid (OSG)...................................................................................226.10. Advanced Photon Source (APS)........................................................................226.11. Nuclear Physics Groups.....................................................................................22

7. Presentations and Publications...................................................................................237.1. Talks....................................................................................................................237.2. Tutorials...............................................................................................................237.3. Papers...................................................................................................................237.4. Posters..................................................................................................................24

2

Page 3: MS-Word

Center for Enabling Distributed Petascale ScienceProgress Report, May 2007 - November 2008

1. EXECUTIVE SUMMARY

The SciDAC-funded Center for Enabling Distributed Petascale Science (CEDPS) was established to address technical challenges that arise due to the frequent geographic distribution of data producers (in particular, supercomputers and scientific instruments) and data consumers (people and computers) within the DOE laboratory system. Its goal is to produce technical innovations that meet DOE end-user needs for (a) rapid and dependable placement of large quantities of data within a distributed high-performance environment, and (b) the convenient construction of scalable science services that provide for the reliable and high-performance processing of computation and data analysis requests from many remote clients. The Center is also addressing (c) the important problem of troubleshooting these and other related ultra-high-performance distributed activities from the perspective of both performance and functionality.

This report summarizes work carried out by the CEDPS-CET during the period May, 2007 through November, 2008. It includes discussion of highlights, overall progress, period goals, collaborations, papers, and presentations. The CEDPS-CET team brings together researchers and scientists with diverse domain knowledge, whose home institutions include three DOE laboratories and two universities: Argonne National Laboratory (ANL), Fermi National Accelerator Laboratory (FNAL), Lawrence Berkeley National Laboratory (LBNL), University of Wisconsin (Wisc) and University of Southern California, Information Sciences Institute (USC/ISI). The CEDPS-CET PI is Ian Foster, ANL; PI area leads are Ann Chevernak (USC/ISI), Ravi Madduri (ANL), and Dan Gunter (LBNL). All work is accomplished in close collaboration with the project’s stakeholders and domain researchers and scientists. To learn more about our project, please visit the CEDPS website (http://cedps-scidac.org).

2. HIGHLIGHTS

The CEDPS-CET team is working in three sub-areas: Data (CEDPS-Data), Scalable Services (CEDPS-Services), and Troubleshooting (CEDPS-Troubleshooting). We list highlights for each area in this section, and then provide details in the section that follows. While for convenience we present each area separately, there are numerous cross-connections among the different activities, as we make clear in the text that follows.

3

Page 4: MS-Word

2.1. DATA AREA HIGHLIGHTSData work in the CEDPS project takes place in several collaborating groups, including the GridFTP team at ANL, the data replication and placement team at ISI, the storage allocation and placement team at UW, and the dCache team at Fermi.

The work of this team is focused on enabling reliable, high-performance data placement within high-end distributed systems. The word placement here is used to denote policy-driven data movement—for example, to ensure that data is moved from an Advanced Photon Source beamline to an end-user laboratory in a timely manner, or to ensure that data produced by a supercomputer simulation is replicated to collaborator sites. Challenges addressed in this work include efficient end-to-end transport over high-speed networks; the management of scarce resources, such as space and bandwidth; detection and recovery from failures; and high-level specification of user policies. A workhorse for much of this work is the GridFTP data movement system, which provides the basic data transport capabilities.

Highlights for this year include the following:

An optimization for Lots of Small Files (LOSF) transfers, allowing multiple files in transit at the same time. This optimization can improve performance by an order of magnitude or more in some situations. The Advanced Photon Source has used the concurrency optimization in conjunction with pipelining to transfer terabytes of data (partitioned into lots of small files) to a user in Australia at a rate 30 times faster than standard FTP.

Capabilities to dynamically improve the scalability of GridFTP servers. Additional data mover nodes can be added to the GridFTP server at run time to handle more transfer requests. This work addressed problems reported by DOE users on Open Science Grid and a range of non-DOE users on TeraGrid.

Deployment of GridFTP on the HPSS storage system at the Argonne Leadership Computing Facility (ALCF), enabling high-performance remote access to and from ALCF storage.

The design of tools for supporting data replication capabilities, with the goal of supporting the needs for data mirroring of application communities including Earth System Grid, the STAR physics experiment, and the Spallation Neutron Source (SNS).

Continued support for the Replica Location Service, which is used by a variety of scientific collaborations, including Earth System Grid and the Nordugrid ATLAS high energy physics application.

Implementation of LotMan, a lightweight storage allocation software that has a plug-in interface to GridFTP. (This work is an essential step towards enabling space management in distributed systems.) The LotMan software was integrated into the Virtual Data Toolkit (VDT), providing convenient access for Open Science Grid and other users .

4

Page 5: MS-Word

Hardening and productizion of data placement code that uses two components, the Stork data placement service and a newly-developed Lease Manager component, to provide dynamic match making for data placement jobs.

Modifications to the dCache system to provide robust, end-to-end data integrity verification. This work has proven beneficial to the CMS high energy physics application, which uses the capability to verify checksums on approximately 10 Terabytes per day of data downloaded from CERN in Geneva, Switzerland, to Fermi Lab in Illinois.

2.2. SCALABLE SERVICES AREA HIGHLIGHTS

Work in the scalable services area is motivated by the fact that moving data to computation is not always feasible—and may be expected to be far less feasible in the future, as data volumes continue to grow. Thus, we seek methods for enabling remote access to code, and for moving computation easily to remote computers. The two major initiatives are the grid Resource Allocation and Virtualization Environment (gRAVI) tools for wrapping science applications as services, and the Nimbus infrastructure as a service (IaaS: aka “Cloud”) software.

A major accomplishment for the CEDPS-Services area is completion of full implementation of rapid creation and deployment of application services using gRAVI and several releases of the Nimbus Toolkit providing tools that enable scientists to easily leverage cloud computing capabilities. We were also able to successfully integrate gRAVI with Nimbus, thus providing a full spectrum of scalable services functionality to our user communities.

A major application success was enabling the first production run of the nuclear physics STAR applications on Amazon’s EC2 cloud computing infrastructure, in September 2007. The deployment of the STAR cluster on EC2 was orchestrated by the Nimbus Context Broker service that enables automatic and secure deployment of “turnkey” virtual clusters, bridging the gap between functionality provided by EC2 and the “end product” that scientific communities need to deploy their applications. Scientific production runs require careful and involved environment preparation and reliability: this run was a significant step towards convincing the broad STAR community that real science can be done using cloud computing.

The gRAVI tools have been adopted enthusiastically by DOE groups at the Advanced Photon Source at Argonne National Laboratory and the NERSC group at Lawrence Berkeley Lab, enabling rapid application virtualization and provisioning of applications. The tools we developed are also being used and adopted in communities like NIH sponsored cancer Bioinformatics Grid (caBIG) project, Cardio Vascular Grid Research Project and by the OMII-UK team.

2.3. TROUBLESHOOTING AREA HIGHLIGHTS

The CEDPS-Troubleshooting area finished implementation of the prototype tools to parse existing logs and load them into a SQL database, collectively called the "log pipeline". The log parser framework was populated over a dozen new parsers covering software

5

Page 6: MS-Word

components from PBS, SGE, Condor, Globus, BeStMan, and HSI. Documentation and internal logging were improved dramatically across the board. Together, these improvements prepared the CEDPS troubleshooting tools to be deployed and used on OSG.

Collaborations with a variety of groups have been very fruitful. The troubleshooting tools are now in active use by at least four groups:

The CEDPS log pipeline was deployed on the NERSC Parallel Distributed Systems Facility (PDSF). Analysis of the data from the STAR BeStMan data transfers have revealed unexpected network performance, which have triggered upgrades and configuration changes on PDSF.

The NERSC Project Accounts team is using CEDPS log parsing to normalize the logs and log database, to perform traceability analysis.

The Pegasus team at USC/ISI uses the CEDPS pipeline for large computational seismology workflows. The CEDPS tools were able to efficiently analyze execution logs of earthquake science workflows consisting of upwards of a million tasks.

The Tech-X STAR job submission portal uses the CEDPS log database to drill down to site-specific information for a portal job. A prototype of this functionality was demonstrated at SC08.

Enhancements to the NetLogger log summarization library developed as part of CEDPS is used in GridFTP to implement a "bottleneck detection" algorithm that answers the (increasingly important) question of whether disk or network is the data transfer bottleneck.

As part of our collaboration with ESG, we released a dramatically improved version of the administrative interface to the MDS Trigger Service. This work was done in response to feedback from ESG, who requested this capability to support a portal that they plan to develop as well as providing a simpler command-line interface.

3. DATA AREA PROGRESS

This year, CEDPS-Data had accomplishments in several important areas.

3.1. HIGH-PERFORMANCE TRANSPORT: GRIDFTP

Optimization of many small file transfers. GridFTP has long been used to move large files rapidly over wide area networks, with methods such as striping, parallelism, and alternative protocols used to achieve high performance. Unfortunately, scientific data is often partitioned into many small files. For example, microtomographic data produced at the Advanced Photon Source is typically organized as a large number of slice files. In these circumstances, GridFTP suffered from lower transfer rates due to synchronization costs. To address this problem, the GridFTP team developed a pipelining solution last year to address this issue. Though pipelining improved the performance significantly, there was room for

6

Page 7: MS-Word

further optimizations. This year, the team developed additional optimization for Lots of Small Files (LOSF) transfers related to Concurrency. Concurrency refers to having multiple control channel connections between the client and the server, and thus having multiple files in transit at the same time. This is equivalent to starting up n different clients for n different files, and having them all running at the same time. The Advanced Photon Source has used the concurrency optimization in conjuction with pipelining to transfer terabytes of data (partitioned into lots of small files) to a user in Australia at a rate 30 times faster than standard FTP. In addition, the LIGO project has used these optimizations to transfer large volumes of data on a non-LHC type of network from Milwaukee to Germany at a sustained rate of 80 MBytes/sec.

Instrumentation with NetLogger and automated bottleneck detection. The team also added a capability to instrument GridFTP with the Netlogger performance measurement tools. This capability has been helpful to DOE ESNet users and TeraGrid users. The GridFTP server now logs messages that can be postprocessed using Netlogger tools and collected using syslog-ng logging records. Fine-grained disk and net IO characteristics can be visualized and analyzed. The commonly used GridFTP client called globus-url-copy takes advantage of this feature by telling its user which one of the following is the bottleneck for a transfer: Disk Read, Network Write, Network Read, or Disk Write.

Dynamic scaling of GridFTP servers. Open Science Grid (OSG) participants reported that their single biggest problem with running GridFTP servers is that the servers can overwhelm the transfer host and/or the underlying storage system. TeraGrid reported that their major problem with running striped GridFTP servers is the disruptions caused due to the failure of one of the data mover nodes. In response to these problem reports, the GridFTP team developed capabilities to dynamically improve the scalability of GridFTP servers. Additional data mover nodes can be added to the GridFTP server at run time to handle more transfer requests. The GridFTP group also improved the resiliency of striped GridFTP servers. The GridFTP server now continues to operate after any data mover node failure as long as at least one of the data mover nodes is alive. The GridFTP server with these new features was released as part of OSG’s VDT.

Deplopyment of GridFTP at a leadership class computing center. The GridFTP team has worked closely with the Argonne Leadership Computing Facility on deploying GridFTP on their HPSS storage system. A number of issues on IBM’s Parallel I/O interface for HPSS and the GridFTP’s HPSS DSI have been uncovered and most of them were fixed.

Advertisement of GridFTP server properties. To further address the issue of overwhelming the transfer hosts and to augment the overall quality of service with data transfers for the DOE applications, the GridFTP team prototyped a GridFTP information provider service that publishes information such as server load, number of open connections, and the maximum number of connections allowed for a GridFTP server for use by higher level services to improve QoS for data transfers.

Year 2 Milestones MS2.2. Work with VDT team to include MOPS 1.0 in release

o Status: Complete MS2.5. Prototype a non-striped connection management capability using NeST

7

Page 8: MS-Word

o Status: Complete. GridFTP is capable of passing the attributes of a connection to NeST.

MS2.8. Document use cases and performance of ways to manage transferso Status: Complete. Preliminary results are available at

http://www.mcs.anl.gov/~bresnaha/gridftp_memory/ MS2.9. Design and prototype implementation of common interface to storage

o Status: Partially complete. We have created a design document, but the prototype implementation is not complete.

MS2.10. Prototype methods of incorporating troubleshooting into MOPSo Status: Complete. The integrated software is available as part of Globus 4.2.

More information on this is available at: http://www.cedps.net/index.php/Gridftp-netlogger

MS2.12. Deliver a MOPS 2.0 release that includes additional optimizationso Status: Complete. This functionality has been released as enhancements to

GridFTP as part of Globus 4.2.x.

Our plans for the coming year include the following: Augment the prototype information provider service and create a production

implementation, suitable for use by higher-level data movement services. Prototype a GridFTP control channel brokering service that controls access to a

GridFTP server. Such a service is needed to provide a better-than-best-effort data movement service.

Investigate ways to integrate bandwidth reservation services with GridFTP and higher-level data movement services such as the Reliable File Transfer (RFT) service.

Continue to work with APS to deploy new capabilities and obtain feedback. Plan to work with the Spallation Neutron Source (SNS) to identify how our data

movement tools can help SNS scientists and users. Work with the services team on prototyping a simple storage cloud using GridFTP

as an underlying data transfer mechanism.

3.2. DATA REPLICATION AND PLACEMENT

During the past year, the data relication and placement group has focused on three main areas. In the area of data replication and placement services, the team did a significant re-evaluation of the functionality that we should provide based on our interactions with DOE users. We reoriented our work toward providing simple replication utilities rather than higher-level data placement services, because the former better met the needs and requirements of DOE application communities. In particular, we are focused on supporting the needs of the Earth System Grid, which currently uses the Replica Location Service to manage location information for data sets.

ESG is working on the next-generation of replica management functionality for their application domain, and the CEDPS group is participating in the design of replication services that will be used by ESG. In addition, we had extensive discussions with the STAR and SNS applications. All three application communities (ESG, STAR, and SNS) identified a need for data mirroring functionality, and their requirements will drive the ongoing

8

Page 9: MS-Word

implementation of data mirroring tools. Our design and development work is focused on providing this functionality.

In addition, we continued to do research on data placement policies in two areas. First, we looked at the requirements of DOE virtual organizations, such as the high-energy physics community, to disseminate data according to policies at the Virtual Organization level, such as the tiered distribution of data produced by the LHC at CERN. We looked at whether we could use a policy engine to enforce similar policies and had some initial success, as reflected in a poster at the SC2008 conference.

A second area of increasing interest to DOE communities is the use of workflow engines to manage complex scientific workflows. We have done research to characterize realistic scientific workflows (resulting in a paper at the Workshop on Workflows in Support of Large-Scale Science WORKS08). Based on those scientific workflows, we have simulated a variety of data placement strategies that could work in conjunction with a workflow management system to improve the efficiency of execution of scientific workflows. A paper on this work was recently submitted to the DADC 2009 Workshop.

Year 2 Milestones MS2.4 Design and prototype of reliable distribution service

o Status: Complete. The BuTrS service is available here:http://www.cedps.net/index.php/Dps10ReleaseNotes

MS2.13. Release version 2.0 of the DPS with additional functionalityo Based on our discussions with DOE user communities, we re-oriented our

design and development efforts to providing a simple data mirroring capability. We produced a design document for the initial phase of this work, and the implementation is in progress. An initial prototype of this capability will be available in early 2009 (first quarter).

MS2.14. Work with troubleshooting to include additional data in logs for placement services.

o This work is pending, since we are still in an implementation stage in providing new data mirroring functionality. We are committed to incorporating CEDPS troubleshooting interfaces into future data services.

Plans for 2009 include working closely with the Earth System Grid, SNS and other application communities to understand their requirements in the areas of data replication and mirroring and providing functionality that allows these groups to manage their data better. Initially, we will provide a very simple data replication capability based on existing GridFTP and SRM functionality. Over the coming year, we plan to add features to the data replication and mirroring capabilities to provide richer functionality to DOE science communities.

3.3. RESOURCE MANAGEMENT: LOTMAN AND LEASE MANAGER

During that past year, we have integrated LotMan, a lightweight storage allocation software that has a plug-in interface to GridFTP, into the Virtual Data Toolkit (VDT). Through the VDT, this functionality was made available to many groups, including the Open Science Grid (OSG). Some basic testing of LotMan has been added to confirm its basic functionality.

9

Page 10: MS-Word

The other primary effort of this team has been work on the hardening and productizing the data placement code initially developed for the Super Computing 2005 conference. This code uses two components: the Stork data placement service and a newly-developed Lease Manager component, to provide dynamic match making for data placement jobs. Significant effort has been put into converting the Lease Manager from a proof-of-concept prototype into a mature component. The Lease Manager has been integrated into Condor, and it is built and tested regularly as part of Condor's nightly build and test.

Stork development has continued in parallel in two different groups. In addition to the CEDPS team, a group at Louisiana State University led by Tevfik Kosar, the former UW student responsible for Stork's initial development, has also continued to add functionality to Stork. Kosar’s group recently released Stork 1.0. At the same time, the UW team has been working to harden the Stork / Lease Manager interface, and the dynamic matching of data placement jobs. Several additional uses for the Lease Manager as a part of Condor are planned. Going forward, the CEDPS team will work with Kosar’s group to integrate their enhancements to Stork with his work on Stork, with the goal of provide a single Stork distribution in the future.

Year 2 Milestones MS2.7. Develop a managed storage capability for non-striped MOPS.

Future plans in this area involve additional testing of both LotMan and the Lease Manager. Further developments of LotMan will include development of an external interface into LotMan, perhaps via a Web Services interface, to allow storage allocations to be more easily created and managed. Work on the Lease Manager will include performance measurements and enhancements to the Lease Manager / Stork interface. Finally, the UW team will work with Kosar’s group to provide a single Stork release.

3.4. DCACHE IMPROVEMENTS

The CEDPS team has worked to modify the dCache system to provide robust end-to-end data integrity verification. This work has proven beneficial to the CMS high energy physics application, which uses the capability to verify checksums on approximately 10 Terabytes per day of data downloaded to Fermi Lab. CMS has not been willing to reveal the number of what would otherwise have been undetected errors that were identified by this method, but we understand that it is greater than zero.

dCache provides a system for storing, retrieving and managing petabytes of data distributed among a large number of heterogeneous server nodes. dCache supports a variety of management and access protocols, such as gridftp,srm, dccp, xrootd, all representing a single virtual filesystem tree. The project is a joint effort between the DESY (Deutsches Elektronen-Synchrotron) in Hamburg and the FNAL (Fermi National Accelerator Laboratory) near Chicago and is aimed at serving data the needs of US and European based LHC (Large Hadron Collider) experiments. The core part of the dCache functionality is in combining separate disk storage systems of several hundred terabytes into a uniformly accessible filesystem tree. In order to make this process manageable, dCache does load

10

Page 11: MS-Word

balancing among data nodes, data integrity verification, detects failing hardware and attempts to ensure existence of important data in multiple replicas.

End-to-end data integrity verification in dCache is designed to prevent propagation of incorrect data. In order to implement this feature, the following work has been done:

Storing of checksum values and their types inside the dCache metadata catalog Implementation of GridFTP version 2 standard extensions, specifically those that

communicate checksum data between client/server and server/server Server to server negotiation of the checksum type algorithm for verification of

integrity of subsequent transfer Extension to algorithms that calculate data checksum (or file digests) values.

Specifically, adding support for MD5, MD4 and CRC.

Before starting the data transfer of a file to dCache, the client computes the checksum value over the original data file on his or her local disk. This value is sent to dCache using GridFTP checksum protocol extensions. After the transfer completes, dCache verifies the received data for consistency with the client checksum and either rejects or accepts the transaction.

Before a file is read from dCache, a client or other server on its behalf negotiates the checksum algorithm with dCache to ensure that dCache supports the type of checksum consistent with client requirements. This process ends when dCache determines and sends to the client the value of the checksum that reflects the true content of the original file. After the file is transferred to the client, the client verifies the checksum of the data on its local disk and either accepts or rejects the transaction.

This end-to-end data integrity verification process ensures that server to client, client to server and server to server data movement operations preserve the content of the file originally stored by a user.

In addition, dCache checksum failures on particular hardware are routinely used to flag hardware for preemptive replacement or maintenance.

The first deployment of this checksum functionality revealed several deficiencies in a partnering storage system in Europe. Enabling this new functionality triggered further development in other storage products, such as CASTOR at CERN, which now ensure better quality of served data. Currently, the data integrity verification code verifies approximately 10TB of data a day incoming to the FNAL storage system.

An additional effort of the CEDPS data team focused on research on quality of service and opportunistic use. Based on experience in improving efficiency of operations of large scale data reconstruction effort in the OSG opportunistic storage environment, the Fermi group delivered a document outlining ideas and further work needed to virtualize grid storage with the goal of providing data storage with a predefined and sustainable level of service.

Year 2 Milestones MS2.11. Produce Design Document on incorporating mechanisms for quality of

service

11

Page 12: MS-Word

o Status: Complete

Our plans for the coming year: The scale of the data requirements of the CMS experiment to the dCache project has shown that existing implementations of algorithms for replication of high demand data are too simplistic and create substantial inefficiencies during peak usage of the storage system. With varying degrees of success, these inefficiencies are manually addressed on a case-by-case basis. Our future work in this area will focus on researching and then adding automations to optimally replicate “hot” data. This should reduce the need for continuous and hence costly manual parameter adjustments to the dCache system.

4. SERVICES AREA PROGRESS

During the report period the services area has developed and applied tools for the construction, operation, and provisioning of scalable science services.

4.1. NIMBUS

The Nimbus system provides mechanisms for the dynamic allocation of virtual machine images: what is sometimes referred to as a “private cloud.” It also provides mechanisms for the creation of the required images, for the creation of virtual clusters based around virtual machine images, and other management tasks.

During the evaluation period the focus of the Nimbus team has been on working with application communities and providing tools that enable scientists to easily leverage cloud computing capabilities.

Our particular focus was interaction with DOE-related communities as follows:

We enabled the first production run of the nuclear physics STAR applications on Amazon’s EC2 cloud computing infrastructure. This took place in September 2007. The deployment of the STAR cluster on EC2 was orchestrated by the Nimbus Context Broker service that enables automatic and secure deployment of “turnkey” virtual clusters bridging the gap between functionality provided by EC2 and the “end product” that scientific communities need to deploy their applications. Scientific production runs require careful and involved environment preparation and reliability: this run was a significant step towards convincing the broad STAR community that real science can be done using cloud computing. We further worked with STAR on evaluating I/O on EC2 STAR instances which at 5MB/sec was deemed to be adequate for production runs of I/O intensive applications. We continue to collaborate with the project to enable further runs.

Using the Context Broker we also implemented a proof-of-concept that enabled the integration of dynamically provisioned environments (e.g., on EC2 or on clouds created in scientific domain) for the ALICE HEP experiment at CERN (07/08, CHEP submission pending). This work was done in collaboration with the CERNVM project that produces VM images that support all four LHC experiments. Our

12

Page 13: MS-Word

prototype dynamically deployed VMs that were automatically added to the ALICE Alien infrastructure registering their availability for job submission.

We interacted internationally with multiple members of the ATLAS HEP experiment. Ian Gable’s group from the University of Victoria (UVIC) has long been a demanding user of Nimbus and contributing bug fixes and thorough testing of Nimbus capabilities. In the Fall of 2008 they contribute a Nagios-based monitoring component, required to better adapt the project to their needs, and as a result we invited them to join us on the committer team. We also initiated collaboration with a group of ATLAS scientists in the Max-Planck Institute who are interested in open source implementation of EC2 to facilitate moving environments from their resources to EC2.

In terms of software development, this project supported the following developments:

It contributed towards the development of an EC2 gateway (06/07), allowing scientists to submit resource requests to Amazon using grid interfaces and credentials and credits associated with a specific project. This gateway enabled scientists to seamlessly move between clouds configured in scientific space to a commercial target (in this case EC2) for overflow demand.

It contributed to early design and development of the Context Broker technology images enabling automated creation of “turnkey” virtual clusters.

In addition, these collaborations above contributed requirements and informed the design on the following developments:

They helped us define requirements that informed six software releases of the Nimbus toolkit between 05/07 and 11/08. Among others, the releases also contained features such as the Context Broker service, the “workspace pilot” – an non-invasive adaptation of batch schedulers that facilitate Nimbus adoption on existing scientific platforms, EC2-compatible interfaces to our technology, improved extensibility, and better configuration tools.

From 03/08 onwards, we worked with site administrators at UC and other sites to configure Science Clouds – cloud computing platforms available to science. see http://workspace.globus.org/clouds.

We added several new images to the the “workspace bookshelf”, including contextualizable images that can be used for the creation of virtual clusters, most recently an OSG virtual cluster (10/08).

Year 2 Milestones

Develop protocols for specifying targets for scalable services, including performance and resource provisioning targets; continue the implementation of workspace-based provisioning.

13

Page 14: MS-Word

Further work on “workspace bookshelf”: developing schemas for describing and identifying execution environments.

Release the first version of services for on-demand provisioning of workspaces.

4.2. SERVICE CONSTRUCTION TOOLS

The Grid Remote Application Virtualization Interfaces (gRAVI) is an extension to the Introduce grid service authoring tool that adds capabilities for wrapping legacy applications. The first stable release of gRAVI occurred last April, and it was well received by the APS (DOE) and caBIG (NIH) communities who were able to make immediate use of it and began building and deploying services. It has subsequently also been adopted at NERSC. Another release was made this past August that included several improvements suggested by users.

One feature added in the last release was a simple portal, generated as part of the authored gRAVI application, which can be deployed into a Tomcat servlet container to enable users to interact with the corresponding Web Service through a web browser. This feature is based on the feedback we received from the early users of the gRAVI tool that it would be useful if gRAVI would generate a simple web/portal client that could be deployed in a web server and could be shared with the community quickly. After providing users with a proof-of-concept we started mulling over requirements for a production environment with a large user community. Out of these discussions evolved a collaboration with NERSC’s Open Software & Programming Group to develop tools for generating portals to expose various scientific applications on NERSC resources. In late November, an initial portal was in place and was well received by group lead David Skinner, he pointed out several improvements that would need to be made.

Year 2 Milestones

Develop a preliminary architecture document integrating the Web service application infrastructure with provisioning backends.

o Status: Completed. PhD student Ioan Raicu conducted extensive investigations and experiments in this area.

Work with biology applications on creating science services using initial AHS job management based as well as resource management based solutions wherever appropriate.

o Status: Completed. We conducted a promising study with the Argonne computational biology team.

Continued development of pyGridWare to support new protocol versions.o Status: Modified; completed. We chose to focus effort on gRAVI rather than

pyGridWare, and thus this milestone has been deleted. Equivalent functionality is provided by gRAVI.

Develop a version of PyCLST that supports wrapping non-command line applications.

14

Page 15: MS-Word

o Status: Modified; completed. We chose to focus effort on gRAVI rather than pyGridWare, and thus this milestone has been deleted. Equivalent functionality is provided by Introduce.

Continue to deploy new services on OSG, ESG, and others.o Status: Good results achieved with APS, caBIG, and NERSC. (OSG and ESG

have proved to be less interested in the technology for the moment at least.)o Researchers at the APS used gRAVI to generate secure grid services for

controlling a beamline experiment, data analysis, visualization and modeling.

o The caBIG initiative (funded by National Cancer Institute) is using gRAVI to create its “caGrid” infrastructure for creating, registering, discovering, and invoking analytical routines.

o NERSC is using gRAVI portal generation tools to create and deploy a science gateway.

5. TROUBLESHOOTING AREA PROGRESS

In this reporting period, CEDPS-Troubleshooting began several new collaborations, and spent a significant amount of time in engagement with these collaborators. Details are in the Collaborations section.

Development work has focused on providing a production-ready version of the log parsing and database loading tools. Software enhancements and bug fixes have proceeded in parallel with deployment and feedback from collaborators who are using the tools.

5.1. COLLECTION AND ARCHIVE SERVICE IMPROVEMENTS

We further designed, developed, documented, and tested the “log pipeline”, which continuously parses and loads log data into a database, and wrote a variety of parsers to normalize log data from Grid middleware. The log pipeline has three main components: a manager, a log parser, and a database loader. Major improvements to the components in this reporting period are given below. This development work was a necessary pre-requisite to deploying the software on OSG.

The major improvement in the manager component was the addition of a simple UDP messaging protocol to allow it to tell the managed loader/parser components when to roll over, re-parse configurations, or shutdown cleanly. This is more robust and easier to distribute than the previously used UNIX signals, though these still work.

Major improvements to the log parser were a number of new parser modules, including Condor and Globus components, improved error handling, the ability to “throttle” the parser so it doesn’t consume all the host CPU if it is pointed to a very large input file, and other enhancements. The availability of a simple framework that makes developing new parsers a snap is very useful: it has encouraged contributions from NERSC, and been the primary interaction point with other systems. We now have 15 parsers in all, covering

15

Page 16: MS-Word

software components from PBS, SGE, Condor, Globus, SRM, and HIS. Some relatively straightforward extensions in the framework made quick work of a tool for the Pegasus workflow logs that could traverse tens of directories of hundreds of files each, loading them all into the same database for analysis.

Major improvements to the log loader were PostgreSQL (www.postgresql.org) support, greatly improved performance for the SQLite (www.sqlite.org) module, CPU throttling, and a more thorough treatment of the performance/safety tradeoff involved in unique integrity constraints. The PostgreSQL support is important for wider deployments, but in the near term is also the database of choice for the NERSC Project Accounts work. The performance tradeoffs of data loading are particularly important for both the Pegasus team, which needs speed at all costs, and for long-running OSG deployments, which need constistency and robustness across system restarts.

Year 2 Milestones Add authentication and authorization capability to the log Collection Service and log

Archive Service.o Status: In progress. We chose to use the OGSA-DAI technology to perform

this function, but did not make much progress implementing it. This was serendipitous, as in the meantime the OGSA-DAI software has developed a much fuller implementation of the required "view" and distributed join functionality.

Develop tools to filter and feed log data from the Collection Service to the Archive Service

o Status: Complete (see above for details). Continue to deploy new services on OSG, ESG, and others.

o Status: Ongoing. We have deployed on NERSC PDSF and are packaging with VDT to deploy on other OSG resources in the near future.

Continue outreach to Grid application developers to instrument their applications.o Status: Ongoing. We have had very positive interactions in this regard with

members of the Pegasus team (see Section ) and also the SDM group at LBNL (see Section 6.5).

In addition to finishing necessary parts of milestones above, there are a few minor and one large development tasks in the near future. The minor tasks are tools and scripts for more robust operation of the log pipeline, including log and database rotation, and self-monitoring. All these new capabilities will be documented and packaged using the Virtual Data Toolkit (VDT) so they can be easily deployed on the Open Science Grid (OSG).

The major task is an OGSA-DAI interface that allows fully authenticated cross-site database access. The state-of-the-art today is to either keep the logs on the site or to centralize a subset of them (as OSG does today with most of its monitoring). Neither approach is adequate for troubleshooting, as many current and potential users have pointed out. What we are aiming for with OGSA-DAI is a way to control who views which logs at a fine grain using the existing Grid credentials, and to be able to combine information easily from more than one site in the process. We are fortunate to have a project like OGSA-DAI which has done much of the difficult groundwork in mapping Grid credentials to database roles and views, but there is still a considerable amount of work to be done.

16

Page 17: MS-Word

5.2. LOG DATA ANALYSIS

The data analysis work has focused on three tasks: analysis of complex Pegasus workflows from the SCEC CyberShake computations, profiling of BeSTMan data transfers, and correlation of Sun Grid Engine and Globus Job Manager logs for the TechX STAR job submission portal. These projects are all discussed in more detail in the collaboration sections below: Pegasus workflows in Section , BeStMan in Section 6.5, and TechX/STAR in Section 6.7.

Year 2 Milestones Use the Archive Service to establish performance baselines, and trigger events if

performance deviates too much from the baseline.o Status: In progress. We have demonstrated the ability of the Archive Service

to profile complex workflows in collaboration with the Pegasus team. The resulting profiles do form, in a sense, a baseline. We are still working on the triggering of events (see below) based on this information.

Future plans include establishing performance baselines for BeSTMan transfers to and from NERSC PDSF systems (primarily for the STAR project), and continuing the analysis with Pegasus workflows for SCEC CyberShake workflows and other users of the Pegasus technology. We are working towards a real-time view of the status of Pegasus workflows for SCEC/CyberShake, which will be a huge improvement over their current "run and come back 9 hours later" situation. These tools should apply with only small modifications to other users of the Pegasus workflow tools, and pieces of the functionality should be useful for monitoring Condor job DAGs. So we expect to get a much broader impact from what started as very focused work.

5.3. INTEGRATION OF LOG DATA WITH MDS4

To integrate the MDS4 with the CEDPS Log Collection and Log Archive services, we have invested significant time enhancing the administrative interface to the MDS Trigger service. This work lays the groundwork for "action scripts" that can query the CEDPS log database and trigger alarms based on the results.

Year 2 Milestones

Develop MDS4 Trigger Service action scripts to securely restart failed services.o Status: Dropped. Preliminary discussions with system administrators have

indicated that this is not properly in the scope of the CEDPS project. We are instead investigating instead better integration with site “trouble ticket” systems.

Develop MDS4 Triggers for missing log events (based on NetLogger anomaly detection tool)

o Status: In progress. MDS triggers for log events are a work in progress. However, as part of our collaboration with ESG, we have released a

17

Page 18: MS-Word

dramatically improved version of the administrative interface to the MDS Trigger Service. See the ESG Collaboration for details.

Integration of Log Collection Service with MDS4 to provide a log file location serviceo Status: Deferred. Because we plan to use OGSA-DAI as the access mechanism

for log data, we will implement the log file location service by having each OGSA-DAI server register to a central MDS Index server. Because OGSA-DAI already publishes resource properties, no additional development work is needed to provide this functionality; this is a deployment task that we plan to do as we deploy OGSA-DAI.

6. COLLABORATIONS

As is appropriate for a SciDAC Center for Enabling Technology, collaborations are the heart of the CEDPS project, defining requirements for technology R&D, and providing the context within which new technologies are evaluated. We detail some of our major collaborations in the following.

6.1. NATIONAL ENERGY RESEARCH SUPERCOMPUTING CENTER (NERSC)

CEDPS-Troubleshooting worked with the National Energy Research Supercomputing Center (NERSC) on two fronts: log collection on PDSF, and help with the Project Accounts auditing tasks.

The CEDPS log collection tools were deployed on the NERSC Parallel Distributed Systems Facility (PDSF). Parsers have been developed for the Sun Grid Engine (SGE) scheduler used on PDSF. The site administrators at PDSF use this information to track the resource consumption "tokens" claimed by PDSF users. This deployment is also used for tracking STAR data transfers and providing troubleshooting to the TechX/STAR job submission portal, both discussed below.

The NERSC Project Accounts team is designing and implementing a framework to allow NERSC users to run jobs under a shared, or project, account while retaining full traceability to the actual user involved. This model would improve the usability of NERSC resources for many users.  In order to perform the necessary auditing of user's actions in the systems, the Project Accounts team is using the CEDPS log parsing to normalize the logs and log database, to perform traceability analysis.

CEDPS-Services is developing portal generation tools that are being used to create science gateways for NERSC resources. The initial target application is VASP, a package for performing ab-initio quantum mechanical molecular dynamics, which has a relatively large user community at NERSC. To date the portal has only been available internally to NERSC staff for testing, early next year we plan on rolling it out to a select group of community testers.

18

Page 19: MS-Word

6.2. ARGONNE LEADERSHIP COMPUTING FACILITY (ALCF)

The focus of our collaboration here is on enabling ALCF to achieve high-speed remote access to and from their HPSS mass store system. This work has involved close collaboration with ALCF staff on the design, deployment, and evaluation of their GridFTP solution, including extensive work with the HPSS interface. Initial results are extremely promising.

6.3. EARTH SYSTEM GRID (ESG) CENTER FOR ENABLING TECHNOLOGY

ESG-CET is a major user of CEDPS data movement technologies: in particular, GridFTP and RLS. It is also a driver for CEDPS work on data replication.

ESG-CET has continued to make aggressive use of GridFTP, and thus has benefited from the significant performance and functionality enhancements to GridFTP that have been developed under the CEDPS project. ESG-CET uses the Storage Resource Manager from LBNL to perform wide area bulk data movement of large (terabyte) data sets, and SRM in turn calls GridFTP to perform these transfers among ESG sites. In addition, OpenDAP-G uses GridFTP directly for high-performance transfers.

The ESG-CET and CEDPS projects also collaborate in several areas related to data management. ESG-CET uses the GridFTP data transfer service extensively, and thus the project has benefited from the significant performance and functionality enhancements to GridFTP that have been developed under the CEDPS project. In particular, ESG-CET uses the Storage Resource Manager from LBNL to perform wide area bulk data movement of large (terabyte) data sets, and SRM in turn calls GridFTP to perform these transfers among ESG site.

ESG uses the Replica Location Service to track and catalog data sets. For the next generation ESG architecture, we are working with the ESG team to provide data mirroring functionality. This new functionalioty will allow key sites on several continents to host large capacity (terabytes) mirror sites for ESG data sets. We are investigating whether data mirroring tools being developed under CEDPS can help ESG to manage their data replication.

For monitoring, we have released a redesigned version of the MDS Trigger Service for Globus Toolkit Version 4.2. This work provides an improved service interface for administrative tasks such as modifying, enabling, and disabling existing triggers. This work was done in response to feedback from ESG, who requested this capability to support a portal that they plan to develop as well as providing a simpler command-line interface.

In our initial plans, we also identified ESG server-side processing as an important driver for CEDPS-Services work. However, while ESG continues to view server-side processing as important for their long-term plans, they have not been able to prioritize effort in this area, and thus collaboration in that area has not yet eventuated.

19

Page 20: MS-Word

6.4. PEGASUS GROUP, USC/ISI

The Pegasus group at USC/ISI provides a workflow engine called Pegasus-WMS that is used by the Southern California Earthquake Center (SCEC) to run computational simulations on their CyberShake platform. By combining Pegasus-WMS and CEDPS logging tools, we were able to efficiently process execution logs of earthquake science workflows consisting of hundreds of thousands to one million tasks. In an accepted poster for the SC08 conference [17], we show results of processing logs of CyberShake, a workflow application running on the TeraGrid.

Although workflow analysis was not in our list of milestones for this year, it has turned out to be a fruitful area. Just as one can view any distributed job as a type of “workflow”, solutions to the problems of scale and correlation found in the Pegasus workflow logs are re-usable in the context of single Grid submissions. For example, the same types of queries we developed for mining Pegasus logs were also used to correlate the TechX job submissions with the SGE scheduler information (see Section 3.6, below).

We also collaborated with the Pegasus team on exploring Infrastructure-as-a-Service (IaaS) cloud computing for scientific communities: a platform can be flexibly provisioned from academic or commercial provider in response to a developing resource need in a workflow. Our first exploration – comparing performance of workflow-based scientific applications on local platforms and platforms available in the Science Clouds – took place in the summer ’08 and was recently published [13].

6.5. SCIENTIFIC DATA MANAGEMENT GROUP, LBNL

CEDPS-Troubleshooting collaborated with Arie Shoshani’s SDM group, the developers of the BeStMan implementation of the Storage Resource Manager (SRM) protocol, to improve and normalize their log information. The short-term goal of this collaboration has been to collect SRM logs from the STAR project's transfers between Brookhaven National Laboratory and PDSF. 

There is now a deployed version of BeStMan on PDSF that contains new and improved logging; and the requisite parsers in the CEDPS log collection to process these logs and load them into our database for analysis. Impacts on the STAR project are discussed below.

6.6. GLOBUS TEAM, ANL

CEDPS-Troubleshooting continued the effort to help Globus Toolkit logs follow the CEDPS "Logging Best Practices" guidelines. Although the initial target for this was 4.1.3, deployment realities have pushed it back to 4.2.1. Continued engagement and feedback to Globus Toolkit logs in order to make them as useful as possible when GT4.2.1 is deployed through OSG and elsewhere.

6.7. SOLENOIDAL TRACKER AT RHIC (STAR) EXPERIMENT

20

Page 21: MS-Word

The Solenoidal Tracker At RHIC (STAR) nuclear physics experiment based at Brookhaven National Laboratory (BNL) performs data analysis at several sites, including PDSF. The input and output data are transferred between a node at BNL and a node at PDSF using the BeStMan implementation of SRM data management protocols. 

CEDPS-Troubleshooting has begun analyzing the actual end-to-end throughput experienced by these transfers, and have so far found some surprising numbers: transfers from BNL to PDSF showed reasonable (100-200Mbps) throughput, whereas transfers from PDSF back to BNL were 1-2 orders of magnitude slower. The results from GridFTP logs on BNL and BeStMan logs at PDSF were correlated (by the CEDPS logging tools) to verify this result. Until the network bottleneck can be removed, this information was fed back to modify the number of concurrent streams in the BeStMan deployment, and improve the transfer rate.

In addition, CEDPS-Data has had a series of discussions with the STAR team to understand their data mirroring and replication requirements. The STAR team indicated that simple replication tools would be very useful to their project in the future. STAR requirements are feeding into current design work with CEDPS-Data.

CEDPS-Services began its collaboration with STAR early in the project and continued it in the evaluation period. The objective of this collaboration is to demonstrate that STAR applications can use available IaaS cloud resources for production runs and evaluate the paradigm’s usefulness to the STAR community in comparison to existing resources. Working with STAR scientists, we developed contextualization tools allowing for automatic creation of tightly-coupled clusters on IaaS platforms (such as Amazon’s EC2 or Science Clouds). This tool enabled us to prepare the first significant STAR run on EC2 in 09/07, supporting a platform for production codes. Subsequently, we worked with STAR community members to evaluate performance impact of I/O operations in the cloud on STAR applications (found to be within 10% and deemed acceptable) and on preparations for another STAR production run in the cloud – this time, for a critical code that will generate publication worthy results.

6.8. TECHX CORPORATION

CEDPS-Troubleshooting has coordinated with the TechX team (www.txcorp.com), which has developed a STAR job submission portal and also added custom application-level monitoring to STAR middleware. The STAR production managers would like to be able to "drill down" beyond the job start / job end type of monitoring provided by TechX to site-specific logs and errors. The CEDPS log database provides this information, which we have agreed to provide to the TechX portal. 

A prototype version of the Tech-X portal can query the SGE log information stored in the CEDPS log database. We are working together to finish a prototype of this functionality by SC08. When complete, this will greatly enhance the usability of the portal for STAR jobs.

CEDPS-Services worked with the TechX team on coordinating and enhancing their collaboration with STAR. We are currently exploring two collaborations. The first one consists in supplying cloud computing expertise for the development of an elastic cloud

21

Page 22: MS-Word

computing infrastructure to enhance access to a nuclear physics relational database developed by TechX in collaboration with STAR and are working on a project exploring integration of cloud computing technology into the current OSG fabric following up on successful demonstrations of STAR production runs on IaaS infrastructure.

6.9. OPEN SCIENCE GRID (OSG)

CEDPS-Troubleshooting is continuing collaboration with OSG, with a goal of deploying the CEDPS troubleshooting tools to provide a central log database for troubleshooting and to provide an early warning system by detecting deviations from baseline performance. Progress on this front has been delayed somewhat by delays in the release of CEDPS logging in Globus Toolkit components, but also by slower than expected progress on the integration of CEDPS logging with the MDS.

In a separate activity, the CEDPS-Troubleshooting log normalization tools are in production use for the OSG accounting function to analyze GridFTP activity.

CEDPS-Services collaborates with OSG by providing vehicles facilitating IaaS exploration to OSG scientists. Specifically, we made available (10/08) an OSG virtual cluster that can be deployed by OSG scientists on Science Clouds resources as well as (via the use of Nimbus contextualization tools) EC2: this allows scientists to easily deploy an OSG cluster in the cloud. In addition, we also interact with OSG via participation at OSG events (e.g., organized a cloud computing BOF at the OSG AHM in 03/08) to explain and popularize cloud computing ideas.

6.10. ADVANCED PHOTON SOURCE (APS)

Researchers at APS beamlines have used gRAVI in a project aimed at automating large parts of the end-to-end experiment operation, data analysis, data visualization, and data-driven modeling workflows that define their work processes. This work has involved the use of gRAVI to generate secure Web Services for controlling a beamline experiment, and for data analysis, visualization and modeling. The results of this work have been profiled within APS, presented at meetings, and highlighted on the DOE web site.

6.11. NUCLEAR PHYSICS GROUPS

CEDPS-Services established several collaborations with international nuclear physics groups. In the summer of ’08 we enabled seamless integration of VMs deployed on the Science Clouds platforms for the ALICE HEP experiment at CERN (07/08, CHEP submission pending). This is significant because it shows how cloud computing can be integrated into existing community computing mechanisms (the VMs served as platforms for jobs in the ALICE production queue first come first served). Further, we began a collaboration with multiple members of the ATLAS HEP experiment. We work with a group of scientists in the Max-Planck Institute to evaluate Nimbus as an open source EC2-compatible platform. We also continued our collaboration with ATLAS scientists at the University of Victoria (UVIC) exploring Nimbus as a platform for their community. These interactions resulted, among

22

Page 23: MS-Word

others, in open source contributions from the group at UVIC who joined us on the committer team.

7. PRESENTATIONS AND PUBLICATIONS

7.1. TALKS

1. I. Foster, “Enabling Distributed Petascale Science,” Scientific Discovery through Advanced Computing Conference, Boston, Mass., May 2007.

2. I. Foster, “Services for Science,” Keynote talk at INGRID Conference, Ischia, Italy, April 2008.

3. K. Keahey and Tim Freeman, "Cloud Computing and Virtualization with Globus (Tutorial)". Open Source Grid Cluster, Oakland, CA, May 2008

4. K. Keahey, "Globus Virtual Workspaces". HEPiX Fall 2007, St. Louis, MO. November 2007

5. R. Kettimuthu, "Data Movement Tools for Distributed Petascale Science," Maseeh College of Engineering and Computer Science, Portland State University, Portland, OR, Sep 2008

6. R. Kettimuthu, "Reliable Data Movement Framework for Distributed Science Environments," The 2008 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, NV, July 2008.

7. R. Kettimuthu, "Globus GridFTP and RFT: An Overview and New Features," National Energy Research Scientific Computing Center (NERSC), Oakland, CA, May 2008.

7.2. TUTORIALS

8. R. Kettimuthu, J. Bresnahan and M. Link, "Configuring and Deploying GridFTP for Managing Data Movement in Grid/HPC Environments," SC 2008, Austin, TX, Nov. 2008

9. R. Kettimuthu and J. Bresnahan, "Managing Data Movement Using GridFTP in Distributed Environments," Open Source Grid and Cluster Conference, Oakland, CA, May 2008.

7.3. PAPERS

10. “Characterization of Scientific Workflows,” Shishir Bharathi, Ann Chervenak, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, Karan Vahi, The 3rd Workshop on Workflows in Support of Large-Scale Science (WORKS08), in conjunction with Supercomputing (SC08) Conference, Austin, Texas, November, 2008.

11. “Enabling petascale science: data management, troubleshooting, and scalable science services,” A. Baranovski, K. Beattie, S. Bharathi, J. Boverhof, J. Bresnahan, A. Chervenak, I. Foster, T. Freeman, D. Gunter, K. Keahey, C. Kesselman, R. Kettimuthu, N. Leroy, M. Link, M. Livny, R. Madduri, G.Oleynik, L. Pearlman, R. Schuler and B.Tierney, Journal of Physics: Conference Series, Volume 125, 2008. (Also appeared

23

Page 24: MS-Word

in Proceedings of SciDAC 2008 Conference, 13-17 July, 2008, Seattle, Washington, USA.)

12. “Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows – Experiences from SCEC CyberShake", Scott Callaghan, Phil Maechling, Ewa Deelman, Karan Vahi, Gaurang Mehta, Gideon Juve, Kevin Milner, Robert Graves, Edward Field, David Okaya, Dan Gunter, Keith Beattie, Thomas Jordan. Fourth IEEE International Conference on eScience (eScience 2008), Indianapolis, IN, USA, December 2008.

13. "Exploration of the Applicability of Cloud Computing to Large-Scale Scientific Workflows", Hoffa, C., T. Freeman, G. Metha, E. Deelman, and K. Keahey,. to be submitted to SWBES08: Challenging Issues in Workflow Applications, 2008.

14. "Virtual Workspaces for Scientific Applications", Keahey, K., T. Freeman, J. Lauret, D. Olson. SciDAC 2007 Conference, Boston, MA. June 2007

7.4. POSTERS

15. “Center for Enabling Distributed Petascale Science”, SciDAC Conference, July 2008. 16. “Policy-Driven Data Management for Distributed Scientific Collaborations Using a

Rule Engine”, Sara Alspaugh, Ann Chervenak, Ewa Deelman, Supercomputing (SC08) Conference, Austin, Texas, November 2008. Received Best Undergraduate Student Poster award in ACM Student Poster competition.

17. "When Workflow Management Systems and Logging Systems Meet: Analyzing Large-Scale Execution Traces". Dan Gunter, Scott Callaghan, Gaurang Mehta, Gideon Juve, Keith Beattie, Ewa Deelman, Phil Maechling, Brian Tierney, Karan Vahi. SC08, Austin, TX

18. "Virtual Workspaces for Scientific Applications", Kate Keahey. SciDAC 200 Conference, Boston, MA. June 2007

24


Recommended