+ All Categories
Home > Documents > Progress on TeraGrid Stability for the LEAD project.

Progress on TeraGrid Stability for the LEAD project.

Date post: 28-Dec-2015
Category:
Upload: godfrey-curtis-matthews
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
12
Progress on TeraGrid Stability for the LEAD project
Transcript
Page 1: Progress on TeraGrid Stability for the LEAD project.

Progress on TeraGrid Stability for the LEAD project

Page 2: Progress on TeraGrid Stability for the LEAD project.

History

•Reliability problems for LEAD– 2006 Unidata workshop– Spring 2007 Weather Challenge– Continued “heroics” needed by staff every time

more than a handful of users used the gateway

•10/25/08 ARCH call topic raised by Dane• “I think it's time to raise this discussion in the broader venue of the

ARCH meeting. We have been raising the profile of this investigation and trying to come to a persistent resolution of the problem. We're putting attention on this among the management and it should be reflected to the working teams. There also continue to be misunderstanding and different expectations and it would be good to set those clearly.”

Page 3: Progress on TeraGrid Stability for the LEAD project.

Gateway-debug calls initiated 10/25/07

•Goal– Stable systems for LEAD to conduct student

Weather Challenge with 67 universities•Runs start 1/28/08

– Improve stability of grid services for all users at all TeraGrid sites•Eliminate need for staff heroics

Page 4: Progress on TeraGrid Stability for the LEAD project.

Get the right staff on the gateway-debug calls

•Original request for– knowledgeable LEAD rep– knowledgeable Globus rep– knowledgeable NCSA RP rep– knowledgeable IU RP rep– knowledgeable GRAM rep– knowledgeable gridftp rep– knowledgeable Inca rep– knowledgeable TG operations rep

Page 5: Progress on TeraGrid Stability for the LEAD project.

gateway-debug activities

•Understand the problems

– Suresh creates http://www.teragridforum.org/mediawiki/index.php?title=LEAD

Page 6: Progress on TeraGrid Stability for the LEAD project.

With some humor

•Overloaded GridFTP servers

•http://www.youtube.com/v/4wp3m1vg06Q&hl=en

Page 7: Progress on TeraGrid Stability for the LEAD project.

•Create testbed where we can implement solutions rapidly– Only at sites LEAD was trying to use

•ANL, NCSA, IU•Software and hardware configuration changes on the testbed–Non-striped GridFTP servers–Globus 4.0.7 which includes GRAM scalability improvements–RFT improvements

•Develop tests that simulate what LEAD does– GRAM, GridFTP, javaCOG

Page 8: Progress on TeraGrid Stability for the LEAD project.

Inca

•Use Inca to run LEAD tests– Inca run once per day on production sites

•Version tests, limited functionality tests

– Frequency greatly increased for testbed•Every 5 min. “are you alive” tests•Once an hour “can I get a job into a queue” test

–These can be tuned, back off when a service proves it is stable

– Automatic admin notification

– These last two were the key!!

Page 9: Progress on TeraGrid Stability for the LEAD project.

Inca results reviewed at each call

•http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi– Still lots of errors this

past week

•Summary sent before gateway-debug– Issues addressed on

the call– Follow-up on actions

from previous week

Page 10: Progress on TeraGrid Stability for the LEAD project.

Gateway-debug work moving to ops-wg

•Maintain testbed– For now, maintain as stable infrastructure for LEAD

•Having trouble today with testbed stability

– In the future•Use testbed and Inca structure to verify reliability of new versions of CTSS before it goes into production

•Improve simulated scalability tests and produce benchmark (before asking Users/Gateways to participate)

•Turn focus on production systems– Increase testing frequency enough to be able to

determine stability•Once per day is not enough

– Automatic notification of sys admins

Page 11: Progress on TeraGrid Stability for the LEAD project.

Let’s learn from this experience

•Increased testing•Automatic sys admin notification•Having the right staff on the calls as needed•Weekly reviews of test

•The above items are what moved us along•We need to continue paying attention if we expect to have a stable environment for Gateways and users of grid services

•Stay tuned for progress in ops-wg

Page 12: Progress on TeraGrid Stability for the LEAD project.

Thank YouTo lots of folks, but especially

•Suresh Marru•Doru Marcusiu•Kate Ericson, Shava Smallen•Derek Simmel, Robert Budden•Mike Lowe, Jenette Tillotson•Stu Martin, Dan Fraser•Raj Kettimuthu, John Bresnahan•Ravi Madduri


Recommended