www. chameleoncloud.org
CHAMELEON: OPERATIONAL LESSONS Kate Keahey, Jason Anderson (ANL, UC)
Paul Ruth (RENCI), Jacob Colleran (UC, ANL), Cody Hammock (TACC), Joe Stubbs (TACC), Zhuo Zhen (UC, ANL)
{keahey, jasonanderson}@uchicago.edu July 29, 2019 HARC Workshop, Chicago, IL
www. chameleoncloud.org
CHAMELEONINANUTSHELL� Weliketochange:testbedthatadaptsitselftoyourexperimentalneeds
� Deepreconfigurability(baremetal)andisolation(CHI)–butalsoeaseofuse(KVM)� CHI:poweron/off,reboot,customkernel,serialconsoleaccess,etc.
� Wewanttobeallthingstoallpeople:balancinglarge-scaleanddiverse� Large-scale:~largehomogenouspartition(~15,000cores),5PBofstoragedistributedover
2sites(now+1!)connectedwith100Gnetwork…� …anddiverse:ARMs,Atoms,FPGAs,GPUs,Corsaswitches,etc.
� Cloudoncloud:leveragingmainstreamcloudtechnologies� PoweredbyOpenStackwithbaremetalreconfiguration(Ironic)+“specialsauce”� ChameleonteamcontributionrecognizedasofficialOpenStackcomponent
� Welivetoserve:open,productiontestbedforComputerScienceResearch� Startedin10/2014,testbedavailablesince07/2015,renewedin10/2017� Currently3,000+users,500+projects,100+institutions
www. chameleoncloud.org
CLOUDSVERSUSHPCRESOURCES� TraditionalHPCresources:interfaces,complexity,efficiency&cost� Clouds:interfaces,complexity,efficiency&cost� Differencesincomplexity:
� Operationalcomplexity:networking,security,andothers
� Greatersharingofartifacts:appliancemanagement
� Relativeimmaturityoftheparadigm
� Cloudsystemsaremorecomplexbecausetheysolveamorecomplexproblem
www. chameleoncloud.org
EXPERIMENTALINSTRUMENTSVERSUSCLOUDS
Bare-Metal Infrastructures security, fewer layers of abstraction, relative immaturity of infrastructure
Networking Access to L2 for all, complexity/automation, integration with commercial offerings
Chasing the Research Frontier and Adaptation Emphasis on development/adding new features, closer collaboration with user
community
www. chameleoncloud.org
…ANDITNEEDSTOSCALE
# of diverse CS experiments you can run
acce
ssib
ility
Traditional open HPC resources Open cloud
resources
manually configurable closed testbeds
Chameleon
www. chameleoncloud.org
WHATDOESITMEANTOPEOPLE?� Operators
� Veryhighlevelofskill:morediverseanddeeperexpertise
� Significantlearningcurve
� Teamsofoperatorswithdifferentspecialties
� Developmentexperienceiscritical
� Moreeffort� Manymovingparts,immatureparts,newparts,unexpectedparts
� Closeinteractionwithusercommunity� Usersareincreasinglylesscustomersandincreasinglymorepartners
www. chameleoncloud.org
HELPINGHUMANSINTHELOOP� Researchersandinstructors(users)
� Makeinterfacestocloudmoreintuitive(oratleastsimilartocommercialclouds)� Facilitatecreationofecosystemforsharingknowledge� Directinstructionandguidance
� Hostinstitutions,serviceproviders(operators)� ReducecostofrunningChameleonaslowaspossible� Enablepluggingintoexistingecosystems
� Ourselves� Enableteammembersofvariableexpertisetobeproductive� GiveinsightintousageandhealthofChameleon� Addforce-multiplierstomaketeamhaveoutsizedimpact
www. chameleoncloud.org
MONITORING:THREEPILLARSQuantify� Symptom-based
metrics� Prometheus
� Chameleon-specificmetrics
� Logindexingandsearch� Elasticsearch,Fluentd
� Kolla-Ansible
Detect� Metric-basedalerts
� Prometheus,Alertmanager
� ”Black-box”probes� Periodicchecksfor
externalconnectivitytopublicAPIs
� “Smoketests”� SuiteofJenkinstests,
runnightly(expensive)orhourly(cheap)
� Checks“happypath”throughsystem
React� Runbooks
� Documentationofknownerrorsandmitigations(foroperators)
� Helpsnewteammembersbeproductive
� Hammers� Automatedsolutions
forknownerrors
www. chameleoncloud.org
www. chameleoncloud.org
ISITWORTHTHETIMETOAUTOMATE?
www. chameleoncloud.org
AUTOMATION� Newappliancereleases
� No“snowflake”images:expressedincodeandbuiltwithdiskimage-builder
� Newsystemreleases� Patchesaretested,builtintoanew(Docker)container,thenpushedtoalocal
registryforrelease
� Ajobistriggeredtodeploynewcontainerversionusingcontrolledprocess(Kolla-Ansible)
� Nobodyhastolearnhowtobuild/installpackages!Downside:somebodyshouldknowhowtofixproblemswithpipeline.
� Maintenanceprocesses� Takingnodeoutofproduction,attachingmetadataforoperators
www. chameleoncloud.org
PACKAGINGCHAMELEON� WhatisCHI-in-a-Box?
� InstallChameleononyourowninfrastructurewithsetofprovisioningscripts+softwarebundles
Traditional software OpenStack
www. chameleoncloud.org
CHI-IN-A-BOXUSECASES� ChameleonAssociate
� ResourcesaddeddirectlytoChameleon,whileretainingprojectidentity
� Chameleonprovidesusermanagement(andusersupport!),resourcediscoveryandappliancecatalog
� JointlymaintainedbyChameleonstaffandassociatesitepartnership
� ChameleonPart-timeAssociate� Similartoabove,butallresourcesareexpectedtobetakenofflineattimes
� IndependentTestbed� AssociatesitedeploysChameleon,butoperatesusermanagementandsupport
themselves.
� FirstsitealreadydeployedatNU
www. chameleoncloud.org
PACKAGING:OURAPPROACH� Distributeassetofprovisioningscripts,buildoncommodity
technology� Kolla,Kolla-Ansible,Ansible
� Allinfrastructureexpressedasversionedcode.Infrastructurecanbebuiltfromscratchrepeatably(goodfordisasterrecovery.)
� Provideinstallationandsupportdocumentation� Installguide,troubleshooting,runbooks
� We“dogfood”CHI-in-a-Boxinternally� Beingconsumersofourownproductimprovesquality
� Focusonreducingcouplingbetweensitesforreliability
www. chameleoncloud.org
USERS:THEFINALFRONTIER� Ticketsvs.supportlists
� Dedicated,trackablecommunicationversusdiscoverable,noisycommunications
� Covenantbetweenusersandoperators� Everythingworksbetterwhenusersareeducatedabout“properuse”
� Educationandoutreach� OpenStackdocumentationismixedblessing
� Chameleondocsareupdatedwitheachrelease
� Livewebinarsandface-to-facemeetupsmostimpactful
� Incentivizinganecosystem� Sharingismuchmorepowerfulinthecloud!
www. chameleoncloud.org
PARTINGTHOUGHTS� Physicalenvironment:Chameleonisarapidlyevolvingexperimental
platform� From“adaptstotheneedsofyourexperiment”…� …to“adaptstothechangingresearchfrontier”
� Cloudsarehardtooperatebecausetheysolveacomplexproblem–andexperimentalfacilitiesevenmoreso� Moreskilledpersonnel,moreeffort–andespeciallyintestbedsmoredevelopment
� Towardsanecosystem:ameetingplaceofuserssharingresourcesandresearch� Testbedsaremorethanjustexperimentalplatforms:common/sharedplatformisa
“commondenominator”thatcaneliminatemuchcomplexitythatgoesintosystematicexperimentation,sharing,andreproducibility
� WorkingwithotheroperatorsviaCHI-in-a-BoxandBYOHinitiatives� Workingwithusersviaprovidingsharingmechanismsandfosteringcommunity
development
www. chameleoncloud.org
WE’REHERETOCHANGE
CHAMELEON:
www.chameleoncloud.org