HEPiX Fall 2014 - Thursday October 16 4:00pmJames Pryor - [email protected] and ATLAS Computing Facility atBrookhaven National Laboratory
Configuration Management Change Management, and Culture Management
James Pryor - RACF BNL 2
Configuration Management Change Management, and Culture Management
Past
Present
Future Plans & Desires
Credits / Discussion
James Pryor - RACF BNL 3
Configuration Management, Change Management, and Culture Management
Many ways to do OS/application deployment and configuration
OS deploy: a single PXE/kickstart server in 2007.Lots of kickstart files. At best 5:1, worst 1:1. We used a post install RPM file with base level config files & scripts
Started with Cfengine in 2008 on RH Linux and Solaris
Used it for very basic OS configuration and some application configuration
Past
James Pryor - RACF BNL 4
Configuration Management, Change Management, and Culture Management
Identified some problems local & global.
Kickstart file is installation only. Once installed, changes over time make unique configurations.
We found Cfengine 2.x and Cfengine 3.x not ideal
On a wider scope: We (humans) don't scale. Separate pools of knowledge. Files, internal web, internal mail, our minds, external web.
More than one way to create/fix/diagnose it.
Errors & issues are hard to debug/replicate. We can not do it all or remember it all.
Past: 2010
James Pryor - RACF BNL 5
Configuration Management, Change Management, and Culture Management
Shawn Hokehttps://www.flickr.com/photos/shawnhoke/14908288722Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
We can not do it all. We don't have super powers.
James Pryor - RACF BNL 6
Configuration Management, Change Management, and Culture Management
Policy & procedures (can be formal as ITIL, or as informal as DevOps) used to control changes and keep a historical change record made to production/development/test systems
Uncontrolled change can work, but will often cause self inflicted problems and future firefighting episodes & upgrade nightmares
Without it, servers/applications become like snowflakes: they start out identical, but over time, configuration drift eventually makes each one unique.
Past: Change Management
James Pryor - RACF BNL 7
Configuration Management, Change Management, and Culture Management
Alexey Kljatov - Creative Commons (CC BY-NC 2.0)https://www.flickr.com/photos/chaoticmind75/6715743931https://www.flickr.com/photos/chaoticmind75/6922463361https://www.flickr.com/photos/chaoticmind75/6737065985
They are pretty to look at but hard to manage.
James Pryor - RACF BNL 8
Configuration Management, Change Management, and Culture Management
Increase agility. Do More with less. Shorten time to solution.
Standardize! We want to stop duplicating work and effort.
Almost everyone is using change management.
Align ourselves to be congruent with other entities (Labs, Univ, Corp) so that we can build upon their success.
Follow proven methods of improvement and change management to shift staff time from perpetual reactive mode (firefighting) to more proactive work (fire prevention)
No unauthorized changes, no "cowboy" or "superhero" type behavior tolerated
Past: Change Management
James Pryor - RACF BNL 9
Configuration Management, Change Management, and Culture Management
Moved to Puppet for configuration management: easy to learn, facts, modular, idempotence, reporting.
Intro level training in Aug 2010
Fall 2010, we chose Cobbler for Linux OS deployment via PXE/Kickstart. CLI & Web manages distros, systems, and repos. Modularity through variables, templates, and profiles.
Selected GLPI for asset management, and FusionInventory Agent to collect server details
Past: 2010
James Pryor - RACF BNL 10
Configuration Management, Change Management, and Culture Management
Convert kickstarts into Cobbler single template to support just about all use cases, and convert the RPM file post-install configuration shell scripts into Puppet code.
Developed front-end Perl web scripts to manage SSL CA certs, and act as change management 'gateway' to back-end git/cgit
Puppet External Node Classifier: GLPI, Puppet Dashboard, Linux worker node pool custom inventory DB
Created a foundational base class that encapsulates the desired default configuration, then worked on managing services with puppet.
Started working on Desktop Centralized Management
Past: 2011 to 2012
James Pryor - RACF BNL 11
Configuration Management, Change Management, and Culture Management
This flowchart is from a presentation made in late 2010
James Pryor - RACF BNL 12
Configuration Management, Change Management, and Culture Management
Cobbler – System view - racf-min.ks is our shared templated kickstart
James Pryor - RACF BNL 13
Configuration Management, Change Management, and Culture Management
GLPI – Server details. Note the circled Custom tab.
James Pryor - RACF BNL 14
Configuration Management, Change Management, and Culture Management
GLPI – Custom tab details. Note the circled base class.
James Pryor - RACF BNL 15
Configuration Management, Change Management, and Culture Management
Base class is 'community property' where we build the foundation for infrastructure servers. Code edited/resized for screenshot
James Pryor - RACF BNL 16
Configuration Management, Change Management, and Culture Management
This is the linuxfarm group's workernode class
James Pryor - RACF BNL 17
Configuration Management, Change Management, and Culture Management
This shows we have one client host cert that need to be signed.
James Pryor - RACF BNL 18
Configuration Management, Change Management, and Culture Management
I need to clean a client host cert, so I can search for that host with regex support.
James Pryor - RACF BNL 19
Configuration Management, Change Management, and Culture Management
It is cleaning one host cert and removing and exported resources tied to that host.
James Pryor - RACF BNL 20
Configuration Management, Change Management, and Culture Management
Present: Make a change to productionpryor@mydesktop (production) ~/gitree/catalog/common/cvesecurity/manifests $ git push origin production<trimmed> remote: diff-tree:remote: :100644 100644 04eb0bc... 7bf289b... M common/cvesecurity/manifests/cve_2014_7169.ppremote: Merge results:remote: Updating ce24fcc..d8a7860remote: Fast-forwardremote: common/cvesecurity/manifests/cve_2014_7169.pp | 1 +remote: 1 file changed, 1 insertion(+)remote: Note: Your updates to the production branch are waiting for Note: Your updates to the production branch are waiting for approval in branch: pending-production-pryor-ce24fcc-20141009T201915UTCapproval in branch: pending-production-pryor-ce24fcc-20141009T201915UTCremote: remote: error: hook declined to update refs/heads/productionTo https://webdocs.racf.bnl.gov/git/puppet/catalog ! [remote rejected] production -> production (hook declined)error: failed to push some refs to 'https://webdocs.racf.bnl.gov/git/puppet/catalog'
James Pryor - RACF BNL 21
Configuration Management, Change Management, and Culture Management
Puppet Approval Change Management Portal showing my commit to production
James Pryor - RACF BNL 22
Configuration Management, Change Management, and Culture Management
Diff of my commit as seen in cgit
James Pryor - RACF BNL 23
Configuration Management, Change Management, and Culture Management
Do you want to merge this?
James Pryor - RACF BNL 24
Configuration Management, Change Management, and Culture Management
Puppet Change Approval Committe. Note the regex under the Authorization column.
James Pryor - RACF BNL 25
Configuration Management, Change Management, and Culture Management
RACF Puppet Catalog. The regex from the previous slide matches up with the directories (in blue).
James Pryor - RACF BNL 26
Configuration Management, Change Management, and Culture Management
Merge results
James Pryor - RACF BNL 27
Configuration Management, Change Management, and Culture Management
Approval scripts also record who approved what for auditing purposes
James Pryor - RACF BNL 28
Configuration Management, Change Management, and Culture Management
Summary of all log messages making up the branch merge in cases where there are many commits in that branch
James Pryor - RACF BNL 29
Configuration Management, Change Management, and Culture Management
Present: Reporting, Performance MetricsPuppet Dashboard is deprecated. We now use The Foreman and PuppetDB for puppet client run reporting.
The Foreman is a RHEL 6.5 VM on the RHEV cluster, and assigned 8 vCPUs and 8GB RAM.
A single Puppet Master server: Dell PowerEdge R610 with Dual CPU Intel Xeon X5660 2.80GHz, 96GB RAM, RAID10 900GB, RHEL 6.5
It runs just about everything: Apache, Perl Puppet CA, Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL), GLPI (w/ MySQL)
James Pryor - RACF BNL 31
Configuration Management, Change Management, and Culture Management
Puppet Client agent check-in method & interval
The Foreman VM on RHEV PostgreSQL
Linux farm worker nodes
Farm Infrastructure,
DB
Puppet daemon check-in once per hour
Manual check_puppet.py run without Puppet daemon
Grid, Infrastructure, Storage, Cloud hosts under 600 machines
Puppet Master Dell PowerEdge R610
Dual CPU Intel Xeon X5660 2.80GHz 96GB RAM, RAID10 900GB,RHEL 6.5
Apache, Perl Puppet CA , Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL)
GLPI (w/ MySQL)
James Pryor - RACF BNL 32
Configuration Management, Change Management, and Culture Management
Present: Load test & MetricsWe tested our Puppet master with client agents checking in at about 1Hz. PuppetDB reports that we have over 2500 checking-in during this test, which ran a little over an hour.
Load (top command output) as seen at about 12 minutes after starting a manual puppet agent run (14:45) on the Linux Farm worker nodes.
top - 14:57:23 up 3:55, 2 users, load average: 4.22, 4.22, 3.03Tasks: 551 total, 4 running, 547 sleeping, 0 stopped, 0 zombieCpu(s): 20.9%us, 2.1%sy, 0.0%ni, 76.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%stMem: 99051792k total, 31167540k used, 67884252k free, 340332k buffersSwap: 8388604k total, 0k used, 8388604k free, 4798864k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9178 puppet 20 0 302m 195m 3404 S 96.5 0.2 4:19.72 ruby 7361 puppet 20 0 291m 183m 3404 S 86.6 0.2 4:24.94 ruby 8992 puppet 20 0 362m 254m 3404 R 74.6 0.3 13:43.31 ruby 6993 puppetdb 20 0 14.6g 1.3g 12m S 71.3 1.3 21:26.15 java 434 puppet 20 0 352m 244m 3404 R 54.7 0.3 8:32.26 ruby 9070 puppet 20 0 276m 169m 3404 S 43.8 0.2 4:34.72 ruby 9077 puppet 20 0 306m 199m 3400 S 33.2 0.2 5:03.86 ruby 7808 postgres 20 0 24.4g 575m 569m S 26.5 0.6 0:14.28 postmaster 8571 postgres 20 0 24.4g 670m 662m S 9.6 0.7 0:14.15 postmaster
James Pryor - RACF BNL 33
Configuration Management, Change Management, and Culture Management
Linux farm manual run started approx 14:45.
James Pryor - RACF BNL 34
Configuration Management, Change Management, and Culture Management
In a bit more than an hour, we saw between 40% - 60% CPU with about 2500 agents checking in.
James Pryor - RACF BNL 35
Configuration Management, Change Management, and Culture Management
PuppetDB metrics during the load test
James Pryor - RACF BNL 36
Configuration Management, Change Management, and Culture Management
PuppetDB metrics during the load test
James Pryor - RACF BNL 37
Configuration Management, Change Management, and Culture Management
PuppetDB metrics during the load test
James Pryor - RACF BNL 38
Configuration Management, Change Management, and Culture Management
Present: Puppet Code StatisticsProject name: RACF Puppet Catalog
Generated: 2014-10-10 05:05:07 (in 6 seconds)
Generator: GitStats (version ), git version 1.8.5.3
Report Period: 2011-01-04 13:54:30 to 2014-10-09 16:59:44
Age: 1375 days, 877 active days (63.78%)
Total Files: 2620
Total Lines of Code: 327595 (1209686 added, 882091 removed)
Average file size: 9068.73 bytes
Total Commits: 13396 (average 15.3 commits per active day, 9.7 per all days)
Authors: 31 (average 432.1 commits per author)
146 modules, 431 different .pp files, totaling nearly 34k lines of Puppet code.
James Pryor - RACF BNL 39
Configuration Management, Change Management, and Culture Management
Present: Puppet Code Statistics
James Pryor - RACF BNL 40
Configuration Management, Change Management, and Culture Management
Present: Puppet code style guideIn Aug & Sept of 2014, we used puppet-lint and Geppetto to update all our Puppet code to comply with the Puppet Labs's suggested style guide.
Style guide is now enforced on push to the Git server via puppet-lint
puppet-lint --with-context --no-80chars-check init.ppWARNING: class not documented on line 1ERROR: trailing whitespace found on line 66 } ^WARNING: indentation of => is not properly aligned on line 81 source => $allow_from, ^WARNING: unquoted file mode on line 53 mode => 644, ^
James Pryor - RACF BNL 42
Configuration Management, Change Management, and Culture Management
Present: Culture and CM AdoptionDiscovered that changing servers is easy. Changing people's work flow and ultimately their minds is much harder.
At first mixed reception and some resistance. Most people within our group adopted this work flow & toolset. Others not at all.
http://en.wikipedia.org/wiki/File:DiffusionOfInnovation.pngAttribution 2.5 Generic (CC BY 2.5)
James Pryor - RACF BNL 43
Configuration Management, Change Management, and Culture Management
Present: Puppet Use / Culture ManagementHad some accidental & potentially dangerous commits to production branch.
For the Change Management Approval gateway, implemented a self-approve delay, mandatory scroll & approval authorizations
Never manually change production servers or commit untested code to production. Use test environments & servers.
We host a tree of Puppet code base (“common”) & is now shared and used upstream by the Physics and IT departments.
James Pryor - RACF BNL 44
Configuration Management, Change Management, and Culture Management
If your infrastructure is now done in code,
do you test your code? Or do you just push it to production
and see if it works?
James Pryor - RACF BNL 45
Configuration Management, Change Management, and Culture Management
Dos Equis Man “The Most Interesting Man in the World”is a character & property of Cervecería Cuauhtémoc-Moctezuma.Used without permisson on assumption of Fair Use / Parody
James Pryor - RACF BNL 46
Configuration Management, Change Management, and Culture Management
He is not someone to admire. Don't be like him.
You must test your code.
Dos Equis Man “The Most Interesting Man in the World”is a character & property of Cervecería Cuauhtémoc-Moctezuma.Used without permisson on assumption of Fair Use / Parody
James Pryor - RACF BNL 47
Configuration Management, Change Management, and Culture Management
Future Plans and DesiresWe look to adopt software development practices for our Puppet code: smoke testing, unit testing, acceptance testing
Automatic testing (Continuous Integration) system with Jenkins CI tool, to manage our Puppet testing process. It runs Puppet on a pool of RHEV VMs,and all pending changes to production must pass this validation process before they can be approved and merged into production. See talk on this topic at future HEPiX and/or CHEP 2015.
Use The Foreman beyond just reporting: as both ENC & as OS provisioner.
James Pryor - RACF BNL 48
Configuration Management, Change Management, and Culture Management
Future Plans and DesiresWork toward a community shared Puppet code base beyond RACF/Physics/Lab ITD. This is desirable but is at least 1 – 2 years away from being realized.
Puppet with Hiera: a hierarchical data store keeps site-specific data out of your manifests. Avoid repetition (duplicating similar blocks of modular code), and use public Puppet modules. When used, don’t need to edit the code, just put the necessary data in Hiera.
Requires a rewrite/refactor of our code. This is a non-trivial project.
James Pryor - RACF BNL 49
Configuration Management, Change Management, and Culture Management
Future Plans and DesiresMcollective: a framework to build server orchestration, parallel job execution, on clusters of servers. Not simply a fancy SSH "for loop", but provides granularity and reporting.
Integrate Monitoring and Puppet: rewrite existing “nagios” puppet class to support use of exported resources. Both the target node to be monitored and the Nagios server would execute Puppet code in a sort of conversation:
"Hey Nagios server. I'm a node and have a new service. You need to monitor it." Then it would be monitored.
James Pryor - RACF BNL 50
Configuration Management, Change Management, and Culture Management
Credit / QuestionsDr. Jason A. Smith, Mizuki Karasawa, John S. De Stefano Jr.
William Strecker-Kellog, Christopher Hollowell, James Pryor
Credit
"Bernard De Chartres used to compare us to [puny] dwarfs perched on the shoulders of giants. He pointed out that we see more and
farther than our predecessors, not because we have keener vision or greater height, but because we are lifted up and borne aloft on
their gigantic stature."
John of Salisbury - 1159