+ All Categories
Home > Documents > Configuration Management Change Management, and Culture ... · Puppet External Node Classifier:...

Configuration Management Change Management, and Culture ... · Puppet External Node Classifier:...

Date post: 22-Dec-2018
Category:
Upload: phungdung
View: 236 times
Download: 0 times
Share this document with a friend
50
HEPiX Fall 2014 - Thursday October 16 4:00pm James Pryor - [email protected] RHIC and ATLAS Computing Facility at Brookhaven National Laboratory Configuration Management Change Management, and Culture Management
Transcript

HEPiX Fall 2014 - Thursday October 16 4:00pmJames Pryor - [email protected] and ATLAS Computing Facility atBrookhaven National Laboratory

Configuration Management Change Management, and Culture Management

James Pryor - RACF BNL 2

Configuration Management Change Management, and Culture Management

Past

Present

Future Plans & Desires

Credits / Discussion

James Pryor - RACF BNL 3

Configuration Management, Change Management, and Culture Management

Many ways to do OS/application deployment and configuration

OS deploy: a single PXE/kickstart server in 2007.Lots of kickstart files. At best 5:1, worst 1:1. We used a post install RPM file with base level config files & scripts

Started with Cfengine in 2008 on RH Linux and Solaris

Used it for very basic OS configuration and some application configuration

Past

James Pryor - RACF BNL 4

Configuration Management, Change Management, and Culture Management

Identified some problems local & global.

Kickstart file is installation only. Once installed, changes over time make unique configurations.

We found Cfengine 2.x and Cfengine 3.x not ideal

On a wider scope: We (humans) don't scale. Separate pools of knowledge. Files, internal web, internal mail, our minds, external web.

More than one way to create/fix/diagnose it.

Errors & issues are hard to debug/replicate. We can not do it all or remember it all.

Past: 2010

James Pryor - RACF BNL 5

Configuration Management, Change Management, and Culture Management

Shawn Hokehttps://www.flickr.com/photos/shawnhoke/14908288722Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)

We can not do it all. We don't have super powers.

James Pryor - RACF BNL 6

Configuration Management, Change Management, and Culture Management

Policy & procedures (can be formal as ITIL, or as informal as DevOps) used to control changes and keep a historical change record made to production/development/test systems

Uncontrolled change can work, but will often cause self inflicted problems and future firefighting episodes & upgrade nightmares

Without it, servers/applications become like snowflakes: they start out identical, but over time, configuration drift eventually makes each one unique.

Past: Change Management

James Pryor - RACF BNL 7

Configuration Management, Change Management, and Culture Management

Alexey Kljatov - Creative Commons (CC BY-NC 2.0)https://www.flickr.com/photos/chaoticmind75/6715743931https://www.flickr.com/photos/chaoticmind75/6922463361https://www.flickr.com/photos/chaoticmind75/6737065985

They are pretty to look at but hard to manage.

James Pryor - RACF BNL 8

Configuration Management, Change Management, and Culture Management

Increase agility. Do More with less. Shorten time to solution.

Standardize! We want to stop duplicating work and effort.

Almost everyone is using change management.

Align ourselves to be congruent with other entities (Labs, Univ, Corp) so that we can build upon their success.

Follow proven methods of improvement and change management to shift staff time from perpetual reactive mode (firefighting) to more proactive work (fire prevention)

No unauthorized changes, no "cowboy" or "superhero" type behavior tolerated

Past: Change Management

James Pryor - RACF BNL 9

Configuration Management, Change Management, and Culture Management

Moved to Puppet for configuration management: easy to learn, facts, modular, idempotence, reporting.

Intro level training in Aug 2010

Fall 2010, we chose Cobbler for Linux OS deployment via PXE/Kickstart. CLI & Web manages distros, systems, and repos. Modularity through variables, templates, and profiles.

Selected GLPI for asset management, and FusionInventory Agent to collect server details

Past: 2010

James Pryor - RACF BNL 10

Configuration Management, Change Management, and Culture Management

Convert kickstarts into Cobbler single template to support just about all use cases, and convert the RPM file post-install configuration shell scripts into Puppet code.

Developed front-end Perl web scripts to manage SSL CA certs, and act as change management 'gateway' to back-end git/cgit

Puppet External Node Classifier: GLPI, Puppet Dashboard, Linux worker node pool custom inventory DB

Created a foundational base class that encapsulates the desired default configuration, then worked on managing services with puppet.

Started working on Desktop Centralized Management

Past: 2011 to 2012

James Pryor - RACF BNL 11

Configuration Management, Change Management, and Culture Management

This flowchart is from a presentation made in late 2010

James Pryor - RACF BNL 12

Configuration Management, Change Management, and Culture Management

Cobbler – System view - racf-min.ks is our shared templated kickstart

James Pryor - RACF BNL 13

Configuration Management, Change Management, and Culture Management

GLPI – Server details. Note the circled Custom tab.

James Pryor - RACF BNL 14

Configuration Management, Change Management, and Culture Management

GLPI – Custom tab details. Note the circled base class.

James Pryor - RACF BNL 15

Configuration Management, Change Management, and Culture Management

Base class is 'community property' where we build the foundation for infrastructure servers. Code edited/resized for screenshot

James Pryor - RACF BNL 16

Configuration Management, Change Management, and Culture Management

This is the linuxfarm group's workernode class

James Pryor - RACF BNL 17

Configuration Management, Change Management, and Culture Management

This shows we have one client host cert that need to be signed.

James Pryor - RACF BNL 18

Configuration Management, Change Management, and Culture Management

I need to clean a client host cert, so I can search for that host with regex support.

James Pryor - RACF BNL 19

Configuration Management, Change Management, and Culture Management

It is cleaning one host cert and removing and exported resources tied to that host.

James Pryor - RACF BNL 20

Configuration Management, Change Management, and Culture Management

Present: Make a change to productionpryor@mydesktop (production) ~/gitree/catalog/common/cvesecurity/manifests $ git push origin production<trimmed> remote: diff-tree:remote: :100644 100644 04eb0bc... 7bf289b... M common/cvesecurity/manifests/cve_2014_7169.ppremote: Merge results:remote: Updating ce24fcc..d8a7860remote: Fast-forwardremote: common/cvesecurity/manifests/cve_2014_7169.pp | 1 +remote: 1 file changed, 1 insertion(+)remote: Note: Your updates to the production branch are waiting for Note: Your updates to the production branch are waiting for approval in branch: pending-production-pryor-ce24fcc-20141009T201915UTCapproval in branch: pending-production-pryor-ce24fcc-20141009T201915UTCremote: remote: error: hook declined to update refs/heads/productionTo https://webdocs.racf.bnl.gov/git/puppet/catalog ! [remote rejected] production -> production (hook declined)error: failed to push some refs to 'https://webdocs.racf.bnl.gov/git/puppet/catalog'

James Pryor - RACF BNL 21

Configuration Management, Change Management, and Culture Management

Puppet Approval Change Management Portal showing my commit to production

James Pryor - RACF BNL 22

Configuration Management, Change Management, and Culture Management

Diff of my commit as seen in cgit

James Pryor - RACF BNL 23

Configuration Management, Change Management, and Culture Management

Do you want to merge this?

James Pryor - RACF BNL 24

Configuration Management, Change Management, and Culture Management

Puppet Change Approval Committe. Note the regex under the Authorization column.

James Pryor - RACF BNL 25

Configuration Management, Change Management, and Culture Management

RACF Puppet Catalog. The regex from the previous slide matches up with the directories (in blue).

James Pryor - RACF BNL 26

Configuration Management, Change Management, and Culture Management

Merge results

James Pryor - RACF BNL 27

Configuration Management, Change Management, and Culture Management

Approval scripts also record who approved what for auditing purposes

James Pryor - RACF BNL 28

Configuration Management, Change Management, and Culture Management

Summary of all log messages making up the branch merge in cases where there are many commits in that branch

James Pryor - RACF BNL 29

Configuration Management, Change Management, and Culture Management

Present: Reporting, Performance MetricsPuppet Dashboard is deprecated. We now use The Foreman and PuppetDB for puppet client run reporting.

The Foreman is a RHEL 6.5 VM on the RHEV cluster, and assigned 8 vCPUs and 8GB RAM.

A single Puppet Master server: Dell PowerEdge R610 with Dual CPU Intel Xeon X5660 2.80GHz, 96GB RAM, RAID10 900GB, RHEL 6.5

It runs just about everything: Apache, Perl Puppet CA, Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL), GLPI (w/ MySQL)

James Pryor - RACF BNL 30

Configuration Management, Change Management, and Culture Management

James Pryor - RACF BNL 31

Configuration Management, Change Management, and Culture Management

Puppet Client agent check-in method & interval

The Foreman VM on RHEV PostgreSQL

Linux farm worker nodes

Farm Infrastructure,

DB

Puppet daemon check-in once per hour

Manual check_puppet.py run without Puppet daemon

Grid, Infrastructure, Storage, Cloud hosts under 600 machines

Puppet Master Dell PowerEdge R610

Dual CPU Intel Xeon X5660 2.80GHz 96GB RAM, RAID10 900GB,RHEL 6.5

Apache, Perl Puppet CA , Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL)

GLPI (w/ MySQL)

James Pryor - RACF BNL 32

Configuration Management, Change Management, and Culture Management

Present: Load test & MetricsWe tested our Puppet master with client agents checking in at about 1Hz. PuppetDB reports that we have over 2500 checking-in during this test, which ran a little over an hour.

Load (top command output) as seen at about 12 minutes after starting a manual puppet agent run (14:45) on the Linux Farm worker nodes.

top - 14:57:23 up 3:55, 2 users, load average: 4.22, 4.22, 3.03Tasks: 551 total, 4 running, 547 sleeping, 0 stopped, 0 zombieCpu(s): 20.9%us, 2.1%sy, 0.0%ni, 76.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%stMem: 99051792k total, 31167540k used, 67884252k free, 340332k buffersSwap: 8388604k total, 0k used, 8388604k free, 4798864k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9178 puppet 20 0 302m 195m 3404 S 96.5 0.2 4:19.72 ruby 7361 puppet 20 0 291m 183m 3404 S 86.6 0.2 4:24.94 ruby 8992 puppet 20 0 362m 254m 3404 R 74.6 0.3 13:43.31 ruby 6993 puppetdb 20 0 14.6g 1.3g 12m S 71.3 1.3 21:26.15 java 434 puppet 20 0 352m 244m 3404 R 54.7 0.3 8:32.26 ruby 9070 puppet 20 0 276m 169m 3404 S 43.8 0.2 4:34.72 ruby 9077 puppet 20 0 306m 199m 3400 S 33.2 0.2 5:03.86 ruby 7808 postgres 20 0 24.4g 575m 569m S 26.5 0.6 0:14.28 postmaster 8571 postgres 20 0 24.4g 670m 662m S 9.6 0.7 0:14.15 postmaster

James Pryor - RACF BNL 33

Configuration Management, Change Management, and Culture Management

Linux farm manual run started approx 14:45.

James Pryor - RACF BNL 34

Configuration Management, Change Management, and Culture Management

In a bit more than an hour, we saw between 40% - 60% CPU with about 2500 agents checking in.

James Pryor - RACF BNL 35

Configuration Management, Change Management, and Culture Management

PuppetDB metrics during the load test

James Pryor - RACF BNL 36

Configuration Management, Change Management, and Culture Management

PuppetDB metrics during the load test

James Pryor - RACF BNL 37

Configuration Management, Change Management, and Culture Management

PuppetDB metrics during the load test

James Pryor - RACF BNL 38

Configuration Management, Change Management, and Culture Management

Present: Puppet Code StatisticsProject name: RACF Puppet Catalog

Generated: 2014-10-10 05:05:07 (in 6 seconds)

Generator: GitStats (version ), git version 1.8.5.3

Report Period: 2011-01-04 13:54:30 to 2014-10-09 16:59:44

Age: 1375 days, 877 active days (63.78%)

Total Files: 2620

Total Lines of Code: 327595 (1209686 added, 882091 removed)

Average file size: 9068.73 bytes

Total Commits: 13396 (average 15.3 commits per active day, 9.7 per all days)

Authors: 31 (average 432.1 commits per author)

146 modules, 431 different .pp files, totaling nearly 34k lines of Puppet code.

James Pryor - RACF BNL 39

Configuration Management, Change Management, and Culture Management

Present: Puppet Code Statistics

James Pryor - RACF BNL 40

Configuration Management, Change Management, and Culture Management

Present: Puppet code style guideIn Aug & Sept of 2014, we used puppet-lint and Geppetto to update all our Puppet code to comply with the Puppet Labs's suggested style guide.

Style guide is now enforced on push to the Git server via puppet-lint

puppet-lint --with-context --no-80chars-check init.ppWARNING: class not documented on line 1ERROR: trailing whitespace found on line 66 } ^WARNING: indentation of => is not properly aligned on line 81 source => $allow_from, ^WARNING: unquoted file mode on line 53 mode => 644, ^

James Pryor - RACF BNL 41

Configuration Management, Change Management, and Culture Management

James Pryor - RACF BNL 42

Configuration Management, Change Management, and Culture Management

Present: Culture and CM AdoptionDiscovered that changing servers is easy. Changing people's work flow and ultimately their minds is much harder.

At first mixed reception and some resistance. Most people within our group adopted this work flow & toolset. Others not at all.

http://en.wikipedia.org/wiki/File:DiffusionOfInnovation.pngAttribution 2.5 Generic (CC BY 2.5)

James Pryor - RACF BNL 43

Configuration Management, Change Management, and Culture Management

Present: Puppet Use / Culture ManagementHad some accidental & potentially dangerous commits to production branch.

For the Change Management Approval gateway, implemented a self-approve delay, mandatory scroll & approval authorizations

Never manually change production servers or commit untested code to production. Use test environments & servers.

We host a tree of Puppet code base (“common”) & is now shared and used upstream by the Physics and IT departments.

James Pryor - RACF BNL 44

Configuration Management, Change Management, and Culture Management

If your infrastructure is now done in code,

do you test your code? Or do you just push it to production

and see if it works?

James Pryor - RACF BNL 45

Configuration Management, Change Management, and Culture Management

Dos Equis Man “The Most Interesting Man in the World”is a character & property of Cervecería Cuauhtémoc-Moctezuma.Used without permisson on assumption of Fair Use / Parody

James Pryor - RACF BNL 46

Configuration Management, Change Management, and Culture Management

He is not someone to admire. Don't be like him.

You must test your code.

Dos Equis Man “The Most Interesting Man in the World”is a character & property of Cervecería Cuauhtémoc-Moctezuma.Used without permisson on assumption of Fair Use / Parody

James Pryor - RACF BNL 47

Configuration Management, Change Management, and Culture Management

Future Plans and DesiresWe look to adopt software development practices for our Puppet code: smoke testing, unit testing, acceptance testing

Automatic testing (Continuous Integration) system with Jenkins CI tool, to manage our Puppet testing process. It runs Puppet on a pool of RHEV VMs,and all pending changes to production must pass this validation process before they can be approved and merged into production. See talk on this topic at future HEPiX and/or CHEP 2015.

Use The Foreman beyond just reporting: as both ENC & as OS provisioner.

James Pryor - RACF BNL 48

Configuration Management, Change Management, and Culture Management

Future Plans and DesiresWork toward a community shared Puppet code base beyond RACF/Physics/Lab ITD. This is desirable but is at least 1 – 2 years away from being realized.

Puppet with Hiera: a hierarchical data store keeps site-specific data out of your manifests. Avoid repetition (duplicating similar blocks of modular code), and use public Puppet modules. When used, don’t need to edit the code, just put the necessary data in Hiera.

Requires a rewrite/refactor of our code. This is a non-trivial project.

James Pryor - RACF BNL 49

Configuration Management, Change Management, and Culture Management

Future Plans and DesiresMcollective: a framework to build server orchestration, parallel job execution, on clusters of servers. Not simply a fancy SSH "for loop", but provides granularity and reporting.

Integrate Monitoring and Puppet: rewrite existing “nagios” puppet class to support use of exported resources. Both the target node to be monitored and the Nagios server would execute Puppet code in a sort of conversation:

"Hey Nagios server. I'm a node and have a new service. You need to monitor it." Then it would be monitored.

James Pryor - RACF BNL 50

Configuration Management, Change Management, and Culture Management

Credit / QuestionsDr. Jason A. Smith, Mizuki Karasawa, John S. De Stefano Jr.

William Strecker-Kellog, Christopher Hollowell, James Pryor

Credit

"Bernard De Chartres used to compare us to [puny] dwarfs perched on the shoulders of giants. He pointed out that we see more and

farther than our predecessors, not because we have keener vision or greater height, but because we are lifted up and borne aloft on

their gigantic stature."

John of Salisbury - 1159


Recommended