Fabric, Cuisine and Watchdog for server administration in Python

Post on 08-Sep-2014

47,989 views 0 download

Tags:

description

Presents Fabric, Cuisine and Watchdog, three Python tools that will help you setup, administer and monitor your servers.

transcript

ffunctioninc.

Fabric, Cuisine & Watchdog

Sébastien Pierre, ffunction inc.@Montréal Python, February 2011

www.ffctn.com

ffunctioninc.

How to use Python for

Server AdministrationThanks to

FabricCuisine*

& Watchdog**custom tools

ffunctioninc.

The way we useservers

has changed

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

Sysadmins typicallySSH and configure

the servers live

Sysadmins typicallySSH and configure

the servers live

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

The servers areconservatively managed,

updates are risky

The servers areconservatively managed,

updates are risky

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

We now have multiplesmall virtual servers

(slices/VPS)

We now have multiplesmall virtual servers

(slices/VPS)

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

Often located in differentdata-centers

Often located in differentdata-centers

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

...and sometimes withdifferent providers

...and sometimes withdifferent providers

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

DEDICATEDSERVER 1

DEDICATEDSERVER 2

IWeb.com

We even sometimesstill have physical,dedicated servers

We even sometimesstill have physical,dedicated servers

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

MAKE THIS PROCESS AS FAST (AND SIMPLE)AS POSSIBLE

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

MAKE THIS PROCESS AS FAST (AND SIMPLE)AS POSSIBLE

Create users, groupsCustomize config filesInstall base packages

Create users, groupsCustomize config filesInstall base packages

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

MAKE THIS PROCESS AS FAST (AND SIMPLE)AS POSSIBLE

Install app-specificpackages

deploy applicationstart services

Install app-specificpackages

deploy applicationstart services

ffunctioninc.

The challenge

ffunctioninc.

The challenge

Quickly integrate yournew server in the

existing architecture

Quickly integrate yournew server in the

existing architecture

ffunctioninc.

The challenge ...and make sureit's running!

...and make sureit's running!

ffunctioninc.

Today's menu

FABRIC

CUISINE

WATCHDOG

Interact with your remote machinesas if they were local

Takes care of users, group, packagesand configuration of your new machine

Ensures that your servers and servicesare up and running

ffunctioninc.

Today's menu

FABRIC

CUISINE

WATCHDOG

Interact with your remote machinesas if they were local

Takes care of users, group, packagesand configuration of your new machine

Ensures that your servers and servicesare up and running

Made byMade by

ffunctioninc.

Part 1

Fabric - http://fabfile.org

application deployment & systems administration tasks

ffunctioninc.

Fabric is a Python library and command-line tool

for streamlining the use of SSHfor application deployment

or systems administration tasks.

ffunctioninc.

Fabric is a Python library and command-line tool

for streamlining the use of SSHfor application deployment

or systems administration tasks.

Wait... what doesthat mean ?

Wait... what doesthat mean ?

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version'”).read()

version = run(“cat /proc/version”)

By hand:

Using Fabric:

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

You can specify multiple hosts and runthe same commands

across them

You can specify multiple hosts and runthe same commands

across them

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

Connections will belazily created and

pooled

Connections will belazily created and

pooled

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

Failures ($STATUS) willbe handled just like in Make

Failures ($STATUS) willbe handled just like in Make

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

It's easy to take actiondepending on the result

It's easy to take actiondepending on the result

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

Note that we add trueso that the run() always

succeeds** there are other ways...

Note that we add trueso that the run() always

succeeds** there are other ways...

ffunctioninc.

Example: retrieving system status

disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”

print disk_usage, mem_usage, cpu_info

ffunctioninc.

Example: retrieving system status

disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”

print disk_usage, mem_usage, cpu_info

Very useful for gettinglive information from

many different servers

Very useful for gettinglive information from

many different servers

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

Just like Make, youwrite rules that do

something

Just like Make, youwrite rules that do

something

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

...and you can specifyon which servers the rules

will run

...and you can specifyon which servers the rules

will run

ffunctioninc.

Multiple hosts

@hosts(“db1.myapp”)def backup_db():

run(...)

env.hosts = [“db1.myapp.com”,“db2.myapp.com”,“db3.myapp.com”

]

ffunctioninc.

Roles

$ fab -R web setup

env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}

ffunctioninc.

Roles

$ fab -R web setup

env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}

Will run the setup ruleonly on hosts members

of the web role.

Will run the setup ruleonly on hosts members

of the web role.

ffunctioninc.

What's good about Fabric?

Low-levelBasically an ssh() command that returns the result

Simple primitivesrun(), sudo(), get(), put(), local(), prompt(), reboot()

No magicNo DSL, no abstraction, just a remote command API

ffunctioninc.

What could be improved ?

Ease common admin tasksUser, group creation. Files, directory operations.

Abstract primitivesLike install package, so that it works with different OS

TemplatesTo make creating/updating configuration files easy

ffunctioninc.

Cuisine:Chef-like functionality for Fabric

ffunctioninc.

Part 2

Cuisine

ffunctioninc.

What is Opscode's Chef?

RecipesScripts/packages to install and configure services and applications

APIA DSL-like Ruby API to interact with the OS (create users, groups, install packages, etc)

ArchitectureClient-server or “solo” mode to push and deploy your new configurations

http://wiki.opscode.com/display/chef/Home

ffunctioninc.

What I liked about Chef

FlexibleYou can use the API or shell commands

StructuredHelped me have a clear decomposition of the services installed per machine

CommunityLots of recipes already available from http://cookbooks.opscode.com/

ffunctioninc.

What I didn't like

Too many files and directoriesCode is spread out, hard to get the big picture

Abstraction overloadAPI not very well documented, frequent fall backs to plain shell scripts within the recipe

No “smart” recipeRecipes are applied all the time, even when it's not necessary

ffunctioninc.

The question that kept coming...

Django recipe: 5 files, 2 directories

sudo aptitude install apache2 python django-python

What it does, in essence

ffunctioninc.

The question that kept coming...

Django recipe: 5 files, 2 directories

sudo aptitude install apache2 python django-python

What it does, in essence

Is this really necessaryfor what I want to do ?

Is this really necessaryfor what I want to do ?

ffunctioninc.

What I loved about Fabric

Bare metalssh() function, simple and elegant set of primitives

No magicNo abstraction, no model, no compilation

Two-way communicationEasy to change the rule's behaviour according to the output (ex: do not install something that's already installed)

ffunctioninc.

What I needed

Fabric

ffunctioninc.

What I needed

Fabric

File I/OFile I/O

ffunctioninc.

What I needed

Fabric

File I/OFile I/O User/GroupManagement

User/GroupManagement

ffunctioninc.

What I needed

Fabric

File I/OFile I/O PackageManagement

PackageManagement

User/GroupManagement

User/GroupManagement

ffunctioninc.

What I needed

Fabric

File I/OFile I/O PackageManagement

PackageManagement

User/GroupManagement

User/GroupManagement

Text processing & TemplatesText processing & Templates

ffunctioninc.

How I wanted it

Simple “flat” API[object]_[operation] where operation is something in “create”, “read”, “update”, “write”, “remove”, “ensure”, etc...

Driven by needOnly implement a feature if I have a real need for it

No magicEverything is implemented using sh-compatible commands

No unnecessary structureEverything fits in one file, no imposed file layout

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup

Fabric's core functionsare already imported

Fabric's core functionsare already imported

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup Cuisine's APIcalls

Cuisine's APIcalls

ffunctioninc.

File I/O

ffunctioninc.

Cuisine : File I/O

● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove

ffunctioninc.

Cuisine : File I/O

● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove

Supports owner/groupand mode change

Supports owner/groupand mode change

ffunctioninc.

Cuisine : File I/O (directories)

● dir_exists does remote file exists?● dir_ensure ensures that a directory exists● dir_attribs chmod & chown● dir_remove

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

This replaces the values forconfiguration entriesdbpath and logpath

This replaces the values forconfiguration entriesdbpath and logpath

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

The remote file will only bechanged if the content

is different

The remote file will only bechanged if the content

is different

ffunctioninc.

User Management

ffunctioninc.

Cuisine: User Management

● user_exists does the user exists?● user_create create the user● user_ensure create the user if it doesn't exist

ffunctioninc.

Cuisine: Group Management

● group_exists does the group exists?● group_create create the group● group_ensure create the group if it doesn't exist● group_user_exists does the user belong to the group?● group_user_add adds the user to the group● group_user_ensure

ffunctioninc.

Package Management

ffunctioninc.

Cuisine: Package Management

● package_exists is the package available ?● package_installed is it installed ?● package_install install the package● package_ensure ... only if it's not installed● package_upgrade upgrades the/all package(s)

ffunctioninc.

Text & Templates

ffunctioninc.

Cuisine: Text transformation

text_ensure_line(text, lines)

file_update("/home/user/.profile", lambda _:text_ensure_line(_,

"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"

))

ffunctioninc.

Cuisine: Text transformation

text_ensure_line(text, lines)

file_update("/home/user/.profile", lambda _:text_ensure_line(_,

"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"

))

Ensures that the PYTHONPATHvariable is set and exported,

If not, these lines will beappended.

Ensures that the PYTHONPATHvariable is set and exported,

If not, these lines will beappended.

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

Replaces lines that look likeVARIABLE=VALUE

with the actual values from thevariables dictionary.

Replaces lines that look likeVARIABLE=VALUE

with the actual values from thevariables dictionary.

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

The process lambda transformsinput lines before comparing

them.

Here the lines are strippedof spaces and of their value.

The process lambda transformsinput lines before comparing

them.

Here the lines are strippedof spaces and of their value.

ffunctioninc.

Cuisine: Text transformation

text_strip_margin(text)

file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""

))

ffunctioninc.

Cuisine: Text transformation

text_strip_margin(text)

file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""

))

Everything after the | separatorwill be output as content.

It allows to easily embed texttemplates within functions.

Everything after the | separatorwill be output as content.

It allows to easily embed texttemplates within functions.

ffunctioninc.

Cuisine: Text transformation

text_template(text, variables)

text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""

), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"

))

ffunctioninc.

Cuisine: Text transformation

text_template(text, variables)

text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""

), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"

))

This is a simple wrapperaround Python (safe)

string.template() function

This is a simple wrapperaround Python (safe)

string.template() function

ffunctioninc.

Cuisine: Goodies

● ssh_keygen generates DSA keys

● ssh_authorize authorizes your key on the remote server

● mode_sudo run() always uses sudo

● upstart_ensure ensures the given daemon is running

& more!

ffunctioninc.

Why use Cuisine ?

● Simple API for remote-server manipulationFiles, users, groups, packages

● Shell commands for specific tasks onlyAvoid problems with your shell commands by only using run() for very specific tasks

● Cuisine tasks are not stupid*_ensure() commands won't do anything if it's not necessary

ffunctioninc.

Limitations

● Limited to sh-shellsOperations will not work under csh

● Only written/tested for Ubuntu LinuxContributors could easily port commands

ffunctioninc.

Get started !

On Github:http://github.com/sebastien/cuisine

1 short Python fileDocumented API

ffunctioninc.

Part 3

Watchdog

Server and services monitoring

ffunctioninc.

The problem

ffunctioninc.

The problem

Low disk spaceLow disk space

ffunctioninc.

The problem

Archive filesRotate logs

Purge cache

Archive filesRotate logs

Purge cache

ffunctioninc.

The problem HTTP serverhas highlatency

HTTP serverhas highlatency

ffunctioninc.

The problemRestart HTTP

server

Restart HTTPserver

ffunctioninc.

The problem

System loadis too high

System loadis too high

ffunctioninc.

The problem

re-niceimportantprocesses

re-niceimportantprocesses

ffunctioninc.

We want to be notifiedwhen incidents happen

ffunctioninc.

We want automatic actions to be taken whenever possible

ffunctioninc.

(Some of the) existing solutions

Monit, God, Supervisord, UpstartFocus on starting/restarting daemons and services

Munin, CactiFocus on visualization of RRDTool data

CollectdFocus on collecting and publishing data

ffunctioninc.

The ideal tool

Wide spectrumData collection, service monitoring, actions

Easy setup and deploymentNo complex installation or configuration

Flexible server architectureCan monitor local or remote processes

Customizable and extensibleFrom restarting deamons to monitoring whole servers

ffunctioninc.

Hello, Watchdog!

SERVICE

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

A service is acollection of

RULES

A service is acollection of

RULES

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

Each rule retrievesdata and processes it.Rules can SUCCEED

or FAIL

Each rule retrievesdata and processes it.Rules can SUCCEED

or FAIL

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

LoggingXMPP, Email notificationsStart/stop process….

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

LoggingXMPP, Email notificationsStart/stop process….

Actions are boundto rule, triggeredon rule SUCCESS

or FAILURE

Actions are boundto rule, triggeredon rule SUCCESS

or FAILURE

ffunctioninc.

Execution Model

MONITOR

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

Services are registeredin the monitor

Services are registeredin the monitor

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

SUCCESS FAILURE

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

If the rule SUCCEEDSactions will be

sequentially executed

If the rule SUCCEEDSactions will be

sequentially executed

SUCCESS FAILURE

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

If the rule FAILfailure actions will besequentially executed

If the rule FAILfailure actions will besequentially executed

SUCCESS FAILURE

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

A monitor is like the“main” for Watchdog.

It actively monitorsservices.

A monitor is like the“main” for Watchdog.

It actively monitorsservices.

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

Don't forget to callrun() on it

Don't forget to callrun() on it

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

The service monitorsthe rules

The service monitorsthe rules

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

The HTTP ruleallows to test

an URL

The HTTP ruleallows to test

an URL

And we display amessage in case

of failure

And we display amessage in case

of failure

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

If it there is a 4XX orit timeouts, the rulewill fail and displayan error message

If it there is a 4XX orit timeouts, the rulewill fail and displayan error message

ffunctioninc.

Monitoring a remote machine

$ python example-service-monitoring.py

2011-02-27T22:33:18 watchdog --- #0 (runners=1,threads=2,duration=0.57s)2011-02-27T22:33:18 watchdog [!] Failure on HTTP(GET="www.google.ca:80/search?q=watchdog",timeout=0.08) : Socket error: timed outGoogle search query took more than 50ms2011-02-27T22:33:19 watchdog --- #1 (runners=1,threads=2,duration=0.73s)2011-02-27T22:33:20 watchdog --- #2 (runners=1,threads=2,duration=0.54s)2011-02-27T22:33:21 watchdog --- #3 (runners=1,threads=2,duration=0.69s)2011-02-27T22:33:22 watchdog --- #4 (runners=1,threads=2,duration=0.77s)2011-02-27T22:33:23 watchdog --- #5 (runners=1,threads=2,duration=0.70s)

ffunctioninc.

Sending Email Notification

send_email = Email("notifications@ffctn.com","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

ffunctioninc.

Sending Email Notification

send_email = Email("notifications@ffctn.com","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

The Email rule will sendan email to

notifications@ffctn.comwhen triggered

The Email rule will sendan email to

notifications@ffctn.comwhen triggered

ffunctioninc.

Sending Email Notification

send_email = Email("notifications@ffctn.com","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

This is how we bind theaction to the rule failure

This is how we bind theaction to the rule failure

ffunctioninc.

Sending Email+Jabber Notification

send_xmpp = XMPP("notifications@jabber.org","Watchdog: Google search latency over 80ms","myuser@jabber.org", "myspassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email, send_xmpp]

)

ffunctioninc.

Monitoring incident: when something fails repeatedly during a given period of

time

ffunctioninc.

Monitoring incident: when something fails repeatedly during a given period of

time

You don't want to benotified all the time,only when it really

matters.

You don't want to benotified all the time,only when it really

matters.

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

An incident is a “smart”action : it will only dosomething when the

condition is met

An incident is a “smart”action : it will only dosomething when the

condition is met

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

When at least 5 errors...When at least 5 errors...

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

...happen over a 10seconds period

...happen over a 10seconds period

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

The Incident action willtrigger the given actions

The Incident action willtrigger the given actions

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

We test if we canGET http://localhost:8000

within 500ms

We test if we canGET http://localhost:8000

within 500ms

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

If we can't reach it during5 seconds

If we can't reach it during5 seconds

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

We kill and restartmyservice-start.py

We kill and restartmyservice-start.py

ffunctioninc.

Example: Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

SystemInfo will retrievesystem information andreturn it as a dictionary

SystemInfo will retrievesystem information andreturn it as a dictionary

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We log each result byextracting the given

value from the resultdictionary (memoryUsage,

diskUsage,cpuUsage)

We log each result byextracting the given

value from the resultdictionary (memoryUsage,

diskUsage,cpuUsage)

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

Bandwidth collectsnetwork interface

live traffic information

Bandwidth collectsnetwork interface

live traffic information

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

But we don't want thetotal amount, we justwant the difference.Delta does just that.

But we don't want thetotal amount, we justwant the difference.Delta does just that.

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We print the resultas before

We print the resultas before

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

SystemHealth willfail whenever the usage

is above the giventhresholds

SystemHealth willfail whenever the usage

is above the giventhresholds

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We'll log failuresin a log file

We'll log failuresin a log file

ffunctioninc.

Watchdog: Overview

Monitoring DSLDeclarative programming to define monitoring strategy

Wide spectrumFrom data collection to incident detection

FlexibleDoes not impose a specific architecture

ffunctioninc.

Watchdog: Use cases

Ensure service availabilityTest and stop/restart when problems

Collect system statisticsLog or send data through the network

Alert on system or service healthTake actions when the system stats is above threshold

ffunctioninc.

Get started !

On Github:http://github.com/sebastien/watchdog

1 Python fileDocumented API

ffunctioninc.

Merci !

www.ffctn.comsebastien@ffctn.comgithub.com/sebastien