Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | sandinmyjoints |
View: | 15,117 times |
Download: | 2 times |
Towards
100% Uptimewith Node.js
9M uniques / month.
75K+ users, some are paidsubscribers.
( We | you | users )hate downtime.
Important, butout of scope:
Redundant infrastructure.Backups.Disaster recovery.
In scope:Application errors.Deploys.Node.js stuff:
Domains.Cluster.Express.
Keys to 100% uptime.
1. Sensibly handleuncaught exceptions.
2. Use domainsto catch and contain errors.
3. Manage processeswith cluster.
4. Gracefully terminateconnections.
1. Sensibly handle uncaughtexceptions.
Uncaught exceptions happen when:An exception is thrown but not caught.An error event is emitted but nothing is listening for it.
From node/lib/events.js:
EventEmitter.prototype.emit = function(type) { // If there is no 'error' event listener then throw. if (type === 'error') { ... } else if (er instanceof Error) { throw er; // Unhandled 'error' event } else { ...
An uncaught exceptioncrashes the process.
If the process is a server:
x 100s??
It starts with...
Domains.2. Use domains to catch and contain errors.
try/catch doesn't doasync.
try { var f = function() { throw new Error("uh-oh"); }; setTimeout(f, 100);} catch (ex) { console.log("try / catch won't catch", ex);}
Domains are a bit liketry/catch for async.
var d = require('domain').create();
d.on('error', function (err) { console.log("domain caught", err);});
var f = d.bind(function() { throw new Error("uh-oh");});
setTimeout(f, 100);
The active domain isdomain.active.
var d = require('domain').create();console.log(domain.active); // <-- null
var f = d.bind(function() { console.log(domain.active === d) // <-- true console.log(process.domain === domain.active) // <-- true throw new Error("uh-oh");});
New EventEmitters bindto the active domain.
EventEmitter.prototype.emit = function(type) { if (type === 'error') { if (this.domain) { // This is important! ... this.domain.emit('error', er); } else if ...
Log the error.Helpful additional fields:
error.domainerror.domainEmittererror.domainBounderror.domainThrown
Then it's up to you.Ignore.Retry.Abort (e.g., return 500).Throw (becomes an unknown error).
Do I have to create a new domainevery time I do an async operation?
Use middleware.More convenient.
In Express, this might look like:var domainWrapper = function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);
reqDomain.once('error', function(err) { res.send(500); // or next(err); });
reqDomain.run(next);};
Based on https://github.com/brianc/node-domain-middleware
https://github.com/mathrawka/express-domain-errors
Domain methods.add: bind an EE to the domain.run: run a function in context of domain.bind: bind one function.intercept: like bind but handles 1st arg err.dispose: cancels IO and timers.
Domainsare great
until they're not.
node-mongodb-native does notplay well with active domain.
console.log(domain.active); // a domainAppModel.findOne(function(err, doc) { console.log(domain.active); // undefined next();});
See https://github.com/LearnBoost/mongoose/pull/1337
Fix with explicit binding.console.log(domain.active); // a domainAppModel.findOne(domain.active.bind(function(err, doc) { console.log(domain.active); // still a domain next();}));
What other operations don't play wellwell with domain.active?
Good question!
Package authors could note this.
If you find one, let package author know.
Can 100% uptime be achievedjust by using domains?
No.Not if only one instance of your app
is running.
3. Manage processeswith cluster.
Cluster module.Node = one thread per process.
Most machines have multiple CPUs.
One process per CPU = cluster.
master / workers1 master process forks n workers.Master and workers communicate state via IPC.When workers want to listen to a socket, master registers themfor it.Each new connection to socket is handed off to a worker.No shared application state between workers.
What about when a workerisn't working anymore?
Some coordination is needed.
1. Worker tells cluster master it's done accepting new connections.
2. Cluster master forks replacement.
3. Worker dies.
Another use case for cluster:
Deployment.Want to replace all existing servers.
Something must manage that = cluster master process.
Zero downtime deployment.When master starts, give it a symlink to worker code.
After deploy new code, update symlink.
Send signal to master: fork new workers!
Master tells old workers to shut down, forks new workers fromnew code.
Master process never stops running.
Signals.A way to communicate with running processes.
SIGHUP: reload workers (some like SIGUSR2).
$ kill -s HUP <pid>$ service <node-service-name> reload
Process management options.
Forevergithub.com/nodejitsu/forever
Has been around...forever.No cluster awareness — used on a single process.Simply restarts the process when it dies.More comparable to Upstart or Monit.
Naughtgithub.com/superjoe30/naught
Newer.Cluster aware.Zero downtime errors and deploys.Runs as daemon.Handles log compression, rotation.
Reclustergithub.com/doxout/recluster
Newer.Cluster aware.Zero downtime errors and deploys.Does not run as daemon.Log agnostic.Simple, relatively easy to reason about.
We went with recluster.Happy so far.
I have been talking aboutstarting / stopping workers
as if it's atomic.
It's not.
4. Gracefully terminateconnections
when needed.
Don't call process.exit too soon!
Give it a grace period to clean up.
Need to clean up:In-flight requests.HTTP keep-alive (open TCP) connections.
Revisiting our middleware from earlier:var domainWrapper = function(afterErrorHook) { return function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);
reqDomain.once('error', function(err) { next(err); if(afterErrorHook) afterErrorHook(err); // Hook. }); reqDomain.run(next); };};
1. Call server.close.var afterErrorHook = function(err) { server.close(); // <-- ensure no new connections}
2. Shut down keep-aliveconnections.
var afterErrorHook = function(err) { app.set("isShuttingDown", true); // <-- set state server.close();}
var shutdownMiddle = function(req, res, next) { if(app.get("isShuttingDown") { // <-- check state req.connection.setTimeout(1); // <-- kill keep-alive } next();}
Idea from https://github.com/mathrawka/express-graceful-exit
3. Then call process.exit
in server.close callback.var afterErrorHook = function(err) { app.set("isShuttingDown", true); server.close(function() { process.exit(1); // <-- all clear to exit });}
Set a timer.If timeout period expires and server is still around, call
process.exit.
Summing up:
Our ideal server.
On startup:Cluster master comes up (for example, via Upstart).Cluster master forks workers from symlink.Each worker's server starts accepting connections.
On deploy:Point symlink to new version.Send signal to cluster master.Master tells existing workers to stop accepting new connections.Master forks new workers from new code.Existing workers shut down gracefully.
On error:Server catches it via domain.Next action depends on you: retry? abort? rethrow? etc.
On uncaught exception:??
// The infamous "uncaughtException" event!process.on('uncaughtException', function(err) { // ??})
Back to where we started:
1. Sensibly handle uncaughtexceptions.
We have minimized these by using domains.
But they can still happen.
Node docs say not to keep running.
An unhandled exception means yourapplication — and by extension node.jsitself — is in an undefined state. Blindly
resuming means anything could happen.You have been warned.
http://nodejs.org/api/process.html#process_event_uncaughtexception
What to do?First, log the error so you know what happened.
Then, you've got tokill the process.
It's not so bad. We can now do sowith minimal trouble.
On uncaught exception:Log error.Server stops accepting new connections.Worker tells cluster master it's done.Master forks a replacement worker.Worker exits gracefully when all connections are closed, or aftertimeout.
What about the requestthat killed the worker?
How does the dying workergracefully respond to it?
Good question!
People are also under the illusion that it ispossible to trace back [an uncaught]
exception to the http request that causedit...
-felixge, https://github.com/joyent/node/issues/2582
This is too bad, because youalways want to return a response,
even on error.
This is Towards 100% Uptime b/c these approaches don'tguarantee response for every request.
But we can get very close.
Fortunately, given what we've seen,uncaughts shouldn't happen often.
And when they do, only oneconnection will be left hanging.
Must restart cluster master when:Upgrade Node.Cluster master code changes.
During timeout periods, might have:More workers than CPUs.Workers running different versions (old/new).
Should be brief. Probably preferable to downtime.
Tip:
Be able to produce errors on demandon your dev and staging servers.
(Disable this in production.)
Tip:
Keep cluster master simple.It needs to run for a long time without being updated.
Things change.I've been talking about:
{ "node": "~0.10.20", "express": "~3.4.0", "connect": "~2.9.0", "mongoose": "~3.6.18", "recluster": "=0.3.4"}
The Future:Node 0.11 / 0.12
For example, cluster module has some changes.
Cluster is experimental.Domains are unstable.
Good reading: (some answers more
helpful than others)Node.js Best Practice Exception Handling
Remove uncaught exception handler?Isaacs stands by killing on uncaughtDomains don't incur performance hits compared to try catchRejected PR to add domains to Mongoose, with discussionDon't call enter / exit across asyncComparison of naught and foreverWhat's changing in cluster
[email protected]/sandinmyjoints/towards-100-pct-uptimegithub.com/sandinmyjoints/towards-100-pct-uptime-examples