node.js in production: Reflections on three years of riding the unicorn

node.js in production:Reflections on three years of riding the unicorn

SVP, Engineering

[email protected]

Bryan Cantrill

@bcantrill

Tuesday, December 3, 13

mailto:[email protected]

mailto:[email protected]

Production systems

• Production systems are ones doing real work: when they misbehave, users or other systems are affected

• Production systems value reliability, performance and ease of deployment — usually in that order

• Contrast to development systems, that value ease of development and speed of development — in that order

• These values can be in tension: new languages and environments typically arise for their development values, not their production ones

• Would node.js be any different?


node.js advantages

• In terms of production suitability, node.js had — and still has — a couple of major advantages going for it:

• It’s not a new language

• It leverages extant (Unix) abstractions

• It’s built on a VM (V8) that itself was designed for performance

• Its pure event-oriented model aligns ease of programming with scalability with respect to load

• As the stewards of both node and SmartOS, Joyent had another advantage: we could change, improve or leverage SmartOS to accommodate node in production


node.js challenges

• But node.js also has a couple of major challenges:

• Single-threaded execution of JavaScript means that compute-bound code can entirely impede progress

• JavaScript closures make it easy to accidentally reference memory

• Because node.js is often used to connect backend components, failure to propagate back pressure can induce memory explosion and death

• High performance VM also implies inscrutable core dumps and very limited instrumentation


August 2010: DTrace in node.js

• Added simple user-level statically defined tracing (USDT) probes for node.js on platforms that support DTrace (e.g., Mac OS X, SmartOS)

• Probes were around connection establishment, serving HTTP requests, etc.

• Allowed questions to be dynamically asked of running, production node.js servers, e.g.: dtrace -n ‘node*:::http-server-request{ printf(“%s of %s from %s\n”, args[0]->method, args[0]->url, args[1]->remoteAddress)}‘

dtrace -n http-server-request’{ @[args[1]->remoteAddress] = count()}‘

dtrace -n gc-start’{self->ts = timestamp}’ \ -n gc-done’/self->ts/{@ = quantize(timestamp - self->ts)}’


August 2010: Deploying 0.2.x

• In August 2010, we deployed our first node.js-based service into production: a NodeKnockout leader-board that used node.js DTrace probes to geolocate connections to contestants in real-time

• Results were promising; surprisingly easy to develop and deploy a node.js based service — and service consumed very little CPU

• Watching the Node Knockout contestants in production revealed they were all light on CPU:

• But there was a storm cloud...


August 2010: Deploying 0.2.x, cont.

• We had a memory leak that resulted in heap exhaustion after several hours under heavy load

• Our service was stateless and load balanced for HA, so this was more disconcerting than debilitating...

• ...but we also had quite a few contestants that would run their RSS up and crash; there was clearly a larger issue:


February 2011: 0.4.0

• In February 2011, we deployed our first major node.js-based service (on 0.4.0)

• Service was able to be built remarkably quickly — but with some pain-points around Connect

• Despite being potentially a compute-bound service, CPU consumption was (again) a non-issue

• And with an updated node (and many fixed node leaks), memory consumption wasn’t necessarily as acute...

• …but we hit our first “spinning black hole” problem


January 2011: node-dtrace-provider

• Our DTrace probes in node were proving to be too low-level for higher-level services — we needed to allow USDT probes to be expressed in JavaScript

• Fortunately, DTrace community member Chris Andrews extended his libusdt to node.js, allowed statically defined probes in JavaScript, e.g.: var dtp = d.createDTraceProvider(‘foo’); var probe = dtp.addProbe(‘foo-start’);

probe.fire(function(p) { return ([ { bar: 123, baz: ‘bar’ } ]); });


April 2011: Restify

• Based on our experiences with Connect/Express, we wanted to build a node module that was purpose-built to implement HTTP-based API endpoints

• Based on Chris Andrews’ work, we wanted to have first class support for DTrace

• Joyent’s Mark Cavage developed node-restify, which quickly became the foundation for all of our services

• Built-in DTrace support allows full observability into per-route/per-handler latency — a capability that we could not live without at this point


November 2011: MDB support for V8

• In mid-2011, Joyent’s Dave Pacheco dared to dream the impossible dream: full postmortem support for V8 for MDB, the debugger native to SmartOS

• Several unspeakable layer violations, mdb_v8 brought postmortem debugging to node.js

• ::jsstack prints full stack including both native C++ frames and JavaScript frames

• ::jsprint prints JavaScript objects — from the dump

• Thanks to mdb_v8, we were able to go back to a core dump from that infinite loop in our service deployed several months earlier — and nail it


December 2011: DTrace ustack helper

• mdb_v8 was actually a way station to an even bolder dream: a DTrace ustack helper for node.js

• A ustack helper is a bit of code that accompanies a binary and assists DTrace in probe context to resolve stack frames to their higher-level names

• Once completed, allows user-level stack traces to be associated with in-kernel events — like profiling events

• Can use the DTrace profile provider to determine how a node.js program is consuming CPU via stack sampling


December 2011: Flame graphs

• Pouring through stack traces can make hot functions difficult to visualize

• Joyent’s Brendan Gregg developed flame graphs, which allow us to easily visualize thousands of sampled stacks:


January 2012: Bunyan

• Logging was becoming more and more of a problem for us — especially as we were developing distributed systems in node.js

• Joyent’s Trent Mick developed node-bunyan, a simple and fast JSON logging library for node.js

• Provides standardized, JSON, line-based log output that can be easily processed with JSON tools, e.g.:{"name":"moray","hostname":"d1cfb6c7-c975-4ed8-a689-fb18f94b6bfc","pid":8393,"component":"manatee","path":"/manatee/sdc/election","level":20,"db":{"available":2,"max":15,"size":2,"waiting":0},"options":{"async":false,"read":true},"msg":"pg: entered","time":"2013-12-03T02:54:24.565Z","v":0}

• Also includes command line tool, bunyan, for displaying Bunyan logs


February 2012: npm shrinkwrap

• npm allows for fine-grained semver control over package dependencies, but we found that nested dependencies could result in non-replicable installs

• “npm shrinkwrap” generates a file that shrinkwraps all nested dependencies into npm-shrinkwrap.json, thereby locking down all nested versions

• Guarantees that all installs will have same semver versions of dependencies

• Doesn’t necessarily guarantee identical installs, however; for this, one needs private npm repositories


April 2012: node-vasync

• There are a number of modules that deal with some of the mechanics of asynchronous control flow…

• But we found that libraries that handle We found we needed one that emphasized debugging, and in particular,

• node-vasync captures a number of popular flow patterns and allows state to be inspected via MDB


May 2012: ::findjsobjects

• Building on Dave Pacheco’s mdb_v8, we implemented a debugger command that iterates over all of memory in a core dump, looking for JavaScript objects

• Entirely brute force, but allows one to take a swing at a nasty node.js issue: semantic memory leaks> ::findjsobjects OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS95709ac1 195 3 Object: socket, type, handle957093f9 66 9 Object: uid, windowsVerbatimArguments, stdio, …95f13181 130 5 <anonymous> (as exports.StringDecoder): …8432ff55 222 3 Buffer: length, offset, parent843304dd 91 9 Object: refreservation, creation, name, type, …8432cc55 99 9 Object: time, msg, level, hostname, pid, action, …95f08545 66 14 ChildProcess: _closesNeeded, stdio, …8432f2e1 546 2 Array9570cafd 47 24 Object: <sliced string>, <sliced string>, …8432be95 415 3 Array8432fb09 67 19 Socket: errorEmitted, _bytesDispatched, …


May 2012: ::findjsobjects -p

• Searching by property name allows one to find particular objects in the JavaScript heap, e.g.: > ::findjsobjects -p ip4addr | ::findjsobjects | ::jsprint -a8432b109: { ip4addr: 9aee115d: "10.88.88.200", VLAN: 9aee1199: "0", Host Interface: 9aee1185: "e1000g0", Link Status: 9aee1175: "up", MAC Address: 9aee113d: "02:08:20:47:93:82",}…

• While designed for postmortem debugging, this allows mdb_v8 to be used for in situ debugging in development

• Also guides one to a best practice: towards unique property names (which we have historically done in the operating system via structure prefixing)


July 2012: node-fast

• While HTTP makes it very easy to put together a distributed system, parsing and connection management can become prohibitively expensive

• In building Manta, we found that we needed something lighter/faster; Joyent’s Mark Cavage built node-fast

• Only what you need: fully async/duplex/persistent connections, simple on-wire protocol (JSON), etc.

• None of what you don’t want: no IDL madness, no object model, no binary translation madness, etc.

• Deliberately light and limited — HTTP is still the right answer until it isn’t


October 2012: Bunyan + DTrace

• With all of our services using Bunyan, we could enable dynamic logging by adding DTrace USDT probes

• Can use the raw DTrace probes:# dtrace -qn log-debug'{printf("%s\n", copyinstr(arg0))}' -x strsize=8k {"name":"wf-moray-backend","hostname":"414ffb35-adee-47b7-bdf4-d21cb039386c","pid":10952,"component":"MorayClient","host":"10.99.99.17","port":2020,"req_id":"bddb180f-1770-edcf-8df2-b3a81d97e9b1","level":20,"bucket":"wf_runners","key":"414ffb35-adee-47b7-bdf4-d21cb039386c","value":{"active_at":"2013-12-03T07:22:25.125Z","idle":false},"msg":"putObject: entered","time":"2013-12-03T07:22:25.135Z","v":0}...

• Added the json() subroutine to DTrace to make this easier to process

• Can also use “bunyan -p” and avoid the lower-level DTrace details entirely


May 2013: --abort-on-uncaught-exception

• Crash dumps are great — but aborting after an uncaught exception makes it very difficult to determine the true origin of the exception

• Dave Pacheco implemented a V8 patch to induce a process abort (and a core dump) on an uncaught exception

• This allows us to use postmortem debugging to debug our everyday logic errors

• Available starting in 0.10.x — we use it wherever we have it!


July 2013: Thoth

• One of the most important systems we have built in node is Manta, our object store featuring in situ compute

• Manta is an excellent platform for building data-based services — especially for large data objects

• We built manta-thoth, a platform for core and crash dump analysis that allows us to debug core dumps without moving them

• Thoth has become critically important for us to track and automatically debug production node.js services


December 2013: Dump analysis on Linux

• Postmortem debugging has been a (the) tremendous breakthrough for node.js in production…

• ...but despite all node’s postmortem support all being open source, it has been limited to SmartOS

• Some have toyed with porting MDB to Linux; this is in principle possible, but will be rough sledding

• Joyent’s TJ Fontaine (of node core fame) observed what we had done with dump analysis on Manta and had a simpler idea…

• What about making Linux dumps consumable on SmartOS — and therefore Manta?


December 2013: Linux support in libproc

• Over the course of a multiday engineering hackathon, TJ and Joyent’s Max Brunning added support for Linux crash dumps in SmartOS’s libproc

• Fortunately, because of the way the postmortem work was done by Dave Pacheco, it Just Works

• Do this yourself:https://gist.github.com/tjfontaine/de104fe058300a51f7cf

• For Linux users: put your Linux dumps to Manta, and you can finally debug those pesky leaks and crashes!

• Use --abort-on-uncaught-exception and you can use Manta and postmortem debugging to debug more quotidian programming errors!


https://gist.github.com/tjfontaine/de104fe058300a51f7cf

https://gist.github.com/tjfontaine/de104fe058300a51f7cf

Node.js in production!

• For us at Joyent, the tooling that we have built into node.js has resulted in what we believe to be the best dynamic environment for production use

• Yes, even when compared to much older platforms like Java and Erlang...

• There is still work to be done, especially around add-on development (see TJ’s shim work!) and potentially better bundling of objects…

• We will continue to emphasize production deployment and use in our stewardship of node.js!


Thank you

• @dapsays, the Patron Saint of node.js in production, for DTrace support, MDB support, node-vasync, Manta, etc.

• @mcavage for node-restify, node-fast, Manta, etc.

• @trentmick for node-bunyan

• @chrisandrews for node-dtrace-provider

• @brendangregg for flame graphs

• @tjfontaine for bringing postmortem debugging to an entirely new audience with Linux support for libproc!


Date post:	06-May-2015
Category:	Technology
Upload:	bcantrill
View:	12,749 times
Download:	0 times

node.js in production: Reflections on three years of riding the unicorn

Technology