Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | continuum-analytics |
View: | 3,700 times |
Download: | 2 times |
© 2016 Continuum Analytics- Confidential & Proprietary
Visualizing a Billion Pointswith Bokeh Datashader
Peter Wang Continuum Analytics CTO & Co-Founder
@pwang
© 2016 Continuum Analytics- Confidential & Proprietary
Double Feature!
Bokeh: Interactive web visualization library for Python (“d3 for Python”, “Shiny for Python”)http://bokeh.pydata.org
Datashader: Library statistically-driven visualization of extremely large datasets http://github.com/bokeh/datashader
2
© 2016 Continuum Analytics- Confidential & Proprietary
Bokeh
• Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript
3
http://bokeh.pydata.org
© 2016 Continuum Analytics- Confidential & Proprietary
Versatile Plotting Capabilities
4
© 2016 Continuum Analytics- Confidential & Proprietary
Linked plots, tools
5
• Easy to show multiple plots and link them • Easy to link data selections between plots • Can easily customize the kind of linkage straight from
Python, without needing to fiddle around with JS
© 2016 Continuum Analytics- Confidential & Proprietary
Large data
• With easy WebGL support, can scale to 500k points or so
• Bottlenecks are browser performance, JSON encoding, network transport
6
© 2016 Continuum Analytics- Confidential & Proprietary
rBokehPlays well with R ecosystem: HTMLwidget, RMarkdown…
7
http://hafen.github.io/rbokeh
© 2016 Continuum Analytics- Confidential & Proprietary
rBokeh with RStudio & ShinyPlays well with R ecosystem: HTMLwidget, RMarkdown…
8
© 2016 Continuum Analytics- Confidential & Proprietary
Bokeh Apps: Shiny for Python
• Fully interactive data web apps • Streaming data, dynamic data • Easy-to-write pure Python charts, widgets, event
handlers • Open source (BSD licensed), including server • Enterprise on-prem version in Anaconda Enterprise,
with Active Directory/LDAP auth
9
© 2016 Continuum Analytics- Confidential & Proprietary
Example Apps
10
© 2016 Continuum Analytics- Confidential & Proprietary
Easy Streaming Apps
In this demo, we will demonstrate how the Bokeh server makes it easy to visualize streaming and dynamic data.
11
• A minimal example with < 50 LOC • Demonstrates ease of pushing data
from Python code into the browser
© 2016 Continuum Analytics- Confidential & Proprietary 12
© 2016 Continuum Analytics- Confidential & Proprietary
Embeds Well
13
http://cecp.mit.edu
© 2016 Continuum Analytics- Confidential & Proprietary
For more information on Bokeh Apps
• Webinar: http://www.slideshare.net/continuumio/hassle-free-data-science-apps-with-bokeh-webinar
• PyData Videos, Tutorials
14
Community & AdoptionGithub • 4100+ stars • 860+ forks
Mailing list • 400+ members • 150+ posts in November
Downloads • 45,000 / month (conda) • 4,000 / month (pip)
© 2016 Continuum Analytics- Confidential & Proprietary
Billions and billions…
16
© 2016 Continuum Analytics- Confidential & Proprietary
Data Shading Main Points
17
• When trying to visualize millions of points, browser vs. rich client doesn’t really matter
• Raft of common problems that are ignored: Overdraw, over- & under-saturation, clipping, coarse binning
• Statistical transformations of data are a first-class aspect of the visualization
• Rapid iteration of visual styles & configs, interactive selections and filtering are key concerns in data exploration
When data is large, you don’t know when the viz is lying.
18
Data Shading Pipeline
Data
Project / Synthesize
Scene Aggregates
Sample / Raster Transfer
Image
Visual Abstraction
DataTransforms
VisualMappings
ViewTransforms
Data Tables
Source Data Views
Selection Aggregation Transfer
SignificantSet Aggregates
© 2016 Continuum Analytics- Confidential & Proprietary
Dataset 1: OverviewThis demo shows how traditional plotting tools break down for large datasets, and how to use datashading to make even large datasets practical interactively.
19
• Data for 10 million New York City taxi trips
• Even 100,000 points gets slow for scatterplot
• Parameters usually need adjusting for every zoom
• True relationships within data not visible in std plot
Datashading automatically reveals the entire dataset, including outliers, hot spots, and missing data
© 2016 Continuum Analytics- Confidential & Proprietary
Categorical data: 2010 US Census
20
• One point per person
• 300 million total • Categorized by
race • Datashading
shows faithful distribution per pixel
© 2016 Continuum Analytics- Confidential & Proprietary
OSM Dataset: 3 Billion PointsBecause Datashader decouples the data-processing from the visualization, it can handle arbitrarily large data
21
• About 3 billion GPS coordinates
• https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/.
• This image was rendered in one minute on a standard MacBook with 16 GB RAM
• Renders in 7 seconds on a 128GB Amazon EC2 instance
© 2016 Continuum Analytics- Confidential & Proprietary
Contact Information and Additional Details• Contact [email protected] for more information about
Anaconda subscriptions and about becoming an early adopter for Data Explorer — help make sure our product fits your needs!
• View documentation and examples at
github.com/bokeh/datashader and bokeh.pydata.org
• View demo notebooks on Anaconda Cloud
notebooks.anaconda.org/jbednar/
22
Thank you
Email: [email protected]
Twitter: @ContinuumIO
Peter WangTwitter: @pwang
Bokeh
Twitter: @bokehplots