The Big Data Exploratorium
A guided tour of open source data analysis tools
Noah Pepper (@noahmp)Devin Chalmers (@qwzybug)
#exploratorium @osb11
1Thursday, June 23, 2011
Hi,
• We’re here because...
• We are...
• Data Exploration Is...
• Example 1: Patents
• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)
• Example 2: Health Care
• (Pepper et al. Visweek 2010)
2Thursday, June 23, 2011
Hi,
• Exploratorium #1
• Patent citation networks
• Graphviz
• NetworkX
• Exploratorium #2
• Reddit comment word usages
3Thursday, June 23, 2011
Hi,
• Get the code & data samples:
• git clone [email protected]:peppern/exploratorium.git
4Thursday, June 23, 2011
We’re here because...
• There is a really amazing OSS community in the data space.
• This is fantastic news for academics, hobbyists, and professionals alike.
• We want to show what you can do with open source tools, show you the ones we like.
• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.
• Data exploration is fun...
5Thursday, June 23, 2011
We are...
• Academic Data Junkies • We’re Sorta Lucky
Our startup where we build data
exploration platforms
Our academic home. Research focuses on on
exploring the nature of evolutionary
activity through data mining
Noah Pepper - @noahmpDevin Chalmers - @qwzybug
6Thursday, June 23, 2011
We Build Data Exploration Tools!
map.clearhealthcosts.com
7Thursday, June 23, 2011
What is data exploration and what is an exploratorium
• Narrow Definition
• Data exploration is having an iterative relationship with your data, analysis, and visualization stack where you build an intuitive cognitive model of the information visualized.
• Why do I say visualization instead of the more general ‘representation’?
exploratorium |ikˌsplôrəˈtôrēəm|noun [usu. in names ]a scientific museum or similar center at which visitors have the opportunity of performing prearranged experiments or demonstrations.
Yes! That means there’s code
and data
8Thursday, June 23, 2011
Data Exploration Example
• study evolution of technology in patent records– technology is a window on culture– patents are a window on technology
9Thursday, June 23, 2011
Patent Networks
10Thursday, June 23, 2011
Citation Analysis of Patents
11Thursday, June 23, 2011
Time Series Text Analysis
12Thursday, June 23, 2011
Some explorations are more open ended
13Thursday, June 23, 2011
Pointwise Mutual Information (PMI)
# patents that contain words x and y
14Thursday, June 23, 2011
PMI distributions
- see clusters
- different kinds of clusters
15Thursday, June 23, 2011
“the”
“optical”
“cultivar”
PMI Comparison: Plotting a different way
PMI integralhalfway rank
- generalityof content?
16Thursday, June 23, 2011
btw, these are older graphs, now we use ggplot2
17Thursday, June 23, 2011
Previous Work in Health Care...
.... with @homerstrongat Qmedtrix Systems Inc.
Adjudication type
Bill volume
0
100,000
200,000
300,000
400,000
500,000
AMB ASC DME ER IPH OPH PRO
Placement indistribution of billed
Bottom 5%
Upper 5%
18Thursday, June 23, 2011
Previous Work in Health Care...
... @hadleywickham is a #ballRhttp://had.co.nz
Bill volume
0
20,000
40,000
60,000
80,000
100,000
120,000
10 1 10 2 10 3 10 4 10 5 10 6 10 7
Amount ($)
Dollar density
0.0e+00
2.0e+08
4.0e+08
6.0e+08
8.0e+08
1.0e+09
1.2e+09
1.4e+09
10 1 10 2 10 3 10 4 10 5 10 6 10 7
Billed
First Audit
Second Audit
19Thursday, June 23, 2011
Health Care Data & Code Samples...
...Hahaha Just Kidding
20Thursday, June 23, 2011
But actually:
• Qmedtrix R&D team members made source contributions, see:
• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)
• Kevin Lynagh https://github.com/lynaghk (Keming Labs)
21Thursday, June 23, 2011
Exploratorium #1 Patent Networks
citations amongst top 10k
most cited patents
22Thursday, June 23, 2011
Graphviz Art is Pretty!
Grab the graph data:~/exploratorium/patents/toplinks.dot
23Thursday, June 23, 2011
GraphViz Can Graph really big
graphs... but they get hard to use ->
<- Psychedelic Patents
24Thursday, June 23, 2011
Graphviz - Play with Graphs (http://www.graphviz.org)
• sudo port install graphviz or sudo apt-get install graphviz
• graphing commands: dot,neato,twopi,circo,fdp
• dot -Tpdf -o file.dot
• More options here:
• http://www.graphviz.org/content/command-line-invocation
• Fun options are in the .dot file:
• http://www.graphviz.org/content/dot-language
25Thursday, June 23, 2011
Styling dots
• node [shape=point, width="0.15",color="#0000001c"];
• edge [arrowsize="0.50", color="#0000001c"];
• There are tons, read the docs and have fun
• You can also try more complex things
• Like constraints, time for example
• Sometimes too many constraints makes GraphViz unhappy...
26Thursday, June 23, 2011
27Thursday, June 23, 2011
UbiGraph
• We loved UbiGraph, but don’t know an OSS alternative
• Renders many nodes in 3D in realtime FD-layout (50k+).
• 16gb of ram Mac Pro
• Shout out to Apple: thank you for supporting our research!
• It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it!
• Alternatives?
28Thursday, June 23, 2011
Exploratorium #2
• Making graphs of language using python, redis, R and a bunch of awesome libraries
• Thanks
• @hadleywickham
• @homerstrong
• @antirez
• Bryan Lewis (http://illposed.net/)
29Thursday, June 23, 2011
...how?
Mine — Munge — Visualize
30Thursday, June 23, 2011
...how?
github.com/peppern/exploratorium
[ brew | apt-get | port ] install redis
www.r-project.orggithub.com/qwzybug/rredisredis TTR package
31Thursday, June 23, 2011
Best show on TV
32Thursday, June 23, 2011
Best show on TV
32Thursday, June 23, 2011
Best show on TV
32Thursday, June 23, 2011
Best show on TV
32Thursday, June 23, 2011
Best show on TV
33Thursday, June 23, 2011
Mine the data
• gutenberg.org
• google.com/ngrams
• APIs — Twitter, etc.
• http://code.google.com/apis/socialgraph/
• Scrape
34Thursday, June 23, 2011
Store the data
35Thursday, June 23, 2011
Store the data
Postgres is not too shabby
35Thursday, June 23, 2011
Store the data
SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10
36Thursday, June 23, 2011
Store the data
SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part('year', `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite`
37Thursday, June 23, 2011
Store the data
SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > '1990-01-01' AND grantdate < '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50;
38Thursday, June 23, 2011
Store the data
39Thursday, June 23, 2011
Store the data
NoSQL is a good fit for web data
40Thursday, June 23, 2011
Reshape the data
41Thursday, June 23, 2011
Reshape the data
citer citee
a b
c b
b d
41Thursday, June 23, 2011
Reshape the data
citer citee
a b
c b
b d
{ a : [b], c : [b], b: [d] }
41Thursday, June 23, 2011
Reshape the data
citer citee
a b
c b
b d
{ a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }
41Thursday, June 23, 2011
Redis
In-Memory Data Structure Server
42Thursday, June 23, 2011
Redis
43Thursday, June 23, 2011
Redis
• HSET key name value
• SADD key value
• ZUNIONSTORE
• HSETNX
• BRPOPLPUSH
• …
44Thursday, June 23, 2011
Redis
45Thursday, June 23, 2011
Redis
Global variable for all your programs
45Thursday, June 23, 2011
Redis
Global variable for all your programs
Memcached with structure
45Thursday, June 23, 2011
Redis
Global variable for all your programs
Memcached with structure
Really fast
45Thursday, June 23, 2011
Redis
Global variable for all your programs
Memcached with structure
Really really fast
46Thursday, June 23, 2011
Redis
Global variable for all your programs
Memcached with structure
Really, really, astonishingly fast
47Thursday, June 23, 2011
Redis
Global variable for all your programs
Memcached with structure
No, faster than that
48Thursday, June 23, 2011
49Thursday, June 23, 2011
49Thursday, June 23, 2011
50Thursday, June 23, 2011
• Count words by hour
50Thursday, June 23, 2011
• Count words by hour
• Comment network
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12word [count]
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
word [count]
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
word [count]
“parent_id:child_id”
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
SET thread_id:users
word [count]
“parent_id:child_id”
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
SET thread_id:users
word [count]
“parent_id:child_id”
“parent_id:child_id”
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
SET thread_id:users
SET subreddit:threads
word [count]
“parent_id:child_id”
“parent_id:child_id”
50Thursday, June 23, 2011
• Count words by hour
• Comment network
• User network
ZSET subreddit:2011-06-21:12
SET thread_id:comments
SET thread_id:users
SET subreddit:threads
word [count]
“parent_id:child_id”
“parent_id:child_id”
thread_id
50Thursday, June 23, 2011
github.com/peppern/exploratorium
[ brew | apt-get | port ] install redis
www.r-project.orggithub.com/qwzybug/rredisredis TTR package
51Thursday, June 23, 2011
(demo)
52Thursday, June 23, 2011
Go forth and graph!
#exploratorium #osb11
53Thursday, June 23, 2011
Go forth and graph!
#exploratorium #osb11
We will hire you.
53Thursday, June 23, 2011
Go forth and graph!
#exploratorium #osb11
We will hire you.
For reals.
53Thursday, June 23, 2011
You Are Now Leaving the Big Data Exploratorium
Please ensure you have your valuables.
Noah Pepper @noahmpDevin Chalmers @qwzybug
#exploratorium #osb11
54Thursday, June 23, 2011