Date post: | 28-Jan-2018 |
Category: |
Technology |
Upload: | paco-nathan |
View: | 4,992 times |
Download: | 0 times |
Paco NathanConcurrent, Inc.
[email protected]@pacoid
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Copyright @2012, Concurrent, Inc.
“A Data Scientist And A Log File Walk Into A Bar…”
Unstructured Data meets Enterprise Scale
opportunity
1. backstory: how we got here2. overview: typical use cases 3. example: a Cascading app
1. backstory:how we got here
Intro to Data ScienceScrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
inflection point
• huge Internet successes after 1997 holiday season…AMZN, EBAY, then GOOG, Inktomi (YHOO Search)
• consider this metric: annual revenue per customer / amount of data storeddropped 100x within a few years after 1997
• storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data…our methods must adapt
• “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… tends toward inertia!
• MapReduce and the Hadoop open source stack grew directly out of that contention… but only solve portions
massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
1997
1998
2004
the world before…
BI, SQL, and highly optimized code
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
data innovation: circa 1996
the world after…
machine learning, leveraging log files
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
data innovation: circa 2001
the world ahead…
what our customers are doing now
Workflow
RDBMS
"real time"batch
services
transactions,content
socialinteractions
Web Apps,Mobile,
etc.History
Data AppsCustomers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Data Access Patterns
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacity
endpoints
DataScientist
App Dev
Ops
DomainExpert
data innovation: circa 2013
a key difference…
statistical thinking
employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions, but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way… however, both systems engineers and data scientists must!
Process Variation Data Tools
references
by Leo Breiman
Statistical Modeling: The Two CulturesStatistical Science, 2001
http://bit.ly/eUTh9L
also check out RStudio:http://rstudio.org/http://rpubs.com/
most valuable skills
• approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.
• unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up
• most valuable skills:‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
the rest of the skills – modeling, algorithms, etc. – those are secondary
D3
team process
discovery
modeling
integration
apps
systems
help people ask the right questions
allow automation to place informed bets
deliver products at scale to customers
leverage smarts in product features
keep infrastructure running, cost-effective
Gephi
matrix: usage
stakeholder
scientist
developer
ops
conceptual tool for managing Data Science teams
overlay your project requirements (needs) with your team’s strengths (roles)
that will show very quickly where to focus
NB: bring in individuals who cover 2-3 needs, particularly for team leads
discovery
discovery
modeling
modeling
integration
integration
appsapps systems
systems
building teams
stakeholder
scientist
developer
ops
discovery
discovery
modeling
modeling
integration
integration
appsapps systems
systems
references
by DJ Patil
Data JujitsuO’Reilly, 2012
http://www.amazon.com/dp/B008HMN5BE
Building Data Science TeamsO’Reilly, 2011
http://www.amazon.com/dp/B005O4U3ZE
2. overview:typical use cases
Intro to Data ScienceScrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
in a nutshell, what we do…
• estimate probability
• calculate analytic variance
• manipulate order complexity
• make use of learning theory
• collab with DevOps, Stakeholders
using science in data science
Unique Registration
Launched games lobby
NUI:TutorialMode
Birthday Message
Chat PublicRoom voice
Launched heyzap game
ConnectivityTest: test suite started
Create New Pet
Movie View Started: client, community
NUI:MovieMode
Buy an Item: web
Put on Clothing
Address space remaining: 512M
Customer Made Purchase Cart Page Step 2
Feed Pet
Play Pet
Chat Now
Edit Panel
Client Inventory Panel Flip Product Over
Add Friend
Open 3D Window
Change Seat
Type a Bubble
Visit Own Homepage
Take a Snapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Address space remaining: 1G
Leave a Message
NUI:ChatMode
NUI:FriendsModedv
Website Login
Add Buddy
NUI:PublicRoomMode
NUI:MyRoomMode
Client Inventory Panel Remove Product
Client Inventory Panel Apply Product
NUI:DressUpMode
Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode
use case: marketing funnel
• must optimize a very large ad spend
• different vendors report different metrics
• seasonal variation distorts performance
• some campaigns are much smaller than others
• hard to predict ROI for incremental spend
approach:• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• customer lifetime value quantifies ROI of new leads
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/benefit
• linear programming models estimate elasticity of demand
Wikipedia
use case: ecommerce fraud
• sparse data means lots of missing values
• “needle in a haystack” lack of training cases
• answers are available in large-scale batch, results are needed in real-time event processing
• not just one pattern to detect – many, ever-changing
approach:• random forest (RF) classifiers predict likely fraud
• subsampled data to re-balance training sets
• impute missing values based on density functions
• train on massive log files, run on in-memory grid
• adjust metrics to minimize customer support costs
• detect novelty – report anomalies via notifications
stat.berkeley.edu
use case: customer segmentation
• many millions of customers, hard to determine which features resonate
• multi-modal distributions get obscured by the practice of calculating an “average”
• not much is known about individual customers
approach:• connected components for sessionization, determining
uniques from logs
• estimates for age, gender, income, geo, etc.
• clustering algorithms to group into market segments
• social graph infers “unknown” relationships
• covariance/heat maps visualizes segments vs. feature sets
Mathw
orks
use case: monetizing content
• need to suggest relevant content which wouldotherwise get buried in the back catalog
• big disconnect between inventory and limited performance ad market
• enormous amounts of text, hard to categorize
approach:• text analytics glean key phrases from documents
• hierarchical clustering of char frequencies detects lang
• latent dirichlet allocation (LDA) reduces dimension to topic models
• recommenders suggest similar topics to customers
• collaborative filters connect known users with less known
Digital H
umanities
plus some great tools…
scale-out:Scalr, RightScale, CycleComputing, vFabric, Beanstalk
apps:Cascading, Scalding, Cascalog, R markdown, SWF
analytics/modeling:R, Weka, Matlab, PMML, GLPK
hadoop:EMR, HW, MapR, EMC, Azure, Compute
key/val:Redis,Membase, MySQL
index:Lucene/Solr, ElasticSearch
durable storage:S3, ASV, GCS, Riak, Couch
imdg:Spark, Storm, Gigaspaces
visualization:ggplot2, D3, Gephi
graph:Gremlin, GraphLab,Neo4J
column:Vertica, HBase, Drill, Dynamo
text:LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK
relational:usual suspects
reporting:Graphite, PowerPivot, Pentaho, Jaspersoft, SAS
machine data:Splunk, collectd, Nagios
3. example:a Cascading app
Intro to Data ScienceScrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
getting started
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
cascading.org/category/impatient/
businessprocess
APIlanguage
optimize / schedule
physicalplan
computesubstrate
machinedata
Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM
composition of a workflow
Splunk, Nagios, Collectd, etc.
major changes in technology now
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
domain expertise, business trade-offs,market position, operating parameters, etc.
Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.
“asse
mb
ler”
cod
e
1: copy
Source
Sink
M
public class Main { public static void main( String[] args ) { String inPath = args[ 0 ]; String outPath = args[ 1 ];
Properties props = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
// create the sink tap Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );
// specify a pipe to connect the taps Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "copy" ) .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );
// run the flow flowConnector.connect( flowDef ).complete(); } } 1 mapper
0 reducers10 lines code
ten lines of code for a file copy…seems like a lot.
wait!
same JAR, any scale…
Your Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes
Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours
Production Cluster:Tb’s dataEMR w/ 50 HPC InstancesOps monitors resultsruntime: hours – days
MegaCorp Enterprise IT:Pb’s data1000+ node private clusterEVP calls you when app failsruntime: days+
2: word count
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 mapper 1 reducer18 lines code
3: City of Palo Alto open data
github.com/Cascading/CoPA/wiki• GIS export for parks, roads, trees (unstructured / open data)• log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks• curated metadata, used to enrich the dataset• could extend via mash-up with many available public data APIs
Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
M
M
M
R
M
M
M
M
GroupBytree_name
RM
Checkpointtsv
Regexfilter
Regexparser
road
RoadMetadata
HashJoinLeft
RHS
EstimateAlbedo
RoadSegments Geohash
CoGroup
RHStree
road
Filtertree_dist
TreeDistance
Checkpointshade
GPSlogs
Geohash
CoGroup
RHS
reco
CoPAGIS exprot
Regexparser
tsv
park
Regexfilter
park
Scrubspecies
Geohash
Regexfilter
Regexparser
tree
TreeMetadata
HashJoinLeft
RHS
FailureTraps
M
R
log events
• addr: 115 HAWTHORNE AVE• lat/lng: 37.446, -122.168• geohash: 9q9jh0• tree: 413 site 2• species: Liquidambar styraciflua• avg height 23 m• road albedo: 0.12• distance: 10 m• a short walk from my train stop ✔
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0 10 20 30 40 50avg_height
dens
ity
count0100200300
Estimated Tree Height (meters)example results
blog, code/wiki/gists, jars, list, DevOps products:
cascading.org/
github.org/Cascading/
conjars.org/
goo.gl/KQtUL
concurrentinc.com/
drill-down