Post on 14-Jul-2015
transcript
Building Guerrilla Analytics Teams
Presented by:
Enda Ridge, PhD
People, Process and Technologyfor Doing Data Science
Copyright Enda Ridge 2014
What this talk is about
• Data Science: expectations and reality
• 3 Drivers for doing Data Science
• Why Data Science projects are so challenging
• Introduction to Guerrilla Analytics
• Building Guerrilla Analytics Capability
Copyright Enda Ridge 2014 1
Guerrilla Analytics
People
ProcessTech
What we hear about Data Science
2Copyright Enda Ridge 2014
“Data is the new science. Big data holds the answers.”
“the sexy job in the next 10 years will be statisticians”
“Data Scientist: The Sexiest Job of the 21st Century”
“Information is the oil of the 21st century, and analytics is the combustion engine.”
http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery
What we really want from Data Science
Copyright Enda Ridge 2014 3
• “I have made data available, now how do I use it?”
Leverage
• “I want to make data available or buy a data product. How do I know it will be worth it?”
Justify
• “I think I have a fraud problem / security breach / etc”
• “Help me better understand my customers”
Ad-hoc
My background
PhD Computer Science
• Design of Experiments for Tuning Algorithms”
Boutique Consultancy
• Social Network Analysis for Fraud
Forensic Data Analytics
• Professional Services
Senior Manager
• Data Science Consulting& Data Product Development
Copyright Enda Ridge 2014 4
Misconception about how we do Data Science
Copyright Enda Ridge 2014 5
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22
Reality – Guerrilla Analytics
• Disruptions
• Data
• Requirements
• Resources
• Business Rules
• Constraints
• Time
• Toolsets
• People
• Repeatable
• Explainable
• Tested
Copyright Enda Ridge 2014 6
Guerrilla Analytics Workflow
Copyright Enda Ridge 2014 7
Data
• Extract
• Receive
• Load
Analytics
• Transform
• Algorithm
• Consolidate
Insight
• Reports
• Work Products
Disruptions
Some Guerrilla Analytics Principles
• Prefer simple, project structures over heavily documented and complex ones. 1
• Prefer automation with program code over manual graphical approaches. 2
• Link data on the file system, to data in the analytics environment, to data in work products.3
• Version control changes to program code AND data. 4
Copyright Enda Ridge 2014 8
Building Guerrilla Analytics Capability
Copyright Enda Ridge 2014 9
Leverage
Justify
Ad-hoc
Guerrilla Analytics
People
ProcessTech
People Capability
Copyright Enda Ridge 2014 10
People
Hard Skills
Programming
Software Engineering
Visualization
Maths / Stats
Soft Skills
Communication
Domain Knowledge
Mindset
Capability: Data Programming
“Using a programming language to describe and execute data manipulations, data analyses, data visualizations”
Copyright Enda Ridge 2014 11
Guerrilla Environment
• Wide variety of data
• Poor quality data
• Evolving understanding
• Reproduce and repeat
Benefit
• Flexibility
• Consolidation
• Knowledge transfer
• Self describing
Capability: Software Engineering
“the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software”
Copyright Enda Ridge 2014 12
Guerrilla Environment
• Changing data
• Iterations of work products
• Reproduce despite pace
• Correctness despite complexity
Benefit
• Version control
• Testing
• Automation
• Issue/bug tracking
Capability: Domain Knowledge & Communication
Prefer analytics skills with great communication
Analytics
Forensic Accounting
Forensic Accountant
Data Scientist
Copyright Enda Ridge 2014 13
Capability: Mind-set
Guerrilla Environment
• Changing requirements
• Poorly understood data
• Constraints
• Time pressure
• Iterations
• Dead Ends
Required Capability
• Tenacity
• Curiosity
• Problem solving
• Communication
The attitude and approach to work that best matches Guerrilla Analytics
Copyright Enda Ridge 2014 14
Common Misconceptions about Technology
“If we use this tech, my team don’t need to code”
“We can productionise all possible data science scenarios”
“We need to invest in a platform to get value from our data”
“We need Big Data technology X”
Copyright Enda Ridge 2014 16
Technology Capability
Copyright Enda Ridge 2014 17
People
Agility
Data Manipulation Environment
Scripting & Command Line
Shared Space
Visualization
Consolidate
Code Libraries
Machine Images
Project Wiki
Process Support
Source Code Control
Issue Tracking
Security
Guerrilla Analytics Workflow
Copyright Enda Ridge 2014 19
Data
• Extract
• Receive
• Load
Analytics
• Transform
• Algorithm
• Consolidate
Insight
• Reports
• Work Products
Disruptions
Common Misconceptions about Process
“We must document everything”
“We can completely plan a data science job”
“We should track everything in a traditional top-down way”
“Work products must be right first time”
Copyright Enda Ridge 2014 20
Process Capability
Copyright Enda Ridge 2014 21
Data• Extract
• Receive
• Load
Analytics• Transform
• Algorithm
• Consolidate
Insight• Reports
• Work Products
Log Data ReceiptTrack Work
Product VersionsTrack Work
Product Release
Summary
• Leverage
• Justify
• Ad-hoc
Data Science Aims
• Disruptions
• Constraints
• Reproducible, Testable, Explainable
Guerrilla Analytics
Copyright Enda Ridge 2014 22
• Hard Skills
• Soft SkillsPeople Capability
• Analytics Agility
• Consolidation
• Process Support
Technology Capability
• Tracking Data (Inputs)
• Tracking Work Products Creation
• Tracking Outputs
Process Capability
Keep in Touch!
Copyright Enda Ridge 2014 23
@Enda_Ridge
GuerrillaAnalytics@gmail.com
www.guerrilla-analytics.net