Post on 18-Mar-2018
transcript
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
From Developer to Data Scientist
Gaines Kergosien
Executive Director, Music City Code
Associate Director, UBS
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Many thanks to our sponsors & partners!
GOLD
SILVER
PARTNERS
PLATINUM
POWERED BY
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
GAINES KERGOSIEN
Leader, Speaker, Problem Solver
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• National gap for analytical expertise at 140k+ by 2017. –McKinsey 2011
• Shortage of 100k Data Scientists by 2020. –Gartner 2012
• 90% of clients need expertise, 40% cite lack of talent. –Accenture 2014
• Survey finds 83% of data scientists see shortage. –Crowdflower 2016
• “I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” –Google’s Chief Economist
• Data Scientist the #1 job in America for 2016 AND 2017! –GlassDoor
The Demand
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Maturity
http://www.burtchworks.com/files/2016/04/Burtch-Works-Study_DS-2016-final.pdf
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Salary
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Industry
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
A data scientist is a job title for an employee or
business intelligence (BI) consultant who excels at
analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.
–WhatIs.com
The Definition
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Definition
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Recipe
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Classification – Is this A or B?
• Anomaly Detection – Is this weird?
• Regression – How much -or- how many?
• Clustering – How is this organized?
• Reinforcement Learning – What should I do next?
The Five Questions
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Educate the business
• Look for problems to solve
• Research new techniques
• Collate data for analysis (ETL)*
• Implement algorithms
• Design big data-capable architecture
• Present insights
The Job
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Big Data
• Fast Data
• Dark Data
• Unstructured Data
• Data Mining
• Data Visualization
• Predictive Analytics
• [Deep] Neural Network
The Buzzwords
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Volume
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Volume
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Volume Variety
• Records• Transactions• Tables & Files
• Structured• Unstructured• Semi-structured
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Unstructured Text
• Books
• Blog Posts
• Comments
• Tweets
• Photos
• Video
• Audio
The Variety
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Volume Variety
Velocity
• Real Time• Near Time• Batch• Streams
• Records• Transactions• Tables & Files
• Structured• Unstructured• Semi-structured
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Velocity
• 6,000 tweets per second
• 500 million tweets/day
• 300 million photos/day
NY Stock Exchange
• captures 1TB of trade information each session
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Big DataVolume Variety
Velocity
• Real Time• Near Time• Batch• Streams
• Records• Transactions• Tables & Files
• Structured• Unstructured• Semi-structured
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Data
• Define
• Collect
• Store
• Explore
The Breakdown
Science
• Hypothesis
• Plan Approach
• Analysis
• Report Results
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Skills
Subject Matter Expertise
Statistics• Choose Procedures• Diagnose Problems• Develop Procedures
Hacking Expertise• Technical Skills• Creativity
• Values• Goals• Constraints
Machine Learning
Traditional Research
TraditionalSoftware
Data Science
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Skills
Subject Matter
Expertise
Hacking Expertise
Social Sciences
Statistics
Machine Learning
TraditionalSoftware
Data Science
Traditional Research
Traditional Research
HolisticResearch
SociallyUnaware
DomainUnaware
HolisticSoftware
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Overlap
Data ScienceBig Data
BigData
Science
Big Data
Volume Variety
Velocity
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Analysis Tools
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Tool Trends
Python
KNIME
RapidMiner
R
SPSS
SAS
Hadoop
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• SQL
• Excel
• Python
• R
• MySQL
The Top Tools
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• SPSS
• Matlab
• Julia
• Kafka/Storm
• R
• Python
• Java/Scala
• Stata
• SAS
The Languages
http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – SAS, Phython or R?
http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – Trends
http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – Industries
http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – Education
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – Trends
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – The Future
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• R Statistical Programming Language
• Based on the S programming language
• R Development Environment
• Statistical and Visual Analysis
• Cross-Platform
• Free Open Source
• Active User Community
• Over 9,000 Extension Packages
The R
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Created in 1991 to emphasize productivity and code
readability
• Easier learning curve than R
• Free Open Source
• Active User Community
The Python
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Hadoop Distributed File System (HDFS)
• MapReduce vs. YARN
• Pig
• Hive
• Hbase
• Storm
• Spark
• etc.
The Hadoop Collective
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Sample the Data
• Random
• Stratified
Reconcile Missing Data
• Discard
• Infer
Normalize Numeric Values
• Standard Unit of Measure
• Subtract Average (Mean = 0)
• Divide by Standard Deviation
The Wrangling
Reduce Dimensionality
• Irrelevant Input Variables
• Redundant Input Variables
Add Derivative Values
• Generalize Attributes
• Discretize Attributes to Categories
• Binarize Categorical Attributes
Design Training Data
• Select
• Combine
• Aggregate
Power and Log transformation
• Approximate Normal Distribution
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• basic statistics (ie. p-value)
• statistical modeling
• statistical tests
• experiment design
• distributions
• maximum likelihood estimators
• probability theory
• linear algebra
• multivariable calculus
The Math
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Tableau (enterprise visualization products) - www.tableau.com
• ggvis (R visualization package) - ggvis.rstudio.com
• ggplot (plotting system) - ggplot.yhathq.com
• D3.js (declarative DOM manipulation) - d3js.org
• Vega (visualization grammar)- trifacta.github.com/vega
• Rickshaw (charting library - code.shutterstock.com/rickshaw
• modest maps (map library) - modestmaps.com
• Chart.js (plotting library) - www.chartjs.org
The Visualization Tools
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Concepts
• k-nearest neighbors
• random forests
• ensemble methods
• …use Python libraries!
Tools
• Weka - www.cs.waikato.ac.nz/ml/weka/
The Machine Learning
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Report
• Presentation
• Demo
• Prototype
• Component
The Results
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Data Analyst (A)
• Data Engineer (B)
• Academic (Ab)
• Generalist (AB)
The Skills
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Expertise
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Degree
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Degree – Trending
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Fundamentals
2. Statistics
3. Programming
4. ML
5. Text Mining
6. Visualization
7. Big Data
8. Data Munging
9. Toolbox
The Path
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Matrices & Linear Algebra
2. Hash Functions, Binary Tree
3. Relational Algebra, DB Basics
4. Inner, Outer, Cross, Theta Join
5. Cap Theorem
6. Tabular Data
7. Data Frames & Series
8. Sharding
9. OLAP
The Fundamentals
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
10. Multidimensional Data Model
11. ETL
12. Reporting vs BI vs Analytics
13. JSON & XML
14. NoSQL
15. Regex
16. Vendor Landscape
17. Environment Setup
The Fundamentals
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Pick a Dataset
2. Descriptive Statistics
3. Exploratory Data Analysis
4. Histograms
5. Percentiles and Outliers
6. Probability Theorem
7. Bayes Theorem
8. Random Variables
9. Cumul Dist Fn (CDF)
The Statistics
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Statistics
10. Continuous Distr.
11. Skewness
12. ANOVA
13. Prob Den Fn (PDF)
14. Cenral Limit Theorem
15. Monte Carlo Method
16. Hypothesis Training
17. p-Value
…
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Python Basics
2. Working in Excel
3. R Setup / R Studio
4. R Basics
5. Expressions
6. Variables
7. Vectors
8. Matrices
9. Arrays
The Programming
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
10. Factors
11. Lists
12. Data Frames
13. Reading CSV Data
14. Reading Raw Data
15. Subsetting Data
16. Manipulate Data Frames
17. Functions
18. Factor Analysis
19. Install Packages
The Programming
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Coursera - www.coursera.org
• EdX- www.edx.org
• Udacity - www.udacity.com
• Kaggle - www.kaggle.com
• Youtube - projects.iq.harvard.edu/stat110/youtube
• Boot Camps
The Training
@GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Q & A
Slides at DotNetDude.net
Subject Matter
Expertise
Hacking Expertise
Social Sciences
Statistics
Machine Learning
TraditionalSoftware
Data Science
Traditional Research
Traditional Research
HolisticResearch
SociallyUnaware
DomainUnaware
HolisticSoftware
Big Data
Volume Variety
Velocity
Data ScienceBig Data
BigData
Science