Post on 10-Jan-2017
transcript
Data Analysis in .
for BeginnersAlton Alexander
Data Science Consultant
Why R?• R is open source – like python not like SAS
• Out of the box R is single machine, in memory statistical computing engine– Download from https://www.r-project.org/
• Use an IDE– R Studio https://www.rstudio.com/
– Revolution Analytics (MSFT)
– Jython (ipython)
R studio
Download
Overview
Essential Learning Resources
A new book for learning R
Q: What have you tried and what works?
Topics• Data ingestion• Manipulation• Summary and exploration• Writing Reports• Interactive visualization and dashboarding• Predictive Modeling & Forecasting• Big Data Integrations
Demo
Options data
R studio
Data ingestion
• Load data– Load.csv()
– library(RJDBC)
– library(RODBC)
Data Structures and Manipulation
• Another major reason for using R– Ability to work with data in Data Frames– Like pandas in python and data tables in SAS
• Reasons for doing data manipulation (munging)– Feature extraction– ETL– Data cleansing– Pivots, stack/unstack, aggregate, groupby, reshape
Set Theory
SQL joins and their results
merge, sqldf in Rhttp://www.r-bloggers.com/manipulating-
data-frames-using-sqldf-a-brief-overview/
Summary and Exploration
• Powerful summary functions for programmatically quantifying datasets
• Functions include:– Summary(), hist(), levels(), aggregate()
Interactive Visualizationand Dashboarding
• Shiny from Rstudio• Like tableau
– Local and server options
• Much more customizable, more coding, no GUI or click to edit
• But you can bring in powerful libraries to build web apps comparatively fast
Predictive Modeling & Forecasting• Examples
– Customer segmentation• Unsupervised classification
– Marketing mix models• Explain the coefficients
– Attribution modeling• Supervised time series of events
– Multivariate testing • (AB tests with statistical significance, ANOVA)
– Lead scoring • P2B Models, topic of interest, propensity to buy, expected spend
5 Libraries for Machine LearningAllowing the machine to capture complexity:1. gbm [Gradient Boosting Machine]2. randomForest [Random Forest]3. e1071 [Support Vector Machines]
Taking advantage of high-cardinality categorical or text-data:4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]5. tau [Text Analysis Utilities]
Big Data Integration
• Single laptop is often sufficient– Millions of rows on a 32GB i7 laptop
• Scale using a larger server– Often sufficient but has limitations (100s of GB)
• Clustered compute engine– Algorithm considerations to affect performance
RServer
• For datasets that don’t fit in memory or for convenience there is a SERVER option– A shared compute engine
– Shares resources
– Think +100 GB of RAM
Big Data Integration - Frameworks
• H2O.ai• SparkR• Revolution Analytics• In DB processing
– Applying lead score or segmentation model in real time
– Spark, teradata, vertica
Why R? In High Demand Nationally
Get Alton’s FREE Reports!
Go to http://frontanalysis.com/bigdatameetup/
Complete the survey including your email
I’ll email you the two reports:
1. Anonymized Summary of the Survey2. LinkedIn Job Suggestions for a Utah Data Scientist