Post on 14-Feb-2017
transcript
Introduction to Basic statistics & R programming
History of R
2015
2004
2003
2000
1997
1995
Research Project in New
Zealand
Open Source Project
R-Core Group
R-1.0.0 released
R Foundation
First international Conf.
R-3.2.5 and R Consortium
What is R ?
Language
PlatformCommunit
y
Ecosystem
• A programming language for statistics, analytics, and data science
• A data visualization framework• Provided as Open Source• Used by 2.5M+ data scientists, statisticians and
analysts • Taught in most university statistics programs
• Active and thriving user groups across the world
• CRAN: 7000+ freely available algorithms, test data and evaluation
• Many of these are applicable to big data if scaled
• New and recent graduates prefer it
Start working with R• Install R IDE
go to https://cran.r-project.org/ Select the ‘base’ sub-directoryAnd then click on ‘Download R for Windows’
• Install Rstudio http://www.rstudio.com• Installing packages
install.packages(“<package name>”)
• Loading a packageLibrary(<package name>)
R Interfaces
Importing data from different mediums • Flat files (text, csv)• Excel files• Relational databases• Web• Other statistical softwares
Data Structures in R• Vectors - Consists of more than one element, but of the same datatype. The c() function is used to
create a vector.• Matrix - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function. All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
• Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension.
• Dataframes - A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).
• List - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
• Factors - The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
R Charts and Graphs• Histogram • Dot Plot• Pie Chart• Box Plot• Scatter Plot
Basic Statistics• Inferential vs Descriptive• Sample vs population• Central tendencies
1. Mean2. Median3. Mode
• Measures of Dispersion1. Range2. Interquartile Range and outliers3. Variance4. Standard deviation
Example for Basic statistics
Lets look at a demo of what we have covered till now!
Random variables• Defined as a set of possible values from a random experiment• Types – Discrete vs continuous• Expected value of random variables • The Law of large numbers
Understanding Data distributionThings to look for : • Continuous or discrete• Symmetry• The upper and lower limits• Likelihood of observing extreme values• Probability of occurrence
Binomial Distribution
Basic assumptions: 1. Discrete distribution2. Number of trials are fixed in advance3. Just two outcomes for each trial4. Trials are independent5. All trials have the same probability of
occurrence
Uses include: 6. Estimating the probabilities of an outcome in
any set of success or failure trials7. Number of defective items in a batch size of n3. Election results
Poisson Distribution
Basic assumptions: 1. Discrete distribution2. Occurrences are proportional over time intervals3. Events occurs at a constant average rate4. Occurrences are independent
Uses include: 5. Number of events in an interval of time (or area) when
the events are occurring at a constant rate6. Call drop rate in telecom7. Number of people arriving at a queue in a bank8. Number of hits on a website9. The number of typos in a book
Normal Distribution
Basic assumptions: 1. Symmetrical distribution about the mean (bell-
shaped curve)2. Commonly used in inferential statistics3. Family of distributions characterized is by m and s
Uses include: 4. Probabilistic assessments of distribution of time
between independent events occurring at a constant rate
5. Shape can be used to describe failure rates that are constant as a function of usage
Correlation and Regression Analysis• Pearson’s r
• Also known as the correlation coefficient between two variables.
• Measures the strength and direction of linear correlation.
• Value is between -1 and +1• +1 is a strong positive
correlation and -1 is a strong negative correlation.
• Plotting the regression line (Linear regression)
1. ; a is the intercept and b is the slope2. b = r*() and a = - b3. Note: Correlation is not causation
Big Data and RBasic Big Data definition is when Data size > RAM capacity while R stores data in the memory. So the 3 ways to use R for Big Data:• Extract Data as a sample/subset/summary• Compute on the parts, repeat computation and combine results• Compute on the whole
Working with Big Data in R• R can be integrated with a lot of other data
warehouses like Hadoop, SAP Hana, SQL, Oracle etc.• Store Data in a data warehouse that has the capacity,
then pass subsets from the warehouse to R or pass the R code to the data warehouse.
• Nowadays major data warehouses support R code and that is treated as one of the selling points.
• If the Data warehouse does not support R, we can still use R with the help of API packages like dplyr.
• Advantages of an API package like dplyr:• Built in SQL backend• Connects to DBMS• Transforms R code to SQL and passes it to the
DBMS• Collects results from DBMS to R• Flexible enough to add your own SQL backend
Challenges of open source R$?
Lack of scalability
Inadequate access to important business data
Insufficient business agility
Limited business value
R from Microsoft bringsFlexibility and agility
Mindset Efficiency Speed and scalability
R Product Suite• MS R Open
- free, open source R distribution
• MS R Server- Secure, scalable and supported distribution on top of R open
• SQL Server 2016 R services- building applications in R and deploying them to production using T-SQL interface
CRAN R, MRO and MRS Comparison
Data Size In-memory In-memory In-Memory or Disk Based
Speed of Analysis Single threaded Multi-threaded Multi-threaded, Parallel processing 1:N servers
Support Community Community Community + Commercial
Analytic Breadth & Depth
8000+ innovative analytic packages
8000+ innovative analytic packages
8000+ innovative packages + Commercial parallel high-speed functions
License Open Source Open Source Commercial license,Supported release with indemnity
MicrosoftR Open
Microsoft R Server
Microsoft R Server PlatformR Open Microsoft R Server
Enha
nced
R
Inte
rpre
ter
R+CR
AN
DistributedR
ScaleR
ConnectR
DeployR
DevelopR
ConnectR•High-speed & direct
connectors•HDFS, Teradata, SAS, SPSS,
EDWs, ODBC
ScaleR•Fully-parallelized analytics•Data prep & data distillation•Variety of big data stats, predictive modeling & machine learning•User tools for distributing customized R algorithms across nodes
DistributedR•Distributed computing
framework•Delivers cross-platform
portability
R+CRAN•Open source R •100% Compatible with existing R scripts, functions and packages
RevoScaleR•High-performance Math Kernel Library (MKL) to speed up linear algebra functions
SQL Server R Services:Enterprise R Analytics in SQL Server 2016
Model & Deploy In SQL16:
• Support Entire Analytics Lifecycle
• Enable R Users to Run R Inside SQL 2016
• Enable SQL Users to Extend BI Applications Using R Analytics
Advantages:
• Scale By Eliminating Movement
• Scale Using Parallelized Analytics
• Reduced Security Exposure
• SQL Skill Reuse for Data Engineering
• SQL Skill Reuse for App development
• Improved Operational Stability for Applications
SQL 2016
OperationalizeModelPrepare
2015 2014
IEEE Spectrum July 2015
Language PopularityIEEE Spectrum Top Programming Languages
R’s popularity is growing rapidlyR Usage Growth
Rexer Data Miner Survey, 2007-2013
• Rexer Data Miner Survey
#9: R
Bibliography• Datacamp tutorials• Coursera and EdX sites• Download ‘Swirl’ package from CRAN repository for hands-on practice• Subscribe to www.r-bloggers.com• For basic statistics : www.stattrek.com