Introduction to basic statistics

Post on 14-Feb-2017

71 views 2 download

transcript

Introduction to Basic statistics & R programming

History of R

2015

2004

2003

2000

1997

1995

Research Project in New

Zealand

Open Source Project

R-Core Group

R-1.0.0 released

R Foundation

First international Conf.

R-3.2.5 and R Consortium

What is R ?

Language

PlatformCommunit

y

Ecosystem

• A programming language for statistics, analytics, and data science

• A data visualization framework• Provided as Open Source• Used by 2.5M+ data scientists, statisticians and

analysts • Taught in most university statistics programs

• Active and thriving user groups across the world

• CRAN: 7000+ freely available algorithms, test data and evaluation

• Many of these are applicable to big data if scaled

• New and recent graduates prefer it

Start working with R• Install R IDE

go to https://cran.r-project.org/ Select the ‘base’ sub-directoryAnd then click on ‘Download R for Windows’

• Install Rstudio http://www.rstudio.com• Installing packages

install.packages(“<package name>”)

• Loading a packageLibrary(<package name>)

R Interfaces

Importing data from different mediums • Flat files (text, csv)• Excel files• Relational databases• Web• Other statistical softwares

Data Structures in R• Vectors - Consists of more than one element, but of the same datatype. The c() function is used to

create a vector.• Matrix - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to

the matrix function. All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.

• Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension.

• Dataframes - A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).

• List - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.

• Factors - The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

R Charts and Graphs• Histogram • Dot Plot• Pie Chart• Box Plot• Scatter Plot

Basic Statistics• Inferential vs Descriptive• Sample vs population• Central tendencies

1. Mean2. Median3. Mode

• Measures of Dispersion1. Range2. Interquartile Range and outliers3. Variance4. Standard deviation

Example for Basic statistics

Lets look at a demo of what we have covered till now!

Random variables• Defined as a set of possible values from a random experiment• Types – Discrete vs continuous• Expected value of random variables • The Law of large numbers

Understanding Data distributionThings to look for : • Continuous or discrete• Symmetry• The upper and lower limits• Likelihood of observing extreme values• Probability of occurrence

Binomial Distribution

Basic assumptions: 1. Discrete distribution2. Number of trials are fixed in advance3. Just two outcomes for each trial4. Trials are independent5. All trials have the same probability of

occurrence

Uses include: 6. Estimating the probabilities of an outcome in

any set of success or failure trials7. Number of defective items in a batch size of n3. Election results

Poisson Distribution

Basic assumptions: 1. Discrete distribution2. Occurrences are proportional over time intervals3. Events occurs at a constant average rate4. Occurrences are independent

Uses include: 5. Number of events in an interval of time (or area) when

the events are occurring at a constant rate6. Call drop rate in telecom7. Number of people arriving at a queue in a bank8. Number of hits on a website9. The number of typos in a book

Normal Distribution

Basic assumptions: 1. Symmetrical distribution about the mean (bell-

shaped curve)2. Commonly used in inferential statistics3. Family of distributions characterized is by m and s

Uses include: 4. Probabilistic assessments of distribution of time

between independent events occurring at a constant rate

5. Shape can be used to describe failure rates that are constant as a function of usage

Correlation and Regression Analysis• Pearson’s r

• Also known as the correlation coefficient between two variables.

• Measures the strength and direction of linear correlation.

• Value is between -1 and +1• +1 is a strong positive

correlation and -1 is a strong negative correlation.

• Plotting the regression line (Linear regression)

1. ; a is the intercept and b is the slope2. b = r*() and a = - b3. Note: Correlation is not causation

Big Data and RBasic Big Data definition is when Data size > RAM capacity while R stores data in the memory. So the 3 ways to use R for Big Data:• Extract Data as a sample/subset/summary• Compute on the parts, repeat computation and combine results• Compute on the whole

Working with Big Data in R• R can be integrated with a lot of other data

warehouses like Hadoop, SAP Hana, SQL, Oracle etc.• Store Data in a data warehouse that has the capacity,

then pass subsets from the warehouse to R or pass the R code to the data warehouse.

• Nowadays major data warehouses support R code and that is treated as one of the selling points.

• If the Data warehouse does not support R, we can still use R with the help of API packages like dplyr.

• Advantages of an API package like dplyr:• Built in SQL backend• Connects to DBMS• Transforms R code to SQL and passes it to the

DBMS• Collects results from DBMS to R• Flexible enough to add your own SQL backend

Challenges of open source R$?

Lack of scalability

Inadequate access to important business data

Insufficient business agility

Limited business value

R from Microsoft bringsFlexibility and agility

Mindset Efficiency Speed and scalability

R Product Suite• MS R Open

- free, open source R distribution

• MS R Server- Secure, scalable and supported distribution on top of R open

• SQL Server 2016 R services- building applications in R and deploying them to production using T-SQL interface

CRAN R, MRO and MRS Comparison

Data Size In-memory In-memory In-Memory or Disk Based

Speed of Analysis Single threaded Multi-threaded Multi-threaded, Parallel processing 1:N servers

Support Community Community Community + Commercial

Analytic Breadth & Depth

8000+ innovative analytic packages

8000+ innovative analytic packages

8000+ innovative packages + Commercial parallel high-speed functions

License Open Source Open Source Commercial license,Supported release with indemnity

MicrosoftR Open

Microsoft R Server

Microsoft R Server PlatformR Open Microsoft R Server

Enha

nced

R

Inte

rpre

ter

R+CR

AN

DistributedR

ScaleR

ConnectR

DeployR

DevelopR

ConnectR•High-speed & direct

connectors•HDFS, Teradata, SAS, SPSS,

EDWs, ODBC

ScaleR•Fully-parallelized analytics•Data prep & data distillation•Variety of big data stats, predictive modeling & machine learning•User tools for distributing customized R algorithms across nodes

DistributedR•Distributed computing

framework•Delivers cross-platform

portability

R+CRAN•Open source R •100% Compatible with existing R scripts, functions and packages

RevoScaleR•High-performance Math Kernel Library (MKL) to speed up linear algebra functions

SQL Server R Services:Enterprise R Analytics in SQL Server 2016

Model & Deploy In SQL16:

• Support Entire Analytics Lifecycle

• Enable R Users to Run R Inside SQL 2016

• Enable SQL Users to Extend BI Applications Using R Analytics

Advantages:

• Scale By Eliminating Movement

• Scale Using Parallelized Analytics

• Reduced Security Exposure

• SQL Skill Reuse for Data Engineering

• SQL Skill Reuse for App development

• Improved Operational Stability for Applications

SQL 2016

OperationalizeModelPrepare

2015 2014

IEEE Spectrum July 2015

Language PopularityIEEE Spectrum Top Programming Languages

R’s popularity is growing rapidlyR Usage Growth

Rexer Data Miner Survey, 2007-2013

• Rexer Data Miner Survey

#9: R

Bibliography• Datacamp tutorials• Coursera and EdX sites• Download ‘Swirl’ package from CRAN repository for hands-on practice• Subscribe to www.r-bloggers.com• For basic statistics : www.stattrek.com