+ All Categories
Home > Presentations & Public Speaking > Introduction to basic statistics

Introduction to basic statistics

Date post: 14-Feb-2017
Category:
Upload: ibm
View: 71 times
Download: 2 times
Share this document with a friend
26
Introduction to Basic statistics & R programming
Transcript
Page 1: Introduction to basic statistics

Introduction to Basic statistics & R programming

Page 2: Introduction to basic statistics

History of R

2015

2004

2003

2000

1997

1995

Research Project in New

Zealand

Open Source Project

R-Core Group

R-1.0.0 released

R Foundation

First international Conf.

R-3.2.5 and R Consortium

Page 3: Introduction to basic statistics

What is R ?

Language

PlatformCommunit

y

Ecosystem

• A programming language for statistics, analytics, and data science

• A data visualization framework• Provided as Open Source• Used by 2.5M+ data scientists, statisticians and

analysts • Taught in most university statistics programs

• Active and thriving user groups across the world

• CRAN: 7000+ freely available algorithms, test data and evaluation

• Many of these are applicable to big data if scaled

• New and recent graduates prefer it

Page 4: Introduction to basic statistics

Start working with R• Install R IDE

go to https://cran.r-project.org/ Select the ‘base’ sub-directoryAnd then click on ‘Download R for Windows’

• Install Rstudio http://www.rstudio.com• Installing packages

install.packages(“<package name>”)

• Loading a packageLibrary(<package name>)

Page 5: Introduction to basic statistics

R Interfaces

Importing data from different mediums • Flat files (text, csv)• Excel files• Relational databases• Web• Other statistical softwares

Page 6: Introduction to basic statistics

Data Structures in R• Vectors - Consists of more than one element, but of the same datatype. The c() function is used to

create a vector.• Matrix - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to

the matrix function. All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.

• Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension.

• Dataframes - A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).

• List - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.

• Factors - The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

Page 7: Introduction to basic statistics

R Charts and Graphs• Histogram • Dot Plot• Pie Chart• Box Plot• Scatter Plot

Page 8: Introduction to basic statistics

Basic Statistics• Inferential vs Descriptive• Sample vs population• Central tendencies

1. Mean2. Median3. Mode

• Measures of Dispersion1. Range2. Interquartile Range and outliers3. Variance4. Standard deviation

Page 9: Introduction to basic statistics

Example for Basic statistics

Lets look at a demo of what we have covered till now!

Page 10: Introduction to basic statistics

Random variables• Defined as a set of possible values from a random experiment• Types – Discrete vs continuous• Expected value of random variables • The Law of large numbers

Page 11: Introduction to basic statistics

Understanding Data distributionThings to look for : • Continuous or discrete• Symmetry• The upper and lower limits• Likelihood of observing extreme values• Probability of occurrence

Page 12: Introduction to basic statistics

Binomial Distribution

Basic assumptions: 1. Discrete distribution2. Number of trials are fixed in advance3. Just two outcomes for each trial4. Trials are independent5. All trials have the same probability of

occurrence

Uses include: 6. Estimating the probabilities of an outcome in

any set of success or failure trials7. Number of defective items in a batch size of n3. Election results

Page 13: Introduction to basic statistics

Poisson Distribution

Basic assumptions: 1. Discrete distribution2. Occurrences are proportional over time intervals3. Events occurs at a constant average rate4. Occurrences are independent

Uses include: 5. Number of events in an interval of time (or area) when

the events are occurring at a constant rate6. Call drop rate in telecom7. Number of people arriving at a queue in a bank8. Number of hits on a website9. The number of typos in a book

Page 14: Introduction to basic statistics

Normal Distribution

Basic assumptions: 1. Symmetrical distribution about the mean (bell-

shaped curve)2. Commonly used in inferential statistics3. Family of distributions characterized is by m and s

Uses include: 4. Probabilistic assessments of distribution of time

between independent events occurring at a constant rate

5. Shape can be used to describe failure rates that are constant as a function of usage

Page 15: Introduction to basic statistics

Correlation and Regression Analysis• Pearson’s r

• Also known as the correlation coefficient between two variables.

• Measures the strength and direction of linear correlation.

• Value is between -1 and +1• +1 is a strong positive

correlation and -1 is a strong negative correlation.

Page 16: Introduction to basic statistics

• Plotting the regression line (Linear regression)

1. ; a is the intercept and b is the slope2. b = r*() and a = - b3. Note: Correlation is not causation

Page 17: Introduction to basic statistics

Big Data and RBasic Big Data definition is when Data size > RAM capacity while R stores data in the memory. So the 3 ways to use R for Big Data:• Extract Data as a sample/subset/summary• Compute on the parts, repeat computation and combine results• Compute on the whole

Page 18: Introduction to basic statistics

Working with Big Data in R• R can be integrated with a lot of other data

warehouses like Hadoop, SAP Hana, SQL, Oracle etc.• Store Data in a data warehouse that has the capacity,

then pass subsets from the warehouse to R or pass the R code to the data warehouse.

• Nowadays major data warehouses support R code and that is treated as one of the selling points.

• If the Data warehouse does not support R, we can still use R with the help of API packages like dplyr.

• Advantages of an API package like dplyr:• Built in SQL backend• Connects to DBMS• Transforms R code to SQL and passes it to the

DBMS• Collects results from DBMS to R• Flexible enough to add your own SQL backend

Page 19: Introduction to basic statistics

Challenges of open source R$?

Lack of scalability

Inadequate access to important business data

Insufficient business agility

Limited business value

Page 20: Introduction to basic statistics

R from Microsoft bringsFlexibility and agility

Mindset Efficiency Speed and scalability

Page 21: Introduction to basic statistics

R Product Suite• MS R Open

- free, open source R distribution

• MS R Server- Secure, scalable and supported distribution on top of R open

• SQL Server 2016 R services- building applications in R and deploying them to production using T-SQL interface

Page 22: Introduction to basic statistics

CRAN R, MRO and MRS Comparison

Data Size In-memory In-memory In-Memory or Disk Based

Speed of Analysis Single threaded Multi-threaded Multi-threaded, Parallel processing 1:N servers

Support Community Community Community + Commercial

Analytic Breadth & Depth

8000+ innovative analytic packages

8000+ innovative analytic packages

8000+ innovative packages + Commercial parallel high-speed functions

License Open Source Open Source Commercial license,Supported release with indemnity

MicrosoftR Open

Microsoft R Server

Page 23: Introduction to basic statistics

Microsoft R Server PlatformR Open Microsoft R Server

Enha

nced

R

Inte

rpre

ter

R+CR

AN

DistributedR

ScaleR

ConnectR

DeployR

DevelopR

ConnectR•High-speed & direct

connectors•HDFS, Teradata, SAS, SPSS,

EDWs, ODBC

ScaleR•Fully-parallelized analytics•Data prep & data distillation•Variety of big data stats, predictive modeling & machine learning•User tools for distributing customized R algorithms across nodes

DistributedR•Distributed computing

framework•Delivers cross-platform

portability

R+CRAN•Open source R •100% Compatible with existing R scripts, functions and packages

RevoScaleR•High-performance Math Kernel Library (MKL) to speed up linear algebra functions

Page 24: Introduction to basic statistics

SQL Server R Services:Enterprise R Analytics in SQL Server 2016

Model & Deploy In SQL16:

• Support Entire Analytics Lifecycle

• Enable R Users to Run R Inside SQL 2016

• Enable SQL Users to Extend BI Applications Using R Analytics

Advantages:

• Scale By Eliminating Movement

• Scale Using Parallelized Analytics

• Reduced Security Exposure

• SQL Skill Reuse for Data Engineering

• SQL Skill Reuse for App development

• Improved Operational Stability for Applications

SQL 2016

OperationalizeModelPrepare

Page 25: Introduction to basic statistics

2015 2014

IEEE Spectrum July 2015

Language PopularityIEEE Spectrum Top Programming Languages

R’s popularity is growing rapidlyR Usage Growth

Rexer Data Miner Survey, 2007-2013

• Rexer Data Miner Survey

#9: R

Page 26: Introduction to basic statistics

Bibliography• Datacamp tutorials• Coursera and EdX sites• Download ‘Swirl’ package from CRAN repository for hands-on practice• Subscribe to www.r-bloggers.com• For basic statistics : www.stattrek.com


Recommended