clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete...

Post on 27-Jul-2020

8 views 0 download

transcript

the woRldclusteRing

basic machine leaRning with R

beRnaRdus aRi kuncoRo

V3.0

woRkshop

about mE

Hi, I am Ari!

Work

School

quiZ

How many R letters written on my _____ slide?

A. 3B. 6C. 9D. 12

downloaD

http://arikuncoro.xyz/fff

agendA

Quick intro to Dataset and R

Data Exploration

Data Preparation

Clustering

Use case

quotE

“Learning from data is virtually universally useful. Master it and you will be welcomed anywhere.”

John ElderElder Research

evidencE

“In 2017 more than 10 companies approached me to work as a Data Scientist.”

to dataset & Rquick intro

Part - 1

dataseT

World Happiness Record https://www.kaggle.com/unsdsn/world-happiness

Download

1-2-3 in R

Install R& R Studio

Install Packages

Code

What is R?

R is a language and environment for statistical computing and graphics.

R was created by Ross Ihaka and Robert Gentleman at Univ of Auckland.

Runs on widely UNIX System (Linux, MacOS, FreeBSD) and Windows.

Source: https://www.r-project.org/about.htmlSource: https://edu.kpfu.ru/mod/page/view.php?id=35064

How to Install?

• Install R• Download installer R: https://cran.r-

project.org• Choose version (Windows, Mac, Linux

available)• Run R installer. Leave all default

settings in the installation options.

• Install RStudio as integrated development environment (IDE) for R

• Download RStudio fromhttps://www.rstudio.com/products/rstudio/download (choose version and install it)

• Leave all default settings in the installation options

RStudio

Source: http://dss.princeton.edu/training/RStudio101.pdf,Learn more: https://www.rstudio.com/online-learning/

Some Useful R - links

Find out the link that suits you! ☺

• Official page: www.r-project.org• CRAN download page:

https://cran.r-project.org• Microsoft R Open offical page:

https://mran.revolutionanalytics.com/open

• RStudio IDE: www.rstudio.com• Streams of news and articles:

http://www.r-bloggers.com• Question and answer:

http://stackoverflow.com/tags/r• Some Useful R links:

http://stats.stackexchange.com/questions/138/free-resources-for-learning-r

• ... need more? Google it

Online Course from:- Datacamp- Coursera - edX- Udemy- And many more…

R packages for data scienceThe tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.

Install the complete tidyverse with

install.packages("tidyverse")

::tidyversE

Please install these packages also for our clustering purpose. “corrplot”, “plotly”, “cluster”, “fpc”

DemoGetting Started

Basic R Syntax & Data Types

Variable Assignment: <- or =

Type “Hello World!” in console, then hit enterOr, assign to a variable :

> my.string <- "Hello World!"or> my.string = "Hello World!"

Display value: > print(my.string) or simply > my.string[1] "Hello World"

Arithmetic in RAddition: +Subtraction: -Multiplication: *Division: /Exponentiation: ^Modulo: %%

Data Types: class()• Decimals values like 5.6 are

called numerics.• Natural numbers like 7 are

called integers. Integers are also numerics.

• Boolean values (TRUE or FALSE) are called logical.

• Text (or string) values are called characters.

Special Value

Value Description ExampleNA Stands for not available. NA is a

placeholder for a missing value> num <- c(NA, 5, 4, 3)> mean(num)> mean(num, na.rm = TRUE)

NULL Empty set. It has no class (its class is NULL) and does not take up any space in a vector

> num <- c(NULL, 1, 2, 3)> mean(num)

Inf Stands for infinity and only applies to vectors of class numeric

> num <- 1/0> num

NaN Stands for not a number. This is generally the result of a calculation of which the result is unknown, but it is surely not a number

> 0/0> Inf - Inf

Source: Nurandi (DSI Bootcamp Slide)

Exercise #1

Using Consoles in R Studio,

#1 Calculate 3 + 4

#2 Assign “aku cinta Indonesia” as my_char

#3 Find out the data type of my_char

#4 Assign FALSE as my_logical

#5 Find out the data type of my_logical

Timer

End123456789101112131415161718192021222324252627282930313233343536373839404142434445

Data Structure

• Vector: one-dimensional array of the same mode

• Matrix: two-dimensional array

• Array: multi-dimensional (more than two dimensions) array

• Data Frame: tabular data objects

• List: collection of objects, where the elements can be of different types

• *Factor: vector along with the distinct values of the elements

• *Function: objects to make specific operations

Source: Nurandi (DSI Bootcamp Slide)

How to Create Vector, Matrix, Array?

numbers <- 1:11

colors <- c("red", "green", "blue")

height <- c(leo = 170, jon = 167, alex = 185)

colors[1]

colors[2:3]

colors[-3]

height["alex"]

c(numbers, height)

my.matrix <- matrix(c(1:12), nrow = 3)

my.matrix[2, ]

my.matrix[, 3:4]

my.matrix[2, 4]

my.matrix[2:4]

my.array <- array(c(1:24), dim = c(4,3,2))

my.array

my.array[2, ,]

my.array[2:3, ,]

my.array[1:10]

Source: Nurandi (DSI Bootcamp Slide)

bmi <- data.frame(

gender = c("Female", "Male","Female"),

single = c(F, F, T),

height = c(155, 170, 165.5),

weight = c(64, 65, 48.5),

age = c(42, 38, 26)

)

bmi

bmi[1, ]

bmi[, 3]

bmi[2, 3]

bmi$gender

bmi[c("height", "weight")]

newdata <- subset(bmi, age >= 30, select=c(height, weight))

How to Create a Data Frame?

Source: Nurandi (DSI Bootcamp Slide)

List, Factor, & Function

my.list <- list(colors, bmi, my.matrix, 4)

my.list[[1]]

my.list[[1]][2]

my.list[[2]]["single"] edu <- rep(c("SD", "SMP", "SMA"), 3)

edu <- factor(edu)

edu

Source: Nurandi (DSI Bootcamp Slide)

# Define ratio() function

ratio <- function(x, y) {

x/y

}

# Call ratio() with arguments 3 and 4

ratio(3,4)

Exercise #2

#1 Create a vector that contains the following strings, assign it as name_vector!

"Ahmad" "Hani" "Andri" "Dian"

#2 Create a vector that contains the following integers, assign it as nim_vector!

1506772460 1506699232 1506772561 1506699453

#3 Construct a matrix with 6 rows that contain the numbers 1 up to 36, assign it as my_array!

#4 Construct a data frame from two vectors: name_vectorand nim_vector! Assign it as data_mahasiswa!

#5 Rename the columns using names() function into: “Nama” and “NIM”!

#6 Write a function selisih() that takes arguments x and y and returns their difference, x - y!

2:001:591:581:571:561:551:541:531:521:511:501:491:481:471:461:451:441:431:421:411:401:391:381:371:361:351:341:331:321:311:301:291:281:271:261:251:241:231:221:211:201:191:181:171:161:151:141:131:121:111:101:091:081:071:061:051:041:031:021:011:000:590:580:570:560:550:540:530:520:510:500:490:480:470:460:450:440:430:420:410:400:390:380:370:360:350:340:330:320:310:300:290:280:270:260:250:240:230:220:210:200:190:180:170:160:150:140:130:120:110:100:090:080:070:060:050:040:030:020:01End5:00

explorationdata

Part - 2

No To do My Recommendation, please utilize

1 Name your code title, loading packages and dataset

(Optional) Code title: World Happiness ExplorationPackage name: tidyverseFunction: read.csv()

2 Know the data: number of files, dimension, column name, merging dataframe, statistic summary of data, what is happiest country in 2017, etc.

Function: dim(), colnames(), summary()

3 Know the rank changes Function: mutate()

4 Correlation between variables Function: corrplot(), corr()

5 Visualize the data with boxplot Function: boxplot()

explorE

DemoData Exploration

explorE#loading packagelibrary(tidyverse)library(corrplot)library(plotly)

#setting working directorysetwd("D:/data/world-happiness-report")

#import datawhr_2015 <- read.csv("2015.csv",stringsAsFactors = F)whr_2016 <- read.csv("2016.csv", stringsAsFactors = F)whr_2017 <- read.csv("2017.csv", stringsAsFactors = F)

#cek datahead(whr_2015)head(whr_2016)head(whr_2017)

explorE#cek dimensi data (bisa juga dengan melihat di bagian environment di kanan atas)dim(whr_2017)

#cek nama kolomcolnames(whr_2015)colnames(whr_2016)colnames(whr_2017)

#Cek Distribusi dari Happiness Score untuk WHR 2015summary(whr_2015$Happiness.Score)

explorE#Pertanyaan:

Dari tahun 2016 ke 2017, negaramana yang peringkat tingkatkebahagiaannya nya naik paling tinggi?

Naik berapa peringkat?

Dari peringkat berapa ke berapa?

explorE#Mengetahui perubahan Ranking dari tahun 2016 ke 2017

#1) join dua data frame by Countrywhr_all<-merge(whr_2016[,c(1,3)],

whr_2017[,c(1,2)],by.x = "Country",by.y = "Country")

#2) namai kolom whr_all lalu buat kolom Rank Change dengan mutate functioncolnames(whr_all)<-c("Country","Happiness Rank 2016","Happiness Rank 2017")

whr_all1 <-whr_all %>%mutate(`Rank Change`=`Happiness Rank 2016`-`Happiness Rank 2017`)

#3) tunjukkan negara mana yang memiliki kenaikan peringkat paling besarwhr_all1[whr_all1$`Rank Change`==max(whr_all1$`Rank Change`),][,1]

#Jawaban: Bulgaria, naik 24 peringkat dari 129 ke 105.

explorE

#Korelasi antar variabel

test3<-cor(as.matrix(whr_2017[,-c(1,2)]))

corrplot::corrplot(test3,type = "upper",method = "square",mar = c(0,0,1,0))

2017

explorE#Data 2017 tidak ada kolom region, sehingga kita harus melakukan join by country agar kolom Region muncul.

whr_2017_new <- whr_2017 %>% left_join(whr_2016[,c(1,2)], by="Country")

plot_ly(whr_2017_new,x=~Region,y=~Happiness.Score,type="box",boxpoints="all",pointpos = -1.8,color=~Region)%>%

layout(xaxis=list(showticklabels= FALSE),

margin=list(b = 100))

preparationdata

Part - 3

preparE

Goal: Cluster Analysis World Happiness Report 2017

Select Feature

01Normalize dataset

02Cluster data

03

DemoData Preparation

preparE#Tujuan Anda adalah melakukan clustering berdasarkan variable pembentuk Happiness Score. Misalkan kita pilih 6 variable:#GDP, Family, Life Expectancy, Freedom, Trust, Generousity

clean_data <- whr_2017[,c(1,6:11)]

# Melihat persebaran masing-masing variabelpar(mfrow=c(1,6))for(i in 2:7) {

boxplot(clean_data[,i], main=names(clean_data)[i])}

preparE#menghilangkan data yang missing atau naclean_data <- na.omit(clean_data) # listwise deletion of missingclean_data$Country <- as.character(clean_data$Country)

#normalize data (optional, karena dilihat dari hasil boxplot persebarannya sudah cukup rata)clean_data_2 <- scale(clean_data[,2:7]) # standardize variablesa <- data.frame(clean_data_2)clean_data_3 <- cbind(clean_data$Country,a)

#melihat boxplot setelah discale/normalize (hasil normalisasiini tidak digunakan, kita gunakan clean_data)par(mfrow=c(1,6))for(i in 2:7) {

boxplot(clean_data_3[,i], main=names(clean_data)[i])}

clustering

Part - 4

No To do My Recommendation, please utilize

1 Use 3 algorithm of clustering: - Partitioning: K-Means- Hierarchical Clustering

Function: kmeans(), hclust(), cutree(),

2 Plotting cluster solution # vary parameters for most readable graphlibrary(cluster)clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,

labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functionslibrary(fpc)plotcluster(mydata, fit$cluster)

https://www.statmethods.net/advstats/cluster.htmlhttps://datascienceplus.com/k-means-clustering-in-r/https://datascienceplus.com/hierarchical-clustering-in-r/

DemoClustering

clusterinG# K Means Clustering

# Determine number of clusterswss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) wss[i] <-sum(kmeans(mydata,

centers=i)$withinss)plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")

# K-Means Cluster Analysisfit1 <- kmeans(mydata, 3) # 3 cluster solution# get cluster means aggregate(mydata,by=list(fit1$cluster),FUN=mean)# append cluster assignmentmydata_kmeans <- data.frame(clean_data, fit1$cluster)

clusterinG# Ward Hierarchical Clusteringrownames(clean_data) <- clean_data[,1]mydata <- clean_data[,-1]

d <- dist(mydata, method = "euclidean") # distance matrixfit2 <- hclust(d, method="ward.D") plot(fit2) # display dendogramrect.hclust(fit2, k=3, border="red")groups <- cutree(fit2, k=3) # cut tree into 3 clusters

# draw dendogram with red borders around the 3 clusters mydata_hclust <-data.frame(clean_data,groups)

clusterinG# Cluster Plot against 1st 2 principal components, and vary parameters for most readable graphlibrary(cluster) clusplot(mydata, fit1$cluster, color=TRUE, shade=TRUE,

labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functionslibrary(fpc)plotcluster(mydata, fit1$cluster)

q&A

Thank You