+ All Categories
Home > Documents > 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN,...

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN,...

Date post: 15-Dec-2015
Category:
Upload: keven-rollin
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
35
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference
Transcript
Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 6a, February 25, 2014, SAGE 3101

kNN, K-Means, Clustering and Bayesian Inference

Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Contents

2

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Did you get to create the neighborhood map?

table(mapcoord$NEIGHBORHOOD)

mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD)

geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD) # this one is easier

3

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

4

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

KNN!Did you loop over k?

{

knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5)

knntesterr<-sum(knnpred!=mappred$class)/length(testid)

}

knntesterr

[1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159

What do you think?

5

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

What else could you classify?• SALE.PRICE?

– If so, how would you measure error?

# I added SALE.PRICE as 5th column in adduse…

> pcolor<- color.scale(log(mapcoord[,5]),c(0,1,1),c(1,1,0),0)

> geoPlot(mapcoord,zoom=12,color=pcolor)• TAX.CLASS.AT.PRESENT?• TAX.CLASS.AT.TIME.OF.SALE?

• measure error?6

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Summing up ‘knn’• Advantages

– Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)

– Effective if the training data is large

• Disadvantages– Need to determine value of parameter K (number of

nearest neighbors)– Distance based learning is not clear which type of

distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?

• Friday – yet more KNN: weighted KNN… 7

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

K-Means!> mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')

> mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

> fitted(mapobj,method=c("centers","classes")) 8

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Return objectcluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers A matrix of cluster centres.

totss The total sum of squares.

withinss Vector of within-cluster sum of squares, one component per cluster.

tot.withinss Total within-cluster sum of squares, i.e., sum(withinss).

betweenss The between-cluster sum of squares, i.e. totss-tot.withinss.

size The number of points in each cluster. 9

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

10

> plot(mapmeans,mapobj$cluster)

ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SQUARE.FEET, GROSS.SQUARE.FEET, SALE.PRICE, latitude, longitude'

ZIP.C

OD

E, N

EIG

HB

OR

HO

OD

, TO

TAL

.UN

ITS

, LA

ND

.SF, G

RO

SS

.SF, S

AL

E.P

RIC

E, la

t, lon

g

> mapobj$size[1] 432 31 1 11 56

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

11

> mapobj$centers adduse.ZIP.CODE as.numeric.mapcoord.NEIGHBORHOOD. adduse.TOTAL.UNITS adduse.LAND.SQUARE.FEET1 10464.09 19.47454 1.550926 2028.2852 10460.65 16.38710 25.419355 11077.4193 10454.00 20.00000 1.000000 29000.0004 10463.45 10.90909 42.181818 10462.2735 10464.00 17.42857 4.714286 14042.214 adduse.GROSS.SQUARE.FEET adduse.SALE.PRICE adduse..querylist.latitude. adduse..querylist.longitude.1 1712.887 279950.4 40.85280 -73.873572 26793.516 2944099.9 40.85597 -73.891393 87000.000 24120881.0 40.80441 -73.922904 40476.636 6953345.4 40.86009 -73.886325 9757.679 885950.9 40.85300 -73.87781

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Plotting clustersrequire(cluster)

clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

12

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Simpler K-Means!> mapmeans<-data.frame(as.numeric(mapcoord$NEIGHBORHOOD), adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')

> mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

> fitted(mapobjnew,method=c("centers","classes"))

13

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Plot

14

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Clusplot (k=17)

15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Dendogram for this = tree of the clusters:

16

Highly supported by data?

Okay, this is a little complex – perhaps something simpler?

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Hierarchical clustering> d <- dist(as.matrix(mtcars))

> hc <- hclust(d)

> plot(hc)

17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Decision tree (example)> require(party) # don’t get me started!

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 19

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

plot(iris_ctree)

20

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

However… there is more

21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])

setosa versicolor virginica

2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 22

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

23

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Using a contingency table> predict(mdl, as.data.frame(Titanic)[,1:3])

[1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No

[26] No No No Yes Yes Yes Yes

Levels: No Yes

24

Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Naïve Bayes – what is it?• Example: testing for a specific item of

knowledge that 1% of the population has been informed of (don’t ask how).

• An imperfect test:– 99% of knowledgeable people test positive– 99% of ignorant people test negative

• If a person tests positive – what is the probability that they know the fact?

25

Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Naïve approach…• We have 10,000 representative people• 100 know the fact/item, 9,900 do not• We test them all:

– Get 99 knowing people testing knowing– Get 99 not knowing people testing not knowing– But 99 not knowing people testing as knowing

• Testing positive (knowing) – equally likely to know or not = 50%

26

Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Tree diagram

10000 ppl

1% know (100ppl)

99% test to know

(99ppl)

1% test not to know (1per)

99% do not know

(9900ppl)

1% test to know

(99ppl)

99% test not to know

(9801ppl)27

Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Relation between probabilities• For outcomes x and y there are probabilities

of p(x) and p (y) that either happened• If there’s a connection then the joint

probability - both happen = p(x,y)• Or x happens given y happens = p(x|y) or

vice versa then:– p(x|y)*p(y)=p(x,y)=p(y|x)*p(x)

• So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law)• E.g.

p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5

28

Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

How do you use it?• If the population contains x what is the

chance that y is true?

• p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(word)

• Base this on data: – p(spam) counts proportion of spam versus not– p(word|spam) counts prevalence of spam

containing the ‘word’– p(word|!spam) counts prevalence of non-spam

containing the ‘word’ 29

Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Or..• What is the probability that you are in one

class (i) over another class (j) given another factor (X)?

• Invoke Bayes:

• Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known)

• So: conditional indep - 30

Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

• P(xk | Ci) is estimated from the training samples – Categorical: Estimate P(xk | Ci) as percentage of

samples of class i with value xk

• Training involves counting percentage of occurrence of each possible value for each class

– Numeric: Actual form of density function is generally not known, so “normal” density is often assumed

31

Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Thus..• Supervised or training set needed

• We will explore this more on Friday

32

Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Tentative assignments

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ March 7. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ March 18. 5% (0% written and 5% oral; individual);

• Term project (6). Due ~ week 13. 30% (25% written, 5% oral; individual).

• Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual);

33

Page 34: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Coming weeks• I will be out of town Friday March 21 and 28• On March 21 you will have a lab –

attendance will be taken – to work on assignments (term (6) and assignment 7). Normal lecture on March 18.

• On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab.

• Back to regular schedule in April (except 18th) 34

Page 35: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

35


Recommended