Download - An extended K-means++ with mixed attributes for outlier detection Presented by Miss Sarunya Kanjanawattana.

An extended K-means++ with mixed attributes for outlier detection

Presented by Miss Sarunya Kanjanawattana

Examination Committee

Dr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

:: Agenda ::

• Background• Literature review• Methodologies

Background• Problem statement• Objective of the study• Scope and Limitation • Contribution

« Background »

• Data mining :– huge volume of data and information are collected in

databases. – These tremendous data has far exceeded the human

ability to analyze extract valuable information for the purpose of decision-making support.

“data mining helps to transform the collected data into valuable information”

« Background »

• Outlier detection :– Outlier cluster is a popular methodology

that uses to detect fraud in data sets.– identify data points as “normal” or “outlier”

Outlier data point => fraudulent sample

« Background »

• Fraud detection – Health insurance fraud detection is a

beneficial and challenging task.– The detection helps to observe the fraud

and abuse pattern.

Example : Institutional or health professional led health insurance fraud include the falsification of information on forms.

« Background »

• The National Health Security office– is an autonomous state agency, officially

founded in 2002 , stated by the National Health Security Act

– The vital duties of NHSO • are to manage the health security fund

and allocate the subsidiary budget to 236 clinics and 963 hospitals to promote and develop a good health care system for all Thai people.

« Problem statement »

• Fraud and abuse • led to significant additional expense in the health care

system.

• A case study : NHSO database• Occurred with the large number of data .• Many transactions emerge constantly daily hour. • These become huge and hard to use human inspections for

detecting fraud.

• Outlier clustering approach : • Need fast and more accuracy algorithm to monitor outliers

« Objective of the study »

• To provide a process of extracting the fraud instances and uncover unusual activities in NHSO.

• To develop the K-means++, that is another variation of standard k-means algorithm, with mixed attributes of dataset for detecting outliers.

• To answer what is the optimal “”.

« Scope and Limitation »

• The data source only involved in 4 provinces in Thailand– Nakhonratchasima, Chaiyaphom, Burirum and Surin.

• The transaction comes from a group of High-costs diseases – There is high chance to occur fraudulent behaviors

larger than other groups of diseases.

« Contribution »

• The proposed study provides the methodology to detect fraud and abuse in NSHO, Thailand. It will present some results of outlier cluster.

• This study proposes a novel algorithm based on extended K-means++ to work with mixed attributes and detect outliers.

Literature review• Fraud detection• The process of data mining

« Literature review »

Yi et al. 2006 : • understand and detect suspicious health care

frauds from large databases using clustering technique

• Use two clusters to compare : SAS EM and CLUTO

• As the experimental results indicate that CLUTO is faster than SAS EM while SAS EM provides more useful clusters than CLUTO.

Fraud detection


Liou, Tang, and Chen 2008 : • Applies data mining techniques to detect

fraudulent or abusive reporting by healthcare providers using their invoices for diabetic outpatient services.

• Logistic regression, neural network, classification trees

• The classification tree model performs the best with an overall correct identification rate of 99%.

Fraud detection


• Data preprocessing– The data that obtain from the real

databases are often incomplete, noisy and inconsistent.

– The target of data preprocessing is to clean a rough data set for improve accuracy.

– The process of data preprocessing :• data cleaning, data transformation and

integration and data reduction.

The process of data mining


• Data preprocessingWang and Chiang 2009 : – presents an efficient data preprocessing

procedure for the support of vector clustering (SVC) to reduce the size of a training dataset.



• K-means algorithm



• K-means algorithm– The benefits of K-means • fast and simplicity. Its algorithm is really

easy to understand and implementation.

– The shortcoming of K-means • number of clusters dependency • degeneracy



• K-means++ algorithm



• K-means++ algorithm• Arthur and Vassilvitskii 2007– Fast and more efficient• K-means : O(i * n * k)• K-means++ : O(log k)

– not pretty good to work with a dataset which combines categorical and numerical attribute



• K-means++ algorithm• Example


(k=3)

D(x) =

the shortest distance from

a data point x to the

closest center we have

already chosen.




(k=3)




D2=82+42

D2=72+32

D2=12+72

D2=22+12

(k=3)




D2=82+42

D2=72+32

D2=12+72

D2=22+12

(k=3)




D2=12+12

D2=12+72

D2=22+12

(k=3)




D2=12+12

D2=12+72

D2=22+12

(k=3)




(k=3)


• Y-means algorithm



• Y-means algorithm• Guan, Ghorbani, and Belacel 2003– based on the K-means algorithm– It overcomes two shortcomings

of K-means: • number of clusters dependency and

degeneracy



• Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, and Reynolds 2007– Introduced a strategy named

“Attribute Value Frequency (AVF)”. – That is a fast and scalable outlier

detection strategy for categorical data.


Methodologies• Methodology• Data collection• Data evaluation • Tasks and timeline

« Methodologies »

• It can divide into 3 phases.• Phases 1: Data preprocessing– Convert categorical data to numeric data

• Phases 2: Clustering– Followed by K-means++ algorithm

• Phases 3: Outlier detection – Local and global outlier– Determine what cluster is outlier

« Methodologies »

• Overview of the extended K-means++ algorithm

« Methodologies »

• Phases 1: Data preprocessing

« Methodologies »

• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into

the range of 0 and 1

Attribute W Attribute X Attribute Y Attribute Z

A C 100 100

A C 300 900

A D 800 800

B D 900 200

B C 200 800

B E 600 900

A D 700 100

« Methodologies »

• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into

the range of 0 and 1


A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

« Methodologies »

• Phases 1: Data preprocessing2) A categorical attribute A with most number of

items is selected to be the base attribute.


A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

2 items: A,B 3 items: C,D,E

« Methodologies »

• Phases 1: Data preprocessing3) Counting the frequency of co-occurrence,

represent by Matrix M


A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

Matrix M =

4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1

A B C D E

A B C D E

« Methodologies »

• Phases 1: Data preprocessing4) Calculate similarity between items represent by

equation D

Matrix M =

4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1

A B C D E

A B C D E

Similarity Calculated value

DAC 2/4+3-2 = 0.4

DAD 2/4+3-2 = 0.4

DAE 0/4+2-0 = 0

DBC 1/3+3-1 = 0.2

DBD 1/3+3-1 = 0.2

DBE 1/3+1-1 = 0.33

« Methodologies »

• Phases 1: Data preprocessing5) Find group variance of numerical value by

following equation:

Y attribute

Base Items Mean SSw

C 0.1+0.3+0.2/3 = 0.2 0.01+0.01+0 = 0.02

D 0.8+0.9+0.7/3 = 0.8 0+0.01+0.01 = 0.02

E 0.6/1 = 0.6 0

Z attribute

Base Items Mean SSw

C 0.1+0.9+0.8/3 = 0.6 0.25+0.09+0.01 = 0.35

D 0.8+0.2+0.1/3 = 0.37 0.185+0.029+0.73 = 0.94

E 0.9/1 = 0.9 0

å SSw(Y) = 0.04å SSw(Z) = 1.294

<< Select Y

« Methodologies »

• Phases 1: Data preprocessing6) Every base item can be quantified by assigning

mean of the mapping value in the selected numeric attribute.

Y attribute

Base Items Mean

C 0.1+0.3+0.2/3 = 0.2

D 0.8+0.9+0.7/3 = 0.8

E 0.6/1 = 0.6


A 0.2 (C) 0.1 0.1

A 0.2 (C) 0.3 0.9

A 0.8 (D) 0.8 0.8

B 0.8 (D) 0.9 0.2

B 0.2 (C) 0.2 0.8

B 0.6 (E) 0.6 0.9

A 0.8 (D) 0.7 0.1

« Methodologies »

• Phases 1: Data preprocessing7) All other categorical items can be quantified by

applying the function:


0.4 (A) 0.2 (C) 0.1 0.1

0.4 (A) 0.2 (C) 0.3 0.9

0.4 (A) 0.8 (D) 0.8 0.8

0.398 (B) 0.8 (D) 0.9 0.2

0.398 (B) 0.2 (C) 0.2 0.8

0.398 (B) 0.6 (E) 0.6 0.9

0.4 (A) 0.8 (D) 0.7 0.1

F(A) = 0.4 * 0.2 + 0.4 * 0.8 + 0 * 0.6 = 0.4

F(B) = 0.2 * 0.2 + 0.2 * 0.8 + 0.33 * 0.6 = 0.398

*All data in data set are numeric now.

« Methodologies »

• Phases 2: Clustering

Probability :

D(x) : denote the shortest distance from a data point x to the closest center we have already chosen.

« Methodologies »

• Phases 2: Clustering– Define initial values: • = Cluster width

– for detect local outlier– Followed by previous study = 2.32.

• = Cluster population ratio– for detect global outlier– My assumption : = 0.9

Detection rate and false negative rate should be get the highest values with optimal “”.

« Methodologies »

• Phases 3: Outlier detection

« Methodologies »

• Phases 3: Outlier detection– There are 2 stages• Local outlier detection : • = cluster width

« Methodologies »

• Phases 3: Outlier detection– There are 2 stages• Global outlier detection• = population ratio

« Data collection »

• A real dataset provided by National Health Security office of Thailand was applied to demonstrate the effectiveness of the proposed method.

• Primary data will gather information from database especially statement information that contains all financial transactions, Thailand.

« Data collection »

• Overview of data set

« Data evaluation »

• Outlier Detection Accuracy rate, which is the number of outliers correctly identified by this approach as outliers

• False Positive rate, reflecting the number of normal points erroneously identified as outliers.

« Tasks and timeline »

Thank youDr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

Question?