+ All Categories
Home > Documents > Erroneous Distribution Data Identification Using Outlier Detection Techniques

Erroneous Distribution Data Identification Using Outlier Detection Techniques

Date post: 06-Feb-2016
Category:
Upload: mandell
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Erroneous Distribution Data Identification Using Outlier Detection Techniques. W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA. Overview. Review of OBIS DQ-issues Review of existing DQ methods Case study: d etecting outliers in multidimensional data - PowerPoint PPT Presentation
Popular Tags:
18
Erroneous Distribution Data Erroneous Distribution Data Identification Using Outlier Identification Using Outlier Detection Techniques Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of Rutgers, the State University of New Jersey, USA New Jersey, USA
Transcript
Page 1: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Erroneous Distribution Data Identification Using Erroneous Distribution Data Identification Using Outlier Detection Techniques Outlier Detection Techniques

W. Zhuang, Y. Zhang, J.F. GrassleW. Zhuang, Y. Zhang, J.F. Grassle

Rutgers, the State University of New Jersey, Rutgers, the State University of New Jersey, USAUSA

Page 2: Erroneous Distribution Data Identification Using Outlier Detection Techniques

OverviewOverview

Review of OBIS DQ-issuesReview of OBIS DQ-issues Review of existing DQ methodsReview of existing DQ methods Case study: detecting outliers in Case study: detecting outliers in

multidimensional datamultidimensional data Discussion and future directionsDiscussion and future directions

Page 3: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Data Quality (DQ) Data Quality (DQ)

DQ problems can be generated in every DQ problems can be generated in every steps of the data life cycle:steps of the data life cycle:

Page 4: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DQ problems (I)DQ problems (I)

Data gathering:Data gathering:

instrument failureinstrument failures; s; false identificationsfalse identifications

geo-referencinggeo-referencing Data storageData storage

key metadata missingkey metadata missing

erroneous data entry; database default values erroneous data entry; database default values masquerading as real valuesmasquerading as real values

Page 5: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DQ problems (II)DQ problems (II)

Data delivery: data corruption due to encoding Data delivery: data corruption due to encoding conversionconversion

Data integration: duplicated recordsData integration: duplicated records Data retrieval: missing valuesData retrieval: missing values Data analysis/cleaning: inappropriate models Data analysis/cleaning: inappropriate models

used, etc.used, etc.

Page 6: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DQ solving-a process-based approach DQ solving-a process-based approach

DQ solving is an essential component of data DQ solving is an essential component of data analysis and thus part of the data life cycleanalysis and thus part of the data life cycle

A. It builds foundation for analysis and modelingA. It builds foundation for analysis and modeling B. It provides feedbackB. It provides feedback to improve the whole to improve the whole

data life cycledata life cycle C. It could lead to more DQ problems if not C. It could lead to more DQ problems if not

carefully executedcarefully executed

Page 7: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DQ solving methodsDQ solving methods

Harvest metadata close to dataHarvest metadata close to data Built-in integrity check and double data entryBuilt-in integrity check and double data entry Model-based approach: Model-based approach:

a) statistical a) statistical

b) heuristic b) heuristic

Page 8: Erroneous Distribution Data Identification Using Outlier Detection Techniques

OBIS DQ StudyOBIS DQ Study

Metadata-related problemsMetadata-related problems DQ on scientific namesDQ on scientific names Integrity checking Integrity checking Redundant records detectionRedundant records detection Outliers Outliers detection- a case studydetection- a case study

Outliers sometimes represent erroneous dataOutliers sometimes represent erroneous data

We are examining data mining tools for detecting We are examining data mining tools for detecting erroneous data pointserroneous data points

Page 9: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DBSCAN-a clustering toolDBSCAN-a clustering tool

DBSCAN is density-based in feature spaceDBSCAN is density-based in feature space It deals with high dimensional dataIt deals with high dimensional data There is no need to specify cluster numbersThere is no need to specify cluster numbers It identifies outliers during the clustering process It identifies outliers during the clustering process It is a fast algorithm and freely availableIt is a fast algorithm and freely available M.Ester, H.P.Kriegel, J.Sander and Xu. A M.Ester, H.P.Kriegel, J.Sander and Xu. A

density-based algorithm for discovering clusters density-based algorithm for discovering clusters in large spatial databasesin large spatial databases

Page 10: Erroneous Distribution Data Identification Using Outlier Detection Techniques

A diagram of DBSCANA diagram of DBSCAN

Core

Border

Outlier

= 1unit

MinPts = 5

Page 11: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Total points distributionTotal points distribution

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

whole dataset

Page 12: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Result from DBSCANResult from DBSCAN

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

cluster points outliers

Page 13: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Limitation of the methodLimitation of the method

Geographical outliers may be used to identify Geographical outliers may be used to identify erroneous points in survey data, but may not erroneous points in survey data, but may not good for museum collections or literature-based good for museum collections or literature-based data records.data records.

Other methods to identify erroneous distribution Other methods to identify erroneous distribution data ? How about using environmental data as data ? How about using environmental data as proxies? proxies?

Page 14: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Can we get some more information?Can we get some more information?

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

dcsn dcso dosn doso

Page 15: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Limitations of using environmental Limitations of using environmental variablesvariables

Risk of imposing a rigid model at the time of pre-Risk of imposing a rigid model at the time of pre-processingprocessing

Risk of losing valuable outliersRisk of losing valuable outliers Risk of circular logic in later analysesRisk of circular logic in later analyses

Page 16: Erroneous Distribution Data Identification Using Outlier Detection Techniques

DiscussionsDiscussions

Why don’t you use more environmental Why don’t you use more environmental variables? variables?

Can you use DBSCAN on environmental Can you use DBSCAN on environmental variables directly?variables directly?

Page 17: Erroneous Distribution Data Identification Using Outlier Detection Techniques

Possible improvementsPossible improvements

Define multiple methods as DQ componentsDefine multiple methods as DQ components Assign bootstrap weightsAssign bootstrap weights Present outlier candidates to expertsPresent outlier candidates to experts Update weights based on user feedbackUpdate weights based on user feedback

Page 18: Erroneous Distribution Data Identification Using Outlier Detection Techniques

SummarySummary

Many data quality problems can arise during the Many data quality problems can arise during the whole data life cycle.whole data life cycle.

Preliminary checking can eliminate a lot of Preliminary checking can eliminate a lot of simple errorssimple errors

Expert knowledge should be integrated and be Expert knowledge should be integrated and be the decisive factor when it comes to DQ solvingthe decisive factor when it comes to DQ solving

Data mining techniques may act as metal Data mining techniques may act as metal detectors so that experts can focus on a detectors so that experts can focus on a narrowed down group of candidatesnarrowed down group of candidates


Recommended