+ All Categories
Home > Documents > Laboratorium voor Neuro

Laboratorium voor Neuro

Date post: 08-Apr-2018
Category:
Upload: 9822042718
View: 220 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/6/2019 Laboratorium voor Neuro

    1/30

    Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 1/51

    Data MiningMarc M. VAN HULLEK.U.LeuvenFaculteit GeneeskundeLaboratorium voor Neuro- en PsychofysiologieCampus Gasthuisberg, HerestraatB-3000 Leuven, BELGIUME-mail: [email protected] DEHASPEOncolmethylome BVBAGaston Geenslaan 1B-3001 Heverlee, BELGIUME-mail: [email protected] Introduction2 Data Transformation3 Feature Selection and Feature Extraction4 Frequent Pattern Discovery5 Graph Mining6 Data Stream Mining7 Visual MiningLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 2/511 IntroductionOverview:1.1 Knowledge discovery in databases data mining1.2 Steps in knowledge discovery1.3 Data Mining1.3.1 Data Mining Objectives1.3.2 Building blocks of Data Mining1.3.3 Data Mining Topics of this Course1.3.4 Data Mining Tool Classification1.3.5 Overview of Available Data Mining Tools1.4 Data Preprocessing1.4.1 Definition1.4.2 Outlier removal

  • 8/6/2019 Laboratorium voor Neuro

    2/30

    1.4.3 Noise removal1.4.4 Missing data handling1.4.5 Unlabeled data handlingLaboratorium voor Neuro- en Psychofysiologie

    Katholieke Universiteit Leuven 3/511.1 Knowledge discovery in databases data miningKnowledge discovery in databases (KDD)

    _ =non-trivial process of identifying valid, novel,potentiallyuseful & understandable patterns & relationships indata(knowledge = patterns & relationships) pattern: expression describing facts about data set relation: expression describing dependencies between dataand/or patterns process: KDD is multistep process, involving datapreparation,data cleaning, data mining. . . (see further) valid: discovered patterns, relationships should be valid on

    newdata with some certainty (or correctness, below error level) novel: not yet known (to KDD system) potentially useful: should lead to potentially useful actions(lower costs, increased profit,. . . ) understandable: provide knowledge that is understandabletohumans, or that leads to a better understanding of the datasetLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 4/511.1 KDD data mining ContdData mining

    _ =

  • 8/6/2019 Laboratorium voor Neuro

    3/30

    step in KDD process aimed at discovering patterns&relationships in preprocessed & transformed dataHence: knowledge discovery =data preparation + data mining + evaluation/interpretationof discovered patterns/relationshipsNote: nowadays, data mining KDDdata preparation,. . .= part of data mining tool boxLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 5/511.2 Steps in knowledge discoverydataknowledge873 762 012638 678 923773 887 638... ... ...274574846348485958590983...

    transformationdata mining interpretation

    preprocessed selection preprocessing

    target datadetected patternstransformeddatadataSteps in KDD process1. develop understanding of application domain, relevantpriorknowledge, goals of end-user2. create target data set (= subset)3. data cleaning and preprocessing: remove noise & out-liers/wildshots,handle missing data & unlabeled data,. . .4. transform data (dimensionality reduction & dataprojection):

  • 8/6/2019 Laboratorium voor Neuro

    4/30

    find useful features with which to more efficiently representdata5. select data mining taskLaboratorium voor Neuro- en Psychofysiologie

    Katholieke Universiteit Leuven 6/511.2 Steps in knowledge discovery Contddataknowledge873 762 012638 678 923773 887 638... ... ...274574846348485958590983...

    transformationdata mining interpretation

    preprocessed selection preprocessing target datadetected

    patternstransformeddatadataSteps in KDD process6. choose data mining algorithm7. data mining8. interpret the mined patterns, relationships possible return to steps 179. consolidate discovered knowledgeLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 7/511.3 Data MiningCore process in KDD = data miningdataknowledge873 762 012638 678 923773 887 638... ... ...274574846348485958590983...

    transformation

  • 8/6/2019 Laboratorium voor Neuro

    5/30

    data mining interpretation

    selection preprocessing target datatransformeddatadata

    preprocesseddetected

    patternsSteps in KDD processLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 8/511.3.1 Data Mining Objectives

    Two high-level objectives of Data mining:prediction & description Prediction of unknown or future values of selectedvariables Description in terms of (human-interpretable) patterns

    gain "insights" Descriptiondecision-making Predictionsupport, improve & automateCustomersProducts

    ServicesSales repsMarketingDistributionSuppliersActionsSalesProfitsTimeEffectivityCostsQualityEvents...

    Engineering mathematical modelslarge and numerical) data setsof (business) processing from (typ.

    Data mining objectivesLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 9/511.3.1 Data Mining Objectives Contd

  • 8/6/2019 Laboratorium voor Neuro

    6/30

    Prediction & Description involve modeling data set Differing degrees of model complexity: simplest: OLAP intermediate: linear regression, decision tree, clustering,

    classification neural networks gain "insights" Descriptiondecision-making Predictionsupport, improve & automateCustomersProductsServicesSales repsMarketingDistributionSuppliersActionsSalesProfitsTimeEffectivityCostsQualityEvents...Engineering mathematical modelslarge and numerical) data setsof (business) processing from (typ.simplest: OLAPintermediate: linear regression

    intermediate: decision tree, clusteringcomplex: Neural Network intermediate: classification

    OLAP = On-Line Analytical ProcessingLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 10/511.3.2 Building blocks of Data Mining1.3.2.1 OLAP On-Line Analytical Processing (OLAP)

    _ =

    set of tools for pro-viding multi-dimensional analysis of data warehouses Data warehouse

    _ =database that contains subject-oriented, in-tegrated, and historical data, primarily used in analysis andde-

  • 8/6/2019 Laboratorium voor Neuro

    7/30

    cision support environments requires collecting & cleaning transactional data & makingit available for on-line retrieval= formidable task, especially in mergers with 6= database

    archs.! OLAP = superior to SQL in computing summaries andbreak-downs along dimensions(SQL (Standard Query Language) = script language for inter-rogating (manually) large databases such as Oracle) OLAP requires substantial interaction from users to identifyinteresting patterns (clusters, trends) also: OLAP often confirms users hunch6= looking for real hidden patterns, relations OLAP is now integrated into more advanced data miningtoolsLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 11/511.3.2.2 Linear regression Consider income/debt datadebtincomeLinear regression example

    Assumptions:1. independent variable = income2. dependent variable = debt3. relation between income & debt = linearLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 12/511.3.2.2 Linear regression Contd Failure: when relation 6= lineardebtincome

    Non-linear regression with semi-circledebtincomeNon-linear regression with smoothing spline Hence: need for non-linear regression capabilitiesLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 13/51

  • 8/6/2019 Laboratorium voor Neuro

    8/30

    1.3.2.2 Linear regression Contd Type of regression models:1. functional (descriptive) models: purpose is to summarizedata

    compactly, not to explain system that generated data2. structural (mechanistic) models: purpose is to account forphysics, statistics,. . . of system that generated data best results obtained when a priori knowledge of sys-tem/process that generated data (structural modeling)Example: estimate probability that drug X will cure patient better model if imply knowledge of:1) components of drug that curative/side effects2) certainty of patients diagnose (alternative diagnoses?)3) patients track record on reaction to drug componentsLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 14/511.3.2.2 Linear regression ContdWhat is the right level of model detail? more parameters 6= better model!more parameters = more data needed to estimate them! more parameters = risk of overfitting!complex regression model goes through all data points,but fails to correctly model new data points !

    With 5 parameters you can fit an elephant.And with 6 parameters you can make it blink! more parameters 6= deeper understanding of system/processthat generated datacf. Occams razor: simplest model that explains data = pre-ferredleads to better understanding good model has good predictive power(e.g., by testing on new data points) good model provides confidence levels or degrees of certaintywith regression resultsLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 15/511.3.2.3 Decision trees

  • 8/6/2019 Laboratorium voor Neuro

    9/30

    Decision tree _ =technique that recursively splits/divides spaceinto 6= (sub-)regions with decision hyperplanes orthogonal

    tocoordinate axes Decision tree = root node + successive directionallinks/branches to other nodes, until leaf node reached at each node: ask for value of particular propertye.g., color? green continue until no further questions = leaf node leaf node carries class label test pattern gets class label of leaf node attachedwatermelon apple grapebig medium small size? size?size?shape?taste?cherry grapebanana apple

    grapefruit lemonround thinbig small medium small

    sweet sour greenroot color? level 0level 1

    level 2

    level 3

    yellow red

    Decision treeLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 16/511.3.2.3 Decision trees Contd Advantages: easy to interpret rules can be derived form tree:e.g., Apple = (medium size AND NOT yellow color) Disadvantages: devours data at a rate exponential with depthhence: to uncover complex structure, extensive data needed crude partitioning of space:corresponds to (hierarchical) classification problem in whicheach variable has different constant value for each class, in-

  • 8/6/2019 Laboratorium voor Neuro

    10/30

    dependently from other variables(Note: classification = partitioning of set into subsets basedon knowledge of class membership) Introduced by work on Classification And Regression Trees

    (CART) of Breiman et al. (1984)Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 17/511.3.2.4 Clustering Clustering = way to detect subsets of similar data Example: customer database:1. how many types of customers?2. what is typical customer profile for each type?e.g., cluster mean or median,. . .= cluster prototype= typical profile

    Cluster 1Cluster 3Cluster 2debtincomeClustering & cluster prototypesLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 18/511.3.2.4 Clustering Contd wealth of clustering algorithms exist

    for overviews, see: Duda & Hart (1973), Duda et al. (2001), Theodoridis and Koutroumbas (1998) major types of clustering algorithms:1. distortion-based clustering= most widely used techniquehere: k-means & Fuzzy k-means clustering2. density-based clusteringlocal peaks in density surface indicate cluster(-centers)here: hill-climbingLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 19/511.3.2.4 Clustering Contd1. Distortion-based clusteringk-means clustering Assume: we have c clusters & sample set D = { v i} Goal: find mean vectors of clusters, 1 , . . . , c

  • 8/6/2019 Laboratorium voor Neuro

    11/30

    Algorithm:1. initialize 1 , . . . , c2. do for each v j D, determine: argmin i kv j ikrecompute i : i average{all samples clusteri}

    3. until no change in i, i4. stop Converges with less iterations than number of samples Each sample belongs to exactly 1 cluster Underlying idea: minimize mean squared error (MSE)distor-tion(squared Euclidean distance) between cluster mean &clustersamples:

    Jkmeans =Xci=1

    Xn j=1Mi(v j)kv j ik2with M i(v j) = 1 if i = argmin k kv j kk, else = 0Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 20/51

    1.3.2.4 Clustering Contd1. Distortion-based clustering Contdk-means clustering Contd Cluster membership function

    _ =c = arg minall clusters

    kv j iki.e., closest cluster prototype in Euclidean distance terms Result: partitioning of input space into non-overlappingregionsi.e., quantization regions Shape of quantization regions = convex polytopes,boundaries bisector planes of lines joining pairs of prototypes Partitioning

    _ =

  • 8/6/2019 Laboratorium voor Neuro

    12/30

    Voronoi tessellation or Dirichlet tessellation

    jl k ia bLaboratorium voor Neuro- en Psychofysiologie

    Katholieke Universiteit Leuven 21/511.3.2.4 Clustering Contd1. Distortion-based clustering Contdk-means clustering ContdExampledebtincomefinalintermediateinitial= initial

    = intermediate= finalevolution of Voronoi tessellation:evolution of prototype positions:

    Trajectories of prototypes & Voronoi tes-sellation of the k-means clustering proce-dure on 3 clusters in 2D spaceLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 22/511.3.2.4 Clustering Contd

    1. Distortion-based clustering ContdFuzzy k-means clustering Assume: we have c clusters & sample set D = { v i} Each sample belongs in probability to clustereach sample has graded or fuzzy cluster membership Define: P( i|v j, ) is probability that sample v j belongs tocluster i, given parameter vector of membership functions,

  • 8/6/2019 Laboratorium voor Neuro

    13/30

    = { 1 , . . . , c}(we further omit in our notation) Note that:Pci=1

    P( i| v j) = 1, v j D (i.e., normalized) Goal is to minimize:

    Jfuzz =Xci=1

    Xn j=1[P( i| v j)]bkv j ik2by gradient descent, Jfuzz iNote:b = 0 cf. MSE minimizationb > 1 each pattern belongs to several classes, e.g., b = 2Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 23/511.3.2.4 Clustering Contd1. Distortion-based clustering ContdFuzzy k-means clustering Contd Result of gradient descent: i is computed at each iteration step as: i Pn

    j=1

    [P( i| v j)]bv j Pn j=1

    [P( i| v j)]b, iP( i| v j) is computed as:P( i| v j) ( 1d ij)1b1

    Pcr=1

  • 8/6/2019 Laboratorium voor Neuro

    14/30

    ( 1d rj)1b1

    with d ij = k v j ik2 Algorithm:1. initialize 1 , . . . , c, P( i|v j), i, j2. do recompute i, irecompute P( i| v j), i, j3. until small change in i, i, P( i| v j), i, j4. stopLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 24/511.3.2.4 Clustering Contd1. Distortion-based clustering ContdAdvantages/disadvantages advantage: simple to implement + wealth of heuristics disadvantages:1. assumes cluster distribution = spherical around center2. assumes number of clusters = known heuristics developed for determining number of clustersLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 25/51

    1.3.2.4 Clustering Contd1. Distortion-based clustering ContdHeuristics for optimal number of clusters1) Statistical folklore technique:1. cluster dispersion

    _ =sum of squared Euclidean distancesbetween pairs of samples of same cluster:Wc =

    Xc

    r=112n rXi,jclusterr

    kv i v jk2

  • 8/6/2019 Laboratorium voor Neuro

    15/30

    with n r = number of samples in rth cluster2. plot cluster dispersion as function of number of clusters3. look for point where dispersion decreases at slower rate4. choose this point as optimal number of clustersdebt

    incomedispersion2 number of clusters

    Left panel: 2 clusters.Right panel: cluster dispersion as a func-tion of number of clustersLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 26/511.3.2.4 Clustering Contd1. Distortion-based clustering Contd

    Heuristics for optimal number of clusters Contd2) Gap statistic (Tibshirani et al, 2000):1. determine gap statistic:Gap(c) = E(log(W c )) log(W c)E(log(W c )) = dispersion expected for uniform distributionwith same range as original data set(in fact, E(log(W c)) is average for B uniform data sets)2. plot Gap(c) as a function of number of clusters c3. optimal number of clusters:copt

    _ =c for which Gap(c) Gap(c+1) s c+1with s c+1 = sd c+1p1+ 1Bwith sd c+1 =

    _ 1B

    PBb=1

    _ log(W b

  • 8/6/2019 Laboratorium voor Neuro

    16/30

    c ) 1B

    PBb=1log(W bc )

    _ 2 _ 1

    2

    (i.e., standard deviation of dispersions observed)debtincome 2 number of clustersgap statistic

    Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 27/51

    1.3.2.4 Clustering Contd2. Density-based clustering Determine density estimate at data points in set D = { v i} Hill-climbing:1. since data estimate can be noisy, with irrelevant bumps: for each data point v i, look for higher density estimatewithin certain distance K from v i2. move to this new data point v jrepeat operation until no further progress is possible:resulting data point = locus of local density peak Set of these loci = cluster prototypesset of data points ; same density peak = data points cluster Algorithm with hill-climbing:1. determine density estimate given D, HC D, HC2 2. do until HC does not change3. do data points HClook for data point v j in range K with higher densityHC2 v jHC HC2HC2 4. stop Note: hill-climbing gradient descent Other algorithms: valley-seeking approach (Koontz & Fuku-naga, 1972), SKeleton by Influence Zones (SKIZ) (Serra,1982),. . .

  • 8/6/2019 Laboratorium voor Neuro

    17/30

    Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 28/511.3.2.4 Clustering Contd2. Density-based clustering Contd

    Hill-climbing example-101v1-101v201pd 2

    -101v1-101v201pd 2

    Left panel: Distribution from which sam-ples are drawnRight panel: Estimate of distribution de-termined from sample set-1 1-11v1v2Clusters & boundaries determined with hill-climbing algorithmLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 29/511.3.2.4 Clustering Contd2. Density-based clustering ContdAdvantages/disadvantages

  • 8/6/2019 Laboratorium voor Neuro

    18/30

    advantage: no assumptions about shape of densitydistribution; less heuristics-based (more objective) disadvantages:

    1. vulnerable in high dim. spacesdifficulty estimate density when dimensionality 2. how to determine optimal smoothness of densityestimate?; complex/tricky proceduresLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 30/511.3.2.4 Clustering ContdHierarchical clusteringClusters arranged in tree divisive clustering (progressive subdivision of data set) agglomerative clustering (progressively merge clusters)Divisive clusteringLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 31/511.3.2.5 Classification1) Nearest-Neighbor classification simplest classification technique algorithm:

    1. given set of samples (prototypes) & corresponding labels:F = {( i, label i)}2. what is label of sample v j?3. determine closest prototype:c = argmini

    kv j ik4. assign label c to sample v jdebtincomeclosest prototype unlabeled sample

    assumes Voronoi tessellation of input space misclassifaction rate > Bayesian rate (= min. misclass.rate)given # prototypes: misclass. < 2 ?Bayesian rateLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 32/51

  • 8/6/2019 Laboratorium voor Neuro

    19/30

    1.3.2.5 Classification Contd2) k-Nearest-Neighbor classification for k > 1: closer to Bayesian rate than k = 1 case algorithm:

    1. given set of samples (prototypes) & corresponding labels:F = {( i, label i)}2. what is label of sample v j?3. determine k closest prototypes4. label sample v j according to majority of labels in k closestprototypesdebtincomek =10 nearest neighbors unlabeled sample

    k = 10 nearest-neighbor classification:

    7 ?, 3 ?; hence, sample gets labelLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 33/511.3.2.6 Neural Networks (Artificial) Neural Network

    _ =network of nodes (neurons)that compute their states numerically through interactionsviadirectional links (synaptic connections/synapses) with othernodes Synapses are modified so that network performs given taskModification is done with learning algorithm Network architecture + learning algorithmspecify type of task network can solve in Data Mining ANNs are used for: regression classification clustering (time-series)prediction feature extraction (PCA, ICA) manifold projection (topographic maps, visual mining) for more information:see courses Neural Computing (H02B3A) & Artificial NeuralNetworks (H02C4A)Laboratorium voor Neuro- en Psychofysiologie

  • 8/6/2019 Laboratorium voor Neuro

    20/30

    Katholieke Universiteit Leuven 34/511.3.3 Data Mining Topics of this CourseData mining techniques: Frequent pattern discoveryFind all patterns for which there are sufficient examples inthesample data.In contrast, k-optimal pattern discovery techniques find the kpatterns that optimize a user-specified measure of interest.

    Thevalue k is also specified by the user (e.g., k-meansclustering). Graph mining

    Structure mining or structured data mining is the process of finding and extracting useful information from semi-structureddata sets. Graph mining is a special case of structured datamining. Data stream miningParadigms for knowledge discovery from evolving data. Visual miningExploratory data analysis by visualization of high-dim. data: Data transformation techniques (PCA, ICA, MDS, SOM,GTM) Iconic display techniques (abstract & metaphoric glyphs)Data preparation techniques: Data transformation (dimensionality reduction) Feature extraction & -selectionLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 35/511.3.4 Data Mining Tool ClassificationData mining tool vendors offer 6= products: extended business intelligence (BI) suites generic data mining tool suites specialized stand-alone products proprietary solutions Data Mining tool classificationLaboratorium voor Neuro- en Psychofysiologie

  • 8/6/2019 Laboratorium voor Neuro

    21/30

    Katholieke Universiteit Leuven 36/511.3.4 DM Tool Classification Contd1) Generic, application-independent tools wide array of data mining & visualization techniques:from clustering & regression to neural networks requires skilled staffing: have to know how to preparedata,what technique to use for given task, how to use technique,how to validate results, how to apply in business examples: IBMs Intelligent Miner, SPSS Clementine,SAS Institutes Enterprise Miner2) Algorithm-specific tools similar to Generic tools, but use specific set of algorithms

    yield better results if algorithms suited for given problem examples focussing on decision trees: CART fromSalford Systems, KnowledgeSeeker from Angoss Software,Alice from Isoft examples focussing on neural networks: NeuralWorks Pro-fesional from NeuralWare, Viscovery by Eudaptics,GS Textplorer by GurusoftLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 37/51

    1.3.4 DM Tool Classification Contd3) Application-specific tools customized environment, e.g., for marketing, churn predic-tion, customer relations management (CRM) strength = provide user with guidance in setting-updata mining project through question-and-answer dialogs hence, less expertise from user examples: IBM Intelligent Miner for relationship marketing,Unicas Model 1, SLP Inforwares Churn/CPS for churnprediction, Quadstones Decisionhouse for CRM4) Embedded data mining tools added to database management systems (DBMS) products& BI suites mostly decision trees only easy to use, less specialized staff needed

  • 8/6/2019 Laboratorium voor Neuro

    22/30

    restricted flexibility affects quality of results (lessaccurate),limited functionality for validating resultsLaboratorium voor Neuro- en Psychofysiologie

    Katholieke Universiteit Leuven 38/511.3.4 DM Tool Classification Contd5) Analytical programming tools targeted at generic analytical, tasks,not data mining specifically lots of graphics, database access facilities, statistics work only for smaller data sets requires experienced staff (savvy users) examples for business analyst: SPSSs Base, SQL tools,

    Excel examples for statistical analyst: SAS Macro Language,Matlab, S-Plus6) Data mining solutions & support from external servicesprovider (ESP) from advice to development & on-/off-site implementation drawback = loss of control by customer,e.g., loss of personnel at ESP project follow-up? project also fail due to bad model selection,wrong business constraints, unknown regulations,. . . examples: PriceWaterhouseCoopers (PWC),IBM Global Business Solutions, Data4S,. . .Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 39/511.3.4 DM Tool Classification ContdBottom line: One should know data mining project requirements+ benefits of all options offered for project When data mining objectives are unclear choose class 1 However: trade-off between: quality of results &required skill level, flexibility, time-to-solution classes are not mutually exclusive (they overlap)+ can complement eachother, in particular for companieswithcomplex or numerous data mining objectives

  • 8/6/2019 Laboratorium voor Neuro

    23/30

    Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 40/511.3.4 DM Tool Classification ContdEvaluation of data mining tool categories:Class Ease of Quality of Time-to- Flexibilitydeployment results solutionGeneric 2-3 3-4 2-4 3-4Alg-spec 2-3 4-5 2-4 2-3Appl-spec 4-5 2-4 4-5 1-2Embedded 3-4 1-3 3-5 2-3Analytical 1-5 2-5 1-5 5DM solutions 3-5 2-5 2-5 3-5Ratings: 1 = worst, 5 = best. Ease of

    deployment is inversely proportional to in-house skills requiredLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 41/511.3.5 Overview of Available DM ToolsCommercial- (com) & public domain (pd) systems Classification Decision-tree approach:pd: C4.5, GAtree, IND, Mangrove, OC1, ODBCMINE,

    PC4.5, SMILES,. . .com: AC2, Alice dIsoft, Angoss KnowledgeSEEKER, An-goss StrategyBUILDER,C5.0 , CART 5.0, DTREG, Deci-sionhouse, SPSS AnswerTree,Xpertrule Miner,. . . Neural network approach:pd: NN FAQ free software list, NuClass7, Sciengy RPF,. . .com: Alyuda NeuroIntelligence, BioComp iModel(tm),COGNOS 4Thought, BrainMaker, KINOsuite, MATLABNeural Net Toolbox, MemBrain, NeuroSolutions, NeuroXL,NeuralWorks Predict, SPSS Neural Connection 2, STATIS-

    TICA Neural networks, Synapse, Tiberius, Eudaptics Vis-covery, Gurusoft GS Textplorer Rule Discovery approach:pd: CBA, DM-II system, KINOsuite-PR, PNC2 Rule Induc-tion Systemcom:Compumine Rule Discovery System, Datamite, DMT

  • 8/6/2019 Laboratorium voor Neuro

    24/30

    Nuggets, PolyAnalyst, SuperQuery, WizWhy, XpertRuleMiner Genetic Programming approach:com: Evolver, GAtree, Genalytics GA3, GenIQ Model, Gep-

    soft GeneXproTools 4.0 Other approaches:pd: Grobian, Rough Set Exploration System (RSES)com: BLIASoft Knowledge Discovery software, DatalogicLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 42/511.3.5 Overview Avail. DM Tools ContdCommercial- (com) & public domain (pd) systems Contd Classification:

    Decision tree methodsRule-based methodsNeural networks Clusteringpd:Autoclass C, Databionic ESOM Tools, MCLUST/EMCLUST,PermutMatrix, Snob, SOM in Excel, StarProbecom: BayesiaLab, ClustanGraphics3, CViz Cluster Visualiza-tion, IBM Intelligent Miner for Data, NeusciencesaXi.Kohonen,PolyAnalyst, StarProbe, Viscovery explorative data miningmod-ules, Visipoint for more information, see:http://www.kdnuggets.com/siftware.htmlLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 43/511.3.5 Overview Avail. DM Tools ContdNeural Networks in Data Mining:1. Additional tool in generic data mining suite, or1-of-several algorithms in algorithm-specific DM tool (mostly) regression/classification/(time-series) prediction

    Techniques used: Multilayer Perceptron (MLP), trained with Backprop Radial Basis Function (RBF) networks: 2-layered NN,

  • 8/6/2019 Laboratorium voor Neuro

    25/30

    input layer = RBFs, mostly radially-symm. Gaussians,output layer = 1-layered perceptron Support Vector Machines (SVM): optimal choice of clas-sification boundaries by weight vectors (support vectors)

    also used for regression purposes for more information on RBF, SVM:see course Artificial Neural Networks (H02C4A)Examples: SPSS Clementine, Thinking MachinesDarwin, Right Information Systems 4Thought, ViennaUniv. Techn. INSPECT sometimes (rudimentary) use of SOM algorithmfor clustering & (high-dimensional) data visualization2. Prime tool in specialized stand-alone toolsusing topographic maps (SOM) for clustering, high-dimensionaldata visualization, regression, classification for more information on SOM: see furtherExamples: Viscovery by Eudaptics Databionic ESOM ToolsGS Textplorer by Gurusoft Neusciences aXi.Kohonen by Solu-tions4PlanningLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 44/511.4 Data Preprocessing1.4.1 Introduction Data Mining is rarely performed on raw data Reasons:1. data can be noisye.g.: noisy time series: apply temporal filteringunderlying idea:1) small fluctuations are not important, but trends are2) fluctuations can be due to imperfect measuring device2. data can contain outliers/wildshots (= unlikely samples)e.g.: human error in processing poll results, errors filling-out forms by customer in marketing inquiry, etc. . .however: outliers could be nuggets one is looking for. . .3. data can be incompletee.g.: not every question in a poll is answered4. data can be unlabeled

  • 8/6/2019 Laboratorium voor Neuro

    26/30

    e.g.: outcome of not every clinical trial is known5. not enough data to perform e.g. clustering analysisno clear aggregates observed which could lead to clusters6. too high-dimensional data to do e.g. regression

    not enough data to estimate the many parameters Hence, prior to DM: data preprocessing (points 1. . . 4) this chapter data transformation (points 5 & 6 ) next 2 chapters: project on subspace/manifold (Data Transformation) select subset of features or inputs (Feature Selection)Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 45/511.4.1 Introduction - Contddataknowledge873 762 012638 678 923773 887 638... ... ...274574846348485958590983...

    transformationdata mining interpretation

    selection preprocessing target datadetected

    patternstransformeddatadata

    preprocessedSteps in KDD processData Preprocessing

    _ =

    remove noise & outliers,handle missing data & unlabeled data,. . . Outlier removal distinguish informative from uninformative patterns Noise removal suppress noise by appropriate filtering Missing data handling

  • 8/6/2019 Laboratorium voor Neuro

    27/30

    deal with missing entries in tables Unlabeled data handling deal with missing labels in classificationLaboratorium voor Neuro- en Psychofysiologie

    Katholieke Universiteit Leuven 46/511.4.2 Outlier removal What is an outlier? Statistical definition:new pattern v i is outlier of data set D?if probability P( v i D) < 10 6 , e.g. Information-theoretic definition:new pattern = outlier when difficult to predict by modeltrained

    on previously seen dataoutlier = informative pattern Pattern is informative when it is suprising Example: 2 class problem, labels {0, 1} probability that estimated label b yk of new pattern vk,givenclassification model, = correct label y k:I(k) = log P( b yk = y k) =y k log P( b yk = 1) (1 y k) log(1 P( b yk = 1))(Shannon information gain) Information-theoretic sense: pattern vk = most informativewhen I(k) > threshold, else it is uninformativeLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 47/511.4.2 Outlier removal Contd1.4.2.1 Data cleaning Garbage patterns can also be informative!!! Data cleaning: sort out good/bad outlierssort out nuggets from garbage Purely manual cleaning = tedious Hence: computer-aided tools for data cleaningapplicable to classification, regression, data densitymodeling On-line algorithm:1. train model (classifier) that provides I estimates

  • 8/6/2019 Laboratorium voor Neuro

    28/30

    on small clean subset2. draw labeled pattern ( v i, y i) from raw database3. check if information gain I(i)?>< threshold if I(i) < threshold OK use for training model if I(i) > threshold human operator checks:if pattern = garbage ( discard) oracceptable ( use for training model)4. stop when all data has been processedDisadvantage: dependence on order patterns are presentedQuestion: what is optimal threshold?Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 48/51

    1.4.2 Outlier removal Contd1.4.2.1 Data cleaning Contd Batch algorithm:1. train model on all data (garbage as well)2. sort data according to information gain I3. human operator checks patterns v i with I(i) > thresholdremove if garbage4. retrain model5. sort data according to information gain I6. human operator removes garbage patterns7. . . .Question: what is optimal threshold? Optimal threshold:1. perform data cleaning several times, for differentthresholdsi.e., for series of increasing threshold values2. determine model errors on test set (validaton error)3. choose model for which validaton error is lowest

    optimal thresholdLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 49/511.4.3 Noise Removal1. lowpass filtering = linear filteringprinciple: replace signal value at time t, s[t], by weightedaver-

  • 8/6/2019 Laboratorium voor Neuro

    29/30

    age of signal values in Gaussian region around texample: s[t] 0.5s[t]+0.25s[t 1]+0.25s[t+1]advantage: removes noise when cut-off frequency Gaussianfil-

    ter < expected lowest frequency in noisedisadvantage: blurs sharp transitionsdebtincomesharp transition

    2. median filtering3. regressionLaboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 50/511.4.3 Noise Removal Contd1. lowpass filtering2. median filtering = non-linear filteringprinciple: replace signal value at time t, s[t],by median of signal value in region R around tadvantages: less blurring than lowpass filtering very effective if noise consists spike-like components(outliers) often better suited for noise removal than lowpass filt.3. regression (curve fitting): signal can be fitted bypolynomial orparametric function with small enough # of parametersHow? cf., generalization performance in NN trainingvs. network complexity (i.e., # weights)4. . . .Laboratorium voor Neuro- en PsychofysiologieKatholieke Universiteit Leuven 51/511.4.4 Missing Data Handling1. Mean substitution = most used techniqueHow?: replace missing entries in column by columns mean crude but easy to implement2. Cluster center substutionlook for nearest cluster center c , by ignoring missing entries& substitute (Kohonen, 1995; Samad & Harp, 1992)3. Expectation Maximization (EM) technique

  • 8/6/2019 Laboratorium voor Neuro

    30/30

    more sophisticated technique (Dempster et al., 1977)statistics-based: replace missing entries by most likely value4. . . .1.4.5 Unlabeled Data Handling

    Two basic strategies:1. discard unlabeled entries2. use unlabeled + labeled entries & model data densitythen use density model to develop classification model(Hence, in this way, all data is used as much as possible)


Recommended