Statistical Data mining 2. How large is large?
3. Data mining Software
4.
What is new?
Challenge $!%#?! data that is noisy,badly behaved,many missing values Tree Methods Cluster Analysis Regression Analysis 5.
Paradigm for data mining: Selection of interesting subsets Variables Bad Data Good Data Cases Good Data 6. Role of visualization Data visualization methods are attractive tools to use for analyzing such datasets for several reasons: Data visualization methods show many features (expected and unexpected) of a dataset at once and, as such, are well equipped to pick up subtle structures of interest and anomalies as well as clear patterns. They allow (in fact, encourage) flexible interaction with the data. They can be more readily understood by non-statisticians (although their properties may not be). Good user-friendly graphics software is becoming more readily available. 7.
8. Scatter Plot Image Plot 9. 10. 11. 12. Feature recognition methods variable and case selection cluster analysis (unsupervised pattern recognition) - partitioning methods (e.g.,k -means,k -medioids) - hierarchical methods (e.g., agglomerative nesting) - fuzzy analysis classification (supervised pattern recognition, discriminant analysis) - trees (e.g., CART, C5, Firm, Tree) - model-based methods (e.g., logistic regression, sliced inverse regression) - artificial neural networks role of robust methods / diagnostics 13. Trees
Function f(X,Y) Tree form of f(X,Y) 5 0 2 3 3 4 2 X Y Y< 4 X< 3 0 2 Y< 2 3 5 14.
15. ClassificationTree 1) root 1000 249.300 0.47302) x0.931225 16031.900 0.2750 * Iclass:Interval%Success%Interval-0.026 0.5 12791440.0 0.550634) KNEE961.36514 365685.0 0.9898 * 9) RBEDS>2.77141 159739.0 1.948018) ADM4.87542 112563.8 2.3880 * 5) HIP96>2.01527 10592992.0 1.569010) FEMUR962.28992 6682002.0 1.7840 * 3) HIP95>2.52265 18857672.0 2.96906) KNEE952.96704 7923022.0 3.625014) SIR9.85983 4261530.0 3.9790 * 20. Interval-focused regression (IREG) For thej th descriptor variablex j , an interesting subset { aRapid technology advances (advances in database technology as well as advances in the biological sciences) have led to an explosive growth in the number of massive high dimensional datasets appearing in pharmaceutical situations.
CASE STUDY: Data mining in Biopharmaceutical Research 23. The structure activity database (SADB) consisted of several measurements ofin vivoandin vitroactivity on876 compounds together with several physicochemical properties ofthese compounds. Each concentration of each compound was recorded as 1 (response)or 0 (no response) for each assay and each route and each species (there were also a large number of missing values). In vivo measurements of biological activity Anti Convulsant Assay (AC) - positive effect Horizontal Screen Assay (HS) - adverse effect -rats and mice -ip and po 24. In vitro results IC50 G-shift H-coefficient Physicochemical properties Molecular weight Molecular volume CMR MlogP Energy Total dipole moment Rotational bonds Atom types Connectivities between molecules 25. Objective To determine whether the physicochemical properties, together with thein vitroactivity measurements, are predictive of biological activity (and selectivityand bioavailability). Given the data {( Y ti , x ji ),i=1,..,N ,t=1,...,q ,j=1,...,p },this involves studying the relationship between { Y ti } vs { x ji }. Initial considerations Simplification:Z= I( Y 4) Re-expression: log(IC50) Check for clear outliers in individual variables (remedy: winsorize) 26. 27. G-shift Hill MLOGP MOLWT ENERGY MOLVOL ROTBONDS -2.631.669.48 LOG(IC50) 0.152.043.374.80 0.190.851.32 0.191.792.854.97 240383447683 9.7643.1649.16151 1832833375490579 30% 31% 35% 46% 30% 30% 35% 40% 0.30 0.14 0.18 0.14 0.21 0.19 0.19 0.17 PROB 30% 31% 35% 46% 30% 30% 35% 40% Results of Interval CART Analysis 28. -2.63LOGIC50 1.66 (78/262) 244MOLVOL276407 < MVOL < 46720/38(24/34)ICLASS Tree 29. 30. Case Study: Pima Indians Diabetes
31. 32. Classification Tree 1) root 768 174.500 0.349002) PLASMA127.5 28367.020 0.614806) BODY29.95 20741.300 0.7246014) PLASMA157.5 9210.430 0.86960 * 33. 34. First Split of ICLASS CART Tree 35. ICLASS Tree 167PLASMA 189 (54/61) 154PLASMA1640.328PEDIGREE1.154(14/18)(28/42)(14/18)(34/35)(5/8)123PLASMA152 (28/42)(3/7)(78/437)(90/201)(0/2) 36.