Post on 21-Dec-2015
transcript
Spanish InquisitionFinal Project Week 2 - 4/29/09
Breast Cancer Gene Expression Data
Leon Kay, Yan Tran, Chris Thomas
Chris
Yan
Leon
Weka Filtering
• Used CFS with BestFirst Search• Reduced the number of attributes from
1544 to 125• CFS stands for Correlation-based Feature
Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]
CFS Algorithm - Searching
• Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset.
• “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]
Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering
Before* After** Error Rate Reduction
J48 32.17 28.02 12.92
Bagging (J48) 18.26 16.38 10.30
Boosting (J48) 20.87 16.38 21.52
Random Forests 15.65 14.22 9.12
SMO (SVM) 15.22 14.22 6.53
* From Week1 - all 1544 Attributes
** After applying CFS/BestFit filtering, 125 attributes
ROC – Receiver Operating Characteristic
• ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2]
• “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2]
• Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.
ROC Data – Area under Curve
J48 Bagging (J48) Boosting (J48) Random Forests SMO (SVM)
Basal-like 0.8978 0.9851 0.9883 0.9939 0.9802
Claudin-low 0.9515 0.9993 0.9975 0.9979 0.9977
HER2+/ER- 0.8137 0.9614 0.964 0.9476 0.9313
Luminal A 0.856 0.9558 0.9497 0.9735 0.9418
Luminal B 0.7842 0.93 0.9183 0.9563 0.9336
Normal Breast-like 0.7676 0.9731 0.922 0.9772 0.955
GATA3
• GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.)
• Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.
FLJ13710
• Mentioned in a paper on finding prognostic signatures for breast cancer.
• Couldn’t find any in-depth studies on this gene.
References1) Mark Hall, “Correlation-based Feature Selection for Machine
Learning”, http://www.cs.waikato.ac.nz/~mhall/thesis.pdf2) Tom Fawcett, “An introduction to ROC analysis“,
doi:10.1016/j.patrec.2005.10.010 – enter into http://dx.doi.org/3) Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer
microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”, Molecular Cancer 2008, 7:49. http://www.molecular-cancer.com/content/7/1/49
4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, http://erc.endocrinology-journals.org/cgi/content/abstract/10/2/193
5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “www.nature.com/onc/journal/v25/n15/extref/1209254x4.xls
6) Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.” http://www.enm.bris.ac.uk/cig/pubs/2005/rs4.pdf