Date post: | 22-Dec-2015 |
Category: |
Documents |
Upload: | melina-perry |
View: | 217 times |
Download: | 0 times |
An Integrated and Comprehensive Data Mining System for Studying Environmental
Impact of Nanomaterials: NEIMiner
Nano Working Group Presentation
10/13/2011
Kaizhi Tang, Ph.D., David Mihalcik,Thomas Wavering, Roger Xu
Intelligent Automation Inc
Prof. Stacey Harper, OSUSue Pan, SAIC
Sponsor Agency: Dr. Jeff Steevens, Army ERDC
Outline
Motivation and proposed approach
NEI modeling framework
Design of NEIMiner information system
NEIMiner
Motivation and proposed approach of NEIMiner
NEED: To reduce the risk of nanomaterials in military use, NM environmental impact analysis requires a comprehensive NEI modeling framework, centralized NEI database, powerful model discovering tool and integrated model composition strategy.
KEY COMPONENTS OF THE PROPOSED APPROACH• Flexible data integration based on the ETL (Extract,
Transform, Load) strategy of data warehouse.
• Integrated and collaborative data management utilizing modern content management system
• Optimized data mining process with many algorithms and parameters with huge computational burden
• Flexible model composition based on unified model abstraction reusing FRAMES
DELIVERABLES• Conceptual framework of NEI analysis• Collaborative NEI information system with model discovery
and composition capability
VALUE TO THE CUSTOMER /TRANSITION CUSTOMER• Environmental impact estimation tool for nanomaterials• Easy access to large amount of NEI data in a centralized data
warehouse and the available model generation tool• Potentially useful evaluation models of NEI
Collaboratory of Structural Nanobiology
NEI Data
NEI Data Mining Models
Scope of NEI Modeling
NEIMiner System Architecture
NEI Data
NEI Data Mining Models
Available NEI Data and Schemas
Nanomaterial-Biological Interactions Knowledgebase– http://nbi.oregonstate.edu/
Cancer Nanotechnology Laboratory portal (caNanoLab)– NCI, https://cananolab.nci.nih.gov/caNanoLab/
ICON: International Council on Nanotechnology– Rice University, http://icon.rice.edu
Nano-Tab– tab-delimited spreadsheet type based on EBI
and ISA-TAB
NanoParticle Ontology(NPO)– Implemented in OWL
Most complete characterization capture
Largest number of publications, limited characterization capture
Wide range of characterization and health impact data
Most complete characterization capture
Largest number of publications, limited characterization capture
Other Data and Schemas
OECD Database on Research into Safety of Manufactured Nanomaterials– http://webnet.oecd.org
National Institute for Occupational Safety and Health (NIOSH)– http://www.cdc.gov/niosh/topics/nanotech/NIL.html
SAFENANO - Institute of Occupational Health (UK)– http://www.safenano.org/AdvancedSearch.aspx
University of Wisconsin - Madison: Nanoscale Science and Engineering Center– http://www.nanoceo.net/nanorisks
National Reference Center for Bioethics Literature - Georgetown University, Kennedy Institute of Ethics
– http://bioethics.georgetown.edu/
Nanomedicine Research Portal– http://www.nano-biology.net/
Center on Nanotechnology and Society (Chicago-Kent College of Law in the Illinois Institute of Technology)
– http://www.nano-and-society.org/
Data Extraction Methods
Data extraction via web services– Example: caNanoLab
Data extraction via web scraping– Examples: ICON, NBI– Approaches
Human copy-and-pasteHTTP programmingText grepping and regular expression matchingHTML parsers
Design philosophy of NEI data Warehouse
Data Warehouse– Centralized data from multiple data
sources for analysis=> multiple nano risk related data sources with different formats
– Consists of an ETL tool, a Database, a Reporting tool, Data Modeling
=> tools useful for NM data integration and mining
– Subject oriented data organization=> risk assessment for nano materials
– Multi-dimensional=> various nanomaterial properties
– Star schema=> extendible schema design
NEI Model Discovery
• Physical properties• Material Type• Particle size distribution• PDI • Shape• Structure
• Chemical properties• Surface reactivity• Surface charge• Water solubility
• Exposure and Study scenario• Duration• Continuity• Exposure route• Number of nanoparticles• Number of ligands
• Biological Properties• Species, age, gender, weight
• Environmental ecosystem response
• Fate and transport• Bioavailability and
uptake• Biomagnificiation
• Biological response• Genomic response• Cell death
Correlation?
Prediction?
Interesting Mining Problems and Solutions
How to handle missing data– Median on numerical values– Median-frequency categories– Classification or regression using existing data
How to determine attribute significance– Compare gain ratio for classification– Compare relief ratio for numerical prediction
How to select algorithms and their parameters for training– Meta-optimization on algorithms and parameters
How to split the data sets for high-quality models– Comparing various splitting strategies– Clustering as a preprocessing step