© 2016 KNIME.com AG. All Rights Reserved.
Data Science for Everyone
Greg Landrum
Rosaria Silipo
KNIME
© 2016 KNIME.com AG. All Rights Reserved. 2
Introduction to the characters
• The scientist (chemist, business analyst, domain expert, etc.).
– Deep domain knowledge
– Strong analytics needs (questions that need to be answered!)
• The data scientists (analyst, modeler, informatician, data scientist, etc.)
– Deep knowledge of analytics, data processing
– Knows KNIME (and other tools)
© 2016 KNIME.com AG. All Rights Reserved. 3
The specific scenario/problem
The scientist:
“I’m trying to discover a new anti-malaria medicine. I’ve got a new dataset from a high-throughput screen against a malaria target. Doing the next experiments is expensive. I want to pick the right compounds from our inventory to try next.”
© 2016 KNIME.com AG. All Rights Reserved. 4
The scenario/problem
• Given a new dataset, clean it up so that a model can be built
• Build and validate a model from that dataset
• Use the model to prioritize a set of items from a catalog
• Let the user pick from that prioritized list
© 2016 KNIME.com AG. All Rights Reserved. 5
The steps for doing this
• Cleaning up the data
• Building and validating a model
• Ranking a set of new items from a catalog
• Letting the user pick the items they are interested in
• Providing an excel file
This is a familiar pattern, we know how to do this
© 2016 KNIME.com AG. All Rights Reserved. 8
A guided analytics solution
• The data scientist builds a data preparation and modeling workflow in KNIME capturing their most robust approach along with a solid validation protocol that won’t let a low-quality model pass.
• The data scientist deploys this model as a web application using the KNIME server.
• The scientist can then upload their data, build and validate a model, and then apply it to generate predictions for the items in their catalog in order to decide which experiments to do next.
© 2016 KNIME.com AG. All Rights Reserved. 9
Data Cleaning
© 2016 KNIME.com AG. All Rights Reserved. 10
The 80% problem
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
© 2016 KNIME.com AG. All Rights Reserved. 11
Data Cleaning & Data Scientists
11
https://twitter.com/mrogati/status/601538814746628096
© 2016 KNIME.com AG. All Rights Reserved. 12
7 Techniques for Dimensionality Reduction
Column Reduction based on:
1. Missing values 2. High correlation3. Low standard deviation4. PCA5. Infrequent choice in random forest shallow trees6. Backward Feature Elimination7. Forward Feature Construction
Whitepaper on KNIME web site https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf
12
© 2016 KNIME.com AG. All Rights Reserved. 13
Dataset Quality Measures
Additional Techniques for Data Dimensionality Reduction:
• Low Skewness
• Outlier Removal
13
Measure Dataset Quality Before and After:
• Average Error (%) from Cross-Validation
• Normalized Cronbach Alpha
© 2016 KNIME.com AG. All Rights Reserved. 14
Data Cleaning as a Process
• Reliable (cross-domain)
• Repeatable (not automatic)
• Interactive (human expert supervised)
• From a Web Browser (no KNIME expertise)
• On demand
14
© 2016 KNIME.com AG. All Rights Reserved. 15
CRM Dataset
• Customer Data
• Task is Upselling prediction
• Product is a lawyer insurance
• Lawyer Insurance 0/1 is Target
• If lawyer insurance was bought then after a little while lawyer was assigned
• 10K data rows x 33 data columns
15
© 2016 KNIME.com AG. All Rights Reserved. 16
From KNIME WebPortal: Login
16
© 2016 KNIME.com AG. All Rights Reserved. 17
From KNIME WebPortal: Start
17
© 2016 KNIME.com AG. All Rights Reserved. 18
From KNIME WebPortal: Upload File
18
Only .table and .csv files
© 2016 KNIME.com AG. All Rights Reserved. 19
From KNIME WebPortal: Initial Dataset Quality
19
© 2016 KNIME.com AG. All Rights Reserved. 20
From KNIME WebPortal: Missing Values
20
© 2016 KNIME.com AG. All Rights Reserved. 21
From KNIME WebPortal: Outliers
21
© 2016 KNIME.com AG. All Rights Reserved. 22
From KNIME WebPortal: Low Standard Deviation
22
© 2016 KNIME.com AG. All Rights Reserved. 23
From KNIME WebPortal: Low Skewness & High Correlation
23
© 2016 KNIME.com AG. All Rights Reserved. 24
From KNIME WebPortal: Final Dataset Quality
24
© 2016 KNIME.com AG. All Rights Reserved. 25
From KNIME WebPortal: Back to Refine
25
© 2016 KNIME.com AG. All Rights Reserved. 26
From KNIME WebPortal: Final dataset Quality again
26
© 2016 KNIME.com AG. All Rights Reserved. 27
From KNIME WebPortal: Workflow successful
27
© 2016 KNIME.com AG. All Rights Reserved. 28
Workflow
28
© 2016 KNIME.com AG. All Rights Reserved. 29
Metanode “Dataset Quality”
29
Sum
mar
y o
f D
atas
et Q
ual
ity
© 2016 KNIME.com AG. All Rights Reserved. 30
Malaria Dataset
• Patient Data
• Task is Pf3D7_ps_hit = yes/no
• Primary & secondary readouts, SMILES, experiment date, sample
• Many primary readout ?
• 6675 data rows x 8 data columns
30
© 2016 KNIME.com AG. All Rights Reserved. 31
From KNIME WebPortal: Initial Dataset Quality
31
© 2016 KNIME.com AG. All Rights Reserved. 32
From KNIME WebPortal: Missing Values
32
© 2016 KNIME.com AG. All Rights Reserved. 33
From KNIME WebPortal: Outliers
33
© 2016 KNIME.com AG. All Rights Reserved. 34
From KNIME WebPortal: Low Standard Deviation
34
© 2016 KNIME.com AG. All Rights Reserved. 35
From KNIME WebPortal: Low Skewness & High Correlation
35
© 2016 KNIME.com AG. All Rights Reserved. 36
From KNIME WebPortal: Final Dataset Quality
36
© 2016 KNIME.com AG. All Rights Reserved. 37
That was easy!
37
Happy scientist!
© 2016 KNIME.com AG. All Rights Reserved. 38
Model building
© 2016 KNIME.com AG. All Rights Reserved. 39
The modeling and prediction workflow
Reading the cleaned data and adding the chemistry-specific details
Building a model
Evaluating the model
Ranking and picking new items
© 2016 KNIME.com AG. All Rights Reserved. 40
Robust learning: use multiple models and representations
• Multiple models:
– Random forest (representation 2)
– Gradient boosting (representation 1)
– Fingerprint Bayes (representation 1)
– Logistic regression (representation 1)
– Logistic regression (representation 2)
• Combine predictions using "model fusion"
© 2016 KNIME.com AG. All Rights Reserved. 41
Validation
• The model will be used for ranking new items
• To ensure that it is useful we will evaluate it based both on overall accuracy (using Cohen’s Kappa) and how accurate early picks are (using enrichment)
© 2016 KNIME.com AG. All Rights Reserved. 42
Validation
• Parameters from the Scorer node, adapted to model fusion
• Accuracy parameters from the ROC node
© 2016 KNIME.com AG. All Rights Reserved. 43
When the model isn’t good enough
Accuracy thresholds are set by the data scientist when building the workflow
The workflow ends here.No sense continuing with a model that's unreliable/misleading.
© 2016 KNIME.com AG. All Rights Reserved. 44
Making predictions
Read items from catalog
Generate predictions
Show histogram and ask for number of items to consider
Interactive selection
Download Excel file
© 2016 KNIME.com AG. All Rights Reserved. 45
Interactive selection
Create images for the table
Create plots Keep only rows that are selected in the table
© 2016 KNIME.com AG. All Rights Reserved. 46
The output, Excel at last!
© 2016 KNIME.com AG. All Rights Reserved. 48
That’s it!
48
• Whitepapers & workflows for the two different parts coming soon!
• For more infos email: [email protected]
© 2016 KNIME.com AG. All Rights Reserved. 49
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.