Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a...

© 2016 KNIME.com AG. All Rights Reserved.

Data Science for Everyone

Greg Landrum

Rosaria Silipo

KNIME

© 2016 KNIME.com AG. All Rights Reserved. 2

Introduction to the characters

• The scientist (chemist, business analyst, domain expert, etc.).

– Deep domain knowledge

– Strong analytics needs (questions that need to be answered!)

• The data scientists (analyst, modeler, informatician, data scientist, etc.)

– Deep knowledge of analytics, data processing

– Knows KNIME (and other tools)


The specific scenario/problem

The scientist:

“I’m trying to discover a new anti-malaria medicine. I’ve got a new dataset from a high-throughput screen against a malaria target. Doing the next experiments is expensive. I want to pick the right compounds from our inventory to try next.”


The scenario/problem

• Given a new dataset, clean it up so that a model can be built

• Build and validate a model from that dataset

• Use the model to prioritize a set of items from a catalog

• Let the user pick from that prioritized list


The steps for doing this

• Cleaning up the data

• Building and validating a model

• Ranking a set of new items from a catalog

• Letting the user pick the items they are interested in

• Providing an excel file

This is a familiar pattern, we know how to do this


A guided analytics solution

• The data scientist builds a data preparation and modeling workflow in KNIME capturing their most robust approach along with a solid validation protocol that won’t let a low-quality model pass.

• The data scientist deploys this model as a web application using the KNIME server.

• The scientist can then upload their data, build and validate a model, and then apply it to generate predictions for the items in their catalog in order to decide which experiments to do next.


Data Cleaning


The 80% problem

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.


Data Cleaning & Data Scientists

11

https://twitter.com/mrogati/status/601538814746628096

https://twitter.com/mrogati/status/601538814746628096


7 Techniques for Dimensionality Reduction

Column Reduction based on:

1. Missing values 2. High correlation3. Low standard deviation4. PCA5. Infrequent choice in random forest shallow trees6. Backward Feature Elimination7. Forward Feature Construction

Whitepaper on KNIME web site https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

12

https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf


Dataset Quality Measures

Additional Techniques for Data Dimensionality Reduction:

• Low Skewness

• Outlier Removal

13

Measure Dataset Quality Before and After:

• Average Error (%) from Cross-Validation

• Normalized Cronbach Alpha


Data Cleaning as a Process

• Reliable (cross-domain)

• Repeatable (not automatic)

• Interactive (human expert supervised)

• From a Web Browser (no KNIME expertise)

• On demand

14


CRM Dataset

• Customer Data

• Task is Upselling prediction

• Product is a lawyer insurance

• Lawyer Insurance 0/1 is Target

• If lawyer insurance was bought then after a little while lawyer was assigned

• 10K data rows x 33 data columns

15


From KNIME WebPortal: Login

16


From KNIME WebPortal: Start

17


From KNIME WebPortal: Upload File

18

Only .table and .csv files


From KNIME WebPortal: Initial Dataset Quality

19


From KNIME WebPortal: Missing Values

20


From KNIME WebPortal: Outliers

21


From KNIME WebPortal: Low Standard Deviation

22


From KNIME WebPortal: Low Skewness & High Correlation

23


From KNIME WebPortal: Final Dataset Quality

24


From KNIME WebPortal: Back to Refine

25


From KNIME WebPortal: Final dataset Quality again

26


From KNIME WebPortal: Workflow successful

27


Workflow

28


Metanode “Dataset Quality”

29

Sum

mar

y o

f D

atas

et Q

ual

ity


Malaria Dataset

• Patient Data

• Task is Pf3D7_ps_hit = yes/no

• Primary & secondary readouts, SMILES, experiment date, sample

• Many primary readout ?

• 6675 data rows x 8 data columns

30


From KNIME WebPortal: Initial Dataset Quality

31


From KNIME WebPortal: Missing Values

32


From KNIME WebPortal: Outliers

33


From KNIME WebPortal: Low Standard Deviation

34


From KNIME WebPortal: Low Skewness & High Correlation

35


From KNIME WebPortal: Final Dataset Quality

36


That was easy!

37

Happy scientist!


Model building


The modeling and prediction workflow

Reading the cleaned data and adding the chemistry-specific details

Building a model

Evaluating the model

Ranking and picking new items


Robust learning: use multiple models and representations

• Multiple models:

– Random forest (representation 2)

– Gradient boosting (representation 1)

– Fingerprint Bayes (representation 1)

– Logistic regression (representation 1)

– Logistic regression (representation 2)

• Combine predictions using "model fusion"


Validation

• The model will be used for ranking new items

• To ensure that it is useful we will evaluate it based both on overall accuracy (using Cohen’s Kappa) and how accurate early picks are (using enrichment)


Validation

• Parameters from the Scorer node, adapted to model fusion

• Accuracy parameters from the ROC node


When the model isn’t good enough

Accuracy thresholds are set by the data scientist when building the workflow

The workflow ends here.No sense continuing with a model that's unreliable/misleading.


Making predictions

Read items from catalog

Generate predictions

Show histogram and ask for number of items to consider

Interactive selection

Download Excel file


Interactive selection

Create images for the table

Create plots Keep only rows that are selected in the table


The output, Excel at last!


That’s it!

48

• Whitepapers & workflows for the two different parts coming soon!

• For more infos email: [email protected]

mailto:[email protected]


The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.

KNIME® is also registered in Germany.

Date post:	31-Dec-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a...

Documents