Date post: | 15-Apr-2017 |
Category: |
Data & Analytics |
Upload: | christina-azzam |
View: | 394 times |
Download: | 0 times |
|
Presented By
Date
GETTING DATA ANALYTICS INTO PRODUCTION: BRIDGING THE GAP BETWEEN DATA SCIENCE AND PRODUCT DEVELOPMENT
Unleashing Data Excellence 2016, Amsterdam
Michelle Gregory
7 November 2016
2
£6 billion
2015 revenue
7,200
170
Employees
Countries served
2,500 Journals
30,000 Books
3,500 Healthcare institutions
3
What we do: we help scientists and health professionals to get better outcomes and become more productive
We combine content and data with analytics and technology to help:
RESEARCHERS
to make new discoveries and
have more impact on society
CLINICIANS
to treat patients better
and save more lives
NURSES
to get jobs and help
save lives
4
Where we are going: our products and services are becoming decision support tools, built on high quality content and data
Answers ROS
Data
User is:
“searching”
“doing”
Content “reading”
Research
Corporate
R&D
Clinical
practitioners
Nursing
students
Woodhead Gray’s
Anatomy Fundamentals Cell
Science
Direct Knovel Clinical Key Evolve
Knovel
Materials Infermed Sherpath
Enabling
technologies
1. Standard architecture, next-gen search & recommendation
2. Access hubs for user applications
3. Big data platforms
4. Semantic enrichment & knowledge graphs
5. Machine learning
5
Our capabilities: best content, advanced data analytics, product development
Content & data
Product
State-of-the-art
technology
|
• For a traditional B2B company, the transition to data and analytics as a service has not always been intuitive
• Backend processing was separated from product usage
• Quality of content and tools determined independently from end products
• No shared metrics of success
• Does not allow for iterative testing of analytic products
Data does not get pushed to the products to be exposed to users, nor is it gathered from platforms and analyzed offline.
Can’t separate our data scientists from product development and marketing
Unleashing data: a combination of data, advanced analytics and product development
Outline 1. Accuracy versus quality: Chemical entity extraction 2. Shared metrics: Topic identification 3. How to get data where data doesn’t exist: Academic family trees
|
Reaxys: Automatic chemical entity recognition
Challenge: Content from 450 journals is manually annotated for chemical entities and their properties
| 8
Approach: apply advanced NLP and ML to automate extraction
Hurdles • Are there existing tools that are good enough? • QA functions questions accuracy required from machines • Product assumes humans are always correct • Suppliers are not incentivized to use modern analytic methods • To get any improvements in, entire workflows will have to change
New articles / Patent - unannotated Prediction model Predictions
| 9
Approach: apply advanced NLP and ML to automate extraction
Solutions • Are there existing tools that are good enough? Identified third party tools and in-house expertise • QA functions questions accuracy required from machines QA process is different than accuracy. Accuracy for NLP is measured in terms of F-scores. A QA claim 96% accurate is not the same as a 96% F-score • Product assumes humans are always correct When algorithms predict differently than humans, they are thought to be wrong. However, when humans are measured by the same scientific standards, we see they don’t always agree with each other.
• Suppliers are not incentivized to use modern analytic methods Efficiency gains can only be realized if contracts with suppliers are renegotiated • To get any improvements in, entire workflows will have to change To find the right balance between humans and machines, we need to be able to incorporate the automation while not taking the human out of the loop. We also need to be able to learn from the work that humans do
| 10
Outcomes • Quality. Within 6 months we had tools with good enough quality to
do automatically. • Cost. Within 8 months, contracts renegotiated with suppliers,
resulting in 1 million in annualized savings • Scalability. Within 1 year, able to extract chemical compounds from
450 journals to over 16000.
Reaxys: Automatic chemical entity recognition
Major takeaways • Organization. While the data scientist saw the value of this
approach, the feasibility required data scientists to work side by side with domain experts, QA functions, and product.
• Quality. There was a lot of cross-education on quality, what it means, how it is measured. Accuracy and fit for purpose are related, but not synonymous.
|
Science direct: Topics and definitions
SD. A platform for researchers to search
for articles in Elsevier content.
Problem, User Need and Opportunity:
New product designed to make their work
flow easier--researchers need answers to
questions.
Proposed Solution:
Integrate book and journal content
on ScienceDirect by leveraging our Smart
Content capabilities, to provide content in
context, aligned to the problem it solves for
the researcher.
Use Cases: “I need to quickly get authoritative information on words or concepts that are new to me” “I want to better understand the article” “I need both the foundational information and the latest developments in this area”
Hurdles: what are our metrics of success? How accurate we are in tagging data? How good
are our algorithms at finding definitions and methods in book content? What if we have the
best algorithms but the UI is not very useful?
|
Approach
12
ANALYTICS
Subscribed Usage
Unsubscribed Usage
ARTICLE PAGE NEW! TOPIC PAGE CHAPTER PAGE USAGE DATA
Free content extracted from books features a Definition and
links to Chapter pages containing relevant books content. Experimenting to decide
minimum amount of content
Links to book chapter full text (subscribed) or abstract
(unsubscribed) on SD
Relevant concepts in journal articles highlights and
hyperlinked to new machine generated “Topic pages”
Topic pages also indexed by web search engines
Subscribed usage and unsubscribed
turnaways drives ‘value based’ selling and commissioning
|
13
Buyer response n% think the new features and usage statistics would increase their e-book purchasing
from ScienceDirect
n% think that the integrated content would increase e-book usage
n% would expect an increase in the value of their purchases
User Response 89% of users found the topic page helpful “This would be great, I would read all of this. I would have been pleased with this page [when writing
my paper], it would have saved me a lot of time” M, Senior Research Associate, Neuroscience
“this is exactly the kind of thing I would be looking for. I’m used to bland Wikipedia, this is more on-
point and technical, …it would certainly save me a lot of time” J, Senior Research Associate,
Neuroscience & Pharmacology
Use cases and engagement have been validated
Analytics results (quantitative and qualitative)
0,0%
2,0%
4,0%
6,0%
8,0%
10,0%
12,0%
V0 V1 V2 V3 V4 V5 V6 V7
CT
R
| 14
Outcomes • Quality. A pilot phase indicated an accuracy of only 77% was
useful to 89% of users. No need to focus on topic identification accuracy.
• Usage data had direct impact on defining quality for the analytics • Benefits. CTR confirmed revenue targets during pilot period.
SD: Topic pages and foundational content
Major takeaways • Quality. Quality metrics include many facets, but each have to
have a quantifiable affect on the customer. Need to have end-to-end understanding of what quality is.
• Shared metrics. Product team, UI developers, and data scientists all need to be working toward the same KPI’s. We don’t do speed and accuracy for the sake of it, it has to have direct customer value.
• Usage data is necessary for determining data analytics methods and accuracy
|
How to get data where data doesn’t exist: Academic Family Trees
• Need: Academic family-trees • Be aware of conflict of interests: for example in reviewer selection or funding
panelists
• Recommendations such as article/people in Mendeley or ROS Communities
• Dilemma: • How can I get it right if I do not have data?
• What would be the optimal roadmap to grow a hackathon model to a full scale product?
art clip taken from https://teamupstartup.com
• High Quality: ONLY IF YOU CAN GET IT RIGHT
• Chicken and Egg problem: most data analytic ideas are killed in infancy because (evaluation) data does not exist or cannot be collected cheaply
|
Heuristic Model
(baseline)
Email Campaign
(Crowd Sourcing)
Supervised ML Model
Background Enhancement
of Existing Product
ML Model enhanced with
User Data or Subscription
Info
New Product I: Collective Data
Value
New Product II: Individual Data Point
16
Approach: Evolve through models, platforms and products to grow the data and the analytical models in an agile way in a least costly path
• A heuristic model suggesting a guess will simplify the question from who to Yes or No
• Different ML models: • Varying amount of need on training data • Varying mount of cost to train
• Each model will generate (the best available) guesses for the
next manual data collection. Model will be improved iteratively with better and larger data
• Click-through data can offer an additional data, which may be low quality but sufficient for initial stages
art clip taken from http://oneguestatatime.com/blog-2/organic-growth
|
Heuristic Model
(baseline)
Email Campaign
(Crowd Sourcing)
Supervised ML Model
Background Enhancement
of Existing Product
ML Model enhanced with
User Data or Subscription
Info
New Product I: Collective Data
Value
New Product II: Individual Data Point
17
Hosting Product to grow the model
• Acquire data/users from different business-sectors to accelerate data growth. Different products have different engagements, number of users, opt-in behaviors, or adoption/development cost
• Understand the impact of the accuracy on different products 1. Products benefitting from the model without explicitly
mentioning showing results (implicit application) 2. Products explicitly showing collective data 3. Products explicitly showing individual data
| 18
Outcomes • Quality. Within 4 months we achieved 70% accuracy in mentor
detection. We grow from no training data to 10K+ unbiased data. • Cost. staging and transitional plans ensured that data scientists
and developers cost where minimized before getting the buy-in from business stake-holders for the next phase
ROS Communities: Academic Family Trees
Major takeaways • Data. Similar to a business plan, high quality data and ML models
require a survival roadmap to gain momentum and maturity in terms of size or adoption. This may require combining heuristic, ML, crowd-sourcing, A/B testing,… as well as merging data and moving models across different business sectors
• Quality. Data insights could be staged as a behind the scene support service, before reaching the critical quality to be delivered individually in new products
|
Summary
Quality
• Need a wholistic approach that includes (1) quality of content; (2) accuracy of algorithms; and (3) fit for purpose
• Shared understanding of what quality is
Shared KPI’s
• Data scientists, software developers, and product people need to be working toward the same goal
• If a product metric is to increase revenue, then there have to be metrics in place to demonstrate how algorithms contribute to that
Product usage informs analytics
• User testing informs data quality needs
• User testing provides data where none was previously available
• Usage data confirms KPI’s
Organizational structure
• Data scientists need to be closely linked to product and software development
• All functions need to have incentive to work together (shared KPI’s)