Empowering PDT Analytics through Databricks & Spark Structured Streaming05/05/2021
Arnav Chaudhary (he/him)
Digital Product ManagerTakeda
Jonathan E. Yee (he/him)
Data and Analytics ExecutiveEY
Jeff Cubeta (they/them)
Clinical Intelligence ExecutiveAlgernon Solutions
Databricks on Takeda’s Enterprise Data Backbone
The EDB (Enterprise Data Backbone) is Takeda’s integrated data platform responsible for combining global data assets into a single source of truth and enabling tools to provide insights via analytics.
• Data Ingestion• Data Processing• Advanced Analytics
Domains• US
• Europe• Japan
Global Regions• Python• R Studio
Applications
• MIT Researches as an ongoing collaboration
Specialized Deployment
Databricks on AWS is used heavily by Takeda across the business
200,000 DBUs of Monthly
compute
600+ Monthly Active Users
50+ validated schemas with 100s of tables
15 advanced analytics
teams using Databricks
PDT Analytics Program
Drive improved plasma yield
Increased access to a greater volume of plasma donors
What are we solving for? Expected Outcomes
Gain access to a larger share of the donor market to reduce CPL
Increase yield by improving retention and the conversion funnel
Reduce manual processes and increase automation to improve operations efficiency
Harvest the value of PDT’s data assets
Reduce cost per liter
Improved data, analytics, and process layers for PDT analytics
PDT Donor Portal Application & Analytics Foundation
Going forward…Previously…
• Existing 153 disparate center systems
• Reliance on 3rd party for marketing insights
• Manual report generation • Lack of real-time
information for quick decision making
• Reactive decision-making process
• Consolidated data into one operational data store (ODS)Near Real time datatransmission
• Data lake to store years of information
• Analytics platform allowing data scientists to perform data mining, create predictive model, and generate actionable insights
• Reduction of manual reports
PDT/BioLife Data Backbone
• PDT is the pioneer using the newly developed Takeda Enterprise Data Backbone Platform in the CLOUD
• Supporting Analytics, Operational Use, and other Products data needs (e.g. Donor Engagement, Fuji Innovation Engine)
Daily Batch JobsManual Report Generation
Limited Access to Data
Structured Typed SQL DataAPI Returned JSON
Scheduled CSV Uploads
4 Enterprise Data Systems151 collection centers
250 SQL Tables
~ 1 TB Historic Data~ .5 GB/Hr Ongoing CDC
We designed opportunities to drive value and address the core pain point themes for PDT
• Spark Structured Streams• Low latency data processing• Standardized event streams to
empower downstream apps
Real Time Data• Single presence for Donors• Cross system relationships • Business process data entities
Unified Data Schema
• Uniform ingestion process• Configuration driven operations• S3 Delta Tables• Data served to SQL DB for low latency,
high volume querying
Lakehouse Model
Data Isolation Latency of Analytics Narrow Audience
Three Key Pain Points with PDT Data Analytics
Unified Data Schema
Configuration Driven Process
Lakehouse ModelKey Design Details• Uniform ingestion platform• Improved accessibility to data• Delta Tables backing each layer• Structured Streams between layers• Support for big data analysis through
serving Delta Tables• Support for high volume, low latency
querying using SQL based tools• Extensible design to allow expansion
Real Time Data
Using foreachBatch to Fork and Serve Streaming CDC Data
Using the Delta Table merge construct within _serveWriting the CDC stream
Within the foreachBatch function, we target multiple sinks• Delta Table• SQL Database• Event Bridge
© 2019 Takeda Pharmaceutical Company Limited. All rights reserved
Thank you for attending!We will do our best to answer any questions.