Empowering PDT Analytics through Databricks & Spark ...

Empowering PDT Analytics through Databricks & Spark Structured Streaming05/05/2021

Arnav Chaudhary (he/him)

Digital Product ManagerTakeda

Jonathan E. Yee (he/him)

Data and Analytics ExecutiveEY

Jeff Cubeta (they/them)

Clinical Intelligence ExecutiveAlgernon Solutions

Databricks on Takeda’s Enterprise Data Backbone

The EDB (Enterprise Data Backbone) is Takeda’s integrated data platform responsible for combining global data assets into a single source of truth and enabling tools to provide insights via analytics.

• Data Ingestion• Data Processing• Advanced Analytics

Domains• US

• Europe• Japan

Global Regions• Python• R Studio

Applications

• MIT Researches as an ongoing collaboration

Specialized Deployment

Databricks on AWS is used heavily by Takeda across the business

200,000 DBUs of Monthly

compute

600+ Monthly Active Users

50+ validated schemas with 100s of tables

15 advanced analytics

teams using Databricks

PDT Analytics Program

Drive improved plasma yield

Increased access to a greater volume of plasma donors

What are we solving for? Expected Outcomes

Gain access to a larger share of the donor market to reduce CPL

Increase yield by improving retention and the conversion funnel

Reduce manual processes and increase automation to improve operations efficiency

Harvest the value of PDT’s data assets

Reduce cost per liter

Improved data, analytics, and process layers for PDT analytics

PDT Donor Portal Application & Analytics Foundation

Going forward…Previously…

• Existing 153 disparate center systems

• Reliance on 3rd party for marketing insights

• Manual report generation • Lack of real-time

information for quick decision making

• Reactive decision-making process

• Consolidated data into one operational data store (ODS)Near Real time datatransmission

• Data lake to store years of information

• Analytics platform allowing data scientists to perform data mining, create predictive model, and generate actionable insights

• Reduction of manual reports

PDT/BioLife Data Backbone

• PDT is the pioneer using the newly developed Takeda Enterprise Data Backbone Platform in the CLOUD

• Supporting Analytics, Operational Use, and other Products data needs (e.g. Donor Engagement, Fuji Innovation Engine)

Daily Batch JobsManual Report Generation

Limited Access to Data

Structured Typed SQL DataAPI Returned JSON

Scheduled CSV Uploads

4 Enterprise Data Systems151 collection centers

250 SQL Tables

~ 1 TB Historic Data~ .5 GB/Hr Ongoing CDC

We designed opportunities to drive value and address the core pain point themes for PDT

• Spark Structured Streams• Low latency data processing• Standardized event streams to

empower downstream apps

Real Time Data• Single presence for Donors• Cross system relationships • Business process data entities

Unified Data Schema

• Uniform ingestion process• Configuration driven operations• S3 Delta Tables• Data served to SQL DB for low latency,

high volume querying

Lakehouse Model

Data Isolation Latency of Analytics Narrow Audience

Three Key Pain Points with PDT Data Analytics

Unified Data Schema

Configuration Driven Process

Lakehouse ModelKey Design Details• Uniform ingestion platform• Improved accessibility to data• Delta Tables backing each layer• Structured Streams between layers• Support for big data analysis through

serving Delta Tables• Support for high volume, low latency

querying using SQL based tools• Extensible design to allow expansion

Real Time Data

Using foreachBatch to Fork and Serve Streaming CDC Data

Using the Delta Table merge construct within _serveWriting the CDC stream

Within the foreachBatch function, we target multiple sinks• Delta Table• SQL Database• Event Bridge

© 2019 Takeda Pharmaceutical Company Limited. All rights reserved

Thank you for attending!We will do our best to answer any questions.

Date post:	12-Dec-2021
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Empowering PDT Analytics through Databricks & Spark ...

Documents