Using Big Data To Design & Manage Clinical Trials
An Architect’s Perspective
Manoj Vig
https://www.linkedin.com/in/manojvig Twitter - #manojvig
Disclaimer
I am an employee of Shire pharmaceuticals. The statements and opinions expressed within this session are my own and do not represent those of Shire.
There are some references to technical design pattern being implemented within Shire but explanation of those implementations provided in this session are purely technical.
This presentation outlines general technology direction and trend analysis. Shire has no obligation to pursue any approaches outlined in this document or use any functionality documented or discussed in today’s session.
Volume
VarietyVelocity
What is Big Data
(Petabytes of Data)
(Structured, Unstructured, images, Sounds)
(Batch, sub second response, stream, changes in data)
Handle large volume of data
Designed for Scalability & Failover
Support multiple workloads
Security, multi tenancy & privacy
Cost effective
Characteristics of a big data system
Technology framework for Innovation
3. Apache Hadoop Multiple work loads/Distributed Computing
1. Mobility 2. Social
Arrival of Mobile Age
Participant Recruitment
Adherence & Engagement
User Interaction
Frequent Data Generation
Remote Data Exchange
Data Generation
Power of Social media
Participant engagement
Patient &Site Identification
Social Listenting
Distributed Scale Unstructured Velocity Security Access
Big Data Processing Systems
Using Twitter – Implementation Pattern
TwitterTwitter API(Multi threaded data acquisition)
Curation
Filter Algorithms Rank
Location Profile
Distributed, Scalable, Fast & Economical
Key Decision Makers
Targeted Ads
Visualizations
Web/Mobile
Delivery Channels
Aut
omat
ed P
roce
ss
Apache Hadoop
Security, governance, privacy and Audit
BI Reports&
Dashboards
Data Analysts
Data Scientists
Apps(Web + Mobile)
Devices
Data Feeds
Data Service : Multiple data sources, multiple processing workloads and multiple delivery channels
Impala / Tez(Interactive)
HDFS(Hadoop Distributed File System)
MR(Batch)
Spark(Stream, ETL, DS)
Hive(DW)
Robust Cloud Infrastructure(e.g. AWS EC2)
Gov
erna
nce,
Sec
urity
& A
udit
YARN (Cluster Resource Manager)
Hbase(NoSQL)
Solr(Search)
Spark(Mlib,
Graph)
Custom/proprietary/Visualization AppsCTMS
Com
mon
Dat
a In
gest
ion
Clinical Trials.gov
Metadata Data Quality
Searchable Data Catalog
Streaming
CRO Data Feed
Genomic Data
Information Overload Problem – Apache Solr
CTMS
Streaming
ClinicalTrials.gov
UK Clinical TrialsGateway
Other R&D Datasets
SAS Datasets
Genomic Datasets
Apache Solr Running on Hadoop Cluster
HDFS(Data Landing)
Apache Solr
Data Indexing
Information Extraction(Spark)
Pattern Recognition(Spark)
Machine Learning(Spark)
Metadata Driven Ontology(Hbase)
Data Indexing
Solr APIs
Web UI
Mobile Apps
Desktop Widgets
Dashboards
Data SourcesConsumption
Hbase APIs
Technology is here to stay
Data Generation speed will accelerate
Data Access will get easier
Device connectivity will increase
Technological disruption is inevitable
Conclusion
Questions?
Are Recommender Systems Now Mainstream?◦ https://icrunchdatanews.com/recommender-systems-now-ma
instream/
The Impact of Real-time Computing Systems – Part 1◦ https://icrunchdatanews.com/impact-real-time-computing-sys
tems-part-1/
The Impact of Real-time Computing Systems – Part 1◦ https://
icrunchdatanews.com/impact-real-time-computing-systems-part-2/
ASCOT: a text mining-based web-service◦ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3339391/
Further Reading