Explore Digits Inc
● Data Engineering and Data Science solutions● 5+ years of experience with TIBCO Data Science and AWS● Experience providing TIBCO based solutions supporting initiatives such as
○ Identifying factors contributing to health issues for long term space travel
○ Defining policies to immediately manage disease outbreaks
○ Optimize monetary flow in the economy
Agenda
1. Challenges in taking AI/ML journey for mature organizations.
2. Best practices and solutions that can address these challenges.
3. Some key discriminators, who may want to consider.
2
Journey from Descriptive Analytics to AI/ML
Use Cases Machine Learning Artificial Intelligence
Space Agency Predict person talking is an Astronaut Detect Astronaut is in stress and
recommends actions to reduce stress
Global Health Agency Detect cause of epidemic Suggest policies to control spread
Analytics
Report number of services customers
have used
Advanced Analytics
Clusters customers-based patterns of
services used
Machine Learning
Predicts services a customer will use
Artificial Intelligence
Detect a customer need and recommend a solution
Descriptive & Diagnostic Analytics: How are we doing?
Predictive Analytics:What might happen in the future?
Prescriptive Analytics:Suggest course of action
3
As-Is Fraud Prevention Environment
Teradata
SAS
Tools
Model development &
Investigation Environment
Requirements Implementation
Challenge #1
Ineffective Model deployment as
two different environment/solutions
Challenge #2
Significant effort in managing data
pipelines
Challenge #3
Significant effort in O&M data
environment based on Hadoop & IBM.
Challenge #5
Legacy limiting tools
E.G SAS is legacy and limiting capability.
IDR only supports RDBMS based
analytics
Execution Environment
Data Files
HIVE Meta store
MAPREDUCE Jobs
(PIG Programs)
IBM SPSS + IBM Rules
HADOOP
Payment
System
Cla
ims
Pro
vid
er
Be
ne
ficia
r
y
Dru
g
Qu
alit
y
Th
ird
Pa
rty d
ata
Data Center
Challenge #6
Fixed expensive on-premise data
center
Challenge #4
Scalability & in handling thousands
of models especially ML and AI
based such as deep neural
network based models
4
Data Sources
Multiple
Databases
Analytics
Data
External Feeds
Mainframe
Apps data
80%-90%
Usage Patterns
ML/AI
Analytics
“It’s impossible to overstress this:
80%-90% of the work in any data project
is in getting the data.”
What is the Biggest Challenge in Accelerating Data Analytics?
5Proprietary & Confidential.
Data Sharing
Legacy Data pipelines built since 1990’s
Legacy, inflexible, and proprietary Data Repositories
Significant investment in legacy analytics tools such as
SAS and client-server based BI tools
Analysis
Data Warehouse and Data Lake
AI/ML
Business Intelligence
6
Data Analyst Data Engineer Data Scientist
ValidatingDiscovering Structuring Cleaning Enriching Deploying
And it Impacts Your Entire Organization...
Data Sources
Multiple
Databases
Analytics
Data
External Feeds
Mainframe
Apps data
6
Moving to The Cloud Aggravates The Problem
Legacy Tools Are Insufficient to Tackle Data Challenges in the Cloud
Not integrated with Cloud platforms
Unable to handle diverse data in the cloud
Lack self-service for business users
Rigid design, poor scalability, costly
Poor Analytics Outcome in the Cloud
7
New Tools and Migration process needed for Cloud
8
Build out foundation based on “Cloud & Big Data Enabled”
Migrate Data & Analytics to Reap Benefits of low
cost, high performance (key)
Improved analytics including Self
service and Collaborative
TIME
Operationalize AI and ML Data Environment - Intelligent, powerful, data foundation architected for the cloud
Analytics Environment – Leveraging Open Source and/or Best of breed ready to use Analytics Platform
Data and Analytics Migration services -Migration of legacy analytics and data pipelines using automation – Automate migration of legacy analytics to modern platforms
User enablement for accelerated transition
Nationwide Healthcare Quality Improvement
Foundational Principles:
• Enable innovation • Data driven decision and policy making• Foster learning organizations• Eliminate disparities, strengthen infrastructure, and data systems
Goals
Make care safer
Strengthen person and family engagement
Promote effective communication and coordination of care
Promote effective prevention and treatment
Promote best practices for healthy living
Make care affordable
9
Continuous Improvement Activities
Root Cause Analysis: Identify challenges/problems in real-time and effectively
apply problem-solving techniques.
Measurement Strategies: Utilize data to develop healthcare quality measures
and/or a measurement strategy.
Data Driven: Collect, aggregate, analyze, display, and openly share data (i.e., to
achieve or work toward transparency).
Data Collection: Large and diverse data sets. Internal and External data.
Data Analytics & Reporting: Monitor, educate, and develop policies.
Transparency: Collaborative share methods and results.
Copyright © 2019, Explore Digits Inc. All rights reserved. 10
Addressing Hospital Readmissions: Real-World Scenario
Identify 60% of population that needs to be targeted for readmission
improvement.
Understand network/dependencies and conduct root cause analysis.
Contact potential healthcare providers, convey goals, and possible causes.
Recruit providers to participate in the program.
Monitor progress during the period hence ability to closer to real-time claims
data become important.
Copyright © 2019, Explore Digits Inc. All rights reserved. 11
Legacy Environments
On-premise Unix-based workbench with following major capabilities
Centralized file system - data is made available in SAS data format. Limited file storage
capacity.
SAS server tool - limited processing capability use mainly to extract data.
Observations
80-90% effort is data preparation.
Based on conversations limited statistical and analytics algorithms.
Thousands of legacy SAS modules to measure and monitor health care quality
Copyright © 2019, Explore Digits Inc. All rights reserved. 12
Drawbacks of Legacy Environments
Need diverse compute and consolidated data storage capability
Workbench performance and capacity issues
Low productivity
Expensive investments in local data centers
Need for closer to real-time data
Delayed data makes interaction with impacted users ineffective
Need more data for effective analytics
Not built for modern ML/AI
New employees not skilled in legacy. Hence expensive labor
Need to improve gap in reporting or data visualization tool
Copyright © 2019, Explore Digits Inc. All rights reserved. 13
Data & Analytics Modernization
Business Goals
Improve Analytic experience for users through advanced functionality
Modernize current analytic platforms and data repositories
Increase collaboration and data sharing for internal/external stakeholders
Improve data timeliness Simplify/automate data access
Reduce data duplication, mature data governance and increase data quality and security
Objectives
Establish Centralized Data Repository (CDR) and Analytics Platform (AP) in Cloud Create a centralized repository that includes Enterprise data, etc.
Make Program Specific datasets available for analytics and sharing with multiple consumers
Implement Analytics Toolset
Moving analytic operations
Assist transition to Cloud
14
Data Lake
Data Lake
Processing Engine
Data
Warehouse
Data
Warehouse
Databases
Log Files
Spreadsheets
IoT Sensors
Apps
SourcesCloud Platform
AI & ML
BI Reporting
Enterprise Data
Service
Analysis
ETL/ELT Access
control
Solution mapped to Needs
Central
Data RepositoryOAP
ML based
Advisor
Analytics
(SAS)
Migration
Service
Data Base
Migration
Service
Data
Pipeline
Migration
Service15
Data Repository
DataObjects
(Amazon S3)
AWS HIVE
(Glue Data Catalogue –
planned)
Data De-IdentificationService
Data CatalogueService
Data HydrationService
Data PreparationService
Billing
Enterprise Cluster Program 1
Apache HDFS/SPARK
Cluster
TIBCOData
SciencePortal
Apache Hive VPC
Program 2
Apache HDFS/SPARK
Cluster
Reporting+
Data PrepPortal
VPC
Apache HDFS/SPARK
Cluster
TIBCO+
ReportingPortal
VPC
Local(Amazon S3)
Apache Hive
Local(Amazon S3)
Apache Hive
Local(Amazon S3)
Amazon EMR(on
demand)
Modernization Vision: Cloud-Based Data and Analytics solution
AWS IAM
AWS Lambda
AWS DMS
Clinical
Provider
Patients
Vendors
Enrollment
Copyright © 2019, Explore Digits Inc. All rights reserved.
Program 3
AWS IAM
16
Internal Data Sources
TIBCO Data Science ReportingTools
Ingest Ingest
De-IdentificationPresto SQL Engine
Data Integration
Use of AWS Lake Formation for Data repository includes use of Amazon S3 – Storage, AWS Glue Catalogue (HIVE), AWS Glue ETL or AWS DMS, and EDOS tools
Use of customized AWS Lake Formation or connect to existing environment
External Data Feeds
VPC VPC• Eliminate need to• replicate data• download data to
machines• Reduce
• data centers • data movement
• Improve data quality
IAM IAM
Local Data
Central Amazon EMR
Cluster
Central Data
Central Data
Catalogue
Data Catalogue
Data APIsPresto SQL
Engine
Program 1
Cluster
Local Data
CatalogueData
Data APIsAWS
Redshift
Program 2
Cluster
Access
Internal Data Sources
Internal Data Sources
Copyright © 2019, Explore Digits Inc. All rights reserved.
Modernized Architecture
IAM
17
Open Analytics Platform
Vision Challenges Solution
Increase productivity with data science. Provide a self-service data analytics solution that is scalable, affordable, and leverages big data.
• Time to data insight is too long
• Data scientist skills are expensive
• 80% time spent finding and cleaning data
• Analytics limited to sample data sets
• Application of analytics to operation environment requires recoding
• Significant dependency on IT teams, causing delays
• Lacking collaboration for data science resulting in data analytics silos
• Volume, velocity, and variety of data
TIBCO Platform is part of an integrated, best-in-breed, big data analytics solution on AWS.
• Works with Apache Hadoop and Apache Spark
• 80%-ready solution for most customers
• Allows focus on mission and benefit of continuous innovation
18
Data Visualization Arcadia
Data Transformation Trifacta
Advanced Modeling TIBCO
Analytics Toolset SAS Viya
Bring-Your-Own-Analytics (BOYA)
open source tools
Open Analytics Platform: Tool Choices
Multiple tools to support various needs and offer costs savings
19
OAP: Open Analytics Platform Capabilities
20
Core Capability Technology Benefits
Data Storage and Processing
▪ AWS S3, AWS EMR, AWS Glue
▪ Agile, open architecture solution for storing and processing massive volume, velocity, and variety of data.
▪ Enables data integration, streaming ingest, and storing and processing unstructured data.
Event Framework ▪ AWS Lamba; Tibco Data
Science▪ Real-time processing of new data on Event triggers; micro-services-based
BOTS that push analytics real-time for maximum benefits
Data Governance ▪ AWS Glue Data Catalog▪ Smart data catalog that automatically discovers, organizes and surfaces high-
quality information, making it easy to find and use data.
Predictive Transformation
▪ Tibco Data Science▪ Self service data preparation using suggestive approach in rapidly developing
data processing jobs.
Data Visualization ▪ Arcadia Data/Tableau
▪ Self service approach towards rapidly building and sharing interactive data visualization.
▪ Link analytics tools enable greater insight into data relationships to isolate suspicious entities
▪ Geospatial Analytics to perform iterative, spatial analysis to solve complex problems related to the physical world.
Advanced Analytics & AI/ML
▪ Tibco Data Science▪ Provides a collaborative, visual environment to create and deploy analytics
workflow and predictive models.
AI/ML Environment
DataSourcesData
SourcesDataSources
Investigation
Data Discovery &Governance
Common Analytic
AI Platform
Operationalized Production-Ready AI
Services
Enterprise Metadata Search
Infrastructure logsKnowledge
ManagementCollaboration
Crowd SourcingExternal Data
Scientist Analyst Business/Operations
MONITORING&
Governance&
Control
AI Platform
1010101010
Model Development
Deploy
AI bots
AI Engine
Digital
Exhaust
21
Challenge #1 addressed
Effective Model deployment as we
have seamless deployment
Challenge #2 addressed
Compute to data pattern
eradicates data silos hence effort
in getting data
Challenge #3 addressed
AWS managed services reduce
burden on infrastructure O&M.
Challenge #5 addressed
Modern tools able to bring in new
generation of data scientist
Challenge #6 addressed
Fixed expensive on-premise data center
Challenge #4 addressed
Distributed cluster like Apache Spark and
use of AWS Sagemaker enables highly
scalable and performing executing engine
SAS to Python
Proprietary & closed source Open Source
Recurring licensing costs on usage and any
expansion
No license fee on usage or expansion
Specialized Skillset or training needed for
resources.
Widely adopted and easy to find resources
with the skillset needed
Deployment is constrained by licensing and
hard to deploy anywhere without first licensing
the software
Deployment is easy across in-house or cloud
infrastructures
22
Manual effort by teams of Data Scientists.
Resources should know both SAS and Python
Time consuming and error prone
And very expensive too in terms of resources
Manually convert SAS to Python
We use solution that automatically* converts code written in SAS language to open source Python 3.x language with the goal of enabling data scientists to use the modern machine learning and deep learning packages available via Python.
Written by team of Data Scientists and Engineers to convert SAS to Python correctly.
Solution: Automate conversion
* 80% automation usually seen and increases with familiarity of the SAS code used
23
Simple language constructs
24
Datalines
25
Statistical Calculations
26
ETL code
27
“
Questions?