Advanced Analytics in Banking
Juan M. Huerta
Global Decision Management
VP Advanced Analytics
Citibank
I will talk about…
• Big Data Adoption process at Citi
• Realizing the Technical Value of Big Data
• Global Solutions
1
140 countries2
200 million
accounts
Citi: A Customer Centered Organization
3
As a customer-centered bank, the goal of our Big Data strategy to shift the focus from independent vertical silos to Common Horizontal Solutions focused around Citi’s 200-million customer accounts
Big Data Adoption Stakeholders
• Lines of Business
• Strategy & Decision Management Organizations: cross LOB & Geo, global
• Data innovation Office: Governance & Regulatory
• CitiData – Big Data & Analytics Engineering
4
Big Data Adoption Roadmap
5
Adoption will not occur at once. The level of capability maturity across the organization will vary significantly.
On theory we think in terms of Staged Competencies of a Big Data Maturity Model.
In practice, a hybrid process, which fits the level of maturity of participants, is needed.
CommonData
Common
AnalyticPlatform
Common Tools &
Techniques
Common Solutions
Common Focus
Strategy
Big Data Adoption Hybrid Participation Model
• Novice: Proof of Concept
• Expert: R&D Environment
• Shadowed
6
7
End-to-end Analytic Process for a POC Project
This is one component of the hybrid model
Ideas and
Hypotheses
Information Asset Inventory Navigator (“IAIN”)
• Pipeline of ideas
to use data for
competitive
advantage
• Robust, comprehensive ontology allowing analysts and economists to search, sort, and select data for analysis
• Preliminary assessment for business value, data safekeeping and alignment to business practices
Data Transformation &
Provisioning
• Transformation rules executed to normalize and conform production data
• Conformed data set made available in production environment
Production Model Development
• Develop scalable, productizable analytics
Model Deployment
• Exploit insights and analyses across the enterprise to maximize value
• Models measured for quality / usage
• Formal approval process through Business Steering Committee based on understanding expected use of production data
R&D process
R&D Project
Approval
Product Approval
Engineering / Production process
Analytics Knowledge
Management
• Robust, comprehensive ontology allowing analysts and economists to search, sort, and select data for analysis
Data Set Preparation
& Provisioning
• Basic preparation of data set (e.g., consolidation, conformation)
• Permission-based provisioning of data set into a Big Data Analytics environment
Analytics Execution
• Advanced analytic tools mine business insight from large volumes of data
• Data scientist peers review model findings and results
Analytics Peer Review
Data Acquisition
• Where
necessary,
acquire new
data sets to
support R&D
project
Advanced Global Solutions
• A global solution is a tested algorithm or analytic model that carries out a particular business analysis and which is leveraged at a global scale
• A big data global solution enables the interplay of complex algorithms and large datasets
• When a global solution is built upon big data approaches a delivery roadmap should be considered
• In the exploratory process a Global Solution is developed in the Innovation R/D environment and validated through a POC process
• Alignment with Innovation, UAT, PRD environments
8
Technical Value of Big Data:
Benchmarks and Analysis
The Boom Driving Big Data is Technological
Heebyung Koh , Christopher L. MageeA functional approach for studying technological progress: Extension to energy technologyTechnological Forecasting and Social Change, Volume 75, Issue 6, July 2008, Pages 735–758
The Quadrant Of Analytic Opportunity
Run Time is affected by Data Size and Algorithmic Complexity
Algorithmic Complexity
Database
Interaction
Mtg+Cards+
Banking
Accounts Transaction
features
Accounts Transactions
Branches Transactions
Accounts Summary Stats.
Employees Summary Stats.
GL-GOCS GL-Entries
Branches Summary Stats.
10^10
10^9
10^9
10^8
10^7
10^6
10^5
Data Size
SequenceMining
Predictivefiltering
Latent Dirichlet Allocation
HMM Baum-WelchO(ns nf nt)
CARTO(nf ns log ns)
IterativeSVD- CF
K-means
LogisticRegression
PCAPageRank
Self-Org.Maps
Neural Nets
CollaborativeFiltering
(CF)
Vector basedApproaches
HMM
MachineLearning
TraditionalStatistical
Big Data/Pattern Mining
ConditionalRandomFields
Support Vector Machines
Breaking down the gains of P13n:
A Controlled Incremental Benchmark on a
Workstation grade processor (x500)
Implemented an incremental-SVD (Netflix Cup) predictive model that runs on midsize of datasets…
X30• Compiled Code (vs. interpreted)
x4• In Memory (vs. Disk access)
X3.12• Multithread (vs. single thread)
X1.3• Workstation grade processor
Basic Map Reduce Benchmarks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Series1
Impact of overhead as functionOf input volume:
Relative Map Throughputas a function of # Mappers
0
5
10
15
20
25
0 5 10 15 20
Rela
tive M
ap C
PU
tim
e s
peedup
Number of Maps
0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056
Linear (0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056)
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20
To
ken
s p
er
Wall C
lock S
eco
nd
Number of Maps
Series1
Linear (Series1)
HAMSTER: Hadoop Multi-signature Search
for Text-based Entity Retrieval
• Core algorithm: String Edit Distance O(mnk2)
• Baseline runs at 100 matches per day
• HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) = 2000x faster
Source
Items
Target
Items
Source
items
per
target
Input
Size
MAP
Records
Cluster
Max Map
Tasks
Effective
Map
Tasks
CPU
map
(secs)
Wall time
34k 618k 100 4.40GB 345 33 33 196k 2h 14
secs
34k 618k 50 8.8GB 690 40 66 196k 1h
47min
34k 618k 30 14.6GB 1,149 40 110 199k 1h 39
min
Leveraging Global Big Data Global Solutions
Creating Global Big Data solutions
Our goal is to evolve from Big Data algorithms to Big Data
Solutions
Example of Advanced Global Solution Matrix
17
Outlier
Detection
Multivariate
Segmentation
Sequence
Matching
Network
Analysis
Customer Contextual Clickstream
Action Marketing Risk/Fraud Digital
Structured
Prediction
17
K-Medoids
Clustering
Example: Transactional Time Series
An
om
alo
us
Be
hav
ior
On Demand Simulation: Generate Branches’ DNA
• Case Scenario: Unusual number of cash advances by 2 tellers.
Single day fraud Multi day fraudOriginal branch (August)
Creating Regions of Interest based on
On-Demand-SimulationMinimum-Spanning-Tree based branch association for region of interest generation
Multi-day fraud simulation
Original branch
Region of interest
• Numbers shown are randomized indices
Conclusion: Lessons Learned
• One Size does not fit all
• Follow a Hybrid Approach
• Leverage Analytic patterns: Global Solutions
• Big Data is about Parallelization
• The future: expensive Algorithms applied to large datasets
• Global Solutions are the combination of algorithmic building blocks applied to specific business problems
21
Thank You!
22