Download - Advanced Analytics in Banking, CITI

Advanced Analytics in Banking

Juan M. Huerta

Global Decision Management

VP Advanced Analytics

Citibank

I will talk about…

• Big Data Adoption process at Citi

• Realizing the Technical Value of Big Data

• Global Solutions

1

140 countries2

200 million

accounts

Citi: A Customer Centered Organization

3

As a customer-centered bank, the goal of our Big Data strategy to shift the focus from independent vertical silos to Common Horizontal Solutions focused around Citi’s 200-million customer accounts

Big Data Adoption Stakeholders

• Lines of Business

• Strategy & Decision Management Organizations: cross LOB & Geo, global

• Data innovation Office: Governance & Regulatory

• CitiData – Big Data & Analytics Engineering

4

Big Data Adoption Roadmap

5

Adoption will not occur at once. The level of capability maturity across the organization will vary significantly.

On theory we think in terms of Staged Competencies of a Big Data Maturity Model.

In practice, a hybrid process, which fits the level of maturity of participants, is needed.

CommonData

Common

AnalyticPlatform

Common Tools &

Techniques

Common Solutions

Common Focus

Strategy

Big Data Adoption Hybrid Participation Model

• Novice: Proof of Concept

• Expert: R&D Environment

• Shadowed

6

7

End-to-end Analytic Process for a POC Project

This is one component of the hybrid model

Ideas and

Hypotheses

Information Asset Inventory Navigator (“IAIN”)

• Pipeline of ideas

to use data for

competitive

advantage

• Robust, comprehensive ontology allowing analysts and economists to search, sort, and select data for analysis

• Preliminary assessment for business value, data safekeeping and alignment to business practices

Data Transformation &

Provisioning

• Transformation rules executed to normalize and conform production data

• Conformed data set made available in production environment

Production Model Development

• Develop scalable, productizable analytics

Model Deployment

• Exploit insights and analyses across the enterprise to maximize value

• Models measured for quality / usage

• Formal approval process through Business Steering Committee based on understanding expected use of production data

R&D process

R&D Project

Approval

Product Approval

Engineering / Production process

Analytics Knowledge

Management

• Robust, comprehensive ontology allowing analysts and economists to search, sort, and select data for analysis

Data Set Preparation

& Provisioning

• Basic preparation of data set (e.g., consolidation, conformation)

• Permission-based provisioning of data set into a Big Data Analytics environment

Analytics Execution

• Advanced analytic tools mine business insight from large volumes of data

• Data scientist peers review model findings and results

Analytics Peer Review

Data Acquisition

• Where

necessary,

acquire new

data sets to

support R&D

project

Advanced Global Solutions

• A global solution is a tested algorithm or analytic model that carries out a particular business analysis and which is leveraged at a global scale

• A big data global solution enables the interplay of complex algorithms and large datasets

• When a global solution is built upon big data approaches a delivery roadmap should be considered

• In the exploratory process a Global Solution is developed in the Innovation R/D environment and validated through a POC process

• Alignment with Innovation, UAT, PRD environments

8

Technical Value of Big Data:

Benchmarks and Analysis

The Boom Driving Big Data is Technological

Heebyung Koh , Christopher L. MageeA functional approach for studying technological progress: Extension to energy technologyTechnological Forecasting and Social Change, Volume 75, Issue 6, July 2008, Pages 735–758

The Quadrant Of Analytic Opportunity

Run Time is affected by Data Size and Algorithmic Complexity

Algorithmic Complexity

Database

Interaction

Mtg+Cards+

Banking

Accounts Transaction

features

Accounts Transactions

Branches Transactions

Accounts Summary Stats.

Employees Summary Stats.

GL-GOCS GL-Entries

Branches Summary Stats.

10^10

10^9

10^9

10^8

10^7

10^6

10^5

Data Size

SequenceMining

Predictivefiltering

Latent Dirichlet Allocation

HMM Baum-WelchO(ns nf nt)

CARTO(nf ns log ns)

IterativeSVD- CF

K-means

LogisticRegression

PCAPageRank

Self-Org.Maps

Neural Nets

CollaborativeFiltering

(CF)

Vector basedApproaches

HMM

MachineLearning

TraditionalStatistical

Big Data/Pattern Mining

ConditionalRandomFields

Support Vector Machines

Breaking down the gains of P13n:

A Controlled Incremental Benchmark on a

Workstation grade processor (x500)

Implemented an incremental-SVD (Netflix Cup) predictive model that runs on midsize of datasets…

X30• Compiled Code (vs. interpreted)

x4• In Memory (vs. Disk access)

X3.12• Multithread (vs. single thread)

X1.3• Workstation grade processor

Basic Map Reduce Benchmarks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6

Series1

Impact of overhead as functionOf input volume:

Relative Map Throughputas a function of # Mappers

0

5

10

15

20

25

0 5 10 15 20

Rela

tive M

ap C

PU

tim

e s

peedup

Number of Maps

0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056

Linear (0.003351955 0.032258065 0.319148936 1 2.631578947 21.12676056)

0

200

400

600

800

1000

1200

1400

1600

0 5 10 15 20

To

ken

s p

er

Wall C

lock S

eco

nd

Number of Maps

Series1

Linear (Series1)

HAMSTER: Hadoop Multi-signature Search

for Text-based Entity Retrieval

• Core algorithm: String Edit Distance O(mnk2)

• Baseline runs at 100 matches per day

• HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) = 2000x faster

Source

Items

Target

Items

Source

items

per

target

Input

Size

MAP

Records

Cluster

Max Map

Tasks

Effective

Map

Tasks

CPU

map

(secs)

Wall time

34k 618k 100 4.40GB 345 33 33 196k 2h 14

secs

34k 618k 50 8.8GB 690 40 66 196k 1h

47min

34k 618k 30 14.6GB 1,149 40 110 199k 1h 39

min

Leveraging Global Big Data Global Solutions

Creating Global Big Data solutions

Our goal is to evolve from Big Data algorithms to Big Data

Solutions

Example of Advanced Global Solution Matrix

17

Outlier

Detection

Multivariate

Segmentation

Sequence

Matching

Network

Analysis

Customer Contextual Clickstream

Action Marketing Risk/Fraud Digital

Structured

Prediction

17

K-Medoids

Clustering

Example: Transactional Time Series

An

om

alo

us

Be

hav

ior

On Demand Simulation: Generate Branches’ DNA

• Case Scenario: Unusual number of cash advances by 2 tellers.

Single day fraud Multi day fraudOriginal branch (August)

Creating Regions of Interest based on

On-Demand-SimulationMinimum-Spanning-Tree based branch association for region of interest generation

Multi-day fraud simulation

Original branch

Region of interest

• Numbers shown are randomized indices

Conclusion: Lessons Learned

• One Size does not fit all

• Follow a Hybrid Approach

• Leverage Analytic patterns: Global Solutions

• Big Data is about Parallelization

• The future: expensive Algorithms applied to large datasets

• Global Solutions are the combination of algorithmic building blocks applied to specific business problems

21

Thank You!

22