+ All Categories
Home > Documents > Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data...

Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data...

Date post: 20-May-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
135
Demystifying Data Science 19 th September 2018
Transcript
Page 1: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Demystifying Data Science

19th September 2018

Page 2: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

The views expressed in these presentations

are those of the presenter(s) and not

necessarily of the Society of Actuaries in

Ireland

Disclaimer

Page 3: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Pedro Ecija Serrano

Chair, Data Analytics Subcommittee

• First of a series of three presentations

Welcome

Disclaimer:

The material, content and views in the following presentation are those of the presenter(s).

Page 4: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

4

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 5: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

“Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms”

—Wikipedia

What is Data Science?

Page 6: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

6

What is Data Science?

“Data science is the study of how to

make data-driven decisions”

Page 7: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

7

The more data you have,

The better your decisions should be

What is Data Science?

Page 8: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

8

Data Science Map

Data Science

Page 9: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

9

Data Science Map: Insurance Industry

Data Scientists

Actuaries Optimal?

Page 10: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

10

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 11: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

11

Data Storage Costs

Page 12: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

12

Digitalization

Page 13: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

13

Number of Wifi-Connected Devices

Page 14: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

14

Volume of Data

Page 15: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

15

Computer Speeds

Page 16: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

16

Data Science Tools

Page 17: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

17

Machine Learning

Page 18: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

18

Is Data an Asset?

Page 19: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

19

Why is it a Big Deal Now?

Q: Is data an asset?

A: Yes

Q: How can companies extract value from their data?

A: Data Science

Q: Who will actually analyse this data?

A: Data Scientists

Page 20: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

20

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 21: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

21

Data Science Process

Obtain Data + Develop Plan

Model

Clean + Reformat

Explore Data

Summarise Results

Make Data-Driven Decisions

Page 22: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

22

Traditional Actuarial Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

ExcelPolicyholder

DatabaseOther

Databases

Database Excel 1

CSVMarket Data

Results Database

Excel 2

Proprietary Model

Excel

Excel Models

Out-of-Model Adjustments

Summary Spreadsheets

Proprietary Model Reformat

Database

MI / BI

Page 23: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

23

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

ExcelPolicyholder

DatabaseOther

Databases

Python

CSVMarket Data

PythonAutomated Summary

PythonAudit Trail /

Run log

MI / BI

Python

Page 24: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

24

Opportunities for Actuaries (1)

• Streamline your processes using open-source data science tools

• Improve efficiency and reduce time costs

• Reduced risk of manual error

• Spend time on value-added work rather than manual labour

Page 25: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

25

Opportunities for Actuaries (2)

• The ultimate wider field?

• Opportunity to drive revenue growth

• (e.g. using policyholder-level predictive modelling)

• Opportunity to work in different industries

• Powerful new tools to solve real-world problems

• Already familiar with handling data and building complex models

• CDO Roles

• Superstar salaries for top researchers

Page 26: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

26

Source: Indeed.com, November 2017

Page 27: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

27

Opportunities for Actuaries: Chief Data Officers

Source: VisualCapitalist.com: The Rise of the Chief Data Officer

Page 28: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

28

Threats for Actuaries

• Increased competition from data scientists• Who have strong computer skills

• Who have powerful predictive models

• Strong ability to handle data and extract information from the Company’s data

• Particularly for younger actuaries

Page 29: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

29

Threats

29

Data Scientists

Actuaries

Page 30: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

30

Threat Mitigation

• Improve data science skills within each actuarial team

• Mainly by improving computer skills and learning about machine learning models

• Gain access to open-source data science tools at work

• Overcome internal challenges to open-source software

• e.g. the IT department might be reluctant to use new software

Page 31: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

31

Opportunities for Companies

• Extract value from their data asset

• Make better data-driven decisions

• Better understanding of risks and opportunities by doing quick, novel analyses of the data

• Streamline operations

Page 32: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

32

Threats for Companies

• New companies could develop massive structural advantages over incumbents?

• E.g. Amazon have massive structural advantages over traditional retailers

Page 33: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

33

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 34: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

34

Python and R

• Python is a high level, general purpose programming language with readable syntax

• R is a statistical programming language designed by statisticians for statisticians

• Both are widely used for data science

• Both have similar market-leading functionality

Page 35: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

35

Trends

Page 36: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

36

Open-Source

Open-source software:

Users have the ability to:

• Run

• Study

• Modify

• Improve

• Copy

• Distribute to anyone and for any purpose

Page 37: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

37

The Python Data Science Stack

• Programming Language

• Numerical and scientific calculations

• Organising data, merging data, doing calculations

• Graphs

• Big Data

• Machine learning

• Artificial intelligence and ultra-fast calculations

Page 38: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

38

Open Source vs Closed Source

Open Source Closed Source

Source Code Open Hidden

Redistributable? Yes No

Modifiable? Yes No

Licence and Subscription Fees? No Yes

Documentation, Helpdesk and

Tutorials

Online (Google / Stackoverflow)

Provided by Provider (for a fee)

Responsiveness to bugs and

market Quick to respond

Depends on Provider

Version Control Systems AvailableDepends on

Provider

Page 39: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

39

Open-Source Advantages

• Fast

• Scalable

• Capable of full automation

• No licencing fees

• Auditability

• Flexibility

• Sustainability

• Easy to find or train developers

• Fast Learning Curve

Page 40: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

40

Open-Source Misconceptions

• Not secure

• Too hard to learn

• No documentation / bad documentation

• Not as good as proprietary software

Page 41: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

41

Closed Source Advantages

• It’s the Standard / Well Known

• Easier for Unskilled Users

• Guaranteed Support (for a fee)

• Managers prefer buying Software as a Service rather than building own systems?

• Warranties and Indemnity Liability

• Unlikely to Become Obsolete?

Page 42: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

42

Closed Source Risks

• Expensive

• Restrictive licences

• Lock-in / Capture

• Time-consuming / Hard to learn

• Management Incentives (Planned obsolescence / cash cow)

• Bankruptcy

• Unknown code quality

• Unknown level of security

• No incentive to provide good documentation

Page 43: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

43

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 44: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

44

Data Science Process : Buzzwords

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Big Data

Page 45: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

45

Big Data

Big data: data sets that are too big and complex for

traditional data processing software

Need to use new software which can distribute the

storage and calculations across different machines

Page 46: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

46

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Exploratory Data Analysis

Page 47: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

47

Exploratory Data Analysis

EDA: Analyzing data sets to find their

main characteristics

Page 48: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

48

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Exploratory Data Analysis

Data Mining

Page 49: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

49

Data Mining

Data Mining is the process of finding patterns and

relationships in large datasets

Goal = to extract valuable understandable

information from data

Page 50: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

50

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Business Intelligence and Management Information

Page 51: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

51

Business Intelligence and Management Information

Analyzing data and presenting

information to help executives make

informed business decisions

Page 52: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

52

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Statistical Models

Predictive Analytics

Predictive Modelling

Machine Learning

Page 53: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

53

Statistics vs Predictive Analytics vs Machine Learning

Statistics is about data:

• Collection

• Organisation

• Analysis

• Interpretation

• Presentation

Page 54: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

54

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Statistical Models

Predictive Analytics

Predictive Modelling

Machine Learning

Page 55: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

55

Predictive Analytics

Predictive Analytics is a set of statistical techniques that

make predictions about future unknown events

For example:

• Data mining

• Traditional predictive models

• Machine learning models

Page 56: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

56

Data Science Process

Obtain Data

Model

Clean + Reformat

Explore / Check

Summarise Results

Make Data-Driven Decisions

Statistical Models

Predictive Analytics

Predictive Modelling

Machine Learning

Page 57: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

57

Predictive Modelling

Predictive models are models which make predictions

about future unknown events.

• Using current and historical data

• Allowing for relationships among many factors

• Make predictions about every example in the dataset

• These predictions can be used to guide decision

making

Page 58: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

58

Predictive Modelling

Two main types:

• Traditional predictive models

• Machine learning models

Page 59: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

59

Traditional Predictive Models

Characteristics of traditional predictive models:

• Explainable and interpretable

• Grounded in maths and statistics

• All parameters derived manually using closed form

mathematical solutions or simple algorithms

• Lots of manual effort required to build high

accuracy models

Page 60: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

60

Machine Learning Models

Machine learning models are predictive models

which have the ability to learn from data without

being explicitly programmed

Learning = progressively improving performance on

a specific task

Page 61: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

61

Machine Learning Models

Characteristics of machine-learning models:

• Automatic

• May be explainable or a black box

• Grounded in computer science

• Most parameters derived automatically using a

machine learning algorithm

• Little manual effort required to build high accuracy

models

Page 62: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

62

ML Models

Many possible datasets

Many possible predictions

Policyholder Datafiles

Claims Datafiles

Time Series Data

Text Files

Pictures

Videos

Audio

Policy Reserves

Price

Fraud / Not Fraud

Risk of Lapsing:High/Medium/Low

Rating from 1-5

Machine Learning

Model

Many Different Models

Page 63: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

63

Digital Photos

Source: Openframeworks.cc

• Digital Photos are stored as arrays of numbers

Page 64: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

64

Digital Audio Files

Source: ch.mathworks.com

• Digital Audio files are stored as a time series of arrays

• Each array contains information on pitch and loudness

Page 65: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

65

Digital Text

Source: ch.mathworks.com

• Can be converted to vectors of numbers• Glove

• Word2Vec

• Word Embeddings

Page 66: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

66

General Examples of Predictive Models

Self-Driving Cars

Speech-to-text

Recommender Systems

Game Playing

Reducing Electricity Costs

Machine translation

Chatbots

Text-to-Speech

Fraud Detection

Credit Risk

Pricing

Customer Retention

Proxy Models

Sales Forecasting

Anti-Money Laundering

Call-Centre Routing

Sentiment Analysis

Geographic Analysis

AnalysingSatellite Photos

Reading X-rays

Page 67: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

67

Example: Machine Translation as Predictive Model

“Je Suis” “I am”Predictive

Model

• The model tries to predict what words a human translator would use

Page 68: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

68

Example: Captioning

Red dress with White Spots and Black Belt

Red sweater with white stripes on arms and

Gingerbread man with Christmas Hat

Train Model

• The model takes the picture and predicts what the caption should be

Page 69: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

69

Example: Self-Driving Cars

Source: https://clipartxtras.com/

Good Driving

Bad Driving

Train Model

Model predicts what a good driver would do in the current circumstances

Page 70: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

70

Example: Fraud Detection

Claim isn’t Fraudulent

Claim is Fraudulent

Train Model

The model will predict whether each incoming claim is fraudulent or non-fraudulent

Page 71: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

71

General Examples of Predictive Models

Self-Driving Cars

Speech-to-text

Recommender Systems

Game Playing

Reducing Electricity Costs

Machine translation

Chatbots

Text-to-Speech

Fraud Detection

Credit Risk

Pricing

Customer Retention

Proxy Models

Sales Forecasting

Anti-Money Laundering

Call-Centre Routing

Sentiment Analysis

Geographic Analysis

AnalysingSatellite Photos

Reading X-rays

Page 72: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

72

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 73: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

73

Practical Example: Traditional Modelling and Machine Learning

Page 74: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

74

How much is a 1000 square foot house?

Eyeball approach:

Around €90k

Page 75: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

75

Linear Regression Predictive Model

• Linear Regression Model:

• Price = €101,955

• Slope = 108

• Intercept = -5,700

• MSE = 258 million

• But how do you find the slope and intercept?

Page 76: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

76

Approach 1: Normal Equation

Page 77: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

77

Linear Regression Predictive Model

Linear Regression Model:

• Price = €101,955

• Slope = 108

• Intercept = -5,700

• MSE = 258 million

Page 78: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

78

Approach 1: Normal Equation

Problem with normal equation:

• Only works if 𝑋𝑇𝑋 is invertible

• Doesn’t work on other models

• Doesn’t work well on large datasets

Page 79: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

79

Approach 2: Gridsearch

Page 80: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

80

Approach 2: Gridsearch

Page 81: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

81

Approach 2: Gridsearch

Page 82: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

82

Approach 2: Gridsearch

Page 83: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

83

Approach 2: Gridsearch

• Problem with gridsearch: Very inefficient

• Only works for models with a handful of parameters

Page 84: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

84

Approach 3: Stochastic Gradient Descent

1. You don’t know the slope and intercept, so randomly choose them

2. Therefore you start at a random point

3. Calculate the slope of the MSE loss surface at that point

4. Take a step downhill

5. Repeat 3 and 4 until you reach the lowest point on the loss surface

Page 85: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

85

Approach 3: Stochastic Gradient Descent

SGD gives exact same answer as Normal Equation in this example

Page 86: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

86

SGD: Python Code

Page 87: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

87

Approach 3: Stochastic Gradient Descent

Page 88: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

88

SGD: Cubic Polynomial

Page 89: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

89

SGD: Cubic Polynomial

Page 90: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

90

SGD: Exponential Model

Page 91: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

91

SGD: Exponential Curve

Page 92: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

92

SGD: Exponential Plus Cubic Model

Page 93: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

93

SGD: Exponential Plus Cubic Model

Page 94: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

94

SGD: Sine Regression

Page 95: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

95

SGD: Python Code

Page 96: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

96

SGD: Mathematical Background

Page 97: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

97

SGD: Python Code

Page 98: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

98

Benefits of SGD

• It is straightforward to calibrate predictive models

• You can build models with thousands of parameters

• Can work on huge data sets

• Can achieve human-level accuracy

• You can build models for all different types of data• Pictures

• Videos

• Audio

• Text

• Policyholder datafiles

Page 99: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

99

Benefits of SGD

• It works very well in practice• You can choose models which are a good fit to the data

• Rather than choosing models which you are able to fit to the data

Page 100: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

100

Machine Learning Models

Page 101: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

101

Neural Network Models

Page 102: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

102

Machine Learning Models in Scikit-Learn

Page 103: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

103

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Demystifying Data Science

Page 104: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Big Data

More Data

More Computing Power

More Analysis

• Computers in Actuarial Work

• A Word on Terminology

• Association Rule Mining

• Unsupervised Learning

Practical Examples – Getting started

Page 105: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Mainframe Systems

• Valuation Software

• Spreadsheets

• A precise answer…

• ...given assumptions

• Computers may be able to ‘solve’ problems

• Or at least give valuable insights

The role of Computers in Actuarial Work

Page 106: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Proved in 1976

• First major theorem proved by computer

Example 1 - Four Colour problem solved

Page 107: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• xn+yn = zn

• Solved by computer for all primes up to 4,000,000

Example 2 - Fermat’s Last Theorem solved (almost)

Page 108: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Results always need to be interpreted!

http://tylervigen.com/spurious-correlations

Correlation and Causation!

Page 109: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Actuaries didn’t get here first!

• P = A / ä

Periodic Policy Amount =

Bounded Risk Benefit /

Contribution Vector

• Terminology not intuitive...

• ...concepts are

A word on Terminology

Page 110: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

This presentation

• Association Rule Mining (Amazon, Tesco)

• Unsupervised Learning

Letting the data tell its own story

Next presentation

• Supervised Learning

Where we propose a model

Final presentation

• Deep Learning (Neural Nets)

What we’re looking to cover

Page 111: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Purchasing datasets

Association Rule Mining 1

Bread Milk Eggs ... Yoghurt Tuna Fruit

Customer 1 x

Customer 2 x x x

Customer 3 x x

::

x

Customer n x

• Very very sparse

• Think of Amazon

Page 112: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Of interest, what items occur together?

• As a purchasing dataset will have very sparse data, ideas will be illustrated by a medical dataset

• 240 Patients

• 6 Symptoms

Association Rule Mining 2

Page 113: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Illustrative dataset

Association Rule Mining Dataset

Symptoms

1 2 3 4 5 6

Patient 1 x

Patient 2 x x

Patient 3 x x x

::

::

::

::

::

::

::

Patient 240 x x

Total 19 157 55 85 58 181

• Less sparse

Page 114: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Which symptoms occur together?

• Three key concepts...

For symptoms A & B

1) Support = P(A ⋂ B) = P(A,B)

2) Confidence = P(B|A) = P(A,B) / P(A)

3) Lift = P(A,B) / [P(A).P(B)]

Association Rule Mining Investigation

Page 115: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Association Rule Mining Result 1

Page 116: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Association Rule Mining Result 2

Page 117: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Concepts are not difficult

• Terminology and visualisation can be confusing at first

• Basic analysis can be enhanced by adding bounds and standardising results

• Very sophisticated algorithms can be developed but speed is an issue

~~---~~

Association Rule Summary

Page 118: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning

No y value, Multiple x values

Supervised Learning

We do have a y value & multiple x values

What we’re looking to cover, a reminder

Page 119: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Old Faithful Geyser

• 272 data points on Waiting & Eruption Times

Unsupervised Learning 1

Page 120: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

• Old Faithful Geyser

• 272 data points on Waiting & Eruption Times

Unsupervised Learning 2

Page 121: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning 3

Page 122: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning 4

Page 123: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning 5

‘Elbow’

Page 124: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning 6

• Resulting Segmentation

• Can be exploratory or detective

Page 125: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Another Grouping (Clustering) Example 1

Page 126: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Another Grouping (Clustering) Example 2

Page 127: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Another Grouping (Clustering) Example 3

Page 128: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Another Grouping (Clustering) Example 4

• Accuracy 88%

• ‘First pass’ result

• Readily implementable

• Methodology generalisable to n dimensions

• Where could this give more insight?– Segmentation (Distribution Channel)

– Any homogeneous group selection

– Deconstructing portfolios

– Model point building

– Outlier identification (Fraud etc.)

– Trend analysis

Page 129: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Deconstructing Trend Analysis 1

http://www.rdatamining.com/

Page 130: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Deconstructing Trend Analysis 2

• Constructed dataset

• 6 x 100 sub-series

Page 131: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Deconstructing Trend Analysis 3

Page 132: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Deconstructing Trend Analysis 4

1 2 3 4 5 6

1 97 3 0 0 0 0

2 1 99 0 0 0 0

3 0 0 81 0 19 0

4 0 0 0 63 0 37

5 0 0 16 0 84 0

6 0 0 0 1 0 99

Predicted Group

Act

ual

Gro

up

• Accuracy 87%!

Page 133: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Deconstructing Trend Analysis 5

• Accuracy 87%!!!

• Where could this give more insight?– Claim rates

– Seasonal / Selection Effects

– Investment performance analysis

– Stochastic model analysis

– Trend analysis

Page 134: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

Unsupervised Learning Summary

• Can help identify patterns in data

• Can help identify homogeneous groups

• Using computer power

• Relatively unsophisticated

• Possible to get answers quickly

• Perfect insight not possible

• Improved understanding may result

Page 135: Demystifying Data Science - Society of Actuaries in Ireland · 2018-10-13 · Demystifying Data Science. 34 Python and R •Python is a high level, general purpose programming language

135

• What is Data Science?

• Why has it Grown So Quickly?

• Opportunities and Threats

• Open Source vs Closed Source

• Buzzwords

• Example: Machine Learning Model

• Practical Examples

Any Questions?


Recommended