Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | gramener |
View: | 474 times |
Download: | 0 times |
1
Exploratory Data AnalysisKathirmani SukumarData Scientist @ Gramener
2
How do I start doing analysis?
3
Exploratory Data Analysis might help you…!!!
4
CASE STUDIES
5
DETECTING FRAUD
“ We know meter readings are incorrect, for various reasons.
We don’t, however, have the concrete proof we need to start the process of meter reading automation.
Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
ENERGY UTILITY
6
AN ENERGY UTILITY DETECTED BILLING FRAUD
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large
number of readings are aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available.
Most fraud detection software failed to load the data, and sampled data revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary.
Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.
7
PREDICTING MARKS
“ What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction matter?
Does community or religion matter?
Does their birthday matter?
Does the first letter of their name matter?
EDUCATION
8
TN CLASS X: ENGLISH
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
9
TN CLASS X: SOCIAL SCIENCE
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
10
TN CLASS X: LANGUAGE
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
11
TN CLASS X: SCIENCE
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
12
TN CLASS X: MATHEMATICS
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
13
ICSE 2013 CLASS XII: TOTAL MARKS
14
CBSE 2013 CLASS XII: ENGLISH MARKS
15
Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.
June borns score the
lowest
The marks shoot up for Aug borns
… and peaks for Sep-borns
120 marks out of 1200
explainable by month of birth
An identical pattern was observed in 2009 and 2010…
… and across districts, gender, subjects, and class X & XII.
“It’s simply that in Canada the eligibility cut-off for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”
-- Malcolm Gladwell, Outliers
16
This is a dataset (1975 – 1990) that has been around for several years, and has been studied extensively. Yet, a visualization can reveal patterns that are neither obvious nor well known.
For example,• Are birthdays uniformly distributed?• Do doctors or parents exercise the C-section option to
move dates?• Is there any day of the month that has unusually high or
low births?• Are there any months with relatively high or low births?
Very high births in September. But this is fairly
well known. Most conceptions happen during
the winter holiday season
Relatively few births during the Christmas and
Thanksgiving holidays, as well as New Year and
Independence Day.
Most people prefer not to have children
on the 13th of any month, given that it’s
an unlucky day
Some special days like April Fool’s day are avoided, but Valentine’s Day is quite popular
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
LET’S LOOK AT 15 YEARS OF US BIRTH DATA
17
THE PATTERN IN INDIA IS QUITE DIFFERENTThis is a birth date dataset that’s obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns.
For example,• Is there an aversion to the 13th or is there a local cultural
nuance?• Are holidays avoided for births?• Which months have a higher propensity for births, and
why?• Are there any patterns not found in the US data?
Very few children are born in the month of August, and
thereafter. Most births are concentrated in the first half
of the year
We see a large number of children born on the 5th, 10th,
15th, 20th and 25th of each month – that is, round
numbered dates
Such round numbered patterns a typical indication
of fraud. Here, birthdates are brought forward to aid
early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
EDA PROCESS
UNDERSTAND DERIVE QUESTION INTERACT
Identify Relevant data & sources
Map Context Prepare
Metadata Label & Clean
data
New Metrics from business
Metrics from Patterns (Binning, comparison, Ratios, Attributes, Transformation)
Stakeholder inputs who would benefit from the analysis
Based on patterns(top groups by a metric, maximise a metric, bivariate relationships)
Filter by a group value
Compare against a value or a derived metric
Sort by a dimension
19
LIVE DEMO
20
THANK YOU
21
Reaching out…
Kathirmani SukumarEmail: [email protected]
Twitter: @skathirmani
LinkedIn: https://in.linkedin.com/in/skathirmani