Date post: | 28-Jul-2015 |
Category: |
Documents |
Upload: | mapr-technologies |
View: | 44 times |
Download: | 0 times |
®© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Ted Dunning June 9, 2015
®© 2014 MapR Technologies 2
Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies
Email [email protected] [email protected] Twitter @Ted_Dunning
®© 2014 MapR Technologies 3
e-book available courtesy of MapR Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
®© 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world
business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
®© 2014 MapR Technologies 5
Agenda • Monty Hall • Randomized geo-coding • Thompson sampling
– Bayesian Bandits – Targeting – Bayesian ranking
• Dithering (sound, signals) • Synthetic data (preview)
®© 2014 MapR Technologies 6
Let’s Start with Trouble • Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize • You pick one • Monte shows you one of the remaining doors is empty • You can switch at this point to the other door or not
• Should you switch?
®© 2014 MapR Technologies 7
®© 2014 MapR Technologies 8
®© 2014 MapR Technologies 9
®© 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
®© 2014 MapR Technologies 11
Live Coding With REAL Chaos
®© 2014 MapR Technologies 12
Geo-coding
®© 2014 MapR Technologies 13
Geo-coding • Some databases have disk locality ó key locality • The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible – But loses some distance information – A line is not a square!
• We want to do proximity searches – This gets harder in the polar regions for most codings
®© 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
®© 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
00 3
201
®© 2014 MapR Technologies 16 000 001 010 011 100 101 110 111
000
001
010
011
100
101
110
111
Z-coding – Interleave Bits
x = 010y = 011geo = 00.11.01
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
®© 2014 MapR Technologies 17 000 001 010 011 100 101 110 111
000
001
010
011
100
101
110
111
Neighbors Often Share Prefix
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
00. 11.11
10. 01.01
00. 11.01
®© 2014 MapR Technologies 18
Often, not always
13 15 37Close Far
®© 2014 MapR Technologies 19 000 001 010 011 100 101 110 111
000
001
010
011
100
101
110
111
Random Sampling to Derive Keys
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
®© 2014 MapR Technologies 20
"00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10”
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
®© 2014 MapR Technologies 21
"00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10”
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
®© 2014 MapR Technologies 22
"00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10”
1110
010000
1110
11
01
01
10
00
00
11
01
10
01
110010
®© 2014 MapR Technologies 23
Dithering
®© 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
®© 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2
02
4
Time
®© 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
01000
3000
®© 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2
02
4
Time
®© 2014 MapR Technologies 28
Thompson Sampling
®© 2014 MapR Technologies 29
Learning in the Real World • In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret – We would like to minimize this – But we can’t quantify regret without incurring regret!
®© 2014 MapR Technologies 30
An Example • Pick one of five options
– Purple, blue, green, red, yellow – Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization – Best = minimal regret + minimal code complexity
®© 2014 MapR Technologies 31
Demo – The Algorithm
®© 2014 MapR Technologies 32
Synthetic Data
®© 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc `
Can You See the Problem?
®© 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
®© 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
®© 2014 MapR Technologies 36
One Attack • The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data – How many columns – What types – Perhaps statistical summaries
®© 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
Data Data
Data Fake
Data Fake
x y α ξ
x y α ξ
®© 2014 MapR Technologies 38
The Upshot • So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
®© 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
®© 2014 MapR Technologies 40
Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
Time Series Databases
Ted Dunning &
Ellen Friedman
New Ways to Store and Access
®© 2014 MapR Technologies 41
Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
®© 2014 MapR Technologies 42
Thank you for coming today!