@davidcoallierPart of an amazing team at Barricade.io
Data Science is Hard
Data Hacking is “Easy”
Data Analysis is “Easy”
Data Expertise is “Easy”
Got all?Having the three is real hard!
Is that it?Well don’t forget your purpose.
You are not an economist.ɪˈkɒnəmɪst/: Someone with all the answers, and none of the questions.
The Data Scientific Method
Find a question.
Use the data you have
Features & Tests
Analyse ResultsYou will be sad.
ConversateTalk about your findings.
Good ChatsImply egoless and collaborative data scientists.
Recap.
1. Hacking 2. Maths & Stats 3. Expertise
And
1. Question 2. Be Pragmatic 3. Features 4. Analyse 5. Share.
A team!Rarely a single-person effort.
An ExampleFraud Prevention — Business Prevention
I knew better.Obviously… duh
We didn’t share.Science has historically been shared.
Not with p-values
Empathise.Use human language, not lingo.
For us at Barricade
Doing this at scale is hard.
We’re still smallAbout a billion data points a day.
Humble BeginningsTypically… an Queue and an API.
This had issues.Hard to scale, hard to decouple, etc.
Enter the Lambda Architecture.
Speed Layer
Batch Layer
Speed Layer: U new behaviour from new data
Batch Layer: All classified behaviour since T
Serving Layer
Speed Layer: U new behaviour from new data
Batch Layer: All classified behaviour since T
Serve Layer: Batch layer U Speed Layer
Cache Layer
On Amazon AWS
Identifying an Attack.
Ahh! What’s that?
Kafka Queue.Distributed messaging system Append-only log Consumers have offsets Partition for parallelism Replicate for redundancy Message order guaranteed, per-partition