Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | hadoop-summit |
View: | 628 times |
Download: | 4 times |
© 2014 MapR Technologies 1
© MapR Technologies, confidential
Anomaly Detection, Time-Series Data Bases and More
How to Find What You Didn’t Know to Look For
© 2014 MapR Technologies 2
Agenda
• What is anomaly detection?• Some examples• Some generalization• Compression == Truth• Deep dive into deep learning• Why this matters for time series databases
This goes beyond what was described in the abstract…
© 2014 MapR Technologies 3
Who I am
• Ted Dunning, Chief Application Architect, MapR [email protected]
Twitter @ted_dunning
• Committer, mentor, champion, PMC member on several Apache projects
• Mahout, Drill, Zookeeper, Spark and others
• Hashtag for today’s talk #HadoopSummit
© 2014 MapR Technologies 4
Who we are
• MapR makes the technology leading distribution including Hadoop
• MapR integrates real-time data semantics directly into a system that also runs Hadoop programs seamlessly
• The biggest and best choose MapR– Google, Amazon– Largest credit card, retailer, health insurance, telco– Ping me for info
© 2014 MapR Technologies 5
A New Look at Anomaly Detection
• O’Reilly series Practical Machine Learning 2nd volume• Print copies available at Hadoop Summit at MapR booth• eBook available at http://bit.ly/1jQ9QuL
• Book signing by both authors at
4pm Wed (today)
© 2014 MapR Technologies 6
What is Anomaly Detection?
• What just happened that shouldn’t?– but I don’t know what failure looks like (yet)
• Find the problem before other people see it– especially customers and CEO’s
• But don’t wake me up if it isn’t really broken
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
Looks pretty anomalous
to me
© 2014 MapR Technologies 9
Will the real anomaly please stand up?
© 2014 MapR Technologies 10
What Are We Really Doing
• We want action when something breaks (dies/falls over/otherwise gets in trouble)
• But action is expensive• So we don’t want false alarms• And we don’t want false negatives
• We need to trade off costs
© 2014 MapR Technologies 11
A Second Look
© 2014 MapR Technologies 12
A Second Look
99.9%-ile
© 2014 MapR Technologies 13
Online Summarizer
99.9%-ile
t
x > t ? Alarm !x
How Hard Can it Be?
© 2014 MapR Technologies 14
On-line Percentile Estimates
• Apache Mahout has on-line percentile estimator– very high accuracy for extreme tails– new in version 0.9 !!
• What’s the big deal with anomaly detection?
• This looks like a solved problem
© 2014 MapR Technologies 15
Already Done? Etsy Skyline?
© 2014 MapR Technologies 16
What About This?
© 2014 MapR Technologies 17
Spot the Anomaly
Anomaly?
© 2014 MapR Technologies 18
Maybe not!
© 2014 MapR Technologies 19
Where’s Waldo?
This is the real anomaly
© 2014 MapR Technologies 20
Normal Isn’t Just Normal
• What we want is a model of what is normal
• What doesn’t fit the model is the anomaly
• For simple signals, the model can be simple …
• The real world is rarely so accommodating
© 2014 MapR Technologies 21
We Do Windows
© 2014 MapR Technologies 22
We Do Windows
© 2014 MapR Technologies 23
We Do Windows
© 2014 MapR Technologies 24
We Do Windows
© 2014 MapR Technologies 25
We Do Windows
© 2014 MapR Technologies 26
We Do Windows
© 2014 MapR Technologies 27
We Do Windows
© 2014 MapR Technologies 28
We Do Windows
© 2014 MapR Technologies 29
We Do Windows
© 2014 MapR Technologies 30
We Do Windows
© 2014 MapR Technologies 31
We Do Windows
© 2014 MapR Technologies 32
We Do Windows
© 2014 MapR Technologies 33
We Do Windows
© 2014 MapR Technologies 34
We Do Windows
© 2014 MapR Technologies 35
We Do Windows
© 2014 MapR Technologies 36
Windows on the World
• The set of windowed signals is a nice model of our original signal• Clustering can find the prototypes
– Fancier techniques available using sparse coding
• The result is a dictionary of shapes• New signals can be encoded by shifting, scaling and adding
shapes from the dictionary
© 2014 MapR Technologies 37
Most Common Shapes (for EKG)
© 2014 MapR Technologies 38
Reconstructed signal
Original signal
Reconstructed signal
Reconstructionerror
< 1 bit / sample
© 2014 MapR Technologies 39
An Anomaly
Original technique for finding 1-d anomaly works against reconstruction error
© 2014 MapR Technologies 40
Close-up of anomaly
Not what you want your heart to do.
And not what the model expects it to do.
© 2014 MapR Technologies 41
A Different Kind of Anomaly
© 2014 MapR Technologies 42
Model Delta Anomaly Detection
Online Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
© 2014 MapR Technologies 43
The Real Inside Scoop
• The model-delta anomaly detector is really just a sum of random variables– the model we know about already– and a normally distributed error
• The output (delta) is (roughly) the log probability of the sum distribution (really δ2)
• Thinking about probability distributions is good
© 2014 MapR Technologies 44
Example: Event Stream (timing)
• Events of various types arrive at irregular intervals– we can assume Poisson distribution
• The key question is whether frequency has changed relative to expected values
• Want alert as soon as possible
© 2014 MapR Technologies 45
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold– or we can pick a threshold empirically
© 2014 MapR Technologies 46
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 51
Recap (out of order)
• Anomaly detection is best done with a probability model• -log p is a good way to convert to anomaly measure• Adaptive quantile estimation works for auto-setting thresholds
© 2014 MapR Technologies 52
Recap
• Different systems require different models• Continuous time-series
– sparse coding to build signal model
• Events in time– rate model base on variable rate Poisson– segregated rate model
• Events with labels– language modeling– hidden Markov models
© 2014 MapR Technologies 53
But Wait! Compression is Truth
• Maximizing log πk is minimizing compressed size
– (each symbol takes -log πk bits on average)
• Maximizing log πk happens where πk = pk
– (maximum likelihood principle)
© 2014 MapR Technologies 54
But Auto-encoders Find Max Likelihood
• Minimal error => maximum likelihood
• Maximum likelihood => maximum compression
• So good anomaly detectors give good compression
© 2014 MapR Technologies 55
In Case You Want the Details
© 2014 MapR Technologies 56
Pause To Reflect on Clustering
• Use windowing to apportion signal– Hamming windows add up to 1
• Find nearest cluster for each window– Can use dot product because all clusters normalized
• Scale cluster to right size– Dot product again
• Subtract from original signal
© 2014 MapR Technologies 57
Auto-encoding - Information Bottleneck
© 2014 MapR Technologies 58
Clustering as Neural Network
Hidden layer is 1 of k
Could be m of kSparsity allows k >>
100
© 2014 MapR Technologies 59
Overlapping Networks
Time series input
Reconstructed time series
© 2014 MapR Technologies 60
Deep Learning
© 2014 MapR Technologies 61
What About the Database?
• We don’t have to keep the reconstruction• We can keep the first level nodes
– And the reconstruction error
• To keep the first level nodes– We can keep the second level nodes– Plus the reconstruction error
© 2014 MapR Technologies 62
What Does it Matter?
• Even one level of auto-encoding compresses– 30-50x in EKG example with k-means
• Multiple levels compress more– Understanding => Truth => Compression
• Higher levels give semantic search
© 2014 MapR Technologies 63
How Do I Build Such a System
• The key is to combine real-time and long-time– real-time evaluates data stream against model– long-time is how we build the model
• Extended Lambda architecture is my favorite
• See my other talks on slideshare.net for info
• Ping me directly
© 2014 MapR Technologies 64
t
now
Hadoop is Not Very Real-time
UnprocessedData
Fully processed Latest full period
Hadoop job takes this long for this data
© 2014 MapR Technologies 65
t
now
Hadoop works great back here
Storm workshere
Real-time and Long-time together
Blended viewBlended viewBlended View
© 2014 MapR Technologies 66
Who I am
• Ted Dunning, Chief Application Architect, MapR [email protected]
@ted_dunning
• Committer, mentor, champion, PMC member on several Apache projects
• Mahout, Drill, Zookeeper others
© 2014 MapR Technologies 67
© MapR Technologies, confidential