Real-Time Anomaly Detection at Scale: 19 Billion Events ... · returned). Cassandra is also...

Real-Time Anomaly Detection at Scale: 19 Billion Events per Day

IntroductionAnomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. Streaming data (events) from these applications are inspected for anomalies or irregularities, and when an anomaly is detected, alerts are raised either to trigger an automated process to handle the exception or for manual intervention. The logic to determine if an event is an anomaly depends on the application but, typically, such detection systems look for historically known patterns (that were previously classified as anomalies, for supervised anomaly detection, or that are significantly different to past events, for unsupervised systems) in the streaming data. Anomaly detection systems involve a combination of technologies such as machine learning, statistical analysis, algorithmic optimisation techniques and data layer technologies to ingest, process, analyse, disseminate and store streaming data. When such applications operate at massive scale generating millions or billions of events, they impose significant computational challenges to anomaly detection algorithms and, performance and scalability challenges to data layer technologies. In this paper, we explore the architecture of an example streaming data pipeline and demonstrate the scalability, performance and cost effectiveness of data layer technologies like Apache Kafka® and Apache Cassandra® for use in anomaly detection applications.

White Paper

Paul Brebner Kaushik Mysur

www.instaclustr.com

Anomaly DetectionA simple type of unsupervised anomaly detection is Break or Changepoint analysis. We use a simple CUSUM (CUmulative SUM) algorithm which takes a stream of events and analyses them to see if the most recent event(s) are “different” to previous ones, here’s an example:

2

The diagram below shows the logical steps in the anomaly detection pipeline.

1 Events arrive in a stream2 Get the next event from the stream3 Write the event to the database (4)5 Query the historic data from the database (4)6 If there are sufficient observations, run the anomaly detector7 Was a potential anomaly detected? Take appropriate action.

http://www.instaclustr.com

www.instaclustr.com 3

Architecture and Technology ChoicesWe selected Apache Kafka and Cassandra as the two technologies to implement an affordable, fast, and scalable anomaly detector, as they are naturally complementary in a number of ways.

Kafka is a good choice of technology for fast scalable ingestion of streaming data. It supports multiple heterogeneous data sources and is linearly scalable. It supports data persistence and replication by design which ensures that no data is lost even if some nodes fail. Because Kafka uses a store and forward design, it can act as a buffer between volatile external data sources and the Cassandra database, to prevent Cassandra from being overwhelmed by large data surges and ensure no data is lost. Once the data is in Kafka, it’s easy to send elsewhere (e.g. to Cassandra) and to continuously process the streaming data in real-time. This graph shows a spike in the load coming into Kafka and subsequently dropping off, but Cassandra continuing to process the events at a steady rate until completion.

Cassandra is a good choice for rapidly storing high velocity streaming data (particularly time series data) as it’s optimised for writes. It’s also a good choice for reading the data again, as it supports random access queries using a sophisticated primary key consisting of a partition key (simple or composite), and zero or more clustering keys (which determine the order of data returned). Cassandra is also linearly scalable and copes with failures without loss of data.

We combined Kafka, Cassandra, and the application to form a simple type of “Lambda” architecture, with Kafka and the streaming data pipeline as the “speed layer”, and Cassandra as the “batch” and “serving” layer. However, given that Kafka already has an immutable copy of the data, we wondered if an even simpler “Kappa” architecture was possible? Could we remove Cassandra, and run the anomaly detection pipeline as a Kafka streams application?


https://en.wikipedia.org/wiki/Lambda_architecture

https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb

www.instaclustr.com

The answer was “no”, as Kafka is only efficient when replaying previous data contiguously and in order - it doesn’t have indices or support random access queries. For a large range of IDs (e.g. billions) the consumer would have to read many irrelevant records to filter for the matching IDs. Alternatives (which won’t scale) are to have 1 partition per ID, or cache the data using streams state stores (but as we’re dealing with “Big” data we can’t realistically or reliably keep it all in RAM).

Application DesignNow that we have established the rationale for the choice of technology and architecture, let’s look at the data model and application design in this experiment.

For the data model, we used a simple numeric <key, value> pair, with a Kafka “ingestion” timestamp embedded into the Kafka meta-data. The data is stored in Cassandra as a time series to enable efficient reading of the last N (50 in our experiment) values for a given key (id) as follows. It uses a Compound Primary Key with id as the partition key, and time as the clustering key which enables row data to be retrieved in descending order:

create table event_stream ( id bigint, time timestamp, value double, Primary key (id, time)) with clustering order by (time desc);

This is the Cassandra query we use to retrieve up to the last 50 records for a specific key, for processing by the detection algorithm:

SELECT * from event_stream where id=123456789 limit 50;

4



To scale the application we can increase the number of threads, application instances, and server resources available for each component.

This diagram shows the anomaly detector application design. The main components are the event generator (using a Kafka producer), Kafka cluster, anomaly detection pipeline, and Cassandra cluster. The anomaly detection pipeline consists of two components:

1. The Kafka consumer (in it’s own thread pool) which reads messages from the Kafka cluster, and,

2. The processing stages (in another thread pool) which, for each event received from the consumer component, it writes it to Cassandra, reads historic data from Cassandra, and runs the anomaly detection algorithm to decide if the event has a high risk of being an anomalous event.


www.instaclustr.com

We built the Kafka and Cassandra clusters on the Instaclustr managed platform (on AWS), which enables: rapid and painless creation of arbitrary sized clusters (for multiple cloud provider, node types and number), easy management of its operations, and comprehensive monitoring. To automate provisioning, deployment and scaling of the application, we used Kubernetes on AWS EKS. We also needed to build the observability of the application once it was running in Kubernetes, to monitor, debug and tune all stages of the pipeline, and then to report scalability and performance metrics for an anomaly detection application. To do this, we used Open Source Prometheus for metrics monitoring, and OpenTracing and Jaeger for distributed tracing.

Production Application Deployment on KubernetesTo enable application scalability, we deployed the event Generator and anomaly Detector components on AWS using Kubernetes (AWS EKS). AWS EKS and Kubernetes has a non-trivial learning curve, and getting everything working (with AWK EKS running, creating worker nodes, configuring and deploying the application onto Kubernetes) smoothly takes some effort. But once done, the process is repeatable, and the application can easily be changed, reconfigured, and tuned, and run in as many Pods (the Kubernetes unit of concurrency) as needed to scale.

6

Automation and InstrumentationOur high level cloud deployment context looked like this:



We used VPC Peering to enable secure communication via private IP addresses between the application running in Kubernetes on AWS EKS and the Instaclustr clusters (all running on the same AWS region). The Instaclustr Management Console and APIs support initiating VPC requests, which can then be accepted and configured in AWS. The Instaclustr provisioning API was used to enable the application running in Kubernetes Pods to dynamically discover the private IP addresses of the cluster nodes and connect to them.

We instrumented the application with Prometheus, ran the Prometheus operator on the Kubernetes cluster, and configured and ran the Prometheus Server to monitor all the application Pods, and enabled ingress to the server from outside the cluster. Once deployed, it continues to auto-magically enable Prometheus to dynamically monitor any number of Pods, as they come and go.

Here’s what the final system looks like, with Prometheus running in the Kubernetes cluster, enabling you to connect to the Prometheus Server URL with a browser and then explore, discover and report metrics.


www.instaclustr.com

This architecture allowed us to observe the application by viewing, selecting, writing expressions, and graphing the Prometheus metrics (e.g. detector rate and duration) in Grafana:

After also instrumenting the application with OpenTracing, here’s the Jaeger dependencies view (there are other views which show single traces in detail) which shows the topology of the system, including tracing across process boundaries (producers to consumers):

8


www.instaclustr.com

Scaling the Anomaly Detection PipelineNow that we have architected and deployed the anomaly detection pipeline, the next big question is, how well does it scale? Rather than jump straight in and try and run the system on very big clusters, we used an incremental approach through monitoring, debugging, tuning and re-tuning various moving parts of the anomaly detection pipeline. Using the metrics collected by Prometheus, we tuned the application and clusters (by changing the number of Pods running the application, thread pools, Kafka consumers and partitions, and Cassandra connections) to optimise performance with increasing scale of Cassandra/Kafka clusters (more nodes in each cluster). One of the findings was that to maximise throughput with increasing Pods, we need to minimise the number of Kafka partitions and Cassandra connections.

Adding more nodes to an existing Cassandra cluster is easy with an Instaclustr managed Cassandra cluster. This made it feasible to create incrementally bigger clusters, and helped to develop a tuning methodology which worked with increasing cluster sizes, to achieve near-linear scalability. This graph shows the near-linear scalability of the system with increasing total numbers of cores (total system cores vs. Billions (10^9) of anomaly checks per day):

9


www.instaclustr.com

Final ResultsFor the Kafka as a buffer use case, we wanted a Kafka cluster that would process at least 2 Million write/s for a few minutes to cope with event load spikes, while enabling the rest of the anomaly detection pipeline to scale and run at maximum capacity to process the backlog of events as fast as possible but without being impacted by event load spikes.

Initial testing revealed that 9 Kafka Producer Pods was sufficient to exceed this target of 2M writes/s using a 9 node x 8 CPU cores per node Kafka cluster with 200 partitions.

Here are the specifications of the largest clusters we used to achieve our biggest results.

Cluster Details (all running in AWS, US East North Virginia)

Instaclustr managed Kafka cluster - EBS: high throughput 1500 9 x r4.2xlarge-1500 (1,500 GB Disk, 61 GB RAM, 8 cores), Apache Kafka 2.1.0, Replication Factor=3

Instaclustr managed Cassandra cluster - Extra Large, 48 x i3.2xlarge (1769 GB (SSD), 61 GB RAM, 8 cores), Apache Cassandra 3.11.3, Replication Factor=3

AWS EKS Kubernetes cluster Worker Nodes - 2 x c5.18xlarge (72 cores, 144 GB RAM, 25 Gbps network), Kubernetes Version 1.10, Platform Version eks.3

This stacked Grafana graph (of the Prometheus metrics) shows the Kafka producer ramping up (from 1 to 9 Kubernetes Pods), with 2 minutes load time, peaking at 2.3M events/s.

This stacked graph shows the anomaly check rate reaching a sustainable 220,000 events/s and continuing until all the events are processed. Prometheus is gathering this metric from the pipeline application running on 100 Kubernetes Pods.

10



To summarise, the peak Kafka writes was 2.3M/s, while the rest of the pipeline ran at a sustainable anomaly checks of 220,000/s - a massive 19 Billion events processed per day (19 x 10^9).

The system used 574 cores in total. That is a lot of cores! Cassandra had the most (384 cores), Kubernetes cluster (worker cores) had 118 cores, and Kafka the least with 72 cores. 109 application Pods (Kafka producer and detector pipeline) were running on the Kubernetes cluster (along with several Prometheus Pods). The system was processing 400 anomaly checks per second per core.



The Cassandra cluster is more than 5 times bigger than the Kafka cluster, even though the Kafka cluster can process an order of magnitude larger load spike (2.3M/s) than the Cassandra cluster (220,000/s). It is obviously more practical to use “Kafka as a buffer” to cope with load spikes rather than to increase the size of the Cassandra cluster by an order of magnitude (from 48 to 480 nodes!) in a hurry.

How do these results compare with others? Results published last year for a similar system achieved 200 anomaly check/s using 240 cores. They used supervised anomaly detection which required training of the classifiers (once a day), so they used Apache Spark (for ML, feature engineering, and classification), as well as Kafka and Cassandra. Taking into account resource differences, our result is around 500 times higher throughput, and with lower real-time latency. They had more overhead due to the “feature engineering” phase, and using Spark to run the classifier introduced up to 200s latency, making it unsuitable for real-time use. With a detection latency under 1s (average 500ms) our solution is fast enough to provide real-time anomaly detection and blocking. If the incoming load exceeds the capacity of the pipeline for brief periods of time the processing time increases, and potentially anomalous events detected then may need to be handled differently (e.g. by putting a freeze on the account as a whole and contacting the customer etc).

Affordability at ScaleWe have proven that our system can scale well to process 19 Billion events a day, more than adequate for even a large business. So, what is the cost to run an anomaly detection system of this size? It only costs around $1,000 a day for the basic infrastructure using on demand AWS instances. However, the total cost of ownership would be higher (including R&D of the anomaly detection application, ongoing maintenance of the application, Managed service costs, etc). Assuming a more realistic $10,000 a day total cost (x10 the infrastructure cost), the system can run anomaly checks on 1.9 Million events per dollar spent.

The system can easily be scaled up or down to match different business requirements for emerging applications, and the infrastructure costs will scale proportionally. For example, the smallest system we ran checked 1.5 Billion events per day, for a cost of around $100/day for the AWS infrastructure.



For Further InformationFor more technical information, see our blog series Anomalia Machina: Massively Scalable Anomaly Detection with Apache Kafka and Cassandra:

Anomalia Machina 1 - IntroductionAnomalia Machina 2 - Automatic Provisioning of Cassandra and Kafka clustersAnomalia Machina 3 - Kafka Load GenerationAnomalia Machina 4 - Prototype applicationAnomalia Machina 5 - Monitoring with PrometheusAnomalia Machina 6 - Application Tracing with OpenTracing and JaegerAnomalia Machina 7 - Kubernetes Cluster Creation and Application DeploymentAnomalia Machina 8 - Production Application Deployment with KubernetesAnomalia Machina 9 - Anomaly Detection at ScaleAnomalia Machina 10 - Final Results

The sample Open Source Anomalia Machina application code is available from the Instaclustr GitHub.

The Instaclustr Managed Platform AdvantageInstaclustr managed platform enables Anomaly Detection at scale and in near-real-time making it possible to prevent anomalies, not just detect them long after they happen.

Instaclustr offers managed platform and services for Apache Kafka and Apache Cassandra which are massively scalable data layer technologies, enabling performant and reliable data ingestion-processing-dissemination-storage at scale.

Apache Kafka and Apache Cassandra are open source technologies which means there are no license fees and there is no vendor-lock-in. As technologies evolve and newer, cutting-edge technologies come up, you can easily change the technology used under the hood without being tied to a specific licensed technology Vendor.

The Instaclustr managed platform enables rapid creation of different sized/types of cluster on multiple cloud providers and in many regions, for both Kafka and Cassandra. It also offers multiple ways of expanding your cluster (including permanent increase in number of nodes and dynamic node resizing). This is critical for a use-case like anomaly detection where the event load could vary significantly over days or even within a single day. Being able to seamlessly (within minutes) scale up/down, truly enables anomaly detection system to be effective and near-real-time.


https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-and-cassandra/

https://www.instaclustr.com/anomalia-machina-2-automatic-provisioning-massively-scalable-anomaly-detection-with-apache-kafka-and-apache-cassandra/

https://www.instaclustr.com/anomalia-machina-3-load-generation-massively-scalable-anomaly-detection-apache-kafka-apache-cassandra/

https://www.instaclustr.com/anomalia-machina-4-prototype-massively-scalable-anomaly-detection-apache-kafka-cassandra/

https://www.instaclustr.com/anomalia-machina-5-1-application-monitoring-prometheus-massively-scalable-anomaly-detection-apache-kafka-cassandra/

https://www.instaclustr.com/anomalia-machina-6-application-tracing-opentracing-massively-scalable-anomaly-detection-apache-kafka-cassandra/

https://www.instaclustr.com/anomalia-machina-7-application-deployment-kubernetes/

https://www.instaclustr.com/anomalia-machina-8-production-application-deployment-kubernetes-massively-scalable-anomaly-detection-apache-kafka-cassandra/

https://www.instaclustr.com/anomalia-machina-9-anomaly-detection-at-scale/

https://www.instaclustr.com/anomalia-machina-10-final-results-massively-scalable-anomaly-detection-with-apache-kafka-and-cassandra/

https://github.com/instaclustr/AnomaliaMachina

https://github.com/instaclustr/AnomaliaMachina

www.instaclustr.com

About Instaclustr

14

Instaclustr delivers reliability at scale through our integrated data platform of open source technologies such as Apache Cassandra®, Apache Kafka®, Apache Spark™ and Elasticsearch.

Our expertize stems from delivering more than 40+ million node hours under management, allowing us to run the world’s most powerful data technologies effortlessly. We provide a range of managed, consulting and support services to help our customers develop and deploy solutions around open source technologies. Our integrated data platform, built on open source technologies, powers mission critical, highly available applications for our customers and help them achieve scalability, reliability and performance for their applications.

Apache Cassandra®, Apache Spark™, Apache Kafka®, Apache Lucene Core®, Apache Zeppelin™ are trademarks of the Apache Software Foundation in the United States and/or other countries. Elasticsearch and Kibana are trademarks for Elasticsearch BV, registered in U.S. and in other countries.

All the complexities of deploying and managing operations at scale is safe with Instaclustr, and our customers can continue their focus on building applications and running their busi-nesses rather than coping with complexity of the data layer technologies.

With over 40 million node hours of operational experience and over 2 PB of data under management, Instaclustr provides comprehensive 24x7 support for cluster provisioning, monitoring, and operations management relieving the burden from businesses or you could use our self-serve Console and REST APIs for provisioning and monitoring.

Instaclustr is a SOC-2 certified managed service provider (the only one in the market for Cassandra). We are also in the process of making our platform PCI certified. Instaclustr managed platform is highly secure and our support team employs the highest levels of security measures to ensure there is little or no room for security threats/risks.

Instaclustr supports seamless integration and securing of your applications (e.g. running on AWS EKS) and the managed clusters. In this case using AWS VPCs, initiated from the Instaclustr console and configured in AWS.

If you are interested in knowing more about how Apache Kafka and Apache Cassandra can meet the needs of your data layer requirements and how they can be deployed, and operated as your business scales (be it for anomaly detection or for any other purpose), please contact [email protected].


https://www.instaclustr.com/solutions/managed-apache-cassandra/

https://www.instaclustr.com/solutions/managed-apache-kafka/

https://www.instaclustr.com/solutions/managed-apache-spark/

https://www.instaclustr.com/solutions/managed-elasticsearch/

mailto:sales%40instaclustr.com?subject=

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Real-Time Anomaly Detection at Scale: 19 Billion Events ... · returned). Cassandra is also...

Documents