Description:
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection. Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other. Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent. We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time. This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution. Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations: • Recovery time • Time relativity and continuity • Geographical distribution of data sources • Limit on data loss • Maintainability The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec. This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.