Mining Big Data in Real Time

Post on 08-May-2015

3,086 views 4 download

description

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.

transcript

Mining Big Data in Real Time

Albert Bifet

Turing/SLAIS 2012 Conference

BIG Data

BIG DATAMeasure and React

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Motivation

Memory unit Size Binary sizekilobyte (kB/KB) 103 210

megabyte (MB) 106 220

gigabyte (GB) 109 230

terabyte (TB) 1012 240

petabyte (PB) 1015 250

exabyte (EB) 1018 260

zettabyte (ZB) 1021 270

yottabyte (YB) 1024 280

Data is growing

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Streaming Data

Big Data & Real Time

Big Data

McKinsey Global Institute (MGI) Report on Big Data, 2011.

Big data refers to datasets whose size is beyondthe ability of typical database software tools to

capture, store, manage, and analyze.

Big Data

McKinsey Global Institute (MGI) Report on Big Data, 2011.

Big data refers to datasets whose size is beyondthe ability of typical database software tools to

capture, store, manage, and analyze.

BIG Data

I VolumeI VarietyI Velocity

3 Vs

Methodology

Sampling and distributed systems

Methodology

Paolo BoldiFacebook Four degrees of separation

Big Data does not need big machines,it needs big intelligence

Real time analytics

We want to analyze what is happening now.

Real time analytics

We want to analyze what is happening now.

Time and Memory

Number 8 Wire Mentality

Time and memory are the resource dimensions ofthe process.

Time and Memory

Time and memory are the resource dimensions ofthe process.

Algorithms

Classification, Regression, Clustering, FrequentPattern Mining.

Applications

I sensor data: industry, citiesI telecomm dataI social networks: twitter, facebook, yahooI marketing: sales business

Data may come from: humans, sensors, ormachines.

New applications: social networks

Twitter: A Massive Data Stream

I Micro-blogging serviceI Built to discover what is happening at any moment in time,

anywhere in the world.I 3 billion requests a day via its API.

MOA-TweetReader: a real-time system to

I read tweets in real timeI detect changesI find the terms whose frequency changed

Sentiment Analysis on TwitterSentiment analysisClassifying messages into two categories depending onwhether they convey positive or negative feelings

Emoticons are visual cues associated with emotional states,which can be used to define class labels for sentimentclassification

Positive Emoticons Negative Emoticons:) :(:-) :-(: ) : (:D=)

Table : List of positive and negative emoticons.

New problem: structured classification

New methods for structured classification

D

B

C

A

C

D

B

C C

B→

I sequences, trees, graphs

I frequent pattern mining techniquesI multi-label data mining

I Example: Lord of the Rings → Action, Adventure, Fantasy

New problem: structured classification

New methods for structured classification

D

B

C C

D

B

C

A

D

B

C

B

D

B

C

, → ,

I sequences, trees, graphsI frequent pattern mining techniques

I multi-label data miningI Example: Lord of the Rings → Action, Adventure, Fantasy

New problem: structured classification

New methods for structured classification

D

B

C C

D

B

C

A

D

B

C

B

D

B

C

, → ,

a,b → class1, class2

I sequences, trees, graphsI frequent pattern mining techniquesI multi-label data mining

I Example: Lord of the Rings → Action, Adventure, Fantasy

New Techniques: Distributed Systems

Hadoop, S4 and Storm

Hadoop

Hadoop

Hadoop

Hadoop architecture

Apache Mahout

Mahout: open source framework

Pig

Pig: Similar to SQL

Pig

I A = LOAD ’data’ USING PigStorage() AS(f1:int, f2:int, f3:int);

I B = GROUP A BY f1;

I C = FOREACH B GENERATE COUNT ($0);

I DUMP C;

Pig: Similar to SQL

Apache S4

Apache S4

Apache S4

Storm

Storm from Twitter

Storm

Stream, Spout, Bolt, Topology

Storm

Tools

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

Hadoop

Storm

“Lambda Architecture”

Storm

ElephantDB, Voldemort

Cassandra, Riak, HBaseKafka

Runaway complexity in Big DataNathan Marz, 2012

Data Streams

Big Data & Real Time

Data Streams

Thanks!