+ All Categories
Home > Documents > SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis...

SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis...

Date post: 16-Jan-2016
Category:
Upload: christopher-davidson
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
63
SALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015
Transcript
Page 1: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

1

Thesis Defense

Xiaoming Gao

Advisor: Prof. Judy Qiu

01/21/2015

Page 2: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Outline

Introduction - emerging Big Data characteristics and challenges

Storage substrate: challenge and contributions

Batch analysis module: challenge and contributions

Streaming analysis module: challenge and contributions

Summary

2

Page 3: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Introduction – Big Data Challenges

3

Volume

VarietyVelocity

Large size of datasets (TBs, PBs, …).

Data size is a function of time.Moreover, speed may also bea function of time.

Various types of structured and unstructured data.

Page 4: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Big Data - emerging characteristics

4

Velocity

Volume

Variety

Streaming data becoming more and more important

Analyses focusing on “interesting” data subsets

Sensor data streams, stock price streams, etc.Gene sequence analysis, news data analysis, etc.

10s to 100s of millions of streaming social updates per day

Data subsets about social events/activities

Social media data (e.g. Twitter streaming API)

Page 5: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Social media data – an example data record

5

{ "text":"RT @sengineland: My Single Best... ", "created_at":"Fri Apr 15 23:37:26 +0000 2011", "retweet_count":0, "id_str":"59037647649259521",

"entities":{ "user_mentions":[{ "screen_name":"sengineland", "id_str":"1059801", "name":"Search Engine Land" }], "hashtags":[], "urls":[{ "url":"http:\/\/selnd.com\/e2QPS1", "expanded_url":null }]}, "user":{ "created_at":"Sat Jan 22 18:39:46 +0000 2011", "friends_count":63, "id_str":"241622902", ...}, "retweeted_status":{ "text":"My Single Best... ", "created_at":"Fri Apr 15 21:40:10 +0000 2011", "id_str":"59008136320786432", ...}, ...}

Page 6: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Introduction – thesis research goal

6

A scalable architecture in the Cloud to address related research challenges

Storage substrate

Batch analysis module

Streaming analysis module

Page 7: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Outline

Introduction - emerging Big Data characteristics and challenges

Storage substrate: challenge and contributions

Batch analysis module: challenge and contributions

Streaming analysis module: challenge and contributions

Summary

7

Page 8: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Storage substrate - requirements

8

Scalable storage solution based on data characteristics - large size, high speed

- fine-grained data records with evolving structures

- mostly write-once-read-many

Proper indexing to support efficient queries: - constraints on both text content and social context

Page 9: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Varied level of indexing support on NoSQL databases

9

Single-dimensional indices Multidimensional indices

Sorted (B+ tree) Inverted index (Lucene)

Unsorted (Hash) R tree(PostGIS)

K-d tree(SciDB)

Quad tree

Single-field

Composite Single-field

Single-field

Composite

HBase

Cassandra

Riak

MongoDB

Yes Yes

Yes Yes

Yes Yes Yes Yes Yes

Page 10: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Storage substrate – query characteristics

10

A query q = {t, s, g} - t: constraints on text content, e.g. #euro2012, @iusoic - s: constraints on social context, e.g. [06/08/12, 07/01/12] - g: a tag telling what social information to get, e.g. retweet network Example queries: - get-tweets-with-text(occupy*; [10/08/11, 12/01/11]) - meme-cooccurrence-count(#occupy; [10/08/11, 12/01/11]) - get-retweet-edges(occupyIN,occupyWS; [10/08/11, 12/01/11])

- user-post-count(occupyIN,occupyWS; [10/08/11, 12/01/11])

Page 11: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA11

Query evaluation with traditional text indices

Problem:Complexity of query evaluation = O(max(|textIndex|, |timeIndex|))Time window is normally in months – largeStores frequency and position information for ranking top-N “most relevant” documents

get_tweets_with_text(occupy*, time_window)

Text index

IDs of tweets for occupy*

Time index

IDs of tweets for time window

results

Text indexoccupyIN: 1234 2346 … (tweet id)

occupyWS …

Time index2011-10-01: 7890 3345 … (tweet id)

2011-10-02: …

103 ~ 104 per day ~108 per day

Page 12: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

More suitable index structure

12

2011-10-01|1234occupyIN

2011-10-02|3417occupyWS

2011-10-03|4532

userID: 333

userID: 444 userID: 555

- Index on multiple (text and non-text) columns- Included columns- Index on computed columnsCustomizability is necessary!

Page 13: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Customizable indexing framework

13

Abstract index structure:Entry ID

Field1Field2

Entry IDField1Field2

Entry IDField1Field2

key1

Entry IDField1Field2

Entry IDField1Field2

key2

Entry IDField1Field2

Entry IDField1Field2

Entry IDField1Field2

key3 Entry IDField1Field2

Entry IDField1Field2

Entry IDField1Field2

Entry IDField1Field2

key4

- A sorted list of index keys- Each key associated with multiple entries sorted by unique entry IDs- Each entry contains multiple additional fields- Key, entry ID, and entry fields are customizable through a configuration file

Page 14: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Demonstration of customizability

14

• Inverted index for text data - store frequency/position for ranking

339330

1

doc id

frequency

doc id

frequencyamerican

339330

1

doc id

frequencyoutrage

• Composite index on both text and non-text fields - not supported by any current NoSQL databases

339330

2012-09-24

tweet id

time

tweet id

timeoccupyIN

339330

2012-09-24

tweet id

timeoccupyW

S

• Join indexGet-tweets-by-user-desc(iu*, [2014-05-01, 2014-05-28]) 123456iusoic

228765ivy

Tweet ID

User-description-tweet Index

Uid 5652014-04-02

Uid 6762014-05-02

Uidtime

Page 15: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Implementation

15

- Requirements for scalable index storage and efficient indexing speed

- NoSQL databases: scalable storage and efficient random access for their data model

- Mapping abstract index structure to underlying data model

- Batch/online indexing mechanisms and parallel query evaluation strategies

3393302012-09-24Field2

Entry IDField1Field2

Entry IDField1Field2

american

3393302012-09-24Field2

Entry IDField1Field2

outrage

entries339330

american 2012-09-24Filed2

Entry ID

Field1Filed2

Entry ID

Field1Filed2

339330outrage 2012-09-24

Filed2

Entry ID

Field1Filed2

Text IndexText Index Table

Page 16: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Data loading and query performance

16

Real Twitter data and queries from Truthy

Page 17: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Historical data loading comparison

17

• One month’s data in .json.gz files

• IndexedHBase: MapReduce program for parallel loading and indexing

• Riak: distributed loaders using native text indexing support (distributed Lucene)

Loading time (hours)

Loaded total data size (GB)

Loaded index data size (GB)

Riak 294.11 3258 667IndexedHBase 45.47 1167 212Comparative ratio of Riak / IndexeHBase

6.47 2.79 3.15

Page 18: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Query evaluation performance comparison

18

Page 19: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Comparison with related work

19

Temporal-text queries, longitudinal analytics on web archives, etc.

Online text indexing and incremental index maintenance

O2, PostgreSQL, ANDA

Hadoop++, HAIL, Eagle-Eyed Elephant

Xiaoming Gao, Vaibhav Nachankar, Judy Qiu. Experimenting Lucene Index on HBase in an HPC Environment. Proc. 1st workshop on High-Performance Computing meets Databases (HPCDB 2011) at Supercomputing 2011.Xiaoming Gao, Evan Roth, Karissa McKelvey, Clayton Davis, Andrew Younge, Emilio Ferrara, Filippo Menczer, and Judy Qiu. Supporting a Social Media Observatory with Customizable Index Structures - Architecture and Performance. Book chapter in Cloud Computing for Data Intensive Applications, Springer Publisher, 2015.

Page 20: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Outline

Introduction - emerging Big Data characteristics and challenges

Storage substrate: challenge and contributions

Batch analysis module: challenge and contributions

Streaming analysis module: challenge and contributions

Summary

20

Page 21: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Efficient execution of integrated workflows

21

- multiple stages and analysis tasks - computation/communication patterns suitable for different frameworks - requirement for dynamic adoption of various processing frameworks - requirement for efficient individual algorithms

Characteristics of workflows:

Page 22: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Integrated analysis stack based on YARN

22

- Dynamic adoption of different processing frameworks- Integrates queries and analysis tasks

Page 23: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Analysis algorithms for composing workflows

23

Algorithm Key feature Time complexityRelated hashtag mining Mostly relies on index; only accesses a small

portion of original data.O(H*M + N).

Meme daily frequency generation

Totally based on parallel scan of customized index.

O(N).

Domain name entropy computation

Totally based on parallel scan of customized index.

O(N).

Graph layout First parallel implementation on iterative MapReduce; near-linear scalability.

O(I*N2).

Page 24: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Related Hashtag mining

24

𝜎 (𝑆 ,𝑇 )=¿ S∩T∨ ¿¿ S∪T∨¿¿

¿- Jaccard coefficient:

- S: set of tweets containing seed hashtag s

- T: set of tweets containing target hashtag t

- σ > threshold means t is related to s

#p2

#mitt2012

#vote #obama

2012 presidential election

Mapper Mapper Mapper Mapper… …

Reducer Reducer

#vote: 0.54…

#obama: 0.38…

#p2 Meme index table

tweet id tweet id tweet id tweet id

#vote, … #obama

Page 25: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Domain name entropy computation

- For each user, find the domain names posted during certain time- Compute entropy based on the domain name distribution

25

tweets12393 13496 … (tweet ids)

“http://truthy.indiana.edu/”… (time: user ID)2012-06-01: 3213409 2012-06-05: 6918355

Meme Index Table (2012-06)

Map()

3213409, truthy.indiana.edu…

Reduce()

3213409, 0.693147user ID, entropy…

Page 26: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Force-directed graph layout algorithm

26

Iterative MapReduce implementation of Fruchterman-Reingold

- force-directed graph layout algorithm, complexity O(I * N2)

- Twister-Ivy (now Harp)

- parallel force computation within iteration

- chain model broadcast across iteration

Page 27: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Composition of workflows

27 Reproduced results for 2010, extended to 2012 with a 20 times larger network

*

*

*

Page 28: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Performance analysis

28

Page 29: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Performance analysis

29

- Near linear scalability for Fruchterman-Reingold on Twister-Ivy- Per-iteration on sequential R for 2012 network: 6035 seconds

Xiaoming Gao, Judy Qiu. Social Media Data Analysis with IndexedHBase and Iterative MapReduce. MTAGS 2013Xiaoming Gao, Judy Qiu. Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases. CCGRID 2014.

Page 30: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Outline

Introduction - emerging Big Data characteristics and challenges

Storage substrate: challenge and contributions

Batch analysis module: challenge and contributions

Streaming analysis module: challenge and contributions

Summary

30

Page 31: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Streaming analysis module - introduction

31

Non-trivial parallel stream processing algorithms with global synchronization

Clustering of social media streams

Recent progress in learning data representations and similarity metrics

High-dimensional vectors: textual and network information

Expensive similarity computation: 43.4 hours to cluster 1 hour’s data with sequential algorithm

Goal: meet real-time constraint through parallelization

Page 32: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Sequential algorithm for clustering tweet stream

32

Online K-Means with sliding time window and outlier detection Group tweets as protomemes: hashtags, mentions, URLs, and phrases. Cluster protomemes using similarity measurement:

- Common user similarity:

- Common tweet ID similarity:

- Content similarity:

- Diffusion similarity:

- Combinations:

(Posting + mentioned + retweeting)

Page 33: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Online K-Means clustering

33

(1) Slide time window by one time step

(2) Delete old protomemes out of time window from their clusters

(3) Generate protomemes for tweets in this step

(4) For each new protomeme:

#p2#p2

Page 34: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Sequential clustering algorithm

34

Final step statistics for a sequential run over 6 minutes data:

Time Step Length (s)

Total Length of Content Vector

Similarity Compute time (s)

Centroids Update Time (s)

10 47749 33.305 0.068

20 76146 78.778 0.113

30 128521 209.013 0.213

Page 35: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Parallelization with Storm - challenges

35

DAG organization of parallel workers: hard to synchronize cluster information

Protomeme Generator

Spout

Synchronization Coordinator

Bolt

ActiveMQBroker

Worker ProcessClustering Bolt

Clustering Bolt

Worker Process

Clustering Bolt

Clustering Bolt

tweet stream

- Spout initiated synchronization- Clustering bolt initiated synchronization- Sync coordinator initiated synchronization

Page 36: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Parallelization with Storm - challenges

36

Data point 1:Content_Vector: [“step”:1, “time”:1, “nation”: 1, “ram”:1]Diffusion_Vector: ……

Data point 2:Content_Vector: [“lovin”:1, “support”:1, “vcu”:1, “ram”:1]Diffusion_Vector: ……

Centroid:Content_Vector: [“step”:0.5, “time”:0.5, “nation”: 0.5, “ram”:1.0, “lovin”:0.5, “support”:0.5, “vcu”:0.5]Diffusion_Vector: ……

Cluster

Sparsity of high-dimensional vectors make traditional synchronization expensive

- Cluster-delta synchronization strategy

Page 37: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Solution – enhanced Storm topology

37

Protomeme Generator

Spout

Synchronization Coordinator

Bolt

ActiveMQBroker

SYNCINITCDELTAS

Sequential or Parallel Batch Clustering Algorithm

Bootstrap Information

Worker ProcessClustering Bolt

Clustering Bolt

Worker Process

Clustering Bolt

Clustering Bolt

… PMADDOUTLIERSYNCREQ

tweet stream

Page 38: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalability comparison

38

1 hour’s data for testing, first 10 mins for bootstrap 33 mins to process 50 mins’ data

Page 39: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalability comparison

39

Number of clustering bolts

Total processing time (sec) Compute time / sync time Sync time per batch

(sec)Avg. length of sync

message3 67603 30.3 6.71 22,113,5206 35207 15.1 6.71 21,595,49912 19295 7.0 7.32 22,066,47324 11341 3.2 8.24 22,319,41348 7395 1.5 9.15 21,489,95096 6965 0.7 12.93 21,536,799

Number of clustering bolts

Total processing time (sec) Compute time / sync time Sync time per batch

(sec)Avg. length of sync

message3 50381 252.6 0.62 2,525,8966 22949 96.4 0.73 2,529,77912 11560 42.2 0.81 2,532,34924 6221 21.7 0.81 2,544,09548 3490 8.4 1.08 2,559,22196 2494 2.5 2.17 2,590,857

Full-centroids synchronization

Cluster-delta synchronization

Page 40: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Comparison with related work

40

Projected/subspace clustering, density-based approaches [Aggarwal 04], [Amini 12].

Parallel sequential leader clustering over tweet streams [Wu 14]

Aurora, Borealis. [Cherniack 03], [Abadi 05].

Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).

Page 41: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Summary of contributions

41

Storage substrate (2011-2012) - customizable indexing framework over NoSQL databases

- data loading/indexing faster by multiple times

- queries faster by one to two orders of magnitude

Batch analysis module (2013-2014) - integrated analysis stack based on YARN

- index-based analysis algorithms multiple to 10s of times faster than data scanning solutions

- first iterative MapReduce Fruchterman-Reingold, near-linear scalability

Streaming analysis module (2014-2015) - novel cluster-delta synchronization to achieve scalability - real-time processing of 10% Twitter stream

Page 42: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

PublicationsThesis Related Publications:[1] Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To appear at CCGRID 2015.[2] Xiaoming Gao, Judy Qiu. Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases. CCGRID 2014.[3] Xiaoming Gao, Evan Roth, Karissa McKelvey, Clayton Davis, Andrew Younge, Emilio Ferrara, Filippo Menczer, and Judy Qiu. Supporting a Social Media Observatory with Customizable Index Structures - Architecture and Performance. Book chapter in Cloud Computing for Data Intensive Applications.[4] Xiaoming Gao, Judy Qiu. Social Media Data Analysis with IndexedHBase and Iterative MapReduce. MTAGS ’13 at Super Computing 2013.[5] Xiaoming Gao, Vaibhav Nachankar, Judy Qiu. Experimenting Lucene Index on HBase in an HPC Environment. HPCDB ’11 at Supercomputing 2011.Other Publications:[6] Xiaoming Gao, Yu Ma, Marlon Pierce, Mike Lowe, Geoffrey Fox. Building a Distributed Block Storage System for Cloud Infrastructure. CloudCom 2010.[7] Xiaoming Gao, Mike Lowe, Yu Ma, Marlon Pierce. Supporting Cloud Computing with the Virtual Block Store System. eScience 2009.[8] Robert Granat, Xiaoming Gao, Marlon Pierce. The QuakeSim Web Portal Environment for GPS Data Analysis. Proc. Workshop on Sensor Networks for Earth and Space Science Applications, 2009.[9] Yehuda Bock, Brendan Crowell, Linette Prawirodirdjo, Paul Jamason, Ruey-Juin Chang, Peng Fang, Melinda Squibb, Marlon E. Pierce, Xiaoming Gao, Frank Webb, Sharon Kedar, Robert Granat, Jay Parker, Danan Dong. Modeling and On-the-Fly Solutions for Solid Earth Sciences: Web Services and Data Portal for Earthquake Early Warning System. Proc. IEEE International Geoscience & Remote Sensing Symposium, 2008.[10] Marlon E. Pierce, Xiaoming Gao, Sangmi L. Pallickara, Zhenhua Guo, Geoffrey C. Fox. QuakeSim Portal and Services: New Approaches to Science Gateway Development Techniques. Concurrency & Computation: Practice & Experience, 2010.[11] Marlon E. Pierce, Geoffrey C. Fox, Jong Y. Choi, Zhenhua Guo, Xiaoming Gao, and Yu Ma. Using Web 2.0 for Scientific Applications and Scientific Communities. Concurrency and Computation: Practice and Experience, 2009.Awards and Honors:Contributions to the grant proposal of NSF XPS: Rapid Prototyping HPC Environment for Deep Learning.NSF Student Travel Grant for IEEE/ACM CCGrid 2014.Best poster award for "A Survey of Cloud Storage Systems" at CloudCom 2010.Best student poster award for "The Virtual Block Store System" at TeraGrid 2009

Page 43: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Acknowledgements

43

Committee members Prof. Judy Qiu, Prof. Fil Menczer, Prof. Dirk Van Gucht, Prof. Geoffrey C. Fox.

Colleagues in SALSAHPC and PTI Bingjing Zhang, Stephen Wu, Yang Ruan, Andrew Younge, Jerome Mitchell, Saliya Ekanayake,

Supun Kamburugamuve, Thomas Wiggins, Zhenghao Gu, Jaliya Ekanayake, Thilina Gunarathne,

Yuduo Zhou, Fei Teng, Zhenhua Guo, Tao Huang, Marlon Pierce, Yu Ma, Jun Wang, Robert Granat.

Collaborators from CNETS Emilio Ferrara, Clayton Davis, Mohsen JafariAsbagh, Onur Varol, Karissa McKelvey,

Giovanni L. Ciampaglia, Alessandro Flammini.

Professors and staff of SOIC Prof. Yuqing M. Wu, Koji Tanaka, Allan Streib, Rob Henderson, Gary Miksik, Lynne Mikolon, Patty Reyes-Cooksey, Becky Curtis, and Christi Pike.

My family and dear friends!

Page 44: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Future work

44

Extend customizable indexing framework to other NoSQL databases

Integrate more processing frameworks such as Giraph and Harp

Integration with high-level languages such as Pig

Integrate Harp communication into parallel stream processing

Approach the speed of full Twitter stream

Page 45: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA45

• Region split and dynamic load balancing for index tableDistributed indexers

Region server Region server Region server Region server

a - k

Text Index Table

l - r

Text Index Table

s - z

Text Index Table

a - f g - k g - k

Text Index Table

HMaster

Implementation on HBase - IndexedHBase

Page 46: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalable historical data loading

46

• Measure total loading time for two month’s data with different cluster size on Alamo

- Total data size: 719 GB compressed, ~1.3 billion tweets

- Online indexing when loading each tweet

Page 47: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA47

Query evaluation time with separate meme and time indices on Riak

Query evaluation time with customized meme index on IndexedHBase

Page 48: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA48

Page 49: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA49

Page 50: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

• SQL query for user-post-count:

SELECT event_memes.meme_id AS meme,events.user_id AS user, COUNT(*) AS tweetCount FROM (SELECT meme_id FROM event_memes INNER JOIN events ON events.id=event_memes.event_id WHERE DATE_FORMAT(events.time_stamp,'%Y-%m-%d') BETWEEN__fromDay__ AND __toDay__ GROUP BY meme_id HAVING COUNT(*) BETWEEN __minMemeSize__ AND__maxMemeSize__) MemeSize INNER JOIN event_memes ON event_memes.meme_id=MemeSize.meme_id INNER JOIN events ON events.id=event_memes.event_id WHERE DATE_FORMAT(events.time_stamp,'%Y-%m-%d') BETWEEN __fromDay__AND __toDay__ GROUP BY event_memes.meme_id,events.user_id

50

Page 51: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA51

Index Configuration

File

User Defined Indexer …

Client Application

insert(dataRecord) index(dataRecord)

User Defined Indexer

Basic Index

Operator

User Defined

Index Operator

search(indexConstraints)

HBase

General Customizable Indexer

Page 52: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA52

Abstract data model and index structure

Mapping to table ops

HBase

Mapping to column family ops

Cassandra

Client application

Mapping to document ops

MongoDB …

Page 53: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA53

Suggested mappings for other NoSQL databasesFeature needed Cassandra Riak MongoDB

Fast real time insertion and updates of index entries

Yes. Index key as row key and entry ID as column name, or index key + entry ID as row key.

Yes. Index key + entry ID as object key.

Yes. Index key + entry ID as “_id” of document.

Fast real time read of index entries

Yes. Index key as row key and entry ID as column name, or index key + entry ID as row key.

Yes. Index key + entry ID as object key.

Yes. Index key + entry ID as “_id” of document.

Scalable storage and access speed of index entries

Yes. Yes. Yes.

Efficient range scan on index keys

Yes with order preserving hash function, but “not recommended”.

Doable with a secondary index on an attribute whose value is object key, but performance unknown.

Doable with Index key + entry ID as “_id” of document, but performance unknown.

Efficient range scan on entry IDs

Yes with order preserving hash function and index entry ID as column name.

Doable with a secondary index on an attribute whose value is object key, but performance unknown.

Doable with Index key + entry ID as “_id” of document, but performance unknown.

Page 54: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Customizable indexing framework

54

Customizability through index configuration file

<index-config> <source-recordset>tweets</source-recordset> <index-name>textIndex</index-name> <index-key sourcetype=“full-text”>{source-record}.text</index-key> <index-entry-id>{source-record}.id</index-entry-id> <index-entry-field>{source-record}.created_at</index-entry-field></index-config><index-config> <source-recordset>users</source-recordset> <index-name>snameIndex</index-name> <indexer-class>iu.pti.hbaseapp.truthy.UserSnameIndexer</indexer-class></index-config>

Page 55: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalable indexing of streaming data

55

• Test potential data rate faster than current stream• Split 2013-07-03.json.gz into fragments distributed across all nodes• HBase cluster size: 8• Average loading and indexing speed observed on one loader: 2ms per tweet

Page 56: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Storage substrate – parallel DBMS vs NoSQL

56 (Kyu-Young Whang in 2011 Internaltional Conference on Database Systems for Advanced Applications)

Page 57: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Parallel query evaluation strategy

57

get-retweet-edges(#p2, <2012-09-01, 2012-10-29>)

memeIndexTable-2012-09 memeIndexTable-2012-10

#p2 #p2

… …

Parallel Evaluation Phase

Mapper Mapper Mapper Mapper… …

Reducer Reducer

1568 -> 2334 : 8…

3677 -> 2099 : 5…

Tweet ID Search Phase

Page 58: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Correctness verification

58

Ground truth dataset: 1 week of tweets containing trending hashtags Run sequential and parallel algorithms with trending hashtags

removed Compute LFK-NMI: normalized mutual information [0, 1]

Parallel vs Sequential Sequential vs ground truth

Parallel vs ground truth

0.728 0.169 0.185

Page 59: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Comparison with related work

59

Indices for queries in relational and NoSQL databases

HadoopDB

Shark and Spark

Xiaoming Gao, Judy Qiu. Social Media Data Analysis with IndexedHBase and Iterative MapReduce. Proc. Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS 2013) at Super Computing 2013.Xiaoming Gao, Judy Qiu. Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases. Proc. 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2014).

Page 60: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Scalability comparison

60

Madrid: non-peak time, 33 mins to process 50 mins’ data Moe: peak-time, larger batch size, 39mins for 50 mins’ data

Page 61: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Apply customized indices in analysis algorithms

• Hashtag daily frequency generation

tweets

259227 339330 … (tweet ids)“#p2”

… (tweet creation time)2012-09-23 2012-09-24

Meme Index Table

- Can be done by only scanning the index

- MapReduce over HBase index tables

#p2 : 2012-09-01|2344, 2012-09-02|32001, …#tcot : 2012-09-01|5536, 2012-09-02|8849, ……

Page 62: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Online indexing and batch indexing mechanisms

General Customizable

Indexer

Twitter streaming API

Construct input data records

General Customizable

Indexer

Construct input data records…

HBaseData tables Index tables

Loader 1 Loader N

Stream distribution mechanism

Stream input client

General Customizable

Indexer

Construct input data records

General Customizable

Indexer

Construct input data records…

HBaseText Index table Meme Index table

Node 1 Node N

Data table region

Data table region

mapper mapper

Online indexing for streaming data Batch indexing for existing data tables

Page 63: SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.

SALSA

Streaming and historical data loading mechanisms

General Customizable

Indexer

Twitter streaming API

Construct input data records

General Customizable

Indexer

Construct input data records…

HBaseData tables Index tables

Loader 1 Loader N

Stream distribution mechanism

Stream input client

General Customizable

Indexer

Construct input data records

General Customizable

Indexer

Construct input data records…

HBaseData tables Index tables

Loader 1 Loader N

mapper mapper

.json.gz file

.json.gz file


Recommended