+ All Categories
Home > Documents > Big Data Analysis Technology

Big Data Analysis Technology

Date post: 25-Feb-2016
Category:
Upload: raziya
View: 44 times
Download: 4 times
Share this document with a friend
Description:
Big Data Analysis Technology. University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – [email protected]. Table of content. Introduction Definitions Background Example - PowerPoint PPT Presentation
Popular Tags:
69
Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – [email protected]
Transcript
Page 1: Big Data Analysis Technology

Big Data Analysis Technology

University of PaderbornL.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)Summer semester 2013June 12, 2013

Tobias Hardes (6687549) – [email protected]

Page 2: Big Data Analysis Technology

2June 12, 2013

Table of content

Introduction Definitions

Background Example

Related Work Research

Main Approaches Association Rule Mining MapReduce Framework

Conclusion

Page 3: Big Data Analysis Technology

3June 12, 2013

4 Big keywords

Page 4: Big Data Analysis Technology

4June 12, 2013

Big Data vs. Business Intelligence

How can we predict cancer early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Which is the most profitable branch of our supermarket? In a specific country? During a specific period of time

Docs.oralcle.com

Page 5: Big Data Analysis Technology

5June 12, 2013

Background

home.web.cern.ch

Page 6: Big Data Analysis Technology

6June 12, 2013

Big Science – The LHC

600 million times per second, particles collide within the Large Hadron Collider (LHC)

Each collision generate new particles Particles decay in complex way Each collision is detected The CERN Data Center reconstruct this collision event

15 petabytes of data stored every year Worldwide LHC Computing Grid (WLCG) is

used to crunch all of the data

home.web.cern.ch

Page 7: Big Data Analysis Technology

7June 12, 2013

Data Stream Analysis

- Just in time analysis of data.- Sensor networks

- Analysis for a certain time (last 30 seconds)

http://venturebeat.com

Page 8: Big Data Analysis Technology

8June 12, 2013

Complex event processing (CEP)

- Provides queries for streams- Usage of „Event Processing Languages“ (EPL)

- select avg(price) from StockTickEvent.win:time(30 sec)

https://forge.fi-ware.eu

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

Page 9: Big Data Analysis Technology

9June 12, 2013

Complex Event Processing - Areas of application

- Just in time analysis Complexity of algorithms- CEP is used with Twitter:

- Identify emotional states of users

- Sarcasm?

Page 10: Big Data Analysis Technology

10June 12, 2013

Related Work

Page 11: Big Data Analysis Technology

11June 12, 2013

Big Data in companies

Page 12: Big Data Analysis Technology

12June 12, 2013

Principles

- Statistics- Probability theory- Machine learning

Data Mining- Association rule learning- Cluster analysis- Classificiation

Page 13: Big Data Analysis Technology

13June 12, 2013

Association Rule Mining – Cluster analysis

Association Rule Mining-Relationships between items-Find associations, correlations or causal structures

-Apriori algorithm-Frequent Pattern (FP)-Growth algorithm

Is soda purchased with bananas?

Page 14: Big Data Analysis Technology

14June 12, 2013

Cluster analysis – Classification

Cluster Analysis-Classification of similar objects into classes-Classes are defined during the clustering

-k-Means-K-Means++

Page 15: Big Data Analysis Technology

16June 12, 2013

Research and future work

- Performance, performance, performance…- Passes of the data source- Parallelization- NP-hard problems- ….

- Accuracy- Optimized solutions

Page 16: Big Data Analysis Technology

17June 12, 2013

Example

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

Page 17: Big Data Analysis Technology

18June 12, 2013

Distributed computing – Motivation

- Complex computational tasks- Serveral terabytes of data- Limited hardware resources

Google‘s MapReduce frameworkProf. Dr. Erich Ehses (FH Köln)

Page 18: Big Data Analysis Technology

20June 12, 2013

Main approaches

http://ultraoilforpets.com

Page 19: Big Data Analysis Technology

21June 12, 2013

Structure

- Association rule mining- Apriori algorithm- FP-Growth algorithm

- Googles MapReduce

Page 20: Big Data Analysis Technology

22June 12, 2013

Association rule mining

- Identify items that are related to other items- Example: Analysis of baskets in an online shop

or in a supermarket

http://img.deusm.com/

Page 21: Big Data Analysis Technology

23June 12, 2013

Terminology

- A stream or a database with n elements: S - Item set: - Frequency of occurrence of an item set: Φ(A)

- Association rule B :

- Support: - Confidence:

Page 22: Big Data Analysis Technology

24June 12, 2013

Example

- Rule: „If a basket contains cheese and chocolate, then it also contains bread“

- 6 of 60 transactions contains cheese and chocolate

- 3 of the 6 transactions contains bread

Page 23: Big Data Analysis Technology

25June 12, 2013

Common approach

- Disjoin the problem into two tasks:

1. Generation of frequent item sets• Find item sets that satisfy a minimum support value

2. Generation of rules• Find Confidence rules using the item sets

𝐦𝐢𝐧𝐬𝐮𝐩≤𝐬𝐮𝐩 ( 𝑨 )= 𝜱 (𝑨)¿𝑺∨¿ ¿

𝒎𝒊𝒏𝒄𝒐𝒏𝒇 ≤𝒄𝒐𝒏𝒇 ¿

Page 24: Big Data Analysis Technology

26June 12, 2013

Aprio algorithm – Frequent item set

Input:Minimum support: min_supDatasource: S

Page 25: Big Data Analysis Technology

27June 12, 2013

Apriori – Frequent item sets (I)

Generation of frequent item sets : min_sup = 2TID Transaction1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D2 341 12 21 3 122 3 4 24

https://www.mev.de/

Page 26: Big Data Analysis Technology

28June 12, 2013

Apriori – Frequent item sets (II)

Generation of frequent item sets : min_sup = 2TID Transaction1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D

AB AC AD BC BD CD

4 342

1 2 2 3 2 2

ACD BCD

Candidates

Candidates 2 1

https://www.mev.de/

L1

L2

L3

Page 27: Big Data Analysis Technology

29June 12, 2013

Apriori Algorithm – Rule generation

- Uses frequent item sets to extract high-confidence rules- Based on the same principle as the item set generation- Done for all

frequent item set Lk

Page 28: Big Data Analysis Technology

30June 12, 2013

Example: Rule generation

TID ItemsT1 {Coffee; Pasta; Milk}

T2 {Pasta; Milk}

T3 {Bread; Butter}

T4 {Coffee; Milk; Butter}

T5 {Milk; Bread; Butter}𝐵𝑢𝑡𝑡𝑒𝑟❑

⇒𝑀𝑖𝑙𝑘

¿ (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒𝑀𝑖𝑙𝑘)=Φ(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)

¿𝑆∨¿=25=40%¿

conf (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒𝑀𝑖𝑙𝑘)=𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)

¿ (𝐵𝑢𝑡𝑡𝑒𝑟 )=40%60%

=66%

Page 29: Big Data Analysis Technology

31June 12, 2013

Summary Apriori algorithm

- n+1 scans of the database- Expensive generation of the candidate item set- Implements level-wise search using frequent

item property.

- Easy to implement- Some opportunities for specialized optimizations

Page 30: Big Data Analysis Technology

32June 12, 2013

FP-Growth algorithm

- Used for databases- Features:

- Requires 2 scans of the database- Uses a special data structure – The FP-Tree1. Build the FP-Tree2. Extract frequent item sets

- Compression of the database- Devide this database and apply data mining

Page 31: Big Data Analysis Technology

33June 12, 2013

Construct FP-Tree

TID Items

1 {a,b}

2 {b,c,d}

3 {a,c,d,e}

4 {a,d,e}

5 {a,b,c}

6 {a,b,c,d}

7 {a}

8 {a,b,c}

9 {a,b,d}

10 {b,c,e}

d:1

Page 32: Big Data Analysis Technology

34June 12, 2013

Extract frequent itemsets (I)

- Bottom-up strategy

- Start with node „e“- Then look for „de“- Each path is processed

recursively- Solutions are merged

Page 33: Big Data Analysis Technology

35June 12, 2013

Extract frequent itemsets (II)

Φ(e) = 3 – Assume the minimum support was set to 2

- Is e frequent?- Is de frequent?

- …- Is ce frequent?

- ….- Is be frequent?

- ….- Is ae frequent?

- …..Using subproblems to identify frequent itemsets

Page 34: Big Data Analysis Technology

36June 12, 2013

Extract frequent itemsets (III)

1. Update the support count along the prefix path

2. Remove Node e3. Check the frequency of the paths

Find item sets withde, ce, ae or be

Page 35: Big Data Analysis Technology

37June 12, 2013

Apriori vs. FP-Growth

- FP-Growth has some advantages- Two scans of the database- No expensive computation of candidates- Compressed datastructure- Easier to parallelize

W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithmabout association rule mining

Page 36: Big Data Analysis Technology

45June 12, 2013

MapReduce

- Map and Reduce functions are expressed by a developer

- map(key, val)- Emits new key-values p

- reduce(key, values) - Emits an arbitrary output- Usually a key with one value

Page 37: Big Data Analysis Technology

46June 12, 2013

MapReduce – Word count

Page 38: Big Data Analysis Technology

User Programm

Master

worker

worker

worker

worker

worker

worker

worker

worker

Input files Mapphase

Intermediate files Shuffle Reduce

phase Output files

Worker for red keys

Worker for blue keys

Worker for yellow keys

(1)fork (1)fork(1)fork

(2) assign (2) assign

(3) read (4) local write (5) RPC

(6) write

(7) return

Page 39: Big Data Analysis Technology

48June 12, 2013

Conclusion: MapReduce (I)

- MapReduce is design as a batch processing framework

- No usage for ad-hoc analysis- Used for very large data sets- Used for time intensive computations

- OpenSource implementation: Apache Hadoop

http://hadoop.apache.org/

Page 40: Big Data Analysis Technology

49June 12, 2013

Conclusion

Page 41: Big Data Analysis Technology

50June 12, 2013

Conclusion (I)

- Big Data is important for research and in daily business

- Different approaches- Data Stream analysis

- Complex event processing- Rule Mining

- Apriori algorithm- FP-Growth algorithm

Page 42: Big Data Analysis Technology

51June 12, 2013

Conclusion (II)

- Clustering- K-Means- K-Means++

- Distributed computing- MapReduce

- Performance / Runtime- Multiple minutes- Hours- Days…- Online analytical processing for Big Data?

Page 43: Big Data Analysis Technology

Thank you for your attention

Page 44: Big Data Analysis Technology

Appendix

Page 45: Big Data Analysis Technology

54June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software tools to capture, store, manage, and analyze.(McKinsey & Company)

Page 46: Big Data Analysis Technology

55June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software tools to capture, store, manage, and analyze.(McKinsey & Company)

Page 47: Big Data Analysis Technology

56June 12, 2013

Complex Event Processing – Windows

Tumbling Window-Moves as much as the window size

Sliding Window-Slides in time-Buffers the last x elements

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

Page 48: Big Data Analysis Technology

57June 12, 2013

MapReduce vs. BigQuery

Page 49: Big Data Analysis Technology

58June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do- - for each do

- end for- end for - if then

- end if- end for- return

Page 50: Big Data Analysis Technology

59June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do- - for each do

- end for- end for - if then

- end if- end for- return

Page 51: Big Data Analysis Technology

60June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do- - for each do

- end for- end for - if then

- end if- end for- return

Page 52: Big Data Analysis Technology

61June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do- - for each do

- end for- end for - if then

- end if- end for- return

Page 53: Big Data Analysis Technology

62June 12, 2013

Page 54: Big Data Analysis Technology

63June 12, 2013

Distributed computing of Big Data

CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002

Stores, distributes and analyse the 15 petabytes of data 140 centres across 35 countries

Page 55: Big Data Analysis Technology

64June 12, 2013

Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 Join

- Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.

- Assume that the items are ordered (alphabetical)

- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.

Page 56: Big Data Analysis Technology

65June 12, 2013

Big Data vs. Business Intelligence

Big Data Large and complex data sets Temporal, historical, … Difficult to process and to

analyse Used for deep analysis and

reporting: How can we predict cancer

early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Business Intelligence Transformed Data Historical view Easy to process and to

analyse Used for reporting:

Which is the most profitable branch of our supermarket?

Which postcodes suffered the most dropped calls in July?

Page 57: Big Data Analysis Technology

66June 12, 2013

Improvement approaches

- Selection of startup parameters for algorithms

- Reducing the number of passes over the database

- Sampling the database

- Adding extra constraints for patterns

- Parallelization

Page 58: Big Data Analysis Technology

67June 12, 2013

Improvement approaches – Examples

 

Page 59: Big Data Analysis Technology

68June 12, 2013

Example: FA-DMFI

- Algorithm for Discovering frequent item sets- Read the database once

- Compress into a matrix- Frequent item sets are generated by cover relations Further costly computations are avoided

Page 60: Big Data Analysis Technology

69June 12, 2013

K-Means algorithm

1. Select k entities as the initial centroids.2. (Re)Assign all entities to their closest centroids.3. Recompute the centroid of each newly

assembled cluster.4. Repeat step 2 and 3 until the centroids do not

change or until the maximum value for the iterations is reached

Page 61: Big Data Analysis Technology

70June 12, 2013

Solving approaches

- K-Means cluster is NP-hard- Optimization methods to handle NP-hard

problems (K-Means clustering)

Page 62: Big Data Analysis Technology

71June 12, 2013

Examples

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

- K-Means: Exponential runtime- K-Means++: Improve startup parameters

Page 63: Big Data Analysis Technology

72June 12, 2013

Google‘s BigQuery

UploadUpload the data set to the Google Storage

http://glenn-packer.net/

Analyse

Import data to tablesProcess

Run queries

Page 64: Big Data Analysis Technology

73June 12, 2013

The Apriori algorithm

- Most known algorithm for rule mining- Based on a simple principle:

- „If an item set is frequent, then all subsets of this item are also frequent“

- Input:- Minimum confidence: min_conf- Minimum support: min_sup- Data source: S

Page 65: Big Data Analysis Technology

74June 12, 2013

Apriori Algorithm – aprioriGen

- Generates a candidate item set that might by larger

- Join: Generation of the item set- Prune: Elimination of item sets with

Page 66: Big Data Analysis Technology

75June 12, 2013

Apriori Algorithm – Rule generation -- Example

- {Butter, milk, bread} {cheese}- {Butter, meat, bread} {cola}

{Butter, bread} {cheese, cola}

Page 67: Big Data Analysis Technology

76June 12, 2013

How to improve the Apriori algorithm

- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.

- Sampling: mining on a subset of given data- Dynamic itemset counting:

Page 68: Big Data Analysis Technology

77June 12, 2013

Construction of FP-Tree

- Compressed representation of the database- First scan

- Get the support of every item and sort them by the support count

- Second scan- Each transaction is mapped to a path- Compression is done if overlapping path are

detected- Generate links between same nodes

- Each node has a counter Number of mapped transactions

Page 69: Big Data Analysis Technology

78June 12, 2013

FP-Growth algorithm

Calculate the support count of

each item in S

Sort items in decreasing support

counts

Read transaction t

Create new nodes labeled with the

items in t

Set the frequency count to 1

No overlappedprefix found

Increment the frequency count for

each overlapped item

Overlapped prefix found

Create new nodes for none overlapped

items

Create additional path to common

items

hasNext

return


Recommended