+ All Categories
Home > Documents > Big data foundations - POSTECH

Big data foundations - POSTECH

Date post: 15-Jan-2022
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
79
Big data foundations Hwanjo Yu POSTECH 1 Hwanjo Yu, POSTECH
Transcript
Page 1: Big data foundations - POSTECH

Big data foundationsHwanjo YuPOSTECH

1Hwanjo Yu, POSTECH

Page 2: Big data foundations - POSTECH

Big data in real-world

Big data in the movies

Big data in the sports

Big data in the hospitals• Q: What would be volume and financial impact be if we were to hire another cardiovascular surgeon?

• Q: What are re-admission patterns for heart failure patients?• Q: For a specific diagnosis, what are core interventions that improves the outcomes?

2Hwanjo Yu, POSTECH

Page 3: Big data foundations - POSTECH

Big data in real-world

Government• “Pillbox” project in US -> reduce expenses of 50 million USD per year• Customized employment using Big data in Germany -> reduce 10 billion euro per 3 years• Open competition by NIH -> detect geographical epidemic diseases via twitter analysisIndustry• Google: predict geographical epidemic flu (trajectory) via search engine log analysis• Google: provide real time road traffic service• Volvo: find initial faulty of newly released vehicles via SNS and blog analysis (prevent recall of 50 thousand

vehicles)• Hertz: review customers’ evaluations by Big data analysis• Posco: determine the purchase time and price of raw materials• Watcha: recommend movies via taste analysis (no 1., larger than Naver movie)• Xerox: recruit via SNS analysisPrompt response to commercial condition changes, improve credibility and image, reduce expenses, improve productivity, facilitate administration, etc.

3Hwanjo Yu, POSTECH

Page 4: Big data foundations - POSTECH

Big data as proper noun

Extremely large data (Wikipedia, Mckinsey)• Too large to store, manage, and analyze in existing ways using existing storage and existing DBMS SWs

Government and Industry• Information technology to predict trends and respond proactively.

• Technology to collect, store, manage, search, and analyze large scale data.

4Hwanjo Yu, POSTECH

Page 5: Big data foundations - POSTECH

Evolution of science

Evolution of Science (Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002)• Before 1600: empirical science

• 1600~1950: theoretical science• 1950~1990: computational science• 1990~: data science <= DB, DW (data warehouse)

• 2010~ : Mobile Computing and Big Data• SNS, UCC, RFID, sensors, …• Twitter 1TB/day, Facebook 15TB/day, …• Size of data accumulated for the last 2 years > that of the previous 10 years• 80 exabyte at 2009 -> 40% increase every year -> 35 zetabyte at 2020

• (2010~ : Deep learning and AI)

5Hwanjo Yu, POSTECH

Page 6: Big data foundations - POSTECH

History of Big data

6

Relational database management

Data warehousing

Data mining

Big data

Hwanjo Yu, POSTECH

Page 7: Big data foundations - POSTECH

Business intelligence

7

Increasing potentialto supportbusiness decisions

End User

BusinessAnalyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Hwanjo Yu, POSTECH

Page 8: Big data foundations - POSTECH

All sciences are data sciences!

8

…2.Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitizedbooks. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online3.Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature449: 713– 716. doi: 10.1038/nature06137. Find this article online4.Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-Europeanhistory. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online…6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers ofPsychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5:200–207. doi: 10.1037/a0023195. Find this article online…

“The necessity of grappling with Big Data, and the desirability of unlocking the information hidden within it, is now a key theme in all the sciences – arguably the key scientific theme of our times.”

Francis X. DieboldPaul F. and Warren S. Miller Professor of Economics

School of Arts and Sciences, University of Pennsylvania

Hwanjo Yu, POSTECH

Page 9: Big data foundations - POSTECH

Big data demands

KDnuggets report, 2017• Data scientist is selected as the sexiest job on 21st century by Harvard business review

• From 2014 to 2024, the data scientist career path is expected to grow by 11%–14% faster than for all occupations.

9Hwanjo Yu, POSTECH

Page 10: Big data foundations - POSTECH

Drew Conway’s data science Venn diagram

10

• If you’re a DBA, you need to learn to deal with unstructured data

• If you’re a statistician, you need to learn to deal with data that does not fit in memory

• If you’re a software engineer, you need to learn statistical modeling and how to communicate results.

Hwanjo Yu, POSTECH

Page 11: Big data foundations - POSTECH

11

Deep learning

Machine learning

SVM

Decision Trees

Ensembles

Bayesian Learning

Artificial Intelligence

Algorithm

A* search

CSP

Logics

Big Data

Preprocessing

Database

Hadoop

DFS

Hwanjo Yu, POSTECH

Page 12: Big data foundations - POSTECH

New challenges?

• Big data is not new?

• Very Large Database (VLDB) has been an important issue in research communities.

• Parallel processing has been a major research problem for the last century of computer science.

• What are new challenges?

12Hwanjo Yu, POSTECH

Page 13: Big data foundations - POSTECH

IPA: Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Network [ICDE 2013, best poster award]

1. 10x times faster than PMIA (the state-of-the-art algorithm)2. Uses much less memory than PMIA;

• IPA successfully produces results on graphs of millions of nodes using 4GB memory where PMIA fails with 24GB memory.

3. Accurately approximates influence spread;• IPA’s accuracy is close to that of Greedy solutions with 20k times MC simulation and is higher than that

of PMIA overall.

4. Can be applied to all IC-based models;• PMIA cannot be applied to CT-IC model.

5. Easily parallelized;• The parallel IPA speeds up linearly as # of CPU cores increases, and more speed-up is achieved for larger

data sets.

13Hwanjo Yu, POSTECH

Page 14: Big data foundations - POSTECH

New challenges?

Scaling up to Billion-Nodes Network using Map-Reduce?

Very Hard !

Something is easily parallelized does NOT mean it can be easily “map-reduced”.

Big data processing ≠ Parallel data processing

How different?

14Hwanjo Yu, POSTECH

Page 15: Big data foundations - POSTECH

15

Structured Data: RDBMS, DW

SQL

Enterprise DBMS

3V: Data Volume, Variety, Velocity increase=>

Storage (DAS, NAS, SAN) cost increase,Analysis is hard (unstructured >> structured)

Big Data Analysis System

Scale-out cluster

HDFS, Swift

Hadoop HBase

Hive, Pig, R

Structured + UnstructuredData

App

App

Hwanjo Yu, POSTECH

Page 16: Big data foundations - POSTECH

16

Big Data Analysis System

Scale-out cluster

HDFS, Swift

Hadoop HBase

Hive, Pig, R

Structured + UnstructuredData

App

App

Storage

Distributed File System

DB or Data Access

High level Language

Hwanjo Yu, POSTECH

Page 18: Big data foundations - POSTECH

18

Data Trend(Big Data)

Storage Trend (Distributed) : Inexpensive Scale-out, butExpensive Data Transfer!

Need New Programming Model to Minimize Data Transfer

Move operations instead of data!

MapReduce by Google

Hadoop and many subprojects

Hwanjo Yu, POSTECH

Page 19: Big data foundations - POSTECH

19

Design Tips

• Lower the work of reduce• Use combine if possible

• Compression of map’s output helps decreasing network overhead

• Minimize iterations and broadcasting• Sharing information is minimized

• Use bulk reading• Too many invocation of map may incur too

many function calls

• Design algorithm to have enough reduce functions• Having only a single reduce will not speed up

• …

MapReduce Principles

• Run operation on data nodes: Move operations to Data

• Minimize data transfer

A straightforward extension of parallel IPA algorithm produce too many iterations and heavy data transfer from map to reduce

Hwanjo Yu, POSTECH

Page 20: Big data foundations - POSTECH

Big data subprojects

• Big data programming framework• MapReduce (Batch): HDFS & Hadoop, Dryad• MapReduce (Iterative): HaLoop, Twister• MapReduce (Streaming): Storm (Twitter), S4 (Yahoo), InfoSphere Streams (IBM), HStreaming

• NoSQL DB• HBase (Master, slaves), Cassandra (P2P, “Gossip”, no master server), Dynamo (Amazon), MongoDB (for

text)

• Graph processing engine• Pregel, Giraph, Trinity, Neo4J, TurboGraph

• IoT platform• NoSQL DB + Analytics solutions• Allseen, Predix

20Hwanjo Yu, POSTECH

Page 21: Big data foundations - POSTECH

21

Big Data subprojects: MapReduce, NoSQL DB

Search

RecommendationBI

Social Network

StorageHW Intra

SW Platform

App

• Minimize Data Transfer• Tasks: Search, Recommendation, ..• Data: Text, Graph, Multimedia, ..• Processing: Batch, Streaming• Storage-aware platform

• Scalability• Scale-out cost• Energy efficiency• Load balancing• Heterogeneous storage

• Minimize Data Transfer• Which platform?• Generalization• Feasible? Approximate?• Storage-aware mining

Bio

Hwanjo Yu, POSTECH

Page 22: Big data foundations - POSTECH

Reality

• Big data system is complex and slow.• Big data is rare.• Active data is small.

22Hwanjo Yu, POSTECH

Page 23: Big data foundations - POSTECH

What is data?

23

###query length COG hit#1 e-value#1 identity#1 score #1 hit length#1 description#1

chr_4[480001-580000].287 4500

chr_4[560001-660000].1 3556

chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalaminbiosynthesisprotein

chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 NucleosomebindingfactorSPN

chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalaminbiosynthesisprotein

chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-bindingproteinof thePuf

chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-bindingproteinof thePuf

chr_24[160001-260000].65 3542

chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-bindingproteinof thePuf

chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitincarboxyl-terminalhyd

chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositolkinaseand

chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositolkinaseand

chr_11[1-100000].70 2886

chr_11[80001-180000].100 1523

Hwanjo Yu, POSTECH

Page 24: Big data foundations - POSTECH

Where to store data?

24Hwanjo Yu, POSTECH

Page 25: Big data foundations - POSTECH

Data type and representation

1. Table and record• Relational database, transaction data• Matrix, cross table• Text documents as term-frequency vector

2. Graph and network• World Wide Web• Social or information networks• Molecular structures

3. Ordered data or sequence• Time-series, temporal data, sequence data• Data streams, sensor data• Natural language and text data

4. Spatial, Multimedia• Spatial data (map), spatiotemporal data• Multimedia: Image, video

25

Document 1

season

timeout

lost

win

game

score

ball

play

coach

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Hwanjo Yu, POSTECH

Page 26: Big data foundations - POSTECH

What is data model?

Three components of data model1. Structures

• rows and columns?• nodes and edges?• key-value pairs?• a sequence of bytes?

2. Constraints• all rows must have the same number of columns• all values in one column must have the same type• a child cannot have two parents

3. Operations• find the value of key x• find the rows where column “lastname” is “Jordan”• get the next N bytes

26Hwanjo Yu, POSTECH

Page 27: Big data foundations - POSTECH

What is database?

A database is a collection of information organized to provide efficient retrieval.

27

http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml

Hwanjo Yu, POSTECH

Page 28: Big data foundations - POSTECH

Why do we want a database?

What problems do they solve?1. Sharing

• Support concurrent access by multiple readers and writers2. Data model enforcement

• Make sure all applications see clean, organized data3. Scalability

• Work with datasets too large to fit in memory4. Flexibility

• Use the data in new, unanticipated ways

28Hwanjo Yu, POSTECH

Page 29: Big data foundations - POSTECH

Questions to consider

• How is the data physically organized on disk?

• What kinds of queries are efficiently supported by this organization and what kinds are not?

• How hard is it to update the data or add new data?

• What happens when I encounter new queries that I didn’t anticipate? Do I reorganize the data? How hard is that?

29Hwanjo Yu, POSTECH

Page 30: Big data foundations - POSTECH

Historical example: network database

30Hwanjo Yu, POSTECH

Page 31: Big data foundations - POSTECH

Historical example: hierarchical database

31

• Works great if you want to find all orders for a particular customer.

• What if you want to find all customers who ordered a Nail?

Hwanjo Yu, POSTECH

Page 32: Big data foundations - POSTECH

Relational database (Codd 1970)

“Relational Database Management Systems were invented to let you use one set of data in multipleways, including ways that are unforeseen at the time the database is built and the 1st applicationsare written.” (Curt Monash, analyst/blogger)

32Hwanjo Yu, POSTECH

Page 33: Big data foundations - POSTECH

Relational database (Codd 1970)

• Data is represented as a table.• A database is represented as a set of tables.

• Every row in a table has the same columns.• Relationships between tables are implicit: no pointers• Processing is equivalent for

• “find names registered for CSE344”• “find courses that Jane registered”

• Row: record, tuple, instance, object, …• Column: attribute, field, feature, dimension, …

33

Course Student Id

CSE 344 223…

CSE 344 244…

CSE 514 255..

CSE 514 244…

Student Id Student Name223… Jane244… Joe255.. Susan

Hwanjo Yu, POSTECH

Page 34: Big data foundations - POSTECH

Attribute type

Attribute types are• Nominal (or Categorical), e.g. Type of car, Color name • Binary, e.g. Gender, Whether to have car or not• Ordinal, e.g. Grade• Numerical, e.g. Height, Temperature

Numerical could be• Discrete, e.g. Integer• Continuous, e.g. Real

34Hwanjo Yu, POSTECH

Page 35: Big data foundations - POSTECH

Relational database in practice

• Pre-Relational: if your data changed, your application broke.• Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are

changed.” (Codd 1979)

• Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

35Hwanjo Yu, POSTECH

Page 36: Big data foundations - POSTECH

Key idea: “Physical data independence”

36Hwanjo Yu, POSTECH

Page 37: Big data foundations - POSTECH

Size of data

37

R, Matlab, SAS,Excel, … SQLite, MySQL, … Hadoop, Spark, NoSQL, SPARK, …

Hwanjo Yu, POSTECH

Page 38: Big data foundations - POSTECH

What does “scalable” mean?

Operationally:• In the past: “Works even if data doesn’t fit in main memory”• Now: “Can make use of 1000s of cheap computers”

Algorithmically:

• In the past: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚 operations” -- polynomial time algorithms

• Now: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚/𝑘𝑘 operations”, for some large 𝑘𝑘• Polynomial-time algorithms must be parallelized

• Soon: “If you have 𝑁𝑁 data items, you should do no more than 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑁𝑁 operations”• As data sizes go up, you may only get one pass at the data• The data is streaming -- you better make that one pass count• Ex: Large Synoptic Survey Telescope (30TB / night)

38Hwanjo Yu, POSTECH

Page 39: Big data foundations - POSTECH

Example: Find matching DNA sequences

• Given a set of sequences• Find all sequences equal to “GATTACGATATTA”

39

GATTACGATATTATACCTGCCGTAA

Hwanjo Yu, POSTECH

Page 40: Big data foundations - POSTECH

Example: Find matching DNA sequences

40

TACCTGCCGTAA = GATTACGATATTA?

No.

time = 1

GATTACGATATTA

Hwanjo Yu, POSTECH

Page 41: Big data foundations - POSTECH

Example: Find matching DNA sequences

41

CCCCCAATGAC = GATTACGATATTA?

No.

time = 2

GATTACGATATTA

Hwanjo Yu, POSTECH

Page 42: Big data foundations - POSTECH

Example: Find matching DNA sequences

42

GATTACGATATTA contains GATTACGATATTA?Yes!

Send it to the output

time = 17

GATTACGATATTA

Hwanjo Yu, POSTECH

Page 43: Big data foundations - POSTECH

Example: Find matching DNA sequences

43

40 records, 40 comparisonsN records, N comparisonsThe algorithmic complexity is 𝑂𝑂(𝑁𝑁)

GATTACGATATTA

Hwanjo Yu, POSTECH

Page 44: Big data foundations - POSTECH

Example: Find matching DNA sequences

44

GATTACGATATTA

100%0%

AAAATCCTGCA AAACGCCTGCA

TTTACGTCAA

TTTTCGTAATT

What if we sort the sequences?

Hwanjo Yu, POSTECH

Page 45: Big data foundations - POSTECH

Example: Find matching DNA sequences

45

CTGTACACAACCT < GATTACGATATTA No match.Skip to 75% marktime = 0

CTGTACACAACCTStart at the 50% mark

100%

GATTACGATATTA0%

Hwanjo Yu, POSTECH

Page 46: Big data foundations - POSTECH

Example: Find matching DNA sequences

46

GGATACACATTTA > GATTACGATATTA

time = 1

GATTACGATATTAGGATACACATTTA

0% 100%

No match.Go back to 62.5% mark

Hwanjo Yu, POSTECH

Page 47: Big data foundations - POSTECH

Example: Find matching DNA sequences

47

GATATTTTAAGC < GATTACGATATTA

GATTACGATATTAGGATACACATTTA

0% 100%

No match.Skip back to 68.75% mark

Hwanjo Yu, POSTECH

Page 48: Big data foundations - POSTECH

Example: Find matching DNA sequences

48

GATTACGATATTA = GATTACGATATTA

GATTACGATATTA

100%0%

Match!

Hwanjo Yu, POSTECH

Page 49: Big data foundations - POSTECH

Example: Find matching DNA sequences

49

How many comparisons did we do?

40 records, only 4 comparisons𝑁𝑁 records, log𝑁𝑁 comparisonsThis algorithm is O(log𝑁𝑁) Far better scalability

GATTACGATATTA

0% 100%

Hwanjo Yu, POSTECH

Page 50: Big data foundations - POSTECH

Relational database

• Databases are especially effective at “finding needle in haystack” by using indexes.

CREATE INDEX seq_index ON sequence(seq);

• Indexes are easily built and automatically used when appropriate.

SELECT seq,FROM sequence

WHERE seq = ‘GATTACGATATTA’;

50Hwanjo Yu, POSTECH

Page 51: Big data foundations - POSTECH

New task: Read trimming

• Given a set of DNA sequences

• Trim the final 𝑛𝑛 bps of each sequence• Generate a new dataset

51

GATTACGATATTATACCTGCCGTAA

Hwanjo Yu, POSTECH

Page 52: Big data foundations - POSTECH

New task: Read trimming

52

TACCTGCCGTAA becomes TACCT

time = 1

Hwanjo Yu, POSTECH

Page 53: Big data foundations - POSTECH

New task: Read trimming

53

CCCCCAATGAC becomes CCCCC

time = 2

Hwanjo Yu, POSTECH

Page 54: Big data foundations - POSTECH

New task: Read trimming

54

GATTACGATATTA becomes GATTA

time = 3

Hwanjo Yu, POSTECH

Page 55: Big data foundations - POSTECH

New task: Read trimming

55

Can we use an index?

No. We have to touch every record no matter what.

The task is fundamentally O(N)

Can we do any better?

Hwanjo Yu, POSTECH

Page 56: Big data foundations - POSTECH

56Hwanjo Yu, POSTECH

Page 57: Big data foundations - POSTECH

57

time = 1

Hwanjo Yu, POSTECH

Page 58: Big data foundations - POSTECH

58

time = 2

Hwanjo Yu, POSTECH

Page 59: Big data foundations - POSTECH

59

time = 3

Hwanjo Yu, POSTECH

Page 60: Big data foundations - POSTECH

60

time = 7

How much time did this take?7 cycles

40 records, 6 workers

O(N/k)

Hwanjo Yu, POSTECH

Page 61: Big data foundations - POSTECH

61

f f f f f f f is a function to trim a read; apply it to every item

You are given short “reads”: genomic sequences about 35-75 characters each

Distribute the reads among k computers

Now we have a big distributed set of trimmed reads

Schematic of a parallel “Read Trimming” task

Hwanjo Yu, POSTECH

Page 62: Big data foundations - POSTECH

62

f is a function to convert TIFF to PNG;apply it to every item

You are given TIFF images

Distribute the images among k computers

f f f f f f

Now we have a big distributed set of converted images

New task: Convert 405k TIFF images to PNG

Hwanjo Yu, POSTECH

Page 63: Big data foundations - POSTECH

63

f runs the simulation and produces some output; apply it to every item

You have sets of parameters to optimize by running thousands of simulations

Divide the parameter sets among k computers

f f f f f f

Now we have a big distributed set of simulation results

New task: Run thousands of simulations

Hwanjo Yu, POSTECH

Page 64: Big data foundations - POSTECH

64

f finds the most common word in a singledocument

You have millions of documents

Distribute the documents among k computers

f f f f f f

Now we have a big distributed list of (doc_id, word) pairs

New task: Find the most common word in each document

Hwanjo Yu, POSTECH

Page 65: Big data foundations - POSTECH

65

Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

(people, 2)(government, 6)(assume, 1)(history, 2)…

Compute word frequency of every word in a document

Hwanjo Yu, POSTECH

Page 66: Big data foundations - POSTECH

66

For each document f returns a set of (word, freq) pairs

You have millions of documents

Distribute the documents among k computers

f f f f f f

Now we have a big distributed list of sets of word freqs

New task: Compute word frequency of 5M documents

Hwanjo Yu, POSTECH

Page 67: Big data foundations - POSTECH

Map function

There’s a pattern here…• A function that maps a read to a trimmed read• A function that maps a TIFF image to a PNG image• A function that maps a set of parameters to a simulation result• A function that maps a document to its most common word• A function that maps a document to a histogram of word frequencies

67Hwanjo Yu, POSTECH

Page 68: Big data foundations - POSTECH

68

US Constitution Declaration of IndependenceArticles of Confederation

(people, 78)(government, 123)

(assume, 23)(history, 38)

What if we want to compute word frequency across all documents?

Hwanjo Yu, POSTECH

Page 69: Big data foundations - POSTECH

69

You have millions of documents

Distribute the documents among k computers

map map map For each document, return a set of (word, freq)pairs

Now what?But we don’t want a bunch of little histograms – we want one big histogram.

How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in?

Condition: We have to avoid bottleneck as much as possible!

map map map

New task: Compute word frequency across 5M documents

Hwanjo Yu, POSTECH

Page 70: Big data foundations - POSTECH

70

Distribute the documents among k computers

map map map map map mapFor each document, return a set of (word, freq)pairs

Now we have a big distributed list of sets of wordfreqs.

reduce reduce reduce reduce Now just count the occurrences of each word

44 3 We have our distributed histogram

Compute word frequency across 5M documents

Hwanjo Yu, POSTECH

Page 71: Big data foundations - POSTECH

71

Map

(Shuffle)

Reduce

MapReduce: A distributed algorithm framework

Hwanjo Yu, POSTECH

Page 72: Big data foundations - POSTECH

72

Easiest to program, but $$

Scales to 1000s of computers

Taxonomy of parallel architecture

Hwanjo Yu, POSTECH

Page 73: Big data foundations - POSTECH

• Large number of commodity servers, connected by commodity network

• Rack: holds a small number of servers

• Data center: holds many racks

• Massive parallelism:• 100s, 1000s, or 10,000s servers

• Failure:• If mean-time-between-failure is 1 year,• then, 10,000 servers have one failure per hour

73

Cluster computing

Hwanjo Yu, POSTECH

Page 74: Big data foundations - POSTECH

• For very large files: TBs, PBs• Each file is partitioned into chunks, typically 64MB• Each chunk is replicated several times (>=3) on different racks for fault tolerance• Implementations:

• Google’s DFS: GFS, proprietary• Hadoop’s DFS: HDFS, open source

74

Distributed file system (DFS)

Hwanjo Yu, POSTECH

Page 75: Big data foundations - POSTECH

• Many tasks process big data, produce big data

• Want to use hundreds or thousands of CPUs• ... but this needs to be easy• Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily

scale to hundreds of nodes.

• MapReduce is a lightweight framework, providing:• Automatic parallelization and distribution• Fault-tolerance• Status and monitoring

75

Large-scale data processing

Hwanjo Yu, POSTECH

Page 76: Big data foundations - POSTECH

Year System/Paper

Scale to1000s

PrimaryIndex

SecondaryIndexes Transactions Joins/

AnalyticsIntegrityConstraints Views Language/

AlgebraDatamodel

mylabel

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable/Hbase ✔ ✔ ✔ record compat.w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2009 Redis ✔ ✔ ✔ group O O O ✔ key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ Tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / Tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ Tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ Tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ Tables sql-like2013 Impala ✔ O O O ✔ ✔ O ✔ Tables sql-like2014 MS Cosmos ✔ ✔ O EC O O O ✔ document nosql

76

NoSQL and related systems

Hwanjo Yu, POSTECH

Page 77: Big data foundations - POSTECH

NoSQL: distributed data management system

No ACID but eventual consistency• In absence of updates, all replicas converge towards identical copies• What the application sees in the meantime is sensitive to replication mechanics and difficult to

predict

77Hwanjo Yu, POSTECH

Page 78: Big data foundations - POSTECH

78

User: SueFriends: Joe, Kai, …Status: “Headed to new Bond flick” Wall: “…”, “…”

User: Joe Friends: Sue, …Status: “I’m sleepy” Wall: “…”, “…”

User: KaiFriends: Sue, …Status: “Done for tonight” Wall: “…”, “…”

WriteUpdate Sue’s status. Who sees the new status, and who sees the old one?

RDBMS “Everyone MUST see the same thing, either old or new, no matter how long it takes.”

NoSQL “For large applications, we can’t afford to wait that long, and maybe it doesn’t matter anyway”

Eventual consistency example

Hwanjo Yu, POSTECH

Page 79: Big data foundations - POSTECH

NoSQL: pros and cons

For whom?• “I started with MySQL, but had a hard time scaling out in a distributed environment”• “My data doesn’t conform to a rigid schema”

Cons:• No ACID, thus screwing up mission-critical data is no!• Low-level query language is hard to maintain.• Distributed system is hard to maintain.• NoSQL means no standards!• A typical large enterprise has thousands of databases!

79Hwanjo Yu, POSTECH


Recommended