Big data foundations - POSTECH

Big data foundationsHwanjo YuPOSTECH

1Hwanjo Yu, POSTECH

Big data in real-world

Big data in the movies

Big data in the sports

Big data in the hospitals• Q: What would be volume and financial impact be if we were to hire another cardiovascular surgeon?

• Q: What are re-admission patterns for heart failure patients?• Q: For a specific diagnosis, what are core interventions that improves the outcomes?

2Hwanjo Yu, POSTECH

Big data in real-world

Government• “Pillbox” project in US -> reduce expenses of 50 million USD per year• Customized employment using Big data in Germany -> reduce 10 billion euro per 3 years• Open competition by NIH -> detect geographical epidemic diseases via twitter analysisIndustry• Google: predict geographical epidemic flu (trajectory) via search engine log analysis• Google: provide real time road traffic service• Volvo: find initial faulty of newly released vehicles via SNS and blog analysis (prevent recall of 50 thousand

vehicles)• Hertz: review customers’ evaluations by Big data analysis• Posco: determine the purchase time and price of raw materials• Watcha: recommend movies via taste analysis (no 1., larger than Naver movie)• Xerox: recruit via SNS analysisPrompt response to commercial condition changes, improve credibility and image, reduce expenses, improve productivity, facilitate administration, etc.

3Hwanjo Yu, POSTECH

Big data as proper noun

Extremely large data (Wikipedia, Mckinsey)• Too large to store, manage, and analyze in existing ways using existing storage and existing DBMS SWs

Government and Industry• Information technology to predict trends and respond proactively.

• Technology to collect, store, manage, search, and analyze large scale data.

4Hwanjo Yu, POSTECH

Evolution of science

Evolution of Science (Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002)• Before 1600: empirical science

• 1600~1950: theoretical science• 1950~1990: computational science• 1990~: data science <= DB, DW (data warehouse)

• 2010~ : Mobile Computing and Big Data• SNS, UCC, RFID, sensors, …• Twitter 1TB/day, Facebook 15TB/day, …• Size of data accumulated for the last 2 years > that of the previous 10 years• 80 exabyte at 2009 -> 40% increase every year -> 35 zetabyte at 2020

• (2010~ : Deep learning and AI)

5Hwanjo Yu, POSTECH

History of Big data

6

Relational database management

Data warehousing

Data mining

Big data

Hwanjo Yu, POSTECH

Business intelligence

7

Increasing potentialto supportbusiness decisions

End User

BusinessAnalyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Hwanjo Yu, POSTECH

All sciences are data sciences!

8

…2.Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitizedbooks. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online3.Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature449: 713– 716. doi: 10.1038/nature06137. Find this article online4.Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-Europeanhistory. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online…6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers ofPsychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5:200–207. doi: 10.1037/a0023195. Find this article online…

“The necessity of grappling with Big Data, and the desirability of unlocking the information hidden within it, is now a key theme in all the sciences – arguably the key scientific theme of our times.”

Francis X. DieboldPaul F. and Warren S. Miller Professor of Economics

School of Arts and Sciences, University of Pennsylvania

Hwanjo Yu, POSTECH

Big data demands

KDnuggets report, 2017• Data scientist is selected as the sexiest job on 21st century by Harvard business review

• From 2014 to 2024, the data scientist career path is expected to grow by 11%–14% faster than for all occupations.

9Hwanjo Yu, POSTECH

Drew Conway’s data science Venn diagram

10

• If you’re a DBA, you need to learn to deal with unstructured data

• If you’re a statistician, you need to learn to deal with data that does not fit in memory

• If you’re a software engineer, you need to learn statistical modeling and how to communicate results.

Hwanjo Yu, POSTECH

11

Deep learning

Machine learning

SVM

Decision Trees

Ensembles

Bayesian Learning

…

Artificial Intelligence

Algorithm

A* search

CSP

Logics

…

Big Data

Preprocessing

Database

Hadoop

DFS

…

Hwanjo Yu, POSTECH

New challenges?

• Big data is not new?

• Very Large Database (VLDB) has been an important issue in research communities.

• Parallel processing has been a major research problem for the last century of computer science.

• What are new challenges?

12Hwanjo Yu, POSTECH

IPA: Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Network [ICDE 2013, best poster award]

1. 10x times faster than PMIA (the state-of-the-art algorithm)2. Uses much less memory than PMIA;

• IPA successfully produces results on graphs of millions of nodes using 4GB memory where PMIA fails with 24GB memory.

3. Accurately approximates influence spread;• IPA’s accuracy is close to that of Greedy solutions with 20k times MC simulation and is higher than that

of PMIA overall.

4. Can be applied to all IC-based models;• PMIA cannot be applied to CT-IC model.

5. Easily parallelized;• The parallel IPA speeds up linearly as # of CPU cores increases, and more speed-up is achieved for larger

data sets.


New challenges?

Scaling up to Billion-Nodes Network using Map-Reduce?

Very Hard !

Something is easily parallelized does NOT mean it can be easily “map-reduced”.

Big data processing ≠ Parallel data processing

How different?


15

Structured Data: RDBMS, DW

SQL

Enterprise DBMS

3V: Data Volume, Variety, Velocity increase=>

Storage (DAS, NAS, SAN) cost increase,Analysis is hard (unstructured >> structured)

Big Data Analysis System

Scale-out cluster

HDFS, Swift

Hadoop HBase

Hive, Pig, R

Structured + UnstructuredData

App

App

Hwanjo Yu, POSTECH

16

Big Data Analysis System

Scale-out cluster

HDFS, Swift

Hadoop HBase

Hive, Pig, R

Structured + UnstructuredData

App

App

Storage

Distributed File System

DB or Data Access

High level Language

Hwanjo Yu, POSTECH

17

Network, distributed file system

Network, RAID

servers

storage

• Proprietary, Highly reliable HW=> Scale-up: Expensive

• Commodity HW=> Scale-out: Inexpensive

Big data =>Need scalability

Centralized storage: SAN, NAS Distributed storage

=> Fast data transfer => Slow data transfer=> Need new programming model !

Hwanjo Yu, POSTECH

http://www1.ap.dell.com/content/products/compare.aspx?c=kr&cs=krbsd1&id=tower&l=ko&s=bsd






18

Data Trend(Big Data)

Storage Trend (Distributed) : Inexpensive Scale-out, butExpensive Data Transfer!

Need New Programming Model to Minimize Data Transfer

Move operations instead of data!

MapReduce by Google

Hadoop and many subprojects

Hwanjo Yu, POSTECH

19

Design Tips

• Lower the work of reduce• Use combine if possible

• Compression of map’s output helps decreasing network overhead

• Minimize iterations and broadcasting• Sharing information is minimized

• Use bulk reading• Too many invocation of map may incur too

many function calls

• Design algorithm to have enough reduce functions• Having only a single reduce will not speed up

• …

MapReduce Principles

• Run operation on data nodes: Move operations to Data

• Minimize data transfer

A straightforward extension of parallel IPA algorithm produce too many iterations and heavy data transfer from map to reduce

Hwanjo Yu, POSTECH

Big data subprojects

• Big data programming framework• MapReduce (Batch): HDFS & Hadoop, Dryad• MapReduce (Iterative): HaLoop, Twister• MapReduce (Streaming): Storm (Twitter), S4 (Yahoo), InfoSphere Streams (IBM), HStreaming

• NoSQL DB• HBase (Master, slaves), Cassandra (P2P, “Gossip”, no master server), Dynamo (Amazon), MongoDB (for

text)

• Graph processing engine• Pregel, Giraph, Trinity, Neo4J, TurboGraph

• IoT platform• NoSQL DB + Analytics solutions• Allseen, Predix


21

Big Data subprojects: MapReduce, NoSQL DB

Search

RecommendationBI

Social Network

StorageHW Intra

SW Platform

App

• Minimize Data Transfer• Tasks: Search, Recommendation, ..• Data: Text, Graph, Multimedia, ..• Processing: Batch, Streaming• Storage-aware platform

• Scalability• Scale-out cost• Energy efficiency• Load balancing• Heterogeneous storage

• Minimize Data Transfer• Which platform?• Generalization• Feasible? Approximate?• Storage-aware mining

Bio

Hwanjo Yu, POSTECH

Reality

• Big data system is complex and slow.• Big data is rare.• Active data is small.


What is data?

23

###query length COG hit#1 e-value#1 identity#1 score #1 hit length#1 description#1

chr_4[480001-580000].287 4500

chr_4[560001-660000].1 3556

chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalaminbiosynthesisprotein

chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 NucleosomebindingfactorSPN

chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalaminbiosynthesisprotein

chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-bindingproteinof thePuf


chr_24[160001-260000].65 3542


chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitincarboxyl-terminalhyd

chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositolkinaseand

chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositolkinaseand

chr_11[1-100000].70 2886

chr_11[80001-180000].100 1523

Hwanjo Yu, POSTECH

Where to store data?


Data type and representation

1. Table and record• Relational database, transaction data• Matrix, cross table• Text documents as term-frequency vector

2. Graph and network• World Wide Web• Social or information networks• Molecular structures

3. Ordered data or sequence• Time-series, temporal data, sequence data• Data streams, sensor data• Natural language and text data

4. Spatial, Multimedia• Spatial data (map), spatiotemporal data• Multimedia: Image, video

25

Document 1

season

timeout

lost

win

game

score

ball

play

coach

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Hwanjo Yu, POSTECH

What is data model?

Three components of data model1. Structures

• rows and columns?• nodes and edges?• key-value pairs?• a sequence of bytes?

2. Constraints• all rows must have the same number of columns• all values in one column must have the same type• a child cannot have two parents

3. Operations• find the value of key x• find the rows where column “lastname” is “Jordan”• get the next N bytes


What is database?

A database is a collection of information organized to provide efficient retrieval.

27

http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml

Hwanjo Yu, POSTECH

http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml

Why do we want a database?

What problems do they solve?1. Sharing

• Support concurrent access by multiple readers and writers2. Data model enforcement

• Make sure all applications see clean, organized data3. Scalability

• Work with datasets too large to fit in memory4. Flexibility

• Use the data in new, unanticipated ways


Questions to consider

• How is the data physically organized on disk?

• What kinds of queries are efficiently supported by this organization and what kinds are not?

• How hard is it to update the data or add new data?

• What happens when I encounter new queries that I didn’t anticipate? Do I reorganize the data? How hard is that?


Historical example: network database


Historical example: hierarchical database

31

• Works great if you want to find all orders for a particular customer.

• What if you want to find all customers who ordered a Nail?

Hwanjo Yu, POSTECH

Relational database (Codd 1970)

“Relational Database Management Systems were invented to let you use one set of data in multipleways, including ways that are unforeseen at the time the database is built and the 1st applicationsare written.” (Curt Monash, analyst/blogger)


Relational database (Codd 1970)

• Data is represented as a table.• A database is represented as a set of tables.

• Every row in a table has the same columns.• Relationships between tables are implicit: no pointers• Processing is equivalent for

• “find names registered for CSE344”• “find courses that Jane registered”

• Row: record, tuple, instance, object, …• Column: attribute, field, feature, dimension, …

33

Course Student Id

CSE 344 223…

CSE 344 244…

CSE 514 255..

CSE 514 244…

Student Id Student Name223… Jane244… Joe255.. Susan

Hwanjo Yu, POSTECH

Attribute type

Attribute types are• Nominal (or Categorical), e.g. Type of car, Color name • Binary, e.g. Gender, Whether to have car or not• Ordinal, e.g. Grade• Numerical, e.g. Height, Temperature

Numerical could be• Discrete, e.g. Integer• Continuous, e.g. Real


Relational database in practice

• Pre-Relational: if your data changed, your application broke.• Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are

changed.” (Codd 1979)

• Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation


Key idea: “Physical data independence”


Size of data

37

R, Matlab, SAS,Excel, … SQLite, MySQL, … Hadoop, Spark, NoSQL, SPARK, …

Hwanjo Yu, POSTECH

What does “scalable” mean?

Operationally:• In the past: “Works even if data doesn’t fit in main memory”• Now: “Can make use of 1000s of cheap computers”

Algorithmically:

• In the past: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚 operations” -- polynomial time algorithms

• Now: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚/𝑘𝑘 operations”, for some large 𝑘𝑘• Polynomial-time algorithms must be parallelized

• Soon: “If you have 𝑁𝑁 data items, you should do no more than 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑁𝑁 operations”• As data sizes go up, you may only get one pass at the data• The data is streaming -- you better make that one pass count• Ex: Large Synoptic Survey Telescope (30TB / night)


Example: Find matching DNA sequences

• Given a set of sequences• Find all sequences equal to “GATTACGATATTA”

39

GATTACGATATTATACCTGCCGTAA

Hwanjo Yu, POSTECH


40

TACCTGCCGTAA = GATTACGATATTA?

No.

time = 1

GATTACGATATTA

Hwanjo Yu, POSTECH


41

CCCCCAATGAC = GATTACGATATTA?

No.

time = 2

GATTACGATATTA

Hwanjo Yu, POSTECH


42

GATTACGATATTA contains GATTACGATATTA?Yes!

Send it to the output

time = 17

GATTACGATATTA

Hwanjo Yu, POSTECH


43

40 records, 40 comparisonsN records, N comparisonsThe algorithmic complexity is 𝑂𝑂(𝑁𝑁)

GATTACGATATTA

Hwanjo Yu, POSTECH


44

GATTACGATATTA

100%0%

AAAATCCTGCA AAACGCCTGCA

TTTACGTCAA

TTTTCGTAATT

What if we sort the sequences?

Hwanjo Yu, POSTECH


45

CTGTACACAACCT < GATTACGATATTA No match.Skip to 75% marktime = 0

CTGTACACAACCTStart at the 50% mark

100%

GATTACGATATTA0%

Hwanjo Yu, POSTECH


46

GGATACACATTTA > GATTACGATATTA

time = 1

GATTACGATATTAGGATACACATTTA

0% 100%

No match.Go back to 62.5% mark

Hwanjo Yu, POSTECH


47

GATATTTTAAGC < GATTACGATATTA

GATTACGATATTAGGATACACATTTA

0% 100%

No match.Skip back to 68.75% mark

Hwanjo Yu, POSTECH


48

GATTACGATATTA = GATTACGATATTA

GATTACGATATTA

100%0%

Match!

Hwanjo Yu, POSTECH


49

How many comparisons did we do?

40 records, only 4 comparisons𝑁𝑁 records, log𝑁𝑁 comparisonsThis algorithm is O(log𝑁𝑁) Far better scalability

GATTACGATATTA

0% 100%

Hwanjo Yu, POSTECH

Relational database

• Databases are especially effective at “finding needle in haystack” by using indexes.

CREATE INDEX seq_index ON sequence(seq);

• Indexes are easily built and automatically used when appropriate.

SELECT seq,FROM sequence

WHERE seq = ‘GATTACGATATTA’;


New task: Read trimming

• Given a set of DNA sequences

• Trim the final 𝑛𝑛 bps of each sequence• Generate a new dataset

51

GATTACGATATTATACCTGCCGTAA

Hwanjo Yu, POSTECH


52

TACCTGCCGTAA becomes TACCT

time = 1

Hwanjo Yu, POSTECH


53

CCCCCAATGAC becomes CCCCC

time = 2

Hwanjo Yu, POSTECH


54

GATTACGATATTA becomes GATTA

time = 3

Hwanjo Yu, POSTECH


55

Can we use an index?

No. We have to touch every record no matter what.

The task is fundamentally O(N)

Can we do any better?

Hwanjo Yu, POSTECH


57

time = 1

Hwanjo Yu, POSTECH

58

time = 2

Hwanjo Yu, POSTECH

59

time = 3

Hwanjo Yu, POSTECH

60

time = 7

How much time did this take?7 cycles

40 records, 6 workers

O(N/k)

Hwanjo Yu, POSTECH

61

f f f f f f f is a function to trim a read; apply it to every item

You are given short “reads”: genomic sequences about 35-75 characters each

Distribute the reads among k computers

Now we have a big distributed set of trimmed reads

Schematic of a parallel “Read Trimming” task

Hwanjo Yu, POSTECH

62

f is a function to convert TIFF to PNG;apply it to every item

You are given TIFF images

Distribute the images among k computers

f f f f f f

Now we have a big distributed set of converted images

New task: Convert 405k TIFF images to PNG

Hwanjo Yu, POSTECH

63

f runs the simulation and produces some output; apply it to every item

You have sets of parameters to optimize by running thousands of simulations

Divide the parameter sets among k computers

f f f f f f

Now we have a big distributed set of simulation results

New task: Run thousands of simulations

Hwanjo Yu, POSTECH

64

f finds the most common word in a singledocument

You have millions of documents

Distribute the documents among k computers

f f f f f f

Now we have a big distributed list of (doc_id, word) pairs

New task: Find the most common word in each document

Hwanjo Yu, POSTECH

65

Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

(people, 2)(government, 6)(assume, 1)(history, 2)…

Compute word frequency of every word in a document

Hwanjo Yu, POSTECH

66

For each document f returns a set of (word, freq) pairs



f f f f f f

Now we have a big distributed list of sets of word freqs

New task: Compute word frequency of 5M documents

Hwanjo Yu, POSTECH

Map function

There’s a pattern here…• A function that maps a read to a trimmed read• A function that maps a TIFF image to a PNG image• A function that maps a set of parameters to a simulation result• A function that maps a document to its most common word• A function that maps a document to a histogram of word frequencies


68

US Constitution Declaration of IndependenceArticles of Confederation

(people, 78)(government, 123)

(assume, 23)(history, 38)

…

What if we want to compute word frequency across all documents?

Hwanjo Yu, POSTECH

69



map map map For each document, return a set of (word, freq)pairs

Now what?But we don’t want a bunch of little histograms – we want one big histogram.

How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in?

Condition: We have to avoid bottleneck as much as possible!

map map map

New task: Compute word frequency across 5M documents

Hwanjo Yu, POSTECH

70


map map map map map mapFor each document, return a set of (word, freq)pairs

Now we have a big distributed list of sets of wordfreqs.

reduce reduce reduce reduce Now just count the occurrences of each word

44 3 We have our distributed histogram

Compute word frequency across 5M documents

Hwanjo Yu, POSTECH

71

Map

(Shuffle)

Reduce

MapReduce: A distributed algorithm framework

Hwanjo Yu, POSTECH

72

Easiest to program, but $$

Scales to 1000s of computers

Taxonomy of parallel architecture

Hwanjo Yu, POSTECH

• Large number of commodity servers, connected by commodity network

• Rack: holds a small number of servers

• Data center: holds many racks

• Massive parallelism:• 100s, 1000s, or 10,000s servers

• Failure:• If mean-time-between-failure is 1 year,• then, 10,000 servers have one failure per hour

73

Cluster computing

Hwanjo Yu, POSTECH

• For very large files: TBs, PBs• Each file is partitioned into chunks, typically 64MB• Each chunk is replicated several times (>=3) on different racks for fault tolerance• Implementations:

• Google’s DFS: GFS, proprietary• Hadoop’s DFS: HDFS, open source

74

Distributed file system (DFS)

Hwanjo Yu, POSTECH

• Many tasks process big data, produce big data

• Want to use hundreds or thousands of CPUs• ... but this needs to be easy• Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily

scale to hundreds of nodes.

• MapReduce is a lightweight framework, providing:• Automatic parallelization and distribution• Fault-tolerance• Status and monitoring

75

Large-scale data processing

Hwanjo Yu, POSTECH

Year System/Paper

Scale to1000s

PrimaryIndex

SecondaryIndexes Transactions Joins/

AnalyticsIntegrityConstraints Views Language/

AlgebraDatamodel

mylabel

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable/Hbase ✔ ✔ ✔ record compat.w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2009 Redis ✔ ✔ ✔ group O O O ✔ key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ Tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / Tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ Tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ Tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ Tables sql-like2013 Impala ✔ O O O ✔ ✔ O ✔ Tables sql-like2014 MS Cosmos ✔ ✔ O EC O O O ✔ document nosql

76

NoSQL and related systems

Hwanjo Yu, POSTECH

NoSQL: distributed data management system

No ACID but eventual consistency• In absence of updates, all replicas converge towards identical copies• What the application sees in the meantime is sensitive to replication mechanics and difficult to

predict


78

User: SueFriends: Joe, Kai, …Status: “Headed to new Bond flick” Wall: “…”, “…”

User: Joe Friends: Sue, …Status: “I’m sleepy” Wall: “…”, “…”

User: KaiFriends: Sue, …Status: “Done for tonight” Wall: “…”, “…”

WriteUpdate Sue’s status. Who sees the new status, and who sees the old one?

RDBMS “Everyone MUST see the same thing, either old or new, no matter how long it takes.”

NoSQL “For large applications, we can’t afford to wait that long, and maybe it doesn’t matter anyway”

Eventual consistency example

Hwanjo Yu, POSTECH

NoSQL: pros and cons

For whom?• “I started with MySQL, but had a hard time scaling out in a distributed environment”• “My data doesn’t conform to a rigid schema”

Cons:• No ACID, thus screwing up mission-critical data is no!• Low-level query language is hard to maintain.• Distributed system is hard to maintain.• NoSQL means no standards!• A typical large enterprise has thousands of databases!


Date post:	15-Jan-2022
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Big data foundations - POSTECH

Documents