Big data foundationsHwanjo YuPOSTECH
1Hwanjo Yu, POSTECH
Big data in real-world
Big data in the movies
Big data in the sports
Big data in the hospitals• Q: What would be volume and financial impact be if we were to hire another cardiovascular surgeon?
• Q: What are re-admission patterns for heart failure patients?• Q: For a specific diagnosis, what are core interventions that improves the outcomes?
2Hwanjo Yu, POSTECH
Big data in real-world
Government• “Pillbox” project in US -> reduce expenses of 50 million USD per year• Customized employment using Big data in Germany -> reduce 10 billion euro per 3 years• Open competition by NIH -> detect geographical epidemic diseases via twitter analysisIndustry• Google: predict geographical epidemic flu (trajectory) via search engine log analysis• Google: provide real time road traffic service• Volvo: find initial faulty of newly released vehicles via SNS and blog analysis (prevent recall of 50 thousand
vehicles)• Hertz: review customers’ evaluations by Big data analysis• Posco: determine the purchase time and price of raw materials• Watcha: recommend movies via taste analysis (no 1., larger than Naver movie)• Xerox: recruit via SNS analysisPrompt response to commercial condition changes, improve credibility and image, reduce expenses, improve productivity, facilitate administration, etc.
3Hwanjo Yu, POSTECH
Big data as proper noun
Extremely large data (Wikipedia, Mckinsey)• Too large to store, manage, and analyze in existing ways using existing storage and existing DBMS SWs
Government and Industry• Information technology to predict trends and respond proactively.
• Technology to collect, store, manage, search, and analyze large scale data.
4Hwanjo Yu, POSTECH
Evolution of science
Evolution of Science (Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002)• Before 1600: empirical science
• 1600~1950: theoretical science• 1950~1990: computational science• 1990~: data science <= DB, DW (data warehouse)
• 2010~ : Mobile Computing and Big Data• SNS, UCC, RFID, sensors, …• Twitter 1TB/day, Facebook 15TB/day, …• Size of data accumulated for the last 2 years > that of the previous 10 years• 80 exabyte at 2009 -> 40% increase every year -> 35 zetabyte at 2020
• (2010~ : Deep learning and AI)
5Hwanjo Yu, POSTECH
History of Big data
6
Relational database management
Data warehousing
Data mining
Big data
Hwanjo Yu, POSTECH
Business intelligence
7
Increasing potentialto supportbusiness decisions
End User
BusinessAnalyst
DataAnalyst
DBA
Decision Making
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
Hwanjo Yu, POSTECH
All sciences are data sciences!
8
…2.Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitizedbooks. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online3.Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature449: 713– 716. doi: 10.1038/nature06137. Find this article online4.Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-Europeanhistory. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online…6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers ofPsychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5:200–207. doi: 10.1037/a0023195. Find this article online…
“The necessity of grappling with Big Data, and the desirability of unlocking the information hidden within it, is now a key theme in all the sciences – arguably the key scientific theme of our times.”
Francis X. DieboldPaul F. and Warren S. Miller Professor of Economics
School of Arts and Sciences, University of Pennsylvania
Hwanjo Yu, POSTECH
Big data demands
KDnuggets report, 2017• Data scientist is selected as the sexiest job on 21st century by Harvard business review
• From 2014 to 2024, the data scientist career path is expected to grow by 11%–14% faster than for all occupations.
9Hwanjo Yu, POSTECH
Drew Conway’s data science Venn diagram
10
• If you’re a DBA, you need to learn to deal with unstructured data
• If you’re a statistician, you need to learn to deal with data that does not fit in memory
• If you’re a software engineer, you need to learn statistical modeling and how to communicate results.
Hwanjo Yu, POSTECH
11
Deep learning
Machine learning
SVM
Decision Trees
Ensembles
Bayesian Learning
…
Artificial Intelligence
Algorithm
A* search
CSP
Logics
…
Big Data
Preprocessing
Database
Hadoop
DFS
…
Hwanjo Yu, POSTECH
New challenges?
• Big data is not new?
• Very Large Database (VLDB) has been an important issue in research communities.
• Parallel processing has been a major research problem for the last century of computer science.
• What are new challenges?
12Hwanjo Yu, POSTECH
IPA: Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Network [ICDE 2013, best poster award]
1. 10x times faster than PMIA (the state-of-the-art algorithm)2. Uses much less memory than PMIA;
• IPA successfully produces results on graphs of millions of nodes using 4GB memory where PMIA fails with 24GB memory.
3. Accurately approximates influence spread;• IPA’s accuracy is close to that of Greedy solutions with 20k times MC simulation and is higher than that
of PMIA overall.
4. Can be applied to all IC-based models;• PMIA cannot be applied to CT-IC model.
5. Easily parallelized;• The parallel IPA speeds up linearly as # of CPU cores increases, and more speed-up is achieved for larger
data sets.
13Hwanjo Yu, POSTECH
New challenges?
Scaling up to Billion-Nodes Network using Map-Reduce?
Very Hard !
Something is easily parallelized does NOT mean it can be easily “map-reduced”.
Big data processing ≠ Parallel data processing
How different?
14Hwanjo Yu, POSTECH
15
Structured Data: RDBMS, DW
SQL
Enterprise DBMS
3V: Data Volume, Variety, Velocity increase=>
Storage (DAS, NAS, SAN) cost increase,Analysis is hard (unstructured >> structured)
Big Data Analysis System
Scale-out cluster
HDFS, Swift
Hadoop HBase
Hive, Pig, R
Structured + UnstructuredData
App
App
Hwanjo Yu, POSTECH
16
Big Data Analysis System
Scale-out cluster
HDFS, Swift
Hadoop HBase
Hive, Pig, R
Structured + UnstructuredData
App
App
Storage
Distributed File System
DB or Data Access
High level Language
Hwanjo Yu, POSTECH
17
Network, distributed file system
Network, RAID
servers
storage
• Proprietary, Highly reliable HW=> Scale-up: Expensive
• Commodity HW=> Scale-out: Inexpensive
Big data =>Need scalability
Centralized storage: SAN, NAS Distributed storage
=> Fast data transfer => Slow data transfer=> Need new programming model !
Hwanjo Yu, POSTECH
18
Data Trend(Big Data)
Storage Trend (Distributed) : Inexpensive Scale-out, butExpensive Data Transfer!
Need New Programming Model to Minimize Data Transfer
Move operations instead of data!
MapReduce by Google
Hadoop and many subprojects
Hwanjo Yu, POSTECH
19
Design Tips
• Lower the work of reduce• Use combine if possible
• Compression of map’s output helps decreasing network overhead
• Minimize iterations and broadcasting• Sharing information is minimized
• Use bulk reading• Too many invocation of map may incur too
many function calls
• Design algorithm to have enough reduce functions• Having only a single reduce will not speed up
• …
MapReduce Principles
• Run operation on data nodes: Move operations to Data
• Minimize data transfer
A straightforward extension of parallel IPA algorithm produce too many iterations and heavy data transfer from map to reduce
Hwanjo Yu, POSTECH
Big data subprojects
• Big data programming framework• MapReduce (Batch): HDFS & Hadoop, Dryad• MapReduce (Iterative): HaLoop, Twister• MapReduce (Streaming): Storm (Twitter), S4 (Yahoo), InfoSphere Streams (IBM), HStreaming
• NoSQL DB• HBase (Master, slaves), Cassandra (P2P, “Gossip”, no master server), Dynamo (Amazon), MongoDB (for
text)
• Graph processing engine• Pregel, Giraph, Trinity, Neo4J, TurboGraph
• IoT platform• NoSQL DB + Analytics solutions• Allseen, Predix
20Hwanjo Yu, POSTECH
21
Big Data subprojects: MapReduce, NoSQL DB
Search
RecommendationBI
Social Network
StorageHW Intra
SW Platform
App
• Minimize Data Transfer• Tasks: Search, Recommendation, ..• Data: Text, Graph, Multimedia, ..• Processing: Batch, Streaming• Storage-aware platform
• Scalability• Scale-out cost• Energy efficiency• Load balancing• Heterogeneous storage
• Minimize Data Transfer• Which platform?• Generalization• Feasible? Approximate?• Storage-aware mining
Bio
Hwanjo Yu, POSTECH
Reality
• Big data system is complex and slow.• Big data is rare.• Active data is small.
22Hwanjo Yu, POSTECH
What is data?
23
###query length COG hit#1 e-value#1 identity#1 score #1 hit length#1 description#1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalaminbiosynthesisprotein
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 NucleosomebindingfactorSPN
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalaminbiosynthesisprotein
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-bindingproteinof thePuf
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-bindingproteinof thePuf
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-bindingproteinof thePuf
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitincarboxyl-terminalhyd
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositolkinaseand
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositolkinaseand
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
Hwanjo Yu, POSTECH
Where to store data?
24Hwanjo Yu, POSTECH
Data type and representation
1. Table and record• Relational database, transaction data• Matrix, cross table• Text documents as term-frequency vector
2. Graph and network• World Wide Web• Social or information networks• Molecular structures
3. Ordered data or sequence• Time-series, temporal data, sequence data• Data streams, sensor data• Natural language and text data
4. Spatial, Multimedia• Spatial data (map), spatiotemporal data• Multimedia: Image, video
25
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Hwanjo Yu, POSTECH
What is data model?
Three components of data model1. Structures
• rows and columns?• nodes and edges?• key-value pairs?• a sequence of bytes?
2. Constraints• all rows must have the same number of columns• all values in one column must have the same type• a child cannot have two parents
3. Operations• find the value of key x• find the rows where column “lastname” is “Jordan”• get the next N bytes
26Hwanjo Yu, POSTECH
What is database?
A database is a collection of information organized to provide efficient retrieval.
27
http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml
Hwanjo Yu, POSTECH
Why do we want a database?
What problems do they solve?1. Sharing
• Support concurrent access by multiple readers and writers2. Data model enforcement
• Make sure all applications see clean, organized data3. Scalability
• Work with datasets too large to fit in memory4. Flexibility
• Use the data in new, unanticipated ways
28Hwanjo Yu, POSTECH
Questions to consider
• How is the data physically organized on disk?
• What kinds of queries are efficiently supported by this organization and what kinds are not?
• How hard is it to update the data or add new data?
• What happens when I encounter new queries that I didn’t anticipate? Do I reorganize the data? How hard is that?
29Hwanjo Yu, POSTECH
Historical example: network database
30Hwanjo Yu, POSTECH
Historical example: hierarchical database
31
• Works great if you want to find all orders for a particular customer.
• What if you want to find all customers who ordered a Nail?
Hwanjo Yu, POSTECH
Relational database (Codd 1970)
“Relational Database Management Systems were invented to let you use one set of data in multipleways, including ways that are unforeseen at the time the database is built and the 1st applicationsare written.” (Curt Monash, analyst/blogger)
32Hwanjo Yu, POSTECH
Relational database (Codd 1970)
• Data is represented as a table.• A database is represented as a set of tables.
• Every row in a table has the same columns.• Relationships between tables are implicit: no pointers• Processing is equivalent for
• “find names registered for CSE344”• “find courses that Jane registered”
• Row: record, tuple, instance, object, …• Column: attribute, field, feature, dimension, …
33
Course Student Id
CSE 344 223…
CSE 344 244…
CSE 514 255..
CSE 514 244…
Student Id Student Name223… Jane244… Joe255.. Susan
Hwanjo Yu, POSTECH
Attribute type
Attribute types are• Nominal (or Categorical), e.g. Type of car, Color name • Binary, e.g. Gender, Whether to have car or not• Ordinal, e.g. Grade• Numerical, e.g. Height, Temperature
Numerical could be• Discrete, e.g. Integer• Continuous, e.g. Real
34Hwanjo Yu, POSTECH
Relational database in practice
• Pre-Relational: if your data changed, your application broke.• Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are
changed.” (Codd 1979)
• Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
35Hwanjo Yu, POSTECH
Key idea: “Physical data independence”
36Hwanjo Yu, POSTECH
Size of data
37
R, Matlab, SAS,Excel, … SQLite, MySQL, … Hadoop, Spark, NoSQL, SPARK, …
Hwanjo Yu, POSTECH
What does “scalable” mean?
Operationally:• In the past: “Works even if data doesn’t fit in main memory”• Now: “Can make use of 1000s of cheap computers”
Algorithmically:
• In the past: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚 operations” -- polynomial time algorithms
• Now: “If you have 𝑁𝑁 data items, you must do no more than 𝑁𝑁𝑚𝑚/𝑘𝑘 operations”, for some large 𝑘𝑘• Polynomial-time algorithms must be parallelized
• Soon: “If you have 𝑁𝑁 data items, you should do no more than 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑁𝑁 operations”• As data sizes go up, you may only get one pass at the data• The data is streaming -- you better make that one pass count• Ex: Large Synoptic Survey Telescope (30TB / night)
38Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
• Given a set of sequences• Find all sequences equal to “GATTACGATATTA”
39
GATTACGATATTATACCTGCCGTAA
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
40
TACCTGCCGTAA = GATTACGATATTA?
No.
time = 1
GATTACGATATTA
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
41
CCCCCAATGAC = GATTACGATATTA?
No.
time = 2
GATTACGATATTA
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
42
GATTACGATATTA contains GATTACGATATTA?Yes!
Send it to the output
time = 17
GATTACGATATTA
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
43
40 records, 40 comparisonsN records, N comparisonsThe algorithmic complexity is 𝑂𝑂(𝑁𝑁)
GATTACGATATTA
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
44
GATTACGATATTA
100%0%
AAAATCCTGCA AAACGCCTGCA
TTTACGTCAA
TTTTCGTAATT
What if we sort the sequences?
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
45
CTGTACACAACCT < GATTACGATATTA No match.Skip to 75% marktime = 0
CTGTACACAACCTStart at the 50% mark
100%
GATTACGATATTA0%
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
46
GGATACACATTTA > GATTACGATATTA
time = 1
GATTACGATATTAGGATACACATTTA
0% 100%
No match.Go back to 62.5% mark
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
47
GATATTTTAAGC < GATTACGATATTA
GATTACGATATTAGGATACACATTTA
0% 100%
No match.Skip back to 68.75% mark
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
48
GATTACGATATTA = GATTACGATATTA
GATTACGATATTA
100%0%
Match!
Hwanjo Yu, POSTECH
Example: Find matching DNA sequences
49
How many comparisons did we do?
40 records, only 4 comparisons𝑁𝑁 records, log𝑁𝑁 comparisonsThis algorithm is O(log𝑁𝑁) Far better scalability
GATTACGATATTA
0% 100%
Hwanjo Yu, POSTECH
Relational database
• Databases are especially effective at “finding needle in haystack” by using indexes.
CREATE INDEX seq_index ON sequence(seq);
• Indexes are easily built and automatically used when appropriate.
SELECT seq,FROM sequence
WHERE seq = ‘GATTACGATATTA’;
50Hwanjo Yu, POSTECH
New task: Read trimming
• Given a set of DNA sequences
• Trim the final 𝑛𝑛 bps of each sequence• Generate a new dataset
51
GATTACGATATTATACCTGCCGTAA
Hwanjo Yu, POSTECH
New task: Read trimming
52
TACCTGCCGTAA becomes TACCT
time = 1
Hwanjo Yu, POSTECH
New task: Read trimming
53
CCCCCAATGAC becomes CCCCC
time = 2
Hwanjo Yu, POSTECH
New task: Read trimming
54
GATTACGATATTA becomes GATTA
time = 3
Hwanjo Yu, POSTECH
New task: Read trimming
55
Can we use an index?
No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?
Hwanjo Yu, POSTECH
56Hwanjo Yu, POSTECH
57
time = 1
Hwanjo Yu, POSTECH
58
time = 2
Hwanjo Yu, POSTECH
59
time = 3
Hwanjo Yu, POSTECH
60
time = 7
How much time did this take?7 cycles
40 records, 6 workers
O(N/k)
Hwanjo Yu, POSTECH
61
f f f f f f f is a function to trim a read; apply it to every item
You are given short “reads”: genomic sequences about 35-75 characters each
Distribute the reads among k computers
Now we have a big distributed set of trimmed reads
Schematic of a parallel “Read Trimming” task
Hwanjo Yu, POSTECH
62
f is a function to convert TIFF to PNG;apply it to every item
You are given TIFF images
Distribute the images among k computers
f f f f f f
Now we have a big distributed set of converted images
New task: Convert 405k TIFF images to PNG
Hwanjo Yu, POSTECH
63
f runs the simulation and produces some output; apply it to every item
You have sets of parameters to optimize by running thousands of simulations
Divide the parameter sets among k computers
f f f f f f
Now we have a big distributed set of simulation results
New task: Run thousands of simulations
Hwanjo Yu, POSTECH
64
f finds the most common word in a singledocument
You have millions of documents
Distribute the documents among k computers
f f f f f f
Now we have a big distributed list of (doc_id, word) pairs
New task: Find the most common word in each document
Hwanjo Yu, POSTECH
65
Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
(people, 2)(government, 6)(assume, 1)(history, 2)…
Compute word frequency of every word in a document
Hwanjo Yu, POSTECH
66
For each document f returns a set of (word, freq) pairs
You have millions of documents
Distribute the documents among k computers
f f f f f f
Now we have a big distributed list of sets of word freqs
New task: Compute word frequency of 5M documents
Hwanjo Yu, POSTECH
Map function
There’s a pattern here…• A function that maps a read to a trimmed read• A function that maps a TIFF image to a PNG image• A function that maps a set of parameters to a simulation result• A function that maps a document to its most common word• A function that maps a document to a histogram of word frequencies
67Hwanjo Yu, POSTECH
68
US Constitution Declaration of IndependenceArticles of Confederation
(people, 78)(government, 123)
(assume, 23)(history, 38)
…
What if we want to compute word frequency across all documents?
Hwanjo Yu, POSTECH
69
You have millions of documents
Distribute the documents among k computers
map map map For each document, return a set of (word, freq)pairs
Now what?But we don’t want a bunch of little histograms – we want one big histogram.
How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in?
Condition: We have to avoid bottleneck as much as possible!
map map map
New task: Compute word frequency across 5M documents
Hwanjo Yu, POSTECH
70
Distribute the documents among k computers
map map map map map mapFor each document, return a set of (word, freq)pairs
Now we have a big distributed list of sets of wordfreqs.
reduce reduce reduce reduce Now just count the occurrences of each word
44 3 We have our distributed histogram
Compute word frequency across 5M documents
Hwanjo Yu, POSTECH
71
Map
(Shuffle)
Reduce
MapReduce: A distributed algorithm framework
Hwanjo Yu, POSTECH
72
Easiest to program, but $$
Scales to 1000s of computers
Taxonomy of parallel architecture
Hwanjo Yu, POSTECH
• Large number of commodity servers, connected by commodity network
• Rack: holds a small number of servers
• Data center: holds many racks
• Massive parallelism:• 100s, 1000s, or 10,000s servers
• Failure:• If mean-time-between-failure is 1 year,• then, 10,000 servers have one failure per hour
73
Cluster computing
Hwanjo Yu, POSTECH
• For very large files: TBs, PBs• Each file is partitioned into chunks, typically 64MB• Each chunk is replicated several times (>=3) on different racks for fault tolerance• Implementations:
• Google’s DFS: GFS, proprietary• Hadoop’s DFS: HDFS, open source
74
Distributed file system (DFS)
Hwanjo Yu, POSTECH
• Many tasks process big data, produce big data
• Want to use hundreds or thousands of CPUs• ... but this needs to be easy• Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily
scale to hundreds of nodes.
• MapReduce is a lightweight framework, providing:• Automatic parallelization and distribution• Fault-tolerance• Status and monitoring
75
Large-scale data processing
Hwanjo Yu, POSTECH
Year System/Paper
Scale to1000s
PrimaryIndex
SecondaryIndexes Transactions Joins/
AnalyticsIntegrityConstraints Views Language/
AlgebraDatamodel
mylabel
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable/Hbase ✔ ✔ ✔ record compat.w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql2009 Redis ✔ ✔ ✔ group O O O ✔ key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ Tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / Tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ Tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ Tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ Tables sql-like2013 Impala ✔ O O O ✔ ✔ O ✔ Tables sql-like2014 MS Cosmos ✔ ✔ O EC O O O ✔ document nosql
76
NoSQL and related systems
Hwanjo Yu, POSTECH
NoSQL: distributed data management system
No ACID but eventual consistency• In absence of updates, all replicas converge towards identical copies• What the application sees in the meantime is sensitive to replication mechanics and difficult to
predict
77Hwanjo Yu, POSTECH
78
User: SueFriends: Joe, Kai, …Status: “Headed to new Bond flick” Wall: “…”, “…”
User: Joe Friends: Sue, …Status: “I’m sleepy” Wall: “…”, “…”
User: KaiFriends: Sue, …Status: “Done for tonight” Wall: “…”, “…”
WriteUpdate Sue’s status. Who sees the new status, and who sees the old one?
RDBMS “Everyone MUST see the same thing, either old or new, no matter how long it takes.”
NoSQL “For large applications, we can’t afford to wait that long, and maybe it doesn’t matter anyway”
Eventual consistency example
Hwanjo Yu, POSTECH
NoSQL: pros and cons
For whom?• “I started with MySQL, but had a hard time scaling out in a distributed environment”• “My data doesn’t conform to a rigid schema”
Cons:• No ACID, thus screwing up mission-critical data is no!• Low-level query language is hard to maintain.• Distributed system is hard to maintain.• NoSQL means no standards!• A typical large enterprise has thousands of databases!
79Hwanjo Yu, POSTECH