2018. 10 KAIST/DGIST CIKM Copyright © 2009 by Kyu-Young Whang 1
October 23rd, 2018
Recent Trends of Big Data Platforms and Applications*
Kyu-Young WhangACM Fellow, IEEE Life FellowKAIST Distinguished Professor, Professor Emeritus
School of Computing, KAISTDGIST Visiting Chair Professor
Department of Information and Communication Engineering
* A joint work with Jae-Gil Lee, Prof., Dept. of Industrial and Systems Eng., KAIST
ER 2018 keynote
2018. 10 KAIST/DGIST
ContentsContents
Part I: Big Data—Introduction
Part II: Big Data Platforms
Part III: Big Data Applications
Conclusions
Copyright © 2018 by Kyu-Young Whang 2
2018. 10 KAIST/DGIST
Part I: Big Data—Introduction
Copyright © 2018 by Kyu-Young Whang 3
2018. 10 KAIST/DGIST
Big Data: DefinitionsBig Data: Definitions
“A term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with” [Wik18a]
“Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions” The highlighted answer of Google Search as of Sept. 2018
Volume, Velocity, and Variety, Veracity, Value (a.k.a. Three V’s, Five V’s) [Lan01, Goe14, Wik18a]
Copyright © 2018 by Kyu-Young Whang 4
Copyright © 2018 by Kyu-Young Whang 5Source: http://www.visualcapitalist.com/internet-minute-2018/
2018. 10 KAIST/DGIST
Main Sources of Big DataMain Sources of Big Data
Internet-of-Things (IoT)
Web 2.0 (e.g., social media)
Scientific experiments
Mobile data
Cloud data
Healthcare data…..
Copyright © 2018 by Kyu-Young Whang 6
2018. 10 KAIST/DGIST
Big Data from IoTBig Data from IoT
Abundant data are being generated from many sensors in our daily life Example: GPS data from our cell phones
More and more devices are being connected to the Internet (by wireless network) Example: Smart TV’s, Blu-ray Players,
Gaming Consoles, Streaming Media Hubs (Apple TV), Refrigerators
All these are part of what’s called Internet of Things (IoT)
Copyright © 2018 by Kyu-Young Whang 7
2018. 10 KAIST/DGIST
Big Data from Web 2.0Big Data from Web 2.0
Users can provide the data that is on a Web 2.0 site and exercise some control over that data Web 2.0 is bidirectional, i.e., users are creators of user-generated
content as well as consumers
Examples of Web 2.0 include Social networking sites (e.g., Facebook, Twitter) Blogs (e.g., Tumblr) Wikis (e.g., Wikipedia) Photo/video sharing sites (e.g., Flickr, YouTube) …
Copyright © 2018 by Kyu-Young Whang 8
2018. 10 KAIST/DGIST
Social networking sites are attracting significant interestsworldwide and producing big data The data are modeled using a graph, where a node is a person and
an edge is a relationship between them (followers, friends, etc.)
Copyright © 2018 by Kyu-Young Whang 9
user-generated data a user
a relationship
U1
2018. 10 KAIST/DGIST
Big Data from Scientific ExperimentsBig Data from Scientific Experiments
Many scientists are using and producing vast amounts of data through scientific simulations and observations [Gra02]
Scientists try to discover patterns, trends, hidden messages, or even truth from this vast amount of scientific data through intensive analysis
Here comes the notion of data science that “uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms”[Wik18g]
(Data science deals with any type of data as well as scientific data) Turing award winner Jim Gray envisioned data science as a
fourth paradigm of science following empirical, theoretical, and computational sciences in the human history[Gra02]
Copyright © 2018 by Kyu-Young Whang 10
=
~1000 AD, empirical science
~1500-1950s,theoretical science
~1950s-1990s, computational science
~1990-now,data science
2018. 10 KAIST/DGIST
Examples of Big Data Sources from Scientific ExperimentsExamples of Big Data Sources from Scientific Experiments
CERN particle accelerator The researches proved the existence of the Higgs boson, a particle
predicted to exist 50 years earlier They evaluated massive amounts of data (25 petabytes/year)
collected during the three years of testing [Wik18h]
DNA sequencing The amount of DNA sequence data reaches a fourth of the size of
YouTube's yearly data production Source: Washington Post, July 7, 2015
Copyright © 2018 by Kyu-Young Whang 11
Illumina DNA Sequencer
2018. 10 KAIST/DGIST
Recent Rise of Big DataRecent Rise of Big Data
Computing power improvement⇒ both CPU and data storage capacity
Accessibility improvement ⇒ in terms of computing power and data storage
through Cloud service
Easy distributed, parallel programming ⇒ such as MapReduce and Hadoop on distributed platforms
Copyright © 2018 by Kyu-Young Whang 12
2018. 10 KAIST/DGIST
Computing Power ImprovementComputing Power Improvement
Moore’s Law: “…the number of transistors in a dense integrated circuit doubles about every two years” [Wik18b]
Copyright © 2018 by Kyu-Young Whang 13Source: https://en.wikipedia.org/wiki/Moore%27s_law
2018. 10 KAIST/DGIST
Accessibility ImprovementAccessibility Improvement
Users now can lease the computing power and data storage from the Cloud service at a cheap price Example: Amazon EC2, Microsoft Azure, Google Cloud
Copyright © 2018 by Kyu-Young Whang 14
Create accounts for the virtual machine and storage through the Web
2018. 10 KAIST/DGIST
Distributed, Parallel Programming: IntroductionDistributed, Parallel Programming: Introduction
Computing (for analysis) The amount of work required becomes greater than the capacity of a
single CPU. Thus, to fill in the gap, we find the need to employ parallelism.
Data Storage Data are too big to store in one node and too slow to read from a
singe node. The obvious solution is to store data in multiple nodes and read from them simultaneously and in a distributed fashion.
Copyright © 2018 by Kyu-Young Whang 15
Source: http://www.spiral.net/
gap
2018. 10 KAIST/DGIST
Distributed, Parallel Programming: ChallengesDistributed, Parallel Programming: Challenges
However, there are several problems in working with multiple machines Coordination among multiple nodes Dealing with frequent hardware failures when we work with a large
number of inexpensive processors and storage devices → replication
Nevertheless, the programmers do not want to think about these complexities
Copyright © 2018 by Kyu-Young Whang 16
Image Source : http://hirendave.tech/programmers/tech-humor-a-programmers-revenge-deal-with-frustration-at-work/
2018. 10 KAIST/DGIST
MapReduceMapReduce
“A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster” [Wik18c]
Typically for batch-oriented large-scale parallelization
Inspired by functional programming’s map() and reduce() functions
Proposed by Jeffrey Dean and Sanjay Ghemawat [Dea04] at Google in 2004 Cited more than 25,000 times as of Sept. 2018
Copyright © 2018 by Kyu-Young Whang 17
2018. 10 KAIST/DGIST
MapReduce: Map & ReduceMapReduce: Map & Reduce
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase Map usually performs filtering and sorting Reduce usually performs a summary operation The output of map is provided to the input of reduce Each phase has key-value pairs as input and output, the types of
which may be chosen by the programmer
Copyright © 2018 by Kyu-Young Whang 18
2018. 10 KAIST/DGIST
Example: Count the frequency of each unique word in a large document collection Required for building an inverted index of a search engine
1. Map: Going through all the words in a split and producing a set of key-value pairs <word, frequency>
2. Reduce: Aggregating the frequencies for the same key3. Finally producing counts of all unique words in the output
MapReduce: ExampleMapReduce: Example
Copyright © 2018 by Kyu-Young Whang
19
big … big ... data … data …
big … data … … ER … ER
Split 1
Split 2
Map
Map
Reduce<data, 4><ER, 2>
<big, 1>
<big, 1><big, 1>
big … data ... …
Split 3
Map
Reduce<data, 1><data, 1>
<data, 1><ER, 1><ER 1>
<big, 1><data, 1>
<big, 4>
<big, 4><data, 4><ER, 2>
Output
2018. 10 KAIST/DGIST
HadoopHadoop
A software framework for distributed storage and processing of big data using the MapReduce programming model The most popular open-source implementation of MapReduce Being developed as a top-level Apache project
Significance: Why Hadoop? Because Hadoop takes care of these complexities, the programmers
don’t have to worry about the complexities behind parallel distributed processing
Copyright © 2018 by Kyu-Young Whang 20
2018. 10 KAIST/DGIST
In addition,
Hadoop Distributed File System (HDFS) supports fault tolerance and load balancing through replication (default: 3)
Hadoop provides horizontal scalability by using inexpensive commodity machines
Copyright © 2018 by Kyu-Young Whang 21
2018. 10 KAIST/DGIST
Hadoop: Main ComponentsHadoop: Main Components
Hadoop Distributed File System (HDFS): a distributed file system that stores data on a massive number of commodity machines
Hadoop YARN: a platform for managing and scheduling computing resources in clusters (resource manager)
Hadoop MapReduce: an implementation of the MapReduceprogramming model
Hadoop Common: libraries and utilities
Copyright © 2018 by Kyu-Young Whang 22
Common
Source: http://www.dataintegration.ninja/big-data-and-hadoop-features-and-core-architecture/
2018. 10 KAIST/DGIST
Hadoop: Limitations and Other PlatformsHadoop: Limitations and Other Platforms
Iterative processing Enhancements: Spark, Flink, …
Stream processing Spark Streaming, Flink, …
Graph processing Giraph, Turi, … ← not covered in this talk
High-level functionality—such as SQL, schemas, transactions, and indexes SQL-on-Hadoop, NewSQL
…
Copyright © 2018 by Kyu-Young Whang 23
2018. 10 KAIST/DGIST
Iterative ProcessingIterative Processing
Hadoop is optimized for a single batch job, whereas applications such as machine learning typically needs iterative computation that repeats until convergence
Limitations of Hadoop for iterative processing Repeatedly writing the intermediate output (the result of the i-th
iteration) to disk and reading it from disk again at the next iteration, causing excessive disk I/O’s—degrading performance.
Copyright © 2018 by Kyu-Young Whang 24
Output from Iter1
Input Output from Iter2
Map Reduce Map Reduce Map Reduce
HDFS
Iteration (Job) 1 Iteration (Job) 2 Iteration (Job) 3
2018. 10 KAIST/DGIST
Spark for Iterative ProcessingSpark for Iterative Processing
Solutions of Spark In Spark, a main architectural component is the Resilient
Distributed Dataset (RDD), which is a main-memory structure representing a working set (i.e., intermediate results). It is distributed over a cluster of nodes, and is fault-tolerant.
By using RDDs across iterations, we can eliminate expensive disk I/O’s.
Copyright © 2018 by Kyu-Young Whang 25
RDD’s are stored in main memory
Main components of Spark (Source: http://datastrophic.io/ )
2018. 10 KAIST/DGIST
RDD A read-only main memory structure distributed over a cluster of
nodes Acting as a cache for an input, an intermediate result, or an output Speeding up iterative operations (transformations)
Allows parallel execution according to a DAG structure representing data flow
Fault-tolerant (or chunks in RDD) An RDD is made fault tolerant by keeping track of its lineage and
reconstructing it in case of data loss.
Copyright © 2018 by Kyu-Young Whang 26
2018. 10 KAIST/DGIST
Stream ProcessingStream Processing
Hadoop is designed for a job that requires (almost) allinput data ready to start processing, whereas real-time applications need to return the results as soon as an input stream arrives
Limitation of Hadoop for stream processing The entire input should reside on the HDFS (disks) before processing The reducers will start only after all the mappers are completed, i.e.,
after all data splits (each with 128 MB) are read in and processed by Map
Therefore, real-time processing is not feasible
Copyright © 2018 by Kyu-Young Whang 27
2018. 10 KAIST/DGIST
Storm for Stream ProcessingStorm for Stream Processing
Solutions of Storm for stream processing Storm: a streaming engine or a data stream management system(DSMS) In Storm, data are processed in real time as they arrive
Topology: defines a (continuous) query, in the form of a directed acyclic graph(DAG), consisting of Spouts, Bolts, and Streams (edges)
Spout: defines a stream source Bolt: defines work logic executing each record (default) or each micro batch
(Microbatch is a set of data records collected for a very short period of time)
Copyright © 2018 by Kyu-Young Whang 28
Spout
Bolt
Storm topology
<source: http://storm.apache.org>
2018. 10 KAIST/DGIST
Part II: Big Data Platforms
Copyright © 2018 by Kyu-Young Whang 29
2018. 10 KAIST/DGIST
Evolution of Data Management Systems [Wha18]Evolution of Data Management Systems [Wha18]
When MapReduce (or NoSQL) initially came about in 2004, we lost much of the high-level functionality of the relational DBMS—such as SQL, indexing, schemas, and transactions—in return for scalability
Since then, there have been a lot of efforts to restore them Two distinct trends SQL-on-Hadoop NewSQL initiatives
Copyright © 2018 by Kyu-Young Whang 30
O/Sfile system
Scalability
FunctionalityRDBMS
NoSQL
SQL-on-HadoopNewSQL
2018. 10 KAIST/DGIST
NoSQLNoSQL
Providing a mechanism for storage and retrieval of data other than in tabular form (typically, in key-value format)
Mostly using low-level query languages instead of SQLTypically compromising consistency in favor of scalability
and speed Optimized for horizontal scaling to a cluster of machines Add more machines to the cluster when needed without
architectural changesExamples:
Copyright © 2018 by Kyu-Young Whang 31
Big Table
2018. 10 KAIST/DGIST
Advantages and Disadvantages of NoSQLAdvantages and Disadvantages of NoSQL
Advantages Scalability
A large address space for big data spanning over a number of independent nodes(machines) connected through a network (i.e., a distributed file system)
High performance of sequential scan Good for batch-oriented jobs (such as OLAP work load)
Fault tolerance and high availability Through (low-level) replication (with inexpensive commodity hardware)
Disadvantages Lack of transactions and consistency(i.e., ACIDity) Not adequate for OLTP workload (short transactions with typically
random access, requiring ACIDity) Lack of high-level functionality such as SQL, schemas, and secondary
indexes
Copyright © 2018 by Kyu-Young Whang 32
2018. 10 KAIST/DGIST
SQL-on-HadoopSQL-on-Hadoop
Providing an SQL-like high-level language on top of the parallel execution layer of MapReduce/Hadoop
Suitable for processing OLAP-like batch operations
Examples [Abo15]:
Copyright © 2018 by Kyu-Young Whang 33
(Facebook)
(open source)
2018. 10 KAIST/DGIST
NewSQLNewSQL
“…provide the same scalable performance of NoSQL systems for on-line transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system” [Wik18d]
Providing high-level functionality (e.g., SQL, transactions, schemas, and secondary indexes) of conventional DBMSs
Base architecture: “shared-nothing parallel DBMS”Examples [Asl11][Kat13]:
Copyright © 2018 by Kyu-Young Whang 34
Google Spanner
2018. 10 KAIST/DGIST
SQL on Hadoop vs. NewSQL: Representative Systems(by no means exhaustive)SQL on Hadoop vs. NewSQL: Representative Systems(by no means exhaustive)
Copyright © 2018 by Kyu-Young Whang 35
Shared-nothing parallel DBMS
F1language&modellayer
parallel executionlayer
key-value/relationallayer
storagelayer GFS HDFS Colossus
Spanner VoltDB
Hadapt(Hadoop
DB)Presto
ImpalaSparkSQL
Phoenix
Hadoop (MapReduce)
Hbase
CassandraDBMSs
for storage
Bigtable
Google Global
Cache(GGC)cachelayer
NoSQL/SQL-on-Hadoop NewSQL
Google Facebook
MySQLCluster
Mem
cach
ed/T
AO
Odysseus DBMS
ODYS Parallel-IR
Local disk
KAIST(ODYS)
Spark
Map/Reduce
2018. 10 KAIST/DGIST
Shared-Nothing Parallel DBMSsShared-Nothing Parallel DBMSs
Parallel DBMSs can be as good as or even better than MapReduce in performance Stonebreaker et al.[Sto10] have shown that parallel DBMSs are (linearly)
scalable and capable of processing petabyte-scale databases and large-scale query loads
Floratou et al.[Flo11] have shown that parallel DBMSs outperform MapReduce providing high performance and scalability by partitioning and storing tables in multiple nodes configured in a shared-nothing manner
Drawbacks of Parallel DBMS Expensive Too heavy by having too much functionality that is not needed in practical
large-scale applications—including capability of processing global transactions with general workload
Not suitable where faults occur frequently Hard to setup and use
Copyright © 2018 by Kyu-Young Whang 36
2018. 10 KAIST/DGIST
NewSQL FeaturesNewSQL Features
Improvements were made in various aspects towards inexpensive massively-parallel DBMSs Shared-nothing architecture ⇒ all NewSQL systems
Row/column partitioning Hash partitioning
Light-weighted (slimed down) ⇒ VoltDB Single-threaded: minimal locking, no latches Simple logs: snapshots+command logging (to disk) Fault tolerance, availability: relying on replicas (automatic data replication)
Main memory DBMS ⇒ VoltDB, SAP Hana Concurrency control
Timestamp based(minimizing use of locks) Multi-version concurrency control (MVCC)⇒ Spanner, VoltDB, SAP Hana
37Copyright © 2016 by Kyu-Young Whang
2018. 10 KAIST/DGIST
ODYS Search Engine[Wha13]—A NewSQL ApproachODYS Search Engine[Wha13]—A NewSQL Approach
A massively-parallel search engine [Wha13] using the Odysseus Object-Relational DBMS developed at KAIST [Wha02,05,15]
Efficiency DB-IR tight integration [Wha02] IR features are implemented directly into the core of the DBMS
speeding up search performance
Scalability Shared-nothing architecture Each slave capable of indexing 100 million Web documents Supporting massive databases (64-bit architecture)
38Copyright © 2018 by Kyu-Young Whang
2018. 10 KAIST/DGIST
High-level functionality of ODYS search engine SQL (especially, DB-IR integrated queries)
Easy implementation of query interfaces by translating keyword queries into SQL queries
Selection, aggregation, limited join (where only one table is partitioned), etc. Schemas
Activation of advanced query processing methods (i.e., attribute embedding or IR index join) by controlling the schema
Indexes B+-tree indexes for structured data and IR indexes for unstructured data (i.e.,
text) Transactions/consistency
ACIDity for local transactions in a single machine Global transactions possible with 2-phase commit (not fully implemented)
Immediate update allowed
39Copyright © 2018 by Kyu-Young Whang
2018. 10 KAIST/DGIST Copyright © 2018 by Kyu-Young Whang 40
Odysseus DB-IR-Spatial Tightly-Integrated DBMS Odysseus DB-IR-Spatial Tightly-Integrated DBMS
An Object-Relational DBMS developed at KAIST for over 26 years (1990 – 2016) [Wha02, 03, 05, 07, 10, 12, 13, 15]
An earlier version of this technology played a vital role in starting up NaverCom Co. (currently, Naver Co.) in 1996-2000, which has been the number one portal in Korea
Best Demonstration Award at the IEEE 21st Int’l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 5-8, 2005 [Wha05]
Tight integration of IR features (U.S. patented [Wha02]) as well as spatial database features with the DBMS Being a DBMS and, at the same time, a search engine [Wha02, Wha03, Wha05, Wha15] Being a DBMS and, at the same time, a GIS engine [Wha07, Wha10]
Concurrency control and recovery Coarse granularity locking version: the shadow-page deferred-update recovery method (US patented
[Wha12]) Fine granularity locking version: the ARIES recovery method [Moh92]
Having many commercial applications
Consisting of approximately 600,000 lines of C/C++ (precision) codes
Open source released (600,000+ lines of C, C++)(Only the coarse granularity locking version has been released as of Aug. 2016)
https://github.com/odysseus-oosql/ODYSSEUS-OOSQLhttps://dblab.kaist.ac.kr/Open-Software/ODYSSEUS/main.html
2018. 10 KAIST/DGIST
Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]
Copyright © 2018 by Kyu-Young Whang 41
Keyword
B+-tree
Large object tree
Posting List
Large object tree
Large object tree
Large object tree
Subindex (for each large object)
B+-tree
# of Postings docId OID # of Occurrences offset
posting
Posting List
2018. 10 KAIST/DGIST
Implementation of the IR Index [Wha15]Implementation of the IR Index [Wha15]
IR index is a composite structure consisting of a relation, B+-trees, and sub-indexes
Copyright © 2018 by Kyu-Young Whang 42
Example: pageInfo relationsiteId
(integer)content(text)
pageInfo_content_Inverted
keyword reverseKeyword nPostings postingList
B+-tree B+-tree
Sub-index
Sub-index
large object
IR index is created
Reversed strings need not be stored. This
field does not contain a value.
title(text)
URL(varchar)
pageInfo_title_Inverted
IR index for pageInfo.content
an index on a tuple instance
2018. 10 KAIST/DGIST
ODYS Search Engine: Shared-Nothing Parallel Architecture Using Odysseus DBMS [Wha13]ODYS Search Engine: Shared-Nothing Parallel Architecture Using Odysseus DBMS [Wha13]
Copyright © 2018 by Kyu-Young Whang 43
ODYS Parallel-IR Masters
. . .
LAN card1 LAN cardnh. . .
machineprocess
disk array
Slave1 Slavens
. . . . . .. . .
. . . . . .
Slave 1+nhns1)-(nhSlave
nhns
. . .Disk1 Diskw
Odysseus DBMS
Shared buffer
. . .Odysseus DBMS
Parent
Child(async. calls)
Hub1 Hubnh
2018. 10 KAIST/DGIST
Experimental Setting (10-Node ODYS Prototype)Experimental Setting (10-Node ODYS Prototype)
Master †
One Linux machine (one Quad-Core 3.0GHz CPUs, 6GB RAM)
Slaves ‡ (10 slaves) Four Linux machines (two Dual-Core 3.0GHz CPUs, 4GB RAM) One Linux machine (one Quad-Core 2.5GHz CPU, 4GB RAM) Five Linux machines (one Quad-Core 2.4GHz CPU, 8GB RAM) Four disk arrays (AS-2400~AS-2500, 0.9TB~3.9TB, RAID5, 200MB/s bandwidth, 512MB~1GB cache, average
59.5 MB/s disk transfer rate, 13 disks (arms) + 1 parity disk + 1 hot spare ) One disk array (TN-6416S, 13TB, RAID5, 4Gbit/s bandwidth, 512MB cache, average 83.3MB/s disk transfer
rate, 13 disks (arms) + 1 parity disk) Five internal disk arrays (B110i, 5TB, 768MB/s bandwidth, 81.2MB/s disk transfer rate, 10 disk (arms) + 1
parity disk)
Network †‡
Eleven gigabit LAN cards(Intel 82574L dual-port(1), Intel 82541GI single-port(5), HP NC326i dual-port(5)) A gigabit hub (HP 1410-24G, 1000Mbps, 24port)
Data 114 million Web documents 2 (duplicated)
Size of loaded data : Web page (1.55TB) and IR index (1.84TB) for 228 million Web documents Each slave indexes 22.8 million web documents (Note: A slave is capable of indexing 100 million documents)
† The master (ODYS Parallel-IR) consists of 58,000 lines of C and C++ code‡ The slave (Odysseus DBMS) consists of 600,000 lines of C and C++ code†‡ We use socket-based RPC consisting of 17,000 lines of C, C++, and Python code developed by the authors
Copyright © 2018 by Kyu-Young Whang
2018. 10 KAIST/DGIST
Performance Projection for 300-Node Real-World-Scale ODYS Performance Projection for 300-Node Real-World-Scale ODYS
300-Node Real-World-Scale ODYS One ODYS set consisting of 4 masters and 300 slaves 300 slaves capable of indexing 30 billion Web pages Performance projection through performance modelling and
experimental data from the 10-Node ODYS prototype
Copyright © 2018 by Kyu-Young Whang
2018. 10 KAIST/DGIST
Estimated Average Total Query Response Time (300-node ODYS) [Wha13](measured with a 10-node ODYS and extended to a 300-node one through performance modelling)
Copyright © 2018 by Kyu-Young Whang
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10 12 14 16
TOTAL-EST-300SLAVE-MAX-EST-300
Arrival rate (million queries/day/set)
Ave
rage
tota
l que
ryre
spon
se ti
me
(ms)
− Query load: 1 billion queries/day †
(QUERY-MIX1)
− Web pages indexed: 6.84 (30) billions(22.8 (100) million Web pages/slave)() indicates max capacity
− Nodes required: 43,472 (for 194ms/query)
− Nodes required: 86,944 (for 148ms/query)
3.5 7
Requires 143 sets of 304 nodes = 43,472 nodes
Requires 286 sets of 304 nodes = 86,944 nodes
194ms
148ms
† Google Search Statistics [Goo18] indicates Google Search handles 3 billion queries/day as of May 30th, 2018Nielsenwire [Nie10] reports that Google handled 214 million queries/day in the U.S. in Feb. 2010
2018. 10 KAIST/DGIST
Summary of ODYSSummary of ODYS
We have shown that a massively-parallel search engine can be implemented using a DB-IR tightly-integrated parallel DBMS Capable of handing real-world scale data and query loads
Providing high-level functionality
We have shown detailed implementation of the ODYS search engine Being capable of indexing 100 million Web pages/node with shared-
nothing architecture → high scalability
Having tightly integrated DB-IR capability → high performance
Having SQL, schemas, and indexes → high-level functionality
Copyright © 2018 by Kyu-Young Whang
2018. 10 KAIST/DGIST
Part III: Big Data Applications
Copyright © 2018 by Kyu-Young Whang 48
2018. 10 KAIST/DGIST
IBM Watson
Recommender Systems
Intelligent Personal Assistants
Alpha Go
Copyright © 2018 by Kyu-Young Whang 49
2018. 10 KAIST/DGIST
IBM WatsonIBM Watson
A question-answering computer system that can answer the questions in natural language In 2011, Watson competed on Jeopardy! against legendary
champions Brad Rutter and Ken Jennings and won the 1st place
Copyright © 2018 by Kyu-Young Whang 50
Source: https://www.youtube.com/watch?v=P18EdAKuC1U
2018. 10 KAIST/DGIST
The knowledge was built from a vast amount of data Encyclopedias, dictionaries, thesauri, newswire articles, and literary
works as well as databases, taxonomies, and ontologies (e.g., DBPedia, WordNet, and Yago) ← tens of millions of documents[Wik18i]
Copyright © 2018 by Kyu-Young Whang 51
<Image Source: [Fer10]>
Big Data
2018. 10 KAIST/DGIST
IBM Watson for OncologyIBM Watson for Oncology
A solution for cancer treatments A Q&A system similar to Jeopardy! Fueled by Big Data from relevant guidelines, best
practices, and medical journals and textbooks (B.T.W., doctors can hardly afford time to read allthese articles or documents)
Evaluates medical evidence and displays potential treatment options ranked by level of confidence
Copyright © 2018 by Kyu-Young Whang 52
Image source: http://www.ibm.com/
2018. 10 KAIST/DGIST
However, it was also reported that IBM's Watson gave unsafe recommendations for treating cancer “IBM’s Watson Hasn’t Beaten Cancer, But A.I. Still Has Promise”
“…But in the documents obtained by STAT (medical web site), doctors who had tried to use Watson to help them design treatment complained that the system wasn’t ready to practice medicine.”
<Source: Bloomberg, August 25, 2018, https://www.bloomberg.com/view/articles/2018-08-24/ibm-s-watson-failed-against-cancer-but-a-i-still-has-promise >
Possible problems: quality of the information sources
Copyright © 2018 by Kyu-Young Whang 53
2018. 10 KAIST/DGIST
Recommender SystemsRecommender Systems
A platform or engine that suggests items to users by predicting how they would rate the items 35% of consumer purchases on Amazon and 75% of video watches
on Netflix come from recommendations [Mac13] Example: Amazon.com’s recommender system
Copyright © 2018 by Kyu-Young Whang 54
2018. 10 KAIST/DGIST
Basic approach: collaborative filtering Example: Two users U1 and U3 have provided similar feedback (preference) on
other items, and it is reasonable to recommend I3 to U3 Here, again, Big Data plays a vital role in making the recommender system
produce intelligent answer with high confidence. (We have a number of U1’s to make the confidence level high)
Copyright © 2018 by Kyu-Young Whang 55
I1 I2 I3 I4 I5 I6 I7U1 5.0 5.0 3.0 4.0U2 3.0 2.5U3 4.5 ? 3.0 4.5U4 3.0 4.0U5U6 3.5 5.0 4.0 4.5U7 4.0 5.0 4.0 4.5 4.5
Items
Use
rs
purchases, ratings, or preferences
Big Data: abundant data for recommendation from 26.5M transactions/sec in Amazon
2018. 10 KAIST/DGIST
Recently, deep learning has been actively adopted for recommendation Given user profiles and item features, the probability of a user u’s
buying an item i is trained by deep learning techniques
Amazon DSSTNE (pronounced "Destiny") is an open source software library for training and deploying recommendation models
Copyright © 2018 by Kyu-Young Whang 56
Inpu
t:U
ser p
rofil
es a
ndite
m fe
atur
es
Out
put:
Use
r-ite
m p
roba
bilit
y
2018. 10 KAIST/DGIST
Intelligent Personal AssistantIntelligent Personal Assistant
“A mobile software agent that can perform tasks, or services, on behalf of an individual based on a combination of user input, location awareness, and the ability to access information from a variety of online sources…” [Wik18e]
Example:
Copyright © 2018 by Kyu-Young Whang 57
Apple SiriSamsung
Amazon
2018. 10 KAIST/DGIST
Example: Google AssistantExample: Google Assistant
“Google Assistant” exploits Big Data, such as search keywords, locations visited, e-mails, and calendar entries, to provide suitable answers and recommendations Example: It prompts you about half an
hour before you leave to let you know the approximate drive time based on current traffic conditions (by looking at your calendar entries!)
Copyright © 2018 by Kyu-Young Whang 58
2018. 10 KAIST/DGIST
Example: Amazon EchoExample: Amazon Echo
“Echo” is a device connected to Amazon’s intelligent personal assistant, Alexa
Amazon is selling Echo at a very low price ($30) to customers Over 5M Echo devices have been sold in the last 2 years All those people asking Alexa to order kitchen supplies, turn
on the lights, or play music gives Amazon a valuable stockpile of data (adding to Big Data)
By using this Big Data, Amazon builds a “360-degree view” of their customers’ buying habits.
Copyright © 2018 by Kyu-Young Whang 59
Amazon Echo
Big data are the most valuable assets Facebook knows the people you know and places you go;Google knows the things you use and search on the Internet;Amazon knows the items you buy online (or even offline—through Amazon go, etc.)
2018. 10 KAIST/DGIST
AlphaGoAlphaGo
A computer program that plays Go, which is developed by Google DeepMind In March 2016, AlphaGo beat Lee Sedol with 4 to 1 score In May 2017, AlphaGo beat Ke Jie, the world No. 1 ranked
player at the time, by 3 to 0 score.
Copyright © 2018 by Kyu-Young Whang 60
Source: https://www.inverse.com/article/30681-alphago-documentary-tribeca-film-festival
[Sil16]
2018. 10 KAIST/DGIST
“Configuration and Strength”[Wik18f], “Power Consumption”[Dee18]
Learning [Wik18f] AlphaGo was initially trained to mimic human play by attempting to
match the moves of expert players from recorded historical games It was trained using a KGS Go Server database (Big Data) of around 30
million moves from 160,000 games played by 6 to 9 dan human players Supervised learning + reinforcement learning In this version, without the Big Data of human games, it would not have
been possible to reach the level of human players
Copyright © 2018 by Kyu-Young Whang 61
Power consumption
2018. 10 KAIST/DGIST
AlphaGo ZeroAlphaGo Zero
A version created without using data from human games but became stronger than any other previous versions [Sil17]
By playing games against itself (self-play), AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days[Dee18]
Training was done soly based on reinforcement learning without recorded moves from human games
Where is the role of Big Data? Answer: it generates its own Big Data through an enormous number of self-plays
Copyright © 2018 by Kyu-Young Whang 62
[Sil17]
2018. 10 KAIST/DGIST
ConclusionsConclusions
Many distributed data processing platforms for Big Data have been actively developed in industry and academia
The ODYS search engine, developed at KAIST, has shown that a massively-parallel search engine with higher functionality can be implemented using a DB-IR tightly-integrated parallel DBMS.
Emerging applications are realizing big data intelligence
The boom of artificial intelligence is fueled by recent Big Data technologies Big Data is essential for training the deep neural network
Copyright © 2018 by Kyu-Young Whang 63
2018. 10 KAIST/DGIST
ReferencesReferences
[Abo15] Daniel Abadi et al., “Tutorial: SQL-on-Hadoop Systems,”, In Proc. 41st Int’l Conf. on Very Large Data Bases, pp. 2050-2051, Kohala Coast, Hawaii, Aug. 2015.
[Abi05] Serge Abiteboul, et al., “The Lowell Database Research Self-Assessment,” Comm. of ACM, Vol. 48, No. 5, pp. 111-118, May 2005.
[Asl11] Matt Aslett, "How Will the Database Incumbents Respond to NoSQL and NewSQL?," Technical Report, the 451 Group, Apr. 2011. (available at https://451research.com/report-short?entityId=66963)
[CRW05] Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum, “Integrating DB and IR Technologies: What is the Sound of One Hand Clapping?,” In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, Asilomar, California, pp. 1-12, Jan. 2005.
[Dea04] Dean, J. and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137-150, Oct. 2014.
[Fer10] Ferrucci, D. et al., "Building Watson: An Overview of the DeepQA Project,” AI Magazine, Vol.31, No. 3, pp. 59-79, July 2010.
[Flo11] Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S., “Column-oriented Storage Techniques for MapReduce,” In Proc. of the VLDB Endowment, Vol. 4, No. 7, pp. 419-429, 2011.
[Goe14] Goes, P., "Design Science Research in Top Information Systems Journals,” MIS Quarterly: Management Information Systems, Vol. 38, No. 1, 2014.
[Goo18] Google Search Statistics - Internet Live Stats, www.internetlivestats.com, retrieved 2018-05-30.[Gra02] Gray, J. and Szalay, A., “The World Wide Telescope: An Archetype for Online Science,” Comm. ACM, Vo. 45, No.
11, pp. 50-54, Nov. 2002.[Kat13] Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari and Miriam AM Capretz, “Data Management in Cloud
Environments: NoSQL and NewSQL Data Stores,” Journal of Cloud Computing: Advances, Systems and Applications, Vol. 2, No. 22, 2013.
[Lan01] Laney, D.,"3D Data Management: Controlling Data Volume, Velocity and Variety,” META Group Research Note, Vol. 6, No. 70, 2001.
[Len04] Lentz, A., “MySQL Storage Engine Architecture,” In MySQL Developer Articles, MySQL AB, May 2004.
Copyright © 2018 by Kyu-Young Whang 64
2018. 10 KAIST/DGIST
[Mac13] MacKenzie, I., Meyer, C., and Noble, S., "How Retailers Can Keep up with Consumers," McKinsey&CompanyReport, Oct. 2013 (https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers).
[Moh92] Mohan, C. et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.,” ACM Trans. Database Systems, Vol. 17, No. 1, pp.94-162, 1992.
[Nie10] Nielsenwire, “Nielsen Reports February 2010 U.S. Search Rankings,” Technical Report, Mar. 15, 2010 (available at http://blog.nielsen.com/nielsenwire/online_mobile/nielsen-reports-february-2010-u-s-search-rankings/).
[Sil16] Silver D. et al, "Mastering the Game of Go with Deep Neural Networks and Tree Search," Nature, Vol. 529, pp. 484-489, Jan. 2016.
[Sil17] Silver D. et al, "Mastering the Game of Go without Human Knowledge," Nature, Vol. 550, pp. 354-349, Oct. 2017.[Sto10] Stonebraker, M. et al., “MapReduce and Parallel DBMSs: Friends or Foes?,” Communications of the ACM (CACM),
pp. 64-71, Jan. 2010.[Weik07] Gerhard Weikum, “DB&IR: Both Sides Now,” In Proc. 2007 ACM SIGMOD Int’l Conf. on Management of Data, pp.
25-30, Beijing, China, June 12-14, 2007.[Wha02] Whang, K. et al., An Inverted Index Storage Structure Using Subindexes and Large Objects for Tight Coupling of
Information Retrieval with Database Management Systems, U.S. Patent No. 6,349,308, Feb. 19, 2002, Application No. 09/250,487, Feb. 15, 1999.
[Wha03] Whang, K., “Tight Coupling: A Way of Building High-Performance Application Specific Engines,” a presentation at the panel Next-Generation Web Technology and Database Issues, the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003), Kyoto, Japan, URL:http://db-www.aist-nara.ac.jp/dasfaa2003/ppt.html, Mar. 2003.
[Wha05] Whang, K., Lee, M., Lee, J., Kim, M. and Han, W., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Tokyo, Japan, pp. 1104-1105, Apr. 2005. This paper received the Best Demonstrate Award.
[Wha07] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Istanbul, Turkey, p.1493-1494, Apr. 2007.
Copyright © 2018 by Kyu-Young Whang 65
2018. 10 KAIST/DGIST
[Wha10] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., Han, W., Kim, J., “Tightly-Coupled Spatial Database Features in the Odysseus/OpenGIS DBMS for High-Performance, GeoInformatica, Vol. 14, No. 4, pp. 425-446, 2010.
[Wha12] Whang, K. et al., “A Method for Recovering Data in a Storage System” U.S. Patent No. 8,108,356, Jan. 31, 2012, Application No. 12/208,014, Sept. 10, 2008.
[Wha13] Kyu-Young Whang, Tae-Seob Yun, Yeon-Mi Yeo, Il-Yeol Song, Hyuk-Yoon Kwon, and In-Joong Kim, “ODYS: an Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proc. 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313-324, June 2013.
[Wha15] Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J., “DB-IR Integration Using Tight-Coupling in the Odysseus DBMS, World Wide Web, Vol. 18, No. 3, pp.491-520, 2015.
[Wha18] Whang, K., Yun, T., Park, J., Cho, K., Kim, S., Yi, I., Na, I., and Lee, B., Building Social Networking ServicesSystems Using the Relational Shared-Nothing Parallel DBMS, Tech. Report CS-TR-2018-419, School of Computing, KAIST, August 2018.
[Wik18a] Wikipedia, “Big Data,” https://en.wikipedia.org/wiki/Big_data[Wik18b] Wikipedia, “Moore’s Law,” https://en.wikipedia.org/wiki/Moore%27s_law[Wik18c] Wikipedia, “MapReduce,” https://en.wikipedia.org/wiki/MapReduce[Wik18d] Wikipedia, “NewSQL,” https://en.wikipedia.org/wiki/NewSQL[Wik18e] Wikipedia, “Automated Personal Assistant,” https://en.wikipedia.org/wiki/Automated_personal_assistant[Wik18f] Wikipedia, “AlphaGO,” https://en.wikipedia.org/wiki/AlphaGo[Wik18g] Wikipedia, “Data Science,” https://en.wikipedia.org/wiki/Data Science[Wik18h] Wikipedia, “Higgs Boson,” https://en.wikipedia.org/wiki/Higgs_boson [Wik18i] Wikipedia, “Watson (computer),” https://en.wikipedia.org/wiki/Watson_(computer)[Dee18] Deep Mind, “AlphaGo Zero: Learning from Scratch,” https://deepmind.com/blog/alphago-zero-learning-
scratch/
Copyright © 2018 by Kyu-Young Whang 66
2018. 10 KAIST/DGIST
Thanks!
Copyright © 2018 by Kyu-Young Whang 67