+ All Categories
Home > Documents > Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

Date post: 22-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
40
CS 626 Large Scale Data Science Jun Zhang Originally prepared by Dr. Licong Cui Department of Computer Science University of Kentucky January 28, 2020 Lecture 3 – Foundations for Big Data Systems and Programming
Transcript
Page 1: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

CS 626 Large Scale Data Science

Jun ZhangOriginally prepared by Dr. Licong Cui

Department of Computer ScienceUniversity of Kentucky

January 28, 2020

Lecture 3 – Foundations for Big Data Systems and Programming

Page 2: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Review: Five P’s of Data Science

People

Purpose

Process

Platforms

Programmability

Page 3: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Review: Steps in the Data Science Process

Acquire Prepare Analyze Report Act

Big Data Engineering Computational Big Data Science

Page 4: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Basic Scalable Computing Concepts

Distributed File Systems

Scalable Computing over the Internet

Programming Models for Big Data

Page 5: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Traditional File System

64GB 256GB 512GB

1TB

Page 6: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Copy data to an external hard drive?

Buy a bigger disk?

Work Personal

Distribute data on multiple computers?

Page 7: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Store Data in Server?

Page 8: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Cluster of Machines

Page 9: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Distributed File System (DFS)

File system spreads over multiple, autonomous computers

Distributed File System

Rack

Page 10: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Data Replication

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

Page 11: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Fault Tolerance

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

Page 12: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

High Concurrency

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

Reader 1 Reader 2

Reader 3

Page 13: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Distributed File SystemRa

ck

Data Partitioning

Data Replication

Data Scalability

Fault Tolerance

High Concurrency

Page 14: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Scalable Computing Over the Internet

Single compute node

Page 15: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Parallel Computer

Expensive

Page 16: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Commodity Cluster

Affordable

Less-specialized

Distributed computing over the Internet

Reduced computing cost

Page 17: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Architecture of a Commodity Cluster

NetworkRa

ck

Page 18: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Distributed ComputingNetwork

Rack

Network

Rack

Net

wor

kRack

• Enables data-parallelism• Move computation to data

Page 19: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Programming Models for Big Data

Runtime Libraries Programming Languages

Programming Model = abstractions

Distributed File System

Rack

Infrastructure:

Page 20: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Requirements for Big Data Programming Models

1. Support Big Data Operationso Split large volumes of data

o Access data fast

o Distribute computations to nodes

Page 21: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Requirements for Big Data Programming Models (cont.)

2. Handle Fault Toleranceo Replicate data partitions

o Recover files when needed

Page 22: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Requirements for Big Data Programming Models (cont.)

3. Enable Adding More Racks

4

1

2 5

3Rack

1 2 3 4 5

Data

Rack

Rack

Page 23: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Requirements for Big Data Programming Models (cont.)

4. Optimized for specific data types

Document Table

Key-value Graph

Multimedia Stream

Page 24: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Example – Suits of Cards

Page 25: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Example – Suits of Cards

Page 26: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Key Programming Model

MapReduce

A programming model for Big Data

Many implementations

Page 27: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Programming Model = abstractions

Runtime Libraries Programming Languages

Support large data volumes

Provide fault tolerance

Enable scale out

MapReduce

Page 28: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure performs filtering and sorting (such as

sorting students by first name into queues, one queue for each name)

a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

Page 29: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Google Distributed System

Page 30: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Google File System Architecture

Page 31: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Google MapReduce

Page 32: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Hadoop Ecosystem - History

Google

Yahoo! released Hadoop in 2005

More open-source projects

2004

Page 33: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Hadoop Ecosystem – Layer Diagram

Page 34: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

What is Hadoop?

Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals/Requirements Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)

Fault-tolerance Move computation to data

Page 35: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Hadoop Architecture

Page 36: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Hadoop Architecture

Page 37: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Hadoop Architecture (cont.)

HDFS Name Node

Data Node

MapReduce Job Tracker

Task Tracker

Page 38: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Reminder: Downloading and Installing Hadoop

Download and Install VirtualBoxhttps://www.virtualbox.org/wiki/Downloads

Download and Install Cloudera QuickStart VMhttps://www.cloudera.com/downloads/quickstart_vms/5-13.html

Launch the Cloudera VM

Page 39: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Reminder: Import Appliance in VirtualBox

Page 40: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January

Reminder: Amazon Web Service (AWS) Educate Sign Up

AWS Educatehttps://aws.amazon.com/education/awseducate/


Recommended