+ All Categories
Home > Documents > Massive Data Algorithmics - Sharif University of...

Massive Data Algorithmics - Sharif University of...

Date post: 06-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
15
Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction
Transcript
Page 1: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

Massive Data Algorithmics

Lecture 1: Introduction

Massive Data Algorithmics Lecture 1: Introduction

Page 2: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Massive Data

Massive datasets are being collected everywhere

Storage management software is billion-dollar industry

Massive Data Algorithmics Lecture 1: Introduction

Page 3: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Examples

Phone: AT&T 20TB phone call database, wireless tracking

Consumer: WalMart 70TB database, buying patterns

WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7billion clicks per day

Geography: NASA satellites generate 1.2TB per day

Massive Data Algorithmics Lecture 1: Introduction

Page 4: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Grid Terrain Data

Appalachian Mountains (800km x800km)

100m resolution ⇒ ∼64M cells ⇒∼128MB raw data (∼500MB whenprocessing)∼1.2GB at 30m resolution NASASRTM mission acquired 30m datafor 80% of the earth land mass∼12GB at 10m resolution (much ofUS available from USGS)∼1.2TB at 1m resolution (selected,mostly military)

Massive Data Algorithmics Lecture 1: Introduction

Page 5: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. LIDAR Terrain Data

Massive (irregular) point sets (1-10m resolution)

Appalachian Mountains between 50GB and 5TB

Massive Data Algorithmics Lecture 1: Introduction

Page 6: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Application Example: Flooding Prediction

Massive Data Algorithmics Lecture 1: Introduction

Page 7: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Random Access Machine Model

Standard theoretical model of computation:

Infinite memoryUniform access cost

Simple model crucial for success of computer industry

Massive Data Algorithmics Lecture 1: Introduction

Page 8: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Hierarchical Memory

Modern machines have complicated memory hierarchy

Levels get larger and slower further away from CPUData moved between levels using large blocks

Massive Data Algorithmics Lecture 1: Introduction

Page 9: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Slow IO

Disk access is 106 times slower than main memory accessThe difference in speed between modernCPU and disk technologies is analogous tothe difference in speed in sharpening apencil using a sharpener on ones desk or bytaking an airplane to the other side of theworld and using a sharpener on someoneelses desk. (D. Comer)

Disk systems try to amortize large access time transferringlarge contiguous blocks of data (8-16Kbytes)

Important to store/access data to take advantage of blocks(locality)

Massive Data Algorithmics Lecture 1: Introduction

Page 10: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Scalability Problems

Most programs developed inRAM-model. Run on large datasetsbecause OS moves blocks as needed

Moderns OS utilizes sophisticatedpaging and prefetching strategies. Butif program makes scattered accesseseven good OS cannot take advantageof block access

Massive Data Algorithmics Lecture 1: Introduction

Page 11: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. External Memory Model(Cache-Aware Model)

N = # of items in the problem instance

B = # of items per disk block

M = # of items that fit in main memory

T = # of items in output

I/O: Move block between memory and disk

We assume (for convenience) that M > B2

Massive Data Algorithmics Lecture 1: Introduction

Page 12: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Fundamental Bounds

Internal ExternalScanning N N/BSorting N logN N/B logM/B N/BPermuting N min(N,N/B logM/B N/B)Searching logN logB N

Note: Linear I/O: O(N/B)Permuting not linearPermuting and sorting bounds are equal in all practical casesB factor VERY important: N/B < (N/B) logM/B(N/B)<< N

Massive Data Algorithmics Lecture 1: Introduction

Page 13: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Cache-Oblivious Model

A cache-oblivious algorithm is an algorithm designed to takeadvantage of a CPU cache without having the size of thecache

a cache oblivious algorithm is designed to perform well,without modification, on multiple machines with differentcache sizes, or for a memory hierarchy with different levels ofcache having different sizes.

The idea for cache-oblivious algorithms was conceived byCharles E. Leiserson as early as 1996 and first published byHarald Prokop in his master’s thesis at the MassachusettsInstitute of Technology in 1999.

Massive Data Algorithmics Lecture 1: Introduction

Page 14: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Streaming Model

In stream model, input data are not available for random accessfrom disk or memory, but rather arrive as one or more continuousdata streams.

Performance of algorithm is measured by three basic factors:

Number of passes algorithm must make over stream.

The available memory.

The running time of the algorithm.

Massive Data Algorithmics Lecture 1: Introduction

Page 15: Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

.. Grading

Midterm+ Final: 14 points

Homework: 3 points

Presentation: 3 points

Massive Data Algorithmics Lecture 1: Introduction


Recommended