Introduction to Big Data, mostly from jin/BigData Course in big data, Spring 2014 Ruoming Jin.

Introduction to Big Data,mostly from www.cs.kent.edu/~jin/BigData

Course in big data, Spring 2014 Ruoming Jin

2

What’s Big Data?No single definition; here is from Wikipedia:

• Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

• The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

• The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

3

Big Data: 3V’s

12+ TBs of tweet data

every day

25+ TBs oflog data

every day

? TB

s of

data

eve

ry d

ay

2+ billion

people on the Web

by end 2011

30 billion RFID tags today

(1.3B in 2005)

4.6 billion camera phones

world wide

100s of millions

of GPS enabled

devices sold annually

76 million smart meters in 2009…

200M by 2014

Maximilien Brice, © CERNCERN’s Large Hydron Collider (LHC) generates 15 PB a year

6

Variety (Complexity)

• Relational Data (Tables/Transaction/Legacy Data)• Text Data (Web)• Semi-structured Data (XML) • Graph Data

– Social Network, Semantic Web (RDF), …

• Streaming Data – You can only scan the data once

• A single application can be generating/collecting many types of data

• Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of data need to linked together

7

Velocity (Speed)

• Data is begin generated fast and need to be processed fast

• Online Data Analytics• Late decisions missing opportunities• Examples

– E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

8

Real-time/Fast Data

Social media and networks(all of us are generating data)

Scientific instruments(collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

Real-Time Analytics/Decision Requirement

Customer

InfluenceBehavior

Product Recommendations that are Relevant

& Compelling

Friend Invitations to join a

Game or Activitythat expands

business

Preventing Fraud as it is Occurring

& preventing moreproactively

Learning why Customers Switch to competitors

and their offers; in time to Counter

Improving theMarketing

Effectiveness of a Promotion while it

is still in Play

10

Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)• OLAP: Online Analytical Processing (Data Warehousing)• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

11

The Model Has Changed…

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

Big Data: Batch Processing &

Distributed Data StoreHadoop/Spark; HBase/Cassandra

BI ReportingOLAP &

Data warehouse

Business Objects, SAS, Informatica, Cognos other SQL

Reporting Tools

Interactive Business Intelligence &

In-memory RDBMS

QliqView, Tableau, HANA

Big Data:Real Time &Single View

Graph Databases

THE EVOLUTION OF BUSINESS INTELLIGENCE

1990’s 2000’s 2010’s

Speed

Scale

Scale

Speed

13

Big Data Analytics

• Big data is more real-time in nature than traditional DW applications

• Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps

• Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

15

Big Data Technology

Cloud Computing

• IT resources provided as a service– Compute, storage, databases, queues

• Clouds leverage economies of scale of commodity hardware– Cheap storage, high bandwidth networks &

multicore processors – Geographically distributed data centers

• Offerings from Microsoft, Amazon, Google, …

Topic 2: Hadoop/MapReduce Programming & Data Processing

• Architecture of Hadoop, HDFS, and Yarn• Programming on Hadoop

• Basic Data Processing: Sort and Join• Information Retrieval using Hadoop• Data Mining using Hadoop (Kmeans+Histograms)• Machine Learning on Hadoop (EM)

• Hive/Pig• HBase and Cassandra

References• References:

• Hadoop: The Definitive Guide, Tom White, O’Reilly• Hadoop In Action, Chuck Lam, Manning• Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly• Data-Intensive Text Processing with MapReduce, Jimmy Lin

and Chris Dyer (www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)

• Good tutorial presentation & examples at: • http://research.google.com/pubs/pub36249.html • The definitive original paper:

http://research.google.com/archive/mapreduce.html

18

http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf





http://research.google.com/archive/mapreduce.html

Cloud Resources

• Hadoop on your local machine• Hadoop in a virtual machine on your local

machine (Pseudo-Distributed on Ubuntu)• Hadoop in the clouds with Amazon EC2

Introduction to MapReduce/Hadoop

From Ruoming Jin’s Slides, themselves adapted from Jimmy Lin’s

slides (at UMD)

Slides from Dr. Amr Awadallah’s Hadoop talk at Stanford, CTO & VPE from Cloudera

Key Ideas

• Scale “out”, not “up”– Limits of SMP and large shared-memory machines

• Move processing to the data– Cluster may have limited bandwidth

• Process data sequentially, avoid random access– Seeks are expensive, disk throughput is reasonable

• Seamless scalability– From the mythical man-month to the tradable

machine-hour

The datacenter is the computer!

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Apache Hadoop

• Scalable fault-tolerant distributed system for Big Data:– Data Storage – Data Processing – A virtual Big Data machine – Borrowed concepts/Ideas from Google; Open source under

the Apache license

• Core Hadoop has two main systems:– Hadoop/MapReduce: distributed big data processing

infrastructure (abstract/paradigm, fault-tolerant, schedule, execution)

– HDFS (Hadoop Distributed File System): fault-tolerant, high-bandwidth, high availability distributed storage

MapReduce: Big Data Processing Abstraction

Example: word counts

Millions of documents inWord counts out:brown, 2fox, 2how, 1now, 1the, 3 …

In practice, before MapReduce and related technologies:The first 10 computers are easy;The first 100 computers are hard;The first 1000 computers are impossible;

But now with MapReduce, engineers at Google often use 10000 computers!

What’s wrong with 1000 computers?

Some will crash while you’re working…

If probability of crash = .001Then probability of all up = (1-.001)1000 = 0.37

MapReduce expects crashes, tracks partial work, keeps going

Typical Large-Data Problem

• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output

Key idea: provide a functional abstraction for these two operations

Map

Reduce

(Dean and Ghemawat, OSDI 2004)

MapReduce

• Programmers specify two functions:map (k, v) → [(k’, v’)]reduce (k’, [v’]) → [(k’, v’’)]– All values with the same key (k’) are sent to the

same reducer, in k’ order for each reducer– Here [] means a sequence

• The execution framework handles everything else…

“Hello World”: Word Count

Map(String docid, String text): for each word w in text: Emit(w, 1);

Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);

MapReduce “Runtime”

• Handles scheduling– Assigns workers to map and reduce tasks

• Handles “data distribution”– Moves processes to data

• Handles synchronization– Gathers, sorts, and shuffles intermediate data

• Handles errors and faults– Detects worker failures and restarts

• Everything happens on top of a distributed FS (later)

MapReduce• Programmers specify two functions:

map (k, v) → [(k’, v’)]reduce (k’, [v’]) → [(k’, v’’)]– All values with the same key are reduced together

• The execution framework handles everything else…• Not quite…usually, programmers also specify:

partition (k’, number of partitions) → partition for k’– Often a simple hash of the key, e.g., hash(k’) mod n– Divides up key space for parallel reduce operations

• and eventual delivery of results to certain partitionscombine (k’, [v’]) → [(k’, v’’)]– Mini-reducers that run in memory after the map phase– Used as an optimization to reduce network traffic

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1

mouse, 1quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

brown: 1,1fox: 1,1how:1now:1the:1,1,1

ate: 1cow: 1mouse: 1quick: 1

MapReduce Implementations

• Google has a proprietary implementation in C++– Bindings in Java, Python

• Hadoop is an open-source implementation in Java– Development led by Yahoo, used in production– Now an Apache project– Rapidly expanding software ecosystem

• Lots of custom research implementations– For GPUs, cell processors, etc.

Hadoop History• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Becomes Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster• Sept 2008 – Hive becomes a Hadoop subproject• Feb 2009 – The Yahoo! Search Webmap is a Hadoop application that

runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.

• June 2009 – On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production.

• In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB.

Who uses Hadoop?• Amazon/A9• Facebook• Google• IBM• Joost• Last.fm• New York Times• PowerSet• Veoh• Yahoo!

Example Word Count (Map) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word,one); } } }

Example Word Count (Reduce) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Example Word Count (Driver)

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Word Count Execution

the quickbrown fox


how nowbrown cow

Map

Map

Map

Reduce

Reduce


ate, 1cow, 1

mouse, 1quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

brown: 1,1fox: 1,1how:1now:1the:1,1,1

ate: 1cow: 1mouse: 1quick: 1

An Optimization: The Combiner

def combiner(key, values): output(key, sum(values))

• A combiner is a local aggregation function for repeated keys produced by same map

• For associative ops. like sum, count, max• Decreases size of intermediate data

• Example: local counting for Word Count:

Word Count with CombinerInput Map & Combine Shuffle & Sort Reduce Output

the quickbrown fox


how nowbrown cow

Map

Map

Map

Reduce

Reduce


ate, 1cow, 1

mouse, 1quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 2fox, 1

how, 1now, 1

brown, 1ate, 1

mouse, 1

cow, 1

split 0split 1split 2split 3split 4

worker

worker

worker

worker

worker

Master

UserProgram

outputfile 0

outputfile 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Adapted from (Dean and Ghemawat, OSDI 2004)

Distributed File System

• Don’t move data to workers… move workers to the data!– Store data on the local disks of nodes in the cluster– Start up the workers on the node that has the data local

• Why?– Not enough RAM to hold all the data in memory– Disk access is slow, but disk throughput is reasonable

• A distributed file system is the answer– GFS (Google File System) for Google’s MapReduce– HDFS (Hadoop Distributed File System) for Hadoop

Another example of MapReduce

• Clickstream-like data: for each ad viewing, user info and whether they clicked on the ad:

• {userid, ip, zip, adnum, clicked}• Want unique users who saw, clicked, by zip

First Try

• First try key as zip:• Map can emit {90210, {0,1}} if saw and failed

to click, {90210, {1,1}} if saw and clicked• Reduce receives, say:• {90210, [{1,1}, {1,0}, {1,0}]}• This shows three visits, one click, but we don’t

know if these visits were by different users, so we don’t know the number of unique users

Second try

• We need to preserve user identity longer• Use {zip, userid} as key• Value: again {0,1} or {1,1} if saw and clicked• Map emits {90210,user123}, {0,1}}, etc.• Reducer gets info on one user, one zip:• {{90210,user123}, [{0,1}, {1,1}]}• Reducer can process list, emit {90210,user123},

{1,2}}• But not done yet…

Second MapReduce pass

• Reducer (pass 1) emits {90210,user123}, {1,2}}• Second Map reads this, emits its contribution

to zip’s stats (one user saw and clicked): {90210, {1, 1}}

• Second Reduce counts up unique users and their clicks: emits {90210, {1056, 2210}} for 2210 unique users viewed ads, 1056 of them clicked.

Compare to SQL

• Table T of {userid, ip, zip, adnum, clicked}• Using a trick, we can do this in one select:select zip, count (distinct userid), count (distinct

clicked*userid) from T group by zip, clicked having clicked=1• Assumes clicked = 0 or 1 in T row• Note that DB2, Oracle, and mysql can do

count (distinct expr), though entry SQL92 only requires

count(distinct column)

Compare to SQL

• Table T of {userid, ip, zip, adnum, clicked}Closer to MapReduce processing select zip, userid, count (clicked) cc from T

group by zip, userid• Put results into table T1 (zip, userid, cc)select zip, count(*), sum(sign(cc)) from T1 group by zipScalar function sign(x) = -1, 0, +1 is available on

Oracle, DB2, mysql, but not in Entry SQL92

Do it in SQL92?

• CASE is the conditional value capability in SQL92, but is not required for Entry SQL92 (it is supported by all respectable DBs)

• Sign(x) as case:case when x < 0 then -1 when x > 0 then 1 else 0End

Something better?

• We see that using MapReduce means telling the system in detail how to solve the problem

• SQL just states the problem, lets the QP figure out how to do it

• Next time: Hive, the SQL-like query language built on top of MapReduce

Date post:	24-Dec-2015
Category:	Documents
Upload:	rodney-reed
View:	236 times
Download:	4 times

Introduction to Big Data, mostly from jin/BigData Course in big data, Spring 2014 Ruoming Jin.

Documents