+ All Categories
Home > Documents > CS346: Advanced Databases Graham Cormode [email protected] MapReduce and Hadoop.

CS346: Advanced Databases Graham Cormode [email protected] MapReduce and Hadoop.

Date post: 24-Dec-2015
Category:
Upload: frederica-ray
View: 233 times
Download: 0 times
Share this document with a friend
Popular Tags:
61
CS346: Advanced Databases Graham Cormode [email protected]. uk MapReduce and Hadoop
Transcript
Page 1: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

CS346: Advanced DatabasesGraham Cormode [email protected]

MapReduce and Hadoop

Page 2: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Outline

Reading: find resources online, or pick fromData Intensive Text Processing with MapReduce Chapters 1-3

Jimmy Lin, Chris Dyer, Morgan&Claypoolwww.coreservlets.com/hadoop-tutorial/ Marty HallHadoop: The Definitive Guide Tom White, O’Reilly Media

Outline: Data is big and getting bigger. New tools are emerging¨ Hadoop: A file system and processing paradigm (MapReduce)¨ Hbase: A way of storing and retrieving large amounts of data¨ Pig and Hive: High-level abstractions to make Hadoop easier

CS346 Advanced Databases2

Page 3: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

CS910 Foundations of Data Analytics3

¨ Data is growing faster than our ability to store or index it

¨ There are 3 Billion Telephone Calls in USA each day, 30 Billion emails daily, 1 Billion SMS, IMs.

¨ Scientific data: NASA's observation satellitesgenerate billions of readings each per day.

¨ IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!

¨ Whole genome sequences for many species now available: each megabytes to gigabytes in size

Why: Data is Massive

Page 4: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

CS910 Foundations of Data Analytics4

Massive Data Management

Must perform queries on this massive data:¨ Scientific research (monitor environment, species)¨ System management (spot faults, drops, failures)¨ Customer research (association rules, new offers) ¨ For revenue protection (phone fraud, service abuse)Else, why even collect this data?

Page 5: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hadoop

¨ Hadoop is an open-source architecture for handling big data– First developed by Doug Cutting (named after a toy elephant)– Currently managed by the Apache Software Foundation– Google’s original implementations are not publicly available

¨ Many tools/products now reside on top of Hadoop– Hbase: (non-relational) distributed database– Hive: data warehouse infrastructure developed at Facebook– Pig: high-level language that compiles to hadoop, from Yahoo– Mahout: machine learning algorithms in hadoop

¨ Hadoop widely used in technology-based businesses:– Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay– Offered as part of: Amazon EC2, Cloudera, Microsoft Azure

CS346 Advanced Databases5

Page 6: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hadoop Cluster

¨ A hadoop cluster implements the MapReduce framework– Many commodity (off the shelf) machines, with fast network– Placed in physical proximity (allow fast communication)– Typically rack-mounted hardware

¨ Expect and tolerate failures– Disks have MTBF of 10 years– When you have 10,000 disks...– ...expect 3 to fail per day– Jobs can last minutes to hours– So system must cope with failure!

Data is replicated, tasks that fail are retried The developer does not have to worry about this

CS346 Advanced Databases6

Page 7: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Building Blocks

Source: Barroso and Urs Hölzle (2009)

Page 8: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hadoop philosophy

¨ “Scale-out, not scale-up”– Don’t upgrade, but add more hardware to the system, – End of Moore’s law means CPUs not getting faster– Individual disk size is not growing fast– So add more machines/disks (scale-out)– Allow hardware addition/removal mid-job

¨ “Move code to data, not vice-versa”– Data is big, distributed while code is fairly small– So do the processing locally where the data resides– May have to move results across the network though

CS346 Advanced Databases8

Page 9: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hadoop versus the RDBMS

¨ Hadoop and RDBMS are not in direct competition– Solve different problems on different kinds of data

¨ Hadoop: data processing on huge, distributed data (TB-PB)– Batch approach: data is not modified frequently, results take time– No guarantees of resilience, no real-time response, no locking– Data is not in relations, but key-values

¨ RDBMS: resilient, reliable processing of large data (MB-GB)– Provide high level-language (SQL) to deal with structured data– Hit a ceiling when scaling up beyond 10s of TB

¨ But the gaps between the two are narrowing– Lots of work to make Hadoop look like DB (Hive, hbase...)– Hadoop & RDBMS can coexist: DB front-end, Hadoop log analysis

CS346 Advanced Databases9

Page 10: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Running Hadoop

¨ Different parts of the Hadoop ecosystem have incompatibilities– Require certain versions to play well together

¨ Led to Hadoop distributions (like Linux distributions)– Curated releases e.g. cloudera.com/hadoop– Available as a linux package, or virtual machine image

¨ How to run hadoop?– Run on your own (multi-core) machine (for development/testing)– Use a local cluster that you have access to– Go to the cloud ($$$): Amazon EC2, Cloudera, Microsoft Azure

See Peng’s lecture

CS346 Advanced Databases10

Page 11: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

HDFS: the Hadoop Distributed File System¨ Hadoop Distributed File System is an important part of Hadoop

– Good for storing truly massive data ¨ Some HDFS numbers:

– Suitable for files in the TB, PB range– Can store millions-billions of files– Suits 100MB+ minimum size per file

¨ Assumptions about the data– Assume that the data will be written once, read many times– Assume no dynamic updates: append only– Optimize for streaming (sequential) reads, not random access

¨ Not good for low-latency reads, small files, multiple writers

CS346 Advanced Databases11

Page 12: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

HDFS Daemons

¨ Namenode: manages the file systems namespace– Map from file name to where data is stored, like other file systems– Can be a single point of failure in the system

¨ Datanodes: stores and retrieves data blocks– Each datanode reports to namenode

¨ Secondary namenode: does housekeeping (checkpointing, logging)– Not a backup for the namenode!

CS346 Advanced Databases12

Page 13: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Files and Blocks

¨ Files are broken into blocks, just like traditional file systems– But each block is much larger: 64MB or 128MB– Ensure time to seek << time to transfer– Compare 10ms access, 100MB/s read

¨ Blocks are replicated across different datanodes– Default replication level is 3, all managed by namenode

CS346 Advanced Databases13

Page 14: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Replication and Reliability

¨ Namenode is “rack aware”: knows how machines are arranged– Second replica is on same rack as the first, but different machine– Third replica is on a different rack– Balances performance (failover time) vs. reliability (independence)

¨ Namenode does not directly read/write data– Client gets data location from namenode– Client interacts directly with datanode to read/write data

¨ Namenode keeps all block metadata in (fast) memory– Puts constraint on number of files stored: millions of large files– Future iterations of hadoop expect to remove these constraints

CS346 Advanced Databases14

Page 15: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Using HDFS file system

¨ HDFS gives similar control to a traditional file system– Paths in the form of directories below a root– Can ls (list directory), cat (read file), cd, rm, cp, etc.– put: copy file from local file system to HDFS– get: copy file from HDFS to local file system– File permissions similar to unix/linux

¨ Some HDFS-specific commands: change file replication level– Can rebalance data: ensure datanodes are similarly loaded– Java API to read/write HDFS files

¨ Original use for HDFS: store data for MapReduce

CS346 Advanced Databases15

Page 16: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

MapReduce and Big Data

¨ MapReduce is a popular paradigm for analyzing massive data– When the data is much too big for one machine– Allows the parallelization of computations over many machines

¨ Introduced by Jeffrey Dean and Sanjay Ghemawat early 2000s– MapReduce model implemented by MapReduce system at Google– Hadoop MapReduce implements same ideas

¨ Allows a large computation to be distributed over many machines– Brings the computation to the data, not vice-versa– System manages data movement, machine failures, errors– User just has to specify what to do with each piece of data

CS910 Foundations of Data Analytics16

Page 17: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Motivating MapReduce

¨ Many computations over big data follow a common outline:– The data formed of many (many) simple records– Iterate over each record and extract a value– Group together intermediate results with same properties– Aggregate these groups to get final results– Possibly, repeat this process with different functions

¨ MapReduce framework abstracts this outline– Iterate over records = Map– Aggregate the groups = Reduce

CS910 Foundations of Data Analytics17

Page 18: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

What is MapReduce?

¨ MapReduce draws inspiration from functional programming– Map: apply the “map” function to every piece of data– Reduce: form the mapped data into groups and apply a function

¨ Designed for efficiency– Process the data in whatever order it is stored, avoid random access

Random access can be very slow over large data– Split the computation over many machines

Can Map the data in parallel, and Reduce each group in parallel– Resilient to failure: if a Map or Reduce task fails, just run it again

Requires that tasks are idempotent: can repeat on same input

CS910 Foundations of Data Analytics18

Page 19: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Programming in MapReduce

¨ Data is assumed to be in the form of (key, value) pairs– E.g. (key = “CS346”, value = “Advanced Databases”)– E.g. (key = “111-222-3333”, value = “(male, 29 years, married…)”

¨ Abstract view of programming MapReduce. Specify:– Map function: take a (k, v) pair, output some number of (k’,v’) pairs– Reduce function: take all (k’, v’) pairs with same k’ key, and output a

new set of (k’’, v’’) pairs– The “type” of output (key, value) pairs can be different to the input

¨ Many other options/parameters in practice:– Can specify a “partition” function for how to map k’ to reducers– Can specify a “combine” function that aggregates output of Map– Can share some information with all nodes via distributed cache

CS910 Foundations of Data Analytics19

Page 20: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

MapReduce schematic

Page 21: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

“Hello World”: Word Count

¨ The generic MapReduce computation that’s always used…– Count occurrences of each word in a (massive) document collection

Map(String docid, String text): for each word w in text: Emit(w, 1);

lintool.github.io/MapReduce-course-2013s/syllabus.html

private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { WORD.set(itr.nextToken()); context.write(WORD, ONE); } } }

private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasNext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); } }

Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

Page 22: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

“Hello World”

CS346 Advanced Databases22

Page 23: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

MapReduce and Graphs

¨ MapReduce is a powerful way of handling big graph data– Graph: a network of nodes linked by edges– Many big graphs: the web, (social network) friendship, citations

Often have millions of nodes, billions of edges Facebook: > 1billion nodes, 100 billion edges

¨ Many complex calculations needed over large graphs– Rank importance of nodes (for web search)– Predict which links will be added soon / suggest links (social nets)– Label nodes based on classification over graphs (recommendation)

¨ MapReduce allows computation over big graphs– Represent each edge as a value in a key-value pair

CS910 Foundations of Data Analytics23

Page 24: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

MapReduce example: compute degree¨ The degree of a node is the number of edges incident on it

– Here, assume undirected edges

¨ To compute degree in MapReduce:– Map: for edge (E, (v, w)) output (v, 1), (w, 1)– Reduce: for (v, (c1, c2, … cn)) output (v, i=1

n ci )¨ Advanced: could use “combine” to compute partial sums

– E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1))

CS910 Foundations of Data Analytics24

A B

CD

(E1, (A,B))(E2, (A, C))(E3, (A, D))(E4, (B,C))

(A, 1)(B, 1)(A, 1)(C, 1)(A, 1)(D, 1)(B, 1)(C, 1)

Map(A, (1, 1, 1))(B, (1, 1))(C, (1, 1))(D, (1))

Shuffle

(A, 3)(B, 2)(C, 2)(D, 1)

Reduce

Page 25: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

MapReduce Criticism (circa 2008)

Two prominent DB leaders (DeWitt and Stonebraker) complained:¨ MapReduce is a step backward in database access:

– Schemas are good– Separation of the schema from the application is good– High-level access languages are good

¨ MapReduce only allows poor implementations– Brute force and only brute force (no indexes, for example)

¨ MapReduce is missing features– Bulk loader, indexing, updates, transactions…

¨ MapReduce is incompatible with DBMS toolsMuch subsequent debate and development to remedy these

Source: Blog post by DeWitt and Stonebraker (http://craig-henderson.blogspot.co.uk/2009/11/dewitt-and-stonebrakers-mapreduce-major.html)

Page 26: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Relational Databases vs. MapReduce

¨ Relational databases:– Multipurpose: analysis and transactions; batch and interactive– Data integrity via ACID transactions [see later]– Lots of tools in software ecosystem (for ingesting, reporting, etc.)– Supports SQL (and SQL integration, e.g., JDBC)– Automatic SQL query optimization

¨ MapReduce (Hadoop):– Designed for large clusters, fault tolerant– Data is accessed in “native format”– Supports many developing query languages (but not full SQL)– Programmers retain control over performance

Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)

Page 27: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Database operations in MapReduce

For SQL-like processing in MapReduce, need relational operations¨ PROJECT in MapReduce is easy

– Map over tuples, emit new tuples with appropriate attributes– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing

¨ SELECT in MapReduce is easy– Map over tuples, emit only tuples that meet criteria– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing

CS346 Advanced Databases27

Page 28: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Group by… Aggregation

¨ Example: What is the average time spent per URL?– Given data for each visit to a URL giving the time spent

¨ In SQL: SELECT url, AVG(time) FROM visits GROUP BY url;¨ In MapReduce:

– Map over tuples, emit time, keyed by url– MapReduce automatically groups by keys– Compute average in reducer– Optimize with combiners

Page 29: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Join Algorithms in MapReduce

¨ Joins are more difficult to do well– Can easily do a join as a cartesian product followed by a select– This will kill your system for even moderate data sizes

¨ Will exploit some “extensions” of MapReduce– These allow extra ways to access data (e.g. distributed cache)

¨ Several approaches to join in MapReduce– Reduce-side join– Map-side join– In-memory join

Page 30: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Reduce-side Join

¨ Basic idea: group by join key– Map over both sets of tuples– Emit tuple as the value with join key as the intermediate key– Hadoop brings together tuples sharing the same key– Perform actual join in reducer– Similar to “sort-merge join”

¨ Different variants, depending on how the join goes:– 1-to-1 joins– 1-to-many and many-to-many joins

Page 31: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Reduce-side Join: 1-to-1

R1

R4

S2

S3

R1

R4

S2

S3

keys valuesMap

R1

R4

S2

S3

keys values

Reduce

Note: need extra work if we want attributes ordered!

Page 32: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Reduce-side Join: 1-to-many

R1

S2

S3

R1

S2

S3

S9

keys valuesMap

R1 S2

keys values

Reduce

S9

S3 …

Need extra work to get the tuple from R out first

Page 33: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Reduce-side Join: many to many

¨ Follow similar outline in the many to many case– Need enough memory to store all tuples from one relation

¨ Not particularly efficient– End up sending all the data over the network in the shuffle step

CS346 Advanced Databases33

Page 34: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:R1

R2

R3

R4

S1

S2

S3

S4

A sequential scan through both datasets to join(equivalent to a merge join)

Doesn’t seem to fit MapReduce model?

Page 35: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Map-side Join: Parallel Scans

¨ If datasets are sorted by join key, then just scan over both¨ How can we accomplish this in parallel?

– Partition and sort both datasets with the same ordering¨ In MapReduce:

– Map over one dataset, read from other corresponding partition Requires reading from (distributed) data in Map

– No reducers necessary (unless to repartition or resort)¨ Requires data to be organized just how we want it

– If not, fall back to reduce-side join R1

R2

R3

R4

S1

S2

S3

S4

Page 36: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

In-Memory Join

¨ Basic idea: load one dataset into memory, stream over the other– Works if R << S, and R fits into memory– Equivalent to a hash join

¨ MapReduce implementation– Distribute R to all nodes: use the distributed cache– Map over S, each mapper loads R in memory, hashed by join key– For every tuple in S, look up join key in R– No reducers, unless for regrouping or resorting tuples

¨ Striped variant (like single-loop join): if R is too big for memory– Divide R into R1, R2, R3, … s.t. each Rn fits into memory– Perform in-memory join: n, Rn S⋈– Take the union of all join results

Page 37: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Summary: Relational Processing in Hadoop¨ MapReduce algorithms for processing relational data:

– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

– Multiple strategies for relational joins Prefer In-memory over map-side over reduce-side Reduce-side is most general, in-memory is most restricted

¨ Complex operations will need multiple MapReduce jobs– Example: top ten URLs in terms of average time spent– Opportunities for automatic optimization

Page 38: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase

¨ Hbase (Hadoop Database) is a column-oriented data store– An example of a “NoSQL” database: not the full relational model– Open source, written in Java– Does allow update operations (unlike HDFS…)

¨ Hbase designed to handle very large tables– Billions of rows, millions of columns– Inspired by “BigTable”, internal to Google

CS346 Advanced Databases38

Page 39: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Suitability of Hbase

¨ Hbase suits applications when– Don’t need full power of relational database– Need a large enough cluster (5+ nodes)– Data is very large (obviously) – 100M to Billions of rows– Don’t need real-time response: can be slow to respond (latency)– Have many clients– Access pattern is mostly selects or range scan by key– Suits when the data is sparse (many attributes, mostly null)– Don’t want to do group by/join etc.

CS346 Advanced Databases39

Page 40: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase data model

¨ The hbase data model is similar to relational model:– Data is stored in tables, which have rows– Each row is identified/referenced by a unique key value– Rows have columns, which are grouped into column families

¨ Data (bytes) is stored in cells– Each cell is indentified by (row, column-family, column)– Limited support for secondary indexes on non-key values– Cell contents are versioned: multiple values are stored (default: 3)– Optimized to provide access to most recent version– Can access old versions by timestamp

CS346 Advanced Databases40

Page 41: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase data storage

¨ Rows are kept in sorted order of key¨ Example of (logical) data layout:

¨ Data is stored in Hfiles, usually under HDFS– Empty cells are not explicitly stored – allows very sparse data

CS346 Advanced Databases41

Page 42: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hfiles

¨ Since HDFS does not allow updates, need to use some tricks¨ Data is stored in hfiles (still stored in HDFS)¨ Newly added data is stored in a Write Ahead Log (WAL)¨ Delete markers are used to indicate records to delete¨ When data is accessed, the Hfile and WAL are merged

¨ HBase periodically applies compaction to the Hfiles¨ Minor compaction: merge together multiple hfiles (fast)¨ Major compaction: more extensive merging and deletion

¨ Management of data relies on a “distributed coordination service”¨ Provided by Zookeeper (similar to Google’s Chubby)¨ Maps names to locations

CS346 Advanced Databases42

Page 43: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase column families and columns

¨ Columns are grouped into families to organize data– Referenced as family:column e.g. user:first_name

¨ Family definitions are static: rarely added to or changed– Expect a small number of families

¨ Columns are not static, can be updated dynamically– Can have millions of columns per family

CS346 Advanced Databases43

Page 44: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase application example

¨ Use hbase to store and retrieve a large number of articles¨ Example Schema: two sets of column families

– Info, containing columns ‘title’, ‘author’, ‘date’– Content, containing column ‘post’

¨ Can then access data– Get: retrieve a single row (or columns from a row, other versions)– Scan: retrieve a range of rows– Edit and delete data

CS346 Advanced Databases44

Page 45: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hbase conclusions

¨ Hbase best suited to storing/retrieving large amounts of data– E.g. mananging a very large blogging network– Facebook uses hbase to store users’ messages (since 2010)

www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919

¨ Need to think about how to design the data storage– E.g. one row per blog, or one row per article– “Tall-narrow” design (1 row per article) works well

Fits better with the way hbase structures hfiles Scales better when blogs have many articles

¨ Can use Hadoop for heavy duty processing– Hbase can be the input (and output) for a hadoop job

CS346 Advanced Databases45

Page 46: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hive and Pig

¨ Hive: data warehousing application in Hadoop– Query language is HQL, variant of SQL– Tables stored on HDFS with different encodings– Developed by Facebook, now open source

¨ Pig: large-scale data processing system– Scripts are written in Pig Latin, a dataflow language– Programmer focuses on data transformations– Developed by Yahoo!, now open source

¨ Common idea:– Provide higher-level language to facilitate large-data processing– Higher-level language “compiles down” to Hadoop jobs

Page 47: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig

¨ Pig is a “platform for analyzing large datasets”– High-level (declarative) language (pig latin)– Compiled in MapReduce for execution on hadoop cluster– Developed at Yahoo, used by Twitter, Netflix...

¨ Aim: make MapReduce coding easier for non-programmers– Data analysts, data scientists, statisticians...

¨ Various use-cases suggested:– Extract, Transform, Load (ETL): analyze large log data (clean, join)– Analyze “raw” unstructured data, multiple sources e.g. user logs

CS346 Advanced Databases47

Page 48: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig concepts

¨ Field: a piece of data¨ Tuple: an ordered set of fields

– Example: (10.4, 5, word, 4, field1)¨ Bag: collection of tuples

– { (10.4, 5, word, 4, field1), (this, 1, blah) }¨ Similar to tables in a relational DB

– But don’t require that all tuples in a bag have the same arity– Can be nested: a tuple can contain a bag, (a, {(1), (2), (3), (4)})

¨ Standard set of datatypes available:– int, long, float, double, chararray (string), bytearray (blob)

CS346 Advanced Databases48

Page 49: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig Latin

¨ Pig latin language somewhere between SQL and imperative¨ LOAD data AS schema;

– t = LOAD ‘mylog’ AS (userId:chararray, timestamp:long, query:chararray);¨ DUMP displays results to screen; STORE saves to disk

– DUMP t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)...¨ GROUP tuples BY field;

– Create new tuples, one for each different value of field – E.g. g = GROUP t BY userId;– Will generate a bag of timestamp and query tuples for each user– DUMP g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})

CS346 Advanced Databases49

Page 50: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig: Foreach

¨ FOREACH bag GENERATE data : iterate over all elements in a bag– r = FOREACH t GENERATE timestamp– DUMP r : (12:34), (12:36), (12:37)

¨ GENERATE can also apply various builtin functions to data– s = FOREACH g GENERATE group, COUNT(t)– DUMP s : (u1, 2), (u3, 1)

¨ Several builtin functions to manipulate data– TOKENIZE: break strings into words– FLATTEN: remove structure, e.g. convert bag of bags into a bag– Can also use User Defined Functions (UDFs) in Java, Python...

¨ The “word count” problem can be done easily with these tools– All commands correspond to simple Map, Reduce or MR tasks

CS346 Advanced Databases50

t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})

Page 51: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Joins in Pig

¨ Pig supports join between two bags¨ JOIN bag1 BY field1, bag2 BY field2

– Performs an equijoin, with the condition field1=field2¨ Can perform the join on a tuple of fields

– E.g. join on (date, time): only join if both match¨ Implemented via join algorithms seen earlier

CS346 Advanced Databases51

Page 52: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig: Example

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Task: Find the top 10 most visited pages in each category

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Page 53: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Pig Script for example query

1. visits = load ‘/data/visits’ as (user, url, time);2. gVisits = group visits by url;3. visitCounts = foreach gVisits generate url, count(visits);4. urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);5. visitCounts = join visitCounts by url, urlInfo by url;6. gCategories = group visitCounts by category;7. topUrls = foreach gCategories generate top(visitCounts,10);8. store topUrls into ‘/data/topUrls’;

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Page 54: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by category

Group by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Pig Query Plan for Hadoop Execution

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Page 55: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hive

¨ Hive is a data warehouse built on top of Hadoop– Originated at Facebook in 2007, now part of Apache Hadoop– Provides SQL-like language called HiveQL

¨ Hive gives simple interface for queries and analysis– Access to files stored via HDFS, Hbase– Does not give fast “real-time” response – inherent from hadoop– Minimum response time may be minutes: designed to scale

¨ Example use case at Netflix: log data analysis– 0.6TB of log data per day, analyzed by 50+ nodes– Test quality: how well is the network performing?– Statistics: how many streams/day, errors/session etc.

CS346 Advanced Databases55

Page 56: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

HiveQL to Hive

¨ Hive: translates HiveQL query to a set of MR jobs and executes¨ To support persistent schemas, keeps metadata in a RDBMS

– Known as the metastore (implemented by Apache Derby DBMS)

CS346 Advanced Databases56

Page 57: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hive concepts

¨ Hive presents a view of data similar to relational DB– Database is a set of tables– Tables formed from rows with the same schema (attributes)– Row of a table: a single record– Column in a row: an attribute of the record

CS346 Advanced Databases57

Page 58: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

HiveQL examples: Create and Load

¨ CREATE TABLE posts (user STRING, post STRING, time BIGINT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’STORED AS TEXTFILE;

¨ LOAD DATA LOCAL INPATH ‘data/user-posts.txt’ OVERWRITE INTO TABLE posts;

¨ SELECT COUNT(1) FROM posts;Total MapReduce jobs = 1Launching Job 1 out of 1 [...]Total MapReduce CPU Time Spent: 2 seconds 640 msec4Time taken:14.204 seconds

CS346 Advanced Databases58

Page 59: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

HiveQL examples: querying

¨ SELECT * FROM posts WHERE user=“u1”;– Similar to SQL syntax

¨ SELECT * FROM posts WHERE time<=1343182133839 LIMIT 2;– Only return the first 2 matching results

¨ GROUP BY and HAVING allow aggregation as in SQL– SELECT category, count(1) AS cnt FROM items GROUP BY category HAVING cnt > 10;

¨ Can also specify how results are sorted– ORDER BY (totally ordered) and SORT BY (sorted by each reducer)

¨ Can specify how tuples are allocated to reducers– Via DISTRIBUTE BY keyword

CS346 Advanced Databases59

Page 60: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Hive: Bucketing and Partitioning

¨ Can use one column to partition data– Each partition stored in a separate file– E.g. partition by country– No difference in syntax, but querying on partitioned attribute is fast

¨ Can cluster data by buckets: randomly hash data into buckets– Allows parallelization in MapReduce: one mapper per bucket– Use buckets to evaluate query on a sample (one bucket)

CS346 Advanced Databases60

Page 61: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

Summary

CS346 Advanced Databases61

¨ Large, complex ecosystem for data management around hadoop– We have barely scratched the surface of this world

¨ Began with Hadoop and HDFS for MapReduce– Hbase for storage/retrieval of large data– Hieve and Pig for more high-level programming abstractions

Reading: www.coreservlets.com/hadoop-tutorial/Data Intensive Text Processing with MapReduce Chapters 1-3Hadoop: The Definitive Guide


Recommended