CS 626 Large Scale Data Science
Jun ZhangMarch 5, 2020
Originally prepared by Licong Cui
Lecture 11 – Apache Hive
1
Review: Hadoop Ecosystem
2
Review: Pig
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
Pig Engine
Pig Latin3
Review: WordCount Using Pig
lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;
4
Pig vs MapReduce vs Hive
1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163
1
5
Outline
What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig
6
Hive
Facebook Data warehousing infrastructure based on
Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data
HiveQL - Hive’s query language7
Hive
Organize data into tables
Metastore
Metadata (table schemas)
8
Run Hive
Interactive mode Hive shell: hive
Non-interactive mode Local: hive –f script.q
Hive web interface
JDBC (Java Database Connectivity)
9
HiveQL Data Types
Primitive data types
Complex data types
10
Primitive Data Types
Boolean
Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL
String: STRING, VARCHAR
Timestamp: TIMESTAMP, DATE
11
Primitive Data Types (cont.)
12
Primitive Data Types (cont.)
13
Complex Data Types
14
Example Complex Data Types
15
Example Complex Data Types (cont.)
16
HiveQL Example
17
HiveQL Example (cont.)
18
HiveQL Example (cont.)
19
Data Model
Tables
Partitions
Buckets
20
Tables
Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table
directory conform to the schema at the loading time Example:
hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table
21
External Tables
Does not move data to Hive’s warehouse directory
Point to existing data directories in HDFS
Data is assumed to be in Hive-compatible format
Dropping external table drops only the metadata
Example:
22
Partitions
Hive organize tables into partitions
Partitions determine distribution of data within subdirectories
Example:
23
Partitions (cont.)
Example:
SELECT statements:
24
Buckets
Data in each partition divided into buckets
Each bucket is stored as a file in partition directory
Hash function: H(column) mod num_buckets = bucket_number
Example:
25
Hive Architecture
26
Hive Architecture
27
Thrift Server
Framework for cross language services
Server written in Java
Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts
28
Metastore
System catalog which contains metadata about the Hive tables
Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL
Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage
29
Driver
Driver Maintains the lifecycle of HiveQL statement
Query Compiler Compiles HiveQL in a DAG of map reduce tasks
Executor Executes the tasks plan generated by the compiler in
proper dependency order Interacts with the underlying Hadoop instance
30
Hive vs Traditional Database
Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time
Hive: schema on read does not verify the data when it is loaded, but rather when
a query is issued
31
Hive vs Traditional Database (cont.)
Updates, Transactions, and Indexes Mainstays of traditional databases
HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are
stored in small delta files
Delta files are periodically merged into the base table files by MapReduce jobs
32
Hive vs. RDBMS
33
SQL vs HiveQL
34
Cheat sheet at class webpage
SQL vs HiveQL (cont.)
35
Pig vs HivePig Hive
Procedural data flow language
Declarative SQLishLanguage
36
Pig vs Hive (cont.)
Pig Mainly for data transformations and processing
Unstructured and structured data
Hive Mainly for data warehousing and querying data
Structured data
Lower learning curve than Pig or MapReduce
HiveQL is much closer to SQL than Pig37
Hive, Pig, and Hadoop Benchmark
Version: Hadoop – 0.18x, Pig:786346, Hive:786346
38
Bigger Picture
Store large amounts of data to HDFS
Process raw data using Pig
Build schema using Hive
Querying data using Hive
39
References
Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201
5/ http://www.slideshare.net/chirag064/hive-
warehousing-over-hadoop https://acadgild.com/blog/working-with-
hive-complex-data-types/
40