Date post: | 11-Nov-2014 |
Category: |
Technology |
Upload: | clogeny-technologies |
View: | 1,558 times |
Download: | 5 times |
Clogeny’s Hadoop Developer Training Series
An Introduction to Hive
Madhur [email protected]
Cloud Computing
Private & Public Clouds Big Data
Storage
DevOps
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
What is Hive?A data warehousing infrastructure based on HadoopProvides easy data summarizationProvides ad-hoc querying and analysis of large volumes of dataComes with Hive QL, based on SQLAllows to plug in custom mappers and reducers
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
What Hive is NOTNot suitable for small datasets due to high latencyCannot be compared to systems like OracleDoes not offer real-time queries and row level updates
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive Architecture
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Data Models Types - TablesTables• Made up of actual data and the associated metadata• Actual data is stored in a Hadoop Filesystem• Metadata is always stored in a relational database like MySQL• Managed Tables
Hive physically moves data into its warehouse $ CREATE TABLE managed_table (dummy STRING);
$ LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Tables Hive refers data from existing location in HDFS $ CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; $ LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Data Models Types - PartitionsPartitions• A way to divide tables into coarse-grained parts• Data is partitioned based on the value of partition
column• Supports multiple dimensions• Defined at table creation time using PARTITION BY
clause• At the filesystem level, partitions are simply nested
subdirectories of the table directory.
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Data Models Types - PartitionsCREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Data Model Types - BucketsBuckets• Partitions table within range• Enables more efficient queries by creating smaller
buckets of data rather than working with an entire partition.
• Make sampling more efficient$ CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Column Data Types
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
PrimitivesTYPE DESCRIPTION EXAMPLE
TINYINT 8-bit signed integer 1
SMALLINT 16-bit signed integer 1
INT 32-bit signed integer 1
BIGINT 64-bit signed integer 1
FLOAT 32-bit single precision floating point number
1.0
DOUBLE 64-bit double precision floating point number
1.0
BOOLEAN true/false value TRUE
STRING Character string ‘a’,”a”
TIMESTRAMP Timestamp with nanosecond precision
‘2012-01-02 03:04:05.123456789’
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Column Data Types
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Complex Data TypesTYPE DESCRIPTION EXAMPLE
ARRAY An ordered collection of fields. The fields must all be of same type
array(1, 2)
MAP An unordered collection of key-value pairs. Keys must be primitives, values
may be any type. For a particular map, the keys must be the same type, and the values must be the
same type
map(‘a’, 1,’ b’, 2)
STRUCT A collection of named fields. The fields may be of different types
struct(‘a’, 1, 1.0)
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Metastore
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
A central repository of Hive metadataComprises of 2 parts:• Metastore service• Backing store for the data
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Metastore deployment modes1: Embedded Mode
This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database.
Both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process.
This mode requires the least amount of effort to configure.
But it can support only one active user at a time and is not certified for production use.
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Metastore deployment modes2: Local Mode
In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host.
The embedded metastore service communicates with the metastore database over JDBC.
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Metastore deployment modes3: Remote Mode
In this mode the Hive metastore service runs in its own JVM process; other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property).
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Metastore Properties
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Property Name Type Description
hive.metastore.warehouse.dir URI The directory in HDFS where managed tables are stored
hive.metastore.local Boolean Flag for embedded metastore or local metastore
hive.metastore.uris Comma separated URIs
List of remote metastore URI’s
javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore database
javax.jdo.option.ConnectionDriverName String The JDBC driver classname
javax.jdo.option.ConnectionUserName String The JDBC username
javax.jdo.option.ConnectionPassword String The JDBC password
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive PackagesThe following packages are needed by Hive:• hive – base package that provides the complete
language and runtime (required)• hive-metastore – provides scripts for running the
metastore as a standalone service (optional)• hive-server – provides scripts for running the original
HiveServer as a standalone service (optional)• hive-server2 – provides scripts for running the new
HiveServer2 as a standalone service (optional)
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Comparison with Traditional Databases
Schema on Read Verses Schema on Write• In a traditional database, a table’s schema is enforced at data
load time• If the data being loaded doesn’t conform to the schema, then
it is rejected• Hive, on the other hand, doesn’t verify the data when it is
loaded, but rather when a query is issued
Updates, Transactions, and Indexes• Updates, transactions, and indexes are mainstays of traditional
databases.• Until recently, these features have not been considered a part
of Hive’s feature set
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Installing HiveWe will install hive with Metastore as a standalone serviceFor this, install the hive and Metastore packages as:
$ yum –y install hive hive-metastore
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive ConfigurationDefault configuration in• /etc/hive/conf/hive-default.xml
Re(Define) properties in• /etc/hive/conf/hive-site.xml
Use $HIVE_CONF_DIR to specify alternate conf dir locationYou can override Hadoop configuration properties in Hive’s configuration• e.g: mapred.reduce.tasks=1
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore databaseStep 1: Install and start MySQL if you have not already done so$ yum install mysql-server
Step 2: Configure the MySQL Service and Connector$ yum install mysql-connector-java$ ln -s /usr/share/java/mysql-connector-java-5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-5.1.17.jar
Step 3: To set the MySQL root password:
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…Step 4: To make sure the MySQL server starts at boot• $ /sbin/chkconfig mysqld on
Step 5. Create the Database and User• Create the initial database schema using the hive-schema-
0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory.
• Create a user for hive with the hostname of the metastore.• Grant proper privileges to the user.
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…
Step 6: Configure the Metastore Service to Communicate with the MySQL Database• This step shows the configuration properties you need
to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same
• hive-site.xml on all hosts (client, metastore, HiveServer)• hive.metastore.uris is the only property that must be
configured on all of them; the others are used only on the metastore host.
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Configure Metastore database cont…Step 7: Create hive user directory in hdfs$ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse$ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse$ sudo –u hdfs hadoop fs –chown –R hive /user/hive
Step 8: Set Environment Variables:• Add the following to .bashrc file $ vim ~/.bashrc export HADOOP_HOME="/usr/lib/hadoop" PATH=$PATH:"/usr/lib/hadoop/bin“• Run command “bash” on command prompt $ bash
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Starting the MetastoreYou can run the metastore from the command line:$ hive --service metastore
Ensure that the above does not give any errorUse Ctrl-c to stop the metastore process running from the command line.To run the metastore as a daemon, the command is:$ service hive-metastore start
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Starting the Hive ConsoleTo start the Hive console:$ hive
To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a semi-colon after the command:hive> SHOW tables;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI CommandsSet a Hive or Hadoop conf property:hive> set propkey=value;
List all properties and values:hive> set –v;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI CommandsCreating managed table$ cat input/hive/tables/data.txt$ hive hive> CREATE TABLE managed_table (dummy STRING); hive> LOAD DATA LOCAL INPATH ‘input/hive/tables/data.txt' INTO table managed_table; hive> select * from managed_table; $ hadoop fs -cat /user/hive/warehouse/managed_table/data.txt
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI CommandsCreating external table• Select a location in hdfs to create table• Ensure it has write access to other users
$ sudo -u hdfs hadoop fs -mkdir /user/joe/table$ sudo -u hdfs hadoop fs -chmod a+w /user/joe/table
• Create external table and load data into it:hive> CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/joe/table';hive> LOAD DATA LOCAL INPATH 'input/hive/tables/data.txt' INTO TABLE external_table;hive> select * from external_table;
• Check if the table was created in the external directory$ sudo -u hdfs hadoop fs -cat /user/joe/table/data.txt
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI CommandsCreate Partitioned table
hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Load data in table specifying the partitionshive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file2' INTO TABLE logs PARTITION (dt='2001-01-01', country='US');
hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file3' INTO TABLE logs PARTITION (dt='2001-01-02', country='US');
See the table contentshive> select * from logs;
List all the partitionshive> SHOW PARTITIONS logs;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI CommandsCreate Bucket:• Create a normal table users and create a bucket named
bucketed_users from ithive> set hive.enforce.bucketing=true;
hive> CREATE TABLE users (id INT, name STRING);
hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table users;
hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
• Check the contents of table per buckethive> select * from bucketed_users;
hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Hive CLI Commands
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsPrerequisites• Create 2 tables sales and things and load data from files
hive> CREATE TABLE sales (user STRING, id INT)row format delimited fields terminated by '\t' stored as textfile;
hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt' INTO table sales;
hive> select * from sales;
hive> CREATE TABLE things (id INT, name STRING)row format delimited fields terminated by '\t' stored as textfile;
hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt' INTO table things;
hive> select * from things;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Joins
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsInner Joinhive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsLeft Outer Joinhive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsRight Outer Joinhive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsFull Outer Joinhive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsSemi Joins• Hive does not support IN sub queries
hive> SELECT * from things WHERE things.id IN (SELECT id from sales);
• So solution is semi joinshive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
JoinsMap Joins• Used in case when 1 table is very small enough to fit in
memory. No reducers usedhive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Other CommandsCREATE TABLE…AS SELECThive> CREATE TABLE target AS SELECT id from things;
Altering Tableshive> ALTER TABLE target RENAME TO source;hive> ALTER TABLE source ADD COLUMNS (col2 STRING);
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
Other CommandsDropping Tables• For managed tables both data and metadata is deleted• For external tables only metadata is deleted
hive> drop table <table_name>;
Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
ReferencesHadoop: The Definitive Guide, 3rd EditionHive Community page• http://hive.apache.org/