+ All Categories
Home > Technology > Hive Tutorials - Hive Data Types

Hive Tutorials - Hive Data Types

Date post: 28-Nov-2014
Category:
Upload: prashant-pandey
View: 964 times
Download: 0 times
Share this document with a friend
Description:
 
6
Apache Hive Tutorials Hive Data Types Prashant Kumar Pandey
Transcript
Page 2: Hive Tutorials - Hive Data Types

Learn more with us at http://www.thriveschool.blogspot.in/

Before we get into details of creating tables, loading data and executing queries in hive to learn

power of hive as a data warehouse, let’s learn some basics. We have already understood what

is a database or schema in hive however I will cover some more details about it in this tutorial.

Then we will understand hive data types. So let’s start.

First concept related to hive is database. In hive, database is just a namespace of tables. They

are mainly used to organize tables into logical groups. For hive, SCHEMA and DATABASE are

synonymous, which means that you can use the keyword SCHEMA instead of DATABASE in all

the database-related commands explained below.

Most simple syntax for creating a database is CREATE DATABASE <database_name>; which we

have already seen earlier in getting started with hive.

When you are creating a script, it is valuable to avoid error which may come if database already

exist. To accomplish this, hive supports following syntax.

CREATE DATABASE IF NOT EXISTS <database_name>;

To get list of databases that already exist you can use SHOW DATABASES; command. This

command also support regular expression search using SHOW DATABASES LIKE 'pk*';

All databases are created under a top-level directory specified by the property

hive.metastore.warehouse.dir. You can override this default using location clause in created

database command as shown in below example.

CREATE DATABASE pkdb LOCATION '/pkp/mydatabases';

You can use DESCRIBE DATABASE <database_name>; command to describe it. Describe

command will show database directory location and a comment about database which you can

create using COMMENT clause as shown in below example.

CREATE DATABASE pkdb COMMENT 'My default database';

Comment is a nice way to provide in place documentation. Similarly you can also associate key-

value properties with the database which will be displayed if you use DESCRIBE DATABASE

EXTENDED <database_name>; command.

Syntax for associating key-value pairs is shown in below example.

CREATE DATABASE pkdb WITH DBPROPERTIES ('Created_By' = 'Prashant', 'Date' = '2013-01-

01');

USE <database_name>; command is used to make any database as current active database.

Page 3: Hive Tutorials - Hive Data Types

Learn more with us at http://www.thriveschool.blogspot.in/

To drop a database we can use DROP DATABASE IF EXISTS <database_name> CASCADE;

command. Where IF EXISTS clause is optional which is used to suppress warning message if

database does not exist.

CASCADE is also optional. If CASCADE is present, it will drop all tables present in specified

database. If CASCADE is not present, hive will not allow you to drop database unless it is empty.

As of now, hive does not support altering database once created except DBPROPERTIES. You

can change key-value pairs associated with you database using alter command as shown in

below example.

ALTER DATABASE pkdb SET DBPROPERTIES (' Date' = '2013-01-02');

Hive Data Types Unlike other popular relational databases, hive supports collection data types along with many

primitive data types. Most of the popular relational databases take control on how storage is

managed on disk. But hive leave this control on us which make it easier to process data

generated using various tools. We will get into those details later and let’s focus on data types

for now.

Primitive Data Types

Hive supports 4 integers, 2 floating points, Boolean and string data types as shown in below

table. For readability, I used uppercase but its case insensitive. Since all of it is internally

implemented in JAVA, you can expect similarities in their behavior.

Data Type Size Example

TINYINT 1 Byte 10, -10

SMALLINT 2 Byte 10, -10

INT 4 Byte 10, -10

BIGINT 8 Byte 10, -10

FLOAT Single precision float 10.8932

DOUBLE Double precision float

10.8932

BOOLEAN Boolean true or false TRUE

STRING Sequence of characters

‘Sample string’

TIMESTAMP Integer, float or string values.

129357385 8929245.879395 ‘2013-01-01 12:00:00.123456789’

BINARY Array of bytes

Page 4: Hive Tutorials - Hive Data Types

Learn more with us at http://www.thriveschool.blogspot.in/

TINYINT, SMALLINT, INT, BIGINT are four integer data types with only differences in their size.

FLOAT and DOUBLE are two floating point data types. BOOLEAN is to store true or false.

STRING is to store character strings. Note that, in hive, we do not specify length for STRING like

in other databases. It’s more flexible and variable in length.

TIMESTAMP can be an integer which is interpreted as seconds since UNIX epoch time. It may be

a float where number after decimal is nanosecond. It may be string which is interpreted

according to the JDBC date string format i.e. YYYY-MM-DD hh:mm:ss.fffffffff. Time component

is interpreted as UTC time. Hive provides various functions to convert into specific time zones.

BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable for binary

data.

If you need to cast one type to another like a string to an integer, you can use cast (str as int)

function in hive.

Collection Data Types

Hive supports columns that are STRUCT, MAP, and ARRAY unlike other relational databases.

If you are familiar with C language, you might already know what STRUCT is. Hive STRUCT is

analogous to STRUCT in C.

MAP is a collection of key-value pairs.

ARRAY is obviously an ordered sequence of same data type of variable which are accessed using

zero based index.

We understand that hive is very flexible in offering collection data types but a genuine concern

is how data is physically stored in flat files. Hive gives that flexibility as well in our hands. Let’s

take an example to understand it.

Page 5: Hive Tutorials - Hive Data Types

Learn more with us at http://www.thriveschool.blogspot.in/

I am assuming an employee table with below structure in hive.

CREATE TABLE employees (name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>

address STRUCT<stree:STRING,

city:STRING,

zip:INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\001’

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

Note that \001 is an octal code for ^A, \002 is ^B and \003 is ^C. for further explanation, we will

use text instead of octal code. Let’s assume one record as shown in below table.

So, we have an employee named as Prashant. Salary is 10000. He has two subordinates Purna

and Sudarshan. From his monthly salary, there are two deductions i.e. 750 for PF and 1000 as

income tax. His address is Steet= Marathalli, City is Bangalore and zip code is 560037.

Let’s build this record in hive table format as described by hive table structure above.

Fields are terminated by ^A. We have five fields Name, Salary, Subordinates, Deductions and

Address. You might see it like below.

Name^ASalary^ASubordinates^ADeductions^A Address

Let’s replace it with actual values for first two fields because they are straight.

Prashant^A10000^ASubordinates^ADeductions^A Address

Page 6: Hive Tutorials - Hive Data Types

Learn more with us at http://www.thriveschool.blogspot.in/

Now, we have two subordinates and we know that collection items will be delimited by ^B

giving below record. Same will apply to address parts.

Prashant^A10000^APurna Dash^BSudarshan Raju ^ADeductions^A

Marathalli^BBangalore^B560037

Deduction is bit complex than others. You need to understand that deductions has two

collection items (PF=750 and IT=1000) and each of those two collection items is a MAP. MAP

will be delimited by ^C (PF^C750 and IT^C1000) then collection will be delimited by ^B giving

below record.

Prashant^A10000^APurna Dash^BSudarshan Raju ^A PF^C750^B IT^C1000^A

Marathalli^BBangalore^B560037

There are other file formats but for now you can consider that’s how data in Hive table will be

stored. It’s a series of records in the above format where each record is terminated by a new

line as specified in LINES TERMINATED BY '\n' clause of create table.

All delimiters used in above example are default delimiters for hive. If you do not specify them,

default values will be assumed by hive. You are free to change these values for each table you

have.

That’s kind of flexibility and looseness makes hive powerful I utilizing data generated by various

ETL tools.


Recommended