Hive Tutorial - Managed Tables

Apache Hive Tutorials Hive Managed Table

Prashant Kumar Pandey

http://hive.apache.org/

http://www.thriveschool.blogspot.in/







Learn more with us at http://www.thriveschool.blogspot.in/

Syntax for creating a table is fairly simple and may not need any explanation. Tables which we

create using method explained below are called Hive managed tables which means if you drop

the table, hive will remove table and data both.

CREATE TABLE IF NOT EXIST

pkdb.employees(name STRING COMMENT ‘Employee Name’,

salary FLOAT COMMENT ‘Employee Salary’,

subordinates ARRAY<STRING> COMMENT ‘Name of

Subordinates’,

deductions MAP<STRING, FLOAT> COMMENT ‘Deductions’,

address STRUCT<stree:STRING,

city:STRING,

zip:INT> COMMENT ‘Home Address’

) COMMENT ‘Employee Table’

TBLPROPERTIES(‘Created_By’=’Prashant’,’Created_Date’=’2013-01-01’);

Few Notes

1. IF NOT EXISTS is optional clause. IF NOT EXISTS does only a name check ignoring

structure.

2. You can optionally prefix database name if you want to create table in other than

current database.

3. You can optionally give comments to columns.

4. You can optionally give comments to table.

5. You can optionally attach key-value pairs as table properties.

6. Hive automatically adds two table properties: last_modified_by which holds username

of the last user who modified it and last_modified_time which holds time of that

modification.

7. You can optionally add other clauses which we learned in previous lessons like ROW

FORMAT DELIMITED, LINES TERMINATED BY and LOCATION etc.

Suppose you already have several tables in your database, you can use SHOW TABLES

command to list all the tables in your current database or you can use SHOW TABLES IN

<database_name> to list all tables in specified database name.

You can use DESCRIBE EXTENDED <database_name>; command to show details about the

table. As you might have guesses extended is optional.

You can drop table using DROP <table_name>; if table is present in other than your current

database, you can prefix database name with table as pkdb.mytable.






Now you know how to create table, let’s create a table and load data into it. We will take a

simple example to understand how data is physically stored in hive managed tables.

I have a database named pkdb which is empty as of now as shown in below screen.

Let’s re check it using show tables; command. As shown in below screen, there are no tables. I

will create a new table named employees. I have overridden default record delimiters and

changed. Fields will be terminated by comma, collection items are terminated by colon, map

key-value pairs are terminated by hyphen. Record is terminated by a new line character. Table

is created and we are ready to load data into it.

Before we load data into employees table, let’s go to our database directory and check what’s

created their as result of our create table statement. You will find a new directory named as

employees created under pkdb.db directory. We don’t have files as there is no data in our

table.








We need a data file to load it into our table. I have created a sample file with just two records

to test data load into hive table. You will notice that, it’s a plain text file and record structure is

in compliance with record structure which we defined while creating table i.e. fields are

terminated by comma and so on. For your reference, I have displayed content of my file in

below screen.

It’s time to load data from file to hive table. Command is very simple as shown in below screen.









Data is loaded into employees table. Let’s execute a select statement to ensure that table is

now holding data.

So, we have created a table in hive and loaded data into it. We executed a select statement and

getting results as well. Let’s go back to our database directory and check what is created there.

Simple, we got same table copied into my HDFS database directory. Let’s drop our table and

check, does it removed directory and data file as well. I hope you will check it yourself.

Next important thing to understand about hive is partitioned table. Let’s assume we are a

global company and we have millions of customers across the globe. We have large data

available about our customers from all over the world and we want to load it into a data

warehouse. Combined data file is of many terabytes or may a couple of petabytes. You can

imagine hod your quries will perform in such situations if you have not partitioned your table

smartly and your queries does not include appropriate partition filters. If you have any

experience in writing database queries on very large databases, probably you are already aware

with such problems. This makes partitioning an basic feature for any database table no matter

its apache hive or any other database. However, hive implementation of table partitioning is

completely different than most of other databases. We will understnd it using an example. I will

use very small data size for this dimonstration but to explain concept behind it, size does not

matter.

Let’s create a customers table which is partitioned on two keys, country and state. Syntax for

creating partioned table is shown in screen below. You will notice that I haven’t created any

column for country and state. There is a reason for this. Since country and state information is

encoded into my partition, why do I need to repeat that data into table? That’s redundent data








so we do not create country and state columns in our table. Any partition itself will be able to

provide that information and all data within specific partion belongs to same value.

After creating a partitioned table, I loaded data into this table for one partition only i.e.

country=india and state=karnatka. There are two things to note here. First, I need to tell hive

which partition I am loading using partition(country=’india’,state=’karnatka’) clause. Second,

my table kar-data.txt will only have data for india/karnatka. Data is loaded and now you are

readu to execute your queries on this table. Partitioned table does not place any restriction on

querying data, you can still execute your queries for entire world, but if you query for specific

partition, hive need not to read all partitions making your query faster.







Now, we need to go back to our database directory as always to check what is created there.

Check below screen and you will notice following.

1. My database ‘pkdb’ is actually a directory named as ‘pkdb.db’.

2. My table ‘customers’ is again a directory inside ‘pkdb.db’ directory.

3. My country partition for India is also a directory named as ‘country=india’ under

‘customers’ table directory. This means if I load data for Japan, hive will create another

directory named as ‘country=japan’ under ‘customers’ table directory. Isn’t it simple?

4. My sub-partition for Karnataka is again a directory named as ‘state=karnataka’ under

my main partition for India.

5. Finally, my data file is placed under my sub-partition for karnatka.

This makes life simple for hive. If I want to read data for entire India, hive will read all files

under directory ‘country=india’ and similarly if I want to read data only for Karnataka, hive will

read all files under directory ‘state=karnatka’.

You might think why i said “all files under directory ‘state=karnatka’” when we want to read all

data for Karnataka. That should be just one file with all data for Karnataka. That’s a catch. You

know HDFS is a distributed file system and MapReduce is distributed processing framework.

Hive is built on top of these two to utilize their power. This brings some limitations which had to

be handled by hive. HDFS is essentially designed for write once and read several times. So if I

load some more data for Karnataka, it won’t update old file, instead it will create another file in

same directory. And when hive has to read all data for Karnataka, it has to read both the files.

Do you see any problem in this? You might be thinking about impact on performance. If you do,







then you might still don’t understand what MapReduce has to offer as parallel processing

framework and how your files are actually broken into blocks which are sprinkled across various

DataNodes in hadoop. This directory and file are just a namespace which is managed at

NameNode. Read my posts about HDFS and MapReduce to understand it correctly.





Date post:	30-Nov-2014
Category:	Technology
Upload:	prashant-pandey
View:	336 times
Download:	1 times

Hive Tutorial - Managed Tables

Technology