Hive Performance With Different Fileformats

Date post: 09-Dec-2015
Hive Performance With Different Fileformats
Performance metrics of Hive Queries

Version 1.0

Comparison of Performance Metrics of Hive Queries using Text,Avro,Parquet Fileformats

Version: 1.0

Table of Contents

1 Introduction........................................................................................................................................................21.1 Overview......................................................................................................................................................................21.2 Introduction to Apache Avro file format......................................................................................................................21.2 Introduction to Apache Parquet file format.................................................................................................................2

2 Case Study...........................................................................................................................................................32.1 Objective......................................................................................................................................................................32.2 Extracting the data from Teradata...............................................................................................................................32.3 Loading the data extracted from sqoop into Hive tables.............................................................................................92.4 Conclusion.................................................................................................................................................................10

Performance metrics of Hive Queries

Version 1.0


1.1 Overview

The purpose of this document is discuss on the performace trade off of Hive queries using different file formats and suggest the best file formats to be used for Hive Storage

1.1 Introduction to Apache Avro file format

Avro is an Apache open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Using Avro, big data can be exchanged between programs written in any language

Using the serialization service, programs can efficiently serialize data into files or into messages. The data storage is compact and efficient. Avro stores both the data definition and the data together in one message or file making it easy for programs to dynamically understand the information stored in an Avro file or message. Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro files include markers that cam be used to splitting large data sets into subsets suitable for MapReduce processing. Some data exchange services use a code generator to interpret the data definition and produce code to access the data. Avro doesn't require this step, making it ideal for scripting languages.

Avro supports a rich set of primitive data types including: numeric, binary data and strings; and a number of complex types including arrays, maps, enumerations and records. A sort order can also be defined for the data. A key feature of Avro is robust support for data schemas that change over time - often called schema evolution. Avro cleanly handles schema changes like missing fields, added fields and changed fields; as a result, old programs can read new data and new programs can read old data. Avro includes API's for Java, Python, Ruby, C, C++ and more. Data stored using Avro can easily be passed from a program written in one language to a program written in another language, even from a complied language like C to a scripting language like Pig.

1.2 Introduction to Apache Parquet file format

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. Parquet was designed to take the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. This approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they

Performance metrics of Hive Queries

Version 1.0

are invented and implemented.Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.


2.1 Objective

Load the data from a Terdata table which is approximatelt about 40GB, Into hive tables to make a performance comparison of the query execution time using different file formats.

2.2 Extaracting the data from Teradata

The data was exracted from the Teradata table using different storage formats and enabling compression.

Sqoop Commands:

The below commands can be used to extract the data as Textfile, Textfile with compression, Avro, Avro with Compression

Performance metrics of Hive Queries

Version 1.0

Since sqoop doesn’t support importing the data in Parquet format we need to convert the data extracted from sqoop into parquet format.

Two approaches are listed below to conert the data into parquet format:

i. Converting a CSV file into Parquet data

1) Get the data file from Teradata database in in CSV format

2) Process the Data file using PIG to convert the CSV file into Parquet file format

3) Once the Conversion is done, create Parquet Hive table based on the schema

4) Load the converted Parquet file into Parquet Table.

Performance metrics of Hive Queries

Version 1.0

5) Description of Parquet Table

6) Simple Select Query on Parquet Table

7) Complex Select Query on Parquet Table

Performance metrics of Hive Queries

Version 1.0

8) Table Size

ii. Converting an Avro data file into Parquet data file

1) Use the below source code to create a jar which helps to converting a avro file to Parquet data file

Main java Code

File Name : Avro2Parquet.java

package com.cloudera.science.avro2parquet;

import java.io.InputStream;

import org.apache.avro.Schema;import org.apache.avro.mapreduce.AvroKeyInputFormat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;

import parquet.avro.AvroParquetOutputFormat;import parquet.avro.AvroSchemaConverter;import parquet.hadoop.metadata.CompressionCodecName;

public class Avro2Parquet extends Configured implements Tool {

public int run(String[] args) throws Exception { Path schemaPath = new Path(args[0]); Path inputPath = new Path(args[1]); Path outputPath = new Path(args[2]);

Job job = new Job(getConf()); job.setJarByClass(getClass()); Configuration conf = job.getConfiguration();

FileSystem fs = FileSystem.get(conf); InputStream in = fs.open(schemaPath); Schema avroSchema = new Schema.Parser().parse(in);

Performance metrics of Hive Queries

Version 1.0

System.out.println(new AvroSchemaConverter().convert(avroSchema).toString());

FileInputFormat.addInputPath(job, inputPath); job.setInputFormatClass(AvroKeyInputFormat.class); job.setOutputFormatClass(AvroParquetOutputFormat.class); AvroParquetOutputFormat.setOutputPath(job, outputPath); AvroParquetOutputFormat.setSchema(job, avroSchema); AvroParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY); AvroParquetOutputFormat.setCompressOutput(job, true); /* Impala likes Parquet files to have only a single row group. * Setting the block size to a larger value helps ensure this to * be the case, at the expense of buffering the output of the * entire mapper's split in memory. * * It would be better to set this based on the files' block size, * using fs.getFileStatus or fs.listStatus. */ AvroParquetOutputFormat.setBlockSize(job, 500 * 1024 * 1024); job.setMapperClass(Avro2ParquetMapper.class); job.setNumReduceTasks(0);

return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Avro2Parquet(), args); System.exit(exitCode); }


Mapper java Code

File Name : Avro2ParquetMapper.java

package com.cloudera.science.avro2parquet;

import java.io.IOException;

import org.apache.avro.generic.GenericRecord;import org.apache.avro.mapred.AvroKey;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Mapper;

public class Avro2ParquetMapper extends Mapper<AvroKey<GenericRecord>, NullWritable, Void, GenericRecord> {

@Override protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException { context.write(null, key.datum()); }

Performance metrics of Hive Queries

Version 1.0


2) Compile the code into a Jar using the below command

Javac –classpath `hadoop classpath`:. Avro2Parquet.java

Jar –cvf avro2parquet.jar .*.class 3) Using the jar to convert from avro data format into Paquet data format

hadoop jar <avro2parquet jar file> \ com.cloudera.science.avro2parquet.Avro2Parquet \ <and generic options to the JVM> \ hdfs:///path/to/avro/schema.avsc \ hdfs:///path/to/avro/data \ hdfs:///output/path

The stats below indicate the File size comparison with different file storage types

As can be seen, File size decreases drasctically as we from Text file -> Snappy Conversion,Avro File -> Avro Compression and in case of Parquet without compression any compression it brings down the file size to 85% of original size.

Performance metrics of Hive Queries

Version 1.0

2.3 Loading the data extracted from sqoop into the Hive tables

1) Loading the text file into Hive Table

This is a straight forward load of the data into Hive table

2) Loading the text file into Hive Table

CREATE EXTERNAL TABLE avro_tableROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOACATION '<hdfs_path>'TBLPROPERTIES ('avro.schema.literal'='<hdfs_path>/avro_schema.avsc');

As we can see above to create a Avro Hive Table it is necessary to specify the Avro Schema

Defining the avro schema can be difficult as it needs a thorough knowledge of Json, hence we can follow

the below steps to extract the schema from the avro data file itself instead of defining the schema manually for

each of the tables.

Performance metrics of Hive Queries

Version 1.0

3) Loading the data into Parquet based Hive Table

The Parquet Hive Table is created on the same structure as Avro Table to use the same structure as Avro Table to use the same schema as the avro table.

CREATE TABLE Parquet_tableROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerde'STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'OUTPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'AS SELECT * FORM AVRO_TABLE WHERE 1=2;

Load the converted parquet data into the newly created Parquet Hive Table

The stats below indicate the Query Respond time with different file storage types.

Performance metrics of Hive Queries

Version 1.0

2.4 Conclusion

Comparing all 3 formats Parquet Storage compresses data file to a great extent and query respond time is also much faster than the other two formats , hence Parquet format looks to be an undisputed winner in this scenario.

Since parquet format is columnar , it might not work as efficient as the above use case incase entire row accesses are needed.

Revision History

Date Version Description Author

27-Nov-14 1.0 Created Mohammed Danesh Guard

