Parquet and impala overview external

transcript

Parquet data format & Impala overview

Agenda

• Objective

• Various data formats

• Use case

• Parquet

• Impala

Objective

• 2 fold:

• Quest for a more performant data format than Avro for nested data

• Understand and test new data formats in general

Hadoop data formats• Sequence file. It stores key-value pairs

of data in a flat binary file. Rows stored as values.

• ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter.

• Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.

Parquet• Columnar storage

• Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression.

• Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run.

• Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form

Parquet• Parquet can handle multiple schemas.

Support schema evolution.

• LogType A : organizationId, userId, timestamp, recordId, cpuTime

• LogType V : userId, organizationId, timestamp, foo, bar

• Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.

Parquet• SELECT vs INSERT.

• Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks.

• Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.

Parquet•Memory issues (Heap space error) resolved by:

•Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB.

•The total memory allocated was around 1 GB.

•Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) .

•So writing data using parquet will always have a high memory requirement .

•Hive’s Distribute by: was workaround to memory issues!

Parquet vs other formats

Performance test with 100G data over multiple queries

Parquet wins

Impala overview• MPP implementation of a query engine

• Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch.

• Not using M/R – but uses HDFS

• Not CEP – closer to a RDBMS.

• Impala uses the same metadata store as Hive to record information about table structure and properties

Impala overview• Can create a table in Hive, and use it in

Impala

• E.g. Impala doesn’t support Avro, but Hive does

• Language is mix between SQL & HiveQL

• Requires a lot of memory (128 G min./node)

• Initial load of data via Refresh; can take a lot of time

• loads the block location data for newly added data files

Impala overview

• Shortcomings

• Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala-compatible data types – it cannot contain nested types such as array, map, or struct.

• Impala currently does not "spill to disk"

• if intermediate results being processed on a node exceed the memory reserved for Impala on that node.

• No Custom Serializer/Deserializer classes (SerDes)

• Impala cancels a running query if any host on which that query is executing fails

Impala overview• Example. For create a PARQUET table in IMPALA

there are 3 ways:

• -> PARQUET table created in HIVE (with no nested data types).

• -> Create and load with data a normal text table in IMPALA:

• IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’;

• Create Parquet format table and then insert into parquet table using normal text table.

• IMPALA> insert overwrite table parquet_table_name select * from text_table_name;

Use Case•Can't query Avro table in Impala because

having nested columns.

•Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types.

•(cannot contain nested types such as array, map, orstruct).

Use Case• How to deal with nested XML data in Hadoop?

• There is no direct mapping from xml to avro. Process goes:

• Parse XML and Convert to Avro : Parse XML using XMLStreamReader and

• Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro:

• Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis.

• The problem is the Schema provided has some arrays which is union of type string and null both.

• Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception.

• Tested: Impala 1.2.3 on CDH 4.5

• Impala doesn’t support nested types at this point

Thank you

Parquet and impala overview external

Technology