Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | hadoopsummit |
View: | 114 times |
Download: | 0 times |
Efficient processing of large and complex XML documents in Hadoop
Sujoe Bose Senior Principal, Sabre Holdings June, 2013
Presenta.on Outline
§ MoBvaBon § ETL vs. ELT § Avro Format § Mapping from XML to Avro § Interfaces to access Avro § Performance and Storage consideraBons § Other types of storage/processing formats
confidenBal 2
You will learn about …
§ A method to store and process complex XML data in Hadoop as Avro files
§ Interfaces to access and analyze data in Avro from Hive, Java and Pig
§ VariaBons of the method and their relaBve trade-‐offs in storage and processing
confidenBal 3
Mo.va.on
§ Prevalence of XML and its derivaBves – Spurred by WebServices and SOA – Preferred communicaBon format unBl newer formats entered
– Data and logs represented in XML § XML – metadata combined data – Flexibility vs. Complexity
§ Could be arbitrarily nested and large § Volumes of documents – Big Data
confidenBal 4
Challenges
§ Parsing XML is CPU Intensive § Certain parsers/parsing methods result in more memory consumpBon
§ Repeated parsing for each query § Large and deeply nested XMLs makes problem worse § Presence of tags in data result in high I/O due to storage size
§ Special handling of opBonal fields
confidenBal 5
ETL vs. ELT
confidenBal 6
§ Hadoop generally built for EL – T – aka Schema-‐on-‐Read – Load as-‐is – Transform on Access/Query
§ Compare with Data Warehouse ETL – Aka Schema-‐on-‐Write – Transform and Load – Queries are lot simpler – TransformaBon and cleansing done a priori
Mix of ETL and ELT
§ Generally beaer in Flexibility
§ More suitable for simpler and well-‐defined formats
§ More applicable for experimentaBon
§ XML data parsed on demand for every query
confidenBal 7
§ Generally beaer in Performance
§ More suitable when substanBal cleansing and reformacng is needed
§ RepeBBve queries and producBon workloads
§ XML Data pre-‐parsed to minimize resource usage
ELT ETL
Approaches
confidenBal 8
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
ELT
confidenBal 9 confidenBal 9
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
ETL
confidenBal 10 confidenBal 10
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
XML Pre-‐parsing
§ Nested Elements and Aaributes § RepresentaBon of parsed XML Structure § Enter Avro!
confidenBal 11
Avro
§ Data serializaBon system § Specifically designed for Hadoop, but used in other environments also
§ Rich data structures: Arrays, Records, Maps etc. § Compact, fast, binary data format § Metadata stored at file level – not record level § Split-‐able – Ideal for Map-‐Reduce
confidenBal 12
Avro APIs
§ Generic Objects and Pre-‐generated Objects – Easy API including simple gets and puts
§ APIs in several languages – Java – C# – C/C++ – Python – Ruby
confidenBal 13
Use-‐case
§ FIXML – Financial InformaBon eXchange – hap://www.fixprotocol.org/specificaBons/
§ XML Database Benchmark – hap://tpox.sourceforge.net/
§ Provides sample data for benchmarking § Data Generator for generaBng large and predictable datasets
confidenBal 14
FIXML
§ XML Data Generator – hap://tpox.sourceforge.net/tpoxdata.htm
§ Order: Buy and sell order of securiBes
confidenBal 15
Simple mapping
confidenBal 16
XML Avro Pig
Elements with repeated nested elements
Array Bag
Elements with aaributes and text elements
Record Tuple
Aaributes and Text Elements Field Field
Avro Schema
{ "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},
...
confidenBal 17
Pig Schema
FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray,
confidenBal 18
Avro – Access Methods
§ Direct support for access from Hive (using SerDe)
CREATE EXTERNAL TABLE <TableName>!ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-file.avsc')
§ Access via Pig -‐ AvroStorage § Avro API -‐ Java MapReduce
confidenBal 19
Test Data
§ Base SecuriBes Order file 500,000 records § Replicated for volume – 15x -‐ 7.5 million records – 30x -‐ 15 million records – 45x -‐ 22.5 million records – 60x – 30 million records – 75x – 37.5 million records
confidenBal 20
Comparison
confidenBal 21
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
File sizes: Orders
§ Base Data – XML file size as is: 749,337,916 (750MB) – Gzip Compressed: 182,687,654 (183MB)
§ Applied Avro conversion – Avro Snappy: 151,647,926 (152MB) – Avro Gzip: 107,898,177 (108MB)
confidenBal 22
Storage Size Comparison
confidenBal 23
Test Environment
§ 18 Nodes § Node configuraBon: – 12 cores per node – 48GB memory – 36 TB with 12 disks of 3TB each
§ CDH 4.1.2
confidenBal 24
Sample Query
§ Security Orders per Account
order_records = LOAD '$AVRO_INPUT' using AVRO_LOAD AS ( -‐-‐-‐-‐-‐-‐-‐ Pig Schema goes here -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ ); order_projecBon = FOREACH order_records GENERATE Order.Acct as Account, Order.OrdQty.Qty
as QuanBty; order_group = GROUP order_projecBon BY Account; order_count = FOREACH order_group GENERATE group, SUM(order_projecBon.QuanBty); STORE order_count INTO '$PIG_OUTPUT' Using PigStorage(',');
confidenBal 25
Run Types
§ Pre-‐parsed approach: – XML to Avro materializaBon: xml-‐to-‐avro
• XML to Avro is run only once on the data
– Avro to Pig via UDF: avro-‐to-‐pig § Parse on demand – XML parsing using Pig UDF: xml-‐to-‐pig
confidenBal 26
confidenBal 27
Run .me in Seconds
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
confidenBal 28
CPU Usage Comparison
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
confidenBal 29 confidenBal 29
Memory Usage Comparison: Total Memused (GB)
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
Results
§ Analysis on pre-‐parsed data compared raw XML – RunBme reducBon by more than 50% – Memory and CPU consumpBon reduced by about 50%
§ Pre-‐parsing stage takes more resources and Bme than on-‐demand parsing
§ RepeBBve queries will benefit from one-‐Bme pre-‐parsing
confidenBal 30
Caveats
§ Not all fields were extracted from the XML input (opBonal elements)
§ Challenge in keeping-‐up with versions/changes of XML
§ Performance numbers can depend on the type of data and the mapping used
confidenBal 31
Alterna.ves
§ Formats other than Avro may be more suitable § Record Columnar formats (RC Files & ORC Files) § Trevni: a column file format supporBng Avro § Parquet: another columnar storage for Hadoop
confidenBal 32
Mo.va.on for Columnar Format
§ Map Reduce capability § Column ProjecBons reduce I/O § Column Compression due to similarity of data further reduces I/O
confidenBal 33
Summary
§ Materialized version well-‐suited for repeated queries § For ad-‐hoc/experimental queries parse-‐on-‐demand is beaer
§ Mapping from XML to Avro can be automated § Hive, Pig and MapReduce Interfaces to access Avro Files
§ RelaBve trade-‐offs between flexibility and performance/storage
confidenBal 34