+ All Categories
Home > Technology > Efficient processing of large and complex XML documents in Hadoop

Efficient processing of large and complex XML documents in Hadoop

Date post: 26-Jan-2015
Category:
Upload: hadoopsummit
View: 114 times
Download: 0 times
Share this document with a friend
Description:
Many systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs.
Popular Tags:
35
Efficient processing of large and complex XML documents in Hadoop Sujoe Bose Senior Principal, Sabre Holdings June, 2013
Transcript
Page 1: Efficient processing of large and complex XML documents in Hadoop

Efficient  processing  of  large  and  complex  XML  documents  in  Hadoop  

 

Sujoe  Bose  Senior  Principal,  Sabre  Holdings  June,  2013  

Page 2: Efficient processing of large and complex XML documents in Hadoop

Presenta.on  Outline  

§  MoBvaBon  §  ETL  vs.  ELT  §  Avro  Format  §  Mapping  from  XML  to  Avro  §  Interfaces  to  access  Avro  §  Performance  and  Storage  consideraBons  §  Other  types  of  storage/processing  formats  

confidenBal   2  

Page 3: Efficient processing of large and complex XML documents in Hadoop

You  will  learn  about  …  

§  A  method  to  store  and  process  complex  XML  data  in  Hadoop  as  Avro  files  

§  Interfaces  to  access  and  analyze  data  in  Avro  from  Hive,  Java  and  Pig  

§  VariaBons  of  the  method  and  their  relaBve  trade-­‐offs  in  storage  and  processing  

confidenBal   3  

Page 4: Efficient processing of large and complex XML documents in Hadoop

Mo.va.on  

§  Prevalence  of  XML  and  its  derivaBves  –  Spurred  by  WebServices  and  SOA  –  Preferred  communicaBon  format  unBl  newer  formats  entered  

– Data  and  logs  represented  in  XML  §  XML  –  metadata  combined  data    –  Flexibility  vs.  Complexity  

§  Could  be  arbitrarily  nested  and  large  §  Volumes  of  documents  –  Big  Data  

confidenBal   4  

Page 5: Efficient processing of large and complex XML documents in Hadoop

Challenges  

§  Parsing  XML  is  CPU  Intensive  §  Certain  parsers/parsing  methods  result  in  more  memory  consumpBon  

§  Repeated  parsing  for  each  query  §  Large  and  deeply  nested  XMLs  makes  problem  worse  §  Presence  of  tags  in  data  result  in  high  I/O  due  to  storage  size  

§  Special  handling  of  opBonal  fields  

confidenBal   5  

Page 6: Efficient processing of large and complex XML documents in Hadoop

ETL  vs.  ELT  

confidenBal   6  

§  Hadoop  generally  built  for  EL  –  T  –  aka  Schema-­‐on-­‐Read  –  Load  as-­‐is  –  Transform  on  Access/Query  

§  Compare  with  Data  Warehouse  ETL  – Aka  Schema-­‐on-­‐Write  –  Transform  and  Load  – Queries  are  lot  simpler  –  TransformaBon  and  cleansing  done  a  priori  

Page 7: Efficient processing of large and complex XML documents in Hadoop

Mix  of  ETL  and  ELT  

§  Generally  beaer  in  Flexibility  

§  More  suitable  for  simpler  and  well-­‐defined  formats  

§  More  applicable  for  experimentaBon  

§  XML  data  parsed  on  demand  for  every  query  

confidenBal   7  

§  Generally  beaer  in  Performance  

§  More  suitable  when  substanBal  cleansing  and  reformacng  is  needed  

§  RepeBBve  queries  and  producBon  workloads  

§  XML  Data  pre-­‐parsed  to  minimize  resource  usage  

ELT   ETL  

Page 8: Efficient processing of large and complex XML documents in Hadoop

Approaches  

confidenBal   8  

XML  Files  Avro  Files  

ETL  Pre-­‐parsing  

Pig  UDF  

Avro  Schema  

On-­‐demand  Parsing  

Interfaces  

Processin

g  Da

ta  

Hive  SerDe   MapReduce   Pig  

UDF  Hive  SerDe   MapReduce  

Page 9: Efficient processing of large and complex XML documents in Hadoop

ELT  

confidenBal   9  confidenBal   9  

XML  Files  Avro  Files  

ETL  Pre-­‐parsing  

Pig  UDF  

Avro  Schema  

On-­‐demand  Parsing  

Interfaces  

Processin

g  Da

ta  

Hive  SerDe   MapReduce   Pig  

UDF  Hive  SerDe   MapReduce  

Page 10: Efficient processing of large and complex XML documents in Hadoop

ETL  

confidenBal   10  confidenBal   10  

XML  Files  Avro  Files  

ETL  Pre-­‐parsing  

Pig  UDF  

Avro  Schema  

On-­‐demand  Parsing  

Interfaces  

Processin

g  Da

ta  

Hive  SerDe   MapReduce   Pig  

UDF  Hive  SerDe   MapReduce  

Page 11: Efficient processing of large and complex XML documents in Hadoop

XML  Pre-­‐parsing  

§  Nested  Elements  and  Aaributes  §  RepresentaBon  of  parsed  XML  Structure  §  Enter  Avro!  

confidenBal   11  

Page 12: Efficient processing of large and complex XML documents in Hadoop

Avro  

§  Data  serializaBon  system  §  Specifically  designed  for  Hadoop,  but  used  in  other  environments  also  

§  Rich  data  structures:  Arrays,  Records,  Maps  etc.  §  Compact,  fast,  binary  data  format  §  Metadata  stored  at  file  level  –  not  record  level  §   Split-­‐able  –  Ideal  for  Map-­‐Reduce  

confidenBal   12  

Page 13: Efficient processing of large and complex XML documents in Hadoop

Avro  APIs  

§  Generic  Objects  and  Pre-­‐generated  Objects  –  Easy  API  including  simple  gets  and  puts  

§  APIs  in  several  languages  –  Java  –  C#  –  C/C++  –  Python  –  Ruby  

confidenBal   13  

Page 14: Efficient processing of large and complex XML documents in Hadoop

Use-­‐case  

§  FIXML  –  Financial  InformaBon  eXchange  –  hap://www.fixprotocol.org/specificaBons/  

§  XML  Database  Benchmark  –  hap://tpox.sourceforge.net/  

§  Provides  sample  data  for  benchmarking  §  Data  Generator  for  generaBng  large  and  predictable  datasets  

confidenBal   14  

Page 15: Efficient processing of large and complex XML documents in Hadoop

FIXML  

§  XML  Data  Generator  –  hap://tpox.sourceforge.net/tpoxdata.htm  

§  Order:  Buy  and  sell  order  of  securiBes  

confidenBal   15  

Page 16: Efficient processing of large and complex XML documents in Hadoop

Simple  mapping  

confidenBal   16  

XML   Avro   Pig  

Elements  with  repeated  nested  elements  

Array   Bag  

Elements  with  aaributes  and  text  elements  

Record   Tuple  

Aaributes  and  Text  Elements   Field   Field  

Page 17: Efficient processing of large and complex XML documents in Hadoop

Avro  Schema  

{ "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},

...  

confidenBal   17  

Page 18: Efficient processing of large and complex XML documents in Hadoop

Pig  Schema  

FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray,

confidenBal   18  

Page 19: Efficient processing of large and complex XML documents in Hadoop

Avro  –  Access  Methods  

§  Direct  support  for  access  from  Hive  (using  SerDe)    

CREATE EXTERNAL TABLE <TableName>!ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!

STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!

OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-file.avsc')  

§  Access  via  Pig  -­‐  AvroStorage  §  Avro  API  -­‐  Java  MapReduce  

confidenBal   19  

Page 20: Efficient processing of large and complex XML documents in Hadoop

Test  Data  

§  Base  SecuriBes  Order  file  500,000  records  §  Replicated  for  volume  –  15x  -­‐  7.5  million  records  –  30x  -­‐  15  million  records  –  45x  -­‐  22.5  million  records  –  60x  –  30  million  records  –  75x  –  37.5  million  records  

 

confidenBal   20  

Page 21: Efficient processing of large and complex XML documents in Hadoop

Comparison  

confidenBal   21  

XML  Files  Avro  Files  

ETL  Pre-­‐parsing  

Pig  UDF  

Avro  Schema  

On-­‐demand  Parsing  

Interfaces  

Processin

g  Da

ta  

Hive  SerDe   MapReduce   Pig  

UDF  Hive  SerDe   MapReduce  

Page 22: Efficient processing of large and complex XML documents in Hadoop

File  sizes:  Orders  

§  Base  Data  –  XML  file  size  as  is:  749,337,916  (750MB)    – Gzip  Compressed:  182,687,654  (183MB)    

§  Applied  Avro  conversion  – Avro  Snappy:  151,647,926  (152MB)    – Avro  Gzip:  107,898,177  (108MB)    

confidenBal   22  

Page 23: Efficient processing of large and complex XML documents in Hadoop

Storage  Size  Comparison  

confidenBal   23  

Page 24: Efficient processing of large and complex XML documents in Hadoop

Test  Environment  

§  18  Nodes  §  Node  configuraBon:  –  12  cores  per  node  –  48GB  memory  –   36  TB  with  12  disks  of  3TB  each  

§  CDH  4.1.2  

confidenBal   24  

Page 25: Efficient processing of large and complex XML documents in Hadoop

Sample  Query  

§  Security  Orders  per  Account  

order_records  =  LOAD  '$AVRO_INPUT'  using  AVRO_LOAD  AS  (  -­‐-­‐-­‐-­‐-­‐-­‐-­‐  Pig  Schema  goes  here  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  );    order_projecBon  =  FOREACH  order_records  GENERATE  Order.Acct  as  Account,  Order.OrdQty.Qty  

as  QuanBty;    order_group  =  GROUP  order_projecBon  BY  Account;    order_count  =  FOREACH  order_group  GENERATE  group,  SUM(order_projecBon.QuanBty);    STORE  order_count  INTO  '$PIG_OUTPUT'  Using  PigStorage(',');  

confidenBal   25  

Page 26: Efficient processing of large and complex XML documents in Hadoop

Run  Types  

§  Pre-­‐parsed  approach:  –  XML  to  Avro  materializaBon:  xml-­‐to-­‐avro  

•  XML  to  Avro  is  run  only  once  on  the  data  

– Avro  to  Pig  via  UDF:  avro-­‐to-­‐pig  §  Parse  on  demand  –  XML  parsing  using  Pig  UDF:  xml-­‐to-­‐pig  

confidenBal   26  

Page 27: Efficient processing of large and complex XML documents in Hadoop

confidenBal   27  

Run  .me  in  Seconds  

Analysis  on  raw  XML:  XML  to  Pig  

Pre-­‐parsing  XML:  XML  to  Avro  

Analysis  on  parsed  XML:  Avro  to  Pig  

Page 28: Efficient processing of large and complex XML documents in Hadoop

confidenBal   28  

CPU  Usage  Comparison  

Analysis  on  raw  XML:  XML  to  Pig  

Pre-­‐parsing  XML:  XML  to  Avro  

Analysis  on  parsed  XML:  Avro  to  Pig  

Page 29: Efficient processing of large and complex XML documents in Hadoop

confidenBal   29  confidenBal   29  

Memory  Usage  Comparison:  Total  Memused  (GB)  

Analysis  on  raw  XML:  XML  to  Pig  

Pre-­‐parsing  XML:  XML  to  Avro  

Analysis  on  parsed  XML:  Avro  to  Pig  

Page 30: Efficient processing of large and complex XML documents in Hadoop

Results  

§  Analysis  on  pre-­‐parsed  data  compared  raw  XML  –  RunBme  reducBon  by  more  than  50%  – Memory  and  CPU  consumpBon  reduced  by  about  50%  

§  Pre-­‐parsing  stage  takes  more  resources  and  Bme  than  on-­‐demand  parsing  

§  RepeBBve  queries  will  benefit  from  one-­‐Bme  pre-­‐parsing  

confidenBal   30  

Page 31: Efficient processing of large and complex XML documents in Hadoop

Caveats  

§  Not  all  fields  were  extracted  from  the  XML  input  (opBonal  elements)  

§  Challenge  in  keeping-­‐up  with  versions/changes  of  XML  

§  Performance  numbers  can  depend  on  the  type  of  data  and  the  mapping  used  

confidenBal   31  

Page 32: Efficient processing of large and complex XML documents in Hadoop

Alterna.ves  

§  Formats  other  than  Avro  may  be  more  suitable  §  Record  Columnar  formats  (RC  Files  &  ORC  Files)  §  Trevni:  a  column  file  format  supporBng  Avro  §  Parquet:  another  columnar  storage  for  Hadoop  

confidenBal   32  

Page 33: Efficient processing of large and complex XML documents in Hadoop

Mo.va.on  for  Columnar  Format  

§  Map  Reduce  capability  §  Column  ProjecBons  reduce  I/O  §  Column  Compression  due  to  similarity  of  data  further  reduces  I/O  

confidenBal   33  

Page 34: Efficient processing of large and complex XML documents in Hadoop

Summary  

§  Materialized  version  well-­‐suited  for  repeated  queries  §  For  ad-­‐hoc/experimental  queries  parse-­‐on-­‐demand  is  beaer  

§  Mapping  from  XML  to  Avro  can  be  automated  §  Hive,  Pig  and  MapReduce  Interfaces  to  access  Avro  Files  

§  RelaBve  trade-­‐offs  between  flexibility  and  performance/storage  

confidenBal   34  

Page 35: Efficient processing of large and complex XML documents in Hadoop

Ques.ons  &  Comments  

confidenBal   35  

Thanks  for  Listening    [email protected]    


Recommended