Hadoop Record Reader In Python

transcript

HADOOP RECORD READER IN

PYTHON

HUG: Nov 18 2009

Paul Tarjan

http://paulisageek.com

@ptarjan

http://github.com/ptarjan/hadoop_record

Hey Jute…

Tabs and newlines are good and all For lots of data, don’t do that

don’t make it bad...

Hadoop has a native data storage format called Hadoop Record or “Jute”

org.apache.hadoop.record

http://en.wikipedia.org/wiki/Jute

take a data structure…

There is a Data Definition Language! module links {

class Link {

ustring URL;

boolean isRelative;

ustring anchorText;

and make it better…

And a compiler $ rcc -l c++ inclrec.jr testrec.jr

namespace inclrec {

class RI :

public hadoop::Record {

private:

int32_t I32;

double D;

std::string S;

remember, to only use C++/Java $rcc --help

Usage: rcc --language

[java|c++] ddl-files

then you can start to make it better… I wanted it in python Need 2 parts:

Parsing library and DDL translator

I only did the first part If you need second part, let me know

Hey Jute don't be afraid…

you were made to go out and get her… http://github.com/ptarjan/

hadoop_record

the minute you let her under your skin… I bet you thought I was done with “Hey

Jude” references, eh? How I built it

Ply == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourself

You can use my lex and yacc stuff in your language of choice

and any time you feel the pain… Parsing the binary format is hard Vector vs struct???

struct = "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}"

LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I

didn’t need Binary on disk -> CSV -> python ==

wasteful Hadoop upacks zip files – name it .mod

na na na na na

Future workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

Hadoop Record Reader In Python

Technology