Post on 20-Jan-2015
description
transcript
HADOOP RECORD READER IN
PYTHON
HUG: Nov 18 2009
Paul Tarjan
http://paulisageek.com
@ptarjan
http://github.com/ptarjan/hadoop_record
Hey Jute…
Tabs and newlines are good and all For lots of data, don’t do that
don’t make it bad...
Hadoop has a native data storage format called Hadoop Record or “Jute”
org.apache.hadoop.record
http://en.wikipedia.org/wiki/Jute
take a data structure…
There is a Data Definition Language! module links {
class Link {
ustring URL;
boolean isRelative;
ustring anchorText;
};
}
and make it better…
And a compiler $ rcc -l c++ inclrec.jr testrec.jr
namespace inclrec {
class RI :
public hadoop::Record {
private:
int32_t I32;
double D;
std::string S;
remember, to only use C++/Java $rcc --help
Usage: rcc --language
[java|c++] ddl-files
then you can start to make it better… I wanted it in python Need 2 parts:
Parsing library and DDL translator
I only did the first part If you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her… http://github.com/ptarjan/
hadoop_record
the minute you let her under your skin… I bet you thought I was done with “Hey
Jude” references, eh? How I built it
Ply == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourself
You can use my lex and yacc stuff in your language of choice
and any time you feel the pain… Parsing the binary format is hard Vector vs struct???
struct = "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}"
LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I
didn’t need Binary on disk -> CSV -> python ==
wasteful Hadoop upacks zip files – name it .mod
na na na na na
Future workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback