Date post: | 11-May-2015 |
Category: |
Documents |
Upload: | xebia-france |
View: | 221 times |
Download: | 1 times |
Big Data, Big Mess ?Par Bertrand Dechoux
1
Experience Hadoop
2
•première contact début 2010•consultant et trainer Hadoop @ Xebia
2
Agenda
3
Et les données ?
Hive, Pig et Cascading
Hadoop MapReduce 101
Api Java, Hadoop Streaming
3
HadoopMapReduce
101 1
4
un problème, une solution
5
Objectifs :
•calcul distribué
•haute volumétrie
Choix :
•commodity hardware
•local read
5
Map et Reduce
6
DATA
reduce
map
DATA DATA DATA
map map map
reduce
DATA DATA
6
Ce qui vous est fourni
7
• des primitives• en Java• fonctionnelles• de batch distribué
7
Api Java,Hadoop Streaming
1
8
L’Api java
9
public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
9
Industrialisation Simple
10
•dependances -> maven•test -> MRUnit + JUnit + maven•release -> maven + jenkins + nexus
10
Cas d’usage classique
11
•centralisation des logs•comment l’exploitant utilise t il les logs?
11
Beyond Java : Hadoop Streaming
12
•lecture et écriture sur stdin/stdout•integration du legacy•seulement des jobs simples•industrialisation sans problème
12
Hive, Pig etCascading
1
13
Hive et Pig
14
•PigLatin•‘bou!e tout’•DAG
•HiveQL•structuré•tree
14
Industrialisation ?
15
•dependances -> maven•test -> JUnit + maven•release -> maven + jenkins + nexus
15
Industrialisation Laborieuse
16
•1 job MapReduce -> minimum 10 secondes•1 requete -> ???•n requetes -> trop long
16
Cascading
17
•principe similaire à Hive et Pig•une surapi en Java•ou scala : scalding•ou clojure : cascalog
•Hadoop n’est pas la seule plateforme
17
Et les données?1
18
Les fichiers
19
type text SequenceFile Avro
interoperabilité
performance
19
Le filesystem : HDFS
20
•peu de "chiers•des gros "chiers•optimisés pour la lecture en continu
20
La base : HBase
21
•un clone de BigTable•essentiellement une Map avec clefs triées
21
Data Management
22
•HCatalog•inspiré de Hive metastore•décrit les jeux de données
•Avro•un "chier contenant sa description•perfomant
22
Data Management
23
•management = coordination
•data steward / data custodian
23
Tout cela est il important ?
24
24
DesQuestions ?
Merci!
25