MongoDB – Hadoop Integration
Senior Solutions Architect, MongoDB Inc.
Massimo Brignoli
#massimobrignoli
We will Cover…
• A quick briefing on what MongoDB and Hadoop are
• The Mongo-Hadoop connector:– What it is– How it works– A tour of what it can do
MongoDB and Hadoop Overview
MongoDB
• document-oriented database with dynamic schema
MongoDB
• document-oriented database with dynamic schema
• stores data in JSON-like documents: { _id : “mike”, age : 21, location : {
state : ”NY”, zip : ”11222” },favorite_colors : [“red”, “green”] }
MongoDB
• Scales horizontally
MongoDB
• Scales horizontally
• With sharding to handle lots of data and load
MongoDB
• Scales horizontally
• With sharding to handle lots of data and load
MongoDB
• Scales horizontally
• With sharding to handle lots of data and load
Hadoop
• Java-based framework for Map/Reduce
• Excels at batch processing on large data sets by taking advantage of parallelism
Mongo-Hadoop Connector - Why
• Lots of people using Hadoop and Mongo separately, but need integration
• Need to process data across multiple sources
• Custom code or slow and hacky import/export scripts often used to get data in+out
• Scalability and flexibility with changes in
• Scalability and flexibility with changes in Hadoop or MongoDB configurations
ResultsInput data HadoopCluster
-or- -or-
.BSON .BSON
Mongo-Hadoop Connector
• Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop
• New Feature: As of v1.1, also works with MongoDB backup files (.bson)
Benefits and Features
• Takes advantage of full multi-core parallelism to process data in Mongo
• Full integration with Hadoop and JVM ecosystems
• Can be used with Amazon Elastic MapReduce
• Can read and write backup files from local filesystem, HDFS, or S3
Benefits and Features
• Vanilla Java MapReduce
• or if you don’t want to use Java, support for Hadoop Streaming.
• write MapReduce code in
Benefits and Features
• Support for Pig– high-level scripting language for data analysis
and building map/reduce workflows
• Support for Hive– SQL-like language for ad-hoc queries + analysis
of data sets on Hadoop-compatible file systems
How It Works
• Adapter examines the MongoDB input collection and calculates a set of splits from the data
• Each split gets assigned to a node in Hadoop cluster
• In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally
• Hadoop merges results and streams output back to MongoDB or BSON
Tour of Mongo-Hadoop, by Example
Tour of Mongo-Hadoop, by Example
• Using Java MapReduce with Mongo-Hadoop
• Using Hadoop Streaming
• Pig and Hive with Mongo-Hadoop
• Elastic MapReduce + BSON
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecast\n\n ",
"filename" : "1.",
"headers" : {
"From" : "[email protected]",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "[email protected]",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecast\n\n ",
"filename" : "1.",
"headers" : {
"From" : "[email protected]",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "[email protected]",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
sender
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecast\n\n ",
"filename" : "1.",
"headers" : {
"From" : "[email protected]",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "[email protected]",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
sender
recipients
The Problem
Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair
The Output Required
alice1499
bob 48 charlie
20eve9
{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 14}{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 9}{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 99}{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 48}{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 20}
Example 1 - Java MapReduce
Map Phase
Map phase - each input doc gets passed through a Mapper function
@Override
public void map(NullWritable key, BSONObject val, final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From");
String to = (String)headers.get("To");
String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new IntWritable(1)); }
}
}
Reduce Phase
Map phase - each input doc gets passed through a Mapper function
@Override
public void map(NullWritable key, BSONObject val, final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From");
String to = (String)headers.get("To");
String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new IntWritable(1)); }
}
}
mongoDB document passed into Hadoop MapReduce
Reduce Phase
Reduce phase - outputs of Map are grouped together by key and passed to Reducer
public void reduce( final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){ sum += value.get();}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) );
}
Reduce Phase
Reduce phase - outputs of Map are grouped together by key and passed to Reducer
public void reduce( final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){ sum += value.get();}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) );
}
the {to, from} key
Reduce Phase
Reduce phase - outputs of Map are grouped together by key and passed to Reducer
public void reduce( final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){ sum += value.get();}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) );
}
list of all the values collected under the key
the {to, from} key
Reduce Phase
Reduce phase - outputs of Map are grouped together by key and passed to Reducer
public void reduce( final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){ sum += value.get();}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) );
}
output written back to MongoDB
list of all the values collected under the key
the {to, from} key
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir= file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson
Write Output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write Output to BSON
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir= file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson
Write Output to BSON
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 1 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 1 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 2 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 2 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 4 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 1 }
{ "_id" : { "t" : [email protected], "f" : "[email protected]" }, "count" : 1 }
...
has more
Example 2 - Hadoop Streaming
Example 2 - Hadoop Streaming
• Let’s do the same Enron Map/Reduce job with Python instead of Java
$ pip install pymongo_hadoop
Example 2 - Hadoop Streaming
• Hadoop passes data to an external process via STDOUT/STDIN
hadoop (JVM)
STDINPython / Ruby / JS
interpreter
def mapper(documents):STDOUT...
map(k, v)
map(k, v)
from pymongo_hadoop import BSONMapper
def mapper(documents):
i=0
for doc in documents:
i=i+1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Mapper in Python
from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)
Reducer in Python
Example 3 - Mongo-Hadoop with Pig and Hive
Mongo-Hadoop and Pig
• Let’s do the same thing yet again, but this time using Pig
• Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts
• Can perform JOIN, GROUP, and execute user-defined functions (UDFs)
Loading/Writing Data
Pig directives for loading data: BSONLoader and MongoLoader
data = LOAD 'mongodb://localhost:27017/db.collection’ using
com.mongodb.hadoop.pig.MongoLoader;
Writing data out BSONStorage and MongoInsertStorage
STORE records INTO 'file:///output.bson’ using
com.mongodb.hadoop.pig.BSONStorage;
Datatype Conversion
• Pig has its own special datatypes:– Bags– Maps– Tuples
• Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes
raw = LOAD 'hdfs:///messages.bson’ using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson’ using com.mongodb.hadoop.pig.BSONStorage;
Code in Pig
Mongo-Hadoop and Hive
Similar idea to Pig - process your data without needing to write Map/Reduce code from scratch
...but with SQL as the language of choice
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }
{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
Sample Data: db.users
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");
Declare Collection in Hive
SELECT name,age FROM mongo_users WHERE id > 100 ;
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;
Run SQL on it
DROP TABLE old_users;
INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ;
Write The Results…
Example 4: Amazon MapReduce
Example 4
• Usage with Amazon Elastic MapReduce
• Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster.
Bootstrap
• First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers)#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-code/mongo-hadoop-core_1.1.2-1.1.0.jar
Bootstrap
Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it.
s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read
Launch the Job!
• ...then launch the job from the command line, pointing to your S3 locations
$ elastic-mapreduce --create --jobflow ENRON000--instance-type m1.xlarge--num-instances 5--bootstrap-action s3://$S3_BUCKET/bootstrap.sh--log-uri s3://$S3_BUCKET/enron_logs--jar s3://$S3_BUCKET/enron-example.jar--arg –D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat# (any additional parameters here)
So why Amazon?
• Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster
• Turn up the “num-instances” to make jobs complete faster
• Logs get captured into S3 files
• (Pig, Hive, and streaming work on EMR, too!)
Example 5: MongoUpdateWritable
Example 5: New Feature
• In previous examples, we wrote job output data by inserting into a new collection
• ... but we can also modify an existing output collection
• Works by applying mongoDB update modifiers: $push, $pull, $addToSet, $inc, $set, etc.
• Can be used to do incremental Map/Reduce or “join” two collections
Sample of Data
Let’s say we have two collections.
{"_id": ObjectId("51b792d381c3e67b0a18d0ed"), "name": "730LsRkX","type": "pressure","owner": "steve”}{"_id": ObjectId("51b792d381c3e67b0a18d678"), "sensor_id": ObjectId("51b792d381c3e67b0a18d4a1"),"value": 3328.5895416489802,"timestamp”: ISODate("2013-05-18T13:11:38.709-0400"), "loc": [-175.13,51.658]}
sensors
log events
Sample of Data
For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...
Stage 1 - Map/Reduce on sensors collection
Sensors(MongoDB Collection)
Log events(MongoDB Collection)
map/reduce
for each sensor, emit:
{key: owner+type, value:
_id}
group data from map() under each
key, output:
{key: owner+type, val: [ list
of _ids] }
Results(MongoDB Collection)
Stage 1 - Results
After stage one, the output docs look like:
{ "_id": "alice pressure", "sensors": [ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), … ] }
Stage 1 - Results
After stage one, the output docs look like:
{ "_id": "alice pressure", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"),
ObjectId("51b792d381c3e67b0a18d16d"),ObjectId("51b792d381c3e67b0a18d2bf”…
]}
the sensor’s owner and type
list of ID’s of sensors with this owner and type
Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group.
Stage 2 - Map/Reduce on events collection
Sensors(MongoDB Collection)
Log events(MongoDB Collection)
map/reduce
for each sensor, emit:{key: sensor_id, value: 1}
group data from map() under each key for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})
Results(MongoDB Collection)
update() existing records in mongoDB
context.write(null, new MongoUpdateWritable( query, //which documents to modify update, //how to modify ($inc) true, //upsert false) ); // multi
Stage 2 - Results
Result after stage 2
{"_id": "1UoTcvnCTz temp", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), ...],"logs_count": 1050616 }
now populated with correct count
Conclusions
Recap
• Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON
• MongoDB becomes a Hadoop-enabled filesystem
• Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc.
Examples
Examples can be found on github:
https://github.com/mongodb/mongo-hadoop/tree/master/
examples