Date post: | 05-Jul-2015 |
Category: |
Technology |
Upload: | mongodb |
View: | 549 times |
Download: | 4 times |
MongoDB and Hadoop
Software Engineer, MongoDB
Luke Lovett
Agenda
• Complementary Approaches to Data
• MongoDB & Hadoop Use Cases
• MongoDB Connector Overview and Features
• Demo
Complementary Approaches to Data
Operational: MongoDB
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
MongoDB
• Store and read data frequently
• Easy administration
• Built-in analytical tools
– aggregation framework
– JavaScript MapReduce
– Geo/text indexes
Analytical: Hadoop
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics
Operational vs. Analytical: Lifecycle
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
MongoDB & Hadoop Use Cases
Batch Aggregation
Applicatio
ns
powered
by
Analysis
powered
by
● Need more than MongoDB aggregation
● Need offline processing
● Results sent back to MongoDB
● Can be left as BSON on HDFS for further analysis
MongoDB Connector
for Hadoop
Commerce
Applicatio
ns
powered
by
Analysis
powered
by
• Products & Inventory
• Recommended
products
• Customer profile
• Session management
• Elastic pricing
• Recommendation
models
• Predictive analytics
• Clickstream history
MongoDB Connector
for Hadoop
Fraud Detection
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only
MongoDB Connector for Hadoop
Connector Overview
HadoopMap Reduce, Hive, Pig, Spark
HDFS / S3Hadoop Connector
Text Files
Hadoop
Connector
Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon
EMR
BSON FilesMongoDB
Single Node, Replica Set,
Cluster
Data Movement
Dynamic queries with most recent data
Puts load on operational database
Snapshots move load to Hadoop
Snapshots add predictable load to MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
Connector Operation
1. Split according to given InputFormat
- many options available for reading from live cluster
- configure key pattern, split strategy
1. Write splits file
2. Output to BSON file or live MongoDB
- BSON file splits written automatically for future tasks
- Mongo insertion round-robin across collections
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Config
Servers
Chunk
Chunk
Chunk
Shard
Mongos
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
Shard
MongoDB Connector for Hadoop
Config
Servers
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Chunk
Chunk
Chunk
Shard
Mongos
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
Shard
MongoDB Connector for Hadoop
MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
MapReduce Configuration
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson
Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects with
input/output formats and data
URI
• Load/save data using
SparkContext Hadoop file API
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONSerDe
Hive Support
MongoDB Hive
Primitive type (int, String, etc.) Primitive type (int, float, etc.)
Document Row
Sub-document Struct, Map, or exploded field
Array Array or exploded field
● Types given by schema
● May use structs to project fields out of documents and ease access
● Can explode nested fields to make them top-level:{“customer”: {“name”: “Bart”}}
can be accessed with “customer.name”.
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
Pig Mappings
MongoDB Pig
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)
Document Tuple (schema given)
Document Tuple containing a Map (no schema)
Sub-document Map
Array Bag
● Organize and prune documents by specifying a schema
● Access full document in a Map without needing a schema
Demo!
Questions?