+ All Categories
Home > Technology > Couchconf-SF-Couchbase-Hadoop-Integration

Couchconf-SF-Couchbase-Hadoop-Integration

Date post: 20-Aug-2015
Category:
Upload: couchbase
View: 2,158 times
Download: 0 times
Share this document with a friend
Popular Tags:
19
Apache Sqoop Connecting Couchbase with Hadoop Arvind Prabhakar, Cloudera Inc. July 29, 2011
Transcript

Apache Sqoop

Connecting Couchbase with Hadoop

Arvind Prabhakar, Cloudera Inc. July 29, 2011

Agenda

• Background and Motivation

• Design of Sqoop

• Couchbase Plugin

• Demo

Apache Hadoop

• A framework for Data Intensive and Distributed Applications.

• Inspired by Google’s MapReduce and Google File System Papers

Name Node

Data Node 1

Data Node 2

Data Node 3

Job Tracker

Task Tracker 1

Task Tracker 2

Task Tracker 3

HDFS

Map Reduce

Data Storage

Hadoop:• Data Archival• Open Data Formats• Healthy Ecosystem

Data storage is costly.Deleting data maybe costlier!

Data Analysis

• Structured Data Stores• Semi-Structured Data

Stores

• Ad-hoc structured Data

• Unstructured Data

Introducing Sqoop

• Easily Import Data into Hadoop• Generate Datatypes for use in

MapReduce Applications• Integrate with Hive and Hbase• Easily export Data from Hadoop

Sqoop

Motivation

Without Sqoop

• Requires direct access to data from within Hadoop

• Loss of efficiency due to network overhead

• Impedance mismatch. Map Reduce requires fast data access.

• Can overwhelm external systems

Using Sqoop

• Data Locality

• Efficient operation for

• Integration with Hadoop based systems – Hive, HBase

• Optimized transfer speeds based on native tools

Key Features

• Command Line Interface

– Scriptable

• Integrates with Hadoop Ecosystem

– Hive, HBase, Oozie

• Automatic code generation

– Use your data in MapReduce work flows

• Connector based architecture

– Support for connector specific optimizations

Design Overview

Sqoop Datastore

Sqoop Record Map

HDFS

Map

HDFS

Map

HDFS

Map

HDFS

1. Metadata Lookup

2. GenerateCode

3. SubmitMR Job

MapReduce Job

Design Overview

Map-Only Implementation

• InputFormat:– Selects Input Source

– Defines Splits

– Creates Record Readers

• OutputFormat:– Selects Destination

– Creates Record Writers

InputFormat

SplitSplitSplitRecordReaderRecordReaderRecordReader

Map Map Map…

OutputFormat

Metadata Management

• Sqoop Record

– Dynamically generated

– Independently packaged

• Maybe used without Sqoop

– Maintains type mapping

– Different Serial Formats

• Text

• Binary

• Avro Data File

Import Operation

• Generate SqoopRecord

– Or use provided SqoopRecord

• Create Input Splits

• Spin Mappers to consume splits

• Direct output to HDFS or HBase

– Control compression, File type based on user input

• Populate Hive Metastore

Export Operation

• Generate SqoopRecord

– Or use provided SqoopRecord

• Spin Mappers to consume input files

• Each Mapper writes straight to external store

– Optionally stage data before final export

Typical Workflow

• Data imported from external systems– Periodic / Incremental imports for new data

• Hadoop Analytics Processing– Hive / HBase tables– MapReduce Processing

• Processed Data exported to external systems– Periodic / Incremental exports for new data

• Workflow automation using Oozie

Connectors

• Drop-in Sqoop Extension

• Specializes in connectivity with a particular system

• Provides optimal data transfer mechanism

• Based on Connector Mechanism of Sqoop

– Varying degree of control

Couchbase Plugin

• Based on the Couchbase Tap Interface

• Allows importing and exporting of entire database or of future key mutations

Couchbase HDFS

1. Data imported viaTap mechanism

2. HadoopProcessing

3. Data exported backto Couchbase

Couchbase Import

$ sqoop import –-connect http://localhost:8091/pools --table DUMP

$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5

$ sqoop export --connect http://localhost:8091/pools

--table DUMP –export-dir DUMP

• For Imports, table must be:– DUMP: All keys currently in Couchbase– BACKFILL_n: All key mutations for n minutes

• For Exports, table option is ignored• Specified –username maps to bucket

– By default set to “default” bucket

Demo

Thank You!

• Couchbase:– www.couchbase.com

• Hadoop:– hadoop.apache.org

• Sqoop:– incubator.apache.org/projects/sqoop.html

• Cloudera:– www.cloudera.com


Recommended