+ All Categories
Home > Documents > Uncovering the mysteries of data access in Hadoop...Copyright Elasticsearch 2013. Copying,...

Uncovering the mysteries of data access in Hadoop...Copyright Elasticsearch 2013. Copying,...

Date post: 10-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
39
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Uncovering the mysteries of data access in Hadoop Costin Leau @costinl
Transcript

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Uncovering the mysteries of data access in Hadoop

Costin Leau@costinl

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Provides native integration with Hadoop

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch

Elasticsearch = OSS Search & Analytics engine

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch Hadoop

Native integration w/ Hadoop eco-system

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Hadoop

Hadoop Distributed File System (HDFS)

Map Reduce Framework (M/R)

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Map/Reduce

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Access

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Reading Data

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Map/Reduce

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Focusing on Data Access

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Locality

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Locality

Data

- Critical

- Persistent

- Big

Code

- Small

- Stateless

- Transient

I/O expensive �� CPU/RAM cheap

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Compression & Codecs

Saves disk space (by using the free CPU)

Name Extension Splittable

Gzip .gz No

Bzip2 .bz2 No

Snappy .snappy YES

LZO .lzo No

LZ4 .lz4 YES

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Serialization

Converts Objects to byte streams (and back)

Hadoop Writable

JDK serialization

Avro

Protocol Buffers

Thrift

Kryo

MsgPack

JSON/Smile

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Main APIs

OutputF

ormat

RecordW

riter

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Main APIs

Allow Hadoop to read/write data

Tied to the data source

Handle data format (serialization/protocol/etc..)

Are bundled with the Hadoop job

- restrictions on size, state, configuration

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Input/Output Format

Fundamental for data retrieval/store

Handle splitting

Understands data format

- File based (relies to a the Hadoop FS)- Hdfs, s3, webfs, etc...

- Protocol based- HTTP/Rest, JDBC/RDMS ...

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

InputSplits

Divide the data intro fragments

- format / application-specific bounds

- processed separately

- self-container

- have no dependency on one another

Critical for scalability (and performance)

Drive the number of tasks running in parallel

Ideally are data aware

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Record Reader/Writer

Translate the split into objects (map of k/v)

Responsible for:

- object creation

- data structure parsing

- progress monitoring

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Formats and Records

Short sentence.

InputFormat

1 Short sentence

RecordReader

The quick brown fox jumps \over the lazy dog.

1 The quick brown fox jumps over the lazy dog.

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Out of the box

Type Format Records

File Text Line

SequenceFile Binary Key/Value

RCFile/ORCFile BinaryColumn Groups

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Adding a data store

Sharding is critical

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

External data stores features

Scalable / Sharding ?

Data locality ?

Streaming ?

Collocation (with Hadoop) ?

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Source Example - RDBMS

Sharding – no

Data Locality – none

Streaming – supported by some

Collocation – no

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

DB Formats and Records

InputFormat

RecordReader

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Using a RDBMS w/ Hadoop

JDBC-based (I|O)Format/Record(R|W)

- available in Hadoop out of the box

Usage discouraged due to:

- Lack of batching- Multiple, short queries

- Multiple, short transactions

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Alternatives

RDBMS love batching

Export data to HDFS (Apache Sqoop)

Import data from HDFS to RDBMS

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Source Example - HBase

Distributed, scalable column datastore

Excellent for high rates of row-level updates

Based on HDFS (can use Map/Reduce)

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Data Source Example - HBase

Sharding – yes

Data Locality – yes

Streaming – partial

Collocation – yes (no need it’s all HDFS)

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch Hadoop

Sharding – yes

Data Locality – yes

Streaming – partial

Collocation – yes

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch Hadoop

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Elasticsearch Hadoop

InputFormat

RecordReader

{ “key”:”value”, “key”:”value”} {...}

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Hadoop Eco-system

Library Format Record

Cascading Tap Scheme

Pig Load/StoreFunction Record

Hive HiveStorageHandler / SerDe SerDe / Record

Built upon (I|O)Format and Record(R|W)

Handle type conversion

Implement complex operations

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Wrap-up

Focus on data store capabilities first

Dump data to HDFS for stores w/o sharding

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited

Hvala!@costinl


Recommended