+ All Categories
Home > Technology > Big Data File Ingestion using Apex

Big Data File Ingestion using Apex

Date post: 15-Apr-2017
Category:
Upload: datatorrent
View: 56 times
Download: 0 times
Share this document with a friend
27
Apache Apex Meetup Big Data File Ingestion using Apex Sandeep Deshmukh, PhD [email protected]
Transcript
Page 1: Big Data File Ingestion using Apex

Apache Apex Meetup

Big Data File Ingestion using Apex

Sandeep Deshmukh, PhD

[email protected]

Page 2: Big Data File Ingestion using Apex

Apache Apex Meetup

Contents

● What is Big Data Ingestion● Challenges in File copy @ scale● Ingestion using Apex

○ Input ○ Output○ Key features

● Demo● Summary

Page 3: Big Data File Ingestion using Apex

Apache Apex Meetup

Directed Acyclic Graph (DAG)

•A Stream is a sequence of data tuples•An Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded

•Directed Acyclic Graph (DAG) is made up of operators and streams

Apex: Application Programming Model

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

Filtered

Stream

Page 4: Big Data File Ingestion using Apex

Apache Apex Meetup

What is Ingestion

Data ingestion● process of obtaining, importing, and processing data for later use or storage

in a database

Big Data Ingestion● discovering the data sources ● importing the data ● processing data to produce intermediate data ● Send data out to durable data stores

Page 5: Big Data File Ingestion using Apex

Apache Apex Meetup

Challenges in File copy @ scale

● Failure Recovery

● Copying big files in parallel

● Copying large number of small files

● Processing○ Encryption

○ Compression

○ Compaction

Page 6: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Components

Read Data Write DataProcess

Page 7: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Read Data : Requirements

● Independent of input file type ○ HDFS○ S3○ FTP○ NFS

● Scale to large data○ Large files○ Large number of small files

● Configurable Bandwidth usage

Page 8: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Read Data

Break the whole task into smaller sub-tasks

Connect to input and scan for available data

Assign smaller tasks for downstream operators

Step

s

Purp

ose

Nam

e

Work on the sub-tasks given by Operator 1, one

at a time

Connect to source and read data as smaller

tasks one-by-one

Pass on the read data to downstream operator

Write File

Save the data read by Operator 2

File Splitter

Block Reader

FileWriter

Page 9: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Simple Design

File Splitter

Block Reader

FileWriterBlockMetaData Data

Challenges● Reading files in parallel is not possible

○ Can have multiple Block Readers and File Writers reading multiple files in parallel but single file can’t be read by two Block Readers

● Failure recovery is hard

Page 10: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Read Data

Break the whole task into smaller sub-tasks

Connect to input and scan for available data

Assign smaller tasks for downstream

operators

Step

s

Pur

pose

N

ame

Work on the sub-tasks given by Operator 1,

one at a time

Connect to source and read data as smaller

tasks one-by-one

Pass on the read data to downstream operator

Write File

Save the data read by Operator 2

File Splitter

Block Reader

FileWriter

Check for completeness

Make sure all smaller tasks for a file are completed by upstream operators & send file merger trigger

Synchronizer

Page 11: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Input

File Splitter

Block Reader

BlockWriter

BlockMetaData

Data

Block Reader

BlockWriter

Synchronizer

BlockMetaData

FileMetaData

BlockMetaData

BlockMetaData

Data

Page 12: Big Data File Ingestion using Apex

Apache Apex Meetup

Input DAG - FileSplitter

Scan input files/ directoriesCreate smaller sub-tasks

FileMetaData

BlockMetaData

File Splitter

● Parameters○ input files/directories to copy data from○ recursive - Yes / No○ polling - Yes / No○ bandwidth - MB / sec

Page 13: Big Data File Ingestion using Apex

Apache Apex Meetup

Input DAG - FileSplitter

● For each file in the directory:■ [output] FileMetaData - file information

● Name● Size ● Relative path● Block IDs into which the file is virtually split

■ [output] BlockMetaData - block information

● BlockID ● Start position● End position ● File URL

InputFile.txt

1073741824 (1GB)

input/data/InputFile.txt

[0,1,2,3,4,5,6,7,8]

1

134217728

268435456 (128MB)

hdfs://node18:8020/user/sandeep/input

Page 14: Big Data File Ingestion using Apex

Apache Apex Meetup

Input DAG - BlockReader

Block Reader

Read block from remote location and emit Data

Data

BlockMetaData

● Parameters

Input URL: E.g.: hdfs://node18:8020/user/hduser/input

BlockMetaData

Page 15: Big Data File Ingestion using Apex

Apache Apex Meetup

Input DAG - BlockWriter

Block Writer

Write block data on local HDFS

BlockMetaData

BlockMetaData

Data

Saves data in apps directory

Page 16: Big Data File Ingestion using Apex

Apache Apex Meetup

Input DAG - Synchronizer

Track blocks for each file and send trigger once all the block for that file

are available

FileMetaDataSynchronizer

FileMetaData

BlockMetaData

Page 17: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Input

File Splitter

Block Reader

Block Writer

Synchronizer

BlockMetaData

Data

FileMetaData

BlockMetaData

BlockMetaData

FileMetaData

Page 18: Big Data File Ingestion using Apex

Apache Apex Meetup

Output DAG - FileMerger

Merge blocks to recreate original file

FileMerger

● Parameters○ Output directory to copy data to○ Overwrite - Yes/No

FileMetaData

Page 19: Big Data File Ingestion using Apex

Apache Apex Meetup

Output DAG - FileMerger - FastMerge Magic

Different Blocks:

File :

B1

Dat

aNod

e 1

Dat

aNod

e 2

Dat

aNod

e 3

Dat

aNod

e 4

B2

B1

B1

B2

B2

Bn

Bn

Bn

BnB1 B2

1

2

1

1

2

2

n

n

n

1 2 n

Page 20: Big Data File Ingestion using Apex

Apache Apex Meetup

● Same replication factor

● On same HDFS cluster

● Same block size for all files

● Size of all files (except last) : multiple of block size

Output DAG - FileMerger - FastMerge Magic

Page 21: Big Data File Ingestion using Apex

Apache Apex Meetup

DAG - Complete

File Splitter

Block Reader

Block Writer

Synchronizer

BlockMetaData BlockMetaData

Data

BlockMetaData

FileMetaData

FileMergerFileMetaData

Page 22: Big Data File Ingestion using Apex

Apache Apex Meetup

Other features: Optional processing

● Compression○ Gzip and lzo

● Encryption○ PKI & AES

● Compaction○ Size based

● Dedup● Dimension Computation & Aggregation

Page 23: Big Data File Ingestion using Apex

Apache Apex Meetup

Page 24: Big Data File Ingestion using Apex

Apache Apex Meetup

Summary

● Easy to use○ Configure and run

● Unified for batch and continuous ingestion● Handles

○ Large files○ Large number of small files

Page 25: Big Data File Ingestion using Apex

25

Page 26: Big Data File Ingestion using Apex

Resources

Apache Apex Meetup

• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter

o @ApacheApex; Follow - https://twitter.com/apacheapexo @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product

o https://www.datatorrent.com/product/startup-accelerator/

Page 27: Big Data File Ingestion using Apex

© 2016 DataTorrent

We Are Hiring

2

[email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders


Recommended