+ All Categories
Home > Technology > Introduction to Apache Apex

Introduction to Apache Apex

Date post: 08-Jan-2017
Category:
Upload: apache-apex
View: 173 times
Download: 0 times
Share this document with a friend
19
Introduction to Apache Apex Priyanka Gugale ([email protected]) September 30 th 2016
Transcript
Page 1: Introduction to Apache Apex

Introduction to Apache Apex

Priyanka Gugale ([email protected])September 30th 2016

Page 2: Introduction to Apache Apex

Next Gen Stream Data Processing• Data from variety of sources (IoT, Kafka, files, social media etc.)• Unbounded, continuous data streams

ᵒ Batch can be processed as stream (but a stream is not a batch)• (In-memory) Processing with temporal boundaries (windows)• Stateful operations: Aggregation, Rules, … -> Analytics• Results stored to variety of sinks or destinations

ᵒ Streaming application can also serve data with very low latency

2

Browser

Web Server

Kafka Input(logs)

Decompress, Parse, Filter

Dimensions Aggregate Kafka

LogsKafka

Page 3: Introduction to Apache Apex

Apache Apex

3

• In-memory, distributed stream processing• Application logic broken into components called operators that run in a distributed

fashion across your cluster• Natural programming model

• Unobtrusive Java API to express (custom) logic• Maintain state and metrics in your member variables

• Scalable, high throughput, low latency• Operators can be scaled up or down at runtime according to the load and SLA• Dynamic scaling (elasticity), compute locality

• Fault tolerance & correctness• Automatically recover from node outages without having to reprocess from

beginning• State is preserved, checkpointing, incremental recovery• End-to-end exactly-once

• Operability• System and application metrics, record/visualize data• Dynamic changes

Page 4: Introduction to Apache Apex

Apex Platform Overview

4

Page 5: Introduction to Apache Apex

Native Hadoop Integration

5

• YARN is the resource manager

• HDFS for storing persistent state

Page 6: Introduction to Apache Apex

Application Development Model

6

▪A Stream is a sequence of data tuples▪A typical Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library

• Operator has many instances that run in parallel and each instance is single-threaded▪Directed Acyclic Graph (DAG) is made up of operators and streams

Directed Acyclic Graph (DAG)

Filtered

Stream

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

er

Operator

er

Operator

Page 7: Introduction to Apache Apex

Dag Components

7

• Tuple● Atomic data that flows over a stream

• Operator● Basic compute unit per tuple

• Stream● Connector abstraction between operators● Tuples flow over this

Operator1

Operator2

Streamtuple

3tuple

1tuple

2

Page 8: Introduction to Apache Apex

Dag Example

8

Stream

Page 9: Introduction to Apache Apex

Operator Library

9

RDBMS• Vertica• MySQL• Oracle• JDBC

NoSQL• Cassandra, Hbase• Aerospike, Accumulo• Couchbase/ CouchDB• Redis, MongoDB• Geode

Messaging• Kafka• Solace• Flume, ActiveMQ• Kinesis, NiFi

File Systems• HDFS/ Hive• NFS• S3

Parsers• XML • JSON• CSV• Avro• Parquet

Transformations• Filters• Rules• Expression• Dedup• Enrich

Analytics• Dimensional Aggregations

(with state management for historical data + query)

Protocols• HTTP• FTP• WebSocket• MQTT• SMTP

Other• Elastic Search• Script (JavaScript, Python, R)• Solr• Twitter

Page 10: Introduction to Apache Apex

10

Platform Features

Page 11: Introduction to Apache Apex

Windowing in Apex

11

● Data is flowing w.r.t time● Computers understands time● Use time axis as a reference● Break the stream into finite time slices

⇒ Streaming Windows

Page 12: Introduction to Apache Apex

Windowing in Apex

12 12

Input Operator

Operator 1 Operator 2 Operator 3

WindowN+1

Begin Window Data Tuple End Window

WNWN+1WN+2

Astime progress

Page 13: Introduction to Apache Apex

Checkpointing

13

▪ Application window ▪ Sliding window and tumbling window▪ Checkpoint window▪ No artificial latency

Page 14: Introduction to Apache Apex

Scalability

14

NxM Partitions

Unifier

0 1 2 3

Logical DAG

0 1 2

1

1Unifier

1

20

Logical Diagram

Physical Diagram with operator 1 with 3 partitions

0 Unifier

1a

1b

1c

2a

2b

Unifier

3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck U

nifier

Unifier

0

1a

1b

1c

2a

2b

Unifier

3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

Page 15: Introduction to Apache Apex

Fault Tolerance

15

• Operator state is checkpointed to persistent storeᵒ Automatically performed by engine, no additional coding neededᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state

• Automatic detection and recovery of failed containersᵒ Heartbeat mechanismᵒ YARN process status notification

• Buffering to enable replay of data from recovered pointᵒ Fast, incremental recovery, spike handling

• Application master state checkpointedᵒ Snapshot of physical (and logical) planᵒ Execution layer change log

Page 16: Introduction to Apache Apex

• In-memory PubSub• Stores results emitted by operator until committed• Handles backpressure / spillover to local disk• Ordering, idempotency

Operator 1

Container 1

BufferServer

Node 1

Operator 2

Container 2

Node 2

Buffer Server

16

Page 17: Introduction to Apache Apex

Industrial IoT applications

17

GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.

Business Need Apex based Solution Client Outcome

• Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss

• Predictive analytics to reduce costly maintenance and improve customer service

• Unified monitoring of all connected sensors and devices to minimize disruptions

• Fast application development cycle• High scalability to meet changing business

and application workloads

• Ingestion application using DataTorrent Enterprise platform

• Powered by Apache Apex• In-memory stream processing• Built-in fault tolerance• Dynamic scalability• Comprehensive library of pre-built

operators• Management UI console

• Helps GE improve performance and lower cost by enabling real-time Big Data analytics

• Helps GE detect possible failures and minimize unplanned downtimes with centralized management & monitoring of devices

• Enables faster innovation with short application development cycle

• No data loss and 24x7 availability of applications

• Helps GE adjust to scalability needs with auto-scaling

Page 18: Introduction to Apache Apex

Resources for the use cases

18

• Pubmatic• https://www.youtube.com/watch?v=JSXpgfQFcU8

• GE• https://www.youtube.com/watch?v=hmaSkXhHNu0• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-serv

ice-using-apache-apex-hadoop

• SilverSpring Networks• https://www.youtube.com/watch?v=8VORISKeSjI• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-

hadoop-by-silver-spring-networks

Page 19: Introduction to Apache Apex

Q&A

19


Recommended