+ All Categories
Home > Technology > Apache Apex Kafka Input Operator

Apache Apex Kafka Input Operator

Date post: 26-Jan-2017
Category:
Upload: apache-apex
View: 525 times
Download: 0 times
Share this document with a friend
17
Kafka Input Operator Siyuan Hua, DataTorrent, Committer Apache Apex Apr 6, 2016 1
Transcript

Kafka Input Operator

Siyuan Hua, DataTorrent, Committer Apache ApexApr 6, 2016

1

Feature Overview

Apache Apex Meetup

0.8 (Simple Consumer) 0.9

LoC 5900 2406

Fault-Tolerant Yes (At least once, exactly once) Yes (At least once, exactly once)

Scalability Scale with Kafka(static and dynamic)

Scale with Kafka(static and dynamic)

Multi-Cluster/Topic Yes Yes

Throughput throttle Yes Yes

Idempotent Yes Yes

2

Feature Overview

Apache Apex Meetup

0.8 (Simple Consumer) 0.9

Offset Management Customized management Implicit but out-of-box management

Partition Strategy 1:1, 1:M, Dynamic(Unstable), Customized

1:1, 1:M, Customized

Dependency Both public and internal API Public API

Metrics report Using old Counters API Using new Apex @AutoMetric

3

0.8 Kafka Input Operator

Apache Apex Meetup

● Only Simple Consumer can deliver all features

● High-level Consumer doesn’t support customized assignor and sticky partition

● Have to deal with the metadata change in operator code

● One shared consumer per broker model

● 2.5 years old! (Tested and mature)

4

0.9 Kafka Input Operator

Apache Apex Meetup

● Use Assign API comes with 0.9 Consumer class

● Assign API is good replacement for Simple Consumer in the new Kafka Input Operator

● Partitions are explicitly assigned to each operator instance

● Consumer is shared to all assigned partitions

● Operator doesn’t need to handle metadata change, broker failure

● 2 month old!

5

Workflow

Apache Apex Meetup

6

Partition Strategy

Apache Apex Meetup

1 to 1 Partition 1 to N Partition

7

Customized Partition Strategy

Apache Apex Meetup

Public abstract class AbstractKafkaPartitioner{

...abstract List<Set<PartitionMeta>> assign(Map<String,

Map<String,List<PartitionInfo>>> metadata)...void partitioned(Map<Integer, Partition<AbstractKafkaInputOperator>>

map)…Response processStats(BatchedOperatorStats batchedOperatorStats)

} Customized Partition Strategy

8

Partition Strategy (Con’t)

Apache Apex Meetup

● Sticky Partition (Each operator instance only consumes from Kafka partitions that are assigned by AM) is BEST practice!

9

Offset Checkpointing

Apache Apex Meetup

W = last offset in window i

W W W

Current offset

Downstream operator window

. . . . . . . . . . . .

Check pointed offsets with window id

Resume from offsets of any window below

i

k ji

10

11

Offset Commitment (0.8 Operator)W = last offset in window i

. . . . . . . . . . . .

W

Current offset

Commit Window i

report to AM

i

i

Application Master

Offset Manager

12

Offset Commitment (0.8 Operator)

Public interface OffsetManager{

...public Map<KafkaPartition, Long> loadInitialOffsets();

...public void updateOffsets(Map<KafkaPartition, Long> offsetsOfPartitions);

}

Offset Commitment (0.9 Operator)

Apache Apex Meetup

W = last offset in window i

. . . . . . . . . . . .

W

Current offset

. . .

Commit Window i

Offset Topic contains App name

Offset is saved in kafka

i

i

13

Some important properties

Apache Apex Meetup

● initialOffset

● topics

● clusters

● strategy

● maxTuplesPerWindow

● initialPartitionCount

● consumerProps

14

● initialOffset

● consumer.topic

● consumer.zookeeper

● strategy

● maxTuplesPerWindow

● initialPartitionCount

● offsetManager0.8 Operator 0.9 Operator

MapR Streams support

Apache Apex Meetup

● MapR Streams is compatible with 0.9 Kafka client API

● The 0.9 Input Operator has been tested with MapR sandbox and all major features are working without any code change

● Use MapR Streams Client library instead of Kafka one

● Leave “clusters” property empty because MapR doesn’t require broker host name settings

● Support special character “/” in topic name because MapR Streams topic name is just path to the topic file

● Multi-cluster is not supported

15

Performance : Kafka Input Operator

Apache Apex Meetup

● 4 Kafka Brokers - 8 partitions

● 1 Zookeeper

● Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz

● 256GB RAM

● 10 GigE between nodes

● Use yahoo streaming benchmark application(https://github.com/yahoo/streaming-benchmarks)

● 940567 msg/S 245Bytes/Msg for 0.8 Input Operator

● 850000 msg/s 245Bytes/Msg for 0.9Input Operator

Q & A

Apache Apex Meetup

Follow Apex meetups:http://apex.incubator.apache.org/announcements.html

Learn more about Apex:http://apex.incubator.apache.org/docs.html

17


Recommended