Druid

Druid,or There and Back Again

Is it about woods creature?Nope. It’s about…

a data store designed for OLAP queries on event (time-series) data.

2

Boring facts• Open-source, community-driven project

• ~3400 stars, ~7300 commits on Github ATM

• Written in Java

• Very modular / extendable thanks to Guice

3

https://github.com/druid-io/druid

https://github.com/google/guice

What it does?• Ingests stream / batch time-series data and

splits it into segments representing configured time interval

• During ingestion, it performs aggregation using provided algorithms to create metrics

• Allows to perform several types of queries over served segments via nice HTTP API (here you can do simple aggregations over metrics)

4

How it works?

5

How it works?Druid consists of several types of services (hereinafter, nodes) which together make it complete working system:

• Broker

• Historical

• Real-time

• Coordinator

• Overlord, Middle Manager, Peons, etc. (aka boring nodes)

6

How it works?It also has three external dependencies:

• Apache ZooKeeper for configuration management, leader election and data flow organisation (coordinator - historical communication)

• Metadata Storage such as MySQL, PostgreSQL used to store (guess what?) various metadata about the system

• Deep Storage where all the compressed data is stored (S3, HDFS, FS)

7

http://zookeeper.apache.org/

Nodes: Real-timeIt is responsible for ingestion of streaming data.

Also it exposes HTTP API for querying, so data is available for querying right after processing / aggregation.

That is how Druid allows querying of real-time data.

*: by real-time here we mean something which happened ~0-15 seconds ago.8

Nodes: HistoricalThis guy is responsible for loading data (hereinafter, segments) from Deep Storage and serving this data via HTTP API (same as Real-time dude).

Historical Node uses ZooKeeper to understand what segments to load.

It uncompresses segment data and caches it locally on FS.

9

Nodes: CoordinatorNode which manages segments and coordinates their distribution on historical nodes. It uses ZooKeeper for:

• getting current cluster state

• assigning segments to Historical Nodes

Loading and dropping of segments is managed via Rules stored in Metadata Storage.

10

Nodes: BrokerThis is the only Node which is usually touched by client applications, because it exposes HTTP API for querying segments data.

It splits incoming query onto smaller ones based on segments data stored ZooKeeper and queries corresponding Historical and Real-time Nodes.

11

Nodes: OverlordIndexing service powers Druid’s batch data ingestion. It consists of three node types: Overlord, Middle Manager and Peons.

Overlord accepts indexing tasks and distributes those to Middle Manager Nodes. It communicates the latter ones through ZooKeeper.

It provides simple HTTP API to create, shutdown and view status of indexing tasks.

12

Nodes: Middle-ManagerMiddle-manager executes submitted indexing tasks. It creates separate JVMs i.e. Peon Node processes for this. Each Peon runs one task at a time.

MM retrieves indexing tasks from ZooKeeper. Then it stores indexing task as JSON file and runs Peon providing it with path to the indexing task file. After processing Peon stores segments data in Deep Storage.

13

Live example• Run all nodes locally

• Local FS is used for Deep Storage

• Derby is used for Metadata Storage

• ZooKeeper … well, it’s ZooKeeper, gotta run it

• Kafka 0.8.7.6.5.4.3.2.1 for streaming data <3

• Various Bash and Node scripts for loading data

• Imply Pivot for visualisation of queried results

14

Problems?Everybody can tell good sides of Druid. Issues time!

• DevOps needed to spin up all the Nodes and dependencies (Did we clone Dmytro already?)

• Limited number of aggregations

• Druid loves cookies… no, space, it needs space, more space, and it’s not enough anyways.

15

Thank you

Questions?

Fun time?

Beer for me?

16

Date post:	12-Apr-2017
Category:	Software
Upload:	alexander-makarenko
View:	10 times
Download:	0 times

Druid

Software