Elasticsearch Data Analyses

transcript

Elasticsearch

Elasticsearch Timed Data Analyses

By Alaa Elhadba@aelhadba

Table of Contents

- Hot-Cold Architecture

- Data High Availability

- Data design at large scale

- Search Execution

- Time framed indices

- Aggregations

Hot-Cold Architecture

Hot Data Nodes

Perform indexingHold most recent dataUse SSD storage, Writing is an Intensive IO operation

Cold Data Nodes

Handle read only operationsCan use large spinning disks

Hot-Cold Configuration

node.box_type: hot

elasticsearch.yaml

Shard 2

Shard 1

node.box_type: cold

elasticsearch.yaml

Data Availability

Availability Zone 1

Availability Zone 2

Data Availability

Availability Zone 1

Availability Zone 2

Data Availability

Availability Zone 1

Availability Zone 2Availability Zone / Rack failure ? Shard Allocation Awareness

Shard Allocation Awareness

Availability Zone 1

Availability Zone 2

Availability Zone 1

Availability Zone 2

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

Availability Zone 1 Availability Zone 2

on a live cluster

● Use Forced Awareness to avoid the extra load

of reallocation of missing shards

on a live cluster

● Use Forced Awareness to avoid the extra load

of reallocation of missing shards

Make sure you can handle the load with less nodes!

Forced Awareness

● Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone.

● Avoid extra of reallocating unassigned shards after rack failure.

● Allow no single point of failure for your system.● Make sure you can handle the load with less nodes.

cluster.routing.allocation.awareness.force.zone.values: zone1,zone2

cluster.routing.allocation.awareness.attributes: rack1,zone1

Data design at large scale

Searching

Shard 4

Shard 2

Result

Shard 3

Shard 1

Searching

Shard 4

Shard 2

Result

Shard 3

Shard 1

How to avoid asking all shards ?

Searching

Shard 4

Shard 2

Result

Shard 3

Shard 1

How to avoid asking all shards ? Routing

I know my shards!

Routing

PUT my_index/my_type/my_id?routing=shard1

GET my_index/_search?routing=shard1,shard2

● Avoid calling all shards● Dedicated shards per purpose● Talk to one dedicated shard● Eliminate Network Traffic● Better Performance● Handle sharding on your own

Routing

PUT my_index/my_type/my_id?routing=shard1

GET my_index/_search?routing=shard1,shard2

● Avoid calling all shards● Dedicated shards per purpose● Talk to one dedicated shard● Eliminate Network Traffic● Better Performance● Handle sharding on your own

But, Once in, Never out● Routing must be always specified

Routing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

Routing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

I MUST KNOW EVERYTHING!

Talking to data

Aliasing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

today yesterday 3_days_ago

Aliasing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

22.06.2016

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

22.06.2016

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

22.06.2016

I MUST KNOW!it’s Better Performance

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

22.06.2016

It’s a Data Problem!

Aliasing + Routing

1 2 3 1 2 3

21.06.2016 20.06.2016

22.06.2016

It’s a Data Problem!

today yesterday 3_days_agotoday_returns recent_returns

Aliasing + Routing + Search

IndexIndex Shard

Shard slice

Search Execution Preference

Elasticsearch targets shards and replicas in round-robin manner. Each shard is queried similarly

_primary Query only primary shards (latest info from index or optimize for writing path)

_primary_first Query primary first in available

_replica Query replica shard only

_replica_first Query replica first in available

_local Query shards available on the current node

_only_node:node_id Query a specific node

_only_nodes:* Query only a set of nodes

_prefer_node:node_id Query a prefered noe

_shards:1,3 e,g _shards:1,3;_local Query specific shards with a preference

PUT _search?preference=_replica

Time Framed Indices

Data Flow

HOT Cold Closed

Backed_up

Trashed

Closing/Opening Index

➔ Closing an index

◆ Removes all shard allocations from the cluster ◆ But keeps the index data around ◆ Helps reduce the resources used on the cluster ◆ Consumes only disk space

➔ Opening an index

◆ Allows to open a closed index ◆ Note, those are not “milliseconds” time operation, opening an index can take a few seconds

to a couple of minutes ◆ Flushing before closing will reduce the opening time

Index Templates

- Order allows you to override other templates

- Settings allows you to scale anytime

- Aliases can be defined on index creation

Index Templates

Time framed indices lifecycle

1. Use Index templates to generate mappings for new indices2. Use aliases to decouple your application from data logic3. Use hot nodes for fresh data4. Move old data to cold nodes5. Close old indices before deletion6. Change your time frame at any point to scale (Monthly, Weekly….)7. Use Routing if you have too many shards in a big cluster

Data Flow

HOT Cold Closed

Backed_up

Trashed

Aggregations

Aggregations Types

Buckets Metrics Pipeline

Nested Bucket Aggregations

Aggregation Query

Better cachingFetch relevant documents

First segmentation

Nested segmentation

Doc Values

- Why do we need this?

- Sorting, Aggregations, Some Scripting

- Doc Values

- Build columnar style data structure on disk

- Created at indexing time, stored as part of the segment

- Read like other pieces of the Lucene index

- Don't take up heap space

- Uses file system cache

- Default for not_analyzed string and numeric fields in 2.0+

Raw Fields

- Use customer_name.raw for aggregations

- Use customer_name for search

Aggregations Types

Buckets Metrics Pipeline

Metrics Aggregations

- Avg Aggregation

- Cardinality Aggregation

- Extended Stats Aggregation

- Max Aggregation

- Min Aggregation

- Percentiles Aggregation

- Percentile Ranks Aggregation

- Scripted Metric Aggregation

- Stats Aggregation

- Sum Aggregation

- Top hits Aggregation

- Value Count Aggregation

Extended Stats Aggregation

Aggregation Search

Shard 4

Shard 2

Result

Shard 3

Shard 1

Scripted Metric Aggregation

- Init_script Executed first. Allows initialization of variables.- map_script Executed once after each document is collected. - combine_script Executed once on each shard after document collection is complete. - reduce_script Executed once on the coordinating node after all shards have returned their results.

Buckets Aggregations

- Children Aggregation

- Date Histogram Aggregation

- Date Range Aggregation

- Filter Aggregation

- Filters Aggregation

- Global Aggregation

- Histogram Aggregation

- Missing Aggregation

- Range Aggregation

- Reverse nested Aggregation

- Sampler Aggregation

- Significant Terms Aggregation

- Terms Aggregation

Date Histogram Aggregation

Date Range Aggregation

Don’t forget!

Round your dates

Missing Aggregations

Range agg

Histogram Aggregation

Pipeline Aggregations

Pipeline

Pipeline Aggregations

Parent

- Able to compute new buckets or new aggregations to a parent aggregation.

Sibling

- Able to compute new buckets or new aggregation on the same level.

Siblings Aggregation

- min_bucket

- max_bucket

- sum_bucket

- avg_bucket

- stats_bucket

- extended_stats_bucket

- percentiles_bucket

Average Aggregation

Parent Pipeline Aggregation

- moving_avg

- derivative

- cumulative_sum

- bucket_script

- bucket_selector

- serial_diff

Cumulative Sum Aggregation

Derivative Aggregation

Moving Average Aggregation

Prediction

Bucket Selector Aggregation

Bucket Script Aggregation

The End

Elasticsearch Data Analyses

Documents