Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
Search and Time Series Databases
Corso di Sistemi e Architetture per Big Data A.A. 2016/17
Valeria Cardellini
The reference Big Data stack
Valeria Cardellini - SABD 2016/17
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Why search platforms?
• How to find documents that match queries? – With text search faster than RDBMs
• How to obtain specific features? – Such as highlighting, spatial search, suggestions,
guided navigation, …
Valeria Cardellini - SABD 2016/17
2
Search engines
• Most popular search platforms: – Apache Solr – ElasticSearch
• ETL process
Valeria Cardellini - SABD 2016/17
3
Apache Solr
• Scalable, highly reliable and open-source framework for searching data
• Built on Apache Lucene – Open-source library for indexing and search – Used by Solr for full-text search
• Can index documents written in: • XML, JSON, CSV and binary formats
• Runs as Java Web application • Provides a REST-like web service that exposes
services to manage the lifecycle of documents in the index (indexing, querying, …)
• Used by most popular Web apps (Apple, Instagram, LinkedIn, …)
Valeria Cardellini - SABD 2016/17
4
Solr: key features
• Faceting – To group the results based on specific field or defined
criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search
results by manufacturer or price
• Auto-suggest – To present list of possible query terms
• Spell check – To suggest corrected spelling of query terms
• Highlighting • Document clustering
– To group related documents in the search results
• Spatial search – To filter search results based on location
Valeria Cardellini - SABD 2016/17
5
Solr: key features
• Pagination and ranking of search results • Results grouping
– To group the results based on a grouping field and return the top documents in each group
• Near real-time search – To search documents immediately after they have been
indexed; useful for apps with dynamic changing content (e.g., news)
• More Like This – identifies other documents that are similar to one in a result
set
Valeria Cardellini - SABD 2016/17
6
Solr feature example
Valeria Cardellini - SABD 2016/17
7
Solr components
Valeria Cardellini - SABD 2016/17
8
Solr components • Request Handlers: handle a request at a URL
– E.g.: /select!• Search Components: part of a Search Handler, a
componentized request handler – Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable
• Update Handlers: handle an indexing request • Update Processors chain: per-handler componentized
chain that handles updates • Query Parser plugins
– Mix and match query types in a single request – Function plugins for Function Query
• Text Analysis plugins: Analyzers, Tokenizers, TokenFilters
• Response Writers: serialize and stream response to client
Valeria Cardellini - SABD 2016/17
9
Scaling Solr: SolrCloud • How to provide distributed indexing and search
capabilities?
– Up to millions of users and millions of indexed documents
• SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers – Enables and simplifies horizontal scaling of a search index
through replication and sharding – Sharding: incoming queries are distributed to to shards in the
collection, which respond with merged results – Replication: to handle higher concurrent query load by
spreading the requests to multiple servers
• No master node to allocate nodes, shards and replicas
• SolrCloud uses ZooKeeper for storing shared configuration files and for coordination
Valeria Cardellini - SABD 2016/17
10
Solr distributed architecture
Valeria Cardellini - SABD 2016/17
11
Elasticsearch
• Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents
• Search engine based on Apache Lucene • Developed in Java • Distributed
– Indices can be divided into shards and each shard can have zero or more replicas
– Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s)
– Rebalancing and routing are done automatically Valeria Cardellini - SABD 2016/17
12
Elastic (ELK) Stack
• Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack, previously known as ELK)
• Logstash – Server-side data processing pipeline that ingests
data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch
• Kibana – Data visualization platform
Valeria Cardellini - SABD 2016/17
13
Solr vs. Elasticsearch
• Solr – Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment wide range
of plug-ins are available
• Elasticsearch – Newer, but already very widely used – Focus on extracting value from data generally, and not just
on search – Part of ELK stack – Schema-free and document-oriented
Valeria Cardellini - SABD 2016/17
14
• Elasticsearch vs Solr on Google Trends
Time series data base (TSDB)
• How to analyze DevOps monitoring, application metrics, IoT sensor data? – Time series databases (TSDBs) provides an effective and
lightweight solution
• Optimized for handling high-volume time series data – Time series: a sequence of data points (arrays of numbers)
indexed by time (a date time or a date time range), e.g.: • Time series of stock prices (price curve) • Time series of energy consumption (load profile) • Log of temperature values (temperature trace)
• Optimized for providing complex logic to analyze time series data – Queries for historical data, replete with time ranges and roll
ups and arbitrary time zone conversions are difficult in DBMS
Valeria Cardellini - SABD 2016/17
15
TSDB: overview
• Create, enumerate, update and destroy various time series and organize them in some fashion – Series may be organized hierarchically and have
companion metadata – Provide basic calculations on a series as a whole ,
(e.g., multiplying, adding, or combining various time series into a new time series)
– Filter on arbitrary patterns (e.g., day of the week, low value, high value)
– Provide additional statistical functions that are targeted to time series data
Valeria Cardellini - SABD 2016/17
16
TSDB: some products
• Some open-source products – CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org
• Stores numeric time-series data and render graphs of this data on demand
– InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io
• Stores its time series in Cassandra
– OpenTSDB http://opentsdb.net • Stores its time series in HBase
– Riak-TS http://basho.com/products/riak-ts/
Valeria Cardellini - SABD 2016/17
17
InfluxDB • Written in Go • Supports high write loads and large data set storage • Conserves space through downsampling
– By automatically expiring and deleting unwanted data as well as backup and restore
• Provides easy-to-use SQL-like query language for interacting with data
• Provides simple, high performing write and query HTTP(S) APIs, e.g.: – To create a database
url -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb”!
– To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'!Va
leria
Car
delli
ni -
SA
BD
201
6/17
18
InfluxDB datastore • Data organized by time series, which contain a
measured value, like “cpu_load” or “temperature” • Time series have zero to many points, one for each
discrete sample of the metric • Points consist of:
– time (a timestamp) – a measurement (e.g., “cpu_load”) – at least one key-value field (the measured value itself, e.g.
“value=0.64”, or “temperature=21.2”) – and zero to many key-value tags containing any metadata
about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)
Valeria Cardellini - SABD 2016/17
19
InfluxDB datastore
• General format of points: <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]!
• Examples of points: – cpu,host=serverA,region=us_west value=0.64!– payment,device=mobile,product=Notepad,method=credit ! billed=33,licenses=3i 1434067467100293230!– stock,symbol=AAPL bid=127.46,ask=127.48!– temperature,machine=unit42,type=assembly
external=25,internal=37 1434067467000000000!
!
Valeria Cardellini - SABD 2016/17
20
InfluxDB datastore
• A measurement is like a SQL table, where the primary index is time
• With respect to DBMS: – No need to define schemas up-front – Null values are not stored
Valeria Cardellini - SABD 2016/17
21
InfluxDB stack
• Integrated with Telegraph, Chronograf and Kapacitor (TICK stack)
Valeria Cardellini - SABD 2016/17
22
References
• Apache Solr Reference Guide, http://bit.ly/2scksQF • InfluxDB Version 1.2 Documentation,
http://bit.ly/2ryagFT • Dunning and Friedman, “Time Series Databases”,
O’Reilly, 2015.
Valeria Cardellini - SABD 2016/17
23