keyvi the key value index @ Cliqz

Post on 14-Apr-2017

626 views 6 download

transcript

keyvi - the key value index

Or

How did we build a large scale low-latency search-engine with keyvi?

Hendrik Muhs <hendrik.muhs@gmail.com>

BASED IN MUNICH

MAJORITY-OWNED BY HUBERT BURDA MEDIA

INTERNATIONAL TEAM OF 90 EXPERTS

WE COMBINE THE POWER OF DATA, SEARCH, AND

BROWSERS TO REDESIGN THE INTERNET

FOR THE USER

WE REDESIGN THE INTERNET

http://cliqz.com/

Key value index based on finite state, so basically a immutable key value store.

Licence: Apache 2.0 (just keyvi, 3rdparty)Language: C++ (core), Python (binding)Runs on: Linux, MacOSX (not tested on Windows)Link: www.keyvi.orgAuthor: me ;-)

Cliqz Search Backend

Elasticsearch used in the early days (2014)

→ Redis own cluster implementation (before Redis cluster), at peak over 100 redis instances in 1 cluster, > 5TB of data, all on AWS

→ keyvidrop-in replacement for Redis, significantly reduced size (2TB) and number of machines

! Whether redis or keyvi: average latency of 55ms at backend !

Why replace redis?

Size

extremely efficient storing values

low-level access: msgpack & Redis fork to compress even more (zlib)

implementation of auto-completion is expensive and slow

Runtime

single threaded → contention, queuing, timeouts

Persistence

memory only, loading times of several minutes

Why replace redis?

→Redis is great! We still use it a lot! But for 1 of our

- and only 1 of our –

usecases, we can do better!

started as auto-completion engine

caching layer for Redis

now providing the complete index (>2 TB)

distributed across multiple machines

multi-process, fast, reliable, stable

@

shared memory model (mmap)

multi-core, reliable, no loading (un-serializing)

space efficient

compact key-space, FSA minimization

BUT:

keyvi is an immutable store, therefore index

(as the underlying data structure of Lucene is)

vs. Redis

Workflow has 2 steps:

compile/build index using keyvicompiler or via python bindings

dump/query using C++ or python API

Note: There is no SegmentWriter/Merger/Reader (yet)!

Usage

exact matching / simple entity recognition:

values can None, integer, string or json

approximate matching:

close/near match e.g. for Geo applications

scoring based: Levenshtein & Co

completion matching:

prefix, multi-word, fuzzy

more on Features

it's fast! extremely fast!

it scales:

it's compact/small, enables indexing GB's of data

it brings FST's to a level of more established data structures like hash tables and B-Trees on one side …

… and enables applications not or hardly possible with them (completions, approximate matching, etc.)

the gist

http://www.keyvi.org

Lot's of content from crashcourse to in-depth

check it out!

Questions?

Comments!

Feedback.

Contact: hendrik.muhs@gmail.com

check it out!