keyvi the key value index @ Cliqz

transcript

keyvi - the key value index

How did we build a large scale low-latency search-engine with keyvi?

Hendrik Muhs <hendrik.muhs@gmail.com>

BASED IN MUNICH

MAJORITY-OWNED BY HUBERT BURDA MEDIA

INTERNATIONAL TEAM OF 90 EXPERTS

WE COMBINE THE POWER OF DATA, SEARCH, AND

BROWSERS TO REDESIGN THE INTERNET

FOR THE USER

WE REDESIGN THE INTERNET

http://cliqz.com/

Key value index based on finite state, so basically a immutable key value store.

Licence: Apache 2.0 (just keyvi, 3rdparty)Language: C++ (core), Python (binding)Runs on: Linux, MacOSX (not tested on Windows)Link: www.keyvi.orgAuthor: me ;-)

Cliqz Search Backend

Elasticsearch used in the early days (2014)

→ Redis own cluster implementation (before Redis cluster), at peak over 100 redis instances in 1 cluster, > 5TB of data, all on AWS

→ keyvidrop-in replacement for Redis, significantly reduced size (2TB) and number of machines

! Whether redis or keyvi: average latency of 55ms at backend !

Why replace redis?

extremely efficient storing values

low-level access: msgpack & Redis fork to compress even more (zlib)

implementation of auto-completion is expensive and slow

Runtime

single threaded → contention, queuing, timeouts

Persistence

memory only, loading times of several minutes

Why replace redis?

→Redis is great! We still use it a lot! But for 1 of our

- and only 1 of our –

usecases, we can do better!

started as auto-completion engine

caching layer for Redis

now providing the complete index (>2 TB)

distributed across multiple machines

multi-process, fast, reliable, stable

shared memory model (mmap)

multi-core, reliable, no loading (un-serializing)

space efficient

compact key-space, FSA minimization

keyvi is an immutable store, therefore index

(as the underlying data structure of Lucene is)

vs. Redis

Workflow has 2 steps:

compile/build index using keyvicompiler or via python bindings

dump/query using C++ or python API

Note: There is no SegmentWriter/Merger/Reader (yet)!

exact matching / simple entity recognition:

values can None, integer, string or json

approximate matching:

close/near match e.g. for Geo applications

scoring based: Levenshtein & Co

completion matching:

prefix, multi-word, fuzzy

keyvi the key value index @ Cliqz

Technology