Post on 14-Apr-2017
transcript
keyvi - the key value index
Or
How did we build a large scale low-latency search-engine with keyvi?
Hendrik Muhs <hendrik.muhs@gmail.com>
BASED IN MUNICH
MAJORITY-OWNED BY HUBERT BURDA MEDIA
INTERNATIONAL TEAM OF 90 EXPERTS
WE COMBINE THE POWER OF DATA, SEARCH, AND
BROWSERS TO REDESIGN THE INTERNET
FOR THE USER
WE REDESIGN THE INTERNET
http://cliqz.com/
Key value index based on finite state, so basically a immutable key value store.
Licence: Apache 2.0 (just keyvi, 3rdparty)Language: C++ (core), Python (binding)Runs on: Linux, MacOSX (not tested on Windows)Link: www.keyvi.orgAuthor: me ;-)
Cliqz Search Backend
Elasticsearch used in the early days (2014)
→ Redis own cluster implementation (before Redis cluster), at peak over 100 redis instances in 1 cluster, > 5TB of data, all on AWS
→ keyvidrop-in replacement for Redis, significantly reduced size (2TB) and number of machines
! Whether redis or keyvi: average latency of 55ms at backend !
Why replace redis?
Size
extremely efficient storing values
low-level access: msgpack & Redis fork to compress even more (zlib)
implementation of auto-completion is expensive and slow
Runtime
single threaded → contention, queuing, timeouts
Persistence
memory only, loading times of several minutes
Why replace redis?
→Redis is great! We still use it a lot! But for 1 of our
- and only 1 of our –
usecases, we can do better!
started as auto-completion engine
caching layer for Redis
now providing the complete index (>2 TB)
distributed across multiple machines
multi-process, fast, reliable, stable
@
shared memory model (mmap)
multi-core, reliable, no loading (un-serializing)
space efficient
compact key-space, FSA minimization
BUT:
keyvi is an immutable store, therefore index
(as the underlying data structure of Lucene is)
vs. Redis
Workflow has 2 steps:
compile/build index using keyvicompiler or via python bindings
dump/query using C++ or python API
Note: There is no SegmentWriter/Merger/Reader (yet)!
Usage
exact matching / simple entity recognition:
values can None, integer, string or json
approximate matching:
close/near match e.g. for Geo applications
scoring based: Levenshtein & Co
completion matching:
prefix, multi-word, fuzzy
more on Features
it's fast! extremely fast!
it scales:
it's compact/small, enables indexing GB's of data
it brings FST's to a level of more established data structures like hash tables and B-Trees on one side …
… and enables applications not or hardly possible with them (completions, approximate matching, etc.)
the gist
http://www.keyvi.org
Lot's of content from crashcourse to in-depth
check it out!
Questions?
Comments!
Feedback.
Contact: hendrik.muhs@gmail.com
check it out!