Date post: | 07-Jul-2015 |
Category: |
Software |
Upload: | lucidworks |
View: | 344 times |
Download: | 0 times |
Search Architecture at Evernote Not Your Typical Big Data Problem
CHRISTIAN KOHLSCHÜTTERSr. Search Researcher
Augmented Intelligence @ Evernote
We are the workspace.
Write Collect FindWrite Collect Find Present
Find
Collect
Serving 100+ Million Users Worldwide
• 559 Shards (200k users per shard), Linux/Tomcat/MySQL
• 3.2 PB WebDAV-based Storage
• 224 TB SSD capacity for System, MySQL and Lucene
• 3.1 Billion Notes stored, 3.8 Bn Notes ever created
• 115 Million Notes created or edited last week
• 26 Million API calls to Context last week
• 1 Lucene index per user
Evernote’s Three Laws of Data Protection
• Your Data is Yours
• Your Data is Protected
• Your Data is Portable
We are not a “big data” company and do not try to make
money from your content.
Technical Debt
• I/O over Lucene 2.9 indexes became a bottleneck
• Code was woven into our “NoteStore” platform
• Index changes had to be backwards-compatible
• Complex re-indexing would require taking down a shard
• Needed to rethink the entire architecture, but keep public API
• Make search faster vs. Make us move faster
From Lucene 2.9 to 4.x and beyond
• Large refactoring of search code
• Lucene no longer is a direct dependency in “NoteStore”
• Design-by-Contract
• Can now run multiple Lucene versions concurrently in one VM
• … and one specific version / schema per user
• Migrated all users to Lucene 4.5, avg. downtime/user < 1 min
Separate the What from the How
Separation of Concerns
UserIndexManager
UserIndexFactory
UserIndex
Lucene29UserIndexImpl
Lucene4UserIndexImpl
API
Implementation
Caching UserIndex
Benchmarking UserIndex
NoteStore
...
Hide Lucene behind ClassLoaders
• One Maven artifact per major Lucene version,
build profiles for code-reuse between minor updates
• Code is packaged with dependencies into one common fat-jar with prefixes for each
implementation:
- lucene29/org/apache/lucene/...
lucene29/com/evernote/search/lucene2/…
- lucene43/org/apache/lucene/...
lucene43/com/evernote/search/lucene4/…
- lucene45/org/apache/lucene/…
lucene45/com/evernote/search/lucene4/…
• ResourcePrefixClassLoader called from outside code strips prefix,
uses fat-jar as the only dependency
New Index Structure
• Each user’s index now comes with a properties file that
describes its internal structures, such as index type and
version. Can handle different behavior in code.
• Changes to the index schema? Just increase the index version
and handle the rest in code
• Automatically trigger re-indexing if necessary
Index Auto-Migration
• Target Default Index Implementation centrally set by DevOps
• Triggered upon UserIndex access
• UserIndex facade determines whether re-index is necessary
• “Cruise Control” automates off-peak access
# Threads
Phase 1: Migration to Lucene 4
• Changes in Disk I/O (CPU correlates)
overall: -81%
searchRelatedNotes: -87%
keyword-based search: -96%
Saves TBs of I/OSaves TBs of I/O
Phase 2: Add Compression
• User Indexes sizes and access patterns are skewed
• Optimize large accounts
• Directory-level compression
• Compress segment files, invisible to the IndexReader
• Only when re-indexing / every 3 months
• In-memory Caching
LuceneTransform
• https://code.google.com/p/lucenetransform by Mitja Lenič
• We ported it to Lucene 4.5 (now available upstream for 4.9)
• Improved LRU caching, added LZ4/Snappy compression
• We will contribute our changes soon
OverlayDirectory
on disk:
_23.cfe
_23.si
c$_23.cfs
segments.gen
segments_2
visible to IndexReader:
_23.cfe
_23.si
_23.cfs
segments.gen
segments_2
Results
• Compressed the largest 5% of all indexes using LZ4
• 1.9 TB index space saved
• 100 MB LRU Cache hit rate: 79% on avg (67% — 93%)
• Saved 0.5 PB disk reads/week
• Cache is so good, may use better/slower compression algorithm,
may apply to more usersSaves PBs of I/OSaves PBs of I/O
Bugs, Bugs, Bugs :-)
• We’ve been warned
“VInt bug”
“background merge hit exception”
JVM segfaults
!
• and then this happened, too
SPI / ContextClassLoaders … LUCENE-4713
Deadlocks / over-optimistic locking
Unclosed resources / Too many open file handles => HousekeepingDirectory
Issues with FieldCache singleton => LUCENE-831, LUCENE-2133, …
…
• UserIndex tracks “broken” state; allows self-healing (rebuild)
Conclusion
• Design-By-Contract, Separation of Concerns
• Per-user Search Implementation / Multiple Lucene versions
• Migrated 60M users, without noticeable downtime
• Migration allowed index changes, saves TBs of disk I/O
• Block-level Index Compression, saves PBs of disk I/O
• This is just the beginning.