Post on 15-Jan-2015
description
transcript
Distributed Search - Solutions and Comparison
Ngọc Bùitrungngoc.bui@vtc.vn
Facts
FB:750 million active users3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. 14M videos uploaded each monthMore than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.TBs log data daily
HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?
Centralized Search – PROBLEM?
Lucene is great: high-performance, full-featured search library Incremental indexing Boolean Query, Fuzzy Query, Range Query, Multi
Phrase Query, Wild Card Query etc… It’s great BUT:
Slow if index is very big Index bigger than on HDD No load balance No failover
GOAL
Reliable index serving - by failover (master and nodes)
Scalable for traffic and index size by adding nodes Distributed TF-IDF
Solution:
Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.
Choices: Katta Elastic Search HbaseDirectory (our choice)
Katta
Katta is a distributed application running on many commodity hardware servers
An index for Katta is a folder with a set of subfolders. Those subfolder are called index shards.
The distributed configuration and locking system Zookeeper is used for master-node communication.
Pros and Cons
Pros: Copy and distribute Shards automatically on Slaves. Support distributing queries and aggregating results.
Cons: No indexing support. Incremental update index is hard Resharding is too expensive.
Elastic Search (www.elasticsearch.org)
Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of LuceneAutomatic Shard allocationAuto shard index & update indexNetwork interface (http) for data indexing, searching and administrating purely RESTful API.Schema Free.Can be integrated well with Hadoop/Map-Reduce
Behind Elastic
automatic shard allocation
There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).
If you want to scale out search, you can simply have more shard, replicas per shard.
HbaseDirectory – What?
Directory
HbaseDirectory – What?
Indexing PhaseSearching Phase
Directory
HbaseDirectory – What?
Directory is distributed? No but not impossible. Distributed? Using Directory on a distributed
storage system. HDFS: slowwww Hbase: our choice since it is optimized for random
access which is appropriate for accessing lucene index.
Hbase Directory: consider Hbase as a logical “Directory”.
Two Mode
Hbase Directory: lazy mode Keep lucene index file structures, porting to Hbase Only rewrite 2 libraries: FSDirectory & RAMDirectory
(Directory interface) Hbase Directory: active mode
Redesign index structure to utilize Hbase’s strength. Rewrite: 2 above + Indexreader & Indexwriter
Lucene index flow – Hbase flow
Performance & Conclusion
Refer to excel file HbaseDirectory – Active mode is the correct
choice. Improvement needed.