Distributed search solutions and comparison

transcript

Distributed Search - Solutions and Comparison

Ngọc Bùitrungngoc.bui@vtc.vn

FB:750 million active users3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. 14M videos uploaded each monthMore than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.TBs log data daily

HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?

Centralized Search – PROBLEM?

Lucene is great: high-performance, full-featured search library Incremental indexing Boolean Query, Fuzzy Query, Range Query, Multi

Phrase Query, Wild Card Query etc… It’s great BUT:

Slow if index is very big Index bigger than on HDD No load balance No failover

Reliable index serving - by failover (master and nodes)

Scalable for traffic and index size by adding nodes Distributed TF-IDF

Solution:

Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.

Choices: Katta Elastic Search HbaseDirectory (our choice)

Katta is a distributed application running on many commodity hardware servers

An index for Katta is a folder with a set of subfolders. Those subfolder are called index shards.

The distributed configuration and locking system Zookeeper is used for master-node communication.

Pros and Cons

Pros: Copy and distribute Shards automatically on Slaves. Support distributing queries and aggregating results.

Cons: No indexing support. Incremental update index is hard Resharding is too expensive.

Elastic Search (www.elasticsearch.org)

Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of LuceneAutomatic Shard allocationAuto shard index & update indexNetwork interface (http) for data indexing, searching and administrating purely RESTful API.Schema Free.Can be integrated well with Hadoop/Map-Reduce

Behind Elastic

automatic shard allocation

There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).

If you want to scale out search, you can simply have more shard, replicas per shard.

HbaseDirectory – What?

Distributed search solutions and comparison

Technology