Big Search 4 Big Data War Stories

transcript

Big Search 4 Big DataEnterprise Search Summit Europe 2013 London

Eric Pugh | epugh@o19s.com | @dep4b

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

CO-AUTHORW

orking on 4.0!

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

What is Big Search?5

Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Aggressive timeline.

• All the data must be searched per query.

• Limited selection of tools available.

• On Solr 3.x line

• Prototyping

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

Lesson: Don’t get

captured by your

environment

Make it easy to change approach

Make it easy to change sharding

IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }

Lesson: Sharpen

your axe

Go Wide Quickly

shard1shard1shard1shard1 :8983

search1.o19s.com

search2.o19s.com

search3.o19s.com

Lesson: Hardw

is cheap/devs

expensive

Why so many pipelines?

Simple Pipeline

• Simple pipeline

• mv is atomic

Lesson: Simple

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

• Map/Reduce

Can you test your changes?

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

Run, don’t WalkLesson: Testing

needs to be easy

• Prototyping

•Application Development

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Lesson: Use w

you have at hand

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

• Don’t duplicate work

Tika as a pipeline?

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project.

• Ingest multiple XML formats as well as CSV and EDI

• Prototyping

•Maintaining Your Big Search Indexes

Indexing is Easy and Quick

CHEAP AND CHEERFUL

NRT versus BigData

The tension between scale and update rate

10 million 100’s of millionsBad Place

Grim Reaper33

Grim Reaper “Death of Mice”

Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware

Lesson: Embrace

failure, don’t fear

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Lesson: Autom

Everything!

TRADITIONAL ENVIRONMENT

POOLED ENVIRONMENTLesson: T

Building a Patents Index

5 days 3 days 30 Minutes

What happens when we want to index 2 million patents in 30 minutes?

Amazon AWS is Good but...

•EC2 is costly for your “base” load• Issues of access to internal data•Firewall and security

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

• Think shared nothing cluster!

Lesson: No!

• Prototyping

One more thought...

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

www.solrpa.nl

Project SolrPanlProject SolrPanl

We need a

motivated beta

tester!

Thank you!

Questions?

• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!

Big Search 4 Big Data War Stories

Technology