Big Search 4 Big Data War Stories

Post on 11-May-2015

502 views 0 download

Tags:

description

Some lessons that we learned in rolling out a search engine across a very big set of data.

transcript

Big Search 4 Big DataEnterprise Search Summit Europe 2013 London

Eric Pugh | epugh@o19s.com | @dep4b

1

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

2

CO-AUTHORW

orking on 4.0!

3

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

war^

4

Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Aggressive timeline.

• All the data must be searched per query.

• Limited selection of tools available.

• On Solr 3.x line

6

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

7

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

8

Bash Rocks

9

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

10

Lesson: Don’t get

captured by your

environment

11

Make it easy to change approach

12

Make it easy to change sharding

IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }

Lesson: Sharpen

your axe

13

Go Wide Quickly

14

shard1shard1shard1shard1 :8983

shard1shard1shard1shard8 :8984

shard1shard1shard1shard12 :8985

search1.o19s.com

shard1shard1shard1shard12 :8985

shard1shard1shard1shard1 :8983

search1.o19s.com

shard1shard1shard1shard8 :8983

search2.o19s.com

shard1shard1shard1shard12 :8983

search3.o19s.com

Lesson: Hardw

are

is cheap/devs

expensive

15

Why so many pipelines?

16

Simple Pipeline

• Simple pipeline

• mv is atomic

Lesson: Simple

Works

17

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

• Map/Reduce

18

Can you test your changes?

19

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

20

21

Run, don’t WalkLesson: Testing

needs to be easy

22

Telling some stories

• Prototyping

•Application Development

• Maintaining Your Big Search Indexes

23

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

24

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Lesson: Use w

hat

you have at hand

25

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

• Don’t duplicate work

26

Tika as a pipeline?

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project.

• Ingest multiple XML formats as well as CSV and EDI

27

Telling some stories

• Prototyping

• Application Development

•Maintaining Your Big Search Indexes

28

Indexing is Easy and Quick

29

CHEAP AND CHEERFUL

><

30

NRT versus BigData

31

The tension between scale and update rate

10 million 100’s of millionsBad Place

32

Grim Reaper33

Grim Reaper “Death of Mice”

Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware

Lesson: Embrace

failure, don’t fear

it

34

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Lesson: Autom

ate

Everything!

35

TRADITIONAL ENVIRONMENT

36

POOLED ENVIRONMENTLesson: T

hink

Cloud

37

Building  a  Patents  Index

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Mac

hine

Cou

nt

What  happens  when  we  want  to  index  2  million  patents  in  30  minutes?

38

Amazon  AWS  is  Good  but...

•EC2  is  costly  for  your  “base”  load• Issues  of  access  to  internal  data•Firewall  and  security

39

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

• Think shared nothing cluster!

Lesson: No!

40

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

41

One more thought...

42

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

43

www.solrpa.nl

Project SolrPanlProject SolrPanl

We need a

motivated beta

tester!

44

Thank you!

Questions?

• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!

45