+ All Categories
Home > Technology > Big Search 4 Big Data War Stories

Big Search 4 Big Data War Stories

Date post: 11-May-2015
Category:
Upload: opensource-connections
View: 502 times
Download: 0 times
Share this document with a friend
Description:
Some lessons that we learned in rolling out a search engine across a very big set of data.
Popular Tags:
45
Big Search 4 Big Data Enterprise Search Summit Europe 2013 London Eric Pugh | [email protected] | @dep4b 1
Transcript
Page 1: Big Search 4 Big Data War Stories

Big Search 4 Big DataEnterprise Search Summit Europe 2013 London

Eric Pugh | [email protected] | @dep4b

1

Page 2: Big Search 4 Big Data War Stories

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

2

Page 3: Big Search 4 Big Data War Stories

CO-AUTHORW

orking on 4.0!

3

Page 4: Big Search 4 Big Data War Stories

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

war^

4

Page 6: Big Search 4 Big Data War Stories

Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Aggressive timeline.

• All the data must be searched per query.

• Limited selection of tools available.

• On Solr 3.x line

6

Page 7: Big Search 4 Big Data War Stories

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

7

Page 8: Big Search 4 Big Data War Stories

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

8

Page 9: Big Search 4 Big Data War Stories

Bash Rocks

9

Page 10: Big Search 4 Big Data War Stories

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

10

Page 11: Big Search 4 Big Data War Stories

Lesson: Don’t get

captured by your

environment

11

Page 12: Big Search 4 Big Data War Stories

Make it easy to change approach

12

Page 13: Big Search 4 Big Data War Stories

Make it easy to change sharding

IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }

Lesson: Sharpen

your axe

13

Page 14: Big Search 4 Big Data War Stories

Go Wide Quickly

14

Page 15: Big Search 4 Big Data War Stories

shard1shard1shard1shard1 :8983

shard1shard1shard1shard8 :8984

shard1shard1shard1shard12 :8985

search1.o19s.com

shard1shard1shard1shard12 :8985

shard1shard1shard1shard1 :8983

search1.o19s.com

shard1shard1shard1shard8 :8983

search2.o19s.com

shard1shard1shard1shard12 :8983

search3.o19s.com

Lesson: Hardw

are

is cheap/devs

expensive

15

Page 16: Big Search 4 Big Data War Stories

Why so many pipelines?

16

Page 17: Big Search 4 Big Data War Stories

Simple Pipeline

• Simple pipeline

• mv is atomic

Lesson: Simple

Works

17

Page 18: Big Search 4 Big Data War Stories

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

• Map/Reduce

18

Page 19: Big Search 4 Big Data War Stories

Can you test your changes?

19

Page 20: Big Search 4 Big Data War Stories

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

20

Page 21: Big Search 4 Big Data War Stories

21

Page 22: Big Search 4 Big Data War Stories

Run, don’t WalkLesson: Testing

needs to be easy

22

Page 23: Big Search 4 Big Data War Stories

Telling some stories

• Prototyping

•Application Development

• Maintaining Your Big Search Indexes

23

Page 24: Big Search 4 Big Data War Stories

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

24

Page 25: Big Search 4 Big Data War Stories

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Lesson: Use w

hat

you have at hand

25

Page 26: Big Search 4 Big Data War Stories

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

• Don’t duplicate work

26

Page 27: Big Search 4 Big Data War Stories

Tika as a pipeline?

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project.

• Ingest multiple XML formats as well as CSV and EDI

27

Page 28: Big Search 4 Big Data War Stories

Telling some stories

• Prototyping

• Application Development

•Maintaining Your Big Search Indexes

28

Page 29: Big Search 4 Big Data War Stories

Indexing is Easy and Quick

29

Page 30: Big Search 4 Big Data War Stories

CHEAP AND CHEERFUL

><

30

Page 31: Big Search 4 Big Data War Stories

NRT versus BigData

31

Page 32: Big Search 4 Big Data War Stories

The tension between scale and update rate

10 million 100’s of millionsBad Place

32

Page 33: Big Search 4 Big Data War Stories

Grim Reaper33

Page 34: Big Search 4 Big Data War Stories

Grim Reaper “Death of Mice”

Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware

Lesson: Embrace

failure, don’t fear

it

34

Page 35: Big Search 4 Big Data War Stories

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Lesson: Autom

ate

Everything!

35

Page 36: Big Search 4 Big Data War Stories

TRADITIONAL ENVIRONMENT

36

Page 37: Big Search 4 Big Data War Stories

POOLED ENVIRONMENTLesson: T

hink

Cloud

37

Page 38: Big Search 4 Big Data War Stories

Building  a  Patents  Index

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Mac

hine

Cou

nt

What  happens  when  we  want  to  index  2  million  patents  in  30  minutes?

38

Page 39: Big Search 4 Big Data War Stories

Amazon  AWS  is  Good  but...

•EC2  is  costly  for  your  “base”  load• Issues  of  access  to  internal  data•Firewall  and  security

39

Page 40: Big Search 4 Big Data War Stories

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

• Think shared nothing cluster!

Lesson: No!

40

Page 41: Big Search 4 Big Data War Stories

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

41

Page 42: Big Search 4 Big Data War Stories

One more thought...

42

Page 43: Big Search 4 Big Data War Stories

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

43

Page 44: Big Search 4 Big Data War Stories

www.solrpa.nl

Project SolrPanlProject SolrPanl

We need a

motivated beta

tester!

44

Page 45: Big Search 4 Big Data War Stories

Thank you!

Questions?

[email protected]

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!

45


Recommended