+ All Categories
Home > Technology > Solr rug

Solr rug

Date post: 27-Jun-2015
Category:
Upload: phoet
View: 2,159 times
Download: 0 times
Share this document with a friend
Description:
Digging into Solr
Popular Tags:
32
Digging into solr Rails Usergroup Hamburg 13. April 2011
Transcript

Digging into solrRails Usergroup Hamburg 13. April 2011

Overview

● What is solr● Solr integration into Rails● Challenges for the search● Experiences

What is solr

● Matthew 7:7b / Lukas 11:9b ● (sermon on the Mount)

● seek and you will find;

What is solr

What is solr

HTTP Request Servlet

Admin

Update Servlet

Different Request HandlerXML

Update

Solr Core

Lucene

config

schemacaching

concurrency

Replication

What is solr

● Unstructured rows● Denormalization of data● Dynamic fields● Schema → Tokenizer, Filters, etc.● Tons of XML

What is solr

Index

StringsTokenizer FilterTokenizer Token

Indexing

TokenizerFilter Query

Results

Query

What is solr

● Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+review_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+display_name_wa^128&spellcheck.collate=true&wt=ruby&hl=true&rows=100&fl =pk_i,score&start=0&q=chipotle+bbq&spellcheck.dictionary=spell_en&bf=linear(en_rating_points_i,100,0)&spellcheck.count=1&qt=dismax&fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)

What is solr

● Response type● XML● Ruby● JSON● XML + XSLT● etc.

Solr integration into Rails

● Sunspot● acts_as_solr● Qype → acts_as_solr● Optimized Queries for solr

● Monkey patching● Defined queries without dynamic fields● Names of search fields differ from AR names

Solr integration into Rails

● Data consistency● Synchronous

– AR stores in mysql and solr– Longer response times – Not really synchron in case of replication

● Asynchronous– AR stores in mysql– Data import via mysql requests by solr master– Out of sync for some minutes– Deletion by flag, later physically– Javascript preprocessing of data possible

● Pool of words for spellchecking● Words from real data● Beeeeeeer● 9 Languages● New → Spellchecker for different kind of data● Suggestion → Locator → Facet → best match ?● Similar word → fuzzy search vs. spellchecking

Challenges - Spellchecking

?

Challenges - Spellchecking

CC BY-ND 2.0 - JM3

Challenges - Spellchecking

Chipotle BBQCC BY-ND 2.0 - Meindert Arnold Jacob

Chinese BabyCC BY-ND 2.0 - joshDubya

CC BY-ND 2.0 raybdbomb

shingles!CC BY-ND 2.0 - michael clarke stuff

Challenges – Stemming

● Stemming vs. Lemmatizing● 9 Languages● Hafen – Hafer (Harbor – Oat)● Performance● Stemming → solr SnowBallPorterFactory● Polish → Lemmatizng → OpenOffice

Challenges – Synonyms

● 9 Languages● OpenOffice rules !● Not all languages available → NL is missing

Challenges – NGrams

● Hugh Index● Tee matches Steeb● EdgeNGrams● Bar → Sofabar → Barmbek

● Not matched string shall be a word → performance

Challenges – Phrases

● Boost matching of phrases → whole entry● 'Europa Passage'

● Boost matching of phrases → left sided● 'Galeria Kaufhof in Hamburg'● 'Boutique in Galeria Kaufhof'● Javascript pre processing

● Boost matching of phrase somewhere in entry● How to handle matches of some words in given

phrase?

Challenges – Whitespace in index

● Index: 'Ping Pong'● Search word: 'Pingpong'● Javascript pre processing

CC BY-ND 2.0 - zimpenfish

CC BY-ND 2.0 - Ewan-M

Experiences – sever setup

Live Staging Dev

Loadbalancer

Slave Slave Slave

Master

Slave

DB Slave

Solr queries

Replication

Import

Slave

Master

DB Slave

iMac

Solr & MySql

Experiences – size of indices

● Staging System → Sunday evening● Places in simple format: 712 MB● Previews simple format: 5,519 GByte● Places Previews Comments extended: 3,5 GB● Big Spellchecker: 16 GByte ● New combined index: 15 GByte

● Index: 14 Gbyte● Spellchecker: 1 GByte

Experiences – server setup

● Live Servers● 2 x 8 Cores, 2 x 16 Cores● 32 Gbyte RAM● Max. CPU usage: up to 500%● Solr loves RAM → 32 Gbyte full with cache

Experiences – Solr loves RAM

● Dev → 1 Gig● Staging → 4.5 Gig (no load)● Import → 11 Gig and more● Production → 14 Gig

Experiences – Solr loves RAM prod. slave

Experiences – accesses

● More than ~60 requests per seconds are not recommended

● Max of 40 requests per seconds is OK

Experiences – accesses

Experiences – CPU load● Last Import → up to 250 %● Production (slave):

Experiences – Response times

Experiences – Response times

● Spellchecking 'pizzt' big index (staging):● 1502 / 48 / 47 / 48 / 31 ms● Spellchecking 'pizzt' small index (staging):● 603 / 12 / 8 / 9 / 9 ms

Experiences – Response times

● Facet for spellchecking:● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score&

facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"&facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"&facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+display_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"&facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"&facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+OR+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&ffacet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"&facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+OR+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"&facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"&facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"&facet.query=comment_de_wa:"pizzä"+OR+review_de_wa:"pizzä"+OR+everything_de_wa:"pizzä"+OR+everything_wa:"pizzä"+OR+display_name_de_wa:"pizzä"+OR+display_name_wa:"pizzä"+OR+display_name_ngram:"pizzä"&q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)

● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms

Experiences – Response times

● Warming up → Staging vs. Production● Staging: slow● Production: fast

Experiences – Response times

● Staging / index schama on prod● Standard Query 'pizza': 106 / 0 / 0 (9122)● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129)● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122)● Wildcard: (rest*): 39 / 0 / 0 (41031)

Experiences - Monitoring

● Munin● New Relic


Recommended