Date post: | 27-Jun-2015 |
Category: |
Technology |
Upload: | phoet |
View: | 2,159 times |
Download: | 0 times |
What is solr
HTTP Request Servlet
Admin
Update Servlet
Different Request HandlerXML
Update
Solr Core
Lucene
config
schemacaching
concurrency
Replication
What is solr
● Unstructured rows● Denormalization of data● Dynamic fields● Schema → Tokenizer, Filters, etc.● Tons of XML
What is solr
Index
StringsTokenizer FilterTokenizer Token
Indexing
TokenizerFilter Query
Results
Query
What is solr
● Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+review_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+display_name_wa^128&spellcheck.collate=true&wt=ruby&hl=true&rows=100&fl =pk_i,score&start=0&q=chipotle+bbq&spellcheck.dictionary=spell_en&bf=linear(en_rating_points_i,100,0)&spellcheck.count=1&qt=dismax&fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
Solr integration into Rails
● Sunspot● acts_as_solr● Qype → acts_as_solr● Optimized Queries for solr
● Monkey patching● Defined queries without dynamic fields● Names of search fields differ from AR names
Solr integration into Rails
● Data consistency● Synchronous
– AR stores in mysql and solr– Longer response times – Not really synchron in case of replication
● Asynchronous– AR stores in mysql– Data import via mysql requests by solr master– Out of sync for some minutes– Deletion by flag, later physically– Javascript preprocessing of data possible
● Pool of words for spellchecking● Words from real data● Beeeeeeer● 9 Languages● New → Spellchecker for different kind of data● Suggestion → Locator → Facet → best match ?● Similar word → fuzzy search vs. spellchecking
Challenges - Spellchecking
?
Challenges - Spellchecking
CC BY-ND 2.0 - JM3
Challenges - Spellchecking
Chipotle BBQCC BY-ND 2.0 - Meindert Arnold Jacob
Chinese BabyCC BY-ND 2.0 - joshDubya
CC BY-ND 2.0 raybdbomb
shingles!CC BY-ND 2.0 - michael clarke stuff
Challenges – Stemming
● Stemming vs. Lemmatizing● 9 Languages● Hafen – Hafer (Harbor – Oat)● Performance● Stemming → solr SnowBallPorterFactory● Polish → Lemmatizng → OpenOffice
Challenges – Synonyms
● 9 Languages● OpenOffice rules !● Not all languages available → NL is missing
Challenges – NGrams
● Hugh Index● Tee matches Steeb● EdgeNGrams● Bar → Sofabar → Barmbek
● Not matched string shall be a word → performance
Challenges – Phrases
● Boost matching of phrases → whole entry● 'Europa Passage'
● Boost matching of phrases → left sided● 'Galeria Kaufhof in Hamburg'● 'Boutique in Galeria Kaufhof'● Javascript pre processing
● Boost matching of phrase somewhere in entry● How to handle matches of some words in given
phrase?
Challenges – Whitespace in index
● Index: 'Ping Pong'● Search word: 'Pingpong'● Javascript pre processing
CC BY-ND 2.0 - zimpenfish
CC BY-ND 2.0 - Ewan-M
Experiences – sever setup
Live Staging Dev
Loadbalancer
Slave Slave Slave
Master
Slave
DB Slave
Solr queries
Replication
Import
Slave
Master
DB Slave
iMac
Solr & MySql
Experiences – size of indices
● Staging System → Sunday evening● Places in simple format: 712 MB● Previews simple format: 5,519 GByte● Places Previews Comments extended: 3,5 GB● Big Spellchecker: 16 GByte ● New combined index: 15 GByte
● Index: 14 Gbyte● Spellchecker: 1 GByte
Experiences – server setup
● Live Servers● 2 x 8 Cores, 2 x 16 Cores● 32 Gbyte RAM● Max. CPU usage: up to 500%● Solr loves RAM → 32 Gbyte full with cache
Experiences – Solr loves RAM
● Dev → 1 Gig● Staging → 4.5 Gig (no load)● Import → 11 Gig and more● Production → 14 Gig
Experiences – accesses
● More than ~60 requests per seconds are not recommended
● Max of 40 requests per seconds is OK
Experiences – Response times
● Spellchecking 'pizzt' big index (staging):● 1502 / 48 / 47 / 48 / 31 ms● Spellchecking 'pizzt' small index (staging):● 603 / 12 / 8 / 9 / 9 ms
Experiences – Response times
● Facet for spellchecking:● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score&
facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"&facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"&facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+display_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"&facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"&facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+OR+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&ffacet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"&facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+OR+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"&facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"&facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"&facet.query=comment_de_wa:"pizzä"+OR+review_de_wa:"pizzä"+OR+everything_de_wa:"pizzä"+OR+everything_wa:"pizzä"+OR+display_name_de_wa:"pizzä"+OR+display_name_wa:"pizzä"+OR+display_name_ngram:"pizzä"&q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)
● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
Experiences – Response times
● Staging / index schama on prod● Standard Query 'pizza': 106 / 0 / 0 (9122)● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129)● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122)● Wildcard: (rest*): 39 / 0 / 0 (41031)