Date post: | 17-Jul-2015 |
Category: |
Technology |
Upload: | sander-kieft |
View: | 124 times |
Download: | 0 times |
SearchFind the rabbit
22442015 copy Sanoma Media
Agenda
bull Search Basics
bull Features
bull Search solutions
raquoMySQL (Full-Text search and Sphinx)
raquoSolr
raquoElasticSearch
bull Sanoma Content Library
bull Common gotcharsquos
BasicsABC of search
High level components
Filtering Indexing Querying Ranking
42442015 copy Sanoma Media
High level componentsFiltering techniques
Filtering Indexing Querying Ranking
54242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
22442015 copy Sanoma Media
Agenda
bull Search Basics
bull Features
bull Search solutions
raquoMySQL (Full-Text search and Sphinx)
raquoSolr
raquoElasticSearch
bull Sanoma Content Library
bull Common gotcharsquos
BasicsABC of search
High level components
Filtering Indexing Querying Ranking
42442015 copy Sanoma Media
High level componentsFiltering techniques
Filtering Indexing Querying Ranking
54242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
BasicsABC of search
High level components
Filtering Indexing Querying Ranking
42442015 copy Sanoma Media
High level componentsFiltering techniques
Filtering Indexing Querying Ranking
54242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level components
Filtering Indexing Querying Ranking
42442015 copy Sanoma Media
High level componentsFiltering techniques
Filtering Indexing Querying Ranking
54242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques
Filtering Indexing Querying Ranking
54242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
64242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
The quick brown fox jumps over a lazy dog
Thequickbrown
foxjumpsover
alazydog
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
74242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Special characters +-$^amp etc
raquoIBM
bull Case and numeric changes
raquoPowerShot TransAM SD500 iPod
bull Decide what you want to happened with
raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)
raquoOrsquoneillrsquos
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
bull Remove stop words from being indexed
bull No value since theyrsquore to common
Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog
Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh
High level componentsFiltering techniques Filtering Indexing Querying Ranking
84242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
94242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull De-duplicate various words
raquobicycle cycle bike
raquoi-pod ipot =gt iPod
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
104242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Determine the stem of a word
raquoDogs =gt dog
raquoRecharging =gt recharg
raquoRechargeable =gt recharg
bull Language specific
raquoPorter for English (-s -ed -ly -ing etc)
raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
114242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Options for limiting the size of the index
raquoMinimum Term frequency
raquoMinimum Term Length
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsFiltering techniques Filtering Indexing Querying Ranking
124242015 copy Sanoma Media
bull Tokenizing
bull Stop Words
bull Synonyms
bull Stemming
bull Term occurrence
bull Phonetics
bull Handling sounds like queries
raquo Robert =gt R163 lt= Rupert
raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith
bull Various methods available
raquo DoubleMetaphone
raquo Metaphone
raquo Soundex
raquo RefinedSoundex
raquo Caverphone
raquo BeiderMorse
bull Levenstein can be used during quering
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsApply the filters on Filtering and querying
Filtering Indexing Querying Ranking
132442015 copy Sanoma Media
Same filters
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Sto
p w
ord
s
ste
mm
ing
synonym
s
etc
Filters
High level componentsIndexing
Filtering Indexing Querying Ranking
142442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsQuerying
Filtering Indexing Querying Ranking
152442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
DEMOStemming Phonetics
162442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsRanking
Filtering Indexing Querying Ranking
172442015 copy Sanoma Media
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search term occur in the text
How many words are in the entire text
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
High level componentsRanking ndash TF-IDF
Filtering Indexing Querying Ranking
182442015 copy Sanoma Media
312 = 025 524 = 021
More relevant
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
USER PATTERNS
192442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User patterns
bull Features should be adjusted to the user and usage patterns your seeing
bull What are users searching for on your site
bull How are they searching for it
bull Use web analytics to track and improve your search behavior
202442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User pattern - Quit
212442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User patterns ndash Pogosticking
222442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User patterns - Thrashing
232442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User patterns - Narrow
242442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
User patterns ndash Others
bull Pearl Growing
bull Expand
252442015 copy Sanoma Media
Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
SearchFeatures
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Search Features
bull Faceting
bull Autocomplete
bull More like this
bull Highlighting
bull Spellcheckingdid you mean
bull Geospatialldquobike repairrdquo in area of [longlat][longlat]
bull Boostingwhen title is more relevant then content
bull Elevationalways get a certain result at position nget the current weather current traffic at 1st
position or ingest ads
272442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Search Features - Faceting
282442015 copy Sanoma Media
From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Search Features - Autocomplete
292442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Search Features - More like this
302442015 copy Sanoma Media
bull Give you the related items based on a document
bull Compares the Term Vectors of various documents
bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915
TermNumber of Instances of Term in Document
Number of DocumentsMatching Term
IDF value Score
pre 18 26 4609916 82978
username 10 23 47276993 47276
column 9 13 5266696 47400264
oracle 9 8 57085285 51376
alter 7 1 7212606 50488
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Search Features - Highlighting
312442015 copy Sanoma Media
bull Highlighting the search terms
bull Includes stemming and other logic
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
DEMO SOLR
322442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
SOLUTIONS
332442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
ServicesCommon search options
bull MySQL based
raquoNative Full-Text search
raquoSphinx Search Plugin
bull Lucene based (Java)
raquoApache LuceneSolr
raquoElasticSearch
342442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
ServicesCommon search options
352442015 copy Sanoma Media
Ease of use
Power
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
MySQL BasedNative Full-Text vs Sphinx
MySQL Full-Text search
bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields
bull Only standard English stop words
bull Limited query capabilities
bull Slow on large collections (1GB+)
bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo
bull No stemming no synonyms no custom flieds no highlighting
Sphinx
bull External plugin
bull All storage engines
bull Also on numeric field types
bull ~3x faster on index and query
bull Simple stemming and synonyms
bull No custom fields no highlighting
362442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Querying is easy
bull MySQL Full-Text query
SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)
bull Getting the score
SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles
bull Sphinx query index is separate table
SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)
ORDER by weight DESCcreated_time DESC
372442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Lucene based
ElasticSearch
bull Simpler Solr
bull No need for a schema
bull Easy to cluster
bull Focus on scaling and realtime
bull Go with the defaults
bull Configuration = 3 lines
bull Percolation
bull Versions and TTLs
Solr
bull Exposing all of the lucenepower
bull Clustering possible but harder
bull Focus on complete and customizable
bull Defaults
bull Configuration = 3000 lines
382442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Idle
0
10
20
30
40
50
60
Search
tim
e i
n m
s
ElasticSearch
Solr
392442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec
0
50
100
150
200
250
Search
tim
e i
n m
s
ElasticSearch
Solr
402442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
412442015 copy Sanoma Media
Lower is better
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec
0
500
1000
1500
2000
2500
Search
tim
e i
n m
s
ElasticSearch
Solr
422442015 copy Sanoma MediaLower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Solr vs ElasticSearch
432442015 copy Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Querying with Solr and ElasticSearch
Solr
bull Normal query
httpsolrq=fieldbanana
bull Facetting
httpsolrq=fieldbananaampfacet=onampfacetfield=tags
ElasticSearch
bull Normal query
http_searchq=fieldvalue
bull Advanced queries via PUT
POST httpcollectionseach
query query_string query T
facets
tags terms field tags
442442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
ElasticSearch
452442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
SANOMA CONTENT LIBRARY
462442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Sanoma Content Library
Search
in site
in cluster
in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474242015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Services Content Library
482442015 copy Sanoma Media
Content Library
Analyse Pipeline
NER Sentiment
Crawler
Indexer
Searchindex
Search- nunl- wtf
Related- Vrouwen- Kieskeurig
Relevant- Txel
API
Edge
Redirects
Loader
Solr
Mongo
Integration- Vrouwen- Wordpress- SAS
CMS
JCR
Keyphraseextractor
Classifier
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media
Common gotcharsquos
bull Use right settings for your language stopwords and stemming
bull Indexing too much or too detailed
raquoTimestamps
492442015 copy Sanoma Media
END
502442015 copy Sanoma Media