+ All Categories
Home > Technology > Search Basics

Search Basics

Date post: 17-Jul-2015
Category:
Upload: sander-kieft
View: 124 times
Download: 0 times
Share this document with a friend
50
Search Find the rabbit..
Transcript

SearchFind the rabbit

22442015 copy Sanoma Media

Agenda

bull Search Basics

bull Features

bull Search solutions

raquoMySQL (Full-Text search and Sphinx)

raquoSolr

raquoElasticSearch

bull Sanoma Content Library

bull Common gotcharsquos

BasicsABC of search

High level components

Filtering Indexing Querying Ranking

42442015 copy Sanoma Media

High level componentsFiltering techniques

Filtering Indexing Querying Ranking

54242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

22442015 copy Sanoma Media

Agenda

bull Search Basics

bull Features

bull Search solutions

raquoMySQL (Full-Text search and Sphinx)

raquoSolr

raquoElasticSearch

bull Sanoma Content Library

bull Common gotcharsquos

BasicsABC of search

High level components

Filtering Indexing Querying Ranking

42442015 copy Sanoma Media

High level componentsFiltering techniques

Filtering Indexing Querying Ranking

54242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

BasicsABC of search

High level components

Filtering Indexing Querying Ranking

42442015 copy Sanoma Media

High level componentsFiltering techniques

Filtering Indexing Querying Ranking

54242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level components

Filtering Indexing Querying Ranking

42442015 copy Sanoma Media

High level componentsFiltering techniques

Filtering Indexing Querying Ranking

54242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques

Filtering Indexing Querying Ranking

54242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

64242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

The quick brown fox jumps over a lazy dog

Thequickbrown

foxjumpsover

alazydog

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

74242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Special characters +-$^amp etc

raquoIBM

bull Case and numeric changes

raquoPowerShot TransAM SD500 iPod

bull Decide what you want to happened with

raquoCanon Power-Shot SD500(Canon Power shot SD-500 Canon Powershot SD 500)

raquoOrsquoneillrsquos

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

bull Remove stop words from being indexed

bull No value since theyrsquore to common

Thequick quickbrown brownfox foxjumps jumpsoveralazy lazydog dog

Stop wordsaableaboutacrossafterallalmostalsoamamonganandanyareasatbebecausebeenbutbycancannotcoulddeardiddodoeseitherelseevereveryforfromgothadhaveh

High level componentsFiltering techniques Filtering Indexing Querying Ranking

84242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

94242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull De-duplicate various words

raquobicycle cycle bike

raquoi-pod ipot =gt iPod

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

104242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Determine the stem of a word

raquoDogs =gt dog

raquoRecharging =gt recharg

raquoRechargeable =gt recharg

bull Language specific

raquoPorter for English (-s -ed -ly -ing etc)

raquoSnowballPorter or Kraaij-Pohlmannfor Dutch (ge- -en etc)

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

114242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Options for limiting the size of the index

raquoMinimum Term frequency

raquoMinimum Term Length

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsFiltering techniques Filtering Indexing Querying Ranking

124242015 copy Sanoma Media

bull Tokenizing

bull Stop Words

bull Synonyms

bull Stemming

bull Term occurrence

bull Phonetics

bull Handling sounds like queries

raquo Robert =gt R163 lt= Rupert

raquo Smith =gt (SM0XMT) cap (XMTSMT) lt= Schmith

bull Various methods available

raquo DoubleMetaphone

raquo Metaphone

raquo Soundex

raquo RefinedSoundex

raquo Caverphone

raquo BeiderMorse

bull Levenstein can be used during quering

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsApply the filters on Filtering and querying

Filtering Indexing Querying Ranking

132442015 copy Sanoma Media

Same filters

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Sto

p w

ord

s

ste

mm

ing

synonym

s

etc

Filters

High level componentsIndexing

Filtering Indexing Querying Ranking

142442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsQuerying

Filtering Indexing Querying Ranking

152442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

DEMOStemming Phonetics

162442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsRanking

Filtering Indexing Querying Ranking

172442015 copy Sanoma Media

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search term occur in the text

How many words are in the entire text

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

High level componentsRanking ndash TF-IDF

Filtering Indexing Querying Ranking

182442015 copy Sanoma Media

312 = 025 524 = 021

More relevant

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

USER PATTERNS

192442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User patterns

bull Features should be adjusted to the user and usage patterns your seeing

bull What are users searching for on your site

bull How are they searching for it

bull Use web analytics to track and improve your search behavior

202442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User pattern - Quit

212442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User patterns ndash Pogosticking

222442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User patterns - Thrashing

232442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User patterns - Narrow

242442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

User patterns ndash Others

bull Pearl Growing

bull Expand

252442015 copy Sanoma Media

Image credits httpwwwflickrcomphotosmorvillecollections72157604060564791

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

SearchFeatures

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Search Features

bull Faceting

bull Autocomplete

bull More like this

bull Highlighting

bull Spellcheckingdid you mean

bull Geospatialldquobike repairrdquo in area of [longlat][longlat]

bull Boostingwhen title is more relevant then content

bull Elevationalways get a certain result at position nget the current weather current traffic at 1st

position or ingest ads

272442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Search Features - Faceting

282442015 copy Sanoma Media

From the user perspective faceted search (also called faceted navigation guided navigation or parametric search) breaks up search results into multiple categories typically showing counts for each and allows the user to drill down or further restrict their search results based on those facets

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Search Features - Autocomplete

292442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Search Features - More like this

302442015 copy Sanoma Media

bull Give you the related items based on a document

bull Compares the Term Vectors of various documents

bull Creates a query with boosting bodypre bodyusername^56974 bodycolumn^57123 bodyoracle^61915

TermNumber of Instances of Term in Document

Number of DocumentsMatching Term

IDF value Score

pre 18 26 4609916 82978

username 10 23 47276993 47276

column 9 13 5266696 47400264

oracle 9 8 57085285 51376

alter 7 1 7212606 50488

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Search Features - Highlighting

312442015 copy Sanoma Media

bull Highlighting the search terms

bull Includes stemming and other logic

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

DEMO SOLR

322442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

SOLUTIONS

332442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

ServicesCommon search options

bull MySQL based

raquoNative Full-Text search

raquoSphinx Search Plugin

bull Lucene based (Java)

raquoApache LuceneSolr

raquoElasticSearch

342442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

ServicesCommon search options

352442015 copy Sanoma Media

Ease of use

Power

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

MySQL BasedNative Full-Text vs Sphinx

MySQL Full-Text search

bull Only for MyISAM tables and only on CHAR VARCHAR and TEXT fields

bull Only standard English stop words

bull Limited query capabilities

bull Slow on large collections (1GB+)

bull Building facetting is ldquohardrdquo and ldquoexpensiverdquo

bull No stemming no synonyms no custom flieds no highlighting

Sphinx

bull External plugin

bull All storage engines

bull Also on numeric field types

bull ~3x faster on index and query

bull Simple stemming and synonyms

bull No custom fields no highlighting

362442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Querying is easy

bull MySQL Full-Text query

SELECT FROM articlesWHERE MATCH (titlebody)AGAINST (database)

bull Getting the score

SELECT id MATCH (titlebody) AGAINST (Tutorial)FROM articles

bull Sphinx query index is separate table

SELECT id created_time weightFROM my_sphinx_indexWHERE created_time BETWEEN (X AND Y)AND MATCH (Android phonersquo)

ORDER by weight DESCcreated_time DESC

372442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Lucene based

ElasticSearch

bull Simpler Solr

bull No need for a schema

bull Easy to cluster

bull Focus on scaling and realtime

bull Go with the defaults

bull Configuration = 3 lines

bull Percolation

bull Versions and TTLs

Solr

bull Exposing all of the lucenepower

bull Clustering possible but harder

bull Focus on complete and customizable

bull Defaults

bull Configuration = 3000 lines

382442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Idle

0

10

20

30

40

50

60

Search

tim

e i

n m

s

ElasticSearch

Solr

392442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Solr vs ElasticSearchSearch Fresh Index While Indexing 1doc3sec

0

50

100

150

200

250

Search

tim

e i

n m

s

ElasticSearch

Solr

402442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

412442015 copy Sanoma Media

Lower is better

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Solr vs ElasticSearchSearch Full Index While Indexing 1doc3sec

0

500

1000

1500

2000

2500

Search

tim

e i

n m

s

ElasticSearch

Solr

422442015 copy Sanoma MediaLower is better

Idle Indexing Full + Indexing

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Solr vs ElasticSearch

432442015 copy Sanoma Media

Lower is better

SOLR ElasticSearch

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Querying with Solr and ElasticSearch

Solr

bull Normal query

httpsolrq=fieldbanana

bull Facetting

httpsolrq=fieldbananaampfacet=onampfacetfield=tags

ElasticSearch

bull Normal query

http_searchq=fieldvalue

bull Advanced queries via PUT

POST httpcollectionseach

query query_string query T

facets

tags terms field tags

442442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

ElasticSearch

452442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

SANOMA CONTENT LIBRARY

462442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Sanoma Content Library

Search

in site

in cluster

in network

Elevation (ads)

Facetting

Related

More like this

Relevant ads

Products

Reuse

Sharing

Variants

(simple) Drm

Images

Analyse

Sentiment

Named Entities

Tagging

Classificatie

Key phrases

474242015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Services Content Library

482442015 copy Sanoma Media

Content Library

Analyse Pipeline

NER Sentiment

Crawler

Indexer

Searchindex

Search- nunl- wtf

Related- Vrouwen- Kieskeurig

Relevant- Txel

API

Edge

Redirects

Loader

Solr

Mongo

Integration- Vrouwen- Wordpress- SAS

CMS

JCR

Keyphraseextractor

Classifier

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

Common gotcharsquos

bull Use right settings for your language stopwords and stemming

bull Indexing too much or too detailed

raquoTimestamps

492442015 copy Sanoma Media

END

502442015 copy Sanoma Media

END

502442015 copy Sanoma Media


Recommended