Post on 17-May-2015
description
transcript
Solr: BEYOND THE BASICS!
script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899
∑knk,j
tfi,j x idfi,j |{d:ti ∈
d}|∑knk,j
ni,j
REVIOUSLY....PMy site
search was slow and the results were bad, but Solr
saved me!
security comes first!
/etc/solr/solr.xml
Core Core
CONF CONF
/var/solr/data
/var/solr/lib
<solr sharedLib="/var/solr/lib" persistent="true"> <cores adminPath="/admin/cores"> <core default="true" instanceDir="main" name="main"> </core> </cores></solr>
olr.xmlS
<schema
name="ex
ample" v
ersion="
1.2">
<!-- a
ttribute
"name"
is the n
ame of t
his sche
ma and i
s only u
sed for
display
purposes
.
A
pplicati
ons shou
ld chang
e this t
o reflec
t the na
ture of
the sear
ch
collecti
on.
v
ersion="
1.2" is
Solr's v
ersion n
umber fo
r the sc
hema syn
tax and
semantic
s. It s
hould
n
ot norma
lly be c
hanged b
y applic
ations.
1
.0: mult
iValued
attribut
e did no
t exist,
all fie
lds are
multiVal
ued by
nature
1
.1: mult
iValued
attribut
e introd
uced, fa
lse by d
efault
1
.2: omit
TermFreq
AndPosit
ions att
ribute i
ntroduce
d, true
by defau
lt
except f
or text
fields.
-->
<types
>
<!--
field t
ype defi
nitions.
The "na
me" attr
ibute is
j
ust a la
bel to b
e used b
y field
definiti
ons. Th
e "class
"
a
ttribute
and any
other a
ttribute
s determ
ine the
real
b
ehavior
of the f
ieldType
.
Class n
ames sta
rting wi
th "solr
" refer
to java
classes
in the
o
rg.apach
e.solr.a
nalysis
package.
-->
<!--
The Str
Field ty
pe is no
t analyz
ed, but
indexed/
stored v
erbatim.
-
StrFiel
d and Te
xtField
support
an optio
nal comp
ressThre
shold wh
ich
l
imits co
mpressio
n (if en
abled in
the der
ived fie
lds) to
values w
hich
e
xceed a
certain
size (in
charact
ers).
-->
<fie
ldType n
ame="str
ing" cla
ss="solr
.StrFiel
d" sortM
issingLa
st="true
"
omitNorm
s="true"
/>
<!--
boolean
type: "
true" or
"false"
-->
<fie
ldType n
ame="boo
lean" cl
ass="sol
r.BoolFi
eld" sor
tMissing
Last="tr
ue"
omitNorm
s="true"
/>
<!--
Binary d
ata type
. The da
ta shoul
d be sen
t/retrie
ved in a
s Base64
encoded
Strings
-->
<fie
ldtype n
ame="bin
ary" cla
ss="solr
.BinaryF
ield"/>
<!--
The opt
ional so
rtMissin
gLast an
d sortMi
ssingFir
st attri
butes ar
e
<config> <!-- Set this to 'false' if you want solr to continue working after it has
encountered an severe configuration error. In a production
environment, you may want solr to keep working even if one handler is mis-
configured. You may also set this to false using by setting the system property:
-Dsolr.abortOnConfigurationError=false
--> <abortOnConfigurationError>${solr.abortOnConfigurationError:true}</
abortOnConfigurationError>
<!-- lib directives can be used to instruct Solr to load an Jars
identified and use them to resolve any "plugins" specified in your
solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).
All directories and paths are resolved relative the instanceDir.
If a "./lib" directory exists in your instanceDir, all files found in
it are included as if you had used the following syntax...
<lib dir="./lib" />
--> <!-- A dir option by itself adds any files found in the directory to the
classpath, this is useful for including all jars in a directory.
--> <!--lib dir="../../contrib/extraction/lib" /-->
<!-- When a regex is specified in addition to a directory, only the files
in that directory which completely match the regex (anchored on both ends)
will be included.
--> <!--lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="../../dist/" regex="apache-solr-clustering-\d.*\.jar" /-->
<!-- If a dir option (with or without a regex) is used and nothing is
found
olr’s secret plan!S
<listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">solr rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst> <lst> <str name="q">from solrconfig.xml</str> </lst> </arr></listener>
cache warming!
Query
Index Configuration
Request Handlers
search components
Content Type
section
search types
field types
fields
THe cms!
TITLE
LEAD PARADATE
BODY
permalink
Category
TagsAuthor
Scientific analysis!
how do we turn our text into tokens?
Field Type, Storage, Tokenisation,
Filters, and copy fields.
<fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> </analyzer></fieldType>
chema.xmlS
keyword
ORIGINAL
Whitespace
STANDARD
O’Reilly’s wi-fi guide!
O’Reilly’s
wi-fi
guide!
O’Reilly’s wi-fi guide!
O
wi
SReilly
FI
GUIDE
“My Phrase?”
stored INDEXED
“My Phrase?”
my
phrase
doc 1
doc 1
doc 1
Ian barber
AN PRPR
<fieldtype name="phonetic" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer></fieldtype>
IAIN BARBOUR
AN PRPR
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" generateNumberParts="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
delimiters
O
wi
SReilly
FI
GUIDE
OReillys
wifi
precision versus recall
vs
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
stemming
O
wi
SReilli
FI
GUID
OReilli
wifi
Je ne parle pas
anglais!
TITLE
LEAD PARA BODY
<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0" />
<fieldType name="lowercase" class="solr.TextField"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer></fieldType>
chema.xmlS
tags
Date
author
category
permalink
<fields><field name="permalink" type="lowercase" required="true" /> <field name="category" type="lowercase" /><field name="tag" type="lowercase" multiValued="true" /><field name="title" type="text" required="true"/><field name="body" type="text" required="true" /> <field name="author" type="lowercase" stored="false" multiValued="true" /><field name="date" type="tdate" multiValued="true" /><field name="lead_para" type="text" /><field name="phonetic" type="phonetic" /><field name="text" type="text" stored="false" multiValued="true" /></fields>
chema.xmlS
<!-- Copy Fields --><copyField source="permalink" dest="text" /><copyField source="category" dest="text" /><copyField source="title" dest="text" /><copyField source="lead_para" dest="text" /><copyField source="body" dest="text" /><copyField source="author" dest="text" /><copyField source="category" dest="phonetic" /><copyField source="title" dest="phonetic" /><copyField source="lead_para" dest="phonetic" /><copyField source="body" dest="phonetic" /><copyField source="author" dest="phonetic" />
<!-- ID --><uniqueKey>permalink</uniqueKey>
from solr import *s=SolrConnection( 'http://localhost:8080/solr/main')doc = dict( permalink = "http://fooweb.com/strategy/DCPO", category = "strategy", title = "DPCO: A Framework For Synergy", body = "DPCO, or Dynamic Performance Class Organisation is a ISO90210 quality oriented management process [...]", author = "Sean Alison", date = "2011-03-01T00:00:00Z", source_site = "fooweb.com",)s.add(doc)s.commit() impleadd.pys
<add> <doc> <field name="body"> DPCO, or Dynamic Performance Class [...] </field> <field name="category">strategy</field> <field name="permalink"> http://fooweb.com/strategy/DCPO </field> <field name="source_site">fooweb.com</field> <field name="title"> DPCO: A Framework For Synergy </field> <field name="date">2011-03-01T00:00:00Z </field> <field name="author">Sean Alison</field> </doc></add>
time for the gadgets!
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"> db-data-config.xml </str> </lst></requestHandler>
olrconfig.xmlS
<dataConfig><dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/cms" user="root" password="password" /><document> <entity name="story" query="SELECT s.id, s.content, CONCAT (u.first_name, ' ', u.last_name) as author [...] s.status_id = 1" deltaImportQuery="SELECT s.id, s.content [...] AND s.id = ${dataimporter.delta.id}" deltaQuery="SELECT id FROM stories WHERE modified > ${dataimporter.last_index_time}" transformer= "TemplateTransformer,HTMLStripTransformer" >
ata-config.xmlD
<field column="permalink" name="permalink" template="http://fooweb.com/${story.slug}" /> <field column="publish_date" name="date" /> <field column="content" name="body" stripHTML="true" /> <field column="source_site" template="cms" /> [...] <entity name="topic" query="SELECT [...] st.item_id=${story.id}"> <field column="category" /> </entity> </entity></document></dataConfig>
<response> <str name="command">full-import</str> <str name="status">busy</str> <str name="importResponse"> A command is still running...</str> <lst name="statusMessages"> <str name="Time Elapsed">0:0:14.979</str> <str name="Total Requests made">5523</str> <str name="Total Rows Fetched">5522</str> <str name="Total Documents Processed"> 2760</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started"> 2011-03-02 15:48:00</str> </lst></response>
http://SOLR:8080/solr/main/dataimport
The SOLR CELL!
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="uprefix">ignored_</str> </lst></requestHandler>
olrconfig.xmlS
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
chema.xmlS
<dynamicField name="ignored_*" type="ignored" indexed="false" stored="false"/> can it
be...schema free?!
ynamic FieldsD
$ curl -‐v “http://localhost:8080/solr/main/update/extract?literal.source_site=files&literal.permalink=http://fooweb.com/arch.pdf&commit=true&fmap.content=body&fmap.Author=author—data-‐binary @arch.pdf -‐H ‘Content-‐Type:application/pdf’
A crawler!
Lucidimagination.com/blog/2009/03/09/nutch-solr
# skip some protocols-^(https|telnet|file|ftp|mailto):-[?*!@=]
# allow urls in defined domain+^http://([a-z0-9\-A-Z]*\.)*fooweb.com/
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# deny anything else-.
egex-urlfilter.txtr
<mapping> <fields> <field dest="body" source="content" /> <field dest="source_site" source="site" /> <field dest="title" source="title" /> <field dest="ignored_host" source="host" /> <field dest="ignored_segment" source="segment" /> <field dest="ignored_boost" source="boost" /> <field dest="ignored_digest" source="digest" /> <field dest="date" source="tstamp" /> <field dest="permalink" source="url" /> </fields> <uniqueKey>permalink</uniqueKey></mapping>
olrindex-mapping.xmlS
$ echo "http://subsite.fooweb.com" > urls/seed.txt$ bin/nutch inject /var/nutch/crawldb urls
$ bin/nutch generate /var/nutch/crawldb /var/nutch/segments$ export SEGMENT=/var/nutch/segments/`ls -‐tr /var/nutch/segments|tail -‐1`$ bin/nutch fetch $SEGMENT -‐noParsing$ bin/nutch parse $SEGMENT$ bin/nutch updatedb $SEGMENT -‐filter -‐normalize$ bin/nutch invertlinks /var/nutch/linkdb -‐dir /var/nutch/segments
$ bin/nutch solrindex http://localhost:8080/solr/main /var/nutch/crawldb /var/nutch/linkdb/ /var/nutch/segments/*
solr goes to work!
he has dismax!
<requestHandler name="dismax" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm">3<60%</str> </lst></requestHandler> olrconfig.xmlS
from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)
response = s.query('idie manager')for hit in response.results: print hit['title'] print hit['body']
$ python simplequery.py Overview of the IDIE managerTo help with those implementing IDIE [...]IDIE: The 801g Of Talent ManagementInspiration-‐Direction-‐Influence [...]
<str name="bf"> recip(ms(NOW,date),3.16e-11,1,1)
</str>
FunctionQuery(1.0/(3.16E-11*float(ms(const(1299450070912),date(date)))+1.0)), product of: 0.9974636 = 1.0/(3.16E-11*float(ms(const(1299450070912),date(date)=1299369600000))+1.0) 1.0 = boost 0.03730806 = queryNorm
going beyond just
search results!
$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "badly drawn";$p = array( 'facet' => "true", 'facet.field' => 'category', 'facet.mincount' => 1,);
$r = $solr->search($query, 0, 5, $p);foreach( $r->facet_counts->facet_fields->category as $cat => $count) { echo $cat, " ", $count, PHP_EOL;
$query = "";$p = array( 'q.alt' => "*:*", "facet" => "true", "facet.date" => 'date', "facet.date.start" => "NOW/YEAR-6MONTHS", "facet.date.end" => "NOW/YEAR", "facet.date.gap" => "+1MONTH", "fq" => "category: Reviews",);
$r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_dates->date as $date => $count) { echo $date, " ", $count, PHP_EOL;}
$query = "";$p = array( 'q.alt' => "*:*", 'facet' => "true", 'facet.mincount' => 1, "facet.query" => array("title:gig", "title:album"), "fq" => "category:Reviews",); $r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_queries as $query => $count) { echo $query, " ", $count, PHP_EOL;}
What Fields to facet?
what facets to show?
how to facet?
<requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <str name="defType">mlt</str> <str name="mlt">true</str> <str name="mlt.fl">body title</str> <str name="mlt.match.include"> false </str> </lst></requestHandler>
olrconfig.xmlS
$solr = new Apache_Solr_Service ('localhost', 8080, '/solr/main');$query = "Losing my backpacking virginity";$p = array('qt' => "mlt");$results = $solr->search($query, 0, 3, $p);foreach($results->response->docs as $doc) { echo $doc->title, PHP_EOL;}
$ php mltquery.php Backpacking across USA social media waySafe solo travel on New York holidaysCracking The Big Apple's Big 10
THanks!
script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899
http://wiki.apache.org/solrhttp://nutch.apache.org/http://lucidimagination.com/blog/http://robotlibrarian.billdueber.com/http://code.google.com/p/solr-php-clienthttp://pypi.python.org/pypi/solrpyhttps://www.packtpub.com/solr-1-4-enterprise-search-server/book
http://github.com/ianbarber/SolrBTB-Talk
Some useful links!
Bonus content!
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType"> textSpell </str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="buildOnCommit">true</str> <str name="spellcheckIndexDir"> /var/lib/solr/spellchecker </str> </lst></searchComponent>
olrconfig.xmlS
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> </analyzer></fieldType> chema.xmlS
[...] <int name="ps">10</int> <int name="qs">5</int> <str name="spellcheck.onlyMorePopular">true</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
ismax handlerD
$solr = new Apache_Solr_Service('localhost', 8080, '/solr/main');$p = array( 'spellcheck' => 'true', 'spellcheck.collate' => 'true');$results = $solr->search("roose", 0, 5, $p);echo "Did you mean " . $results->spellcheck->suggestions->collation, PHP_EOL;
$ php spellquery.php Did you mean rose
include_once "Apache/Solr/Service.php";$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "album review";$p = array('sort' => 'title_sort desc');$res = $solr->search($query, 0, 10, $p);foreach($res->response->docs as $doc) { echo $doc->title, PHP_EOL;}
<field name="title_sort" type="lowercase" indexed="true" stored="false" />
<copyField source="title" dest="title_sort" />
$ php sortquery.php Zola Jesus album review -‐ Stridulum IIZero 7 album review -‐ RecordZebra and GiraffeYoung Knives video interview part 2Young Knives -‐ Road to V winners on tourYou Me At Six @ Wembley Arena, LondonYou Me At Six -‐ Hold Me DownYet again... Good Shoes @ ULU, LondonYelle: North American tour reviewYelle: interview with a French pop artiste
http://code.google.com/p/solr-php-client
<highlighting><fragmenter name="regex" class="[..]highlight.RegexFragmenter"><lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.pattern"> [-\w ,/\n\"']{20,200}</str></lst></fragmenter><formatter name="html" class="[...]highlight.HtmlFormatter" default="true"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str></lst> </formatter></highlighting>
$so = new Apache_Solr_Service('localhost', 8080, '/solr/main');$q = "album review";$r =$so->search($q,0,5,array('hl'=>"true"));foreach($r->response->docs as $doc) { echo $r->highlighting->{$doc->permalink}->title[0], PHP_EOL;}
$ php highlightquery.php Fenech Soler <em>album</em> <em>review</em>Weezer -‐ Hurley <em>album</em> <em>review</em>Feeder <em>album</em> <em>review</em> -‐ Renegades
The masters of scaling are here!
Replication sharding caching
from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"
$ python simplefail.py No results found!
IS SOLR DEFEATED?
http://solrurl:8080/solr/main/admin/analysis.jsp
<lst name="debug"> <str name="rawquerystring">"iso 90210"</str> <str name="querystring">"iso 90210"</str> <str name="parsedquery">+DisjunctionMaxQuery((body:"iso 90210")~0.01) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str>
/solr/select/?q="iso 90210"&debugQuery=true
<lst name="debug"> <str name="rawquerystring">iso 90210</str> <str name="querystring">iso 90210</str> <str name="parsedquery">+((DisjunctionMaxQuery((body:iso)~0.01) DisjunctionMaxQuery((body:90210)~0.01))~2) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str> <str name="parsedquery_toString">+(((body:iso)~0.01 (body:90210)~0.01)~2) (body:"iso 90210")~0.01</str>
/solr/select/?q=iso 90210&debugQuery=true
0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (body:"iso 90210") 0.0 = weight(body:"iso 90210" in 0), product of: 0.6953707 = queryWeight(body:"iso 90210"), product of: 3.8325815 = idf(body: iso=1 90210=1) 0.18143663 = queryNorm 0.0 = fieldWeight(body:"iso 90210" in 0), product of: 0.0 = tf(phraseFreq=0.0) 3.8325815 = idf(body: iso=1 90210=1) 0.15625 = fieldNorm(field=body, doc=0)
&explainother=90210
<str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm"> 3<60%</str> <int name="ps">10</int> <int name="qs">5</int> </lst>
olrconfig.xmlS
from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"
$ python simplefail.py DPCO: A Framework For SynergyDPCO, or Dynamic Performance Class Organisation is a ISO90210 quality [...]