Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | lucidworks-archived |
View: | 105 times |
Download: | 1 times |
What’s new in Lucene and Solr?Grant Ingersoll
CTO, LucidWorksLucene/Solr Committer
Sink or Swim?
Search is good for…• Traditional: Fast, fuzzy text matching across a large document
collection• De-normalized data
– “light” relational• Top N problems
– Key-value (top 1)– Recommendations, “Good enough” classification, clustering
• Faceting, slicing and dicing of numerical/enumerated data• Spatial, spell checking, record linkage, highlighting• NoSQL
What’s New?
• Community
• Lucene
• Solr
Relax, You’re Among Friends• Large, diverse search community with many non-traditional search
engine usages– Object stores, Record linkage, Social, mobile -> web
• “The Apache Way”– Meritocracy – Those who do, decide!
• Always Be Testing– Randomized system tests are all the rage– http://vimeo.com/32087114
• Patches Welcome!
Acceleration!
Coming Soon: Lucene and Solr 4.8
Java 1.7
Lucene: Speed and Memory• Native Near Real Time (NRT) support
– Per segment– FieldCache can be controlled to only load new segments– Soft commit -- faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)– Faster more consistent index speed
• Faster fuzzy & wildcard query processing• Automatic compression of stored fields and term vectors• String -> BytesRef
– Much improved data structure– … means less memory and less garbage collection effort
Lucene: Flexibility• Flexible Index Formats
– New posting list codecs: Block, Simple Text, HDFS, etc.– Pulsing codec: improves performance of primary key searches, inlining
docs, positions, and payloads, saves disk seeks
• Pluggable Scoring– Decoupled from TF/IDF– Built in alternatives include BM25 & DFR, and others
• http://en.wikipedia.org/wiki/Okapi_BM25• http://terrier.org/docs/v3.5/dfr_description.html
– Add your own
FS(A|T)• Keys:
– byte[] – write-once– Linear time build of min. automata– Compression, Reverse lookups– Weights (used for auto-suggest)– Pluggable Algebra
• Uses:– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
Grab Bag• Lots of new suggesters
– Available in Solr
• Doc Values– Column oriented store– Numeric and binary variants are updatable (coming to Solr soon)
• Overhauled term vectors APIs– Now look a lot like Terms
Solr 4: New Features• Search/Faceting/Relevance
– New Relevance Function Queries (tf, df, others)– Pivot Faceting– Pseudo-join– Improved Spatial (more later)– Full support for Lucene Codecs, pluggable scoring
• Indexing– New Update Processors, including scripting option– Near real time
• Schema and Config APIs + Schemaless• Cursors (aka Deep Paging)• Admin UI
Geospatial improvements• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle
• Indexing:– "geo”:”43.17614,-90.57341”– “geo”:”Circle(4.56,1.23 d=0.0710)”– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30)))”
Scaling Solr• Distributed/sharded indexing & search
– Auto distributes updates and queries to appropriate shards– Near Real Time (NRT) indexing capable– Document routing extensions
• Dynamically scalable– New SolrCloud instances add indexing and query capacity– Supports re-balancing (shard-splitting)
• Reliable– No single point of failure– Transactions logged– Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
Solr as NoSQL• Non-traditional data stores
• Not designed for SQL type queries
• Distributed fault tolerant architecture
• Document oriented, data format agnostic (JSON, XML, CSV, binary)
Go Deep!
APIs• New APIs for Schema and Solr Config
– XML becoming more of an implementation detail
• Managed Schema mode
• Data-driven schema (aka schemaless)
• Synonyms, stopwords, request handlers
Beyond Solr: LucidWorks Open Source• Effortless AWS deployment and monitoring: http
://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
Summary• Lucene/Solr 4.x:
– Faster– More Flexible– Easier than ever scaling– More reliable than ever
• Go forth and rank!
Resources• Me
– [email protected]– @gsingers on Twitter
• LucidWorks– http://www.lucidworks.com– http://www.lucidworks.com/support-services/ask-the-experts/