Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | lucenerevolution |
View: | 5,179 times |
Download: | 2 times |
Faceted Search with Lucene
Shai EreraResearcher, IBM
• Working at IBM – Information Retrieval Research• Lucene/Solr committer and PMC member• http://shaierera.blogspot.com• [email protected]
Who Am I
Lucene Facets 101
• Technique for accessing documents that were classified into a taxonomy of categories– Flat: Author/John Doe, Tags/Lucene, Popularity/High
– Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)
• Quick overview of the break down of the search results– How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet
• Simplifies interaction with the search application– Drilldown to issues that were updated in Past 2 days by clicking a link
– No knowledge required about search syntax and index schema
Faceted Search
http://jirasearch.mikemccandless.com
• Contributed by IBM in 2011, released in 3.4.0• Major changes since 4.1.0+
– NRT support– Nearly 400% search speedups– Complete API revamp– New features (SortedSet, range faceting, drill-sideways)
• Two main indexing-time modes– Taxonomy-based: hierarchical facets, managed by a
sidecar index, low NRT reopen cost– SortedSetDocValues: flat facets only, no sidecar index,
higher NRT reopen cost
• Runtime modes– Range facets (on NumericDocValues fields)
• Other implementations: Solr, ElasticSearch, Bobo Browse
Lucene Facets
• TaxonomyWriter/Reader– Manage the taxonomy information
• FacetFields– Add facets information to documents (DocValues fields, drilldown terms)
• FacetRequest– Defines which facets to aggregate and the FacetsAggregator (aggregation function)
• FacetsCollector– Collects matching documents and computes the top-K categories for each facet request
(invokes FacetsAccumulator)
• DrillDownQuery / DrillSideways– Execute drilldown and drill-sideways requests
Lucene Facet Components
// Builds the taxonomy as documents are indexed, multi-threaded, single instanceTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);
// Adds facets information to a document, can be initialized once per threadFacetFields facetFields = new FacetFields(taxoWriter);
// List of categories to add to the documentList<CategoryPath> cats = new ArrayList<CategoryPath>();cats.add(new CategoryPath("Author", "Erik Hatcher"));cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));
Document bookDoc = new Document();bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);
// add categories fields (DocValues, Postings)facetFields.addFields(bookDoc, cats);
// index the documentindexWriter.addDocument(bookDoc);
Sample Code – Indexing
// Open an NRT TaxonomyReaderTaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
// Define the facets to aggregate (top-10 categories for each)FacetSearchParams fsp = new FacetSearchParams();fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));
// Collect both top-K facets and top-N matching documentsTopDocsCollector tdc = TopScoredDocCollector.create(10, true);FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);Query q = new TermQuery(new Term(“title”, “lucene”));searcher.search(q, MultiCollector.wrap(tdc, fc));
// Traverse the top facetsfor (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); }}
Sample Code – Search
• Drilldown adds a filter to the search– Multiple categories can be OR’d
// Drilldown – filter results to “Component/core/index”;// All other “Component/*” and “Component/core/*” get count 0Query base = new MatchAllDocsQuery();DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);ddq.add(new CategoryPath(“Component/core/index”, ‘/’));
• Drill sideways allows drilldown, yet still aggregate “sideways” categories
// Drill-Sideways – drilldown on “Component/core/index”;// Other “Component/*” and “Component/core/*” are counted tooDrillSideways ds = new DrillSideways(searcher, taxoReader);DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);
Drilldown and Drill-Sideways
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
• Range facets on NumericDocValues fields– Define interested buckets during query– Supports any arbitrary ValueSource (Lucene 4.6.0)
// Aggregate matching documents into bucketsRangeAccumulator a = new RangeAccumulator(new
RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));
Dynamic Facets
• Not all facets created equal– Categories added by an automatic categorization system, e.g. Category/Apache
Lucene (0.74) (confidence level is 0.74)– Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated
from contracts)– Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,
numUpdates=8…)
• Categories can have values associated with them per document– They are later aggregated by these values– NOTE: ≠ NumericDocValuesFields!
• Facet associations are completely customizable – encoded as a byte[] per document
Facet Associations
http://shaierera.blogspot.com/2013/01/facet-associations.html
• Complements– Holds the count of each category in-memory, per IndexReader – When number of search results is >50% of the index, count the “complement set”– Useful for “overview” queries, e.g. MatchAllDocsQuery
• Sampling– Aggregate a sampled set of the search results– Optionally re-count top-K facets for accurate values
• Partitions– Partition the taxonomy space to control memory usage during faceted search– Useful for very big taxonomies (10s of millions of categories)
More Features
Lucene Facets Under the Hood
• The taxonomy maps categories to integer codes (referred to as ordinals)– Kind of like a Map<CategoryPath,Integer>, with hierarchy support– Provides taxonomy browsing services– DirectoryTaxonomyWriter is managed as a sidecar Lucene index
• Categories are broken down to their path components, e.g. Date/2012/March/20 becomes:
– Date, with ordinal=1– Date/2012, with ordinal=2– Date/2012/March, with ordinal=3– Date/2012/March/20, with ordinal=4
The Taxonomy Index
• Categories are added as drilldown terms, e.g. for Date/2012/March/20:– $facets:Date– $facets:Date/2012– …
• All category ordinals associated with the document are added as a BinaryDocValuesField
– All path components ordinals’ are added, not just the leafs’– Encoded as VInt + gap for efficient compression and speed
• Other compression methods attempted, but were slower to decode (LUCENE-4609)
– Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)
The Search Index
• SortedSetFacetFields add SortedSetDocValuesFields and drilldown terms to documents
• Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState
• Use SortedSetDocValuesAccumulator to accumulate SortedSet facets• Advantages:
– Taxonomy representation requires less RAM (flat taxonomy)– No sidecar index– Tie-breaks by label-sort order
• Disadvantages:– Not full taxonomy– Overall uses more RAM (local-to-global ordinal mapping)– Adds NRT reopen cost– Slower than taxonomy-based facets
SortedSet Facets
• Per-segment integer codes (as used by the SortedSet approach) are less efficient– Different ordinals for same categories across segments– Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable– Resolve top-K on the String representation of categories – more CPU
• Global ordinals allow efficient per-segment faceting and aggregation– No translation maps required (no extra RAM, highly scalable)– Aggregation, top-K computation done on integer codes
• But, do not play well with IndexWriter.addIndexes(Directory…)– Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the
input search are mapped to the destination’s
Global Ordinals
• FacetsCollector works in two steps:– Collects matching documents (and optionally their scores)– Invokes FacetsAccumulator to accumulate the top-K facets
• Performance tests show that this improves faceted search (LUCENE-4600)– Locality of reference?
• Useful for Sampling and Complements– Hard to do otherwise
Two-Phase Aggregation
• Determine how facets are encoded– Partition size– Facet delimiter character (for drilldown terms, default \u001F)– CategoryListParams
• CategoryListParams holds parameters for a category list– Encoder/Decoder (default DGapVInt)– OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and
ALL_BUT_DIMENSION (default)
• CategoryListParams can be used to group facets together– Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)– Expert: separate categories by dimension into different category lists
• Useful when sets of categories are always aggregated together, but not with other categories
• FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!
FacetIndexingParams
Questions?