Faceted Search with Lucene

Faceted Search with Lucene

Shai EreraResearcher, IBM

• Working at IBM – Information Retrieval Research• Lucene/Solr committer and PMC member• http://shaierera.blogspot.com• [email protected]

Who Am I

Lucene Facets 101

• Technique for accessing documents that were classified into a taxonomy of categories– Flat: Author/John Doe, Tags/Lucene, Popularity/High

– Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)

• Quick overview of the break down of the search results– How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet

• Simplifies interaction with the search application– Drilldown to issues that were updated in Past 2 days by clicking a link

– No knowledge required about search syntax and index schema

Faceted Search

http://jirasearch.mikemccandless.com

http://jirasearch.mikemccandless.com/

• Contributed by IBM in 2011, released in 3.4.0• Major changes since 4.1.0+

– NRT support– Nearly 400% search speedups– Complete API revamp– New features (SortedSet, range faceting, drill-sideways)

• Two main indexing-time modes– Taxonomy-based: hierarchical facets, managed by a

sidecar index, low NRT reopen cost– SortedSetDocValues: flat facets only, no sidecar index,

higher NRT reopen cost

• Runtime modes– Range facets (on NumericDocValues fields)

• Other implementations: Solr, ElasticSearch, Bobo Browse

Lucene Facets

• TaxonomyWriter/Reader– Manage the taxonomy information

• FacetFields– Add facets information to documents (DocValues fields, drilldown terms)

• FacetRequest– Defines which facets to aggregate and the FacetsAggregator (aggregation function)

• FacetsCollector– Collects matching documents and computes the top-K categories for each facet request

(invokes FacetsAccumulator)

• DrillDownQuery / DrillSideways– Execute drilldown and drill-sideways requests

Lucene Facet Components

// Builds the taxonomy as documents are indexed, multi-threaded, single instanceTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);

// Adds facets information to a document, can be initialized once per threadFacetFields facetFields = new FacetFields(taxoWriter);

// List of categories to add to the documentList<CategoryPath> cats = new ArrayList<CategoryPath>();cats.add(new CategoryPath("Author", "Erik Hatcher"));cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));

Document bookDoc = new Document();bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);

// add categories fields (DocValues, Postings)facetFields.addFields(bookDoc, cats);

// index the documentindexWriter.addDocument(bookDoc);

Sample Code – Indexing

// Open an NRT TaxonomyReaderTaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);

// Define the facets to aggregate (top-10 categories for each)FacetSearchParams fsp = new FacetSearchParams();fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));

// Collect both top-K facets and top-N matching documentsTopDocsCollector tdc = TopScoredDocCollector.create(10, true);FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);Query q = new TermQuery(new Term(“title”, “lucene”));searcher.search(q, MultiCollector.wrap(tdc, fc));

// Traverse the top facetsfor (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); }}

Sample Code – Search

• Drilldown adds a filter to the search– Multiple categories can be OR’d

// Drilldown – filter results to “Component/core/index”;// All other “Component/*” and “Component/core/*” get count 0Query base = new MatchAllDocsQuery();DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);ddq.add(new CategoryPath(“Component/core/index”, ‘/’));

• Drill sideways allows drilldown, yet still aggregate “sideways” categories

// Drill-Sideways – drilldown on “Component/core/index”;// Other “Component/*” and “Component/core/*” are counted tooDrillSideways ds = new DrillSideways(searcher, taxoReader);DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);

Drilldown and Drill-Sideways

http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html

http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html

• Range facets on NumericDocValues fields– Define interested buckets during query– Supports any arbitrary ValueSource (Lucene 4.6.0)

// Aggregate matching documents into bucketsRangeAccumulator a = new RangeAccumulator(new

RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));

Dynamic Facets

• Not all facets created equal– Categories added by an automatic categorization system, e.g. Category/Apache

Lucene (0.74) (confidence level is 0.74)– Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated

from contracts)– Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,

numUpdates=8…)

• Categories can have values associated with them per document– They are later aggregated by these values– NOTE: ≠ NumericDocValuesFields!

• Facet associations are completely customizable – encoded as a byte[] per document

Facet Associations

http://shaierera.blogspot.com/2013/01/facet-associations.html

http://shaierera.blogspot.com/2013/01/facet-associations.html

• Complements– Holds the count of each category in-memory, per IndexReader – When number of search results is >50% of the index, count the “complement set”– Useful for “overview” queries, e.g. MatchAllDocsQuery

• Sampling– Aggregate a sampled set of the search results– Optionally re-count top-K facets for accurate values

• Partitions– Partition the taxonomy space to control memory usage during faceted search– Useful for very big taxonomies (10s of millions of categories)

More Features

Lucene Facets Under the Hood

• The taxonomy maps categories to integer codes (referred to as ordinals)– Kind of like a Map<CategoryPath,Integer>, with hierarchy support– Provides taxonomy browsing services– DirectoryTaxonomyWriter is managed as a sidecar Lucene index

• Categories are broken down to their path components, e.g. Date/2012/March/20 becomes:

– Date, with ordinal=1– Date/2012, with ordinal=2– Date/2012/March, with ordinal=3– Date/2012/March/20, with ordinal=4

The Taxonomy Index

• Categories are added as drilldown terms, e.g. for Date/2012/March/20:– $facets:Date– $facets:Date/2012– …

• All category ordinals associated with the document are added as a BinaryDocValuesField

– All path components ordinals’ are added, not just the leafs’– Encoded as VInt + gap for efficient compression and speed

• Other compression methods attempted, but were slower to decode (LUCENE-4609)

– Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)

The Search Index

https://issues.apache.org/jira/browse/LUCENE-4609

• SortedSetFacetFields add SortedSetDocValuesFields and drilldown terms to documents

• Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState

• Use SortedSetDocValuesAccumulator to accumulate SortedSet facets• Advantages:

– Taxonomy representation requires less RAM (flat taxonomy)– No sidecar index– Tie-breaks by label-sort order

• Disadvantages:– Not full taxonomy– Overall uses more RAM (local-to-global ordinal mapping)– Adds NRT reopen cost– Slower than taxonomy-based facets

SortedSet Facets

• Per-segment integer codes (as used by the SortedSet approach) are less efficient– Different ordinals for same categories across segments– Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable– Resolve top-K on the String representation of categories – more CPU

• Global ordinals allow efficient per-segment faceting and aggregation– No translation maps required (no extra RAM, highly scalable)– Aggregation, top-K computation done on integer codes

• But, do not play well with IndexWriter.addIndexes(Directory…)– Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the

input search are mapped to the destination’s

Global Ordinals

• FacetsCollector works in two steps:– Collects matching documents (and optionally their scores)– Invokes FacetsAccumulator to accumulate the top-K facets

• Performance tests show that this improves faceted search (LUCENE-4600)– Locality of reference?

• Useful for Sampling and Complements– Hard to do otherwise

Two-Phase Aggregation

https://issues.apache.org/jira/browse/LUCENE-4600

• Determine how facets are encoded– Partition size– Facet delimiter character (for drilldown terms, default \u001F)– CategoryListParams

• CategoryListParams holds parameters for a category list– Encoder/Decoder (default DGapVInt)– OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and

ALL_BUT_DIMENSION (default)

• CategoryListParams can be used to group facets together– Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)– Expert: separate categories by dimension into different category lists

• Useful when sets of categories are always aggregated together, but not with other categories

• FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!

FacetIndexingParams

Questions?

Date post:	11-May-2015
Category:	Technology
Upload:	lucenerevolution
View:	5,179 times
Download:	2 times

Faceted Search with Lucene

Technology