+ All Categories
Home > Documents > Luce Ne Bootcamp

Luce Ne Bootcamp

Date post: 05-Apr-2018
Category:
Upload: mfahci
View: 215 times
Download: 0 times
Share this document with a friend

of 83

Transcript
  • 7/31/2019 Luce Ne Bootcamp

    1/83

    Lucene Boot Camp

    Grant Ingersoll

    Lucid ImaginationNov. 12, 2007

    Atlanta, Georgia

  • 7/31/2019 Luce Ne Bootcamp

    2/83

    Intro

    My Background

    Your Background

    Brief History of Lucene

    Goals for Tutorial

    Understand Lucene core capabilities

    Real examples, real code, real data

    Ask Questions!!!!!

  • 7/31/2019 Luce Ne Bootcamp

    3/83

    Schedule1. 10-10:10 Introducing Lucene and Search

    2. 10:10-12 Indexing, Analysis, Searching, Performance

    3. 12-12:05 Break

    4. 12-1 More on Indexing, Analysis, Searching, Performance

    5. 1-2:30 Lunch

    6. 2:30-2:40 Recap, Questions, Content

    7. 2:40-4:40 Class Example

    8. 4-4:20 Break

    9. 4:20-5 Class Example

    10. 5-5:20 Lucene Contributions (time permitting)

    11. 5:20-5:25 Open Discussion (time permitting)

    12. 5:25-5:30 Resources/Wrap Up

  • 7/31/2019 Luce Ne Bootcamp

    4/83

    Lucene is

    NOT a crawler

    See Nutch

    NOT an applicationSee PoweredBy on the Wiki

    NOT a library for doing Google PageRank

    or other link analysis algorithmsSee Nutch

    A library for enabling text based search

  • 7/31/2019 Luce Ne Bootcamp

    5/83

    A Few Words about Solr

    HTTP-based Search Server

    XML Configuration

    XML, JSON, Ruby, PHP, Java support

    Caching, Replication

    Many, many nice features that Lucene users

    need

    http://lucene.apache.org/solr

    http://lucene.apache.org/solrhttp://lucene.apache.org/solr
  • 7/31/2019 Luce Ne Bootcamp

    6/83

    Search Basics

    Goal: Identify documents thatare similar to input query

    Lucene uses a modified VectorSpace Model (VSM)

    Boolean + VSM

    TF-IDF

    The words in the documentand the query each define aVector in an n-dimensional

    space Sim(q1, d1) = cos

    In Lucene, boolean approachrestricts what documents toscore

    q1

    d1

    dj=

    q= w = weight assigned to term

  • 7/31/2019 Luce Ne Bootcamp

    7/83

    Indexing

    Process of preparing and adding text toLucene

    Optimized for searching Key Point: Lucene only indexes Strings

    What does this mean?

    Lucene doesnt care about XML, Word, PDF, etc.

    There are many good open source extractors available

    Its our job to convert whatever file format we haveinto something Lucene can use

  • 7/31/2019 Luce Ne Bootcamp

    8/83

    Indexing Classes

    Analyzer

    Creates tokens using a Tokenizer and filters

    them through zero or more TokenFilters IndexWriter

    Responsible for converting text into internal

    Lucene format

  • 7/31/2019 Luce Ne Bootcamp

    9/83

    Indexing Classes

    Directory

    Where the Index is stored

    RAMDirectory, FSDirectory, others

    Document A collection ofFields

    Can be boosted

    Field

    Free text, keywords, dates, etc. Defines attributes for storing, indexing

    Can be boosted

    Field Constructors and parameters

    Open up Fieldable and Field in IDE

  • 7/31/2019 Luce Ne Bootcamp

    10/83

    How to Index

    Create IndexWriter

    For each input

    Create a Document

    Add Fields to the Document

    Add the Document to the IndexWriter

    Close the IndexWriter

    Optimize (optional)

  • 7/31/2019 Luce Ne Bootcamp

    11/83

    Task 1.a From the Boot Camp Files, use the basic.ReutersIndexer

    skeleton to start

    Index the small Reuters Collection using theIndexWriter, a Directory and

    StandardAnalyzer Boost every 10 documents by 3

    Questions to Answer:

    What Fields should I define?

    What attributes should each Field have? What Fields should OMIT_NORMS?

    Pick a field to boost and give a reason why you think it should beboosted

  • 7/31/2019 Luce Ne Bootcamp

    12/83

    Use the Luke

  • 7/31/2019 Luce Ne Bootcamp

    13/83

    Searching

    Key Classes: Searcher

    Provides methods for searching

    Take a moment to look at the Searcher class declaration

    IndexSearcher, MultiSearcher,

    ParallelMultiSearcher IndexReader

    Loads a snapshot of the index into memory for searching

    Hits

    Storage/caching of results from searching

    QueryParser

    JavaCC grammar for creating Lucene Queries

    http://lucene.apache.org/java/docs/queryparsersyntax.html

    Query

    Logical representation of programs information need

    http://lucene.apache.org/java/docs/queryparsersyntax.htmlhttp://lucene.apache.org/java/docs/queryparsersyntax.html
  • 7/31/2019 Luce Ne Bootcamp

    14/83

    Query Parsing

    Basic syntax:

    title:hockey +(body:stanley AND body:cup)

    OR/AND must be uppercase Default operator is OR (can be changed)

    Supports fairly advanced syntax, see the website http://lucene.apache.org/java/docs/queryparsersyntax.html

    Doesnt always play nice, so beware Many applications construct queries programmatically

    or restrict syntax

  • 7/31/2019 Luce Ne Bootcamp

    15/83

    Task 1.b Using the ReutersIndexerTest.java skeleton in the boot

    camp files

    Search your newly created index using queries you develop

    Delete a Document by the doc id

    Hints:

    Use a IndexSearcher

    Create a Query using the QueryParser

    Display the results from the Hits

    Questions:

    What is the default field for the QueryParser?

    What Analyzer to use?

  • 7/31/2019 Luce Ne Bootcamp

    16/83

    Task 1 Results

    Locks

    Lucene maintains locks on files to prevent

    index corruption

    Located in same directory as index

    Scores from Hits are normalized

    Scores across queries are NOT comparable

    Lucene 2.3 has some transactional

    semantics for indexing, but is not a DB

  • 7/31/2019 Luce Ne Bootcamp

    17/83

    Deletion and Updates

    Deletions can be a bit confusing

    Both IndexReader and IndexWriter

    have delete methods Updates are always a delete and an add

    Updates are always a delete and an add

    Yes, that is a repeat!

    Nature of data structures used in search

  • 7/31/2019 Luce Ne Bootcamp

    18/83

    Analysis Analysis is the process of creating Tokens to be indexed

    Analysis is usually done to improve results overall, but itcomes with a price

    Lucene comes with many different Analyzers,

    Tokenizers and TokenFilters, each with their owngoals

    See contrib/analyzers

    StandardAnalyzer is included with the core JAR and

    does a good job for most English and Latin-based tasks Often times you want the same content analyzed in

    different ways

    Consider a catch-all Field in addition to otherFields

  • 7/31/2019 Luce Ne Bootcamp

    19/83

    Commonly Used Analyzers

    StandardAnalyzer

    WhitespaceAnalyzer

    PerFieldAnalyzerWrapper

    SimpleAnalyzer

  • 7/31/2019 Luce Ne Bootcamp

    20/83

    Indexing in a Nutshell For each Document

    For each Field to be tokenized

    Create the tokens using the specified Tokenizer

    Tokens consist of a String, position, type and offset information Pass the tokens through the chained TokenFilters where

    they can be changed or removed

    Add the end result to the inverted index

    Position information can be altered

    Useful when removing words or to prevent phrases

    from matching

  • 7/31/2019 Luce Ne Bootcamp

    21/83

    Inverted Index

    aardvark

    hood

    red

    little

    riding

    robin

    women

    zoo

    Little Red Riding Hood

    Robin Hood

    Little Women

    0 1

    0 2

    0

    0

    2

    1

    0

    1

    2

  • 7/31/2019 Luce Ne Bootcamp

    22/83

    Tokenization

    Split words into Tokens to be processed

    Tokenization is fairly straightforward for

    most languages that use a space for wordsegmentation

    More difficult for some East Asian languages

    See the CJK Analyzer

  • 7/31/2019 Luce Ne Bootcamp

    23/83

    Modifying Tokens

    TokenFilters are used to alter the tokenstream to be indexed

    Common tasks:

    Remove stopwords

    Lower case

    Stem/Normalize -> Wi-Fi -> Wi Fi

    Add Synonyms StandardAnalyzer does things that you may

    not want

  • 7/31/2019 Luce Ne Bootcamp

    24/83

    Custom Analyzers

    Solution: write your own Analyzer

    Better solution: write a configurable

    Analyzer so you only need one Analyzerthat you can easily change for your projects

    See Solr

    Tokenizers and TokenFilters mustbe newly constructed for each input

  • 7/31/2019 Luce Ne Bootcamp

    25/83

    Special Cases

    Dates and numbers need special treatment to be

    searchable

    o.a.l.document.DateTools org.apache.solr.util.NumberUtils

    Altering Position Information

    Increase Position Gap between sentences to prevent

    phrases from crossing sentence boundaries

    Index synonyms at the same position so query can

    match regardless of synonym used

  • 7/31/2019 Luce Ne Bootcamp

    26/83

    5 minute Break

  • 7/31/2019 Luce Ne Bootcamp

    27/83

    Indexing Performance

    Behind the Scenes

    Lucene indexes Documents into memory

    At certain trigger points, memory (segments)are flushed to the Directory

    Segments are periodically merged

    Lucene 2.3 has significant performanceimprovements

  • 7/31/2019 Luce Ne Bootcamp

    28/83

    IndexWriter Performance

    Factors maxBufferedDocs

    Minimum # of docs before merge occurs and a new segment is

    created

    Usually, Larger == faster, but more RAM

    mergeFactor

    How often segments are merged

    Smaller == less RAM, better for incremental updates

    Larger == faster, better for batch indexing

    maxFieldLength

    Limit the number of terms in a Document

  • 7/31/2019 Luce Ne Bootcamp

    29/83

    Lucene 2.3 IndexWriter Changes

    setRAMBufferSizeMB

    New model for automagically controlling indexingfactors based on the amount of memory in use

    Obsoletes setMaxBufferedDocs andsetMergeFactor

    Takes storage and term vectors out of the mergeprocess

    Turn off auto-commit if there are stored fields andterm vectors

    Provides significant performance increase

  • 7/31/2019 Luce Ne Bootcamp

    30/83

    Index Threading

    IndexWriter and IndexReader are thread-

    safe and can be shared between threads without

    external synchronization

    One open IndexWriter perDirectory

    Parallel Indexing

    Index to separate Directory instances

    Merge using IndexWriter.addIndexes

    Could also distribute and collect

  • 7/31/2019 Luce Ne Bootcamp

    31/83

    Benchmarking Indexing

    contrib/benchmark

    Try out different algorithms between Lucene 2.2and trunk (2.3)

    contrib/benchmark/conf:

    indexing.alg

    indexing-multithreaded.alg

    Info:

    Mac Pro 2 x 2GHz Dual-Core Xeon

    4 GB RAM

    ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

  • 7/31/2019 Luce Ne Bootcamp

    32/83

    Benchmarking ResultsRecords/Sec Avg. T Mem

    2.2 421 39M

    Trunk 2,122 52M

    Trunk-mt (4) 3,680 57M

    Your results will depend on analysis, etc.

  • 7/31/2019 Luce Ne Bootcamp

    33/83

    Searching

    Earlier we touched on basics of searchusing the QueryParser

    Now look at: Searcher/IndexReader Lifecycle

    Query classes

    More details on the QueryParser

    Filters

    Sorting

  • 7/31/2019 Luce Ne Bootcamp

    34/83

    Lifecycle

    Recall that the IndexReader loads a snapshotof index into memory

    This means updates made since loading the index will

    not be seen

    Business rules are needed to define how often toreload the index, if at all

    IndexReader.isCurrent() can help

    Loading an index is an expensive operation

    Do not open a Searcher/IndexReader for everysearch

  • 7/31/2019 Luce Ne Bootcamp

    35/83

    Query Classes TermQuery is basis for all non-span queries

    BooleanQuery combines multiple Queryinstances as clauses

    should required

    PhraseQuery finds terms occurring near eachother, position-wise

    slop is the edit distance between two terms

    Take 2-3 minutes to explore Queryimplementations

  • 7/31/2019 Luce Ne Bootcamp

    36/83

    Spans

    Spans provide information about wherematches took place

    Not supported by the QueryParser Can be used in BooleanQuery clauses

    Take 2-3 minutes to explore SpanQuery

    classes SpanNearQuery useful for doing phrase

    matching

  • 7/31/2019 Luce Ne Bootcamp

    37/83

    QueryParser

    MultiFieldQueryParser

    Boolean operators cause confusion

    Better to think in terms of required (+ operator) and notallowed (- operator)

    Check JIRA forQueryParser issues http://www.gossamer-threads.com/lists/lucene/java-user/40945

    Most applications either modify QP, create theirown, or restrict to a subset of the syntax

    Your users may not need all the flexibility ofthe QP

    http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945http://www.gossamer-threads.com/lists/lucene/java-user/40945
  • 7/31/2019 Luce Ne Bootcamp

    38/83

    Sorting Lucene default sort is by score

    Searcher has several methods that take in a

    Sort object

    Sorting should be addressed during indexing

    Sorting is done on Fields containing a single

    term that can be used for comparison

    The SortField defines the different sort types

    available AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,

    DOC

  • 7/31/2019 Luce Ne Bootcamp

    39/83

    Sorting II

    Look at Searcher, Sort and

    SortField

    Custom sorting is done with aSortComparatorSource

    Sorting can be very expensive

    Terms are cached in the FieldCache

    SortFilterTest.java example

  • 7/31/2019 Luce Ne Bootcamp

    40/83

    Filters

    Filters restrict the search space to asubset ofDocuments

    Use CasesSearch within a Search

    Restrict by date

    Rating

    Security

    Author

  • 7/31/2019 Luce Ne Bootcamp

    41/83

    Filter Classes

    QueryWrapperFilter (QueryFilter)

    Restrict to subset ofDocuments that match a Query

    RangeFilter Restrict to Documents that fall within a range

    Better alternative to RangeQuery

    CachingWrapperFilter

    Wrap anotherFilter and provide caching

    SortFilterTest.java example

  • 7/31/2019 Luce Ne Bootcamp

    42/83

    Expert Results

    Searcherhas several expert methods

    Hits is not always what you need due to:

    Caching

    Normalized Scores

    Reexecutes Query repeatedly as results are accessed

    HitCollector allows low-level access to all

    Documents as they are scored

    TopDocs represents top n docs that match

    TopDocsTest in examples

  • 7/31/2019 Luce Ne Bootcamp

    43/83

    Searchers MultiSearcher

    Search over multiple Searchables, including remote

    MultiReader

    Not a Searcher, but can be used with

    IndexSearcher to achieve same results for localindexes

    ParallelMultiSearcher

    Like MultiSearcher, but threaded

    RemoteSearchable

    RMI based remote searching

    Look at MultiSearcherTest in example

    code

  • 7/31/2019 Luce Ne Bootcamp

    44/83

    Search Performance

    Search speed is based on a number of factors: Query Type(s)

    Query Size

    Analysis

    Occurrences of Query Terms Optimize

    Index Size

    Index type (RAMDirectory, other)

    Usual Suspects

    CPU

    Memory

    I/O

    Business Needs

  • 7/31/2019 Luce Ne Bootcamp

    45/83

    Query Types

    Be careful with WildcardQuery as it rewritesto a BooleanQuery containing all the termsthat match the wildcards

    Avoid starting a WildcardQuery with wildcard

    Use ConstantScoreRangeQuery instead ofRangeQuery

    Be careful with range queries and dates User mailing list and Wiki have useful tips for

    optimizing date handling

  • 7/31/2019 Luce Ne Bootcamp

    46/83

    Query Size

    Stopword removal

    Search an all field instead of many fields with the same

    terms

    Disambiguation

    May be useful when doing synonym expansion

    Difficult to automate and may be slower

    Some applications may allow the user to disambiguate

    Relevance Feedback/More Like This

    Use most important words

    Important can be defined in a number of ways

  • 7/31/2019 Luce Ne Bootcamp

    47/83

    Usual Suspects CPU

    Profile your application

    Memory

    Examine your heap size, garbage collection approach

    I/O

    Cache yourSearcher

    Define business logic for refreshing based on indexing needs

    Warm yourSearcher before going live -- See Solr

    Business Needs

    Do you really need to support Wildcards?

    What about date range queries down to the millisecond?

  • 7/31/2019 Luce Ne Bootcamp

    48/83

    Explanations

    explain(Query, int) method is

    useful for understanding why a Document

    scored the way it did ExplainsTest in sample code

    Open Luke and try some queries and then

    use the explain button

  • 7/31/2019 Luce Ne Bootcamp

    49/83

    FieldSelector

    Prior to version 2.1, Lucene always loaded allFields in a Document

    FieldSelector API addition allows Lucene to

    skip large Fields Options: Load, Lazy Load, No Load, Load and Break,

    Load for Merge, Size, Size and Break

    Makes storage of original content more viable

    without large cost of loading it when not used

    FieldSelectorTest in example code

  • 7/31/2019 Luce Ne Bootcamp

    50/83

    Scoring and Similarity

    Lucene has sophisticated scoring

    mechanism designed to meet most needs

    Has hooks for modifying scores Scoring is handled by the Query, Weight

    and Scorer class

  • 7/31/2019 Luce Ne Bootcamp

    51/83

    Affecting Relevance

    FunctionQuery from Solr (variation in

    Lucene)

    Override Similarity Implement own Query and related classes

    Payloads

    HitCollector

    Take 5 to examine these

  • 7/31/2019 Luce Ne Bootcamp

    52/83

    Lunch

    1-2:30

  • 7/31/2019 Luce Ne Bootcamp

    53/83

    Recap

    Indexing

    Searching

    Performance

    Odds and Ends

    Explains

    FieldSelector

    Relevance

  • 7/31/2019 Luce Ne Bootcamp

    54/83

    Next Up

    Dealing with Content

    File Formats

    Extraction

    Large Task

    Miscellaneous

    Wrapping Up

  • 7/31/2019 Luce Ne Bootcamp

    55/83

    File Formats

    Several open source libraries, projects for extracting content to use inLucene

    PDF: PDFBox

    http://www.pdfbox.org/

    Word: POI, Open Office, TextMining

    http://www.textmining.org/textmining.zip

    XML: SAX or Pull parser

    HTML: Neko, Jtidy

    http://people.apache.org/~andyc/neko/doc/html/

    http://jtidy.sourceforge.net/

    Tika http://incubator.apache.org/tika/

    Aperture

    http://aperture.sourceforge.net

    http://www.textmining.org/textmining.ziphttp://people.apache.org/~andyc/neko/doc/html/http://incubator.apache.org/tika/http://aperture.sourceforge.net/http://aperture.sourceforge.net/http://incubator.apache.org/tika/http://people.apache.org/~andyc/neko/doc/html/http://www.textmining.org/textmining.zip
  • 7/31/2019 Luce Ne Bootcamp

    56/83

    Aperture Basics

    Crawlers

    Data Connectors

    Extraction WrappersPOI, PDFBox, HTML, XML, etc.

    http://aperture.wiki.sourceforge.net/Extractorswill give you info on what comes back from

    Aperture

    LuceneApertureCallbackHandlerin example code

    http://aperture.wiki.sourceforge.net/Extractorshttp://aperture.wiki.sourceforge.net/Extractors
  • 7/31/2019 Luce Ne Bootcamp

    57/83

    Large Task Using the skeleton files in the

    com.lucenebootcamp.training.full package:

    Get some content:

    Web, file system

    Different file formats Index it

    Plan out your fields, boosts, field properties

    Support updates and deletes

    Optional: How fast can you make it go? Divide and conquer?

    Multithreaded?

  • 7/31/2019 Luce Ne Bootcamp

    58/83

    Large Task

    Search Content

    Allow for arbitrary user queries across multipleFields via command line or simple web interface

    How fast can you make it?

    Support:

    Sort

    Filter Explains

    How much slower is to retrieve an explanation?

  • 7/31/2019 Luce Ne Bootcamp

    59/83

    Large Task

    Document Retrieval

    Display/write out the one or more documents

    Support FieldSelector

  • 7/31/2019 Luce Ne Bootcamp

    60/83

    Large Task

    Optional Tasks

    Hit Highlighting using contrib/Highlighter

    Multithreaded indexing and Search

    Explore other Field construction options

    Binary fields, term vectors

    Use Lucene trunk version and try out some of the

    changes in indexing Try out Solr or Nutch at http://lucene.apache.org/

    Whats do they offer that Lucene Java doesnt that you might

    need?

    http://lucene.apache.org/http://lucene.apache.org/
  • 7/31/2019 Luce Ne Bootcamp

    61/83

    Large Task Metadata

    Pair up if you want

    Ask questions

    2 hoursUse Luke to check your index!

    Explore other parts of Lucene that you are

    interested in

    Be prepared to discuss/share with the class

  • 7/31/2019 Luce Ne Bootcamp

    62/83

    Large Task Post-Mortem

    Volunteers to share?

  • 7/31/2019 Luce Ne Bootcamp

    63/83

    Term Information TermEnum gives access to terms and how manyDocuments they occur in

    IndexReader.terms()

    IndexReader.termPositions()

    TermDocs gives access to the frequency of aterm in a Document

    IndexReader.termDocs()

    Term Vectors give access to term frequencyinformation in a given Document

    IndexReader.getTermFreqVector

    TermsTest in sample code

  • 7/31/2019 Luce Ne Bootcamp

    64/83

    Lucene Contributions

    Many people have generously contributed code tohelp solve common problems

    These are in contrib directory of the source

    Popular:

    Analyzers

    Highlighter

    Queries and MoreLikeThis Snowball Stemmers

    Spellchecker

  • 7/31/2019 Luce Ne Bootcamp

    65/83

    Open Discussion

    Multilingual Best Practices

    UNICODE

    One Index versus many

    Advanced Analysis

    Distributed Lucene

    Crawling

    Hadoop Nutch

    Solr

  • 7/31/2019 Luce Ne Bootcamp

    66/83

    Resources

    http://lucene.apache.org/

    http://en.wikipedia.org/wiki/Vector_space_model

    Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto

    Lucene In Action by Hatcher and Gospodneti

    Wiki

    Mailing Lists

    [email protected]

    Discussions on how to use Lucene

    [email protected]

    Discussions on how to develop Lucene

    Issue Tracking https://issues.apache.org/jira/secure/Dashboard.jspa

    We always welcome patches

    Ask on the mailing list before reporting a bug

    http://lucene.apache.org/http://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://lucene.apache.org/
  • 7/31/2019 Luce Ne Bootcamp

    67/83

    Resources

    [email protected]

  • 7/31/2019 Luce Ne Bootcamp

    68/83

    Finally

    Please take the time to fill out a survey to

    help me improve this training

    Located in base directory of sourceEmail it to me at [email protected]

    There are several Lucene related talks on

    Friday

  • 7/31/2019 Luce Ne Bootcamp

    69/83

    Extras

  • 7/31/2019 Luce Ne Bootcamp

    70/83

    Task 2 Take 10-15 minutes, pair up, and write anAnalyzer and Unit Test

    Examine results in Luke

    Run some searches

    Ideas:

    Combine existing Tokenizers and TokenFilters

    Normalize abbreviations

    Filter out all words beginning with the letter A Identify/Mark sentences

    Questions:

    What would help improve search results?

  • 7/31/2019 Luce Ne Bootcamp

    71/83

    Task 2 Results

    Share what you did and why

    Improving Results (in most cases)

    StemmingIgnore Case

    Stopword Removal

    SynonymsPay attention to business needs

  • 7/31/2019 Luce Ne Bootcamp

    72/83

    Grab Bag

    Accessing Term Information

    TermEnum

    TermDocsTerm Vectors

    FieldSelector

    Scoring and Similarity File Formats

  • 7/31/2019 Luce Ne Bootcamp

    73/83

    Task 6

    Count and print all the unique terms in the

    index and their frequencies

    Notes: Half of the class write it using TermEnum and

    TermDocs

    Other Half write it using Term Vectors

    Time your Task Only count the title and body content

  • 7/31/2019 Luce Ne Bootcamp

    74/83

    Task 6 Results

    Term Vector approach is faster on smaller

    collections

    TermEnum approach is faster on largercollections

  • 7/31/2019 Luce Ne Bootcamp

    75/83

    Task 4 Re-index your collection

    Add in a rating field that randomly assigns a numberbetween 0 and 9

    Write searches to sort by Date

    Title

    Rating, Date, Doc Id

    A Custom Sort

    Questions How to sort the title?

    How to sort multiple Fields?

  • 7/31/2019 Luce Ne Bootcamp

    76/83

    Task 4 Results

    Add stitle to use for sorting the title

  • 7/31/2019 Luce Ne Bootcamp

    77/83

    Task 5

    Create and search using Filters to:

    Restrict to all docs written on Feb. 26, 1987

    Restrict to all docs with the word computerin title

    Also:

    Create a Filter where the length of the body +title is greater than X

  • 7/31/2019 Luce Ne Bootcamp

    78/83

    Task 5 Results

    Solr has more advanced Filter

    mechanisms that may be worth using

    Cache filters

  • 7/31/2019 Luce Ne Bootcamp

    79/83

    Task 7 Pair up if you like and take 30-40 minutes to:

    Pick two file formats to work on

    Identify content in that format

    Can you index contents on your hard drive?

    Project Gutenberg, Creative Commons, Wikipedia

    Combine w/ Reuters collection

    Extract the content and index it using the appropriatelibrary

    Store the content as a Field

    Search the content

    Load Documents with and withoutFieldSelector and measure performance

  • 7/31/2019 Luce Ne Bootcamp

    80/83

    Task 7 (cont.)

    Include score and explanation in results

    Dump results to XML or HTML

    Be prepared to share with class what you did What libraries did you use?

    What content did you use?

    What is yourDocument structure?

    What issues did you have?

  • 7/31/2019 Luce Ne Bootcamp

    81/83

    20 Minute Break

  • 7/31/2019 Luce Ne Bootcamp

    82/83

    Task 7 Results

    Explain what your group did

    Build a Content Handler Framework

    Or help out with Tika

  • 7/31/2019 Luce Ne Bootcamp

    83/83

    Task 8

    Building on Task 7

    Incorporate one or more contrib packages into

    your solution


Recommended