Date post: | 28-Jan-2015 |
Category: |
Technology |
Upload: | search-tools-consulting |
View: | 103 times |
Download: | 1 times |
1
The Fundamentals of Enterprise Search
www.searchtools.com/slides/kmw09/fundamentals-of-search.html
KMWorld 2009
Avi Rappoport, Search Tools Consulting
www.searchtools.com
2
What’s In This Workshop
• Overview of enterprise search, in context • Search engine processes
– Robot spiders, database access– Indexing– Security– Query parsing, retrieval, and relevance ranking– Usable search interfaces. – Maintenance and Analytics
• Methods for choosing a good search engine
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
3
About SearchTools
• Avi Rappoport is a librarian (MLIS from Berkeley) – Software developer and product manager– User interface designer– Long-time search consultant
• Editor & Publisher, www.searchtools.com• Search Tools Consulting
– Search needs analysis and recommendations– Enterprise search evaluation – Outsourced search administration
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
4
Defining Enterprise Search
• Large scale web site search – Corporate sites– Institutional sites– Online stores
• Intranet search – Crossing departmental lines– Opening data silos
• Extranets• Portal Search
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
5
Similarities to Webwide Search
• Robot crawlers • HTML over HTTP• Scaling to millions of items• Distributed processing • Full-text indexing of content• Simple query language• Relevance ranking of results
– TF-IDF (term frequency : inverse document frequency)
• Familiar results list
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
6
Differences from Web Search
• Limited scope – A site, set of sites, extranet, or intranet
• Few meaningful hyperlinks – Page Rank and link analysis is less useful
• Security and access control issues• Content in databases, CMSs, etc. • More control
– Index update scheduling – Some content is very valuable, other is not
• No search spam
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
7
Text Search vs. Database Search
• Indexes multiple content sources– Database fields, files, web pages, feeds...
• Simple search commands instead of SQL• Flexible indexing and retrieval• Relevance ranking (this is a major issue)• Does not compete for database resources
– Easy to scale separately from DBMS
• New features: spellcheck, auto complete, facets• Works in the real world, from eBay to Google
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
8
Search and Information Architecture
• Information Architecture – The art and science of organizing information
for access and use.• IA work enriches search
– Creates order and systems– Provides standard vocabulary– Removes ROT (redundant, obsolete, trivial)
• Search supplements IA– Supports user vocabularies– Changes dynamically with new content
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
9
Search and Taxonomy
• Taxonomy creates categories– Labels and metadata– Improves quality of search results– Additional metadata extremely valuable
• Search crosses categories – Bypasses ambiguous topic labels– Useful for novices – Supports user vocabulary– Dynamic updates for new topics
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
10
Search & Knowledge Management
• KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine)– Organizes information, processes and people – Offers collaboration and archiving tools– Attempts to regularize implicit knowledge
• Search mostly matches words
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
11
Two Main Types of Search
• Known-item search – Short queries– “Good-enough” answers
• Exploratory search– Research - finding unknowns– Scientific, legal, medical, business, sales– Conceptual overviews– Completeness - all possible relevant items
• Law enforcement• Medicine
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
12
• All people see are the search box and results list• Invisible functionality
– Indexes– Query processing– Retrieval– Relevance ranking
• Search is a mystery – But it’s just software
Search as an Iceberg
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
13
Elements of Search Engines
• Automated tools to collect content • Specialized storage for quick retrieval• Query processing and expansion • Retrieval (matching query to index content)• Relevance ranking• Search results interfaces • Analytics, metrics and maintenance
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
14
Choosing Content To Index
• Information sites – Consider indexing every single page– Use search indexing as a discovery mechanism
• Online stores, catalogs – Product information: cost, color, size, materials– Other: return policies, CEO’s name, jobs listing
• Intranets – Intranet portal and core servers – May need archive servers and search
• Multimedia: images, audio, video– Metadata at least
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
15
(Near) Real Time Indexing
• Twitter has changed expectations– Even in intranets
• Index must support partial updates– Search engines finding limits at scale– Distribute indexing and indexes
• Trigger index updates (push vs. pull)– Continuous feed– Send web service message– Database trigger– Update watched URLs with new links
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
16
Indexing and Security
• Search can undermine “security by obscurity”– One link can expose a whole set of documents
• Work with your security team – List areas which contain sensitive content– Define words which trigger further analysis– Create a process for removing sensitive data
• Indexing encrypted content – Search engine uses SSL client for indexing – Encrypt search results before returning– Physical security on search servers
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
17
Search and Access Control
• Authentication and authorization in indexing – “Basic authentication” - user name and password– NT Security integration– ACLs and single sign-on
• Conform to security rules during indexing– Keep access control info as part of document store
• Showing results - who can see what?– Access to search engine itself– Collection-level access control – Locked results as teaser for subscription– Hit-level access control
• Check before displaying results
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
18
Indexing: Sources of Content– Web sites – Intranets– Extranets– Blogs– Wikis– Mailing list archives & email public folders– File systems & shared servers
• NFS, SMB, AFP, GFS, ftp, WebDAV– Content Management Systems – Databases– Legacy programs in silos
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
19
Indexing: Robot Spiders
• Start with base URL for all hosts • For each page, repeat
– Read text into internal format– Save document in cache– Save words into index– Extract all links and check the rules– If they are new URLs, add them to the list
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
20
Robot Indexing Spider
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
21
Common Problems With Robots
• Pages that are not linked from anywhere • Spider disallowed by robots.txt or robots meta• URLs with ? and & (all should do these now)• JavaScript, forms, and interactive dynamic links
– Some robots can handle some of these
• Session IDs that change• Duplicate detection
– Multiple views of the same data (Lotus, wikis) – Symbolic links & bad redirects– Multiple copies of files or directories
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
22
Indexing: Other Data Sources
• RSS feeds: nice clean text• File servers: SMB, file:/// etc. • Content / Document Management Systems• Email archives • Databases via ODBC, JDBC, Oracle API
– Full-text content– Metadata: library catalog records, yellow pages
• External sources using APIs (Application programmatic interfaces)
– News feeds (Reuters, AP)– Twitter
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
23
Indexing: Text Files
• Plain text is easy• RTF export format text easy to find• HTML semi-structured text
– Content is between tags and in attributes– Generated by JavaScript - hard to extract– Bad HTML, especially missing </ close tags
• XML files (structured)– Many tags are document-level– Content is between tags and in attributes– Complex tag hierarchy
• TEI (Text Encoding Initiative) & Semantic Web• Xquery and XPATH tools
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
24
Indexing: Binary File Formats
• PDF– Scanned, may not have any text– Bad PDF generators break words at columns– “Shadow” text effect duplicates letters
• SWF and Flash: API may not load dynamic text• Office documents
– Word processing files (may have hidden text from revisions)– Spreadsheets (hard to know what to grab) – Presentations– Note: new docx, xslx, pptx are really XML file sets
• CAD and project files • Metadata (properties, Adobe XMP)Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
25
Indexing: Tokenizing
• Lowercase all characters (aka ‘folding’)• Tokenizing makes words searchable
– Break on punctuation and spaces– Recognize special words: C++ @ [TS]– Typography issues: is really “st”st– HTML escaped text: möchten = möchten– Special cases for structured strings
• Numbers, Prices, Dates
• N-grams - an alternate approach– Break into short text patterns– Takes a lot of index space
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
26
Indexing: Character Set Issues
• World has many charsets (aka scripts, alphabets)– English has a simple alphabet: 26 letters, 10 numbers– Other Roman languages: extended (ç, î, ß)– Non-Roman one byte: Cyrillic, Arabic, Hebrew– Asian two bytes: Chinese, Japanese, Korean
• Identifying character sets– Unicode characters– Older usage: language “code pages”– HTTP header or <META http-equiv>– Statistical detection techniques
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
27
Indexing: Language Issues
• Text search works across languages– Simple pattern-matching, query to index
• Language-specific indexing improves search– Tokenizing using appropriate rules
• Compound nouns (kindergarten)– Language rules for stemming
• Singular version of thés is thé• Language detection
– Trusted tags– Bilingual dictionaries– Statistical matches, n-grams
• Documents may have mixed languages…Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
28
Indexing: Multimedia
• Images, photos, drawings, sound, scores, video• External metadata
– File name– Link text, surrounding words
• Internal metadata – ID3 tags for music– EXIF and other digital photo information– Subtitles (sometimes)
• Content– OCR to extract graphic text and closed captions– Audio: Speech-to-text conversion, still buggy
• Use human judgment not just automated systems
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
29
Inverted Index Diagram
• Inverted indexes work well• Lots of IR research
shows this• Better than DBMS
• Alphabetical list of tokens• Tokens not in
paragraph order, thus, inverted
• Each token hasID of source
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
30
Richer Index Structures
• Store word position (for phrase matching) • Enclosing tag or field
– Document metadata – Database field names– Image (which attribute)– Named anchor text– Text markup tags (TEI, Semantic Web)– Extracted entities
• Personal names, companies, geo locations, dates• Anchor text from incoming links
– Can be very descriptive– Add to index as if part of the target document
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
31
Example Inverted Index Structure
• For each word– Document ID– Position– Tag name
• For each document– ID– Title– URL– Description
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
32
Indexing: Stopwords• Stopwords - very common terms
– Linguistic (a an the as he she it you new)– Ubiquitous (names, copyright, click here)
• Consequences of excluding stopwords:– Reduces the size of index files – Improves recall, finds more matching documents – Fails some queries
• As You Like It, IT copyright policy– Problems matching phrases: “New York University”
• Solutions vary:– Index everything, pay the price in index size– CommonGrams: n-grams of of frequent phrases
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
33
Stopwords Problems: Example
• Searching wordpress.com for whatever will be • Finds all matches for whatever (stopwords ignored)
• Useless results ranking• No matches for will be• One ad gets it right
• External search finds over 3,000 pages on site with phrase
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
34
Indexing: Stemming• Singular query should find plural words & vice versa
– Shoe <=> shoes, cans <=> can, geese <=> goose – Statistical and probabilistic truncation rules– Linguistic rules
• Lemmatization - stemming based on part of speech• Stemming before indexing
– Improve recall: find all forms of a word– Reduce index size
• Consequences of extreme stemming– Short query problems– Search for Ran shouldn’t match Run, Lola, Run
• Other options– Index everything (makes indexes larger and queries slower)– New idea: CommonGrams (n-grams of frequent phrases)
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
35
Indexing: Document Store
• Minimum– ID (key for for inverted index)– Unique location (URL / file path / record ID)
• Richer document store– Implicit metadata: filename, size, location– Explicit metadata
• Title, date, keywords, author• Taxonomy labels, classification, user tagging
– Language, character set– Access control settings
• Full text of the document– For snippets and caching
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
36
Indexing: Dealing with Duplicates
• Detecting duplicate documents – Exact match is fairly easy: checksums– Document similarity check: harder but worth it
• Choosing the primary copy – Most recent (if reliable)– Rules based on path or metadata– New web search “canonical” tag
• What to do with duplicates – Remove from the index: saves space– Hide in results unless requested
• That’s the Google way
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
37
Indexing: Document Dates
• HTTP servers lie about dates – Frequent wrong settings: 1969, 2040– Dynamic pages send the current timestamp
• File systems lie about dates • Applications lie about dates• Indexers do the best they can
– Metadata (date tag, property, tag DC.date)– Extract from page content– Checksum to see if file has changed since last index – Consider external metadata repository
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
38
Search Process Flow
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
39
Where the Queries Come From
• User-entered text in search fields• Search navigation: moving around in results list• Previous searches
– May just be repeated clicks on URL– Save Search feature– Simplistic alerts
• Facet click to add a metadata filter– May re-issue search with additional terms– May be navigational, no text query
• Scripts or automated queries– Dynamic links (find all pictures by this artist)– Geographic information systems
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
40
Query Processing Steps
1. Try to recognize the character set and language 2. Tokenize the text by language rules
– Break at spaces and punctuation– Same algorithm as index tokenizer
3. Check for operators – Internet Query Operators: + - "quotes"– Boolean Operators: AND OR NOT & | !– Others: NEAR, (parentheses)
4. Check for field names, zones, other filters– Example: title:lunch location=94703
5. Handle the rare natural language question
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
41
Query Expansion
• Stemming – Dependant on index stemming choices– Good to find singular/plural forms
• Word similarity searching - increases recall– Fuzzy matching– Phonetic, soundex, sound-alike – May overwhelm exact matches
• Synonym expansion, should be site-specific – bus => coach, ATM => Air Tasking Message
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
42
Search: Retrieval, Recall & Precision
• Retrieval – Finding the documents matching a particular query
• Recall – Finding every relevant document
• Precision – Finding only relevant documents
• Balance more recall vs. better precision– Use search logs and user studies to guide choices
• Use precision as part of relevance ranking – Top results should be more exact matches
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
43
One-Word Text Retrieval
– Fast binary search in inverted index • Check index updates on disk or in memory• If there are distributed indexes, merge results
– Store the related document information in a list • Document ID• Term frequency in document• Term positions in the document• Note: The document list is not yet sorted
– Frequent searches may be cached• “Short head” vs. “long tail”
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
44
Multi-Word Text Retrieval
• Relationship between words defines results– Boolean AND, + operator, find all default
• Only documents which contain all terms– Boolean OR operator, find any default
• All documents with any term– Boolean NOT, - operator
• All documents with the first term but not next term – Phrase operators, quotes
• Only documents with the words as a phrase– Also check for zones or field filters – Parentheses: use for order of processing
• Merge resulting listsFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
45
Relevance Ranking Algorithms
• Relevance – The likelihood that an item will fill an information need– Based on documents in retrieval list
• Most common algorithm: TF:IDF(Term frequency : inverse document frequency)– How often the query word is in the document?– How often the word is in the index?
• Other relevance algorithms – Vectors and document-query similarity – Linguistic analysis and Natural Language Processing – Statistical and Bayesian analysis
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
46
Relevance Heuristics
• Phrase matches for multiple query terms – Logs show most multi-word searches are phrases
• Query terms found in special sections– Title– Metadata– Top of document
• All terms matched in document – Even when not relevant, it’s transparent– Old systems gave excess weight to single rare terms
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
47
More on Relevance
• Relevance is task-specific – Results can never please all of the people– More like berry-picking than like hunting
• Link analysis (PageRank) not very useful – Intranet and site links tend to be navigational
• Situation-specific adjustments – Some areas more likely to be valuable – Current content– Local content
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
48
Federated Search and Relevance
• Send query to multiple search engines– May require special syntax– Response time often a factor– Receive results in relevance order for each
• Display results, two options– Separate sections for each search engine– Merged single relevance rank list
• Works if all search indexes are similar• Problems where the sources are very different
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
49
Retrieval: Access Control
• Limit access to search itself– User enters password or other credentials– Search only accepts queries when authenticated
• Collection-level access control– Query filter only retrieves items from allowed groups
• Hit-level access control– Real-time check for user access on documents– Start with most relevant documents– Repeat until there are ten (may be slow)– Display top results, include estimate of how many more– Show helpful message if user can’t see any
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
50
Search User Experience
• Limit user interface complexity– Show the scope of the information covered– Expose query expansion and contraction – Use familiar UI elements
• User experience goes beyond interface– Index coverage– Query syntax– Retrieval quality and speed– Relevance ranking (first ten are vital)
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
51
Search Forms Interface
• Balance simplicity with functionality• Put a search field in the navigation bar
– Location should be consistent– Longer is better: short fields lead to short queries
• Simple Search forms: limit options – Zone or section– Dates
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
52
Search Field Auto-Complete
• Dropdown menu of matching words• Base on search logs• Smallish list, 7-10
– Most popular• Simple sort
– Alphabetic– Price or size– Complete range
(preferably lowestto highest)
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
53
Other Search Interfaces
• Heavily researched• Natural language
– Must keep typing
– Defining a questionis quite hard
• Interactive search– Guided interviews
– But users want immediate results
• Avatars – do not improve interaction
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
54
Simple vs. Advanced Search UI
• Most searches are simple– Short: one to three words– Fewer than 10% use any operators at all (maybe 1%)– Even experts prefer simple search
• Will use advanced tools if simple doesn’t work
• Default to simple search, link to advanced search – Those are your power users: librarians, techies– Expose all possible options– Don’t spend huge resources on advanced UI
• Exploratory search is different
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
55
Advanced Search Fits Sometimes
• EBay– High motivation – Complex search
requirements – Frequent use
• UX testing still required
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
56
Search Results: Page Elements
• Site context – General page layout, navigation links– Colors and design elements
• Results header– A search field, with the current search terms – Retrieval information - how many hits
• Results list in relevance order– Each result item with at least a linked title
• Facets: dynamic links for filtering results• Results footer
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
57
Search Results: Good Example
• Full but readable• white space• content blocks
• Site look-and-feel• Navigation• Familiar search
results elements
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
58
Search Results: Not-So-Good Example
Site page has navigation, colors: search results should tooFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
59
Search Results: Visualization
• Fascinating to look at, great demos– Star charts– Topographical displays– Interactive fly-throughs– Hyperbolic trees
• Require significant resources to run• Good for exploratory & comprehensive research
– Finding unexpected synergies• Simple search is much cheaper for casual users
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
60
Search Results: Header Elements
• Search field, with the current query– Users often edit to be more or less restrictive
• Number of results found• A few search options
– Match Any Word / All Words / Exact Phase– Filter by date option (if trustworthy)– Search zones
• Results navigation• Best Bets• Spelling suggestions
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
61
Search Results: Hits and Pages
• Show number of items matched– Be accurate – Do not give estimates for small numbers
• (Google and SharePoint are bad this way)
• Pagination - results list navigation– Helps user calibrate content– Important for exploratory search– Follow web search conventions, example
< previous 1 2 3 4 ... 26 next >– Be accurate
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
62
Results Headers: Examples
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
63
Search Results: “Best Bets”
aka Search Suggestions, QuickLinks, KeyMatch, Recommendations
• Special-case links for problem queries – Internal topic landing pages– External sites when appropriate– New and better query to search
• Only implement for very frequent queries– Discover problems from users, log analysis – “Short head” - few very popular query terms– Allocate resources to keep them current
• Good search results are higher priority
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
64
Best Bets Example
• Best Bets are very clear
• Would not come first in normal search results
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
65
Search Results: List Sorting
• List of links to items matching the query• Sorted by matching terms
– Impossible to be relevant to every query– Variety of sources when possible– Transparency: why these items in this order
• Other sort orders - make very visible– By author’s last name– By date– By price
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
66
Search Results: Not Enough Variety
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
67
Search Results: Weird Sort
Sorted by:“Degrees away”
Labels too subtle:• Hidden in header• Degree icon should
be on the left side
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
68
Result Items: Elements
• Information foraging: show hints about items• Title of document, or name of product• Location: URL, file path, database ID
– May need to rewrite to user-accessible URLs– Hide location if it’s not meaningful
• Distinguishing data – Metadata: picture, product code, author name
• Show match terms in context (snippets)– Text before and after query term matches – Highlight the matches
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
69
Results Items: Not Enough Content
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
70
Results Items: Too Much Content
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
71
Results Items: Just Right
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
72
Results Items: Additional Data
• Date (if reliable)• Size and File type
– Avoid surprising launches of Acrobat or other app.• Metadata
– Author, department, brand, product... • Access status: password required? • Topics and subject headings
– Taxonomy categories– Keywords and concept tags– User tags, folksonomy
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
73
Results Items: Rich Items Example
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
74
Results: Dynamic Clustering
• Uses search results text to infer topics – Groups by similarity in titles and results text
• Particularly good for portals and intranets– Unstructured, uncontrolled text– Dynamic, no preprocessing needed
• Can supplement categorization and taxonomies
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
75
Results: Clustering Example
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
76
Commerce and Catalog Results
• Picture or graphic if possible• Important attributes
– Price– Color– Size– Compatibility– Availability
• “Buy” button – Simplify process, save time
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
77
Online Store Results Example
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
78
Multimedia Search
• Image, audio, and video files– Audio and visual similarity search still theory
• Show context in results – Match terms from transcript or OCR– Text around image– Thumbnails or keyframes
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
79
Multimedia Results Example
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
80
Results: Faceted Metadata
• Better than forms for structured text data – Exposes attributes as part of search results – Leverages metadata
• Topic names, taxonomy• Mundane stuff: color, date, size, author...
• Choices specifically relating to search results – Dynamically generates from metadata – Preview numbers offer users confidence in clicking
• Supported by extensive usability testing• Used on a majority of large e-commerce sites
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
81
Why Faceted Search is Better Than Forms
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
82
Faceted Metadata: Commerce Example
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
83
Faceted Metadata: Library Catalog
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
84
No Matches Queries: Causes
• Misspellings and typing errors• Scope problem: nothing for that topic• Vocabulary differences
– Users may be less precise, or use competitor’s terms– Marketers may dominate content
• Restrictive search settings – Default may only match exact phrase or all words – Access control may disallow user
• Software/hardware/network failures
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
85
No Matches Queries: Responses
• Track queries with no matches in logs • Use sessions, surveys & testing to find user intent • Design the no-matches page carefully
– Explain what is and isn’t on the site – Provide useful navigation links
• Add search engine help – Synonyms– Best Bets– Spelling
• Add terms to text• Add content, topic pages
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
86
No Matches Queries: Spelling Issues
• Detect and address common problems– Spelling errors– Typos– Queries without spaces between words
• Use site-specific dictionary– Easy to build from search index – Never suggests any words not on the site
• Users familiar with did you mean....?
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
87
Good Example of No-Matches Page
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
88
Empty Searches
• Users click or press “enter” in the search box• Test for this special case
• Should not find all items in the index
• Interaction options:• Do nothing• Go to a simple
search page• Show an
error dialog
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
89
Search Engine Maintenance
• Index maintenance – Obsolete content removal– Check for new content– Track technical problems (bad links, servers down)
• Search quality – Re-run test suite– Compare with original results– Add new test queries
• Track user feedback, surveys– Use metrics and log analysis to catch trends
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
90
Metrics for Search Engines
• Server uptime• Errors: how often and how serious• Index
– Size on disc and in memory– Number of entries– Number and type of indexing errors
• Search traffic – Queries per minute (60 qpm is common)– Average clicks on results items per query– Average next-page views per query– Number and percent of no-match queries
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
91
Search Log Analysis
• Most frequent query terms– Short head: a few very popular terms– Long tail of unique queries– Lots of junk: URLs, spam, gibberish
• Frequent query terms not matched - fix somehow• More esoteric analysis - need a lot of data
– Frequent query terms with low click-through – Frequent query terms with high “next page” clicks
• Raw logs– Import into database for ad-hoc reports– Session analysis can be enlightening
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
92
Choosing a Search Engine
• Find specific information needs• Analyze content
– Source and formats formats– Rough number of pages/ records / items
• Define platform, API, language requirements• Buy (or use open source), don’t build
– User surveys show problems with home-grown • Choose & compare likely candidates
– Gathering, indexing, retrieval, relevance features– Scaling– Administration tools– Continuing development, support, user groups– Price
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
93
Information Needs Analysis
• What works already? • Don’t fix what’s not broken
• Where is the real pain? – Difficult search syntax– Data silos – New content not findable
• What requires more complex tools? • Exploratory search• Scientific & academic research • Business intelligence and data mining• Comprehensive legal discovery
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
94
Content Inventory
• Work with Information Architects – Use existing taxonomies and catalogs
• Learn what you have – Simple static HTML pages– Other formats: PDF, Office documents (which version)– CMS, document management, publishing systems– Databases and legacy systems– Multimedia audio and video files
• Identify more and less valuable data – Some content should be in archives
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
95
Search Engine Deployment Types
• Software – Controlled by local IT– Flexible installation– Open-source - several high quality packages
• Search Appliances – Server hardware/software combinations– Require very little technical attention– Check development and backup server pricing
• Remote Search Services (SaaS) – Index using robot spiders or remote access– Query goes to service, results go back to user – Low network, hosting, IT load
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
96
Scaling Search to Millions & Billions
• What are the largest installations for each?– Talk to them before committing
• Cache frequent queries• Add query servers, automated load balancing• Indexing at scale
– Indexing on dedicated servers– Deal with new calls for near-real-time indexing – Distribute multiple clones of indexes– Segment indexes, parallel lookups, merge result
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
97
Testing Search Indexing
• Choose 3-4 good candidates • Index as much content as possible
– Watch the robot, track errors– Try to index tricky data sources– Compare coverage among them
• Test index scaling– Make a really big index based on expected use– Speed of add/ update/ delete– Responsiveness during big update
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
98
Evaluating Search Results
• Create a query test suite– Use existing search logs if possible– Short, long, unusual, common (check cache)– Simple and complex queries – Spelling, typing and vocabulary errors – Many matches, few matches, no matches
• Perform searches against the test engines– Save results pages as HTML for later checking
• Analyze differences among them– Retrieval (and indexing): what’s found?– Relevance: are the top results good ones?
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
99
Search: Not a Black Box
• Simple search solves many enterprise problems – Dynamic access to local content– Familiar interface, expectations– User vocabulary
• Understand the real information needs – Index the right stuff– Work with content providers and IAs
• Link to specialty research engines• Learn from users over time, make it better
Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com