+ All Categories
Home > Technology > Fundamentals Of Search

Fundamentals Of Search

Date post: 28-Jan-2015
Category:
Upload: search-tools-consulting
View: 103 times
Download: 1 times
Share this document with a friend
Description:
These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.
Popular Tags:
99
1 The Fundamentals of Enterprise Search www.searchtools.com/slides/kmw09/fundamentals-of-search.html KMWorld 2009 Avi Rappoport, Search Tools Consulting www.searchtools.com [email protected]
Transcript
Page 1: Fundamentals Of Search

1

The Fundamentals of Enterprise Search

www.searchtools.com/slides/kmw09/fundamentals-of-search.html

KMWorld 2009

Avi Rappoport, Search Tools Consulting

www.searchtools.com

[email protected]

Page 2: Fundamentals Of Search

2

What’s In This Workshop

• Overview of enterprise search, in context • Search engine processes

– Robot spiders, database access– Indexing– Security– Query parsing, retrieval, and relevance ranking– Usable search interfaces. – Maintenance and Analytics

• Methods for choosing a good search engine

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 3: Fundamentals Of Search

3

About SearchTools

• Avi Rappoport is a librarian (MLIS from Berkeley) – Software developer and product manager– User interface designer– Long-time search consultant

• Editor & Publisher, www.searchtools.com• Search Tools Consulting

– Search needs analysis and recommendations– Enterprise search evaluation – Outsourced search administration

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 4: Fundamentals Of Search

4

Defining Enterprise Search

• Large scale web site search – Corporate sites– Institutional sites– Online stores

• Intranet search – Crossing departmental lines– Opening data silos

• Extranets• Portal Search

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 5: Fundamentals Of Search

5

Similarities to Webwide Search

• Robot crawlers • HTML over HTTP• Scaling to millions of items• Distributed processing • Full-text indexing of content• Simple query language• Relevance ranking of results

– TF-IDF (term frequency : inverse document frequency)

• Familiar results list

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 6: Fundamentals Of Search

6

Differences from Web Search

• Limited scope – A site, set of sites, extranet, or intranet

• Few meaningful hyperlinks – Page Rank and link analysis is less useful

• Security and access control issues• Content in databases, CMSs, etc. • More control

– Index update scheduling – Some content is very valuable, other is not

• No search spam

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 7: Fundamentals Of Search

7

Text Search vs. Database Search

• Indexes multiple content sources– Database fields, files, web pages, feeds...

• Simple search commands instead of SQL• Flexible indexing and retrieval• Relevance ranking (this is a major issue)• Does not compete for database resources

– Easy to scale separately from DBMS

• New features: spellcheck, auto complete, facets• Works in the real world, from eBay to Google

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 8: Fundamentals Of Search

8

Search and Information Architecture

• Information Architecture – The art and science of organizing information

for access and use.• IA work enriches search

– Creates order and systems– Provides standard vocabulary– Removes ROT (redundant, obsolete, trivial)

• Search supplements IA– Supports user vocabularies– Changes dynamically with new content

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 9: Fundamentals Of Search

9

Search and Taxonomy

• Taxonomy creates categories– Labels and metadata– Improves quality of search results– Additional metadata extremely valuable

• Search crosses categories – Bypasses ambiguous topic labels– Useful for novices – Supports user vocabulary– Dynamic updates for new topics

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 10: Fundamentals Of Search

10

Search & Knowledge Management

• KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine)– Organizes information, processes and people – Offers collaboration and archiving tools– Attempts to regularize implicit knowledge

• Search mostly matches words

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 11: Fundamentals Of Search

11

Two Main Types of Search

• Known-item search – Short queries– “Good-enough” answers

• Exploratory search– Research - finding unknowns– Scientific, legal, medical, business, sales– Conceptual overviews– Completeness - all possible relevant items

• Law enforcement• Medicine

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 12: Fundamentals Of Search

12

• All people see are the search box and results list• Invisible functionality

– Indexes– Query processing– Retrieval– Relevance ranking

• Search is a mystery – But it’s just software

Search as an Iceberg

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 13: Fundamentals Of Search

13

Elements of Search Engines

• Automated tools to collect content • Specialized storage for quick retrieval• Query processing and expansion • Retrieval (matching query to index content)• Relevance ranking• Search results interfaces • Analytics, metrics and maintenance

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 14: Fundamentals Of Search

14

Choosing Content To Index

• Information sites – Consider indexing every single page– Use search indexing as a discovery mechanism

• Online stores, catalogs – Product information: cost, color, size, materials– Other: return policies, CEO’s name, jobs listing

• Intranets – Intranet portal and core servers – May need archive servers and search

• Multimedia: images, audio, video– Metadata at least

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 15: Fundamentals Of Search

15

(Near) Real Time Indexing

• Twitter has changed expectations– Even in intranets

• Index must support partial updates– Search engines finding limits at scale– Distribute indexing and indexes

• Trigger index updates (push vs. pull)– Continuous feed– Send web service message– Database trigger– Update watched URLs with new links

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 16: Fundamentals Of Search

16

Indexing and Security

• Search can undermine “security by obscurity”– One link can expose a whole set of documents

• Work with your security team – List areas which contain sensitive content– Define words which trigger further analysis– Create a process for removing sensitive data

• Indexing encrypted content – Search engine uses SSL client for indexing – Encrypt search results before returning– Physical security on search servers

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 17: Fundamentals Of Search

17

Search and Access Control

• Authentication and authorization in indexing – “Basic authentication” - user name and password– NT Security integration– ACLs and single sign-on

• Conform to security rules during indexing– Keep access control info as part of document store

• Showing results - who can see what?– Access to search engine itself– Collection-level access control – Locked results as teaser for subscription– Hit-level access control

• Check before displaying results

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 18: Fundamentals Of Search

18

Indexing: Sources of Content– Web sites – Intranets– Extranets– Blogs– Wikis– Mailing list archives & email public folders– File systems & shared servers

• NFS, SMB, AFP, GFS, ftp, WebDAV– Content Management Systems – Databases– Legacy programs in silos

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 19: Fundamentals Of Search

19

Indexing: Robot Spiders

• Start with base URL for all hosts • For each page, repeat

– Read text into internal format– Save document in cache– Save words into index– Extract all links and check the rules– If they are new URLs, add them to the list

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 20: Fundamentals Of Search

20

Robot Indexing Spider

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 21: Fundamentals Of Search

21

Common Problems With Robots

• Pages that are not linked from anywhere • Spider disallowed by robots.txt or robots meta• URLs with ? and & (all should do these now)• JavaScript, forms, and interactive dynamic links

– Some robots can handle some of these

• Session IDs that change• Duplicate detection

– Multiple views of the same data (Lotus, wikis) – Symbolic links & bad redirects– Multiple copies of files or directories

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 22: Fundamentals Of Search

22

Indexing: Other Data Sources

• RSS feeds: nice clean text• File servers: SMB, file:/// etc. • Content / Document Management Systems• Email archives • Databases via ODBC, JDBC, Oracle API

– Full-text content– Metadata: library catalog records, yellow pages

• External sources using APIs (Application programmatic interfaces)

– News feeds (Reuters, AP)– Twitter

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 23: Fundamentals Of Search

23

Indexing: Text Files

• Plain text is easy• RTF export format text easy to find• HTML semi-structured text

– Content is between tags and in attributes– Generated by JavaScript - hard to extract– Bad HTML, especially missing </ close tags

• XML files (structured)– Many tags are document-level– Content is between tags and in attributes– Complex tag hierarchy

• TEI (Text Encoding Initiative) & Semantic Web• Xquery and XPATH tools

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 24: Fundamentals Of Search

24

Indexing: Binary File Formats

• PDF– Scanned, may not have any text– Bad PDF generators break words at columns– “Shadow” text effect duplicates letters

• SWF and Flash: API may not load dynamic text• Office documents

– Word processing files (may have hidden text from revisions)– Spreadsheets (hard to know what to grab) – Presentations– Note: new docx, xslx, pptx are really XML file sets

• CAD and project files • Metadata (properties, Adobe XMP)Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 25: Fundamentals Of Search

25

Indexing: Tokenizing

• Lowercase all characters (aka ‘folding’)• Tokenizing makes words searchable

– Break on punctuation and spaces– Recognize special words: C++ @ [TS]– Typography issues: is really “st”st– HTML escaped text: möchten = m&ouml;chten– Special cases for structured strings

• Numbers, Prices, Dates

• N-grams - an alternate approach– Break into short text patterns– Takes a lot of index space

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 26: Fundamentals Of Search

26

Indexing: Character Set Issues

• World has many charsets (aka scripts, alphabets)– English has a simple alphabet: 26 letters, 10 numbers– Other Roman languages: extended (ç, î, ß)– Non-Roman one byte: Cyrillic, Arabic, Hebrew– Asian two bytes: Chinese, Japanese, Korean

• Identifying character sets– Unicode characters– Older usage: language “code pages”– HTTP header or <META http-equiv>– Statistical detection techniques

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 27: Fundamentals Of Search

27

Indexing: Language Issues

• Text search works across languages– Simple pattern-matching, query to index

• Language-specific indexing improves search– Tokenizing using appropriate rules

• Compound nouns (kindergarten)– Language rules for stemming

• Singular version of thés is thé• Language detection

– Trusted tags– Bilingual dictionaries– Statistical matches, n-grams

• Documents may have mixed languages…Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 28: Fundamentals Of Search

28

Indexing: Multimedia

• Images, photos, drawings, sound, scores, video• External metadata

– File name– Link text, surrounding words

• Internal metadata – ID3 tags for music– EXIF and other digital photo information– Subtitles (sometimes)

• Content– OCR to extract graphic text and closed captions– Audio: Speech-to-text conversion, still buggy

• Use human judgment not just automated systems

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 29: Fundamentals Of Search

29

Inverted Index Diagram

• Inverted indexes work well• Lots of IR research

shows this• Better than DBMS

• Alphabetical list of tokens• Tokens not in

paragraph order, thus, inverted

• Each token hasID of source

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 30: Fundamentals Of Search

30

Richer Index Structures

• Store word position (for phrase matching) • Enclosing tag or field

– Document metadata – Database field names– Image (which attribute)– Named anchor text– Text markup tags (TEI, Semantic Web)– Extracted entities

• Personal names, companies, geo locations, dates• Anchor text from incoming links

– Can be very descriptive– Add to index as if part of the target document

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 31: Fundamentals Of Search

31

Example Inverted Index Structure

• For each word– Document ID– Position– Tag name

• For each document– ID– Title– URL– Description

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 32: Fundamentals Of Search

32

Indexing: Stopwords• Stopwords - very common terms

– Linguistic (a an the as he she it you new)– Ubiquitous (names, copyright, click here)

• Consequences of excluding stopwords:– Reduces the size of index files – Improves recall, finds more matching documents – Fails some queries

• As You Like It, IT copyright policy– Problems matching phrases: “New York University”

• Solutions vary:– Index everything, pay the price in index size– CommonGrams: n-grams of of frequent phrases

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 33: Fundamentals Of Search

33

Stopwords Problems: Example

• Searching wordpress.com for whatever will be • Finds all matches for whatever (stopwords ignored)

• Useless results ranking• No matches for will be• One ad gets it right

• External search finds over 3,000 pages on site with phrase

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 34: Fundamentals Of Search

34

Indexing: Stemming• Singular query should find plural words & vice versa

– Shoe <=> shoes, cans <=> can, geese <=> goose – Statistical and probabilistic truncation rules– Linguistic rules

• Lemmatization - stemming based on part of speech• Stemming before indexing

– Improve recall: find all forms of a word– Reduce index size

• Consequences of extreme stemming– Short query problems– Search for Ran shouldn’t match Run, Lola, Run

• Other options– Index everything (makes indexes larger and queries slower)– New idea: CommonGrams (n-grams of frequent phrases)

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 35: Fundamentals Of Search

35

Indexing: Document Store

• Minimum– ID (key for for inverted index)– Unique location (URL / file path / record ID)

• Richer document store– Implicit metadata: filename, size, location– Explicit metadata

• Title, date, keywords, author• Taxonomy labels, classification, user tagging

– Language, character set– Access control settings

• Full text of the document– For snippets and caching

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 36: Fundamentals Of Search

36

Indexing: Dealing with Duplicates

• Detecting duplicate documents – Exact match is fairly easy: checksums– Document similarity check: harder but worth it

• Choosing the primary copy – Most recent (if reliable)– Rules based on path or metadata– New web search “canonical” tag

• What to do with duplicates – Remove from the index: saves space– Hide in results unless requested

• That’s the Google way

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 37: Fundamentals Of Search

37

Indexing: Document Dates

• HTTP servers lie about dates – Frequent wrong settings: 1969, 2040– Dynamic pages send the current timestamp

• File systems lie about dates • Applications lie about dates• Indexers do the best they can

– Metadata (date tag, property, tag DC.date)– Extract from page content– Checksum to see if file has changed since last index – Consider external metadata repository

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 38: Fundamentals Of Search

38

Search Process Flow

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 39: Fundamentals Of Search

39

Where the Queries Come From

• User-entered text in search fields• Search navigation: moving around in results list• Previous searches

– May just be repeated clicks on URL– Save Search feature– Simplistic alerts

• Facet click to add a metadata filter– May re-issue search with additional terms– May be navigational, no text query

• Scripts or automated queries– Dynamic links (find all pictures by this artist)– Geographic information systems

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 40: Fundamentals Of Search

40

Query Processing Steps

1. Try to recognize the character set and language 2. Tokenize the text by language rules

– Break at spaces and punctuation– Same algorithm as index tokenizer

3. Check for operators – Internet Query Operators: + - "quotes"– Boolean Operators: AND OR NOT & | !– Others: NEAR, (parentheses)

4. Check for field names, zones, other filters– Example: title:lunch location=94703

5. Handle the rare natural language question

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 41: Fundamentals Of Search

41

Query Expansion

• Stemming – Dependant on index stemming choices– Good to find singular/plural forms

• Word similarity searching - increases recall– Fuzzy matching– Phonetic, soundex, sound-alike – May overwhelm exact matches

• Synonym expansion, should be site-specific – bus => coach, ATM => Air Tasking Message

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 42: Fundamentals Of Search

42

Search: Retrieval, Recall & Precision

• Retrieval – Finding the documents matching a particular query

• Recall – Finding every relevant document

• Precision – Finding only relevant documents

• Balance more recall vs. better precision– Use search logs and user studies to guide choices

• Use precision as part of relevance ranking – Top results should be more exact matches

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 43: Fundamentals Of Search

43

One-Word Text Retrieval

– Fast binary search in inverted index • Check index updates on disk or in memory• If there are distributed indexes, merge results

– Store the related document information in a list • Document ID• Term frequency in document• Term positions in the document• Note: The document list is not yet sorted

– Frequent searches may be cached• “Short head” vs. “long tail”

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 44: Fundamentals Of Search

44

Multi-Word Text Retrieval

• Relationship between words defines results– Boolean AND, + operator, find all default

• Only documents which contain all terms– Boolean OR operator, find any default

• All documents with any term– Boolean NOT, - operator

• All documents with the first term but not next term – Phrase operators, quotes

• Only documents with the words as a phrase– Also check for zones or field filters – Parentheses: use for order of processing

• Merge resulting listsFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 45: Fundamentals Of Search

45

Relevance Ranking Algorithms

• Relevance – The likelihood that an item will fill an information need– Based on documents in retrieval list

• Most common algorithm: TF:IDF(Term frequency : inverse document frequency)– How often the query word is in the document?– How often the word is in the index?

• Other relevance algorithms – Vectors and document-query similarity – Linguistic analysis and Natural Language Processing – Statistical and Bayesian analysis

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 46: Fundamentals Of Search

46

Relevance Heuristics

• Phrase matches for multiple query terms – Logs show most multi-word searches are phrases

• Query terms found in special sections– Title– Metadata– Top of document

• All terms matched in document – Even when not relevant, it’s transparent– Old systems gave excess weight to single rare terms

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 47: Fundamentals Of Search

47

More on Relevance

• Relevance is task-specific – Results can never please all of the people– More like berry-picking than like hunting

• Link analysis (PageRank) not very useful – Intranet and site links tend to be navigational

• Situation-specific adjustments – Some areas more likely to be valuable – Current content– Local content

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 48: Fundamentals Of Search

48

Federated Search and Relevance

• Send query to multiple search engines– May require special syntax– Response time often a factor– Receive results in relevance order for each

• Display results, two options– Separate sections for each search engine– Merged single relevance rank list

• Works if all search indexes are similar• Problems where the sources are very different

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 49: Fundamentals Of Search

49

Retrieval: Access Control

• Limit access to search itself– User enters password or other credentials– Search only accepts queries when authenticated

• Collection-level access control– Query filter only retrieves items from allowed groups

• Hit-level access control– Real-time check for user access on documents– Start with most relevant documents– Repeat until there are ten (may be slow)– Display top results, include estimate of how many more– Show helpful message if user can’t see any

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 50: Fundamentals Of Search

50

Search User Experience

• Limit user interface complexity– Show the scope of the information covered– Expose query expansion and contraction – Use familiar UI elements

• User experience goes beyond interface– Index coverage– Query syntax– Retrieval quality and speed– Relevance ranking (first ten are vital)

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 51: Fundamentals Of Search

51

Search Forms Interface

• Balance simplicity with functionality• Put a search field in the navigation bar

– Location should be consistent– Longer is better: short fields lead to short queries

• Simple Search forms: limit options – Zone or section– Dates

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 52: Fundamentals Of Search

52

Search Field Auto-Complete

• Dropdown menu of matching words• Base on search logs• Smallish list, 7-10

– Most popular• Simple sort

– Alphabetic– Price or size– Complete range

(preferably lowestto highest)

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 53: Fundamentals Of Search

53

Other Search Interfaces

• Heavily researched• Natural language

– Must keep typing

– Defining a questionis quite hard

• Interactive search– Guided interviews

– But users want immediate results

• Avatars – do not improve interaction

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 54: Fundamentals Of Search

54

Simple vs. Advanced Search UI

• Most searches are simple– Short: one to three words– Fewer than 10% use any operators at all (maybe 1%)– Even experts prefer simple search

• Will use advanced tools if simple doesn’t work

• Default to simple search, link to advanced search – Those are your power users: librarians, techies– Expose all possible options– Don’t spend huge resources on advanced UI

• Exploratory search is different

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 55: Fundamentals Of Search

55

Advanced Search Fits Sometimes

• EBay– High motivation – Complex search

requirements – Frequent use

• UX testing still required

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 56: Fundamentals Of Search

56

Search Results: Page Elements

• Site context – General page layout, navigation links– Colors and design elements

• Results header– A search field, with the current search terms – Retrieval information - how many hits

• Results list in relevance order– Each result item with at least a linked title

• Facets: dynamic links for filtering results• Results footer

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 57: Fundamentals Of Search

57

Search Results: Good Example

• Full but readable• white space• content blocks

• Site look-and-feel• Navigation• Familiar search

results elements

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 58: Fundamentals Of Search

58

Search Results: Not-So-Good Example

Site page has navigation, colors: search results should tooFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 59: Fundamentals Of Search

59

Search Results: Visualization

• Fascinating to look at, great demos– Star charts– Topographical displays– Interactive fly-throughs– Hyperbolic trees

• Require significant resources to run• Good for exploratory & comprehensive research

– Finding unexpected synergies• Simple search is much cheaper for casual users

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 60: Fundamentals Of Search

60

Search Results: Header Elements

• Search field, with the current query– Users often edit to be more or less restrictive

• Number of results found• A few search options

– Match Any Word / All Words / Exact Phase– Filter by date option (if trustworthy)– Search zones

• Results navigation• Best Bets• Spelling suggestions

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 61: Fundamentals Of Search

61

Search Results: Hits and Pages

• Show number of items matched– Be accurate – Do not give estimates for small numbers

• (Google and SharePoint are bad this way)

• Pagination - results list navigation– Helps user calibrate content– Important for exploratory search– Follow web search conventions, example

< previous 1 2 3 4 ... 26 next >– Be accurate

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 62: Fundamentals Of Search

62

Results Headers: Examples

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 63: Fundamentals Of Search

63

Search Results: “Best Bets”

aka Search Suggestions, QuickLinks, KeyMatch, Recommendations

• Special-case links for problem queries – Internal topic landing pages– External sites when appropriate– New and better query to search

• Only implement for very frequent queries– Discover problems from users, log analysis – “Short head” - few very popular query terms– Allocate resources to keep them current

• Good search results are higher priority

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 64: Fundamentals Of Search

64

Best Bets Example

• Best Bets are very clear

• Would not come first in normal search results

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 65: Fundamentals Of Search

65

Search Results: List Sorting

• List of links to items matching the query• Sorted by matching terms

– Impossible to be relevant to every query– Variety of sources when possible– Transparency: why these items in this order

• Other sort orders - make very visible– By author’s last name– By date– By price

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 66: Fundamentals Of Search

66

Search Results: Not Enough Variety

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 67: Fundamentals Of Search

67

Search Results: Weird Sort

Sorted by:“Degrees away”

Labels too subtle:• Hidden in header• Degree icon should

be on the left side

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 68: Fundamentals Of Search

68

Result Items: Elements

• Information foraging: show hints about items• Title of document, or name of product• Location: URL, file path, database ID

– May need to rewrite to user-accessible URLs– Hide location if it’s not meaningful

• Distinguishing data – Metadata: picture, product code, author name

• Show match terms in context (snippets)– Text before and after query term matches – Highlight the matches

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 69: Fundamentals Of Search

69

Results Items: Not Enough Content

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 70: Fundamentals Of Search

70

Results Items: Too Much Content

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 71: Fundamentals Of Search

71

Results Items: Just Right

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 72: Fundamentals Of Search

72

Results Items: Additional Data

• Date (if reliable)• Size and File type

– Avoid surprising launches of Acrobat or other app.• Metadata

– Author, department, brand, product... • Access status: password required? • Topics and subject headings

– Taxonomy categories– Keywords and concept tags– User tags, folksonomy

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 73: Fundamentals Of Search

73

Results Items: Rich Items Example

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 74: Fundamentals Of Search

74

Results: Dynamic Clustering

• Uses search results text to infer topics – Groups by similarity in titles and results text

• Particularly good for portals and intranets– Unstructured, uncontrolled text– Dynamic, no preprocessing needed

• Can supplement categorization and taxonomies

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 75: Fundamentals Of Search

75

Results: Clustering Example

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 76: Fundamentals Of Search

76

Commerce and Catalog Results

• Picture or graphic if possible• Important attributes

– Price– Color– Size– Compatibility– Availability

• “Buy” button – Simplify process, save time

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 77: Fundamentals Of Search

77

Online Store Results Example

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 78: Fundamentals Of Search

78

Multimedia Search

• Image, audio, and video files– Audio and visual similarity search still theory

• Show context in results – Match terms from transcript or OCR– Text around image– Thumbnails or keyframes

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 79: Fundamentals Of Search

79

Multimedia Results Example

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 80: Fundamentals Of Search

80

Results: Faceted Metadata

• Better than forms for structured text data – Exposes attributes as part of search results – Leverages metadata

• Topic names, taxonomy• Mundane stuff: color, date, size, author...

• Choices specifically relating to search results – Dynamically generates from metadata – Preview numbers offer users confidence in clicking

• Supported by extensive usability testing• Used on a majority of large e-commerce sites

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 81: Fundamentals Of Search

81

Why Faceted Search is Better Than Forms

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 82: Fundamentals Of Search

82

Faceted Metadata: Commerce Example

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 83: Fundamentals Of Search

83

Faceted Metadata: Library Catalog

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 84: Fundamentals Of Search

84

No Matches Queries: Causes

• Misspellings and typing errors• Scope problem: nothing for that topic• Vocabulary differences

– Users may be less precise, or use competitor’s terms– Marketers may dominate content

• Restrictive search settings – Default may only match exact phrase or all words – Access control may disallow user

• Software/hardware/network failures

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 85: Fundamentals Of Search

85

No Matches Queries: Responses

• Track queries with no matches in logs • Use sessions, surveys & testing to find user intent • Design the no-matches page carefully

– Explain what is and isn’t on the site – Provide useful navigation links

• Add search engine help – Synonyms– Best Bets– Spelling

• Add terms to text• Add content, topic pages

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 86: Fundamentals Of Search

86

No Matches Queries: Spelling Issues

• Detect and address common problems– Spelling errors– Typos– Queries without spaces between words

• Use site-specific dictionary– Easy to build from search index – Never suggests any words not on the site

• Users familiar with did you mean....?

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 87: Fundamentals Of Search

87

Good Example of No-Matches Page

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 88: Fundamentals Of Search

88

Empty Searches

• Users click or press “enter” in the search box• Test for this special case

• Should not find all items in the index

• Interaction options:• Do nothing• Go to a simple

search page• Show an

error dialog

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 89: Fundamentals Of Search

89

Search Engine Maintenance

• Index maintenance – Obsolete content removal– Check for new content– Track technical problems (bad links, servers down)

• Search quality – Re-run test suite– Compare with original results– Add new test queries

• Track user feedback, surveys– Use metrics and log analysis to catch trends

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 90: Fundamentals Of Search

90

Metrics for Search Engines

• Server uptime• Errors: how often and how serious• Index

– Size on disc and in memory– Number of entries– Number and type of indexing errors

• Search traffic – Queries per minute (60 qpm is common)– Average clicks on results items per query– Average next-page views per query– Number and percent of no-match queries

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 91: Fundamentals Of Search

91

Search Log Analysis

• Most frequent query terms– Short head: a few very popular terms– Long tail of unique queries– Lots of junk: URLs, spam, gibberish

• Frequent query terms not matched - fix somehow• More esoteric analysis - need a lot of data

– Frequent query terms with low click-through – Frequent query terms with high “next page” clicks

• Raw logs– Import into database for ad-hoc reports– Session analysis can be enlightening

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 92: Fundamentals Of Search

92

Choosing a Search Engine

• Find specific information needs• Analyze content

– Source and formats formats– Rough number of pages/ records / items

• Define platform, API, language requirements• Buy (or use open source), don’t build

– User surveys show problems with home-grown • Choose & compare likely candidates

– Gathering, indexing, retrieval, relevance features– Scaling– Administration tools– Continuing development, support, user groups– Price

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 93: Fundamentals Of Search

93

Information Needs Analysis

• What works already? • Don’t fix what’s not broken

• Where is the real pain? – Difficult search syntax– Data silos – New content not findable

• What requires more complex tools? • Exploratory search• Scientific & academic research • Business intelligence and data mining• Comprehensive legal discovery

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 94: Fundamentals Of Search

94

Content Inventory

• Work with Information Architects – Use existing taxonomies and catalogs

• Learn what you have – Simple static HTML pages– Other formats: PDF, Office documents (which version)– CMS, document management, publishing systems– Databases and legacy systems– Multimedia audio and video files

• Identify more and less valuable data – Some content should be in archives

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 95: Fundamentals Of Search

95

Search Engine Deployment Types

• Software – Controlled by local IT– Flexible installation– Open-source - several high quality packages

• Search Appliances – Server hardware/software combinations– Require very little technical attention– Check development and backup server pricing

• Remote Search Services (SaaS) – Index using robot spiders or remote access– Query goes to service, results go back to user – Low network, hosting, IT load

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 96: Fundamentals Of Search

96

Scaling Search to Millions & Billions

• What are the largest installations for each?– Talk to them before committing

• Cache frequent queries• Add query servers, automated load balancing• Indexing at scale

– Indexing on dedicated servers– Deal with new calls for near-real-time indexing – Distribute multiple clones of indexes– Segment indexes, parallel lookups, merge result

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 97: Fundamentals Of Search

97

Testing Search Indexing

• Choose 3-4 good candidates • Index as much content as possible

– Watch the robot, track errors– Try to index tricky data sources– Compare coverage among them

• Test index scaling– Make a really big index based on expected use– Speed of add/ update/ delete– Responsiveness during big update

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 98: Fundamentals Of Search

98

Evaluating Search Results

• Create a query test suite– Use existing search logs if possible– Short, long, unusual, common (check cache)– Simple and complex queries – Spelling, typing and vocabulary errors – Many matches, few matches, no matches

• Perform searches against the test engines– Save results pages as HTML for later checking

• Analyze differences among them– Retrieval (and indexing): what’s found?– Relevance: are the top results good ones?

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Page 99: Fundamentals Of Search

99

Search: Not a Black Box

• Simple search solves many enterprise problems – Dynamic access to local content– Familiar interface, expectations– User vocabulary

• Understand the real information needs – Index the right stuff– Work with content providers and IAs

• Link to specialty research engines• Learn from users over time, make it better

Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com


Recommended