+ All Categories
Home > Documents > INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval...

INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval...

Date post: 21-Jul-2020
Category:
Upload: others
View: 32 times
Download: 1 times
Share this document with a friend
80
INLS 509: Introduction to Information Retrieval Jaime Arguello [email protected] January 8, 2014 Friday, January 10, 14
Transcript
Page 1: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

INLS 509:Introduction to Information Retrieval

Jaime [email protected]

January 8, 2014

Friday, January 10, 14

Page 2: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

2

• Hello, my name is ______.

• However, I’d rather be called ______. (optional)

• I’m in the ______ program.

• I’m taking this course because I want to ______.

Introductions

Friday, January 10, 14

Page 3: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

3

What is Information Retrieval?

• Information retrieval (IR) is the science and practice of developing and evaluating systems that match information seekers with the information they seek.

Friday, January 10, 14

Page 4: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

4

• This course mainly focuses on search engines

• Given a query and a corpus, find relevant items

query: a user’s expression of their information need

corpus: a repository of retrievable items

relevance: satisfaction of the user’s information need

What is Information Retrieval?

Friday, January 10, 14

Page 5: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

5

What is Information Retrieval?

• Gerard Salton, 1968:

Information retrieval is a field concerned with the structure, analysis, organization, storage, and retrieval of information.

Friday, January 10, 14

Page 6: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

6

Information Retrievalstructure

Friday, January 10, 14

Page 7: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

7

Information Retrievaldocument structure

Friday, January 10, 14

Page 8: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

8

Information Retrievaldocument structure

However, the main content of the page is in the form of natural language text,

which has little structure that a computer can understand

Friday, January 10, 14

Page 9: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

9

Information Retrievaldocument structure

However, the main content of the page is in the form of natural language text,

which has little structure that a computer can understandAs it turns out, it’s not necessary for a

computer to understand natural language text for it to determine that this

document is likely to be relevant to a particular query (e.g., “Gerard Salton”)

Friday, January 10, 14

Page 10: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

10

Information Retrievalcollection structure

Friday, January 10, 14

Page 11: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

11

Information Retrievalanalysis: classification and information extraction

Friday, January 10, 14

Page 12: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

12

Information Retrievalorganization: cataloguing

http://www.dmoz.orgFriday, January 10, 14

Page 13: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

13http://www.dmoz.org

Information Retrievalorganization: cataloguing

Friday, January 10, 14

Page 14: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

14

Information Retrievalanalysis and organization: reading-level

Friday, January 10, 14

Page 15: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

15

Information Retrievalorganization: recommendations

http://www.yelp.com/biz/cosmic-cantina-chapel-hill(not actual page)

Friday, January 10, 14

Page 16: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

16

Information Retrievalstorage

• How might a web search engine view these pages differently in terms of storage?

Friday, January 10, 14

Page 17: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

17

Information Retrievalretrieval

• Efficiency: retrieving results in this lifetime (or, better yet, in 0.18 seconds)

• Effectiveness: retrieving results that satisfy the user’s information need (more on this later)

• We will focus more on effectiveness

• However, we will also discuss in some detail how search engines retrieve results as fast as they do

Friday, January 10, 14

Page 18: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

18

Many Types of Search Engines

Friday, January 10, 14

Page 19: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

19

Many Types of Search Engines

Friday, January 10, 14

Page 20: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

20

• Given a query and a corpus, find relevant items

query: user’s expression of their information need

corpus: a repository of retrievable items

relevance: satisfaction of the user’s information need

The Search Task

Friday, January 10, 14

Page 21: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

21

Search Enginesweb search

query

corpus

web pages

results

Friday, January 10, 14

Page 22: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

22

query

results

corpus

scientific publications

Search Enginesdigital library search

Friday, January 10, 14

Page 23: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

23

query

corpus

news articles

results

Search Enginesnews search

Friday, January 10, 14

Page 24: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

curated/synthesized business listings

24

query

corpus

results

Search Engineslocal business search

Friday, January 10, 14

Page 25: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

25

query

corpus

files in my laptop

results

Search Enginesdesktop search

Friday, January 10, 14

Page 26: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

26

query

corpus

tweets

results

Search Enginesmicro-blog search

Friday, January 10, 14

Page 27: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

27

query

corpus

profiles

Search Enginespeople/profile search

results

Friday, January 10, 14

Page 28: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

digital library search

web search

enterprise search

news search

local business search

image search

video search

(micro-)blog search

community Q&A search28

Information Retrieval Tasks and Applications

desktop search

question-answering

federated search

social search

expert search

product search

patent search

recommender systems

opinion mining

Friday, January 10, 14

Page 29: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

29

• Given a query and a corpus, find relevant items

query: user’s expression of their information need

corpus: a repository of retrievable items

relevance: satisfaction of the user’s information need

The Search Task

Friday, January 10, 14

Page 30: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

30

• Given a query and a corpus, find relevant items

query: user’s expression of their information need

‣ a textual description of what the user wants

corpus: a repository of retrievable items

‣ a collection of textual documents

relevance: satisfaction of the user’s information need

‣ the document contains information the user wants

The Search Taskin this course

Friday, January 10, 14

Page 31: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

31

• Information retrieval is an uncertain process

‣ users don’t know what they want

‣ users don’t know how to convey what they want

‣ computers can’t elicit information like a librarian

‣ computers can’t understand natural language text

‣ the search engine can only guess what is relevant

‣ the search engine can only guess if a user is satisfied

‣ over time, we can only guess how users adjust their short- and long-term behavior for the better

Why is IR fascinating?

Friday, January 10, 14

Page 32: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

32

Queries and Relevance

Friday, January 10, 14

Page 33: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

33

Queries and Relevance

(AOL query-log)

‣ soft surroundings

‣ trains interlocking dog sheets

‣ belly dancing music

‣ christian dior large bag

‣ best western airport sea tac

‣ www.bajawedding.com

‣ marie selby botanical gardens

‣ big chill down coats

‣ www.magichat.co.uk

‣ marie selby botanical gardens

‣ broadstone raquet club

‣ seadoo utopia

‣ seasons white plains condo

‣ priority club.com

‣ aircat tools

‣ epicurus evil

‣ instructions

‣ hinds county city of jackson

‣ last searches on aol a to z

‣ riverbank run

Friday, January 10, 14

Page 34: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

34

• A query is an impoverished description of the user’s information need

• Highly ambiguous to anyone other than the user

Queries and Relevance

Friday, January 10, 14

Page 35: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

35

Queries and Relevance

(from TREC 2005 HARD Track)

what is in the user’s head

the input to the system• Query 435: curbing population growth

• Description: What measures have been taken worldwide and what countries have been effective in curbing population growth? A relevant document must describe an actual case in which population measures have been taken and their results are known. Reduction measures must have been actively pursued. Passive events such as decease, which involuntarily reduce population, are not relevant.

Friday, January 10, 14

Page 36: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

36

Queries and Relevance

• Query 435: curbing population growth

• Description: ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

(from TREC 2005 HARD Track)

Friday, January 10, 14

Page 37: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

37

Queries and Relevance

• Query 435: curbing population growth

• Can we imagine a relevant document without all these query terms?

Friday, January 10, 14

Page 38: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

38

Queries and Relevance

• Query 435: curbing population growth

• The same concept can be expressed in different ways

Friday, January 10, 14

Page 39: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

39

Queries and Relevance

• Query 435: curbing population growth

• Can we imagine a non-relevant document with all these query terms?

Friday, January 10, 14

Page 40: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

40

Queries and Relevance

• Query 435: curbing population growth

• The query concept can have different “senses”

Friday, January 10, 14

Page 41: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

41

Queries and Relevance

• This is why IR is difficult (and fascinating!)

• Croft, Metzler, & Strohman:

Understanding how people compare text and designing computer algorithms to accurately perform this comparison is at the core of information retrieval.

• IR does not seek a deep “understanding” of the document text

• It uses statistical properties of the text to predict whether a document is relevant to a query

‣ easier and often times sufficient

Friday, January 10, 14

Page 42: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• What types of evidence can we use to predict that a document is relevant to a query?

‣ query-document evidence: a property of the query-document pair (e.g., a measure of similarity)

‣ document evidence: a property of the document (same for all queries)

42

Predicting Relevance

Friday, January 10, 14

Page 43: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

...

• Query: bathing a cat

43

Query-Document Evidence

Friday, January 10, 14

Page 44: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Query: bathing a cat

• The important query terms occur frequently

• Both terms occur

• Terms occur close together

• Terms occur in the title

• Terms occur in the URLwww.wikihow.com/bathe-your-cat

• Any other ideas?

...44

Query-Document Evidence

Friday, January 10, 14

Page 45: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Terms occur in hyperlinks pointing to the page

• Same language as query

• Other terms semantically related to query-terms (e.g., feline, wash)

...45

Query-Document Evidence

Friday, January 10, 14

Page 46: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Does not contain “.com”

• [verb] [article] [noun]

• Not one of the most popular queries

• Does not contain the term “news”

...46

Query-Document Evidence

Friday, January 10, 14

Page 47: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• We can also use previous user interactions, e.g.:

• The query is similar to other queries associated with clicks on this document

• The document is similar to other documents associated with clicks for this query

...47

Query-Document Evidence

Friday, January 10, 14

Page 48: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

...

• Lots of in-links (endorsements)

• Non-spam properties:

‣ grammatical sentences

‣ no profanity

• Has good formatting

• Anything other ideas?

48

Document Evidence

Friday, January 10, 14

Page 49: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

...

• Author attributes

• Peer-reviewed by many

• Reading-level appropriate for user community

• Has pictures

• Recently modified (fresh)

• Normal length

• From domain with other high-quality documents

49

Document Evidence

Friday, January 10, 14

Page 50: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

50

• IR does not require a deep “understanding” of information

• We can get by using shallow sources of evidence, which can be generated from the query-document pair or just the document itself.

Predicting Relevance

Friday, January 10, 14

Page 51: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Output: a ranking of items in descending order of predicted relevance (simplifies the task)

• Assumption: the user scans the results from top to bottom and stops when he/she is satisfied or gives up

51

The Search Task

Friday, January 10, 14

Page 52: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

52

Evaluating a Ranking

• So, how good is a particular ranking?

• Suppose we know which documents are truly relevant to the query...

Friday, January 10, 14

Page 53: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

53

A B

Evaluating a Ranking

• Which ranking is better?

Friday, January 10, 14

Page 54: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• In general, a ranking with all the relevant documents at the top is best (A is better than B)

54

A B

Evaluating a Ranking

Friday, January 10, 14

Page 55: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

55

A B

• Which ranking is better?

Evaluating a Ranking

Friday, January 10, 14

Page 56: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

56

A B

• Oftentimes the (relative) quality of a ranking is unclear and depends on the task

Evaluating a Ranking

Friday, January 10, 14

Page 57: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Web search: ??????

57

A B

Evaluating a Ranking

Friday, January 10, 14

Page 58: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Web search: A is better than B• Many documents (redundantly) satisfy the user; the

higher the first relevant document, the better 58

A B

Evaluating a Ranking

Friday, January 10, 14

Page 59: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Patent search: ??????

59

A B

Evaluating a Ranking

Friday, January 10, 14

Page 60: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Patent search: B is better than A• User wants to see everything in the corpus that is related

to the query (high cost in missing something) 60

A B

Evaluating a Ranking

Friday, January 10, 14

Page 61: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Exploratory search: ??????

61

A B

Evaluating a Ranking

Friday, January 10, 14

Page 62: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

• Exploratory search: A is better than B• Satisfying the information need requires information

found in different documents 62

A B

Evaluating a Ranking

Friday, January 10, 14

Page 63: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

63

• Given a ranking with known relevant/non-relevant documents, an evaluation metric outputs a quality score

• Many, many metrics

• Different metrics make different assumptions

• Choosing the “right one” requires understanding the task

• Often, we use several (sanity check)

Evaluating a Rankingevaluation metrics

Friday, January 10, 14

Page 64: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

64

• The goal of information retrieval is to match information-seekers with the information they seek.

• IR involves analysis, organization, storage, and retrieval

• There are many types of search engines

• There is uncertainty at every step of the search process

• Simple heuristics don’t work, so IR systems make predictions about relevance!

• IR systems use “superficial” evidence to make predictions

• Users expect different things, depending on the task

• Evaluation requires understanding the user community.

• My goal is convince you that IR is a fascinating science

Summary

Friday, January 10, 14

Page 65: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

Course Overview

Jaime [email protected]

January 8, 2014

Friday, January 10, 14

Page 66: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

66

• How do search engines work?

‣ effectiveness and efficiency

• How do users behave with them?

‣ how do users determine usefulness of information?

‣ how can a search engine mimic this process?

• Why do search engines fail?

‣ the user? the corpus? the system? something else?

• How can they be evaluated (off-line)?

• How can they be monitored and tuned (on-line)?

Course Objectives

Friday, January 10, 14

Page 67: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

67

• Most of the world’s information is in natural language text‣ the world wide web‣ scientific publications‣ books‣ social media interactions

• The amount of this information is growing quickly; human capacity is not (evolution doesn’t move that fast)

• We need smarter tools

• IR provides tools for analyzing and organizing content to facilitate search, discovery, and learning

Why are these important questions?

Friday, January 10, 14

Page 68: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

68

• Information retrieval is an interdisciplinary problem

Course Structure

people who want to understand people

people who want to understand how

computers can solve problems

people who care about

informationretrieval

• We need to understand both ends of the spectrum

Friday, January 10, 14

Page 69: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

69

• IR: computer-based solutions to a human problem

Course Structure

the userthe system

first half of the semester

second half of the semester

• Understanding IR systems requires math!

Friday, January 10, 14

Page 70: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

70

Road Map

• Introduction to ad-hoc retrieval‣ controlled vocabularies‣ full-text indexing

• Boolean retrieval• Indexing and query processing• Statistical Properties of Text• Document Representation• Retrieval Models

‣ vector space model‣ language modeling‣ others (depending on how quickly we progress)

Friday, January 10, 14

Page 71: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

71

Road Map• Evaluation

‣ test-collection construction‣ evaluation metrics‣ experimentation‣ user studies‣ search-log analysis

• Studies of search behavior

• Federated Search

• Clustering

• Text Classification

Friday, January 10, 14

Page 72: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

72

Grading• 30% homework

‣ 10% each

• 15% midterm

• 15% final exam

• 30% literature review

‣ 5% proposal‣ 10% presentation‣ 15% paper

• 10% (and chocolates) participation

Friday, January 10, 14

Page 73: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

73

Grading for Graduate Students

• H: 95-100%

• P: 80-94%

• L: 60-79%

• F: 0-59%

Friday, January 10, 14

Page 74: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

74

Grading for Undergraduate Students

• A+: 97-100%

• A: 94-96%

• A-: 90-93%

• B+: 87-89%

• B: 84-86%

• B-: 80-83%

• C+: 77-79%

• C: 74-76%

• C-: 70-73%

• D+: 67-69%

• D: 64-66%

• D-: 60-63%

• F: <= 59%

Friday, January 10, 14

Page 75: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

75

Homework vs. Midterm vs. Final

• The homework will be challenging. It should be, you have more time.

Friday, January 10, 14

Page 76: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

76

Literature Review

• See description on the syllabus

• Form groups of 2 or 3

• Choose an IR task (next slide)

• Write a short proposal (mostly for feedback)

• Review the literature

‣ not just the different solutions to the problem

‣ the best solutions to the problem!

• Write a paper (~30 pages double-spaced)

• Make a presentation

‣ 10 minute presentation + 5 minutes Q&AFriday, January 10, 14

Page 77: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

77

• Personalized information retrieval

• Session-based information retrieval

• Clustering of search results

• Book search

• Multimedia search (over items not inherently associated with text)

• Social-media data for forecasting and event-detection

• Query-log analysis for forecasting and event-detection

• Faceted search

• Federated search

Literature Reviewexample tasks

Friday, January 10, 14

Page 78: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

78

• Be thorough

• Be scientific

‣ don’t focus on the writing of the papers you review

‣ focus on the science (the method and the evaluation)

• Be constructive

• Contribute new insight and structure

‣ your literature review shouldn’t read like a “list”

‣ connect dots that haven’t been connected

• Say what you think!

Literature Reviewtips

Friday, January 10, 14

Page 79: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

79

• Work hard

• Do the assigned readings

• Do other readings

• Be patient and have reasonable expectations

‣ you’re not supposed to understand everything we cover in class during class

• Seek help sooner rather than later

‣ office hours: manning 305, T, Th 9:30-10:30am

‣ questions via email

• Keep laptop usage to a minimum (live in the present)

Course Tips

Friday, January 10, 14

Page 80: INLS 509: Introduction to Information Retrieval · INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 8, 2014 Friday, January 10, 14

80

Questions?

Friday, January 10, 14


Recommended