+ All Categories
Home > Documents > Information Retrieval Unit 1 Seema Chandak. Unit 1 : Objective & Content Objective To deal with IR...

Information Retrieval Unit 1 Seema Chandak. Unit 1 : Objective & Content Objective To deal with IR...

Date post: 19-Dec-2015
Category:
Upload: christopher-dean
View: 219 times
Download: 1 times
Share this document with a friend
34
Information Retrieval Unit 1 Seema Chandak
Transcript

Information Retrieval

Unit 1Seema Chandak

Unit 1 : Objective & Content

Objective To deal with IR representation, storage,

organization & access to information items.

Unit 1 : Content(contu..)

• Content :: Basic Concepts of IR, Data Retrieval & Information Retrieval, IR system block diagram. Automatic Text Analysis, Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing, Probabilistic Indexing,

Unit 1 : Content(contu…)

Automatic Classification. Measures of Association, Different Matching Coefficient, Classification Methods, Cluster Hypothesis. Clustering Algorithms, Single Pass Algorithm, Single Link Algorithm, Rochhio's Algorith Dendogram

What is IRInformation retrieval:

Subfield of computer science that deals with automated retrieval of infromaition (especially text) based on their content and context.

The term Information Retrieval was first coined by Calvin Moores (1950). “ It is concerned with the representation, storage, and organization and accessing of information items .“

Need for IR• Information is considered as the most important

source, for most of the activities.• Example : Timely Weather reports.• Timely sharing of information.• The timely retrieval of information plays a major role,

keeping with the motto “right information at the right time”.

Types of IR

– Structured (All Database management systems)– Unstructured (Search engines)– Semi structured(Datawarehouses)

IR Based on Structured Data

• Recollect Terms related to DBMS ..– Data Organization in the form of schema, keys,

index, metadata….– Query structure – Results set– …..– ….

Why IR ?Why not Database?

What are some limitations of Database Systems?

IR Vs. DR Information Retrieval System: a system that allows a

user to retrieve documents that match her “information need” from a large corpus. Example: Get documents about Java, except for ones

that are about the Java coffee.

Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Example: Get all documents containing the term

“Java” but no containing the term “coffee”.

IR Vs. DR1. Matching.– In data retrieval we are normally looking for an

exact match, that is, we are checking to see whether an item is or is not present in the file.

– Eg.Select * from Student where per >= 75.0

– In information retrieval more generally we want to find those items which partially match the request and then select from those a few of the best matching ones.

– Eg. Student having 75 or >75 percentage from student of pict college.

IR Vs. DR2. Inference – In data retrieval is of the simple deductive kind, that is, aRb

and bRc then aRc. – In information retrieval it is of inductive inference; – Relations are only specified with a degree of certainty or

uncertainty and hence our confidence in the inference is variable.

3. Model– Data retrieval is deterministic but information retrieval is

probabilistic.– Frequently Bayes' Theorem is invoked to carry out inferences

in IR, but in DR probabilities do not enter– into the processing.

IR Vs. DR4 .Classification: – In DR most likely monothetic classification is used.– That is, one with classes defined by objects – possessing attributes both necessary and sufficient to

belong to a class.

– In IR such a classification is not very useful.– A polythetic classification is mostly used. – Each individual in a class will possess only a proportion

of all the attributes possessed by all the members of that class.

– Hence no attribute is necessary nor sufficient for membership to a class.

IR Vs. DR5. Query Language:– The query language for DR is one with restricted

syntax and vocabulary.– In IR we prefer to use natural language although there

are some notable exceptions.

6. Query Specification :– In DR the query is generally a complete specification

of what is wanted,– In IR it is invariably incomplete.

IR Vs. DR7. Items wanted :– In IR we are searching for relevant documents as

opposed to exactly matching items in DR.

8. Error response :– DR is more sensitive to error in the sense that, an

error in matching will not retrieve the wanted item which implies a total failure of the system.

– In IR small errors in matching generally do not affect performance of the system significantly

IR Vs. DRData Retrieval (DR) Information Retrieval

(IR)

Matching Exact match Partial match, best match

Inference Deduction Induction

Model Deterministic Probabilistic

Classification Monothetic Polythetic

Data Database tables, structured

Free text, unstructured

Query language

Artificial, SQL, relational algebras.

Natural, Keywords, free text

Query specification

Complete Incomplete

Items wanted Matching Relevant

IR vs.DRInformation Retrieval Data Retrieval

Error Response

Insensitive Sensitive

Results Approximate matches

Exact matches

Results Ordered by relevance

Unordered

Accessibility Non-expert humans Knowledgeable users or automatic processes

Information Retrieval deals with uncertainty and vagueness in information systems.

• Uncertainty: available representation does typically not reflect true semantics/meaning of objects (text, images, video, etc.)

• Vagueness: information need of user lacks clarity, is only vaguel expressed in query, feedback or user actions.

• Differs conceptually from database queries!

Issues with Information Retrieval?

Re Call the Definition• What Is IR ?• “ Finding some desired information in large data sets or

store of information “

• Means : – Searching for documents – Searching for information in documents– Searching for metadata which describes documents– Searching within database–

• Web search engines like Google and Lycos are the most visible IR applications.• IR systems are used to reduce information overload.

Definition

Automatic Information Retrieval Automatic – as against ‘manual’. Information – as against ‘data’. Defn : An information retrieval system does not inform

(i.e.change the knowledge of) the user on the subject of his inquiry.

It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.

Media – Where Does Information Reside?

• Text documents: web pages, books, articles , papers, emails etc.

• Manuscripts• Graphics & Images• Speech & Video• Maps & Satellite Imagery• Local Information, Yellow Pages• Mismatch: given representation in specific medium vs. semantic description of information (semantic gap)

Scale - How Much Information is out there?

• World Wide Web Tens or hundreds billions of documents? Approx. 10KB/doc of 100s of TB

• Then there is everything else Email, personal files, proprietary databases,

broadcast media, print• Estimated 5 Exabytes p.a. (growing at 30%)• 800 MB p.a. and person• Web is just a tiny starting point….

IR problem It is mainly dealing with a very large , mostly

unstructured data set IR problem consists of :

building efficient indexes. processing user queries with high performance. improve ‘quality’ of answer set.

Basic Concepts

• Information retrieval is directly affected by the :– User Tasks– Document Logical view

User Tasks

• Interaction of the user with retrieval system.

Retrieval

BrowsingDocuments

User Tasks• Classical information retrieval system allows IR• Hypertext system are usually tuned for quick

Browsing.• Modern digital lib. and Web interfacing might

attempt to combine these tasks.

Logical view of the document• Documents are represented either by Keywords or

Indexes is known as logical view of the documents.• Keywords are either extracted directly from the text of

document or specified by human.• Modern computers represents doc by its set of :– Full words.– Small words. • Stopwords : elimination of articles and

connectives.• steaming : (reduces distinct words to their

common grammatical roots.)

Introduction…• Information Retrieval System:

Input

Queries

Processor

Documents

Feedback

Output

A typical IR system

Sample retrieval

28

Introduction…• Information Retrieval System:

– Input: Store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation.

– A document representative could be a list of extracted words considered to be significant.

– The user has to use the language in which he/she can express the needed information in the language.

– Processor: Involve in performing actual retrieval function, executing the search strategy in response to a query.

– Feedback: Improving the subsequent run after a sample retrieval.– Output:A set of document numbers. And the evaluation can be

done.29

IntroductionInformationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

How isthe queryconstructed?

How isthe text processed?

Information Retrieval Process

Definitions

• Searching: Seeking for specific information within a body of information. The result of a search is a set of hits.

• Browsing: Unstructured exploration of a body of information.

• Linking: Moving from one item to another following links, such as citations, references, etc.

The Basics of Information RetrievalQuery: A string of text, describing the information that the user is seeking. Each word of the query is called a search term.

A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols.

Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words.

Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.

SORTING AND RANKING HITSWhen a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large.

The value to the use depends on the order in which the hits are presented.

Three main methods:

• Sorting the hits, e.g., by date

• Ranking the hits by similarity between query and document

• Ranking the hits by the importance of the documents

Examples of Search Systems

Find file on a computer system (Spotlight for Macintosh).

Library catalog for searching bibliographic records about books and other objects (Library of Congress catalog).

Abstracting and indexing system for finding research information about specific topics (Medline for medical information).

Web search service for finding web pages (Google).


Recommended