Information Retrieval:...

Information Retrieval:

Introduction

Romi Satria [email protected]://romisatriawahono.net

0878-804804-85

Lahir di Madiun, 2 Oktober 1974

SD Sompok Semarang (1987)

SMPN 8 Semarang (1990)

SMA Taruna Nusantara, Magelang (1993)

S1, S2 dan S3 (on-leave)Department of Computer SciencesSaitama University, Japan (1994-2004)

Core Competence: Software Engineering, Computational Intelligence

Founder dan Koordinator IlmuKomputer.Com

CEO PT Brainmatics Cipta Informatika

Romi Satria Wahono

Learning Methods

Lecture

Discussion

Case Study

Practice

Textbook

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008

References

Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008

Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010

Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009

David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004

Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007

Course Contents

1. Introduction

2. Boolean Retrieval

3. The Term Vocabulary

4. Dictionaries and Tolerant Retrieval

5. Index Construction

6. Index Compression

7. Vector Space Model

8. Computing Scores

9. Evaluation in Information Retrieval

10. Relevance Feedback and Query Expansion

11. XML Retrieval

Course Contents

12. Probabilistic Information Retrieval

13. Language Models for Information Retrieval

14. Text Classification and Naive Bayes

15. Vector Space Classification

16. Support Vector Machines and Machine Learning on Documents

17. Flat Clustering

18. Hierarchical Clustering

19. Latent Semantic Indexing

20. Web Search

21. Web Crawling and Indexes

22. Link Analysis

INTRODUCTION

History of Information Retrieval (IR)

1940-

late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans

1945: Vannevar Bush's As We May Thinkappeared in Atlantic Monthly

1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds

1950-

1950s: mechanized literature searching systems (Allen Kent et al.) and the invention of citation indexing (Eugene Garfield)

1950: The term "information retrieval" appears to have been coined by Calvin Mooers

1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT

1955: Kent and colleagues published a paper in American Documentation describing the precision and recall: the IR evaluation method

1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."

1960-

early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell

1960: Melvin Earl (Bill) Maron and John Lary Kuhns published "On relevance, probabilistic indexing, and information retrieval" in the Journal of the ACM 7(3):216–244, July 1960\

1962: Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation

1963: Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963).

1964: Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR.

1960- continued

mid-1960s: National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch-retrieval system (Project Intrex at MIT)

1965: J. C. R. Licklider published Libraries of the Future.

1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs

late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval

1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model

1970

early 1970s: First online systems—NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT. Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines.

1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." (Information Storage and Retrieval, 7(5), pp. 217–240, December 1971)

1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:

• A Theory of Indexing (Society for Industrial and Applied Mathematics)

• A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26)

• A Vector Space Model for Automatic Indexing (CACM 18:11)

1978: The First ACM SIGIR conference

1979: C. J. van Rijsbergen published Information Retrieval(Butterworths). Heavy emphasis on probabilistic models

1980-

1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge

1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was an important concept, though their automated analysis tool proved ultimately disappointing

1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models

mid-1980s: Efforts to develop end-user versions of commercial IR systems

1989: First World Wide Web proposals by Tim Berners-Leeat CERN

1990

1992: First TREC conference

1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems

mid 1990s:Searching FTPable documents on the Internet (Archie, WAIS) and Searching the World Wide Web (Lycos, Yahoo, Altavista)

late 1990s: Web search engines implementation of many features formerly found only in experimental IR systems. Search engines become the most common and maybe best instantiation of IR models, research, and implementation

17

2000-

Link analysis for Web Search (Google)

Automated Information Extraction

• Whizbang

• Fetch

• Burning Glass

Question Answering

• TREC Q/A track

Automated Text Categorization & Clustering

Recommender Systems (Ringo, Amazon, NetPerceptions)

18

2000- continued

Multimedia IR

• Image

• Video

• Audio and music

Cross-Language IR

• DARPA Tides

Document Summarization

19

Related Areas

Database Management

Library and Information Science

Artificial Intelligence

Natural Language Processing

Machine Learning

20

Database Management

Focused on structured data stored in relational tables rather than free-form text

Focused on efficient processing of well-defined queries in a formal language (SQL)

Clearer semantics for both data and queries

Recent move towards semi-structured data (XML) brings it closer to IR

21

Library and Information Science

Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization)

Concerned with effective categorization of human knowledge

Concerned with citation analysis and bibliometrics (structure of information)

Recent work on digital libraries brings it closer to CS & IR

22

Artificial Intelligence

Focused on the representation of knowledge, reasoning, and intelligent action

Formalisms for representing knowledge and queries:

• First-order Predicate Logic

• Bayesian Networks

Recent work on web ontologies and intelligent information agents brings it closer to IR

23

Natural Language Processing

Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse

Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords

24

Natural Language Processing:IR Directions

Methods for determining the sense of an ambiguous word based on context (word sense disambiguation)

Methods for identifying specific pieces of information in a document (information extraction)

Methods for answering specific NL questions from document corpora

25

Machine Learning

Focused on the development of computational systems that improve their performance with experience

Automated classification of examples based on learning concepts from labeled training examples (supervised learning)

Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning)

26

Machine Learning: IR Directions

Text Categorization

• Automatic hierarchical classification (Yahoo)

• Adaptive filtering/routing/recommending

• Automated spam filtering

Text Clustering

• Clustering of IR query results

• Automatic formation of hierarchies (Yahoo)

Learning for Information Extraction

Text Mining

Basic Concepts

Information Retrieval (IR)

Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

Structured vs Unstructured Data

Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match

(for text) queries, e.g.,

Salary < 60000 AND Manager = Smith

Structured vs Unstructured Data in 1996

Structured vs Unstructured Data in 2009

IR Fields: Document Filtering

Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic.

Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class, if any, each of a set of documents belongs to

IR Fields: Operation Scale

1. Web Search:

• provide search over billions of documents stored on millions of computers

• distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web.


2. Personal information retrieval:

• consumer operating systems have integrated information retrieval (Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders

• Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner


3. Enterprise, Institutional, and Domain-Specific Search

• provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry

• the documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection

Perbandingan antara IRS, IS dan AI

Objek Data Fungsi Ukuran

Basis Data

IRS Dokumen Temu-kembali

(probabilistik)

Kecil – besar

IS (DBMS) Tabel Temu-kembali

(deterministik)

Kecil – besar

AI Pernyataan

Logika

Inferensia Kecil

What is a Document?

Examples:

• web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.

Common properties

• Significant text content

• Some structure (e.g., title, author, date for papers; subject, sender, destination for email)

Documents vs Database Records

Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)

• e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order to find matches

Text is more difficult

Documents vs Records

Example bank database query

• Find records with balance > $50,000 in branches located in Amherst, MA.

• Matches easily found by comparison with field values of records

Example search engine query

• bank scandals in western mass

• This text must be compared to the text of entire news stories

Comparing Text

Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval

Exact matching of words is not enough

• Many different ways to write the same thing in a “natural language” like English

• e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?

• Some stories will be better matches than others

Information Retrieval System (IRS)

Sistem yang berfungsi untuk menemukaninformasi yang relevan dengan kebutuhanpemakai

Informasi yang diproses terkandung didalamsebuah dokumen yang bersifat tekstual

Temu kembali informasi berkaitan denganrepresentasi, penyimpanan, dan akses terhadapdokumen

Dokumen yang ditemukan belum pasti apakahrelevan dengan kebutuhan informasi penggunayang dinyatakan dalam query

Sistem Informasi dan IRS

Secara hirarkis Sistem Pemrosesan Transaksi

Sistem Informasi Manajemen

Sistem Informasi Ekskutif

Secara fungsional Sistem Informasi Pemasaran

Sistem Informasi Kepegawaian

Sistem Informasi Keuangan

dsb

Tidak terkait dengan hirarki dan fungsional Sistem Pendukung Keputusan (Decision Support System)

Sistem Kecerdasan Buatan (Artificial Intelligent System)

Sistem Temu Kembali Informasi (Information Retrieval System)

Sistem Informasi Perpustakaan

dsb

Komponen IRS

PENGGUNAKOLEKSI

DOKUMEN

DOKUMEN

TERAMBIL

FUNGSI

MATCHING

PENENTUAN

RELEVANSI

Kategori Pengguna IRS

Novice: Pengguna pemula

• Belum mempunyai kebutuhan informasi yang jelas

• Masih punya keinginan mem-browse informasi

Intermediate: Pengguna sudah mulai belajar

• Sudah punya keinginan informasi tapi masih agak kabur

• Berkeinginan untuk mem-browse dan men-search

Expert: Pengguna yang ahli

• Mempunyai kebutuhan informasi yang terdefinisikan denganjelas

• Melakukan searching informasi yang dibutuhkan

Big Issues in IR

Relevance

Evaluation

User and Information Needs

Big Issues in IR: Relevance

Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine

Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

Topical relevance (same topic) vs. user relevance (everything else)

Big Issues in IR: Relevance

Retrieval models define a view of relevance

Ranking algorithms used in search engines are based on retrieval models

Most models describe statistical properties of text rather than linguistic

• i.e. counting simple text features such as words instead of parsing and analyzing the sentences

• Statistical approach to text processing started with Luhn in the 50s

• Linguistic features can be part of a statistical model

Big Issues in IR: Evaluation

Experimental procedures and measures for comparing system output with user expectations

• Originated in Cranfield experiments in the 60s

IR evaluation methods now used in many fields

Typically use test collection of documents, queries, and relevance judgments

• Most commonly used are TREC collections

Recall and precision are two examples of effectiveness measures

Big Issues in IR: User and Information Needs

Search evaluation is user-centered

Keyword queries are often poor descriptions of actual information needs

Interaction and context are important for understanding user intent

Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking

IR and Search Engines

A search engine is the practical application of information retrieval techniques to large scale text collections

Web search engines are best-known examples, but many others

• Open source search engines are important for research and development

• e.g., Lucene, Lemur/Indri, Galago

IR and Search Engines

Relevance

-Effective ranking

Evaluation

-Testing and measuring

Information needs

-User interaction

Performance

-Efficient search and indexing

Incorporating new data

-Coverage and freshness

Scalability

-Growing with data and users

Adaptability

-Tuning for applications

Specific problems

-e.g. Spam

Information Retrieval

Search Engines

Search Engine Issues

Performance

• Measuring and improving the efficiency of search

e.g., reducing response time, increasing query throughput, increasing indexing speed

• Indexes are data structures designed to improve search efficiency

designing and implementing them are major issues for search engines


Dynamic data

• The “collection” for most real applications is constantly changing in terms of updates, additions, deletions

e.g., web pages

• Acquiring or “crawling” the documents is a major task

Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed)

• Updating the indexes while processing queries is also a design issue


Scalability

• Making everything work with millions of users every day, and many terabytes of documents

• Distributed processing is essential

Adaptability

• Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications

Spam

For Web search, spam in all its forms is one of the major issues

Affects the efficiency of search engines and, more seriously, the effectiveness of the results

Many types of spam

• e.g. spamdexing or term spam, link spam, “optimization”

New subfield called adversarial IR, since spammers are “adversaries” with different goals

Information Retrieval Model (Techniques)

The Taxonomy of IR Model

The Taxonomy of IR Model (Kuropka, 2004)

Cara Menemukan Informasi (User Tasks)

Browsing

• Untuk pengguna yang belum begitu “pasti” mengenai informasi apa yang dicarinya

• Browsing dapat dilakukan secara acak maupun secara terstruktur (menu based)

Searching

• Untuk pengguna yang sudah tahu informasi yang dicarinya

• Menggunakan kata-kata kunci

Model Klasik Retrieval Techniques

1. Model Boolean

1. Fuzzy

2. Extended Boolean

2. Model Vektor

1. General vector space

2. Latent semantic indexing

3. Neural network

3. Model Probabilistik

1. Inferensia network

2. Neural network

Karakteristik Model Klasik

Dokumen direpresentasikan denganmenggunakan indeks term

Nilai biner digunakan untuk bobot index term

Bobot indeks term menunjukkanspesifikasi untuk dokumen tertentu

Pengolahan komputasi dilakukan denganpendekatan matematik statistik

Model Boolean

Model Boolean

Model ini berdasarkan teori himpunan dan aljabar Boolean

Dokumen adalah himpunan dari istilah (term)

Query adalah pernyataan Boolean yang ditulis pada term

Dokumen diprediksi apakah relevan atau tidak

Model ini menggunakan operator Boolean

Istilah (term) dalam sebuah query dihubungkan dengan menggunakan operator AND, OR atau NOT

Metode ini merupakan metode yang paling sering digunakan pada mesin pencari (search engine) karena kecepatannya

Boolean OR

OR is a boolean operator used to broaden your search by retrieving any, some, or all of the keywords used in the search statement

OR helps you make sure you aren't missing anything valuableQuery: College OR University(I would like information about college)

We retrieve records in which AT LEAST ONE of the search terms is present

We are searching on the terms college and also university since documents containing either of these words might be relevant

Query and Result (OR)

Search terms Results

college 17,320,770

university 33,685,205

college OR university 33,702,660


college 17,320,770

university 33,685,205

college OR university 33,702,660

college OR university OR campus 33,703,082

X OR Y

X Y Z

1 0 1

0 1 1

1 1 1

0 0 0

Boolean AND

AND is a boolean operator used to narrow your search by ensuring that all keywords used appear in the search results

Since the Web is already huge, it is important you use AND effectively.

Query: Poverty AND Crime (I'm interested in the relationship between poverty and crime)

We retrieve records in which both of the search terms are present

Notice how we do not retrieve any records with only "poverty" or only "crime."

Query and Result (AND)


poverty 783.447

crime 2,962,165

poverty AND crime 1,677


poverty 783.447

crime 2,962,165

poverty AND crime 1,677

poverty AND crime AND gender 76

X AND Y

X Y Z

1 0 0

0 1 0

1 1 1

0 0 0

Syarat Tahun Kabisat

1. Tahun % 400 == 0

OR

2. (Tahun % 4) && !(tahun % 100 == 0)

Boolean NOT

NOT is a boolean operator used to eliminate an unwanted concept or word in your search statement.

Query: Pets NOT Cats

I want to see information about pets, but I want to avoid seeing anything about cats.

We retrieve records in which ONLY ONE of the search terms is present.

No records are retrieved in which the word "cats" appears, even if the word "pets" appears there, too

Query and Result (NOT)


pets 4,556,515

cats 3,651,252

pets NOT cats 81,497

Nesting

A method of combining Boolean operators in a logical order

When using Boolean Operators in combination, however, it is important to "nest" them

Nesting means putting operators in parentheses in order to tell the library catalog, database, or Internet search engine how it should search for your terms

Example of Nesting

(A OR B) AND C -- find concepts A or B and where they intersect with concept CExample: (ford OR chevrolet) AND recall -- finds Fords or Chevrolets and the recalls on each

(A OR B OR C) AND (D OR E) -- finds either concept A or B or C, then finds concept D or E, and then combines A or B or C with D or EExample: (smoking OR tobacco OR nicotine) AND (adolescentsOR teenagers) -- finds the smoking or tobacco or nicotine for adolescent or for teenagers

(A OR B) AND (C NOT D) -- This search finds either concept A or B, then finds concept C but not D, and then combines A or B with CExample: (treatment or outcomes) AND (anorexia not bulimia) -- finds the treatment or outcomes for anorexia but not for bulimi

Contoh Model Boolean

A and B D1AB, D2AB, ...d1AB > d2AB >

... dengan dAB = min(dA,dB)

A or B D1AB, D2AB, ...d1AB > d2AB >

… dengan dAB = max(dA,dB)

Not A U – dA

• Dimana dA menyatakan bobot istilah A pada dokumen D

• Bobot istilah ini didapat dari hasil proses Indexing• Min(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan

bobot sebesar nilai terkecil dari bobot-bobot istilah yang

dipunyainya

• Max(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan

bobot sebesar nilai terbesar dari bobot-bobot istilah yang

dipunyainya

Tugas

1. (A OR B) AND C

2. (A OR B OR C) AND (D OR E)

3. (A OR B) AND (NOT C)

(A OR B) AND C

A B C A OR B (A OR B) AND C

0 0 0 0 0

0 0 1 0 0

0 1 0 1 0

0 1 1 1 1

Kelebihan Model Boolean

Model Boolean merupakan model sederhana yang menggunakan teori dasar himpunan sehingga mudah diimplementasikan

Model Boolean dapat diperluas dengan menggunakan proximity operator dan wildcard operator

Adanya pertimbangan biaya untuk mengubah software dan struktur database, terutama pada sistem komersial

Kelemahan Boolean Model

Model Boolean tidak bisa membuat peringkatpada dokumen yang terambil

Dokumen yang terambil hanya dokumen yang benar-benar sesuai dengan pernyataan Booleanatau query yang diberikan (exact match)

Sehingga dokumen yang terambil bisa sangat banyak atau sangat sedikit menyulitkanpengambilan keputusan

Pernyataan Boolean bisa kompleks pengguna harus memiliki pengetahuan tentang querydengan Boolean agar pencarian efisien

Tidak bisa menyelesaikan partial matching pada query

Tugas

Baca artikel berjudul Teknik PencarianEfektif dengan Google (Romi SatriaWahono)

Uji coba dengan searching keyword melaluiGoogle

Extended Boolean Model

Teknik Extended Boolean berdasarkan p-norm model merupakan pengembanganlebih lanjut dari model Boolean

Teknik ini memakai operator yang dikomputasi berdasarkan rumusSavoy(1993)

Query Retrieval Status Value (RSV)

A OR <p> B

A AND <p> B

NOT A 1 – Wia

Rumus Extended Boolean

p

p

ib

p

iaWW

2

p pib

pia WW

2

)1()1(1

p adalah nilai p-norm yang dimasukkan pada query

Wia adalah bobot istilah A dalam indeks pada dokumen Di

Wib adalah bobot istilah B dalam indeks pada dokumen Di

Pemeringkatan Extended Boolean

Langsung mengurutkan dokumen (dari besar ke kecil) berdasarkan bobot dokumen yang didapat denganrumus RSV (retrieval status value)

Memakai rumus Learning Scheme

RSV(Di) = RSVinit (Di) + ik norm * RSVinit (Dk) untuk i= 1, 2,...., n,

Dimana:

• RSVinit(Di) merupakan retrieval status value daridokumen i yang dikomputasi berdasarkan rumus teknikretrieval P-norm model

• ik merupakan bobot keterhubungan antara dokumen idan kBobot keterhubungan ini didapat dari nilai relevance linkyang merupakan hasil dari proses pembelajaran

Citra Komputer

1. S048 2.000000 1. S005 0.099570

2. S005 1.000000 2. S048 0.039120

3. S006 1.000000 3. T044 0.031300

4. S030 1.000000 4. S006 0.026080

5. S067 1.000000 5. T005 0.022350

6. T005 1.000000 6. S030 0.013040

7. T044 1.000000 7. S067 0.013040

Boolean vs. Extended BooleanSearch: Citra and Komputer

Model Ruang Vektor

Model Ruang Vektor

Model vektor berdasarkan keyterm

Model vektor mendukung partial matching dan penentuan peringkat dokumen

Prinsip dasar vektor model:

• Dokumen direpresentasikan dengan menggunkan vektor keyterm

• Ruang dimensi ditentukan oleh keyterms

• Query direpresentasikan dengan menggunakan vektor keyterm

• Kesamaan document-keyterm dihitung berdasarkan jarak vektor

Model vektor memerlukan

• Bobot keyterm untuk vektor dokumen

• Bobot keyterm untuk query

• Perhitungan jarak untuk vektor document-keyterm

Prosedur Model Ruang Vektor

1. Pengideks-an dokumen

2. Pembobotan indeks, untuk menghasilkan dokumen yang relevan

3. Memberikan peringkat dokumen berdasarkan ukuran kesamaan (similarity measure)

Keuntungan Model Ruang Vektor

Sangat efisien

• Menggunakan metode matrik sparse

• Menggunakan aljabar linier yang sederhana

• Mudah dalam representasi

• Dapat diimplementasikan pada document-matching

Fleksibel

• Digunakan dalam resolusi query

• Menggunakan kesamaan dokumen (document to document similarity)

• Menggunakan cluster

Sangat populer dan sering digunakan

Kerugian Model Ruang Vektor

Teoritical frameworknya tidak jelas

Menghasilkan indeks yang berdekatan

Asumsi yang digunakan adalah independensi index term

Pengindeksan Dokumen

Beberapa kata dalam sebuah dokumen, tidak menggambarkan isi dari dokumen tersebut, seperti kata the, is, a, dsb

Kata-kata tersebut dikenal dengan nama kata-kata buangan. Dengan menggunakan automatic document indexing, kata-kata buangan tersebut dihilangkan dari dokumen

Pembuatan indeks tersebut dapat berdasarkan

• Frekuensi kemunculan istilah dalam sebuah dokumen

• Metode non linguistic: probabilistic indexing

Pembobotan Indeks (Term Weighting)

Pembobotan istilah dalam ruang vektor secara keseluruhan berdasarkan single term statistic. Ada tiga faktor utama dalam pembobotan istilah dengan menggunakan ruang vektor:

1. Term frequency factor

2. Collection frequency factor

3. Length normalization factor

Ketiga faktor tersebut diatas dikalikan untuk menghasilkan bobot istilah

Skema pembobotan yang paling umum untuk istilah dalam sebuah dokumen adalah dengan menggunakan frekuensi kemunculan

Peringkat Dokumen

Ukuran kesamaan (similarity) istilah dalam model ruang vektor ditentukan berdasarkan assosiative coefficient berdasarkan inner product dari document vector dan query vector, dimana word overlap menunnjukkan kesamaan istilah

Inner product umumnya sudah dinormalisasi

Metode ukuran kesamaan yang paling populer adalah cosine coefficient, yang menghitung sudut antara vektor dokumen dengan vektor query

Metode ukuran kesamaan lainnya adalah Jaccardand Dice Coeeficient

Model Probabilistik

Model Probabilistik

Melakukan pendugaan page relevansi denganmenggunakan probabilistik

Mempunyai teoritical framework yang jelas

• Berdasarkan prinsip statistik

• Relevansi dokumen dapat diupdate

• Adanya feedback dari user

Ide dasar

• Query dapat menghasilkan jawaban yang benar

• Menggunkan indeks term

• Menggunakan pendugaan awal

• Menggunakan initial hasil

• Feedback dari user dapat memperbaiki probabilitas dari relevansi

References

1. Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008

2. Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010

3. Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009

4. David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004

5. Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007

Date post:	14-Jul-2019
Category:	Documents
Upload:	buiminh
View:	216 times
Download:	0 times