Information Retrieval:
Introduction
Romi Satria [email protected]://romisatriawahono.net
0878-804804-85
Lahir di Madiun, 2 Oktober 1974
SD Sompok Semarang (1987)
SMPN 8 Semarang (1990)
SMA Taruna Nusantara, Magelang (1993)
S1, S2 dan S3 (on-leave)Department of Computer SciencesSaitama University, Japan (1994-2004)
Core Competence: Software Engineering, Computational Intelligence
Founder dan Koordinator IlmuKomputer.Com
CEO PT Brainmatics Cipta Informatika
Romi Satria Wahono
Learning Methods
Lecture
Discussion
Case Study
Practice
Textbook
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008
References
Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008
Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010
Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009
David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004
Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007
Course Contents
1. Introduction
2. Boolean Retrieval
3. The Term Vocabulary
4. Dictionaries and Tolerant Retrieval
5. Index Construction
6. Index Compression
7. Vector Space Model
8. Computing Scores
9. Evaluation in Information Retrieval
10. Relevance Feedback and Query Expansion
11. XML Retrieval
Course Contents
12. Probabilistic Information Retrieval
13. Language Models for Information Retrieval
14. Text Classification and Naive Bayes
15. Vector Space Classification
16. Support Vector Machines and Machine Learning on Documents
17. Flat Clustering
18. Hierarchical Clustering
19. Latent Semantic Indexing
20. Web Search
21. Web Crawling and Indexes
22. Link Analysis
INTRODUCTION
History of Information Retrieval (IR)
1940-
late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans
1945: Vannevar Bush's As We May Thinkappeared in Atlantic Monthly
1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds
1950-
1950s: mechanized literature searching systems (Allen Kent et al.) and the invention of citation indexing (Eugene Garfield)
1950: The term "information retrieval" appears to have been coined by Calvin Mooers
1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT
1955: Kent and colleagues published a paper in American Documentation describing the precision and recall: the IR evaluation method
1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
1960-
early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell
1960: Melvin Earl (Bill) Maron and John Lary Kuhns published "On relevance, probabilistic indexing, and information retrieval" in the Journal of the ACM 7(3):216–244, July 1960\
1962: Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation
1963: Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963).
1964: Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR.
1960- continued
mid-1960s: National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch-retrieval system (Project Intrex at MIT)
1965: J. C. R. Licklider published Libraries of the Future.
1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs
late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval
1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model
1970
early 1970s: First online systems—NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT. Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines.
1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." (Information Storage and Retrieval, 7(5), pp. 217–240, December 1971)
1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:
• A Theory of Indexing (Society for Industrial and Applied Mathematics)
• A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26)
• A Vector Space Model for Automatic Indexing (CACM 18:11)
1978: The First ACM SIGIR conference
1979: C. J. van Rijsbergen published Information Retrieval(Butterworths). Heavy emphasis on probabilistic models
1980-
1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge
1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was an important concept, though their automated analysis tool proved ultimately disappointing
1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models
mid-1980s: Efforts to develop end-user versions of commercial IR systems
1989: First World Wide Web proposals by Tim Berners-Leeat CERN
1990
1992: First TREC conference
1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems
mid 1990s:Searching FTPable documents on the Internet (Archie, WAIS) and Searching the World Wide Web (Lycos, Yahoo, Altavista)
late 1990s: Web search engines implementation of many features formerly found only in experimental IR systems. Search engines become the most common and maybe best instantiation of IR models, research, and implementation
17
2000-
Link analysis for Web Search (Google)
Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
Question Answering
• TREC Q/A track
Automated Text Categorization & Clustering
Recommender Systems (Ringo, Amazon, NetPerceptions)
18
2000- continued
Multimedia IR
• Image
• Video
• Audio and music
Cross-Language IR
• DARPA Tides
Document Summarization
19
Related Areas
Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning
20
Database Management
Focused on structured data stored in relational tables rather than free-form text
Focused on efficient processing of well-defined queries in a formal language (SQL)
Clearer semantics for both data and queries
Recent move towards semi-structured data (XML) brings it closer to IR
21
Library and Information Science
Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization)
Concerned with effective categorization of human knowledge
Concerned with citation analysis and bibliometrics (structure of information)
Recent work on digital libraries brings it closer to CS & IR
22
Artificial Intelligence
Focused on the representation of knowledge, reasoning, and intelligent action
Formalisms for representing knowledge and queries:
• First-order Predicate Logic
• Bayesian Networks
Recent work on web ontologies and intelligent information agents brings it closer to IR
23
Natural Language Processing
Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse
Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords
24
Natural Language Processing:IR Directions
Methods for determining the sense of an ambiguous word based on context (word sense disambiguation)
Methods for identifying specific pieces of information in a document (information extraction)
Methods for answering specific NL questions from document corpora
25
Machine Learning
Focused on the development of computational systems that improve their performance with experience
Automated classification of examples based on learning concepts from labeled training examples (supervised learning)
Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning)
26
Machine Learning: IR Directions
Text Categorization
• Automatic hierarchical classification (Yahoo)
• Adaptive filtering/routing/recommending
• Automated spam filtering
Text Clustering
• Clustering of IR query results
• Automatic formation of hierarchies (Yahoo)
Learning for Information Extraction
Text Mining
Basic Concepts
Information Retrieval (IR)
Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
Structured vs Unstructured Data
Structured data tends to refer to information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith
Structured vs Unstructured Data in 1996
Structured vs Unstructured Data in 2009
IR Fields: Document Filtering
Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic.
Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class, if any, each of a set of documents belongs to
IR Fields: Operation Scale
1. Web Search:
• provide search over billions of documents stored on millions of computers
• distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web.
IR Fields: Operation Scale
2. Personal information retrieval:
• consumer operating systems have integrated information retrieval (Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders
• Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner
IR Fields: Operation Scale
3. Enterprise, Institutional, and Domain-Specific Search
• provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry
• the documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection
Perbandingan antara IRS, IS dan AI
Objek Data Fungsi Ukuran
Basis Data
IRS Dokumen Temu-kembali
(probabilistik)
Kecil – besar
IS (DBMS) Tabel Temu-kembali
(deterministik)
Kecil – besar
AI Pernyataan
Logika
Inferensia Kecil
What is a Document?
Examples:
• web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.
Common properties
• Significant text content
• Some structure (e.g., title, author, date for papers; subject, sender, destination for email)
Documents vs Database Records
Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)
• e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in order to find matches
Text is more difficult
Documents vs Records
Example bank database query
• Find records with balance > $50,000 in branches located in Amherst, MA.
• Matches easily found by comparison with field values of records
Example search engine query
• bank scandals in western mass
• This text must be compared to the text of entire news stories
Comparing Text
Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval
Exact matching of words is not enough
• Many different ways to write the same thing in a “natural language” like English
• e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?
• Some stories will be better matches than others
Information Retrieval System (IRS)
Sistem yang berfungsi untuk menemukaninformasi yang relevan dengan kebutuhanpemakai
Informasi yang diproses terkandung didalamsebuah dokumen yang bersifat tekstual
Temu kembali informasi berkaitan denganrepresentasi, penyimpanan, dan akses terhadapdokumen
Dokumen yang ditemukan belum pasti apakahrelevan dengan kebutuhan informasi penggunayang dinyatakan dalam query
Sistem Informasi dan IRS
Secara hirarkis Sistem Pemrosesan Transaksi
Sistem Informasi Manajemen
Sistem Informasi Ekskutif
Secara fungsional Sistem Informasi Pemasaran
Sistem Informasi Kepegawaian
Sistem Informasi Keuangan
dsb
Tidak terkait dengan hirarki dan fungsional Sistem Pendukung Keputusan (Decision Support System)
Sistem Kecerdasan Buatan (Artificial Intelligent System)
Sistem Temu Kembali Informasi (Information Retrieval System)
Sistem Informasi Perpustakaan
dsb
Komponen IRS
PENGGUNAKOLEKSI
DOKUMEN
DOKUMEN
TERAMBIL
FUNGSI
MATCHING
PENENTUAN
RELEVANSI
Kategori Pengguna IRS
Novice: Pengguna pemula
• Belum mempunyai kebutuhan informasi yang jelas
• Masih punya keinginan mem-browse informasi
Intermediate: Pengguna sudah mulai belajar
• Sudah punya keinginan informasi tapi masih agak kabur
• Berkeinginan untuk mem-browse dan men-search
Expert: Pengguna yang ahli
• Mempunyai kebutuhan informasi yang terdefinisikan denganjelas
• Melakukan searching informasi yang dibutuhkan
Big Issues in IR
Relevance
Evaluation
User and Information Needs
Big Issues in IR: Relevance
Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine
Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style
Topical relevance (same topic) vs. user relevance (everything else)
Big Issues in IR: Relevance
Retrieval models define a view of relevance
Ranking algorithms used in search engines are based on retrieval models
Most models describe statistical properties of text rather than linguistic
• i.e. counting simple text features such as words instead of parsing and analyzing the sentences
• Statistical approach to text processing started with Luhn in the 50s
• Linguistic features can be part of a statistical model
Big Issues in IR: Evaluation
Experimental procedures and measures for comparing system output with user expectations
• Originated in Cranfield experiments in the 60s
IR evaluation methods now used in many fields
Typically use test collection of documents, queries, and relevance judgments
• Most commonly used are TREC collections
Recall and precision are two examples of effectiveness measures
Big Issues in IR: User and Information Needs
Search evaluation is user-centered
Keyword queries are often poor descriptions of actual information needs
Interaction and context are important for understanding user intent
Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
IR and Search Engines
A search engine is the practical application of information retrieval techniques to large scale text collections
Web search engines are best-known examples, but many others
• Open source search engines are important for research and development
• e.g., Lucene, Lemur/Indri, Galago
IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval
Search Engines
Search Engine Issues
Performance
• Measuring and improving the efficiency of search
e.g., reducing response time, increasing query throughput, increasing indexing speed
• Indexes are data structures designed to improve search efficiency
designing and implementing them are major issues for search engines
Search Engine Issues
Dynamic data
• The “collection” for most real applications is constantly changing in terms of updates, additions, deletions
e.g., web pages
• Acquiring or “crawling” the documents is a major task
Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed)
• Updating the indexes while processing queries is also a design issue
Search Engine Issues
Scalability
• Making everything work with millions of users every day, and many terabytes of documents
• Distributed processing is essential
Adaptability
• Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications
Spam
For Web search, spam in all its forms is one of the major issues
Affects the efficiency of search engines and, more seriously, the effectiveness of the results
Many types of spam
• e.g. spamdexing or term spam, link spam, “optimization”
New subfield called adversarial IR, since spammers are “adversaries” with different goals
Information Retrieval Model (Techniques)
The Taxonomy of IR Model
The Taxonomy of IR Model (Kuropka, 2004)
Cara Menemukan Informasi (User Tasks)
Browsing
• Untuk pengguna yang belum begitu “pasti” mengenai informasi apa yang dicarinya
• Browsing dapat dilakukan secara acak maupun secara terstruktur (menu based)
Searching
• Untuk pengguna yang sudah tahu informasi yang dicarinya
• Menggunakan kata-kata kunci
Model Klasik Retrieval Techniques
1. Model Boolean
1. Fuzzy
2. Extended Boolean
2. Model Vektor
1. General vector space
2. Latent semantic indexing
3. Neural network
3. Model Probabilistik
1. Inferensia network
2. Neural network
Karakteristik Model Klasik
Dokumen direpresentasikan denganmenggunakan indeks term
Nilai biner digunakan untuk bobot index term
Bobot indeks term menunjukkanspesifikasi untuk dokumen tertentu
Pengolahan komputasi dilakukan denganpendekatan matematik statistik
Model Boolean
Model Boolean
Model ini berdasarkan teori himpunan dan aljabar Boolean
Dokumen adalah himpunan dari istilah (term)
Query adalah pernyataan Boolean yang ditulis pada term
Dokumen diprediksi apakah relevan atau tidak
Model ini menggunakan operator Boolean
Istilah (term) dalam sebuah query dihubungkan dengan menggunakan operator AND, OR atau NOT
Metode ini merupakan metode yang paling sering digunakan pada mesin pencari (search engine) karena kecepatannya
Boolean OR
OR is a boolean operator used to broaden your search by retrieving any, some, or all of the keywords used in the search statement
OR helps you make sure you aren't missing anything valuableQuery: College OR University(I would like information about college)
We retrieve records in which AT LEAST ONE of the search terms is present
We are searching on the terms college and also university since documents containing either of these words might be relevant
Query and Result (OR)
Search terms Results
college 17,320,770
university 33,685,205
college OR university 33,702,660
Search terms Results
college 17,320,770
university 33,685,205
college OR university 33,702,660
college OR university OR campus 33,703,082
X OR Y
X Y Z
1 0 1
0 1 1
1 1 1
0 0 0
Boolean AND
AND is a boolean operator used to narrow your search by ensuring that all keywords used appear in the search results
Since the Web is already huge, it is important you use AND effectively.
Query: Poverty AND Crime (I'm interested in the relationship between poverty and crime)
We retrieve records in which both of the search terms are present
Notice how we do not retrieve any records with only "poverty" or only "crime."
Query and Result (AND)
Search terms Results
poverty 783.447
crime 2,962,165
poverty AND crime 1,677
Search terms Results
poverty 783.447
crime 2,962,165
poverty AND crime 1,677
poverty AND crime AND gender 76
X AND Y
X Y Z
1 0 0
0 1 0
1 1 1
0 0 0
Syarat Tahun Kabisat
1. Tahun % 400 == 0
OR
2. (Tahun % 4) && !(tahun % 100 == 0)
Boolean NOT
NOT is a boolean operator used to eliminate an unwanted concept or word in your search statement.
Query: Pets NOT Cats
I want to see information about pets, but I want to avoid seeing anything about cats.
We retrieve records in which ONLY ONE of the search terms is present.
No records are retrieved in which the word "cats" appears, even if the word "pets" appears there, too
Query and Result (NOT)
Search terms Results
pets 4,556,515
cats 3,651,252
pets NOT cats 81,497
Nesting
A method of combining Boolean operators in a logical order
When using Boolean Operators in combination, however, it is important to "nest" them
Nesting means putting operators in parentheses in order to tell the library catalog, database, or Internet search engine how it should search for your terms
Example of Nesting
(A OR B) AND C -- find concepts A or B and where they intersect with concept CExample: (ford OR chevrolet) AND recall -- finds Fords or Chevrolets and the recalls on each
(A OR B OR C) AND (D OR E) -- finds either concept A or B or C, then finds concept D or E, and then combines A or B or C with D or EExample: (smoking OR tobacco OR nicotine) AND (adolescentsOR teenagers) -- finds the smoking or tobacco or nicotine for adolescent or for teenagers
(A OR B) AND (C NOT D) -- This search finds either concept A or B, then finds concept C but not D, and then combines A or B with CExample: (treatment or outcomes) AND (anorexia not bulimia) -- finds the treatment or outcomes for anorexia but not for bulimi
Contoh Model Boolean
A and B D1AB, D2AB, ...d1AB > d2AB >
... dengan dAB = min(dA,dB)
A or B D1AB, D2AB, ...d1AB > d2AB >
… dengan dAB = max(dA,dB)
Not A U – dA
• Dimana dA menyatakan bobot istilah A pada dokumen D
• Bobot istilah ini didapat dari hasil proses Indexing• Min(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan
bobot sebesar nilai terkecil dari bobot-bobot istilah yang
dipunyainya
• Max(dA,dB) berarti bahwa sebuah dokumen di retrieve dengan
bobot sebesar nilai terbesar dari bobot-bobot istilah yang
dipunyainya
Tugas
1. (A OR B) AND C
2. (A OR B OR C) AND (D OR E)
3. (A OR B) AND (NOT C)
(A OR B) AND C
A B C A OR B (A OR B) AND C
0 0 0 0 0
0 0 1 0 0
0 1 0 1 0
0 1 1 1 1
Kelebihan Model Boolean
Model Boolean merupakan model sederhana yang menggunakan teori dasar himpunan sehingga mudah diimplementasikan
Model Boolean dapat diperluas dengan menggunakan proximity operator dan wildcard operator
Adanya pertimbangan biaya untuk mengubah software dan struktur database, terutama pada sistem komersial
Kelemahan Boolean Model
Model Boolean tidak bisa membuat peringkatpada dokumen yang terambil
Dokumen yang terambil hanya dokumen yang benar-benar sesuai dengan pernyataan Booleanatau query yang diberikan (exact match)
Sehingga dokumen yang terambil bisa sangat banyak atau sangat sedikit menyulitkanpengambilan keputusan
Pernyataan Boolean bisa kompleks pengguna harus memiliki pengetahuan tentang querydengan Boolean agar pencarian efisien
Tidak bisa menyelesaikan partial matching pada query
Tugas
Baca artikel berjudul Teknik PencarianEfektif dengan Google (Romi SatriaWahono)
Uji coba dengan searching keyword melaluiGoogle
Extended Boolean Model
Teknik Extended Boolean berdasarkan p-norm model merupakan pengembanganlebih lanjut dari model Boolean
Teknik ini memakai operator yang dikomputasi berdasarkan rumusSavoy(1993)
Query Retrieval Status Value (RSV)
A OR <p> B
A AND <p> B
NOT A 1 – Wia
Rumus Extended Boolean
p
p
ib
p
iaWW
2
p pib
pia WW
2
)1()1(1
p adalah nilai p-norm yang dimasukkan pada query
Wia adalah bobot istilah A dalam indeks pada dokumen Di
Wib adalah bobot istilah B dalam indeks pada dokumen Di
Pemeringkatan Extended Boolean
Langsung mengurutkan dokumen (dari besar ke kecil) berdasarkan bobot dokumen yang didapat denganrumus RSV (retrieval status value)
Memakai rumus Learning Scheme
RSV(Di) = RSVinit (Di) + ik norm * RSVinit (Dk) untuk i= 1, 2,...., n,
Dimana:
• RSVinit(Di) merupakan retrieval status value daridokumen i yang dikomputasi berdasarkan rumus teknikretrieval P-norm model
• ik merupakan bobot keterhubungan antara dokumen idan kBobot keterhubungan ini didapat dari nilai relevance linkyang merupakan hasil dari proses pembelajaran
Citra Komputer
1. S048 2.000000 1. S005 0.099570
2. S005 1.000000 2. S048 0.039120
3. S006 1.000000 3. T044 0.031300
4. S030 1.000000 4. S006 0.026080
5. S067 1.000000 5. T005 0.022350
6. T005 1.000000 6. S030 0.013040
7. T044 1.000000 7. S067 0.013040
Boolean vs. Extended BooleanSearch: Citra and Komputer
Model Ruang Vektor
Model Ruang Vektor
Model vektor berdasarkan keyterm
Model vektor mendukung partial matching dan penentuan peringkat dokumen
Prinsip dasar vektor model:
• Dokumen direpresentasikan dengan menggunkan vektor keyterm
• Ruang dimensi ditentukan oleh keyterms
• Query direpresentasikan dengan menggunakan vektor keyterm
• Kesamaan document-keyterm dihitung berdasarkan jarak vektor
Model vektor memerlukan
• Bobot keyterm untuk vektor dokumen
• Bobot keyterm untuk query
• Perhitungan jarak untuk vektor document-keyterm
Prosedur Model Ruang Vektor
1. Pengideks-an dokumen
2. Pembobotan indeks, untuk menghasilkan dokumen yang relevan
3. Memberikan peringkat dokumen berdasarkan ukuran kesamaan (similarity measure)
Keuntungan Model Ruang Vektor
Sangat efisien
• Menggunakan metode matrik sparse
• Menggunakan aljabar linier yang sederhana
• Mudah dalam representasi
• Dapat diimplementasikan pada document-matching
Fleksibel
• Digunakan dalam resolusi query
• Menggunakan kesamaan dokumen (document to document similarity)
• Menggunakan cluster
Sangat populer dan sering digunakan
Kerugian Model Ruang Vektor
Teoritical frameworknya tidak jelas
Menghasilkan indeks yang berdekatan
Asumsi yang digunakan adalah independensi index term
Pengindeksan Dokumen
Beberapa kata dalam sebuah dokumen, tidak menggambarkan isi dari dokumen tersebut, seperti kata the, is, a, dsb
Kata-kata tersebut dikenal dengan nama kata-kata buangan. Dengan menggunakan automatic document indexing, kata-kata buangan tersebut dihilangkan dari dokumen
Pembuatan indeks tersebut dapat berdasarkan
• Frekuensi kemunculan istilah dalam sebuah dokumen
• Metode non linguistic: probabilistic indexing
Pembobotan Indeks (Term Weighting)
Pembobotan istilah dalam ruang vektor secara keseluruhan berdasarkan single term statistic. Ada tiga faktor utama dalam pembobotan istilah dengan menggunakan ruang vektor:
1. Term frequency factor
2. Collection frequency factor
3. Length normalization factor
Ketiga faktor tersebut diatas dikalikan untuk menghasilkan bobot istilah
Skema pembobotan yang paling umum untuk istilah dalam sebuah dokumen adalah dengan menggunakan frekuensi kemunculan
Peringkat Dokumen
Ukuran kesamaan (similarity) istilah dalam model ruang vektor ditentukan berdasarkan assosiative coefficient berdasarkan inner product dari document vector dan query vector, dimana word overlap menunnjukkan kesamaan istilah
Inner product umumnya sudah dinormalisasi
Metode ukuran kesamaan yang paling populer adalah cosine coefficient, yang menghitung sudut antara vektor dokumen dengan vektor query
Metode ukuran kesamaan lainnya adalah Jaccardand Dice Coeeficient
Model Probabilistik
Model Probabilistik
Melakukan pendugaan page relevansi denganmenggunakan probabilistik
Mempunyai teoritical framework yang jelas
• Berdasarkan prinsip statistik
• Relevansi dokumen dapat diupdate
• Adanya feedback dari user
Ide dasar
• Query dapat menghasilkan jawaban yang benar
• Menggunkan indeks term
• Menggunakan pendugaan awal
• Menggunakan initial hasil
• Feedback dari user dapat memperbaiki probabilitas dari relevansi
References
1. Christopher D. Manning, Prabhakar Raghavan, HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press, 2008
2. Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press, 2010
3. Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009
4. David A. Grossman and Ophir Frieder, Information Retrieval: Algorithms and Heuristics 2nd edition, Springer, 2004
5. Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry, Text Information Retrieval Systems Third Edition, Library and Information Science, 2007