Adding Semantics to Information Retrieval
ByKedar Bellare20th April 2003
Motivation
Current IR techniques term-based Semantics of document and query
not considered Problems like polysemy and
synonymy Lot of advances in NLP and
Statistical Modeling of Semantics Is Semantic IR really required?
Organization
Traditional IR Statistics for Semantics – Latent
Semantic Indexing Semantic Resources for Semantics –
Use of Semantic Nets, Conceptual Graphs, WordNet etc. in IR.
Conclusion
Information Retrieval
An information retrieval system does not inform the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.
A Typical IR System
Current IR Preprocessing of Documents
Inverted Index Removing stopwords and Stemming
Representation of Documents Vector Space Model – TF and IDF Document Clustering
Improvements to the above Better weighting of Document Vectors Link analysis – PageRank and Anchor Text
Latent Semantic Indexing
Problems with Traditional Approaches Synonymy – Automobile and Car Polysemy – Jaguar means both a Car
and Animal LSI – Linear Algebra for capturing
“Latent Semantics” of documents Method of dimensionality reduction
LSI Compares document vectors in Latent
Semantic Space Two documents can have high
similarity value even if no terms shared Attempts to remove minor differences
in terminology during indexing Truncated SVD – used for construction
of Latent Semantic Space
Singular Value Decomposition
Given a term-document matrix At x d
converts it into product of three matrices Tt x r, Sr x r and Dd x r such that
A = T S DT
T and D are orthogonal, S is diagonal and r is rank of A
Reduced space corresponds to axes of greatest variation
What LSI does?
Uses truncated SVD Instead of r – dimensional space
uses a factor k
Āt x d = Tt x k Sk x k DTd x k
Truncated SVD – captures underlying structure in association of terms and documents
Using the SVD model
Comparison of terms – entries of the matrix T S2T T
Comparison of documents – entries of the matrix D S2DT
Comparison of term and document – entries of the matrix TSDT
Query in SVD model – q’ = qT T S-1
Example of LSI
Why LSI works? Although lot of empirical evidence
no concrete proof of why LSI works No major degradation – Theorem
of Eckart and Young States that the distance of two
matrices is minimum Still does not explain
improvements in recall and precision
Why LSI works? (contd.) Papadimitriou et. al.
Assumes documents generated from set of topics with disjoint vocabularies
If term-document matrix A is perturbed, they prove that LSI recovers topic information and removes the noise
Kontostathis et. al. Essentially claims that LSI’s ability to trace
term co-occurrences is what helps in improved recall
Advantages & Disadvantages
Advantages Synonymy Term Dependence
Disadvantages Storage Efficiency
Semantic Resources
Semantic Nets - E.g. John gave Mary the book
Applied in UNL – Eg. Only a few farmers could use information technology in early 1990s
Semantic Resources (contd.) Conceptual Graphs – E.g. A bird is
singing in a Sycamore tree
Conceptual Dependency – E.g. I gave the man a book
Lexical Resources – WordNet
Applications of Semantic Resources in
IR UNL
Used in improving document vectors Conceptual Graphs
Graph matching of query and document CDs
FERRET – Comparison of CD patterns WordNet
Query Expansion using WordNet
Conclusion Various things need to be considered before
applying to Web Storage Efficiency Knowledge Content of Query
Clearly, semantic method needed for eliminating synonymy and polysemy
Currently, traditional models with minor hacks serve the purpose
However, in conclusion : Statistical or Conceptual or combination of both to model Document Semantics is definitely required
References[1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra
for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, 1995.
[2] S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kau.mann Publishers, San Francisco, 2002.
[3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41 (6), pages 391–407, 1990.
[4] A. Kontostathis and W. M. Pottenger. A mathematical view of Latent Semantic Indexing: Tracing Term Co-occurences. Technical report, Lehigh University, 2002.
[5] R. Mandala, T. Takenobu, and T. Hozumi. The use of WordNet in Information Retrieval. In COLING/ACL Workshop on the Usage of WordNet in Natural Language Processing Systems, pages 31–37, 1998.
References (contd.)[6] M. L. Mauldin. Retrieval performance in FERRET: a conceptual
information retrieval system. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 347–355. ACM Press, 1991.
[7] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J.Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 – 244, 1990.
[8] M. Montes-y-Gomez, A. Lopez, and A. F. Gelbukh. Information retrieval with Conceptual Graph matching. In Database and Expert Systems Applications, pages 312–321, 2000.
[9] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A probabilistic analysis. pages 159–168, 1998.
[10] E. Rich and K. Knight. Artificial Intelligence. Tata McGraw-Hill Publishers, New Delhi, 2002.
References (contd.)[11] G. Salton, A. Wong, and C. S. Yang. A vector space
model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
[12] C. Shah, B. Chowdhary, and P. Bhattacharyya. Constructing better Document Vectors using Universal Networking Language (UNL). In Proceedings of International Conference on Knowledge-Based Computer Systems (KBCS) 2002. NCST, Navi Mumbai, India, 1995.
[13] H. Uchida, M. Zhu, and S. T. Della. UNL : A gift for a millenium. Technical report, The United Nations University, 2000.
[14] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.