+ All Categories
Home > Documents > Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Date post: 04-Jan-2016
Category:
Upload: branden-norman
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken
Transcript
Page 1: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations and Information Management Applications

Gregor ErbachSaarland University

Saarbrücken

Page 2: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Outline

• Information Management Applications

• Information Retrieval Techniques

• Categorization, Clustering

• Summarization

• Information Extraction

• Question Answering

• Points for discussion

Page 3: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Well-known Applications of Collocations

• Lexicography

• Machine Translation

• NL Generation

• NL Parsing

• Terminology Extraction

• Foreign Language Teaching

• Speech Recognition

Page 4: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Information Management Applications

• Information Retrieval• Text Categorisation

(by language, topic, author, genre ...)

• Clustering

• Summarisation / Keyword Extraction

• Information Extraction

• Question Answering

Page 5: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Information Retrieval

• Most IR systems don't retrieve information, but documents

• Boolean retrieval: an unordered set of documents are returned as result for a query

• Ranked retrieval: an ordered list of documents is returned; relevance of documents is determined by matching with a query

Page 6: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

IR System Model

Documents

RQ

Matching

[0, 1]

Query

Representation RD

Page 7: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Query Languages

• Co-occurence within document information AN D retrieval

• Negation information AND (NOT retrieval)

• Multi-word expression "information retrieval"

• Proximity operators information NEAR retrieval

Page 8: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Evaluation

• Precision

• Recall

• Precision/recall graphs

• 11 point average precision

• TREC (Text Retrieval Conference)

• TREC Tasks: ad-hoc, web, spoken documents, multimedia, cross-language ...

Page 9: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Relevance

• Relevance is matching of a document with an information need expressed through a query

• Relevance is considered as binary and determined by human assessors for document-query pairs

• Relevance is modelled by a similarity measure that compares query and document representations

Page 10: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Document and Query Representations in IR

• Documents and queries are generally represented as a vector of terms weights

• Documents are treated as bags of words

• Preprocessing: stemming or morphological analysis

• POS, chunking, syntax did not improve information retrieval performance

Page 11: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Similarity Measures

• Term weighting: TF , TF/ICF, TF/IDF

• Similarity measures determine how close two documents are, or how alike a document and a query are

• A common similarity measure is the cosine of the angles between the vector representations

Page 12: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Cosine Similarity

term2

term1

4 8

2

6

Page 13: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Result Ranking

• Adjacency or proximity of search terms can be taken into account in ranking of retrieval results

• This accounts for phrases and collocations

• Search terms occurring near each other (e.g. within a paragraph) are more likely to be related than search term occurring in different parts of a document

Page 14: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Latent Semantic Indexing

• LSI: Singular value decomposition, dimensionality reduction

• LSI associates terms that share the same context, i.e. can be substituted

• Applications: information retrieval, cross-language IR, language learning, text categorisation, vocabulary tests

Page 15: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Query Expansion

• Query expansion with related terms (e.g. from WordNet, thesarus)

• Relevance Feedback: Query expansion with terms from relevant document

• Blind Relevance Feedback: Query expansion with terms from top-ranking document. Expansion with co-occurring terms improves precision/recall.

Page 16: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Language Models for IR

• Language models generate queries from documents

• Estimate probability that a given query was generated by a particular document

• Uni-gram language models

• (special case of probabilistic IR)

Page 17: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Cross-language IR

• Methods: document translation, query translation, parallel/comparable corpora

Page 18: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Document Categorization

• Similar techniques to IR (document representation, similarity measures)

• Document base contains categorized documents• New document as query which retrieves the

best matching documents from database• Support Vector Machines achieve

very good performance on various text categorization tasks

Page 19: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Document Clustering

• Similar techniques to IR (document representation, similarity measures)

• Each cluster is represented by a centroid

• Iterative hierarchical grouping of similar documents

Page 20: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Summarization

• Two approaches: Extraction (of sentencs or keywords) and abstraction (summary generation)

• Indicative vs. informative summaries

• Query-independent vs. query-biased summaries

• Evaluation criteria: informativeness, coherence

Page 21: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Information Extraction

• Tasks: named entity extraction, coreference, template extraction

• named entities: person, organisation, location, time, date, money, percentage

• methods: finite-state grammars, finite-state transducers

• evaluation: precision, recall, f-measure

Page 22: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Question Answering

• Answer extraction (passage retrieval) vs. Information extraction + answer generation

• Combination of IR-based and NLP-based approaches (semantic concepts, dependency relations).

• TREC open domain QA evaluation: extract 50-word passage containing the answer to a factual question

Page 23: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Co-location spaces

• Linear (speech, text)

• document (as bag of words)

• hierarchical structure (tree, dependency relations)

• semantic/conceptual space (e.g. WordNet)

• cyberspace (hyperlinks)

Page 24: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations and IM

Collocations are

• multi-word units

• with statistical associations

• with restricted semantic compositionality

Are they useful for information management applications?

Page 25: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations and Document Representations

• Common representations treat terms in the document and query as independent

• Collocations research shows that they are not independent

• Implications?

Page 26: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations / Query Formulation

• Use of collocations for query expansion?(e.g. collocation, corpus, association ...

vs. collocation, facility, service, server, hosting ...)

• Automatic or interactive?

Page 27: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations / Categorisation & Clustering

• How much can category-specific collocations improve performance?

• Collocations for identification of genre, author, dialect?

Page 28: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations / IE

• IE techniques (finite-state shallow parsing) for collocation identification

• Use of collocation in IE grammars (Gewinn machen, Umsatz erzielen ...)

Page 29: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations / QA

• Use of collocations for finding answers (e.g. function-proper_name)

Page 30: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Collocations and Summarisation

• Keyword / key phrase extraction

• Evaluation of coherence: which association measures can be used?

Page 31: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Questions for Discussion

• Are collocations a useful level of representation for indexing and retrieval?

• Or are they only useful in establishing semantic representations?


Recommended