+ All Categories
Home > Documents > Tdm information retrieval

Tdm information retrieval

Date post: 18-Aug-2015
Category:
Upload: kani-mozhi-u
View: 17 times
Download: 0 times
Share this document with a friend
Popular Tags:

of 86

Click here to load reader

Transcript
  1. 1. Text Data Mining PART I - IR
  2. 2. Text Mining Applications Information Retrieval Query-based search of large text archives, e.g., the Web Text Classification Automated assignment of topics to Web pages, e.g., Yahoo, Google Automated classification of email into spam and non- spam Text Clustering Automated organization of search results in real-time into categories Discovery clusters and trends in technical literature (e.g. CiteSeer) Information Extraction Extracting standard fields from free-text Extracting names and places from reports, newspapers
  3. 3. Information Retrieval - Definition Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Information Retrieval Deals with the representation, storage, organization of, and access to information items Modern Information Retrieval General Objective: Minimize the overhead of a user locating needed information
  4. 4. Information Retrieval Is Not Database Information Retrieval Process stored documents Search documents relevant to user queries No standard of how queries should be Query results are permissive to errors or inaccurate items Database Normally no processing of data Search records matching queries Standard: SQL language Query results should have 100% accuracy. Zero tolerant to errors
  5. 5. Information Retrieval Is Not Data Mining Information Retrieval User target: Existing relevant data entries Data Mining User target: Knowledge (rules, etc.) implied by data (not the individual data entries themselves) Many techniques and models are shared and related E.g. classification of documents
  6. 6. Is Information Retrieval a Form of Text Mining? What is the principal computer specialty for processing documents and text?? Information Retrieval (IR) The task of IR is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity A query is examined and transformed into a vector of values to be compared with stored documents
  7. 7. Is Information Retrieval a Form of Text Mining? In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document
  8. 8. Specify Query Search Document Collection Return Subset of Relevant Documents Key Steps in Information Retrieval Examine Document Collection Learn Classification Criteria Apply Criteria to New Documents Key Steps in Predictive Text Mining
  9. 9. Specify Query Vector Match Document Collection Get Subset of Relevant Documents Examine Document Properties Predicting from Retrieved Documents Key steps in IR simple criteria such as documents labels
  10. 10. Information Retrieval (IR) Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. Expressed as queries Historically, IR is about document retrieval, emphasizing document as the basic unit. Finding documents relevant to user queries Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
  11. 11. Information Retrieval Cycle Source Selection Search Query Selection Results Examination Documents Delivery Information Query Formulation Resource source reselection System discovery Vocabulary discovery Concept discovery Document discovery
  12. 12. Abstract IR Architecture DocumentsQuery Hits Representation Function Representation Function Query Representation Document Representation Comparison Function Index offlineonline
  13. 13. IR Architecture
  14. 14. IR Queries Keyword queries Boolean queries (using AND, OR, NOT) Phrase queries Proximity queries Full document queries Natural language questions
  15. 15. Information retrieval models An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. Main models: Boolean model Vector space model Statistical language model etc
  16. 16. Elements in Information Retrieval Processing of documents Acceptance and processing of queries from users Modelling, searching and ranking of documents Presenting the search result
  17. 17. Process of Retrieving Information
  18. 18. Document Processing Removing stopwords (appear frequently but no much meaning, e.g. the, of) Stemming: recognize different words with the same grammar root Noun groups: common combination of words Indexing: for fast locating documents
  19. 19. Processing Queries Define a language for queries Syntax, operators, etc. Modify the queries for better search Ignore meaningless parts: punctuations, conjunctives, etc. Append synonyms e.g. e-business e-commerce Emerging technology Natural language queries
  20. 20. Modelling/Ranking of Documents Model the relevance (usefulness) of documents against the user query Q The model represents a function Rel(Q,D) D is a document, Q is a user query Rel(Q,D) is the relevance of document D to query Q There are many models available Algebraic models Probabilistic models Set-theoretic models
  21. 21. Basic Vector Space Model Define a set of words and phases as terms Text is represented by a vector of terms User query is converted to a vector, too Measure the vector distance between a document vector and the query vector business computer PowerPoint presentation user web Term Set We are doing an e-business presentation in PowerPoint. Document (1,0,1,1,0,0) computer presentation Query (0,1,0,1,0,0) 222222 )00()00()11()01()10()01( Distance
  22. 22. Probabilistic Models Overview Probabilistic Models Ranking: the probability that a document is relevant to a query Often denoted as Pr(R|D,Q) In actual measure, log-odds transformation is used: Probability values are estimated in applications ),|Pr( ),|Pr( log QDR QDR
  23. 23. Information Retrieval Given A source of textual documents A well defined limited query (text based) Find Sentences with relevant information Extract the relevant information and ignore non- relevant information (important!) Link related information and output in a predetermined format Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc. Information items can be in the form of text, image, video, audio, numbers, etc.
  24. 24. Information Retrieval 2 basic information retrieval (IR) process: Browsing or navigation system User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found Classical IR system: question answering system Query: question in natural language Answer: directly extracted from text of document collection Text Based Information Retrieval: Information item (document) Text format (written/spoken) or has textual description Information need (query)
  25. 25. Classical IR System Process
  26. 26. General concepts in IR Representation language Typically a vector of d attribute values, e.g., set of color, intensity, texture, features characterizing images word counts for text documents Data set D of N objects Typically represented as an N x d matrix Query Q User poses a query to search D Query is typically expressed in the same representation language as the data, e.g., each text document is a set of words that occur in the document
  27. 27. Query by Content Traditional DB query: exact matches E.g. query Q = [level = MANAGER] AND [age < 30] or, Boolean match on text Query = Irvine AND fun: return all docs with Irvine and fun Not useful when there are many matches E.g., data mining in Google returns 60 million documents Query-by-content query: more general / less precise E.g. what record is most similar to a query Q? For text data, often called information retrieval (IR) Can also be used for images, sequences, video, etc Q can itself be an object (e.g., a document) or a shorter version (e.g., 1 word) Goal Match query Q to the N objects in the database Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q
  28. 28. Issues in Query by Content What representation language to use How to measure similarity between Q and each object in D How to compute the results in real-time (for interactive querying) How to rank the results for the user Allowing user feedback (query modification) How to evaluate and compare different IR algorithms/systems
  29. 29. The Standard Approach Fixed-length (d dimensional) vector representation For query (1-by-d Q) and database (n-by-d X) objects Use domain-specific higher-level features (vs raw) Image bag of features: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), Text bag of words: freq count for each word in each document, Also known as the vector-space model Compute distances between vectorized representation Use k-NN to find k vectors in X closest to Q
  30. 30. Text Retrieval Document: book, paper, WWW page, ... Term: word, word-pair, phrase, (often: 50,000+) query Q = set of terms, e.g., data + mining NLP (natural language processing) too hard, so Want (vector) representation for text which Retains maximum useful semantics Supports efficient distance computes between docs and Q Term weights Boolean (e.g. term in document or not); bag of words Real-valued (e.g. freq term in doc; relative to all docs) ...
  31. 31. Practical Issues Tokenization Convert document to word counts word token = any nonempty sequence of characters for HTML (etc) need to remove formatting Canonical forms, Stopwords, Stemming Remove capitalization Stopwords Remove very frequent words (a, the, and) can use standard list Can also remove very rare words Stemming (next slide) Data representation E.g., 3 column: Inverted index (faster) List of sorted pairs: useful for finding docs containing certain terms Equivalent to a sparse representation of term x doc matrix
  32. 32. Intelligent Information Retrieval Meaning of words Synonyms buy / purchase Ambiguity bat (baseball vs. mammal) Order of words in the query Hot dog stand in the amusement park Hot amusement stand in the dog park
  33. 33. Key Word Search The technical goal for prediction is to classify new, unseen documents The Prediction and IR are unified by the computation of similarity of documents IR based on traditional keyword search through a search engine So we should recognize that using a search engine is a special instance of prediction concept
  34. 34. Key Word Search We enter a key words to a search engine and expect relevant documents to be returned These key words are words in a dictionary created from the document collection and can be viewed as a small document So, we want to measuring how similar the new document (query) is to the documents in the collection
  35. 35. Key Word Search So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine But, the objective of the search engine is to rank the documents, not to assign a label So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)
  36. 36. Key Word Search In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not Ands are implicit, even if not explicitly specified Ranking of documents on the basis of estimated relevance to a query is critical Relevance ranking is based on factors such as o Term frequency Frequency of occurrence of query keyword in document o Inverse document frequency How many documents the query keyword occurs in Fewer give more importance to keyword o Hyperlinks to documents
  37. 37. Relevance Ranking Using Terms TF-IDF (Term frequency/Inverse Document frequency) ranking: Let n(d) = number of terms in the document d n(d, t) = number of occurrences of term t in the document d. Relevance of a document d to a term t The log factor is to avoid excessive weight to frequent terms Relevance of document to query Q n(d) n(d, t) 1 +TF (d, t) = log r (d, Q) = TF (d, t) n(t)tQ
  38. 38. Relevance Ranking Using Terms Most systems add to the above model Words that occur in title, author list, section headings, etc. are given greater importance Words whose first occurrence is late in the document are given lower importance Very common words such as a, an, the, it etc are eliminated Called stop words Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score Usually only top few documents are returned, not all
  39. 39. Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents. Relevance feedback: Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Vector space model: define an n-dimensional space, where n is the number of words in the document set. Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t ) The cosine of the angle between the vectors of two
  40. 40. Relevance Using Hyperlinks Number of documents relevant to a query can be enormous if only term frequencies are taken into account Using term frequencies makes spamming easy E.g. a travel agency can add many occurrences of the words travel to its page to make its rank very high Most of the time people are looking for pages from popular sites Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords
  41. 41. Relevance Using Hyperlinks Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site Count only one hyperlink from each site Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of site difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity Refinements When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations
  42. 42. Relevance Using Hyperlinks Connections to social networking theories that ranked prestige of people E.g. the president of the U.S.A has a high prestige since many people know him Someone known by multiple prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many pages (on a topic) An authority is a page that contains actual information on a topic Each page gets a hub prestige based on prestige of authorities that it points to Each page gets an authority prestige based on prestige of hubs that point to it Again, prestige definitions are cyclic, and can be got by
  43. 43. Nearest-Neighbor Methods A method that compares vectors and measures similarity In Prediction: the NNMs will collect the K most similar documents and then look at their labels In IR: the NNMs will determine whether a satisfactory response to the search query has been found
  44. 44. Measuring Similarity These measures used to examine how documents are similar and the output is a numerical measure of similarity Three increasingly complex measures: Shared Word Count Word Count and Bonus Cosine Similarity
  45. 45. Shared Word Count Counts the shared words between documents The words: In IR we have a global dictionary where all potential words will be included, with the exception of stopwords. In Prediction its better to preselect the dictionary relative to the label
  46. 46. Computing similarity by Shared words Look at all words in the new document For each document in the collection count how many of these words appear No weighting are used, just a simple count The dictionary has true key words (weakly words removed) The results of this measure are clearly intuitive No one will question why a document was retrieved
  47. 47. Computing similarity by Shared words Each document represented as a vector of key words (zeros and ones) The similarity of 2 documents is the product of the 2 vectors If 2 documents have the same key word then this word is counted (1*1) The performance of this measure depends mainly on the dictionary used
  48. 48. Computing similarity by Shared words Shared words is an exact search Either retrieving or not retrieving a document. No weighting can be done on terms In query, A and B, you cant specify A is more important than B Every retrieved document are treated equally
  49. 49. Word Count and Bonus TF term frequency Number of times a term occurs in a document DF Document frequency Number of documents that contain the term. IDF inversed document frequency =log (N/df) N: the total number of documents Vector is a numerical representation for a point in a multi- dimensional space. (x1, x2, xn) Dimensions of the space need to be defined A measure of the space needs to be defined.
  50. 50. Word Count and Bonus Each indexing term is a dimension Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tik) Document similarity is defined as 0 1 1 , 1 jdfjw jwiD K j If word (j) occurs in both documents otherwise K = number of words
  51. 51. Word Count and Bonus The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small. This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.
  52. 52. Word Count and Bonus 2.83 1.33 0 1.33 1.5 1.33 2.67 Measure Similarity With Bonus 10101 11000 00010 10001 00100 01010 11001 Similarity Scores 1101 Labeled Spreadsheet Vector New Document Computing Similarity Scores with Bonus A document Space is defined by five terms: hardware, software, user, information, index. The query is hardware, user, information.
  53. 53. Cosine Similarity The Vector Space A document is represented as a vector: (W1, W2, , Wn) Binary: Wi= 1 if the corresponding term is in the document Wi= 0 if the term is not in the document TF: (Term Frequency) Wi= tfi where tfi is the number of times the term occurred in the document TF*IDF: (Inverse Document Frequency) Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.
  54. 54. Cosine Similarity The Vector Space vec(D) = (w1, w2, ..., wt) Sim(d1,d2) = cos() = [vec(d1) vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2| W(j) > 0 whenever j di So, 0

Recommended