Text-Mining: analysis of text data Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia and...

Text-Mining: analysis of text data

Dunja Mladenić

J.Stefan Institute, Ljubljana, Slovenia and Carnegie Mellon University, USAhttp://www-ai.ijs.si/DunjaMladenic/

http://www.cs.cmu.edu/~dunja/

Web user profiling• imagine the user browsing the Web,

most of the time by clicking hyperlinks

• goal: provide help by highlighting the clicked hyperlinks (we assume that the user is clicking on interesting hyperlinks)

– induce a profile for each user separately– the profile can be used to predict clicking

on hyperlinks (in our case), to collect interesting Web-pages, to compare different users and share knowledge between them (collaborative agents)

Structure of the personal browsing assistant - Personal

WebWatcher

The Web

UserUser profile

Personal WebWatcher

proxy(adviser)

URL URL

page

modified page

Personal WebWatcher in action (1996)

Highlight interesting hyperlinks

Data Pyramid

Data

Information

Knowledge

Wisdom

Data plus context

Information plus rules

Knowledge plus experience

What is Data Mining?

• Data mining (knowledge discovery in databases - KDD, business intelligence): – finding interesting (non-trivial, hidden,

previously unknown and potentially

useful) regularities in large datasets• “Say something interesting about the

data.”• “Decribe this data.”

Data Mining: Potential usage

• Market analysis

• Risk analysis

• Fraud detection

• Text Mining

• Web Mining

• ...

Why text analysis?• The amount of text data on

electronic media is growing daily – e-mail, business documents, the Web,

organized databases of documents,...

• There is a lot of information contained in the text

• Available methods and approaches enabling solving interesting and non-trivial problems

Problem description (I)• Text information filtering• Help with browsing the Web• Generation and analysis of user profiles

Automatic document categorization and keyword assignment to documents

• Document clustering• Document visualization

Document authorship detectionDocument copying identificationLanguage identification in text

Document categorization

labeled documents

Document Classifier

unlabeled documentdocument category

(label)

???

Yahoo! page for one category

Automatic document categorization

• Problem: given is a set of content categories filled with documents.

• The goal is: to automatically insert a new document (assign one or more relevant categories to a new document).

• Content categories can be structured (eg., Yahoo, Medline) or unstructured (eg., Reuters)

• The problem is similar to assigning keywords to documents

Document to categorize:

CFP for CoNLL-2000

Some predicted

categories

Our approach to document categorization

• Data is obtained from the existing collection of manually categorized documents, where the used content categories are structured

• Using Text Mining methods, we constructed a model that captures manual work of editors

• The model is used to automatically assign content categories and the corresponding keywords to new, previously unseen documents

labeled documents(from Yahoo! hierarchy)

Feature construction

unlabeled document document category (label)

??

System architecture

vectors of n-grams

Document Classifier

Subproblem definition

Feature selection

Classifier construction

Web

Summary of experiments and results

• learning from categorization hierarchy: considering only promising categories during the classification (5%-15% of categories)

• extended document representation: new features for sequences of two words

• feature subset selection: Odds ratio using 50-100 best features (0.2%-5%)

• More can be found at our project page

www.cs.cmu.edu/~TextLearning/pww/yplanet.html

Document authorship detection

• Problem: based on a database of documents and authors, assign the most probable author to a new document

• Solution is based on the fact that each author uses a characteristic frequency distribution over words and phrases

Document copying identification

• Problem: predict probability that a given document was copied (partially or completely) from some other document(s) from our database

• Algorithm uses complex indexing methods on (different length) parts of documents and compares them against the given document

Natural language identification

• Text data analysis systems commonly use some natural language dependent methods

• Need for identification of natural language the document is written in

• Problem: for a given text identify the natural language it is written in selecting among the predefined languages

Algorithm for natural language identification

• Basic algorithms are simple: for each language build a characteristic frequency table of pairs and triples of letters that can be simply used to identify a document language (TextCat publicly available system, covers 60 languages)

• Problem is with short documents - in this case we can use mechanisms for language dependent stop-words detection (stop-words are frequent in all languages)

Topic identification and tracking in time series of documents

• Document indexing based on content and not only keywordsContent segmentation of textDocument summarization

• Link analysis Information extraction

Problem description (II)

Topic identification and tracking in time series of

documents• Problem: given is a time-sequence of

documents (news) - based on this document sequence we want to:– identify document that introduces new

topic– from the sequence of new documents

identify documents about existing topics and connect them into a topic sequence

Text segmentation based on content

• Problem: divide text that has no given structure (content table, paragraphs, etc.) into segments with similar content

• Example applications: – topic tracking in news (spoken news) – identification of topics in large,

unstructured text databases

Algorithm for text segmentation

• Algorithm:– Divide text into sentences– Represent each sentence with words and

phrases it contains– Calculate similarity between the pairs of

sentences– Find a segmentation (sequence of delimiters),

so that the similarity between the sentences inside the same segment is maximized and minimized between the segments

Text Summarization• Task: Given a text document create a

summary reflecting the document’s contents• Three main phases:

– Analyzing the source text– Determining its important points– Synthesizing an appropriate output

• Most methods adopt linear weighting model – each text unit (sentence) is assessed by:– Weight(U)=LocationInText(U)+CuePhrase(U)

+Statistics(U)+AdditionalPresence(U)

• …output consists from topmost text units (sentences)

Information extraction

• Collect a set of Home pages from the Web and build a “soft” database of people (name, address, coworkers, research areas and publications, biography...)

• Collect electronic seminar announcements and extract location (room number), start and end time, name of the speaker

Where are we now?

• Growing interest and need for handling large collections of text

• The area is present in Slovenia for over 5 years with strong international connection – joint R&D project with: Microsoft Research,

European and American research institutions, cooperation with Boeing

• Organization of international events focused on Text Mining (ICML-99, KDD-2000, ICDM-2001)

Instead of conclusions...

• Text Mining enables solving some problems that are often not expected to be addressed by computers:– document authorship detection,

identification of related content or finding “interesting” people, document segmentation and organization, automatic collection of officer names for the selected sector companies, finding experts in some area, who is involved with whom (discovering social networks), ...

To find more information check:

<http://www-personal.umich.edu/~wfan/text_mining.html><http://ai.about.com/library/weekly/aa102899.htm><http://extractor.iit.nrc.ca/bibliographies/ml-applied-to-ir.html><http://www.content-analysis.de/>get research papers at <http://www.researchindex.com>

• KDD-2000 Text Mining Workshop <http://www.cs.cmu.edu/~dunja/WshKDD2000.html>

• ECAI-2000 ML for Information Extraction <http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html>

• PRICAI-2000 Text and Web MiningWorkshop <http://textmining.krdl.org.sg/cfp.html>

• IJCAI-2001 Adaptive Text Extraction and Mining Workshop <http://www.smi.ucd.ie/ATEM2001/>, Text Learning: Beyond Supervision <http://www.cs.cmu.edu/~mccallum/textbeyond/>

• ICDM-2001 Text Mining Workshop <http://www-ai.ijs.si/DunjaMladenic/TextDM01/>

• ECML/PKDD-2001 Text Mining tutorial <http://www-ai.ijs.si/DunjaMladenic/TextDM01/Tutorial.ps>

Link Analysis

• Mechanisms for detecting which vertices in the graph (pages on the web) are more important on the basis of link structure:– Hits algorithm (Hubs & Authorities)

(Kleinberg 1998)– PageRank (Page 1999) weighting (used

by Google to better rank good pages)

Link analysis on Amazon data

• We downloaded product pages from Amazon.com web site:– …products are connected with cross-sell relation

(“customers who bought this product also bought following products…”)

– 130.000 books and 32.000 music CDs connected into graph

• Question: which products (books or CDs) are the most important?

• …we used Hits algorithm to calculate the weights – Harry Potter & Beatles won the test.

Popular books1. Harry Potter and the Goblet of Fire (Book 4): J K

Rowling, Mary Grandpre2. The Beatles Anthology: The Beatles, Paul McCartney,

George Harrison, Ringo Starr, Lennon, John Lennon3. Prodigal Summer: Barbara Kingsolver4. Harry Potter and the Sorcerer's Stone (Book 1): J K

Rowling5. The Mark : The Beast Rules the World (Left Behind

#8): Tim LaHaye, Jerry B Jenkins6. Harry Potter and the Chamber of Secrets (Book 2): J

K Rowling7. Harry Potter and the Prisoner of Azkaban (Book 3): J

K Rowling, Mary Grandpre8. The Sibley Guide to Birds (Audubon Society Nature

Guides Ser.): David Allen Sibley• ....

Popular CDs1. The Beatles2. A Day Without Rain: Enya3. Lovers Rock: Sade4. All That You Can't Leave Behind: U25. Riding With The King: Eric Clapton, BB King6. Black and Blue: Backstreet Boys7. Sailing To Philadelphia: Mark Knopfler8. You're The One: Paul Simon9. Kid A: Radiohead10.Music: Madonna11.Red Dirt Girl: Emmylou Harris12.Renee Fleming• ...

Date post:	19-Jan-2016
Category:	Documents
Upload:	samson-montgomery
View:	218 times
Download:	0 times

Text-Mining: analysis of text data Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia and...

Documents