+ All Categories
Home > Documents > Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004....

Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004....

Date post: 27-Dec-2015
Category:
Upload: melvin-potter
View: 269 times
Download: 9 times
Share this document with a friend
Popular Tags:
31
Text Mining Text Mining Text Classification Text Classification Text Clustering Text Clustering 2004. 11. 2004. 11.
Transcript
Page 1: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

• Text MiningText MiningText ClassificationText ClassificationText ClusteringText Clustering

2004. 11. 2004. 11.

Page 2: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

ContentsContents

IntroductionIntroduction

Related TechnologiesRelated Technologies Feature selectionFeature selection Text classificationText classification Text clusteringText clustering

Page 3: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Introduction (1/2)Introduction (1/2)

Text classification (categorization)Text classification (categorization) Sorting new items into existing Sorting new items into existing

structuresstructures general topic hierarchiesgeneral topic hierarchies email foldersemail folders general file systemgeneral file system

Information filtering/pushInformation filtering/push Mail filtering(spam vs.not)Mail filtering(spam vs.not) Customized Push service Customized Push service

Page 4: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Categorization

Document Categorization

Page 5: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Clustering(topic discovery)

Document Clustering

Page 6: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Introduction (2/2)Introduction (2/2)

Difference with data miningDifference with data mining Analyze both raw data and textual Analyze both raw data and textual

information at the same timeinformation at the same time Require complicated feature selection Require complicated feature selection

technologiestechnologies May include linguistic, lexical, and May include linguistic, lexical, and

contextual techniquescontextual techniques

Page 7: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Classifier

unknowndocuments

sampledocuments

1. learning1. learning 2. classification2. classification

A. fun B. business C.private

Text classification Text classification 예예 : e-: e-mailmail

Page 8: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

ProcessProcess Construction of vocabularyConstruction of vocabulary

optionaloptional ExtractionExtraction

Keep incoming documents in the systemKeep incoming documents in the system ParsingParsing

StemmingStemming Vector model, bag-of-wordsVector model, bag-of-words

Feature selection (reduction)Feature selection (reduction) LearningLearning

Off-line process: Build model parametersOff-line process: Build model parameters CategorizationCategorization

On-line processOn-line process Re-learningRe-learning

On-line processOn-line process

Page 9: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

ExtractionExtraction

DatabasesDatabases DocumentsDocuments

Incoming/Training/Categorized documentsIncoming/Training/Categorized documents DictionaryDictionary

StopwordsStopwords 조사조사 , , 어미어미

……

Page 10: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

StemmingStemming

Table LookupTable Lookup 검색어와 관련된 모든 어간을 테이블 기록검색어와 관련된 모든 어간을 테이블 기록

N-gram stemmerN-gram stemmer 접사 제거접사 제거

어근 추출을 위해 접두사어근 추출을 위해 접두사 , , 접미사접미사 , , 어미 등을 제거어미 등을 제거 Porter Porter 알고리즘알고리즘

‘‘ies’ ies’ ‘y’ ‘y’ ‘‘es’ es’ ‘s’ ‘s’ ‘‘s’ s’ NULL NULL

Page 11: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

ExtractionExtraction

색인어색인어 주로 명사주로 명사 (( 구구 )) 그외 형용사그외 형용사 (( 구구 ), ), 동사동사 (( 구구 ))

색인방법색인방법 통계적 기법통계적 기법

단어 출현 통계량 단어 출현 통계량 (term frequency) (term frequency) 사용사용 언어학적 기법언어학적 기법

형태소 분석형태소 분석 , , 구문 분석구문 분석

Page 12: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

ExtractionExtraction

한글 문서의 특징한글 문서의 특징 띄어쓰기가 자유로움띄어쓰기가 자유로움

복합명사 분해 문제복합명사 분해 문제 대학생선교회 대학생선교회 대학 대학 ++ 생선생선 ++ 교회 교회 or or 대학생대학생 ++ 선교회선교회

용언의 변화용언의 변화 , , 축약축약 음절 분석 필요음절 분석 필요

불용어 처리불용어 처리 맞춤법 처리 맞춤법 처리

색인어로 적당한 한글의 격틀색인어로 적당한 한글의 격틀 명사명사 : ex) : ex) 정보 정보 명사 명사 + + 명사명사 : ex) : ex) 정보검색정보검색 명사 명사 + + 조사 조사 + + 명사명사 : ex) : ex) 정보의 검색정보의 검색

Page 13: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Feature Selection Feature Selection (reduction)(reduction)

: Curse of dimensionality: Curse of dimensionality Removal of stopwordsRemoval of stopwords Feature SelectionFeature Selection

Zipf’s LawZipf’s Law DF (document frequency)-basedDF (document frequency)-based xx22 Statistics-based Statistics-based Mutual InformationMutual Information ……

Page 14: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Feature Selection Feature Selection (reduction)(reduction)

: Curse of dimensionality: Curse of dimensionality StopwordsStopwords

Page 15: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Feature Selection Feature Selection (reduction)(reduction)

: Curse of dimensionality: Curse of dimensionality Zipf’s LawZipf’s Law

Page 16: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Feature Selection Feature Selection (reduction)(reduction)

: Curse of dimensionality: Curse of dimensionality xx22 statistics-based statistics-based

)()()()(

)(),(2

srsqqprp

qrpsntc

CC /C/C

tt pp rr

/t/t qq ss

Page 17: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Parsing: representing Parsing: representing documentsdocuments

Vector Representation - term frequency - document frequency - weights

Page 18: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Classification ModelClassification Model: machine learning approach: machine learning approach

Learner

Classifier

ObservedTraining

documents

Unknown documents

Model(hypothesis)Parameters

Categorizeddocuments

Page 19: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Classification ModelClassification Model: machine learning approach: machine learning approach

Naïve Bayesian ClassificationNaïve Bayesian Classification

Nearest Neighbor ClassificationNearest Neighbor Classification

dwccccwqcqclassifier )Pr()|Pr(maxarg)|Pr(maxarg)(

q

Page 20: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Classification Classification 예예

환자본인의 유전자를 이용 , 배아를 만든 후 이를 이용해 실험실에서 건강한 세포를 배양시켜 환자에 다시 주입하는 이른바 치료복제법이 실험을 통해 입증되기는 이번이 세계최초라고 연구진은 주장했는데 이 방법은 주입된 세포에 대한 인체의 거부 반응이 없어 그동안 의학계의 관심을 끌어왔다

환자 본인 환자본인 유전자 이용 배아 이용 실험실 건강 세포 배양 환자 주입 치료복제법 실험 입증 이번 세계 최초 세계최초 연구진 주장 방법 주입 세포 인체 거부 반응 의학계 관심

수의학 0.191149 의학 , 생명공학 , 약학 0.134847치의학 0.114641 생물 , 미생물 0.109833성 0.099062 질병 , 증상 , 죽음 0.084554. . .

Page 21: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Learning the text Learning the text classifierclassifier

Before system startsBefore system starts Define category (class, topic) Define category (class, topic) Learning representative documents for Learning representative documents for

each defined categoryeach defined category During system operationDuring system operation

Incremental learning for each classifierIncremental learning for each classifier Define new categories by clustering Define new categories by clustering

uncategorized documentsuncategorized documents

Page 22: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Machine Learning based approach

(Basic architecture)

Page 23: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Machine Learning MethodsMachine Learning Methods Similarity-basedSimilarity-based

K-Nearest NeighborK-Nearest Neighbor Decision TreesDecision Trees Statistical Learning: Statistical Learning:

Naïve BayesNaïve Bayes Bayes NetsBayes Nets

Support Vector MachinesSupport Vector Machines Artificial Neural NetworksArtificial Neural Networks . . . . . . OthersOthers

Hierarchical classificationHierarchical classification Expectation-Maximization techniqueExpectation-Maximization technique Variants of BoostingVariants of Boosting Active learningActive learning

Page 24: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Naïve Bayes Text ClassifierNaïve Bayes Text Classifier

||

1)|Pr()Pr(maxarg

)|Pr()Pr(maxarg

)Pr(

)|Pr()Pr(maxarg

)|Pr(maxarg)(

i

j

j

j

jNB

d

k jikjCc

jijCc

i

jijCc

ijCci

cwc

cdc

d

cdc

dcdf

||

)|Pr()Pr(

||

1^

tl

D

i ijjc

D

dcc

tl

j

Vw c

ikc

jikcwVwtf

wtfcw

j

j

jik||)(

1)()|Pr(|

^

Classification model of NB classifiers

- Class prior estimate - Word probability estimate

1,0)|Pr( ij dc

Class prior estimate Word probability estimate

Page 25: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Uses of Clustering in IRUses of Clustering in IR Clustering as RepresentationClustering as Representation ( (abstractionabstraction))

Clustering is unsupervised learningClustering is unsupervised learning of the underlying structure, classes of the underlying structure, classes

Clustering can be used to transform representationsClustering can be used to transform representations documents are represented by class membership as well as individual termsdocuments are represented by class membership as well as individual terms

Can be viewed as Can be viewed as dimensionality reductiondimensionality reduction especially term clustering (e.g., word variant clusters)especially term clustering (e.g., word variant clusters)

Clustering for BrowsingClustering for Browsing Clustering has been proposed as a technique for organizing documents Clustering has been proposed as a technique for organizing documents

for browsing, interaction and visualizationfor browsing, interaction and visualization constructing hypertext constructing hypertext clustering the results of searches clustering the results of searches iterative clustering of the collection (e.g, Scatter/Gather) iterative clustering of the collection (e.g, Scatter/Gather) clustering the webclustering the web

Also has been used to group terms for browsingAlso has been used to group terms for browsing automatic thesauri automatic thesauri topic summariestopic summaries

Clustering for topic discoveryClustering for topic discovery

Page 26: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

IntroductionIntroduction

Text clusteringText clustering Summarization of large text dataSummarization of large text data Discovering new categoriesDiscovering new categories

Page 27: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Abstraction of a set of Abstraction of a set of documentsdocuments

Document within a cluster “relevant”

Page 28: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Information Retrieval Information Retrieval (browsing)(browsing)

Clustering of Query Results Clustering of Query Results (Scatter/Gather)(Scatter/Gather)

Scatter&Gather

Page 29: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Clustering for Topic Clustering for Topic discovery discovery (Evolution of topic (Evolution of topic

hierarchy)hierarchy)

“Movie &Film”

. . .

A

“Movie &Film”

“Plays”

“Film Festivals”

. . .

“ScreenPlays”

“Movie” “Genres”

“Film Festival”

. . .

“Horror”“ScienceFiction”

BReorganizationNew topic discovery

Concept drift

Change of

viewpoint

Page 30: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Clustering AlgorithmClustering Algorithm

Two general methodologiesTwo general methodologies HierarchicalHierarchical

pairs of items or clusters are successively linked pairs of items or clusters are successively linked to produce larger clusters (to produce larger clusters (agglomerativeagglomerative))

or start with the whole set as a cluster and or start with the whole set as a cluster and successively divide sets into smaller partitions successively divide sets into smaller partitions ((divisivedivisive))

Non-hierarchicalNon-hierarchical - divide a set of N items - divide a set of N items into M clusters (top-down)into M clusters (top-down)

GraphGraph partitioningpartitioning

Page 31: Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Clusters

Supervised Clustering (for Topic Discovery)

Clustering

Document

Collection

A’ B’ C’ D’ E’

Human Knowledge

Topics (categories)


Recommended