Citala 2009 Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies Mourad Abbas.

Citala 2009

Comparing TR-Classifier Comparing TR-Classifier and kNN by using and kNN by using Reduced Sizes of Reduced Sizes of

VocabulariesVocabularies

Mourad AbbasMourad Abbas

Citala 2009

Topic Identification: DefinitionTopic Identification: Definition

Topic identification, Topic identification, what does it mean?what does it mean?

It aims to assign a topic It aims to assign a topic label to a flow of textual label to a flow of textual data.data.

Citala 2009

T.I applicationsT.I applications

Documents categorization,Documents categorization,

Machine Translation,Machine Translation,

Selecting documents for web engines,Selecting documents for web engines,

Speech recognition system...etc.Speech recognition system...etc.

Good luck

Our aim of studying topic identification is to enhance the performance of speech recognition system. How can we do that? That is what we explain in the next diapositives.

Citala 2009

Speech RecognitionSpeech Recognition

According to Bayes probability formula , P(W|X) is defined as According to Bayes probability formula , P(W|X) is defined as below: below:

The probability to observe the sequence of vectors X when a The probability to observe the sequence of vectors X when a sequence of words W is emitted P(X|W). It is given by an sequence of words W is emitted P(X|W). It is given by an acoustic model.acoustic model.

The probability of the sequence of the words W in the used The probability of the sequence of the words W in the used language P(W). This probability is given by a language model. language P(W). This probability is given by a language model.

)(

)()./()/(

XP

WPWXPXWP

n

iinii wwwPWP

111 )...,/()(

)()./()/( WPWXPXWP

Citala 2009

Speech

Parametrization

Searching

arg maxW={w1,…,wT} P(X|W).P(W)

X={x1,…,xT}

AcousticModel

Language Model

P(X|W)

P(W)Sequence of recognized words

Description of the recognition process

Citala 2009

Statistical Language models are essential for Speech Recognition Statistical Language models are essential for Speech Recognition of large vocabularies. They allow to estimate the of large vocabularies. They allow to estimate the a prioria priori probability P(W) to emit a sequence of words W from a training probability P(W) to emit a sequence of words W from a training corpus.corpus.

Nevertheless, in many times, the language model is not able to Nevertheless, in many times, the language model is not able to find the correct choice. find the correct choice.

That is why Language model adaptation is needed.That is why Language model adaptation is needed.

Speech RecognitionSpeech Recognition

Citala 2009

One of Language model adaptation methods consists to divide One of Language model adaptation methods consists to divide the training documents to classes.the training documents to classes.

Each class represents a subset of the language which regroups the Each class represents a subset of the language which regroups the

documents that share the same characteristics.documents that share the same characteristics.

In our case these subsets are known as topics.In our case these subsets are known as topics.

Language model adaptation Language model adaptation

Corpus

Culture PoliticsReligion

Citala 2009

This allows to construct from these topics a language model which This allows to construct from these topics a language model which is able to describe the characteristics of each topic.is able to describe the characteristics of each topic.

The aim is then to: The aim is then to: - find out the topic of the recognized uttered - find out the topic of the recognized uttered

sentences. sentences. - Use the model derived from the detected topic. - Use the model derived from the detected topic.

Citala 2009

Building the vocabularyBuilding the vocabulary

Starting from the training corpus the vocabulary is built. Starting from the training corpus the vocabulary is built. Using the vocabulary, a document is represented. If a word of the Using the vocabulary, a document is represented. If a word of the

vocabulary doesn’t exist in the document the attributed value is vocabulary doesn’t exist in the document the attributed value is zero.zero.

To construct the vocabulary, some methods could be used:To construct the vocabulary, some methods could be used: - Term Frequency.- Term Frequency. - document Frequency.- document Frequency. - mutual Information.- mutual Information. - Transition Point Technique.- Transition Point Technique.

We have used the Term Frequency, because it is simple and leads We have used the Term Frequency, because it is simple and leads to good results. to good results.

Words which frequency don’t exceed value 3 are discarded.Words which frequency don’t exceed value 3 are discarded.

The non content words are too discarded.The non content words are too discarded.

They do not bring any information with regard to the sens of the text.

The vocabulary The vocabulary should be should be representative of representative of the corpus.the corpus.

مجلس العقد منتمنع لناالجتماعات هذه وأن9 وطنيال words

وطني مجلس عقد تمنع 5 اجتماعات words

Citala 2009

One Arabic word equivalent to 4 words One Arabic word equivalent to 4 words in the following example.in the following example.

Arabic English

و and

ب by

عالقات relations

ها her

Citala 2009

Fig 3. Illustratif exemple : Method Bag of words

Citala 2009

Role of the vocabulary in representationRole of the vocabulary in representation

Each document d={wEach document d={w11,w,w22,…,w,…,wnn} is represented by a } is represented by a vector V={vector V={ff11,f,f22,…,f,…,fnn} with } with ffnn = TF(w = TF(wnn,d) .IDF(w,d) .IDF(wnn).).

Word 1Word 2Word 3

………

Word n

Vf

f

f

f

0

0

4

3

1Real values Real values

|V| Size of the vocabulary|V| Size of the vocabulary

We put 0 in the case where the word We put 0 in the case where the word couldn’t be found in the document.couldn’t be found in the document.

Citala 2009

kNNkNN

To identify a topic-unknown document To identify a topic-unknown document dd, kNN , kNN ranks the neighbors of ranks the neighbors of dd among the training among the training document vectors, and uses the topics of the k document vectors, and uses the topics of the k Nearest Neighbors to predict the topic of the test Nearest Neighbors to predict the topic of the test document document dd..

Citala 2009

TR-Classifier TR-Classifier

Triggers of a word wTriggers of a word wkk are the ensemble of words that are the ensemble of words that have a high degree of correlation with it.have a high degree of correlation with it.

The main idea of the TR-classifier is based on The main idea of the TR-classifier is based on computing the average mutual information of each computing the average mutual information of each couple of words which belong to the vocabulary Vcouple of words which belong to the vocabulary Vii. .

Couples of words or "triggers" that are considered Couples of words or "triggers" that are considered important for a topic identification task, are those important for a topic identification task, are those which have the highest average mutual information which have the highest average mutual information (AMI) values. (AMI) values.

Each topic is then endowed with a number of selected Each topic is then endowed with a number of selected triggers M, calculated using training corpora of topic triggers M, calculated using training corpora of topic TTii..

Citala 2009

TR-ClassifierTR-Classifier

The AMI of two words The AMI of two words aa and and bb is given by: is given by:

AMI measures the association between words, using the following AMI measures the association between words, using the following values: values:

)().(

),(log).,(

)().(

),(log).,(

)().(

),(log).,(

)().(

),(log).,(),(

bpap

bapbap

bpap

bapbap

bpap

bapbap

bpap

bapbapbaIMM

Number of documents Number of documents in which in which aa et et bb could be could be found togetherfound together..

Number of Number of documents in documents in whichwhich bb could be could be found without found without aa ..

Number of Number of documents that documents that contain the word b .contain the word b .

Number of Number of documents in whichdocuments in which both a and b couldn’t be a and b couldn’t be foundfound

Number of Number of documents that documents that doesn’t contain the doesn’t contain the wordword b b

Citala 2009

TR-ClassifierTR-ClassifierIdentifying topics by using TR-method consists in:

• Giving corresponding triggers for each word wk Є Vi, where Vi is the vocabulary of a topic Ti.

• Selecting the best M triggers which characterize the topic Ti.

• In test step, we extract for each word wk from the test document, its corresponding triggers.

• Computing Qi values by using the TR-distance given by the equation:

1

0

n

l

ki

kik

ln

wwIMM

)(

),(

Q ,i

Citala 2009

Where i stands for the ith topic. The denominator Where i stands for the ith topic. The denominator presents a normalization of AMI computation.presents a normalization of AMI computation.

are triggers included in the test document d , and are triggers included in the test document d , and characterizing the topic Ti.characterizing the topic Ti.

A Decision for labeling the test document with topic Ti is A Decision for labeling the test document with topic Ti is obtained by choosing arg max Qi.obtained by choosing arg max Qi.

TR-classifier uses topic vocabularies which are composed TR-classifier uses topic vocabularies which are composed of words ranked according to their frequencies from the of words ranked according to their frequencies from the maximum to the minimum. maximum to the minimum.

kiw

new

Citala 2009

The ten best triggers which charactrizes the The ten best triggers which charactrizes the topic Culturetopic Culture

culture

Arabic English

ملتقى - ثقافةقصيدة - شاعررواية - قصة

شخصية - مسلسلأفالم - جمهور

تشكيلي - معرضمسلسل - فنان

لوحة - تشكيليفرقة - مسرحأفالم - سينما

Culture - MeetingPoet - Poem

Novel - StoryPersonage - Serial

Public - MoviesExposition - Plastic

Artist - SerialPlastic - PaintingTheater - GroupCinema - Movies

Citala 2009

Evaluation of the methodsEvaluation of the methods

For a topic TFor a topic Tnn , the method is evaluated using the following , the method is evaluated using the following measures:measures:

Recall Recall R= number of documents correctly labelled (T R= number of documents correctly labelled (Tnn ) / ) / total number of documents (belonging to the topic Ttotal number of documents (belonging to the topic Tnn ) . ) .

Precision P= number of documents correctly labelled (TPrecision P= number of documents correctly labelled (Tnn ) / ) / number of documents labelled ( Tnumber of documents labelled ( Tnn ) by the method. ) by the method.

The combination of R and P gives F1 which allows to The combination of R and P gives F1 which allows to measure the number of documents correctly labelled measure the number of documents correctly labelled efficiently.efficiently.

F1=2RPF1=2RP/ R+P/ R+P

Citala 2009

Experiments and resultsExperiments and results

Citala 2009

corpus gatheringcorpus gatheringThe software The software WinHTTrack allowed to WinHTTrack allowed to collect many web collect many web pages. We have just to pages. We have just to fill the address of the fill the address of the source. source.

Citala 2009

Corpus sourceCorpus source

The source of the used corpus is the arabic newspaper:The source of the used corpus is the arabic newspaper:

Alwatan Sultanate of Alwatan Sultanate of

Oman.Oman.

Citala 2009

Size of the corpusSize of the corpus

TopicTopic N. de motsN. de mots

CultureCulture 1.359.2101.359.210

ReligionReligion 3.122.5653.122.565

InternationaInternationall

855.945855.945

EconomyEconomy 1.460.4621.460.462

LocalLocal 1.555.6351.555.635

SportsSports 1.423.5491.423.549

TotalTotal 9.813.3669.813.366

ELWATAN newspaper

Citala 2009

TR-Classifier PerformancesTR-Classifier Performances

Topic Recall (%) Precision (%) F1 (%)

Culture 82.66 80.55 81.59

Religion 96.33 83.56 89.49

Economy 83.50 84.05 83.77

Local 86.25 82.53 84.35

International 93.33 90.66 91.97

Sports 96 97.33 96.66

Total 89.67 86.44 88.02

Citala 2009

Recall values versus triggers number using a size of vocabulary 300

Maximal value of R=89.67 % with N of triggers= 250.

Citala 2009

kNN PerformanceskNN Performances

Topics Recall (%) Precision (%)

Culture 76 49.78

Religion 75.33 94.95

Economy 68.66 81.74

Local 69.33 70.27

International 80 85.11

Sports 84.66 92.70

Average 75.66 70.09

Citala 2009

TR versus kNNTR versus kNN

Citala 2009

ConclusionConclusion

The experiments are realized using an Arabic corpus.The experiments are realized using an Arabic corpus.

The strong point of the TR-Classifier is its ability to realize The strong point of the TR-Classifier is its ability to realize better performances by using reduced sizes of topic better performances by using reduced sizes of topic vocabularies, compared to kNN. vocabularies, compared to kNN.

The reason behind that, is the significance of the information The reason behind that, is the significance of the information present in the longer-distance history that TR-Classifier uses.present in the longer-distance history that TR-Classifier uses.

Though the used small corpus (800 words), Performances of Though the used small corpus (800 words), Performances of kNN are relatively acceptable (~ 76 % in terms or Recall).kNN are relatively acceptable (~ 76 % in terms or Recall).

In perspectives, we aim to enhance TR-Classifier In perspectives, we aim to enhance TR-Classifier performances by using superior sizes of vocabularies, though performances by using superior sizes of vocabularies, though it outperforms kNN by 14 %. it outperforms kNN by 14 %.

Date post:	02-Jan-2016
Category:	Documents
Upload:	maude-cunningham
View:	219 times
Download:	3 times