+ All Categories
Home > Documents > Ljubljana, November 18JOTA - Faculty of Arts1 Topic Segmentation Gaël Dias and Elsa Alves Human...

Ljubljana, November 18JOTA - Faculty of Arts1 Topic Segmentation Gaël Dias and Elsa Alves Human...

Date post: 28-Dec-2015
Category:
Upload: lisa-newman
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
50
Ljubljana, November 18 JOTA - Faculty of Arts 1 Topic Segmentation Gaël Dias and Elsa Alves Human Language Technology Interest Group Department of Computer Science University of Beira Interior - Portugal http://hultig.di.ubi.pt
Transcript

Ljubljana, November 18 JOTA - Faculty of Arts 1

Topic Segmentation

Gaël Dias and Elsa Alves

Human Language Technology Interest GroupDepartment of Computer Science

University of Beira Interior - Portugal

http://hultig.di.ubi.pt

Ljubljana, November 18 JOTA - Faculty of Arts 2

Guidelines

Introduction TextTiling Lexical Cohesion Profile DotPlotting Link Set Median Problems Asas Conclusions Problems and Future Work

Ljubljana, November 18 JOTA - Faculty of Arts 3

Introduction

The concept of Topic Segmentation is the task

of breaking documents into

topically coherent multi-paragraph

subparts.

Topic Segmentation

Ljubljana, November 18 JOTA - Faculty of Arts 4

Introduction

Topic Segmentation is important for applications like:

Information Retrieval

Indeed, you should prefer a document in which the occurrences of a word are concentrated into one or two paragraphs since such a concentration is most likely to

contain a definition of what is the word.

Text Summarization

Topic Segmentation is one of the three phases of Automatic Abstracting: Topic Segmentation, Sentence Extraction and

Sentence Compression.

Ljubljana, November 18 JOTA - Faculty of Arts 5

TextTiling

TextTiling (Hearst and Plaunt 1993) is one of the most famous system for Topic Segmentation.

The basic idea of this algorithm is to search for parts of a text where the vocabulary shifts from one subtopic to another. These points are then interpreted as the boundaries of multi-paragraph units.

TextTiling is divided into 4 phases:(1) Segmentation of the Text

(2) Cohesion Scorer(3) Depth Scorer

(4) Boundary Selector

Ljubljana, November 18 JOTA - Faculty of Arts 6

TextTiling: Segmentation

Sentence length can vary considerably. Therefore the text is first divided into small

fixed units (20 words), the token sequences.

Indeed, you should prefer a document in which the occurrences of a word are concentrated into one or two paragraphs since such a concentration is most likely to contain a definition of what is the word. Topic Segmentation is one of the three phases of Automatic Abstracting: Topic Segmentation, Sentence Extraction and Sentence Compression. ….

gaps20 words

Ljubljana, November 18 JOTA - Faculty of Arts 7

TextTiling: Cohesion

The cohesion scorer measures the amount of “topic continuity” or cohesion at each gap, i.e. the amount of evidence that the same is prevalent on both sides of the gap. Intuitively, we want to consider gaps with low cohesion as possible

segmentation points.

In fact, we want to infer the breaking points from the distribution of the words in the text i.e. when there is a

change of topic, previous words seem to disappear from the rest of the text.

Ljubljana, November 18 JOTA - Faculty of Arts 8

TextTiling: Cohesion

Ljubljana, November 18 JOTA - Faculty of Arts 9

TextTiling: Cohesion

In order to calculate this cohesion, Textiling uses the cosine similarity measure where the weights

of the words are the frequency in the text.

p

k

jk

p

k

ik

p

k

jkik

jiij

XX

XX

XXS

11

1

22

,cos

Ljubljana, November 18 JOTA - Faculty of Arts 10

TextTiling: Cohesion

The similarity calculation can be illustrated as the following figure where the x-axis is the gap

number.

Ljubljana, November 18 JOTA - Faculty of Arts 11

TextTiling: Depth Scorer

The depth scorer assigns a depth score to each gap depending on how low its cohesion score is

compared to the surrounding gaps.

If cohesion at the gap is lower than at surrounding gaps, then the depth score is high.

Conversely, if cohesion is about the same at surrounding gaps, then the depth score is low.

However, this situation is not so linear. There may be slight shifts and big shifts. Only the latter

may be considered as breaking points.

Ljubljana, November 18 JOTA - Faculty of Arts 12

TextTiling: Depth Scorer

g1 g2 g3

s1

s2

s3

Indeed, you should prefer a document in which the occurrences of a word are concentrated into one or two paragraphs since such a concentration is most likely to contain a definition of what is the word. Topic Segmentation is one of the three phases of Automatic Abstracting: Topic Segmentation, Sentence Extraction and Sentence Compression. ….

s1s3

s2

Ljubljana, November 18 JOTA - Faculty of Arts 13

TextTiling: Depth Scorer

The gap depth score is computed by summing the heights of the two sides of the valley it is

located in.

For instance, for text 1, we would have: depth(g2)=(s1-s2)+(s3-s2).

In a text with rapid fluctuations of topic vocabulary, only the most radical changes will

be accorded the status of segment boundaries.

Ljubljana, November 18 JOTA - Faculty of Arts 14

TextTiling: Depth Scorer

g1 g2 g3 g1 g2 g3 g4 g5

s1

s2

s3 s1

s2

s3

s4

s5

text1 text2

Ljubljana, November 18 JOTA - Faculty of Arts 15

TextTiling: Depth Scorer

For a practical implementation, several enhancements of the basic algorithm are

needed. First, we need to smooth cohesion scores to address situations like in text 2.Intuitively, the difference (s1-s2) should

contribute to the depth score of g4.

This is achieved by smoothing scores using a low pass filter. For example, the cohesion score

si for gi is replaced by (si-1 + si + si+1)/3.

Ljubljana, November 18 JOTA - Faculty of Arts 16

TextTiling: Depth Scorer

g1 g2 g3

s1 s2 s3

s2=(s1+s2+s3)/3smoothing smoothing

b1 b2 b3 b4

Ljubljana, November 18 JOTA - Faculty of Arts 17

TextTiling: Boundary Selector

The boundary selector is the module that looks at the depth scores and selects the gaps that are

the best segmentation points.

The module estimates the average μ and the standard deviation σ of the depth scores and

selects all gaps as boundaries that have a depth score higher that μ-c.σ where c is some

constant.

Ljubljana, November 18 JOTA - Faculty of Arts 18

Lexical Cohesion Profile

The Lexical Cohesion Profile (LCP) has been proposed by (Kozima, 1993).

The basic idea is that the words in a segment are linked together via lexical cohesion relations. So, LCP records

mutual similarity of words in a sequence of text.

The similarity of words, which represents their cohesiveness, is computed using a predefined semantic network

automatically built from the LDOCE (English Dictionary).

Ljubljana, November 18 JOTA - Faculty of Arts 19

Lexical Cohesion Profile

The basic idea is that when a block shifting from left to right looses in lexical cohesion, then, it should evidence a topic change.

Ljubljana, November 18 JOTA - Faculty of Arts 20

Lexical Cohesion Profile

Some Results …

LCPHuman Reader

Ljubljana, November 18 JOTA - Faculty of Arts 21

Dotplotting

The Dotplotting methodology has been proposed for finding discourse boundaries by (Reynar 1994)

The idea is based on lexical item repetition.

If a particular word appears in position x and y in a text, then 4 points corresponding to the Cartesian product should be plotted on the dotplot: (x,x) (x,y) (y,x) (y,y).

Ljubljana, November 18 JOTA - Faculty of Arts 22

Dotplotting

x y

x

y

Text Segment = Lexical Cohesion based on repetition

Ljubljana, November 18 JOTA - Faculty of Arts 23

Dotplotting In order to find the |P| desired boundaries, we repeat the minimization of the following measure:

where Va,b represents a vector containing the word counts between a through b.

This technique has the advantage to compare all possible segments with all the other ones and not just the surrounding ones.

Ljubljana, November 18 JOTA - Faculty of Arts 24

Dotplotting

Ljubljana, November 18 JOTA - Faculty of Arts 25

Link Set Median Procedure

The idea of the Link Set Median (LSM), proposed by (Sardinha1999) is to look at the similarity between all the sentences with which each adjacent

sentence shares lexical items.

The set of sentences with which each sentence has links can be seen to form a link set.

Ljubljana, November 18 JOTA - Faculty of Arts 26

Link Set Median Procedure

Ljubljana, November 18 JOTA - Faculty of Arts 27

Link Set Median ProcedureALGORITHM

1. Identify the links for all sentences in the text2. Create the link sets3. Compute the median for each link set4. Calculate the median difference for each link set and itspredecessor5. Compute the average (mean) median difference for the text6. Compare each (link set) median difference to the (text) averagemedian difference7. If the median difference is higher than the average, insert asegment boundary8. Locate the section boundaries in the text and disregard sectionsstarting with sentence 19. Compare segment and section boundaries

Ljubljana, November 18 JOTA - Faculty of Arts 28

Link Set Median Procedure

Ljubljana, November 18 JOTA - Faculty of Arts 29

Problems

Lexical repetition shows reliability problems.

Systems based on lexical cohesion use existing linguistic resources (dictionary, thesaurus, ontology) that are usually available only for dominating languages like English, French or German, and as a consequence do not apply to less favored languages.

Ljubljana, November 18 JOTA - Faculty of Arts 30

Asas The idea of the Informative Similarity-based Topic Segmentation System (Asas), proposed by (Dias and Alves,

2005) is to look at:

(1) The importance of words in global context

(2) The importance of words in local context

(3) The global importance of words

(4) The Informative Similarity between Sentences and Blocks of sentences

(5) The Selection of Boundaries

Ljubljana, November 18 JOTA - Faculty of Arts 31

Asas

Text to Segment

Texts of Context

Asas

Segmented Text

Ljubljana, November 18 JOTA - Faculty of Arts 32

Asas: Global Importance

First, it is necessary to find what are the important words in the text to segment. Only these should be

taken into account. For that reason, we apply the well-known tf.idf measure introduced by (Salton, 1975).

)(

log||

;,. 2 wdf

N

d

dwtfdwidftf

Ljubljana, November 18 JOTA - Faculty of Arts 33

Asas: Local importance

Then, it is necessary to find what are the important words for the sake of segmentation. Indeed, if a word occurs in all the sentences of the text, it is of no use for the task of segmentation. For that reason, we apply the well-known

tf.idf measure to sentences that we call tf.isf.

)(

log||

;,. 2 wsf

Ns

s

swstfswisftf

Ljubljana, November 18 JOTA - Faculty of Arts 34

Asas: Local importance

Useless for Topic Segmentation

Ljubljana, November 18 JOTA - Faculty of Arts 35

Asas: Local Importance

Finally, the more dense a word is, the more important it is for the sake of segmentation. Indeed, if a word occurs many times in a small portion of the text,

it is of great use for segmentation. For that purpose, (Dias and Alves, 2005) proposed a density measure based on the distance (in terms of words) of

consecutive occurrences of a word.

1||

11,ln

1,

w

kekoccurkoccurdist

dwdens

Ljubljana, November 18 JOTA - Faculty of Arts 36

Asas: Local Importance

From moon to star

not so strong

Ljubljana, November 18 JOTA - Faculty of Arts 37

Asas: Weight

So, in order to give a weight to each word for each sentence in the text to segment, we propose this

simple measure:

where all measure have been normalized so that they can be joined into a single measure.

dwdensswisftfdwidftfdwweight ,,.,.,

Ljubljana, November 18 JOTA - Faculty of Arts 38

Asas: Similarity

Unlike in TextTiling and LCP, we use the sentence as the basic unit of segmentation. The basic idea is to observe whether a given

sentence is more similar than the previous group of sentences or more similar to the next group of sentences.

Si

Si-1Si-2Si-3

Si+3Si+2Si+1

SimilarityBlock 1

SimilarityBlock 2

Ljubljana, November 18 JOTA - Faculty of Arts 39

Asas: SimilarityFor that purpose, we introduce an informative similarity measure (Dias and Alves, 2005) in

order to avoid the need of lexical databases like thesauri or dictionaries.

The informative similarity measure is based on the cosine similarity measure but integrating any word co-occurrence association measure in its body.

p

k

jk

p

k

ik

p

k

jkik

jiij

XX

XX

XXS

11

1

22

,cos

The cosine measure

Where:

Xi=the vector representing a webpageXj=the vector representing the website

Xik=the weight of word in index k in the vector Xi

Ljubljana, November 18 JOTA - Faculty of Arts 40

Asas: Similarity

(1) Ronaldo defeated the goalkeeper once more.(2) Real Madrid striker scored again.

Ljubljana, November 18 JOTA - Faculty of Arts 41

Asas: Similarity

Each block and the sentence in focus are represented as vectors of weights and the association measure used is the

Equivalence Index (Muller et al., 1997).

p

k

p

kl

jljkjk

p

k

p

kl

ilikik

p

k

jlik

p

l

jlik

jiij

WWEIXWWEIX

WWEIXX

XXinfosimbaS

1 11 1

1 1

22

,,

,

,

The informative similarity measure

where, qfwf

qwfwqpqwpqwEI

2,||,

Ljubljana, November 18 JOTA - Faculty of Arts 42

Asas: Selection of Boundary

1

1

,

,log

ii

iii

XXinfosimba

XXinfosimbaXps

In order to compare the similarity between blocks and the sentence in focus, we propose the following solution.

So, if Xi is more similar to Xi-1, ps(Xi) will give a positive number. On the contrary, if Xi is more similar to Xi-2, ps(Xi) will give a negative number. In the case, where it is similar to both blocks, the value of ps(X i) will be 0.

Ljubljana, November 18 JOTA - Faculty of Arts 43

Asas: Selection of BoundaryResults with Equivalence Index

Ljubljana, November 18 JOTA - Faculty of Arts 44

Asas: Selection of Boundary

Each time the link value goes from positive to negative between two consecutive sentences, there exits a topic shift. We will

call this phenomenon a downhill.

downhills

Ljubljana, November 18 JOTA - Faculty of Arts 45

Asas: Selection of Boundary

A downhill is simply defined as follows whenever the value of the ps score goes from positive to negative between two consecutive

sentences Xi and Xi+1.

11, iiii XpsXpsXXdownhill

Ljubljana, November 18 JOTA - Faculty of Arts 46

Asas: Selection of Boundary

Once all downhills in the text have been calculated, their mean and standard deviation are evaluated. The topic boundaries are then elected if they satisfy the following constraint

2, 1 xXXdownhill ii

Ljubljana, November 18 JOTA - Faculty of Arts 47

Asas: Flexibility

This architecture has the advantage to accept different association measures so that better tuning can be obtained.

And by choosing different context length (one word, k words, k sentences, k paragraphs, k texts) for the calculation of association measures between two words, different applications can be obtained for our architecture:

- Segmenting different news articles from a list of articles (larger context).- Segmenting a technical text about one topic where each segment is about a subtopic (small context).

Ljubljana, November 18 JOTA - Faculty of Arts 48

Conclusion

Language-independent unsupervised Topic Segmentation system based on word-co-occurrence.

Avoids the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology.

Solves the problems evidenced systems based uniquely on lexical repetition that show reliability problems.

Ljubljana, November 18 JOTA - Faculty of Arts 49

Problems and Future Work

Existence of three main parameters: the block size, the window size to calculate the association measure and the topic discovery threshold.

Exhaustive Evaluation: different association measures, different similarity measures, different languages etc …

Comparison with systems that use linguistic resources.

More work must be done on the automatic boundary detection algorithm. We are convinced that better algorithms may be proposed based on the transformation of the representation of ps function into a graph or network.

The system will be downloadable at http://asas.di.ubi.pt/ under GPL License when completely tested.

Ljubljana, November 18 JOTA - Faculty of Arts 50

The End

Thanks for your attention and patience


Recommended