A New Approach to Unsupervised Text Summarization
Agenda
Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work
Introduction
Supervisedtypically make use of human-made summaries or
extracts to find features or parameters of
summarization algorithms.
Problem: human-made summaries should be reliable enough.
Unsuperviseddetermine relevant parameters without regard to
human-made summaries.
Introduction (cont’d)
Validity?
Introduction (cont’d)
ExperimentA large group students of university to identify 10% sentences
in a text (various domains in a news paper corpus) which they
believe to be most important.
Reporting the rather modest result of 25% agreement among
their choice.
Problem1.Reliability
2.Portabolity
The Approach
Evaluate summaryNot in terms of how well they match human-made extracts.
Not in terms of how much time it takes for humans to make
relevance judgments on them.
In terms of how well they represent source documents in usual
IR tasks such as document retrieval and text categorization.
The Approach (cont’d)
ExtractionLack of fluency or cohesion.
But humans are able to perform as well reading 20%-30%
extracts as the original full text.
Diversity-Based Summarization
ProblemWhat is the most important sentences that can represent the
text.
Katz’s make an important observation that the numbers of
occurrences of content words in a document do not depend on
the document’s length.
The frequencies per document of individual content words do
not grow proportionally with the length of a document.
Diversity-Based Summarization (cont’d)
Two important properties of text1.Redundancy – How repetitive concepts are.
2.Diversity – How many different concept are in the text.
Much of the prior work is focus on redundancy, few of them
take an issue with the problem of diversity.
MMR (maximal marginal relevance)
Diversity-Based Summarization (cont’d)
Method1.Find diversity – Find diverse topic areas in text.
2.Reduce-Redundancy – From each topic area, identify the
most important sentence and take that sentence as a
representative of the area.
A summary is then a set of sentences generated by Reduce-
Redundancy.
Diversity-Based Summarization (cont’d)
Find DiversityBuilt upon the K-means clustering algorithm extended with
Minimum Description Length Principle (MDL) version of X-
means.
X-means is an extension of K-means with an added
functionality of estimating K, K is supplied by user.
Diversity-Based Summarization (cont’d)
μj – the coordinates of the centroid with the index j.
xi – the coordinates of the i-th data point.
(i) represents the index of the centroid closest to the data point
i.
Ex. μ(j) denotes the centroid associated with the data point j.
ci - denotes a cluster with the index i.
Diversity-Based Summarization (cont’d)
K-meansA hard clustering algorithm that produces a clustering of input
data points into K disjoint subsets.
Starting with some randomly chosen initial points. A bad choice
of initial centers can have adverse effects on performance in
Clustering.
A best solution is one that minimizes distortion.
Diversity-Based Summarization (cont’d)
Define distortion as the averaged sum of squares of Euclidean
distances between objects of a cluster and its centroid.
For some clustering solution S = {c1, . . . , ck}, its distortion is
where
ci - a cluster
xj - an object in ci
μ(i) - the centroid of ci
| ・ | - the cardinality function
Diversity-Based Summarization (cont’d)
Problem of K-meansUser should supply the number of clusters.
It’s prone to searching local minima.
Diversity-Based Summarization (cont’d)
X-meansGlobally searching the space of centroid locations to find the
best way of partitioning the input data.
Resorting to a model selection criterion known as the Baysian
Information Criterion (BIC) to decide whether to split a cluster.
When the information gain from splitting a cluster as measured
by BIC is greater than the gain for keeping that cluster as it is.
It splits.
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Modification of X-meansReplacing BIC by MDL
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Diversity-Based Summarization (cont’d)
Reduce-RedundancyUse a simple sentence weighting model (the Z-model)
Taking the weight of a given sentence as the sum of tf ・ idf
values of index terms in that sentence.
x - a index term
tf(x) - the frequency of term x in document
idf(x) - the inverse document frequency of x
Diversity-Based Summarization (cont’d)
Z-model sentence selection1.Determining the weights of sentences in the text.
2.Sorting them in a decreasing order.
3.Selecting top sentences.
Further normalizes sentence weight of length.
Find out the best W(s) score. Then take the sentence as a
representative of the cluster.
Minimize the loss of the resulting summary’s relevance to
potential query.
Diversity-Based Summarization (cont’d)
ProblemThe process does not preserve statistical properties of a
source text, which are often left statistically indistinguishable
after the process.
SolutionExtrapolating frequencies of index terms in extracts in order to
estimate their true frequencies in source texts.
Diversity-Based Summarization (cont’d)
Extrapolation formula
pr - the probability of a given word occurring r times in the document.m ≥ 0
In this experiments, index terms with two or more occurrences in the document, so the extrapolation would be E(k | k ≥ 2) .
Test Data and Evaluation Procedure
BMIR-J2Benchmark for Japanese IR system version 2, represents a
test collection of 5080 news article which published in 1994 in
Japan.
Test Data and Evaluation Procedure
F-measure
P – PrecisionR – Recall
Test Data and Evaluation Procedure
Two-set of experimentStrict relevance scheme (SRS), takes only A-labeled
documents as relevant to the query.
Moderate relevance scheme (MRS), takes both A- and B-
labeled documents as relevant.
Test Data and Evaluation Procedure
Summarization method1.Z model
2.diversity-based summarizer with the standard K-means
(DBS/K)
3.diversity-based summarizer with XM-means (DBS/XM)
Compression rate is between 20% to 50%.
Test Data and Evaluation Procedure
Experiment procedure1.At each compression rate, run Z-model, DBS/K and
DBS/XM on the entire BMIR-J2 collection, to produce
respective pools of extracts.
2.For each query from BMIR-J2, perform a search on
each pool generated, and score performance with the
uninterpolated average F-measure.
Results and Discussion
Results and Discussion
Results and Discussion
Results and Discussion
Results and Discussion
Conclusion and Future Work
Diversity-based summarization (DBS/XM) was found to be
superior to relevance-based summarization (Z-model) in
measuring the loss of information in extracts in terms of
retrieval performance.
Future WorkExtending the current DBS framework to deal with multi-
document summarization.
Speech summarization with audio input and output.
Text categorization.