+ All Categories
Home > Documents > Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Lecture 12

Date post: 25-Feb-2016
Category:
Upload: cortez
View: 50 times
Download: 2 times
Share this document with a friend
Description:
Corpora and Statistical Methods Lecture 12. Albert Gatt. Part 2. Automatic summarisation. The task. Given a single document or collection of documents, return an abridged version that distils the most important information (possibly for a particular task/user) - PowerPoint PPT Presentation
Popular Tags:

of 44

Click here to load reader

Transcript

Corpora and Statistical Methods Lecture 12

Albert GattCorpora and Statistical MethodsLecture 12Automatic summarisationPart 2The taskGiven a single document or collection of documents, return an abridged version that distils the most important information (possibly for a particular task/user)

Summarisation systems perform:Content selection: choosing the relevant information in the source document(s), typically in the form of sentences/clauses.Information orderingSentence realisation: cleaning up the sentences to make them fluent.

Note the similarity to NLG architectures.Main difference: summarisation input is text, whereas NLG input is non-linguistic data.Types of summariesExtractive vs. AbstractiveExtractive: select informative sentences/clauses in the source document and reproduce them most current systems (and our focus today)Abstractive: summarise the subject matter (usually using new sentences)much harder, as it involves deeper analysis & generation

DimensionsSingle-document vs. multi-document

ContextQuery-specific vs. query-independentExtracts vs Abstracts: Lincolns Gettsyburg Address

Source: Jurafsky & Martin (2009), p. 823

ExtractAbstractA Summarization Machine

EXTRACTSABSTRACTS?MULTIDOCSExtractAbstractIndicativeGenericBackgroundQuery-orientedJust the news10%50%100%Very BriefBriefLongHeadlineInformativeDOCQUERYCASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMSAdapted from: Hovy & Marcu (1998). Automated text summarization. COLING-ACL Tutorial. http://www.isi.edu/~marcu/7The Modules of the Summarization Machine

EXTRACTION

INTERPRETATIONEXTRACTSABSTRACTS?

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMSMULTIDOCEXTRACTSGENERATIONFILTERINGDOCEXTRACTSUnsupervised single-document summarisation Ibag of words approachesBasic architecture for single-doc

The central task in single-document summarisation. Can be supervised or unsupervised.Less critical. Since we have only one document, we can rely on the order in which sentences occur in the source itself.Unsupervised content selection I: Topic SignaturesSimplest unsupervised algorithm:Split document into sentences.Select those sentences which contain the most salient/informative words.Salient term = a term in the topic signature (words that are crucial to identifying the topic of the document)

Topic signature detection:Represent sentences (documents) as word vectorsCompute the weight of each wordWeight sentences by the average weight of their (non-stop) words.Vector space revisitedDocument collectionKey terms * document Doc 1: To make fried chicken, take the chicken, chop it up and put it in a pan until golden. Remove the fried chicken pieces and serve hot.Doc 2: To make roast chicken, take the chicken and put in the oven until golden. Remove the chicken and serve hot.

Columns = documentsRows = term frequenciesNB: Stop list to remove v. high frequency words!Term weighting: tf-idfCommon term weighting scheme used in the information retrieval literature.tf (term frequency) = freq. of term in documentidf (inverse document frequency) = log(N/ni)N = no. of documentsni = no. of docs in which term i occurs

Method: Count frequency of term in the doc being considered.Count inverse doc frequency over whole document collectionCompute tf-idf scoreTerm weighting: log likelihood ratioRequirements:A background corpusIn our case, for a term w, LLR is the ratio between:Prob. of observing w in the input corpusProb. of observing w in the background corpusSince LLR is asymptotically chi-square distributed, if the LLR value is significant, we treat the term as a key term.Chi-square values tend to be significant at p = .001 if they are greater than 10.8

Sentence centralityInstead of weighting sentences by averaging individual term weights, we can compute pairwise distance between sentences and choose those sentences which are closer to eachother on average.Example: represent sentences as tf-idf vectors and compute cosine for each sentence x in relation to all other sentences y

where K = total no. of sentences

Unsupervised single-document summarisation IIUsing rhetorical structureRhetorical Structure TheoryRST (Mann and Thompson 1988) is a theory of text structureNot about what texts are about butHow bits of the underlying content of a text are structured so as to hang together in a coherent way.

The main claim of RST:Parts of a text are related to eachother in predetermined ways.There is a finite set of such relations.Relations hold between two spans of textNucleusSatelliteA small exampleYou should visit the new exhibition. Its excellent. It got very good reviews. Its completely free.You should ...Its completely ...Its excellent...It got ...MOTIVATIONEVIDENCEENABLEMENTAn RST relation definitionMOTIVATIONNucleus represents an action which the hearer is meant to do at some point in future.You should go to the exhibitionSatellite represents something which is meant to make the hearer want to carry out the nucleus action.Its excellent. It got a good review.Note: Satellite need not be a single clause. In our example, the satellite has 2 clauses. They themselves are related to eachother by the EVIDENCE relation.Effect: to increase the hearers desire to perform the nucleus action.

RST relations more generallyAn RST relation is defined in terms of theNucleus + constraints on the nucleus(e.g. Nucleus of motivation is some action to be performed by H)Satellite + constraints on satelliteDesired effect.

Other examples of RST relations:CAUSE: the nucleus is the result; the satellite is the causeELABORATION: the satellite gives more information about the nucleus

Some relations are multi-nuclearDo not relate a nucleus and satellite, but two or more nuclei (i.e. 2 pieces of information of the same status).Example: SEQUENCEJohn walked into the room. He turned on the light.Some more on RSTRST relations are neutral with respect to their realisation.E.g. You can express EVIDENCE in lots of different ways.

Its excellent...It got ...EVIDENCEIts excellent. It got very good reviews.

You can see that its excellent from its great reviews.

Its excellence is evidenced by the good reviews it got.

It must be excellent since it got good reviews.RST for unsupervised content selectionCompute coherence relations between units (= clauses)Can use a discourse parser and/or rely on cue phrasesCorpora annotated with RST relations existUse the intuition that the nucleus of a relation is more central to the content than the satellite to identify the set of salient units Sal:Base case: If n consists of a leaf node, then Sal(n) = {n}Recursive case: if n is non-leaf, then

Rank nodes in Sal(n): the higher the node of which n is a nucleus, the more salient it is

Rhetorical structure: example

Ranking of a nodes: 2 > 8 > 3 ...Supervised content selectionBasic ideaInput: a training set consisting of:Document + human-produced (extractive) summariesSo sentences in each doc can be marked with a binary feature (1 = included in summary; 0 = not included)

Train a machine learner to classify sentences as 1 (extract-worthy) or 0, based on features.FeaturesPosition: important sentences tend to occur early in a document (but this is genre dependent). E.g. news articles: most important sentence is the title.Cue phrases: sentences with phrases like to summarise give important summary info. (Again, genre dependent: different genres have different cue phrases).Word informativeness: words in the sentence which belong to the docs topic signatureSentence length: we usually want to avoid very short sentencesCohesion: we can use lexical chains to compute how many words are in a sentence which are also in the document lexical chainLexical chain: a series of words that are indicative of the documents topicAlgorithmsOnce we have the feature set F, we want to compute:

Many methods weve discussed will do!Naive Bayes Maximum Entropy...

Which corpus?There are some corpora with extractive summaries, but often we come up against the problem of not having the right data.Many types of text in themselves contain summaries, e.g. scientific articles have abstractsBut these are not purely extractive!(though people tend to include sentences in abstracts that are very similar to the sentences in their text).Possible method: align sentences in an abstract with sentences in the document, by computing their overlap (e.g. using n-grams)

RealisationSentence simplificationRealisationWith single-doc summarisation, realisation isnt a big problem (were reproducing sentences from summaries).But we may want to simplify (or compress) the sentences. Simplest method is to use heuristics, e.g.:Appositives: Rajam, 28, an artist who lives in Philadelphia, found inspiration in the back of city magazines.Sentential adverbs: As a matter of fact, this policy will be ruinous.A lot of current research on simplification/compression, often using parsers to identify dependencies that can be omitted with little loss of information.Realisation is much more of an issue in multi-document summarisation.Multi-document summarisationWhy multi-documentVery useful when:queries return multiple documents from the webSeveral articles talk about the same topic (e.g. a disease)...

The steps are the same as for single-doc summarisation, but:Were selecting content from more than one sourceWe cant rely on the source documents only for orderingRealisation is required to ensure coherence.Content selectionSince we have multiple docs, we have a problem with redundancy: repeated info in several documents; overlapping words, sentences, phrases...We can modify sentence scoring methods to penalise redundancy, by comparing a candidate sentence to sentences already selected.

Methods:Modify sentence score to penalise redundancy:(sentence is compared to sentences already chosen in the summary)

Use clustering to group related sentences, and then perform selection on clusters.More on clustering next week.

Information orderingIf sentences are selected from multiple documents, we risk creating an incoherent document.

Rhetorical structure:*Therefore, I slept. I was tired.I was tired. Therefore, I slept.Lexical cohesion:*We had chicken for dinner. Paul was late. It was roasted.We had chicken for dinner. I was roasted. Paul was late.Referring expressions:*He said that ... . George W. Bush was speaking at a meeting.George W. Bush said that ... . He was speaking at a meeting.

These heuristics can be combined.We can also do information ordering during the content selection process itself.Information ordering based on referenceReferring expressions (NPs that identify objects) include pronouns, names, definite NPs...Centering Theory (Grosz et al 1995): every discourse segment has a focus (what the segment is about). Entities are salient in discourse depending on their position in the sentence: SUBJECT >> OBJECT >> OTHERA coherent discourse is one which, as far as possible, maintains smooth transitions between sentences.Information ordering based on lexical cohesionSentences which are about the same things tend to occur together in a document.

Possible method:use tf-idf cosine to compute pairwise similarity between selected sentencesattempt to order sentences to maximise the similarity between adjacent pairs.RealisationCompare:

Source: Jurafsky & Martin (2009), p. 835Uses of realisationSince sentences come from different documents, we may end up with infelicitous NP orderings (e.g. pronoun before definite). One possible solution: run a coreference resolver on the extracted summaryIdentify reference chains (NPs referring to the same entity)Replace or reorder NPs if they violate coherence. E.g. use full name before pronoun

Another interesting problem is sentence aggregation or fusion, where different phrases (from different sources) are combined into a single phrase.Evaluating summarisationEvaluation baselinesRandom sentences:If were producing summaries of length N, we use as baseline a random extractor that pulls out N sentences.Not too difficult to beat.

Leading sentences:Choose the first N sentences.Much more difficult to beat!A lot of informative sentences are at the beginning of documents.Some terminology (reminder)Intrinsic evaluation: evaluation of output in its own right, independent of a task (e.g. Compare output to human output).

Extrinsic evaluation: evaluation of output in a particular task (e.g. Humans answer questions after reading a summary)

Weve seen the uses of BLEU (intrinsic) for realisation in NLG.A similar metric in Summarisation is ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

BLEU vs ROUGEBLEUROUGEPrecision-orientedLooks at n-gram overlap for different values of n up to some maximum.Measures the average n-gram overlap between an output text and a set of reference texts.Recall-orientedN-gram is fixed:ROUGE-1, ROUGE-2 etc (for different n-gram lengths)Measures how many n-grams an output summary contains from the source summary.ROUGEGeneralises easily to any n-gram length.Other versions:ROUGE-L: measures longest common subsequence between reference summary and outputROUGE-SU: uses skip bigrams

Intrinsic vs. Extrinsic againProblem: ROUGE assumes that reference summaries are gold standards, but people often disagree about summaries, including wording.

Same questions arise as for NLG (and MT):To what extent does this metric actually tell us about the effectiveness of a summary?Some recent work has shown that the correlation between ROUGE and a measure of relevance given by humans is quite low.

See: Dorr et al. (2005). A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate? Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 18, Ann Arbor, June 2005The Pyramid method (Nenkova et al)Also intrinsic, but relies on semantic content units instead of n-grams.

Human annotators label SCUs in sentences from human summaries.Based on identifying the content of different sentences, and grouping together sentences in different summaries that talk about the same thing.Goes beyond surface wording!Find SCUs in the automatic summaries.Weight SCUsCompute the ratio of the sum of weights of SCUs in the automatic summary to the weight of an optimal summary of roughly the same length.


Recommended