+ All Categories
Home > Documents > Document-level Text Quality: Models for Organization and ...

Document-level Text Quality: Models for Organization and ...

Date post: 09-Dec-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
48
Document-level Text Quality: Models for Organization and Reader Interest Annie Louis October 16, 2015 Université Catholique de Louvain Joint work with Ani Nenkova
Transcript
Page 1: Document-level Text Quality: Models for Organization and ...

Document-levelTextQuality:ModelsforOrganizationandReaderInterest

AnnieLouisOctober16,2015

Université Catholique deLouvain

JointworkwithAni Nenkova

Page 2: Document-level Text Quality: Models for Organization and ...

Peoplespontaneouslyrespondtodifferencesinwriting

2

Page 3: Document-level Text Quality: Models for Organization and ...

3

“Finnegans Wake islong,dense,andlinguisticallyknotty,yethugelyrewarding,ifyou'rewillingtolearnhowtoreadit...”

http://www.publishersweekly.com

Page 4: Document-level Text Quality: Models for Organization and ...

“MyFaith:WhyIdon'tsingthe'StarSpangledBanner’”

“What a poorly written article. Strays off topic and hardly even addresses the point of the article.

The only brief mention of why they don’t play the national anthem is that they believe in church and state. This just was one long rant about his religion.”

4

http://www.cnn.com

Page 5: Document-level Text Quality: Models for Organization and ...

5

http://www.vocabula.com

Page 6: Document-level Text Quality: Models for Organization and ...

TextQualityPrediction

Thisarticleiswell-written.Nextone..

Canweteachcomputerstomakesimilarjudgements?

6

o Howtoformulatethetask?

o Getsuitabledatawithdistinctions

o Findcorrelatesintext

Page 7: Document-level Text Quality: Models for Organization and ...

Whydowecare?• Informationretrieval,articlerecommendation

– Allarticlesarenotofthesamequality– Canfilterbyqualityinadditiontorelevance

• Authoringsupport,educationalassessment– Automaticassessmentischeap,consistentandquick– Spellingandgrammarcorrectionarecommerciallysuccessful

• Textgenerationsystems– Systemscanunderstandhowtogeneratecoherenttext– Canevaluatesystemoutput

7

Page 8: Document-level Text Quality: Models for Organization and ...

Thistalk

• Definingtextqualityandcreatingacorpusofoverallarticleratings– Largescalerealisticsampleofwritingdifferences

• Twomodels– Amodelfororganizationusingsyntaxpatterns– Amodelforreaderinterest

• Document-levelqualityprediction– Incontrasttospellingandgrammar– Oftennotabinary,correct/in-correctdistinction

8

Page 9: Document-level Text Quality: Models for Organization and ...

>>DefiningTextQuality

9

• Aspectsofquality• Whoistheaudience?

Page 10: Document-level Text Quality: Models for Organization and ...

Aspectsofquality

• Weadoptadefinitionfromtheeducationfield

10

Page 11: Document-level Text Quality: Models for Organization and ...

Ideasanddevelopment

Organization(Smooth

transitions)

Voice(Personaltouch)

Wordchoice(vivid, lively)

Sentencefluency(Rhythm)

Conventions(Mechanics)

SixTraits[Spandel 2004]Detailsandtheirpresentation

Flowbetweensentences

Interestingnature,beautifulwriting

Spelling,grammar,layout

11

Page 12: Document-level Text Quality: Models for Organization and ...

Audiencefortextquality– Anexpert

lowcompetency highcompetency

ExperiencedNLPresearcher

Readerofmachine-generatedtext

Adult readerofnewspaper

• Increasedfocusonlinguisticpropertiesofthetext

12

Page 13: Document-level Text Quality: Models for Organization and ...

Relationshiptoreadability• Readabilityhasastrongfocusoncomprehension

Gradelevel1

Gradelevel2..

Gradelevel12

13

• Audiencedistinctions– childvs.adult,novicevs.expert,cognitivedisabilityornot

Page 14: Document-level Text Quality: Models for Organization and ...

>>ACorpusforDocument-levelQuality

14

Louis&Nenkova,DiscourseandDialogue,2013

Page 15: Document-level Text Quality: Models for Organization and ...

Sciencejournalism:examplesnippet

Sarah Lewis is fluent in firefly.

On this night she walks through a farm field in eastern Massachusetts, watching the first fireflies of the evening rise into the air and begin to blink on and off.

Dr. Lewis, an evolutionary ecologist at Tufts University...

15

Page 16: Document-level Text Quality: Models for Organization and ...

Category1:VERYGOODarticles• Seedset=63NewYorkTimesarticlesthatappearedintheBestAmericanScienceWritingseries

• WechooseonlytheNYTarticles– WeusetheNYTCorpustoexpandourcategory– Normalizefordifferencesinwritingduetosource

16

Page 17: Document-level Text Quality: Models for Organization and ...

Topicsintheseedset

17

Tag AppearanceMedicineandHealth 22

Space 14Physics 10Biology andBiochemistry 8

GeneticsandHeredity 8

Archaeology andAnthropology 7

ComputersandtheInternet 4

Page 18: Document-level Text Quality: Models for Organization and ...

ExpandingtheVERYGOODset

• Assume:~40authorsoftheseedsetareexcellentwriters

• OtherarticlesfromtheNYTwrittenbythesameauthors– whichareresearchrelated– duringthesame10yearperiod– onsimilartopics– similarlengths

18

Page 19: Document-level Text Quality: Models for Organization and ...

Category2:TYPICALwritingintheNYT

• Othersciencearticlesaroundthesametime,butnotwrittenbythepopularauthors

Category TotalArticles

VERY GOOD 3,530

TYPICAL 20,242

Thegeneralcorpus:

19

Page 20: Document-level Text Quality: Models for Organization and ...

Atopic-pairedcorpus

• Thegeneralcategoriesmixdifferenttopics– geography,biology,astronomy,linguistics…

• ButanIRsystemcomparesarticlesonthesametopic

• ForeachVERYGOODarticle,get10mostsimilarTYPICALarticles(basedonthecontent)

• Enumerateallpairsof(VERYGOOD,TYPICAL)

• 35,300pairs

20

Page 21: Document-level Text Quality: Models for Organization and ...

Twoqualitypredictiontasks

`Same-topic’– whicharticleinthepairistheVERYGOODone?

2categoriesGOOD(~3500)TYPICAL(~3500)

Topicallysimilarpairs<VERYGOOD,TYPICAL>

~35,000pairs

`Any-topic’– isthisarticleVERYGOODorTYPICAL?

21

Page 22: Document-level Text Quality: Models for Organization and ...

Propertiesofthedataset

• Distinguishesaveragewritingfromverygood

• Allowtofocusonaspectssuchasbeautifulwriting– Lesslikelytohavespellingandgrammarerrors

• Largescaleandrealisticsampleofwritingdifferences– Previousworkoftenusedmachinegeneratedtextorartificiallymanipulatedtext

22

Page 23: Document-level Text Quality: Models for Organization and ...

>>Predictingorganizationquality

23

Louis&Nenkova,EMNLP2012

Page 24: Document-level Text Quality: Models for Organization and ...

Somesequencesofsentencetypesconveytheoverallpurposebetter

24

Solving X is useful for many applications.

We present a new approach to address X.

Results show that our method works well.

Motivation

Introduceapproach

Results

Page 25: Document-level Text Quality: Models for Organization and ...

Intentionalstructureofanarticle• Everytexthasapurposethattheauthorwishestoconvey

• Influentialearlytheoriesdiscussitatlength

[Grosz&Sidner 1986]

• Particularlyforacademicwriting,itispopulartoseearticlesasasequenceofintentions

[Swales1990,Teufel 2000]

Narrative

Explanation Critique

25

Page 26: Document-level Text Quality: Models for Organization and ...

Oraclemodelofintentionalstructure• UsingmanualannotationsofintentionsonACLarticles

26

STARTBackground

AimOwn

Contrast

TextualEND0.7

0.1

0.3

MarkovChainonIntroductionsections

[corpusbyTeufel,2000]

0.8

0.4

Otherswork0.1

Page 27: Document-level Text Quality: Models for Organization and ...

Mainideaofthework• Annotatingsentencetypesishard.Pre-definingthesetof

sentencetypesisharder

• Assume

27

Syntax~roughproxyforsentencetype

Page 28: Document-level Text Quality: Models for Organization and ...

28

Syntacticpatternsinexplanations

• An aqueduct is awatersupplyornavigablechannelconstructedtoconveywater.Inmodernengineering, thetermisusedforanysystemofpipes,canals,tunnels,andotherstructuresusedforthispurpose.

• A cytokinereceptoris areceptor thatbindscytokines.Inrecentyears, thecytokinereceptorshavecometodemandmoreattentionbecausetheirdeficiencyhasnowbeendirectlylinkedtocertaindebilitatingimmunodeficiencystates.

Definitionslooklikethis

Descriptivearticleslooklike

this

indefinitearticletermtodefine

Relativeclause

is/are NPMorespecific:topicalizedPP

Page 29: Document-level Text Quality: Models for Organization and ...

Syntax-basedHMMmodel

START END

0.5

0.3

0.2

0.3

0.2

VPà VBZNP

NPà DTADJP

NPà NPPP

….

“Definitions”

NPà NNPCCNNP

NPà CD

NPà NP,NP

“Citations”

VPàMDVP

VPà VBVP

VPà VBPP

“Speculations”

29*Usesgrammaticalproductions

Page 30: Document-level Text Quality: Models for Organization and ...

30

• Moreinformationaboutadjacentconstituents• APOStagsequencelosesallabstraction

• D-sequence– controlabstractionusingaparameter“depth”(d)

Asecondmodel:basedond-sequences

S

NP”,S“ VP .

NP VP

DT VBZ NP

NN

NNPNNP VBD

JJ

[“DTVBZJJNN,”NNPNNPVBD.]

Page 31: Document-level Text Quality: Models for Organization and ...

31

Step1– depthcutoffROOT

S

NP”,S“ VP .

NP VP

DT VBZ NP

JJ NN

NNPNNP VBD

Chooseadepthd

Terminatetreeatd

Readoffnewleavesfrom lefttoright

d=2

“S,”NPVP.

d=3“NPVP,”NNPNNPVBD.

“That’sgoodnews,”Dr.Leaksaid.

Page 32: Document-level Text Quality: Models for Organization and ...

32

Step2:NodeaugmentationROOT

S

NP”,S“ VP .

NP VP

DT VBZ NP

JJ NN

NNPNNP VBD

Forphrasalnodes ind-sequence,

- Annotatewithleftmostleafinfulltree

d=2

“ SDT , ” NPNNP VPVBD .

d=3

“ NPDTVPVBZ ,” NNPNNPVBD.

DT NNP VBD

DT VBZ

Page 33: Document-level Text Quality: Models for Organization and ...

Evaluationtaskonacademicwriting

• ACLanthologycorpus– abstract,introduction,relatedwork

• Approximatedistinction fororganizationquality– Originalarticleà well-organized– Randompermutationoforiginalà poorly-organized– Createpairs<original,permutation?

• Task:identifytheoriginalversioninthepair– Baseline50%accuracy

33

Page 34: Document-level Text Quality: Models for Organization and ...

Summaryofresultsonacademicwriting

• Correct=higherlikelihoodfororiginalarticle– versuspermutedarticle

• D-seq model

34

ACLconference Accuracy

Abstract 62.9

Introduction 68.8

Relatedwork 72.7

Baseline=50%

Page 35: Document-level Text Quality: Models for Organization and ...

DosentencetypesdistinguishVERYGOODandTYPICALsciencenews?

• CreatetheHMMonVERYGOODtrainingarticles

• Getlikelihoodandmostlikelystatesequenceforanewarticle– Computefeaturesbasedonthese

• AclassifieristrainedtopredicttheVERYGOODarticle

35

Page 36: Document-level Text Quality: Models for Organization and ...

Resultsonourcorpus

36

AnyTopic:Givenanarticle,isit“VERYGOOD”or“TYPICAL”?

System Accuracy

Baseline(random) 50%

HMM-productions 61%

§ 10foldcrossvalidationresults§ SVMclassifier

SameTopic:Givenapairofarticlesonthesametopic,whichoneis“VERYGOOD”?

System Accuracy

Baseline(random) 50%

HMM-productions 63%

Page 37: Document-level Text Quality: Models for Organization and ...

>>Predictingreaderinterest

37

Louis&Nenkova,TACL2013

Page 38: Document-level Text Quality: Models for Organization and ...

Predictinginterest:Anewtask

• Alotofworkonidentifyingwhatiswrongwithatext– Spellingmistakes,grammarerrors,incoherentwriting

• Itisnotknownhowtocharacterizewritingthatisengaging,interestingandnice

38

Page 39: Document-level Text Quality: Models for Organization and ...

Approachtofeaturedevelopment

• Focusoninterpretablefeatures– Only41features– Eachfeatureisacompositeone:indicatesanaspectdirectly– Linguisticallyinteresting

• Confirmthatfeaturesrepresenttheintendedaspect– Tunebycheckingfeaturevaluesonrandomsnippetsoftext

39

Page 40: Document-level Text Quality: Models for Organization and ...

1.UnusualwordsandphrasesIsthephrasingandlanguageuseunique?

• Word-based– highperplexityunderaphonemen-grammodel– Eg:‘undersheriff’,‘powwow’,‘chihuahua’,‘qipao’

• Wordpairs--based– adjective-noun,noun-noun,adverb-verb,subject-verbpairs– perplexityunderalanguagemodel– Eg:‘plastickywoman’,‘so-calledsuperkids’

40

Page 41: Document-level Text Quality: Models for Organization and ...

2.VisualnatureIstherescenesetting?

• Creatingalargelexiconofvisualterms– Source:animage-taggedcorpus– Largesourceofpotentiallyvisualwords,butnoisy

• CreateLDA-basedtopicsonthetagset– UsethemanualMRCtermstofilteroutnon-visualtopics

41

grass,mountain,green,hill,blue,field,sand...round,ball,circles,logo,dots,square,sphere...silver,white,diamond,gold,necklace,chain...

Page 42: Document-level Text Quality: Models for Organization and ...

Humaninterestandtextstructure

3. UseofpeopleinthestoryDoesthestoryrevolvearoundaperson?

– animacy informationfromNEs,pronouns,ngram patterns

4. Sub-genreIsthearticleisanarrative,interviewordialog

– Eg:narrativescore~pasttenseverbs,pronouns,propernames

42

Page 43: Document-level Text Quality: Models for Organization and ...

SentimentandResearch

5. AffectIsthereanemotionalangletothestory?

– usingsentimentworddictionaries

6. ResearchcontentHowmuchexplicitresearchdescriptionispresent?

– usingahand-builtdictionaryofresearchwords

43

Page 44: Document-level Text Quality: Models for Organization and ...

Howthefeaturesvaryinarandomsampleofverygoodandtypicalarticles(t-test)

Higher valuesinVERYGOODset

ü Visualwordsinbeginningandendofarticles

☓ Totalvisual words

☓Animacy countsü Unusualwordsandphrases

☓Narrative,interviewordialogformat

ü Sentimentwords,negativepolarity

ü Researchwords44

Page 45: Document-level Text Quality: Models for Organization and ...

AccuraciesonthetwotasksAnyTopic:Givenanarticle,isit“VERYGOOD”or“TYPICAL”?

System Accuracy

Baseline(random) 50%

Interesting-sciencefeatures

75%

§ 10foldcrossvalidationresults§ SVMclassifier

SameTopic:Givenapairofarticlesonthesametopic,whichoneis“VERYGOOD”?

System Accuracy

Baseline(random) 50%

Interesting-sciencefeatures

68%

45

Page 46: Document-level Text Quality: Models for Organization and ...

Combininginterestwithotheraspects

46

Featureset anytopic sametopic

Interesting science 75.3 68.0

PreviousmethodsforpredictingotheraspectsReadable(article length,language

model, cohesion,syntax)16features

65.5 63.0

Well-written(entitygrid[BL08],discourserelations[PN08])

23features

59.1 59.9

Interesting-fiction[ML09]22features

67.9 62.8

Combination ofallfeaturesAllwriting aspects 76.7 74.7

Differentaspectsofwritinghave

complementarystrengths

Genre-specificmeasuresarestrongerthangenericones

Page 47: Document-level Text Quality: Models for Organization and ...

Conclusions

• Textqualityisaninterestingandchallengingtask

• Moresuccessonthetopicrecently– applicationtonovels,tweets,essays

• Futurework– Alottobedoneintermsofformalizingthetasks,collectingdata,modelsandevaluation

– Transferringtheknowledgetogeneratingtexts

47

Page 48: Document-level Text Quality: Models for Organization and ...

Thankyou!

48


Recommended