Homing in on the Text-Initial Cluster
Mike ScottSchool of English
University of LiverpoolAston Corpus Symposium
Friday May 4th 2007 This presentation is at
www.lexically.net/downloads/corpus_linguistics
Starting Questions
1. Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position?
2. Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text?
3. If so, are there any common patterns in text-initial clusters?
Context
Textual Priming Project, University of Liverpool
Michael HoeyMichaela MahlbergMatthew O’DonnellMike Scott
Textual Priming Project: Aims to investigate how many (and what types of)
lexical items are primed to appear in text-initial or paragraph-initial position
to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts.
to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis.
from O
’Donnell
et al 2007
Hard News Corpus
“Home News” sections of the Guardian and Observer 1998 to 2004 115,654 articles divided thus:
headline & lead 1st sentence of 1st paragraph (TISC) all other sentences
TISC contains 3.2 million tokens The rest: 51.2 million tokens About 470 words per article
Research Questions
Using the hard news corpus,
1. How many 3-5 word clusters are found to be key in TISC sections?
2. How many are positively and how many are negatively key?
3. What recurrent patterns can be found in the two types of key cluster?
Methods (1)1. Format the corpus in XML and
separate out all TISC sections (done by Matt O’Donnell)
2. Use WordSmith’s WordList tool to compute wordlist indexes of
1. all the text
2. all the TISC sections
3. Using WordList, compute 3-5 word clusters for each index, save as .lst
Top clusters, all sections
GUARDIAN CO UKONE OF THEA HREF HTTP, WWW GUARDIAN CO and similar web
linksTHE PRIME MINISTERTHE END OFAS WELL ASTHE NUMBER OFTHERE IS ASOME OF THETHERE IS NO
Top clusters, TISC
ONE OF THEACCORDING TO ALAST NIGHT AFTERFOR THE FIRSTTHE FIRST TIMEIS TO BEFOR THE FIRST TIMETHE MURDER OFARE TO BETHE DEATH OF
OF THE MOSTTHE HOME SECRETARYWAS LAST NIGHTIT EMERGED YESTERDAYAS PART OFAN ATTEMPT TOTHE UNITED STATESTHE NUMBER OFONE OF THE MOSTACCORDING TO THE
Methods (2)4. Use KeyWords tool to compute KWs
for the TISC 3-5 word clusters using all the text as a reference corpus
5. Identify patterns in the KW clusters
TISC key clusters
ACCORDING TO ALAST NIGHT AFTERIT EMERGED YESTERDAYWAS LAST NIGHTARE TO BETHE MURDER OFLAST NIGHT WHENTHE GOVERNMENT YESTERDAYLAST NIGHT ASIS TO BE
WERE LAST NIGHTYESTERDAY AFTER ATONY BLAIR YESTERDAYCOURT HEARD YESTERDAYWAS TOLD YESTERDAYWAS JAILED FORTHE DEATH OFYEAR OLD BOYYESTERDAY WHEN THEWITH THE MURDER OF
Numbers of Key Clusters
RQs 1 & 2: Numbers of KW clusters
using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic,
8,132 key clusters altogether (in 3.2 million words of text)
of which 7,631 were positively key and 501 negatively key
though there is repetition as these are 3-5 word n-grams
Research Question 2
RepetitionYESTERDAY FOUND GUILTYYESTERDAY FOUND GUILTY OFYESTERDAY FROM AYESTERDAY FROM THEYESTERDAY GAVE AYESTERDAY GAVE HISYESTERDAY GAVE THEYESTERDAY GIVEN AYESTERDAY GIVEN THEYESTERDAY GIVEN THE GOYESTERDAY GIVEN THE GO AHEAD
Negatively key:
A LOT OFA SPOKESMAN FORTHERE IS NOHE SAID THESAID IT WASTHERE IS ATHIS IS ATHE FACT THATAS WELL ASIT WOULD BE
SPOKESMAN FOR THEPER CENT OFWE HAVE TOSAID THAT THEBUT IT ISAT A TIMEA SPOKESMAN FOR THESAID HE WASIT IS NOTTHERE WAS NO
RQ 1: Numbers of KW clusters
Is 8 thousand a large number of distinct key text-initial clusters?
In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether…
about one in 10 is associated with text initial position at the .0000001 level of significance
RQ 1, continued
… is 1 in 10 a large number to be key? In the case of SISC (sentences from
paragraphs with only one sentence in), we get
507 thousand clusters, of which 2,192 are key (1,747 positively and 445
negatively) which is about 1 in 230
PATTERNS
RQ 3: patterns
recency: in the top 200, seventy express time,
generally using yesterday or last night
Recency clusters
COURT HEARD YESTERDAYTONY BLAIR YESTERDAYYESTERDAY AFTER AWERE LAST NIGHTLAST NIGHT ASTHE GOVERNMENT YESTERDAYLAST NIGHT WHENWAS LAST NIGHTIT EMERGED YESTERDAYLAST NIGHT AFTER
YESTERDAY IN AIT EMERGED LAST NIGHTA COURT HEARD YESTERDAYYESTERDAY WHEN AYESTERDAY AFTER THEEMERGED LAST NIGHTLAST NIGHT TOYESTERDAY AS THEYESTERDAY WHEN THEWAS TOLD YESTERDAY
Superlatives
ONE OF BRITAIN'S MOST
ONE OF THE MOST
OF THE WORLD'S
THE FIRST TIME
OF BRITAIN'S MOST
FOR THE FIRST
FOR THE FIRST TIME
Research, Report etc.
ACCORDING TO A REPORTA COURT HEARD (YESTERDAY)ACCORDING TO RESEARCHTO A SURVEYIT EMERGED LAST NIGHTIT WAS ANNOUNCED YESTERDAYIT WAS REVEALED YESTERDAYA REPORT PUBLISHEDACCORDING TO A STUDYTO RESEARCH PUBLISHED
Attention-grabbers
IT EMERGED THAT
OBSERVER CAN REVEAL
THE OBSERVER CAN REVEAL
Indefinite articles positively key….
A BABY GIRLA BAN ONA BEACH INA BID TOA BITTER ROWA BLACK MANA BLISTERING ATTACK ONA JURY WAS TOLD YESTERDAY
A LABOUR MPA LANDMARK RULINGA LAST DITCH ATTEMPT TOA LAST MINUTEA LEADING BRITISHA LEADING SCIENTISTA LEGAL BATTLEA LEGAL CHALLENGE
Indefinite articles negatively key
A KIND OF
A COUPLE OF
A GREAT DEAL
A KIND OF
A LOT MORE
IT + reporting verb – positively key
IT WAS ANNOUNCED LAST NIGHT
IT WAS CLAIMED LAST NIGHT
IT WAS CONFIRMED LAST NIGHT
IT IS REVEALED TODAY
IT otherwise negatively key:
IT IS A
IT IS ABOUT
IT IS EXPECTED
IT IS GOING
IT IS ONLY
IT IS POSSIBLE
IT SEEMS TO
SAID YESTERDAY – positively key
SAID YESTERDAY AFTER
SAID YESTERDAY THAT HE
SAID YESTERDAY THEY HAD
SAID without time – negatively key
SAID AT THE
SAID HE HAD
SAID HE WOULD
SAID THE GOVERNMENT
SAID THERE WAS NO
Conclusions
The “once upon a time” syndrome seems to be much more common than might be thought.
In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance
whereas in non text-initial sections only 1 in 230 was key
Other patterns
recency superlatives research, report attention-grabbers indefinite articles IT + reporting verb; SAID + time
O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007.
References