Claudiu Mihăilă
The National Centre for Text Mining
The University of Manchester
News Search Using Discourse Analytics
Data
Exponential growth of data
Information overload
Growing
Data
Exponential growth of data
Information overload
Data deluge
Pouring
Data
Exponential growth of data
Information overload
Data deluge
Can we process a deluge of data
in a useful manner?
Processing
Searching
Give a query as input
Obtain a set of relevant articles
Keyword v. Semantics
– Synonyms
– Hyponyms
– Spelling variants
– Inflections
– Relations between query terms
Searching
Crimes in the town of Sandwich
Keywords
Searching
Crimes in the town of Sandwich
– Crime Sandwich by Click Bang
Productions on SoundCloud
– Sandwich Crime - Topix
– Crime on rye: Four accused of
stealing $10 sandwich from car
– Crime Scene Sandwich Bags
– Crime rate in Sandwich, Illinois (IL):
murders, rapes, robberies
– Ham Sandwich Nation: Due Process
When Everything is a Crime
Keywords
Searching
Crimes in the town of Sandwich
Semantics
Searching
Crimes in the town of Sandwich
– Kent Police issue warning after fake
£20 notes reported in Sandwich
– Trio jailed for total of 30 years after
crime spree in Sandwich
– Murder at Sandwich - Kent
Semantics
Semantic search engine
Specification of semantic types of
search terms: town:Sandwich
Normalisation of semantic entities:
Sandwich, Kent = Sandwich, UK
Relations between search terms to
describe events: location:Sandwich
Restrictions on discourse context of
retrieved events
Features
Structured events
The event
Discourse interpretation
Karl Munro may have killed Sunita in Weatherfield in 2013.
According to Karl Munro, Craig Tinker set Sunita on fire in Weatherfield in 2013.
Karl Munro said he will kill Sunita.
Karl Munro didn’t fail to kill Sunita in Weatherfield in 2013.
Stella Price condemned all of Karl’s wrongdoings.
The story
ACE corpus
599 news-domain documents
– News articles
– Transcripts of broadcast news
– Transcripts of broadcast conversation
– Conversational telephone speech
– Weblogs
– Discussion fora
Polarity
Tense
Specificity
Modality
Source type
Subjectivity
2005 version Discourse -related Attributes
Discourse context of events
Scheme
New York Times corpus
20 years-worth of news articles – 1.8M
Includes annotations of
– Metadata
– Named entities
– Normalisation
Facilitates diachronic studies
– Language evolution
– Social change
– Development of events
Digital archive
ISHER
Web-based
User-friendly interface
Intuitive query-building mechanism
Refining/filtering according to facets
Semantically enabled searching
ISHER
Automatic Event Recognition - EventMine
Miwa, Thompson, Ananiadou. (2012). Boosting automatic event extraction from the literature using
domain adaptation and coreference resolution. Bioinformatics, 28(13), 1759-1765
ISHER
Web-based interface – “Coronation Street”
ISHER
Semantic clustering
Lingo – 3rd party
NaCTeM clustering
ISHER
Semantic clustering Cluster summarisation
ISHER
Metadata in the NYT corpus
ISHER
Entities
ISHER
Events
ISHER
Events
Prime Minister Tony Blair’s election last month
Final remarks
Same technique can be adapted to other domains
Previously developed
–EUPMC – medical journal articles
–ASCOT – clinical trials
Other domains
Final remarks
Enhanced access to information within
digital heritage archives (NYT)
Identified discourse phenomena to
search for and filter events
Created ISHER, semantic search
engine to access the NYT corpus
Apply to new domains and institutional
repositories
Customise towards social unrest
Diachronic studies
Other languages in danger of digital
extinction – Meta-Net
Summary Future work
Thank you!