Beatrice Alex!Edinburgh Language Technology Group!School of [email protected]!@bea_alex!
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Text mining big data: potential and challenges
LTG
The Edinburgh Language Technology Group
Research and development of natural language processing techniques and technology.
Collaboration in projects with partners in a range of different disciplines (biodiversity, biomedicine, history and literature).
Aggregating, text mining, geo-parsing and linking data.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
LTG
Ongoing projects:
Palimpsest (Mining Literary Edinburgh, AHRC)
UK Connectivity (Analysis of social media, British Council)
BotaniTours (Information aggregation and presentation of botanical points of interest in the Scottish Borders, Smart Tourism and dot.rural).
Trading Consequences (Text mining trends in commodity trading of large 19th century text collections, Jisc, ESRC, AHRC).
New: Text mining brain scan reports for clinical neurologists.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TEXT MINING
Describes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources.!
Turns unstructured text into structured data (e.g. relational database or linked data).
Is very useful for analysing large text collections automatically (overcoming data paralysis).
Goal: Analyse large amounts of textual data to enable scholars to discover novel patterns and explore hypotheses.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
Named entity recognition.
Grounding, e.g. geo-referencing.
Relation extraction.
Clustering, e.g. topic modelling.
Sentiment analysis.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
Location mention recognition and geo-parsing output for Picturesque Notes by R.L.Stevenson (Palimpsest, http://palimpsest.blogs.edina.ac.uk/ @LitPalimpsest).
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
Location mention recognition and geo-parsing output for Picturesque Notes by R.L.Stevenson (Palimpsest, http://palimpsest.blogs.edina.ac.uk/ @LitPalimpsest).
Trading Consequences visualisation interface (@digtrade http://tradingconsequences.blogs.edina.ac.uk)
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
pipes, drums, Queen
dogs
home nationsbaton
UK Connectivity: Sentiment analysis of tweets on the Commonwealth Opening Ceremony in Glasgow 2014
Rod
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
UK Connectivity project: person names and sentiment for Ukraine twitter data for a week in March, June and July 2014.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
TYPES OF ANALYSES
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Early March 2014 End of June 2014 Mid July 2014
Geo-referenced user location data of tweeters talking about the Ukraine.
TYPES OF ANALYSES
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
<pal-snippet end="s432" start="s432" score="0.43">"When we go to Edinburgh," she said, "remind me while we're there to go and visit Miss Brodie's grave."</pal-snippet>!...!<pal-snippet end="s547" start="s546" score="0.71">Now they were in a great square, the Grassmarket, with the Castle, which was in any case everywhere, rearing between a big gap in the houses where the aristocracy used to live. It was Sandy's first experience of a foreign country, which intimates itself by its new smells and shapes and its new poor.</pal-snippet>
Snippet analysis to rank by “interestingness”.
The Prime of Miss Jean Brodie (Muriel Spark)
POTENTIAL
Making text collections more accessible.
Enabling distant reading.
Can be applied in an assisted curation setting.
Discovery.
Linking data sets.
Can be optimised with user input.
TM output is more accessible to users through visualisations.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
ACCESSIBILITY
More and bigger available data sets.
Manual analysis becomes difficult, if not impossible.
Text mining makes textual data more accessible.
It can improve search but also provide other entry points into data, e.g. via visualisations.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
ACCESSIBILITY
Text mining can be used to analyse trends in large collections and thereby enable distant reading.
At the same time it can be used to point readers to individual examples and direct them back to original sources (close reading).
Source: http://www.nassrgrads.com/online-academics-questions-for-grad-students/
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
ASSISTED CURATION
Automatic processing of the bulk of information, followed by more careful manual annotation, correction, selection.
Important when going from big to small data where quality matters.
Can produce high precision and recall.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
ASSISTED CURATION
Text mining output visualised in the Palimpsest assisted curation interface developed at SACHI, University of St.Andrews.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
DISCOVERY
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
LINKING
Text mining often involves linking data sets.
Location mention -> gazetteer entry with lat/lon
Person name -> Wikipedia page
Gene mention -> unique identifier in gene ontology
Plant -> Wikispecies page
This can help discovery.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
LINKING
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
LINKING
BotaniTours: Combining various data sets as well as linking out to other existing sites containing plant species information.
http://groups.inf.ed.ac.uk/BotaniTours
USER INPUT
Iterative development (meetings, prototyping, interviews, manual annotation, continuous feedback).
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
USER INPUT
User workshop to improve the functionality of the Trading Consequences interface at CHESS 2013 organised by Prof. Colin Coates and colleagues (Hinrichs et al., DH 2014)
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
USER INPUT
Annotations of brain scan reports to develop a text mining pipeline for this data. Collaboration with Dr. William Whiteley.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
DATA VISUALISATION
Photo by: Daniel Belasco Rogers
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
DATA VISUALISATION
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Trading Consequences location tree visualisation developed by Uta Hinrichs, at University of St.Andrews.
CHALLENGES
Availability of data sources
Gazillions of formats
Data quality
Data size
Limitations of text mining
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
AVAILABILITY
Even though original printed sources might be out of copyright, their electronic copies often aren’t.
Even if they are, then it can be time-consuming before a collection is actually available.
Make it freely available and you’ll be amazed what other people can do with it. Go on!
“open but not free” :-(
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
FORMATS
Various gazetteers combined into one Edinburgh gazetteer for the Palimpsest project.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
NOISY DATA
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
NOISY DATA
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
NOISY DATA
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
NOISY DATA
QUALITY RATING
Alex and Burns, DATeCH 2014.
Quality Distribution
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
We need more clarity on the quality of existing digitised collections.
NATURE OF DATABotaniTours: Geo-referenced flowering plant data from GBIF for the Scottish Border.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
DATA SIZE
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
We can process hundreds of thousands of documents, millions of pages, billions of words, millions of tweets (but often less), usually gigabytes (or less, rarely more). Is that big data?
The entire British Library Nineteenth Century Books collection is 16Tb (1Tb of text, 15Tb of images).
Parallelisation is important.
Shallow text processing can be done relatively quickly but deeper semantic analysis still takes too long to be practical.
LIMITATIONS OF TM
Text mining is not 100% accurate, especially if the data contains errors or isn’t running text.
Intrinsic evaluation (using a gold standard) and error analysis is important.
Openness about performance helps to manage expectations of users and let’s them understand the strengths and weaknesses of our technology.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Trading Consequences: Intrinsic evaluation of the prototype and an improved system (Klein et al. 2014).
LIMITATIONS OF TM
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Trading Consequences: Intrinsic evaluation of the prototype and an improved system (Klein et al. 2014).
And geo-referencing evaluation when varying the number of GeoNames candidates considered. (Alex et al., to appear)
LIMITATIONS OF TM
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
SUMMARY
We are consumers and creators of data: we use existing text collections, mine and enrich them.
Text mining can be useful for assisting and speeding up manual analysis.
We often work with visualisations of mined data in order facilitate their analysis.
We like to tailor our technology to users and ask them for feedback to improve its performance.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
SUMMARY
Better processes need to be put in place to get access to data. Some providers (HathiTrust) already do this very well.
Data standards can be useful. Format conversion takes up huge amounts of my time and I haven’t seen this improve much yet.
It’s important to make the quality of data more explicit. We need to think about how to deal with low quality and fuzzy data.
Text mining is not going to replace human analysis!
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
LTG STAFF AND STUDENTS
Prof. Ewan Klein
Prof. Jon Oberlander
Dr. Claire Grover
Dr. Colin Matheson
Dr. Beatrice Alex
Richard Tobin
Dr. Kate Byrne
Dr. Michael Roth
Xuri Tang (visiting)
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
Clare Llewellyn
Daniel Duma
Paolo Pareti
Amy Isard
LTG TOOLS
The Edinburgh Geoparser: a tool for geo-referencing text.
LT-XML2 and LT-TTT2: XML-based software for shallow linguistic processing of text.
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014
THANK YOU
Questions?
Contact: [email protected]
Website: http://homepages.inf.ed.ac.uk/balex/
Twitter: @bea_alex
Next LaTeCH at ACL 2015 in Beijing!
Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014