+ All Categories
Home > Technology > Digital History and Big Data: text mining historical documents on trade in the British empire

Digital History and Big Data: text mining historical documents on trade in the British empire

Date post: 25-Dec-2014
Category:
Upload: beatrice-alex
View: 621 times
Download: 2 times
Share this document with a friend
Description:
 
26
Beatrice Alex [email protected] Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013 Digital history and big data: Text mining historical documents on trade in the British Empire
Transcript
Page 1: Digital History and Big Data: text mining historical documents on trade in the British empire

Beatrice [email protected]

Beatrice [email protected]

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Digital history and big data:Text mining historical documents on

trade in the British Empire

Digital history and big data:Text mining historical documents on

trade in the British Empire

Page 2: Digital History and Big Data: text mining historical documents on trade in the British empire

Overview

What is text mining?

Text Mining in digital history

Trading Consequences

“Big data”

Visualisation

Challenge of noisy data

Collaborating with historiansDigital scholarship: day of ideas 2, Edinburgh,

02/05/2013

Page 3: Digital History and Big Data: text mining historical documents on trade in the British empire

Text Mining

Describes a set of linguistic, statistical and machine learning techniques that model and structure the information content of textual resources.

Turns unstructured text into structured data (e.g. relational database or linked data).

Is very useful for analysing large text collections automatically.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 4: Digital History and Big Data: text mining historical documents on trade in the British empire

Text Mining

TM methods often rely on a set of linguistic pre-processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking).

Currently our focus is on named entity recognition, entity grounding and relation extraction.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 5: Digital History and Big Data: text mining historical documents on trade in the British empire

TM in Digital History

Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypothesis.

Methods: linguistic text analysis, named entity recognition, geo-grounding and relation extraction to transform the text into structured data.

Sea-change to methods used in ‘traditional’ history.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 6: Digital History and Big Data: text mining historical documents on trade in the British empire

“Traditional” Historical Research

Cinchona plantations in George King’s A Manual of Cinchona Cultivation in India (1880).

Global Fats Supply 1894-98

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 7: Digital History and Big Data: text mining historical documents on trade in the British empire

Trading Consequences

Digging into Data II project (till Dec. 2013)

Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex, Dr. Claire Grover, Clare Llewellyn, Richard Tobin, James Reid, Nicola Osborne, Ian Fieldhouse

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 8: Digital History and Big Data: text mining historical documents on trade in the British empire

TRADING CONSEQUEnCES

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 9: Digital History and Big Data: text mining historical documents on trade in the British empire

Trading Consequences

What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century?

Scope: global, but with focus on Canadian natural resources.

Example questions:

‣ What were the routes and volumes of international trade in resource commodities in the nineteenth century?

‣ What were the local environmental consequences of this demand for these resources?

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 10: Digital History and Big Data: text mining historical documents on trade in the British empire

Document Collections

Big data for historians:

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 11: Digital History and Big Data: text mining historical documents on trade in the British empire

Mined Information

Example sentence:

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 12: Digital History and Big Data: text mining historical documents on trade in the British empire

Mined Information

Example sentence:

Extracted entities:commodity: cassia bark

date: 1871

location: Padanglocation: Americaquantity + unit: 6,127 piculs

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 13: Digital History and Big Data: text mining historical documents on trade in the British empire

Mined Information

Example sentence:

Normalised and grounded entities:commodity: cassia barkdate: 1871 (year=1871)location: Padang (lat=-0.94924;long=100.35427;country=ID)location: America (lat=39.76;long=-98.50;country=n/a)quantity + unit: 6,127 piculs

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 14: Digital History and Big Data: text mining historical documents on trade in the British empire

Mined Information

Example sentence:

Extracted entity attributes and relations:origin location: Padangdestination location: Americacommodity–date relation: cassia bark – 1871commodity–location relation: cassia bark – Padangcommodity–location relation: cassia bark – America

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 15: Digital History and Big Data: text mining historical documents on trade in the British empire

Commodity Ontology

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 16: Digital History and Big Data: text mining historical documents on trade in the British empire

Improved Search & Visualisations

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 17: Digital History and Big Data: text mining historical documents on trade in the British empire

Improved Search & Visualisations

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 18: Digital History and Big Data: text mining historical documents on trade in the British empire

Improved Search & Visualisations

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 19: Digital History and Big Data: text mining historical documents on trade in the British empire

Noisy Data

Optical character recognition contains many errors and often the structure of the page layout is lost.

Sophistication of the OCR engine and scanning equipment.

Quality of the original print and paper.

Use of historical language.

Information in page margins (header, page numbers, etc.).

Information in tables.

Language of the text.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 20: Digital History and Big Data: text mining historical documents on trade in the British empire

Fixing Noisy Data

Text normalisation and correction:

End-of-line soft hyphen removal

Dehyphen all token-splitting hyphens using a dictionary-based approach.

“False f”-to-s conversion

Convert all false f characters to s using a corpus.

Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012).

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 21: Digital History and Big Data: text mining historical documents on trade in the British empire

Fixing Noisy Data

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 22: Digital History and Big Data: text mining historical documents on trade in the British empire

Fixing Noisy Data

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 23: Digital History and Big Data: text mining historical documents on trade in the British empire

Extract from document 10.2307/60238580 in FCOC.

How Noisy Is Too Noisy?

qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 24: Digital History and Big Data: text mining historical documents on trade in the British empire

The Users (Historians)

Involvement of historians:

Everything is based on the use cases and build on users’ hypotheses/research questions.

They are responsible for identification of relevant collections and are involved in the ontology development.

They provide feedback for us to improve technology iteratively: Partners at York use of the prototype for their research and track errors; Workshop at CHESS 2013 with a group of independent historians

Clarity on the text mining accuracy is IMPORTANT.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 25: Digital History and Big Data: text mining historical documents on trade in the British empire

Summary

Text mining historic documents in Trading Consequences.

Processing “big data”.

Power of visualising structured data.

Fixing noisy data.

Importance of two-way collaboration between technology experts and users in digital history.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Page 26: Digital History and Big Data: text mining historical documents on trade in the British empire

Thank you

Questions? Fire away or contact me at: [email protected]

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013


Recommended