BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with...

BiographyNed

eScience Center 21 March 2013

Why a good case for eScience?

• Involves big data with high complexity• Rich meta data joining diverse textual sources and

selections of data• Incomplete and noisy• Potential to investigate difficult questions, e.g.:– How did the current Dutch elite develop from the

colonial past?• Biographies may represent different views and

realities and thus answers to questions:– hero or villain– 2.8 textual sources per person

What will we do?

• Develop generic text mining technology that converts textual data to structured data– Taking into account nature of historical text

• Enrich and externally link data repository of Dutch biographies

• Develop visualizations and interactions on the data set to support historical research

• Develop a range of cases that demonstrate the possibilities and impossibilities of the data set and technology

Patterns in data ValueInterpretationLine composition in paintings

Twitter patterns during electionsCubism

Democratic participation

Nature of eHumanities

Patterns in data ValueInterpretation

Narratives

Cases: persons/objects/events

Line composition in paintingsTwitter patterns during elections

CubismDemocratic participation

19th-century Japanese printsBiographical descriptions of Prince Bernhard

The rise of the Japanese middle classGerman nobles in the Interbellum

Statistics on available information

Name

Category

Gender

Date of Death

Date of Birth

Place of Birth

Place of Death

Occupation

Religion

Father

Mother

Claim to Fame

Partner

Text

0 20 40 60 80 100 120

Individuals with available information (%)

percentage

Textual Information per person

Information Numbers

Average XML-files per individual 2.79

Texts 78.75%

Words (total/person) 288.83

Words (longest text/person) 229.04

Words (total/text) 366.76

Words (longest text)/texts 290.83

Availability of Information in the portal

Partner

Mother

Father

Claim 2 Fa

me

Religio

n

Occupati

on

Date of b

irth

Place o

f birt

h

Date of d

eath

Place o

f dea

th

Catego

ryNam

e0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Information AbsentText available

Presence of information for governors of Dutch Indies (% on 71 individuals)

mariag

e

multiple mari

age

partners

Children

(number)

Children

(nam

es)

Age (s

tart fu

nction)

Place o

f Birt

h

Place o

f Dea

th

Studies

Previous c

arree

r

Reaso

n job en

d

Last jo

b

Family

connecti

ons

Religio

n0

10

20

30

40

50

60

70

80

90

100

metadatatext

The Historical Perspective

• History and Biography• Where do eScience and History meet?

• Use Cases

Historical Research

The Art and Science of History: Drawing up a narrative from primary and secondary sources which approximates historical reality as well as

possible.

Building Blocks and Concrete

• Building blocks: facts derived mainly from archival findings and existing literature

• Concrete: the methods historians use to put them together into a narrative/synthesis.

• The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative element in drawing up a narrative

Example: Grand Pensionary Johan de Witt (1625-1672)

• Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of the murder

• Concrete: (logic) Based on these last data itis likely that William ordered the death of Johan

• Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoning

The House of History

The Importance of Provenance

The only way to falsify presented historical facts is by going back to the original source(s) and

look at those sources critically.

Highly important to be able to know what information comes from where exactly.

Our Sources Here

• The Metadata: building blocks

• The entries in biographical dictionaries themselves: short historical narratives

Status of Biography in Academia and Society

• Despite improved efforts this century to embed biography in academic theories and methods, some still do not consider it (e.g. some social historians) a worthy academic discipline, being too anecdotal and limited.

• Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)

Where do eScience and History meet? (I)

“And when the capsule biography of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”

(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography)

Where do eScience and History meet? (II)

A. Quantitative analyses of a larger group of people (prosopography).Surpassing the anecdotal.

B. Finding relations/networks between people which are otherwise hard to detect

Where do eScience and History meet? III

C. Insight in Historiography and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and main author 1001 vrouwen)

D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?

BiographyNed Use Cases

In the initial stages of the research a list of

possible historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to give us an answer to, or at least point

into a direction/trend.

Case I: Making life easier: Group portrait of the Governors-General

• Highest Official in the Dutch indies 1610-1949• 71 men (still a relatively small group)• What can we say about these men as a group?• Who was appointed and what qualities did he

have to have? • Etc ….

Case I: data mining

• Family connections (parents/wife/children, other relevant connections <= patronage)

• Place of Birth• Education • Religion• Career (patterns)• Age at appointment• Duration of holding the office• Reason for leaving the office• Place of Death

Case I: Time and Effort

More than 1 full week

to manually mine this information from the Biography Portal. Can a historian do this with

(almost) the same results in under one hour if helped by the demonstrator?

Case II: Making things possible: The Dutch Nation & Identity

• Who were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)

• Are there different perspectives on the sameperson over the time and how can this be explained?

• Who was deemed most important? (based on the length of the entries)• What time periods are most represented?• Is there a difference in claim to fame for people from different

periods in history, or between men and women?• Which words are used most often and can we link them to

national identities?

Case II: More Questions …

• What events are mentioned most often and what does that say about the status questionis of how the Dutch see/saw themselves?

• What are the differences in the answers to these questions between several national biographical dictionaries?

• Are people and events described or appreciated differently over time? Does the perspective change?

• How does this relate to biographical dictionaries, nations and identities elsewhere in Europe?

Conversion to Linked Data

Online machine readable data with links • Simple facts called ‘RDF Triples’

Thorbecke > hasBirthPlace > Zwolle

Some technology concepts: • Schemas: To structure LD• RDF Stores: To store LD • SPARQL: To access LD

Huge growth in the past years: •More than 300 data sources•More than 30 billion triples

A crash course on Linked Data

Purely syntactic conversion• Preserve the original structure of the data• Prevent loss of information• Allow for reinterpretation of the original data in the future

The conversion process

Data Preservation

Conversion steps: • Retrieval of XML dump of the Biography Portal• Initial conversion to ‘crude’ RDF• Using ClioPatria and the XMLRDF

tool for ClioPatria• RDF restructuring• Linking to other sources• Essential step in the

‘Linked Data’ philosophy


Data schema: • Based on the structure of the original XML files• Needs to facilitate the coupling of different biographies of the same

person, without compromising the original data• Needs to facilitate the incorporation of several enrichments, following

from NLP, Entity Reconciliation, etc.• Compatible with existing

schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms


Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…


BiograpyNedschema

Thorbecke

Biographical Description

ProvenanceMeta Data

NNBW

PersonMeta Data

“Thorbecke”

BiographyParts

Birth1798Event

Biographical Description

Enrichment NLP Tool

PersonMeta Data

EventBirth


Zwolle1798-01-14

Retrieving Information from Text

The texts in the Biography Portal

• Collection of biographical dictionaries• Dutch, including from the 19th and early 20th

century and even older quotes• Sources (different dictionaries/collections)

have their own style• Metadata available (though large differences

in completeness)

Challenges and Advantages

• Challenges:– Little work on NLP and biographies– Performance of Dutch NLP tools on variations of

Dutch• Advantages:– High quality metadata coverage several categories

of information (supervised machine learning)– Within sources, clear and similar structure of texts

General Approach

• Start by using advantages:– Use metadata to label information– A basic IR system can be build using sentence

number and lemmas as features• Enhance performance with NLP tools• Build upon information retrieve in the first

steps to tackle more challenging tasks

A Basic System

• Supervised Machine Learning• Two step identification process (Wu and Weld

2007;2010, Fader et al. 2011)– Identify sentence that contains information– Sequence tagging to identify information within

the sentence

Adding NLP

• Location & Date recognition (GeoNames)• (other) Named Entities (VIAF enhanced with

names from metadata)• Depending on performance of the system,

we’ll work on:– Chunking, multiword recognition– Parsing– Word Sense Disambiguation

Metadata & Project Goals

• Duplicate detection (metadata and text)• Events/Network discovery– Education (begin, end, location)– Occupation (begin, end, location)– Relations (parents, partners)

• Temporal relations between events

Output first system

• Better coverage of categories mentioned above

• A timeline for a person’s life (birth, education, occupation, locations, death)

• Named Entities in text (dates, locations, persons)

Beyond the first system

The information provided by the first system can be used to:

1. Identify alternative descriptions of events(same time, location and/or participants)

2. Identify relations between events(same locations & time, consequent events, same participants, etc.)

3. Initial networks of people

Methodological issues and text interpretation

• Results should be reproducible– Code release (including scripts, configurations, …)– Documentation– Open source data

• The setup should be modular– Combine output of different tools– Flexible choice of methods used

Evaluation Challenges (1/2)

• How to evaluate the extraction tools?• Partial evaluation using metadata (10-fold

cross-validation), but:1. No precise indication of precision or recall

(incomplete metadata…)2. Biographies with rich metadata are not

necessarily representative Manually annotated data needed!

Evaluation Challenges (2/2)

• How to compare performance NLP tools?– Little work on biographies, little or none on Dutch

ones…– How hard are older texts? Can we quantify?

Systematic comparison:• English biographies (wikipedia)• Dutch biographies (wikipedia)• Biographies from the portal

Reproducibility/Replication

• What do results mean if they cannot be reproduced?

• What variation in results can be expected based on details not mentioned in papers?

• Which information is needed to replicate results or find the origin of differences?

Paper submitted ACL 2013 (joint work with Marieke van Erp and others)

Representations (tools)

• How to represent and combine output of different tools?– Compatibility (easy to convert output of external

NLP tools)– Flexibility (be able to contain alternative

representations and interpretations)

Integrate representations in NIF (joint work with Jesper Hoeksema and Willem van Hage)

Representation (events)

• How to combine knowledge from the NLP community and Linked Data community?– Combination of textual information with external

resources– Complete representation of information from text

(location, retrieval method)

Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)

Current state of affairs

• Basic system using sentence number and lemmas for main categories metadata (evaluation ongoing)

• Module for labeling locations and dates in text (adaptions to be made for modularity)

• Annotation effort started for evaluation (selection of approximately 700 texts)

Demonstrator

• The interface should be easy to use• The demonstrator should inspire historians to

undertake new research and give direction, rather than being the ‘closing factor’ in their research

• The interface should allow to ‘fine tune’ results returned upon an initial action

Interface: Focus

• Query composition• Faceted browsing• A combination

Interface: Options

• Drop down boxes to select ‘Verbs’, data elements and relations

Interface: Query composition

• No explicit querying, but convergence of the data through browsing and selecting

• Provides better feedback to the user• Allows for more direct and easier

adjustment of the selected data

Interface: Faceted browsing

Interface: Faceted browsing

• Query composition combined with faceted browsing

• Create new facets by defining a query– The result of the query is available as a subset of

the data by selecting the defined facet– As such, combinable with other facets

• Method to integrate ‘open’ querying of the data into a general interface and visualization

Interface: A combination

Interface: A combination

Question Analysis

SelectionProcess

Results

Data

Facets

Time and place are primary elements

Interface: Demonstrator

Results

?

Questions

Date post:	14-Jan-2016
Category:	Documents
Upload:	brent-ross
View:	214 times
Download:	0 times

BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with...

Documents