Date post: | 14-Jan-2016 |
Category: |
Documents |
Upload: | brent-ross |
View: | 214 times |
Download: | 0 times |
BiographyNed
eScience Center 21 March 2013
Why a good case for eScience?
• Involves big data with high complexity• Rich meta data joining diverse textual sources and
selections of data• Incomplete and noisy• Potential to investigate difficult questions, e.g.:– How did the current Dutch elite develop from the
colonial past?• Biographies may represent different views and
realities and thus answers to questions:– hero or villain– 2.8 textual sources per person
What will we do?
• Develop generic text mining technology that converts textual data to structured data– Taking into account nature of historical text
• Enrich and externally link data repository of Dutch biographies
• Develop visualizations and interactions on the data set to support historical research
• Develop a range of cases that demonstrate the possibilities and impossibilities of the data set and technology
Patterns in data ValueInterpretationLine composition in paintings
Twitter patterns during electionsCubism
Democratic participation
Nature of eHumanities
Patterns in data ValueInterpretation
Narratives
Cases: persons/objects/events
Line composition in paintingsTwitter patterns during elections
CubismDemocratic participation
19th-century Japanese printsBiographical descriptions of Prince Bernhard
The rise of the Japanese middle classGerman nobles in the Interbellum
Statistics on available information
Name
Category
Gender
Date of Death
Date of Birth
Place of Birth
Place of Death
Occupation
Religion
Father
Mother
Claim to Fame
Partner
Text
0 20 40 60 80 100 120
Individuals with available information (%)
percentage
Textual Information per person
Information Numbers
Average XML-files per individual 2.79
Texts 78.75%
Words (total/person) 288.83
Words (longest text/person) 229.04
Words (total/text) 366.76
Words (longest text)/texts 290.83
Availability of Information in the portal
Partner
Mother
Father
Claim 2 Fa
me
Religio
n
Occupati
on
Date of b
irth
Place o
f birt
h
Date of d
eath
Place o
f dea
th
Catego
ryNam
e0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Information AbsentText available
Presence of information for governors of Dutch Indies (% on 71 individuals)
mariag
e
multiple mari
age
partners
Children
(number)
Children
(nam
es)
Age (s
tart fu
nction)
Place o
f Birt
h
Place o
f Dea
th
Studies
Previous c
arree
r
Reaso
n job en
d
Last jo
b
Family
connecti
ons
Religio
n0
10
20
30
40
50
60
70
80
90
100
metadatatext
The Historical Perspective
• History and Biography• Where do eScience and History meet?
• Use Cases
Historical Research
The Art and Science of History: Drawing up a narrative from primary and secondary sources which approximates historical reality as well as
possible.
Building Blocks and Concrete
• Building blocks: facts derived mainly from archival findings and existing literature
• Concrete: the methods historians use to put them together into a narrative/synthesis.
• The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative element in drawing up a narrative
Example: Grand Pensionary Johan de Witt (1625-1672)
• Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of the murder
• Concrete: (logic) Based on these last data itis likely that William ordered the death of Johan
• Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoning
The House of History
The Importance of Provenance
The only way to falsify presented historical facts is by going back to the original source(s) and
look at those sources critically.
Highly important to be able to know what information comes from where exactly.
Our Sources Here
• The Metadata: building blocks
• The entries in biographical dictionaries themselves: short historical narratives
Status of Biography in Academia and Society
• Despite improved efforts this century to embed biography in academic theories and methods, some still do not consider it (e.g. some social historians) a worthy academic discipline, being too anecdotal and limited.
• Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)
Where do eScience and History meet? (I)
“And when the capsule biography of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”
(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography)
Where do eScience and History meet? (II)
A. Quantitative analyses of a larger group of people (prosopography).Surpassing the anecdotal.
B. Finding relations/networks between people which are otherwise hard to detect
Where do eScience and History meet? III
C. Insight in Historiography and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and main author 1001 vrouwen)
D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?
BiographyNed Use Cases
In the initial stages of the research a list of
possible historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to give us an answer to, or at least point
into a direction/trend.
Case I: Making life easier: Group portrait of the Governors-General
• Highest Official in the Dutch indies 1610-1949• 71 men (still a relatively small group)• What can we say about these men as a group?• Who was appointed and what qualities did he
have to have? • Etc ….
Case I: data mining
• Family connections (parents/wife/children, other relevant connections <= patronage)
• Place of Birth• Education • Religion• Career (patterns)• Age at appointment• Duration of holding the office• Reason for leaving the office• Place of Death
Case I: Time and Effort
More than 1 full week
to manually mine this information from the Biography Portal. Can a historian do this with
(almost) the same results in under one hour if helped by the demonstrator?
Case II: Making things possible: The Dutch Nation & Identity
• Who were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)
• Are there different perspectives on the sameperson over the time and how can this be explained?
• Who was deemed most important? (based on the length of the entries)• What time periods are most represented?• Is there a difference in claim to fame for people from different
periods in history, or between men and women?• Which words are used most often and can we link them to
national identities?
Case II: More Questions …
• What events are mentioned most often and what does that say about the status questionis of how the Dutch see/saw themselves?
• What are the differences in the answers to these questions between several national biographical dictionaries?
• Are people and events described or appreciated differently over time? Does the perspective change?
• How does this relate to biographical dictionaries, nations and identities elsewhere in Europe?
Conversion to Linked Data
Online machine readable data with links • Simple facts called ‘RDF Triples’
Thorbecke > hasBirthPlace > Zwolle
Some technology concepts: • Schemas: To structure LD• RDF Stores: To store LD • SPARQL: To access LD
Huge growth in the past years: •More than 300 data sources•More than 30 billion triples
A crash course on Linked Data
Purely syntactic conversion• Preserve the original structure of the data• Prevent loss of information• Allow for reinterpretation of the original data in the future
The conversion process
Data Preservation
Conversion steps: • Retrieval of XML dump of the Biography Portal• Initial conversion to ‘crude’ RDF• Using ClioPatria and the XMLRDF
tool for ClioPatria• RDF restructuring• Linking to other sources• Essential step in the
‘Linked Data’ philosophy
The conversion process
Data schema: • Based on the structure of the original XML files• Needs to facilitate the coupling of different biographies of the same
person, without compromising the original data• Needs to facilitate the incorporation of several enrichments, following
from NLP, Entity Reconciliation, etc.• Compatible with existing
schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms
The conversion process
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
BiograpyNedschema
Thorbecke
Biographical Description
ProvenanceMeta Data
NNBW
PersonMeta Data
“Thorbecke”
BiographyParts
Birth1798Event
Biographical Description
Enrichment NLP Tool
PersonMeta Data
EventBirth
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
Zwolle1798-01-14
Retrieving Information from Text
The texts in the Biography Portal
• Collection of biographical dictionaries• Dutch, including from the 19th and early 20th
century and even older quotes• Sources (different dictionaries/collections)
have their own style• Metadata available (though large differences
in completeness)
Challenges and Advantages
• Challenges:– Little work on NLP and biographies– Performance of Dutch NLP tools on variations of
Dutch• Advantages:– High quality metadata coverage several categories
of information (supervised machine learning)– Within sources, clear and similar structure of texts
General Approach
• Start by using advantages:– Use metadata to label information– A basic IR system can be build using sentence
number and lemmas as features• Enhance performance with NLP tools• Build upon information retrieve in the first
steps to tackle more challenging tasks
A Basic System
• Supervised Machine Learning• Two step identification process (Wu and Weld
2007;2010, Fader et al. 2011)– Identify sentence that contains information– Sequence tagging to identify information within
the sentence
Adding NLP
• Location & Date recognition (GeoNames)• (other) Named Entities (VIAF enhanced with
names from metadata)• Depending on performance of the system,
we’ll work on:– Chunking, multiword recognition– Parsing– Word Sense Disambiguation
Metadata & Project Goals
• Duplicate detection (metadata and text)• Events/Network discovery– Education (begin, end, location)– Occupation (begin, end, location)– Relations (parents, partners)
• Temporal relations between events
Output first system
• Better coverage of categories mentioned above
• A timeline for a person’s life (birth, education, occupation, locations, death)
• Named Entities in text (dates, locations, persons)
Beyond the first system
The information provided by the first system can be used to:
1. Identify alternative descriptions of events(same time, location and/or participants)
2. Identify relations between events(same locations & time, consequent events, same participants, etc.)
3. Initial networks of people
Methodological issues and text interpretation
• Results should be reproducible– Code release (including scripts, configurations, …)– Documentation– Open source data
• The setup should be modular– Combine output of different tools– Flexible choice of methods used
Evaluation Challenges (1/2)
• How to evaluate the extraction tools?• Partial evaluation using metadata (10-fold
cross-validation), but:1. No precise indication of precision or recall
(incomplete metadata…)2. Biographies with rich metadata are not
necessarily representative Manually annotated data needed!
Evaluation Challenges (2/2)
• How to compare performance NLP tools?– Little work on biographies, little or none on Dutch
ones…– How hard are older texts? Can we quantify?
Systematic comparison:• English biographies (wikipedia)• Dutch biographies (wikipedia)• Biographies from the portal
Reproducibility/Replication
• What do results mean if they cannot be reproduced?
• What variation in results can be expected based on details not mentioned in papers?
• Which information is needed to replicate results or find the origin of differences?
Paper submitted ACL 2013 (joint work with Marieke van Erp and others)
Representations (tools)
• How to represent and combine output of different tools?– Compatibility (easy to convert output of external
NLP tools)– Flexibility (be able to contain alternative
representations and interpretations)
Integrate representations in NIF (joint work with Jesper Hoeksema and Willem van Hage)
Representation (events)
• How to combine knowledge from the NLP community and Linked Data community?– Combination of textual information with external
resources– Complete representation of information from text
(location, retrieval method)
Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)
Current state of affairs
• Basic system using sentence number and lemmas for main categories metadata (evaluation ongoing)
• Module for labeling locations and dates in text (adaptions to be made for modularity)
• Annotation effort started for evaluation (selection of approximately 700 texts)
Demonstrator
• The interface should be easy to use• The demonstrator should inspire historians to
undertake new research and give direction, rather than being the ‘closing factor’ in their research
• The interface should allow to ‘fine tune’ results returned upon an initial action
Interface: Focus
• Query composition• Faceted browsing• A combination
Interface: Options
• Drop down boxes to select ‘Verbs’, data elements and relations
Interface: Query composition
• No explicit querying, but convergence of the data through browsing and selecting
• Provides better feedback to the user• Allows for more direct and easier
adjustment of the selected data
Interface: Faceted browsing
Interface: Faceted browsing
• Query composition combined with faceted browsing
• Create new facets by defining a query– The result of the query is available as a subset of
the data by selecting the defined facet– As such, combinable with other facets
• Method to integrate ‘open’ querying of the data into a general interface and visualization
Interface: A combination
Interface: A combination
Question Analysis
SelectionProcess
Results
Data
Facets
Time and place are primary elements
Interface: Demonstrator
Results
?
Questions