Automatic Extraction of Topic Maps basedArgumentation Trails
Text Mining Services ConferenceLeipzig, 2009/03/25
Marco Büchler, Lutz Maicher,Frederik Baumgardt, Benjamin Bock
Natural Language Processing GroupDepartment of Computer Science
University of Leipzig
2
Starting Point: Panionion
3
Computation of argumentation trails on fragmentary texts
Surplus and relation between Topic Maps and argumentation trails
Results
Further work / conclusion
Agenda
4
Technical details
5
Text source
6
Co-occurrence as underlying graph- de Saussure (1898/1916):
Structuralism assumes that meaning is the result of structural relations between word forms
The fundamental structural relations are syntagmatic and paradigmatic relations [Heyer & Bordag 2007]
Argumentation trails vs. Lexical Chaining
- fragmentary texts
Underlying graph
7
“Definition/Motivation”: What's the average path length in a graph?
Average path length is typically not larger than7.Average path length is typically not larger than7. Simple proof of concept (Using XING):Simple proof of concept (Using XING):
Every person of my contacts has in Every person of my contacts has in average about 73 contacts (1. and 2.average about 73 contacts (1. and 2. level) level) loglog7373(6,800,000,000)= 5,28(6,800,000,000)= 5,28
Small World
8
Methodology
9
Topic Maps
Data model of Topic Maps (Topics)
10
Nikolaikirche
variant
St. Nicholas Church
St. Nikolai
name
English
scope
1165occurrence
www.nikolaikirche-leipzig.de/
occurrence
foundation
type
website
type
Data model of Topic Maps (Associations)
11
St. Nikolai Leipzig
association
container-containee
ass. rolerole player
containercontainee
role type
Data model of Topic Maps (Summary)
one topic represents one subject in a data source− names represent the names of the subject
names might have variants− occurrences represent properties of the subject− associations represent relationships between subjects
flexibility through roles n-ary associations
− all types and scopes are (set of) Topics in a topic map everything is a topic
12
What are Topic Maps (ISO 13250)?
Topic Maps are highly-networked data sources one topic for each subject relationships of subjects are associations between topics
Topic Maps have a human-centric data model vocabulary for documenting information fits human cognition network resembles human cognition
Topic Maps have an integration model whenever two topics represent the same subject, they have to be merged always one information access hub for each subject high terminological flexibility and schema-free use in knowledge federation and sensemaking
Topic Maps is an international industry standard (ISO 13250)
T
13
14
Extraction of typed significant terms
Corpus is categorized in several classification schemas.
Split corpus into several sub corpora
Medusa
age gender geography
....
Categorized co-occurrences/terms
Tomcat/Prefuse
Age
gender
geography
(Source:Taken from bachelor thesis slides of Marcus Puchalla.)
(
15
Results
16
Several graph properties
Number of nodes 538,572 388,929 363,359 353,618 1,14 9 4,487 2,178
57,762,474 34,818,138 25,615,956 21,004,538 15,4 36 126,188 152,856
30,382,422 21,739,476 17,687,582 15,462,940 14 ,876 69,858 84,124
Percentage 0.53 0.62 0.69 0.74 0.96 0.55 0.55
Average degree 56.41 55.90 48.68 43.73 12.95 15.57 38.62
Number of trails 361.094 7.958.240 3.087.581
Average degree 15.34 9.93 7.70 6.79 7.03 7.77 9.93
31.34 21.08 14.33 11.45 7.02 10.15 12.31
301.38 362.56 285.86 231.39 55.66 76.06 81.86
Complete graph
w_id>=100 &&
freq(word)>1
w_id>=300 &&
freq(word)>1
w_id>=500 &&
freq(word)>1
Named Entities
Normalised Named Entities
Normalised Text and Named Entities
Number of co-occurrences
Number of significant co-occurrences
> 108 > 108 > 108 > 108
Average degree of internal node (trail length 2)
Average degree of internal node (trail
length 3)
Grap
h pr
oper
ties
Argu
men
tatio
n tra
il pr
oper
ties
17
Visualisation of two argumentation trails
Marco Büchler
onotoa.topicmapslab.de
Topic-Maps-Ontologie for the Argumentation Trails
Topic Maps and Argumentation Trails
23
- Reduction of graph comlexity- e. g. by semantic pre-clustering or - authors restrictions
- Weighting of argumentation trails- e. g. Trails containing hubs should be weighted lower
- Improvements in visualisation- Clustering of similar trails to a bunch of semanitic similar trails
- Improvements in typing nodes and especially edges
Further work / conclusion