A spatio-temporal visual analysis tool for historical dictionaries.

Post on 11-Apr-2017

6 views 0 download

transcript

Alejandro Benito abenito@usal.es

Universidad de Salamanca (Spain)

A visual historical exploration tool for lexical resources: the case of Austrian language

Antonio Losadaalosada@usal.es

Universidad de Salamanca (Spain)

Roberto Theróntheron@usal.es

Universidad de Salamanca (Spain)

Eveline Wandl-Vogteveline.wandl-vogt@oeaw.ac.atAustrian Academy

of Sciences (Austria)

DIGITAL HUMANITIES

Aims of our research

● Rethink the string-based, text-only workflows traditionally employed by academics

● Propose a new visual, interactive framework for language data exploration

● Provide a new perspective of data

● Speed up knowledge extraction

Approach: (3 + 1) Computational Pillars

NLP: Text mining and information retrieval techniques

GIS: Natural approach to data

SNA: Applying graph theory to data to perform relationship and pattern detection

DATAVIS: Central pillar. Unleashes the computational power of the 3.

- Questionnaires are delivered amongst population

- They are retrieved back and analysed in search of particularities

- Complementary fieldwork:

* Personal interviews* Drawings & Maps* Notes * Other artefacts

- Cards are generated for each word found

- Further research is performed afterwards using the cards and artefacts

Where do dictionaries come from?

WBÖ, dbo@ema & exploreAT!

* WBÖ is an initiative with almost 100 years of history

* The DBÖ project (1993) starts creating a digital databank out of WBÖ data

* dbo@ema: Attempt to expose DBÖ data to the general public by using the Web (1st Dataset)

* TUSTEP-XML format is employed(2nd Dataset)

* In the process, historical geocoding data is also generated

exploreAT! 4 Core Topics:

• Digital infrastructures

• E-Lexicography

• Visual analysis tools

• Citizen Science

Timeline

Timeline created by: A. Dorn; clip art: www.clker.com

Our approach

Documental search engine

• Full support of string-based queries, typically used in lexicography (NLP)

• On top of that we add more dimensions:

* Spatial (GIS) and temporal searches* Fuzzy + word distance queries

Visualisation

• Linked-view layout that unleashes the computational power of GIS, SNA and NLP to the novel user

• SNA visualisations employing different techniques to foster pattern identification

Software methodology

There is not a predefined, by-the-book architectural model in DH

1. Microprototyping is necessary to provide an initial insight of the data

2. Periodic exchange meetings with the team of lexicography experts

3. Progressive extraction, refinement and integration of requirements in the final prototype

Bernard et al. (2015)

Prototyping

Problematic of the two datasets

DBO@EMA

Too tight relational approach

Cumbersome scheme, difficult to work with

Slow string query response times

Models a traditional concept of dictionary

TUSTEP-XML

Too slow for creating a responsive visualisation for the web

Only contains textual information

Does not hold the required dimensions

Loose formatting.

/(1\d{3})(-\d{2})*|(1\d{1}.x):(\d{2})*-(\d{2})*/gort = ort.replace(/\/[A-za-z]+/i, "");

SELECT * FROM gemeinde WHERE nameKurz LIKE ? OR nameKurz LIKE ? OR originaldaten LIKE ? OR originaldaten LIKE ?

TUSTEP-XML

HEURISTIC RULES&

SUPERVISED PROCESS

dbo@ema dataset+

Historical geocoding data

● Time● Space● Other features

Some numbers

Total of 2.206.227 records (95.3%) out of the initial 2.314.031 processed

Remaining 4.7% is discarded because of formatting errors in the data source

In the imported 2.2M records set:

* 26.6% have temporal but no spatial dimension

* 32.4% have only spatial dimension

* 9.8% contain both spatial and temporal dimensions

* 31.1% of the records do not contain spatial or temporal dimensions

Our architectural model in DH1. There is not a standard architecture in DH

2. We learnt from other researchers and dbo@ema initiative

3. Our proposal is oriented towards:

a. Deal with big amounts of data(>1M records)

b. Enhance user experience

c. Reactive components

d. Support visualisations able to perform in interaction times

Resolution change

AND

OR

Study cases

1. Visual exploration of the usage of the word “red” and possible referents

2. Popular plant names ending in “-kraut” (herb)

Discussion

Provide a new concept of historical dictionary

Experts’ validation was very positive 

Future lines of workIncorporate other data sources (OpenLink)

Visual representation of fuzzy results

Deal with areas of terrain instead of aggregations only

Ability to update/validate the dataset using citizen science

Results

That was about it for now.

Thanks for listening!

Questions?