Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and...

1

Mitglied der

DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of

German Speech Corpora

Joachim GaschE-mail: [email protected]

2

Mitglied der

1. Introduction1.1 The Collection of German Speech Corpora at the IDS1.2 The Standardization Approach for cross-Corpus Information Management

2. The Online Navigation Platform2.1 The Navigation Interface – Design Principals2.2 The Visualization and Presentation of Speech Corpus Content

2.2.1 Generic Visualization of the XML Meta-Information of Speech Corpora

2.2.2 Transcript Visualization and Presentation2.2.3 Media Presentation

3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components3.1 The Full-Text Search Module3.2 XQuery Information Retrieval in structured XML Documents

4. Summary and Outlook

3

Mitglied der

1. Introduction1.1 The Collection of German Speech Corpora at the

IDS

The IDS is hosting a wide range of historical and contemporary German speech corpora

Many historical corpora can be (partially) accessed online via the Database for Spoken German (DGD)

=> Main objectives of the current DGD 2.0 project:

Generic, cross-corpus approach to speech corpus management

Normalized integration of historical and recent speech corpora

Sustainability of speech corpus data components Object-oriented user interface (based on document

structures) for corpus exploration and querying

4

Mitglied der

1.2 The Standardization Approach for cross-Corpus Information Management

The speech corpus system manages meta-information of media source signals

Different corpora: the information structures of data components may vary considerably due to different linguistic research questions, i.e. represented genres, degree of content restriction, physical data structure, research field (natural vs. elicited speech)

=> Web-based speech corpus navigation platform: Standardization concept: cross-corpus solution for large

speech corpus collections rather than for particular speech corpus projects

Definition of a generic, system-wide data model containing the following components (systematically interlinked):

+ structured XML documentation instances on corpus-, event- and speaker level+ unstructured, semi-structured or structured transcripts (time aligned, multi-dimensional)+ media source files + optional: unstructured secondary documents

5

Mitglied der

Interlinked components of the normalized speech corpus data model

6

Mitglied der

2. The Online Navigation Platform2.1 The Navigation Interface - Design Principals

Object-oriented, document-centric interaction paradigm: based on document structures to be managed by the system

Provision of adaptive views of speech corpus data components

=> The application menu: Flat structure of the navigation menu Fixed position at the top of the screen Permanent, homogeneous acces to application components Indication of flat / hierarchically subdivided menu entry

points by the symbols ► and ▼

7

Mitglied der

=> Classifying icons Intuitive user orientation by marking specific types of

corpus data components with their correspondent icons:

=> „bread crumb“ navigation: Help the user to identify his current position in the

navigation tree

8

Mitglied der

2.2 The Visualization and Presentation of Speech Corpus Content

2.2.1 Generic Visualization of the XML Meta-information

Native XML database storage of documentation instances Use of generic XML rendering module to avoid corpus specific

instance visualizations, providing:

+ expandable / collapsible document nodes+ node level selection functionality+ direct access to hyperlinks

=> The cross-corpus (single coprus independent) display method of corpus-, event and speaker documentation offers an ergonomic navigation experience (especially for large data-centric XML instances)

9

Mitglied der

Generic XML document rendering

10

Mitglied der

=> Documentation of geocodes:

The geographic coordinates of event locations may be documented in specific speech corpus projects

A geographic map can be displayed on demand: the example shows the geographic map for the event DH--_E_00167 (with geographic latitude 47.423336 and longitude 9.377225 ) which took place in St. Gallen (Switzerland)

11

Mitglied der

Geographic map (based on documented geocodes showing the event location)

12

Mitglied der

2.2.2 Transcript Visualization and Presentation

For larger speech corpus collections, a common concept of „transcript“ becomes fuzzy:+ Annotation of distinct phenomena+ Use of heterogeneous (transcript editor specifc) data formats

Historical speech copora:+ Unstructured transcript data formats (only layout oriented)

Contemporary speech corpora:+ Use of annotation tools available nowadays: structured data formats but no cross-corpus structure homogeneity

Cross-corpus visualization is possible for the transcript-related part of the event documentations via menu point „Transkripte“ (corpus specific transcript access lists)

13

Mitglied der

Corpus-specific transcript list for the speech corpus DS

14

Mitglied der

2.2.3 Media Presentation

Speech corpora may include different types of interdependent media files:

+ One event is related to one or more source files:the raw material recorded for an event (originating directly from an audio device) + An event can be composed of several speech events:further segmentation of the source files into speech event specific recordings

All relevant information regarding different media file types is maintained in the meta-documentation of the corresponding event and can be accessed via the list of the menu point “Aufnahmen”

15

Mitglied der

Corpus-specific list of source recordings for the speech corpus DH

16

Mitglied der

3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components

Media file content can only be located via descriptive meta-information:+ meta data (schema valid XML instances)+ transcript data (unstructured, semi-structured, structured)

Transcript data of speech corpus collections is spreading regarding the structuring degree

Retrieval strategies depend on this degree: from simple full-text search to complex layer-aware query processing

Single corpus transcript incompatibilities (worst case scenario):

+ Signal segmentation without precise segmentation guidelines (i.e. phones, words, phrases or turns)+ No or not sufficient naming conventions applied for the different transcript layer descriptors (i.e. no unique descriptor used for orthographic transcription layer)+ No exact semantic layer definition available or semantic mix-up of layer content (i.e. mix-up of orthographic and phonetic markup in one single layer)+ No exact syntactic definition of layer content available or syntactic mix-up of layer content (i.e. mix-up of punctuation- or capitalization conventions in the orthographic layer)+ Violation of cross-layer time relations (i.e. caused by interval changes that were made with multi-layer transcript editors without layer inheritance control)

17

Mitglied der

3.1 The Full-Text Search Module

No structured data is required (but can be optionally included) Advantages: short query response times, easy user interface

handling The full-text search functionality is implemented using Oracle Text Examples of the provided full-text query features:

+ The simple and multiple wildcard characters "_" and "%":_ind matches i.e. "Kind" and "Wind“%wind matches i.e. "Nordwind" or Südwind“

+ The operators AND and OR build logical relations between search terms:Nordwind AND Südwind matches only documents with

occurrences of both terms+ Tthe NOT operator excludes a specific search term:

Nordwind NOT Südwind matches only documents containing "Nordwind" but not containing "Südwind“+ The NEAR operator finds documents depending on the word distance of search terms:

NEAR((Schule, Kirche, 4, true) matches documents where both search terms occur with a (maximum) word distance of 4 words.

18

Mitglied der

Full-text search in semi-structured transcript data with search results (KWIC-list)

19

Mitglied der

3.2 XQuery Information Retrieval in structured XML Documents

The full-text search option is not sufficient for the retrieval in fine-grained XML instances (like meta data or time aligned multi-dimensional transcripts)

XQuery allows the implementation of context-sensitive queries for the hierarchical interdependent informational units of XML structured data:+ criteria-specific information selection and filtering+ joining of data from document selections+ sorting, grouping, aggregating, transforming and restructuring of data+ arithmetic calculations on numbers and dates

Powerful queries can be defined but a detailed knowledge about the underlying information structures is necessary

=> Two different approaches for the implementation of Web-based XQuery retrieval interfaces:+ HTML form with a graphical representation of the XML tree (easy to use but limited flexibility for query definition)+ HTML form providing a text area field to enter the XQuery as plain text (intended for system experts only, also complex queries on data centric instances or cross-structural joins are possible)

20

Mitglied der

HTML form providing a graphical XQuery composition interface

21

Mitglied der

HTML form for XQuery plain text submission

22

Mitglied der

4. Summary and Outlook

Media source files become analyzable via their appropriate meta-information

Contemporary speech corpus systems have to close the gap between the processing of binary media data and related meta-information

The need for standardization of speech corpus components is commonly accepted

But: the identification of all necessary parameters for a cross-corpus standardization still remains an outstanding goal

Future evolving technologies like the MPEG-7 standard might provide appropriate logic to achieve the standardized integration of the different audiovisual information types (potentially involved in media corpora):

+ Audio + Voice+ Video + Images+ Graphs + 3D models

=> Questions? Suggestions?

Date post:	26-Mar-2015
Category:	Documents
Upload:	andrew-casey
View:	215 times
Download:	0 times