Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | andrew-casey |
View: | 215 times |
Download: | 0 times |
1
Mitglied der
DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of
German Speech Corpora
Joachim GaschE-mail: [email protected]
2
Mitglied der
1. Introduction1.1 The Collection of German Speech Corpora at the IDS1.2 The Standardization Approach for cross-Corpus Information Management
2. The Online Navigation Platform2.1 The Navigation Interface – Design Principals2.2 The Visualization and Presentation of Speech Corpus Content
2.2.1 Generic Visualization of the XML Meta-Information of Speech Corpora
2.2.2 Transcript Visualization and Presentation2.2.3 Media Presentation
3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components3.1 The Full-Text Search Module3.2 XQuery Information Retrieval in structured XML Documents
4. Summary and Outlook
3
Mitglied der
1. Introduction1.1 The Collection of German Speech Corpora at the
IDS
The IDS is hosting a wide range of historical and contemporary German speech corpora
Many historical corpora can be (partially) accessed online via the Database for Spoken German (DGD)
=> Main objectives of the current DGD 2.0 project:
Generic, cross-corpus approach to speech corpus management
Normalized integration of historical and recent speech corpora
Sustainability of speech corpus data components Object-oriented user interface (based on document
structures) for corpus exploration and querying
4
Mitglied der
1.2 The Standardization Approach for cross-Corpus Information Management
The speech corpus system manages meta-information of media source signals
Different corpora: the information structures of data components may vary considerably due to different linguistic research questions, i.e. represented genres, degree of content restriction, physical data structure, research field (natural vs. elicited speech)
=> Web-based speech corpus navigation platform: Standardization concept: cross-corpus solution for large
speech corpus collections rather than for particular speech corpus projects
Definition of a generic, system-wide data model containing the following components (systematically interlinked):
+ structured XML documentation instances on corpus-, event- and speaker level+ unstructured, semi-structured or structured transcripts (time aligned, multi-dimensional)+ media source files + optional: unstructured secondary documents
5
Mitglied der
Interlinked components of the normalized speech corpus data model
6
Mitglied der
2. The Online Navigation Platform2.1 The Navigation Interface - Design Principals
Object-oriented, document-centric interaction paradigm: based on document structures to be managed by the system
Provision of adaptive views of speech corpus data components
=> The application menu: Flat structure of the navigation menu Fixed position at the top of the screen Permanent, homogeneous acces to application components Indication of flat / hierarchically subdivided menu entry
points by the symbols ► and ▼
7
Mitglied der
=> Classifying icons Intuitive user orientation by marking specific types of
corpus data components with their correspondent icons:
=> „bread crumb“ navigation: Help the user to identify his current position in the
navigation tree
8
Mitglied der
2.2 The Visualization and Presentation of Speech Corpus Content
2.2.1 Generic Visualization of the XML Meta-information
Native XML database storage of documentation instances Use of generic XML rendering module to avoid corpus specific
instance visualizations, providing:
+ expandable / collapsible document nodes+ node level selection functionality+ direct access to hyperlinks
=> The cross-corpus (single coprus independent) display method of corpus-, event and speaker documentation offers an ergonomic navigation experience (especially for large data-centric XML instances)
9
Mitglied der
Generic XML document rendering
10
Mitglied der
=> Documentation of geocodes:
The geographic coordinates of event locations may be documented in specific speech corpus projects
A geographic map can be displayed on demand: the example shows the geographic map for the event DH--_E_00167 (with geographic latitude 47.423336 and longitude 9.377225 ) which took place in St. Gallen (Switzerland)
11
Mitglied der
Geographic map (based on documented geocodes showing the event location)
12
Mitglied der
2.2.2 Transcript Visualization and Presentation
For larger speech corpus collections, a common concept of „transcript“ becomes fuzzy:+ Annotation of distinct phenomena+ Use of heterogeneous (transcript editor specifc) data formats
Historical speech copora:+ Unstructured transcript data formats (only layout oriented)
Contemporary speech corpora:+ Use of annotation tools available nowadays: structured data formats but no cross-corpus structure homogeneity
Cross-corpus visualization is possible for the transcript-related part of the event documentations via menu point „Transkripte“ (corpus specific transcript access lists)
13
Mitglied der
Corpus-specific transcript list for the speech corpus DS
14
Mitglied der
2.2.3 Media Presentation
Speech corpora may include different types of interdependent media files:
+ One event is related to one or more source files:the raw material recorded for an event (originating directly from an audio device) + An event can be composed of several speech events:further segmentation of the source files into speech event specific recordings
All relevant information regarding different media file types is maintained in the meta-documentation of the corresponding event and can be accessed via the list of the menu point “Aufnahmen”
15
Mitglied der
Corpus-specific list of source recordings for the speech corpus DH
16
Mitglied der
3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components
Media file content can only be located via descriptive meta-information:+ meta data (schema valid XML instances)+ transcript data (unstructured, semi-structured, structured)
Transcript data of speech corpus collections is spreading regarding the structuring degree
Retrieval strategies depend on this degree: from simple full-text search to complex layer-aware query processing
Single corpus transcript incompatibilities (worst case scenario):
+ Signal segmentation without precise segmentation guidelines (i.e. phones, words, phrases or turns)+ No or not sufficient naming conventions applied for the different transcript layer descriptors (i.e. no unique descriptor used for orthographic transcription layer)+ No exact semantic layer definition available or semantic mix-up of layer content (i.e. mix-up of orthographic and phonetic markup in one single layer)+ No exact syntactic definition of layer content available or syntactic mix-up of layer content (i.e. mix-up of punctuation- or capitalization conventions in the orthographic layer)+ Violation of cross-layer time relations (i.e. caused by interval changes that were made with multi-layer transcript editors without layer inheritance control)
17
Mitglied der
3.1 The Full-Text Search Module
No structured data is required (but can be optionally included) Advantages: short query response times, easy user interface
handling The full-text search functionality is implemented using Oracle Text Examples of the provided full-text query features:
+ The simple and multiple wildcard characters "_" and "%":_ind matches i.e. "Kind" and "Wind“%wind matches i.e. "Nordwind" or Südwind“
+ The operators AND and OR build logical relations between search terms:Nordwind AND Südwind matches only documents with
occurrences of both terms+ Tthe NOT operator excludes a specific search term:
Nordwind NOT Südwind matches only documents containing "Nordwind" but not containing "Südwind“+ The NEAR operator finds documents depending on the word distance of search terms:
NEAR((Schule, Kirche, 4, true) matches documents where both search terms occur with a (maximum) word distance of 4 words.
18
Mitglied der
Full-text search in semi-structured transcript data with search results (KWIC-list)
19
Mitglied der
3.2 XQuery Information Retrieval in structured XML Documents
The full-text search option is not sufficient for the retrieval in fine-grained XML instances (like meta data or time aligned multi-dimensional transcripts)
XQuery allows the implementation of context-sensitive queries for the hierarchical interdependent informational units of XML structured data:+ criteria-specific information selection and filtering+ joining of data from document selections+ sorting, grouping, aggregating, transforming and restructuring of data+ arithmetic calculations on numbers and dates
Powerful queries can be defined but a detailed knowledge about the underlying information structures is necessary
=> Two different approaches for the implementation of Web-based XQuery retrieval interfaces:+ HTML form with a graphical representation of the XML tree (easy to use but limited flexibility for query definition)+ HTML form providing a text area field to enter the XQuery as plain text (intended for system experts only, also complex queries on data centric instances or cross-structural joins are possible)
20
Mitglied der
HTML form providing a graphical XQuery composition interface
21
Mitglied der
HTML form for XQuery plain text submission
22
Mitglied der
4. Summary and Outlook
Media source files become analyzable via their appropriate meta-information
Contemporary speech corpus systems have to close the gap between the processing of binary media data and related meta-information
The need for standardization of speech corpus components is commonly accepted
But: the identification of all necessary parameters for a cross-corpus standardization still remains an outstanding goal
Future evolving technologies like the MPEG-7 standard might provide appropriate logic to achieve the standardized integration of the different audiovisual information types (potentially involved in media corpora):
+ Audio + Voice+ Video + Images+ Graphs + 3D models
=> Questions? Suggestions?