+ All Categories
Home > Documents > Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and...

Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and...

Date post: 02-Mar-2018
Category:
Upload: lamkhuong
View: 215 times
Download: 0 times
Share this document with a friend
35
D3.1 Retrieval System Functionality and Specifications Abstract The EASAIER Retrieval System of Work Package 3 allows retrieval of documents in a sound archive through a combination of feature extraction (search by similarity) and metadata descriptors. It exploits the features described and extracted in the Sound Object Representations of Work Package 4, and the ontology devised in Work Package 2. The retrieval system has four components; music retrieval, speech retrieval, cross- media retrieval and an additional vocal query interface. This document describes the functionality of each of these systems, how they are specified, and how they will be integrated into a complete retrieval system for EASAIER prototype. It is based on the documents published on the EASAIER restricted website and prepared by WP3 contributors. D3.1 Retrieval System Functionality and Specifications Version 1.12 Date:November 1, 2006 Editor: ALL Contributors: ALL, DIT, NICE, QMUL, RSAMD 1
Transcript
Page 1: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Abstract

The EASAIER Retrieval System of Work Package 3 allows retrieval of documents in a sound archive through a combination of feature extraction (search by similarity) and metadata descriptors. It exploits the features described and extracted in the Sound Object Representations of Work Package 4, and the ontology devised in Work Package 2. The retrieval system has four components; music retrieval, speech retrieval, cross-media retrieval and an additional vocal query interface. This document describes the functionality of each of these systems, how they are specified, and how they will be integrated into a complete retrieval system for EASAIER prototype. It is based on the documents published on the EASAIER restricted website and prepared by WP3 contributors.

D3.1 Retrieval System Functionality

and Specifications

Version 1.12 Date:November 1, 2006 Editor: ALL Contributors: ALL, DIT, NICE, QMUL, RSAMD

1

Page 2: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Table of Contents 1. EXECUTIVE SUMMARY ..............................................................................................................3 2. USER NEEDS AND REQUIREMENTS ........................................................................................4 3. RELATIONSHIP TO OTHER WORK PACKAGES...................................................................6 4. MUSIC RETRIEVAL ......................................................................................................................7

4.1. CONTENT-BASED SIMILARITY ....................................................................................................9 4.2. METADATA SEARCHES .............................................................................................................10

5. SPEECH RETRIEVAL..................................................................................................................12 5.1. SEARCH IN DIGITISED SPEECH ARCHIVES..................................................................................12 5.2. THE ARCHITECTURE OF RETRIEVAL..........................................................................................12 5.3. THE TECHNOLOGY OF PHONEMATIZATION AND PHONEME LEVEL RECOGNITION ......................13 5.4. SPEECH RETRIEVAL TECHNICAL SPECIFICATION .......................................................................18

6. INDEXING AND SPEAKER RETRIEVAL ................................................................................19 6.1. INTRODUCTION ........................................................................................................................19 6.2. SPEAKER CHARACTERISTIC INDEXING - VOICE PRINT, GENDER, EMOTION ..............................20 6.3. SPEAKER CHARACTERISTIC RETRIEVAL....................................................................................20 6.4. SAMPLE TEST USER INTERFACE ................................................................................................21

7. CROSS-MEDIA RETRIEVAL .....................................................................................................23 7.1. INTRODUCTION ........................................................................................................................23 7.2. UPLOADING CONTENT AND CONTENT ANALYSIS ......................................................................24 7.3. SIMILARITY SEARCHING AND INDEXING ..................................................................................26

8. VOCAL QUERY INTERFACE ....................................................................................................28 8.1. TECHNOLOGY DESCRIPTION .....................................................................................................28 8.2. VOCAL QUERY – USER APPROACH ...........................................................................................28

9. RETRIEVAL SYSTEM INTEGRATION AND KNOWLEDGE MANAGEMENT ...............29 9.1. THE INTEGRATED RETRIEVAL SYSTEM ....................................................................................29 9.2. ONTOLOGY-BASED KNOWLEDGE REPRESENTATION .................................................................29 9.3. INTERFACE TO INFORMATION RETRIEVAL SYSTEMS .................................................................30

10. EXAMPLE RETRIEVAL SYSTEM MOCK-UPS AND INTERFACES.............................31 REFERENCES .........................................................................................................................................35

2

Page 3: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

1. Executive Summary In WP3 the partners specify and develop 3 separate retrieval systems and an interface:

• T3.1 is responsible for the specification and development of a music retrieval system, lead by QMUL. It works on searching and organizing audio collections according to their relevance to music-related queries, using high level features to produce a ranked list of audio files related to an audio query examining melodic and harmonic similarity.

• T3.2 is responsible for the specification and development of a speech retrieval system, lead by ALL.

• T3.3 here, the objective is the development of the cross-media retrieval system, this task is lead by QMUL.

• T3.4 relates to T3.2, using the same methods it aims to specify and develop the vocal query interface.

The work is done in parallel, however, close coordination ensures that the result of each task will complement the other tasks leading to a robust solution to the problem. Initially, the parts will be specified separately. Even the development of the prototype will be done separately producing 3 test user interfaces with a common design. Towards the end of the project, and in conjunction with other Work Packages, they will be integrated into a complete retrieval system. This document describes the specification and functionality of all the retrieval system components. Section 2 describes the User Needs and Requirements which have already been established and which are used to determine the features of the retrieval system. Fictional use-case scenarios are also provided throughout the document to illustrate various concepts and issues. Section 3 describes how the retrieval system is related to other components of the EASAIER system. The music retrieval system, which incorporates both abstract music similarity measures and search by metadata descriptors, is discussed in Section 4. Discussion of the speech retrieval components is provided in Sections 5 and 6. Section 5 deals with speech retrieval where the query is provided as text which may be uttered by a speaker. This section relates to technologies provided by ALL. Section 6, which relates to NICE’s contribution, deals with speaker retrieval, where one wishes to retrieve a known speaker, or speech with certain characteristics, regardless of the words which are uttered. Section 7 provides a description the cross-media retrieval system, where different media may be retrieved using a combination of feature extraction and descriptor searches. Section 8 describes the vocal query interface, which allows certain queries to be spoken rather than entered as text. Section 9 then explains how the various retrieval system components are integrated together, and how the retrieval system relates to the knowledge management layer. Finally, Section 10 provides various mock-ups and example interfaces for retrieval system components (though example interfaces are also provided in other sections where appropriate).

3

Page 4: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

2. User Needs and Requirements Example use-case scenarios for the retrieval system are provided throughout this document. They serve to illustrate the situations for which the EASAIER retrieval system is intended. These situations demonstrate realistic queries, and the associated EASAIER functionality, from the perspective of music students, musicians, lecturers, amateur musicians and a music enthusiast.

Use-Case Scenario 1 - The amateur musician Greg is guitarist in a band consisting of old school friends. A vinyl enthusiast, the pride of Greg’s collection is a complete set of Pink Floyd and Led Zeppelin albums. On borrowing his girlfriend’s MP3 player, Greg discovered the large amount of material available digitally. Hearing an alternative version of Led Zep’s Black Dog on a late night radio show, Greg became interested in finding alternative and live recordings of the songs played by his rock’n’roll heroes, but he finds it difficult searching the internet for such tracks as he can rarely listen to them without buying them first. Following a Google search, Greg logs onto a classic rock’n’roll archive that uses the EASAIER system. He enters ‘Pink Floyd’ in the author/title field and also puts ‘live’ in the keyword field. To his delight, his search returns an alternative, live version of a segment from Atom Heart Mother with a pared-down orchestration – much sparser than the original studio mix. The metadata displays the distinctive cover art of the Atom Heart Mother album (showing a large cow named Lulubelle III) and a picture of a Pink Floyd performance from around the time the album came out. Happy with his musical research, Greg plays this segment again, and looks for further alternative versions of his favourite tracks. Greg sends his fellow band members an email with a link to the website employing the EASAIER system.

A significant problem with prior research into audio processing, sound archive access and semantics is that it has often not taken into account the wealth of user needs studies in this area. Previous work within the digital library community has identified strong demand for specific tools. The EASAIER Retrieval System will address user needs for sound archives which have already been identified in [1-4], as well as recent work[5-7] which provided a systematic study of what end users want from music retrieval systems and the types of queries that they make. User needs studies[3] and extensive research by the JISC[4, 8, 9] has identified a number of key features that are required in order to enrich sound and music archives. These findings stress the need for web-based access, integration of other media, and enriched access and playback tools. These studies confirm and build on previous work on user needs for digitized audio collections such as the exploratory studies carried out by the Library of Congress[10] or in the mid 1990s by the European-funded Jukebox and Harmonica projects[11, 12]. The development of Indiana University’s Variations project[1] was founded on close analysis of users’ needs – particularly music students’ need for annotation and visualisation tools to help them learn with digital music content. This aims to establish a digital music library testbed containing music in a variety of formats, involving research and development in system architecture, metadata standards, component-based application architecture, and network services. The music is in multiple media and formats: audio, video, musical scores, computerized score notation. Users listen to sound recordings, display printed scores, and search the collection in all formats from a single search engine. The project integrates information systems design with research focused in the areas of usability, intellectual property, music pedagogy, network performance and monitoring, and information systems architecture. However, a recent Scottish study into training needs analysis in e-learning[13] reported that audio is still an under-used technology. The UK Arts and Humanities Research Council’s ICT Programme has recently funded work surveying the needs of the research community in searching and analysis tools for audio streams[14]. But these studies often do not have complementary development projects, and vice-versa. Thus user requirements are not addressed, and tools are often created without satisfying any need. Most audio archives are very limited in terms of access and interaction[2, 9]. In essence, they give the user the option of discovering and requesting a given resource via metadata. If online listening access is available, it is usually in a restricted means constrained not by the material in the corpus, but by the imposed functionality of the interface. EASAIER will remove these barriers, thus allowing the user to choose their means of access and presentation.

4

Page 5: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Studies from within the consortium have also contributed significant findings. HOTBED (Handing On Tradition By Electronic Dissemination) [3] evaluated the use of networked digital sound materials in the conservatoire curriculum. The project built a networked collection of sound and video resources and tools to manipulate them, investigating through close user needs analysis how best to exploit these to enhance learning and teaching. The HOTBED project identified a strong user need for other media in addition to audio. The JISC User Requirements Study for a Moving Pictures and Sound Portal[4] also identified the need to have “cross-searching between still and time-based collections.” Furthermore, the HOTBED study identified the use of video as having a strong impact in aural learning. Thus, it is clear that this collection must incorporate video and other media as well as audio, and provide significant interaction between the media. Finally, studies such as HOTBED and personal communications with the National Library of Scotland, who serve on the Expert User Advisory Board of EASAIER, have also identified the need for incorporation of speech retrieval tools in music archives. This is significant in that many audio researchers may ignore the speech that exists in almost any audio archive. The HOTBED study, however, noted that many of the spoken recordings in archives, especially those intended for traditional music, were of exceptionally poor quality and included relatively obscure dialects and accents[15]. Thus the performance of speech retrieval tools in such a scenario is extremely limited. Collating the findings from these studies, we have established the following user needs requirements.

• Search for media fitting a wide variety of metadata - This necessitates, where possible the automatic extraction of semantically meaningful metadata, and the means to search across multiple and sometimes ambiguous selections of metadata.

• Relevant near neighbour query-by-example searches - This is to be distinguished from those systems that allow the user to enter an audio sample and find if it exists in a database. Here, we will produce ranked lists of related audio files, which may be similar in musical or timbral qualities.

• Speech Retrieval tools- Music archives deals primarily with music, and the spoken portions unfortunately cannot always be managed with speech retrieval systems. This is because they are often spoken in dialects. However, most sound archives include more standard spoken materials. Traditional libraries (books on CD, DVD) and broadcast materials, are usually of high quality and are suitable for speech retrieval.

• A cross-media information retrieval system - We will implement a cross-media information retrieval system. Such a system would allow the user to enter a piece of media as a query and retrieve a different type of media as a related document.

• Enriched access and interaction tools- While this is an important feature of the EASAIER system, this functionality is treated separately from the core components of the Retrieval System. It is addressed in Work Packages 5 and 6.

Use-Case Scenario 2 - The music student Chen-yin Mae, a harp student at the RSAMD, recently performed the sonata for harp, flute and viola by Debussy. She and her two colleagues wish to put together a concert programme based around this work with the same unusual instrumentation. Chen-yin accesses a sound archive using the EASAIER system. She puts ‘harp, flute, viola’ in the orchestration/instrumentation field. The group would like to make the programme all 20th-century, so Chen-yin enters ‘19**’ in the year field. Her search returns several works, including the Elegiac trio by Arnold Bax and a transcription of Ravel's Sonatine (originally composed for the piano) by Carlos Salzedo. Having already heard the Bax piece in concert, Chen-yin listens to the Ravel transcription. Afterwards, she returns to the list and selects a later piece, Tre Ecloghe (1984) written for the same group of instruments by composer Jay Anthony Gach. There is a link to Gach’s website displayed in the metadata. Together with the Debussy, Chen-yin thinks this selection of three pieces will form a balanced programme, and sets about marking the technically difficult passages in the audio track using the EASAIER enriched access tools (WP5). Later, as her flat has wireless Internet, Chen-yin can play back these marked passages to herself in her practising, and to her ensemble when they rehearse.

5

Page 6: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

3. Relationship to other Work Packages The EASAIER project is comprised of 8 work packages, of which Work Packages 2-7 have strong technical components and are highly interrelated. The following table summarises the important features of the other technical Work Packages as they relate to WP3.

Table 1. A summary of how other EASAIER Work Packages relate to the Retrieval System.

# Title Description Inputs to Retrieval System Outputs from Retrieval System

2 Media Semantics and Ontologies

Provides the structure of the descriptors describing all resources and the relationships between them.

The ontology will describe how resources are organised and connected, and all queries will exploit the ontology to retrieve relevant resources.

The retrieval system functionality may necessitate slight modifications to the ontology.

4 Sound Object Representation

Provides for extraction of metadata and low-level features for all audio objects.

These features represent the fields, metadata and representations on which searches are performed.

Retrieval system functionality determines many of the features which will be extracted.

5 Enriched Access Tools

Construction of tools enabling a more enriched experience when accessing the media in the sound archive

None Determines the type of resources on which these tools can be applied.

6 Intelligent Interfaces

Integration of software components from other Work Packages and development of a unified interface.

System Architecture approach may constrain the retrieval system specification.

The retrieval system functionality and the method of returning results will determine the interface and presentation.

7 Evaluation and Benchmarking

Assess all aspects of the EASAIER system.

User needs studies, wish lists, benchmarking and critique.

Mock-ups and prototypes of each component of the retrieval system.

From Table 1 it is clear that several components of a fully functional retrieval system are dealt with primarily by other work packages. The database structure falls within the remit of Work Package 2, which also describes links between descriptors and objects, and enables sophisticated, intelligent queries. The automatic creation of metadata, and feature extraction for query-by-example, is primarily addressed in Work Package 4. Access to the retrieved media is performed in WP5, and the interface for the entire system, including retrieval and presentation, is part of WP6. Input requirement studies and output evaluation studies are provided by WP7. Thus, the retrieval system (WP3) exploits certain features (WP4), connected through the ontology (WP2), in order for the user to access various media (WP5,6) based on the assessment of WP7. Understanding of the retrieval system is for the most part, independent of knowledge about other work packages. Even the ontological framework need not be understood in its entirety to perform queries and retrieve results. However the Sound Object Representations of WP4 to a large part determine the full capabilities of the retrieval system. This work package deals with the identification of features within the archived audio assets. As far as the musical audio is concerned, the tools will enable the extraction of high and mid-level descriptors for classification and search purposes.

6

Page 7: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

4. Music Retrieval T3.1 works on searching and organizing audio collections according to their relevance to music-related queries, using various features to produce a ranked list of audio files related to an audio query through various metadata and automatically extracted features. QMUL will lead this task, and in conjunction with DIT, will provide algorithms and research for this subtask.

Use-Case Scenario 3 - The lecturer Dr Goran Kryllic is a lecturer in choral conducting. He has come to believe, through personal experience, that when choirs sing a cappella in the key of F major, they tend to go flat. Dr Kryllic, when he directs an a cappella work for choir written in F major, will always transcribe it up or down a semitone. However, he is finding it difficult to demonstrate the rationale behind this to the students on his choral conducting course. Logging onto an EASAIER system, Dr Kryllic fills in various fields: ‘F major’ in the key field, ‘a capella’ in the keyword field, ‘choral’ in the genre field. As he wants to find works of both fast and slow tempi, he leaves the tempo range field blank. Of the twelve tracks displayed, eight display the undesirable musical behaviour Dr Kryllic believes typical of choirs singing a cappella in the key of F major. Using the EASAIER enriched access tools (WP5), he marks the various tracks in order to show his students the particular stages in the works where the choir goes flat. He uses the EASAIER sound representation (WP4) to find the places where the choirs sing at their loudest, most quiet, highest or lowest. In this way he can begin to explore what it is particularly about F major that makes choirs go flat. To further prove his point Dr Kryllic enters the same search information specifying all a cappella works for choir not in F major. He listens to these and not one of the examples goes out of tune. After his time using the EASAIER system, Dr Kryllic realises he has not only got the basis for a fascinating lecture, but also the starting point for an exciting research project.

The audio data goes through a number of modules, developed in WP4, for the extraction of musical features, which will be included in the meta-data associated to the audio file under analysis for classification and search purposes. The path taken by an audio file undergoing the process of archiving is shown in Figure 1. The PCM data is pre-processed by algorithms developed under both WP4 and WP5 that allow, whenever possible, the separation of individual sound sources and, whenever required, the restoration and de-noising of damaged audio assets.

Figure 1. Internal architecture of archiving module for musical audio.

7

Page 8: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Following this stage, the original and de-mixed sound objects are analysed by features extraction algorithms organised hierarchically and divided in three broad categories: low, mid and high-level features extractors. Low level extractors perform a frame-based analysis of the audio objects in order to gather raw spectral and time-domain data, which is then processed by mid level extractors that return musically relevant time-synchronous information such as harmonic and timbral profiles, chord sequences and the position of beats. The outputs of the mid level algorithms are then further processed by the high-level features extractors to obtain global, and mostly single-valued, information such as tempo, meter, global key or the presence of a particular instrument within the audio file. Whilst mid-level descriptors are particularly suitable for transcription purposes and similarity-based searches within the EASAIER archives,.high level extractors can be employed to perform parameter-based searches such as: “find an audio file exhibiting a tempo of 120 bpm at 4/4 time signature and containing the instrument conga”. Unfortunately high level features extractors are not robust enough at this stage of development to guarantee an absolute consistency, hence we envisage the use of a “reliability metric” that can prompt the operator to double-check the results and, if necessary, to manually populate the relevant high-level tags. The end user will be able to access the content of the EASAIER archive by means of an application that can retrieve an audio asset and its associated meta-data using a variety of non-mutually exclusive query methodologies, such as:

- Queries based on general tags: i.e. find material by author/title, genre and year - Musical parameters-based queries: i.e. find songs by key, orchestration, tempo range. - Similarity-based queries: i.e. once a musical audio asset has been retrieved, find other assets that

exhibit some degree of similarity in terms of macroscopic structure, timbre and harmonic profile.

The audio is delivered by the server (either by streaming or download of the entire compressed file) to the client application and then buffered and converted to a suitable format for further processing and visualization. As well as information for other Work Packages, the meta-data also contain general and music-specific tags providing comprehensive information regarding the audio asset under analysis (displayed in the “Browsing and Searching UI”). Although high and mid-level musical descriptors are generated by the archiving application on the server side, an enhancement to the functionality offered by the EASAIER system can be identified in the ability to provide similarity-based searches using audio files residing on the client’s hard drive. This functionality will require the deployment of a scaled-down version of the archiving application, allowing the generation of data that can be used to search the contents of the EASAIER server, as shown in Figure 2. The challenge is not in devising a music retrieval system, per se, but in making it robust and versatile, and integrating it into the EASAIER framework. The task may be subcategorized into content-based similarity searching, metadata searching (including automatically extracted metadata), and integrated music retrieval which exploits both approaches.

8

Page 9: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 2. Retrieval of musical audio assets from the client side application.

4.1. Content-based Similarity The practical usefulness of any content-based similarity measure, as opposed to increasingly popular social recommenders, will depend on the success of real-world applications rather than laboratory experiments. We therefore expect that low computational cost, in particular fast comparison time to enable searching and browsing large collections, will be an essential attribute of an effective similarity measure. Low memory requirements are also likely to be important for online searching of large collections and to enable embedding search functionality into applications. Just as with fingerprinting, a compact feature representation will be important in the design of content-based search interfaces intended to be queried by remote clients. Last but not least, the cost of content-based similarity will have to be low for applications where timbral distance is just one source of information to be combined with textual or community metadata. The music retrieval system is based on the methods used in SoundBite[16], a content-based music search engine developed by QMUL. This allows users experience and evaluate a variety of similarity measures in a realistic application context. The system incorporates automatic segmentation and thumbnailing, enabling easier and faster user evaluation of search results by presenting a short representative thumbnail audio clip for each track returned. SoundBite manages a database of tracks, allowing the entry of simple metadata (album title, trackname, artist, genre, etc.) as each new track is added, while automatically extracting and saving segmentation information and a thumbnail segment, as well as MFCCs and other features required to support various similarity measures. Tracks in the database can be browsed and searches can be made by similarity to a chosen track: in all cases the results are presented as a search engine-style list, with thumbnail audio immediately available for playback for each track in the list. The music similarity algorithm used by SoundBite involves a pre-processing stage which annotates the audio content of the sound archive. It takes raw audio as input and returns labelled segments. For a given track, the space of possible timbres is divided into N timbre types, each of which generates timbre features according to a Gaussian distribution. The sequence of timbre features through the track is modelled by an N-state Hidden Markov Model where the hidden states correspond to the N timbre-types. The most likely sequence of timbre types to have generated the features is Viterbi decoded from the HMM. The most likely segmentation is found by clustering histograms of the timbre types. The features vector consists of the first 20 PCA components extracted from the normalised constant-Q spectrum of the audio under analysis along with the normalised envelope. Analysis hop size is chosen as the estimated beat length of the audio under analysis. The overall system architecture in SoundBite, which will be maintained for this Task, is shown in Figure 3. The software design defines a simple interface to be implemented for each similarity measure. The measure then becomes available to the user in a pull-down menu. Different similarity measures can be selected at will for each search, and searches can also be constrained by textual metadata, for example to search only within the labelled genre of the query track. In a practical search application implementing MFCC-based similarity measures, this method can search a collection of 50,000 tracks in under 4ms on a standard consumer laptop, and with a memory footprint for the features of only 8MB.

9

Page 10: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 3. The SoundBite Framework.

4.2. Metadata searches Music retrieval will also incorporate search on the metadata described in WP4. This includes both automatically extracted metadata and metadata that is manually entered by content managers. The metadata will include standard media data (year, performer, composer, etc…) and metadata suggested by user studies and for which there are robust extraction methods.

Use-Case Scenario 4 - The DJ Club DJ Flyguy McJim normally starts his Saturday night set at the Edinburgh university student union bar on campus just after 11 o’clock. This particular Saturday night has seen the Scottish football team defeat their bitter rivals England in a World Cup qualifying match. The bar is packed, and the crowd has been singing Scottish songs all night. As the dance floor quickly fills up with happy revellers, Flyguy realises a patriotic flavour might be the way to go. As the second track kicks in, he uses his laptop’s Internet connection to log onto a traditional Scottish music archive that employs the EASAIER system. Typing in ‘Scotland the Brave’, ‘Pipe band’ and ‘D major’ in the author/title, orchestration and key fields (the drum’n’bass groove McJim has in mind works best with tracks in D major) he puts 120-132bpm in the tempo range field. Unfortunately, no results are displayed. However, Flyguy remembers that Pipe Bands often play in strange keys, so he searches again leaving the key field blank. This search brings several suitable tracks. Selecting one, he uses EASAIER’s enriched access tools (WP5), to create a loop sample, bump up the pipes’ drone from B quarter sharp to D, and set the playback tempo at 126 bpm. Mixing in a slightly heavier and faster drumbeat, Flyguy McJim drops the EASAIER loop of ‘Scotland the Brave’ between the verse and chorus of the Proclaimers’ song 500 miles. On hearing the pipe band, the crowd on the dance floor goes wild. McJim drops the loop in several more times to equally rousing effect. Impressed with the EASAIER audio material, so readily to hand and so easily accessed and manipulated, McJim finishes his set thinking about how he might further use the resources to provide a unique element to his DJ sets. He makes a mental note to see how many other sound archives use the EASAIER system.

Table 2 lists the likely metadata that would be used for music retrieval. Note that this also includes the low-level features used for content-based similarity as described in Section 0. In this table, the columns describe the following attributes of the descriptors. 1) “Entry Type”: refers to whether the metadata is automatically or manually generated. 2) “Value Type” attempts a generalised description of the type of tags, making a distinction between tags that require a “scalar” (e.g. global tempo) or a “text” entry and those tags that have an associated time-stamped sequence(s) of values (lists, vectors, multi-D vectors) 3) The “Query or Similarity Tag” column discriminates between tags that can be used for parameter-based searches (e.g. “Rhythm” ), those that can be used for free text queries (e.g. “Title”) and those that may be used for similarity searches (e.g. “Harmonic Content” and “Timbre profile” ). 4) The “Notes on Automated Entry” column is for comments on the general characteristics of the algorithms used to extract the features.

10

Page 11: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Table 2. Suggested Metadata and Descriptors for use in Music Retrieval.

Descriptor Tag name Entry

Type Value Type

Query Similarity Processing Tag

Notes on automated entry

General Tags

<Artist name>/<item name> Manual Text Query <Title> Manual Text Query <Year> Manual Text Query <Narrative> (General notes and text)

Manual Text Query

<People> (musician names, cast names …)

Manual Text Query

<Media Location> <Content type> (speech, music, video, image)

Manual Category (single value)

Query

Musical Audio Tags

<Instrumentation> (horn, fiddle voice etc…)

Manual/ Auto

List Query The automated entry can be generated from the source separation/instrument recognition algorithms

<Orchestration> (solo or ensemble)

Manual/ Auto

Single Value

Query Deduced from “instrumentation” entry.

<Key> (C, Db, D, Eb etc…)

Manual/ Auto

Single Value

Query Derived from “key Estimation” mid-level feature

<Mode> (major, minor)

Manual/ Auto

Single Value

Query Derived from key and chord estimation (maybe?)

<Tempo> (120bpm etc..)

Manual/ Auto

Single Value

Query This value can be derived from the “Tempo Profile” mid-level feature

<Rhythm> (Eg. Time signature)

Manual/ Auto

Single Value

Query This value can be derived from the “Meter Profile” mid-level feature

Low / Mid level features:

Low & mid-level features should have associated header describing extraction technique & analysis parameters such as # bands/bins, temporal resolution, hop size, etc

<Melody> (Some representation of melodic shape, transcription)

Manual/ Auto

Vector Similarity Either a mid-level representation or a global value ?

<Tempo Profile> Auto Vector Similarity search & generation of high level descriptor

Derived from the output of the beat tracker

<Meter Profile> Auto Vector Similarity search & generation of high level descriptor

Derived from the output of the beat tracker

<Chord Estimation> Auto Vector Similarity search & generation of high level descriptor

Estimate of chords from a constant-Q spectrum.

<Key Estimation> Auto Vector Similarity search & generation of high level descriptor

Derived from the output of the chord estimation algorithm

<Harmonic Change Detection Function>

Auto Vector Similarity search & generation of high level descriptor (harmonic complexity)

Detection of harmonic change in musical audio files

<Harmonic Content Profile> Auto Vector Similarity search Feature Vector for the

11

Page 12: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

description of tactus-based harmonic movement

<Timbre Profile> Auto Multi-Dimensional Vector

Similarity search Generation of Timbre features for use within SoundByte.

5. Speech Retrieval

5.1. Search in digitised speech archives Similarly to text retrieval systems, the most elementary problem of a speech retrieval system is to find the occurrences of one word in a large digitised speech archive. Although solving this problem can be considered the minimal requirement for any retrieval system, it is also sufficient for implementing many more sophisticated features. Queries ranging from simple word lists to ones with boolean operators, NEAR, etc. can all be answered easily if we are able to search for single words. More generally, most of the techniques of the text retrieval systems are readily applicable in our case if we substitute the text word search module with it's speech counterpart. The following paragraphs briefly summarize the techniques employed by ALL to solve this elementary problem.

Use-Case Scenario 5 - Speech Retrieval (Editorial/Publishing) Janet is working to a tight deadline on a CD-rom to accompany a ‘how-to’ book for young actors. She is looking for a striking example of a male actor addressing a crowd. She decides on the actor Richard Burton, and logs on to a sound archive specialising in theatre performances form the second half of the 20th century. This sound archive uses the EASAIER system. When Janet enters ‘Richard Burton’ in the performer field, a large number of results are displayed. She clicks on search within results, and enters ‘Royal Shakespeare’ in the keyword field. This brings back a smaller number, but still too many to search through given the short amount of time. Janet recalls her favourite speech from Henry V, and within the search within results enters ‘Cry God for Harry’ in the speech recognition field. Having searched through the audio records for Richard Burton saying the words she typed, the EASAIER system presents five audio records – four from performances with the Royal Shakespeare at the National Theatre, and one anonymous four-minute recording of Burton rehearsing the famous rallying call. This last example shows Burton demonstrating a wide range of potential inflections. Janet calls her editor into the room, and plays the example back on her computer. Both agree that it is a great find, and perfect for the CD-rom. As the record is rights-cleared and easily downloaded, she copies the audio example onto the CD-rom mock-up.

5.2. The architecture of retrieval The search engine is based on a phoneme level speech recognition system. This system outputs the most probable phoneme sequence for a given utterance. Of course, we have to be prepared for erroneous recognition results, and be able to give a result quickly. To this end, the search engine consists of two, well distinguishable part. The first part generates a set of priors based on an index. A prior is a small speech segment in the archive that is a candidate for containing the given word. This part has to be very fast. Its response time should increase very slowly as the size of the archive is growing, roughly logarithmically. (But of course it can be linear in the number of its results.) The precision of these priors can be lower then required for the final result, but the recall should be good enough, or even better than the recall we expect at the user end of the whole system. We will refer this part as the index searcher. The second part of the system is a prior refiner. Its job is to examine all the priors one by one, and decide if the given word is really contained in them, and select the best priors for the final result. From this, we will also be able to find the exact location of the word within the segments that contain it.) The amount of time consumed by this layer is obviously linear in the number of priors. So the search time per result element in this phase is determined by the precision of the priors. This architecture explains our claims in the previous paragraph about the requirements on the index searcher. The recall must be high, because no new hits will be added in the second phase, but the precision can be lower, as it can be highly augmented by selecting only the best priors, though it affects the speed of the system as a whole.

12

Page 13: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

The index searcher The most common solution for text retrieval are indexes based on words: for every word occurring in our corpus, we store the list of documents that contain it. To get the search result, all we have to do is to load the list for the word we are looking for. The notion of word can be defined in this techniques as character sequences delimited by white spaces. One important aspect of the text retrieval problem is that the number of occurring words is relatively low – roughly upper bounded by the number of words in the language (this helps less for Hungarian for example, because of the inflections, but a morphological analyzer can help in this case). This concept cannot be used for speech retrieval. We don't have natural word boundaries, but instead we have a lot of recognition errors. The lack of boundaries makes the above simple definition of a word meaningless. We can try to define a word as an element of a predefined directory, but it makes our search system limited to this directory. The other possibility is to consider all subsequences of the recognized phoneme sequence as a word. But in this case we also have to include sequences that are similar to the above subsequences to compensate for recognition errors. (“Similar” can be defined for example in terms of edit distance, see the next chapter.) These two factors make the number of “words” possibly contained in a segment of speech really large. Although this solution would work without white space boundaries, and would try to compensate for recognition results, the number of character sequences that must be stored makes it impossible in practice. Our solution is to build an index for phoneme 3-grams. A phoneme 3-gram is just a sequence of three phonemes. This means that we split the archive to sufficiently small segments, and for each segment we extract all the phoneme 3-grams contained in its phonetic transcription. Then for each 3-gram, we store the list of segments that contain it. In the search phase, the index is used in the following way. The 3-grams of the phonematized search word are extracted, and we are looking for the segments that contains sufficiently large portion of this 3-gram set. This is based on the assumption, that even when recognition errors are present, some of the 3-grams of a word's phonemes remain intact.

The prior refiner To perform the prior refinement, we have to measure how similar the phonematized search word is to a part of the transcription of a prior. For this, we define edit distance of two sequence as the number of insert and delete operations necessary to get the second sequence starting from the first. For example, the edit distance between 'bet' and 'bye' is two, by deleting the 't', and inserting the 'y'. We use a simple dynamic programming approach to find the substring of a transcription that has the smallest edit distance from the phonematized version of the query string. This smallest edit distance is used to decide if the given prior should be included. The threshold used for this decision is also depends on the length of the search word, as we want to permit larger edit distance for longer words.

5.3. The technology of phonematization and phoneme level recognition

Summary of architecture In our system the input for training is the parallel segments of sound and text. The input for phoneme level recognition is the sound. In case of text the phonematization is used to produce data for the algorithms, in case of sound signal processing methods are used to produce data for the training and recognition methods. Using phonematization the text is converted to the phoneme model of the system. The signal processing produces the feature vectors, the feature vectors managed by the similarity model of the system. The feature vectors are classified into phoneme model entities (phoneme, allophone, morpheme). The similarity model supports the classification, comparison and distinction of the phoneme model’s entities. The time scale is mapped by the dynamic model. The recognition is supported by the language model. The language model aids the recognition done by the dynamic model with information describing the language itself. Our dynamic model based on the Hidden Markov Model (HMM). The phoneme level recognition is performed by the Viterbi algorithms running on the HMMs which are trained by the Baum-Welch algorithm. In our experiments we examine the elements of our system described above. We tune the parameterization of the elements on the basis of the classification below:

• Signal processing • Phonematization • Phoneme model • Dynamic model

13

Page 14: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

• Similarity model • Language model

Phonematization The phonematization maps the written text to the phoneme model of our system. This is a multi-step process:

• First we produce the phoneme sequence that corresponds to the text, using the language specific rules in case of Hungarian. In case of English this step is completely different, we have to use a dictionary and additionally trained HMMs.

• The generated phoneme sequence is translated to the phonemes defined in the phoneme model. • This phoneme sequence is transcribed using the transition rules and variant rules

defined in the phoneme model.

The steps of the Hungarian phonematization

Transcript to phoneme sequence Fortunate character of the Hungarian language is that on the base of the written text the pronunciation can be predicted very well. We can prepare a rule based program, which converts the text to a phoneme sequence.

Modelling of silence We use two different phonemes to model the silence: SIL (silence) is the long one, SP (short pause) is the short silence symbol. The punctuation is transcribed to long silence, between words the short one is placed but only optionally. The optional short silence means that our model manages similarly the two variant, one with short silence, and one without. Our model assumes that they appear with the same probability.

Phonemes The items of the phoneme sequence are matched to the phonemes defined by the phoneme model. The phonemes of the phoneme model correspond to the structures of the similarity model and the dynamic model (HMMs). This dynamic model is depicted in Figure 4. The system learns the parameters of these structures, and uses them later in the course of recognition. We match each phoneme sign to exactly one phoneme of the phoneme model. After the training we analyze the typical errors of the recognitions. The errors are classified and summarized on this basis we can refine the phoneme model.

Phoneme clusters, allophones On the basis of typical mistakes we form phoneme clusters. They can be considered as guidelines describing which phonemes and phoneme groups have to be addressed by special separation methods. We used experiments to measure the effect of phoneme cluster differentiation to the recognition. In these experiments we unified the first few clusters as a separate phoneme. For example the ‘A’ and the ’O’ phonemes were considered as an ‘AO’ artificial phoneme. Examining the results of these experiments we realized that it is worth to add new phonemes to the phoneme model where the phoneme depends on its context. These are the phoneme’s allophones.

14

Page 15: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 4. Alternative phonematization in Hungarian with and without assimilation rules.

The steps of the English phonematization In the case of English language the rule-based phonematization cannot be used. Basically we have to use a dictionary. It contains the written form of the words and several phomenatized pronunciation variants. However the words missing from the dictionary have to be phonematized too. We use a phonematized text corpus to train a phonematization model. It can be used to supplement the dictionary. The silence between the words is optional in this case too. We do not have to manage the context dependent pronunciation variants explicitly. Summary:

• The text is converted to a phoneme sign sequence on the basis of the dictionary. • The out of dictionary words are phonematized by the statistical phonematization model. • The elements of the phoneme sign sequence are transcribed to the phonemes of the phoneme

model. • This phoneme sequence is modified according to the transition and variant rules defined by the

phoneme model.

The HMM model of the phonemes

Monophone model According to the state of art, the phoneme model has the same structure in case of all the phonemes. These are 3 states HMMs. This is justified by that rough heuristic that during speech each pronounced phoneme has a start, a mid and an end phase. From each phase (state) there is a possibility for transition to itself or to the next phase. Each stated is characterized by a normal distribution or a mixture of normal distributions. The dimension of the normal distribution or mixture of the normal distributions is same as the dimension of the feature vectors.

Diphone model It is an obvious objection that during continuous speech, the fitting of phonemes pronounced after each other alters the start and end states of them. This idea is realized by the diphone model. In the diphone model we explicitly map the transition by adding a transition model to each subsequent phoneme.

Signal processing The wave signals are converted by signal processing methods to feature vectors. These feature vectors incorporate significant parameters of the voice. These parameters enable the similarity model to distinguish the phonemes using different statistical methods.

MFCC feature vectors According to the state of the art, most widely used is the mel-cepstrum feature vector. The mel-cepstrum features consist of the logarithmic energies taken on 12 disjoint frequency intervals. The dynamic change of continuous speech is modelled by adding the first and second derivatives to the mel-cepstrum features in case of each coordinate. The result is a 39 dimension MFCC feature vector. The derivatives describe the change of the logarithmic energies.

Voiced / breathed, formant, pitch, pitch-synch energy It is well-known that the energy of some frequency in the frequency spectrum of some phonemes – mainly the vowels – is significantly higher. These are the formants (see Figure 5). The most typical formant is the pitch. Another important parameter is that the phoneme is voiced or breathed. The phoneme assimilation mainly is voiced-breathed assimilation.

15

Page 16: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 5. Visualization of sound files. The wave and its fourier spectrum. The spectrum shows the formants.

Similarity model Initially, we used a normal distribution in the states of the HMM-s representing the phonemes. However, it may be observed that the normal distribution does not match well with the empiric distribution calculated from the feature vectors. So we examined other representations of the distribution to improve our performance. The distribution representations listed below are considered for trial investigations.

• Normal distribution • Mixture of normal distribution (linear combination of normal distributions using weights, the

sum of the weights is 1) • Neural network – see below • SVM (Support Vector Machine).

Applying Artificial Neural Networks for Similarity Models The problem of ASR (automatic speech recognition) can be formulated as follows: how can an input sequence (a sequence of spectral feature vectors) be properly explained in terms of an output sequence (sequence of phonemes). Although Artificial Neural Networks (ANNs) have been shown to be quite powerful in static pattern classification, their formalism is not very well suited in ASR, because the sound patterns are primarily sequential and dynamical. In ASR there is a time dimension or a sequential dimension which is highly variable and difficult to handle directly in ANNs. Therefore, we applied and compared various current state-of-the-art ANN methods to improve accuracy of our baseline Hidden Markov Model / Gaussian Mixture Model (HMM/GMM) system. Several neural network architectures have been developed for (time) sequence classification, including recurrent networks that accept input vectors sequentially and use a recurrent internal state that is a function of the current input and the previous internal state. The Echo State Network (ESN), developed by Jaeger[17], is a static, randomly-generated recurrent neural network. The internal states of the ESN may be thought of as a rich set of dynamical basis functions constructed from the input time-series. In our

16

Page 17: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

research, we have modified the ESN such that the forced output was given by other classifiers, for example GMM or standard feed-forward neural network like back-propagation. Another approach for handling time dimension of ASR is the Time-Delay Neural Networks (TDNN) proposed by Waibel et al. [18]. A TDNN generalizes not only from the current training data, but also from past data and even future data, therefore there is time delay between real time and phoneme recognition. Using a wide window in TDNN results a large number of features, therefore, TDNN require a large number of hidden-layer neurons to attain an acceptable recognition rate. This leads to long training times, and even with powerful computers training could take weeks. To solve this problem, we have aggregated feature vectors of given window into one vector and this method gave us also good accuracy in reasonable running time. Our attention focused on the system combinations especially the hybrid HMM/ANN models. While HMM is a dominant approach in most state-of-the-art speaker-independent, continuous speech recognition systems, ANN are universally known as one the most powerful nonlinear methods for pattern recognition, time series prediction, optimization and forecasting. HMM/ANN combines the advantages of both approaches by using an ANN to estimate the state dependent observation probabilities of a HMM, instead of GMMs, while the temporal aspects of speech are dealt with by left-to-right HMM models. We have also proved tandem systems, that are transform the cepstral features into posterior probabilities of phonemes using ANNs, which are processed to form input features for conventional speech recognition systems. They have been shown to perform better than conventional speech recognition systems using cepstral features. Our best tandem model was realized by such combination where the posterior probabilities were determined by voting technics using different ANN and GMM models trained various ways.

Parameter estimation After choosing the similarity model, first the parameters of the model are initialized. Later the training parameters are estimated. We can estimate the parameters of the similarity model in one of two ways:

• Independently of the dynamic model using empirical estimation • Together with the estimation of the parameters of the dynamic model by training

Integrating the different parameters of the system

Weighting the different probabilities The system calculates series of different probabilities during the training and recognition. Instead of using the series of probabilities we can calculate the weighted product of them. It means technically that we use the power by weight of the probability in the product instead of its value.

Weighting by coordinate To distinguish different phonemes we analyze importance of the feature vectors distinct coordinates.

Dynamic model The system manages the time scale of the speech in the dynamic model. The similarity model alone can provide static classification only. Depending on the concrete model and on the basis of the trained parameters, the similarity model decides about the probability of the matching of a feature vector to a phoneme. The dynamic model can modify this classification by considering the chronological order of feature vector sequence. The base of the dynamic model is the Hidden Markov Model. The HMM modifies the similarity model’s resolution by the probability of its state transitions. The resolution of the similarity model is represented by the emission distributions in the HMM’s states.

Language model The language model realizes the language specific knowledge, which does not relate to the voice but relates to the syntactical and semantic content of the speech. Thanks to the flexibility of the HMM, in many cases the knowledge of the language model can be embedded directly into the dynamic model. The simplest language model is the phoneme trigram model (3-element phoneme sequences). The prevalence of the different phoneme trigrams shows a distribution typical for the particular language. The discrete distribution of the phoneme trigrams can be estimated by the phonematized form of big text corpuses.

17

Page 18: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

5.4. Speech retrieval technical specification The quality of accepted sound archives is limited. We implement our algorithms to provide a general solution, but adequate performance depends on the quality of the recordings. We expect broadcast quality recordings in English and in Hungarian separately. The recording can consist of speech and non speech parts. The music and overlapped music and speech parts are omitted from the speech processing. Accepted wave quality is 16 KHz, 16 bit, slightly noisy recordings. The retrieval task consists of two phases. This first is the offline preprocessing phase. It means the identification of speech parts, the phoneme level recognition and the indexing. The technical background of the phoneme level recognition is very complex, and we discuss it in another chapter. We continuously work on the methods and algorithms to improve the performance of the phoneme level recognition. This work is mainly research and experiment. This is the most risky part of our task. We cannot predict the performance of the final outcome. The offline indexing works with documents. The documents are stored in the file system. This solution ensures the easy deletion and insertion of documents. The format of the documents is uncompressed wave. This is a little wasteful, but the algorithms work with this format. Any other file formats can be supported it is only question of time to unpack the file before preprocessing and when seeking to hits in retrieval time. The speed of the indexing is real time using an average desktop machine. Faster and parallel indexing can be carried out using faster servers and server farms. The size of the index is 10% of the original uncompressed sound document. During speech retrieval the user request is answered online. So the retrieval itself has to be fast enough independently of the size of the searched data. The user enters the query in the form of text. It can contain one or more words. The system produces the phonematization of the text. This process is detailed in another chapter for the Hungarian and English language. In Hungarian the phoneme transcription is unambiguous. In English more than one transcription can be produced. The retrieval performed on indexes. The index ensures that the response time is more or less constant. The first hit is presented within 1-2 seconds, if there is any. If there is not hit the response time can be longer, because the user does not get any reply while the index is fully browsed. The search is unsynchronized, it is still working while first hits are presented. The retrieval enhanced with fault tolerance algorithms. The hits are ordered according to their relevance. The hits are tagged with document name and time position inside the document. The hits can be browsed and listened or watched.

18

Page 19: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

6. Indexing and Speaker Retrieval

6.1. Introduction

General Speaker retrieval is a unique task from the speaker recognition applications world. In Speaker retrieval application, a collection of multi-speaker audio repository is searched for a specific speaker (target). This application is very useful, for example, in a call-center (or audio archive), where a large number of calls are collected for each specific line or extension. However, the target speaker for each specific line participates is in only a small part of these calls, while the rest of the calls contain other speakers. Currently, a human listener has to go over all calls in order to locate the calls of the target speaker. With speaker retrieval engine, all the calls in the database are to be scored (according to one referenced target’s speech file), and the human listener checks only the top scoring calls.

State of the art speaker retrieval system The Speaker retrieval system is based on Gaussian Mixture Models (GMM) classifier and is suitable for text-independent tasks. Figure 6 shows the Speaker retrieval system, which consists of training, testing (detection/hunting) and memory. The training involves the acquisition of all the speakers’ speech in the database to be searched (training database). The signals are pre-processed and features are extracted. Using these sequences of feature vectors – target speakers’ models are estimated. In the hunting (test) stage, the input is the target (hunted) speaker’s utterance, which undergoes pre-processing and feature extraction. A pattern-matching scheme is used to calculate the probabilistic scores, which represent the matching between the tested target’s speech and the models in the database. Then, score alignment is made. The top scores represent the top matched models to the input tested target’s speech.

Pre- processing

Feature Extraction

Scores calculation

Feature Extraction

Pre- processing

Target speaker’s speech

Speech from multi-speakers database

Training

Speaker 1 model

Speaker 2 model

Speaker Nmodel

Score alignment

Memory

Testing (hunting/detection)

Model Estimation

Sorted speech segments

Figure 6. State of the art system for Speaker Retrieval.

19

Page 20: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Front end processing Several processing steps occur in the front-end analysis. First, the speech is segmented into frames by a 20-ms window progressing at a 10-ms frame rate. A speech activity detector is then used to discard silence–noise-music frames. The speech activity detector is a self-normalizing, energy based detector. Next, Mel-Frequency-Cepstral-Coefficients (MFCC) feature vectors are extracted from the speech frames. The MFCC is the discrete cosine transform of the log-spectral energies of the speech segment. The spectral energies are calculated over log-arithmically spaced filters with increasing bandwidths (mel-filters). All cepstral coefficients except its zeroth value (the DC level of the log-spectral energies) are retained in the processing. Finally, delta cepstra (DMFCC) are computed using a first order orthogonal polynomial temporal fit over 2 feature vectors (two to the left and two to the right over time) from the current vector. Finally, the feature vectors are channel normalized to remove linear channel convolutional effects. Since we are using cepstral features, linear convolutional effects appear as additive biases. Cepstral mean subtraction (CMS) has been used successfully.

Audio database The audio database used for testing purposes in development of the Speaker Retrieval and Indexing algorithms will consist of telephony quality recordings (PCM, 8 KBs, 16bits) in wav format. This is lower quality than typically expected in sound archives, but is readily available for analysis, and any processing algorithms that are effective on this archive should also be effective on higher quality archives. Clean speech is preferred but not a requirement. Separate repositories will be used for different languages. For testing, the audio samples will consist of one speaker only, and segmentation into different speakers is addressed separately in Workpackage 4.

Speech/non-speech detection The pre-processing phase will include a speech detection module which will eliminate any non speech segments prior to the indexing phase. The engine will save the time tags of the speech segments (in the data base) and will not further classify the non-speech segments (music, tones or other).

6.2. Speaker characteristic indexing - Voice Print, Gender, Emotion In this phase the wav files (from the speech repository) is being processed by 4 different algorithms: Voice print extraction, Gender detection, Laughter Detection and Emotion Detection. Table 3 summarizes the speaker characteristics output for each speech segment.

Table 3. Speaker descriptors.

Characteristics (Descriptors)

*RT Type Values Size Example

Voice Print 1:30 File Binary ~100 KB call.gmm Gender 1:200 Boolean 0 Female

1 Male 1 Byte 0 (female)

Emotion 1:200 Numerical Structure

Time tag and Likelihood score (0 – 100)

Few hundreds bytes

00:01 00:20 80 00:42 00:56 65 01:44 01:58 76 :

Laughter 1:200 Numerical Structure

Time tag and Likelihood score (0 – 100)

Few hundreds bytes

00:11 00:20 90 01:42 01:56 55 02:44 02:58 66 :

* 1:x: mean x hours are being analyzed in 1 hour. The speaker characteristic indexing requires a file retention mechanism to store all voice prints. This data can be stored in a relational database and accessed through the EASAIER ontology. The other speaker characteristics, such as gender, emotion and laughter, will be saved in the database as simple descriptors.

6.3. Speaker characteristic retrieval The speaker retrieval system will be able to search according to the rules given in Table 4.

20

Page 21: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Table 4. Speaker retrieval.

Search What does it mean? Query Known speaker Voice print exists in the VP repository Search term Unknown speaker Voice print doesn’t exist Advance search: Load speaker voice

sample. Optional: Edit voice sample and save VP in the VP repository.

Emotion Find all audio files with emotional segments above a given confidence level

Advance search

Laughter Find all audio files with laughter segments above a given confidence level

Advance search

Gender Find all audio files with the given gender Advance search

Fields of research activity • Improve the voice print indexing RT. Currently the RT is 1:30, we will need to research for

efficient algorithm with minor performance degradation. Our target is to reach 1:100 RT. • Decrease the size of the voice print file. Currently we need to save 100KB. If the data base

includes millions of files it will need huge amount of storage and also will become the dominant factor on the search RT. Our target is to reach 50KB.

• Improve the speaker search speed. The ad hoc search should be very fast so the user experience will be acceptable. We will not to research for innovative methods to improve the speaker retrieval. Currently with the state of the Speaker Recognition algorithm, to search over 100,000 files will take approximately 2 hours! Our target is to reach 100,000 files in 1 second.

• Improve Detection versus FA rate.

6.4. Sample test user interface Figure 7 depicts a basic interface for speaker identification and retrieval, where the details are hidden from the user. The user selects Search from a browser-like toolbar and is directed to an EASAIER-based homepage for a sound archive. The user may then select either a basic textual-keyword search on the media files or move to an Advanced or Query by Example search. In this example the user has selected only speech having the word ‘George Bush’ in a metadata field and the return results are ranked according to similarity score. The user may then select one of the results for access and playback. In this example the search in field is ‘voice print’ so the system search by using ‘George Bush’ voice print and not by the lexical meaning. The search will begin only if the respective voice print exists in the speakers’ voice print repository. In case the speaker search term is new to the system (doesn’t have stored voice print) the system should give the following message: “search term doesn’t have voice print please use advanced search mode”.

Figure 7. Example for speaker retrieval using the basic interface for EASAIER retrieval system.

21

Page 22: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

In case the speaker voice print is not included in the voice print repository the advanced search form gives the user an option to add a new voice print to the voice print repository or search by specific speaker voice sample. Advanced search: Speaker Retrieval In the text box we browse for a specific speaker voice sample. The system will search in the audio indexing repository for speakers that are similar to this voice sample. In addition we could edit (cleaning and pre-processing) the voice sample and save the voice print with the assigned search term, as depicted in Figure 8.

Speaker-Specific Search

Find speaker similar to the file

Browse C:\Temp\Josh_Reiss.wav

Edit Save Voice Print

Figure 8. An advanced search for speaker retrieval. .

Advanced search: Speaker Retrieval – Edit option When the Edit button is pressed, an advanced tool to create a voice print will be opened. The tool, such as the one shown in Figure 9, would let us do the following tasks:

• Playback the audio excerpt • Mark only the desired speaker utterance in the audio • Delete noise and music segments • In case we have a multi-speaker audio file, activate the segmentation algorithm and mark the

desired speaker (see Figure 10). • After the pre-processing and cleaning have completed, save the new wav file

Figure 9. An advanced edit tool for playback, create voice print, cleaning and pre-processing.

22

Page 23: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 10. The visualization of the audio input after activating the un-supervised segmentation algorithm.

Advanced search: Speaker Retrieval – Save option The user could assign the search term for the voice sample that was created. The next time the user searches for this term using the‘Voice Print’ field, the system automatically will assign the stored voice sample and will start the search process.

Advanced search: Speaker Characteristics Retrieval This allows the user to search for audio samples that match specific characteristics above a given similarity threshold (0-100). All criterions if marked will be added by an “and” operation. For example, if the user assigns value 80 to Emotion, value 80 to laughter, no value to gender and call length grater than 30 sec and less than 3 min the search results will present only audio segments that are grater in length than 30sec and less than 3 min, includes emotion > 80 and laughter > 80. Emotion Only return results that are grater than 80

Gender Only return results that are grater than 80

Laughter Only return results that are grater than 80

Length Only return results that are Greater than and less than min

Figure 11. An advanced search for Speaker Characteristics Retrieval.

7. Cross-media Retrieval

7.1. Introduction

Objectives The investigators plan to construct a cross-media information retrieval system that combines feature extraction techniques, metadata, and optimised multidimensional search methods. Although there has been a tremendous amount of research into many aspects of such a system, very little work has been done on cross-media information retrieval. Such a system would allow the user to enter a piece of media as a query and might retrieve an entirely different type of media as a related document. For instance, one could envisage an entertainment application where an image of an actor may be entered, and film clips of that actor are retrieved. A more practical application could be one where a fingerprint is entered, and likely sound clips and photos of the person are retrieved. The goal of this work is to show that such a system is effective and plausible. To illustrate the concept of cross-media retrieval, this section present a simple user scenario that will be possible to fulfil with the EASAIER system.

23

Page 24: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

We believe that this scenario presents an interesting, yet solvable challenge. There has been considerable research on the various components of such a system, but there remain many technical and theoretical difficulties with linking all the components together. Thus, this proposal represents a worthwhile endeavour both as a useful application and as an advance in frontier research. The key technological challenges and possible solutions are listed below. Each of the primary technological components is described below, along with how we plan to solve the related issues.

7.2. Uploading content and content analysis

Metadata In order to support true cross-media queries, it is necessary to provide a means whereby one can say that two documents of different media types can be considered related. As an example, it is clear that an audio recording of a person speaking and a photo of that person are related, yet this information is in no way revealed through the use of feature extraction. To relate such documents, some knowledge, such as the editorial information depicted in Table 5, needs to be either explicitly entered into the EASAIER system’s semantic web knowledge environment by the content manager during the uploading process or retrieved later through the semantic web (Workpackage 2). These descriptors and semantic knowledge is then stored in the EASAIER knowledge base with a number of relations in between them[19] so they can pool together images, audio, video and text related to the same subject. An ontology depicting the relationships in musical objects is depicted in Figure 12. These relations between different media and different abstract pieces of knowledge (such as, for example, a musical performance which was filmed and recorded) can be used for retrieval and ranking of the result, so that an obscure relation, such as the photo of a group of people and a video of one person in the group, might be ranked lower than audio and video of the same person in the same context. We thus define a similarity measure that utilises the ontological structure defined in workpackage 2. This similarity measure can be used in series or in parallel with the feature based measure, hence allowing full cross-media queries. An example is given in Figure 14. . Table 5. An example of editorial information that can be used in EASAIER retrieval system.

Descriptor Tag name Automatically Manually

Use-Case Scenario 6 – Cross-media retrieval Ian McGinley is just about to finish High School. In accordance to his origins, he plays guitar, but as well various traditional Celtic instruments. He loves rock music, but he is also interested in traditional music. He still has not decided about college, but he would like to continue studding music.

Having this in mind, his father played him one of his favourite vinyl records by Van Morrison and The Chieftains, “Irish Heartbeat”. Ian liked it on first listening. Especially he likes third song on the album "Ta Mo Chleamhnas Deanta (My Match It Is Made)". He wants to find a digital version of his favorite song and any other information related to the song for further studies. In order to find this multimedia information, Ian gets into a digital library through an EASAIER system. He enters the name of the song and the names of the authors and clicks a search button.

On the search results page, Ian can find various multimedia documents. On the top of the list there are an audio file containing the song, a video file with the live performance of the song and the song lyrics. The EASAIER system also presents Ian with number of related materials, such as the album cover of “Irish Heartbeat”, text of the interview with Van Morrison and The Chieftains about the album, link to their live performance on RTE, link to a video segment of a documentary about Irish traditional music in which Van Morrison is talking about song or the links to different versions of the song performed by other artists.

Being satisfied with the quality of information he retrieved, Ian chooses to read the song lyrics, text and interviews about the song, to hear the song or see the live performance of the song. Above that, the EASAIER system also gives him opportunity to search for musically similar files, using the EASAIER music retrieval system, or to analyze the song using EASAIER enriched access tools (Workpackage 5).

24

Page 25: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Generated Generated <Artist name> / <item name> X 1 <Title> X 2 <Year> X 3 <Narrative> (General notes and text) X 4 <People> (musician names, cast names …) X 5 <Media Time> X 6 <Media Location> X 7 <Content type> (speech, music, video, image) X 8

Figure 12. An example of relations between knowledge classes in a music ontology.

Feature Extraction Feature extraction on media, whether images, audio, video or otherwise, involves the analysis of a file, or portion of a file, to extract a small set of quantifiable features which represent the most relevant properties of the media. The benefit of extracting these features is that a set of features is much easier to compare, analyse and manipulate than the huge amount of information in the media file. Since we wish for the retrieval system to be able to associate media files in various different ways, a variety of different feature sets will be created. This allows, for instance, images to be described by their colour distribution or via edge detection algorithms. Similarly, music audio may be described by timbral features or melodic features. By providing several feature sets that can be combined in different ways, we can search the database using different similarity measures. We will describe the EASAIER analysis and feature extraction module on an uploaded multimedia stream (Figure 13) as an example of the most complex content that can be entered into the system. First, video and audio stream will be extracted from the multimedia stream. The audio stream will be automatically segmented and all segments will be labelled as music, speech or indefinable. All music/speech segments will be sent to corresponding modules defined in previous chapters for further analysis and feature extraction.

25

Page 26: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 13. Segmentation of a multimedia stream

The video stream will be also automatically segmented using module provided by QMUL into video shots and a representative frame (keyframe) from a video shot will be stored for further analysis. Keyframes will be analysed in the same way as other images that will be uploaded in to the system for the set of standard Mpeg7 [20] low-level features given in Table 6. Low-level features are then store and related to the corresponding image/keyframe. All information about time segmentation of the multimedia stream (video, music, speech segments) are stored and related to the corresponding document for future analysis. It is important to notice, that all this computationally expensive operations are done “off-line” during uploading process and then stored in the system, so it will not influence the execution time of the search-retrieval task.

Table 6. Image analysis – Mpeg7 low level features

Descriptor Tag name

Automatically Generated

Manually Generated

<Saptial Decomposition> X 1 <Scalable Color> X 2 <ColorLayout> X 3 <ColorStructure> X 4 <DominantColor> X 5 <EdgeHistogram> X 6

<HomogeneosTexture> X 7

7.3. Similarity Searching and Indexing The metadata and extracted features described in previous section are a valuable basis to make the user-case scenario described in section 6.1 possible. In this scenario, metadata will be used to find in the database and rank all documents related to the song and the authors’ names. The EASAIER system will present with highest relevance the audio file containing the song, but ranking of other related documents remains challenging problem. The features extracted from documents will be used to enhance this task. In the given scenario, a video that is related to Van Morrison can be presented to the user with high relevance if it contains a segment with a performance of the song. This can be found by computing similarity between the music segments of the video and the audio recording of the song. In the similar way, computing visual similarity on keyframes can identify a video clip of an interview with the Van Morrison in which the same video segment is used. Due to the combined use of metadata and features, the information retrieval system must support several radically different types of internal search methods. First, relationships based on features require a multidimensional similarity index. There is a large literature on this but it remains to be seen which index is most suited for such a problem, and what modifications it would require. Furthermore, the metadata gives rise to complex relationships between documents. Ranking the relationships between media and abstract pieces of information may give rise to nonmetric relationships (an image may come from a film, which features a song, but the image is only tangentially related to the song). Thus the appropriate way to

26

Page 27: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

design an information retrieval system on top of this knowledge, which can be either ground truth or derived, remains a challenging task. Semantic web technologies, graph theory and small world networks may be highly applicable to this problem.

Computational Cost Computational costs are incurred in several different places. Since this is planned as a system with feature extraction on the query (in the case of similarity search on the document that is not stored in the database), and a multidimensional search on the data, then large query documents and large databases can both result in an excessive retrieval times. Thus, some optimisation is necessary. The researchers will investigate several schemes with the goal of minimising the time it takes to construct the database (feature extraction on all documents, creation of metadata, construction of a search index) and the time it takes to retrieve documents (feature extraction on the query document, searching the index using both metadata and feature similarity, ordering and presentation of results).

Figure 14. A flowchart depicting how a cross-media query can be performed using a combination of a feature similarity measure and a metadata similarity measure.

27

Page 28: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

8. Vocal Query Interface

8.1. Technology description It is a convenient feature for the user and an interesting challenge for the researcher to provide a vocal query interface for a speech archive. Vocal query interface means that the system allows the user to search for the words by pronouncing them into a microphone attached to the computer. The basis of this technology is a phoneme level speech recognition system. In its simplest case, it returns the most likely sequence of phonemes for a given utterance. It is also possible, that the SR system returns a phoneme lattice to represent more possible transcriptions. When the user initiates a vocal query, the signal from the microphone is pre-processed, the speech part is selected and transcribed by the SR system. Unfortunately, the recognition of the user input is not perfect, especially if we don't want to fix a predefined dictionary of possible search queries. So just starting a search with the phoneme sequence output by the SR layer is a very inefficient solution. This section contains the techniques we use to overcome this problem. On the index search phase, we have to generate a set of 3-grams which is then matched to the index of 3-grams of our archive. When we get a text based query, we phonematize the word and extract all of its 3-grams, that is all of its subsequent 3 long phoneme sub-sequences. In the vocal query case, we can do the same thing with the output of the SR layer. This already provides some level of robustness against recognition errors, because we can hope that some 3-grams remain intact, and we can find the word by this trace. The results improve significantly if the SR layer returns a phoneme lattice, and we use all the trigrams of all the paths through the lattice. Of course, the larger query 3-gram set means more hits, but if we raise the threshold levels accordingly, we can find recall/precision rate that fulfils our needs. The prior refinement phase is also easily adaptable to the vocal query settings. If we have a prior, we can find the subsequence of its transcription that is closest to the query transcription in edit distance. We typically have to raise the acceptance threshold compared to the text query interface, because now we have to be prepared for recognition errors both in the recognition of the archive segment and that of the query word. Better performance can be achieved if one is willing to restrict the search to a predefined dictionary. The trivial way to do this is to force the SR layer to restrict its output to the words of the dictionary, which yields a better recognition performance. Another way to use a dictionary is to present the user with the word of the dictionary that was found to be the most probable by the SR layer. If it is not the correct query word, the user can interactively change it, or repeat the query. The idea of repeating and interactive correction can also be used without a dictionary. If the user repeats the same query, the system is able to find a better transcription, by matching the possible transcriptions to both utterances at the same time. Interactive correction can also be possible, but only on a phoneme level. So the user is presented the transcription of his/her utterance, and then (s)he can either repeat it, or change some phonemes on screen.

8.2. Vocal query – User approach The user can record the query instead of using the keyboard to enter it as text. It can be a word or a simple phase. The system performs the phoneme level recognition and optionally enlists different variations. The user can choose the best matches, or even correct the phoneme sequence. Then, the user can save the best phoneme transcription together with the recorded query. Later, the user can find this recording when browsing the already recorded queries. The queries stored with their phoneme level transcriptions. By recording and accepting, or choosing a previously saved query the user can initiate a search. More than one phoneme sequence can be used for the retrieval. The retrieval process and the presentation of hits is the same as described in the case of text based search.

28

Page 29: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

9. Retrieval System Integration and Knowledge management

9.1. The Integrated Retrieval System

In the previous sections, we have described the various components of the EASAIER Retrieval System, however the manner in which the different components integrate has not yet been discussed. An integrated retrieval system would have the following framework:

a. Firstly, the media must be classified. This is done manually by the content manager who imports audio, movies and images separately.

b. Then, audio content is segmented into music and speech, allowing for separate retrieval methods on each set of segments. Though this segmentation is performed in WP4, it’s important to mention that this is a prerequisite for the music and speech retrieval systems.

c. A common basic interface must also be provided, such as the ones provided in the current deliverable document.

d. This is the ability to retrieve documents is the key feature of all retrieval systems, which is determined by a similarity match between a segment (as described in step b above) and a query consisting of a keyword search, a restricted set of parameters and/or a query document.

e. Advanced search capabilities will require adaptable speech, music, cross-media interfaces for the. Integrated retrieval will also be possible using the vocal query interface.

It is clear from this framework that the distinction outlined in previous sections between music and speech retrieval may not need to be a formal one. That is, all audio documents may have associated fields such as, for instance, speaker and musical instrument. If the audio is speech only, then many music-related descriptors are left blank, but a speech-related search may still be performed across all audio documents and segments. This further allows us to retrieve both speech and music documents fitting a set of search parameters without having to exploit the more advanced technology of the Cross-media Retrieval System. For example, searching for the keyword lute may retrieve both lute performances and a spoken lecture on the instrument. In other words, all the Information Retrieval Systems are accessing the same knowledge base. Thus, integration of these systems is just integration of user interfaces. This integrated interface approach is perhaps most easily demonstrated through the use of mock-ups such as those depicted in section 10. The knowledge base, and its relationship to the retrieval systems is provided in the following subsections. They deal with the interface between the different information retrieval systems and the knowledge management architecture defined in Workpackage 2. Firstly the knowledge representation that is being used is briefly presented, and then the interface itself is also described.

9.2. Ontology-based knowledge representation

An ontology provides a way to describe a restricted world we are interested in, in a logical language (description logics, and in a semantic web context, OWL), allowing automatic reasoning. It is far more than just a metadata scheme, as expressed in the following key features of ontology-based knowledge representation:

• Automatic reasoning – Formally defined, an ontology allows automatic reasoning of objects in the described domain. For example, one can query an ontology-based system for all recordings involving “wind instruments” and gain access to those involving “flute,” “oboe”, ... and not only the ones directly tagged with “wind instrument”. When several pieces of knowledge exist in different places (such as, the output of a feature extraction algorithm, an “editorial” knowledge base, and so on), it is possible to bind all this knowledge together automatically. Without an a-priori knowledge of what is available in a database, it is possible to identify similar objects. Moreover, ontologies allow inference about terms; for example, “a quartet is a musical group involving four members.” Reasoning can be divided into two parts: consistency checks (“is the world which is accessible consistent?”) and inference (“from the knowledge available, what can be deduced?”).

29

Page 30: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

• Cross-Media knowledge management - Each multimedia object is relevant, and described in the same knowledge representation language. For example, by using an ontology, both the video of a performance and the related recording, as well as the lyrics can be accessed.

• Flexible knowledge representation - For example, using an ontology, one can perfectly recognise the existence of an object representing a particular performance of a piece, without the related recording. This is impossible with a standard metadata approach.

• Distributed multimedia repositories - Using OWL, multimedia files are identified by an URI. It means that resources can be located on a FTP server, on an HTTP server, streamed, or even on a peer-to-peer network. The corresponding URI just has to be resolvable.

• Exporting multiple metadata standards / MPEG7 link - By building a particular interpretation of the theory held by the ontology, it is possible to export some knowledge in several metadata standards, from really poorly expressive schemes such as ID3, to highly expressive ones such as MPEG7.

Ontology design For the EASAIER system, we need to handle both editorial information and information that is part of the production process of a particular recording (either ground-truth or reverse-engineered through feature extraction algorithms). The editorial information is attached to the product of an archiving process (an audio file) using a unique identifier for tracks. Thus, one can attach to an archived signal some knowledge about the related release, name of the track, name of the band/artist that is associated to this release, a release type amongst other things. Workpackage 2 will combine these two types of information by specifying a relevant ontology.

9.3. Interface to information retrieval systems Accessing the knowledge field

The first task is to provide a query interface, in order to access the knowledge held by the ontology. In order to do so, SPARQL is the preferred choice. It is well implemented and well supported, although still a candidate recommendation from W3C. An example of a SPARQL query: Retrieve me all performances associated with at least one instrument PREFIX mu: <http://purl.org/NET/c4dm/music.owl#> SELECT ?p ?i WHERE { ?p rdf:type mu:Performance. ?p mu:hasFactor ?i. } Once a query interface has been set up, it provides a basis for the information retrieval systems, which are able to provide means of computing similarity measures between documents held by the knowledge environment or to use its ontological structure.

Populating the knowledge base

For automatically populating the knowledge base (automatic descriptor extraction), an algorithm which creates RDF triples according to available ontologies is specified. These triples consist of subject, predicate, and object. Once uploaded to the knowledge base, the RDF Triples will be merged through reasoning with other triples there. QMUL has developed a basic ontology for expressing the mapping between a predicate and a set of RDF triples, thus such information could be held in the main knowledge base. This allows us to design the archivist interface on top of the knowledge base (as well as IR tools). However, this ontology will have to be adapted according to the algorithm encapsulation process. We will employ a REST[21], or REpresentational State Transfer, encapsulation of algorithms in order to keep the system simple. This means that an algorithm could be called by a single HTTP/GET request, which outputs the accurate RDF triples. This should allow one to easily describe algorithms in the knowledge base (an algorithm is unambiguously represented by an URL). A proposed architecture for the interface between the knowledge management system and the information retrieval system is depicted in Figure 15.

30

Page 31: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 15. The interface between the retrieval system and the knowledge management layer.

10. Example Retrieval System Mock-ups and Interfaces In this section, we demonstrate several mock-up interfaces for the retrieval system. Figure 16 depicts a basic interface for the combined retrieval system, where the details are hidden from the user. The user selects Search from a browser-like toolbar and is directed to an EASAIER-based homepage for a sound archive. The user may then select either a basic textual-keyword search on the media files or move to an Advanced or Query by Example search. In this example the user has selected only music having the word ‘guitar’ in a metadata field, and a ranked order presentation of the returned results is depicted in Figure 17. The user may then select one of the results for access and playback.

Figure 16. A basic interface for the retrieval system.

31

Page 32: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 17. Retrieval results interface.

Designing the interface, and presentation issues become even more important when one considers that the goal of providing a multimedia and cross-media retrieval system is to uncover previously unknown relationships between query documents and documents in the database. Thus it is not merely sufficient to say that an audio clip is related to, for example: two audio files, an image, another audio file, a video clip, etc. Since ground truth information is already incorporated into the database, the results should be presented in a more structured manner. A more relevant presentation should reveal, for instance, that an audio clip is related to the audio stream from a certain video, with these related images, and to another audio clip that relates to a certain subject. The choices for how to present the retrieved results are numerous, and user testing is necessary to determine the most effective approach.

A simple example of the cross-media retrieval interface is given in Figure 18. When accessing an audio file, the user may then select ‘Related Media.’ This will activate the cross-media retrieval system, allowing the user to find images, video, and other media which are associated with the original query file.

Figure 18. Activating the cross-media retrieval system while accessing an audio resource.

Finally, if the user had selected the Advanced Search or Query by Example option, as opposed to the Basic Search in Figure 16, then the user would be directed to a query interface similar to that provided in Figure 19. Note that the full range of options here will depend on the available features from Work

32

Page 33: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Package 4, and may also depend on the available content and how the specific deployment of EASAIER was configured.

Figure 19. An advanced interface for music retrieval.

The speech retrieval works using the same pattern described above and the same common UI. On the first web page, shown in Figure 16, the user can instead choose “Speech” and can enter a text based query. This query will be processed by the speech retrieval system as described in the Speech retrieval chapter. The results will be presented in a ranked list, as shown in Figure 20.

Figure 20. Presentation of speech retrieval hits.

The vocal query interface would enable the user to upload a previously recorded query. The user must record the wave file on his/her desktop machine. This recording must be quite small and contain 1 word or a phrase. The system could accept several sound file formats. The retrieval is then identical to the scenario described above except now the query file may be presented and played back within the same interface as the presentation of returned results, as depicted in Figure 21.

33

Page 34: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

Figure 21. Presentation of the vocal query user interface, and the result of search.

34

Page 35: Retrieval System Functionality and · PDF fileD3.1 Retrieval System Functionality and Specifications ... Led Zep’s Black Dog on a late night radio show, Greg became interested in

D3.1 Retrieval System Functionality and Specifications

References [1] J. W. Dunn, M. W. Davidson, J. R. Holloway, and G. Bernbom, "The Variations and Variations2 Digital Music Library Projects at Indiana University," in Digital Libraries: Policy, Planning and Practice, J. Andrews and D. Law, Eds.: Ashgate Publishing, 2004, pp. 189-211. [2] A. Smith, D. R. Allen, and K. Allen, Survey of the State of Audio Collections in Academic Libraries. Washington, D.C.: Council on Library and Information Resources, 2004. [3] S. Barrett, C. Duffy, and K. Marshalsay, "HOTBED (Handing On Tradition By Electronic Dissemination)," Royal Scottish Academy of Music and Drama, Glasgow, Report March 2004.www.hotbed.ac.uk[4] M. Asensio, "JISC User Requirement Study for a Moving Pictures and Sound Portal," The Joint Information Systems Committee, Final Report November 2003.www.jisc.ac.uk/index.cfm?name=project_study_picsounds[5] S. Downie and S. J. Cunningham, "Toward a theory of music information retrieval queries: System design implications," Proceedings of the Third International Conference on Music Information Retrieval (ISMIR), Paris, France, 2002. [6] D. Bainbridge, S. J. Cunningham, and S. Downie, "How people describe their music information needs: A grounded theory analysis of music queries.," Proceedings of the Fourth International Conference on Music Information Retrieval (ISMIR), Baltimore, Maryland, 2003. [7] S. J. Cunningham, N. Reeves, and M. Britland, "An ethnographic study of music information seeking: implications for the design of a music digital library," Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries (JCDL), Houston, Texas, pp. 5 - 16, 2003. [8] "British Library/JISC Online Audio Usability Evaluation Workshop," Joint Information Systems Committee (JISC), London, UK 11 October 2004.www.jisc.ac.uk/index.cfm?name=workshop_html[9] S. Dempster, "Report on the British Library and Joint Information Systems Committee Usability Evaluation Workshop, 20th October 2004," JISC Moving Pictures and Sound Working Group, London, UK 20 October 2004 [10] "Final Report of the American Memory User Evaluation, 1991-1993," American Memory Evaluation Team, Library of Congress, Washington, DC 1993.http://memory.loc.gov/ammem/usereval.html

[11] E. Fønss-Jørgensen, "Applying Telematics Technology to Improve Public Access to Audio Archives (JUKEBOX)," Århus, Denmark 1997.www.statsbiblioteket.dk/Jukebox/finalrep.html[12] R. Tucker, "Harmonised Access & Retrieval for Music-Oriented Networked Information (HARMONICA)," 1997-2000.http://projects.fnb.nl/harmonica/[13] "Higher Education Training Needs Analysis (HETNA)," Scottish Higher Education Funding Council (SHEFC), Sheffield, UK November 2004.www.shefc.ac.uk/about_us/departments/learning_teaching/hetna/hetna.html[14] A. Marsden, "ICT Tools for Searching, Annotation and Analysis of Audio-Visual Media," UK Arts and Humanities Research Council, Lancaster, UK September 2005.www.ahrbict.rdg.ac.uk/activities/marsden.htm[15] C. Duffy and K. Marshalsay, "HOTBED: an innovative electronic resource," Proceedings of the Digital Resources in the Humanities (DRH) Conference, Edinburgh, UK, 2002. [16] M. Levy and M. Sandler, "Application of Segmentation and Thumbnailing to Music Browsing and Searching," Proceedings of the AES 120th Convention, Paris, France, 2006. [17] H. Jaeger, "The ''Echo State'' Approach to Analysing and Training Recurrent Neural Networks," GMD Forschungszentrum Informationstechnik GmbH Institut four autonome intelligente Systeme Technical Report 148, 2001. www.faculty.iubremen.de/hjaeger/courses/SeminarSpring04/ESNStandardSlides.pdf[18] A. H. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, "Phoneme recognition using time-delay neural networks," IEEE Trans. Acoustics, Speech, Signal Processing, vol. 37, pp. 328-339, 1989. [19] S. A. Abdallah, Y. Raimond, and M. Sandler, "An ontology-based approach to information management for music analysis systems," Proceedings of the 120th AES Convention, Paris, France, 2006. [20] "Multimedia Content Description Interface - Part 3: Visual," ISO/IEC 15938-3:2001 Version 1, 2001 [21] R. T. Fielding, "Architectural Styles and the Design of Network-based Software Architectures," PhD Thesis: University of California, IRVINE, 2000. www.ics.uci.edu/~fielding/pubs/dissertation/top.htm

35


Recommended