Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP3 – Retrieval...

transcript

Enabling Access to Sound Archives through Integration, Enrichment and Retrieval

WP3 – Retrieval systems

12 Month Review Meeting

Project #033902

Introduction to Workpackageoverview

Objectives: To provide retrieval systems offering the ability to search

by various musical similarity measures. To search for spoken words or phrases. To search across different media for associated content. Queries may be text-based, spoken or audio examples.

Tasks: T3.1: Music retrieval T3.2: Speech retrieval T3.3: Cross-media retrieval T3.4: Vocal query interface

Project #033902

Introduction to Workpackageparticipants & schedule

Participants and their contributions QMUL 22 mm – music & cross media retrieval DIT: 5 mm – music retrieval ALL: 24 mm – speech retrieval & vocal queries LFUI: 3 mm – integration of retrieval engines NICE: 12 mm – retrieval for attributes of speech

Schedule T3.1 Music retrieval: month 9 – month 20 T3.2 Speech retrieval: month 8 – month 20 T3.3: Cross-media retrieval:month 1 – month 6

month 21 – month 26 T3.4: Vocal query interface: month 7 – month 20

Project #033902

Introduction to WorkpackageTask 3.1 - Music retrieval

Searching and organizing music collections: Adequate representation for the audio in the query Textual and keyword: e.g Author, title, date, genre, etc Automatic feature extraction Low-level acoustic similarity measures Mid-level features – characterize the rhythmic structure High-level features

musically relevant parameters visualisation of key events along the assets' timeline

Project #033902

Introduction to WorkpackageTask 3.2 - Speech retrieval

Retrieving the content of speech corpuses in English and Hungarian languages Building test corpuses Levels of recognition

Phoneme level recognition Pronunciation dictionary filter or morphological analysis Text corpus based language model Phoneme and word level indexing Fast retrieval

Improve the performance

Project #033902

Introduction to WorkpackageTask 3.3 - Cross-media retrieval

Searching media in various formats (audio recordings, video recordings, notated scores, images) Using metadata Feature extraction Similarity measures Optimised multidimensional search methods Video analysis enter a piece of media as a query and might retrieve an

entirely different type of media

Project #033902

Introduction to WorkpackageTask 3.4 - Vocal query interface

Voice initiated media retrieval (without natural language processing) Recording of query Phoneme level recognition Pronunciation dictionary based word(s) identification Speaker adaptation

Project #033902

Deliverables

D3.1 Report outlining retrieval system functionality and specification (Month 6)

D3.2 Prototype on speech and music retrieval systems with vocal query interface (Month 20)

D3.3 Prototype on cross-media retrieval system (Month 20)

Project #033902

Deliverable D3.1 – Report outlining retrieval system functionality and specification

Topics described: User requirements Relations to other work packages Music retrieval Speech retrieval Indexing and speaker retrieval Cross-media retrieval Vocal query interface Retrieval system integration and knowledge

management – role of ontology Example user interfaces

Contributors: ALL, DIT, NICE, QMUL, RSAMD

Project #033902

Milestones

M3.1 Initial vocal query system tested, initial speech and music retrieval algorithms developed (Month 8)

M3.2 Vocal query is fully-functional, speech and music retrieval implemented, cross-media retrieval method finalized (Month 14)

M3.3 Vocal query finished, speech and music retrieval systems established, basic cross-media retrieval implemented (Month 20)

M3.4 Cross-media retrieval fully functional, further work is only refinement and optimization (Month 26)

Project #033902

Milestones M3.1 – Speech retrieval & Vocal query

Vocal query system: Is ready for demonstration in Hungarian and in English phoneme level recognition implemented and tested, performance

improvement in progress Hungarian Tri-phone management is under development English pronunciation dictionary embedded, with multiple versions of

pronunciation Morphological analyzer implemented for the Hungarian pronunciation The performance of the Hungarian version is better than the English, the

reason is under investigation Speech retrieval:

See above Language model established for the Hungarian version,

for English we are looking for a good text corpus Word level recognition implemented, under testing,

the performance depends on the content/domain of the speech Phoneme based search finished, word based is under implementation

Project #033902

Milestones M3.1 – Music retrieval

Music retrieval: Extractors for tempo, key changes and mode detection have

been implemented as VAMP plugins. SoundBite similarity retrieval is fully functional and available as a

MAC OS application. A segmenter based on SoundBite has been implemented as a

VAMP plugin. A framework for the automatic extraction of audio features has

been built. It uses VAMP plugins and outputs descriptors directly in RDF format, allowing easy integration with the ontology.

Project #033902

Workpackage Progress

Parallel development of separate retrieval engines is in good progress

Results are according to the schedule of the technical annex

We aim to demonstrate our results by running software modules

Scientific research induce risk when targeting improvement of performance

Integration of different retrieval engines into a common architecture and user interface will be challenging

Utilization of the power of the ontology centric approach

Project #033902

Workpackage ProgressMusic retrieval - 1

Music retrieval system

Searches assets according to their relevance to music-related queries,using various metadata and automatically extracted features to produce a ranked list of audio files .

Search methods: Textual and keyword: e.g Author, title, date, genre, etc ... Similarity based on automatically extracted low-level features Music-related parameters using automatically extracted

descriptors: Instrument, orchestration, tempo, key, etc...

Project #033902

Workpackage ProgressMusic retrieval – 2 – Music Analysis module

To PCM & compressed audio assets repository

Input Audio File (PCM)

Manual Entry Tags/Data

De-Noising / Restoration

Source Separation

Mid-LevelFeature

Extractors High-LevelFeatures

Extractors

Compression

Reliability Metric

High-level features (parametric search)

Mid-level descriptors similarity search

Optimal source separation & denoising parameters

To MetadataRepository

Manual Tags & Manual High Level Features

Archive application (musical audio)

Project #033902

Music Analysis module

Musically relevant descriptors are automatically extracted by a module running a series of “VAMP” plugins.

Descriptors returned by the plugins can be classified as:

Mid-level features used by the retrieval system to search for similar audio assets (e.g. Timbre profile ).

High-level features enable the user to search for audio assets using musically relevant parameters.

High level descriptors are also used for the visualisation of key events along the assets' timeline (e.g. position of beats, bars, key changes and instruments), providing a considerable aid in the analysis of a piece of music.

Project #033902

Similarity Search

Based on SoundBite algorithm: It allows simultaneous segmentation, thumbnailing and modelling of an audio asset

Project #033902

Workpackage ProgressSpeech retrieval - 1 Research and testing is made on well prepared corpuses

Text and corresponding speech needed 10-20 hours Quality is very important

Low noise Mixed sound source omitted Accent sensitive Field interviews, phone conversations Quality of the recording

Silence to silence segmentation - automated and manual Half of the corpus for training Half of the corpus for testing

Hungarian corpus Hungarian radio station ‘Kossuth’ broadcast quality More than 20 hours - segmentation enhanced manually

English corpus TIMIT for research purpose US Supreme Court recordings

Project #033902

Workpackage ProgressSpeech retrieval - 2

Preparation for speech recognition Training on segmented corpuses Importance of same accent and same speech content

domain Layers of speech recognition

Phoneme level recognition -> acoustic score Pronunciation dictionary/morphological analysis based

recognition -> word exist or not Language model -> final probability

Mixed, phoneme and word based indexing keeping probabilities

Index based fast retrieval with score value

Project #033902

Phoneme level recognition ALL spent many months with research to refine its

algorithms Successful results

Using gaussian mixture model in HMM nodes Triphone and allophone identification and management Speaker clustering

Our phoneme level recognition exceeded the 60% hit rate This solution was not good enough to build up a reliable

speech retrieval on it The workflow:

Input: silence to silence segments of wave Output: probable phoneme sequences with acoustic score

Project #033902

Workpackage ProgressSpeech retrieval - 4 Dictionary level

Filtering of feasible phoneme sequences by a pronunciation dictionary in English (custom based pronunciation) a morphological analyzer in Hungarian (rule based pronunciation)

The workflow: Input: phoneme sequences with acoustic score Output: feasible word sequences with kept acoustic score

Language level On the basis of big text corpuses we rank the probability of the

word sequences The solution performs much better on domain specific speech

(legal, medical domain) The workflow:

Input: word sequences with acoustic score Output: word sequences with modified acoustic score

Project #033902

Integration of speech retrieval into the EASAIER architecture Speech retrieval works as

a black box Relies on binary indexes Business logic layer

needed Temporal RDF triplets

generated User initiated retrieval

performed

EASAIER User Interface

Ontologyconcepts & Instances

stored in RDF Triplet Database

Business Logic

Speech Retrieval EngineSpeech Index Files

Retrieving data

Adding RDF tripletstemporally

Generating complex

SPARQL query

Getting SPARQL query result

Project #033902

Workpackage ProgressCross media retrieval - 1

Cross-media retrieval The CM retrieval engine and its functionalities were

specified in Deliverable 3.1 Video analysis modules necessary for CMR are specified

in internal EASAIER software modules document and consist of:

Video Transcoding Module Shot Detection and Key-Frame Extraction Module Low-Level Feature Extraction Module

Project #033902

Workpackage ProgressCross media retrieval – 2 – Video Analysis module

Compression

Audio Stream Extraction

Video Segmentationand

Keyframe extraction

Keyframe Analysis

Manual Annotation

Audio stream analysis

Input Video File

Keyframes

Original video fileStreaming video file (eg. mpeg 4)

Multimedia assets repository

KF temporal data

Video segments

temporal data

Metadata Features Temporal data

Video segments metadata

KF Extracted Features

Metadata repository

Project #033902

Workpackage ProgressCross media retrieval - 3

For transcoding purposes we chose and tested ffmpeg software

First version of Shot Detection and Key-Frame Extraction were developed at QMUL and it is ready for integration

MPEG-7 eXperimentation Model (XM) will be integrated for purpose of Low-Level Feature Extraction.

Project #033902

Workpackage ProgressVocal query

Technology based on the first two layer of speech recognition (see above) Phoneme level recognition Pronunciation dictionary or morphological analyzer

Process Phoneme level recognition with acoustic score Matching to the dictionary Solution is the item with the best acoustic score which also

found in the dictionary Technology will be demonstrated

Speed have to be tuned Performance under evaluation Performance difference between English and Hungarian

Project #033902

Contributions and Connections with Other Workpackages

Project #033902

Upcoming Work Plan Months 12-24 – Music related

Up to month 20 Music retrieval - prototype available, remaining time will be

integration, testing, and continuing development of music retrieval based on other similarity measures.

Month 20- Deliver D3.2 Between month 20 and 26

Music retrieval - only further work is refinement based on user studies (WP7)

Month 26 Deliver D3.3

Project #033902

Upcoming Work Plan Months 12-24 – Speech related Before D3.2 – Month 20

Phoneme level recognition Speed tuning Language specific performance improvement Triphone implementation in English Speaker clustering

Dictionary level More pronunciation variants in English

Language model Bigger text corpuses in English and in Hungarian

Indexing and retrieval Reimplementation for mixed phoneme and word search

After Month 20 Testing and refinement Performance tuning

Project #033902

Upcoming Work Plan Months 12-24 – Cross media related All cross-media retrieval effort will be focused on coding and testing

the software. Material produced for working documents (System Architecture,

Software Modules, Metadata) specify the software to be developed and integrated, and how this will be done.

The required routines have for the most part been developed at a 'proof-of-concept' level (ontological relationships between different media, feature extraction for images and video, key frame display for retrieved video)

Month 26- M3.4 -Cross-media retrieval fully functional, further work is only

refinement and optimization. D3.3 Prototype on cross-media retrieval system

Between month 20 and 26, QMUL will work closely with SILOGIC to integrate cross-media retrieval into the EASAIER system, which we intend to use as the prototype for this.

Project #033902

DemonstrationOverview

Retrieval engines and tools are demonstrated separately Speech related Demo

Segmentation user interface Vocal query (isolated word search) in Hungarian Vocal query (isolated word search) in English

Music related Demo Soundbite – timbre-based music similarity embedded in

ITunes

Project #033902

DemonstrationSegmentation

User interface for manual segmentation Synchronizing silence to silence speech segments to the

text Checking automated silence to silence segmentation Synchronizing word boundaries to the text Synchronizing phoneme boundaries to the text

Project #033902

DemonstrationSegmentation – screen shot

Project #033902

DemonstrationVocal query

Searching for a spoken word in a dictionary Recording input Phoneme level recognition Matching probable phoneme sequences to content of the

pronunciation dictionary Display of the most probable solution from the dictionary

Project #033902

DemonstrationVocal query – screen shot

Project #033902

DemonstrationMusic retrieval – screen shot

Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP3 – Retrieval...

Documents