Post on 23-Dec-2015
transcript
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval
WP3 – Retrieval systems
12 Month Review Meeting
Project #033902
Introduction to Workpackageoverview
Objectives: To provide retrieval systems offering the ability to search
by various musical similarity measures. To search for spoken words or phrases. To search across different media for associated content. Queries may be text-based, spoken or audio examples.
Tasks: T3.1: Music retrieval T3.2: Speech retrieval T3.3: Cross-media retrieval T3.4: Vocal query interface
12 Month Review Meeting
Project #033902
Introduction to Workpackageparticipants & schedule
Participants and their contributions QMUL 22 mm – music & cross media retrieval DIT: 5 mm – music retrieval ALL: 24 mm – speech retrieval & vocal queries LFUI: 3 mm – integration of retrieval engines NICE: 12 mm – retrieval for attributes of speech
Schedule T3.1 Music retrieval: month 9 – month 20 T3.2 Speech retrieval: month 8 – month 20 T3.3: Cross-media retrieval:month 1 – month 6
month 21 – month 26 T3.4: Vocal query interface: month 7 – month 20
12 Month Review Meeting
Project #033902
Introduction to WorkpackageTask 3.1 - Music retrieval
Searching and organizing music collections: Adequate representation for the audio in the query Textual and keyword: e.g Author, title, date, genre, etc Automatic feature extraction Low-level acoustic similarity measures Mid-level features – characterize the rhythmic structure High-level features
musically relevant parameters visualisation of key events along the assets' timeline
12 Month Review Meeting
Project #033902
Introduction to WorkpackageTask 3.2 - Speech retrieval
Retrieving the content of speech corpuses in English and Hungarian languages Building test corpuses Levels of recognition
Phoneme level recognition Pronunciation dictionary filter or morphological analysis Text corpus based language model Phoneme and word level indexing Fast retrieval
Improve the performance
12 Month Review Meeting
Project #033902
Introduction to WorkpackageTask 3.3 - Cross-media retrieval
Searching media in various formats (audio recordings, video recordings, notated scores, images) Using metadata Feature extraction Similarity measures Optimised multidimensional search methods Video analysis enter a piece of media as a query and might retrieve an
entirely different type of media
12 Month Review Meeting
Project #033902
Introduction to WorkpackageTask 3.4 - Vocal query interface
Voice initiated media retrieval (without natural language processing) Recording of query Phoneme level recognition Pronunciation dictionary based word(s) identification Speaker adaptation
12 Month Review Meeting
Project #033902
Deliverables
D3.1 Report outlining retrieval system functionality and specification (Month 6)
D3.2 Prototype on speech and music retrieval systems with vocal query interface (Month 20)
D3.3 Prototype on cross-media retrieval system (Month 20)
12 Month Review Meeting
Project #033902
Deliverable D3.1 – Report outlining retrieval system functionality and specification
Topics described: User requirements Relations to other work packages Music retrieval Speech retrieval Indexing and speaker retrieval Cross-media retrieval Vocal query interface Retrieval system integration and knowledge
management – role of ontology Example user interfaces
Contributors: ALL, DIT, NICE, QMUL, RSAMD
12 Month Review Meeting
Project #033902
Milestones
M3.1 Initial vocal query system tested, initial speech and music retrieval algorithms developed (Month 8)
M3.2 Vocal query is fully-functional, speech and music retrieval implemented, cross-media retrieval method finalized (Month 14)
M3.3 Vocal query finished, speech and music retrieval systems established, basic cross-media retrieval implemented (Month 20)
M3.4 Cross-media retrieval fully functional, further work is only refinement and optimization (Month 26)
12 Month Review Meeting
Project #033902
Milestones M3.1 – Speech retrieval & Vocal query
Vocal query system: Is ready for demonstration in Hungarian and in English phoneme level recognition implemented and tested, performance
improvement in progress Hungarian Tri-phone management is under development English pronunciation dictionary embedded, with multiple versions of
pronunciation Morphological analyzer implemented for the Hungarian pronunciation The performance of the Hungarian version is better than the English, the
reason is under investigation Speech retrieval:
See above Language model established for the Hungarian version,
for English we are looking for a good text corpus Word level recognition implemented, under testing,
the performance depends on the content/domain of the speech Phoneme based search finished, word based is under implementation
12 Month Review Meeting
Project #033902
Milestones M3.1 – Music retrieval
Music retrieval: Extractors for tempo, key changes and mode detection have
been implemented as VAMP plugins. SoundBite similarity retrieval is fully functional and available as a
MAC OS application. A segmenter based on SoundBite has been implemented as a
VAMP plugin. A framework for the automatic extraction of audio features has
been built. It uses VAMP plugins and outputs descriptors directly in RDF format, allowing easy integration with the ontology.
12 Month Review Meeting
Project #033902
Workpackage Progress
Parallel development of separate retrieval engines is in good progress
Results are according to the schedule of the technical annex
We aim to demonstrate our results by running software modules
Scientific research induce risk when targeting improvement of performance
Integration of different retrieval engines into a common architecture and user interface will be challenging
Utilization of the power of the ontology centric approach
12 Month Review Meeting
Project #033902
Workpackage ProgressMusic retrieval - 1
Music retrieval system
Searches assets according to their relevance to music-related queries,using various metadata and automatically extracted features to produce a ranked list of audio files .
Search methods: Textual and keyword: e.g Author, title, date, genre, etc ... Similarity based on automatically extracted low-level features Music-related parameters using automatically extracted
descriptors: Instrument, orchestration, tempo, key, etc...
12 Month Review Meeting
Project #033902
Workpackage ProgressMusic retrieval – 2 – Music Analysis module
To PCM & compressed audio assets repository
Input Audio File (PCM)
Manual Entry Tags/Data
De-Noising / Restoration
Source Separation
Mid-LevelFeature
Extractors High-LevelFeatures
Extractors
Compression
Reliability Metric
High-level features (parametric search)
Mid-level descriptors similarity search
Optimal source separation & denoising parameters
To MetadataRepository
Manual Tags & Manual High Level Features
Archive application (musical audio)
12 Month Review Meeting
Project #033902
Workpackage ProgressMusic retrieval - 3
Music Analysis module
Musically relevant descriptors are automatically extracted by a module running a series of “VAMP” plugins.
Descriptors returned by the plugins can be classified as:
Mid-level features used by the retrieval system to search for similar audio assets (e.g. Timbre profile ).
High-level features enable the user to search for audio assets using musically relevant parameters.
High level descriptors are also used for the visualisation of key events along the assets' timeline (e.g. position of beats, bars, key changes and instruments), providing a considerable aid in the analysis of a piece of music.
12 Month Review Meeting
Project #033902
Workpackage ProgressMusic retrieval - 4
Similarity Search
Based on SoundBite algorithm: It allows simultaneous segmentation, thumbnailing and modelling of an audio asset
12 Month Review Meeting
Project #033902
Workpackage ProgressSpeech retrieval - 1 Research and testing is made on well prepared corpuses
Text and corresponding speech needed 10-20 hours Quality is very important
Low noise Mixed sound source omitted Accent sensitive Field interviews, phone conversations Quality of the recording
Silence to silence segmentation - automated and manual Half of the corpus for training Half of the corpus for testing
Hungarian corpus Hungarian radio station ‘Kossuth’ broadcast quality More than 20 hours - segmentation enhanced manually
English corpus TIMIT for research purpose US Supreme Court recordings
12 Month Review Meeting
Project #033902
Workpackage ProgressSpeech retrieval - 2
Preparation for speech recognition Training on segmented corpuses Importance of same accent and same speech content
domain Layers of speech recognition
Phoneme level recognition -> acoustic score Pronunciation dictionary/morphological analysis based
recognition -> word exist or not Language model -> final probability
Mixed, phoneme and word based indexing keeping probabilities
Index based fast retrieval with score value
12 Month Review Meeting
Project #033902
Workpackage ProgressSpeech retrieval - 3
Phoneme level recognition ALL spent many months with research to refine its
algorithms Successful results
Using gaussian mixture model in HMM nodes Triphone and allophone identification and management Speaker clustering
Our phoneme level recognition exceeded the 60% hit rate This solution was not good enough to build up a reliable
speech retrieval on it The workflow:
Input: silence to silence segments of wave Output: probable phoneme sequences with acoustic score
12 Month Review Meeting
Project #033902
Workpackage ProgressSpeech retrieval - 4 Dictionary level
Filtering of feasible phoneme sequences by a pronunciation dictionary in English (custom based pronunciation) a morphological analyzer in Hungarian (rule based pronunciation)
The workflow: Input: phoneme sequences with acoustic score Output: feasible word sequences with kept acoustic score
Language level On the basis of big text corpuses we rank the probability of the
word sequences The solution performs much better on domain specific speech
(legal, medical domain) The workflow:
Input: word sequences with acoustic score Output: word sequences with modified acoustic score
12 Month Review Meeting
Project #033902
Workpackage ProgressSpeech retrieval - 5
Integration of speech retrieval into the EASAIER architecture Speech retrieval works as
a black box Relies on binary indexes Business logic layer
needed Temporal RDF triplets
generated User initiated retrieval
performed
EASAIER User Interface
Ontologyconcepts & Instances
stored in RDF Triplet Database
Business Logic
Que
ry
Speech Retrieval EngineSpeech Index Files
Re
sult
set
Sp
eech
Ret
rieva
l R
equ
est
Sp
eech
Ret
rieva
l R
esu
lt S
et
Retrieving data
Adding RDF tripletstemporally
Generating complex
SPARQL query
Getting SPARQL query result
12 Month Review Meeting
Project #033902
Workpackage ProgressCross media retrieval - 1
Cross-media retrieval The CM retrieval engine and its functionalities were
specified in Deliverable 3.1 Video analysis modules necessary for CMR are specified
in internal EASAIER software modules document and consist of:
Video Transcoding Module Shot Detection and Key-Frame Extraction Module Low-Level Feature Extraction Module
12 Month Review Meeting
Project #033902
Workpackage ProgressCross media retrieval – 2 – Video Analysis module
Compression
Audio Stream Extraction
Video Segmentationand
Keyframe extraction
Keyframe Analysis
Manual Annotation
Audio stream analysis
Input Video File
Keyframes
PCM
Original video fileStreaming video file (eg. mpeg 4)
Multimedia assets repository
KF temporal data
Video segments
temporal data
Metadata Features Temporal data
Video segments metadata
KF Extracted Features
Metadata repository
12 Month Review Meeting
Project #033902
Workpackage ProgressCross media retrieval - 3
For transcoding purposes we chose and tested ffmpeg software
First version of Shot Detection and Key-Frame Extraction were developed at QMUL and it is ready for integration
MPEG-7 eXperimentation Model (XM) will be integrated for purpose of Low-Level Feature Extraction.
12 Month Review Meeting
Project #033902
Workpackage ProgressVocal query
Technology based on the first two layer of speech recognition (see above) Phoneme level recognition Pronunciation dictionary or morphological analyzer
Process Phoneme level recognition with acoustic score Matching to the dictionary Solution is the item with the best acoustic score which also
found in the dictionary Technology will be demonstrated
Speed have to be tuned Performance under evaluation Performance difference between English and Hungarian
12 Month Review Meeting
Project #033902
Contributions and Connections with Other Workpackages
12 Month Review Meeting
Project #033902
Upcoming Work Plan Months 12-24 – Music related
Up to month 20 Music retrieval - prototype available, remaining time will be
integration, testing, and continuing development of music retrieval based on other similarity measures.
Month 20- Deliver D3.2 Between month 20 and 26
Music retrieval - only further work is refinement based on user studies (WP7)
Month 26 Deliver D3.3
12 Month Review Meeting
Project #033902
Upcoming Work Plan Months 12-24 – Speech related Before D3.2 – Month 20
Phoneme level recognition Speed tuning Language specific performance improvement Triphone implementation in English Speaker clustering
Dictionary level More pronunciation variants in English
Language model Bigger text corpuses in English and in Hungarian
Indexing and retrieval Reimplementation for mixed phoneme and word search
After Month 20 Testing and refinement Performance tuning
12 Month Review Meeting
Project #033902
Upcoming Work Plan Months 12-24 – Cross media related All cross-media retrieval effort will be focused on coding and testing
the software. Material produced for working documents (System Architecture,
Software Modules, Metadata) specify the software to be developed and integrated, and how this will be done.
The required routines have for the most part been developed at a 'proof-of-concept' level (ontological relationships between different media, feature extraction for images and video, key frame display for retrieved video)
Month 26- M3.4 -Cross-media retrieval fully functional, further work is only
refinement and optimization. D3.3 Prototype on cross-media retrieval system
Between month 20 and 26, QMUL will work closely with SILOGIC to integrate cross-media retrieval into the EASAIER system, which we intend to use as the prototype for this.
12 Month Review Meeting
Project #033902
DemonstrationOverview
Retrieval engines and tools are demonstrated separately Speech related Demo
Segmentation user interface Vocal query (isolated word search) in Hungarian Vocal query (isolated word search) in English
Music related Demo Soundbite – timbre-based music similarity embedded in
ITunes
12 Month Review Meeting
Project #033902
DemonstrationSegmentation
User interface for manual segmentation Synchronizing silence to silence speech segments to the
text Checking automated silence to silence segmentation Synchronizing word boundaries to the text Synchronizing phoneme boundaries to the text
12 Month Review Meeting
Project #033902
DemonstrationSegmentation – screen shot
12 Month Review Meeting
Project #033902
DemonstrationVocal query
Searching for a spoken word in a dictionary Recording input Phoneme level recognition Matching probable phoneme sequences to content of the
pronunciation dictionary Display of the most probable solution from the dictionary
12 Month Review Meeting
Project #033902
DemonstrationVocal query – screen shot
12 Month Review Meeting
Project #033902
DemonstrationMusic retrieval – screen shot