1
Broadcast News Segmentation using Metadata and Speech-To-Text
Informationto Improve Speech Recognition
Sebastien Coquoz,
Swiss Federal Institute of Technology (EPFL)
International Computer Science Institute (ICSI)
March 16, 2004
2
Outline
General Idea ASR System used Exploratory work Strategies Results Conclusion
3
General idea
Use Metadata (SUs) and Speech-To-Text (STT) information to improve later STT passes (feedback loop)
4
Why segmentation?Why segment the audio stream?
• Important to give « linguistically coherent » pieces to the language model
• Remove « non-speech » (i.e. long silences, laughs, music, other noises,…)
Why use MDE?
• MDE gives information about sentence and speaker breaks
• Speaker labels improve the efficiency of the acoustic model and sentences improve the efficiency of the language model
• BBN’s error analysis of Broadcast News recognition revealed a higher error rate at segments boundaries
this may be caused by missing the true sentence boundaries
5
Metadata and STT information
MDE object used:
Sentence-like units (SUs): express a thought or
idea. It generally corresponds to a sentence. Each SU has a
confidence measure, timing information (starting point and
duration) and a cluster label.
STT object used:
Lexemes: describe the words that were assumed to
be uttered. Each word has timing information (beginning
and duration).
6
ASR system used
The system used is a simplified SRI BN evaluation system.
Recognition steps:
1. Segment the waveforms
2. Cluster the segments into « pseudo-speakers »
3. Compute and normalize features (Mel cepstrum)
4. Do first pass recognition with non-crossword acoustic models and bigram language model
5. Generate lattices
6. Expand lattices using 5-gram language model
7. Adapt acoustic models for each « pseudo-speaker »
8. Generate new lattices using the adapted acoustic models
9. Expand new lattices using 5-gram language model
10. Score the resulting hypotheses
7
Types of segmentation
Baseline
• Classifies frames into « speech » and « non-speech » using a 2-state HMM
• Uses inter-words silences and speaker turns to segment the BN shows
MDE-based
• Uses sentence and speaker breaks to define an initial segmentation
• Further processes the segments using different strategies presented later
Baseline vs. MDE-based segmentation
8
Baseline experiments
Comments:
• The baseline segmentation is the one presented above
• The results (shown later) obtained are:
• the current best results
• the baselines that ultimately have to be improved
• No additional processing step is applied to modify the segments
9
« Cheating » experiments (1)
Why?
• See if there is room for improvement when using MDE-based segmentation
How?
• Use transcripts written by humans to segment the Broadcast News audio stream and apply processing strategies to improve recognition (i.e. use true information)
10
« Cheating » experiments (2)
Results: Baseline vs. « Cheating » experiments
WERBaseline
seg Cheating
seg(using SU)
Cheating seg
(SU+proc)
Wtd avg on 6
shows14.0 14.2 13.0
There is room for improvement!
11
Overview of the processing steps
Broadcast News Shows
0. Segmentation using SUs
1. First strategy: splitting of long segments
2. Second strategy: concatenation of short segments
3. Third strategy: addition of time pads
Final segmentation
12
First strategy: splitting of long segments
Why?
• Too long segments may cover more than 1 sentence confusing for the language model
How?
• Use automatically generated transcripts and MDE
• Too short segments mustn’t be processed bad for the efficiency of the language model
• Take two features into account for decision tree:
• The duration of segments
• The pause between words
13
Second strategy: concatenation of short segments
Why?
• Short segments are not optimal for the language model
• Short segments increase the WER because all their words are close to the boundaries (cf. BBN’s error analysis)
How?
• Take 3 features into account for decision tree:
• Pause between segments
• Sum of the duration of two neighbors
• Cluster label
14
Third strategy: Addition of time pads
Why?
• Prevent words from only being partially included
• Because the windowing in the front end has a scope of up to 8 frames (4 on each side) better to have enough padding
How?
• Take 1 feature into account for decision tree:
• The pause between segments
15
Examples of improvements (1)
1) Real sentence: … and strictly limits state authority over how and when water is used …
Recognized sentence:
With baseline segmentation (cuts in middle of sentence):
… and stricter limits data arty over how and when watery hues …
Legend: segmentation point
red errors
time
time
With MDE-based segmentation:… and strict_ limits state authority over how and when water issues …
time
16
Examples of improvements (2)
2) Real sentence: … I didn’t know if we would pull off the games. I didn’t know if this community
would ever rally around the Olympics again. …
Recognized sentence:
With baseline segmentation (doesn’t cut at end of
sentence):
… pull off the games that had not this community would ever rally around …
time
time
With MDE-based segmentation:
… pull off the game_ I didn’t know _ this community would ever rally around …
time
17
Results for the development set
WER Baseline seg
Step 0: SU seg
SU seg + step1
Wtd avg on 6
shows
14.0 14.4 14.2
SU seg + steps 1 & 2
SU seg + steps 1 & 2
& 3
14.0 13.3
The improvement is 0.7% absolute and 5% relative!
18
Results for the evaluation set
WER Baseline seg
Step 0: SU seg
SU seg + step1
Wtd avg on 6
shows
18.7 19.8 19.7
SU seg + steps 1 & 2
SU seg + steps 1 & 2
& 3
19.6 18.4
The improvement is 0.3% absolute and 1.6% relative!
19
Dev results vs. Eval results
Observations:
• No « cheating » information available for the eval not sure how well the SU detection is working
• Improvements from step 0 (SU segmentation) to final segmentation are similar for dev set and eval set: 1.1% absolute (7.6% relative) for dev set and 1.3% absolute (6.6% relative) for eval set SU information not optimized for eval
• Respective improvements are quite uneven for each show suggests that the strategies are show dependent, not channel dependent
20
Future work
• Further optimize the thresholds for the three strategies
• Find a representation to choose a specific value of the thresholds for each show individually (i.e. fully adapted the decision trees to each show)
• Use Metadata objects such as the confidence measure of each SU and diarization to further improve the strategies
21
Conclusion
• Development of a new segmentation method based on Metadata and Speech-To-Text information
• Use features given by MDE and STT information in decision trees for each processing step
• Results indicate the promiss of this approach
• Further developments still seem to have room for improvement
22
Acknowlegments
I would like to thank:• Prof. Bourlard & Prof. Morgan
• Barbara & Andreas
• Yang
• IM2 for supporting my experience