Date post: | 13-Mar-2016 |
Category: |
Documents |
Upload: | orson-cote |
View: | 29 times |
Download: | 0 times |
Combining Prosodic and Text Featuresfor Segmentation of
Mandarin Broadcast NewsGina-Anne Levow
University of ChicagoSIGHAN
July 25, 2004
Roadmap
• The Problem: Mandarin Story Segmentation• The Tools: Prosodic and Text Cues
– Mandarin Chinese• Individual Results• Integrating Cues• Conclusion & Future Work
The Problem:Mandarin Speech Topic Segmentation
•Separate audio stream into component topics
Why Segment?
• Enables language understanding tasks– Information Retrieval
• Only regions of interest– Summarization
• Cover all main topics– Reference Resolution
• Pronouns tend to refer within segments
The Challenge
• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding
• How can we approach with partial understanding?
• How do we identify boundaries sharply?– Association of sentences may be ambiguous
• Especially, “filler”
The Tools: Prosodic and Text Cues
• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)
• Represent topicality with text– Component words in audio stream
• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)
• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH
transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)
Data and Processing
• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news
• ASR transcription• Manually segmented – known boundaries
– ~4,000 stories, ~750K words • Acoustic analysis (Praat)
– Automatic pitch, intensity tracking• Smoothed, speaker-normalized, per-word
Acoustic-Prosodic Cues
• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning
• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
Learning Boundaries
• Decision tree classifier (Quinlan C4.5)– Classification problem
• For each word, classify as final/non-final
• Features– Acoustic-Prosodic:
• Duration, Pitch, Loudness, Silence– Word average, Between-word difference
Text Boundary Features
– Text• Information retrieval style
– Cosine similarity between weighted term vectors» tf*idf in 50-word windows
• Cue phrases– N-gram features
» Identified by BoosTexter (Schapire & Singer, 2000)– E.g. “Voice of America”, “Audience”, “Reporting”
Classification Results• Balanced training and test sets
– Results on held-out subsets• Acoustic cues only
– 95.6% accuracy • Text cues (+ silence)
– 95.6% accuracy• Combined text and prosody
– 96.4% accuracy
• Typically, false alarms twice as common as miss
Joint Decision Tree
<<
Feature Assessment
•Role of silence•Useful in both text and acoustic classifiers
•More necessary for text•Text captures topicality, not locality
•Can not identify boundaries sharply•Prosodic cues:
•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use
Issue: False Alarms
• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy
• 2% miss, 4.4% false alarms
• Non-boundary frequent• False alarms frequent
Voting Against False Alarms
• Error analysis:– Construct per-feature classifiers:
• Prosody-only, text-only, silence-only
– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM
• Approach: Voting– Require joint + 2 per-feature classifiers
• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm
Conclusion
• Mandarin broadcast news segmentation– Identify topicality and boundary locality
• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues– Prosodic cues: Silence, intensity, pitch, duration
» Robust across range of languages
• Provide supporting and orthogonal information• Majority agreement of per-feature classifiers:
– 1/3 fewer alarms
Current & Future Work• Improving the model of topicality
– Richer text similarity models; broader acoustic models• Alternative classifiers
– Preliminary experiments: • Boosting, Boosted Decision trees, MaxEnt
– Comparable– Alternative integration strategies
• Hierarchical subtopic segmentation– Broadcast news– Dialogue: human-computer, human-human
• Integration with multi-modal features: e.g. gesture, gaze
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
EnglishNormalized Intensity
EnglishNormalized Pitch
Text Decision Tree
Prosodic Decision Tree
The Problem:Speech Topic Segmentation
• Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||