Statistical automatic identification of microchiroptera from echolocation calls
Lessons learned from human automatic speech recognition
Mark D. SkowronskiComputational Neuro-Engineering LabElectrical and Computer Engineering
University of FloridaGainesville, FL, USADecember 1, 2004
Overview• Motivations for bat acoustic research• Review bat call classification methods• Contrast with 1970s human ASR
– Machine learning vs. expert knowledge• Experiments• Conclusions and future work
Bat research motivations• Bats are among:
– the most diverse (25% of all mammal species),– the most endangered,– and the least studied mammals.
• Close relationship with insects– agricultural impact– disease vectors
• Acoustical research– non-invasive (compared to netting)– significant domain (echolocation)
More motivations• Calls simple compared to human speech• Same goals as human ASR
– Detection– Feature extraction– Classification– Noise-robust performance
• Easier to design/develop models• Domain between toy problems and ASR
Bat echolocation• Ultrasonic, brief chirps (~active sonar)• Determine range, velocity of nearby objects
(clutter, prey, other bats)• Tailored for task, environment
Tadarida brasiliensis (Mexican free-tailed bat)
Listen to 10x time-expanded search calls:Sound (OLE2)
Echolocation calls• Two characteristics
– Frequency modulated (range information)– Constant frequency (velocity information)
• Features (holistic)– Freq. extrema– Duration– Shape– # harmonics– Call interval
Mexican free-tailed calls, concatenated
Current classification methods• Expert sonogram readers
– Manual or automatic feature extraction• Griffin 1958, Fenton and Bell 1981
– Comparison with exemplar sonograms– Decision trees
• Automatic classification– Discriminant function analysis
• By far the most popular method in literature• Available in statistical software packages (SAS, SPSS)
– Others• Artificial neural networks, Parsons 2001• Spectrogram correlation, Pettersson Elektronik AB
Parallels the 1970s acoustic-phonetic approach to human ASR.
Acoustic phonetics
• Bottom up paradigm– Frames, boundaries, groups, phonemes, words– Mimics techniques of expert spectrogram readers
• Manual or automatic feature extraction– Formants, voicing, duration, intensity, transitions
• Classification– Decision tree, discriminant functions, neural network, Gaussian
mixture model, Viterbi path
DH AH F UH T B AO L G EY EM IH Z OW V ER
Acoustic phonetics limitations• Variability of conversational speech
– Complex rules, difficult to train• Boundaries difficult to define
– Coarticulation, reduction• Feature estimates brittle
– Variable noise robustness• Hard decisions, errors accumulate
Shifted to machine learning paradigm of human ASR by 1980s: better able to account for variability of speech, noise.
Machine learning ASR• Data-driven models
– Non-parametric: dynamic time warp (DTW)– Parametric: hidden Markov model (HMM)
• Frame-based– Identical features from every frame– Expert information in feature extraction– Models account for feature, temporal
variabilitiesMachine learning dominates state-of-the-art ASR.
Data collection• UF Bat House, home to 60,000 bats
– Mexican free-tailed bat (vast majority)– Evening bat– Southeastern myotis
• Continuous recording– 90 minutes around sunset– ~20,000 calls
• Equipment:– B&K mic (4939), 100 kHz– B&K preamp (2670)– Custom amp/AA filter– NI 6036E 200kS/s A/D card– Laptop, Matlab– Portable
Experiment design• Hand labels as ground truth
– Narrowband spectrogram– 436 calls (2% of data) in 3 hours (80x real time)– Four classes, a priori: 34, 40, 20, 6%– All experiments on hand-labeled data only– No hand-labeled calls excluded from experiments
1 2 3 4
Methods• Baseline, from the literature
– Features• Duration• Zero crossing: Fmin, Fmax, Fmax_energy• MUSIC super resolution frequency estimator
– Classifier• Discriminant function analysis, quadratic boundaries
• DTW and HMM– Features
• Frequency (MUSIC), log energy, Δs (HMM only)– HMM
• 5 states/model• 4 Gaussian mixtures/state, diagonal covariances
• Tests– Leave one out– Repeated trials: 25% test data, 1000 trials– Test on train data (HMM only)
Results• Baseline, zero crossing
– Leave one out: 72.5% correct– Repeated trials: 72.5 ± 4% (mean ± std)
• Baseline, MUSIC– Leave one out: 79.1%– Repeated trials: 77.5 ± 4%
• DTW– Leave one out: 74.5 %– Repeated trials: 74.1 ± 4%
• HMM– Test on train: 85.3 %
Confusion matrices1 2 3 4
1 107 38 1 2 72.3%
2 21 134 16 4 76.6%
3 2 29 57 0 64.8%
4 4 3 0 18 72.0%
72.5%
Baseline, zero crossing Baseline, MUSIC
DTW HMM
1 2 3 4
1 110 36 1 1 74.3%
2 12 149 12 2 85.1%
3 4 18 66 0 75.0%
4 3 2 0 20 80.0%
79.1%
1 2 3 4
1 115 29 0 4 77.7%
2 32 131 11 1 74.9%
3 5 20 63 0 71.6%
4 5 4 0 16 64.0%
74.5%
1 2 3 4
1 118 25 0 5 79.7%
2 10 154 5 6 88.0%
3 1 12 75 0 85.2%
4 0 0 0 25 100%
85.3%
Comments• Experiments
– Weakness: accuracy of class labels– No labeled calls excluded, realistic– HMM most accurate, but undertrained– MUSIC frequency estimate robust, but 1000x slower
than ZCA (20x real time)• Machine learning
– Expert information still necessary• Feature extraction (dimensionality reduction)• Model parameters
– DTW: fast training, slow classification– HMM: slow training, fast classification (real time)
Future work• Ultimate goal
– Real-time portable system for species ID– Commercial product possibilites
• Feature extraction– Robust
• Broadband noise• Echos• Unknown distance between bat and microphone
– Chirp model, echo model– Faster frequency estimates– Match assumptions of classifiers
More future work• Detection
– Replace energy-based method with principled statistical methods using frame-based features
• Classification– Accurate class labels for training
• Netting• Record from known bat roosts (preferred)
– Pseudo-sinusoidal input• Oscillator network• Echo state network