University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251...

Post on 31-Mar-2015

213 views 0 download

Tags:

transcript

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Automatic Speaker Recognition for Series 60

Mobile Devices

University of Joensuu,Department of Computer Science

Specom’2004, Sep 20, 2004

Juhani Saastamoinen, Evgeny Karpov,Ville Hautamäki, and Pasi Fränti

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Background

• Project in National FENIX programme– New Methods and Applications in Speech

Technology

• 7 research institutes• Project partners: NRC, Lingsoft, National

Bureau of Investigation, etc.• Joensuu: Speaker Recognition• http://cs.joensuu.fi/pages/pums

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Research Group

Pasi FräntiProfessor

Juhani SaastamoinenProject manager

Evgeny KarpovProject researcher

Ville HautamäkiProject researcher

Tomi KinnunenResearcher

Ismo Kärkkäinen Clustering algorithms

PUMS project

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Application Scenarios

Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification

Speaker RecognitionSpeaker Recognition

Whose voice is this?Is this Bob’s voice?

(Claim)+

Verification

Imposter!

?Identification

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Project Goal

Port speaker recognition to Series 60 mobile phone

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian Phones

• Series 60 phone features:– 16 MB ROM– 8 MB RAM

– 176 x 208 display

– ARM-processor

– No floating-point unit!!!

Series 80

Series 60UIQ

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian OS

• Defined by Symbian consortium

• Based on EPOC• Operating system for mobile phones

– Real-time system– Long uptime required

• Multitasking, multithreading

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Problems of Porting

• Usual considerations when porting to phone– GUI event driven program(ming)

– Platform specific programming model

– Real-time system, exceptions

• Application specific porting problems– Number crunching without floating point unit!!!

– Signal processing numerically challenging

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Identification System

Speaker Recognition:Classify input speech

based on existing profiles

Signal ProcessingFeature Extraction

Speaker Modelling:Create speaker

profileFeatureVectors

SpeechAudio

Add speaker profiles during training

Read and use all profiles during recognition

Decision

Speaker ProfileDatabase

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

MFCC Signal Processing

Time windowin

gDFT Abs

Filter bank

Log

DCT

Digital speechsignal frame

Featurevector

Pre-emphasis

• pre-emph. coeff. 0.97, Hamm window, 30 triangular mel-filters, base-2 logarithm, output 12 MFCC's

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Fixed-Point Implementation

• Numerical analysis needed for fixed-point arithmetic implementation

• Truncation and re-scaling to avoid overflows in the converted algorithm

• Minimize information loss caused by computation in fixed-point arithmetic – Minimize relative error

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFT, Fixed-Point

• Frequency spectrum of speech– Biggest source of numerical error– Butterflies have multiplications– Layers repeat truncation errors

• Fixed number of bits per element– 32, native integer size in many systems

• Reference implementation: FFTGEN– http://www.jjj.de/fft/fftgen.tgz

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFTGEN (16/16)

• Multiplication: 32 x 32 -bit result must fit in 32 bits: truncate input

• FFTGEN: Truncate inputs to 16/16 bits

32-bit multiplication result

FFT layer input FFT Twiddle FactorX

X16-bit integer 16-bit integer

FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer

16 used bits 16 crop-off bits

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Info Preserving FFT (22/10)• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information

– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024

– Truncate multiplication inputs to 22/10 bits (signal/op)

22 used bits 10 crop-off bits

32-bit multiplication result

X32-bit integer, 22 bits used 16-bit integer, 10 bits used

32-bit integer

FFT layer input FFT Twiddle FactorX

FFT layer output (part of it)Crop-off for next layer: 10 bits

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFT Spectrum, Fixed-Point

originalTIMIT signal

TIMIT signal x 4

16/16 abs values 22/10 abs values

• x-axis: fixed-point FFT element abs. values

• y-axis: correct FFT element abs. values

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Scale of Error in Proposed FFT

16/16 22/10

Log10 of relative error in FFT elements

16/16 22/10

average -0.775 -2.118

standard deviation 0.797 0.590

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

• Compute complex absolute values using maximum coordinate and coordinate ratio

• Suppose |x| > |y| for z = x + i y, then

• Interpret the (squared) y/x by t• Approx. square root by a polynomial P(t)• Constant time algorithm (vs. Newton)

Magnitude Spectrum, Fixed-Point

222 /1 xy+x=y+x|=z|

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Logarithm, Fixed-Point

• Use base 2 instead of base 10– corresponds to output multiplication

• Standard technique:– Return problem to interval [1,2)– Use linear interpolation from values

stored in a look-up table– 8 bits used for indexing the look-up

table values

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Rest of System, Fixed-Point

• No improvement needed in VQ/GLA• Should apply similar technique as

with FFT to other signal processing– Pre-emphasis, utilize full 32 bits– Time windowing, use less bits in

windowing function– FB, use less bits in frequency responses– DCT, use less bits for the cosines

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Effect of Signal Processing

• TIMIT data sets, varying number of speakers (N)• For each N repeat (6x, 5x, 2x) train/recognize

cycles (eliminate GLA initial solution randomness)• FFTGEN: FFT with 16/16 multiplication• Fixed-point: use proposed 22/10 FFT• Mixed: floating-point DSP, fixed-point GLA/VQ

N=10 (6x) N=20 (5x) N=100 (2x)FFTGEN 93,3% 68,0% 59,5%Fixed-point 98,3% 95,0% 82,5%Mixed 100,0% 100,0% 100,0%Floating-point 100,0% 100,0% 100,0%

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Effect of Signal Quality

• GSM/PC data: 16 aligned dual recordings

• All computations in floating-point arith.

• Signal recorded with laptop and PC mic gives average recognition rate 100%

• Signal recorded with Nokia 3660 results in average recognition rate 84,9%

13/16 14/16 15/16 16/16Symbian audio 1 3 3 10PC audio 0 0 0 17

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Conclusion

• Speaker identification was ported to Symbian Series 60 mobile phone

• 22/10 bit usage in multiplication proposed instead of “standard” 16/16

• Experiments indicate that recognition accuracy improves from 68% to 95%