+ All Categories
Home > Documents > IRISA/D5 Thematic Seminar [.5em] Source separation · 2016. 7. 27. · Source separation Source...

IRISA/D5 Thematic Seminar [.5em] Source separation · 2016. 7. 27. · Source separation Source...

Date post: 27-Jan-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
IRISA/D5 Thematic Seminar Source separation Emmanuel Vincent Inria Nancy – Grand Est E. Vincent (Inria Nancy) IRISA/D5 Thematic Seminar 18/02/2013 1 / 50
Transcript
  • IRISA/D5 Thematic Seminar

    Source separation

    Emmanuel Vincent

    Inria Nancy – Grand Est

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 1 / 50

  • Source separation

    Source separation is the problem of recovering the source signalsunderlying a given mixture.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 2 / 50

  • Overview

    Hundreds of source separation systems were designed in the last 20years. . .

    . . . but few are yet applicable to real-world audio, as illustrated by therecent Signal Separation Evaluation Campaigns (SiSEC).

    The wide variety of techniques boils down to four modeling paradigms:

    computational auditory scene analysis (CASA),

    beamforming and post-filtering,

    probabilistic linear modeling, including independent componentanalysis (ICA) and sparse component analysis (SCA),

    probabilistic variance modeling, including hidden Markov models(HMM) and nonnegative matrix factorization (NMF).

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 3 / 50

  • 1 Beamforming and post-filtering

    2 Probabilistic linear modeling

    3 Probabilistic variance modeling

    4 Summary

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 4 / 50

  • Beamforming and post-filtering

    Paradigm 1: separation of a target source in ambient noise

    Early studies on array processing focused on the extraction of a targetpoint source in ambient noise.

    In each time-frequency bin (n, f ), the model used for source localization isassumed here again

    Xnf = SnfDf + Bnf

    Xnf : mixture STFT coeff.Snf : target STFT coeff.Df : steering vectorBnf : ambient noise

    where the steering vector Df encode the ITDs τi and the IIDs gi betweenthe I microphones

    Df ∝

    1g2e

    −2iπf τ2

    ...gI e

    −2iπf τI

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 5 / 50

  • Beamforming and post-filtering

    Beamforming and post-filteringThe optimal linear estimator in the minimum mean square error (MMSE)sense is the multichannel Wiener filter

    Ŝnf = VSnfD

    Hf (Σ

    X

    nf )−1

    Xnf

    where V Snf is the variance of Snf and ΣX

    nf the covariance of Xnf .

    This estimator is in fact the combination of

    a multichannel spatial filter known as the minimum variancedistortionless response (MVDR) beamformer

    Ynf =D

    Hf (Σ

    X

    nf )−1

    Xnf

    DHf (ΣX

    nf )−1Df

    a single-channe spectral filter known as the Wiener post-filter

    Ŝnf =V SnfV Ynf

    Ynf

    where V Ynf is the variance of Ynf and VSnf /V

    Ynf is the SNR.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 6 / 50

  • Beamforming and post-filtering

    Estimation algorithms

    The steering vector Df is derived from the spatial position of the targetobtained via a source localization algorithm.

    The covariance of the mixture ΣXnf is computed empirically by localaveraging of squared STFT coefficients in the time-frequency plane.

    The variance of the target is often estimated by spectral subtraction

    V Snf = max{0,VYnf − V

    Bnf }

    where V Bnf is the assumed noise variance in VYnf .

    V Bnf is estimated for example by the MCRA method for silence detection.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 7 / 50

  • Beamforming and post-filtering

    Summary of beamforming and post-filtering

    The algorithms stemming from paradigm 1 exhibit two main limitations:

    performance is very sensitive to localization accuracy,

    the MCRA algorithm assumes quasi-stationary noise and fails in amulti-source context where V Bnf varies a lot from one time frame tothe next.

    In order to overcome the latter limitation, multiple sources must beexplicitly modeled.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 8 / 50

  • Beamforming and post-filtering

    1 Beamforming and post-filtering

    2 Probabilistic linear modeling

    3 Probabilistic variance modeling

    4 Summary

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 9 / 50

  • Probabilistic linear modeling

    Paradigm 2: linear modeling

    The established linear modeling paradigm relies on two assumptions:1 point sources2 low reverberation

    Under assumption 1, the sources and the mixing process can be modeledas single-channel source signals and a linear filtering process.

    Under assumption 2, this filtering process is equivalent to complex-valuedmultiplication in the time-frequency domain via the short-time Fouriertransform (STFT).

    In each time-frequency bin (n, f )

    Xnf =J∑

    j=1

    SjnfAjf

    Xnf : vector of mixture STFT coeff.J: number of sourcesSjnf : jth source STFT coeff.Ajf : jth mixing vector

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 10 / 50

  • Probabilistic linear modeling

    Modeling of the mixing vectors

    The mixing vectors Ajf encode the ITD and IID of each source at eachfrequency.

    For anechoic mixtures, Ajf is equal to the steering vector Djf .

    For echoic mixtures, ITDs and IIDs follow a smeared distribution P(Ajf |θj)

    0 50 1000

    0.02

    0.04

    Empirical distribution of ITD

    ITD (µs)

    pro

    bab

    ilit

    y d

    ensi

    ty

    −5 0 50

    0.2

    0.4

    0.6

    Empirical distribution of IID

    IID (dB)

    pro

    bab

    ilit

    y d

    ensi

    ty

    anechoic

    RT=50ms

    RT=250ms

    RT=1.25s

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 11 / 50

  • Probabilistic linear modeling

    Sparsity of the source STFT coefficients

    Let us suppose for the moment that the source STFT coefficients Sjnf areindependent and identically distributed (i.i.d.).

    These coefficients are sparse: at each frequency, a few coefficients arelarge and most are close to zero.

    Speech source S1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    0

    20

    40

    60

    Distribution of magnitude STFT coeff.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 12 / 50

  • Probabilistic linear modeling

    Sparse i.i.d. modeling of the sourcesThis property can be modeled in several ways:

    binary masking: single active source jactnf in each time-frequency binwith, e.g., uniform P(jactnf ),

    generalized exponential distribution

    P(|Sjnf ||p, βf ) =p

    βf Γ(1/p)e−

    Sjnfβf

    p

    p: shape parameterβj : scale parameter

    dB

    20

    40

    60

    0 1 2 3 4

    Distribution of magnitude STFT coeff.

    |S1nf

    | (scaled to unit variance)

    pro

    bab

    ilit

    y d

    ensi

    ty

    10−2

    10−1

    100

    101

    empirical

    Gaussian (p=2)

    Laplacian (p=1)

    generalized p=0.4

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 13 / 50

  • Probabilistic linear modeling

    Inference algorithms

    Given the above priors, source separation is typically achieved by jointMAP estimation of the source STFT coefficients Sjnf and other latentvariables (Ajf , gj , τj , p, βj) via alternating nonlinear optimization.

    This objective is called sparse component analysis (SCA).

    For typical values of p, the MAP source STFT coefficients are nonzero forat most I sources.

    When the number of sources is J = I , SCA is renamed nongaussianity-based frequency-domain independent component analysis (FDICA).

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 14 / 50

  • Probabilistic linear modeling

    Practical illustration of separation using i.i.d. linear priors

    Left source S1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Center source S2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Right source S3nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Mixture Xnf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Predominant source pairs

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    2+3

    1+3

    1+2

    Estimated nonzero source pairs

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    2+3

    1+3

    1+2

    First estimated source S1nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Second estimated source S2nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Third estimated source S3nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Time-frequency bins dominated by the center source are often erroneouslyassociated with the two other sources.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 15 / 50

  • Probabilistic linear modeling

    SiSEC results on music mixtures

    0

    5

    10

    15

    20

    SD

    R (

    dB

    )

    panned recorded (RT=250ms)

    i.i.d. linear priors

    ideal CASA mask (upper−bound)

    Panned mixtureEstimated sources using i.i.d. linear priors

    Recorded reverberant mixtureEstimated sources using i.i.d. linear priors

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 16 / 50

  • Probabilistic linear modeling

    SiSEC results on speech mixtures

    0

    5

    10

    15

    SD

    R (

    dB

    )

    80o angular

    distance

    40o angular

    distance

    20o angular

    distance

    Anechoic room, i.i.d. priors

    Office room, i.i.d. priors

    Anechoic recording, 80◦ spacingEstimated sources

    Office recording, 80◦ spacingEstimated sources

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 17 / 50

  • Probabilistic linear modeling

    Summary of probabilistic linear modeling

    Advantages:

    explicitly models multiple sources

    Limitations:

    restricted to mixtures of non-reverberated point sources

    the sources must have different spatial cues (ITD, IID)

    at most two sources can be separated in each time-frequency bin, andtheir are often badly identified due to the ambiguities of spatial cues

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 18 / 50

  • Probabilistic linear modeling

    1 Beamforming and post-filtering

    2 Probabilistic linear modeling

    3 Probabilistic variance modeling

    4 Summary

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 19 / 50

  • Probabilistic variance modeling

    Idea 1: from sources to source spatial images

    Diffuse or semi-diffuse sources cannot be modeled as single-channel signalsand not even as finite dimensional signals.

    Instead of considering the signal produced by each source, one mayconsider its contribution to the mixture, a.k.a. its spatial image.

    Background noise becomes a source as any other.

    Source separation becomes the problem of estimating the spatial images ofall sources.

    In each time-frequency bin (n, f )

    Xnf =J∑

    j=1

    Cjnf

    Xnf : vector of mixture STFT coeff.J: number of sourcesCjnf : jth source spatial image

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 20 / 50

  • Probabilistic variance modeling

    Idea 2: translation and phase invariance

    In order to overcome the ambiguities of spatial cues, additional spectralcues are needed as shown by CASA.

    Most audio sources are translation- and phase-invariant: a given soundmay be produced at any time with any relative phase across frequency.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 21 / 50

  • Probabilistic variance modeling

    Paradigm 3: variance modeling

    Variance modeling combines these two ideas by modeling the STFTcoefficients of individual source spatial images by a circular multivariatedistribution whose parameters vary over time and frequency.

    The non-sparsity of source STFT coefficients over small time-frequencyregions suggests the use of a non-sparse distribution.

    Speech source S1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    0

    2

    4

    Generalized Gaussian shape parameter p

    neighborhood size (Hz × s)10

    110

    210

    3

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 22 / 50

  • Probabilistic variance modeling

    Choice of the distribution

    For historical reasons, several distributions have been preferred in a monocontext, which can equivalently be expressed as divergence functions overthe source magnitude/power STFT coefficients:

    Poisson ↔ Kullback-Leibler divergence aka I-divergence

    tied-variance Gaussian ↔ Euclidean distance

    log-Gaussian ↔ weighted log-Euclidean distance

    These distributions do not easily generalize to multichannel data.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 23 / 50

  • Probabilistic variance modeling

    The multichannel Gaussian model

    The zero-mean Gaussian distribution is a simple multichannel model.

    P(Cjnf |Σjnf ) =1

    det(πΣjnf )e−CH

    jnfΣ

    −1jnf

    Cjnf Σjnf : jth sourcecovariance matrix

    The covariance matrix Σjnf of each source can be factored as the productof a scalar nonnegative variance Vjnf and a spatial covariance matrix Rjfrespectively modeling spectral and spatial properties

    Σjnf = VjnfRjf

    Under this model, the mixture STFT coefficients also follow a Gaussiandistribution whose covariance is the sum of the source covariances

    P(Xnf |Vjnf ,Rjf ) =1

    det(π∑J

    j=1 VjnfRjf

    )e−XHnf (∑J

    j=1 Vjnf Rjf )−1

    Xnf

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 24 / 50

  • Probabilistic variance modeling

    General inference algorithm

    Independently of the priors over Vjnf and Rjf , source separation is typicallyachieved in two steps:

    joint MAP estimation of all model parameters using the expectationmaximization (EM) algorithm,

    MAP estimation of the source STFT coefficients conditional to themodel parameters by multichannel Wiener filtering

    Ĉjnf = VjnfRjf

    J∑

    j ′=1

    Vj ′nfRj ′f

    −1

    Xnf .

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 25 / 50

  • Probabilistic variance modeling

    Rank-1 spatial covariance

    The spatial covariances Rjf encode the apparent spatial direction andspatial spread of sound in terms of

    ITD,

    IID,

    normalized interchannel correlation a.k.a. interchannel coherence.

    For non-reverberated point sources, the interchannel coherence is equal to1, i.e., Rjf has rank 1

    Rjf = AjfAHjf

    In this case, the prior distributions P(Ajf |θj) used with linear modeling canbe reused.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 26 / 50

  • Probabilistic variance modeling

    Full-rank spatial covariance

    For reverberated or diffuse sources, the interchannel coherence is smallerthan 1, i.e. Rjf has full rank.

    The theory of statistical room acoustics suggests the direct+diffuse model

    Rjf ∝ λjAjfAHjf + Bf

    λj : direct-to-reverberant ratioAjf : direct mixing vectorBf : diffuse noise covariance

    with

    Ajf =

    √2

    1 + g2j

    (1

    gje−2iπf τj

    )τj : ITD of direct soundgj : IID of direct sound

    Bf =

    (1 sinc(2πfd/c)

    sinc(2πfd/c) 1

    )d : microphone spacingc : sound speed

    Modeling of Rjf as an unconstrained full-rank matrix is also possible.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 27 / 50

  • Probabilistic variance modeling

    I.i.d. modeling of the source variances

    Baseline systems rely model the source variances Vjnf as i.i.d. and locallyconstant within small time-frequency regions again.

    It can then be shown that the MAP variances are nonzero for up to I 2

    sources.

    Discrete priors constraining the number of nonzero variances to a smallernumber have also been employed.

    When the number of sources is J = I , this model is also callednonstationarity-based FDICA.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 28 / 50

  • Probabilistic variance modeling

    Benefit of exploiting interchannel coherence

    Interchannel coherence helps resolving some ambiguities of ITD and IIDand identify the predominant sources more accurately.

    Linear model Covariance model

    A1

    A2

    A3

    S1

    S3

    S1

    S2

    X

    A1

    A2

    A3

    V1

    1/2

    V3

    1/2

    V1

    1/2

    V2

    1/2

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 29 / 50

  • Probabilistic variance modeling

    Practical illustration of separation using i.i.d. variance

    priors

    Left source S1nf

    (IID < 0)

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Center source S2nf

    (IID = 0)

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Right source S3nf

    (IID > 0)

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Mixture Xnf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Predominant source pairs

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    2+3

    1+3

    1+2

    Estimated nonzero source pairs

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    2+3

    1+3

    1+2

    First estimated source S1nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Second estimated source S2nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Third estimated source S3nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 30 / 50

  • Probabilistic variance modeling

    Spectral modeling using template spectra

    Variance modeling enables the design of phase-invariant spectral priors.

    The Gaussian mixture model (GMM) represents the variance Vjnf of eachsource at a given time by one of K template spectra wjkf indexed by adiscrete state qjn

    Vjnf = wjqjnf with P(qjn = k) = πjk

    Different strategies have been proposed to learn these spectra:

    speaker-independent training on separate single-source data,

    speaker-dependent training on separate single-source data,

    MAP adaptation to the mixture using model selection or interpolation,

    MAP inference from a coarse initial separation.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 31 / 50

  • Probabilistic variance modeling

    Practical illustration of separation using template spectraPiano source C

    1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Violin source C2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Mixture Xnf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Template spectra wjkf

    f (k

    Hz)

    k (piano) k (violin)1 2 3 1 2 3

    0

    2

    4

    dB

    20

    40

    60

    Estimated state sequences qjn

    n (s)

    pia

    no

    vio

    lin

    0 0.5 1

    1

    2

    3

    1

    2

    3

    Estimated piano variance Σ1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    20

    40

    60

    Estimated violin variance Σ2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    20

    40

    60

    Estimated mixture variance Σ1nf

    +Σ2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    20

    40

    60

    Estimated piano source C1nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Estimated violin source C2nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 32 / 50

  • Probabilistic variance modeling

    Spectral modeling using basis spectra

    The GMM does not efficiently model polyphonic sound sources.

    The variance Vjnf of each source can be modeled instead as the linearcombination of K basis spectra wjkf multiplied by time activationcoefficients hjkn

    Vjnf =K∑

    k=1

    hjknwjkf

    This model is also called nonnegative matrix factorization (NMF).

    A range of strategies have been used to learn these spectra:

    instrument-dependent training on separate single-source data,

    MAP adaptation to the mixture using uniform priors,

    MAP adaptation to the mixture using trained priors.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 33 / 50

  • Probabilistic variance modeling

    Practical illustration of separation using basis spectraPiano source C

    1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Violin source C2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    Mixture Xnf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Basis spectra wjkf

    f (k

    Hz)

    k (piano) k (violin)1 2 3 1 2 3

    0

    2

    4

    dB

    −40

    −20

    0

    Estimated scale factors hjkn

    n (s)

    k (p

    iano)

    k (v

    ioli

    n)

    0 0.5 1

    1

    2

    3

    1

    2

    3

    dB

    40

    60

    80

    Estimated piano variance Σ1nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Estimated violin variance Σ2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Estimated mixture variance Σ1nf

    +Σ2nf

    n (s)

    f (k

    Hz)

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Estimated piano source C1nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    Estimated violin source C2nf

    n (s)

    f (k

    Hz)

    ^

    0 0.5 10

    2

    4

    dB

    0

    20

    40

    60

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 34 / 50

  • Probabilistic variance modeling

    SiSEC results on music mixtures

    0

    5

    10

    15

    20S

    DR

    (dB

    )

    panned recorded (RT=250ms)

    adapted basis spectra

    i.i.d. linear priors

    Panned mixtureEstimated sources using adapted basis spectraEstimated sources using i.i.d. linear priors

    Recorded reverberant mixtureEstimated sources using adapted basis spectraEstimated sources using i.i.d. linear priors

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 35 / 50

  • Probabilistic variance modeling

    Constrained template/basis spectra

    MAP adaptation or inference of the template/basis spectra is often neededdue to

    the lack of training data,

    the mismatch between training and test data.

    However, it is often inaccurate: additional constraints over the spectra areneeded to further reduce overfitting.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 36 / 50

  • Probabilistic variance modeling

    Harmonicity and spectral smoothness constraints

    For instance, harmonicity and spectral smoothness can be enforced by

    associating each basis spectrum with some a priori pitch p

    modeling wjpf as the sum of fixed narrowband spectra bplfrepresenting adjacent partials at harmonic frequencies scaled byspectral envelope coefficients ejpl

    wjpf =

    Lp∑

    l=1

    ejplbplf .

    Parameter estimation now amounts to estimating the active pitches andtheir spectral envelopes instead of their full spectra.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 37 / 50

  • Probabilistic variance modeling

    Practical illustration of harmonicity constraints

    10 20 300

    0.5

    1

    f (ERB)

    bp,1,f

    (ejp,1

    =0.756)

    10 20 300

    0.5

    1

    f (ERB)

    bp,2,f

    (ejp,2

    =0.128)

    10 20 300

    0.5

    1

    f (ERB)

    bp,3,f

    (ejp,3

    =0.041)

    10 20 300

    0.5

    1

    f (ERB)

    bp,4,f

    (ejp,4

    =0.037)

    10 20 300

    0.5

    1

    f (ERB)

    bp,5,f

    (ejp,5

    =0.011)

    10 20 300

    0.5

    1

    f (ERB)

    bp,6,f

    (ejp,6

    =0)

    10 20 300

    0.5

    1

    f (ERB)

    wjpf

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 38 / 50

  • Probabilistic variance modeling

    A flexible spectral model

    We have built upon this idea and proposed a flexible framework enablingthe joint exploitation of a wide range of cues by:

    factorization of the variance assuming the excitation-filter modelVjnf = V

    ex

    jnf Vft

    jnf

    further factorization of each part into basis spectra and timeactivation coefficients e.g. V exjnf =

    ∑k h

    ex

    jknwex

    jkf

    further factorization of the basis spectra and time activation seriesinto fine structure and envelope coefficients e.g. w exjkf =

    ∑l e

    ex

    jlk fex

    jlf

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 39 / 50

  • Probabilistic variance modeling

    Source-filter factorization

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 40 / 50

  • Probabilistic variance modeling

    Fine structure and envelope factorization

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 41 / 50

  • Probabilistic variance modeling

    SiSEC results on professional music mixtures

    0

    5

    10

    15

    20

    SD

    R (

    dB

    )

    vocals

    drums

    bass

    guitar

    piano

    Tamy (2 sources)

    Estimated sources using the flexible framework

    Bearlin (10 sources)

    Estimated sources using the flexible framework

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 42 / 50

  • Probabilistic variance modeling

    Results on a speech mixture

    Recorded mixture of 4 sourcesEstimated sources using rank-1 mixing covariance

    full-rank mixing covariancerank-1 and harmonicityfull-rank and harmonicity

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 43 / 50

  • Probabilistic variance modeling

    Separation of single-channel recordings

    The separation of single-channel recordings is more difficult than that ofmultichannel recordings since it relies on spectral cues only.

    A specific model must be learned a priori for each source.

    This makes it possible to separate the sources in each time frame (usingpitch for instance).

    For mixtures of 2 speakers

    Schmidt & Olsson obtained a SDR of 8 dB with 5 min trainingsignals,

    Smaragdis obtained a SDR of 5 dB with 30 s training signals.

    Grouping of the separated sources over time remains difficult and requiresmore sophisticated temporal evolution models which are currently beingstudied.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 44 / 50

  • Probabilistic variance modeling

    Exploitation of visual cuesTwo approaches exist to exploit visual cues:

    activity detection of each speaker and zeroing of inactive timeintervals,

    lip feature extraction and joint modeling of audio and visual featuresby GMMs.

    The second approach performs better, but it cannot always be applied.

    Most of these algorithms were tested on mixtures with I ≥ J.

    In a single-channel scenario, Llagostera obtained comparable performanceto Smaragdis but with much shorter training signals.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 45 / 50

  • Probabilistic variance modeling

    Summary of probabilistic variance modeling

    Advantages:

    virtually applicable to any mixture, including to diffuse sources

    no hard constraint on the number of sources per time-frequency bin

    the predominant sources are more accurately estimated by joint use ofspatial, spectral and learned cues

    principled flexible framework for the integration of additional cues

    Limitations:

    remaining musical noise artifacts

    remaining local optima of the estimation criterion

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 46 / 50

  • Probabilistic variance modeling

    1 Beamforming and post-filtering

    2 Probabilistic linear modeling

    3 Probabilistic variance modeling

    4 Summary

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 47 / 50

  • Summary

    Summary

    This state of the art showed that

    variance modeling algorithms have a greater potential due to thefusion of multiple cues,

    the separation quality is satisfactory for instantaneous noiselessmixtures: the handling of reverberation and noise remains a majorchallenge,

    single-channel separation remains difficult, especially when thesources have similar spectral cues,

    visual cues can improve performance but their use has been littlestudied.

    Existing systems are gradually finding their way into the industry,especially for remixing applications that can accomodate a certain amountof musical noise artifacts and partial user input/feedback.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 48 / 50

  • Summary

    References

    E. Vincent, M.G. Jafari, S.A. Abdallah, M.D. Plumbley, and M.E. Davies, ”Probabilisticmodeling paradigms for audio source separation”, in Machine Audition: Principles,Algorithms and Systems, IGI Global, pp. 162-185, 2010.

    E. Vincent, S. Araki, F.J. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov; B.V.Gowreesunker, D. Lutter, and N.Q.K. Duong, ”The Signal Separation EvaluationCampaign (2007-2010): Achievements and remaining challenges”, Signal Processing, 92,pp. 1928–1936, 2012.

    H.K. Maganti, D. Gatica-Perez, and I. McCowan, ”Speech enhancement and recognitionin meetings with an audio-visual sensor array”, IEEE Transactions on Audio, Speech andLanguage Processing, 15(8), pp. 2257-2268, 2007.

    P. Smaragdis, ”Convolutive speech bases and their application to supervised speechseparation”, IEEE Transactions on Audio, Speech and Language Processing, 15(1), pp.1-12, 2007.

    A. Llagostera Casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval, ”Blindaudiovisual source separation based on sparse redundant representations”, IEEETransactions on Multimedia, 12(5), pp. 358-371, 2010.

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 49 / 50

  • Summary

    Websites and software

    FASST: http://bass-db.gforge.inria.fr/fasst/Software framework for the implementation of source separationalgorithms (Matlab)

    SiSEC: http://sisec.wiki.irisa.fr/Series of evaluation campaigns

    E. Vincent (Inria Nancy)IRISA/D5 Thematic Seminar 18/02/2013 50 / 50

    Beamforming and post-filteringProbabilistic linear modelingProbabilistic variance modelingSummary


Recommended