Linear Inverse Problems, Sparse Regularization,and Convex Optimization
Ivan Selesnick
March 23, 2017
1 / 59
Under-determined linear equations
Consider a system of under-determined system of equations
y = Ax (1)
A : M × N (M < N)
y : M × 1
x : N × 1
y =
y(0)...
y(M − 1)
x =
x(0)
...
x(N − 1)
The system has more unknowns than equations.
We assume AAH is invertible, therefore (1) has infinitely many solutions.
2 / 59
Norms
We will use the `2 and `1 norms.
‖x‖22 :=
∑
n
|x(n)|2 (2)
‖x‖1 :=∑
n
|x(n)| (3)
‖x‖22, i.e., the sum of squares, is referred to as the ‘energy’ of x.
3 / 59
Least squares
To solve y = Ax, it is common to minimize the energy of x.
x = arg minx‖x‖2
2 (4a)
such that y = Ax. (4b)
The solution isx = AH(AAH)−1
y. (5)
When y is noisy, don’t solve y = Ax exactly. Instead, find approximate solution:
x = arg minx
{‖y − Ax‖2
2 + λ‖x‖22
}(6)
The solution isx =
(AHA + λI
)−1AH y. (7)
Large scale systems −→ fast algorithms needed..
4 / 59
Sparse solutions
Another approach to solve y = Ax,
x = arg minx‖x‖1 (8a)
such that y = Ax (8b)
Problem (8) is the basis pursuit (BP) problem.
When y is noisy, don’t solve y = Ax exactly. Instead, find approximate solution.
x = arg minx
{‖y − Ax‖2
2 + λ‖x‖1
}(9)
Problem (9) is the basis pursuit denoising (BPD) problem.
The BP/BPD problems can not be solved in explicit form, only by iterativenumerical algorithms.
5 / 59
Least squares & BP/BPD
Least squares and BP/BPD solutions are quite different. Why?
To minimize ‖x‖22 . . . , the largest values of x must be made small as they
count much more than the smallest values.⇒ least square solutions have many small values, as they are relativelyunimportant ⇒ least square solutions are not sparse.
x
x2
|x|
1 2�1�2 0
t
f(t)
�(|t| + ✏)p
a + b |t|
t0
x
f(x)
f(x) = x
f(x) = sin x
f(x) = 120ex
1
Therefore, when it is known/expected that x is sparse, use the `1 norm;not the `2 norm.
6 / 59
Algorithms for sparse solutions
Objective function
I Non-differentiable
I Convex
I Large-scale
Algorithms
I MM: Majorization-Minimization
I ISTA: Iterative Shrinkage/Thresholding Algorithm
I FISTA: Fast ISTA
I SALSA (ADMM): Split Augmented Lagrangian Shrinkage Algorithm
I ‘Matrix-free’ algorithms
and more...
7 / 59
Parseval frames
The columns of A form a Parseval frame if AAH = pI with p > 0.
A
AH
= pI
If AAH = pI then the solution to
x = arg minx‖x‖2
2
such that y = Ax
is
x = AH(AAH)−1 y (10)
=1
pAH y (11)
No matrix inversion needed.
8 / 59
Parseval frames
If
AAH = pI (12)
then the solution to
x = arg minx
{‖y − Ax‖2
2 + λ‖x‖22
}(13)
is
x = (AHA + λI)−1AH y (14)
=1
λ+ pAH y (15)
using the matrix inverse lemma,
(λ I + AHA
)−1
=1
λI− 1
λAH(λ I + AAH
)−1
A. (16)
• So, if AAH = pI then finding least square solutions is easy. No matrix
inversion needed.
Some algorithms for BP/BPD also become computationally easier.
9 / 59
Example: Sparse Fourier coefficients using BP
The Fourier transform tells how to write the signal as a sum of sinusoids. But,it is not the only way.
Basis pursuit gives a sparse spectrum.
Suppose the M-point signal y is written as
y(m) =N−1∑
n=0
c(n) exp
(j2π
Nmn
), 0 6 m 6 M − 1 (17)
where c(n) is a length-N coefficient sequence, with M 6 N.
y = Ac (18)
Am,n = exp
(j2π
Nmn
), 0 6 m 6 M − 1, 0 6 n 6 N − 1 (19)
c : length-N
The coefficients c(n) are frequency-domain (Fourier) coefficients.
10 / 59
Example: Sparse Fourier coefficients using BP
1. If N = M, then A is the inverse N-point DFT matrix.
2. If N > M, then A is the first M rows of the inverse N-point DFT matrix.⇒ A or AH can be implemented efficiently using the FFT.For example, in Matlab, y = Ac is implemented as:
function y = A(c, M, N)
v = N * ifft(c);
y = v(1:M);
end
Similarly, AH y can be obtained by zero-padding and computing the DFT.In Matlab, c = AH y is implemented as:
function c = AT(y, M, N)
c = fft([y; zeros(N-M, 1)]);
end
⇒ Matrix-free algorithms.
3. Due to the orthogonality properties of complex sinusoids,
AAH = N IM (20)
11 / 59
Example: Sparse Fourier coefficients using BP
When N = M, the coefficients c satisfying y = Ac are uniquely determined.
When N > M, the coefficients c are not unique. Any vector c satisfyingy = Ac can be considered a valid set of coefficients. To find a particularsolution we can minimize either ‖c‖2
2 or ‖c‖1.
Least squares:
c = arg minc‖c‖2
2 (21a)
such that y = Ac (21b)
Basis pursuit:
c = arg minc‖c‖1 (22a)
such that y = Ac. (22b)
The two solutions can be quite different...
12 / 59
Example: Sparse Fourier coefficients using BP
0 20 40 60 80 100
−1
−0.5
0
0.5
1
1.5
Time (samples)
Signal
Real part
Imaginary part
Least square solution:
c = AH(AAH)−1 y
=1
NAHy (AAH = N I)
which is computed by
1. zero-pad the length-M signal y to length-N
2. compute its DFT
BP solution: Compute using algorithm SALSA
13 / 59
Example: Sparse Fourier coefficients using BP
0 20 40 60 80 1000
20
40
60
80
Frequency (DFT index)
(A) Fourier coefficients (DFT)
0 50 100 150 200 2500
0.1
0.2
0.3
0.4(B) Fourier coefficients (least square solution)
Frequency (index)
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1(C) Fourier coefficients (basis pursuit solution)
Frequency (index)
The BP solution does not exhibit the leakage phenomenon.
14 / 59
Example: Sparse Fourier coefficients using BP
0 20 40 60 80 100
1
1.5
2
2.5
Cost function history
Iteration
Cost function history of algorithm for basis pursuit solution
15 / 59
Example: Denoising using BPD
Digital LTI filters are often used for noise reduction (denoising).
But, if
I the noise and signal overlap in the frequency domain,or
I the respective frequency bands are unknown,
then it is difficult to use LTI filters.
However, if the signal has sparse (or relatively sparse) Fourier coefficients, thenBPD can be used for noise reduction.
16 / 59
Example: Denoising using BPD
Noisy speech signal y
y(m) = s(m) + w(m), 0 6 m 6 M − 1, M = 500 (23)
s : noise-free speech signalw : noise sequence.
0 100 200 300 400 500
−0.4
−0.2
0
0.2
0.4
0.6 Noisy signal
Time (samples)
0 100 200 300 400 5000
0.01
0.02
0.03
0.04(A) Fourier coefficients (FFT) of noisy signal
Frequency (index)
17 / 59
Example: Denoising using BPD
Assume the noise-free speech signal s has a sparse set of Fourier coefficients:
y = Ac + w
y : noisy speech signal, length MA : M × N DFT matrix (19)c : sparse Fourier coefficients, length Nw : noise, length M
As y is noisy, find c by solving the least square problem
c = arg minc
{‖y − Ac‖2
2 + λ‖c‖22
}(24)
or the basis pursuit denoising (BPD) problem
c = arg minc
{‖y − Ac‖2
2 + λ‖c‖1
}. (25)
Once c is found, an estimate of the speech signal is given by s = Ac.
18 / 59
Example: Denoising using BPD
Least square solution:
c = (AHA + λI)−1AH y (26)
=1
λ+ NAH y (AAH = N I) (27)
using matrix inverse lemma.
⇒ least square estimate of the speech signal is
s = Ac
=N
λ+ Ny (least square solution).
But s is only a scaled version of the noisy signal!
No filtering is achieved.
19 / 59
Example: Denoising using BPD
BPD solution
0 100 200 300 400 5000
0.02
0.04
0.06(B) Fourier coefficients (BPD solution)
Frequency (index)
0 100 200 300 400 500
−0.4
−0.2
0
0.2
0.4
0.6 Denoising using BPD
Time (samples)
Obtained with algorithm SALSA. Effective noise reduction, unlike least squares!
20 / 59
Example: Deconvolution using BPD
If the signal of interest x is not only noisy but is also distorted by an LTIsystem with impulse response h, then the available data y is
y(m) = (h ∗ x)(m) + w(m) (28)
where ‘∗’ denotes convolution (linear convolution) and w is additive noise.Given the observed data y , we aim to estimate the signal x . We will assumethat the sequence h is known.
21 / 59
Example: Deconvolution using BPD
y = Hx + w (29)
x : length Nh : length Ly : length M = N + L− 1
H =
h0
h1 h0
h2 h1 h0
h2 h1 h0
h2 h1
h2
(30)
H is of size M × N with M > N (because M = N + L− 1).
22 / 59
Example: Deconvolution using BPD
Sparse signal convolved by the 4-point moving average filter
h(n) =
{14
n = 0, 1, 2, 3
0 otherwise
0 20 40 60 80 100
−1
0
1
2 Sparse signal
0 20 40 60 80 100−0.4
−0.2
0
0.2
0.4
0.6Observed signal
23 / 59
Example: Deconvolution using BPD
Due to noise, solve the regularized least square problem
x = arg minx
{‖y −Hx‖2
2 + λ‖x‖22
}(31)
or the basis pursuit denoising (BPD) problem
x = arg minx
{‖y −Hx‖2
2 + λ‖x‖1
}. (32)
Least square solution:x = (HHH + λI)−1HH y. (33)
24 / 59
Example: Deconvolution using BPD
0 20 40 60 80 100−0.4
−0.2
0
0.2
0.4
0.6Deconvolution (least square solution)
0 20 40 60 80 100
−1
0
1
2 Deconvolution (BPD solution)
The BPD solution, obtained using SALSA, is more faithful to original signal.
25 / 59
Example: Filling in missing samples using BP
Due to data transmission/acquisition errors, some signal samples may be lost.Fill in missing values for error concealment.
Part of a signal or image may be intentionally deleted (image editing, etc).Convincingly fill in missing values according to the surrounding area to doinpainting.
0 100 200 300 400 500
−0.5
0
0.5 Incomplete signal
Time (samples)
200 missing samples
26 / 59
Example: Filling in missing samples using BP
We write the incomplete data y as
y = Sx (34)
x : length My : length K < MS : ‘selection’ (or ‘sampling’) matrix of size K ×M.
For example, if only the first, second and last elements of a 5-point signal x areobserved, then the matrix S is given by:
S =
1 0 0 0 00 1 0 0 00 0 0 0 1
. (35)
Problem: Given y and S, find x such that y = Sx
⇒ Underdetermined system, infinitely many solutions.
Least square and BP solutions are very different...
27 / 59
Example: Filling in missing samples using BP
Properties of S
1.SST = I (36)
where I is an K × K identity matrix. For example, with S in (35)
SST =
1 0 00 1 00 0 1
.
2. STy sets the missing samples to zero.For example, with S in (35)
STy =
1 0 00 1 00 0 00 0 00 0 1
y(0)y(1)y(2)
=
y(0)y(1)
00
y(2)
. (37)
28 / 59
Example: Filling in missing samples using BP
Suppose x has a sparse representation with respect to A,
x = Ac (38)
c : sparse vector, length N, with M 6 NA : size M × N.
The incomplete data y can then be written as
y = Sx (39a)
= SAc. (39b)
Therefore, if c satisfiesy = SAc (40)
then we may estimate x asx = Ac. (41)
Note that y is shorter than the coefficient vector c, so (40) has infinitely manysolutions.
29 / 59
Example: Filling in missing samples using BP
Any vector c satisfying y = SAc can be considered a valid set of coefficients.
To find a particular solution, solve the least squares (LS) problem
c = arg minc‖c‖2
2 (42a)
such that y = SAc (42b)
or the basis pursuit (BP) problem
c = arg minc‖c‖1 (43a)
such that y = SAc. (43b)
We will see . . . the LS and BP solutions are very different.
Let us assume A satisfiesAAH = pI, (44)
for some positive real number p.
30 / 59
Example: Filling in missing samples using BP
The least square solution is
c = (SA)H((SA)(SA)H)−1 y (45)
= AHST(SAAHST)−1 y (46)
= AHST(p SST)−1 y (AAH = pI) (47)
= AHST(p I)−1y (SST = I) (48)
=1
pAHSTy. (49)
Hence, the least square estimate x is given by
x = Ac (50)
=1
pAAHSTy using (49) (51)
= STy. (AAH = pI) (52)
This estimate sets all the missing values to zero!
No estimation of the missing values. Least square solution of no use here.
31 / 59
Example: Filling in missing samples using BP
Short segments of speech can be sparsely represented using the DFT; thereforewe set A equal to the M × N DFT (19) with N = 1024.
BP solution obtained using 100 iterations of SALSA:
0 100 200 300 400 5000
0.02
0.04
0.06
0.08Estimated coefficients
Frequency (DFT index)
0 100 200 300 400 500
−0.5
0
0.5 Estimated signal
Time (samples)
The missing samples have been filled in quite accurately.
32 / 59
Total Variation Denoising (TVD)
The signal x is observed in additive white Gaussian noise (AWGN)
y(n) = x(n) + w(n), n ∈ {0, 1, 2, . . . , N − 1}
Total variation denoising is defined by
x = arg minx∈RN
{F (x) =
1
2
∑
n
|y(n)− x(n)|2 + λ∑
n
|x(n)− x(n − 1)|},
written also as
x = arg minx∈RN
{F (x) =
1
2‖y − x‖2
2 + λ ‖Dx‖1
}
where
D =
−1 1
−1 1
. . .. . .−1 1
.
• L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D,vol. 60, pp. 259–268, 1992.
33 / 59
Total Variation Denoising
0 50 100 150 200 250
−2
0
2
4
6Noise−free signal
0 50 100 150 200 250
−2
0
2
4
6Noisy signal
0 50 100 150 200 250
−2
0
2
4
6
L2−filtered signal (λ = 3.00)
0 50 100 150 200 250
−2
0
2
4
6
L2−filtered signal (λ = 8.00)
0 50 100 150 200 250
−2
0
2
4
6
TV−filtered signal (λ = 3.00)
TV denoising preserves discontinuities more accurately than linear filtering.
34 / 59
Total Variation Denoising - staircase artifacts
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
ECG
Time (n)
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
Time (n)
TV Denoising
TVD has staircase artifacts.
35 / 59
Total Variation Denoising - staircase artifacts
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Biosensor data
Time (n)
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Time (n)
TV Denoising
TVD has staircase artifacts.
36 / 59
Sparse Singularity-Preserving Signal Smoothing (SIPS)We assume the signal of interest is of the form
s = f + x1 + x2, s, f , xi ∈ RN
whereI f is a low-pass signalI x1 is approximately piecewise constantI x2 is approximately piecewise linear
0 200 400 600 800 1000−2
−1
0
1
2
s = f + x1 + x
2
0 200 400 600 800 1000−2
−1
0
1
2f [Low−pass]
0 200 400 600 800 1000−2
−1
0
1
2
x1 [sparse derivative]
0 200 400 600 800 1000−2
−1
0
1
2
x2 [sparse 2nd derivative]
37 / 59
Sparse Singularity-Preserving Signal Smoothing (SIPS)
Based on the signal model
s = f + x1 + x2, s, f , xi ∈ RN
we minimize the objective function
J(x1, x2) =1
2‖Hy − H(x1 + x2)‖2
2 + λ1
∑
n
φ([Dx1]n) + λ2
∑
n
φ([D2x2]n)
where H is a high-pass filter.
If φ is the absolute value function, the regularizer is(λ1‖Dx1‖1 + λ2‖D2x2‖1
).
38 / 59
Penalty Function
The penalty function φ can be taken to be
φ(x) =
1a
log(1 + a|x |), a > 0
|x |, a = 0.
The parameter a > 0 controls the non-convexity of φ.
−4 −3 −2 −1 0 1 2 3 40
1
2
3
4
a = 0
a = 0.2
a = 0.5
a = 1.0
1D parametric penalty function
x
30 / 80
Total Variation Denoising (TVD)
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Biosensor data
Time (n)
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Time (n)
TV Denoising
TVD has staircase artifacts.
40 / 59
Sparse Singularity-Preserving Signal Smoothing (SIPS)
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Biosensor data
Time (n)
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
Time (n)
SIPS Denoising
SIPS avoids staircase artifacts.
41 / 59
Sparse Singularity-Preserving Signal Smoothing (SIPS)
– Extension to higher-order singularities.
We assume the signal of interest is of the form
s = f + x2 + x3, s, f , xi ∈ RN
where
I f is a low-pass signal
I x2 is approximately piecewise linear
I x3 is approximately piecewise quadratic
We minimize the objective function
J(x2, x3) =1
2‖Hy − H(x2 + x3)‖2
2 + λ2
∑
n
φ([D2x2]n) + λ3
∑
n
φ([D3x3]n)
where H is a high-pass filter.
42 / 59
Total Variation Denoising (TVD)
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
ECG
Time (n)
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
Time (n)
TV Denoising
TVD has staircase artifacts.
43 / 59
Sparse Singularity-Preserving Signal Smoothing (SIPS)
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
ECG
Time (n)
0 100 200 300 400 500 600 700 800 900 1000
−100
−50
0
50
100
Time (n)
SIPS Denoising
SIPS avoids staircase artifacts.
44 / 59
SIPS with L1 norm
0 200 400 600 800 1000
−100
−50
0
50
100
Noisy signal
0 200 400 600 800 1000
−100
−50
0
50
100
Low−pass filter (Ly)
0 200 400 600 800 1000
−100
−50
0
50
100
Singularity−preserving smoothing (SPS) [L1 norm penalty]
0 200 400 600 800 1000
−20
−10
0
10
20 u2
0 200 400 600 800 1000
−1
−0.5
0
0.5
1 u3
45 / 59
SIPS with non-convex penalty
0 200 400 600 800 1000
−100
−50
0
50
100
Noisy signal
0 200 400 600 800 1000
−100
−50
0
50
100
Low−pass filter (Ly)
0 200 400 600 800 1000
−100
−50
0
50
100
Singularity−preserving smoothing (SPS) [non−convex penalty]
0 200 400 600 800 1000
−20
−10
0
10
20 u2
0 200 400 600 800 1000
−1
−0.5
0
0.5
1 u3
46 / 59
Convex or non-convex: which is better for inverse problems?
Benefits of convex optimization
1. Absence of suboptimal local minima
2. Continuity of solution as a function of input data
3. Fewer complications when specifying regularization parameters
4. Availability of algorithms guaranteed to converge to a global optimum
But, non-convex regularization often performs better! Convex regularizationunder-estimates signal values (a ‘bias toward zero’).
Non-convex regularization induces sparsity more effectively and is a popularalternative to convex functions.
Can we exploit the strong sparsity-inducing properties of non-convex penaltieswithout forgoing the benefits of the convex approach?
47 / 59
Parameterized sparsity-inducing non-convex penaltyLet φ( · ; a) : R→ R be a penalty function with parameter a > 0 satisfying
1. φ is continuous on R2. φ is twice continuously differentiable, increasing, and concave on R+
3. φ(x ; 0) = |x |4. φ(0; a) = 0
5. φ(−x ; a) = φ(x ; a)
6. φ′(0+; a) = 1
7. φ′′(x ; a) > −a for all x 6= 0
−4 −3 −2 −1 0 1 2 3 40
1
2
3
4
a = 0
a = 0.2
a = 0.5
a = 1.0
1D parametric penalty function
x
48 / 59
Non-convex regularization, Convex optimizationTotal variation denoising with convex regularization:
x = arg minx∈RN
{F0(x) =
1
2‖y − x‖2
2 + λ ‖Dx‖1
}
With non-convex regularization:
x = arg minx∈RN
{Fa(x) =
1
2‖y − x‖2
2 + λ∑
n
φ([Dx ]n; a)}
Can we constrain φ so that Fa is convex?
Proposition
Fa is strictly convex if
infx 6=0
φ′′(x) > − 1
4λ.
When φ satisfies properties above, Fa is strictly convex if
0 6 a <1
4λ.
• I. W. Selesnick, A. Parekh, and I. Bayram, “Convex 1-D total variation denoising with non-convexregularization,” IEEE Signal Processing Letters, vol. 22, pp. 141–144, Feb. 2015.
49 / 59
Convex TVD with non-convex regularizationSet a to its maximal value to maximally induce sparsity of Dx while ensuringconvexity of the objective function.
a =1
4λ
0 50 100 150 200 250
−2
0
2
4
6Noisy data
σ = 0.50
0 50 100 150 200 250
−2
0
2
4
6TV denoising (convex penalty)
λ = 2.00 RMSE = 0.318
0 50 100 150 200 250
−2
0
2
4
6TV denoising (non−convex penalty, atan)
λ = 2.00 RMSE = 0.247
0 50 100 150 200 250
−1
−0.5
0
0.5
1
n
Denoising Error
convex
non−convex
50 / 59
Convex TVD with non-convex regularization
TVD with non-convex regularization:
x = arg minx∈RN
{F (x) =
1
2‖y − x‖2
2 + λ∑
n
φ([Dx ]n; a)}
TVD with non-separable non-convex regularization:
x = arg minx∈RN
{Fnonsep(x) =
1
2‖y − x‖2
2 + λ∑
n
ψ(([Dx ]n−1, [Dx ]n); a)}
where ψ : R2 → R.
51 / 59
Convex TVD with non-convex regularization
Contours
−2 −1 0 1 2
−2
−1
0
1
2
−2 −1 0 1 2 −2
0
20
1
2
3
4
L1 norm
Contours
−2 −1 0 1 2−2
−1
0
1
2
−2 −1 0 1 2 −20
20
1
2
3
4
Separable non−convex penalty
Contours
−2 −1 0 1 2−2
−1
0
1
2
−2 −1 0 1 2 −20
20
1
2
3
4
Non−separable non−convex penalty
a1 = a2 = 0 a1 = a2 > 0 a1 > a2 > 0 (Proposed)
(a) Separable, convex (b) Separable, non-convex (c) Non-separable, non-convex
52 / 59
Convex TVD with non-convex regularization
Define ψ( · ; a) : R2 → R as
ψ(x) =
(1− r) [φ(x1;α) + φ(x2;α)] + r φ(x1 + x2;α), x ∈ A1
(1 + r)φ(x1; a2) + φ(rx1 + x2;α), x ∈ A2
(1 + r)φ(x2; a2) + φ(x1 + rx2;α), x ∈ A3
(53)
where Ai are subsets of R2 defined as
A1 = {x ∈ R2 | x1x2 > 0}, (54)
A2 = {x ∈ R2 | x1 (x1 + x2) 6 0}, (55)
A3 = {x ∈ R2 | x2 (x1 + x2) 6 0} (56)
and α and r are given by
α =a1 + a2
2, r =
{a1−a2a1+a2
, a1 + a2 > 0
0, a1 = a2 = 0.(57)
Proposition
Fnonsep is strictly convex if
a1 6 1
2λ, a2 6 1
4λ.
53 / 59
0 50 100 150 200 250
−2
0
2
4
6
8
Noisy data [σ = 0.50]
0 50 100 150 200 250
−2
0
2
4
6
8TV denoising using non−separable non−convex penalty
0 50 100 150 200 250
−1
−0.5
0
0.5
1
Denoising error
L1 (RMSE = 0.318)
Separable non−convex (RMSE = 0.247)
Non−separable non−convex (RMSE = 0.221)
54 / 59
Structured Sparsity with Overlapping Groups
A simple nonlinear thresholding approach to denoising is basis pursuit denoising
x = arg minx
{1
2‖y − x‖2
2 + λ‖x‖1
}= soft(x, λ)
Thresholding does not capture group structure (clustering/grouping behavior).
To account for clustering/grouping, we may use
x = arg minx
{1
2‖y − x‖2
2 + λ∑
n
√|x(n)|2 + |x(n + 1)|2 + |x(n + 2)|2
}
for group size 3.
The optimization is more challenging due to coupling among all signal valuesx(n), but yields superior results for speech enhancement.
• P.-Y. Chen and I. W. Selesnick, “Translation-invariant shrinkage/thresholding of group sparse signals,” SignalProcessing, vol. 94, pp. 476–489, Jan. 2014.
55 / 59
Structured Sparsity with Overlapping Groups: Speech Enhancement
Speech signals exhibit structured sparsity in time-frequency domain.
TIME (SECONDS)
FR
EQ
UE
NC
Y
0 0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
5000
6000
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
TIME (SECONDS)
FR
EQ
UE
NC
Y
0 0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
5000
6000
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
Noise-free signal Noisy signal
56 / 59
Structured Sparsity with Overlapping Groups: Speech Enhancement
Scalar thresholding produces spurious noise spikes and musical noise.
TIME (SECONDS)
FR
EQ
UE
NC
Y
0 0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
5000
6000
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
TIME (SECONDS)
FR
EQ
UE
NC
Y
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.64000
4500
5000
5500
6000
6500
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
Scalar thresholding Magnified view
57 / 59
Structured Sparsity with Overlapping Groups: Speech Enhancement
New group shrinkage/thresholding (OGS) algorithm reduces musical noise.
TIME (SECONDS)
FR
EQ
UE
NC
Y
0 0.5 1 1.5 2 2.5 30
1000
2000
3000
4000
5000
6000
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
TIME (SECONDS)
FR
EQ
UE
NC
Y
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.64000
4500
5000
5500
6000
6500
7000
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
OGS algorithm Magnified view
58 / 59
Summary
1. Basis pursuit
2. Basis pursuit denoising
3. Sparse Fourier coefficients using BP
4. Denoising using BPD
5. Deconvolution using BPD
6. Filling in missing samples using BP
7. Total variation denoising (TVD)
8. Sparse singularity-preserving signal smoothing (SIPS)
9. Non-convex regularization, convex optimization
10. Group sparsity
59 / 59