+ All Categories
Home > Documents > Ivan Selesnick March 23, 2017 - New York University

Ivan Selesnick March 23, 2017 - New York University

Date post: 09-Jan-2022
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
59
Linear Inverse Problems, Sparse Regularization, and Convex Optimization Ivan Selesnick March 23, 2017 1 / 59
Transcript
Page 1: Ivan Selesnick March 23, 2017 - New York University

Linear Inverse Problems, Sparse Regularization,and Convex Optimization

Ivan Selesnick

March 23, 2017

1 / 59

Page 2: Ivan Selesnick March 23, 2017 - New York University

Under-determined linear equations

Consider a system of under-determined system of equations

y = Ax (1)

A : M × N (M < N)

y : M × 1

x : N × 1

y =

y(0)...

y(M − 1)

x =

x(0)

...

x(N − 1)

The system has more unknowns than equations.

We assume AAH is invertible, therefore (1) has infinitely many solutions.

2 / 59

Page 3: Ivan Selesnick March 23, 2017 - New York University

Norms

We will use the `2 and `1 norms.

‖x‖22 :=

n

|x(n)|2 (2)

‖x‖1 :=∑

n

|x(n)| (3)

‖x‖22, i.e., the sum of squares, is referred to as the ‘energy’ of x.

3 / 59

Page 4: Ivan Selesnick March 23, 2017 - New York University

Least squares

To solve y = Ax, it is common to minimize the energy of x.

x = arg minx‖x‖2

2 (4a)

such that y = Ax. (4b)

The solution isx = AH(AAH)−1

y. (5)

When y is noisy, don’t solve y = Ax exactly. Instead, find approximate solution:

x = arg minx

{‖y − Ax‖2

2 + λ‖x‖22

}(6)

The solution isx =

(AHA + λI

)−1AH y. (7)

Large scale systems −→ fast algorithms needed..

4 / 59

Page 5: Ivan Selesnick March 23, 2017 - New York University

Sparse solutions

Another approach to solve y = Ax,

x = arg minx‖x‖1 (8a)

such that y = Ax (8b)

Problem (8) is the basis pursuit (BP) problem.

When y is noisy, don’t solve y = Ax exactly. Instead, find approximate solution.

x = arg minx

{‖y − Ax‖2

2 + λ‖x‖1

}(9)

Problem (9) is the basis pursuit denoising (BPD) problem.

The BP/BPD problems can not be solved in explicit form, only by iterativenumerical algorithms.

5 / 59

Page 6: Ivan Selesnick March 23, 2017 - New York University

Least squares & BP/BPD

Least squares and BP/BPD solutions are quite different. Why?

To minimize ‖x‖22 . . . , the largest values of x must be made small as they

count much more than the smallest values.⇒ least square solutions have many small values, as they are relativelyunimportant ⇒ least square solutions are not sparse.

x

x2

|x|

1 2�1�2 0

t

f(t)

�(|t| + ✏)p

a + b |t|

t0

x

f(x)

f(x) = x

f(x) = sin x

f(x) = 120ex

1

Therefore, when it is known/expected that x is sparse, use the `1 norm;not the `2 norm.

6 / 59

Page 7: Ivan Selesnick March 23, 2017 - New York University

Algorithms for sparse solutions

Objective function

I Non-differentiable

I Convex

I Large-scale

Algorithms

I MM: Majorization-Minimization

I ISTA: Iterative Shrinkage/Thresholding Algorithm

I FISTA: Fast ISTA

I SALSA (ADMM): Split Augmented Lagrangian Shrinkage Algorithm

I ‘Matrix-free’ algorithms

and more...

7 / 59

Page 8: Ivan Selesnick March 23, 2017 - New York University

Parseval frames

The columns of A form a Parseval frame if AAH = pI with p > 0.

A

AH

= pI

If AAH = pI then the solution to

x = arg minx‖x‖2

2

such that y = Ax

is

x = AH(AAH)−1 y (10)

=1

pAH y (11)

No matrix inversion needed.

8 / 59

Page 9: Ivan Selesnick March 23, 2017 - New York University

Parseval frames

If

AAH = pI (12)

then the solution to

x = arg minx

{‖y − Ax‖2

2 + λ‖x‖22

}(13)

is

x = (AHA + λI)−1AH y (14)

=1

λ+ pAH y (15)

using the matrix inverse lemma,

(λ I + AHA

)−1

=1

λI− 1

λAH(λ I + AAH

)−1

A. (16)

• So, if AAH = pI then finding least square solutions is easy. No matrix

inversion needed.

Some algorithms for BP/BPD also become computationally easier.

9 / 59

Page 10: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

The Fourier transform tells how to write the signal as a sum of sinusoids. But,it is not the only way.

Basis pursuit gives a sparse spectrum.

Suppose the M-point signal y is written as

y(m) =N−1∑

n=0

c(n) exp

(j2π

Nmn

), 0 6 m 6 M − 1 (17)

where c(n) is a length-N coefficient sequence, with M 6 N.

y = Ac (18)

Am,n = exp

(j2π

Nmn

), 0 6 m 6 M − 1, 0 6 n 6 N − 1 (19)

c : length-N

The coefficients c(n) are frequency-domain (Fourier) coefficients.

10 / 59

Page 11: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

1. If N = M, then A is the inverse N-point DFT matrix.

2. If N > M, then A is the first M rows of the inverse N-point DFT matrix.⇒ A or AH can be implemented efficiently using the FFT.For example, in Matlab, y = Ac is implemented as:

function y = A(c, M, N)

v = N * ifft(c);

y = v(1:M);

end

Similarly, AH y can be obtained by zero-padding and computing the DFT.In Matlab, c = AH y is implemented as:

function c = AT(y, M, N)

c = fft([y; zeros(N-M, 1)]);

end

⇒ Matrix-free algorithms.

3. Due to the orthogonality properties of complex sinusoids,

AAH = N IM (20)

11 / 59

Page 12: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

When N = M, the coefficients c satisfying y = Ac are uniquely determined.

When N > M, the coefficients c are not unique. Any vector c satisfyingy = Ac can be considered a valid set of coefficients. To find a particularsolution we can minimize either ‖c‖2

2 or ‖c‖1.

Least squares:

c = arg minc‖c‖2

2 (21a)

such that y = Ac (21b)

Basis pursuit:

c = arg minc‖c‖1 (22a)

such that y = Ac. (22b)

The two solutions can be quite different...

12 / 59

Page 13: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

0 20 40 60 80 100

−1

−0.5

0

0.5

1

1.5

Time (samples)

Signal

Real part

Imaginary part

Least square solution:

c = AH(AAH)−1 y

=1

NAHy (AAH = N I)

which is computed by

1. zero-pad the length-M signal y to length-N

2. compute its DFT

BP solution: Compute using algorithm SALSA

13 / 59

Page 14: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

0 20 40 60 80 1000

20

40

60

80

Frequency (DFT index)

(A) Fourier coefficients (DFT)

0 50 100 150 200 2500

0.1

0.2

0.3

0.4(B) Fourier coefficients (least square solution)

Frequency (index)

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(C) Fourier coefficients (basis pursuit solution)

Frequency (index)

The BP solution does not exhibit the leakage phenomenon.

14 / 59

Page 15: Ivan Selesnick March 23, 2017 - New York University

Example: Sparse Fourier coefficients using BP

0 20 40 60 80 100

1

1.5

2

2.5

Cost function history

Iteration

Cost function history of algorithm for basis pursuit solution

15 / 59

Page 16: Ivan Selesnick March 23, 2017 - New York University

Example: Denoising using BPD

Digital LTI filters are often used for noise reduction (denoising).

But, if

I the noise and signal overlap in the frequency domain,or

I the respective frequency bands are unknown,

then it is difficult to use LTI filters.

However, if the signal has sparse (or relatively sparse) Fourier coefficients, thenBPD can be used for noise reduction.

16 / 59

Page 17: Ivan Selesnick March 23, 2017 - New York University

Example: Denoising using BPD

Noisy speech signal y

y(m) = s(m) + w(m), 0 6 m 6 M − 1, M = 500 (23)

s : noise-free speech signalw : noise sequence.

0 100 200 300 400 500

−0.4

−0.2

0

0.2

0.4

0.6 Noisy signal

Time (samples)

0 100 200 300 400 5000

0.01

0.02

0.03

0.04(A) Fourier coefficients (FFT) of noisy signal

Frequency (index)

17 / 59

Page 18: Ivan Selesnick March 23, 2017 - New York University

Example: Denoising using BPD

Assume the noise-free speech signal s has a sparse set of Fourier coefficients:

y = Ac + w

y : noisy speech signal, length MA : M × N DFT matrix (19)c : sparse Fourier coefficients, length Nw : noise, length M

As y is noisy, find c by solving the least square problem

c = arg minc

{‖y − Ac‖2

2 + λ‖c‖22

}(24)

or the basis pursuit denoising (BPD) problem

c = arg minc

{‖y − Ac‖2

2 + λ‖c‖1

}. (25)

Once c is found, an estimate of the speech signal is given by s = Ac.

18 / 59

Page 19: Ivan Selesnick March 23, 2017 - New York University

Example: Denoising using BPD

Least square solution:

c = (AHA + λI)−1AH y (26)

=1

λ+ NAH y (AAH = N I) (27)

using matrix inverse lemma.

⇒ least square estimate of the speech signal is

s = Ac

=N

λ+ Ny (least square solution).

But s is only a scaled version of the noisy signal!

No filtering is achieved.

19 / 59

Page 20: Ivan Selesnick March 23, 2017 - New York University

Example: Denoising using BPD

BPD solution

0 100 200 300 400 5000

0.02

0.04

0.06(B) Fourier coefficients (BPD solution)

Frequency (index)

0 100 200 300 400 500

−0.4

−0.2

0

0.2

0.4

0.6 Denoising using BPD

Time (samples)

Obtained with algorithm SALSA. Effective noise reduction, unlike least squares!

20 / 59

Page 21: Ivan Selesnick March 23, 2017 - New York University

Example: Deconvolution using BPD

If the signal of interest x is not only noisy but is also distorted by an LTIsystem with impulse response h, then the available data y is

y(m) = (h ∗ x)(m) + w(m) (28)

where ‘∗’ denotes convolution (linear convolution) and w is additive noise.Given the observed data y , we aim to estimate the signal x . We will assumethat the sequence h is known.

21 / 59

Page 22: Ivan Selesnick March 23, 2017 - New York University

Example: Deconvolution using BPD

y = Hx + w (29)

x : length Nh : length Ly : length M = N + L− 1

H =

h0

h1 h0

h2 h1 h0

h2 h1 h0

h2 h1

h2

(30)

H is of size M × N with M > N (because M = N + L− 1).

22 / 59

Page 23: Ivan Selesnick March 23, 2017 - New York University

Example: Deconvolution using BPD

Sparse signal convolved by the 4-point moving average filter

h(n) =

{14

n = 0, 1, 2, 3

0 otherwise

0 20 40 60 80 100

−1

0

1

2 Sparse signal

0 20 40 60 80 100−0.4

−0.2

0

0.2

0.4

0.6Observed signal

23 / 59

Page 24: Ivan Selesnick March 23, 2017 - New York University

Example: Deconvolution using BPD

Due to noise, solve the regularized least square problem

x = arg minx

{‖y −Hx‖2

2 + λ‖x‖22

}(31)

or the basis pursuit denoising (BPD) problem

x = arg minx

{‖y −Hx‖2

2 + λ‖x‖1

}. (32)

Least square solution:x = (HHH + λI)−1HH y. (33)

24 / 59

Page 25: Ivan Selesnick March 23, 2017 - New York University

Example: Deconvolution using BPD

0 20 40 60 80 100−0.4

−0.2

0

0.2

0.4

0.6Deconvolution (least square solution)

0 20 40 60 80 100

−1

0

1

2 Deconvolution (BPD solution)

The BPD solution, obtained using SALSA, is more faithful to original signal.

25 / 59

Page 26: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

Due to data transmission/acquisition errors, some signal samples may be lost.Fill in missing values for error concealment.

Part of a signal or image may be intentionally deleted (image editing, etc).Convincingly fill in missing values according to the surrounding area to doinpainting.

0 100 200 300 400 500

−0.5

0

0.5 Incomplete signal

Time (samples)

200 missing samples

26 / 59

Page 27: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

We write the incomplete data y as

y = Sx (34)

x : length My : length K < MS : ‘selection’ (or ‘sampling’) matrix of size K ×M.

For example, if only the first, second and last elements of a 5-point signal x areobserved, then the matrix S is given by:

S =

1 0 0 0 00 1 0 0 00 0 0 0 1

. (35)

Problem: Given y and S, find x such that y = Sx

⇒ Underdetermined system, infinitely many solutions.

Least square and BP solutions are very different...

27 / 59

Page 28: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

Properties of S

1.SST = I (36)

where I is an K × K identity matrix. For example, with S in (35)

SST =

1 0 00 1 00 0 1

.

2. STy sets the missing samples to zero.For example, with S in (35)

STy =

1 0 00 1 00 0 00 0 00 0 1

y(0)y(1)y(2)

=

y(0)y(1)

00

y(2)

. (37)

28 / 59

Page 29: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

Suppose x has a sparse representation with respect to A,

x = Ac (38)

c : sparse vector, length N, with M 6 NA : size M × N.

The incomplete data y can then be written as

y = Sx (39a)

= SAc. (39b)

Therefore, if c satisfiesy = SAc (40)

then we may estimate x asx = Ac. (41)

Note that y is shorter than the coefficient vector c, so (40) has infinitely manysolutions.

29 / 59

Page 30: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

Any vector c satisfying y = SAc can be considered a valid set of coefficients.

To find a particular solution, solve the least squares (LS) problem

c = arg minc‖c‖2

2 (42a)

such that y = SAc (42b)

or the basis pursuit (BP) problem

c = arg minc‖c‖1 (43a)

such that y = SAc. (43b)

We will see . . . the LS and BP solutions are very different.

Let us assume A satisfiesAAH = pI, (44)

for some positive real number p.

30 / 59

Page 31: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

The least square solution is

c = (SA)H((SA)(SA)H)−1 y (45)

= AHST(SAAHST)−1 y (46)

= AHST(p SST)−1 y (AAH = pI) (47)

= AHST(p I)−1y (SST = I) (48)

=1

pAHSTy. (49)

Hence, the least square estimate x is given by

x = Ac (50)

=1

pAAHSTy using (49) (51)

= STy. (AAH = pI) (52)

This estimate sets all the missing values to zero!

No estimation of the missing values. Least square solution of no use here.

31 / 59

Page 32: Ivan Selesnick March 23, 2017 - New York University

Example: Filling in missing samples using BP

Short segments of speech can be sparsely represented using the DFT; thereforewe set A equal to the M × N DFT (19) with N = 1024.

BP solution obtained using 100 iterations of SALSA:

0 100 200 300 400 5000

0.02

0.04

0.06

0.08Estimated coefficients

Frequency (DFT index)

0 100 200 300 400 500

−0.5

0

0.5 Estimated signal

Time (samples)

The missing samples have been filled in quite accurately.

32 / 59

Page 33: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising (TVD)

The signal x is observed in additive white Gaussian noise (AWGN)

y(n) = x(n) + w(n), n ∈ {0, 1, 2, . . . , N − 1}

Total variation denoising is defined by

x = arg minx∈RN

{F (x) =

1

2

n

|y(n)− x(n)|2 + λ∑

n

|x(n)− x(n − 1)|},

written also as

x = arg minx∈RN

{F (x) =

1

2‖y − x‖2

2 + λ ‖Dx‖1

}

where

D =

−1 1

−1 1

. . .. . .−1 1

.

• L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D,vol. 60, pp. 259–268, 1992.

33 / 59

Page 34: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising

0 50 100 150 200 250

−2

0

2

4

6Noise−free signal

0 50 100 150 200 250

−2

0

2

4

6Noisy signal

0 50 100 150 200 250

−2

0

2

4

6

L2−filtered signal (λ = 3.00)

0 50 100 150 200 250

−2

0

2

4

6

L2−filtered signal (λ = 8.00)

0 50 100 150 200 250

−2

0

2

4

6

TV−filtered signal (λ = 3.00)

TV denoising preserves discontinuities more accurately than linear filtering.

34 / 59

Page 35: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising - staircase artifacts

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

ECG

Time (n)

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

Time (n)

TV Denoising

TVD has staircase artifacts.

35 / 59

Page 36: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising - staircase artifacts

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Biosensor data

Time (n)

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Time (n)

TV Denoising

TVD has staircase artifacts.

36 / 59

Page 37: Ivan Selesnick March 23, 2017 - New York University

Sparse Singularity-Preserving Signal Smoothing (SIPS)We assume the signal of interest is of the form

s = f + x1 + x2, s, f , xi ∈ RN

whereI f is a low-pass signalI x1 is approximately piecewise constantI x2 is approximately piecewise linear

0 200 400 600 800 1000−2

−1

0

1

2

s = f + x1 + x

2

0 200 400 600 800 1000−2

−1

0

1

2f [Low−pass]

0 200 400 600 800 1000−2

−1

0

1

2

x1 [sparse derivative]

0 200 400 600 800 1000−2

−1

0

1

2

x2 [sparse 2nd derivative]

37 / 59

Page 38: Ivan Selesnick March 23, 2017 - New York University

Sparse Singularity-Preserving Signal Smoothing (SIPS)

Based on the signal model

s = f + x1 + x2, s, f , xi ∈ RN

we minimize the objective function

J(x1, x2) =1

2‖Hy − H(x1 + x2)‖2

2 + λ1

n

φ([Dx1]n) + λ2

n

φ([D2x2]n)

where H is a high-pass filter.

If φ is the absolute value function, the regularizer is(λ1‖Dx1‖1 + λ2‖D2x2‖1

).

38 / 59

Page 39: Ivan Selesnick March 23, 2017 - New York University

Penalty Function

The penalty function φ can be taken to be

φ(x) =

1a

log(1 + a|x |), a > 0

|x |, a = 0.

The parameter a > 0 controls the non-convexity of φ.

−4 −3 −2 −1 0 1 2 3 40

1

2

3

4

a = 0

a = 0.2

a = 0.5

a = 1.0

1D parametric penalty function

x

30 / 80

Page 40: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising (TVD)

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Biosensor data

Time (n)

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Time (n)

TV Denoising

TVD has staircase artifacts.

40 / 59

Page 41: Ivan Selesnick March 23, 2017 - New York University

Sparse Singularity-Preserving Signal Smoothing (SIPS)

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Biosensor data

Time (n)

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

Time (n)

SIPS Denoising

SIPS avoids staircase artifacts.

41 / 59

Page 42: Ivan Selesnick March 23, 2017 - New York University

Sparse Singularity-Preserving Signal Smoothing (SIPS)

– Extension to higher-order singularities.

We assume the signal of interest is of the form

s = f + x2 + x3, s, f , xi ∈ RN

where

I f is a low-pass signal

I x2 is approximately piecewise linear

I x3 is approximately piecewise quadratic

We minimize the objective function

J(x2, x3) =1

2‖Hy − H(x2 + x3)‖2

2 + λ2

n

φ([D2x2]n) + λ3

n

φ([D3x3]n)

where H is a high-pass filter.

42 / 59

Page 43: Ivan Selesnick March 23, 2017 - New York University

Total Variation Denoising (TVD)

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

ECG

Time (n)

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

Time (n)

TV Denoising

TVD has staircase artifacts.

43 / 59

Page 44: Ivan Selesnick March 23, 2017 - New York University

Sparse Singularity-Preserving Signal Smoothing (SIPS)

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

ECG

Time (n)

0 100 200 300 400 500 600 700 800 900 1000

−100

−50

0

50

100

Time (n)

SIPS Denoising

SIPS avoids staircase artifacts.

44 / 59

Page 45: Ivan Selesnick March 23, 2017 - New York University

SIPS with L1 norm

0 200 400 600 800 1000

−100

−50

0

50

100

Noisy signal

0 200 400 600 800 1000

−100

−50

0

50

100

Low−pass filter (Ly)

0 200 400 600 800 1000

−100

−50

0

50

100

Singularity−preserving smoothing (SPS) [L1 norm penalty]

0 200 400 600 800 1000

−20

−10

0

10

20 u2

0 200 400 600 800 1000

−1

−0.5

0

0.5

1 u3

45 / 59

Page 46: Ivan Selesnick March 23, 2017 - New York University

SIPS with non-convex penalty

0 200 400 600 800 1000

−100

−50

0

50

100

Noisy signal

0 200 400 600 800 1000

−100

−50

0

50

100

Low−pass filter (Ly)

0 200 400 600 800 1000

−100

−50

0

50

100

Singularity−preserving smoothing (SPS) [non−convex penalty]

0 200 400 600 800 1000

−20

−10

0

10

20 u2

0 200 400 600 800 1000

−1

−0.5

0

0.5

1 u3

46 / 59

Page 47: Ivan Selesnick March 23, 2017 - New York University

Convex or non-convex: which is better for inverse problems?

Benefits of convex optimization

1. Absence of suboptimal local minima

2. Continuity of solution as a function of input data

3. Fewer complications when specifying regularization parameters

4. Availability of algorithms guaranteed to converge to a global optimum

But, non-convex regularization often performs better! Convex regularizationunder-estimates signal values (a ‘bias toward zero’).

Non-convex regularization induces sparsity more effectively and is a popularalternative to convex functions.

Can we exploit the strong sparsity-inducing properties of non-convex penaltieswithout forgoing the benefits of the convex approach?

47 / 59

Page 48: Ivan Selesnick March 23, 2017 - New York University

Parameterized sparsity-inducing non-convex penaltyLet φ( · ; a) : R→ R be a penalty function with parameter a > 0 satisfying

1. φ is continuous on R2. φ is twice continuously differentiable, increasing, and concave on R+

3. φ(x ; 0) = |x |4. φ(0; a) = 0

5. φ(−x ; a) = φ(x ; a)

6. φ′(0+; a) = 1

7. φ′′(x ; a) > −a for all x 6= 0

−4 −3 −2 −1 0 1 2 3 40

1

2

3

4

a = 0

a = 0.2

a = 0.5

a = 1.0

1D parametric penalty function

x

48 / 59

Page 49: Ivan Selesnick March 23, 2017 - New York University

Non-convex regularization, Convex optimizationTotal variation denoising with convex regularization:

x = arg minx∈RN

{F0(x) =

1

2‖y − x‖2

2 + λ ‖Dx‖1

}

With non-convex regularization:

x = arg minx∈RN

{Fa(x) =

1

2‖y − x‖2

2 + λ∑

n

φ([Dx ]n; a)}

Can we constrain φ so that Fa is convex?

Proposition

Fa is strictly convex if

infx 6=0

φ′′(x) > − 1

4λ.

When φ satisfies properties above, Fa is strictly convex if

0 6 a <1

4λ.

• I. W. Selesnick, A. Parekh, and I. Bayram, “Convex 1-D total variation denoising with non-convexregularization,” IEEE Signal Processing Letters, vol. 22, pp. 141–144, Feb. 2015.

49 / 59

Page 50: Ivan Selesnick March 23, 2017 - New York University

Convex TVD with non-convex regularizationSet a to its maximal value to maximally induce sparsity of Dx while ensuringconvexity of the objective function.

a =1

0 50 100 150 200 250

−2

0

2

4

6Noisy data

σ = 0.50

0 50 100 150 200 250

−2

0

2

4

6TV denoising (convex penalty)

λ = 2.00 RMSE = 0.318

0 50 100 150 200 250

−2

0

2

4

6TV denoising (non−convex penalty, atan)

λ = 2.00 RMSE = 0.247

0 50 100 150 200 250

−1

−0.5

0

0.5

1

n

Denoising Error

convex

non−convex

50 / 59

Page 51: Ivan Selesnick March 23, 2017 - New York University

Convex TVD with non-convex regularization

TVD with non-convex regularization:

x = arg minx∈RN

{F (x) =

1

2‖y − x‖2

2 + λ∑

n

φ([Dx ]n; a)}

TVD with non-separable non-convex regularization:

x = arg minx∈RN

{Fnonsep(x) =

1

2‖y − x‖2

2 + λ∑

n

ψ(([Dx ]n−1, [Dx ]n); a)}

where ψ : R2 → R.

51 / 59

Page 52: Ivan Selesnick March 23, 2017 - New York University

Convex TVD with non-convex regularization

Contours

−2 −1 0 1 2

−2

−1

0

1

2

−2 −1 0 1 2 −2

0

20

1

2

3

4

L1 norm

Contours

−2 −1 0 1 2−2

−1

0

1

2

−2 −1 0 1 2 −20

20

1

2

3

4

Separable non−convex penalty

Contours

−2 −1 0 1 2−2

−1

0

1

2

−2 −1 0 1 2 −20

20

1

2

3

4

Non−separable non−convex penalty

a1 = a2 = 0 a1 = a2 > 0 a1 > a2 > 0 (Proposed)

(a) Separable, convex (b) Separable, non-convex (c) Non-separable, non-convex

52 / 59

Page 53: Ivan Selesnick March 23, 2017 - New York University

Convex TVD with non-convex regularization

Define ψ( · ; a) : R2 → R as

ψ(x) =

(1− r) [φ(x1;α) + φ(x2;α)] + r φ(x1 + x2;α), x ∈ A1

(1 + r)φ(x1; a2) + φ(rx1 + x2;α), x ∈ A2

(1 + r)φ(x2; a2) + φ(x1 + rx2;α), x ∈ A3

(53)

where Ai are subsets of R2 defined as

A1 = {x ∈ R2 | x1x2 > 0}, (54)

A2 = {x ∈ R2 | x1 (x1 + x2) 6 0}, (55)

A3 = {x ∈ R2 | x2 (x1 + x2) 6 0} (56)

and α and r are given by

α =a1 + a2

2, r =

{a1−a2a1+a2

, a1 + a2 > 0

0, a1 = a2 = 0.(57)

Proposition

Fnonsep is strictly convex if

a1 6 1

2λ, a2 6 1

4λ.

53 / 59

Page 54: Ivan Selesnick March 23, 2017 - New York University

0 50 100 150 200 250

−2

0

2

4

6

8

Noisy data [σ = 0.50]

0 50 100 150 200 250

−2

0

2

4

6

8TV denoising using non−separable non−convex penalty

0 50 100 150 200 250

−1

−0.5

0

0.5

1

Denoising error

L1 (RMSE = 0.318)

Separable non−convex (RMSE = 0.247)

Non−separable non−convex (RMSE = 0.221)

54 / 59

Page 55: Ivan Selesnick March 23, 2017 - New York University

Structured Sparsity with Overlapping Groups

A simple nonlinear thresholding approach to denoising is basis pursuit denoising

x = arg minx

{1

2‖y − x‖2

2 + λ‖x‖1

}= soft(x, λ)

Thresholding does not capture group structure (clustering/grouping behavior).

To account for clustering/grouping, we may use

x = arg minx

{1

2‖y − x‖2

2 + λ∑

n

√|x(n)|2 + |x(n + 1)|2 + |x(n + 2)|2

}

for group size 3.

The optimization is more challenging due to coupling among all signal valuesx(n), but yields superior results for speech enhancement.

• P.-Y. Chen and I. W. Selesnick, “Translation-invariant shrinkage/thresholding of group sparse signals,” SignalProcessing, vol. 94, pp. 476–489, Jan. 2014.

55 / 59

Page 56: Ivan Selesnick March 23, 2017 - New York University

Structured Sparsity with Overlapping Groups: Speech Enhancement

Speech signals exhibit structured sparsity in time-frequency domain.

TIME (SECONDS)

FR

EQ

UE

NC

Y

0 0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

5000

6000

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

TIME (SECONDS)

FR

EQ

UE

NC

Y

0 0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

5000

6000

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Noise-free signal Noisy signal

56 / 59

Page 57: Ivan Selesnick March 23, 2017 - New York University

Structured Sparsity with Overlapping Groups: Speech Enhancement

Scalar thresholding produces spurious noise spikes and musical noise.

TIME (SECONDS)

FR

EQ

UE

NC

Y

0 0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

5000

6000

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

TIME (SECONDS)

FR

EQ

UE

NC

Y

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.64000

4500

5000

5500

6000

6500

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Scalar thresholding Magnified view

57 / 59

Page 58: Ivan Selesnick March 23, 2017 - New York University

Structured Sparsity with Overlapping Groups: Speech Enhancement

New group shrinkage/thresholding (OGS) algorithm reduces musical noise.

TIME (SECONDS)

FR

EQ

UE

NC

Y

0 0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

5000

6000

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

TIME (SECONDS)

FR

EQ

UE

NC

Y

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.64000

4500

5000

5500

6000

6500

7000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

OGS algorithm Magnified view

58 / 59

Page 59: Ivan Selesnick March 23, 2017 - New York University

Summary

1. Basis pursuit

2. Basis pursuit denoising

3. Sparse Fourier coefficients using BP

4. Denoising using BPD

5. Deconvolution using BPD

6. Filling in missing samples using BP

7. Total variation denoising (TVD)

8. Sparse singularity-preserving signal smoothing (SIPS)

9. Non-convex regularization, convex optimization

10. Group sparsity

59 / 59


Recommended