+ All Categories
Home > Documents > Fourier-Based Audio Compression · Audio Compression Last time, we looked at something akin to JPEG...

Fourier-Based Audio Compression · Audio Compression Last time, we looked at something akin to JPEG...

Date post: 21-Mar-2020
Category:
Upload: others
View: 19 times
Download: 1 times
Share this document with a friend
22
6.003: Signal Processing Fourier-Based Audio Compression Review of Lossy Compression, Discrete Cosine Transform (DCT) Brief Introduction to MDCT Additional Considerations for Audio Encoding 2 May 2019
Transcript

6.003: Signal Processing

Fourier-Based Audio Compression

• Review of Lossy Compression, Discrete Cosine Transform (DCT)

• Brief Introduction to MDCT

• Additional Considerations for Audio Encoding

2 May 2019

Today: Lossy Compression

As opposed to “lossless” compression (LZW, Huffman, zip, gzip,

xzip, ...), “lossy” compression achieves a decrease in file size by

throwing away information from the original signal.

Goal: convey the “important” parts of the signal using as few bits

as possible.

Lossy Compression

Key idea: through away the “unimportant” bits (i.e., bits that won’t

be noticed). Doing this involves knowing something about what it

means for something to be noticeable.

Many aspects of human perception are frequency based

→ many lossy formats use frequency-based methods (along w/ mod-

els of human perception).

Lossy Compression: High-level View

To Encode:

• Split signal into “frames”

• Transform each frame into Fourier representation

• Throw away (or attentuate) some coefficients

• Additional lossless compression (LZW, RLE, Huffman, etc.)

To Decode:

• Undo lossless compression

• Transform each frame into time/spatial representation

This is pretty standard! Both JPEG and MP3, for example, work

roughly this way.

Given this, one goal is to get the “important” information in a signal

into relatively few coefficients in FD (“energy compaction”).

Energy Compaction

One goal is to get the “important” information in a signal into

relatively few coefficients in FD (“energy compaction”).

It turns out the DFT has some problems in this regard. Consider

the following signal, broken into 8-sample-long frames:

original signal

n0

8 sample “frame”

n0

Why is the DFT undesireable in this case, given our goal of

compression?

Discrete Cosine Transform

It is much more common to use the DCT (Discrete Cosine Trans-

form) in compression applications. The DCT (or variants thereof)

are used in JPEG, AAC, Vorbis, WMA, MP3, . . ..

The DCT (more formally, the DCT-II) is defined by:

XC [k] = 1N

N−1∑n=0

x[n] cos(π

N

(n+ 1

2

)k

)

DCT: Relationship to DFT

XC [k] = 1N

N−1∑n=0

x[n] cos(π

N

(n+ 1

2

)k

)

= 12N

N−1∑n=0

x[n](ej

πN (n+1/2)k + e−j πN (n+1/2)k

)

= 12N e−j πN

12k

N−1∑n=0

x[n](ej

πN (n+1)k + e−j πN nk

)

= 12N e−j πN

12k(N−1∑n=0

x[n]e−j 2π2N (−n−1)k +

N−1∑n=0

x[n]e−j 2π2N nk

)

= 12N e−j πN

12k

N−1∑n=−N

x̃[n]e−j 2π2N nk

=(e−j πN

12k)X̃[k]

where x̃[·] is given by the following, and the DFT coefficients X̃[·]are computed with an analysis window of length 2N :

x̃[n] = x̃[n+ 2N ] ={x[n] if 0 ≤ n < N

x[−n− 1] if −N < n < 0

Discrete Cosine Transform

The DCT is commonly used in compression applications.

We can think about computing the DCT by first putting a mirrored

copy of a windowed signal next to itself, and then computing the

DFT of that new signal (shifted by 1/2 sample):

8 sample “frame”

n0

16-sample shifted, mirrored frame

n0

Why is the DCT more appropriate, given our goals? How does

this approach fix the issue(s) we saw with the DFT?

The Discrete Cosine Transform

XC [k] = 1N

N−1∑n=0

x[n] cos(πk(n+ 1

2)N

)

Re(ej2π

kN n)

Im(ej2π

kN n)

cos(π kN (n+ 1

2))

nk = 0

n

nk = 1

n

nk = 2

n

nk = 3

n

nk = 4

n

nk = 5

n

nk = 6

n

nk = 7

n

Energy Compaction Example: Ramp

For many authentic signals (photographs, etc), the DCT has good

“energy compaction”: most of the energy in the signal is represented

by relatively few coefficients.

Consider DFT vs DCT of a “ramp:”

0 2 4 6 8 10 12 14n

0

2

4

6

8

10

12

14

x[n]

Energy Compaction Example: Ramp

For many authentic signals (photographs, etc), the DCT has good

“energy compaction”: most of the energy in the signal is represented

by relatively few coefficients.

Consider DFT vs DCT of a “ramp:”

0 2 4 6 8 10 12 14n

0

2

4

6

8

10

12

14

x[n]

0 2 4 6 8 10 12 14k

0

1

2

3

4

5

6

7

|X[k]|

0 2 4 6 8 10 12 14k

0

1

2

3

4

5

6

7

|XC[k]|

Audio Compression

Last time, we looked at something akin to JPEG compression for

images. High-level, what we did was:

• 2D-DCT of 8-by-8 blocks of greyscale images

• In each block, zero out coefficients that are below some threshold

Let’s try the same approach with audio.

Audio Compression

That didn’t sound very good, really... :(

What were the most noticeable artifacts in the reconstructed

version? Where did they come from? How did this compare

to what we saw with JPEG?

Audio Compression v2

Let’s try a different approach:

Rather than zeroing out coefficients below the threshold, let’s quan-

tize them differently (for example, use 8 bits for each sample below

the threshold and 16 bits for each value above the threshold).

How does this compare? What artifacts remain? How can we

explain them?

MDCT

The biggest issue with the last scheme was artifacts at the frame

boundaries.

Many modern audio compression schemes (MP3, AAC, WMA, Vor-

bis, . . .) don’t use the DCT directly, but rather a related transform

called the MDCT (Modified Discrete Cosine Transform), which mit-

igates these issues.

This is a lapped transform: 2N time-domain samples turn into N

frequency-domain samples. By taking the transforms of overlapping

windows and summing, we can reconstruct the original sequence

exactly (similar to overlap-add method we saw with DFT). This

principle is referred to as time-domain aliasing cancellation.

MDCT

0.0

2.5

x[n]

0.0

2.5wi

ndow

0.0

2.5

MDC

T

0.0

2.5

reco

nstru

cted

0.0

2.5

wind

ow

0.0

2.5

MDC

T

0.0

2.5

reco

nstru

cted

0 100 200 300 400 500

0.0

2.5

sum

MDCT

Formally, the MDCT is defined by:

XM [k] = 12N

2N−1∑n=0

x[n] cos(π

N

(n+ 1

2 + N

2

)(k + 1

2

))

y[n] =N−1∑k=0

XM [k] cos(π

N

(n+ 1

2 + N

2

)(k + 1

2

))

Including a window function on both x[·] and y[·] can avoid disconti-

nuities at the endpoints. Similar to DCT in terms of energy com-

paction, but avoids issues with discontinuities on frame boundaries.

Audio Compression v3

Let’s look at a compression scheme that uses the MDCT.

What Else is There?

We have been able to achieve decent compression rates, but nothing

close to MP3, for example. MP3 can ahieve around a 6:1 compres-

sion ratio before expert listeners are able to distinguish between

compressed and original audio.

This approach is actually somewhat similar to MP3, but we’re not

quite there, so what are we missing?

Psychoacoustic Modeling

Importantly, our goal is ultimately to throw away information that

is perceptually unimportant. To this end, MP3 includes a model of

human perception of audio, including:

• Critical bands:

neighborhood of frequencies that excite the same nerve cells

(∼25 distinct bands of varying bandwidth)

• Threshold of hearing:

how loud must a signal be in order to hear it?

• Frequency masking:

a loud component at a particular frequency “masks” nearby fre-

quencies

• Temporal masking:

when two tones are close together in time, one can mask the

other.

High-level overview

MP3 encoding process broken down into steps:

1. Filter the audio signal into frequency sub-bands

2. Determine the amount of masking for each band caused by

nearby bands (in time and in freq) using the psychoacoustic

model

3. If the signal is too small (or if it is “masked” by nearby frequen-

cies), don’t encode it

4. Otherwise, determine the number of bits needed to represent

it such that the noise introduced by quantization is not audible

(below the masking effect)

5. Put these bits together into the proper file format

Other Concerns

Other domain-specific codecs may use other strategies; for example,

some audio codecs designed to compress speech (as opposed to

music, etc) will use something like LPC (discussed in the lecture on

speech). They can then use a small number of bits to represent the

parameters of the model, and use some additional bits to represent

differences from that prediction.


Recommended