FIR FILTER ARCHITECTURES - University of Oulukk/dtsp/tutoriaalit/Azadet.pdf · In many DSP-based...

r

Kamran Azadet and Chris 1. Nicole

Bell Laboratories, Lucent Technologies

In many DSP-based high-speed modem applications, such as broadband modems for high-speed Internet access to the home

ceivers, channel equalization requires processing power so high tha t power consumption and clock speed become major design challenges. This article describes techniques t o implement low-cost adaptive equalizers for ASIC implementations of broadband modems. Power consumption can be reduced using a careful selection o f architectural, algorithmic, and VLSI circuit techniques. The derivation of a hybrid FIR filter structure is given that enables the designer t o adjust both the speed and power consump- t ion to suit an application. Furthermore, the architecture can be made programmable to target mult iple applications in one piece of silicon while maintaining or even improving the efficiency of the architecture. Runtime techniques are shown that can minimize the power consumption for a given application or operat ing environment. In all cases, t he power reduction techniques are supported by simulations and measurements made on a test integrated circuit.

ST b i t E t

second and ye t costs less t han $IO?” Given that the VDSL receiver circuit will reside in a consumer set-top box product, cost is paramount to success. To address this we need t o break down the cost of communications chips. I t turns out that the cost of a chip is dominated by silicon area and packaging. Naturally, the smaller the chip, the cheaper the package and therefore the solution. However, irre- spective of chip area, if a chip dissi-

pates > 1 W, it may require an expensive plastic package. If it dissipates > 2 W it is likely to be placed in a ceramic package, which makes the solution very expensive and non- competitive. Taking the performance requirements of VDSL - just the multiplier-accumulator (MAC) in the receiver circuit - and manufacturing a chip which contains that number of multiplier circuits will already exceed the 2 W limit (according to our own studies). Clearly, significant power reduction techniques are required. In this article we describe many such techniques that combine to reduce the power of a VDSL (or related) receiver to less than 600 mW. This enables the use of a cheap package and leaves room for the integration of additional functions to further reduce cost.

The architecture of an all-digital transceiver for high- speed communication over UTP channels is shown in Fig. 1. On the transmit side, the data is scrambled, encoded, fed through a fixed coefficient shaping filter, and sent to a DIA converter. On the receive side, the incoming signal is fed through an AID converter, processed by a feed-forward linear equalizer (FFLEQ) (the focus of this article), and then sent to a decision device (or slicer) and descrambler. A decision feedback equalizer (DFE) is sometimes used to remove post-cursor intersymbol interference (ISI). Depending on

!

he considerable interest in high-speed communication over unshielded twisted pair (UTP) channels

is primarily due to the vast installed base of copper wiring to the desktop and into the home. As data transfer rates increase, so too do the digital signal processing (DSP) requirements of the solutions. Most of the DSP performance is needed in the receiver circuits where channel equalization is performed. This increases the cost of the transceiver. With software modems appearing on the market, the question of whether general-purpose microprocessors (or programmable DSPs) can be used to implement broadband modems arises. While this may be true for low- to medium-bit-rate asynchronous digital subscriber loop (ADSL) modems [l], we argue that for very-high-bit-rate applications like very high rate DSL (VDSL), even the most state-of-the-art DSPs lack the performance required to enable their use in a cost-effective solution, so application-specific integrated circuit (ASIC) solutions arc still the most viable. Furthermore, significant gains can be made over a VHDL synthesized approach by employing a full custom design methodology. An example of this is demonstrated by the use of small multiported register files for data and coefficient storage. These are found to be more efficient than using a series of flip-flops for the same task.

From our own experiments, a VDSL receiver requires approximately 3.5 billion multiply- accumulate operations per second to

still an order of magnitude greater

via maximum utilization of the j i I - D i C a fastest programmable DSP chips I available today. Given tha t these I DSPs cost hundreds of dollars, they UTP ..I are still too costly for use in a VDSL

___ perform channel equalization. This is

performance than can be achieved I I . - 1 ‘ypf-J--~~~&;;- *DescrarnberJDatc; ! I l . . _ _ . I ...........

Receive

Transmit

‘1 L;;T+ T r ~ ~ ~ , s m ~ l ~ ~ - - - - ~ - - , - ~

/. .,-.b: A/D - .--C , , equalizer (tkLEQ)

j ‘ , ~ ~ ~ & o c k

<IIX:Id Hybrid] . - - . - - -‘ NEXT canceller 1- ....

+- feedback equalizer k---d I

. . . . . . - . - ._ . - - . - -. - -. - - t

recovery

_.

L . . ->I.. ......... system. A programmable solution to VDSL is clearly not feasible in

we build a DSP circuit that performs i . . . . ...

I . . . . . . . . . . . . ............. i

Encoder Scrambler - ~

- - - l D a t a I ing transceiver designers is, “How do I L L i h a p i n g filter ... today’s technology. The question fac-

.... - ... . . . . .- .

3.5 billion multiply-accumulates per Figure 1 . Typical transceiver architecture for high-speed UTP communication.

118 0163-6804/98/$10.00 0 1998 IEEE IEEE Communications Magazine * October 1998

the configuration of the system, a near-end crosstalk canceller is sometimes used as well as an echo canceller (for full duplex transmission). A good system-level overview is given in [2].

Although there is much computation performed in many parts of the system, this article focuses on the implementation of the FFLEQ because it is typically the most computationally intensive. The FFLEQ is essentially an adaptive finite impulse response (FIR) filter and is sometimes “fractionally-spaced” (FSLEQ). Depending on the application, the computation needed in the coefficient updating circuits equals or exceeds that required in the FIR filter.

After a brief introduction on channel equalization, a niod- ular adaptive FIR filter architecture is presented that provides programmability, high performance, and reduced power. Its low-power VLSI implementation is described in detail. A section is dedicated to power optimization using runtime error monitoring techniques. Finally, measured results of a fractionally spaced linear equalizer (FSLEQ) demonstrate the effectiveness of the above techniques.

CHAN NE L EQUALIZATION In typical UTP communication systems, the channel charac- teristics are unknown. An equalizer [3] is inserted into the path of the received signal before a decision device, as indi- cated in Fig. 1. The equalizer weights past values of the received signal with a filter response W(k) to compute the: filter output:

N-1

n=O y k = C w ( k , n ) x ( k - n ) (1)

This describes an FIR filter with a span of N symbol peri- ods and a set of N adaptive filter coefficients: W(k) = {w,:k,o) , w ( ~ , J ) , K, w ( k , N - I ) } . The vector X is the set of N-most recent samples of the received signal: {x(k) , x(~-J ) , X(k-2) , K, x(k-(N.-l)) . The decision device computes an estimate,

I? of the transmitted symbol Zk using a minimum distance function based on Y k and the set of available symbols. The equalizer adapts W(k) in order to maximize the probability that

I; = I k ,

For example, in the Least Mean Square (LMS) updating algorithm [4] the coefficients are updated according to

- Power-of-two

Figure 3 . P e r f o r m a n c e of full-multiplier, p o w e r - o f - t w o , and s ign-s ign LMS a l g o r i t h m s .

)Q W2,opt / W l

w1 ,opt ,I‘

W Figure 2 . An example of an LMS algorithm.

W ( k + l , n ) = w(k,n) -t p @ ( k - n ) for n = OKN- 1 (2) where ek is the error signal and p, the adaptation step size, is a small positive constant. The error is essentially

ek = 1; - Y k (3)

FRACTIONALLY SPACED EQUALIZATION

It is possible to combine the matched filtering operation and the synchronous equalizer in one transversal filter. Called a fractionally spaced linear equalizer (FSLEQ), such a filter has taps at spacings T which are smaller than T [4] and is said to M-spaced when operating with a sample period T = TIM. A filter with a span of N symbols and a spacing of M has NM taps. The samples for symbol are X1Mk-m where m = OK M - 1. The filter output becomes

NM-1

n=O y r = x W ( k , n p ( k M - m - n ) form = OK M - 1 (4)

This produces m filter outputs for each symbol from which one is arbitrarily selected, for example, y k = y i where s is one of {OK M - 1). The LMS updating then becomes

w(kf1,n) = W(k,n) 4- p@(kM-s-n) for = O K N M - 1.

SIMPLIFICATIONS OF THE LMS UPDATING ALGORITHM A simple graphical illustration of an adaptive algorithm is shown in Fig. 2 assuming a two-tap equalizer with [wl w z ] as coefficients. The LMS algorithm uses a steepest-descent method. The adaptation at iteration k causes the coefficients to move to a new point an amount proportional to [Fekxk-l pe@k-z] which is a gradient of the cost function le I where e is described by Eq. 3. The cost of implementing the update equations is often reduced by using a simplified update scheme. Instead of using full-precision multipliers to calculate p @ k - , , in Eq. 5, a power-of-two scheme can be used. By first taking the 2-logarithm of either e k or xk-n (or both), the updating can be done with a barrel shifter and an adder. Fur- ther simplification is achieved by the sign-sign algorithm, where the incremental update is proportional to sgn(e) x sgn(x) [3] requiring only an adder.

As observed in [5] and illustrated in Fig. 3, the speed of convcrgcnce of the sign-sign algorithm is significantly slower than the full-precision algorithm. The power-of-two update does not suffer from such degradation. An intuitive explana- tion of this is that the sign-sign algorithm loses information about the step size and always takes a unit step size each update iteration, whereas the other algorithms utilize a rea- sonable estimate of the ideal step. To speed up the sign-sign

IEEE Communications Magazine October 1998 119

Data . ’ input

Data - output

b. Transposed form 1 - Unit delay . .. - - - . .

Figure 4. Canonical implementation for FIR filters.

a. Systolic W ,

E Figure 5. Systolic implementation of FIR filters.

i Outp~i t path .

Coefficient path I 0 I 15 1 5 110 N = 16, n = 3

< =. denotrs intpgpr part - . _- - - - .. .

__ - - ._ __ . - _. - - - __ E Table 1. Coeficemt latency for different FIRfilter implementations

. ............. I

.- __ .- . . . --- . .

Figure 6. An FIRfilter using hybrid form I.

.- - .- .- - _ _ Figure 7. An FIR filter using hybnd form II.

algorithm a larger could be chosen, but this would give large steps when close to the opti- mum point, which would cause large noise and degrade the performance. Thus, to compare different algorithms the values of p should be selected to give similar mean square errors when converged, as in Fig. 3. It has been pointed out that implementations with limited precision may suffer from numerical problems giving a tap drift phenomenon. A potentially dangerous situ- ation occurs when taps drift to the maximum amplitude and cause tap overflow. A proposed solution is the tap leakage algorithm [4]. In addition, tap saturation prevents disastrous tap overflow.

OWER EQUALIZER

FIR FILTER ARCHITECTURES Finite impulse response (FIR) filters have two well-known canonical implementations, called the direct and transposed forms, as shown in Fig. 4.

In high-speed applications both suffer from important drawbacks. The direct form (a) has a computation delay or critical path corresponding to the time required to achieve all the multiply and add operations. This delay is an increasing function of the number of taps, and at the same time is subject to a very important constraint: it needs to be strictly less than a clock period. The transposed form overcomes this problem by inserting intermediate delays between each multi- plyiadd operation. In this case the critical path flows through one multiplication and one addition only. For filters with many taps, however, the capacitance of the data input bus can limit performance.

To make the computation delay independent of the number of taps, systolic architectures are used. This is a class of modular structures where extra pipeline “barriers” are introduced in order to minimize critical paths. Figure 5 shows two Well-known forms of systolic FIR filters due to H. T. Kung [6].

Calling the z-transform of the input and output signals X(z) and Y(z), respectively:

for the case in Fig. Sa:

In case (b) the output of the filter is delayed by N clock cycles, where N is the number of taps. This is due to the introduction of extra registers. In many applications latency is not acceptable. In such applications the coinfiguration of case (a) can be used instead. However, the clock frequen- cy would have to double in order to keep the same throughput. This can be a serious problem when the clock rate is already very high, and make this architecture less attractive.

120 IEEE Communications Magazine October 1998

Hybrid FIR Form - It is possible to derive a pipelined structure without introducing extra latency by implementing the FIR filters in a hybrid form (i.e., a mixture of direct form and transposed form). Figure 6 shows how the direct form can be trans- formed into a modular architecture called here the hybrid form. The circuit shown in Fig. 6 corresponds to the case of a 5 tap filter implemented as 3 tap modules (N = 5 , iz = 3). This architecture is similar to that proposed in [7].

Another version of the hybrid form is

input -

Data - output

input - Figure 8. Direct from LMS adaptive FIR filter architecture.

shown in Fig. 7. It is derived from the transposed form and is therefore closer to that form than the direct form. In contrast, the first hybrid form is closer to the direct form than the transposed one.

The hybrid forms are modular in that they consist Iof a pipeline of identical stages as per the systolic architectures. At the same time they also provide zero latency and fewer registers than the systolic architectures. Hybrid forms are therefore excellent choices for high-speed/low-power applications and a very good compromise between the direct and transposed forms. It is interesting to notice that the extreme cases iz = 1 and n = N correspond to the transposed and direct forms, respectively. Infinite impulse response (IIR) filters can also be implemented using hybrid forms. Hybrid form IIR filter architectures are included in the appendix.

ADAPTIVE FIR USING A HYBRID FORM Even though the two canonical realizations of an FIR filter (direct and transposed) are functionally equivalent when the coefficients have a fixed value, they are not performing the same operation when the coefficients vary. For example, in Fig. 4a coefficients WO, .... w4 contribute to the output signal at the same time. In the implementation of Fig. 4b W O is not delayed, w1 is delayed by one clock cycle, , . ., w4 is delayed by four clock cycles, and so on. Hence, in the transposed form, synchroniz- ing the coefficient path requires to delay W O by four clock cycles, w1 by three ;lock cycles, w2 by two clock cycles, w3 by one, and finally, w4 is not delayed. This allows us to “equalize” the delays in the paths from coefficients to filter output. In the example of transposed form with 5 taps, the paths from coefficient to output signal are delayed by four clock cycles. In contrast, in the direct form the coefficient paths are not delayed at all. The same coefficient delay equalization principle can be applied to hybrid forms I and 11. Table 1 summa- rizes the time delay for input/output paths and coefficientloutput paths of the above four different implementations of an N-tap FIR filter.

Except for the direct form where coefficient delay is 0, hybrid form I has the lowest delay in the coefficient paths, about n times lower than the transpose form. Besides coefficient delay, the transpose form has the highest power consumption among the above structures, since the number of bits in the output path (accumulation path) is usually much greater than the number of bits in the input path. This

typically translates into a larger number of register cells or flip-flops and hence increased power consumption. The direct form has both the lowest power and the lowest coefficient latency, but also the lowest speed. The transposed form has the highest speed, but also has the highest coefficient delay and power. It becomes clear that the hybrid forms are a good architectural compromise between direct and transposed forms for both power and speed. The conventional LMS tap update in Eq. 2 is implemented using the filter architecture shown in Fig. 8.

In the case where all filter taps are delayed by D clock cycles [SI, the LMS tap update is given by

L J L . . .

E Figure 9. Retiming - basic principle.

i,i ......... SUbSYStem.Z ...... i

................................................................................... ,

a. Direct form

PI Figure 10. [Jsing re-timing to derive the hybrid form LMS adaptive FIR filter.


I .. . .

XMk-m

(C) .............. ~ .. . .... ....

W Figure 1 1. Evolution of the FIR filter architecture: a) fractionally spaced FIR filter; b) using time-multiplexed multipliers; c) fractionally spaced hybrid FIR architecture.

W& + 1) =

Wk(n) + p.e(n - D)xk(n-D).

An implementation of such an update algorithm using a direct form FIR filter is shown in Fig. loa. We first introduce the retiming technique, shown in Fig. 9 to derive a filter architecture based on the hybrid form. In a system [SI, applying a unit delay to all inputs is equivalent to applying a unit delay to all outputs.

Applying this retiming principle to the proper choice of subsystems in the graph of Fig. loa, it is possible to derive an equivalent graph based on the hybrid form, as shown in Fig. lob. Consider subsystem 1 in Fig. loa, highlighted with dotted lines. This subsystem has three inputs, il, i2, and i 3 , and two outputs, o1 and 02. A register is located at each input. By moving these input registers to outputs 01 and 0 2 , the timing of the overall system is not modified. Repeating the same operation for subsystem 2 in Fig. loa, the graph of Fig. 10b is obtained.

Equation 4 indicates that we need M x N multipliers operating at sample rate, as shown in Fig. l l a . However, observing the decimation at the filter output, only one of M samples at the filter output need be evaluated. An area-efficient implementation of this for a fractional spacing of M = 3 is shown in Fig. l l b . The accumulator at the filter output is cyclically reset during one sample every M samples. In the first cycle the accumulator value is& = w g 9 + w5xg + wgx3 + ~11x0 assuming the last x input sample is labeled x9.

In the next cycle, the accumulator evaluates A I = wlx10 + wqx7 + ~ 7 x 4 + w l p l + A0 and finally,A2 = w ~ 1 1 + W3xXg + w g 5 + ~ 9 x 2 + A I , which is equivalent to Eq. 4 with m = 0.

The adders in the output path of the filter are best implemented as carry-save adders (CSAs) which are smaller and faster and consume less power than carry-propagate adders (CPAs). In CPAs the carry bits “ripple” or “propagate” through the adder, causing spurious transitions and increased power consumption. In contrast, in CSAs the carry bits do not ripple through the adder, but instead create a separate output: the carry vector. The final sum and carry vectors are summed in the accumulator shown in Fig. l l b . The reduced delay of the CSA allows many additions to be performed in a single clock cycle. This enables us to make the architecture look more like the direct form by moving some of the delays from the output path back up to the input path where the word size is much smaller (due to the two flip-flops per bit of output precision required by

carry-save arithmetic). In this example, three out of every four delays in the output path are relocated to the input path, giving a significant reduction in the number of flip-flops, as shown in the hybrid form in Fig. l lc . Note that only two registers are needed at the input path on the last multiplier because of the single delay in the output path. This single delay also means that the coefficients in the next stage of the filter (not shown) must be “rotated” so that the correct output is obtained.

ADDING PROGRAMMABILITY WITHOUT INCREASING POWER

The input pipeline registers used to achieve fractional spacing in Fig. 1 lc consume a lot of power because every register par- takes in an exchange of data in every cycle. Given that there is often little correlation in the input data from the channel, this results in the input data pipeline having a high switching activ- ity factor. This can be reduced by replacing each group of registers with a small dual-ported register file configured as a FIFO. These FIFOs can be programmed to provide variable latency for programmability. The register file has separate read and write ports. The addresses for the ports are controlled via a global address generation unit that can be programmed to generate the address sequences for various types of filters. This approach was used in the test chip described in the last section to enable different fractional spacings (Ti2, Ti3, and Ti4).

- -- -_ .- . - . .

W Figure 12. Scalable, programmable, hybnd FIR architecture with reguter files for data and coefficient storage


We have since improved this architecture in an attempt to provide increased programmability. If the register files have “look-through’’ mode, and the port addresses are pipelined against the flow of data (as shown in Fig. 12), the filter can support various types of single- and multirate filtering algorithms with arbitrary time multiplexing of coefficients to the multipliers. The filter in Fig. 12 provides up to 64 taps. It is divided into four blocks (BO ... B3), where each block contains four multiply-add units (MO . . . M3). Each multiply-add unit contains two register files, one for the input data (Rl ) and one for the storage of coefficients (R2). A delay is placed in the filter accumulation path between each block. This corresponds to the single delay in the output

/ o 0 0 0

j 0 0 1 + l x

I 0 1 0 t 1 x

- I

~ 0 1 1 +2x i ~ 1 0 0 -2x

i 1 0 I - lx ~ 1 1 0 - l x

- 7

- I

1 1 1 1 0

H Table 2. Booth encoding . . . . .

path in-Fig. l l c (therebymakingthe hybrid form FIR). The addressing of the input data register files is different for the first multiply-add unit of each block (this is done to counteract the delay in the output path - every fourth register file must have one cycle less latency than the others). The DW and DA signals are the write enable and data addresses for the data register files, respectively. Norinal- ly, DA is used as the address for both reading and writing. However, in the first register file of every block, the DA address (and DW signal) is delayed by 1 cycle to create a “delayed write.” These signals can be “tapped” from the pre- vious block (because of the delays between the blocks). The CA address is used for the coefficient register files. These are also pipelined, removing the need to “rotate” the coefficients in the register files within each block. For adaptive filters the CA is used for both reading and writing. To preserve zero latency, the read/write addresses for the first data register file in the leftmost filter block can be supplied separately. To make an adaptive filter from this architecture, a similar pipeline is constructed to distribute the input data to the updating circuits. A more complete description of this aIchi- tecture and the techniques used to map filters onto it and generating the appropriate sequences of DW, DA, and CA addresses is given in [9].

This architecture provides an excellent mix of scalability, programmability, performance, and efficiency for future broadband modem applications. It is conceivable that a single modem chip could be capable of being configured on the fly to demodulate signals from various sources. For example, the architecture can support T/2 with either 2 or 4 taps/multiplit:r as well as T-Spaced with either 1, 2, 3, or 4 tapsimultiplier. All this functionality is controlled via the addressing of the register fides.

The circuit for a dual-ported register file is given in [lo]. It provides master-slave operation and dissipates slightly more power than a single flip-flop but provides latency of up to four flip-flops. Therefore, the savings achieved by making this hybrid architecture more like the direct form are very significant. The registers are “absorbed” into the latency of the register files. The cost in terms of power of increasing the size of a register file is incremental, whereas the savings achieved by removing carry-save registers from the output path is significant.

LOW-POWER VLSl IMPLEMENTATION F/R Filter - The use of carry-save arithmetic in the FIR fil-

CSAs. The four multipliers in each block can essentially be merged into a single Wallace-tree multiplier structure coni ain- ing four times the number of partial products. Wallace trees are known to consume less power than array multipliers - their balanced delay paths result in reduced spurious transitions, even though they contain increased routing capacitance.

ter enables the summation of the multiplier results in a tree of

1

I id low-power Bc

0 0 0

1 0 ‘ 0 I

7

1 0 0 1 0 1 0

0 1 1

1 0 1

1 0 1

1

0 0 0 j .. . _.

th encoding truth tables.

12

F E 10

3 o 8 VI

N

@ 6 t r 4

._ - Q ._

z 2 -. b 2

a n

Lower pow-2 -

Lower pow- I

/ ----*,*‘*Fixed coefficients /

0 2 f. 6 8 10 12 ’

Amplitude of time-multiplexed input (bits)

H Figure 13. Measured powerper multiplier in FIR filter employing time-multiplexed Booth recoded multipliers.

The cascaded input is added at the bottom of the tree to further minimize spurious transitions. The sign extension bits from each multiplier can be removed and combined as a single vector, then added to the cascaded input, where they are out of the critical path. This also helps to balance the logic depth in the multiplier tree.

Booth recoding is frequently used to halve the number of partial products in the multiplier, making it faster and smaller. The recoding of partial products in the multiplier input using the Booth algorithm involves taking groups of 3 bits (two con- secutive groups overlap by one bit) and generating another number with five signed digits (-2, -1, 0, +1, +2). Each recoded number is used to select a multiple of the multiplicand to create a partial product. The multiplicand multiples are relatively easy to generate. One approach is to generate xl , x2, and NEG signals (Table 2). The x l and x2 select the multiplicand and the multiplicand shifted left by one position respectively. The NEG signal is used to conditionally comple- ment the output to generate the partial product.

It can be shown that the filter response can have significant

coefficients are applied to the Booth re-coded inputs of the filter multipliers. The power of the Booth recoded multiplier is essentially proportional to the number of Os in the Booth recoded input. The input samples are considered random (and scaled to full dynamic range), but only a few “primary” coefficients in a typical filter response require the full preci-

impact on the power consumption of an FIR filter if the filter


To FIR ..........................................

. . . . . . . . I ....... ._ i

~-

Rea. file Reg. file 1 GSYS 1 LSBs 1 -:- ........................... c i 9

A I

.... .. ........

Figure 14. The coeficient update circuit.

Cot-fficient register file (master-slave)

sion of the multiplier. Therefore, Booth recoding the coefficients will exploit the distribution of the coefficients in the filter response to minimize power. However, when using resource sharing in a programmable filter with time-multiplexed multipliers, the coefficient inputs typically cycle through a set of neighboring coefficients, and these often change sign (especially around the near-zero “tail” taps). To avoid significant power consumption caused by sign extension, the Booth recoding circuit should be made to have a single representation of 0. Table 2 shows the truth table for conventional and low-power Booth encoding techniques (note that they differ only by one bit in the last row).

The graph in Fig. 13 shows the power consumption of a multiplier with both fixed and time-multiplexed coefficients for different coefficient amplitudes. The lowpow = 0 and lowpow = 1 curves are with the regular Booth and single zero Booth recoding circuits [IO], respectively. These results demonstrate that the power of a multiplier with time-multiplexed coefficient inputs can be made proportional to the magnitude of the coefficients. This will be exploited later for adaptive coefficient scaling.

Random access -. . coefficient

I/C bus

Powc~of-2 multiplier

-w Delayed samples

-Error

Updating Circuits - The architecture for a symbol span of coefficient storage and update is shown in Fig. 14. This architecture was used in the FSLEQ chip described later. The coefficients are stored in a partitioned register file. The most significant bits (MSBs) of the taps are sent to the FIR filter block; however, the full precision of the coefficient is used in the updating. Power-of-two LMS updating is implemented using a shift-and-add circuit. The error from the slicer is used to control the shifting of the sample input in a barrel shifter. The result is added to wk to compute wk+l. The overflow detection and clipping circuitry can be embedded within the register file where it is out of the critical path - as can

the Booth recoding for the FIR multipliers. The tap leakage algorithm can be implemented using the carry-input of the adder to add or subtract 0.75 LSB from a tap every time it is updated.

It is important that coefficients can be readiwritten by a host processor without the need to interrupt the FIR filtering. The random access coefficient bus is used for this purpose. It can write any tap by waiting for it to be used by the F I R filter and replacing its contents (using the mux) during the update.

Low power dissipation in the equalizer update is achieved using a number of techniques. The main saving comes from using power-of-two updating (which removes a large high-precision multiplier from the circuit). The use of the register file is a lot more efficient than recirculating coefficients through a shift register because in each cycle, only one coefficient is readlwrit- ten resulting in less switching capacitance. Also, by partitioning the register file that stores the coefficients into MSB and LSB components, efficient coefficient “freezing” can be performed. When the updating is disabled, only the MSBs are read out and used for filtering.

RUNTIME POWER Adding programmability to an equalizer enables it to be used in various applications. These applications have different requirements for some of the parameters, such as filter length, update rate, and bit precision. To be compliant with the application the receiver must satisfy these requirements. However, some applications have fewer requirements than others, and most transmission systems do not operate in a worst case environment. In addition, for a given application the requirements may change during runtime depending on changes in the operating environment. By observing a performance measure (signal-to-noise ratio, SNR, at the slicer), the equalizer parameters can be tuned to fulfill a given performance requirement and at the same time minimize power consumption.

THE ERROR MONITOR The error in Eq. 3 computed by the slicer can be observed by an error monitor as shown in Fig. 15. This block takes the absolute value of each symbol error and accumulates these to get an averaging effect. When the error monitor is activated

U

... Error status

- =--

.... .J

Figure 15. Control loop for runtime power reduction showing error monitor.


by an edge on the CheckErr signal, a register is loaded with the current accumulator value, and then the accumulator is reset. The dumped register value is compared to internally stored programmable levels generating status signals depending on the average error value.

ADAPTIVE BIT PRECISION The power consumption characteristic of a single multiplier in Fig. 13 indicates that it would be beneficial to use weights with as small a magnitude as possible. By adding a programmable gain at the FIR filter output as in Fig. 15, the filter function is changed to

NM-I

Ykm = A Z.(k,nP(kM-m-n) (7) n=O

By increasing A, the amplitude of the weights, W, can be reduced while keeping the slicing levels constant. The

L - . I .. . .I

Groupcld m u l t i p l ~ ~ ~ ~ 1-j Booth encoding c o e f f i c i e n i ; . m - ] ___ __ ._ __

ced switching Booth encoding

I Adaptive bit prevision - Worst-case environment

~

Reduce filter length 12 81 4 3 I Reduce filter length

i . - _-_- ~- -. . . _ - W Figure 16. Phgwer reduction techniques in an FSLEQ chip.

cosi oflreducing ;he amplitude is that the effective bit precision is also reduced, giving increased noise levels. The eiror monitor is programmed with error thresholds according to the requirements of the application. An adaptive algorithm can then be used to minimize power. After initial convergence with gain A that gives full bit precision, A is reduced as long as the error is below the thresholds in the error monitor. To minimize the overhead cost, the gain can be implemented as a barrel shifter IimitingA to a power of two.

BURST- M ODE UPDATE Once the equalizer has converged, it is possible to update the taps at a lower rate assuming that the operating environment does not change quickly. Power reduct ion is achieved by shutting off most of the update section of the equalizer. The MSBs of the coefficients still need to be supplied to the FIR filter; however, most of the updating circuits can be disabled via clock gating. Measurements indicate that 80 percent of the power consumption of the update section can be saved. At initial convergence, the error is large forcing the update to be on. Once the eiror drops below a level LUF programmed into the error monitor, the update section is frozen. As long as the error stays below HUF > LUF, the update section is deactivated. If the operating environment changes sufficiently to drive the error above HUF, the updating is turned on until the eiror again is below LUF.

ADAPTIVE FILTER LENGTHS The time span of the IS1 depends on the channel character- istics and signaling scheme. For successful equalizer operation, the FIR filter needs to cover the IS1 time span. Filter length is adaptively changed to achieve a given performance and simultaneously reduce power consumption. By observing that the amplitudes of the tail taps are usually small, zeroing these taps will not significantly change the equalizer transfer function. One way to achieve this is to

the bit is 1, the tap is enabled; otherwise, it is disabled, and the tap is replaced with a 0. Disabled taps are not updated. Adaptive filter lengths therefore reduce power in both the FIR filter and the update circuits. Reduced filter lengths can also assist in obtaining convergence when using blind training.

have an update enable array with a single bit for each tap Jf

EFFECTWENESS OF THE

The FSLEQ chip of reference [lo] contains several power reduction techniques that reduce average power consumption in worst-case operating conditions as well as taking advantage of non-worst case environments. Figure 16 shows the cumula- tive benefits of each of the power reduction approaches described in the VDSL application environment. Three of the power values are estimates - the rest are measurements obtained from the chip. The top bars correspond to the straight-forward implementation using standard-cell approach with full-custom array multipliers.

In the VDSL application, the chip operates with full precision coefficients and 128 taps with an SNR of 38.5 dB and dissipates 162 mW in the FIR filters. This corresponds to 5.1 mW/MAC and represents a 6 x saving over the standard-cell approach. The runtime power reduction techniques can reduce this further by a factor of 1.8 to 2.8 mW. We measured a 3.3 x saving in the update blocks using the power-of-two updating algorithm, and this can be reduced another 2.4 x at runtime to 4.3 mW. The power in the adaptive filter core is reduced from 2.0 W to 500 mW, and further to 245 mW at runtime (representing a total savings of 8 x). The rest of the chip combined consumes 35 mW.

POWER REDUCTION TECHNIQUES

COINCLUSION This article describes techniques for implementing low-power adaptive equalizers for broadband modems. A programmable hybrid FIR filter architecture is shown to provide a good mix of performance and power consumption for supporting several high-speed modem applications. Adaptive techniques are introduced to reduce power at runtime. A CMOS test chip is presented that has demonstrated the effectiveness of these techniques in high-speed modem applications such as VDSL or ATM.

ACKNOWLEDGMENTS The authors wish to thank Bryan Ackland, Patrick Larsson, and Jay O’Neill for contributions to the FSLEQ test chip, Tracy Denk for his contributions to mapping filtering algorithms to the programmable filter architecture, and Mark Yu for help with the development of our simulation plat- form.


REFERENCES [ I ] K. Maxwell, "Asymmetric Digital Subscriber Line: Interim Technology for

the Next Forty Years," /€E€ Commun. Mag., Oct. 1996, pp. 100-6. [ 2 ] D. A. Johns and D. Essig, "Integrated Circuits for Data Transmission over

Twisted Pair Channels," / € E € Custom lntegrated Circuits Conf., San Diego, CA, May 1996, pp. 5-12.

[3] S.U.H. Qureshi, "Adaptive Equalization," Proc. /€E€, vol. 73, no. 9, Sept.

[4] R. D. Gitlin, J. F. Hayes, and S . B. Weinstein, Data Communications Prin- ciples. New York: Plenum, 1992.

[51 D. L. Duttweiler, "Adaptive Filter Performance with Nonlinearities in the correlation multiplier," /EEE Trans. Acoustics, Speech, and Sig. Process- ing, vol. 30, no. 4, Aug. 1982, pp. 578-86.

[6] S. Mitra and J. Kaiser, Handbook for Digital Signal Processing, Wiley, 1993.

(71 H. R. Lee, C. W. Jen, and C. M. Liu, "New Hardware-Efficient Architec- ture for Programmable FIR Filter," I € € € Trans. Circuits and Sys., vol. 43, no. 9, Sept. 1996, pp. 637-44.

[SI N. Shanbhag and K. Parhi, Pipelined Adaptive Digital Filters, Kluwer, 1994.

[9] T. C. Denk et al., "Reconfigurable Hardware for Efficient Implementation of Programmable F I R Filters," /€E€ lnt'l. Conf. Acoustics, Speech and Sig. Processing, Seattle, WA, May 1998.

[ I O ] C. J . Nicol e t al., "A Low Power 128 Tap Digital Adaptive Equalizer for Broadband Modems," / € € € I . Solid-State Circuits, vol. 32, no. 11, Nov. 1997, pp. 1777-89.

[ I 1 I L. Rabiner and B. Gold, Theory and Application of Digital Signal Pro- cessing, Prentice Hall, 1975.

1985, pp. 1349-86.

ADDITIONAL READING [ I ] G. H. Im and J . J . Werner, "Bandwidth-efficient digital transmission over

Unshielded Twisted-Pair Wiring," / € € E JSAC, vol. 13, no. 9, Dec. 1995, pp. 1643-55.

BIOGRAPHIES KAMRAN AZADET ([email protected]) received the engineering degree from Ecole Centrale de Lyon in 1990, and the Ph.D. degree from Ecole Nationale Superieure des Telecommunications, Paris, in 1994. From 1990 to 199, he was a research engineer with Matra MHS, Saint Quentin en Yvelines, France, where he was involved in the design of video filters for acquisition systems. Since 1994 he has been with Bell Laboratories in Holmdel, New Jersey. Since 1996 he has been a member of the IEEE 802.3ab Gigabit Eth- ernet 1000BaseT standard. He is currently a member of technical staff in the DSP and VLSl Systems Research Department. His technical interests include analog and mixed-mode circuit design, signal processing, and digital communication.

CHRIS J . NICOL received the B.Sc. and Ph.D. degrees from the University of New South Wales, Australia, in 1991 and 1995, respectively. His thesis was on the design of VLSl chips for real-time image processing. He spent 12 months working at Bell Laboratories, Holmdel, New Jersey, in 1992-1993 in the DSP and VLSl Systems Research department working on SRAM design. He is currently a member of technical staff at Bell Laboratories, Holmdel, New Jersey, working on the design of high-speed cache memories and digital signal processing architectures. His research interests include low-power circuit design techniques and arithmetric circuits.

APPENDIX: HYBRID IIR FILTERS Similarly to hybrid FIR filters, IIR filters can be implemented using a modular architecture. Figure 17 shows two implementations of IIR filters I l l ] .

~ . . . . . . . . . . . . . . . . . . . . . . -. ...

d l i R filter form I b. IIR filter f o rm II -. - ..

Figure 17. Forms I and ZI of an IIRfilter.

Using the retiming principle from the third section, hybrid forms based on the canonical form I and I1 can be derived. Figure 18 illustrates the particular case where N = 5, n = 2.

I t - t

a I IR hybrid form I ..

i i

b. IIR hybrid form II . _ .

Figure 18. Hybrid forms I and 11 of an IIRfiltev.


Date post:	25-Jun-2018
Category:	Documents
Upload:	truonghuong
View:	219 times
Download:	0 times

FIR FILTER ARCHITECTURES - University of Oulukk/dtsp/tutoriaalit/Azadet.pdf · In many DSP-based...

Documents