Download - Low Power Techniques in Graphics Processing Units

8/2/2019 Low Power Techniques in Graphics Processing Units

1/13

Low Power Techniques in Graphics Processing Units

Deepak Verma

[email protected] of Computer Engineering

Syracuse University

Syracuse, NY 13210

Abstract

Power can be minimized at either system

level or architecture level or algorithm

level or micro architecture level or gate

level or circuit level. Here, we are going to

discuss the different levels where the

minimization has been done in history and

the present researchs going on to reduce

the power at various levels.

Introduction

Most of the power consumption in computeris done in the Graphics Processing Units. So

here we are going to study few of the areas

where the research is being done in this field.

In section 1, the Advancement done is pastto reduce the chip size is and its cost is

explained, whereas in other sections the

Present Research is mentioned.Improvements at Algorithmic level are

discussed in section 2 along with the new

technologies like Low power videoprocessor, High Quality motion Estimation,

Real time DVB-S2 Low-Density Parity-

Check (LDPC) for GPUs. Development atArchitecture level is explained with an

example of Low Power Interconnects forSIMD Computers in section 3. The last levelof minimization i.e. at Hardware level with

technology of Hardware-Efficient Belief

Propagation and Area Optimized Low Power

1-Bit Full-Adder is in section 4.

History

1. Implementation of LowPower One-Chip MUSE

Video Processor

In past the emphasis has been done to do theLogic minimization to reduce the power in

chip to further decrease the cost of chip.

The low power design allowed us to mount

the chip on inexpensive plastic packages as

if the chips consume more than 1.5w then itwas necessary to mount them on expensive

ceramic packages. Here we deal with the

circuit reduction, in previous chip sets, 160

word * 234 bit, RAM consists of two partssince each are of RAM existed on another

chip. But the implementation of one chip

results in a reduced bit size 480 word* 65 bit.

This memory consists of the same 6-

transistor memory cell as the previous

single-port SRAM devices but the individual

cells are accessed through shift registers as

shown below.
mailto:[email protected]:[email protected]:[email protected]


2/13

By changing from using address decoders

with an address buffer and address transient

detector to a shift register, the memory

circuit size was reduced. We estimated thesize of memory and its peripherals. Since the

access speed of single-port ram is not high

enough to access at 16MHZ, we use 1to-3serial-to-parallel converter and 3-to-1

parallel-to-serial converter and needed three

160 word * 65 bit RAM blocks. Single portRAM is larger than the sequential access

memory.

Memory speed was increases from approx.

5Mhz to 16 Mhz. Data output from previoussingle port SRAM blocks had to propagate

through (1) the address counter, (2) the

address buffer, (3) the pre-decoder, (4) themain decoder, (5) the memory cell, (6) the

amplifier. In the dedicated memory block,

data only propagate through (1) the shift reg.,(2) the memory cell,(3) the amplifier. Thus

the access speed increases.

Power is reduced by lowering the operating

voltage from 3.7v to 3.3v and reducing the

chip area through circuit reduction.

Power, P=CV2f;

C=capacitance(chip area)

V=operating voltage,

F=frequency.


3/13

PresentIn this section we are going to discuss the

various levels on which the research is being

done to lower the power.

2. Algorithmic level2.1. Low Power Video

Processor

Multiple power saving methods were applied

to a video processor for color digital video

and still cameras. Architectural levelmethods failed to save However changing

the algorithm to work on pixel differences

yielded 3-15% power reduction in typicalcases.The designer is often constrained by system

level specifications that cannot be changed,

thus also prohibiting low power redesign atthat level. What remains is for the designer

to use best judgment at the architectural

levels (RTL and behavioral), and at the

algorithmic level.

Power Saving Methods That Were

RejectedTwo power reduction methods were

investigated which were rejected:

Asynchronous Design:Three factors are typically expected to

reduce power: The clock network is

eliminated, each module receives inputs onlywhen it needs to compute, and dynamic

voltage scaling may be employed. This

method was shown to save up to 80% oftotal power during periods of low activity,

when the processor may be slowed down.

The bundled data methodology with delaylines and full handshake Interconnect is

employed, but found that the extra powerrequired by the delay lines and the

handshake circuits far exceeded the powersaved by the elimination of the clock. This

was due to the very low frequency of the

clock (13.5MHz, the video input/output rate).

Bus Switching Reduction:

This is possible by selecting between

sending a value or its complement.Hamming distance logic on the sender side

determines which of the value or its

complement incurs less switching, comparedto the previous value that is dynamically

stored on the bus. Analysis shows that, for

the average conditions at this video

processor, the bus load must exceed 1pFbefore this method shows any

benefit. Thus, it is inapplicable to this small

processor.

The Winner: Algorithmic TransformationWe have taken advantage of the facts that

video pixels are often spatially correlated,and that most of the processing algorithm is

linear.

Thus, we resorted to computing thedifference of every two successive pixels,

and converting the linear section of the

algorithm to work on those differences. Thedifferences are mostly zero or 1-2 bit

numbers, and the logic exploits it.


4/13

This observation is obviously false near

edges in the image.

Due to rounding errors, the differencealgorithm performs poorly after any sharp

image gradients. We have retained the

original circuitry and have employed it eachtime an edge has been encountered. Once the

gradient has subsided and relatively

stationary pixel levels have beenreestablished, the difference algorithm is

turned back on and the original algorithm is

shut off.The new combined original/difference

algorithm has been designed to create output

that deviates by no more than a single digitalvalue from the original (a single lsb error),

and simulations have verified this on all ourtest images.

Five images were simulated on the newcircuit which yielded different portions of

pixel differences (Tab. below). Images A-C

exhibit typical edge contents, and result inless than 50% pixels which could be

processed as differences, and in up to 15%

power saving. Image D has little edges, andflat image E contains but one value. Both

exhibit very high ratio of pixel differences.

However, they require less power in thebaseline processor, so that there is no saving

in both cases. Actually, there is some loss,

due to the additional overhead of the more

complex architecture.Ignoring these extreme cases, the new

architecture is useful for power saving on

more typical images.

Powersaving

(%)

Reducedcurrent

(mA)

Pixeldifferences

(%)

Baselinecurrent

(mA)

Image

15 6.3 5 7.4 A

8 5.4 9 5.9 B

7 5.3 37 5.7 C

-3 4.3 66 4.2 D

-12 0.9 98 0.8 E

2.2. High Quality motionEstimation

Motion estimation is the most expensivecomputational part in the video encoding

process. The chip implements a new high

performance motion estimation algorithmbased on a modified genetic search strategy

that can be enhanced by tracking motion

vectors through successive frames.

The digital TV formats like the mainlevel/main profile of MPEG2, the

requirements for both, processing power andIO bandwidth are extremely high if excellent

encoding quality is required as it is the casein studio environment. Therefore one main

focus of the VLSI activities was to provide a

solution for a front-end high qualitymotion estimator which satisfies these

constraints and implements some coding

optimizations, this vlsi aims to replace

complex, FPGA-based prototype hardware.

Basic search approach is based in Genetic

algorithm. It consists of six basic steps:

initialization, i.e find a starting set ofchromosomes, each of them

corresponding to one motion vector;

evaluation of the fitness of the

chromosomes ,i.e. calculation of the MAEcriterion of the corresponding motion

vectors; selection of the fittest

chromosomes, i.e. of the set of motionvectors with the lowest chromosomes MAE;

crossover of the chromosomes, i.e.

produce new motion vectors(childrens)from the selected set of motion vectors

(parents); and mutation of the childrens,

i.e. randomly change the new vectorsaccording to a defined probability; and

iteration of the creation of new populations

(i.e. repeat steps 2 to 5) until a defined


5/13

convergence has been reached. This scheme

has been used as a basis to develop the new

high performance, low complexity motion

estimation algorithm of the IMAGE chip.

The chromosomes is represented directly

by the motion vector with its two

components concatenated. The spatialcorrelation of the motion vectors in the

frame is exploited, i.e. instead of the search

window center; the best motion vector of theprevious macro-block in the same slice is

used as initialization of the search. A set of 9

fittest chromosomes(best motion vectors

so far) is always kept to assure that the finalmotion vector is the one with the lowest

MAE of all performed matching and not

only the best result of the last population.The mean of the two parent motion

vectors is used as crossover operations and

mutation is performed by adding smallrandom vectors to the generated sons.

Furthermore in many cases, the motion

vector fields of consecutive frames arehighly correlated. This character can be

exploited to significantly improve the results

of our algorithm by applying a VECTORTRACING technique. This corresponds to

add at the initialization phase the bestvectors of the nine surrounding macro-

blocks in the reference frame(with

appropriate scaling for the B frames).

These adaptations result into the Vector-

traced Modified Genetic Search Algorithm

(VT-MGS).

For very complex sequence (like basketball),the prediction quality for scenes with very

fast motion can further enhanced byapplying a 2-phase vector tracing scheme. Inthe first phase, the sequence is treated like a

low delay coding sequence, and non-

MPEG2 conform predictions in displayorder are performed with the only goal to

calculate very exact tracing vectors by

adding up partial results of this estimation

(e.g. to form the tracing vector P->1, the

non-MPEG2 motion vectors B1->1, B2->B1,

and P->B2 are added up and stored). In thesecond phase, MPEG2 conform motion

estimation is done using the pre-calculated

initialization vectors. This 2-Phase motionestimations referred as VT-MGS2 implies of

course an significant increase in processing

power (by 50%) due to additional first phase.

2.3.Real time DVB-S2 Low-Density Parity-Check (LDPC)

for GPUs

Till now the computational power requiredto decode large code words in real-time was

not available. Nearly Although LDPC

decoding solutions have recently beenproposed for multi-core platforms, they

mainly address short and regular codes. In

this paper we propose for the first timeLDPC decoders based on Graphics

Processing Units (GPU) for the

computationally demanding case of irregularLDPC codes adopted in the Digital Video

Broadcasting

Irregular nature of these LDPC codes can

impose memory access constraints and this,associated with large code size, createschallenges which are difficult to overcome.

Also, the scheduling mechanism imposes

important restrictions on the attempt toparallelize the algorithm. Thread-level and

data-level parallelism can be conveniently

exploited, together with the use of fast localmemories, to harness the computational

efficiency of these GPU-based signal

processing algorithms.

The algorithms developed supportmulticodeword decoding and are scalable to

future GPU generations, which are expectedto have a higher number of cores. We show

that it is possible to achieve real-time DVB-

S2 LDPC decoding with throughputs above

90 Mbps on ubiquitous GPU computingplatforms.


6/13

The LDPC codes adopted in DVB-S2 have a

periodic nature, which allows the

exploitation of suitable representations ofdata structures for attenuating their

computational requirements. The properties

of DVB-S2 codes are exploited for the GPUparallel architectures in this paper. Parity-

check matrix H has the form shown below

H(N-K) * N = [A (N-K) * K| B(N-K) * (N-K)] =

= a0,0 . . . . . a0,k-1 |1 0 . . . . . . 0 0 0a0,0 . . . . . a0,k-1 |1 1 0 . . . . . .0 . .

. |0 1 1 . . . .. . . . . . .

. |. . . .. . . . . .. . . ..

. |. . .

. |. . . 0 .

. |. . . .

. |. . .

. |.

aN-K-2,0 . . . . . aN-K-2,K-1| 0 0 . . . . . 1 1 0

aN-K-1,0 . . . . . aN-K-1,K-1| 0 0 . . . . . 0 1 1

where A is sparse and B is a staircase lowertriangular matrix. The periodicity constraints

imposed on the pseudo- random generation

of A allow a significant reduction in thestorage requirements without code

performance loss.

The Min-Sum algorithm was adopted in thiswork to perform the decoding of

computationally intensive long LDPC codes.

PROPOSED PARALLEL

ALGORITHM FOR MANY-COREGPU

A computing system with a GPU consists ofa host, typically a Central Processing Unit

(CPU) that is used for programming and

controlling the operation of the GPU. TheGPU is a massively parallel processing

engine that can speed up processing by

simultaneously performing the same

operation on distinct data distributed bymany arithmetic processing units.

GPUs are programmable and one of the most

widely used programming models is theNVIDIA Compute Unified Device

Architecture (CUDA). The execution of a

kernel on a GPU is distributed across a gridof thread blocks with adjustable size.

Parallel multithreaded LDPC decoder

processinga) kernels 1 and 2 on the GPU using one

thread per node of the

Tanner graph where, for example

b), BNd, BNf,BNKare BNs connected to CN0,and threads are grouped and processed in B

blocks on the

c) GPU many-core architecture.


7/13

The algorithm developed attempts to exploit

two major capabilities of GPUs: the massive

use of thread and data parallelism and theminimization of memory accesses, which

often degrade performance in multi-core

systems.Multithread-based processing: In order to

extract the essence of full thread-level

parallelism from the GPU, the proposedDVB-S2 LDPC decoder exploits a thread-

per-node approach (thread per row and

thread per column based processing). Figureillustrates this strategy with 16 threads per

block (here represented by tc0,tc15) being

processed in parallel inside block 0 for theCheck Node processing indicated in kernel 1

from Algorithm 1. A similar approach isapplied to the remaining threads tc16, tcN-K

of kernel 1, which are grouped and executedin other blocks of the grid. Also, in kernel 2

threads tB0,..tBN-1 perform the equivalent

parallel Bit Node processing. The efficiencyof this parallelism is achieved by adopting a

flooding schedule strategy that eliminates

data dependencies in the exchange ofmessages between BNs and CNs.

Additionally, to fully exploit the massive

processing power of the GPU, the algorithmperforms multicodeword decoding by

decoding 16 code words in parallel.

Moreover, this solution uses 8 bit to

represent data, which compares favorablywith existing state-of-the-art VLSI DVB-S2

LDPC decoders that typically use 5 or 6 bit

to represent Data.

Coalesced accesses to data structures:In a GPU, parallel accesses to the slow

global memory may kill performance andshould, whenever possible, be avoided. To

optimize this type of operation, data is

contiguously aligned in memory, whichfavors coalescence to take effect and allows

several threads to access corresponding data

in simultaneous , as depicted in figure below.Nevertheless, modern GPU hardware can be

more efficient at dealing with out-of-order

memory accesses and related issues.

This is one of the efficient parallelalgorithms to perform the massive decoding

of DVB-S2 LDPC codes on GPUs and

High throughputs can be achieved for real-time applications, with values surpassing the

90 Mbps.

3. Architecture level3.1.Low Power Interconnects for

SIMD Computers.A limit of the SIMD width differentarchitectures(like 3D graphics, high

definition video, image processing and

wireless comm.) is the scalability of the

interconnect network between the processingelements in terms of both area and power.

We use XRAM instead of SRAM, which is alow power high performance matrix style

crossbar. one of the most power efficient

ways to utilize this transistor area is throughintegrating multiple processing elements (PE)

within a die. This is represented by many

architectures in the form of increasednumber of single instruction multiple data

(SIMD) lanes in processors, and the shift

from multi-core to many-core architectures.Network-on-Chip architectures show that thecrossbar itself consumes between 30% to

almost 50% of the total interconnect power.

Another critical problem is that existingcircuit topologies in traditional interconnects

do not scale well, because of the complexity

in control wire and control signal generation


8/13

logic which directly effects the delay and

power consumption.

One circuit technique that helps solve thecontrol complexity problem is to embed the

interconnect control within the cross points

of a matrix style crossbar using SRAM cells.This differs from the traditional technique

where interconnections are set by an external

Controller. Other circuit techniques, likeusing the same output bus wires to program

the cross point control, help reduce the

number of control wires needed within theXRAM. Finally borrowing low voltage

swing techniques that are currently used in

SRAM arrays improves performance andlowers the energy used in driving the wires

of the XRAM. Though these techniques helpsolve the performance and scaling problem

of traditional interconnects, one drawback isflexibility; the XRAM is only able to store a

certain number of swizzle configurations at a

given time. A case study shows that theXRAM achieves 1.4x performance and

consumes 2.5x less power in a color-space

conversion algorithm.

XRAM fundamentals.

The input buses run horizontally while the

output buses run vertically, creating an arrayof cross points. Each cross point contains a

6T SRAM bit cell. The state of the SRAMbit cell at a cross point determines whether

or not input data is passed onto the output

bus at the cross point. Along a column, onlyone bit cell is programmed to store a logic

high and create a connection to an input.

Matrix type crossbars incur a huge areaoverhead because of quadratically increasing

number of control signals that are required to

set the connectivity at the cross points. Tomitigate this, XRAM uses techniques similar

to what is employed in SRAM arrays. In an

SRAM array, the same bit line is used toread as well as write a bit cell. Until the

XRAM is programmed the output buses do

not carry any useful data. Hence, these can

be used to configure the SRAM cells at the

cross points without affecting functionality.

Along a channel (output bus), each SRAM

cell is connected to a unique bit line of thebus. This allows for the programming of

multiple SRAM cells (as many bit lines

available in the channel) simultaneously.XRAM re-uses output channels for

programming, resulting in improvement of

silicon utilization to 45%. To furtherimprove silicon utilization, multiple SRAM

cells can be embedded at each cross point to

cache more than one shuffle configuration.

We find that many applications, especially in

the signal processing domain, only utilize a

small number of permutations over and overagain. By caching some of the patterns that

are most frequently used, XRAM reduces

power and latency by eliminating the need toconfigure and reprogram the XRAM for

those patterns.

compared to conventionally implementedcrossbars, the area scales linearly with the

product of the inputoutput ports while

consuming almost 50% less energy.


9/13

Compared to conventional MUX based

implementations, the XRAM improves

performance by 1.4x and between 1.5-2.5xlower power for applications such as color-

space conversion.

4. Hardware level4.1.Hardware-Efficient Belief

Propagation

Loopy belief propagation (BP) is an effective

solution for assigning labels to the nodes of a

graphical model such as the Markov randomfield (MRF), but it requires high memory,

bandwidth, and computational costs.

Furthermore, the iterative, pixel-wise, and

sequential operations of BP make it difficultto parallelize the computation. In this paper,

we propose two techniques to address theseissues. BP algorithms generally require a

great amount of memory for storing the

messages, typically on the order of tens to

hundreds times larger than the input data.Besides since each message is processed

hundreds of times, the saving/loading of

messages consumes considerable bandwidth.Therefore, although BP may work on high-

end platforms such as desktops. Because themessages are sequentially updated and eachmessage is constructed via a sequential

procedure, it is difficult to utilize hardware

parallelism to accelerate BP.Tile-based BP is used to address these issues.

Tile-based BP splits the Markov random

field (MRF) into many tiles and only storesthe messages across the neighboring tiles.

The memory and bandwidth required by this

technique is only a fraction of the ordinary

BP algorithms. But the quality of the results,as tested by the publicly available

Middlebury MRF benchmarks, is

comparable to other efficient algorithms.

Tile-Based BP

Let us first consider the process of

generating outgoing messages of a node p,we need four incoming messages toward p ,

the data costs of p, and the smoothness cost

between p and q. Besides these, we do notneed the data of the nodes far away from p.

this property is exploited in bipartite graph

where the nodes are split into two sets so thatevery edge connects two nodes of different

sets. To generate messages from the first set

to the second one, we only need themessages from the second set to the first one.

Therefore, only a half of the messages arestored An interesting question is, what arethe data required to generate the messages

toward p. The answer is the data costs of ps

neighbors, and the messages sent from theneighbors of these neighbors, as shown in

Fig. (b). Again, we do not have to access the

variables outside the group of those nodes.


10/13

This rule can be easily extended. To generate

the messages from the shaded nodes in Fig.

(b) tops neighbors, we only need themessages from their neighbors. Therefore, if

we have the messages from the boundary of

a region, we can sequentially generate themessages inward. After we reach the region

center, we can then sequentially generate the

outward messages. The only required inputsare the data costs and smoothness costs of

this region and the messages of the boundary

nodes. This concept can be extended tomultiple regions and iterations. For example,

we can spit the nodes of MRF into two sets,

as shown in Fig. One set N1 contains thenodes in a 3-by-3 tile and the other set N2

contains all other nodes. When we performBP in N2, without knowing messages in

N1(dotted edges in Fig), we only need themessages coming from N1 to drive the

propagation (outward arrows in Fig). All the

messages in the tile are irrelevant to themessage passing outside the tile because they

are never used in evaluating. Only messages

that must always reside in the memory arethose sending from one subset to another.

When we perform message passing in one

subset, the messages in the other one can beremoved without affecting the operation.

Given enough computations, the

approximation quality is good enough to

drive the propagation to converge.

We can see that as the algorithm iterates(Ti=

iterations), the energy keeps decreasing. This

technique greatly reduced the memory,

bandwidth, and computational costs of BP

and enabled the parallel processing. With

these two techniques, BP becomes moresuitable for low-cost and power-limited

consumer electronics. We have demonstrated

the applicability of the proposed techniquesfor various applications in the Middlebury

MRF benchmark. A VLSI circuit and a GPU

program for stereo matching based on theproposed techniques

have been developed. These techniques can

also be applied to other parallel platforms.

4.2.AREA OPTIMIZED LOWPOWER 1-BIT FULL-ADDER

we proposed a low power 1-bit full adder(FA) with 10-transistors and this is used in

the design using low power 1-bit full adder

in the implementation of ALU, the power

and area are greatly reduced to more than70% compared to conventional design and

30% compared to transmission gates. So, the

design is attributed as an area efficient andlow power ALU. This design does not

compromise for the speed as the delay of the

full adder is minimized thus the overall delay.

The leakage power of the design is alsoreduced by designing the full adder with less

number of power supply to groundconnections. A conventional CMOS full

adder consists of 28 transistors. But, here we

have designed a full adder only with 10number of transistors, which occupies very

less area and also consumes very less power.

Table shows the carry status of full adder. If

both A and B are 1s then carry is generated

because summing A and B would makeoutput SUM 0 and CARRY 1. If both A


11/13

and B are 0s then summing A and B would

give us 0 and any previous carry is added

to this SUM making CARRY bit 0.This is in effect deleting the CARRY.

Static complementary CMOS adders using

28 transistors.

Fourteen Transistor (14T) Full adder with

Transmission Gates

10 Transistor Full Adder Design(10TFA)The proposed 10TFA also takes the three

inputs A, B and Cin. The third input Cin

represents carry input to the first stage. The

outputs are SUM and CARRY.The full adder circuit uses 0.18 m CMOS

process technology, which provides

transistors with three characteristics, namelyhigh-speed, low-voltage and low-leakage.As the main target of this design is to

minimize Power, so the transistors areselected for it accordingly. The typical

supply voltage for this process is 1.8 V. The

10- transistor 1-bit full adder is designed attransistor level, using 0.18 m CMOS process

technology. Based on the simulation, the 10-

transistor 1-bit full adder consumes6.2995W Power where as a conventional

full adder consumes 16.675W, whichshows a 62.2% of power savings.

The total Power Consumption of a CMOS

circuit includes: dynamic Power

Consumption, static Power Consumption andshort circuit power consumption. The last

two items are neglected due to their low

contribution to the power. The dominantfactor is the dynamic power based on the

equation, P= CLfVdd2.

The instantaneous power P (t) drawn fromthe power supply is proportional to the

supply current Idd (t) and the supply voltage

Vdd (t).

P (t) =Idd (t) Vdd (t)The energy consumed over that time interval

T is the integral of instantaneous power.

E= Idd (t) Vdd (t)


12/13

S.N

o.Design Cell Ener

gy

y(pJ)

Power

(W

)

1 CMOS 2 x 1

MUX

0.92

15

4.60

75

2 4 x 1

MUX

3.02

45

15.1

23

3 Logical

MUX

8.52

12

42.6

06

Conventi

onal Full

Adder

3.33

42

16.6

75

4 Transmiss

ion Gates

2 x 1

MUXwith

Transmiss

ion gates

0.36

25

1.60

79

5 4 x 1

MUX

withTransmiss

ion gates

0.85

25

4.26

25

6 LogicalMUX

with

Transmiss

ion gates

7.42

85

37.1

45

7 10

TransistorFull

Adder

1.25

99

6.29

95

Design ALUwith

CMO

S

gates

ALUwith

CMOS

gates &10-

Transistor full

adder

ALU withTransmissio

n gates &

10-Transistor

full adder

Energy(p

J)

840.8

2

351.95 239.52

Power

(W)

4204.

5

1759.5 1197.5

the leakage power is also very less as thenumber of power supply to ground

connections are greatly reduced. The power

consumption of 16-bit ALU with 10transistor full adder is observed to be

1197.5w.

References

[1] Implementation of low power one-chip

muse by Tetsuo Aoki

[2] A Low Power Video Processor by Uzi

Zangi and Ran Ginosar

[3] Image:A low cost, low power video

processor by Friederich Mombers

[4] REAL-TIME DVB-S2 LDPC

DECODING ON MANY-CORE GPUACCELERATORS by Gabriel Falcao , Joao

Andrade , Vitor Silva and Leonel Sousa

[5] Low Power Interconnects for SIMDComputers by Mark Woh

[6] Hardware-Efficient Belief Propagation

by Chia-Kai Liang, Chao-Chung Cheng,

Yen-Chieh Lai


13/13

[7]AREA OPTIMIZED LOW POWERARITHMETIC AND LOGIC UNIT by T.

Esther Rani

[8]http://ieeexplore.ieee.org.libezproxy2.syr.ed

u/search/advsearch.jsp