8/2/2019 Low Power Techniques in Graphics Processing Units
1/13
Low Power Techniques in Graphics Processing Units
Deepak Verma
[email protected] of Computer Engineering
Syracuse University
Syracuse, NY 13210
Abstract
Power can be minimized at either system
level or architecture level or algorithm
level or micro architecture level or gate
level or circuit level. Here, we are going to
discuss the different levels where the
minimization has been done in history and
the present researchs going on to reduce
the power at various levels.
Introduction
Most of the power consumption in computeris done in the Graphics Processing Units. So
here we are going to study few of the areas
where the research is being done in this field.
In section 1, the Advancement done is pastto reduce the chip size is and its cost is
explained, whereas in other sections the
Present Research is mentioned.Improvements at Algorithmic level are
discussed in section 2 along with the new
technologies like Low power videoprocessor, High Quality motion Estimation,
Real time DVB-S2 Low-Density Parity-
Check (LDPC) for GPUs. Development atArchitecture level is explained with an
example of Low Power Interconnects forSIMD Computers in section 3. The last levelof minimization i.e. at Hardware level with
technology of Hardware-Efficient Belief
Propagation and Area Optimized Low Power
1-Bit Full-Adder is in section 4.
History
1. Implementation of LowPower One-Chip MUSE
Video Processor
In past the emphasis has been done to do theLogic minimization to reduce the power in
chip to further decrease the cost of chip.
The low power design allowed us to mount
the chip on inexpensive plastic packages as
if the chips consume more than 1.5w then itwas necessary to mount them on expensive
ceramic packages. Here we deal with the
circuit reduction, in previous chip sets, 160
word * 234 bit, RAM consists of two partssince each are of RAM existed on another
chip. But the implementation of one chip
results in a reduced bit size 480 word* 65 bit.
This memory consists of the same 6-
transistor memory cell as the previous
single-port SRAM devices but the individual
cells are accessed through shift registers as
shown below.
mailto:[email protected]:[email protected]:[email protected]8/2/2019 Low Power Techniques in Graphics Processing Units
2/13
By changing from using address decoders
with an address buffer and address transient
detector to a shift register, the memory
circuit size was reduced. We estimated thesize of memory and its peripherals. Since the
access speed of single-port ram is not high
enough to access at 16MHZ, we use 1to-3serial-to-parallel converter and 3-to-1
parallel-to-serial converter and needed three
160 word * 65 bit RAM blocks. Single portRAM is larger than the sequential access
memory.
Memory speed was increases from approx.
5Mhz to 16 Mhz. Data output from previoussingle port SRAM blocks had to propagate
through (1) the address counter, (2) the
address buffer, (3) the pre-decoder, (4) themain decoder, (5) the memory cell, (6) the
amplifier. In the dedicated memory block,
data only propagate through (1) the shift reg.,(2) the memory cell,(3) the amplifier. Thus
the access speed increases.
Power is reduced by lowering the operating
voltage from 3.7v to 3.3v and reducing the
chip area through circuit reduction.
Power, P=CV2f;
C=capacitance(chip area)
V=operating voltage,
F=frequency.
8/2/2019 Low Power Techniques in Graphics Processing Units
3/13
PresentIn this section we are going to discuss the
various levels on which the research is being
done to lower the power.
2. Algorithmic level2.1. Low Power Video
Processor
Multiple power saving methods were applied
to a video processor for color digital video
and still cameras. Architectural levelmethods failed to save However changing
the algorithm to work on pixel differences
yielded 3-15% power reduction in typicalcases.The designer is often constrained by system
level specifications that cannot be changed,
thus also prohibiting low power redesign atthat level. What remains is for the designer
to use best judgment at the architectural
levels (RTL and behavioral), and at the
algorithmic level.
Power Saving Methods That Were
RejectedTwo power reduction methods were
investigated which were rejected:
Asynchronous Design:Three factors are typically expected to
reduce power: The clock network is
eliminated, each module receives inputs onlywhen it needs to compute, and dynamic
voltage scaling may be employed. This
method was shown to save up to 80% oftotal power during periods of low activity,
when the processor may be slowed down.
The bundled data methodology with delaylines and full handshake Interconnect is
employed, but found that the extra powerrequired by the delay lines and the
handshake circuits far exceeded the powersaved by the elimination of the clock. This
was due to the very low frequency of the
clock (13.5MHz, the video input/output rate).
Bus Switching Reduction:
This is possible by selecting between
sending a value or its complement.Hamming distance logic on the sender side
determines which of the value or its
complement incurs less switching, comparedto the previous value that is dynamically
stored on the bus. Analysis shows that, for
the average conditions at this video
processor, the bus load must exceed 1pFbefore this method shows any
benefit. Thus, it is inapplicable to this small
processor.
The Winner: Algorithmic TransformationWe have taken advantage of the facts that
video pixels are often spatially correlated,and that most of the processing algorithm is
linear.
Thus, we resorted to computing thedifference of every two successive pixels,
and converting the linear section of the
algorithm to work on those differences. Thedifferences are mostly zero or 1-2 bit
numbers, and the logic exploits it.
8/2/2019 Low Power Techniques in Graphics Processing Units
4/13
This observation is obviously false near
edges in the image.
Due to rounding errors, the differencealgorithm performs poorly after any sharp
image gradients. We have retained the
original circuitry and have employed it eachtime an edge has been encountered. Once the
gradient has subsided and relatively
stationary pixel levels have beenreestablished, the difference algorithm is
turned back on and the original algorithm is
shut off.The new combined original/difference
algorithm has been designed to create output
that deviates by no more than a single digitalvalue from the original (a single lsb error),
and simulations have verified this on all ourtest images.
Five images were simulated on the newcircuit which yielded different portions of
pixel differences (Tab. below). Images A-C
exhibit typical edge contents, and result inless than 50% pixels which could be
processed as differences, and in up to 15%
power saving. Image D has little edges, andflat image E contains but one value. Both
exhibit very high ratio of pixel differences.
However, they require less power in thebaseline processor, so that there is no saving
in both cases. Actually, there is some loss,
due to the additional overhead of the more
complex architecture.Ignoring these extreme cases, the new
architecture is useful for power saving on
more typical images.
Powersaving
(%)
Reducedcurrent
(mA)
Pixeldifferences
(%)
Baselinecurrent
(mA)
Image
15 6.3 5 7.4 A
8 5.4 9 5.9 B
7 5.3 37 5.7 C
-3 4.3 66 4.2 D
-12 0.9 98 0.8 E
2.2. High Quality motionEstimation
Motion estimation is the most expensivecomputational part in the video encoding
process. The chip implements a new high
performance motion estimation algorithmbased on a modified genetic search strategy
that can be enhanced by tracking motion
vectors through successive frames.
The digital TV formats like the mainlevel/main profile of MPEG2, the
requirements for both, processing power andIO bandwidth are extremely high if excellent
encoding quality is required as it is the casein studio environment. Therefore one main
focus of the VLSI activities was to provide a
solution for a front-end high qualitymotion estimator which satisfies these
constraints and implements some coding
optimizations, this vlsi aims to replace
complex, FPGA-based prototype hardware.
Basic search approach is based in Genetic
algorithm. It consists of six basic steps:
initialization, i.e find a starting set ofchromosomes, each of them
corresponding to one motion vector;
evaluation of the fitness of the
chromosomes ,i.e. calculation of the MAEcriterion of the corresponding motion
vectors; selection of the fittest
chromosomes, i.e. of the set of motionvectors with the lowest chromosomes MAE;
crossover of the chromosomes, i.e.
produce new motion vectors(childrens)from the selected set of motion vectors
(parents); and mutation of the childrens,
i.e. randomly change the new vectorsaccording to a defined probability; and
iteration of the creation of new populations
(i.e. repeat steps 2 to 5) until a defined
8/2/2019 Low Power Techniques in Graphics Processing Units
5/13
convergence has been reached. This scheme
has been used as a basis to develop the new
high performance, low complexity motion
estimation algorithm of the IMAGE chip.
The chromosomes is represented directly
by the motion vector with its two
components concatenated. The spatialcorrelation of the motion vectors in the
frame is exploited, i.e. instead of the search
window center; the best motion vector of theprevious macro-block in the same slice is
used as initialization of the search. A set of 9
fittest chromosomes(best motion vectors
so far) is always kept to assure that the finalmotion vector is the one with the lowest
MAE of all performed matching and not
only the best result of the last population.The mean of the two parent motion
vectors is used as crossover operations and
mutation is performed by adding smallrandom vectors to the generated sons.
Furthermore in many cases, the motion
vector fields of consecutive frames arehighly correlated. This character can be
exploited to significantly improve the results
of our algorithm by applying a VECTORTRACING technique. This corresponds to
add at the initialization phase the bestvectors of the nine surrounding macro-
blocks in the reference frame(with
appropriate scaling for the B frames).
These adaptations result into the Vector-
traced Modified Genetic Search Algorithm
(VT-MGS).
For very complex sequence (like basketball),the prediction quality for scenes with very
fast motion can further enhanced byapplying a 2-phase vector tracing scheme. Inthe first phase, the sequence is treated like a
low delay coding sequence, and non-
MPEG2 conform predictions in displayorder are performed with the only goal to
calculate very exact tracing vectors by
adding up partial results of this estimation
(e.g. to form the tracing vector P->1, the
non-MPEG2 motion vectors B1->1, B2->B1,
and P->B2 are added up and stored). In thesecond phase, MPEG2 conform motion
estimation is done using the pre-calculated
initialization vectors. This 2-Phase motionestimations referred as VT-MGS2 implies of
course an significant increase in processing
power (by 50%) due to additional first phase.
2.3.Real time DVB-S2 Low-Density Parity-Check (LDPC)
for GPUs
Till now the computational power requiredto decode large code words in real-time was
not available. Nearly Although LDPC
decoding solutions have recently beenproposed for multi-core platforms, they
mainly address short and regular codes. In
this paper we propose for the first timeLDPC decoders based on Graphics
Processing Units (GPU) for the
computationally demanding case of irregularLDPC codes adopted in the Digital Video
Broadcasting
Irregular nature of these LDPC codes can
impose memory access constraints and this,associated with large code size, createschallenges which are difficult to overcome.
Also, the scheduling mechanism imposes
important restrictions on the attempt toparallelize the algorithm. Thread-level and
data-level parallelism can be conveniently
exploited, together with the use of fast localmemories, to harness the computational
efficiency of these GPU-based signal
processing algorithms.
The algorithms developed supportmulticodeword decoding and are scalable to
future GPU generations, which are expectedto have a higher number of cores. We show
that it is possible to achieve real-time DVB-
S2 LDPC decoding with throughputs above
90 Mbps on ubiquitous GPU computingplatforms.
8/2/2019 Low Power Techniques in Graphics Processing Units
6/13
The LDPC codes adopted in DVB-S2 have a
periodic nature, which allows the
exploitation of suitable representations ofdata structures for attenuating their
computational requirements. The properties
of DVB-S2 codes are exploited for the GPUparallel architectures in this paper. Parity-
check matrix H has the form shown below
H(N-K) * N = [A (N-K) * K| B(N-K) * (N-K)] =
= a0,0 . . . . . a0,k-1 |1 0 . . . . . . 0 0 0a0,0 . . . . . a0,k-1 |1 1 0 . . . . . .0 . .
. |0 1 1 . . . .. . . . . . .
. |. . . .. . . . . .. . . ..
. |. . .
. |. . . 0 .
. |. . . .
. |. . .
. |.
aN-K-2,0 . . . . . aN-K-2,K-1| 0 0 . . . . . 1 1 0
aN-K-1,0 . . . . . aN-K-1,K-1| 0 0 . . . . . 0 1 1
where A is sparse and B is a staircase lowertriangular matrix. The periodicity constraints
imposed on the pseudo- random generation
of A allow a significant reduction in thestorage requirements without code
performance loss.
The Min-Sum algorithm was adopted in thiswork to perform the decoding of
computationally intensive long LDPC codes.
PROPOSED PARALLEL
ALGORITHM FOR MANY-COREGPU
A computing system with a GPU consists ofa host, typically a Central Processing Unit
(CPU) that is used for programming and
controlling the operation of the GPU. TheGPU is a massively parallel processing
engine that can speed up processing by
simultaneously performing the same
operation on distinct data distributed bymany arithmetic processing units.
GPUs are programmable and one of the most
widely used programming models is theNVIDIA Compute Unified Device
Architecture (CUDA). The execution of a
kernel on a GPU is distributed across a gridof thread blocks with adjustable size.
Parallel multithreaded LDPC decoder
processinga) kernels 1 and 2 on the GPU using one
thread per node of the
Tanner graph where, for example
b), BNd, BNf,BNKare BNs connected to CN0,and threads are grouped and processed in B
blocks on the
c) GPU many-core architecture.
8/2/2019 Low Power Techniques in Graphics Processing Units
7/13
The algorithm developed attempts to exploit
two major capabilities of GPUs: the massive
use of thread and data parallelism and theminimization of memory accesses, which
often degrade performance in multi-core
systems.Multithread-based processing: In order to
extract the essence of full thread-level
parallelism from the GPU, the proposedDVB-S2 LDPC decoder exploits a thread-
per-node approach (thread per row and
thread per column based processing). Figureillustrates this strategy with 16 threads per
block (here represented by tc0,tc15) being
processed in parallel inside block 0 for theCheck Node processing indicated in kernel 1
from Algorithm 1. A similar approach isapplied to the remaining threads tc16, tcN-K
of kernel 1, which are grouped and executedin other blocks of the grid. Also, in kernel 2
threads tB0,..tBN-1 perform the equivalent
parallel Bit Node processing. The efficiencyof this parallelism is achieved by adopting a
flooding schedule strategy that eliminates
data dependencies in the exchange ofmessages between BNs and CNs.
Additionally, to fully exploit the massive
processing power of the GPU, the algorithmperforms multicodeword decoding by
decoding 16 code words in parallel.
Moreover, this solution uses 8 bit to
represent data, which compares favorablywith existing state-of-the-art VLSI DVB-S2
LDPC decoders that typically use 5 or 6 bit
to represent Data.
Coalesced accesses to data structures:In a GPU, parallel accesses to the slow
global memory may kill performance andshould, whenever possible, be avoided. To
optimize this type of operation, data is
contiguously aligned in memory, whichfavors coalescence to take effect and allows
several threads to access corresponding data
in simultaneous , as depicted in figure below.Nevertheless, modern GPU hardware can be
more efficient at dealing with out-of-order
memory accesses and related issues.
This is one of the efficient parallelalgorithms to perform the massive decoding
of DVB-S2 LDPC codes on GPUs and
High throughputs can be achieved for real-time applications, with values surpassing the
90 Mbps.
3. Architecture level3.1.Low Power Interconnects for
SIMD Computers.A limit of the SIMD width differentarchitectures(like 3D graphics, high
definition video, image processing and
wireless comm.) is the scalability of the
interconnect network between the processingelements in terms of both area and power.
We use XRAM instead of SRAM, which is alow power high performance matrix style
crossbar. one of the most power efficient
ways to utilize this transistor area is throughintegrating multiple processing elements (PE)
within a die. This is represented by many
architectures in the form of increasednumber of single instruction multiple data
(SIMD) lanes in processors, and the shift
from multi-core to many-core architectures.Network-on-Chip architectures show that thecrossbar itself consumes between 30% to
almost 50% of the total interconnect power.
Another critical problem is that existingcircuit topologies in traditional interconnects
do not scale well, because of the complexity
in control wire and control signal generation
8/2/2019 Low Power Techniques in Graphics Processing Units
8/13
logic which directly effects the delay and
power consumption.
One circuit technique that helps solve thecontrol complexity problem is to embed the
interconnect control within the cross points
of a matrix style crossbar using SRAM cells.This differs from the traditional technique
where interconnections are set by an external
Controller. Other circuit techniques, likeusing the same output bus wires to program
the cross point control, help reduce the
number of control wires needed within theXRAM. Finally borrowing low voltage
swing techniques that are currently used in
SRAM arrays improves performance andlowers the energy used in driving the wires
of the XRAM. Though these techniques helpsolve the performance and scaling problem
of traditional interconnects, one drawback isflexibility; the XRAM is only able to store a
certain number of swizzle configurations at a
given time. A case study shows that theXRAM achieves 1.4x performance and
consumes 2.5x less power in a color-space
conversion algorithm.
XRAM fundamentals.
The input buses run horizontally while the
output buses run vertically, creating an arrayof cross points. Each cross point contains a
6T SRAM bit cell. The state of the SRAMbit cell at a cross point determines whether
or not input data is passed onto the output
bus at the cross point. Along a column, onlyone bit cell is programmed to store a logic
high and create a connection to an input.
Matrix type crossbars incur a huge areaoverhead because of quadratically increasing
number of control signals that are required to
set the connectivity at the cross points. Tomitigate this, XRAM uses techniques similar
to what is employed in SRAM arrays. In an
SRAM array, the same bit line is used toread as well as write a bit cell. Until the
XRAM is programmed the output buses do
not carry any useful data. Hence, these can
be used to configure the SRAM cells at the
cross points without affecting functionality.
Along a channel (output bus), each SRAM
cell is connected to a unique bit line of thebus. This allows for the programming of
multiple SRAM cells (as many bit lines
available in the channel) simultaneously.XRAM re-uses output channels for
programming, resulting in improvement of
silicon utilization to 45%. To furtherimprove silicon utilization, multiple SRAM
cells can be embedded at each cross point to
cache more than one shuffle configuration.
We find that many applications, especially in
the signal processing domain, only utilize a
small number of permutations over and overagain. By caching some of the patterns that
are most frequently used, XRAM reduces
power and latency by eliminating the need toconfigure and reprogram the XRAM for
those patterns.
compared to conventionally implementedcrossbars, the area scales linearly with the
product of the inputoutput ports while
consuming almost 50% less energy.
8/2/2019 Low Power Techniques in Graphics Processing Units
9/13
Compared to conventional MUX based
implementations, the XRAM improves
performance by 1.4x and between 1.5-2.5xlower power for applications such as color-
space conversion.
4. Hardware level4.1.Hardware-Efficient Belief
Propagation
Loopy belief propagation (BP) is an effective
solution for assigning labels to the nodes of a
graphical model such as the Markov randomfield (MRF), but it requires high memory,
bandwidth, and computational costs.
Furthermore, the iterative, pixel-wise, and
sequential operations of BP make it difficultto parallelize the computation. In this paper,
we propose two techniques to address theseissues. BP algorithms generally require a
great amount of memory for storing the
messages, typically on the order of tens to
hundreds times larger than the input data.Besides since each message is processed
hundreds of times, the saving/loading of
messages consumes considerable bandwidth.Therefore, although BP may work on high-
end platforms such as desktops. Because themessages are sequentially updated and eachmessage is constructed via a sequential
procedure, it is difficult to utilize hardware
parallelism to accelerate BP.Tile-based BP is used to address these issues.
Tile-based BP splits the Markov random
field (MRF) into many tiles and only storesthe messages across the neighboring tiles.
The memory and bandwidth required by this
technique is only a fraction of the ordinary
BP algorithms. But the quality of the results,as tested by the publicly available
Middlebury MRF benchmarks, is
comparable to other efficient algorithms.
Tile-Based BP
Let us first consider the process of
generating outgoing messages of a node p,we need four incoming messages toward p ,
the data costs of p, and the smoothness cost
between p and q. Besides these, we do notneed the data of the nodes far away from p.
this property is exploited in bipartite graph
where the nodes are split into two sets so thatevery edge connects two nodes of different
sets. To generate messages from the first set
to the second one, we only need themessages from the second set to the first one.
Therefore, only a half of the messages arestored An interesting question is, what arethe data required to generate the messages
toward p. The answer is the data costs of ps
neighbors, and the messages sent from theneighbors of these neighbors, as shown in
Fig. (b). Again, we do not have to access the
variables outside the group of those nodes.
8/2/2019 Low Power Techniques in Graphics Processing Units
10/13
This rule can be easily extended. To generate
the messages from the shaded nodes in Fig.
(b) tops neighbors, we only need themessages from their neighbors. Therefore, if
we have the messages from the boundary of
a region, we can sequentially generate themessages inward. After we reach the region
center, we can then sequentially generate the
outward messages. The only required inputsare the data costs and smoothness costs of
this region and the messages of the boundary
nodes. This concept can be extended tomultiple regions and iterations. For example,
we can spit the nodes of MRF into two sets,
as shown in Fig. One set N1 contains thenodes in a 3-by-3 tile and the other set N2
contains all other nodes. When we performBP in N2, without knowing messages in
N1(dotted edges in Fig), we only need themessages coming from N1 to drive the
propagation (outward arrows in Fig). All the
messages in the tile are irrelevant to themessage passing outside the tile because they
are never used in evaluating. Only messages
that must always reside in the memory arethose sending from one subset to another.
When we perform message passing in one
subset, the messages in the other one can beremoved without affecting the operation.
Given enough computations, the
approximation quality is good enough to
drive the propagation to converge.
We can see that as the algorithm iterates(Ti=
iterations), the energy keeps decreasing. This
technique greatly reduced the memory,
bandwidth, and computational costs of BP
and enabled the parallel processing. With
these two techniques, BP becomes moresuitable for low-cost and power-limited
consumer electronics. We have demonstrated
the applicability of the proposed techniquesfor various applications in the Middlebury
MRF benchmark. A VLSI circuit and a GPU
program for stereo matching based on theproposed techniques
have been developed. These techniques can
also be applied to other parallel platforms.
4.2.AREA OPTIMIZED LOWPOWER 1-BIT FULL-ADDER
we proposed a low power 1-bit full adder(FA) with 10-transistors and this is used in
the design using low power 1-bit full adder
in the implementation of ALU, the power
and area are greatly reduced to more than70% compared to conventional design and
30% compared to transmission gates. So, the
design is attributed as an area efficient andlow power ALU. This design does not
compromise for the speed as the delay of the
full adder is minimized thus the overall delay.
The leakage power of the design is alsoreduced by designing the full adder with less
number of power supply to groundconnections. A conventional CMOS full
adder consists of 28 transistors. But, here we
have designed a full adder only with 10number of transistors, which occupies very
less area and also consumes very less power.
Table shows the carry status of full adder. If
both A and B are 1s then carry is generated
because summing A and B would makeoutput SUM 0 and CARRY 1. If both A
8/2/2019 Low Power Techniques in Graphics Processing Units
11/13
and B are 0s then summing A and B would
give us 0 and any previous carry is added
to this SUM making CARRY bit 0.This is in effect deleting the CARRY.
Static complementary CMOS adders using
28 transistors.
Fourteen Transistor (14T) Full adder with
Transmission Gates
10 Transistor Full Adder Design(10TFA)The proposed 10TFA also takes the three
inputs A, B and Cin. The third input Cin
represents carry input to the first stage. The
outputs are SUM and CARRY.The full adder circuit uses 0.18 m CMOS
process technology, which provides
transistors with three characteristics, namelyhigh-speed, low-voltage and low-leakage.As the main target of this design is to
minimize Power, so the transistors areselected for it accordingly. The typical
supply voltage for this process is 1.8 V. The
10- transistor 1-bit full adder is designed attransistor level, using 0.18 m CMOS process
technology. Based on the simulation, the 10-
transistor 1-bit full adder consumes6.2995W Power where as a conventional
full adder consumes 16.675W, whichshows a 62.2% of power savings.
The total Power Consumption of a CMOS
circuit includes: dynamic Power
Consumption, static Power Consumption andshort circuit power consumption. The last
two items are neglected due to their low
contribution to the power. The dominantfactor is the dynamic power based on the
equation, P= CLfVdd2.
The instantaneous power P (t) drawn fromthe power supply is proportional to the
supply current Idd (t) and the supply voltage
Vdd (t).
P (t) =Idd (t) Vdd (t)The energy consumed over that time interval
T is the integral of instantaneous power.
E= Idd (t) Vdd (t)
8/2/2019 Low Power Techniques in Graphics Processing Units
12/13
S.N
o.Design Cell Ener
gy
y(pJ)
Power
(W
)
1 CMOS 2 x 1
MUX
0.92
15
4.60
75
2 4 x 1
MUX
3.02
45
15.1
23
3 Logical
MUX
8.52
12
42.6
06
Conventi
onal Full
Adder
3.33
42
16.6
75
4 Transmiss
ion Gates
2 x 1
MUXwith
Transmiss
ion gates
0.36
25
1.60
79
5 4 x 1
MUX
withTransmiss
ion gates
0.85
25
4.26
25
6 LogicalMUX
with
Transmiss
ion gates
7.42
85
37.1
45
7 10
TransistorFull
Adder
1.25
99
6.29
95
Design ALUwith
CMO
S
gates
ALUwith
CMOS
gates &10-
Transistor full
adder
ALU withTransmissio
n gates &
10-Transistor
full adder
Energy(p
J)
840.8
2
351.95 239.52
Power
(W)
4204.
5
1759.5 1197.5
the leakage power is also very less as thenumber of power supply to ground
connections are greatly reduced. The power
consumption of 16-bit ALU with 10transistor full adder is observed to be
1197.5w.
References
[1] Implementation of low power one-chip
muse by Tetsuo Aoki
[2] A Low Power Video Processor by Uzi
Zangi and Ran Ginosar
[3] Image:A low cost, low power video
processor by Friederich Mombers
[4] REAL-TIME DVB-S2 LDPC
DECODING ON MANY-CORE GPUACCELERATORS by Gabriel Falcao , Joao
Andrade , Vitor Silva and Leonel Sousa
[5] Low Power Interconnects for SIMDComputers by Mark Woh
[6] Hardware-Efficient Belief Propagation
by Chia-Kai Liang, Chao-Chung Cheng,
Yen-Chieh Lai
8/2/2019 Low Power Techniques in Graphics Processing Units
13/13
[7]AREA OPTIMIZED LOW POWERARITHMETIC AND LOGIC UNIT by T.
Esther Rani
[8]http://ieeexplore.ieee.org.libezproxy2.syr.ed
u/search/advsearch.jsp