+ All Categories
Transcript
Page 1: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

“12-bit Wallace Tree Multiplier”CMPEN 411 – Final Report

Matthew Poremba5/1/2009

Page 2: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Project OverviewThis project was originally titled “Fast Fourier Transform Unit,” but due to space and time constraints, the project is now just a 12-bit integer multiplier, which is enough precision to be used in a fast Fourier transform application in processors that do not have multiplication unit such as the common 18Fxxxx series by Microchip, Inc.

A wallace tree design was chosen, using 2:2, 3:2 and 6:3 compressors. These compressors are basically combinations of full adders and half adders. The adder selected was a carry select adder, which uses full adders, half adders, and multiplexers. All of these blocks are described in the next section. The layout of the wallace tree uses two rows of 15 6:3 compressors in the first “stage.” The outputs of this stage are sent to the second stage, which sends it's outputs to the third stage, and finally reaches the fourth stage of only 2 rows of binary numbers, which are computed using a regular adder.

Figure 0 – Conceptual view of 12-bit wallace tree multiplier with 6:3 compressors

The first part of the first stage, that is, the partial products highlighted in green is the upper block of the first stage. The partial products highlighted in orange form the lower block of the first stage. In each stage, numbers surrounded by rectangular borders are inputs to the compressors at that stage, in other words, the rectangular border represents a compressor at that stage. Thus, stage 2 has 12 6:3 compressors, 8 3:2 compressors and 1 2:2 compressor. When a 6:3 compressor

Page 3: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

is used with less than 6 inputs, the unused inputs are grounded. This is acceptable since little space is saved by creating special compressors, such as 5:3 or 4:3. It is also important to note that the numbers in bold in the first stage are inverted partial products, necessary for the multiplication of 2's complement numbers.

In the final layout, the Vdd and Ground wires will be in several columns on the metal 1 layer. These columns can be connected to the Vdd and Ground planes that surround the layout for the final packaging. Based on preliminary layouts, the wallace tree design was very difficult to layout efficiently, but should be much faster in computing than other types of multipliers. This project is better described as a complete project, rather than a comparison between two basic functional unit. As such, there is no comparison section at the end of this document.

Project DataSmall Building Blocks

Full AdderThe full adder sums three inputs, A B and Cin, and generates two outputs, the Sum and the Carry. The full adder was designed using standard static CMOS logic, and has a total size of 52.5um x 27.6 um. The worst case delay path is on the falling edge of the sum with a time of 770ps. The schematic was designed in two pieces- a carry generation block, and a sum generation block. The carry generation block is designed to be fast, since in most cases this must ripple to the next adder. The worst case delay for the carry generation is 390ps. The peak power is 5.10mW.

Figure 1 – Schematic of Full Adder. Built using a Sum and Carry Block.

The sum and carry block circuits are as follows. The corresponding layouts for these blocks can be seen as the separate halves of the layout for the full adder.

Page 4: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 2 – Carry generation block of full adder.

Figure 3 – Sum generation block of full adder.

Page 5: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 4 – Layout of full adder. The left side is the carry generation, and the right is the sum generation block.

The full adder was simulated, with 8 out of 8 possible simulations. The results were verified to be the correct output values, and these simulation results were also used to determine the worst case delay path.

Figure 5 – Simulation results of full adder.

In the simulation, the first signal is input a, the second signal is input b, and the third signal is the carry in. The 4th and 5th signals are the sum and the carry out, respectively. The first three signals can simply be summed together to determine what the last two signals should be. Adding the input signals in binary, with the carry out being the most significant bit, it can be verified that the output signals are indeed correct.

Page 6: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Half AdderThe purpose of the half adder is to sum two inputs and return the resulting sum and any carry. Half adders are import for the multiplication block, since they act as 2:2 compressors, and can also be used in building 4:3 and 5:3 compressors. The inputs of the circuit are named A and B. The outputs are Sum and Cout. The circuit was designed using static CMOS logic, and resulted in a total area of 31.2um x 27.6um. The worst case delay path is the falling edge of the sum output, with a delay of 790ps. The peak power is 2.34mW.

Figure 6 – Half Adder schematic. Built using a separate sum and carry block.

The circuit was again designed using two separate blocks; a sum and a carry block. The sum carry block is designed to be faster than the sum block. The schematics of each follow, and their layouts can be seen as pieces of the complete half adder layout.

Figure 7 – Half adder sum generation block.

Page 7: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 8 – Half adder carry generation block.

Page 8: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 9 – Half Adder Layout. The carry generation is on the left, and the sum generation is on the right side.

The half adder circuit was simulated with 4 out of 4 possible inputs. The results verified correct operation of the circuit's outputs, and was also used to determine the worst case delay path of the circuit.

Figure 10 – Half adder simulation results.

Page 9: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

The simulation results can be verified similar to the full adder simulation. The first two signals are the inputs to the circuit, and the last two signals are the sum and the carry out, respectively. Summing the first two inputs in binary, the sum is the least significant digit, and the carry out is the most significant digit of the addition.

Partial Product Generate (AND gate)The partial product generate is used to generate the partial products to be input to the compressors in the wallace tree multiplier. The circuit was designed using static CMOS design, using a NAND gate and an inverter, the cellview symbol has output of both the NAND and the inverter to allow for quicker schematic drawing of the various compressors. The worst case delay path is on the falling edge of the output, and is 370ps. The peak power dissipation is 4.21mW.

Figure 11 – Partial Product Generate schematic. Built using a NAND and inverter.

Figure 12 – NAND gate schematic, used in the partial product generate.

Page 10: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 13 – Inverter schematic, used in partial product generate.

Figure 14 – Partial Product Generate layout.

Page 11: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

The partial product generate circuit was simulated with 4 out of 4 possible inputs. The results verified correct operation of the circuit's outputs, and was also used to determine the worst case delay path of the circuit.

Figure 15 – Partial Product Generate simulation results.

The first two signals in the simulation are the inputs, A and B. The last signal is the output, X. Correctness can be verified by ANDing the two input signals together, since the circuit is exactly the same as an AND gate. The output signal should only be high when both input signals are high, which is true.

D Flip-FlopThe D Flip-Flop is used to remember a 1-bit value that is at the input signal D when the clock signal C transitions from low to high. The output of the flip-flop is Q, the last value input into input D when the clock transitioned. The design is a static CMOS design, and has a size of 56.4um x 29.4. The worst-case delay was determined by simulation to be 980ps, when the output signal falls. The peak power dissipation is 8.37mW

Figure 16 – D flip-flop schematic. The “mystery5” block is a D-latch.

Page 12: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 17 – Schematics of the “mystery5” block.

Figure 18 – Layout of the D flip-flop.

The D flip-flop circuit was simulated using all possible input transitions, as well as transitions to test the functionality between clock cycles. The results verified correct operation of the circuit's outputs, and was also used to determine the worst case delay path of the circuit.

Page 13: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 19 – Simulation of D flip-flop.

In the simulation, the first signal is the clock input, C, the second input is the data signal, D, and the last signal is the output signal, Q. The clock was transitioned from high to low and low to high while the input data signal was low, to ensure the output did not change. The data signal was then transitioned to high, and the clock transitioned from high to low, then low to high. On this low to high transition, the flip-flop kept the signal, as it should have, for Q. At this point the data signal was then transitioned low while the clock was still high, to ensure the output remained high.

6:3 CompressorThe 6:3 compressor functions much like the 3:2 compressor, except it adds 6 binary inputs instead of 3. The inputs are b0 through b5, and the outputs are again the sum, carry, and super-carry, where the sum is the least significant, the carry is the middle, and the super-carry is the most significant digit. The 6:3 compressor is designed using 4 full adders. Two adders add six of the input bits in parallel. A third full adder adds the sums of these two adders and the last input bit. The sum of this third full adder is the least significant digit and the “sum” output. A fourth and final full adder is introduced to add all the carry outs of the previous 3 full adders. The sum of this full adder is the middle digit, and the “carry” output, while the carry out is the most significant digit, and the “super carry” output. The total size of this circuit was 81.3um x 55.5um. The worst-case delay path was found to be 1.4ns on the falling edge of the sum. The peak power dissipation is 34.73mW.

Page 14: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 20 – Schematic of 6:3 compressor.

Figure 21 – Layout of 6:3 compressor.

The 6:3 compressor circuit was simulated with 64 out of 64 possible inputs. The results verified correct operation of the circuit's outputs, and was also used to determine the worst case delay path of the circuit.

Figure 22 – Simulation results of 6:3 compressor.

Page 15: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

The top 6 signals in the simulation result are the 6 binary inputs to be summed. The last three signals are the output signals, sum, carry, and super-carry in that order. Correctness was verified by adding the first seven inputs, and comparing the 3-digit binary result with the levels of the three output signals, using sum as the least significant digit, carry as the middle digit, and the super-carry as the most significant digit.

Medium Sized Components

4-bit Register The 4-bit register is built using 4 flip flops and a buffer. The clock input for the flip flops is the input to the buffer. The buffer output connects to the clocks of the flip flop. This prevents the fan in of later registers from having extremely high fan in. The size is 60.15um x 128.4um.

Figure 23 – Schematic of 4-bit register.

Page 16: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 24 – Layout of 24-bit register

Page 17: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

24-bit RegisterThe 24-bit register is built using six 4-bit registers and a buffer between each set of three 4-bit registers. This is also used to prevent fan in, and results in the final register having a maximum fan in of 4 from the initial clock input.

Figure 25 – Schematic of 24-bit register

Figure 26 - Layout of 24-bit register

Page 18: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Large Sized Components

Wallace Tree, Stage 1The upper piece of the multiplier generates the partial products for the first 6 rows in the partial product additions. This piece was designed to break up the final layout of the multiplier into smaller pieces. Likewise, the lower piece generates the partial products for the second 6 rows in the partial product additions. These two pieces are connected to create the entire stage one. It is important to note that the two pieces have a fan-in of up to twelve. Since the signals are distributed using metal2 and metal3 layers, it would be difficult to create a buffer that can be placed under these layers to reduce fan-in.

Figure 27 – Schematic of stage 1, upper block

Figure 28 – Schematic of stage 1, lower block

Figure 30 – Schematic of stage 1

Page 19: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 31 – Layout of stage 1, upper block

Figure 32 – Layout of stage 1, lower block

Figure 33 – Layout of stage 1

Page 20: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Wallace Tree, Stage 2 & 3The second two stages are much like the first stage, except partial product generation is not needed. The stages are laid out based on the compression needs from the output of the first stage, and for the third stage, the output of the second stage is used. The second stage uses both 6:3 and 3:2 compressors to reduce 6 rows of partial products to 3 rows. The third stage reduces these 3 rows further into 2 rows. These final two rows will be summed in the final stage, stage 4, using a normal adder.

Figure 34 – Stage 2 schematic

Figure 35 – Stage 3 schematic

Page 21: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 36 – Stage 2 layout Figure 37 – Layout of stage 3

Page 22: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Wallace Tree, Stage 4This final stage of the wallace tree multiplier is just a 24-bit ripple carry adder. While this type of adder is not the best choice to use, it is easy to build and relatively small. If any one optimization could be made to this project, this would be the best candidate, and would most likely increase performance by close to 33%, based on the results of the worst case delay for the final wallace tree layout shown later in this document. The layout of this stage was made to fit into empty spaces from the other stages, which also places the adders closer to the outputs of the upper stages of the wallace tree, without disturbing the critical path of the circuit.

Figure 38 – Schematic of 24-bit ripple carry adder (stage 4) Figure 39 – Layout of 24-bit ripple carry adder (stage 4)

Page 23: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

12-bit Wallace TreeThe 12-bit wallace tree puts all four stages together. The inputs of the block are the two numbers to be multiplied, a<0:11> and b<0:11>. The output of the block is the product d<0:23>. The size of the block is 500um x 880um. The peak power consumption is 520mW.

Figure 40 – 12-bit wallace tree schematic

Figure 41 – 12-bit wallace tree layout

Final LayoutThe final layout connects the output of the 12-bit wallace tree multiplier to the 24-bit register. The inputs of the 12-bit wallace tree are drawn from the output pins on the pad frame. The outputs of the flip flops are connected to the output pin of the bi-directional pad on the pad frame. The clock of the 24-bit register is connected to an input only pin pad on the pad frame. An output enable pin was also added in order to switch between input and output of the 24 bi-directional pads. The output enable is distributed throughout the circuit using a method similar to a clock network. This is done to reduce the fan in of the input to at most 4. The size of the block is 628um x 880um.

Page 24: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 42 – Layout of final, 100% completed chip

Simulation ChipThe simulation chip is the same layout as the final layout above, except the pin pads were removed to allow the layout to be extracted, since it can not be extracted without the pin pads being removed. All simulations were performed on this extracted layout. From this extraction the worst case delay was determined to be 16.6ns, when -1 is multiplied by -1.

Page 25: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 43 – Layout of final chip used to verify functionality using the simulator.

Figure 44 – Simulation of worst case delay on critical path

Page 26: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Figure 45 – Critical Path

The critical path highlighted is the longest path through the circuit. A 6:3 compressor has the equivalent delay of a 3-bit ripple carry adder, since it is 2 full adders in parallel, which are in series with two more full adders. Since the stages are in parallel, the path downward is 3 full adder delays, and 23 full adders across the bottom, for a total of 26 full adder delays. Extra delay was also created from the large fan in of the multiplier, along with the delay time through the register, and large capacitances of lengthy interconnects. The simulation shown in Figure 44 is a simulation of the multiplication of -1 x -1. This multiplication generates a carry in the 2nd output bit, which is propagated across the entire bottom ripple carry adder.

Page 27: “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report ...kxc104/class/cmpen411/09s/pj/Poremba_FINAL.pdf · “12-bit Wallace Tree Multiplier” CMPEN 411 – Final Report

Optimizations & ConclusionsOptimizations of the project mainly consisted of buffers to decrease the fan-in of various inputs inside the circuit. No other optimizations were made, since the final layout easily fit within the pad frame, no space enhancements were needed. As mentioned earlier, the best way to improve the project would be to change the 4th stage of the wallace tree to use an adder faster than a ripple carry adder, such as a carry look ahead adder. Replacing the 6:3 compressors with 4,2 counters could also reduce the delay of the critical path, although it would create more stages for a 12-bit multiplier. Booth recoding could also have been used to decrease the amount of partial products by two. If this project were fabricated, all possible multiplication combinations could be tested exhaustively. Using a microcontroller which a multiplication unit, the products could be compared, and all possible combinations could be tested in only a few seconds.


Top Related