+ All Categories
Home > Documents > IEEE 2015 Project List Vlsi

IEEE 2015 Project List Vlsi

Date post: 07-Sep-2015
Category:
Upload: igeeks
View: 42 times
Download: 0 times
Share this document with a friend
Description:
Greetings from IGeekS Technologies ….We were humbled to receive your enquiry regarding your academic project. We assure you to give all kinds of guidance for you to successfully complete your project. IGeekS Technologies is a company located in Bangalore, India. We have being recognized as a quality provider of hardware and software solutions for the student’s in order carry out their academic Projects. We offer academic projects at various academic levels ranging from graduates to masters (Diploma, BCA, BE, M. Tech, MCA, M. Sc (CS/IT)). As a part of the development training, we offer Projects in Embedded Systems & Software to the Engineering College students in all major disciplines. Academic Projects As a part of our vision to provide a field experience to young graduates, we offering academic projects to MCA/B.Tech/BE/M.Tech/BCA students. Normally our way of project guidance will start with in-depth training. Why because unless and until a student know the technology, he cannot implement a project. We designed such courses based on industry requirements. Placements Our support never ends with training. We are maintaining a dedicated consulting division with 5 HR executives to assist our students to find good opportunities. Once a student finishes his course and project, immediately we will collect their profiles and will contact with the companies. Since January 2010, more than 450 students got placed with the help of our quality training, project assistance and placement support. Facilities• Project confirmation and completion certificate.• Project base paper, synopsis and PPT.• In-depth training by industry experts• Project guidance from experienced people• Regular seminars and group discussions• Lab facility• Good placement assistance• A CD which contains all the required softwares and materials.• Lab modules with 100s of examples to improve students programming skills. Please visit our websites for further information:-www.makefinalyearproject.comwww.igeekstechnoloiges.comWe look forward to have you in our office for a detailed technical discussion for in-depth understanding of the base paper and synopsis. Our training methodology includes to first prepare the candidates to the relevant technology used in the selected project and then start the project implementation; this gives the candidate the pre-requisite knowledge to understand not only the project but also the code in which the project is implemented.The program concludes by issuing of project completion certificate from our organization.We attached the proposed project titles for the academic year 2015. Find the attachment. Select the titles we will send the synopsis and base paper...If have any own topic (base paper) pls send us.we will check and confirm the implementation.We will explain the base paper and synopsis, for technical discussion or admission contact Mr. Nandu-9590544567.
Popular Tags:
43
VLSI IEEE Papers Copy Right Protected 1. A fast and accurate network-on-chip timing simulator with a flit propagation model IEEE 2015 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7059108&queryText=noc&sort Type=desc_p_Publication_Year&searchField=Search_All Abstract: Network-on-chip (NoC) can be a simulation bottleneck in a many-core system. Traditional cycle-accurate NoC simulators need a long simulation time, as they synchronize all components (routers and FIFOs) every cycle to guarantee the exact behaviors. Also, a NoC simulation does not benefit from transaction-level modeling (TLM) in speed without any accuracy loss, because the transaction timings of a simulated packet depend on other packets due to wormhole switching. In this paper, we propose a novel NoC simulation method which can calculate cycle-accurate timings with wormhole switching. Instead of updating states of routers and FIFOs cycle-by-cycle, we use a pre-built model to calculate a flit's exact times at ports of routers in a NoC. The results of the proposed simulator are verified withNoC implementations (cycle-accurate at RTL) created by a commercial NoC compiler. All timing results match perfectly with packet waveforms generated by above NoCs (with 40-325 times speed up). As another comparison, the speed of the simulator is similar or faster (0.5-23X) than a TG2 NoC model, which is a SystemC and transaction-level model without timing accuracy (due to ignoring wormhole traffics). 2. A Methodology for Cognitive NoC Design IEEE 2015 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7128666&queryText=noc&sort Type=desc_p_Publication_Year&searchField=Search_All Abstract: The number of cores in a multicore chip design has been increasing in the past two decades. The rate of increase will continue for the foreseeable future. With a large number
Transcript
  • VLSI IEEE Papers

    Copy Right Protected

    1. A fast and accurate network-on-chip timing

    simulator with a flit propagation model

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7059108&queryText=noc&sort

    Type=desc_p_Publication_Year&searchField=Search_All

    Abstract:

    Network-on-chip (NoC) can be a simulation bottleneck in a many-core system. Traditional

    cycle-accurate NoC simulators need a long simulation time, as they synchronize all

    components (routers and FIFOs) every cycle to guarantee the exact behaviors. Also,

    a NoC simulation does not benefit from transaction-level modeling (TLM) in speed without

    any accuracy loss, because the transaction timings of a simulated packet depend on other

    packets due to wormhole switching. In this paper, we propose a novel NoC simulation

    method which can calculate cycle-accurate timings with wormhole switching. Instead of

    updating states of routers and FIFOs cycle-by-cycle, we use a pre-built model to calculate a

    flit's exact times at ports of routers in a NoC. The results of the proposed simulator are

    verified withNoC implementations (cycle-accurate at RTL) created by a

    commercial NoC compiler. All timing results match perfectly with packet waveforms

    generated by above NoCs (with 40-325 times speed up). As another comparison, the speed

    of the simulator is similar or faster (0.5-23X) than a TG2 NoC model, which is a SystemC

    and transaction-level model without timing accuracy (due to ignoring wormhole traffics).

    2. A Methodology for Cognitive NoC Design

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7128666&queryText=noc&sort

    Type=desc_p_Publication_Year&searchField=Search_All

    Abstract:

    The number of cores in a multicore chip design has been increasing in the past two

    decades. The rate of increase will continue for the foreseeable future. With a large number

  • VLSI IEEE Papers

    Copy Right Protected

    of cores, the on-chip communication has become a very important design consideration.

    The increasing number of cores will push the communication complexity level to a point

    where managing such highly complex systems requires much more than what designers

    can anticipate for. We propose a new design methodology for implementing a cognitive

    network-on-chip that has the ability to recognize changes in the environment and to learn

    new ways to adapt to the changes. This learning capability provides a way for the network

    to manage itself. Individual network nodes work autonomously to achieve global system

    goals, e.g., low network latency, higher reliability, power efficiency, adaptability, etc. We use

    fault-tolerant routing as a case study. Simulation results show that the cognitive design has

    the potential to outperform the conventional design for large applications. With the great

    inherent flexibility to adopt different algorithms, the cognitive design can be applied to many

    applications.

    3. A packet-switched interconnect for many-core

    systems with BE and RT service

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7092531&queryText=noc&s

    ortType=desc_p_Publication_Year&searchField=Search_All

    Abstract:

    A packet-switched interconnect design which supports real-time and best-effort services is

    proposed. This interconnect is different from traditional NoCs in that we use direction

    channels to replace the large input buffers and use less resource to realize the network

    transfer. The connection between our interconnect design and IP core is an on-chip

    memory management block named DME. The real-time service implies preferential transfer

    channel allocation, maximum delay bound and time stamping of every real-time packet. The

    solution is geared towards many-core systems, such as complex industrial control systems

    and communication devices, which require these features to facilitate efficient SW and

    application development.

    4. FPGA based design of low power reconfigurable

    router for Network on Chip (NoC)

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7092531&queryText=noc&s

    ortType=desc_p_Publication_Year&searchField=Search_All

  • VLSI IEEE Papers

    Copy Right Protected

    Abstract:

    FPGA based design of reconfigurable router for NoC applications is proposed in the present

    work. Design entry of the proposed router is done using Verilog Hardware Description

    Language (Verilog HDL). The router designed in the present work has four channels

    (namely, east, west, north and south) and a crossbar switch. Each channel consists of First

    in First out (FIFO) buffers and multiplexers. FIFO buffers are used to store the data and the

    input and output of the data are controlled using multiplexers. Firstly, south channel is

    designed which includes the design of FIFO and multiplexers. After that, the crossbar switch

    and other three channels are designed. All these designed channels, FIFO buffers,

    multiplexers and crossbar switches are integrated to form the complete router architecture.

    The proposed design is simulated using Modelsim and the RTL view is obtained using Xilinx

    ISE 13.4. Xilinx SPARTAN-6 FPGAs are used for synthesis of proposed design. Power

    dissipation of the proposed reconfigurable router is reduced using Power gating technique.

    Total power is calculated by the use of XPower Analyzer tool. Obtained results show that

    the proposed design consumes less power compared to the previously designed

    reconfigurable routers.

    5. Reliable router architecture with elastic buffer for

    NoC architecture

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7050463&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=5&searchField=Search_All

    Abstract:

    Router is the basic building block of the interconnection network. In this paper, new router

    architecture with elastic buffer is proposed which is reliable and also has less area and

    power consumption. The proposed router architecture is based on new error detection

    mechanisms appropriate for dynamic NoCarchitectures. It considers data packet error

    detection, correction and also routing errors. The uniqueness of the reliable router

    architecture is to focus on finding error sources accurately. This technique differentiates

    permanent and transient errors and also protects diagonal availabilities. Input and output

    buffers in router architectures are replaced by elastic buffers. Routers spend considerable

    area and power for router buffer. In this paper the proposed router architecture replaces

    FIFO buffers with the elastic buffers in order to reduce area, and power consumption and

    also to have better

  • VLSI IEEE Papers

    Copy Right Protected

    6. Design and analysis of 10 port router for network

    on chip (NoC)

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7087013&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=5&searchField=Search_All

    Abstract:

    Network on chip is an emerging technology which provides data reliability and high speed

    with less power consumption. With the technological advancements a large number of

    devices can be integrated into a single chip. So the communication between these devices

    becomes vital. The network on chip (NoC) router is used for such communication. This

    paper focuses on the design analysis of 10 port router. The delay (2.571ns) and power

    (80.98mW) is minimized by using crossbar switch. The proposed architecture of 10 port

    router is simulated and synthesized in Xilinx ISE 14.4 software.

    7. Concentration and Its Impact on Mesh and

    Torus-Based NoC Performance

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7092745&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=2&searchField=Search_All

    Abstract:

    This paper investigates the effects of concentration on the performance of k-ary n-cubes.

    Simulation results indicate that only large ratios of packet length-to-average hop-count are

    in favor of concentrated mesh and torus. The Cmesh takes full advantage of its high

    channel bandwidth to outperform Ctorus. Moreover, non-local traffic suffers more from

    performance bottleneck than local traffic at routers. Providing dedicated input ports, one for

    each IP, at routers, reduces the average packet latency compared to a configuration with a

    single input port shared by all IP cores of the cluster.

  • VLSI IEEE Papers

    Copy Right Protected

    8. Effect of core ordering on application mapping

    onto mesh based network-on-chip design

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7100274&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=2&searchField=Search_All

    Abstract:

    This paper presents a mapping strategy onto mesh based Network-on-Chip (NoC)

    architecture by using combined techniques such as Particle Swarm Optimization (PSO) and

    constructive heuristic. To arrive at a better solution, the basic PSO has been augmented

    further. That is, it runs the PSOs multiple times. The mapping result has been compared, in

    terms of communication cost, with an exact method such as Integer Linear Programming

    (ILP) and other methods. Experiment results show improvement with other approaches.

    9. Merged switch allocation and transversal with

    dual layer adaptive error control for Network-on-

    Chip switches

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7050468&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=2&searchField=Search_All

    Abstract:

    In this paper, we propose a Network on Chip router architecture with increased reliability,

    energy efficiency and with reduced area overhead. The proposed router architecture model

    adjusts dynamically to the error control strengths of the layers of NoC. In this paper, we

    target to optimize the combined operations of arbiter and multiplexer by using a Merged

    Arbiter Multiplexer (MARX) along with a dual layer cooperative error control protocol. By

    doing so, the number of pipe line stages, area and power consumed is reduced. We use XY

    Routing algorithm to send data from one router to the other when these routers are placed

    in network architecture. The proposed model outperforms the dual layer error control model

    without MARX unit. The router architecture with MARX unit has 22.7% less area and 2.4%

    less energy consumption than router architecture without MARX unit but has moderate

    increase in the delay.

  • VLSI IEEE Papers

    Copy Right Protected

    10. Argo: A Real-Time Network-on-Chip

    Architecture With an Efficient GALS

    Implementation

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7064728&queryText=noc&s

    ortType=desc_p_Publication_Year&pageNumber=3&searchField=Search_All

    Abstract:

    In this paper, we present an area-efficient, globally asynchronous, locally synchronous

    network-on-chip (NoC) architecture for a hard real-time multiprocessor platform.

    The NoC implements message-passing communication between processor cores. It uses

    statically scheduled time-division multiplexing (TDM) to control the communication over a

    structure of routers, links, and network interfaces (NIs) to offer real-time guarantees. The

    area-efficient design is a result of two contributions: 1) asynchronous routers combined with

    TDM scheduling and 2) a novel NI microarchitecture. Together they result in a design in

    which data are transferred in a pipelined fashion, from the local memory of the sending core

    to the local memory of the receiving core, without any dynamic arbitration, buffering, and

    clock synchronization. The routers use two-phase bundled-data handshake latches based

    on the Mousetrap latch controller and are extended with a clock gating mechanism to

    reduce the energy consumption. The NIs integrate the direct memory access functionality

    and the TDM schedule, and use dual-ported local memories to avoid buffering, flow-control,

    and synchronization. To verify the design, we have implemented a 4 x 4 bitorus NoC in 65-

    nm CMOS technology and we present results on area, speed, and energy consumption for

    the router, NI, NoC, and postlayout.

    11. High Speed Modified Booth Encoder

    Multiplier for Signed and Unsigned Numbers

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6205523&queryText=multip

    lier&newsearch=true&searchField=Search_All

  • VLSI IEEE Papers

    Copy Right Protected

    Abstract:

    This paper presents the design and implementation of signed-unsigned Modified Booth

    Encoding (SUMBE) multiplier. The present Modified Booth Encoding (MBE) multiplier and

    the Baugh-Wooleymultiplier perform multiplication operation on signed numbers only. The

    array multiplier and Braun arraymultipliers perform multiplication operation on unsigned

    numbers only. Thus, the requirement of the modern computer system is a dedicated and

    very high speed unique multiplier unit for signed and unsigned numbers. Therefore, this

    paper presents the design and implementation of SUMBE multiplier. The modified Booth

    Encoder circuit generates half the partial products in parallel. By extending sign bit of the

    operands and generating an additional partial product the SUMBE multiplier is obtained.

    The Carry Save Adderr (CSA) tree and the final Carry Look ahead (CLA) adder used to

    speed up themultiplier operation. Since signed and unsigned multiplication operation is

    performed by the samemultiplier unit the required hardware and the chip area reduces and

    this in turn reduces power dissipation and cost of a system.

    12. Design and implementation of 16 16

    multiplier using Vedic mathematics

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7150925&queryText=multip

    lier&sortType=desc_p_Publication_Year&pageNumber=2&searchField=Search_All

    Abstract:

    This paper briefly describes the Urdhva-Tiryagbhyam Sutra of vedic mathematics and we

    have designed multiplier based on the sutra. Vedic Mathematics is the ancient system of

    mathematics which has a unique technique of calculations based on 16 Sutras which are

    discovered by Sri Bharti Krishna Tirthaji. In this era of digitalization, it is required to increase

    the speed of the digital circuits while reducing the on chip area and memory consumption.

    In various applications of digital signal processing, multiplication is one of the key

    component. Vedic technique eliminates the unwanted multiplication steps thus reducing the

    propagation delay in processor and hence reducing the hardware complexity in terms of

    area and memory requirement. We implement the basic building block: 16 16

    Vedic multiplier based on Urdhva-Tiryagbhyam Sutra. This Vedic multiplier is coded in

    VHDL and synthesized and simulated by using Xilinx ISE 10.1. Further the design of

  • VLSI IEEE Papers

    Copy Right Protected

    array multiplier in VHDL is compared with proposedmultiplier in terms of speed and

    memory.

    13. Low power multiplier architectures using vedic mathematics in 45nm technology for high speed computing

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7045662&queryText=multiplier

    &sortType=desc_p_Publication_Year&pageNumber=4&searchField=Search_All

    Abstract:

    Speed and the overall performance of any digital signal processor are largely determined by

    the efficiency of the multiplier units present within. The use of Vedic mathematics has

    resulted in significant improvement in the performance of multiplier architectures used for

    high speed computing. This paper proposes 4-bit and 8-bit multiplier architectures based on

    Urdhva Tiryakbhyam sutra. These low power designs are realized in 45 nm CMOS Process

    technology using Cadence EDA tool.

    14. Design of area and power aware reduced

    Complexity Wallace Tree multiplier

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7087207&queryText=multip

    lier&sortType=desc_p_Publication_Year&pageNumber=4&searchField=Search_All

    Abstract:

    Multiplier is a vital block in high speed Digital Signal Processing Applications. With the more

    advance techniques in wireless communication and high-speed ULSI techniques in recent

    era, the more stress in modern ULSI design under which main constraints are Power,

    Silicon area and delay. In all the high-speed application to Very Large Scale Integration

    fields, fast speed and less area is required. There are two approaches to improve the speed

    of multipliers namely booth algorithm and other is Wallace tree algorithm.

    Generally, multipliers require high latency during the partial products addition and

    conventional multipliers have more stages so delay is more. However, in this paper, the

    work has been done to reduce the area by using energy efficient CMOS Full Adder. To

    implement the high-speedmultiplier, Wallace tree multiplier is designed and it is a three-

  • VLSI IEEE Papers

    Copy Right Protected

    stage operation, which again leads to lesser number of stages and subsequently less

    number of transistors .Moreover the gate count is significantly reduced. Multipliers and their

    associated circuits like half adders, full adders and accumulators consume a significant

    portion of most high-speed applications. Therefore, it is necessary to increase their

    performance as well as size efficiency by customization. In order to reduce the hardware

    complexity which ultimately reduces an area and power, Energy Efficient full adders plays a

    vital role in Wallace tree multiplier. Reduced Complexity Wallace multiplier (RCWM) will

    have fewer adders than Standard Wallace multiplier (SWM). The Reduced complexity

    reduction method greatly reduces the number of half adders with 65-75 % reduction in an

    area of half adders than standard Wallace multipliers.

    15. FPGA implementation of vedic floating point multiplier

    IEEE 2015

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7091534&queryText=multip

    lier&sortType=desc_p_Publication_Year&pageNumber=4&searchField=Search_All

    Abstract:

    Most of the scientific operation involve floating point computations. It is necessary to

    implement fastermultipliers occupying less area and consuming less power. Multipliers play

    a critical role in any digital design. Even though various multiplication algorithms have been

    in use, the performance of Vedicmultipliers has not drawn a wider attention. Vedic

    mathematics involves application of 16 sutras or algorithms. One among these, the Urdhva

    tiryakbhyam sutra for multiplication has been considered in this work. An IEEE-754 based

    Vedic multiplier has been developed to carry out both single precision and double precision

    format floating point operations and its performance has been compared with Booth and

    Karatsuba based floating point multipliers. Xilinx FPGA has been made use of while

    implementing these algorithms and a resource utilization and timing performance based

    comparison has also been made.

    16. FPGA based design of low power

    reconfigurable router for Network on Chip (NoC)

    IEEE 2015

  • VLSI IEEE Papers

    Copy Right Protected

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7148581&queryText=router

    &sortType=desc_p_Publication_Year&pageNumber=3&searchField=Search_All

    Abstract:

    FPGA based design of reconfigurable router for NoC applications is proposed in the present

    work. Design entry of the proposed router is done using Verilog Hardware Description

    Language (Verilog HDL). The router designed in the present work has four channels

    (namely, east, west, north and south) and a crossbar switch. Each channel consists of First

    in First out (FIFO) buffers and multiplexers. FIFO buffers are used to store the data and the

    input and output of the data are controlled using multiplexers. Firstly, south channel is

    designed which includes the design of FIFO and multiplexers. After that, the crossbar switch

    and other three channels are designed. All these designed channels, FIFO buffers,

    multiplexers and crossbar switches are integrated to form the complete router architecture.

    The proposed design is simulated using Modelsim and the RTL view is obtained using Xilinx

    ISE 13.4. Xilinx SPARTAN-6 FPGAs are used for synthesis of proposed design. Power

    dissipation of the proposed reconfigurable router is reduced using Power gating technique.

    Total power is calculated by the use of XPower Analyzer tool. Obtained results show that

    the proposed design consumes less power compared to the previously designed

    reconfigurable routers.

    17. VHDL Implementation of Genetic Algorithm

    for 2-bit Adder

    Abstract:

    Future planetary and deep space exploration demands that the space vehicles should have robust system architectures and be reconfigurable in unpredictable environment. The Evolutionary design of electronic circuits, or Evolvable hardware (EHW), is a discipline that allows the user to automatically obtain the desired circuit design. The circuit configuration is under control of Evolutionary algorithms. The most commonly used evolutionary algorithm is Genetic Algorithm. The paper discusses on Cartesian Genetic Programming for evolving gate level designs and proposes Evolvable unit for 2-bit adder based on Genetic Algorithm

    18. An Area- and Energy-Efficient FIFO Design Using Error-Reduced Data Compression and Near-Threshold Operation for Image/Video Applications

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

  • VLSI IEEE Papers

    Copy Right Protected

    Abstract:

    Many image/video processing algorithms require FIFO for filtering. The FIFO size is proportional to the length of the filters and input data width, causing large area and power consumption. We have proposed an energy- and area-efficient FIFO design for image/video applications through FIFO with error-reduced data compression (FERDC) and near-threshold operation. On architecture level, FERDC technique is proposed to reduce the size and power consumption of the FIFO by utilizing the spatial correlation between neighboring pixels and performing error-reduced data compression together with quantization to minimize the mean square error (MSE). On circuit level, nearthreshold operation is adopted to achieve further power reduction while maintaining the required performance. To demonstrate the proposed FIFO, it has been implemented using a 0.18-m CMOS process technology. The implementation covers different FIFO length, including 128, 256, 512, and 1024. The experimental results show that the proposed FIFO operating at 0.5 V and 28.57 MHz achieves up to 99%, 65%, and 34.91% reduction in dynamic power, leakage power, and area, respectively, with a small MSE of 2.76, compared with the conventional FIFO design.The proposed FIFO can be applied to a wide range of image/video signal processing applications to achieve high area and energy efficiency.

    19. An Area- and Power-Efficient FIFO with Error-Reduced Data Compression for Image/Video Processing

    IEEE 2014

    Abstract:

    Filtering is a key component of many digital image/video processing algorithms. It often requires FIFO to temporarily buffer the pixels data for later usage. The FIFO size is proportional to the length of the filters and input data width, causing large area and power consumption. This paper presents a technique named FIFO with error-reduced data compression (FERDC) to reduce the FIFO size for various filters. The proposed FERDC significantly reduces the area and power consumption while keeping the error metrics such as mean square error (MSE) and peak signal to noise ratio (PSNR) in the acceptable range. Simulation results of a two dimensional wavelet filter shows that the proposed FERDC technique achieves the FIFO size reduction of up to 44.44% with PSNR values larger than 39 dB, which leads to the reduction of at least 31.6% in the dynamic power and 44.44% in the leakage power.

    20. DESIGN AND ANALYSIS OF FIVE PORT ROUTER FOR NETWORK ON CHIP

    Abstract:

    With the technological advancements a large number of devices can be integrated into a single chip. So the communication between these devices becomes vital. The network

  • VLSI IEEE Papers

    Copy Right Protected

    on chip (NoC) is a technology used for such communication. A router is the fundamental component of a NoC. This paper focuses on the implementation and the verification of a five port router. The building blocks of the router are buffering registers, demultiplexer, First In First Out registers, and schedulers. The scheduler uses the round robin algorithm. The proposed architecture of five port router is simulated in Xilinx ISE 10.1 software. The source code is written in VHDL.

    21. Design and verification of five port router for

    network on chip

    IEEE 2014

    Abstract:

    Traditional system on chip (SOC) designs offer integrated solutions to exigent design

    tribulations in areas which necessitate outsized computation and restriction in certain area.

    Because of the common bus architecture in SOC system, performance becomes sluggish

    which limits the processing speed. The network on chip (NOC), due to their characteristics

    such as scalability, flexibility, high bandwidth have been proposed as a valid approach to

    meet communication requirements in SoC, where common bus architecture replaced by

    network. The communication on network on chip is carried out by means of router, so for

    implementing better NOC, the router should be efficiently design. In this paper we present

    the design and verification of router for Mesh topology using Verilog HDL which supports

    five parallel connections at the same time. It uses store and forward type of flow control and

    FSM controller deterministic routing which improves the performance of router. Design unit

    is targeted to Sparten 3E xc3s500e-4fg320 FPGA device and simulated in XILINX 13.1

    Software.

    22. Hummingbird: Ultra-Lightweight Cryptography

    for Resource-Constrained Devices

    Abstract:

    Due to the tight cost and constrained resources of high volume consumer devices such as RFID tags, smart cards and wireless sensor nodes, it is desirable to employ lightweight and specialized cryptographic primitives for many security applications. Motivated by the design of the well-known Enigma machine, we present a novel ultralightweight cryptographic algorithm, referred to as Hummingbird, for resource-constrained devices in this paper. Hummingbird can provide the designed security with small block size and is resistant to the most common attacks such as linear and differential cryptanalysis. Furthermore, we also present efficient software implementation of Hummingbird on the

  • VLSI IEEE Papers

    Copy Right Protected

    8-bit microcontroller ATmega128L from Atmel and the 16-bit microcontroller MSP430 from Texas Instruments, respectively. Our experimental results show that after a system initialization phase Hummingbird can achieve up to 147 and 4:7 times faster throughput for a size-optimized and a speed-optimized implementations, respectively, when compared to the state-of-the-art ultra-lightweight block cipher PRESENT [10] on the similar platforms.

    23. Enhanced FPGA Implementation of the Hummingbird Cryptographic Algorithm

    Abstract:

    Hummingbird is a novel ultra-lightweight cryptographic algorithm aiming at resource-constrained devices. In this work, an enhanced hardware implementation of the Hummingbird cryptographic algorithm for low-cost Spartan-3 FPGA family is described. The enhancement is due to the introduction of the coprocessor approach. Note that all Virtex and Spartan FPGAs consist of many embedded memory blocks and this work explores the use of these functional blocks. The intrinsic serialism of the algorithm is exploited so that each step performs just one operation on the data. We compare our performance results with other reported FPGA implementations of the lightweight cryptographic algorithms. As far as authors knowledge, this work presents the smallest and the most efficient FPGA implementation of the Hummingbird cryptographic algorithm.

    24. FPGA-based High-Throughput and Area-Efficient Architectures of the Hummingbird Cryptography

    Abstract:

    Hummingbird is an ultra-lightweight cryptography targeted for resource-constrained devices such as RFID tags,smart cards and sensor nodes. It has been implemented across different target platforms. In this paper, we present two different FPGA-based implementations for both throughput-oriented (TO) and area-oriented (AO) Hummingbird Cryptography (HC). The throughput-oriented design is optimized for operation speed while the area-oriented design consumes smaller area resource usage. Both proposed designs have been implemented on a Xilinx low-cost Spartan-3 XC3S200 FPGA. When compared with existed methods, the results from the proposed designs show that our designs cost less FPGA slices while the same throughput can be obtained. The proposed architectures are designed to best suit for adding customizable security to embedded control systems

  • VLSI IEEE Papers

    Copy Right Protected

    25. Remedying the Hummingbird Cryptographic Algorithm

    Abstract:

    Hummingbird is a recently proposed lightweight cryptographic algorithm for securing RFID systems. In 2011, Saarinen reported a chosen-IV, chosen-message attack on Hummingbird in FSE11. In this paper, we propose a lightweight remedial scheme in response to the Saarinens attack. The scheme is quite efficient both in software and hardware since only two cyclic shifts are involved. Using this simple tweak, we can keep the compact design of Hummingbird as well as enhance the security of Hummingbird. Readers are welcome to attack the remedial Hummingbird.

    26. Low Power Implementation of Hummingbird Cryptographic Algorithm for RFID tag

    Abstract:

    Hummingbird algorithm is a newly proposed lightweight cryptographic algorithm targeted for low-cost RFID tag. In this paper, we present a hardware implementation of this algorithm using SMIC0.13_m CMOS process. Methods are used to reduce the unnecessary clock toggling and data toggling to reduce dynamic power. Simulation results show that the total area of our design is 14,735 _m2. It requires 16 clock cycles to encrypt 16-bit data (an additional 69 clock cycles for initialization is needed), and consumes 1.08_w power for 1.2 V power supply at 100 KHz.

    27. Merged Switch Allocation and Traversal in Network-on-Chip Switches

    Abstract:

    Large systems-on-chip (SoCs) and chip multiprocessors (CMPs), incorporating tens to hundreds of cores, create a significant integration challenge. Interconnecting a huge amount of architectural modules in an efficient manner, calls for scalable solutions that would offer both high throughput and low-latency communication. The switches are the basic building blocks of such interconnection networks and their design critically affects the performance of the whole system. So far, innovation in switch design relied mostly to architecture-level solutions that took for granted the characteristics of the main building blocks of the switch, such as the buffers, the routing logic, the arbiters, the crossbars multiplexers, and without any further modifications, tried to reorganize them in a more efficient way. Although such pure high-level design has produced highly efficient switches, the question of how much better the switch would be if better building blocks were available

  • VLSI IEEE Papers

    Copy Right Protected

    remains to be investigated. In this paper, we try to partially answer this question by explicitly targeting the design from scratch of new soft macros that can handle concurrently arbitration and multiplexing and can be parameterized with the number of inputs, the data width, and the priority selection policy. With the proposed macros, switch allocation, which employs either standard round robin or more sophisticated arbitration policies with significant network-throughput benefits, and switch traversal, can be performed simultaneously in the same cycle, while still offering energy-delay efficient implementations.

    28. MIHST: A Hardware Technique for Embedded Microprocessor Functional On-Line Self-Test

    Abstract:

    Testing processor cores embedded in systems-on-chip (SoCs) is a major concern for industry nowadays. In this paper, we describe a novel solution which merges the SBST and BIST principles. The technique we propose forces the processor to execute a compact SBST-like test sequence by using a hardware module called MIcroprocessor Hardware Self-Test (MIHST) unit, which is intended to be connected to the system bus like a normal memory core, requesting no modification of the processor core internal structure. The benefit of using the MIHST approach is manifold: while guaranteeing the same or higher defect coverage of the traditional SBST approach, it reduces the time for test execution, better preserves the processor core Intellectual Property (IP), does not require the system memory to store the test program nor the test data, and can be easily adopted for non-concurrent on-line testing, since it minimizes the required system resources. The feasibility and effectiveness of the approach were evaluated on a couple of pipelined processors.

    29. A Practical NoC Design for Parallel DES Computation

    Abstract:

    The Network-on-Chip (NoC) is considered to be a new SoC paradigm for the next generation to support a large number of processing cores. The idea to combine NoC with homogeneous processors constructing a Multi-Core NoC (MCNoC) is one way to achieve high computational throughput for specific purpose like cryptography. Many researches use cryptography standards for performance demonstration but rarely discuss a suitable NoC for such standard. The goal of this paper is to present a practical methodology without complicated virtual channel or pipeline technologies to provide high throughput Data Encryption Standard (DES) computation on FPGA. The results point out that a mesh-based NoC with packet and Processing Element (PE) design according to DES specification can achieve great performance over previous works. Moreover, the deterministic XY routing algorithm shows its competitiveness in high throughput NoC and

  • VLSI IEEE Papers

    Copy Right Protected

    the West-First routing offers the best performance among Turn-Model routings, representatives of adaptive routing.

    30. Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification

    Abstract:

    Packet classification is a vital and complicated task as the processing of packets should be done at a specified line speed. In order to classify a packet as belonging to a particular flow or set of flows, network nodes must perform a search over a set of filters using multiple fields of the packet as the search key. Hence the matching of packets should be much faster and simpler for quick processing and classification. A hardware accelerator or a classifier has been proposed here using a modified version of the HyperCuts packet classification algorithm. A new pre-cutting process has been implemented to reduce the memory size to fit in an FPGA. This classifier can classify packets with high speed and with a power consumption factor of less than 3W. This methodology removes the need for floating point division to be performed by replacing the region compaction scheme of HyperCuts by pre-cutting, while classifying the packets and concentrates on classifying the packets at the core of the network.

    31. Ultra-High Throughput Low-Power Packet Classification

    Abstract:

    Packet classification is used by networking equipment to sort packets into flows by comparing their headers to a list of rules, with packets placed in the flow determined by the matched rule. A flow is used to decide a packets priority and the manner in which it is processed. Packet classification is a difficult task due to the fact that all packets must be processed at wire speed and rulesets can contain tens of thousands of rules. The contribution of this paper is a hardware accelerator that can classify up to 433 million packets per second when using rulesets containing tens of thousands of rules with a peak power consumption of only 9.03 W when using a Stratix III fieldprogrammable gate array (FPGA). The hardware accelerator uses a modified version of the HyperCuts packet classification algorithm, with a new pre-cutting process used to reduce the amount of memory needed to save the search structure for large rulesets so that it is small enough to fit in the on-chip memory of an FPGA. The modified algorithm also removes the need for floating point division to be performed when classifying a packet, allowing higher clock speeds and thus obtaining higher throughputs.

    32. A STUDY & VHDL IMPLEMENTATION OF REEDSOLOMON ERROR CORRECTING CODES

  • VLSI IEEE Papers

    Copy Right Protected

    Abstract:

    In the present world, communication system which includes wireless, satellite and space communication, reducing error is being critical. During message transferring the data might get corrupted, so high bit error rate of the wireless communication system requires employing to various coding methods for transferring the data. Channel coding for detection and correction of error helps the communication systems design to reduce the noise effect during transmission [1]. In this paper, Reed Solomon (RS) Encoder and Decoder and their VHDL implementation using ModelSim tool is analyzed. RS codes are non- binary cyclic error correcting block codes. Here redundant symbols are generated in the encoder using a generator polynomial g(x) and added to the very end of the message symbols. Then RS Decoder determines the locations and magnitudes of errors in the received polynomial. The paper covers the RS encoding and decoding algorithm, simulation results.

    33. Design and Implementation of Reed Solomon Encoder on FPGA

    Abstract:

    Error correcting codes are used for detection and correction of errors in digital communication system. Error correcting coding is based on appending of redundancy to the information message according to a prescribed algorithm. Reed Solomon codes are part of channel coding and withstand the effect of noise, interference and fading. Galois field arithmetic is used for encoding and decoding reed Solomon codes. Galois field multipliers and linear feedback shift registers are used for encoding the information data block. The design of Reed Solomon encoder is complex because of use of LFSR and Galois field arithmetic. The purpose of this paper is to design and implement Reed Solomon (255, 239) encoder with optimized and lesser number of Galois Field multipliers. Symmetric generator polynomial is used to reduce the number of GF multipliers. To increase the capability toward error correction, convolution interleaving will be used with RS encoder. The Design will be implemented on Xilinx FPGA Spartan II.

    34. Instruction-based high-efficient

    synchronization in a many-core Network-on-

    Chip processor

    IEEE 2014

    Abstract:

  • VLSI IEEE Papers

    Copy Right Protected

    Parallelized applications running on many-core Network-on-Chip (NoC) processors may

    consume a great part of execution time to synchronize threads mapped on multiple NoC

    nodes, if synchronization for NoC processors is not carefully designed. In this paper, we

    propose an instruction-based synchronization solution applied in a packet-switched many-

    core NoC processor with 2D mesh grid topology. Return links are added into the on-chip

    network to transmit acknowledgements of read requests, while a specific instruction SET is

    designed as instruction set extension to the original pipeline to perform atomic read-modify-

    write operations. To support various synchronization schemes, a hardware unit SYNC

    containing globally addressable registers as shared variables is adopted to handle

    synchronization requests from both local and remote NoC nodes. Additionally,

    a FIFO located in the SYNC unit can store these synchronization requests to poll on shared

    variables locally. Thus, network contention due to busy-wait synchronization algorithms is

    greatly reduced. Synchronization schemes including spinlock, barrier, FIFO spinlock and

    semaphore are implemented as inline assembly functions. Synthesis results under 55nm

    process suggest low area and power overhead of the hardware design. Performance of

    synchronization schemes are evaluated and are compared to results of conventional

    methods and prior works, showing the proposed solution is of higher efficiency.

    35. Argo: A Time-Elastic Time-Division-

    Multiplexed NOC Using Asynchronous Routers

    IEEE 2014

    Abstract:

    In this paper we explore the use of asynchronous routers in a time-division-multiplexed

    (TDM) network-on-chip (NOC), Argo, that is being developed for a multi-processor platform

    for hard real-time systems. TDM inherently requires a common time reference, and existing

    TDM-based NOC designs are either synchronous or mesochronous. We use asynchronous

    routers to achieve a simpler, smaller and more robust, self-timed design. Our design

    exploits the fact that pipelined asynchronous circuits also behave as ripple FIFOs. Thus, it

    avoids the need for explicit synchronization FIFOs between the routers. Argo has interesting

    elastic timing properties that allow it to tolerate skew between the network interfaces (NIs).

    The paper presents Argo NOC-architecture and provides a quantitative analysis of its ability

    of absorb skew between the NIs. Using a signal transition graph model and realistic

    component delays derived from a 65 nm CMOS implementation, a worst-case analysis

    shows that a typical design can tolerate a skew of 1-5 cycles (depending on FIFO depths

    and NI clock frequency). Simulation results of a 2 2 NOC confirm this.

  • VLSI IEEE Papers

    Copy Right Protected

    36. Efficient round-robin multicast scheduling for

    input-queued switches

    IEEE2014

    Abstract:

    The input-queued (IQ) switch architecture is favoured for designing multicast high-speed

    switches because of its scalability and low implementation complexity. However, using the

    first-in-first-out (FIFO) queueing discipline at each input of the switch may cause the head-

    of-line (HOL) blocking problem. Using a separate queue for each output port at an input to

    reduce the HOL blocking, that is, the virtual output queuing discipline, increases the

    implementation complexity, which limits the scalability. Given the increasing link speed and

    network capacity, a low-complexity yet efficient multicast scheduling algorithm is required

    for next generation high-speed networks. This study proposes the novel efficient round-

    robin multicast scheduling algorithm for IQ architectures and demonstrates how this

    algorithm can be implemented as a hardware solution, which alleviates the multicast HOL

    blocking issue by means of queue look-ahead. Simulation results demonstrate that

    this FIFO-based IQ multicast architecture is able to achieve significant improvements in

    terms of multicast latency requirements by searching through a small number of cells

    beyond the HOL cells in the input queues. Furthermore, hardware synthesis results show

    that the proposed algorithm can be very efficiently implemented in hardware to perform

    multicast scheduling at very high speeds with only modest resource requirements.

    37. An area- and power-efficient FIFO with

    error-reduced data compression for image/video

    processing

    IEEE 2014

    Abstract:

    Filtering is a key component of many digital image/video processing algorithms. It often

    requires FIFO to temporarily buffer the pixels data for later usage. The FIFO size is

    proportional to the length of the filters and input data width, causing large area and power

    consumption. This paper presents a technique named FIFO with error-reduced data

    compression (FERDC) to reduce the FIFO size for various filters. The proposed FERDC

    significantly reduces the area and power consumption while keeping the error metrics such

    as mean square error (MSE) and peak signal to noise ratio (PSNR) in the acceptable range.

  • VLSI IEEE Papers

    Copy Right Protected

    Simulation results of a two dimensional wavelet filter shows that the proposed FERDC

    technique achieves the FIFO size reduction of up to 44.44% with PSNR values larger than 39

    dB, which leads to the reduction of at least 31.6% in the dynamic power and 44.44% in the

    leakage power.

    38. An Area- and Energy-Efficient FIFO Design

    Using Error-Reduced Data Compression and

    Near-Threshold Operation for Image/Video

    Applications

    IEEE 2014

    Abstract:

    Many image/video processing algorithms require FIFO for filtering. The FIFO size is

    proportional to the length of the filters and input data width, causing large area and power

    consumption. We have proposed an energy- and area-efficient FIFO design for image/video

    applications through FIFO with error-reduced data compression (FERDC) and near-

    threshold operation. On architecture level, FERDC technique is proposed to reduce the size

    and power consumption of the FIFO by utilizing the spatial correlation between neighboring

    pixels and performing error-reduced data compression together with quantization to

    minimize the mean square error (MSE). On circuit level, near-threshold operation is adopted

    to achieve further power reduction while maintaining the required performance. To

    demonstrate the proposed FIFO, it has been implemented using a 0.18-m CMOS process

    technology. The implementation covers different FIFO length, including 128, 256, 512, and

    1024. The experimental results show that the proposed FIFO operating at 0.5 V and 28.57

    MHz achieves up to 99%, 65%, and 34.91% reduction in dynamic power, leakage power,

    and area, respectively, with a small MSE of 2.76, compared with the

    conventional FIFO design. The proposed FIFO can be applied to a wide range of

    image/video signal processing applications to achieve high area and energy efficiency.

  • VLSI IEEE Papers

    Copy Right Protected

    39. Design and Implementation of an On-Chip

    Permutation Network for Multiprocessor System-

    On-Chip

    IEEE 2013

    http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6133316&url=http%3A%2F%

    2Fieeexplore.ieee.org%2Fiel5%2F92%2F6387661%2F06133316.pdf%3Farnumber%3D6

    133316

    Abstract : This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a Pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage

    network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary

    traffic permutations. The circuit-switching approach offers a guarantee of permuted data and its

    compact overhead enables the benefit of stacking multiple networks. A 0.13- m CMOS test-chip

    validates the feasibility and efficiency of the proposed design. Experimental results show that the

    proposed on-chip network

    40. UnSync: A Soft Error Resilient Redundant

    Multicore Architecture

    IEEE 2013

    Abstract : Reducing device dimensions, increasing transistor densities, and smaller timing windows, expose the vulnerability of processors to soft errors induced by charge carrying particles. Since these factors are only consequences of the inevitable advancement in processor technology, the industry has been forced to improve reliability on general purpose Chip Multiprocessors (CMPs). With the availability of increased hardware resources, redundancy based techniques are the most promising methods to eradicate soft error failures in CMP systems. In this work, we propose a novel customizable and redundant CMP architecture (UnSync) that utilizes hardware based detection mechanisms (most of which are readily available in the processor), to reduce overheads during error free executions. In the presence of errors

  • VLSI IEEE Papers

    Copy Right Protected

    (which are infrequent), the always forward execution enabled recovery mechanism provides for resilience in the system. The inherent nature of our architecture framework supports customization of the redundancy, and thereby provides means to achieve possible performance-reliability trade-offs in many-core systems. We provide a redundancy based soft error resilient CMP architecture for both write-through and write-back cache configurations. We design a detailed RTL model of our UnSync architecture and perform hardware synthesis to compare the hardware (power/area) overheads incurred. We compare the same with those of the Reunion technique, a state-of-the-art redundant multi-core architecture. We also perform cycle-accurate simulations over a wide range of SPEC2000, and MiBench benchmarks to evaluate the performance

    efficiency achieved over that of the Reunion architecture. Experimental results show that, our UnSync

    architecture reduces power consumption by 34.5% and improves performance by up to 20% with 13.3%

    less area overhead, when compared to Reunion architecture for the same level of reliability achieved.

    41. FPGA based asynchronous pipelined multiplier

    with intelligent delay controller

    IEEE 2008

    Abstract:

    In this paper, a novel scheme is proposed for the implementation of FPGA based digital

    systems using asynchronous pipelining technique. To control the asynchronous data flow

    between stages, an intelligent controller is designed which decides the delay of each stage

    depending upon the magnitude of the input data (Data Dependent Delay). The intelligent

    controller has been designed using NIOS II soft core embedded processor in ALTERA

    EP2C20F484C7 device. But, in this approach, the maximum operating frequency is limited

    by the excess of logical elements consumed by the microcontroller and the sequential

    execution of the C code. Hence, the function of NIOS processor to control asynchronous

    data flow alone has been chosen and is implemented as an equivalent hardware

    INTASYCON (INTelligent ASYnchronous CONtroller) using hardware description language

    and the speed of the circuit was evaluated. To verify the efficacy of the proposed approach,

    8times8 Braun array multiplier is implemented as external logic to the INTASYCON. The

    INTASYCON processor calculates the completion time of each stage (based on the logic

    depth) and accordingly activates the respective dual edge triggered flipflops to transfer data

    from one stage to next stage. This approach consumes lower power and also avoids the

    need for global clock signals and their consequences like skew problems.

    42. VLSI implementation of visible watermarking for secure digital still camera design

  • VLSI IEEE Papers

    Copy Right Protected

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1261070&queryText%3Dwater

    marking+vlsi

    Abstract:

    Synopsys: Watermarking is the process that embeds data called a watermark into a

    multimedia object for its copyright protection. The digital watermarks can be visible

    to a viewer on careful inspection or completely invisible and cannot be easily

    recovered without an appropriate decoding mechanism. Digital image watermarking is

    a computationally intensive task and can be speeded up significantly by

    implementing in hardware. In this work, we describe a new VLSI architecture for

    implementing two different visible watermarking schemes for images. The proposed

    hardware can insert on-the-fly either one or both watermarks into an image

    depending on the application requirement. The proposed circuit can be integrated

    into any existing digital still camera framework. First, separate architectures are

    derived for the two watermarking schemes and then integrated into a unified

    architecture. A prototype CMOS VLSI chip was designed and verified implementing

    the proposed architecture and reported in this paper. To our knowledge, this is the

    first VLSI architecture for implementing visible watermarkingschemes.

    43. Analysis and FPGA implementation of image

    restoration under resource constraints

    http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1183952

    Abstract:

    Programmable logic is emerging as an attractive solution for many digital signal processing

    applications. In this work, we have investigated issues arising due to the resource constraints of

    FPGA-based systems. Using an iterative image restoration algorithm as an example we have

    shown how to manipulate the original algorithm to suit it to an FPGA implementation.

    Consequences of such manipulations have been estimated, such as loss of quality in the output

    image. We also present performance results from an actual implementation on a Xilinx FPGA.

    Our experiments demonstrate that, for different criteria, such as result quality or speed, the

    best implementation is different as well.

    44. Design of high speed low power Viterbi decoder for

    TCM system

  • VLSI IEEE Papers

    Copy Right Protected

    IEEE 2013 Abstract : High-speed, low-power design of Viterbi decoders for trellis coded modulation (TCM) systems is

    presented in this paper. It is well known that the Viterbi decoder (VD) is the dominant module

    determining the overall power consumption of TCM decoders. We propose a pre-computation

    architecture incorporated with -algorithm for VD, which can effectively reduce the power consumption

    without degrading the decoding speed much. A general solution to derive the optimal pre-computation

    steps is also given in the paper. Implementation result of a VD for a rate-3/4 convolution code used in a

    TCM system shows that compared with the full trellis VD, the precomputation architecture reduces the

    power consumption by as much as 70% without performance loss, while the degradation in clock speed

    is negligible.

    45. CORDIC Designs for Fixed Angle of Rotation

    IEEE 2013

    Abstract:

    Rotation of vectors through fixed and known angles has wide applications in robotics, digital signal

    processing, graphics, games, and animation. But, we do not find any optimized coordinate rotation

    digital computer (CORDIC) design for vector-rotation through specific angles. Therefore, in this paper,

    we present optimization schemes and CORDIC circuits for fixed and known rotations with different

    levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired pre-

    shifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for

    the fixed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the

    other they are implemented in two separate stages. Pipelined schemes are suggested further for

    cascading dedicated single-rotation units and bi-rotation CORDIC units for

    high-throughput and reduced latency implementations. We have obtained the optimized set of micro-

    rotations for fixed and known angles. The optimized scale-factors are also derived and dedicated shift-

    add circuits are designed to implement the scaling. The fixed-point mean-squared-error of the proposed

    CORDIC circuit is analyzed statistically, and strategies for reducing the error

    are given. We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-

    nm library, and shown that the proposed designs offer higher throughput, less latency and less area-

    delay product than the reference CORDIC design for fixed and known angles of rotation. We find similar

    results of synthesis for different Xilinx field-programmable gate-array platforms.

    46. A 1.1 GHz 8B/10B encoder and decoder

    design

  • VLSI IEEE Papers

    Copy Right Protected

    IEEE 2010

    http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5604943&url=http%3A%2F%2Fieeexplore.ieee.

    org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5604943

    Abstract:

    This paper presents a design of 8B/10B encoder and decoder with a new architecture. The

    proposed 8B/10B encoder and decoder are implemented based on pipeline and parallel

    processing. The decoder implements an error-undiffusing function. This 8B/10B encoder

    and decoder can be used in the high-speed interconnection between chips. After being

    synthesized using CMOS 90nm process, the proposed encoder and decoder achieves the

    operating frequency over 1.1GHz and occupies the chip area of 1798m2 and 1261m2.

    They each consume 1.8mW and 1.12mW power.

    47. An 8B/10B encoder with a modified coding

    table

    http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4746322&url=http%3A%2F%2Fieeex

    plore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4746322

    IEEE 2009

    Abstract:

    This paper presents a design of 8B/10B encoder with a modified coding table. The

    proposed encoder has been designed based on a reduced coding table with a modified

    disparity control block. After being synthesized using CMOS 0.18 mum process, the

    proposed encoder shows the operating frequency of 343 MHz and occupies the chip area of

    1886 mum2 with 189 logic gates. It consumes 2.74 mW power. Compared to conventional

    approaches, the operating frequency is improved by 25.6% and chip area is decreased to

    43%.

    48. Configurable Pipelined Gabor Filter

    implementation for fingerprint image

    enhancement

  • VLSI IEEE Papers

    Copy Right Protected

    IEEE 2010

    Abstract:

    In this paper a novel Gabor filter hardware scheme for the fingerprint image enhancement is presented. For each pixel of the image, we use accurate local frequency and orientation to generate the corresponding convolution kernel and thus achieve a better enhancement effect. And Compared to the previous works, our design yields a higher throughput which is due to the pipeline techniques. Moreover the proposed design can be reconfigured to fulfill the different requirements.

    Evaluation results demonstrate that, when convolution kernel size is 11h11, our design can achieve

    2MPixels/s @ 250MHz, and equivalent gate count is 63.8k at SMIC 0.13um worst process corner.

    Indeed, its very suitable for the embedded fingerprint recognition system.

    49. Fingerprint Verification Using Gabor Co-

    occurrence Features

    IEEE2010

    Abstract:

    The biometric techniques based on face, iris and fingerprints are used in order to provide strong

    security. Out of which, Fingerprint identification effects far more positive identifications of

    persons worldwide than any other human identification procedure. The most widely used

    minutia based techniques find difficulty in matching the two finger prints with unregistered

    minutia points and also it is difficult to extract complete ridge structures

    in finger prints automatically. This paper presents an efficient Gabor Wavelet Transform (GWT)

    based algorithm for finger print verification for personal identification. This GWT based method

    provides the local and global information in fixed length fingercode. The finger print matching is

    done by means of finding the Euclidean distance between the two corresponding Finger codes

    and hence matching is extremely fast. Key words: Biometrics, FingerCode, fingerprint

    classification, Gabor filters

    50. Finger-knuckle-print: A new biometric

    identifier

    IEEE 2009

    Abstract:

    This paper presents a new biometric identifier, namely finger-knuckle-print (FKP), for personal

    identity authentication. First a specific data acquisition device is constructed to capture the FKP

    images, and then an efficient FKP recognition algorithm is presented to process the acquired

  • VLSI IEEE Papers

    Copy Right Protected

    data. The local convex direction map of the FKP image is extracted, based on which a coordinate

    system is defined to align the images and a region of interest (ROI) is cropped for feature

    extraction. A competitive coding scheme, which uses 2D Gabor filters to extract the image local

    orientation information, is employed to extract and represent the FKP features. When matching,

    the angular distance is used to measure the similarity between two competitive code maps. An

    FKP database was established to examine the performance of the proposed system, and the

    experimental results demonstrated the efficiency and effectiveness of this new biometric

    characteristic

    51. MIHST: A Hardware Technique for

    Embedded Microprocessor Functional On-Line

    Self-Test

    IEEE 2013

    Abstract

    Testing processor cores embedded in Systems-onChip (SoCs) is a major concern for industry nowadays.

    In this paper, we describe a novel solution which merges the SBST and BIST principles. The technique we

    propose forces the processor to execute a compact SBST-like test sequence by

    using a hardware module called MIcroprocessor Hardware SelfTest(MIHST) unit, which is intended to be

    connected to the system bus like a normal memory core, requesting no modification of the processor

    core internal structure. The benefit of using the MIHST approach is manifold: while

    guaranteeing the same or higher defect coverage of the traditional SBST approach, it reduces the time

    for test execution, better preserves the processor core Intellectual Property (IP), does not require the

    system memory to store the test program nor the test data, and can be easily adopted for non-

    concurrent on-line testing, since it minimizes the required system resources. The feasibility and

    effectiveness of the approach were evaluated on a couple of pipelined processors.

    52. Area and time efficient hardwired pre -

    shifted bi-rotation CORDIC design

    IEEE 2014

    Abstract:

  • VLSI IEEE Papers

    Copy Right Protected

    This paper deals with an optimization schemes and CORDIC circuit for fixed and known rotations

    different level of accuracy. For reducing area and time complexity. This paper proposed hard wired,

    pre-shifting technique for barrel-shifter of proposed circuit. Here two proposed CORDIC cells are

    used to the fixed angle rotations. This cells going to implement the micro rotations and scaling

    interleaved, it's implemented the two stages. The cascade proposed the bi-rotation CORDIC for

    higher throughput and reduced latency implementation. This method proposed optimized set of

    micro rotations for fixed and known angles. Shift and add circuits are used to implement the scaling

    factor. Fixed means square error used for analysis and reduced the error in this method. Synthesized

    the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-NM library, and shown that

    the proposed designs offer higher throughput, less latency and less area-delay product than the

    reference CORDIC design for fixed and known angles of rotation. We find similar results of synthesis

    of different Xilinx field-programmable gate-array platforms.

    53. Fixed-Point Analysis and Parameter

    Selections of MSR-CORDIC With

    Applications to FFT Designs

    IEEE 2012

    Abstract:

    Mixed-scaling-rotation (MSR) coordinate rotation digital computer (CORDIC) is an attractive approach to

    synthesizing complex rotators. This paper presents the fixed-point error analysis and parameter

    selections of MSR-CORDIC with applications to the fast Fourier transform (FFT). First, the fixed-point

    mean squared error of the MSR-CORDIC is analyzed by considering both the angle approximation error

    and signal round-off error incurred in the finite precision arithmetic. The signal to quantization noise

    ratio (SQNR) of the output of the FFT synthesized using MSR-CORDIC is thereafter estimated. Based on

    these analyses, two different parameter selection algorithms of MSR-CORDIC are proposed for general

    and dedicated MSR-CORDIC structures. The proposed algorithms minimize the number of adders and

    word-length when the SQNR of the FFT output is constrained. Design examples show that the

    FFT designed by the proposed method exhibits a lower hardware complexity than existing methods.

    54. Scalable pipelined CORDIC architecture

    design and implementation in FPGA

    IEEE 2009

    Abstract:

    In Digital Signal Processing, trigonometry and complex multiplications are used in many signal

    equations, such as synchronization and equalization. Therefore, a fast and an efficient method to

    calculate trigonometry and complex multiplications are required. Coordinate Rotation Digital

    Computer (CORDIC) is trigonometric algorithm that is used to transforming data from rectangular to

    polar and vice versa. CORDIC also can be used other to compute several trigonometry functions,

  • VLSI IEEE Papers

    Copy Right Protected

    either directly or indirectly. The proposed CORDIC design is based on Pipeline datapath Architecture.

    By using pipeline architecture, the design is able to calculate continuous input, has high throughput,

    and doesn't need ROM or registers to save constant angle iteration of CORDIC. The design process is

    started by modelling CORDIC function, design datapath and control unit, coding to hardware

    description language using Verilog HDL, synthesized using Quartus II Version 7.2 and implemented

    on ALTERA Cyclone II DE2 EP2C35F672C6N FPGA. Synthesis result shows that the design is able to

    work at 81.31 MHz.

    55. Design and evaluation of a floating-point

    division operator based on CORDIC

    algorithm

    IEEE 2012

    Abstract:

    Design and evaluation of a CORDIC (COordinate Rotation DIgital Computer) algorithm for a floating-

    point division operation is presented in this paper. In general, division operation based

    on CORDICalgorithm has a limitation in term of the range of inputs that can be processed by

    the CORDIC machine to give proper convergence and precise division operation result. A hardware

    architecture of CORDICalgorithm capable of processing broader input ranges is implemented and

    presented in this paper by using a pre-processing and a post-processing stage. The performance as

    well as the calculation error statistics over exhaustive sets of input tests are evaluated. The results

    show that the CORDICalgorithm can be well-convergence and gives precise division operation results

    with broader input ranges. The proposed hardware architecture is modeled in VHDL and synthesized

    on a CMOS standard-cell technology and a FPGA device, resulting 1 GFlops on the CMOS and

    210.812 MFlops on the FPGA device.

    56. : Energy Efficient Synchronization for

    Embedded Multicore Systems

    IEEE 2013

    Abstract:

    Data synchronization among multiple cores has been one of the critical issues which must be resolved in order to optimize the parallelism of multicore architectures. Data synchronization schemes can be classified as lock-based methods (pessimistic) and lock-free methods (optimistic). However, none of these methods consider the nature of embedded systems which have demanding and sometimes conflicting requirements not only for high performance but also for low power consumption. As an answer to these problems, we proposeC-Lock, an energy- and performance-efficient data

  • VLSI IEEE Papers

    Copy Right Protected

    synchronization method for multicore embedded systems. C-Lockachieves balanced energy- and performance-efficiency by combining the advantages of lock-based methods and transactional memory (TM) approaches; inC-Lock, the core is blocked only when true conflicts exist (advantage of TM), while avoiding roll-back operations which can cause huge overhead with regard to both performance and energy (this is an advantage of locks). Also, in order to save more energy, C-Lockdisables the clocks of the cores which are blocked for the access to the

    shared data until the shared data become available. We compared ourC-Lockapproach against

    traditional locks and transactional memory systems, and found thatC-Lockcan reduce the energy-delay

    product by up to 1.94 times and 13.78 times compared to the baseline and TM, respectively.

    57. ViChaR: A Dynamic Virtual Channel

    Regulator for Network-on-Chip Routers

    IEEE 2009

    Abstract:

    The advent of deep sub-micron technology has recently highlighted the criticality of the on-

    chipinterconnects. As diminishing feature sizes have led to increases in global wiring delays, network-on-

    chip (NoC) architectures are viewed as a possible solution to the wiring challenge and have recently

    crystallized into a significant research thrust. Both NoC performance and energy budget depend heavily

    on the routers' buffer resources. This paper introduces a novel unified buffer structure, called the

    dynamic virtual channel regulator (ViChaR), which dynamically allocates virtual channels (VC) and buffer

    resources according to network traffic conditions. ViChaR maximizes throughput by dispensing a

    variable number of VCs on demand. Simulation results using a cycle-accurate simulator show a

    performance increase of 25% on average over an equal-size generic router buffer, or similar

    performance using a 50% smaller buffer. ViChaR's ability to provide similar performance with half the

    buffer size of a generic router is of paramount importance, since this can yield total area and power

    savings of 30% and 34%, respectively, based on synthesized designs in 90 nm technology

    58. Virtualizing Virtual Channels for Increased

    Network-on-Chip Robustness and

    Upgradeability

    IEEE 2012

    Abstract:

    The Network-on-Chip (NoC) router buffers are instrumental in the overall operation of Chip Multi-

    Processors (CMP), because they facilitate the creation of Virtual Channels (VC). Both the NoC routing

  • VLSI IEEE Papers

    Copy Right Protected

    algorithm and the CMP's cache coherence protocol rely on the presence of VCs within the NoC for

    correct functionality. In this article, we introduce a novel concept that completely decouples the number

    of supported VCs from the number of VC buffers physically present in the

    design. Virtual ChannelRenaming enables the virtualization of existing virtual channels, in order to

    support an arbitrarily large number of VCs. Hence, the CMP can (a) withstand the presence of faulty VCs,

    and (b) accommodate routing algorithms and/or coherence protocols with disparate VC requirements.

    The proposed VC Renamer architecture incurs minimal hardware overhead to existing NoC designs and

    is shown to exhibit excellent performance without affecting the router's critical path.

    59. Low-Cost Self-Test Techniques for Small RAMs in SOCs Using Enhanced IEEE 1500 Test Wrappers

    IEEE 2012 Abstract : This paper proposes an enhanced IEEE 1500 test wrapper to support the testing and diagnosis of the

    single-port or multi-port RAM core attached to the enhanced IEEE 1500 test wrapper without incurring

    large area overhead to small memories. Effective test time reduction techniques for the proposed test

    scheme are also proposed. Simulation results show that the additional area cost for implementing the

    enhanced IEEE 1500 test wrapper is only about 0.58% for a 64 K-bit single-port RAM and only 0.57% for

    a 64 K-bit two-port RAM

    60. Application-Aware Topology Reconfiguration

    for On-Chip Networks

    IEEE 2010

    Abstract:

    In this paper, we present a reconfigurable architecture for networks-on-chip (NoC) on which arbitrary

    application-specific topologies can be implemented. When a new application starts, the proposed NoC

    tailors its topology to the application traffic pattern by changing the inter-router connections to some

    predefined configuration corresponding to the application. It addresses one of the main drawbacks of

    the existing application-specific NoC optimization methods, i.e., optimization of NoCs based on the

    traffic pattern of a single application. Supporting multiple applications is a critical feature of an NoC

    when several different applications are integrated into a single modern and complex multicore system-

  • VLSI IEEE Papers

    Copy Right Protected

    on-chip or chip multiprocessor. The proposed reconfigurable NoC architecture supports multiple

    applications by appropriately configuring itself to a topology that matches the traffic pattern of the

    currently running application. This paper first introduces the proposed reconfigurable topology and then

    addresses the problems of core to network mapping and topology exploration. Further on, we evaluate

    the impact of different architectural attributes on the performance of the proposed NoC. Evaluations

    consider network latency, power consumption, and area complexity.

    61. Smart Reliable Network-on-Chip

    IEEE 2014

    Abstract : In this paper, we present a new network-on-chip (NoC) that handles accurate localizations of the faulty

    parts of the NoC. The proposed NoC is based on new error detection mechanisms suitable for dynamic

    NoCs, where the number and position of processor elements or faulty blocks vary during runtime.

    Indeed, we propose online detection of data packet and adaptive routing algorithm errors. Both

    presented mechanisms are able to distinguish permanent and transient errors and localize accurately

    the position of the faulty blocks (data bus, input port, output port) in the NoC routers, while preserving

    the throughput, the network load, and the data packet latency. We provide localization capacity analysis

    of the presented mechanisms, NoC performance evaluations, and field-programmable gate array

    synthesis

    62. Headfirst sliding routing: A time-based

    routing scheme for bus-NoC hybrid 3-D

    architecture

    IEEE 2013

    Abstract : A contact-less approach that connects chips in vertical dimension has a great potential to

    customize components in 3-D chip multiprocessors (CMPs), assuming card-style components

    inserted to a single cartridge communicate each other wirelessly using inductive-coupling

    technology. To simplify the vertical communication interfaces, static Time Division Multiple

    Access (TDMA) is used for the vertical broadcast buses, while arbitrary or customized topologies

    can be used for intra-chip networks. In this paper, we propose the Headfirst sliding routing

    scheme to overcome the simple static TDMA-based vertical buses. Each vertical bus grants a

    communication time-slot for different chips at the same time periodically, which means these

    buses work with different phases. Depending on the current time, packets are routed toward

    the best vertical bus (elevator) just before the elevator acquires its communication time-slot.

  • VLSI IEEE Papers

    Copy Right Protected

    63. An Area Effective Parity-Based Fault

    Detection Technique for FPGAs

    IEEE 2013

    Abstract:

    Field programmable gate arrays (FPGAs) are highly successful platforms in a variety of niches, such as

    telecommunications and automotive applications. Their usage in critical systems for radiation

    environments, however, still depends on techniques able to provide increased reliability, since such

    devices are susceptible to single event upsets that may alter the specified functionality. Classical

    approaches such as duplication with comparison and triple modular redundancy are powerful in terms

    of fault detection and/or correction capabilities, and can be easily applied to a variety of circuits, but

    come with heavy area overheads. In this work we propose a parity-based concurrent error detection

    technique able to provide single error detection for combinational logic in FPGAs with reduced area

    when compared to the classical approaches. The proposed technique is automatically applied to a set of

    benchmark circuits and presents an average area reduction of 24.4% when compared to duplication

    with comparison, with no performance overhead.

    64. Vendor agnostic, high performance, double

    precision Floating Point division for FPGAs

    IEEE 2013

    Abstract:

    Double precision Floating Point (FP) arithmetic operations are widely used in many applications such as

    image and signal processing and scientific computing. Field Programmable Gate Arrays (FPGAs) are a

    popular platform for accelerating such applications due to their relative high performance, flexibility and

    low power consumption compared to general purpose processors and GPUs. Increasingly scientists are

    interested in double precision FP operations implemented on FPGAs. FP division and square root are

    much more difficult to implement than addition and multiplication. In this paper we focus on a

    fast divider design for double precision floating point that makes efficient use of FPGA resources

    including embedded multipliers. The design is table based; we compare it to iterative and digit

    recurrence implementations. Our division implementation targets performance with balanced latency

    and high clock frequency. Our design has been implemented on both Xilinx and Altera FPGAs. The table

    based double precision floating point divider provides a good tradeoff between area and performance

    and produces good results when targeting both Xilinx and Altera FPGAs

    65. Floating-Point Divider Design for FPGAs

  • VLSI IEEE Papers

    Copy Right Protected

    IEEE 2007

    Abstract:

    Growth in floating-point applications for field-programmable gate arrays (FPGAs) has made it critical

    tooptimize floating-point units for FPGA technology. The divider is of particular interest because

    thedesign space is large and divider usage in applications varies widely. Obtaining the right balance

    between clock speed, latency, throughput, and area in FPGAs can be challenging. The designspresented

    here cover a range of performance, throughput, and area constraints. On a Xilinx Virtex4-11FPGA, the

    range includes 250-MHz IEEE compliant double precision divides that are fully pipelined to 187-MHz

    iterative cores. Similarly, area requirements range from 4100 slices down to a mere 334 slices

    66. Split-Path Fused Floating Point Multiply

    Accumulate (FPMAC)

    IEEE 2007

    Abstract:

    Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key

    circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in

    contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and

    SSE and also extensively used in server processors employed for engineering and scientific applications.

    Consequently design of FPMAC is of vital consideration since it dominates the power and performance

    tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC designwhich focuses on

    optimal computations in the critical path and therefore making it the fastest FPMACdesign as of today in

    literature. The design is based on the premise of isolating and optimizing the critical path computation in

    FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC

    with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the

    exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of

    the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path

    critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit

    mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder

    from multiplier giving both timing and power benefits. Our design by premise of splitting consumes

    lesser power for each operation where only the required logic for each case is switching. Splitting the

    paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-

    20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the

    support for all rounding modes to adhere to IEEE standard for double precisionFPMAC which is critical

    for employment of this design in contemporary process- r families. The

    demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in

    timing while having similar area and giving additional power benefits due to split handling. The design is

    also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being

    30% smaller in area than it.

  • VLSI IEEE Papers

    Copy Right Protected

    67. FPGA Based High Performance Double-

    Precision Matrix Multiplication

    IEEE 2009

    Abstract:

    We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, an

    important kernel in many tile-based BLAS algorithms, optimized for implementation on high-end FPGAs.

    The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to

    sustain their peak performance except during an initial latency period. Through these designs, the trade-

    offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated

    and an analysis is presented for the optimal choice of design parameters. The designs, implemented on

    a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1%

    degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a

    sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s

    for design-II and 5.9 GB/s for design-I.

    68. An FPGA Implementation of a Fully Verified

    Double Precision IEEE Floating-Point Adder

    IEEE 2007

    Abstract:

    We report on the full gate-level verification and FPGA implementation of a

    highly optimized doubleprecision IEEE floating-point adder. The proposed adder design incorporates

    many optimizations like a nonstandard separation into two paths, a simple rounding algorithm,

    unification of rounding cases for addition and subtraction, sign-magnitude computation of a difference

    based on one's complement subtraction, compound adders, and fast circuits for approximate counting

    of leading zeros from borrow-save representation. We formally verify a gate-level specification of the

    algorithm using theorem proving techniques in PVS. The PVS specification was then used to

    automatically generate a gate-levelimplementation that was synthesized using Altera Quartus II. The

    resulting implementation has a total latency of 13.6 ns on an Altera Stratix II device.We have partitioned

    the design into a 2 stage pipeline running at a frequency of 147 Mhz.

    69. Low-power radix-8 divider

    IEEE 2008

    Abstract:

  • VLSI IEEE Papers

    Copy Right Protected

    This work describes the design of a double-precision radix-8 divider. Low-power techniques are applied

    in the design of the unit, and energy-delay tradeoffs considered. The energy dissipation in the divider

    can be reduced by up to 70% with respect to a standard implementation not optimized for energy,

    without penalizing the latency. The radix-8 divider is compared with the one obtained by overlapping

    three radix-2 stages and with a radix-4 divider. Results show that the latency of our divider is similar to

    that of the divider with overlapped stages, but the area is smaller. The speed-up of the radix-8 over the

    radix-4 is about 20% and the energy dissipated to complete a division is almost the same, although the

    area of the radix-8 is 50% larger

    70. Design and evaluation of a floating-point

    division operator based on CORDIC algorithm

    IEEE 2008

    Abstract:

    Design and evaluation of a CORDIC (COordinate Rotation DIgital Computer) algorithm for a floating-

    point division operation is presented in this paper. In general, division operation based

    on CORDICalgorithm has a limitation in term of the range of inputs that can be processed by

    the CORDIC machine to give proper convergence and precise division operation result. A hardware

    architecture of CORDICalgorithm capable of processing broader input ranges is implemented and

    presented in this paper by using a pre-processing and a post-processing stage. The performance as well

    as the calculation error statistics over exhaustive sets of input tests are evaluated. The results show that

    the CORDICalgorithm can be we


Recommended