Abstract - edu.cs.tut.fiedu.cs.tut.fi/ristimaki573.pdf · (Register Transfer Level) model and area...

Reconfigurable IP Blocks : a MIMD Approach

i

Abstract In this thesis, one of the very first published MIMD (Multiple-Instruction-Multiple-Data) based reprogrammable IP (Intellectual Property) blocks called RAA (Reprogrammable Algorithm Accelerator) is presented. In RAA the massively parallel coarse grain reconfigurable IP block is constituted of tiny DSP (Digital Signal Processor) cores, each coupled with local memories. The novel two-level communication mechanism of RAA is introduced, including a novel set of group addressing modes designed to remove the reconfiguration latency bottleneck. In addition, a yield increasing mechanism is shown. The objective of the work done was to raise the abstraction level of application specific hardware accelerator development. A context switching mechanism is implemented into an existing MIMD coarse grain reconfigurable IP block, RAA. Context switching is not used only to hide reconfiguration latency, its emphasis is on virtualizing the dimensions of an array of processors by folding the array to multiple configurations. The hardware extensions for configuration management are coded as a synthesizable RTL (Register Transfer Level) model and area results are presented and compared to the original unextended RAA. In addition, the simulation results are given to show the execution time penalty of virtualization. The given implementation is the first implemented virtual-size coarse grain reconfigurable IP array. The configware flow from parallel algorithm description to the RAA binaries is presented. The presented configware flow includes a graphical mapping tool, automatic place & route tool, and assembler. The NP-hard problems of mapping the configware to RAA and finding use of group addressing modes are solved with fast heuristics. The scalable hardware architecture to implement RSA (Rivest, Shamir, Adleman) encryption on reconfigurable architectures is shown. The presented RSA architecture is implemented on ASIC (Application Specific Integrated Circuit), platform FPGA (Field Programmable Gate Array) and the coarse grain reconfigurable IP block RAA. The overall timing and area results for those three implementations are given. The original platform FPGA implementation was the fastest known FPGA-based 1024-bit RSA encryption at the time of publication.

Abstract

ii

Also, an extensive state-of-the-art survey of the reconfigurable IP blocks is presented. Both main reconfigurable categories, i.e. fine grain and coarse grain, are covered and the most remarkable prevailing implementations are introduced extensively. Each of the architectures is categorized according to the computational granularity, communication topology and source of block, i.e., academic vs. commercial.


iii

Preface The work presented in this thesis has been carried out in the Institute of Digital and Computer Systems at Tampere University of Technology during the years 2002-2005. I express my gratitude to my supervisor Prof. Jari Nurmi for guiding and for providing excellent research surroundings. I am grateful to reviewers of my thesis, Prof. Gerard Smit and Prof. Jorma Skyttä for providing constructive comments on the manuscript. I thank my parents, Rauno and Heidi Ristimäki, for their time in my first school years. Dr.Tech. Seppo Kuusisto and Dr.Tech. Petri Korpisaari are acknowledged for initializing a new doctoral track. Also, I thank my colleagues Mikko Alho M.Sc., Juha Pirttimäki M.Sc. and Tuukka Kasanko M.Sc. who have helped me in several matters during my research work and Kelly Tai for valuable comments in language checking. This thesis was financially supported by Tampere Graduate School in Information Science and Engineering (TISE), The Foundation of Research in Tampere, HPY Research Foundation, Ulla Tuominen Foundation, Tuula and Yrjö Neuvo Foundation and Nokia Foundation, which are all gratefully acknowledged. Minna, Aino, Väinö and Vieno, my wife and children, without your love and patience this work would not have been started neither finished off. Vesilahti, September 2005 Tapio Ristimäki

Preface

iv


v

Contents

ABSTRACT ................................................................................................................................................ I

PREFACE................................................................................................................................................ III

CONTENTS...............................................................................................................................................V

LIST OF PUBLICATIONS.................................................................................................................... IX

ABBREVIATIONS ................................................................................................................................. XI

SYMBOLS ............................................................................................................................................ XIII

LIST OF FIGURES................................................................................................................................ XV

LIST OF TABLES.................................................................................................................................XIX

1. INTRODUCTION............................................................................................................................1

1.1. OBJECTIVE................................................................................................................................3 1.2. OUTLINE AND AUTHOR’S CONTRIBUTION .................................................................................3

2. RECONFIGURABLE IP BLOCKS : A SURVEY........................................................................7

2.1. INTRODUCTION.........................................................................................................................7 2.1.1. Terms and organization......................................................................................................7 2.1.2. Previous surveys in the field ...............................................................................................8

2.2. FINE GRAIN BLOCKS BASED ON LUTS.......................................................................................8 2.2.1. FPGA fabric of Xilinx Virtex II ..........................................................................................9 2.2.2. Commercial fine grain IPs................................................................................................11

2.3. DATAFLOWS ...........................................................................................................................16 2.3.1. D-fabrix by Elixent ...........................................................................................................17 2.3.2. XPP by PACT ...................................................................................................................17 2.3.3. DRP by NEC.....................................................................................................................19 2.3.4. CHESS architecture..........................................................................................................20 2.3.5. DReAM architecture.........................................................................................................21 2.3.6. Some very first academic CG IPs .....................................................................................22

2.4. SYSTOLIC ARRAY....................................................................................................................22 2.5. MIMD....................................................................................................................................23

2.5.1. ACM by Quick silver.........................................................................................................23 2.5.2. Synputer by Synergetic computing systems.......................................................................25

2.6. HYBRIDS.................................................................................................................................25

Contents

vi

2.6.1. Silicon Hive by Philips .....................................................................................................26 2.6.2. Morphotech by Morphotechnologies ................................................................................27 2.6.3. Academic ..........................................................................................................................28

2.7. DISCUSSION............................................................................................................................28

3. REPROGRAMMABLE ALGORITHM ACCELERATOR ......................................................31

3.1. INTRODUCTION.......................................................................................................................31 3.2. ARCHITECTURE OF REPROGRAMMABLE ALGORITHM ACCELERATOR ....................................32

3.2.1. Communication topology..................................................................................................32 3.2.2. Interface............................................................................................................................37 3.2.3. Structure of node ..............................................................................................................38

3.3. 8-BIT DATA SUPPORT EXTENSION............................................................................................41 3.4. YIELD INCREASE MECHANISM ................................................................................................42 3.5. RESULTS.................................................................................................................................44 3.6. SUMMARY ..............................................................................................................................47

4. VIRTUALIZING DIMENSIONS OF COARSE GRAIN RECONFIGURABLE ARRAY ....49

4.1. INTRODUCTION.......................................................................................................................49 4.1.1. Previous work...................................................................................................................50

4.2. MEMORY ARCHITECTURE .......................................................................................................51 4.2.1. Access to memories...........................................................................................................51 4.2.2. FIFOs ...............................................................................................................................53

4.3. SCHEDULER ............................................................................................................................54 4.3.1. Virtualizing dimensions of an array .................................................................................55

4.4. PARTIALLY NON-DEADLOCKABLE FIFO.................................................................................57 4.5. MEMORY ACCESS ALLOCATING SYSTEM.................................................................................58 4.6. APPLICATION CASE STUDY : MATRIX MULTIPLICATIONS ........................................................58 4.7. RESULTS.................................................................................................................................61

4.7.1. Area penalty affect of virtualization structures ................................................................61 4.7.2. Results of deadlock free FIFO..........................................................................................63

4.8. SUMMARY ..............................................................................................................................64

5. CONFIGWARE FLOW OF RAA................................................................................................67

5.1. INTRODUCTION.......................................................................................................................67 5.1.1. Previous work...................................................................................................................68

5.2. GUI MAPPING TOOL................................................................................................................69 5.3. PLACE & ROOT TOOL ..............................................................................................................70

5.3.1. Some algorithms to solve placement problem ..................................................................70


vii

5.3.2. Iterative Lin-Kernighan to solve TSP ...............................................................................71 5.4. RAA PLACE & ROUTE ALGORITHM........................................................................................72

5.4.1. Dijsktars algorithm...........................................................................................................74 5.4.2. Route phase. .....................................................................................................................75

5.5. RAA ASSEMBLER...................................................................................................................75 5.5.1. Heuristic to select groups addressing mechanism............................................................76

5.6. RESULTS.................................................................................................................................78 5.6.1. GPS correlation background............................................................................................78

5.7. GUI AND ASSEMBLER RESULTS ..............................................................................................81 5.8. PLACE&ROUTE RESULTS........................................................................................................82 5.9. SUMMARY ..............................................................................................................................83

6. 3+ WAYS TO IMPLEMENT RSA ENCRYPTION ...................................................................85

6.1. SCALABLE RSA ENCRYPTION SUITABLE FOR HIGH RADIX RECONFIGURABLE STRUCTURES –

INTRODUCTION. .....................................................................................................................................85 6.2. RSA ALGORITHM....................................................................................................................86

6.2.1. Montgomery modular multiplication ................................................................................86 6.3. HARDWARE ARCHITECTURE ...................................................................................................88

6.3.1. Montgomery product ........................................................................................................89 6.3.2. Modular inverse................................................................................................................92

6.4. IMPLEMENTATION IN 0.35µM ASIC TECHNOLOGY .................................................................93 6.5. IMPLEMENTATION ON XILINX VIRTEX II PLATFORM FPGA CHIP ...........................................93 6.6. IMPLEMENTATION ON RAA....................................................................................................95

6.6.1. Results on single context RAA ..........................................................................................97 6.6.2. Results of Multi context RAA ............................................................................................98 6.6.3. Comparison ....................................................................................................................100

6.7. SUMMARY ............................................................................................................................100

7. CONCLUSIONS ..........................................................................................................................103

7.1. DISCUSSION..........................................................................................................................104

8. REFERENCES.............................................................................................................................107

Contents

viii


ix

List of Publications This thesis is a monograph which contains unpublished material. However, it is mainly based on already published work. Copyright of the previously published material is owned by the copyright holders of the following publications. [P1] T. Ristimäki, J.Nurmi, “Reprogrammable Algorithm Accelerator IP

Block”, Proc. of IFIP VLSI-SOC, 2003, pp. 228-232. [P2] T.Ristimäki, J. Nurmi, “Virtualizing the Dimensions of a Coarse-Grained

Reconfigurable Array”, Proc. of Field Programmable Logic and its applications, 2004, pp.1130-1132.

[P3] T.Ristimäki, J. Nurmi, “Reconfigurable IP Blocks : a Survey”, Proc. of

International Symposium on System-on-Chip, 2004, pp. 117-122. [P4] T. Ristimäki, J. Nurmi, “Fast 1024-bit RSA Encryption on Platform

FPGA”, Proc. of International Workshop on Design and Diagnostics of Electronic Circuits and Systems, 2003, pp. 277-284.

[P5] T. Ristimäki, J. Nurmi, “Implementing User and Application Specific

Algorithms within IP-methodology: a Coarse-Grain Approach”, Proc. of International Symposium on System-on-Chip, 2003, pp. 61-64.

[P6] T. Ristimäki, J. Nurmi, “3+ Ways to Design Reconfigurable Algorithm

Accelerator IP Block”, Proc. of IFIP IP Based SoC Design Workshop, 2003, pp.223-225.

List of Publications

x


xi

Abbreviations ALU Arithmetic Logical Unit ASIC Application Specific Integrated Circuit BIST Build-in-Self-Test CE Computational Element CG Coarse Grain CLB Configurable Logic Blocks CM Configuration Manager CMU Configuration memory Unit CPU Central Processing Unit CSoC Configurable System-on-Chip DRAM Dynamic RAM DRP Dynamically Reconfigurable Processor DSP Digital Signal Processor EDF Earliest Deadline First FG Fine Grain FCFS First-Come-First-Served FIFO First-In-First-Out FPGA Field Programmable Gate Array GCU Global Communication Unit GPS Global Positioning System HDL Hardware Description Language HLL High Level Language HTM Hardware Task Manager IC Integrated Circuit IP Intellectual Property JTAG Joint Test Action Group LRT Last Release Time LST Least-Slack-Time LUT Look-Up-Table M Million Mbps Mega bytes per second MIMD Multiple-Instruction-Multiple-Data MIN Matrix Interconnection Network MMU Memory Management Unit

Abbreviations

xii

NML Native Mapping Language NP Non-Polynomial PA Processing Array PAE Processing Array Element PCB Printed Circuit Board PEG Programmable Element Group PN Pseudo-Random Noise RAA Reprogrammable Algorithm Accelerator RAM Random-Access-Memory RISC Reduced Instruction Set Computer ROM Read-Only-Memory RPU Reconfigurable Processing Unit RR Round Robbin RSA Rivest, Shamir, Adleman RTL Register Transfer Language SCM Supervising Configuration Manager SCU Communication Switching Unit SHAPE Silicon Hive Array Programming Environment SIMD Single-Instruction-Multiple-Data SoC System-on-Chip SOP Sum-of-Products SRAM Static RAM OS Operating System SPN Shortest Process Next VLIW Very-Long-Instruction-Word VLSI Very Large Scale Integration 4NN 4 Nearest Neighbors


xiii

Symbols ar a’s counterpart in n-residue C encrypted message e encryption exponent M plain text mod modulo operator n modulo n’ modular inverse of n O big ‘O’ notation OR OR-port s number of words in result X X-coordinate w bits in word Y Y-coordinate

Symbols

xiv


xv

List of Figures

1. The structure of the slice of Virtex II FPGAs.

2. Hierarchy levels of interconnection network of Virtex II FPGAs.

3. Architecture of Varicore FPGA IP.

4. Internal structure of Varicore FPGA IP.

5. The structure of basic cell of the eASIC architecture.

6. Interconnection topology of synthezisable FPGA core.

7. Theoretical picture for dataflow paradigm.

8. Structure of XPP architecture in different levels.

9. Structure of DRP architecture.

10. Physical placement of ALU and route resources of CHESS.

11. Top level architecture of DReAM.

12. Topology of Systolix architecture.

13. Hierarchical structure of communication architecture of ACM.

14. Basic communication topology of Synputer architecture.

15. Architecture of Avispa.

16. Hybrid structure of Morphotec architecture.

17. Mixed granularity paradigm of AMDREL project.

List of Figures

xvi

18. Topology of local FIFO connections.

19. Topology of RAA internal global bus.

20. Delay of OR-bus.

21. Parts in RAA internal virtual addresses.

22. Topology of single node in RAA.

23. Internal structure of DSP core used.

24. Block diagram of 8-bit addressing.

25. Yield increase 1.

26. Yield increase 2.

27. Distributed vs. centralized memory architecture.

28. Context switch extended node architecture.

29. Functionality of hash function used.

30. Partially deadlock free FIFO.

31. Area of node capable to save one, two and four different contexts.

32. Area of processor core and memory blocks in node.

33. Silicon area of partially deadlock free and normal FIFO.

34. The effect of FIFO limit parameter to the execution time.

35. Execution time of correlation algorithm in RAA.

36. GUI placer before and after RSA algorithm placement.


xvii

37. Topology of correlation implementation on RAA.

38. Snake path in correlation configware.

39. Connection graph of test case.

40. Block diagram of hardware architecture of RSA.

41. Hardware implementation of inner loop of standard multiplication

algorithm.

42. Cascade of basic blocks.

43. Modified basic block.

44. Architecture overview of Virtex-II platform FPGA.

45. Communication topology of 1024-bit RSA on RAA.

46. Code needed to implement scalable RSA architecture on RAA.

47. Execution time in clock cycles vs. key length used in RSA implementation.

48. Silicon area of single context RAA needed on RSA.

49. Execution time of 256 bit RSA as a function of contexts used.

50. Execution time of 512 bit RSA as a function of contexts used.

51. Area vs. execution time for 512 bit RSA implemented in virtual size RAA.

52. Fundamental algorithm processing architectures.

List of Figures

xviii


xix

List of Tables

1. Reconfigurable IP world categorization.

2. Best known FG IP vendors.

3. Performance of eASIC core according to eASIC technology ltd.

4. Addressing mode bits.

5. Instruction set of DSP core used in RAA.

6. Sizes of memory elements used in RAA in tests.

7. Area of single RAA node.

8. Hypothetical SoC estimates

9. Matrix multiplication simulation on virtual size RAA.

10. Enumeration of words in PN-code in GPS correlation.

11. Code sizes of RSA and GPS implementations after code optimization.

12. Results of FPGA implementation of RSA.

13. Results of three different RSA implementations.

List of Tables

xx

Reconfigurable IP Blocks : a MIMD approach

1

1. Introduction While the first planar chips in 1965 included only tens of transistors [1], it is nowadays possible to pack hundreds of millions into a single die [2]. We have enormous computational potential in a space of a few square millimeters. On the other hand the complexity of algorithms needed for next generation radio algorithms, encryption systems, learning, video encoding and decoding grows even faster than the complexity of Integrated Circuits (IC) [3]. To take full advantage of such a small and complex component, there are challenges that must be overcome in modern IC design. IC design methodologies are still more or less same as they were ten years ago when the main goal was to make fully optimized structures in a limited space. However, the main goal nowadays is more frequently to get an enormous amount of functionality as fast as possible to almost unlimited space. The trend is the same with desktop computers. While very limited hardware resources forced programs to be implemented in very low assembler level ten years ago, now High Level Language (HLL) compilers and application development programs are used to implement huge programs fast. The implementations are naturally not optimal, but because of, in practice, unlimited resources that is not a problem. The sequential nature of desktop computers has made it possible to develop HLL compilers such that the abstraction level of application development is raised. However the extremely parallel nature of hardware engineering has prevented attempts to develop HLL design methods for IC developing. Thus we are in a situation where we have the combination of unlimited space and low abstraction level development methods. The combination drives us to the situation where we can not utilize all of the improvements of silicon technologies. The phenomenon is known as a productivity gap [4]. In a modern digital system there are typically one to a few processors, memories, interface blocks, special functional hardware blocks i.e. hardware accelerators and communication infrastructure which connects the pieces together. The bottleneck in such a system is the communication between blocks. The penalty to be paid for getting signals out of an IC and linking those on a Printed Circuit Board (PCB) to another IC is high. Pads on an IC take up large area on silicon,

Introduction

2

and thus increase the price of chip. On the other hand the capacitance of the combination of PCB traces and IC pads inflict extremely high energy consumption and speed penalty compared to on-chip wires. Thus it is not a surprise that progress in silicon technologies has been driven to pack different pieces of digital systems inside a single IC. Such a chip is called a System-on-Chip (SoC) [5]. The well known design methodology to increase productivity in SoC design is to build up the chip from pre-designed and pre-verified blocks i.e. Intellectuel Property (IP) blocks, in the same manner as PCB level systems are built up from monolithic ICs. In IP design the abstraction level is lifted from Register Transfer Level (RTL) hardware design to the system level block design. However, because of a very limited amount of customers, IP block availability is limited to the most common blocks. Even blocks that are a bit exotic need to be built from scratch. That is the case in application specific hardware intensive algorithm accelerators. [6,7,8,9] To enable the implementation of application specific algorithm accelerators on SoC within IP methodology, reconfigurable IP blocks are proposed [10]. While the hardware of a reconfigurable IP block is fixed, and thus can be sold as a pre-verified pre-designed block, the functionality of the block can be determined afterwards by reconfiguring the block. There is no upper limit to the number of reconfiguration cycles, and thus the same silicon area can be used to accelerate several different algorithms by dynamic reconfiguration. For example, the very same block can execute the GPS algorithms when global positioning is needed, and switch to accelerate encryption within a few microseconds if the user starts a new application. The possibility to reconfigure the block after the manufacturing phase can be also used to update the device with newer versions of computationally intensive algorithms, or to change the algorithm used. The market of reconfigurable IP blocks has been said to be exploding such that total shipments will exceed 600M$ in 2006 [11]. The reason for the predicted exponential growth is obvious. A viable combination of reconfigurable hardware and configware flow would make it possible to assemble all hardware of a SoC from IPs, including hardware needed by application specific algorithms, thus raising functionality description mostly to the level of some kind of configware, software or alike, and would be a solution to decrease the productivity gap.


3

1.1. Objective

The very original objective of the work done was to research ways to raise the abstraction level of implementing application specific algorithm accelerators. Reconfigurable IPs were noticed to be a very promising approach. However, there was very little research done on this subject when this work was started. On the other hand there was not any research done in trying to answer the question “How can the abstraction level of designing the accelerators on reconfigurable IPs be lifted to such a level that designing can be done by a software engineer, instead of hardware engineer?”. The author’s hypothesis was that the combination of a simple architecture and a HLL compiler is impossible. The hypothesis is based on a number of unsuccessful trials in a genre of parallel computing. Another hypothesis was that an intuitive architecture and a rich set of back-end tools would be the solution to the problem. MIMD (Multiple-Instruction-Multiple-Data) was proposed to be such an architecture. Thus, the first grass-root level objective of this work was to prove the feasibility of using a MIMD architecture as a reconfigurable IP block. The second objective was to develop a reconfigurable IP with such an abstraction level that the accelerator binary files would be binary compatible among differently parameterized versions of the same reconfigurable IP. The most important parameter to be virtualized was noticed to be the dimensions of a reconfigurable fabric. The third objective was to automate the configware flow from parallel algorithm description to the binary files used in reconfiguration, I.e., to find a fully functional back-end to the proposed architecture fulfilling the first objective.

1.2. Outline and author’s contribution

In chapter two, a state-of-the-art survey of reconfigurable IP blocks is given. Both main reconfigurable categories, i.e. fine grain and coarse grain, are treated and the most remarkable prevailing implementations are introduced extensively. Each of the architectures is categorized according to the computational granularity, communication topology and the source of the block i.e. academic vs. commercial. The chapter is based on the author’s publication [P3] which was the first published survey concentrating exclusively on reconfigurable IP blocks.

Introduction

4

The MIMD based reprogrammable IP block RAA (Reprogrammable Algorithm Accelerator) is presented in Chapter Three. In RAA the massively parallel coarse grain reconfigurable IP block is constituted of tiny DSP-cores, each coupled with local memories. The novel two-level communication mechanism of RAA is introduced including a novel set of group addressing modes, designed to remove the reconfiguration latency bottleneck. In addition, a yield increasing mechanism is shown. The chapter is concluded with area and clock period approximations at gate level. This chapter is based on the author’s publications [P1] and [P5], but also contains some unpublished material. When published [P6] in August 2003 the RAA was one of the very first MIMD-based reconfigurable IP blocks. The chapter is contributed alone by the author. In Chapter Four the context switching mechanism is implemented into an existing MIMD coarse grain reconfigurable IP block RAA. Context switching is not used only to hide reconfiguration latency, its emphasis is on virtualizing the dimensions of an array of processors by folding the array to multiple configurations. The hardware extensions for configuration management are coded as a synthesizable VHDL model and area results are presented and compared to the original unextended RAA. In addition, the simulation results are given to show the execution time penalty of virtualization. The results show that the functionality equal to a virtual array four times bigger than the original physical array can be implemented in approximately 2.1 times the area of the original one. On the other hand the execution time penalty of using the virtualization may be even linear to the amount of real processors. That is to say the execution time penalty of the virtualization mechanism is zero. The computational efficiency of a virtually bigger array is illustrated by matrix multiplication and GPS correlation case studies. This chapter is based on the author’s publication [P2] and is the first implemented coarse grain reconfigurable IP array size virtualization system. The chapter is contributed alone by the author. The configware flow from parallel algorithm description to the RAA binaries is given in chapter five. The presented configware flow includes a graphical mapping tool, automatic place & route tool and assembler. In the graphical mapping tool the application developer can map the parallel parts of the accelerator configware to the nodes of RAA. With the automatic place & route


5

tool the mapping can be done automatically. In addition the route tool takes advantage of the nodes as a route through blocks such that nodes separated by a distance greater than 1 can communicate with each other. In the back end the assembler is used to translate assembler language to the binaries. In the chapter, NP-hard problems of mapping the configware to RAA and maximizing the use of group addressing modes are solved with fast heuristics. The mapping tool and assembler are tested with scalable RSA-algorithm and GPS-correlation implementations. The results of this chapter are previously unpublished. However, the driving factor to use an architecture intuitive to a software engineer and low level programming tools, instead of trying to do a simple architecture and compiler are motivated by author’s publication [P6]. This chapter is contributed alone by the author. In Chapter Six the scalable hardware architecture for implementing RSA (Rivest, Shamir, Adleman) encryption on reconfigurable architectures is shown. The presented RSA architecture is implemented on ASIC (Application Specific Integrated Circuit), platform FPGA (Field Programmable Gate Array) and coarse grain reconfigurable IP block RAA. The overall timing and area results for those three implementations are given. Also, the execution times in differently configured virtualized RAA arrays are given. The original platform FPGA implementation was the fastest known FPGA-based 1024-bit RSA encryption when published in [P4]. This chapter is based on the author’s publications [P4] and [P6]. The chapter is contributed alone by the author. Conclusions on the work are given in Chapter Seven.

Introduction

6


7

2. Reconfigurable IP Blocks : a Survey

Abstract. An extensive survey concentrating exclusively on reconfigurable IP blocks is given. Both main reconfigurable categories, i.e. ,fine grain and coarse grain, are covered and the most remarkable prevailing implementations are introduced extensively. Each of the architectures is categorized according to computational granularity, communication topology and its source, i.e., academic vs. commercial.

2.1. Introduction

The first scientific reconfigurable IPs were introduced in 1999 [12] and the first commercial one in 1998 [13]. After 2003 the field has exploded [14] and several new blocks have been introduced. After the first few years of rapid progress in the field, it is now a good time to sum up where we are presently, in order to be able to understand where we are going. Presenting the state-of-the-art advances in reconfigurable IPs also provides the background necessary to understand the contribution of the work presented in the later chapters of this thesis.

2.1.1. Terms and organization

Because of the immature nature of the field, there are some variations in terminology. Even the borders of the concept “reconfigurable IP block” are very fuzzy. Although our purpose is not to try to decide upon the correct terminology, it is necessary to fix some concepts within this thesis. Reconfigurability in Very Large Scale Integration (VLSI) can always be divided into two groups: fine grain (FG) and coarse grain (CG)1. Practically speaking, FG in the IP world is nowadays synonymous with structures consisting of up to 4-bit Look-Up-Tables (LUT) i.e. FPGAs are the de facto architecture. However, the exact definition of FG reconfigurable systems is a lot more permissible i.e. all structures where the computational width of every element is small enough are FG structures. While FG reconfigurable systems provide the ability to map logic to the LUTs, CG ones allow mapping to several-bits wide Computational Elements (CE)2. So any parallel topology containing CEs (like processors, ALUs, functional units etc…) connected together as one computational unit can be said to be a CG architecture. E.g. homogenous parallel topologies dataflow, Single- 1 depending on the author coarse-grained, fine-grained or reconfigurable logic, reconfigurable computing 2 or Processing Elements (PE), Functional Units (FU) etc. depending on the author and context

Reconfigurable IP Blocks : a Survey

8

Instruction-Multiple-Data (SIMD) and Multiple-Instruction-Multiple-Data (MIMD) are in this category. From the definitions it can be seen that the practical design space of CGs is much wider than one of FGs, and indeed, that is the case for real applications in the field of reconfigurable IP blocks. In this survey we divide CG IPs to the system classes according to classical parallel computing classification into dataflow, systolic array, MIMD and hybrid of the above. In addition we will divide implementations into academic and commercial ones. Table 1 gives the organization of this chapter.

Table 1. The Reconfigurable IP world categorization within this survey.

Topology Paragraph

LUT based (FG) 2.2

Dataflow (CG) 2.3 Systolic (CG) 2.4 MIMD (CG) 2.5 Hybrids (CG) 2.6

2.1.2. Previous surveys in the field

On circuit board level reconfigurable systems were studied thoughtfully in the 80’s and 90’s. From the CG side Dr. R. Hartenstein has written especially exhaustive surveys which can be found in [15,16,17]. The chip level reconfigurable platforms are covered in publications made in the late 90’s. Among others, publications [18] and [19] cover the genre of Configurable System on Chips (CSoC). However, the field of reconfigurable IPs is so new that there is not very much published material on a survey level. The publication [20] partly overlaps with this survey, though the overlapping parts are treated [20] superficially. In addition, on the methodology level reconfigurable IP blocks are discussed in [21]. However, as far as the author knows, the author’s publication [P6] was, when published, the first survey focusing completely and comprehensively on the reconfigurable IP, and presenting the most interesting of existing implementations.


9

2.2. Fine grain blocks based on LUTs

At the end of 1990’s when for first time reconfigurable IPs were planned to be implemented, LUT based systems were the most obvious choice. This was the only prevailing architecture with mature configware flow, and had a status of being the de facto standard for implementing reconfigurable structures on the circuit board level. After the fast race, the first commercial application was introduced by Actel in 2001[22]. Straightaway after that LSI Logic came up with their own FPGA hard core, licensed from Adaptive silicon [23]. From these two pioneers the product of Actel is still on the market but LSI Logic has already stopped marketing embedded FPGAs because of the lack of customers [24]. As mentioned in Chapter 1, a huge amount of money is predicted to be spend on reconfigurable IP blocks in the near future, and thus it is not a surprise that the first tries of Actel and LSI Logic were not the last. Instead there are nowadays several little start-up companies trying to do business with this very same idea, some of who are already having customers. From the academic world point of view it is very hard to say whether the progress and the ideas on the FG area come mostly from academic research or private company product development. However, it seems that in the FG side there are only a small amount of academic publications as will be shown later. A Company named Xilinx is the market-leader in monolithic FPGAs. The state-of-the-art Xilinx Virtex II FPGA fabric [25] is a monolithic chip but as a very advanced architecture it makes a good reference point for clarifying the differences between architectures. Thus the architecture of Virtex II is first presented here briefly, although it is not an IP block yet. However, Xilinx is also joining the embedded FPGA market in co-operation with IBM [26] and in particular with exact the same Virtex II architecture presented here.

2.2.1. FPGA fabric of Xilinx Virtex II

When FPGAs were first introduced in the 1980’s they consisted of cells and an interconnection matrix. Each cell consisted of a LUT and a flip-flop. The basic cell, slice, of the state-of-the-art Virtex II series is shown in figure 1, and indeed, LUTs and flip-flops are still the main building blocks of FPGAs. However, several optimisation methods have been incorporated. The most important and most used optimisation is a carry bit chain, which is denoted with labels CIN,


10

COUT and MUXCY in figure 1. The carry chain makes it possible to build fast carry propagation to adders and subtractors wider than the bit width of a single slice. The add operations are also optimized such that xor-ports XORG can be used to implement a full adder in a single slice. The additional and-port MULTAND is dedicated to fast multiplication implementations.

Figure 1. The structure of the slice of Virtex II series [15].

The carry chain is not the only multiplexed chain provided. The often used Sum-of-Products (SOP) methodology is boosted with a dedicated multiplexed line denoted with texts SOPIN and SOPOUT. In addition, the ORCY or-port is used as an enabling mechanism to implement several slices wide SOPs functions. On top of those the traditional LUTs and flip-flops are replaced with configurable functional units that can be used as, e.g., latches, RAMs (Random-Access-Memory), ROMs (Read-Only-Memory) etc. The trend in the communication matrix of FPGAs, connecting the slices together, has been to divide it into several hierarchical levels. The longer lines, on higher hierarchy level, are useful for propagating the global signals, while the very short and fast local connections, on lower hierarchy levels, are suitable for transferring intermediate values between slices. In Xilinx architecture there are five different hierarchy levels of communication. The five level communication hierarchy of Xilinx is illustrated in figure 2. The horizontal and vertical long lines are bi-directional communication lines across the whole chip. The horizontal and


11

vertical Hex lines are uni-directional lines connecting slices with distance of three and six from the source slice in all four directions. Double lines are similar to hex lines but connect distances of one and two from the source. Direct connections connect the slice with its eight neighbours and fast connections are feedback lines from LUT outputs to LUT inputs.

Figure 2. Five different hierarchy levels of interconnection network of Virtex II series [25].

Overall, the structure of a modern FPGA is filled with different kinds of optimization mechanisms to maximize the computational potential of the chip, and thus the architectures significantly differ from the original clean LUT arrays.

2.2.2. Commercial fine grain IPs

The commercial FG reconfigurable IP blocks are very similar from one to another. They are all based on LUTs, and thus share the problem of being delivered as hard macros. The reason why LUT systems are not synthesizable in general is that the majority of surface area consumption comes from memory cells. The SRAM (Static RAM) based memory cell can be implemented with six transistors and DRAM (Dynamic RAM) with one. In a synthesizable structure the only way to implement a memory cell is to use the flip-flop, and consumption of 20+ transistors can be assumed. On the other hand the reconfigurable communication topology of FPGAs can be implemented with transistors used to connect wires, while in a synthesizable structure multiplexers or similar components have to be used. The hard macros are problematic not only because of additional difficulties in implementation, place and route, timing and verification of a complete SoC, but


12

also because it is difficult to create competition between silicon manufacturers or update the system in the long term if the design includes manufacturer and process specific parts [27]. The best known FG IP vendors and their products are given in Table 2.

Table 2. The best known FG IP vendors

Integrated Circuit Technology Corporation

Point Configurable Silicon [28]

LeopardLogic Gladiator [29] Actel VariCore™ Embedded Programmable

Gate Array Core [30] M2000 flexEOS [31] eASIC eAsicCore[32]

2.2.2.1. Varicore by Actel

Because the structure of the first four architectures mentioned in table 2 are similar in practice, only the details of the Varicore architecture are given. In Varicore architecture LUTs are arranged hierarchically in three different segments. On the top level of hierarchy there are from one to the sixteen PEGs (Programmable Element Group) as shown in figure 3.

Figure 3. The architecture of Varicore FC IP [33].


13

Each PEG consists of an 8×8 array of Functional Group (FGR) blocks, and each FGR consist of four Functional Units (FU). Finally inside each FU there are the LUTs. In Varicore architecture two three-input LUTs per FU are used, as illustrated in figure 4. The structure of the FU seems quite similar to the ones in original FPGAs. Note if we take into account that the functionality of a 4-bit LUT can be simulated with two 3-bit LUTs and a multiplexer implemented in Varicore architecture we see that the granularity has remained the same.

Figure 4. The internal structure of a single Varicore FU [30].

Because the main target in designing the state-of-the-art monolithic FPGAs is to achieve maximum computational potential and the main goal in embedded reconfigurable IP blocks is to achieve minimum area consumption, the structure of figure 4 is much simpler than the structure of Virtex II given in figure 1. The execution time optimized structures are especially cut down in area. The only execution time optimization shown in figure 4 is the carry chain which makes it possible to implement fast addition mechanisms. In Varicore architecture the array of LUTs is not the only functional resource provided. Hard core RAM elements are implemented as well. For testing, debugging and programming purposes the JTAG (Joint Test Action Group) interface and BIST (Built-in-Self-Test) blocks are included.

2.2.2.2. eAsicCore by eASIC

The company named eASIC also has a LUT-based IP eAsicCore[22]. However, eAsicCore slightly differs from other FG IPs because of its lack of a dynamically reconfigurable interconnection network. In eAsicCore the interconnections are


14

configured during the manufacturing phase. Therefore the area overhead of the reconfigurable routing resources is sidestepped at the expense of flexibility. This can be seen from figure 5.

Figure 5. The structure of basic cell of the eASIC architecture [32].

In figure 5 the dotted vertical lines denote the potential communication wires. In the processing phase each of the vias, denoted by circles in figure, can be connected or not, and thus the application specific communication network is established. Because the reconfigurable communication architecture is the biggest cause of the power consumption and clock cycle time penalties of FPGAs, the simplifications are admitted to boost the architecture to the performance level near the standard cell implementation as shown in table 3. However, the fixed communication radically diminishes the amount of different practical configurations.

Table 3. The performance of eASIC core according the eASIC technology ltd.[32]

2.2.2.3. Academic

In the academic world the interest in LUT based IPs has not been as strong as interest in the commercial side. At the University of California at Berkeley an


15

academic FPGA fabric [34] was designed and also used as a part of their Configurable System-on-Chip (CSoC) [35]. However, the focus was not on the reconfigurable IP side but on the CsoC side similarly as e.g. in Xilinx Virtex II pro [36], Triscend A7 [37], Altera Excalibur [38] etc.. An interesting new approach in the genre of LUT based configurable IPs is presented in [39]. The paper tries to draft possibilities for implementing a synthesizable LUT based core. With synthesizable structure the soft IP core could be implemented and problems of hard cores could be sidestepped. However, the price to be paid for transferring full-custom LUT cells to the synthesizable RTL-code is pretty high. Although in [39] the generality of the topology was reduced in an attempt to minimize area penalty3, ~6 times area penalty is supposed. If it is also taken into consideration that FPGA implementations already have ~100×area, ~10×power and ~3-5×speed penalties [40] over ASIC technologies, something like 600×penalty in the area has to be accepted if synthesizable LUTs are to be used. One proposed communication topology is shown in figure 6. As can be seen a very strong assumption for the structure of logic to be implemented is made. E.g. the feedback loops and register elements are removed.

Figure 6. Interconnection topology of proposed synthezisable FPGA core [29].

The third scientific LUT-based system, meant to be an IP-block, is presented in [41,42]. The idea is to optimize the architecture to specific applications, instead

3 De facto array communication topology is replaced with dataflow interconnections.


16

of trying to make a general purpose logic platform. However, the IP dimension is only briefly mentioned.

2.3. Dataflows

The dataflow computing topology is depicted in figure 7. In the dataflow model the architecture consists of a communication network and Computation Elements (CE). Most typically the structure is homogeneous, such that all CEs are identical. In computations the data flows through the array. The configuration determines (1) the path of the data; and (2) the operations done on the data in the nodes.

Figure 7. The theoretical picture for the dataflow paradigm. Grey boxes are CEs and arrows form the interconnection network. The desired functionality is implemented by CE and interconnection configurations.

The dataflow-based reconfigurable IPs are nowadays de facto in the CG reconfigurable IP group. Like on board level in the early 1980’s the problem of even somehow automating mapping processes from software to the configware is still the dominating problem in parallel processor systems. However, the simple and regular structure makes dataflow arrays homogeneous and simplifies programming and placing of algorithms. Actually the dataflow CG systems are quite similar to the FPGAs with only a higher level of granularity, and FPGA synthesis tools could at least be used to transform function descriptions to the configuration binaries needed.

2.3.1. D-fabrix by Elixent

The first commercial dataflow-based CG IP was D-fabrix by Elixent[43]4. It is a very typical array of ALUs5. The ALUs are connected together such that each

4 Published in 2001 5 Maybe because it was the first, its datapath is only 4-bit


17

ALU is coupled with a specific switchbox element which are in turn connected together. The switchbox matrix connects ALUs, not only to their four neighbours, but also to the global interconnections. The 4-bit ALUs are not purely combinatorial, but the nodes include registers as well, so complex computational graphs directed by Moore’s state machine can be implemented. As a coarse grain synthesizable structure, the D-fabric is sold as a parameterized soft IP block. On the programming side Elixent provides several programming languages, and a simulator software to simulate the functionality of an array with a certain configuration. The supported programming languages are VHDL (Very High Speed Integrated Circuit Hardware Description Language), Handel-C and Accel-FPGA Matlab entry, however, users have to explicitly define the parallelism.

2.3.2. XPP by PACT

Maybe the best known company producing CG reconfigurable dataflow IPs is PACT[44]. At topology level PACT’s XPP is very similar to the D-fabric, but in reality it contains far more sophisticated and modern features. The architecture of XPP is illustrated in figure 8. The XPP architecture is divided into two separated parts: data and control. On the data side there are several Processing Arrays (PA). Each processing array consists of several Processing Array Elements (PAE). The flexibility of PACT stems mostly from the structure of PAE, which is unfixed. Each PAE contain PAE objects, but the type and amount of those depends on the parameters. However, the most typical configuration is to use ALU object, which can be used in computations. The communication network of XPP hides the latency of data moves, such that an operation is executed in a certain PAE right after all inputs are ready, but not earlier. That kind of hardware abstraction greatly simplifies the design of applications to the array. In XPP the data is transferred via the communication network as packets. The width of the packet is uniform, such that not only the type of PAE element but also the width of the datapath is a parameter in XPP. On top of data packets, 1-bit event packets can also be sent. The event packets are used to transfer state information from one PAE element to another. [46]


18

Figure 8. The structure of XPP architecture on different levels. The top left box is top level and the left bottom box is leaf level element [35].

The control part of XPP is used to upload new configurations to the array. The control part has been arranged to the topology such that in the root is the Supervising Configuration Manager (SCM). Each PA is coupled with its own Configuration Manager (CM) and each CM is connected to the SCM. On the IP interface the SCM is connected to the outside memory. The actual programming of an array is based on pre-designed library functions, which can be utilized from a C-variant language. The library functions are programmed with a specific Native Mapping Language (NML) i.e. the parallelism is explicitly defined [47]. As with D-fabric, the XPP IP is provided as a synthesizable RTL code.

2.3.3. DRP by NEC

The newest CG reconfigurable block on the market is Dynamically Reconfigurable Processor (DRP) by NEC [48]. In DRP the basic building block is a tile shown in figure 9. The amount of tiles in one DRP configuration is parameterized. Each tile consists of an 8×8 array of PEs, a state transition controller, memories and communication infrastructure. The structure of a PE is illustrated in figure 9. The main building blocks of a PE are an 8-bit ALU, register file and instruction memory.


19

Figure 9. In the left hand side the structure of a tile is shown. On the right hand side is the structure of a PE of DRP [48].

The novel structure in DRP is its way of handling the instruction pointer. The state transition controller is a reconfigurable state machine which determines the value of the instruction pointer of every PE in a single tile. The amount of possible states in a state transition controller is 64 and it can receive signals from PEs enabling conditional branches. The maximum amount of branches is four. Because the instruction pointer of every PE inside the tile is the same, the instruction pointing methodology is SIMD. However, the program may be different in every PE and the connections between PEs are made in a dataflow manner, so the topology in reality is not a pure SIMD and is classified as a dataflow in this survey. The DRP is the only commercial architecture taking advantage of a fast context switch hardware extension. Every tile can be reconfigured separately, enabling partial context switching. The number of pre-fetched configurations is 16 per tile, and the actual switching can be done with a clock [49].

2.3.4. CHESS architecture

In [12] CHESS, the academic mother of Elixent’s D-fabric, is presented. Because of the close relation between these two blocks the implementation details are quite similar, i.e., the structure is based on an array of 4-bit ALUs. As in D-fabric the array of CHESS consists of ALUs and switchboxes. The building blocks are arranged in the shape of a chess board, such that ALUs are allocated to the black squares and switch boxes to the white squares. Thus the architecture is tied on layout level, and is not synthesizable. The architecture is given in figure 10.


20

An ALU element in CHESS is very simple containing only two arithmetic and a few logical operations. In addition, a multiplexer and test bit operations are provided. The ALU does not contain any memory except the output registers, but can be configured to use the block RAMs distributed across the array. The switching boxes are a group of transistor configurable wire cross points. Because of the fixed layout the area reserved for the interconnections is as big as the area needed to implement the ALUs. Therefore, there is a lot of area to be used on communication and multiple configurable busses are implemented. As in the Virtex II series the configurable connection lines are hierarchically arranged according to their length to short lines, double lines, hex lines etc. So the communication network gives plenty of possibilities to route the ALUs globally or locally together.

Figure 10. The physical placement of ALU and route resources of CHESS. Also the neighbour connections are depicted in figure [12].

The basic configuration style of CHESS is obvious. With configuration bits the ALUs are configured to execute specific fixed instructions and the communication is arranged as wanted. However, CHESS also has another way to be programmed. Each ALU can feed its output as an instruction to the neighbour ALU. In that way the programs can be written to the block RAMs and the array can be used as a kind of MIMD machine.

2.3.5. DReAM architecture

In [50] the academic Dynamically Reconfigurable Architecture for Mobile Systems (DReAM) is published. DReAM is more sophisticated than CHESS, and interprets different kind of communication, computation and configuration


21

models. On the top level the DReAM architecture follows the data flow model, but the inside structure is quite heterogeneous. On the data side the basic computational block in DReAM is a Reconfigurable Processing Unit (RPU). RPUs are arranged into four RPU groups as shown in figure 11. Each RPU has fast local connection to its four neighbours6. A RPU can also have connections to any of its non-neighbour RPUs via global interconnection lines. The global communication lines of a RPU in hardware level is based on the FPGA kind of switching matrix, i.e., the lines are circuit switched with transistors configured by SRAM configuration bits. However, the communication protocol is done with handshaking signals, i.e., the interconnection delays are hidden from the programmer.

Figure 11. The top level architecture of DReAM [50].

An RPU contains two identical reconfigurable arithmetic processing units (RAP) and dual port RAM pairs. The RAP is not a typical ALU but a combination of a LUT, shifter and adder. The structure of RAP is area and speed optimized such that operations in cases where another operand is fixed are very fast, although some time penalty has to be paid if both operands are changed at each clock cycle. The optimization is motivated by the fact that most operations in mobile communication algorithms are multiplications with a fixed operand.

6 Whether the neighbor is in the same group or not does not limit this feature.


22

The configuration side of DReAM is based on the four RPUs groups mentioned. Each group has its own Configuration memory Unit (CMU) which can reconfigure the RPUs. The CMU is controlled by the Communication Switching Unit (CSU) such that each CSU controls four CMUs. At top level, the Global Communication Unit (GCU) coordinates in a centralized manner all the communications between an outside controller and the CSUs.

2.3.6. Some very first academic CG IPs

Papers [51] and [52] present very pure dataflow based reconfigurable CG arrays tailored to accelerate algorithms on SoCs. The architecture in [51] consists of several heterogeneous elements. The building blocks of the array are multiplexers, accumulators and comparators. The communication between the elements is done with the FPGA kind of transistor switching box architecture. In [52] the whole structure is similar to the ones used on FPGAs but the LUTs are replaced with ALUs named programmable arithmetic logic units (PALU). In [51] the programming side is not mentioned at all but in [52] genetic algorithms are used to automatically implement filters into the array.

2.4. Systolic array

The border between dataflow and systolic array is very fuzzy. The systolic arrays are actually a subgroup of dataflow arrays in which data moves are allowed only between neighbours, and an outside controller can access only the cells on the edges of the array. However, such a limitation greatly simplifies the structure of the communication network inside the array and thus decreases the area penalty. The first commercial CG reconfigurable IP block was PulseDSP by Systolix [13] published in 1998. If we take into account the state of silicon processes at that time it is no wonder that an area efficient systolic array topology was used. However, except for its limited interconnection mechanism, the implementation was already quite similar to modern commercial dataflow reconfigurable CG architectures. The topology is depicted in figure 12.


23

Figure 12. Topology of systolix architecture [13].

2.5. MIMD

The MIMD is maybe the best known parallel computation topology. In MIMD there are several processors, each with its own instruction and data memories connected together with a communication mechanism. MIMD is indeed an universal parallel topology because one can write an emulation program to each processor to implement the functionality of every part of some other parallel system. E.g. SIMDs, dataflows and systolic arrays are trivial to simulate using MIMD architecture.

2.5.1. ACM by Quick silver

QuickSilver technology offers a parallel algorithm accelerator called Adaptive Computing Engine [53] which can somehow be thought to be a MIMD architecture. The basic building blocks of ACM are nodes. Nodes are heterogeneous functional units. There are basically three different types of nodes, i.e., an adaptable execution node, domain bit manipulation node and programmable scalar node. Each node is targeted to a different execution need. The adaptable execution nodes provide plenty of computational power by providing several optimized mathematical operations, the domain bit manipulation nodes give a fast solution to small word operations and scalar nodes are RISC processors. In addition, any user specific IP can be used as a node if it implements the given interface. [54]


24

Any four-unit combination of these four node types can be clustered as one node in Matrix Interconnection Network (MIN). Topology scales hierarchically such that four nodes can be clustered again as a single node on a higher hierarchy level and so on, as illustrated in figure 13. MIN is a packet communication network in which any of the nodes or the outside controller can send a 32-bit data packet to any of the other nodes.

Figure 13. The hierarchical structure of communication architecture of ACM [53].

Programs are executed on ACM such that each program is a task. One node can execute from one to 32 tasks. The inside scheduler, Hardware Task Manager (HTM), manages the execution of different tasks in a single node. The scheduling algorithm goes basically such that: 1. If node is in execution it cannot be interrupted. 2. If task inputs are ready they are put in a queue to

be executed when the previous task is ready. 3. Otherwise the task is in IDLE mode. The company claims that they have an advanced programming flow for this extremely heterogeneous functional unit network, however, it is conceded that programming tools do not implicitly find parallelism [55].

2.5.2. Synputer by Synergetic computing systems

A company named Synergetic Computing Systems offers a MIMD based DSP (Digital Signal Processor) IP block [56] published in 2003 called Synputer. The IP contains an array of either 16-bit or 32-bit DSP processors. The Synputer communication topology is depicted in figure 14. The architecture consists of


25

one to 16 homogeneous processors and a switchboard. The switchboard is actually the global register file into where processors can write their results and from where processors can fetch their operands. The computational paradigm is based on the data-flow graph of the algorithm targeted to be computed. The nodes of the graph on an equal depth can be computed concurrently in different processors because the processors can fetch the operands produced in earlier stages by other processors from the switchboard. Synputer has some kind of a C-compiler, but the main programming language is assembler.

Figure 14. The basic communication topology of synputer architecture [57].

2.6. Hybrids

Like always, it is very unlikely to find a global optimum. E.g., MIMD is more intuitive to program for the software designer but dataflow is superior in efficiency. So, with some kind of hybrids the robustness needed in the real world may be achieved.

2.6.1. Silicon Hive by Philips

One of the most interesting hybrid reconfigurable IP block implementations is from Silicon Hive [58], a company owned by Philips. The architecture of Silicon Hive consists of VLIW (Very-Long-Instruction-Word) processor kind of Processing and Storage Elements (PSEs) which are connected in a MIMD manner to a processor array, called cell. The architecture is highly parameterized, and different kinds of entities can be constructed. However, the first fixed configuration was Avispa shown in figure 15.


26

Figure 15. The architecture of Avispa core in two different hierarchy levels. On the left side the upper level is dedicted. On the right hand side there is the internal structure of leaf element [59].

In Avispa, each of the four PSEs consist of about 18 functional units i.e. multiplier, ALU, accumulate unit, shifter, internal memory and so on. The array of four PSEs is connected to the global bus via an interface block and controlled by a single controller configuration memory pair. The centralized controlling part hides the different PSEs, and from the compiler point of view there is only one processor called Ultra Long Instruction Word processor (UVLIW) [59]. The other prefixed architectures are called Bresca and Moustique. The Bresca consist of an 8×8 array of VLIW cells, but the cells are much simpler than in Avispa, while Moustique has a very simple and tiny architecture consisting of only one PSE. In Silicon Hive IP the parallelism in an accelerated program is utilized such that sequential parts are executed in a sequential manner instruction by instruction but in parallel parts of the program the architecture begins to use such PSEs in a dataflow manner, i.e., the computation model varies from one to another during the execution. The novel hybrid structure is not the only exciting feature of Silicon Hive products. They also claim to have very advanced software tools for making implementations to the blocks on an abstraction level very near C-language. The programming flow consists of three tools, i.e., Partioning compiler, Spatial compiler and Silicon Hive Array Programming Environment (SHAPE). Partitioning compiler is an optional tool in the Silicon Hive programming flow. The tool automatically finds the most computationally intensive algorithm


27

kernels from a C-program and partitions the code, i.e., finds the kernels feasible to run in the accelerator. The spatial compiler hiveCC is a tool which accepts code written with a subset of C and a description of the accelerator architecture as an input, and generates all assembler and microprogram codes needed to execute the program in the particular silicon hive architecture.

2.6.2. Morphotech by Morphotechnologies

Usually the reconfigurable arrays are used such that they are controlled by an outside processor, and subroutines are executed on the reconfigurable block. In the Morphotech [60] implementation the topology is exactly the same but the processor is packed inside the accelerator IP. The processor used is a 32-bit RISC (Reduced Instruction Set Computer). The Reconfigurable (RC) array is a typical dataflow array of 8 to 64 ALUs. The architecture also contains memory elements and provides a possibility to integrate application specific accelerators. A built-in processor may help the compilation problem by giving a sequential choice to the compiler, despite the fact that the parallelism in the RC array of Morphotech is defined by the programmer.

Figure 16. Hybrid structure of Morphotech architecture [60].

2.6.3. Academic

The state-of-the-art monolithic FPGA chips are nowadays also hybrids. I.e. FPGA cells, RAMs, processors, multipliers etc. are combined into a single configurable platform. The Architectures and Methodologies for Dynamic


28

Reconfigurable Logic (AMDREL) project has done the same on the IP level [61]. The approach is to have a mixed granularity platform. The decision to use a mixed granularity approach comes from the observation that different granulaties are needed in different parts of algorithm implementations. E.g. control and glue logic, i.e. state machines, are ideally implemented in FPGA; on the other hand mathematical operations are most suited for implementation in a coarse grain section. The resulting architecture of the AMDREL project is given in figure 17. As shown, a set of reconfigurable logic blocks of different granularity with a common interconnection network and common memory is provided. As in Moprhotech architecture the microprocessor is also packed inside the IP.

Figure 17. Mixed granularity paradigm of AMDREL project [61].

2.7. Discussion

The main trend found in this survey is that granularity and heterogeneity increases all the way from the FPGAs to the hybrid paradigm. It is clear that the enabling force for increasing granularity and adding heterogeneity is the exponentially growing capacity of silicon processes. So the area-wise non-optimal hardware solutions are used to try to achieve more flexibility, reconfigurability or rise of abstraction level7. The main purpose of these actions seems to be to make compiler-based application development possible.

7 Just like in the software industry. The remarkable part of the exponential increase in the computational power of

processors is sacrificed to make it possible to raise the abstraction level of programming.


29

However, there is no evidence that the use of more compiler friendly architecture would enable to the automation of the translation from sequential code to the parallel binaries, though it seems that the highest level of parallelism achievable explicitly nowadays on reconfigurable IP blocks lies somewhere near the VLIW kind of topology.


30


31

3. Reprogrammable Algorithm Accelerator

Abstract In this chapter the MIMD based reprogrammable IP block RAA (Reprogrammable Algorithm Accelerator) is presented. In RAA the massively parallel coarse grain reconfigurable IP block is constituted of tiny DSP-cores, each coupled with local memories. The novel two-level communication mechanism of RAA is introduced, including a novel set of group addressing modes designed to remove reconfiguration latency bottleneck. Also, a yield increasing mechanism is shown. The chapter is concluded with area and clock period approximations at gate level. This chapter is based on the author’s publications [P1] and [P5], but also contains some unpublished material. When published [P6] in August 2003 the RAA was one of the very first MIMD-based reconfigurable IP blocks.

3.1. Introduction

Maybe the biggest problem in coarse grain reconfigurable IPs is the lack of mature configware flow. Low level programming is of course possible, but the problem lies in how to automate the translation from a high level language to the parallel array and especially how to find the parallelism implicitly. Synthesis tools and hardware description languages can always be used to “program” these kinds of structures. However, the goal in the field is to make a platform which can be programmed by software instead of hardware engineers. A lot of work has been done to find the solution e.g. with supercomputers in the 1980’s [e.g. 62-65] and with coarse grain structures at board level in the 1980’s and 1990’s [e.g. 66-68]. In the back end the applications or accelerators can be built up from the macros, but in the front end the compilation techniques for parallel systems have not been advanced to a level that would work in practice. So the author is very pessimistic in that regard and has concluded that the goal is not to find the compiler but to make hardware as programmer-friendly as possible to enable programming even in assembler level [P6]. The MIMD computer architecture is very suitable for that purpose. The basic concept, several processors connected together, is intuitive and familiar to software engineers. MIMD is also quite universal, so other topologies are easily emulated on it, i.e., the program of each processor implements one part of some other parallel system [65]. So the computational model can be changed according to the need. The other advantages of MIMD are that it has been deeply studied in 1980’s and that the architecture is highly scalable.

Reprogrammable Algorithm Accelerator

32

At the time when the research work concerning RAA was started, the feasibility of using MIMD topology on coarse grain reconfigurable IPs was not yet proved and thus the MIMD-based reconfigurable IP Reprogrammable Algorithm Accelerator (RAA) was implemented. All results and conclusions within this thesis are based on a synthesizable VHDL model of RAA architecture.

3.2. Architecture of Reprogrammable Algorithm Accelerator

In practice RAA is an array of processors. Each processor has its own instruction and data memories and executes its own program. The combination of one processor and its memories is called a node in RAA. Nodes communicate with each other with FIFOs (First-In-First-Out) so that each node can read from the buffers of its four neighbours. The controlling of RAA is done via its external interface. Via the interface, data or instructions can be written to or read from any address of any processor. The architecture description in this chapter is organized such that the communication topology, interface, node structure and memories are introduced in sections 3.2.1., 3.2.2. and 3.2.3., respectively.

3.2.1. Communication topology

In RAA three different kinds of data transfer can take place. The first class is configuration data transfers, with which the outside controller programs the RAA. The second class is data transfers between the outside controller and RAA which are used to transfer computation input data to RAA, and the third class is data transfers between nodes. The requirements for these three groups are quite different, and thus a novel two-level communication mechanism was implemented in RAA.

3.2.1.1 Local communication

The four Nearest Neighbors (4NN) approach was selected to connect the nodes in an array together to establish communication between the processors. In 4NN each processor can receive data from all of its four neighbours. The local communication topology is depicted in figure 18.


33

Figure 18. The topology of local FIFO connections.

In ordinary multiprocessor architectures the neighbour links are wires, but in RAA the decision of using FIFOs instead is made. With FIFOs a lot of programming problems concerning the tight relations between the processors are sidestepped. The most important thing is to hide the exact timings between the nodes. The FIFOs are supported in the hardware level such that (1) FIFOs can be used in many instructions as registers, (2) the read operation from an empty FIFO stalls the execution and (3) the write operation to a full FIFO stalls the execution. After the stall happens the execution is started again as soon as the reason for the stall is removed. Thus, to the application developer the FIFOs looks like untimed wires. Local communication FIFOs could also be used to feed data and configuration to the array by connecting the FIFOs on the edge of an array to the inputs or outputs of the whole reconfigurable block as done in [16]. However, this kind of mechanism would introduce a huge latency to the communication between an outside controller and the nodes in the middle of the array. So a novel bus-based internal communication mechanism with an advanced addressing system is implemented on RAA.

3.2.1.2 Internal global bus

The internal global bus is connected to every node, as shown in figure 19. The bus is not meant for data transfers between the nodes, but for transferring data and configurations between outside controller and nodes. The outside controller is always the master in global bus operations. Every node has its own instruction and data memories, and the outside controller has a mechanism to access those


34

independently, such that each memory element has a unique address on the global bus.

Figure 19. The topology of RAA internal global bus.

On the hardware level the internally global bus topology of RAA consists of interface blocks located at node and connected to the same address wires. The addresses are recognized locally, such that the location of the node is configured to its interface block. If a memory access, either unary or group, is made the internal interface block at each node compares the address against its own location and if necessary the access is served. The data bus is split into write and read busses to avoid three-state buffers which are undesired in synthesizable structures [69]. The write bus is trivially implemented with wires, because the only driver of the bus in RAA is the outside controller, however, in the read bus there are as many writers as there are nodes, and some kind of a virtual bus has to be used. The most typical way of implementing the non-three-state buffers bus is to use multiplexers such that only one of the drivers is connected to the receiver at any time. However, because the number of drivers, i.e. nodes, could be over 200 in RAA and the number of wires is sixteen, the multiplexer-based architecture is not suitable. The area used is not the problem of multiplexers, but the levels of logic1 needed to implement a multiplexer tree for tens of nodes would be unfeasibly long. For example in 0.18 µm technology the 16-bit multiplexer with 256 input

1 i.e. critical path


35

ports would have a latency of 9 ns, which is intolerable for a communication network. In RAA the read data bus is implemented as an OR-port tree [70]. The transmitters, i.e. nodes, are in the leafs and the outside controller is set to the root of the tree. The axiomatic condition to such a system is that only one node writes data to the tree per clock cycle, and others write zeros. However, that kind of functionality is easy to provide in the homogeneous RAA structure. The OR-tree is extremely scalable and has very tolerable delay and area penalties. The delay of a bus structure as a function of the number of RAA nodes connected to it is given for multiplexer and OR-tree structures in 0.18µm technology in figure 20.

Figure 20. The delay of OR-bus (cross marks), and multiplexer-bus (dot marks) in nano seconds as a function of RAA nodes connected in 0.18µm technology.

3.2.1.3 Addressing mechanism

The addressing system in RAA is based on several fields in a 16-bit virtual address. The parts are address inside the memory element (memoryaddress), location address of the node in rows and columns (row, column), data instruction separation bit (D/I) and Addressing mode bits (addressMode) as shown in figure 21.

Figure 21. The parts in RAA internal virtual addresses.

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

8

9

10


36

The bit line of local memory address is interpreted on nodes as is. The field is 5 bits wide, so local instruction and data memories up to 31 words are supported2. The eight bits reserved on node location address are split into column and row addresses, both of those 4 bits wide. Thus 16×16=256 nodes can be addressed and added to the array. On top of the mentioned fields the virtual address includes group addressing mode bits. With these bits a part or all of the node location bits can forcibly ignored in addressing. The access can then be forced to be handled in every node, in nodes on single row or in nodes within the same 4×4 block, simultaneously. Due to the fact that the program of the nodes near each other is often the same, it was made possible to address nodes hierarchically. The user can point in parallel to all nodes, nodes on the same row, block of nodes or only one node. To make hierarchical addressing possible in RAA the virtual address includes an addressing mode bits field. If only one node has to be pointed to, the mode bits are set according to Table 4, row and column addresses are set according to the node location and the actual address is set accordingly. If all nodes on the same row have to be pointed to, everything goes logically as in the addressing of one node but the column address has no meaning. On block addressing the nodes which are within four steps down or right from a given node are pointed to. The functionality of all node mode is obvious.

Table 4. Addressing mode bits.

With a group addressing system, each node can determine itself whether the memory access concern it, and so we can write the same data to as many nodes as we want at the same clock cycle. It is obvious that this kind of addressing is possible only in a writing operation. In read operations the mode bits are ignored.

2 The 32nd address of the instruction memory is reserved as will be shown later.

Addressing mode Mode bits All “11” Single row “01” 4*4 Block “10” Single node “00”


37

3.2.1.4 Direct address

The address can be used by yet another special way. The logical instruction memory address “11111” has a special purpose, such that the instruction written to that address will be executed on the next cycle of the clock regardless of the stage of the processor in the node. This direct instruction enables one to synchronise, start and stop3 the computations on the array. On the other hand direct access can be used to transfer the program flow control to the outside controller, and the array can be used, e.g. as a SIMD, without the restriction of 32-word instruction memories. Because the instruction memory address “11111” is reserved, only up to 31 instruction slots are available.

3.2.2. Interface

The external interface of RAA is logically split in two different hierarchy levels: the physical structure and a data transfer protocol. Both of these components in RAA fulfil OCP-IP (Open Core Protocol International Partnership) interface specification 2.0 [71]. The physical interface of RAA implements six of the OCP basic dataflow signals (Clk, Rst, MAddr, MCmd, MData and Sdata). As OCP-IP specifies, the Clk and Rst signals are mandatory such that Clk is a global rising edge clock signal and Rst is an asynchronous reset. The MCmd signal is also mandatory and it is a three-bit command signal indicating the type of transfer the master is requesting. The specification provides several possible transfer types, however, only write and read commands are allowed in RAA. The remaining three interface signals MAddr, MData and SData are used to transfer memory addresses from the master to a slave, transfer data from the master to a slave and to transfer data from a slave to the master, respectively. The OCP-IP protocol provides different kinds of thread and burst protocols to decrease communication overhead; however, the simple dataflow protocol without handshake signals is used in RAA. The master makes a memory access request in clock cycle x, and the request is served by a slave in clock cycle x+1.

3.2.3. Structure of node

The RAA is a homogeneous MIMD architecture and thus the nodes are identical. The inside structure of a single node is depicted in figure 22. The node consists

3 Stopping is important to save power, although it may not be meaningful from a mathematical point of view.


38

of the internal bus interface, instruction RAM, Data RAM, CPU (Central Processing Unit) core and FIFOs. The functionality of the internal bus interface has already been discussed but the structure of the CPU and memory elements are given here.

Figure 22. The topology of single node.

3.2.3.1 Processor core

In RAA the processor used is a tiny 16-bit DSP core. The processor core fulfils the Harvard architecture and is connected to the memories via separate data read, data write and instruction read busses. The execution cycle is split in three stages: fetch, decode and execution. Each one of them is completed in a single clock cycle so one instruction cycle takes three clock cycles. The stages are not pipelined, but the next instruction is fetched after the execution of the previous one is finished. The execution flow goes such that FETCH:

The value of PC is written to the address bus of

instruction memory.

The PC is incremented by one.

DECODE:

The operand addresses are written to the address buses.

Controlling signals of the execution unit are set.

EXECUTION:

In execution stage the result is determined.

Result is written to the accumulators or to the PC.

At first, the instruction is always fetched from the instruction memory. In the decode stage the operands are taken from the data memory, accumulators or


39

FIFOs, depending on the state of the controlling bits set according to the fetched opcode. In the execution stage the result of arithmetic and logic operations are always written to Accumulator1 or Accumulator2. The result of jump instructions is written to the PC. Accumulator1 is a 35-bit register, specially targeted to be used in single-cycle multiply-accumulate operations. Accumulator2 is a 17-bit register targeted to be used on controlling structures. PC is a 5-bit register. The structure of the processor is shown in figure 23.

Figure 23. Internal structure of DSP core used.

The instruction set of RAA is a minimized DSP instruction set. Because the core is optimized for special purpose use, i.e. the algorithm acceleration, single cycle arithmetic operations form the main group of instructions. 16-bit×16-bit+32-bit multiply-accumulate, 16-bit×16-bit multiply, addition, subtraction and 32-bit up to 32 steps shift operations are implemented. The control instructions include conditional and unconditional jump instructions. The complete instruction list is given in table 5.

Table 5. Instruction set.

Instruction Functionality Add Acc1 | Acc2 = *Add1 | Acc(1|2) | FIFO + *Add2 | FIFO And Acc1 | Acc2 = *Add1 | Acc(1|2) | FIFO and *Add2 | FIFO Jmp PC<= *Add2 | FIFO Jmpi If (*Add1 | Acc(1|2) | FIFO ==0) then PC<= *Add2 | FIFO Ldr Acc(1|2) | FIFO<=*Add1 | Acc(1|2) | FIFO Mac Acc1 = mul + Acc1 Mul Acc1 = *Add1 | Acc1(15 : 0) | Acc2 | FIFO × *Add2 | FIFO Nop No operation Not Acc1 | Acc2 = not *Add1 | Acc(1|2) | FIFO


40

Or Acc1 | Acc2 = *Add1 | Acc(1|2) | FIFO or *Add2 | FIFO Shift Acc1<=Acc1 shifted by *Add1 | Acc2 | FIFO | constant stepsStrh *Add2<=Acc1(31 : 16) Strl *Add2 = *Add1 | Acc(1|2) | FIFO Sub Acc1 | Acc2 = *Add1 | Acc(1|2) | FIFO - *Add2 | FIFO

In addition to normal processor functionalities, the core includes a few special controlling mechanisms because of the framework it is used in i.e. because of direct addressing, FIFO full situations, FIFO empty situations and outside controller memory accesses. In direct accessing the outside controller writes an instruction to the instruction memory address “11111”. The node interface block recognizes the direct address and signals the processor core. During each instruction cycle the processor checks in the fetch stage if the direct address signal is active, and if so it fetches the instruction from the node internal direct access bus. The FIFO exceptions are handled such that the FIFO flags are produced in FIFOs, and the processor gets those as inputs. During each decoding stage, if FIFOs are needed, the state flags of FIFOs are checked. If a full or empty state is recognized the execution is stalled until the state is removed, i.e. another node makes a push or pull operation to the FIFO. The outside memory accesses are taken into account such that the bus interface signals the processor in the case of an external memory access. If the processor is simultaneously performing a memory access on the same port, the execution of the processor is stalled.

3.2.3.2 Memory

Each node has four different kinds of local memory blocks: instruction memory, data memory, two FIFOs and two accumulators inside the core. Due to the fact that RAA should be fully synthesizable, actual SRAM (or DRAM) blocks could not be provided and actual busses were not wanted. That is why memory blocks from the synthesizable design library were used. As depicted in figure 22, the instruction memory has one read and one write port memory while the data memory has two read ports and one write port.


41

It is obvious that the synthesizable memory blocks cannot be very large, especially in an architecture which has tens of memory blocks (two per node) and in an architecture which is only one subsystem of a SoC. Thus the limitation of a maximum memory size of 32 words in RAA is reasonable. The actual sizes of the memories are parameterized (within given limits) as well as the depths of FIFOs (which has no limits) in the synthesizable model of RAA.

3.3. 8-bit data support extension

The width of the datapath in RAA is 16 bits. The width is a compromise between a higher width needed for multimedia applications and smaller bit width needed e.g. on radio network algorithms like in correlation. With 16-bit words it is of course possible to compute wider arithmetic operations by accomplishing e.g. 32-bit operation using several 16-bit operations. As well, the smaller word length operations can be computed with 16-bit operands by extending the sign bit to the empty slots. However, if e.g., 8-bit computations are done in 16-bit architecture, a great amount of memory is wasted when intermediate values are saved. In architectures like RAA the amount of memory per node is very limited and it can be a remarkable benefit if the memory could be maximally utilized in all situations. In RAA-architecture an 8-bit addressing extension is used to maximize memory utilization with 8-bit operations. In addition there is a special addressing instruction to the 8-bit arithmetic. When 8-bit addressing is used the architecture interprets each 16-bit word as two 8-bit words.

Figure 24. Block diagram of 8-bit addressing.

“10000” CORE

Slot 0

…

8-bit 8-bit

Slot 1

Address

Data

ϑ

ϑ

“10000” CORE

Slot 0

…

8-bit 8-bit

Slot 1

Slot 0

…

8-bit 8-bit

Slot 1

Address

Data

ϑ

ϑ


42

The block diagram of the implementation of the 8-bit addressing mechanism is given above in figure 24. When 8-bit addressing is used, the special logic transforms the address before memory block such that it removes the lowest bit of the address, shift the address to the right and shift a ‘1’ into the empty slot. Thus addresses “00000” to “11111” scale to the values between “10000” and “11111”, such that two successive addresses map to the same physical memory slot. Again, after the data is retrieved from memory, the data is transformed such that if the bit underflowed in shift operation in address transformation was ‘0’ the lower 8-bits are selected, and otherwise the higher 8-bits are selected. Empty slots are sign extended. In other words, when 8-bit addressing is used the address “00000” accesses the lower 8 bits in real address “10000”, the “00001” accesses the higher 8 bits in real address “10000”, and so on. With such a mechanism the upper half of the memory area is accessible with 8-bit addressing while the lower half can simultaneously be used by normal 16-bit addressing.

3.4. Yield increase mechanism

The typical digital IC manufacturing cost consists of prices of wafer, labour, factory overhead, packaging and testing [72]. The wafer, labour and factory overhead costs are per wafer while packaging and testing are per die. Because the per wafer cost with modern technologies is in the range of $2000 [73] and the per chip costs are in the range of $10, a higher yield in big chips would decrease the total chip cost remarkably. To increase yield a redundancy mechanism was implemented in the RAA architecture. Because of the homogeneous structure of RAA redundancy was achieved without duplicating functional blocks, but by accepting partially functional chips. A partially functional chip means in this context that e.g. we have manufactured 256-node RAA chips, but the parts are sold as 240-node chips. In a single defect fault, the most probable consequence in RAA is that one of the nodes is out of order. As with any SoC the problem is found during production testing, during which it is also possible to recognize which node in a faulty chip is defective. After the faulty node is found the node is encapsulated by inactivating the row where it exists [74]. The encapsulation is done using


43

multiplexers connected to the nodes’ vertical FIFO inlets, such that if the row is inactivated the FIFO links are passed over the row. The bypassing is depicted in figure 25. The FIFO bypassing mechanism encapsulates defected nodes from local FIFO communication. However, the global addressing mechanism is encapsulated as well. On the interface block in each node it is possible to force the block to recognize addresses with vertical address of one place smaller than the node’s physical address. In such a way, from user point of view, the row with the defected node is inactivated such that the usable addresses, and nodes, are removed from the bottom of the array. The mechanism is illustrated in figure 26.

Figure 25. On the left hand size the FIFO links of rows 1, 2 and three are shown. On the right hand side the node on row two is defected and is encapsulated by selecting the row 1 FIFO inlet by multiplexer to be from row 3.

In figure 26 are nodes of column one. The address of the node is given in parentheses. On the right hand side of figure 26 the node on row two is defected and encapsulated, and the address of the node on row three is decreased by one. To the user the array in figure 26 is fully functional; however, there is one row less than in the array on the left hand side of figure. In the given row deactivating mechanism the single fault problems with RAA can mostly be fixed. The presented system needs only one extra multiplexer per horizontal FIFO and one ‘–1’ block per node interface compared to the RAA architecture without a yield improving mechanism. The defected, but fixed array has no timing penalty compared to the non-defected array.

Row1

Row2

Row3

fifo

fifo

fifo

fifo

Row1

Row2

Row3

Row1

Row2

Row3

fifo

fifo

fifo

fifo

fifo

fifo

fifo

fifo

Row1

Row2

Row3


44

The author has not created any mechanism to find faulty nodes or to configure the yield improving system of RAA. That is because the choices to make are obvious and well known. To list, the faults can be found e.g. by scan methods or by implementing a Built-In Self Test (BIST) block. One way to do testing could also be to configure and execute a test program in every node during the production test phase.

Figure 26. On the left hand side the addresses in column 1 in non-defected array are given. On the right hand side the addresses in array with defected node in row 2 are given.

The setting of the encapsulation of a certain row after the faulty node is found can be done for example using laser fuses or flash ram bits. The testing can also be done on the fly, during normal use, if encapsulation is enabled dynamically via processor writable registers. The dynamic approach could be usable on fault tolerant applications like space and military devices.

3.5. Results

The parameterized model of the given RAA architecture was coded on RTL level VHDL. The parameters of the model include the size of different memories like FIFOs, data memories and instruction memories among others. The given extensions are also parameterized, i.e., 8-bit addressing and yield increasing mechanism can be included or excluded from the implementation. To get the silicon area and timing figures for the architecture the design was synthesized with 0.13 µm technology with several different sets of parameters. All synthesis were done with gate level synthesis tools such that the clock period constraint was set to 10 ns and mapping effort to medium. The timing constraint was easily achieved in all synthesis runs. The mathematical and memory elements were

Row1

Row2

Row3

Row1

Row2

Row3

Node interface(1,1)

Node interface(1,2)

Node interface(1,3)

Node interface(1,1)

Node interface(1,2)

Node interface(1,2)

Row1

Row2

Row3

Row1

Row2

Row3

Node interface(1,1)

Node interface(1,2)

Node interface(1,3)

Node interface(1,1)

Node interface(1,2)

Node interface(1,2)


45

implemented as pre-optimized and pre-verified design library components. In table 6 the sizes of the memory blocks used in synthesis are presented.

Table 6. Sizes of memory elements used.

Block Number of 16-bit words

Data memory 32 Instruction memory 16 FIFOs 6

Initially, an RAA with four nodes was synthesized using 0.13µm technology. From the results the area information of a single node was separated as illustrated in table 7 below. From the table one can see that about two thirds of the area comes from the CPU core, and the rest in practice from memory elements. The multiply-accumulate (MAC) block takes up half of the area of the core block. The area of the bus interface is insignificant.

Table 7. The area of single node’s subblocks in 0.13µm technology.

Block Area in mm2 Bus interface 0.0030 Instruction memory 0.020 Data memory 0.015 FIFOs 0.016 CPU 0.071 MAC and multiplication inside the CPU

0.029

Total 0.125 As was planned, the architecture is extremely scalable. RAA with 4, 32 and 64 nodes were synthesized and the areas in square millimetres were 0.50, 3.9 and 7.9, respectively. There is no doubt that the size of the array would not also grow almost linearly with the number of nodes greater than 64. Next, the nodes were extended with the 8-bit addressing instruction and yield increasing mechanism shown before, and the extended implementations were synthesized. Both extensions are extremely light as measured by the area used, and thus the changes on area figures were below the measurement accuracy.


46

Hard macros of processors of ARM1020E Thumb family take up 7 mm2 to 10 mm2 on 0.13 µm technology [75]. If we assume that ~10 mm2 is an acceptable size for an IP block on a SoC design targeted for 0.13 µm technology we could implement RAA with ~100 processors. In other words, we could make an algorithm execution platform to a SoC, that could e.g. execute 100 16×16+32-bits MACs in parallel. To research the advantages of the yield increasing mechanism, a hypothetical SoC model was created. The model includes a 100-node RAA, ARM1020E processor and 8Mbytes of DRAM. The design was given as an input to the InCyte lite chip estimation program [76]. Other given constraints were 90 I/O pads and four different voltage domains. The 0.13µm eight metal layer technology with 150mm wafer was selected as an implementation technology. The results of InCyte tool are presented below in table 8.

Table 8. The estimates of hypothetical SoC chip.

Description Value Die dimensions 14×14 mm (196mm2) Core dimensions 8×8mm Core utilization 70% Total sites 64 Good 23 Yield 36%

The area used for RAA is about 30% of the utilized core area, and thus in single fault cases with yield increasing mechanism the yield of the whole chip could be lifted a few tens of percents.

3.6. Summary

In this chapter we have shown the hardware architecture of a MIMD style coarse grain reconfigurable IP. It was shown that even with 0.13 µm technology a RAA IP with several tens of nodes can be implemented with feasible size and delay values as a single block in a SoC. It was shown that, with insignificant penalty, the basic architecture could be extended with a yield improving mechanism and 8-bit addressing instructions.


47

The presented coarse grain reconfigurable architecture has several novel features on top of the fact that it itself is one of the first published MIMD coarse grain reconfigurable IPs. Of note, the two-level communication mechanism with an internally global OR-bus and local links implemented with FIFOs instead of wires provide the designer with a very intuitive untimed architecture to implement accelerator configwares on. And indeed, the biggest motivation of the research work and use of RAA, or more generally MIMD architecture, as a coarse grain accelerator instead of other options is its application engineer friendly architecture. To solve the problem of reconfiguration delay on reconfigurable devices, a novel group addressing mechanism was designed for RAA. The benefit gained from this addressing system is based on the fact that configurations near each other in an array are more probably the same or similar. Another possible way to speed up memory access to the RAA is a different bus width inside and outside the RAA. Because the bus width of processors is e.g. 16 bits and the interface width may be e.g. 32 bits, one access can potentially transfer more than one word. However, the use of different bus widths in RAA remains as future work. A hypothetical SoC with a processor, memory and RAA was depicted. It was noticed that the yield improving mechanism can give remarkable savings during the manufacturing phase for SoC designs consisting of several mega gates. On the other hand it was shown that the structures needed to provide the row encapsulation mechanism are extremely cheap in terms of silicon area.


48


49

4. Virtualizing Dimensions of Coarse Grain Reconfigurable Array

Abstract. In this chapter the context switching mechanism is implemented into an existing MIMD based coarse grain reconfigurable IP block RAA. Context switching is not used only to hide reconfiguration latency, its emphasis is on virtualizing the dimensions of an array of processors by folding the array to multiple configurations. The hardware extensions for configuration management are coded as a synthesizable VHDL model and area results are presented and compared to the original implementation. The results show that functionality equal to a virtual array which is four times bigger than the original physical array can be implemented in approximately 2.1 times the area of the original one. The computational efficiency of a virtually bigger array is illustrated by matrix multiplication and GPS correlation case studies. This chapter is based on the author’s publication [P2].

4.1. Introduction

One benefit of coarse grain over fine grain reconfigurable IPs is that the amount of configuration data in coarse grain reconfigurable blocks is usually kilobits instead of the megabits needed in FPGAs. However, despite the reduced size of reconfiguration data the time needed for a typical serial reconfiguration may still be unfeasible for dynamically using the same silicon area to accelerate several algorithms. The problem of reconfiguration latency can be hidden by providing a fast context switching mechanism. In the fast context switch, several configuration memories are duplicated inside the block and reconfiguration is done by selecting between the configuration memories instead of downloading a new configuration in the reconfiguration phase. In von-Neumann and compiler-based programming models the implementation details of the CPU are abstracted. The same software file can be compiled for different kinds of processors. As opposite, in the coarse grain reconfigurable systems the configware is very hardware specific. In many cases it is also specific between different versions of a single reconfigurable block architecture, e.g., the configuration designed for a large array is not architecture-compatible and even less bit-compatible with smaller arrays. In this chapter the context switch is added to the Reprogrammable Algorithm Accelerator (RAA) and used not only to enable fast switching between different configurations but also to make configurations designed for different sizes of arrays bit compatible with each other.

Virtualizing Dimensions of Coarse Grain Reconfigurable Array

50

In general the context switch is an operation where the system stops running one process and starts running another. For example many operating systems implement concurrency by maintaining separate environments or "contexts" for each process. The amount of separation between processes and the amount of information in a context depends on the Operating System (OS), but generally the OS should prevent processes from interfering with each other, e.g., modifying each other's memory. In single processor systems a context switch can be as simple as changing the value of the program counter and stack pointer and setting the MMU (Memory Management Unit) to prevent the current process from accessing memory areas of another processes.[77] From the MIMD style coarse grain reconfigurable block point of view, the context is the data and state of a single process in a single node. More likely there are many processes which work together as a single computational entity and thus the context of a single accelerator is the union of the contexts of many processes. The essential part of the context switch operation is that the state of the previous context has to be saved before the next one can be executed. In other words, there is always some maximum limit of different contexts depending on the memory allocated to save contexts. In addition to memory to save released contexts, there has to be a system to determine the schedule of context switches, i.e., a scheduler. This chapter is organized such that the needed context memories extension is explained in section 4.2 and the scheduler hardware in section 4.4.

4.1.1. Previous work

The benefits of context switching as a method for hiding configuration delay in reconfigurable devices has been noticed. A lot of scientific work has been done concerning, e.g., hardware of FPGAs [78], hardware of coarse grain reconfigurable systems [79, 80] and methodology level concepts [81,82,83,84,85]. In addition, commercial application is available [48]. However, the use of the context switch in coarse grain reconfigurable structures to virtualize the dimensions of an array is rarely studied. In [86] the model of extensions needed to hide the size of an array is given. However, the publication sidesteps the actual implementation and leaves all of the scheduling needed to the


51

controlling processor. The author’s work published in [P2] was the first coarse grain reconfigurable IP virtualization system with node-based schedulers.

4.2. Memory architecture

In single processor desktop systems the data and instruction memories of different processes lie in the same physical memory structure, and some mechanism allocates the memory to the processes. However, although one node of RAA contains all of the parts of a usual processor system, it is very simplified. The sizes of data and instruction memories are only tens of bytes instead of tens of megabytes used in desktop systems. For that reason the memories are more likely to be full or almost full in all practical applications, and thus it is meaningless to use the shared memory approach in context switch extended nodes of RAA. Instead the separate memory model was used, i.e., each node has as many data and instruction memories as there are different contexts, as illustrated in figure 27.

Figure 27. In context switch extended RAA the memories inside the nodes are duplicated (right hand side), instead of using shared memory approach (left hand side).

4.2.1. Access to memories

In RAA the communication between the nodes and external controller is done via bus interfaces, which serve data accesses if the address of the access corresponds to the address of the node. In a context switch enabled node, there are two basic architectural choices to be selected for implementing accesses to separate context memories. The first choice to be made is which are the addresses of the context memories and the second choice is how the memory accesses are directed from the bus interface to the corresponding memories. In RAA the addresses are based on the column and row location of nodes. However the addresses of context memories in nodes were decided to be determined such that the row and column addresses are computed from the


52

corresponding node’s real place as a function of the maximum amount of contexts in single node and as a function of the corresponding context’s number such that X=X_node_real+(pro mod max_pro)*max_x, (4.1) Y=Y_node_real+(pro/max_pro)*max_y. (4.2) Where X, Y, X_node_real, Y_node_real, pro, max_pro, max_x and max_y are computed column address, computed row address, node column address, node row address, number of contexts, maximum number of contexts in a node at total, number of real columns on RAA and the number of real rows on RAA, respectively. The reasoning behind the formulas above is given later. However because the pro ∈[0, max_pro-1], X_node_real<max_x and Y_node_real<max_y, the hash functions give unique addresses to the context memories of different nodes. Note that the context number 0 will give the row and column address equal to the values of X_node_real and Y_node_real. The simplest way to recognize the context addresses would have been to modify the node interface to do so. However the multiple group addressing modes of RAA present the problem that multiple context memories inside a single node can be accessed simultaneously, and thus the bus interfaces have to be duplicated. The duplicated bus interfaces and node base context memories give us the context switch extended node topology shown in figure 28.

Figure 28. The multiplexer between processor and memory architecture is used to select the memories in use according to the context on execution. Each context memory has its own bus interface.


53

The dedicated interface block per context automatically hides the number of CPU cores in RAA from the outside controller point of view. When the outside controller writes to the memory address, it does not see the difference whether the memory is the only memory in a node or one of many context memories. Moreover, if the number of contexts is greater than one the outside controller does not see that the memory access is not served on the row and column given in the address but somewhere else according to the given hash functions (4.1) and (4.2). The instruction and data memories are not the only memory cells in the node; the processor core contains registers as well. All such registers, i.e. Program counter, Accumulator 1 and Accumulator 2 were duplicated such that there are as many register banks as the maximum amount of contexts. The register banks are connected via a multiplexer to the datapath of the node processor core.

4.2.2. FIFOs

The third memory elements inside nodes are FIFOs. Opposite to memories, it cannot be said that the utilization of the FIFOs would be high with all practical applications. As a matter of fact, it is more likely that the utilization is often poor. On the other hand every FIFO is connected to the processor cores of four neighbor nodes, and thus duplication of FIFOs would increase the complexity of core FIFO control remarkably. To sidestep the problems of poor utilization, the shared FIFO strategy was used. In context switch capable nodes all contexts write to the same FIFOs. The width of the FIFO is extended by the amount of bits needed to differentiate contexts such that data blocks written by different contexts contain a different context identification number in the FIFO. With the identification number the receiver context can determine if the block on the FIFO was sent to it or to another context in the same node. In the middle of an array the identification number of the receiver and sender has to be the same. Unfortunately, the configuration identification number can not be used homogenously in all nodes, and the nodes on the edge of the array have to be treated specially. That is because solid FIFO communication between virtually neighboring contexts was desired. In FIFO communication over the edges of the array the neighbor does not have the same context identification number but the


54

numbers have one number difference. The difference goes such that the identification numbers of data coming from south edge to the north edge is increased by one, data from north edge to south edge are decreased by one, data from east edge to the west edge are increased by one and data from west edge to the east edge is decreased by one.

4.3. Scheduler

The most obvious use of a context switching architecture in the reconfigurable genre is to hide reconfiguration latency. The configuration data can be loaded to one context memory while some other is used in the execution, and a context switch to the new configuration can be done in a few clock cycles only by selecting new memories to be used. For using the context switch to hide the reconfiguration latency, the presented memory parts duplication, CPU registers duplication and an explicit method for choosing the configuration in the execution would be enough.

Figure 29. In figure 29a. the context memories in a single node are depicted. In 29b. it is shown how the first (white) context memories map to the addresses of real nodes. As opposite the other context memories map such that addresses of four times bigger array are accessible. In addition, the neighbor configuration memories with the same context number are also neighbors when addressed with the hash function addresses.

The hash functions (4.1) and (4.2) used to address context memories tallies to the organization given by the real addresses of a bigger array as illustrated in figure 29. The most important thing is that with (4.1) and (4.2) the neighbor processes hang together. Let’s say, we have a 3×3 array of nodes and every node is extended with four contexts, as in figure 29. With the given hash functions, e.g. the third context (denoted as black confmem3 in figure 29) in nodes (1,1) and (2,1) are assigned virtual addresses (1,4) and (2,4), respectively. Thus, if there is accelerator configware configured on the given RAA such that there is a program


55

in addresses (1,4) and (2,4), the hash functions (4.1) and (4.2) map the programs to the nodes (1,1) and (2,1). Because the real nodes (1,1) and (2,1) are neighbors there is no problem with FIFO communication between virtual addresses (1,4) and (2,4). In other words, the context switch mechanism virtualizes the number of processor cores for the user. The external controller has no way of seeing if there is a unique processor core behind a bus interface or if there is only one context memory. By adding the implicit method for scheduling execution time between different contexts, the main goal in the research work of this chapter is reached, i.e., the ratio between processor cores and context memories is virtualized.

4.3.1. Virtualizing dimensions of an array

In every scheduler system design there are, at least, four main design goals:

• Latency (how long it takes to get results) • Utilization (percentage of time the core spends doing useful work) • Fairness (correct amount of execution time to the different contexts) • Predictability (real time properties). [87]

The goals conflict with one another. When the scheduler is designed to meet one goal well, it is more likely that it is not optimal for some others. In RAA the goals were set such that utilization and fairness were the main goals. Fairness was interpreted as a deadlock and starvation-free implementation. Predictability and latency are, of course, also important factors on reconfigurable IPs, but it was realized that those parameters were more application dependent than hardware and scheduler dependent when compared to the other two goals. That is because the scheduler is used in RAA to schedule different parts of a single accelerator program, instead of scheduling different programs. Thus the latency and predictability of a single context is not critical, but the execution time and latency of an accelerator program in its entirety depend only on the global average execution time utilization. Several scheduling policies can be found in literature [88]. The most common non-real-time policies are First-Come-First-Served (FCFS), Round Robin and Shortest Process Next (SPN), and real-time policies Earliest Deadline First (EDF), Last Release Time (LRT) and Least-Slack-Time (LST). All others, except


56

FCFS, are based on priorities of the processes, and especially in real-time schedulers the priority computations are the most central part of an algorithm. As well, the nature of scheduling can be either static or dynamic. In dynamic systems the scheduling algorithm changes priorities according to the dynamic behavior of the contexts and some dynamic attributes. In static systems, the priorities are fixed. In any multiprocessor system the scheduler can be global, such that the scheduling algorithm is executed only in one place, and the scheduling orders are spread to others [89]. Another approach is peer-based scheduling, where every node executes its own scheduler. Because of scalability issues it was justified to select a peer based scheduling architecture for RAA.

4.3.1.1. Round Robbin approach.

Round Robbin (RR) is the base algorithm used in multi priority queue scheduling e.g. in Unix operating systems [90]. In RR the contexts waiting for execution are in a queue and after some event, the context switch to the next one on the queue is performed. As such, there are no priorities or real-time behavior in the RR and the only attribute in the algorithm is the order of the contexts in the queue. As a simple and preemptive algorithm RR fits well on RAA. Events are selected such that RR behaves eagerly. In short, if the context in the execution can continue, it may do so, but if it cannot the context switch is done. This includes situations when the program ends, when there is no needed data in FIFO, when there is no room to make push operation to FIFO or when the memory is locked by the external controller. The final algorithm is such that 1. Execute next instruction of context X

2. If program executes NOP operation make context switch

to context mod(context X+1, number of contexts)

3. If program ends make context switch to context mod(context

X+1, number of contexts)

4. If operand from FIFO is needed but is not accessible

yet or there is no room to make push operation to FIFO

make context switch to context mod(context

X+1, number of contexts)

5. If outside controller has locked the memory of context

X make context switch to context mod(context X+1, number of

contexts)

6. Jump to row 1

The scheduler was added to the core as a hardware block.


57

4.4. Partially non-deadlockable FIFO

The basic RAA implementation without context switching can not go into a deadlock state1. However, the context switch mechanism introduces a totally new deadlock event. Different contexts write to the same FIFO, and e.g. 1. if there is data1 in FIFO which cannot be used before the node gets data2 and 2. data1 is in the same FIFO with data2, and 3. data1 is written to FIFO before data2 the deadlock state will happen. As said, in the non-context switch capable RAA there is only one process which can write to a certain FIFO, and thus, a problem of the presented kind can happen only because of a software bug. i.e. it depends totally on the accelerator program itself. In contrast, in the extended RAA data1 and data2 can be written by different contexts with no synchronization and the given deadlock can happen. It is very difficult to conclude beforehand if the deadlock will happen or not. For static programs it is possible by simulation, but with programs with dynamic behavior estimation is impossible. Moreover, the golden idea in a virtual size RAA is that the accelerator developer has no idea what the ratio is between nodes and context memories in the accelerator design phase, making it totally impractical to try to design deadlock-free applications even with static programs for an architecture with basic FIFOs. Thus, the classical FIFO used in the basic RAA was modified. The modification done was very heuristic, yet efficient. A counter was added to the FIFO. Every time the pull operation happens, the counter is reset; otherwise, the counter is increased. If the counter reaches a certain threshold value, the FIFO words next to each others are swapped if they do not have the same context identification number. The first swapped element is toggled between odd and even after each counter overflow. The given mechanism is illustrated in figure 30. The heuristic behind the deadlock release mechanism can be verbalized as follows: “If a push operation has not happened for a while, it can be because of a deadlock state and may be a consequence of a bad order of data elements in FIFO”.

1 Of course every programmable system can be programmed to go into an infinitive loop. Deadlock-free means here

that there is not any indeterministic mechanism that inflicts a deadlock.


58

Figure 30. After the threshold is reached neighboring elements with different ids are swapped. The first swapped element is odd (1. swap) and even (2.swap) by turns.

4.5. Memory access allocating system

In the original architecture simultaneous memory accesses by the core and the outside controller are implicitly denied due to the fact that the outside controller access reserves the bus. However, there is no reason to deny simultaneous memory accesses to the different memory parts inside a single node; if these parts are denied the capability to use context switching to hide reconfiguration latency is lost. So, a simple hardware arbiter was implemented. Each memory part is connected to the core with a busy flag signal. If external memory access is done via the interface, the corresponding memory part sets its busy flag. If the core attempts to execute the configuration when a busy flag is set the context switch is performed according to row 5 in the scheduler algorithm.

4.6. Application case study : Matrix multiplications

To demonstrate the execution of configware in a virtual array matrix multiplication and scalar multiplication of a matrix were simulated. RAA is a 16-bit architecture, but in these examples 8-bit operands are supposed to avoid overflow problems. Scalar multiplication of matrix A=(aij) and scalar r is defined such that rA=(r aij).


59

The scalar multiplication can naturally be computed in about one instruction cycle in a coarse grain array if the dimensions of the array are as big as are the dimension of the matrix. In RAA the configware needed would be Mul 1, 2 # Multiply content of address 1 by content of

address 2

StrlL 3, Acc1 # Stores result from Acc1 to the memory address

3.

With the given architecture the execution time penalty increases linearly from i × j nodes RAA, where i and j are the maximum dimensions of the matrix, to one node RAA if the dimensions of the array are diminished. E.g. a 3×3 matrix can be handled in two instruction cycles in a 3×3 array of nodes and in 18 instruction cycles + 8 clock cycles used in context switches in one node. In other words, in accelerator programs where the node processes are mathematically independent the overhead of virtualization is in practice zero. However, this is not the case with most applications. Matrix multiplication of an m-by-n matrix A with a n-by-p matrix B is defined for every pair i,j such that

(AB)ij = AirBrj = ai1 × b1j + ai2 × b2j + ... + ain × bnj

Again the program is easy to design. E.g. the multiplication of a 3×3 matrix by a 3×1 matrix can be computed with 4×3 processors such that a 3×3 array of processors makes the multiplications and a 3×1 array of processors make additions to each row. The multiplication program needed would be Mul 1,2 # Multiplication

Ldr Fifo1, Acc1 # Move from Acc to the neighborhood

Ldr Fifo1, FifoW1 # Move from west to east

Jmp 3 # Continue moving

The processor makes its own multiplication and after that begins to transfer the results of the row to the east. The processors furthest east then make the additions Add Acc1, Fifo1

Jmp 1.


60

The execution of the presented problem with the presented configware in at least a 3×4 array of processors takes 5 instruction cycles which equals 15 clock cycles. The same configware was simulated in a 2×2 array of processors, such that each processor had four configuration memory parts. The state of every processor in every instruction cycle is illustrated in table 9. In total 20 instruction cycles and 6 context switches, which means 62 clock cycles, are needed to complete the execution. So the execution time does not scale linearly but with one third of the processors over four times the execution time is needed. The problem is that the heuristic and local scheduler used is not capable of using all computational capacity of the array, some processors may be unused although execution is not over yet, or a processor may be waiting for the result from another configuration currently in sleep mode in some other processor. The poorly utilized execution blocks are bolded in table 9.

Table 9. States of every processor (P1,P2,P3,P4) in a 2×2 array simulating configware designed for a 3×4 array. The first number in cells is the row of the code in

execution, the second is the ID number of active configuration and the third value is assembler code

P1 P2 P3 P4 1 1,1 (mult) 1,1 (mult) 1,1 (mult) 1,1 (mult) 2 2,1 (to FIFO) 2,1 (to FIFO) 2,1 (to FIFO) 2,1 (to FIFO) 3 Switch to 2 3,1 (WtoE) Switch to 2 3,1 (WtoE) 4 1,2 (mult) 4,1 (jmp) 1,2 (mult) 4,1 (jmp) 5 2,2 (to FIFO) Swich to 2,

Swich to 3 2,2 (to FIFO) Switch to 2

Switch to 1 Switch to2

6 3,2 (WtoE) 1,3, (mult) 3,2 (WtoE) 1,2 (add) 7 4,2 (jmp) 2,3 (to FIFO) 4,2 (jmp) 2,2 (jmp) 8 3,2 (WtoE) Switch to 1,

Switch to 2 3,2 (WtoE) Switch to1

Switch to 2 Switch to 1

9 4,2 (jmp) 1,2 (add) 4,2 jmp Switch to 2 10 Switch to 3 2,2 (jmp) - 1,2 (add) 11 1,3 (mult) 1,2 (add) - 2,2 (jmp) 12 2.,3 (to FIFO) 2,2 (jmp) - 1,2 (add) 13 Switch to 4 1,2 (add) - - 14 1,4 (mult) 2,2 (jmp) - - 15 2,4 (to FIFO) Switch to 3 - -


61

16 3,4 (WtoE) 3,3 (WtoE) - - 17 4,4 (jmp) 4,3 (jmp) - - 18 3,4 (WtoE) Switch to 4 - - 19 - 1,4 (add) - - 20 - 2,4 (jmp) - - 21 - 1,4 (add) - - 22 - 2,4 (jmp) - - 23 - 1,4 (add) - - 24 - 2,4 (jmp) - -

4.7. Results

On the hardware side the extensions described in this chapter were implemented to the existing RAA architecture with VHDL and the new design was synthesized with gate level synthesis tools. The line width of the technology used was again 0.13µm.

4.7.1. Area penalty effect of virtualization structures

In figure 31 the area of a single RAA node equipped with 1, 2 and 4 context is illustrated. As can be seen from the figure the area of a single node increases linearly as a function of the number of contexts. The extra area, when the number of contexts is increased, comes from extra memories, the scheduler and the more complex deadlock free FIFO. In figure 32, the area of the processor core and memory subsystem inside a node are given as a function of the number of contexts. From figure 32, we can see that the area increase of controlling structures is insignificant when the number of contexts is increased, but that the extra area comes from memories. In addition, we notice that the increase of memories is linear, making the system extremely scalable.


62

Figure 31. Area of node capable of saving one, two and four different contexts in square millimeters in 0.13µm technology.

Figure 32. Area of processor core (dashed line) and memories (solid line) in node as a function of the number of contexts.

4.7.2. Results of deadlock free FIFO

In figure 33 the silicon area of a normal FIFO and partially deadlock-free FIFO as a function of memory elements on FIFOs are given as solid and dashed lines, respectively. As can be seen from the figure the area penalty of the deadlock-free FIFO is very tolerable with practical FIFO sizes. Moreover, the total sizes of FIFOs are so small compared to the sizes of whole node that the effect of using 20 elements deadlock free FIFOs instead of normal FIFOs is insignificant.

1 2 40.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Number of contexts

Are

a in

squ

are

mill

imet

ers


63

Figure 33. The silicon area in square micrometers of partially deadlock free (solid line) and normal (dashed line) FIFOs as a function of memory elements in FIFO.

Simulations were also done to optimize the threshold value of FIFO. The 512-bit Montgomery algorithm used e.g. in RSA algorithm was executed on RAA with one-fourth of the needed real nodes and four contexts. The execution times in instruction cycles as a function of the threshold values used in FIFOs are given in figure 34. Two conclusions can be made. One is that threshold values below ten seem to be the most optimal. On the other hand, one can see that differences in execution times with values of threshold value between one and ten are very minor. Thus, the presented FIFO seems to be robust. The 64-bit and 128-bit correlation algorithms were also simulated. The needed RAA size with 64-bit correlation was 4×4 nodes and with 128-bit correlation 8×4 nodes. Both accelerators were executed in RAA with 1, 2 and 4 contexts. The execution times are given in figure 35. The figures are almost linear to the number of contexts. So, the overhead of virtualization is in these cases zero.

5 10 15 200.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

Number of FIFO elements

Sili

con

area

in s

quar

e m

icro

met

ers


64

Figure 34. The effect of FIFO threshold value on execution time in 512-bit Montgomery algorithm used in RSA application implemented on RAA with four contexts.

Figure 35. Execution time of 64 bit (left hand side) and 128 bit (right hand side) correlations in RAA with 1,2 and 4 contexts.

4.8. Summary

In this chapter it was presented how a fast context switch can be implemented to the MIMD based coarse grain reconfigurable algorithm accelerator IP block and how the actual implementation is done in the RAA framework. It was shown how the context switch can be used to virtualize the dimensions of the array by using a node-based hardware scheduler.


65

The synthesis results show that the area penalty of extra controlling structures needed is insignificant but that most of the area increase comes from additional configuration memories. In other words, in some cases it may be feasible to use disposable area such that a smaller array is implemented to make it possible to use configurations designed for a far more effective array. On the other hand, there is no additional penalty for abstracting the size of the array on top of implementing the context switch. More exactly, e.g., an array that is virtually 4 times bigger is achieved by using ~2.1 times the area by four contexts and node-based schedulers. So the area of the virtual array is only ½ of the size of the full-size array with a single context. The same ratio between virtual size and number of real processor cores seems to be constant over different parameter values of the RAA. From the configware side it was noticed that it is hard to predict the efficiency of an algorithm executed in a virtual array. Execution time depends on the balance between the filling factors between different processors during a particular process. With the given correlation algorithm the ratio is very good, but for example with the RSA scenario given later in this thesis ratios with context numbers greater than two are quite bad. However, with ratios near two the virtualization works very well.


66

Reconfigurable IP blocks : a MIMD Approach

67

5. Configware Flow of RAA

Abstract In this chapter the configware flow from parallel algorithm description to the RAA binaries is given. The presented configware flow includes a graphical mapping tool, an automatic mapping & route tool and an assembler. In the graphical mapping tool the application developer can map the parallel parts of the accelerator configware to the nodes of RAA. With the automatic mapping & route tool the mapping can be done automatically. In addition, the route tool takes advantage of the nodes as a route through blocks so that nodes with a distance greater than 1 can communicate with each other. In the back end the assembler is used to translate assembler language to the binaries. In this chapter, NP (Non-Polynomial) -hard problems of mapping the configware to RAA and finding use of group addressing modes are solved with fast heuristics. The mapping tool and assembler are tested with RSA-algorithm and GPS-correlation implementations.

5.1. Introduction

The typical initial state in configware development for RAA is an algorithm description. More specifically, we likely have, the C-model of the algorithm planned to be implemented on RAA. From that description we should produce the binary files needed to program RAA. The first action in RAA is to split the sequential algorithm C-model into parallel processes. In massively parallel architectures there has to be dozens of processes in order to make it reasonable to implement it on RAA. After parallelization the assembler code to nodes has to be written and translated to the binaries. In this kind of flow we have two main problems: (1) how to parallelize the sequential code; and (2) how to map dozens of processes from the previous phase to the nodes of RAA. Although there are several different commercial and academic HLL compilers for reconfigurable structures, it seems like there is no compiler implementation robust enough. E.g. it seems like the core providers cannot sell only a reconfigurable IP core and compiler, but the customers want the whole algorithm accelerator, including the configware. That is because state-of-the-art compiling tools are not good enough to work as general purpose compilers. The situation was predicted already in the author’s publication [P6], and thus the configware flow of RAA was originally planned to be done such that the predicted pitfalls would be sidestepped. Because the combination of a compiler

Configware Flow of RAA

68

and a very simple architecture was predicted to be impossible, the intuitive architecture and low level programming tools were set as a target, such that an ordinary software engineer could create new accelerator configwares to the IP. So while problem (1) is sidestepped within this thesis, as a solution to problem (2) and to the back-end of configware flow a comprehensive chain of tools is provided. This chapter is organized such that the tools of RAA flow i.e. GUI placer, automatic place & route tool and assembler are described in 5.2, 5.3 and 5.4, respectively.

5.1.1. Previous work

Maybe the most advanced C-language-based compiler-hardware architecture combination is made by Silicon Hive. In the Silicon Hive project (2.6) the programming flow is broken down into three stages. In the first stage the “partitioning compiler” analyzes C-code and suggest which parts should be coded on the reconfigurable structure and which ones to the host processor. The tool is not really very Silicon Hive specific but could be used in any other accelerator-host processor combination to help with the decision of partitioning. The second step on the Silicon Hive tool flow is the C compiler HIVECC. At the back-end of the flow is the Silicon Hive Array Programming Environment which places the compiled program into the physical cells. HIVECC is the most advanced part of Silicon Hive flow. The compiler does not only try to place the accelerator pieces in time, it also maps those in space. The idea is to get tasks communicating with each other near each other, to decrease communication penalty. The most novel idea is the approach of transferring the datapath control, like pipeline control, forwarding control etc, from the hardware to the compiler, which reduces the silicon area of hardware. PSDS-XPP2 is a suite of software tools for programming and simulation of the massively parallel Pact XPP. The flow consists of three stages. As in Silicon Hive above, the code is first divided into sequential and parallel parts. The second step is to compile the parallel parts with Vectorizing C-Compiler. The compiler tries to roll out the nested loops and instantiates an optimized code block hand-coded with Native Mapping Language (NML). The third stage of the


69

flow is verification. The verification of code is done with visual simulation tools. [47]

5.2. GUI mapping tool

In many cases the engineer who has manually partitioned a sequential program to parallel processes has quite a clear picture of how the processes are connected together and moreover how they should be mapped [91]. Especially in RAA where virtually only communication between neighbours is possible, and where only tens of instructions per node are allowed, the architecture has to be kept closely in mind while the configware is designed. To make it easier to map the parallel processes in the case where placement is evident to the designer, the GUI mapping tools were developed. In the tool the designer sees the nodes of an array as boxes, and selects from the list the assembler files to place in the nodes. According to the given graphical placement, the tool does the routing automatically. The GUI was programmed with Matlab. The layout of the RAA GUI placer is given in figure 36. On the left hand side in the figure the tool is presented before the placement is done. The nodes are thus attached with “empty” texts. On the right hand side the placement is done, and the nodes are attached with the names of the assembler files placed on them.

Figure 36. The GUI placer before and after the RSA algorithm placement.


70

5.3. Place & route tool

For large problems hand mapping may become impractical. If the design has nodes communicating with other nodes over distances greater than 1, or if the partitioning is a result of some other tool, partitioning by hand may well be in practise an impossible task. For those kinds of situations the automated place & route tool was developed. The use of 4NN FIFOs for communication between the nodes in RAA seems to simplify the place & route problem compared to more flexible communication networks. With 4NN FIFOs the only requirement for placement is that the nodes that communicate together have to be neighbors and that there is not any route phase. However, in RAA architecture it is easy to use a part of nodes as a route through blocks if communication with non-neighbors has to be achieved. In a heavily engineered flow it is natural that the routes through the blocks are defined explicitly by the developer. In contrast, in flows based on automatic tools the determination of the route through the blocks has to be made by the tools. So actually, the place & route tool of RAA is not any simpler than the tools for architectures equipped with more flexible communication mechanisms, but the need of allocating and creating configware to nodes to be used as a route through or even as router blocks lifts the complexity of the problem to the same level as in other coarse grain reconfigurable architectures. In addition, in RAA the tool should take into account the group addressing modes in the placement phase, so that group addressing can be used as much as possible in the assembler stage.

5.3.1. Some algorithms to solve placement problem

The typical algorithms to solve the placement problem can be divided into the classes of exact, greedy and iterative. The exact algorithms are guaranteed to find the optimal solution in a bounded number of steps. For example, cutting-plane or facet-finding algorithms [92, 93] can be used. These algorithms are quite complex, with code on the order of 10,000 lines. In addition, the algorithms are very demanding of computer power. In practice supercomputers have to be used to compute solutions for the bigger problems. Moreover, because the problem is NP-hard, the exact algorithms are computationally far too expensive to be used in practice [94].


71

In greedy algorithms, in each cycle an unplaced cell is selected on the basis of improving the total score of the placement as much as possible, i.e. the best alternative is selected. It is obvious that this kind of placement strategy leads more likely to a local optimum that is possibly far away from the global optimum. However, the greedy algorithm may be good for establishing the initial placement because the algorithm itself can be executed in seconds. In iterative algorithms changes are made an to initial placement, which of course can be a randomized placement, according to some rules, until no change can be done [95]. The simplest rule for making changes is the 2-opt algorithm [96]. In 2-opt two nodes are swapped in each iteration. The swapped nodes are selected such that the overall score improves as much as possible, and iteration is continued until improvements cannot be made. To avoid local minimum the n-opt algorithm can be used instead of 2-opt so that instead of making only one swap per iteration cycle n swaps are done. Because a single swap can decrease the score as much as the series of n swaps in the single cycle increases it, the algorithm can climb away from local minimums [97]. Moreover, if n is selected to be the number of nodes the algorithm is the optimal exhaustive search. By increasing n a more optimal result is obtained but the fast growth of execution time as a function of n makes it impractical to use big n values. On the other hand it is difficult to know what n to use to achieve the best compromise between running time and quality of solution.

5.3.2. Iterative Lin-Kernighan to solve TSP

The Lin and Kernighan algorithm [98] removed the drawback of pre-selecting a compromise between the optimality and execution time with opt-algorithms by introducing a powerful variable -opt algorithm. The algorithm changes the value of n during its execution, deciding at each iteration what the value of n should be. At each iteration step the algorithm examines, for ascending values of n, whether an interchange of n pieces may result a better result. This continues until some stopping conditions are satisfied. The original Lin-Kernighan algorithm [86] given above was targeted to solve the NP-hard traveler salesman shortest path problem. However, the same approach is followed here and applied to a placement problem.


72

5.4. RAA Place & Route algorithm

The RAA placement algorithm is a combination of a heuristic greedy algorithm and a modified iterative Lin-Kernighan algorithm. The greedy algorithm is used to establish the initial placement. In greedy initialization the processes are divided to the available nodes recursively such that the unplaced process with the most connections to the other processes is placed as close to the middle of an array as possible. After that the process having a connection to the process placed in the previous phase and having the most connections to the other processes is placed beside the previous process. The operation is continued in bred-first manner until all processes having direct or indirect communication with the original process has an initial placement. If there are still unplaced processes after the previous cycle, another one is performed until every node is placed. After the initial placement the modified Lin-Kernighan iterative optimization phase is started. The basic idea of the iterations is to try to maximize the score by doing swaps of nodes in the current placement. The manhattan shortest path between nodes equipped with processes communicating with each other increase the cost of the current placement if it differs from the maximum distance attached to the connection such that

if max_length-actual>=0 score=score-0.5(max_length-actual) else score=score-1(max_length-actual) end if ,where max_length and actual are the maximum distance between two processes and the actual current distance between processes, respectively. In addition, to direct the Lin Kernighan algorithm, an additional penalty is added if all of the below conditions are realized.

• There are communicating processes placed to nodes such that manhattan distance is greater than 1.

• max_length is greater than 1.


73

• There are one or more used nodes in manhattan path of point 1. In the above case the placement is unroutable. In other words, there is no way to use nodes as route throughs to connect the 2 given processes. In a case where a clean path is not found between 2 processes that must communicate together the shortest path algorithm is executed again such that

• The original distances are saved to edges. • The destination node is selected to be the starting node. • Length of routes through used nodes are selected to be infinite. • The shortest manhattan path algorithm is executed again to this new

graph. The shortest usable path length can now be computed by summing the value of the first (step1) and second (step2) distances on every edge. The path which goes through the edge with the smallest sum of distances achieved in the first and second algorithm execution is the shortest usable path. The score is increased such that If max_length-(step1+step2)<0 score=score+(max_length-(step1+step2)) else score=score-step2 end if The given cost functions and presented Lin-Kernighan strategy were combined. The outline of the modified Lin-Kernighan algorithm used is such that

1. old<=Greedy(X) Make the initial placement

2. old_score<=Score(old)

3. r=2;

4. while (score(new)!=0)

5. if (r>5)

6. Return (inf)

7. end if

8. new=r-opt_move(old,r)

9. if (score(new)>=old_score)


74

10. r=r+1

11. else

12. old=new

13. old_score=Score(old)

14. r=2

15. end if

16. end while

17. Return (old)

In the given algorithm the subfunction Greedy returns the initial placement and subfunction Score the score of the given placement. The function r-opt_move makes the best possible r-opt move. The main loop logic is such that 2-opt moves are done until a local optimum has been reached and an uphill move is needed. Next the 3-opt moves are tried. The value of r is increased until the score increases. When an improvement is found with r-opt the swaps with 2-opt moves are continued. The optimizations are executed until the score equals zero, or the given optimization algorithm cannot decrease the score further. Because r-opt is an optimal exhaustive search, if r equals the amount of nodes in placement the algorithm is optimal. However, execution time of the r-opt move increases exponentially as a function of r. Thus, the algorithm used stops executing if r gets bigger than 5, and starts the algorithm again with randomized initial placement.

5.4.1. Dijkstra’s algorithm

Dijkstra’s algorithm [99] is perhaps the most widely used shortest path algorithm, and is also used to find shortest manhattan paths between processes in RAA tools. Dijkstra’s algorithm solves the shortest path from vertex s to all other vertexes in a directed graph with nonnegative edge weights. The algorithm’s running time is O(n2). The algorithm works such that with edge relaxations it updates each vertex distance from the source. In edge relaxation the distance of vertex v is updated if there exists an edge from u to v such that distance(s,u)+distance(u,v) is greater than the best already known distance(s,v). The distance (s,s) is initialized to 0 and all other distances to infinity at the beginning of the algorithm.


75

In the path finding process, the graph is established such that each node in RAA is depicted as an edge and every FIFO link as a directed unit cost vertex between corresponding nodes (i.e. edges). It is easy to see that the shortest path problem to this given graph corresponds to the problem of shortest manhattan path in RAA architecture. Of note, the shortest path between two processes is the shortest path between the two nodes they are placed on.

5.4.2. Route phase

The processes in neighboring nodes do not need any route phase, because those communicate between fixed FIFO links. However, the processes separated by a distance of two or more need a route through blocks to establish the communication link. In the RAA place & route tool the previous placement phase has already made sure that the needed empty nodes are in the path. In such an initial state the route phase is simple:

• Find the manhattan path from source to the destination • Create needed configware for each node.

Phase one is done using Dijkstra’s algorithm once again. For each vertex, the algorithm keeps track of which vertex it can be linked to in order to generate the shortest path. From these parent links, the path from the source to the destination can be enumerated by starting from the destination and by following the links to the source. In phase two there is the script which creates configware with line strl fifoX -> fifoX in an infinite loop to each FIFO link that needs to be routed through a node. The Xs are selected according to the routed path.

5.5. RAA Assembler

The assembler in RAA has two main functions. The first one is to translate assembler instructions into binary format. From the developer’s point of view it is far more efficient to make programs when one can, e.g., write ‘add fifo1 acc1, fifo2’, instead of using binary opcode format like ‘1010001000010’.


76

However, the translation between assembler words and binary fields is trivial. The needed program was created with Matlab. The second, and not trivial, purpose of the assembler is to automatically make full use of RAA’s row-, block- and global addressing mechanisms. The purpose is to minimize the total code size needed to configure the RAA. The minimal code size not only diminishes the amount of memory needed on the outside controller, but also minimizes configuration time. However, manually finding an optimum way to use the addressing mechanism with certain accelerator configware is, for a bigger application, impossible in practice. The most obvious method for automatically finding the optimum way to use group addressing methods in order to minimize the size of the configuration data is to find every possible addressing combination and search for the best of those. Let us suppose that we have a RAA configuration with a 8×8 node array and 32 words of program memory. In addition, we have an application which contains 20 different programs, such that there are some of those in every node. To make the presentation more clear, let us also suppose that we can program all 32 words simultaneously, and thus only one memory access per node is needed (this assumption removes only a constant coefficient from the coming analyses. As well, the optimization can be done differentially to every row). So first we have to select one global access for 21 possibilities (21st possibility is empty selection), eight row accesses for each of the 21 possibilities, sixteen block accesses for each of the 21 possibilities and 64 single accesses for each of the 21 possibilities. The result is that we have 21, 218, 2116 and 2164 different ways to select possible combinations of global access, row accesses, block accesses and single accesses, respectively. The total number of combinations is thus 21×218×2116×2164=2189. The number is enormous. Even if there was a machine that could examine one combination in one clock cycle operating at the clock frequency of 1000GHz, it would take ~1×10100 years to go through all of the combinations.


77

5.5.1. Heuristic to select groups addressing mechanism

To achieve good enough results fast, a heuristic algorithm was constructed. The heuristic is based on greedy strategy. The new memory operation is selected such that a maximum amount of memory slots is set to the legal state. The operation is selected as many times as needed to set every memory slot to the legal state. Legal state means that the program loaded to the slot is according to the placement solution. The greedy algorithm is directed according to the following observations:

• There is no such configuration where it would be optimal to make a

memory write to the same address more than once, although the actual data in the access would be different.

• There is no such configuration where it would be optimal to make a memory write such that we revert to a previous state.

According to the greedy nature and regularities above, the beginning of the algorithm is as follows:

• (1) Select the program that is used most and make global memory

access with it.

After that, according to the greedy tactic, we select the row accesses.

• (2) If over half of the memory slots in a row have the same program such that the program is not the one used in stage 1, make a row access with that program.

In the previous stage we wanted to leave room for block addressing, such that only rows with the same program written to over half of its nodes were included, instead of selecting simply the most common.

• (3) If two or more memory slots in the block have the same program

such that the program is not the one used in stages 1 or 2, make a block access with that program.

• (4) Program all memory slots in illegal state as single access.


78

It is easy to see that the given algorithm ends with every memory slot in a legal state because of rule 4. On the other hand the algorithm is extremely fast. With proper data structure the running time is only O(n), in where n denotes number of nodes.

5.6. Results

The efficiency of the group addressing mechanism and the automatic addressing optimisation algorithm were tested with RSA encryption and GPS correlation algorithm configwares. The RSA implementation will be treated more precisely later in Chapter 6, but the details of the GPS correlation implementation are given here.

5.6.1. GPS correlation background

GPS satellites broadcast an encoded signal that repeats itself once every millisecond. Each satellite uses a unique 1023-bit pseudo-random noise (PN) code operating at a rate of 1023 Mbps. The encoded signal is detected at the receiver by correlation; the process of multiplying the received signal with a locally-generated replica Coarse Acquisition (C/A) code of the PN code used by the satellite, and integrating the product to obtain a peak correlation signal [100]. There are 1023 possible code delays in delay space where the peak could be located. Finding it requires the incoming signal to be correlated with 1023 delayed versions of the PN code, which makes the correlation operation extremely hardware demanding i.e. 1023-bit correlation has to be done 1023 times every millisecond. Thus, correlation is impossible to do in the low power consumption processors of mobile devices, but instead, needs a parallel computing platform or fixed hardware implementation. After the peak is found, computationally-intensive correlation is not needed anymore until the tracked signal is lost and the peak has to be found again [101]. So the actual correlation is done only rarely, which makes the GPS correlation algorithm very suitable for implementation on a reconfigurable platform. Unlike the fixed implementation, the area of the accelerator is not unused most of the time, but can be reconfigured to do something else. The author does not have advanced knowledge about the mathematical background of the algorithms needed in correlation, but the details can be found,


79

for example, in [101]. The RAA implementation here is based on ready hardware architecture of bitwise parallel algorithm which can be found from [102]. The bitwise parallel implementation of GPS correlation gets, as an input, the PN code from a satellite presented as 1023 2-bit words, and the C/A code which is 1023 bits. Each word of PN code is enumerated as given in table 10 below.

Table 10. Numerization of words in PN-code

Code Number “00” -2 “01” -1 “10” 1 “11” 2

Each bit in the C/A code is interpreted such that ‘1’ and ‘0’ are 1 and -1 respectively. The idea of the algorithm is to multiply each PN word with its corresponding bit in the C/A word and accumulate the results to one 12-bit value. Note that multiplication with -1 or 1 does not increase the word length in multiplications. After each accumulation, the C/A code is shifted right by one and the same procedure is done again. After 1023 cycles, the peak value is the highest value computed. The problem in implementing GPS correlation on the RAA is that the word length in RAA is 16 bits, and the word lengths of operations in correlation range from 2 bits in multiplications up to 12 bits in the last stage of accumulation. Fortunately, in correlation the PN code operand is fixed while 1023 different correlations are performed against differently delayed versions of C/A code, making it possible to do a precomputation step to lift the word length of computations. In the precomputation step the PN code is spread across the RAA array such that each block has a 4-word part of it. In every node all possible multiplication results are written to a look-up-table. The computations are done such that one operand in multiplication is the fixed 4-word part of the C/A code and another operand gets all possible 4-bit bit combinations. The achieved look-up-table values are saved to the data memories so that the multiplications during the actual execution phase can be done as memory operations. Because of the 8-bit memory operations of RAA, the 16 entries required for the lookup-table can be accommodated into 8 memory slots.


80

Because the given precomputation step has to be done only once per 1023 correlations, its execution time of about 60 instruction cycles is insignificant compared to total execution time. After the precomputation step the actual correlation algorithm is used. In each node one 4-word multiplication is done, i.e. look-up-table operation. After that the nodes are grouped into a 4×4 node groups. In each group the 8-bit results computed in the previous phase are accumulated such that we get one 12-bit value. The accumulation tactic inside a group is shown in figure 37.

Figure 37. The intermediate values on nodes are accumulated to the 12-bit final result.

The given architecture scales upwards from 64-bit operands in steps of 64 bits so that e.g. in a 1024-bit version there has to be 256 nodes, which makes 16 groups and sixteen 12-bit values after the group integration. The remaining 12-bit values have to be accumulated in the outside controller. After the cycle the C/A code has to be shifted right. For that reason the snake path through the nodes is done and the PN code is rotated inside the array such that there is no need to do any configuration or data write operation to the array during the 1023 correlation steps. The snake path is illustrated in figure 38. The execution time of the node to where the group result is accumulated is 16 instruction cycles in total. Because the integration is done in parallel in every group, 16 instruction cycles is also the amount of time it takes to make one 1023-bit correlation in a 256-node RAA with the given architecture. In GPS 1023 correlations have to be done, and thus the total execution time is 16×1023=16368 instruction cycles, which makes 16368×3=49104 clock cycles. If execution should be completed within a one millisecond time window the frequency of

12-bit result

First stage

Second stage

Third stage

Fourth stage

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P12

P13 P14 P15 P16

12-bit result

First stage

Second stage

Third stage

Fourth stage

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P12

P13 P14 P15 P16


81

RAA has not to be greater than ~50MHz. In addition to the execution done in RAA, the outside controller has to read the result of every group after every correlation, making in the 1024-bit version 16 memory accesses every millisecond.

Figure 38. Snake path in 64 nodes (256 bits) correlation configware.GUI and assembler

results

5.7. GUI and assembler results

The efficiency of the group addressing mechanism and the automatic addressing optimisation algorithm were tested with RSA encryption and the given GPS correlation algorithm configwares. First, the 256-bit RSA encryption and 64-bit correlation algorithm were optimised. The RAA architecture used was a single context 4×4 array with 32 words of instruction memory. The configware needed for RSA encryption was composed of, in total, three different kinds of programs and the configware used to implement the correlation algorithm was composed of 14 different programs. Both implementations were coded with a text editor and connected together in the RAA GUI environment. The length of individual programs were from 25 to 31 words, and thus the size of configuration data without any optimisations to the whole RAA array is in the order of 27×16=432 words. The optimisation was done at instruction level and the achieved configuration sizes for 256-bit RSA and 64-bit correlation were 79 words and 62 words, respectively. Sizes of configuration data for the 128-bit correlation algorithm and 512-bit RSA, both in a 8×4 array were 97 and 66 words, respectively. Sizes of unoptimized configuration data are in these cases 992 and 576 words, respectively. The use of different addressing modes in the optimizations of 128-bit correlation and 512 bit RSA are given in table 11.


82

Table 11. The code sizes of RSA and GPS implementations after code optimization.

Configware Addressing mode

RSA GPS

Global 31 18 Row 14 13 Group 0 0 Single 52 35 Total 97 66 Raw 992 576

The third test was the 1024-bit RSA algorithm implementation to an 8×8 array of RAA with 32 words of instruction memory. With the above assumptions the unoptimized code size would be 27×64=1728 words. The optimised code size, with instruction-level optimisation to RSA, was 124 words.

5.8. Place & route results

The automatic placement tool was implemented with an interpretative language. Unfortunately the implementation language was wrong for this purpose. Although the implementation was optimized such that in each Dijkstra’s algorithm execution only one memory operation is made to the look-up-table, the execution time of a single iteration step was in order of tens of milliseconds. Thus, the computations for bigger placement problems were impossible to finalize. However, the algorithm itself given for placement seems to be fast. The inexpensive 2-opt moves are used extensively, and r-opts are needed only for climbing uphill. In the GPS correlation algorithm implementation given in this chapter the 4×4 process basic group was designed. The amount of different combinations needed to be tested in an exhaustive search to place that configware to the equally sized RAA array is 16!, making a total of 2.1×1013 different combinations to examine. The placement with the RAA placement tool gave an optimal result after examining 67594 combinations. With another test case it is demonstrated how the algorithm solves the problem with processes communicating over a maximum distance greater than one. The connectivity graph of a test case is given in figure 39. The numbers in


83

parentheses are maximum distances greater than one. The optimal result for the problem was gotten after examining 1029 combinations

Figure 39. Connection graph of test case with maximum distances greater than one.

5.9. Summary

The automatic design tools for configware development to the RAA architecture were presented. The chapter includes the flow from process description to the RAA binaries. In the first step on the flow two alternatives were considered, manual and automatic placement. The GUI-based program was implemented for handmade placement. For automatic placement a modified Lin-Kernighan algorithm was implemented. A heuristic was designed to provide an initial placement for the Lin-Kernighan algorithm. In the chapter it was shown how the RAA group addressing mechanism together with the automatic code size optimizer tool reduce the bottleneck in the reconfiguring phase. It was shown that a remarkable benefit from group addressing can be achieved even with a non-optimal heuristic optimisation algorithm. With realistic application code size, reconfiguration time cuts with given the algorithm were in the order of 70-90 percent of non-optimized code sizes. In the future, the placement tool should be implemented as optimized C-code to get more information about the presented modified Lin-Kernighan placement algorithm. Another possibility is to combine the code size optimizer and placement tools so that in the placement phase the possibility to optimize the size of the code is also taken into account.

P1 P2 P3 P4

P5 P6

P7 P8

P9 P10 P11 P12

(4)

(3) (4)


84


85

6. 3+ Ways to Implement RSA Encryption

Abstract The hardware architecture to implement RSA encryption is shown. The architecture is scalable and suitable for RSA implementations on high-radix reconfigurable platforms. The presented RSA architecture is implemented on ASIC, platform FPGA and coarse grain reconfigurable IP block. The overall timing and area results for these three implementations are given. In the coarse grain reconfigurable IP the context switching and dimension virtualization mechanisms are used and results show that the execution time is in this application linear to the amount of real processors, although the virtual nodes are used. The given platform FPGA implementation was the fastest known FPGA-based 1024-bit RSA encryption when published in [P4].

6.1. Scalable RSA encryption suitable for high radix reconfigurable structures – Introduction.

The need for confidential communication via insecure data channels has actuated fast growth in the use of public-key cryptosystems. Net banks, net stores and governments’ net services are based on authentication and encryption provided by asymmetric cryptosystems. The most widely used public-key cryptosystem is RSA. The problem with RSA and other asymmetric algorithms is their computational complexity. In general, software implementations of asymmetric encryption algorithms are at least 100 times slower than implementations of symmetric ones [103]. Algorithm execution speed is especially important for mobile applications where authentication has to be done “on the fly”. One way to make RSA efficient enough is to design a hardware accelerator to execute the algorithm. However, in most common applications the encryption is done very rarely and as such, it is unfeasible to sacrifice big silicon area for it, making RSA very suitable for implementation on reconfigurable parts [104]. The problem with RSA is that there is no mathematical evidence that cracking of the algorithm would be in class NP hard. Conversely, there is a recently published proposal [105], which may help to crack RSA faster than what has ever been thought to be possible. Nowadays it is recommended to use keys over 6000 bits long, if data security should be kept “forever”, although only a couple of years ago 2048-bit keys were claimed to be safe “forever”. In RSA, 1024 bits is nowadays considered to be the smallest reasonable key length for valuable data. The factors of uncertainty related to RSA are another important motivation for implementing RSA on reconfigurable architecture.

3+ Ways to Implement RSA Encryption

86

6.2. RSA algorithm

RSA algorithm was invented in 1978 by Rivest, Shamir and Adleman [106]. In RSA, encryption and decryption are performed with the same simple equation C=Me (mod n), (6.1) where numbers C, M, e and n are encrypted message, plain text, encryption exponent and modulus, respectively. When equation 6.1 is used in decryption, numbers C and M are interchanged and e is replaced with d, which is decryption exponent. Numbers e and n compose the public key and numbers d and n compose the private key. It is obvious that e, d and n are not mathematically independent. Because key generation does not belong to the scope of this work, equations needed in the generation of e, d and n are not presented here but can be found e.g. from reference [107].

6.2.1. Montgomery modular multiplication

The basic operation in RSA is the modular multiplication a=a*r mod n, which can be efficiently computed with Montgomery algorithm developed by P. L. Montgomery in 1985 [108]. The algorithm makes it possible to compute the remainder using only divisions by powers of two. In Montgomery algorithm the operands are transferred to the n-residue so that ar=a ∗ r mod n, (6.2) where a is the original multiplicand, ar is a’s counterpart in n-residue, and n is the modulus used in the original modular multiplication. r is defined such that 2k-1≤ n<2k | r ≥ 2k, (6.3) where k is the number of bits used. When r is selected such that it is a power of two, only shifts are needed to make divisions in the Montgomery product. So r is selected such that r=2k. The montgomery modular product of two operands ar and br in n-residue is defined such that


87

cr=ar ∗ br ∗ r-1 mod n, (6.4) where r-1 is the number which realizes the equation r-1 ∗ r=1 mod n. (6.5) The Montgomery algorithm needs yet the number n’, which is defined such that r ∗ r-1-n ∗ n’=1. (6.6) In 1990 Dusse and Kaliski discovered [109] that resolving of whole n’ in the Montgomery algorithm is not necessary if the word length used in the computations is less than that of the operands. Due to the observation n’ in (6.6) can be replaced with n0’ which is defined such that n0'= -n0-1 (mod 2w), (6.7) where n0 is w lowest bits of the modulus. The Montgomery algorithm’s result in n-residue can be transferred back such that c=1 ∗ cr ∗ r-1 mod n. (6.8) Since time of antiques it is known that the exponentiation needed in RSA doesn’t need to be computed in order such as e.g. M2=M ∗ M, M3 = M2 ∗ M, M4=M3 ∗ M, M5=M4 ∗ M (6.9) But can be computed e.g. such that M2=M ∗ M, M4=M2 ∗ M2, M5=M4 ∗ M. (6.10) Because in modular exponentiation the remainder is taken after every multiplication the amount of multiplications preassigns the execution time. The classical way to compute exponentiation Me with less than e multiplications is the binary method [110], where only log2e multiplications are needed. It is also well known that modular exponentiation can be performed as follows:


88

M3mod n = ((M mod n ∗ M mod n) mod n ∗ M mod n) mod n, (6.11) in where the partial results will never grow bigger than n. By using the binary method, computing the remainder after every exponentiation step, and using the Montgomery algorithm in modular multiplications, we obtain the top level algorithm of RSA shown below [111]. RSA(M,e,n) 1. no’=modinverse(n) 2. T=M*2w*s mod n 3. X=2w*s mod n 4. for i=0 to w*s-1 5. Ti+1=monpro(Ti,Ti) 6. if ei=1 then 7. Xi+1=monpro(Xi*Ti) 8. else 9. Xi+1 = Xi 10. R=monpro(Xw*s,1) 11. return R In the algorithm w and s are word length in computation operations and amount of words needed to present n, respectively. So e.g. if 1024-bit encryption is wanted to be computed in a 16-bit architecture the value of w would be 16 and the value of s 64.

6.3. Hardware architecture

The RSA algorithm topology used consists of three main functions: main loop, modinverse and monpro. In addition the 2×w×s-bit division unit (mod) is needed on rows two and three of the top-level algorithm. The structure of the hardware architecture was organized so that the main loop was implemented as a state machine, which can start and control the execution of other blocks. When a block’s start signal is activated the block checks from the control bus what it should start doing and starts the execution. After the start the block operates autonomously until it completes the task and indicates that it is ready. As can be seen from the top-level algorithm the monpro block is called twice per one main loop cycle. Those calls are mathematically independent, which makes it possible to execute them in parallel. That is why two monpro


89

blocks (monpro1 and monpro2) were instantiated to the design. By using two monpro blocks the execution time is diminished on average by a factor of 1.5 and the amount of hardware needed is more or less doubled. The overall architecture of the implementation can be seen in Figure 40.

Figure 40. The block diagram of hardware architecture of RSA design.

6.3.1. Montgomery product

The modular multiplication of the Montgomery algorithm is given [112] such that monpro(a,b) 1. t=a*b 2. m=t*n’ mod 21024 3. u=(t+m*n)/21024 4. if u>=n then 5. return u-n 6. else 7. return u Often, hardware implementations of the Montgomery algorithm are derived from the above algorithm by implementing the whole algorithm for tiny radix 21-24 processing elements [113]. In this work the basic multiplication algorithm, which breaks a huge multiplication problem into smaller pieces, was used. The aim is to present a scalable RSA accelerator architecture suitable for implementation on reconfigurable high radix architectures.


90

The standard multiplication algorithm can be written such that mult(a,b) 1. for i=0 to s-1 2. C=0 3. for j=0 to s-1 4. (C,S)=ti+j+aj*bi+C 5. ti+j=S 6. ti+s=C The algorithm makes it possible to multiply two s×w bit vectors together if we have a w-bit multiplier and a 2×w+1-bit adder. Row four of mult algorithm can be implemented as a hardware block shown in figure 41.

Figure 41. Hardware implementation of inner loop of standard multiplication algorithm.

The structure of figure 41 solves only one row of mult but the whole algorithm can be implemented, in radix w, in parallel by connecting s basic blocks together in cascade so that variable i grows from left to right and every basic block increases its j-value sequentially. Unfortunately all basic blocks cannot start the computation at the same time because they need the second result t of the previous block for the computation of their first t. The cascade of basic blocks is illustrated in figure 42. With the


91

presented extremely scalable architecture the s×w times s×w bit multiplication can be computed in 2×s+s clock cycles.

Figure 42. The cascade of basic blocks.

s×w times s×w bit multiplication solves only the first row of Montgomery multiplication algorithm. Rows two to three can be written such that 1. for i=0 to s-1 2. C=0 3. m=ti*n0’ mod 2w 4. for j=0 to s-1 5. (C,S)=ti+j+m*nj+C 6. ti+j=S 7. for j=i+s to 2s-1 8. (C,S)=tj+C 9. tj=S 10. t2s=C. Fortunately these rows can also be implemented in the hardware illustrated in figure 42, by modifying the basic block a little bit. In figure 43, the modified basic block is shown, which makes it possible to compute rows one to three of the Montgomery product. With the cascade presented in figure 42 and modified basic block of figure 43 rows two to three of the Montgomery multiplication can be computed in 3×s+2×s steps. The last rows of the Montgomery product can be solved by a subtractor. The result of row three is built up to the output of the last basic block during the last 2×s steps. n can be subtracted sequentially from u and the subtraction is


92

ready one clock cycle after u. If the subtraction result is negative the final result is the upper part of u, otherwise the final result is the result from subtraction. The implementation can be done with one 16-bit subtractor.

Figure 43. Modified basic block.

Due to the fact that the first cell of result t gets ready from multiplication a×b after the first 2×s steps and the execution of the rest of the algorithm is started immediately after that, the computation of the Montgomery modular multiplication requires 2×s+3×s+2×s+1 steps in total.

6.3.2. Modular inverse

Modular inverse, if only the first byte of the result is needed, can be computed very efficiently. The algorithm, which realizes equation (6.7), is as follows: modinverse(x) 1. y1=1 2. for i=2 to w 3. if 2i-1 < x*yi-1(mod 2i) then 4. yi=yi-1 + 2i-1 5. else 6. yi=yi-1 7. return yw


93

When the operands are transferred to the n-residue, the remainder of a 2×s×w-bit vector has to be computed with a s×w-bit long mod. Because of clause (6.3), row three in the RSA-algorithm does not require division but can be computed by subtraction 2sw-n.

6.4. Implementation in 0.35µm ASIC technology

To get a baseline the presented RSA architecture was first implemented as a Matlab model to verify its functionality, and as VHDL code targeted for implementation in ASIC technology to get silicon area and timing estimates. In the VHDL model the whole presented architecture of figure 40 was coded as a synthesizable code. The radix was selected to be w=16. Arithmetic blocks as well as memory blocks were implemented as ready design library components to achieve maximum performance. The timing and area figures on silicon were achieved by synthesizing the design with gate level synthesis tools on 0.35 µm ASIC technology.

6.5. Implementation on Xilinx Virtex II platform FPGA chip

Xilinx announced its Virtex II Platform FPGA series in 2001. The difference between traditional FPGAs and these new platform FPGA chips was that not only reconfigurable logic resources but also reconfigurable functional units were provided to the designer. The architecture overview of Virtex-II is shown in figure 44.

Figure 44. Architecture overview of Virtex-II platform FPGA [15].


94

The basic resources in Virtex-II are the Configurable Logic Blocks (CLB) as in an ordinary FPGA chip. The functional units offered are embedded 18×18 signed multipliers, 18 kbit SelectRAM blocks and SelectI/O resources. All functional units are embedded on the FPGA chip, providing very fast parts to the designer without consuming any CLB resources. For example the 18×18 bit multiplication can be done in less than 10 ns purely combinationally, up to 144 18 kbit Dual port RAM blocks can be instantiated to the design without using any CLBs and the chip can be easily connected to the various I/O standards (voltage, speed, impedance, etc.) with SelectI/O blocks. The RSA accelerator topology presented in this chapter is well-suited for implementation on platform FPGAs. The multipliers inside the basic and extended blocks can be mapped to the platform FPGAs hard core 18×18 bit multipliers. Because the multipliers are the most area consuming part of the accelerator, a great amount of basic blocks can be implemented without significantly consuming significantly the chip’s CLB resources by using e.g. value 16 for w. Furthermore, as hard coded logic the multipliers are extremely fast compared to the arithmetic blocks constructed from CLBs. Similarly, the platform chip’s 18 kbit hard core RAM blocks were used to save operands and partial results. By using embedded RAM blocks, there is no need to use any CLB resources on the chip for memory structures. The platform FPGA implementation was done by using the ASIC implementation’s VHDL codes as a starting point. Instead of using design library components, the multipliers and memories were mapped to the platform FPGA chip’s hard core structures. Table 12 summarizes the results of FPGA implementation. The delay value (8.5 ms) for the whole RSA is the minimum cycle time required for the worst case. The results are compared to the best FPGA-based implementation found in literature [114].


95

Table 12. Results of FPGA implementation.

Block CLB Delay Clock cycles

Real time

Basic block 201 11.2 ns 1 (Pipelined)

11.2 ns

Monpro 13,099 14.9 ns 449 6690 ns Mod 346 11.0 ns 32,768 0.36 ms Modiverse 78 9.9 ns 17 168 ns State machine

82 8.8 ns 459,776 4 ms

RSA total 26,738 18.2 ns 459,776 8.5 ms Ref. [114] 6,633 21.9 ns 545,662 11.95 ms

6.6. Implementation on RAA

The very same architecture was also mapped to the RAA. However, in RAA the architecture was partitioned to the software and configware parts. In practice the RAA is always connected to the outside controller processor, and is meant to be used to implement computing intensive inner loops, not controlling structures. On the other hand, despite several optimization techniques presented in previous chapters, the user data uploads always introduce some overhead. Only operations that use the same data more than once can feasibly be implemented on RAA. Thus only monpro algorithm was mapped to RAA. Other parts of the architecture in figure 40 were coded as controller processor code. In practice this partitioning means that the cascade of modified basic blocks is executed on RAA. 216 modified basic blocks were mapped on RAA such that each RAA node executes the functionality of a single block. The s nodes were used as a cascade such that signals t and j on figure 42 are transferred via FIFOs. Because of the use of FIFOs, instead of wires the controlling lines of figure 42 were not needed at all. I-values are node bases and thus those were transferred to the nodes via the global data bus. Because the computation of adjoining blocks starts with a delay of one step, the controller has enough time to write i-values without the need to stall the execution on RAA.


96

The cascade of modified basic blocks was mapped to the RAA as a snake path. An architecture level figure of data communication in a RSA implementation on an 8×8 node RAA is depicted in figure 45.

Figure 45. Communication topology of 1024-bit RSA on 64 nodes RAA.

It was noticed that RAA’s 31 word program memory was well-suited for the code size needed. The assembler code of a basic node is given in figure 46. The same code is uploaded to every node. However, input FIFO names are changed according to the placement of the node in figure 45.

Figure 46. The 31-row code needed to implement the presented scalable RSA architecture on RAA.


97

6.6.1. Results with single context RAA

The code of figure 46 was coded on a text editor and 16, 32 and 64 blocks, matching the key lengths of 256, 512 and 1024 bits, were mapped together in the RAA GUI assembler. The code sizes of 16, 32 and 64-node RAA architectures after the assembler stage were 79, 97 and 124 words, respectively. The binary of monpro algorithm and RTL model of RAA were uploaded to VHDL simulator and the monpro algorithm was verified. The execution time was retrieved from the simulations. The area figures are from RAA synthesis results for 0.13 µm technology. The execution times are given in figure 47. In the figure the execution time refers to the execution of one monpro algorithm. In addition the silicon area as a function of nodes needed to implement the 256, 512 and 1024 bit RSAs is given in figure 48.

Figure 47. Execution time in clock cycles vs. key length used.

Figure 48. The silicon area of single context RAA needed on execution vs. key length used.

256 512 1024400

600

800

1000

1200

1400

1600

1800

2000

2200

Key length

Inst

ruct

ion

cycl

es

256 512 1024

1.9

3.9

7.9

Key lenght of RSA

Are

a in

squ

are

mill

imet

ers


98

6.6.2. Results with multi context RAA

The utilization of computational elements in the presented RSA architecture is poor. After the first basic block starts the computations, it takes, e.g. in 1024-bit configuration, 128 steps before the last basic block starts the computations. On the other hand the first block stops 128 steps before the last one. That is why the virtualization of RAA works well with this application. The virtualization mechanism schedules eagerly processing time to those processes which are active, and thus increases utilization. The simulations with multicontext RAA and RSA were done with 16 basic blocks and 32 basic blocks, i.e. monpro algorithm cable to accelerate 256-bit and 512-bit RSA, respectively. Both algorithms were reconfigured to the RAA with 1, 2 and 4 contexts. The corresponding execution times are given in figures 49 and 50.

Figure 49. The execution time of 256-bit monpro-algorithm as a function of contexts used.

The execution times given above were achieved with RAA architectures in which the size of the data memory was 16 words, the size of the instruction memory was 32 words and FIFOs were 10 words long. The swapping threshold of deadlock-free FIFOs was set to 10 clock cycles.


99

Figure 50. The execution time of 512-bit monpro-algorithm as a function of contexts used.

Below, figure 51 clarifies the tradeoff between area and execution when virtualization is used. In the figure the execution times of 512-bit monpro-algorithm implemented in architectures with 1, 2 and 4 contexts, that is to say, in architectures with 32, 16 and 8 real processor cores are given. As can be seen from figure 51 the virtualization works very well with a small number of contexts, but an exponentially growing penalty has to be accepted if a big amount of contexts is used.

Figure 51. The area vs. execution time for 512-bit RSA impelemented on RAA of 8 nodes with 4 contexts (left point), 16 nodes with 2 contexts (middle point) and 32 nodes with 1 context.


100

6.6.3. Comparison

Before the author’s publication [P4], the fastest published 1024-bit FPGA-based RSA implementation is reported in [114]. Actually, the publication shows only those functionalities which are implemented here as monpro blocks and in parts of the main loop. Additionally, [114] requires precomputations and thus, a straightforward comparison is not fair. But although it is supposed that it would not need any precomputations, the platform FPGA implementation presented in this work is faster as can be seen from table 13. On the other hand [114] needs much less CLB resources than the other FPGA implementation shown here.

Table 13. The result figures of different RSA implementations.

Execution time

Silicon area

Lines of code

Technology Frequency

ASIC 17.2ms 41.1 mm2 1600 0.35 30 MHz FPGA 8.5ms 200 mm2 1400 0.13 30 MHz RAA 10ms 7.9 mm2 31 0.13 200 MHz Ref FPGA

11.95ms - - - 47 MHz

6.7. Summary

The scalable RSA architecture for high radix reconfigurable platforms was given. The implementation of the architecture was made in ASIC, platform FPGA and RAA architecture. In RAA architecture, the design was also simulated with different RAA context memory and virtualization mechanism configurations. The given architecture and its FPGA implementation was the fastest 1024-bit RSA when published. On the other hand, the very same architecture was now implemented on coarse grain reconfigurable RAA architecture. Maybe the most important observation was that the 1400 lines VHDL FPGA implementation could be implemented with 31 lines of assembler on RAA architecture. The rise of the algorithm implementation abstraction level did not affect the execution time dramatically. On the contrary, the ratio between area used and time needed for execution is far better in RAA implementation than in platform FPGA implementation, as can be seen from table 13. The author has not performed any power simulations, but it is more than probable that the power consumption of the design on RAA architecture is at least an order of magnitude less than in the


101

FPGA implementation. The estimate is based on the fact that FPGA fabrics have, in general, ~10× power consumption overhead when compared to the standard cell implementations [40]. The given execution time figures for RSA implementation in differently configured virtual size RAA arrays show the feasibility of the proposed virtualization mechanism for architectures with smaller virtual to real node ratios, i.e. ~2:1 or less.


102


103

7. Conclusions This thesis begins with a state-of-the-art survey of the reconfigurable IP genre, where the prevailing architectures are presented. The survey classifies the implementations according to their source and topology. The first results presented in this thesis are based on one of the first implemented MIMD coarse grain reconfigurable IPs, RAA. From the results it can be concluded that MIMD topology is feasible to be used as a reconfigurable IP. Feasibility is first studied from the silicon area and the clock period vs. computational potential point of view. From the synthesis results it can be concluded that an extremely scalable MIMD can be implemented so that the largest portion of the area comes from arithmetic and memory blocks. The overhead of the control and network structures is only a few percent. In Chapter Four, the first coarse grain reconfigurable IP capable of virtualizing its dimensions is given. From the results of the chapter it can be concluded that the dimension virtualization of the context switch capable MIMD coarse grain reconfigurable IP is achieved, in practice, for free by using node-based hardware schedulers. Next, the Lin-Kernighan heuristics developed to solve the TSP problem were modified to solve the NP-hard placement problem of RAA. From the result of Chapter Five it can be concluded that the placement of RAA can be automated. The chapter also shows that the fast heuristics can be used to automatically take advantage of novel group addressing modes of RAA. The reduction of configuration data sizes achieved is remarkable being in the examples shown over 80 percent. Chapter Six provides a comparison between three architectures i.e. ASIC, platform FPGA and MIMD coarse grain reconfigurable IP by implementing the RSA-encryption algorithm into all 3 and analyzing the results. The cases show that the implementation effort is quite different depending on the architecture used. The amount of code needed in MIMD is only a fraction of the code needed to implement the same architecture to the ASIC and Platform FPGA, although the ratio between area used and execution time is far better. The chapter also

Conclusions

104

gives execution time figures of RSA encryption on RAA when the virtualization mechanism is used. The results show that even for an array with two contexts per node, the execution time is a linear function of the number of real processors. That is to say, the execution time overhead of the virtualization system is in that case zero.

7.1. Discussion

Basically, the reconfigurable IP blocks are one way to implement algorithms, but other processing architectures can be used as well. A few of the most fundamental implementation alternatives are presented in figure 52.

Figure 52. The most fundamental algorithm processing architectures separated according to its potential as a possible programming model and theoretical maximum computational capacity.

When an accelerator is selected, the tradeoff between theoretical computational potential, compilability, area, configurability, power consumption etc. always exists, although companies present the superiority of their own solutions. However, (1) the exponentially improving silicon processes make it acceptable to pay the penalty of increasing area if the abstraction level in the presentation of functionality can be raised higher1 and more flexibility e.g. reconfigurability is achieved. On the other hand (2) the complexity of algorithms needed is also increasing exponentially according to Shannon’s law. The final result from (1)

1 Just like in software industry. The remarkable part of exponential increase in the computational power of the

processors is sacrificed to make it possible to raise the abstraction level of programming

RISCDSP

DSP with parallel extensionsVLIW

TTACoarse grain accelerator

FPGAASIC

Probability to find efficient HLL compiler

Theoretical maximum computational capacity

RISCDSP

DSP with parallel extensionsVLIW

TTACoarse grain accelerator

FPGAASIC

Probability to find efficient HLL compiler

Theoretical maximum computational capacity


105

and (2) is that in state-of-the-art products the accelerators in the middle of figure 52 are more likely to be used in the future. We are living in very interesting times. LUT-based reconfigurable structures have already proved their feasibility as commercial products in some (even limited) applications and for that reason it can be said that it is more likely that those can also be fit into some segment of on-chip structures. However, coarse grain systems did not ever make a commercial breakthrough on circuit board level, would they do on-chip? Would they give a breakthrough although an efficient high level language (HLL) compiler was not yet been found [P6]? Or are the use of those mandatory because of the exponentially increasing algorithm complexity as mentioned in [115,116]? The author believes that coarse grain reconfigurable architectures will have their niche as a part of complex System-on-Chip implementations. Moreover, the given MIMD based on RAA seems to be an answer to most of the questions given above. Its architecture is very intuitive to a software engineer, and thus the accelerators are possible to be coded without a compiler. On the other hand the computational parallel power in RAA is enormous, giving the possibility to implement computationally complex algorithms. In this thesis the systolic kind of accelerator implementations to the RSA and GPS correlation are given to the RAA, but there is still lots of work to be done to prove RAA’s suitability to accelerate, i.e., video codecs. However, in RAA the operations can be done parallel to each pixel or block of pixels in a SIMD fashion, making it more likely suitable also to be used for that kind of applications. The scalability, abstracted dimensions, the reconfigurability, the parallelism and the programmability of the given reconfigurable soft IP RAA are advantages which gives to the author a reason to believe that RAA’s kind of structures are used to accelerate the designing of SoCs of commercial mobile terminals as well as to accelerate algorithms on SoCs in commercial mobile terminals in the near future.

Conclusions

106


107

8. References [1] G. E. Moore,”Cramming More Components Onto Integrated Circuits”,

Electronics, 1965. [2] G. E. Moore, “No Exponential is Forever...but We Can Delay 'Forever'”,

presentation at International Solid State Circuits Conference, 2003. [3] L. Harrison, “Moore's Law Meets Shannon's Law: The Evolution of the

Communication's Industry “ Journal International Conference on Computer Design, 2001.

[4] H. Qi, Z. Jiang and J. Wei, “IP reusable design methodology”, in Proc.,

IEEE International Conference on ASIC, 2001, pp 756 –759. [5] L. Benini, G. De Micheli, "Networks on Chip: A New Paradigm for

Systems on Chip Design" in Proc Design Automation and Test in Europe, 2002.

[6] P.J. Bricaud, “IP reuse creation for system-on-a-chip design”, in Proc.

IEEE Custom Integrated Circuits, 1999, pp. 395-401. [7] D.D. Gajski, A.C.-H. Wu, V. Chaiyakul, S. Mori, T. Nukiyama and P.

Bricaud, “Essential issues for IP reuse”, in Proc. ASP Design Automation Conference, 2000, pp. 37-42.

[8] T. Zhang, L. Benini and G. De Micheli, “Component selection and

matching for IP-based design Design”, in Proc. Conference and Exhibition Automation and Test in Europe, 2001, pp. 40-46.

[9] R. A. Bergamaschi and W. R. Lee, “Designing systems-on-chip using

cores”, Proc. Conference on Design automation, 2000, pp. 420-425. [10] B. Salefski and L. Caglar, “Re-configurable computing in wireless”, in

Proc. Design Automation Conference, 2001, pp. 178-183.

References

108

[11] “Embedded FPGA Market to Evade Effects of Current Economic Downturn” [online] http://www.instat.com/newmk.asp?ID=502

[12] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, B. Hutchings, “A

Reconfigurable Arithmetic Array for Multimedia Applications”, in Proc. International Symposium on Field Programmable Gate Arrays, 1999, pp. 135-143. 1999,

[13] “Systolix technical introduction” [online]

http://www.systolix.co.uk/techintro.htm. [14] “Philips spins out reconfigurable computing cores” [online]

http://www.eetimes.com/story/OEG20030317S0014. [15] R. A. Hartenstein, “Decade of Reconfigurable Computing: a Visionary

Retrospective”, in Proc. Design, Automation and Test in Europe, 2001, pp. 642-649.

[16] R. Hartenstein, “Reconfigurable computing: a new business model-and its

impact on SoC design”, Proc. in Euromicro Symposium on Digital Systems Design, 2001, pp. 103-110.

[17] R. Hartenstein, “Trends in Reconfigurable Logic and Reconfigurable

Computing”, in Proc. Electronics, Circuits and Systems, 2002, pp. 15-18. [18] J. Becker, ”Configurable Systems-on-Chip (CSoC)”, in Proc Symposium

on Integrated Circuits and Systems Design, 2002, pp. 379-384. [19] P. Schaumont, I. Verbauwhede, K. Keutzer, M. Sarrafzadeh, ”A Quick

Safari Through the Reconfiguration Jungle”, in Proc. Conference on Design Automation, 2001, pp. 172-177.

[20] J. Greenbaum, “Reconfigurable Logic in SoC Systems”, in Proc of

Custom Integrated Circuits Conference, 2002, pp. 5-8.


109

[21] S. J. E. Wilton, “Programmable Logic IP Cores in SoC Design: Opportunities and Challenges”, in Proc. of Conference on Custom Integrated Circuits, 2001, pp. 63-66.

[22] “Actel launches VariCore programable core” [online]

http://www.eetimes.com/story/OEG20010215S0065. [23] T. Vaida, “Reprogrammable Processing Capabilities of Embedded FPGA

Blocks”, in Proc. of International ASIC/SOC Conference, 2001, pp. 180-184.

[24] “Hybrid architecture embeds Xilinx FPGA core into IBM ASICs” [online]

http://www.eetimes.com/semi/news/OEG20020624S0016. [25] “Virtex II Handbook”, [online],

http://www.xilinx.com/products/virtex/handbook/index.htm. [26] [online],

http://www3.ibm.com/chips/products/asics/products/cores/efpga.html [27] M. Keating, P. Bricaud, “Reuse Methodology Manual”, Kluwer academic

publishers, 2002. [28] “Embedded Programmable IP”, [online],

http://www.ictpld.com/coresunit/epip.htm [29] “Technology Brief”, [online],

http://www.leopardlogic.com/technology.html#3 [30] “VariCore™ Embedded Programmable Gate Array Core (EPGA™) 0.18µ

Family”, [online], http://www.actel.com/varicore/products/index.html [31] “The FlexEOS product“, [online], http://www.m2000.fr/products4.htm [32] “eASICore Overview (0.13µm Process)”, [online],

http://easic.com/products/easicore013.html

References

110

[33] “Varicore documentation”, [online], http://www.actel.com/varicore/support/docs/VariCoreEPGADS.pdf

[34] V. George, H. Zhang, J. Rabaey, ”The Design of a Low Energy FPGA”, in

Proc. of Interna-tional Symposium on Low Power Electronics and Design, 1999, pp. 188-193.

[35] Zhang et al., ”A 1V Heterogeneous Reconfigurable Processor IC for

Baseband Wireless Applications” in Proc. International Solid-State Circuits Conference, 2000, pp. 68-69.

[36] Xilinx, [online], www.xilinx.com. [37] S. Knapp, D. Tavana, “Field configurable system-on-chip device

architecture”; in Proc. Custom Integrated Circuits Conference, 2000, pp. 155-158

[38] “Altera Excalibur Device Overview”, [online],

http://www.altera.com/literature/ds/ds_arm.pdf. [39] N. Kafafi, K. Bozman, S.J.E. Wilton, “Architectures and Algorithms for

Synthesizable Embedded Programmable Logic Cores”, in Proc. International Symposium on Field Pro-grammable Gate Arrays, 2003.

[40] R. Hartenstein, “Trends in Reconfigurable Logic and Reconfigurable

Computing”, in Proc. International Conference on Electronics, Circuits and Systems, 2002, pp. 801-808.

[41] K. Leijten-Nowak, A. Katoch, “Architecture and Implementation of an

Embedded Reconfigurable Logic Core in CMOS 0.13 um”, in Proc. International ASIC/SOC Conference, 2002, pp. 3-7.

[42] K. Leijten-Nowak, J.L. van Meerbergen, ”Embedded Reconfigurable

Logic Core for DSP Applications”, in Proc. FPL 2002. LNCS 2438, 2002, pp. 89-101.


111

[43] “Reconfigurable Algorithm Processing (RAP) technology”, [online], http://www.elixent.com/products/technologies.htm.

[44] “XPP Intellectual Property cores”, [online], www.pactcorp.com. [45] “The XPP white paper”, [online],

http://www.pactcorp.com/xneu/download/xpp_white_paper.pdf. [46] V. Baumgarte et al., ”PACT XPP A self-reconfigurable data processing

architecture”, in Proc. Engineering of Reconfigurable Systems and Algorithms, 2001.

[47] J.M.P. Cardoso, M. Weinhardt, “Fast and guaranteed C compilation onto

the PACT-XPP reconfigurable computing platform”, in Proc. Field-Programmable Custom Computing Machines, 2002, pp. 291 - 292

[48] Motomura, M., “A Dynamically Reconfigurable Processor Architecture”,

Microprocessor Forum, 2002. [49] T. Kitaoka, H. Amano, K. Anjo, “Reducing the Configuration Loading

Time of a Coarse Grain Multicontext Reconfigurable Device”, in Proc. Field Programmable Logic, 2003, pp. 171-180.

[50] A. Alsolaim, J. Becker, M. Glesner, J. Starzyk, ”Architecture and

Application of a Dynamically Reconfigurable Hardware Array for Future Mobile Communication Systems”, in Proc. Symposium on Field-Programmable Custom Computing Machines, 2000, pp. 205-214.

[51] S. Khawam, T. Arslan, F. Westali, “Embedded Reconfigurable Array

Targeting Motion Estimation Applications”, in Proc. International Symposium on Circuits and Systems, 2003, pp. 760-763.

[52] B.I. Hounsell, T. Arslan, ”An Embedded Programmable Core for the

Implementation of High Performance Digital Filters”, in Proc. International ASIC/SOC Conference, 2001, pp. 169-174.

References

112

[53] “Adaptive Computing Machine”, [online], http://www.qstech.com/pdfs/5_7_WP_dataflow_in_ACM.pdf

[54] P. Master, “The next big leap in reconfigurable systems”, in Proc. Field-

Programmable Technology, pp. 17-22, 2002. [55] B. Plunkett, D. Chou, “ Computational efficiency: adaptive computing vs.

ASICs”, in Proc. Electronics, Circuits and Systems, 2002, pp. 819-822. [56] “Architecture of Synputer”, [online],

http://www.synputer.com/technology/architechture.html [57] N. Streltsov, J. Sparsø, S. Bokov, S. Kleberg, “The Synputer - A Novel

MIMD Processor Targeting High Performance Low Power DSP Applications”, in Proc. International Signal Processing Conference, CD-ROM, 2003.

[58] “Silicon Hive”, [online], http://www.siliconhive.com. [59] J. Leijten, G. Burns, J. Huisken, E. van Wel Waterlander, “A.AVISPA: a

massively parallel reconfigurable accelerator”; in Proc. International Symposium on System-on-Chip, 2003, pp. 165-168.

[60] “Morphotec”, [online], http://www.morphotech.com/. [61] D. Soudris, K. Masselos, S. Blionas, S. Siskos, S. Nikolaidis, K. Tatas,

“AMDREL: De-signing Embedded Reconfigurable Hardware Structures for Future Reconfigurable Systems-on-Chip for Wireless Communication Applications”, in Proc. Workshop on Heterogeneous Reconfigurable Systems on Chip, Chances, Application, Trends, 2002.

[62] M. Annaratone, C. Pommerell, R Ruhl, “ Interprocessor Communication

Speed and Performance in Distributedmemory Parallel Processors”, in Proc. Symposium on Computer Architecture, 1989, pp.315-324.

[63] K.E. Batcher, “MPP-A Massively Parallel Processor”, in Proc.

International Conference of Parallel Processing, 1979, pp. 249.


113

[64] T. Blank, “The MasPar MP-1 Architecture”, in Proc. IEEE Computer

Society International Conference, 1990, pp 20-24. [65] G. Demos, “Issues in Applying Massively Parallel Computing Power”,

The international Journal of Supercomputer Applications, no.4, 1990, pp. 90-105.

[66] E. Mirsky and A. DeHon, “MATRIX: a reconfigurable computing

architecture with configurable instruction distribution and deployable resources”, in Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1996, pp. 157-166.

[67] J.R. Hauser and J. Wawrzynek, ”Garp: a MIPS processor with a

reconfigurable coprocessor”, in Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1997, pp. 12-21.

[68] E. Waingold, M.Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim,

M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe and A. Agarwal, “Baring it all to software: Raw machines” , Journal IEEE Computer, Volume: 30 Issue: 9 , pp. 86-93.

[69] M. J. S. Smith, “Application-Specific Integrated Circuits”, Addison-

Wesley, 1997. [70] V. Lahtinen, “"Design and Analysis of Interconnection Architectures for

On-Chip Digital Systems", PhD Thesis, Tampere University of Technology, publication 478, 2002.

[71] ”OCP-IP Specification 2.0”, [online], www.ocpip.org/spec_download. [72] A.B. Kahng, G. Smith, “A New Design Cost Model for the 2001 ITRS”,

in Proc. International Symposium on Quality Electronic Design, 2002, pp. 190-193.

[73] “2004 unprobed wafer costs”, [online],

www.icknowledge.com/economics/WaferCosts2004.htm.

References

114

[74] A. Towner, D. Panesar, G. Gray, A. Robbins, W. Duller, “Design

picoArray technology: the tool's story”, in Proc. Automation and Test in Europe, 2005, pp.106-111.

[75] “ARM1020E”, [online],

http://www.arm.com/armtech/ARM1020E_1022E?OpenDocument. [76] “InCyte”, [online], www.chipestimate.com. [77] P.R. Nuth, W.J. Dally, “ A Mechanism for Efficient Context Switching”,

in Proc. Computer Design, 1991, pp. 301-304. [78] S. Trimberger, D. Carberry, A. Johnson, J. Wong, “A Time-multiplexed

FPGA”, in Proc Symposium on FPGAs for Custom Computing Machines, 1997, pp. 22-28.

[79] T. Kitaoka, H. Amano, K. Anjo, “Reducing the Configuration Loading

Time of a Coarse Grain Multicontext Reconfigurable Device”, in Proc. FPL, 2003, pp. 171-180.

[80] T. Fujii, K. Furuta, M. Motomura, M. Nomura, M. Mizuno, K Anjo, K.

Wakabayashi, Y. Hirota, Y. Nakazawa, H. Ito, M. Yamashina, “A Dynamically Reconfigurable Logic Engine with a Multi-Context/multi-mode Unified-Cell Architecture”, in Proc. Solid- State Circuits Conference, 1999, pp. 364-365.

[81] N. Kaneko, H. Amano, “A General Hardware Design Model for

Multicontext FPGAs”, in Proc FPL, 2002, pp.1037-1047. [82] H. Amano, A. Jouraku, K. Anjo, “A Dynamically Adaptive Switching

Fabric on a Multicontext Reconfigurable Device”, in Proc. FPL, 2003, pp.161-170.

[83] J. Resano, D. Mozos, D. Verkest, S. Vernalde, F. Catthoor, ”Run-Time

Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems”, in Proc FPL, 2003, pp. 585-594.


115

[84] R. Maestre, F.J. Kurdahi, M. Fernandez, R. Hermida, N. Bagherzadeh, H.

Singh, “A Framework for Reconfigurable Computing: Task Scheduling and Context Management”, Transactions on Very Large Scale Integration Systems. Vol. 9, Issue: 6, 2001, pp. 858-873.

[85] S. Hauck, “Configuration Prefetch for Single Context Reconfigurable

Coprocessors”, in Proc. International Symposium on Field Programmable Gate Arrays, 1998, pp. 65-74.

[86] C. Plessl, M. Platzner1, “Virtualizing Hardware with Multi-context

Reconfigurable Arrays”, in Proc. of FPL, 2003, pp. 151-160. [87] A. S. Tanenbaum, A. S. Woodhull, “Operaing Systems: Design and

Implementation”, Prentice Hall, 1997. [88] J. A. Stankovic, M. Spuri, M. Natale, and G.C. Buttazzo, “Implications of

classical scheduling results for real-time systems”, IEEE Computer, vol.28, no.6, 1995, pp.16-25.

[89] T. L. Casavant, J. G. Kuhl, "Taxonomy of Scheduling in General-Purpose

Distributed Computing Systems", IEEE Transactions on Software Engineering, 1988, vol. 14, no.2, pp.141-154.

[90] M.J. Bach, “The design of the UNIX operating system”, Prentice Hall,

1986. [91] R. Hartenstain, M. Herz, T. Hoffman, U. Nageldinger, ” KressArray

Xplorer: a new CAD environment to optimize reconfigurable datapath array architectures”, in Proc. Design Automation Conference, 2000, pp. 163-168.

[92] M. Padberg and G. Rinaldi, “A branch-and-cut algorithm for the

resolution of large-scale symmetric traveling salesman problems”, SIAM Review, 33, 1991, pp.60-100.

References

116

[93] M. Grötchel, O. Holland, “Solution of large scale symmetric traveling salesman problems”, Math. Programming, 51, 1991, pp.141-202.

[94] M. Garey, D. Johnson, L. Stockmeyer, “Some simplified NP-complete

graph problems”, Theoretical Computer Science, 1, 1976, pp. 237-267. [95] T. Bui, C. Heigham, C. Jones, T. Leighto, “Improving the performance of

the Kernighan-Lin and simulated annealing graph bisection algorithms”, in Proc. Design Automation Conference, 1989, pages 775- 778.

[96] B.W. Kernighan, S. Lin, “An efficient heuristic procedure for partitioning

graphs”, The Bell system technical journal, 49(1), 1970, pp. 291-307.

[97] I. I. Melamed, S. I. Sergeev, I. Kh. Sigal, “The traveling salesman problem. Approximate algorithms”, Avtomat. Telemekh, 11, 1989, pp. 3-26.

[98] S. Lin, B. W. Kernighan, “An Effective Heuristic Algorithm for the Traveling-Salesman Problem”, Oper. Res. 21, 1973, pp. 498-516.

[99] E. W. Dijkstra, “A note on two problems in connexion with graphs”, in

Numerische Mathematik. 1, 1595, pp. 269-271. [100] E. D. Kaplan, “Understanding GPS principles and Applications”, Artech

House, Boston, 1996. [101] S. Michael, S. Braach, A. J. Van Dierendonck, "GPS receiver

architectures and measurements" in Proc. of the IEEE vol87, no.1, 1999, pp. 48-64.

[102] B. M. Ledvina, M. L. Psiaki, D. J. Sheinfeld, A. P. Cerruti, S. P. Powell,

P. M. Kintner, “A Real-Time GPS Civilian L1/L2 Software Receiver," Proc. of the Institute of Navigation GNSS, 2004.

[103] “RSA”, [online], http://www.rsasecurity.com/rsalabs/faq/3-1-2.html,

January 2003.


117

[104] B. Lee and L. John, “Implications of programmable general purpose processors for compression/encryption applications”, Proc. Application-Specific Systems, Architectures, and Processors , 2002, pp. 233-242.

[105] D. J. Bernstein, “Circuits for Integer Factorization: a Proposal”, [online],

http://cr.yp.to/papers/nfscircuit.pdf, 2001. [106] R. L. Rivest, A Shamir, L Adleman, “A Method of Obtaining Digital

Signatures and Public-Key Cryptosystems”, Communications of the ACM, Vol. 21, No. 2, 1978, pp. 120-126.

[107] R. D. Silverman, “Fast Generation of Random Strong RSA Primes”, in

Proc. CryptoBytes, 1997 pp.9-13. [108] P. L. Montgomery, “Modular Multiplication Without Trial Division”,

Mathematics of Computation, Vol. 44, No. 170, 1985, pp. 519-521. [109] S. R. Dusse, B. S. Kaliski, “A Cryptographic Library for the Motorola

DSP56000”, Lecture Notes in Computer Science, Vol. 473, 1990, pp. 230-244.

[110] D. E. Knuth, “The Art of Computer Programming: Seminumerical

Algorithms”, Volume2, Addison-Wesley, 2000. [111] C. K. Koc, ”High-Speed RSA Implementation”, RSA Laboratories, 1994. [112] C. K. Koc, T. Acar, B. S. Kaliski, “Analyzing and Comparing

Montgomery Multiplication Algorithms”, IEEE Micro, Vol. 16, No. 3, 1996, pp. 26-33.

[113] A. Daly, W. Marnane, “Efficient Architectures for Implementing

Montgomery Modular Multiplication and RSA Modular Exponentiation on Reconfigurable Logic”, in Proc. ACM International Symposium on Field-Programmable Gate Arrays, pp. 40-49. 2002.

References

118

[114] T. Blum, C. Paar, “High Radix Montgomery Modular Exponentiation on Reconfigurable Hardware”, IEEE Transactions on Computers, Vol. 50, No. 7, 2001, pp. 759-764.

[115] “Keynote”, [online],

http://xputers.informatik.unikl.de/staff/hartenstein/lot/CASES2002Hartenstein.ppt

[116] J. Helmschmidt, E. Schuler, P. Rao, S. Rossi, S. di Matteo, R. Bonitz,

“Reconfigurable Signal Processing in Wireless Terminals [Mobile Applications]”, in Proc. of Design, Automation and Test in Europe Conference and Exhibition, 2003, pp. 244 – 249.

Date post:	27-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Abstract - edu.cs.tut.fiedu.cs.tut.fi/ristimaki573.pdf · (Register Transfer Level) model and area...

Documents