Date post: | 27-Nov-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
1
Processing Architectures for Smart Pixel SystemsD. Scott Wills, James M. Baker Jr., Huy H. Cat, Sek Chai, José Cruz-Rivera, John Eble, AntonioGentile, Michael Hopper, W. Stephen Lacy, Abelardo López-Lagunas, Phil May, and Tarek Taha
School of Electrical and Computer EngineeringPackaging Research Center
Georgia Institute of TechnologyAtlanta, Georgia 30332-0250
Abstract
Smart pixel architectures offer important new opportunities for low cost,portable image processing systems. They provide greater I/O bandwidth andcomputing performance than systems based on CCD and microprocessors.However, finding a balance between performance, flexibilit y, efficiency, and costdepends on an evaluation of target applications. This paper describes severalpromising architectural approaches for the realization of videoputer systems andoutlines example implementations being pursued at Georgia Tech.
1. IntroductionLow cost video cameras and advanced telecommunications technology enable many new
services, such as electronic video mail and computer-based teleconferencing. Evolvingcompression standards (e.g., MPEG) and inexpensive disk storage allow these electronicexchanges to be treated much as e-mail i s used today. Cellular phone-based wireless technologyprovides low cost communication in the field. However acquiring, transmitting, andmanipulating this information presents a computational requirement beyond the capabiliti es ofexisting systems. Increasing user demand for portable on the move videoputing (video +computing) and teleputing (telecommunications + computing) systems places additionalrequirements on power, size, and weight.
General purpose microprocessors offer inexpensive and versatile processing elements for suchportable imaging systems. However, these new image processing applications demand higherprocessing rates (10 - 1000 Gops/sec) than can be provided by commercial microprocessors (.1 -.5 Gops/sec). Dedicated ASICs (Application Specific Integrated Circuits) can provide the neededperformance and eff iciency. But they lack the flexibilit y needed for varied applicationrequirements. Unfortunately, many portable imaging applications (image enhancement,recognition, and compression) have requirements not met by either of the processing alternatives.
Alternatively, techniques for integrating OE devices, analog interface circuitry, and digitallogic have enabled new approaches for image collection and processing. Monolithic systemsincorporating focal plane arrays offer high I/O bandwidth with modest levels of dedicated analogor digital processing capabilit y. Beginning with Mahowald and Mead’s sili con retina [13], on-focal plane processing has increased in complexity from simple logic gates to latches [11] to 2-bit registers and counters in [9]. Analog processing alternatives have demonstrated even greateroperational complexity using passive and active networks. The systems strive to achieve high fillfactor detector arrays combined with the maximum computing capabilit y that can effectively be
2
incorporated nearby. These image processing solutions are compact and eff icient, but lackcomputing power and flexibility.
General PurposeMicroprocessor
VideoputorProcessor
DedicatedASIC
performance low high highcost low moderate moderateflexibility high high very lowefficiency moderate high high
Table 1: Characteristics of microprocessors, dedicated ASICs, and an ideal videoputer processor.
The ideal architecture (Table 1) must blend a balance of key characteristics for theseapplications. It must provide high processing performance that scales with Si VLSI technologyadvances, while achieving high chip eff iciency (Mops/sec/mm2). Low cost must be realizedthough high eff iciency and flexibilit y where a single system can address many image processingtasks. System power, size, and weight must support portable operation. Image I/O must exploitOE devices to provide low cost and high performance. This system has not yet been realized, buta successful solution can have an impact comparable to the introduction of the personalcomputer, video camera, or FAX machine.
This paper summarizes the most promising architectural approaches for videoputerapplications. Some example implementations being pursued at Georgia Tech ill ustrate thesearchitecture classes. Section 2 describes the fundamentals of processing node organizations.Section 3 outlines the approach of systolic architectures. Sections 4 and 5 present SIMD andMessage Passing MIMD computing techniques. Finally, Section 6 concludes with directions forfuture research.
2. Processing Architecture OrganizationBefore exploring different approaches to smart pixel architectures, the components used to
build them need to be defined. Figure 1 ill ustrates the key elements of all digital processingnodes.
DataMemory
Instruction Control
Datapath
ProgramMemory
NetworkI/O
3
Figure 1: Anatomy of a Processing Node
The datapath contains the most familiar elements of computation: adders, subtractors,multipliers, shifters and logical units as well as registers to hold operands as they are beingprocessed. This is where the work required by an application is performed, and all processors,general or special purpose, must have a datapath. Image processing datapaths often include morespecialized functional components, such as a multiply-accumulate unit, to better supportcommon image processing operations.
An I/O unit is required to input image data to the datapath, and output results of thecomputation back to the outside world. This unit is particularly significant given the high I/Odata rates demanded in image processing systems. Today’s desktop workstation typicallyoperates with less than 10 Mbps I/O; a portable image processor might require 10 to 100 times asmuch I/O.
Since input and intermediate data cannot also fit in datapath registers, additional data memoryis required. This is analogous to memory in a workstation. However, image processingapplications tend to use more operands from I/O and require significantly less data storage. Sincedata memory represents a significant resource cost in computers, the reduction of data memory(1000X or more) can translate to a more efficient system implementation.
Instruction control and program memory are required for all programmable systems. Whileone computational model presented here, systolic arrays, does not include these components, theyare part of nearly every digital computer. Image processing systems can employ severalorganizations for program control. But the typically shorter, more compact application programscan also be exploited for more powerful, efficient system implementations.
Finally, the network provides a medium for many processing nodes to communicate. This isnecessary if nodes are to work together on a common task. Inter-node communication must behigh bandwidth and low latency or overall performance suffers. Aggregate network bandwidthsin Tbps (1000 Gbps) are sometimes required. Integrated OE smart pixel arrays can play a role inthe realization of these networks as well as in image I/O.
These elements provide the building blocks of many smart pixel-based videoputerarchitectures. The following sections describe a few of these promising architectures.
3. Systolic Array ArchitecturesSystolic architectures first became popular in the late 1970’s as an architectural approach to
exploit the growing potential of VLSI technology. H. T. Kung [10] and Charles Lieserson [12]were early proponents of this execution model for extremely eff icient implementation of systemsthat solve computationally intensive applications. More transistors per chip support systemdesigns with increased functionality leading to greater I/O and inter-cell communicationrequirements. Communication costs are typically high in execution time, power dissipation, andchip area. To reduce these communication penalties as well as reducing complexity in designingthe system, systolic design incorporates regular cell structures that communicate over shortdistances. The design cost is further minimized by using regular cell structures rather thanredesigning new components. The key characteristics of systolic designs include modular cells,short communications, scalability and concurrency.
4
Figure 2 ill ustrates a systolic array to compute the multiplication of banded matrixes. Eachhexagonal node includes a simple datapath containing a multiplier and adder, plus clockedregisters to regulate data flow between nodes (shown as arrows). On every cycle, each nodecomputes the product of the received input matrix elements and adds the rising result matrix.These systolic nodes include no data or program memory, and have an elementary network andI/O. Systolic processing systems are the most eff icient in terms of resource usage. But their lackof programmabilit y restricts their flexibilit y. Efforts to produce programmable systolic arrays(e.g., the CMU WARP [1][8]) produced systems more akin to MIMD architectures (see Section5) than those described here. Systolic architecture are well suited for dedicated high throughputcomputation such as image compression. However, cost and performance comparisons must bemade between systolic systems and more flexible architectural approaches.
input
matix A input matix B
result matix C
Figure 2: A systolic array to computer matrix multiplication.
The PAMSAC ArchitectureFigure 3 shows the layout of a pattern matching systolic architecture being implemented at
Georgia Tech. PAMSAC incorporates direct optical input of image data via eight on-chip Sidetectors and ampli fiers. This chip, which has been implemented through the MOSIS foundry in2.0 µm CMOS, simulates in IRSIM at 33 MHz. Digital logic testing of systolic core has beenfully tested; the interface to the OE devices is currently in progress. Figure 4 ill ustrates the blockdiagram of the PAMSAC chip. The simpli fied logic operation of a systolic cell consists of anXNOR and AND gate to perform detection of perfect pattern matching. This systolic designmethodology has simple, modular logic cells with high concurrency and local interconnection.
5
Select1
S7
S6
S5
S4
S3
S2 S1 X1S0 X0 X2 X3 X4 X5
Collector6
X6
GND
X7
SEL
CLK
SEQM
PadGnd
Systoli c Core
PadVDD
PadVDD
111213141516
17
18
19
20
21
23
24
25
26
22
27 28 29 30 31 32 33 34 35 36
37
38
39
40
1
2
3
4
5
6
78910
VDD
PadGND
Ibias 0
AnalogVDD
Ibias 1 Ibias 4Ibias 2 Ibias 3 Ibias 5 Ibias 7Collector
7
PadGND
PadGND
Detector0
Detector1
Detector2
Detector3
Detector4
Detector5
Detector6
Detector776543210
Ibias 6
Collector5
Muxes
Detector Amplifier
latch
Figure 3: Layout of the PAMSAC pattern matching architecture (2252 µm x 2222 µm).
Mux+
ShiftRegs
8
OEInputs
DigitalInput
8
8
X 5MatchStrings
Select
MatchSignal
8 cascaded 200um x 200um detectorwill provide alternative parallel inputinto systolic core.
UnselectedInput
8x5SystolicCore
Amps +Comparators
8
1
2
0
3
4
5
6
7
Clo
ck B
uffe
r
Ain
Reg
iste
r
XN
OR
/AN
D
Cin
Reg
iste
r
Clo
ck B
uffe
r
Bin
Reg
iste
r
Ain
BinCin
Cout
Figure 4: Block diagram of the PAMSAC chip.
4. SIMD ArchitecturesA more flexible architectural approach, compared with systolic arrays, includes
programmable digital processors. Yet commercial microprocessors are ill -suited to videoputerapplications because of their limited performance and low resource eff iciency. They provide toomuch generality and functionality that is not required in image processing.
A more promising computational model, SIMD or Single Instruction stream, Multiple Datastream, replicates the datapath, data memory, and I/O to provide high processing performancewith low node cost. Figure 5 ill ustrates this configuration. SIMD systems often employ
6
thousands of processing elements. The cost of the control unit is amortized across eachprocessing element.
Although a single program is being executed, each instruction is executed simultaneously onmany nodes. This execution model is especially well -suit to early image processing when asubroutine must be applied to every region of an image. While a commercial microprocessormust iterate sequentially across an image, a SIMD architecture can process the entire image in asingle iteration.
Instruction Control
ProgramMemory
DataMemory
Datapath NetworkI/O
DataMemory
Datapath NetworkI/O
DataMemory
Datapath NetworkI/O
DataMemory
Datapath NetworkI/O
Figure 5: SIMD architectures employ a single control unit with multiple datapaths.
The SIMPil ArchitectureWhile SIMD systems have been used for image processing before, the implementations havebeen large and expensive. The MPP [2], CM-2 [14], MasPar [3], and the GAPP [18] areexamples of general purpose SIMD systems capable of performing image processingapplications. However, these systems achieve performance and generality at the expense of focalplane I/O coupling and physical size. Other systems, including the Scan Line Array Processor(SLAP) [6], exploit frame scanning used in video cameras by operating on sequential scan lines.But serial loading and unloading of image data limits frame rates. A more specializedarchitecture can provide the same high levels of performance in a portable system.
The SIMPil system being developed at Georgia Tech [4][5][[ 15] incorporates a specializedSIMD architecture with an integrated array of optoelectronic devices. An 1300 nm optoelectroniclink allows through-sili con wafer input of digital image data from a detector plane stacked abovethe processing plane, shown in Figure 6. By reducing the image transfer bottleneck found indecoupled detector-processor systems, high frame rates are possible without constrainingprocessing power. Processing area does not impact the detector array fill factor.
7
SIMD processing layerdetector array & ADC layer
through-waferoptoelectroniccommunication
Figure 6: A Stacked Two Layer Focal Plane Processor.
The block diagram of a SIMPil node is displayed in Figure 7. The figure also ill ustrates howa single node interfaces to a subarray of detectors, and how each node is connected to each otherin a mesh network to operate in SIMD mode. Each node includes a traditional RISC load/storedatapath plus an interface to the detector array via an OE data channel. Initially, an 8-bit datapathSIMPil node was implemented. It includes an 8-word register file, an arithmetic logic unit, a shiftunit, a 16-bit multiply-accumulator (MACC), and 64-word local memory.
N N N
N N N
N N N
N N N
N N N
N N N
N N N
N N N
N N N
N N N
N N N
N N N
LocalMemory
(64 words)
NEWS Registers
Register File(8 words)
Arithmetic,Logical, and
Shift Unit
Multiply Accumulator
Special Registers
Thin FilmDetector Array
S&HandADC
PE
Figure 7: SIMPil Microarchitecture
The instruction set architecture (ISA) provides for arithmetic operations including addition,subtraction, multiplication, and multiply accumulation. The multiply accumulate (MACC)instruction is included because of its utilit y in image processing applications. For example, theMACC operation reduces the partial convolution of a 3 × 3 sub-image from 17 to 9 operations.The 16-bit accumulator in an 8-bit datapath improves precision especially when using fixed-pointoperands. The logic unit allows bitwise AND, OR, and exclusive-OR operations. Logical,arithmetic, and rotate shifts operations are performed in the shift unit. Register-to-register and
8
immediate addressing modes are supported by the dyadic operations. Local memory is accessedvia the load and store instructions.
Each SIMPil node interfaces to an array of thin film detectors. The instruction setarchitecture (ISA) allows for up to 256 addressable detectors. Each node also includes analog todigital circuitry to convert light intensities to digitally equivalent values. The ISA has a SAMPLEinstruction that synchronously captures light intensities at each detector. The SIMD executionmodel allows the entire image to be sampled by the system synchronously. Once the detectorarray has been digitized, it can be processed by the SIMPil node in data parallel fashion.
Low level image processing applications, such as edge detection, are usually pointalgorithms needing only pixel values in a small neighborhood around the data point. This pixelaccess locality is well supported by a nearest neighbor or mesh network. SIMPil nodescommunicate through a nearest neighbor NEWS (north, east, west, and south) network usingNEWS registers in the datapath.
The SIMPil system is an embedded, programmable, focal-plane image processing system.The processing power of the SIMPil node will surpass the computational needs of a single pixel.However, desired frame rates may not be achieved if the number of pixels assigned to a node istoo large. Simulations of image processing applications suggest a good balance of 36 to 64 pixelsper SIMPil node (with 50 MHz node frequencies). Our prototype target is 64 pixels per SIMPilnode.
Using current VLSI technology, between 16 and 64 SIMPil nodes can be fabricated on asingle Si VLSI chip. By tili ng an array of 16 chips each containing 16 nodes, a 128x128 pixelresolution is achieved. The aggregate total for this system is 16,384 pixels and 256 SIMPil nodes.Operating at 50 MHz, SIMPil can perform 781 Kops/sec for each pixel. Eight bits is theminimum datapath width for pixels supporting 256 gray scale levels.
This demonstration is currently being developed for use in videoputing systems, such as highspeed smart cameras. This prototype addresses issues in multidisciplinary interfacing byincorporating an integrated thin film detector, on-chip analog interface circuitry, and a powerfuldigital processor on a single Si CMOS chip. To ill ustrate the effectiveness of the SIMPilprocessing architecture, several image processing operations are demonstrated including edgedetection, convolution, and image compression. The sili con area eff iciency of this type ofprocessing node is compared with general purpose commercial microprocessors. Figure 8 is aphotomicrograph of a prototype SIMPil node fabricated through the MOSIS foundry in .8 mmCMOS. This prototype has been fully tested and a second generation node is currently beingdesigned. Image processing applications such as vector quantization compression have beenimplemented for SIMPil [7].
9
Figure 8: A photomicrograph of a prototype SIMPil node with integrated OE interface circuitry.
5. Message Passing MIMD ArchitecturesMIMD (Multiple Instructions stream, Multiple Data stream) architectures provide the most
general computational model. Each processing node is an autonomous computing agentincluding a datapath, control, and memory. A system consists of a collection of nodes, eachexecuting a different program, connected by a network through which nodes communication.This organization resembles a room full of connected workstations. But the high throughput, lowlatency communications, and optimized synchronization mechanisms allow the processing nodesto work more closely on a common task.
Figure 9 ill ustrates the organization of a MIMD architecture. This form of execution offers thegreatest generality and the lowest eff iciency. Today’s commercial supercomputers from Cray(T3D) and IBM (SP2) employ MIMD organizations based on commercial microprocessors.Image processing applications require less generality and storage, and be effectively executed onMIMD nodes occupying a fraction of a chip.
MIMD diagram goes here.
Figure 9: A MIMD execution model.
SIMD architectures are ideal for early image processing where operations are performedacross a large image array. MIMD architecture are better suited for later steps when imagefeatures being processed are more sparse and diverse. Often image transformations are dependentof specific image data in that region. Even with their lower resource eff iciency, MIMD oftenprovide more effective computation because of their higher utilization.
10
Optoelectronic technology can enable this type of system in two ways. It can provide the sametightly coupled focal plane image I/O employed in SIMD systems. The same smart pixel arrayscan provide a dense, high throughput communications network for connecting processing nodes.The details of one such system are described in [16].
The Pica ArchitectureThe Pica execution architecture is designed for handling high message traff ic consisting
of small , ephemeral tasks. In order to achieve acceptable eff iciency in this fine-grain domain,parallel overhead must be reduced to the minimum achievable level. Complex mechanisms tosupport general purpose applications are replaced by simpler, lower cost mechanisms for high-throughput problems.
The Pica execution architecture is designed specifically for high-throughput, low-memoryoperation. The design of a Pica node begins with a minimal sequential core architecture. Picaprovides low overhead support for communication, synchronization, naming, and task andstorage management. A small amount of memory (4096 36-bit words) and a networkinterface/router complete the node. This node complexity can be implemented using a fraction ofthe transistors available on a chip in current technology. This allows multi -node chips - theprototype chip will contain four nodes.
The Pica architecture is designed to form a dense, three dimensional computational array forprocessing high-throughput data streams. While less general than other MIMD architectures, it ismore eff icient for this application area. The execution model supported by Pica is more flexiblethan other high-throughput architectures (e.g., systolic arrays, static dataflow).
datapathalu/shifter
synchronizationspecial registers
instructionunit
code cache / IP
contextmanagertask manager
context allocation
controller
context cache32 slots
local memory4096 36-bit words
256 contexts
network interface& router
other nodes other nodes
Figure 10: The Pica microarchitecture.
The basic functional blocks of the Pica microarchitecture are shown in Figure 10. Thenetwork router routes messages through the node, forming that node's contribution to thecommunication network. The router implements a simple adaptive routing strategy based oncurrent local virtual-channel allocation. The network interface buffers incoming messages andsignals the context manager that a context is required. When it obtains access to local memory,the network interface writes the message contents directly into the allocated, fixed-lengthcontext. The datapath consists of a 32-bit integer ALU and shifter, and special-purpose registers.Operands are accessed from a 32 word context cache, which supports two read and one writeaccesses on each cycle. The instruction unit fetches and decodes instructions for execution. In
11
order to keep design complexity and task swapping overhead low, the datapath implementation isnot pipelined. The context manager serves three functions: (1) it maintains a queue of suspendedand ready tasks for execution, (2) it allocates task storage for incoming messages and deallocatesstorage as the tasks complete, and (3) it arbitrates requests by both the network interface andcache controller for control of the local memory bus.
Plans for a full -scale prototype are underway, but are still preliminary. In a target prototypesystem, each processing node will contain 4096 36-bit words of local memory (256 contexts), a32-bit integer processor, and a network interface. The target node performance is 50 MIPS. Achip contains four Pica nodes and 3.2 Gbits/sec I/O bandwidth. The full -scale prototype willemploy a 2.5 supply voltage in addition to other low power techniques to keep total chip powerbelow 500 mW. In a full scale system (4096 nodes) employing through wafer optoelectronicinterconnect, a processing plane contains 64 chips (256 nodes, 12,800 MIPS) and measuresapproximately 10 cm by 10 cm. Sixteen planes contain 1024 chips (4096 nodes, 204,800 MIPS)and fit inside a cube 10 cm on a side. 820 Gbits/sec of system I/O bandwidth is available fromchips on the top and bottom surfaces of the cube. Sides of the cube are available for power andcooling mechanical connections.
6. Future DirectionsArchitectural research focuses on how available technologies can be combined to solve
problems in a more effective way. The most significant advances have come when new enablingtechnology is harnessed to address a broad consumer need. Smart pixel systems enable a newclass of portable image processing systems. The examples presented here demonstrate thepotential of these new products.
AcknowledgmentsThis work is being supported by AFOSR, NSF contracts #ECS-9422452 and #EEC-9402723,and the DARPA Low Power Electronics Program.
References[1] Marco Annaratone, Emmanuel Arnould, Thomas Gross, H.T. Kung, Monica S. Lam, Onat
Menzilcioglu, Ken Sarocky and Jon A Webb, “Warp Architecture and Implementation” ,Computer Architecture News, June 1986, pages 346-356.
[2] K. E. Batcher, “Design of a Massively Parallel Processor,” IEEE Transactions onComputers C-29, 9, Sept. 1980, pp. 836-840.
[3] T. Blank, “MasPar MP-1 Architecture,” Proceedings of COMPCON Spring ’90 - TheThirty-Fifth IEEE Computer Society International Conference, San Francisco, CA, 1990,pp. 20-24.
[4] H. H. Cat, M. Lee, B. Buchanan, D. S. Will s, M. A. Brooke, N. M. Jokerst, “Sili con VLSIProcessing Architectures Incorporating Integrated Optoelectronic Devices” , in Proceedingsof the 16th Conference on Advanced Research in VLSI, pages 17-27, Chapel Hill , NC,March 1995.
12
[5] H. H. Cat., J. C. Eble, D. S. Will s, V. K. De, M. Brooke, N. M. Jokerst, “Low PowerOpportunities for a SIMD VLSI Architecture Incorporating Integrated OptoelectronicDevices”, GOMAC’96 Digest of Papers, pages 59-62, Orlando, FL, March 1996.
[6] A. L. Fisher, “Scan Line Array Processors,” Annual Symposium on Computer Architecture,1986, pp. 338-345.
[7] A. Gentile, H. H. Cat, F. Kossentim, F. Sorbello, and D. S. Will s, “Real-TimeImplementation of Full -Search Vector Quantization on a Low Memory SIMDArchitecture,” IEEE Data Compression Conference, Snowbird, Utah, April 1996, page 438.
[8] T. Gross, H. T. Kung, M. Lam, J. Webb, “Warp as a Machine for Low-Level Vision,”Proceedings of the 1985 IEEE International Conference on Robotics and Automation,March 1985, pp. 790-800.
[9] M. W. Haney, M. P. Christensen, “Smart Pixel Based Viterbi Decoder,” OpticalComputing, 1995 Technical Digest Series, vol. 10, 1995, pp. 99-101.
[10] H.T. Kung, “Why systolic architectures?,” Computer, pages 37-46, January 1982
[11] C. B. Kuznia, A. A. Sawchuk, and L. Cheng, “FET-SEED Smart Pixels for Free-SpaceDigital Optics Systems,” Optical Computing, 1995 Technical Digest Series, vol. 10, 1995,pp. 108-110.
[12] Charles Leiserson and J.B. Saxe, “Optimizing Synchronous Systems”, Proc. 22nd AnnualSymp. Foundations of Computer Science, IEEE Computer Society, Oct 1981, pages 23-36.
[13] C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, Reading, Massachusetts,1989.
[14] Connection Machine Model CM-2 Technical Summary, Thinking Machines Corporation,Version 51., May 1989.
[15] D. S. Will s, N. M. Jokerst, M. A. Brooke, and A. Brown, “A Two Layer Image ProcessingSystem Incorporating Integrated Focal Plane Detectors and Through-Wafer OpticalInterconnect” , in Technical Digest of the 1995 OSA Optical Computing Topical Meeting,pages 19-22, Salt Lake City, UT, March 1995.
[16] D. S. Will s, W. S. Lacy, C. Camperi-Ginestet, B. Buchanan H. H. Cat, S. Wilkinson, M.Lee, N. M. Jokerst, M. A. Brooke, “A Three Dimensional High-Throughput ArchitectureUsing Through-Wafer Optical Interconnect” , in IEEE/OSA Journal of LightwaveTechnology Special Issue on Optical Interconnections for Information Processing, 13:(6),pages 1085-1092, June 1995.
[17] D. S. Will s, H. Cat, J. Cruz-Rivera, W. S. Lacy, M. Baker, J. Eble A. Lopez-Lagunas, M.Hopper, “High-Throughput, Low-Memory Applications on the Pica Architecture”, toappear in IEEE Transactions on Parallel and Distributed Systems.
[18] W. F. Wong and K. T. Lua, “A Preliminary Evaluation of a Massively Parallel Processor:GAPP,” Microprocessing and Microprogramming, vol. 29, no. 1, July 1990, pp. 53-62.