+ All Categories
Home > Documents > Processing architectures for smart pixel systems

Processing architectures for smart pixel systems

Date post: 27-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
12
1 Processing Architectures for Smart Pixel Systems D. Scott Wills, James M. Baker Jr., Huy H. Cat, Sek Chai, José Cruz-Rivera, John Eble, Antonio Gentile, Michael Hopper, W. Stephen Lacy, Abelardo López-Lagunas, Phil May, and Tarek Taha School of Electrical and Computer Engineering Packaging Research Center Georgia Institute of Technology Atlanta, Georgia 30332-0250 Abstract Smart pixel architectures offer important new opportunities for low cost, portable image processing systems. They provide greater I/O bandwidth and computing performance than systems based on CCD and microprocessors. However, finding a balance between performance, flexibility, efficiency, and cost depends on an evaluation of target applications. This paper describes several promising architectural approaches for the realization of videoputer systems and outlines example implementations being pursued at Georgia Tech. 1. Introduction Low cost video cameras and advanced telecommunications technology enable many new services, such as electronic video mail and computer-based teleconferencing. Evolving compression standards (e.g., MPEG) and inexpensive disk storage allow these electronic exchanges to be treated much as e-mail is used today. Cellular phone-based wireless technology provides low cost communication in the field. However acquiring, transmitting, and manipulating this information presents a computational requirement beyond the capabilities of existing systems. Increasing user demand for portable on the move videoputing (video + computing) and teleputing (telecommunications + computing) systems places additional requirements on power, size, and weight. General purpose microprocessors offer inexpensive and versatile processing elements for such portable imaging systems. However, these new image processing applications demand higher processing rates (10 - 1000 Gops/sec) than can be provided by commercial microprocessors (.1 - .5 Gops/sec). Dedicated ASICs (Application Specific Integrated Circuits) can provide the needed performance and efficiency. But they lack the flexibility needed for varied application requirements. Unfortunately, many portable imaging applications (image enhancement, recognition, and compression) have requirements not met by either of the processing alternatives. Alternatively, techniques for integrating OE devices, analog interface circuitry, and digital logic have enabled new approaches for image collection and processing. Monolithic systems incorporating focal plane arrays offer high I/O bandwidth with modest levels of dedicated analog or digital processing capability. Beginning with Mahowald and Mead’s silicon retina [13], on- focal plane processing has increased in complexity from simple logic gates to latches [11] to 2- bit registers and counters in [9]. Analog processing alternatives have demonstrated even greater operational complexity using passive and active networks. The systems strive to achieve high fill factor detector arrays combined with the maximum computing capability that can effectively be
Transcript

1

Processing Architectures for Smart Pixel SystemsD. Scott Wills, James M. Baker Jr., Huy H. Cat, Sek Chai, José Cruz-Rivera, John Eble, AntonioGentile, Michael Hopper, W. Stephen Lacy, Abelardo López-Lagunas, Phil May, and Tarek Taha

School of Electrical and Computer EngineeringPackaging Research Center

Georgia Institute of TechnologyAtlanta, Georgia 30332-0250

Abstract

Smart pixel architectures offer important new opportunities for low cost,portable image processing systems. They provide greater I/O bandwidth andcomputing performance than systems based on CCD and microprocessors.However, finding a balance between performance, flexibilit y, efficiency, and costdepends on an evaluation of target applications. This paper describes severalpromising architectural approaches for the realization of videoputer systems andoutlines example implementations being pursued at Georgia Tech.

1. IntroductionLow cost video cameras and advanced telecommunications technology enable many new

services, such as electronic video mail and computer-based teleconferencing. Evolvingcompression standards (e.g., MPEG) and inexpensive disk storage allow these electronicexchanges to be treated much as e-mail i s used today. Cellular phone-based wireless technologyprovides low cost communication in the field. However acquiring, transmitting, andmanipulating this information presents a computational requirement beyond the capabiliti es ofexisting systems. Increasing user demand for portable on the move videoputing (video +computing) and teleputing (telecommunications + computing) systems places additionalrequirements on power, size, and weight.

General purpose microprocessors offer inexpensive and versatile processing elements for suchportable imaging systems. However, these new image processing applications demand higherprocessing rates (10 - 1000 Gops/sec) than can be provided by commercial microprocessors (.1 -.5 Gops/sec). Dedicated ASICs (Application Specific Integrated Circuits) can provide the neededperformance and eff iciency. But they lack the flexibilit y needed for varied applicationrequirements. Unfortunately, many portable imaging applications (image enhancement,recognition, and compression) have requirements not met by either of the processing alternatives.

Alternatively, techniques for integrating OE devices, analog interface circuitry, and digitallogic have enabled new approaches for image collection and processing. Monolithic systemsincorporating focal plane arrays offer high I/O bandwidth with modest levels of dedicated analogor digital processing capabilit y. Beginning with Mahowald and Mead’s sili con retina [13], on-focal plane processing has increased in complexity from simple logic gates to latches [11] to 2-bit registers and counters in [9]. Analog processing alternatives have demonstrated even greateroperational complexity using passive and active networks. The systems strive to achieve high fillfactor detector arrays combined with the maximum computing capabilit y that can effectively be

2

incorporated nearby. These image processing solutions are compact and eff icient, but lackcomputing power and flexibility.

General PurposeMicroprocessor

VideoputorProcessor

DedicatedASIC

performance low high highcost low moderate moderateflexibility high high very lowefficiency moderate high high

Table 1: Characteristics of microprocessors, dedicated ASICs, and an ideal videoputer processor.

The ideal architecture (Table 1) must blend a balance of key characteristics for theseapplications. It must provide high processing performance that scales with Si VLSI technologyadvances, while achieving high chip eff iciency (Mops/sec/mm2). Low cost must be realizedthough high eff iciency and flexibilit y where a single system can address many image processingtasks. System power, size, and weight must support portable operation. Image I/O must exploitOE devices to provide low cost and high performance. This system has not yet been realized, buta successful solution can have an impact comparable to the introduction of the personalcomputer, video camera, or FAX machine.

This paper summarizes the most promising architectural approaches for videoputerapplications. Some example implementations being pursued at Georgia Tech ill ustrate thesearchitecture classes. Section 2 describes the fundamentals of processing node organizations.Section 3 outlines the approach of systolic architectures. Sections 4 and 5 present SIMD andMessage Passing MIMD computing techniques. Finally, Section 6 concludes with directions forfuture research.

2. Processing Architecture OrganizationBefore exploring different approaches to smart pixel architectures, the components used to

build them need to be defined. Figure 1 ill ustrates the key elements of all digital processingnodes.

DataMemory

Instruction Control

Datapath

ProgramMemory

NetworkI/O

3

Figure 1: Anatomy of a Processing Node

The datapath contains the most familiar elements of computation: adders, subtractors,multipliers, shifters and logical units as well as registers to hold operands as they are beingprocessed. This is where the work required by an application is performed, and all processors,general or special purpose, must have a datapath. Image processing datapaths often include morespecialized functional components, such as a multiply-accumulate unit, to better supportcommon image processing operations.

An I/O unit is required to input image data to the datapath, and output results of thecomputation back to the outside world. This unit is particularly significant given the high I/Odata rates demanded in image processing systems. Today’s desktop workstation typicallyoperates with less than 10 Mbps I/O; a portable image processor might require 10 to 100 times asmuch I/O.

Since input and intermediate data cannot also fit in datapath registers, additional data memoryis required. This is analogous to memory in a workstation. However, image processingapplications tend to use more operands from I/O and require significantly less data storage. Sincedata memory represents a significant resource cost in computers, the reduction of data memory(1000X or more) can translate to a more efficient system implementation.

Instruction control and program memory are required for all programmable systems. Whileone computational model presented here, systolic arrays, does not include these components, theyare part of nearly every digital computer. Image processing systems can employ severalorganizations for program control. But the typically shorter, more compact application programscan also be exploited for more powerful, efficient system implementations.

Finally, the network provides a medium for many processing nodes to communicate. This isnecessary if nodes are to work together on a common task. Inter-node communication must behigh bandwidth and low latency or overall performance suffers. Aggregate network bandwidthsin Tbps (1000 Gbps) are sometimes required. Integrated OE smart pixel arrays can play a role inthe realization of these networks as well as in image I/O.

These elements provide the building blocks of many smart pixel-based videoputerarchitectures. The following sections describe a few of these promising architectures.

3. Systolic Array ArchitecturesSystolic architectures first became popular in the late 1970’s as an architectural approach to

exploit the growing potential of VLSI technology. H. T. Kung [10] and Charles Lieserson [12]were early proponents of this execution model for extremely eff icient implementation of systemsthat solve computationally intensive applications. More transistors per chip support systemdesigns with increased functionality leading to greater I/O and inter-cell communicationrequirements. Communication costs are typically high in execution time, power dissipation, andchip area. To reduce these communication penalties as well as reducing complexity in designingthe system, systolic design incorporates regular cell structures that communicate over shortdistances. The design cost is further minimized by using regular cell structures rather thanredesigning new components. The key characteristics of systolic designs include modular cells,short communications, scalability and concurrency.

4

Figure 2 ill ustrates a systolic array to compute the multiplication of banded matrixes. Eachhexagonal node includes a simple datapath containing a multiplier and adder, plus clockedregisters to regulate data flow between nodes (shown as arrows). On every cycle, each nodecomputes the product of the received input matrix elements and adds the rising result matrix.These systolic nodes include no data or program memory, and have an elementary network andI/O. Systolic processing systems are the most eff icient in terms of resource usage. But their lackof programmabilit y restricts their flexibilit y. Efforts to produce programmable systolic arrays(e.g., the CMU WARP [1][8]) produced systems more akin to MIMD architectures (see Section5) than those described here. Systolic architecture are well suited for dedicated high throughputcomputation such as image compression. However, cost and performance comparisons must bemade between systolic systems and more flexible architectural approaches.

input

matix A input matix B

result matix C

Figure 2: A systolic array to computer matrix multiplication.

The PAMSAC ArchitectureFigure 3 shows the layout of a pattern matching systolic architecture being implemented at

Georgia Tech. PAMSAC incorporates direct optical input of image data via eight on-chip Sidetectors and ampli fiers. This chip, which has been implemented through the MOSIS foundry in2.0 µm CMOS, simulates in IRSIM at 33 MHz. Digital logic testing of systolic core has beenfully tested; the interface to the OE devices is currently in progress. Figure 4 ill ustrates the blockdiagram of the PAMSAC chip. The simpli fied logic operation of a systolic cell consists of anXNOR and AND gate to perform detection of perfect pattern matching. This systolic designmethodology has simple, modular logic cells with high concurrency and local interconnection.

5

Select1

S7

S6

S5

S4

S3

S2 S1 X1S0 X0 X2 X3 X4 X5

Collector6

X6

GND

X7

SEL

CLK

SEQM

PadGnd

Systoli c Core

PadVDD

PadVDD

111213141516

17

18

19

20

21

23

24

25

26

22

27 28 29 30 31 32 33 34 35 36

37

38

39

40

1

2

3

4

5

6

78910

VDD

PadGND

Ibias 0

AnalogVDD

Ibias 1 Ibias 4Ibias 2 Ibias 3 Ibias 5 Ibias 7Collector

7

PadGND

PadGND

Detector0

Detector1

Detector2

Detector3

Detector4

Detector5

Detector6

Detector776543210

Ibias 6

Collector5

Muxes

Detector Amplifier

latch

Figure 3: Layout of the PAMSAC pattern matching architecture (2252 µm x 2222 µm).

Mux+

ShiftRegs

8

OEInputs

DigitalInput

8

8

X 5MatchStrings

Select

MatchSignal

8 cascaded 200um x 200um detectorwill provide alternative parallel inputinto systolic core.

UnselectedInput

8x5SystolicCore

Amps +Comparators

8

1

2

0

3

4

5

6

7

Clo

ck B

uffe

r

Ain

Reg

iste

r

XN

OR

/AN

D

Cin

Reg

iste

r

Clo

ck B

uffe

r

Bin

Reg

iste

r

Ain

BinCin

Cout

Figure 4: Block diagram of the PAMSAC chip.

4. SIMD ArchitecturesA more flexible architectural approach, compared with systolic arrays, includes

programmable digital processors. Yet commercial microprocessors are ill -suited to videoputerapplications because of their limited performance and low resource eff iciency. They provide toomuch generality and functionality that is not required in image processing.

A more promising computational model, SIMD or Single Instruction stream, Multiple Datastream, replicates the datapath, data memory, and I/O to provide high processing performancewith low node cost. Figure 5 ill ustrates this configuration. SIMD systems often employ

6

thousands of processing elements. The cost of the control unit is amortized across eachprocessing element.

Although a single program is being executed, each instruction is executed simultaneously onmany nodes. This execution model is especially well -suit to early image processing when asubroutine must be applied to every region of an image. While a commercial microprocessormust iterate sequentially across an image, a SIMD architecture can process the entire image in asingle iteration.

Instruction Control

ProgramMemory

DataMemory

Datapath NetworkI/O

DataMemory

Datapath NetworkI/O

DataMemory

Datapath NetworkI/O

DataMemory

Datapath NetworkI/O

Figure 5: SIMD architectures employ a single control unit with multiple datapaths.

The SIMPil ArchitectureWhile SIMD systems have been used for image processing before, the implementations havebeen large and expensive. The MPP [2], CM-2 [14], MasPar [3], and the GAPP [18] areexamples of general purpose SIMD systems capable of performing image processingapplications. However, these systems achieve performance and generality at the expense of focalplane I/O coupling and physical size. Other systems, including the Scan Line Array Processor(SLAP) [6], exploit frame scanning used in video cameras by operating on sequential scan lines.But serial loading and unloading of image data limits frame rates. A more specializedarchitecture can provide the same high levels of performance in a portable system.

The SIMPil system being developed at Georgia Tech [4][5][[ 15] incorporates a specializedSIMD architecture with an integrated array of optoelectronic devices. An 1300 nm optoelectroniclink allows through-sili con wafer input of digital image data from a detector plane stacked abovethe processing plane, shown in Figure 6. By reducing the image transfer bottleneck found indecoupled detector-processor systems, high frame rates are possible without constrainingprocessing power. Processing area does not impact the detector array fill factor.

7

SIMD processing layerdetector array & ADC layer

through-waferoptoelectroniccommunication

Figure 6: A Stacked Two Layer Focal Plane Processor.

The block diagram of a SIMPil node is displayed in Figure 7. The figure also ill ustrates howa single node interfaces to a subarray of detectors, and how each node is connected to each otherin a mesh network to operate in SIMD mode. Each node includes a traditional RISC load/storedatapath plus an interface to the detector array via an OE data channel. Initially, an 8-bit datapathSIMPil node was implemented. It includes an 8-word register file, an arithmetic logic unit, a shiftunit, a 16-bit multiply-accumulator (MACC), and 64-word local memory.

N N N

N N N

N N N

N N N

N N N

N N N

N N N

N N N

N N N

N N N

N N N

N N N

LocalMemory

(64 words)

NEWS Registers

Register File(8 words)

Arithmetic,Logical, and

Shift Unit

Multiply Accumulator

Special Registers

Thin FilmDetector Array

S&HandADC

PE

Figure 7: SIMPil Microarchitecture

The instruction set architecture (ISA) provides for arithmetic operations including addition,subtraction, multiplication, and multiply accumulation. The multiply accumulate (MACC)instruction is included because of its utilit y in image processing applications. For example, theMACC operation reduces the partial convolution of a 3 × 3 sub-image from 17 to 9 operations.The 16-bit accumulator in an 8-bit datapath improves precision especially when using fixed-pointoperands. The logic unit allows bitwise AND, OR, and exclusive-OR operations. Logical,arithmetic, and rotate shifts operations are performed in the shift unit. Register-to-register and

8

immediate addressing modes are supported by the dyadic operations. Local memory is accessedvia the load and store instructions.

Each SIMPil node interfaces to an array of thin film detectors. The instruction setarchitecture (ISA) allows for up to 256 addressable detectors. Each node also includes analog todigital circuitry to convert light intensities to digitally equivalent values. The ISA has a SAMPLEinstruction that synchronously captures light intensities at each detector. The SIMD executionmodel allows the entire image to be sampled by the system synchronously. Once the detectorarray has been digitized, it can be processed by the SIMPil node in data parallel fashion.

Low level image processing applications, such as edge detection, are usually pointalgorithms needing only pixel values in a small neighborhood around the data point. This pixelaccess locality is well supported by a nearest neighbor or mesh network. SIMPil nodescommunicate through a nearest neighbor NEWS (north, east, west, and south) network usingNEWS registers in the datapath.

The SIMPil system is an embedded, programmable, focal-plane image processing system.The processing power of the SIMPil node will surpass the computational needs of a single pixel.However, desired frame rates may not be achieved if the number of pixels assigned to a node istoo large. Simulations of image processing applications suggest a good balance of 36 to 64 pixelsper SIMPil node (with 50 MHz node frequencies). Our prototype target is 64 pixels per SIMPilnode.

Using current VLSI technology, between 16 and 64 SIMPil nodes can be fabricated on asingle Si VLSI chip. By tili ng an array of 16 chips each containing 16 nodes, a 128x128 pixelresolution is achieved. The aggregate total for this system is 16,384 pixels and 256 SIMPil nodes.Operating at 50 MHz, SIMPil can perform 781 Kops/sec for each pixel. Eight bits is theminimum datapath width for pixels supporting 256 gray scale levels.

This demonstration is currently being developed for use in videoputing systems, such as highspeed smart cameras. This prototype addresses issues in multidisciplinary interfacing byincorporating an integrated thin film detector, on-chip analog interface circuitry, and a powerfuldigital processor on a single Si CMOS chip. To ill ustrate the effectiveness of the SIMPilprocessing architecture, several image processing operations are demonstrated including edgedetection, convolution, and image compression. The sili con area eff iciency of this type ofprocessing node is compared with general purpose commercial microprocessors. Figure 8 is aphotomicrograph of a prototype SIMPil node fabricated through the MOSIS foundry in .8 mmCMOS. This prototype has been fully tested and a second generation node is currently beingdesigned. Image processing applications such as vector quantization compression have beenimplemented for SIMPil [7].

9

Figure 8: A photomicrograph of a prototype SIMPil node with integrated OE interface circuitry.

5. Message Passing MIMD ArchitecturesMIMD (Multiple Instructions stream, Multiple Data stream) architectures provide the most

general computational model. Each processing node is an autonomous computing agentincluding a datapath, control, and memory. A system consists of a collection of nodes, eachexecuting a different program, connected by a network through which nodes communication.This organization resembles a room full of connected workstations. But the high throughput, lowlatency communications, and optimized synchronization mechanisms allow the processing nodesto work more closely on a common task.

Figure 9 ill ustrates the organization of a MIMD architecture. This form of execution offers thegreatest generality and the lowest eff iciency. Today’s commercial supercomputers from Cray(T3D) and IBM (SP2) employ MIMD organizations based on commercial microprocessors.Image processing applications require less generality and storage, and be effectively executed onMIMD nodes occupying a fraction of a chip.

MIMD diagram goes here.

Figure 9: A MIMD execution model.

SIMD architectures are ideal for early image processing where operations are performedacross a large image array. MIMD architecture are better suited for later steps when imagefeatures being processed are more sparse and diverse. Often image transformations are dependentof specific image data in that region. Even with their lower resource eff iciency, MIMD oftenprovide more effective computation because of their higher utilization.

10

Optoelectronic technology can enable this type of system in two ways. It can provide the sametightly coupled focal plane image I/O employed in SIMD systems. The same smart pixel arrayscan provide a dense, high throughput communications network for connecting processing nodes.The details of one such system are described in [16].

The Pica ArchitectureThe Pica execution architecture is designed for handling high message traff ic consisting

of small , ephemeral tasks. In order to achieve acceptable eff iciency in this fine-grain domain,parallel overhead must be reduced to the minimum achievable level. Complex mechanisms tosupport general purpose applications are replaced by simpler, lower cost mechanisms for high-throughput problems.

The Pica execution architecture is designed specifically for high-throughput, low-memoryoperation. The design of a Pica node begins with a minimal sequential core architecture. Picaprovides low overhead support for communication, synchronization, naming, and task andstorage management. A small amount of memory (4096 36-bit words) and a networkinterface/router complete the node. This node complexity can be implemented using a fraction ofthe transistors available on a chip in current technology. This allows multi -node chips - theprototype chip will contain four nodes.

The Pica architecture is designed to form a dense, three dimensional computational array forprocessing high-throughput data streams. While less general than other MIMD architectures, it ismore eff icient for this application area. The execution model supported by Pica is more flexiblethan other high-throughput architectures (e.g., systolic arrays, static dataflow).

datapathalu/shifter

synchronizationspecial registers

instructionunit

code cache / IP

contextmanagertask manager

context allocation

controller

context cache32 slots

local memory4096 36-bit words

256 contexts

network interface& router

other nodes other nodes

Figure 10: The Pica microarchitecture.

The basic functional blocks of the Pica microarchitecture are shown in Figure 10. Thenetwork router routes messages through the node, forming that node's contribution to thecommunication network. The router implements a simple adaptive routing strategy based oncurrent local virtual-channel allocation. The network interface buffers incoming messages andsignals the context manager that a context is required. When it obtains access to local memory,the network interface writes the message contents directly into the allocated, fixed-lengthcontext. The datapath consists of a 32-bit integer ALU and shifter, and special-purpose registers.Operands are accessed from a 32 word context cache, which supports two read and one writeaccesses on each cycle. The instruction unit fetches and decodes instructions for execution. In

11

order to keep design complexity and task swapping overhead low, the datapath implementation isnot pipelined. The context manager serves three functions: (1) it maintains a queue of suspendedand ready tasks for execution, (2) it allocates task storage for incoming messages and deallocatesstorage as the tasks complete, and (3) it arbitrates requests by both the network interface andcache controller for control of the local memory bus.

Plans for a full -scale prototype are underway, but are still preliminary. In a target prototypesystem, each processing node will contain 4096 36-bit words of local memory (256 contexts), a32-bit integer processor, and a network interface. The target node performance is 50 MIPS. Achip contains four Pica nodes and 3.2 Gbits/sec I/O bandwidth. The full -scale prototype willemploy a 2.5 supply voltage in addition to other low power techniques to keep total chip powerbelow 500 mW. In a full scale system (4096 nodes) employing through wafer optoelectronicinterconnect, a processing plane contains 64 chips (256 nodes, 12,800 MIPS) and measuresapproximately 10 cm by 10 cm. Sixteen planes contain 1024 chips (4096 nodes, 204,800 MIPS)and fit inside a cube 10 cm on a side. 820 Gbits/sec of system I/O bandwidth is available fromchips on the top and bottom surfaces of the cube. Sides of the cube are available for power andcooling mechanical connections.

6. Future DirectionsArchitectural research focuses on how available technologies can be combined to solve

problems in a more effective way. The most significant advances have come when new enablingtechnology is harnessed to address a broad consumer need. Smart pixel systems enable a newclass of portable image processing systems. The examples presented here demonstrate thepotential of these new products.

AcknowledgmentsThis work is being supported by AFOSR, NSF contracts #ECS-9422452 and #EEC-9402723,and the DARPA Low Power Electronics Program.

References[1] Marco Annaratone, Emmanuel Arnould, Thomas Gross, H.T. Kung, Monica S. Lam, Onat

Menzilcioglu, Ken Sarocky and Jon A Webb, “Warp Architecture and Implementation” ,Computer Architecture News, June 1986, pages 346-356.

[2] K. E. Batcher, “Design of a Massively Parallel Processor,” IEEE Transactions onComputers C-29, 9, Sept. 1980, pp. 836-840.

[3] T. Blank, “MasPar MP-1 Architecture,” Proceedings of COMPCON Spring ’90 - TheThirty-Fifth IEEE Computer Society International Conference, San Francisco, CA, 1990,pp. 20-24.

[4] H. H. Cat, M. Lee, B. Buchanan, D. S. Will s, M. A. Brooke, N. M. Jokerst, “Sili con VLSIProcessing Architectures Incorporating Integrated Optoelectronic Devices” , in Proceedingsof the 16th Conference on Advanced Research in VLSI, pages 17-27, Chapel Hill , NC,March 1995.

12

[5] H. H. Cat., J. C. Eble, D. S. Will s, V. K. De, M. Brooke, N. M. Jokerst, “Low PowerOpportunities for a SIMD VLSI Architecture Incorporating Integrated OptoelectronicDevices”, GOMAC’96 Digest of Papers, pages 59-62, Orlando, FL, March 1996.

[6] A. L. Fisher, “Scan Line Array Processors,” Annual Symposium on Computer Architecture,1986, pp. 338-345.

[7] A. Gentile, H. H. Cat, F. Kossentim, F. Sorbello, and D. S. Will s, “Real-TimeImplementation of Full -Search Vector Quantization on a Low Memory SIMDArchitecture,” IEEE Data Compression Conference, Snowbird, Utah, April 1996, page 438.

[8] T. Gross, H. T. Kung, M. Lam, J. Webb, “Warp as a Machine for Low-Level Vision,”Proceedings of the 1985 IEEE International Conference on Robotics and Automation,March 1985, pp. 790-800.

[9] M. W. Haney, M. P. Christensen, “Smart Pixel Based Viterbi Decoder,” OpticalComputing, 1995 Technical Digest Series, vol. 10, 1995, pp. 99-101.

[10] H.T. Kung, “Why systolic architectures?,” Computer, pages 37-46, January 1982

[11] C. B. Kuznia, A. A. Sawchuk, and L. Cheng, “FET-SEED Smart Pixels for Free-SpaceDigital Optics Systems,” Optical Computing, 1995 Technical Digest Series, vol. 10, 1995,pp. 108-110.

[12] Charles Leiserson and J.B. Saxe, “Optimizing Synchronous Systems”, Proc. 22nd AnnualSymp. Foundations of Computer Science, IEEE Computer Society, Oct 1981, pages 23-36.

[13] C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, Reading, Massachusetts,1989.

[14] Connection Machine Model CM-2 Technical Summary, Thinking Machines Corporation,Version 51., May 1989.

[15] D. S. Will s, N. M. Jokerst, M. A. Brooke, and A. Brown, “A Two Layer Image ProcessingSystem Incorporating Integrated Focal Plane Detectors and Through-Wafer OpticalInterconnect” , in Technical Digest of the 1995 OSA Optical Computing Topical Meeting,pages 19-22, Salt Lake City, UT, March 1995.

[16] D. S. Will s, W. S. Lacy, C. Camperi-Ginestet, B. Buchanan H. H. Cat, S. Wilkinson, M.Lee, N. M. Jokerst, M. A. Brooke, “A Three Dimensional High-Throughput ArchitectureUsing Through-Wafer Optical Interconnect” , in IEEE/OSA Journal of LightwaveTechnology Special Issue on Optical Interconnections for Information Processing, 13:(6),pages 1085-1092, June 1995.

[17] D. S. Will s, H. Cat, J. Cruz-Rivera, W. S. Lacy, M. Baker, J. Eble A. Lopez-Lagunas, M.Hopper, “High-Throughput, Low-Memory Applications on the Pica Architecture”, toappear in IEEE Transactions on Parallel and Distributed Systems.

[18] W. F. Wong and K. T. Lua, “A Preliminary Evaluation of a Massively Parallel Processor:GAPP,” Microprocessing and Microprogramming, vol. 29, no. 1, July 1990, pp. 53-62.


Recommended