Parallel Computer Architecture and Interconnect 1b.1.

Parallel Computer Architecture and Interconnect 1b.*

Types of Parallel Computer Architecture 1b.*Two principal types:Shared memory multiprocessorFrom a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical address space. Distributed memory multicomputerIn hardware, refers to network based memory access that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines.

Ref slides from B. Wilkinson at UNC-Charlotte, 2006. and Kumar Introduction to parallel computing

Shared Memory Multiprocessor1b.*

Conventional Computer1b.*Virtually all computers have followed a common machine model known as the von Neumann computer. Named after the Hungarian mathematician John von Neumann. A von Neumann computer uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory.

Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.

Shared Memory Multiprocessor System1b.*Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module :Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.

UMA and NUMA.1b.*Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP Not all processors have equal access time to all memories Memory access across link is slower If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Shared Memory Computers1b.*Advantages: Global address space provides a user-friendly programming interface to memory Data sharing between tasks is both fast and uniform

Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can increases traffic on the shared memory-CPU pathProgrammer responsibility for synchronization constructs that insure "correct" access of global memory and consistent data result. Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

Distributed Memory Computer1b.*Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.

Distributed Memory Computer1b.*Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking like Ethenet. Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. Non-uniform memory access (NUMA) times

Hybrid Computer1b.*The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.

Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache.

Example Quad Shared Memory Multiprocessor1b.*ProcessorL2 CacheBus interfaceL1 cacheProcessorL2 CacheBus interfaceL1 cacheProcessorL2 CacheBus interfaceL1 cacheProcessorL2 CacheBus interfaceL1 cacheMemory controllerMemoryI/O interfaceI/O busProcessor/memorybusShared memory

Programming Shared Memory ComputersSeveral possible ways1b.*

Use Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access shared and global variables declared. Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Any thread can execute any subroutine at the same time as other threads. Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to insure that more than one thread is not updating the same global address at any time. Example Pthreads

1b.*

Use library functions and preprocessor compiler directives with a sequential programming language to declare shared variables and specify parallelism.

Portable / multi-platform, including Unix and Windows NT platformsAvailable in C/C++ and Fortran implementations Can be very easy and simple to useExample OpenMP - industry standard. Consists of library functions, compiler directives, and environment variables - needs OpenMP compiler

Programming Distributed Memory Computers1b.*Message passing modelTasks exchange data through communications by sending and receiving messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

Interconnection Networks1b.*Provide mechanisms for data transfer between processors or between processors and memoryTypical network built on links (physical media such as wires and fibers) and switches ( provide mapping from input to output). Static network: point to point links

Dynamic network: switches and links. Communications are established dynamically among processors and memory.

Interconnection Networks1b.*2- and 3-dimensional meshesHypercube (not now common)Using Switches:CrossbarTreesMultistage interconnection networks

1b.*Bus-Based NetworksIdea for broadcasting. Distance between any two nodes is constant. However, the bounded bandwidth of a bus place limitations on performance as number of nodes creases. Cache is used to improve access time. Scalable in cost but not in performance

Crossbar Networkspxb switches are employed. b>=p, non-blocking Lower bound on the total switches is (p^2).

Not scalable in terms of costScalable in terms of performance

1b.*Multistage NetworksIntermediate class of networks lies between these above two extremes.Omega network consists of log p stages, where p is the number of inputs (nodes) and output (memory).

1b.*Input i and output j, a link exists if:

j = 2i 0

1b.*p inputs are fed into a set of p/2 switches. Each switch is in one of the two connection modes.

1). Pass-through: input are sent straight through to the outputs2). Cross-over: Inputs are crossed over and then sent out.

1b.*Total number of switches?

1b.*AB link may be used by another pair of node to memory. Such communication will be blocked.

1b.*Completely-connected network is good in the sense that any two nodes can exchange message in a single step. Similar to crossbar network due to non-blocking property

Star connected is similar to bus-based network. Communication between any pair of nodes is routed through the central processor. The central node is the bottleneck just like the bus.

1b.*Total nodes are 2^d

In general, a d-dimensional hypercube is constructed by connecting corresponding nodes of two (d-1) dimensional hypercubes.

1b.*Tree-based networkStatic tree network has a processing nodes at each node.Dynamic tree has switching nodes at intermediate levels, processing nodes at leaf level.

To route a message, source node sends the message up the tree until reach the node that is the root of the subtree containing both sender and receiver.

1b.*Cache Coherence1b.*In the case of shared-address-space computers, additional hardware is required to keep multiple copies of data consistent with each other. Especially, for multiple processors how to ensure they all use the same updated values?

If a processor changes the value of its copy, one the two things must happen:The other copies must be invalidatedThe other copies must be updated

1b.*Solid line represents processor actions and the dashed line presents coherence actions. Read on invalid data transition to shared by accessing the remote valueA write on shared transition to dirty and c_write to label other copies to be invalid.

1b.*

*********************************

Date post:	31-Dec-2015
Category:	Documents
Upload:	claud-alvin-richards
View:	219 times
Download:	1 times

Parallel Computer Architecture and Interconnect 1b.1.

Documents