Multicore Computers

William Stallings Computer Organization and Architecture8th EditionChapter 18Multicore Computers

Hardware Performance IssuesMicroprocessors have seen an exponential increase in performanceImproved organizationIncreased clock frequencyIncrease in ParallelismPipeliningSuperscalar (multi-issue)Simultaneous multithreading (SMT)Diminishing returnsMore complexity requires more logicIncreasing chip area for coordinating and signal transfer logicHarder to design, make and debug

Alternative Chip Organizationshttp://www.cadalyst.com/files/cadalyst/nodes/2008/6351/i4.jpg

Intel Hardware TrendsExponential speedup trendILP has come and gonehttp://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpghttp://www.ixbt.com/cpu/semiconductor/intel-65nm/power_density.jpg

Increased ComplexityPower requirements grow exponentially with chip density and clock frequencyCan use more chip area for cacheSmallerOrder of magnitude lower power requirementsBy 2015100 billion transistors on 300mm2 dieCache of 100MB1 billion transistors for logichttp://www.tomshardware.com/reviews/core-duo-notebooks-trade-battery-life-quicker-response,1206-4.htmlhttp://techreport.com/r.x/core-i7/die-callout.jpg

Power and Memory ConsiderationsLess actionMore actionWe passed 50%!!!Is this a RAM or a processor?

Increased ComplexityPollacks rule: Performance is roughly proportional to square root of increase in complexityDouble complexity gives 40% more performanceMulticore has the potential for near-linear improvement (needs some programming effort and wont work for all problems)Unlikely that one core can use all of a huge cache effectively, so add PEs to make an MPSoC

Chip Utilization of TransistorsCacheCPU

Software Performance IssuesPerformance benefits dependent on effective exploitation of parallel resources (obviously)Even small amounts of serial code impact performance (not so obvious)10% inherently serial on 8 processor system gives only 4.7 times performanceMany overheads of MPSoC:CommunicationDistribution of workCache coherenceSome applications effectively exploit multicore processors

Effective Applications for Multicore ProcessorsDatabase (e.g. Select *)Servers handling independent transactionsMulti-threaded native applicationsLotus Domino, Siebel CRMMulti-process applicationsOracle, SAP, PeopleSoftJava applicationsJava VM is multi-threaded with scheduling and memory management (not so good at SSE )Suns Java Application Server, BEAs Weblogic, IBM Websphere, TomcatMulti-instance applicationsOne application running multiple times

Multicore OrganizationMain design variables:Number of core processors on chip (dual, quad ... )Number of levels of cache on chip (L1, L2, L3, ...)Amount of shared cache v.s. not shared (1MB, 4MB, ...)The following slide has examples of each organization:ARM11 MPCoreAMD OpteronIntel Core DuoIntel Core i7

Multicore Organization AlternativesARM11 MPCoreAMD OpteronIntel Core DuoIntel Core i7No sharedShared

Advantages of shared L2 CacheConstructive interference reduces overall miss rate (A wants X then B wants X good!)Data shared by multiple cores not replicated at cache level (one copy of X for both A and B)With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamicThreads with less locality can have more cacheEasy inter-process communication through shared memoryCache coherency confined to small L1Dedicated L2 cache gives each core more rapid accessGood for threads with strong localityShared L3 cache may also improve performance

Core i7 and DuoLet us review these two Intel architectures

Individual Core ArchitectureIntel Core Duo uses superscalar coresIntel Core i7 uses simultaneous multi-threading (SMT)Scales up number of threads supported4 SMT cores, each supporting 4 threads appears as 16 core (my corei7 has 2 threads per CPU)Core i7Core 2 duo

Intel x86 Multicore Organization -Core Duo (1)2006Two x86 superscalar, shared L2 cacheDedicated L1 cache per core32KB instruction and 32KB dataThermal control unit per coreManages chip heat dissipation with sensors, clock speed is throttledMaximize performance within thermal constraintsImproved ergonomics (quiet fan)Advanced Programmable Interrupt Controlled (APIC)Inter-process interrupts between coresRoutes interrupts to appropriate coreIncludes timer so OS can self-interrupt a core

Intel x86 Multicore Organization -Core Duo (2)Power Management LogicMonitors thermal conditions and CPU activityAdjusts voltage (and thus power consumption)Can switch on/off individual logic subsystems to save powerSplit-bus transactions can sleep on one end2MB shared L2 cacheDynamic allocationMESI support for L1 cachesExtended to support multiple Core Duo in SMP (not SMT)L2 data shared between local cores (fast) or externalBus interface is FSB

Intel Core Duo Block Diagram

Intel x86 Multicore Organization -Core i7November 2008Four x86 SMT processorsDedicated L2, shared L3 cacheSpeculative pre-fetch for cachesOn chip DDR3 memory controllerThree 8 byte channels (192 bits) giving 32GB/sNo front side bus (just like labs 1 & 2 with the SDRAM controller)QuickPath Interconnect (QPI video if time allows)Cache coherent point-to-point linkHigh speed communications between processor chips6.4G transfers per second, 16 bits per transferDedicated bi-directional pairsTotal bandwidth 25.6GB/s

Intel Core i7 Block Diagram

ARM11 MPCoreARM vs. x86 and Microsoft Intel started this fight by challenging ARM with its Atom processor, which is moving downmarket and towards smartphones.Apparently, the major ARM vendors are feeling the threat, are now moving upmarket and are beginning to make their run at low-end PCs and storage appliances to put the pressure back on Intel.http://www.tgdaily.com/trendwatch-features/41561-the-coming-arm-vs-intel-pc-battle

ARM11 MPCoreUp to 4 processors each with own L1 instruction and data cacheDistributed Interrupt Controller (DIC)Recall the APIC from Intels core architectureTimer per CPUWatchdog (feed or it barks!)Warning alerts for software failuresCounts down from predetermined valuesIssues warning at zeroCPU interfaceInterrupt acknowledgement, masking and completion acknowledgementCPUSingle ARM11 called MP11Vector floating-point unit (VFP)FP co-processorL1 cacheSnoop control unitL1 cache coherency http://barfblog.foodsafety.ksu.edu/DogObedienceTraining.jpg

ARM11 MPCore Block Diagram

ARM11 MPCore Interrupt HandlingDistributed Interrupt Controller (DIC) collates from many sources (ironically it is a centralized controller)It providesMasking (who can ignore an interrupt)Prioritization (CPU A is more important than CPU B)Distribution to target MP11 CPUsStatus tracking (of interrupts)Software interrupt generationNumber of interrupts independent of MP11 CPU designMemory mapped DIC control registersAccessed by CPUs via private interface through SCUDIC can:Route interrupts to single or multiple CPUsProvide inter-process communicationThread on one CPU can cause activity by thread on another CPU

DIC RoutingDirect to specific CPUTo defined group of CPUsTo all CPUsOS can generate interrupt to:All but selfSelfOther specific CPUTypically combined with shared memory for inter-process communication16 interrupt ids available for inter-process communication (per cpu)

Interrupt StatesInactiveNon-assertedCompleted by that CPU but pending or active in othersE.g. allgatherPendingAssertedProcessing not started on that CPUActiveStarted on that CPU but not completeCan be pre-empted by higher priority interrupt

Interrupt SourcesInter-process Interrupts (IPI)Private to CPUID0-ID15 (16 IPIs per CPU as mentioned earlier)Software triggeredPriority depends on receiving CPU not sourcePrivate timer and/or watchdog interruptID29 and ID30Legacy FIQ lineLegacy FIQ pin, per CPU, bypasses interrupt distributorDirectly drives interrupts to CPUHardwareTriggered by programmable events on associated interrupt linesUp to 224 linesStart at ID32

ARM11 MPCore Interrupt Distributor

Cache CoherencySnoop Control Unit (SCU) resolves most shared data bottleneck issuesNote: L1 cache coherency based on MESI similar to Intels core architecture3 types of SCU shared data resolution:Direct data InterventionCopying clean entries between L1 caches without accessing external memory or L2Can resolve local L1 miss from remote L1 rather than L2Reduces read after write from L1 to L2Duplicated tag RAMsCache tags implemented as separate block of RAM, a copy is held in the SCU. So the SCU knows when 2 CPUs have the same cache lines.Tag RAM has same length as number of lines in cacheTAG duplicates used by SCU to check data availability before sending coherency commandsOnly send to CPUs that must update coherent data cacheLess bus locking due to less communication during coherency stepMigratory linesAllows moving dirty data between CPUs without writing to L2 and reading back from external memory(See Stallings CH 18.5 pg703)

Performance Effect of Multiple Cores

Recommended ReadingMulticore Association web siteStallings chapter 18ARM web site(if we have time) http://www.intel.com/technology/quickpath/index.htmhttp://www.arm.com/products/CPUs/ARM11MPCoreMultiprocessor.htmlhttp://www.eetimes.com/news/design/features/showArticle.jhtml?articleID=23901143

*SMT IS SUPERSCALAR WITH PARALLEL THREADS IN THE ISSUE SLOTS*SSE is Streaming SIMD Extensions**

Date post:	10-Nov-2015
Category:	Documents
Upload:	fian-mario-syahputra
View:	26 times
Download:	2 times

Multicore Computers

Documents