Lecture03InstructionSetPrinciples
CSCE513ComputerArchitecture
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
1
Contents
1. Introduction2. ClassifyingInstructionSetArchitectures3. MemoryAddressing4. TypeandSizeofOperands5. OperationsintheInstructionSet6. InstructionsforControlFlow7. EncodinganInstructionSet8. CrosscuttingIssues:TheRoleofCompilers9. RISC-VISA
• Supplement(notcovered)– RISCvsCISC– ComparisonofISA
• AppendixK 2
1Introduction
InstructionSetArchitecture– theportionofthemachinevisibletotheassemblylevelprogrammerortothecompilerwriter– Tousethehardwareofacomputer,wemustspeak itslanguage– Thewordsofacomputerlanguagearecalledinstructions,and
itsvocabularyiscalledaninstructionset
instructionset
software
hardware
Instr.# Operation+Operandsi movl -4(%ebp),%eax(i+1) addl %eax,(%edx)(i+2) cmpl 8(%ebp),%eax(i+3) jl L5:L5:
3
sum.s forX86
• http://www.cs.virginia.edu/~evans/cs216/guides/x86.html• https://en.wikibooks.org/wiki/X86_Assembly/SSE
2operands-8(%eax):Memoryaddress
4
sum.s forRISC-V
https://riscv.org/
2or3operands-20(s0):Memoryaddress
5
ISAInReal
• Apdfdocumentthatdefinesthemodel/architecture/interfaceofthemachine– X86andIntelSDM:https://software.intel.com/en-
us/articles/intel-sdm• Severalthousandspages
– RISC-VISASpec:https://riscv.org/specifications/• Latestversion2.2,145pages
• AspecificationthatprovidestheISAdetails
• ReviewChapter2oftheCODbook
6
2ClassifyingInstructionSetArchitectures
OperandstorageinCPU Wherearetheyotherthanmemory
#explicitoperandsnamedperinstruction
Howmany?Min,Max,Average
Addressingmode Howtheeffectiveaddressforanoperandcalculated?Canalluseanymode?
Operations Whataretheoptionsfortheopcode?
Type&sizeofoperands Howistypingdone?Howisthesizespecified?
Thesechoicescriticallyaffectnumberofinstructions,CPI,andCPUcycletime
7
ISAClassification
• Mostbasicdifferentiation:internalstorageinaprocessor– Operandsmaybenamedexplicitly orimplicitly
• Majorchoices:1. Inanaccumulatorarchitecture oneoperandisimplicitly the
accumulator=>similartocalculator2. Theoperandsinastackarchitecture areimplicitly onthe
topofthestack3. Thegeneral-purposeregisterarchitectures haveonly
explicit operands– eitherregistersormemorylocation
8
FourISAClasses
• Register-memory:X86(CISC)
• Register-register:RISC(e.g.ARM,MIPS,RISC-V,Power)
9
RegisterMachines• Howmanyregistersaresufficient?• General-purposeregistersvs.special-purposeregisters
• compilerflexibilityandhand-optimization• Twomajorconcernsforarithmeticandlogicalinstructions(ALU)
1.Twoorthreeoperands• X+YÞ X• X+Y Þ Z
2.Howmanyoftheoperandsmaybememoryaddresses(0– 3)
Hence,registerarchitectureclassification(#mem,#operands)
Numberofmemoryaddresses
Maximumnumberofoperandsallowed
TypeofArchitecture Examples
0 3 Load-Store Alpha,ARM,MIPS,PowerPC,SPARC,SuperH,TM32
1 2 Register-Memory IBM360/370,Intel80x86,Motorola68000,TITMS320C54x
2 2 Memory– memory VAX(alsohas3operandformats)
3 3 Memory- memory VAX(alsohas2operandformats)
10
(0,3):Register-Register(RISC)
• ALUisRegistertoRegister– alsoknownas– pureReducedInstructionSetComputer(RISC)
• Advantages– Simplefixedlengthinstructionencoding– Decodeissimplesinceinstructiontypesaresmall– Simplecodegenerationmodel– InstructionCPItendstobeveryuniform
• Exceptformemoryinstructionsofcourse– butthereareonly2ofthem- loadandstore
• Disadvantages– Instructioncounttendstobehigher– Someinstructionsareshort- wastinginstructionwordbits
11
(1,2):Register-Memory(CISC,X86)
• EvolvedRISCandalsooldCISC– newRISCmachinescapableofdoingspeculativeloads– predicatedand/ordeferredloadsarealsopossible
• Advantages– dataaccesstoALUimmediatewithoutloadingfirst– instructionformatisrelativelysimpletoencode– codedensityisimprovedoverRegister(0,3)model
• Disadvantages– operandsarenotequivalent- sourceoperandmaybedestroyed– needformemoryaddressfieldmaylimit#ofregisters– CPIwillvary
• ifmemorytargetisinL0cachethennotsobad• ifnot- lifegetsmiserable
12
(2,2)or(3,3):Memory-Memory
Notusedtoday
• TrueandmostcomplexCISCmodel– currentlyextinctandlikelytoremainso– morecomplexmemoryactionsarelikelytoappearbutnot– directlylinkedtotheALU
• Advantages– mostcompactcode– doesn’twasteregistersfortemporaryvalues
• goodideaforuseoncedata- e.g.streamingmedia
• Disadvantages– largevariationininstructionsize- mayneedashoe-horn– largevariationinCPI- i.e.workperinstruction– exacerbatestheinfamousmemorybottleneck
• registerfilereducesmemoryaccessesifreused
13
Summary:TradeoffsfortheISAClasses
Type Advantages Disadvantages
Register-register(0,3)
Simple,fixedlengthinstructionencoding.Simplecodegenerationmodel.Instructionstakesimilarnumbersofclockstoexecute.
Higherinstructioncountthanarchitectureswithmemoryreferencesintheinstructions.Moreinstructionsandlowerinstructiondensityleadstolargerprograms
Register-memory(1,2)
Datacanbeaccessedwithoutaseparateloadinstructionfirst.Instructionformattendstobeeasytoencodeandyieldsgooddensity
Operandsarenotequivalentsinceasourceoperandisdestroyed.Encodingaregisternumberandamemoryaddressineachinstructionmayrestrictthenumberofregisters.Clocksperinstructionvarybyoperandlocation
Memory-memory(2,2)or(3,3)
Mostcompact.Doesnotwasteregistersfortemporaries.
Largevariationininstructionsize,especiallyforthree-operandinstructions.Inaddition,largevariationinworkperinstruction.Memoryaccessescreatememorybottleneck.(Notusedtoday)
14
3MemoryAddressing
•Objectshavebyteaddresses– thenumberofbytescountedfromthebeginningofmemory
•ObjectLength:–bytes(8bits),halfwords(16bits),–words(32bits),anddoublewords(64bits).–Thetypeisimpliedinopcode,e.g.,
• LDB– loadbyte• LDW– loadword,etc
• ByteOrdering– LittleEndian: putsthebytewhoseaddressisxx00attheleastsignificantpositionintheword.(7,6,5,4,3,2,1,0)
– BigEndian: putsthebytewhoseaddressisxx00atthemostsignificantpositionintheword.(0,1,2,3,4,5,6,7)
• Problemoccurswhenexchangingdataamongmachineswithdifferentorderings
15
InterpretingMemoryAddresses
• AlignmentIssues– Accessestoobjectslargerthanabytemustbealigned.
• AnaccesstoanobjectofsizesbytesatbyteaddressAisalignedifAmods=0.
– Misalignmentcauseshardwarecomplications• sincememoryistypicallyalignedonawordoradouble-wordboundary
• MisalignmenttypicallyresultsinanalignmentfaultthatmustbehandledbytheOS
• Hence– byteaddressisanything- nevermisaligned– halfword- evenaddresses- loworderaddressbit=0(XXXXXXX0)
elsetrap– word- loworder2addressbits=0(XXXXXX00)elsetrap– doubleword- loworder3addressbits=0(XXXXX000)elsetrap
16
MemoryAlignment
17
Aligned/MisalignedAddresses
18
AddressingModes
• Howarchitecturespecifytheeffectiveaddressofanobject?– Effectiveaddress:theactualmemoryaddressspecifiedbythe
addressingmode.• E.g.Mem[R[R1]] referstothecontentsofthememorylocationwhoselocationisgivenbythecontentsofregister1(R1).
• AddressingModes:– Register.– Immediate– Displacement– Registerindirect,……..
-20(s0):Memoryaddress
19
AddressModes
20
AddressingModeImpacts
• Instructioncounts• ArchitectureComplexity• CPI
21
SummaryofUseofMemoryAddressingModes
22
DisplacementValuesareWidelyDistributed
Impactinstructionlength
23
DisplacementAddressingMode
• Benchmarksshow– 12bitsofdisplacementwouldcaptureabout75%ofthefull32-bit
displacements– 16bitsshouldcaptureabout99%
• Remember:– optimizeforthecommoncase.Hence,thechoiceisatleast12-16bits
• Foraddressesthatdofitindisplacementsize:Add R4,10000(R0)
• Foraddressesthatdon’tfitindisplacementsize,thecompilermustdothefollowing:
Load R1,1000000Add R1,R0Add R4,0(R1)
24
ImmediateAddressingMode
• Usedwherewewanttogettoanumericalvalueinaninstruction• Around25%oftheoperationshaveanimmediateoperand
Athighlevel:
a=b+3;
if(a>17)
goto Addr
AtAssemblerlevel:
LoadR2,#3AddR0,R1,R2
LoadR2,#17CMPBGTR1,R2
LoadR1,AddressJump(R1)
25
About25%ofdatatransferandALUoperationshaveanimmediateoperand
Impactinstructionlength
26
NumberofBitsforImmediate
• 16bitswouldcaptureabout80%and8bitsabout50%.
Impactinstructionlength
27
Summary:MemoryAddressing
• Anewarchitectureexpectedtosupportatleast:displacement,immediate,andregisterindirect– represent75%to99%oftheaddressingmodes
• Thesizeoftheaddressfordisplacementmodetobeatleast12-16bits– capture75%to99%ofthedisplacements
• Thesizeoftheimmediatefieldtobeatleast8-16bits– capture50%to80%oftheimmediates
Processorsrelyoncompilerstogeneratecodesusingthoseaddressingmode
28
4 TypeAndSizeofOperands
• Thetypeoftheoperandisusuallyencodedintheopcode– e.g.,LDB– loadbyte;LDW– loadword
• Commonoperandtypes:(implytheirsizes)Character(8bitsor1byte)Halfword(16bitsor2bytes)Word(32bitsor4bytes)Doubleword(64bitsor8bytes)Singleprecisionfloatingpoint(4bytesor1word)Doubleprecisionfloatingpoint(8bytesor2words)ü CharactersarealmostalwaysinASCIIü 16-bitUnicode(usedinJava)isgainingpopularityü Integersaretwo’scomplementbinaryü FloatingpointsfollowtheIEEEstandard754
• Somearchitecturessupportpackeddecimal:4bitsareusedtoencodethevalues0-9;2decimaldigitsarepackedintoeachbyte
Howisthetypeofanoperanddesignated?
29
DistributionofDataAccessesbySize
30
Summary:TypeandSizeofoperands
• 32-architecturesupports8-,16-,and32-bitintegers,32-bitand64-bitIEEE754floating-pointdata.
• Anew64-bitaddressarchitecturesupports64-bitintegers• MediaprocessorandDSPsneedwideraccumulatingregistersforaccuracy.
31
5 OperationsintheInstructionSet
• Allcomputersgenerallyprovideafullsetofoperationsforthefirstthreecategories
• Allcomputersmusthavesomeinstructionsupportforbasicsystemfunctions
• Graphicsinstructionstypicallyoperateonmanysmallerdataitemsinparallel
32
Top10Instructionsfor80x86
33
InstructionEncoding
• RISC-VR-formatinstruction
34
• RISC-VI-formatinstruction
6 InstructionsforControlFlow
• Controlinstructionschangetheflowofcontrol:– insteadofexecutingthenextinstruction,theprogrambranchesto
theaddressspecifiedinthebranchinginstructions• Theybreakthepipeline
– Difficulttooptimizeout– ANDtheyarefrequent
• Fourtypesofcontrolinstructions– Conditionalbranches
• if…else,for/while,switch/case,…– Jumps– unconditionaltransfer
• goto– Procedurecalls
• foo()– Procedurereturns
• return35
BreakdownofControlFlowInstructions
– Conditionalbranches– Jumps– unconditionaltransfer– Procedurecalls– Procedurereturns
• Issues:– Whereisthetargetaddress?Howtospecifyit?(label)– Caller:Whereisreturnaddresskept?Howarethearguments
passed?– Callee:Whereisreturnaddress?Howaretheresultspassed?
36
AddressingModesforControlFlowInstructions
• PC-relative(ProgramCounter)– SupplyadisplacementaddedtothePC
• Knownatcompiletimeforjumps,branches,andcalls(specifiedwithintheinstruction)
– Thetargetisoftennearthecurrentinstruction• Requiringfewerbits• Independentlyofwhereitisloaded(positionindependence)
• Registerindirectaddressing– dynamicaddressing– Thetargetaddressmaynotbeknownatcompiletime– Namingaregisterthatcontainsthetargetaddress
• Caseorswitchstatements• VirtualfunctionsormethodsinC++orJava• High-orderfunctionsorfunctionpointersinCorC++• Dynamicallysharedlibraries
37
BranchDistances
38
ConditionalBranchOptions
Figure2.21Majormethodsforevaluatingbranchconditions
39
ComparisonTypevs.Frequency
• Mostloopsgofrom0ton.• Mostbackwardbranchesareloops– takenabout90%
Program % backward branches
% all control instructions that
modify PCgcc 26% 63%spice 31% 63%TeX 17% 70%Average 25% 65% 40
ProcedureInvocationOptions• Procedurecallsandreturns
– controltransfer– statesaving;thereturnaddressmustbesavedNewerarchitecturesrequirethecompilertogeneratestoresandloads
foreachregistersavedandrestored
• Twobasicconventionsinusetosaveregisters– callersaving:thecallingproceduremustsavetheregistersthatit
wantspreservedforaccessafterthecall• thecalledprocedureneednotworryaboutregisters
– callee saving:thecalledproceduremustsavetheregistersitwantstouse
• leavingthecallerunrestrained
mostrealsystemstodayuseacombinationofboth• Applicationbinaryinterface(ABI)thatsetdownthebasicrulesastowhichregisterbecallersavedandwhichshouldbecallee saved
41
7.EncodinganInstructionSet
• Opcode:specifyingtheoperation• #ofoperand
– addressingmode– addressspecifier:tellswhataddressingmodeisused– Load-storecomputer
• Onlyonememoryoperand• Onlyoneortwoaddressingmodes
• Thearchitecturemustbalancingseveralcompetingforceswhenencodingtheinstructionset:– #ofregisters&&Addressingmodes– Sizeofregisters&&Addressingmodefields– Averageinstructionsize&&Averageprogramsize.– Easytohandleinpipelineimplementation.
42
Example:x86andAlpha
• x86:
• Alpha:
43
ThreeBasicVariationsforInstructionEncoding
Thelengthof80x86(CISC)instructionsvariesbetween1and17bytes.
ThelengthofmostRISCISAinstructionsare4bytes.
X86programaregenerallysmallerthanRISCISA.
ToreduceRISCcodesize
44
InstructionLengthTradeoffs
• Fixedlength:Lengthofallinstructionsthesame+Easiertodecodesingleinstructioninhardware+Easiertodecodemultipleinstructionsconcurrently-- Wastedbitsininstructions(Whyisthisbad?)-- Harder-to-extendISA(howtoaddnewinstructions?)
• Variablelength:Lengthofinstructionsdifferent(determinedbyopcode andsub-opcode)+Compactencoding(Whyisthisgood?)
Intel432:Huffmanencoding(sortof).6to321bitinstructions.How?-- Morelogictodecodeasingleinstruction-- Hardertodecodemultipleinstructionsconcurrently
• Tradeoffs– Codesize(memoryspace,bandwidth,latency)vs.hardwarecomplexity– ISAextensibilityandexpressiveness– Performance?Smallercodevs.imperfectdecode
45
Uniformvs Non-uniformDecode
• Uniformdecode:Samebitsineachinstructioncorrespondtothesamemeaning– Opcode isalwaysinthesamelocation– immediatevalues,…– Many“RISC” ISAs:Alpha,MIPS,SPARC+Easierdecode,simplerhardware+Enablesparallelism:generatetargetaddressbeforeknowingtheinstruction
isabranch-- Restrictsinstructionformat(fewerinstructions?)orwastesspace
• Non-uniformdecode– E.g.,opcode canbethe1st-7thbyteinx86+Morecompactandpowerfulinstructionformat-- Morecomplexdecodelogic
46
ReducedCodeSizeinRISCs
• Hybridencoding– support16-bitand32-bitinstructionsinRISC,eg.ARMThumb,MIPS16andRISC-V– Narrowinstructionssupportfeweroperations,smalleraddressand
immediatefields,fewerregisters,andtwo-addressformatratherthantheclassicthree-addressformat
– Claimacodesizereductionofupto40%
• CompressioninIBM’sCodePack– Addshardwaretodecompressinstructionsastheyarefetchedfrom
memoryonaninstructioncachemiss– Theinstructioncachecontainsfull32-bitinstructions,but
compressedcodeiskeptinmainmemory,ROMs,andthedisk– Claimcodereduction35%- 40%– PowerPCcreateaHashtableinmemorythatmapbetween
compressedanduncompressedaddress.Codesize35%~40%
• Hitachi’sSuperH:fixed16-bitformat– 16ratherthan32registers– fewerinstructions
47
SummaryofInstructionEncoding
• Threechoices– Variable,fixedandhybrid– Notethedifferencesofhybridandvariable
• Choicesofinstructionencodingisatradeoffbetween– Forperformance:fixedencoding– Forcodesize:variableencoding
• HowhybridencodingisusedinRISCtoreducecodesize– 16bitand32bit
• Ingeneral,wesee:– RISC:fixedorhybrid– CISC:variable
48
8TheRoleofCompilers• Almostallprogrammingisdoneinhigh-levellanguages.
– AnISAisessentiallyacompliertarget.
• Seebackupslidesforthecompilationstagebymostcompiler,e.g.gcc
• Compilergoals:– Allcorrectprogramsexecutecorrectly– Mostcompiledprogramsexecutefast(optimizations)– Fastcompilation– Debuggingsupport
49
TypicalModernCompilerStructure
Figure A.19 Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes.This structure maximizes the probability that a program compiled at various levels of optimization will produce the same outputwhen given the same input. The optimizing passes are designed to be optional and may be skipped when faster compilation is thegoal and lower-quality code is acceptable. A pass is simply one phase in which the compiler reads and transforms the entireprogram. (The term phase is often used inter-changeably with pass.) Because the optimizing passes are separated, multiplelanguages can use the same optimizing and code generation passes. Only a new front end is required for a new language. 50
OptimizationTypes
• Highlevel– doneatornearsourcecodelevel– Ifprocedureiscalledonlyonce,putitin-lineandsaveCALL– moregeneralcase:ifcall-count<somethreshold,putthemin-line
• Local– donewithinstraight-linecode– commonsub-expressionsproducesamevalue– eitherallocatea
registerorreplacewithsinglecopy– constantpropagation– replaceconstantvaluedvariablewiththe
constant– stackheightreduction– re-arrangeexpressiontreetominimize
temporarystorageneeds• Global– acrossabranch
– copypropagation– replaceallinstancesofavariableAthathasbeenassignedX(i.e.,A=X)withX.
– codemotion– removecodefromaloopthatcomputessamevalueeachiterationoftheloopandputitbeforetheloop
– simplifyoreliminatearrayaddressingcalculationsinloops
51
OptimizationTypes
• Machine-dependentoptimizations– basedonmachineknowledge– strengthreduction– replacemultiplybyaconstantwithshifts
andadds• wouldmakesenseiftherewasnohardwaresupportforMUL• atrickierversion:17´ =arithmeticleftshift4andadd
• Pipeliningscheduling– reorderinstructionstoimprovepipelineperformance– dependencyanalysis– branchoffsetoptimization- reordercodetominimizebranch
offsets
52
MajorTypesofOptimizations
53
ComplierOptimizations– ChangeinIC
• L0– unoptimized• L1– localopts,codescheduling,&localreg.allocation• L2– globaloptsandlooptransformations,&globalreg.Allocation• L3– procedureintegration
gcc -O2hello.c -ohello
54
CompilerBasedRegisterOptimization
• Compilerassumessmallnumberofregisters(16-32)– Optimizinguseisuptocompiler– HLLprogramshavenoexplicitreferencestoregisters
• CompilerApproach– Assignsymbolicorvirtualregistertoeachcandidatevariable– Map(unlimited)symbolicregisterstorealregisters– Symbolicregistersthatdonotoverlapcansharerealregisters– Ifyourunoutofrealregisterssomevariables
• Spilling
55
GraphColoring
• Givenagraphofnodesandedges– Assignacolor toeachnode
• Adjacentnodeshavedifferentcolors• Useminimumnumberofcolors
• Registrationallocation– Nodesaresymbolicregisters– Tworegistersthatareliveinthesameprogramfragmentare
joinedbyanedge– Trytocolor thegraphwithn colors,wheren isthenumberof
realregisters– Nodesthatcannotbecolored areplacedinmemory
https://en.wikipedia.org/wiki/Graph_coloring
56
Iron-codeSummary• SectionA.2—Usegeneral-purposeregisterswithaload-storearchitecture.• SectionA.3—Supporttheseaddressingmodes:displacement(withanaddressoffset
sizeof12to16bits),immediate(size8to16bits),andregisterindirect.• SectionA.4—Supportthesedatasizesandtypes:8-,16-,32-,and64-bitintegersand
64-bitIEEE754floating-pointnumbers.– Nowwesee16-bitFPfordeeplearninginGPU
• http://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/
• SectionA.5—Supportthesesimpleinstructions,sincetheywilldominatethenumberofinstructionsexecuted:load,store,add,subtract,moveregister- register,andshift.
• SectionA.6—Compareequal,comparenotequal,compareless,branch(withaPC-relativeaddressatleast8bitslong),jump,call,andreturn.
• SectionA.7—Usefixedinstructionencodingifinterestedinperformance,andusevariableinstructionencodingifinterestedincodesize.
• SectionA.8—Provideatleast16general-purposeregisters,besurealladdressingmodesapplytoalldatatransferinstructions,andaimforaminimalistIS
– Oftenuseseparatefloating-pointregisters.– Thejustificationistoincreasethetotalnumberofregisterswithoutraisingproblemsin
theinstructionformatorinthespeedofthegeneral-purposeregisterfile.Thiscompromise,however,isnotorthogonal.
57
RealWorldISA
58
Thedetailsindesignistotrade-off!
59