Lecture 03 Instruction Set Principles · 2019-01-10 · Hence, register architecture classification...

Lecture03InstructionSetPrinciples

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

1

Contents

1. Introduction2. ClassifyingInstructionSetArchitectures3. MemoryAddressing4. TypeandSizeofOperands5. OperationsintheInstructionSet6. InstructionsforControlFlow7. EncodinganInstructionSet8. CrosscuttingIssues:TheRoleofCompilers9. RISC-VISA

• Supplement(notcovered)– RISCvsCISC– ComparisonofISA

• AppendixK 2

1Introduction

InstructionSetArchitecture– theportionofthemachinevisibletotheassemblylevelprogrammerortothecompilerwriter– Tousethehardwareofacomputer,wemustspeak itslanguage– Thewordsofacomputerlanguagearecalledinstructions,and

itsvocabularyiscalledaninstructionset

instructionset

software

hardware

Instr.# Operation+Operandsi movl -4(%ebp),%eax(i+1) addl %eax,(%edx)(i+2) cmpl 8(%ebp),%eax(i+3) jl L5:L5:

3

sum.s forX86

• http://www.cs.virginia.edu/~evans/cs216/guides/x86.html• https://en.wikibooks.org/wiki/X86_Assembly/SSE

2operands-8(%eax):Memoryaddress

4

sum.s forRISC-V

https://riscv.org/

2or3operands-20(s0):Memoryaddress

5

ISAInReal

• Apdfdocumentthatdefinesthemodel/architecture/interfaceofthemachine– X86andIntelSDM:https://software.intel.com/en-

us/articles/intel-sdm• Severalthousandspages

– RISC-VISASpec:https://riscv.org/specifications/• Latestversion2.2,145pages

• AspecificationthatprovidestheISAdetails

• ReviewChapter2oftheCODbook

6

2ClassifyingInstructionSetArchitectures

OperandstorageinCPU Wherearetheyotherthanmemory

#explicitoperandsnamedperinstruction

Howmany?Min,Max,Average

Addressingmode Howtheeffectiveaddressforanoperandcalculated?Canalluseanymode?

Operations Whataretheoptionsfortheopcode?

Type&sizeofoperands Howistypingdone?Howisthesizespecified?

Thesechoicescriticallyaffectnumberofinstructions,CPI,andCPUcycletime

7

ISAClassification

• Mostbasicdifferentiation:internalstorageinaprocessor– Operandsmaybenamedexplicitly orimplicitly

• Majorchoices:1. Inanaccumulatorarchitecture oneoperandisimplicitly the

accumulator=>similartocalculator2. Theoperandsinastackarchitecture areimplicitly onthe

topofthestack3. Thegeneral-purposeregisterarchitectures haveonly

explicit operands– eitherregistersormemorylocation

8

FourISAClasses

• Register-memory:X86(CISC)

• Register-register:RISC(e.g.ARM,MIPS,RISC-V,Power)

9

RegisterMachines• Howmanyregistersaresufficient?• General-purposeregistersvs.special-purposeregisters

• compilerflexibilityandhand-optimization• Twomajorconcernsforarithmeticandlogicalinstructions(ALU)

1.Twoorthreeoperands• X+YÞ X• X+Y Þ Z

2.Howmanyoftheoperandsmaybememoryaddresses(0– 3)

Hence,registerarchitectureclassification(#mem,#operands)

Numberofmemoryaddresses

Maximumnumberofoperandsallowed

TypeofArchitecture Examples

0 3 Load-Store Alpha,ARM,MIPS,PowerPC,SPARC,SuperH,TM32

1 2 Register-Memory IBM360/370,Intel80x86,Motorola68000,TITMS320C54x

2 2 Memory– memory VAX(alsohas3operandformats)

3 3 Memory- memory VAX(alsohas2operandformats)

10

(0,3):Register-Register(RISC)

• ALUisRegistertoRegister– alsoknownas– pureReducedInstructionSetComputer(RISC)

• Advantages– Simplefixedlengthinstructionencoding– Decodeissimplesinceinstructiontypesaresmall– Simplecodegenerationmodel– InstructionCPItendstobeveryuniform

• Exceptformemoryinstructionsofcourse– butthereareonly2ofthem- loadandstore

• Disadvantages– Instructioncounttendstobehigher– Someinstructionsareshort- wastinginstructionwordbits

11

(1,2):Register-Memory(CISC,X86)

• EvolvedRISCandalsooldCISC– newRISCmachinescapableofdoingspeculativeloads– predicatedand/ordeferredloadsarealsopossible

• Advantages– dataaccesstoALUimmediatewithoutloadingfirst– instructionformatisrelativelysimpletoencode– codedensityisimprovedoverRegister(0,3)model

• Disadvantages– operandsarenotequivalent- sourceoperandmaybedestroyed– needformemoryaddressfieldmaylimit#ofregisters– CPIwillvary

• ifmemorytargetisinL0cachethennotsobad• ifnot- lifegetsmiserable

12

(2,2)or(3,3):Memory-Memory

Notusedtoday

• TrueandmostcomplexCISCmodel– currentlyextinctandlikelytoremainso– morecomplexmemoryactionsarelikelytoappearbutnot– directlylinkedtotheALU

• Advantages– mostcompactcode– doesn’twasteregistersfortemporaryvalues

• goodideaforuseoncedata- e.g.streamingmedia

• Disadvantages– largevariationininstructionsize- mayneedashoe-horn– largevariationinCPI- i.e.workperinstruction– exacerbatestheinfamousmemorybottleneck

• registerfilereducesmemoryaccessesifreused

13

Summary:TradeoffsfortheISAClasses

Type Advantages Disadvantages

Register-register(0,3)

Simple,fixedlengthinstructionencoding.Simplecodegenerationmodel.Instructionstakesimilarnumbersofclockstoexecute.

Higherinstructioncountthanarchitectureswithmemoryreferencesintheinstructions.Moreinstructionsandlowerinstructiondensityleadstolargerprograms

Register-memory(1,2)

Datacanbeaccessedwithoutaseparateloadinstructionfirst.Instructionformattendstobeeasytoencodeandyieldsgooddensity

Operandsarenotequivalentsinceasourceoperandisdestroyed.Encodingaregisternumberandamemoryaddressineachinstructionmayrestrictthenumberofregisters.Clocksperinstructionvarybyoperandlocation

Memory-memory(2,2)or(3,3)

Mostcompact.Doesnotwasteregistersfortemporaries.

Largevariationininstructionsize,especiallyforthree-operandinstructions.Inaddition,largevariationinworkperinstruction.Memoryaccessescreatememorybottleneck.(Notusedtoday)

14

3MemoryAddressing

•Objectshavebyteaddresses– thenumberofbytescountedfromthebeginningofmemory

•ObjectLength:–bytes(8bits),halfwords(16bits),–words(32bits),anddoublewords(64bits).–Thetypeisimpliedinopcode,e.g.,

• LDB– loadbyte• LDW– loadword,etc

• ByteOrdering– LittleEndian: putsthebytewhoseaddressisxx00attheleastsignificantpositionintheword.(7,6,5,4,3,2,1,0)

– BigEndian: putsthebytewhoseaddressisxx00atthemostsignificantpositionintheword.(0,1,2,3,4,5,6,7)

• Problemoccurswhenexchangingdataamongmachineswithdifferentorderings

15

InterpretingMemoryAddresses

• AlignmentIssues– Accessestoobjectslargerthanabytemustbealigned.

• AnaccesstoanobjectofsizesbytesatbyteaddressAisalignedifAmods=0.

– Misalignmentcauseshardwarecomplications• sincememoryistypicallyalignedonawordoradouble-wordboundary

• MisalignmenttypicallyresultsinanalignmentfaultthatmustbehandledbytheOS

• Hence– byteaddressisanything- nevermisaligned– halfword- evenaddresses- loworderaddressbit=0(XXXXXXX0)

elsetrap– word- loworder2addressbits=0(XXXXXX00)elsetrap– doubleword- loworder3addressbits=0(XXXXX000)elsetrap

16

MemoryAlignment

17

Aligned/MisalignedAddresses

18

AddressingModes

• Howarchitecturespecifytheeffectiveaddressofanobject?– Effectiveaddress:theactualmemoryaddressspecifiedbythe

addressingmode.• E.g.Mem[R[R1]] referstothecontentsofthememorylocationwhoselocationisgivenbythecontentsofregister1(R1).

• AddressingModes:– Register.– Immediate– Displacement– Registerindirect,……..

-20(s0):Memoryaddress

19

AddressModes

20

AddressingModeImpacts

• Instructioncounts• ArchitectureComplexity• CPI

21

SummaryofUseofMemoryAddressingModes

22

DisplacementValuesareWidelyDistributed

Impactinstructionlength

23

DisplacementAddressingMode

• Benchmarksshow– 12bitsofdisplacementwouldcaptureabout75%ofthefull32-bit

displacements– 16bitsshouldcaptureabout99%

• Remember:– optimizeforthecommoncase.Hence,thechoiceisatleast12-16bits

• Foraddressesthatdofitindisplacementsize:Add R4,10000(R0)

• Foraddressesthatdon’tfitindisplacementsize,thecompilermustdothefollowing:

Load R1,1000000Add R1,R0Add R4,0(R1)

24

ImmediateAddressingMode

• Usedwherewewanttogettoanumericalvalueinaninstruction• Around25%oftheoperationshaveanimmediateoperand

Athighlevel:

a=b+3;

if(a>17)

goto Addr

AtAssemblerlevel:

LoadR2,#3AddR0,R1,R2

LoadR2,#17CMPBGTR1,R2

LoadR1,AddressJump(R1)

25

About25%ofdatatransferandALUoperationshaveanimmediateoperand


26

NumberofBitsforImmediate

• 16bitswouldcaptureabout80%and8bitsabout50%.


27

Summary:MemoryAddressing

• Anewarchitectureexpectedtosupportatleast:displacement,immediate,andregisterindirect– represent75%to99%oftheaddressingmodes

• Thesizeoftheaddressfordisplacementmodetobeatleast12-16bits– capture75%to99%ofthedisplacements

• Thesizeoftheimmediatefieldtobeatleast8-16bits– capture50%to80%oftheimmediates

Processorsrelyoncompilerstogeneratecodesusingthoseaddressingmode

28

4 TypeAndSizeofOperands

• Thetypeoftheoperandisusuallyencodedintheopcode– e.g.,LDB– loadbyte;LDW– loadword

• Commonoperandtypes:(implytheirsizes)Character(8bitsor1byte)Halfword(16bitsor2bytes)Word(32bitsor4bytes)Doubleword(64bitsor8bytes)Singleprecisionfloatingpoint(4bytesor1word)Doubleprecisionfloatingpoint(8bytesor2words)ü CharactersarealmostalwaysinASCIIü 16-bitUnicode(usedinJava)isgainingpopularityü Integersaretwo’scomplementbinaryü FloatingpointsfollowtheIEEEstandard754

• Somearchitecturessupportpackeddecimal:4bitsareusedtoencodethevalues0-9;2decimaldigitsarepackedintoeachbyte

Howisthetypeofanoperanddesignated?

29

DistributionofDataAccessesbySize

30

Summary:TypeandSizeofoperands

• 32-architecturesupports8-,16-,and32-bitintegers,32-bitand64-bitIEEE754floating-pointdata.

• Anew64-bitaddressarchitecturesupports64-bitintegers• MediaprocessorandDSPsneedwideraccumulatingregistersforaccuracy.

31

5 OperationsintheInstructionSet

• Allcomputersgenerallyprovideafullsetofoperationsforthefirstthreecategories

• Allcomputersmusthavesomeinstructionsupportforbasicsystemfunctions

• Graphicsinstructionstypicallyoperateonmanysmallerdataitemsinparallel

32

Top10Instructionsfor80x86

33

InstructionEncoding

• RISC-VR-formatinstruction

34

• RISC-VI-formatinstruction

6 InstructionsforControlFlow

• Controlinstructionschangetheflowofcontrol:– insteadofexecutingthenextinstruction,theprogrambranchesto

theaddressspecifiedinthebranchinginstructions• Theybreakthepipeline

– Difficulttooptimizeout– ANDtheyarefrequent

• Fourtypesofcontrolinstructions– Conditionalbranches

• if…else,for/while,switch/case,…– Jumps– unconditionaltransfer

• goto– Procedurecalls

• foo()– Procedurereturns

• return35

BreakdownofControlFlowInstructions

– Conditionalbranches– Jumps– unconditionaltransfer– Procedurecalls– Procedurereturns

• Issues:– Whereisthetargetaddress?Howtospecifyit?(label)– Caller:Whereisreturnaddresskept?Howarethearguments

passed?– Callee:Whereisreturnaddress?Howaretheresultspassed?

36

AddressingModesforControlFlowInstructions

• PC-relative(ProgramCounter)– SupplyadisplacementaddedtothePC

• Knownatcompiletimeforjumps,branches,andcalls(specifiedwithintheinstruction)

– Thetargetisoftennearthecurrentinstruction• Requiringfewerbits• Independentlyofwhereitisloaded(positionindependence)

• Registerindirectaddressing– dynamicaddressing– Thetargetaddressmaynotbeknownatcompiletime– Namingaregisterthatcontainsthetargetaddress

• Caseorswitchstatements• VirtualfunctionsormethodsinC++orJava• High-orderfunctionsorfunctionpointersinCorC++• Dynamicallysharedlibraries

37

BranchDistances

38

ConditionalBranchOptions

Figure2.21Majormethodsforevaluatingbranchconditions

39

ComparisonTypevs.Frequency

• Mostloopsgofrom0ton.• Mostbackwardbranchesareloops– takenabout90%

Program % backward branches

% all control instructions that

modify PCgcc 26% 63%spice 31% 63%TeX 17% 70%Average 25% 65% 40

ProcedureInvocationOptions• Procedurecallsandreturns

– controltransfer– statesaving;thereturnaddressmustbesavedNewerarchitecturesrequirethecompilertogeneratestoresandloads

foreachregistersavedandrestored

• Twobasicconventionsinusetosaveregisters– callersaving:thecallingproceduremustsavetheregistersthatit

wantspreservedforaccessafterthecall• thecalledprocedureneednotworryaboutregisters

– callee saving:thecalledproceduremustsavetheregistersitwantstouse

• leavingthecallerunrestrained

mostrealsystemstodayuseacombinationofboth• Applicationbinaryinterface(ABI)thatsetdownthebasicrulesastowhichregisterbecallersavedandwhichshouldbecallee saved

41

7.EncodinganInstructionSet

• Opcode:specifyingtheoperation• #ofoperand

– addressingmode– addressspecifier:tellswhataddressingmodeisused– Load-storecomputer

• Onlyonememoryoperand• Onlyoneortwoaddressingmodes

• Thearchitecturemustbalancingseveralcompetingforceswhenencodingtheinstructionset:– #ofregisters&&Addressingmodes– Sizeofregisters&&Addressingmodefields– Averageinstructionsize&&Averageprogramsize.– Easytohandleinpipelineimplementation.

42

Example:x86andAlpha

• x86:

• Alpha:

43

ThreeBasicVariationsforInstructionEncoding

Thelengthof80x86(CISC)instructionsvariesbetween1and17bytes.

ThelengthofmostRISCISAinstructionsare4bytes.

X86programaregenerallysmallerthanRISCISA.

ToreduceRISCcodesize

44

InstructionLengthTradeoffs

• Fixedlength:Lengthofallinstructionsthesame+Easiertodecodesingleinstructioninhardware+Easiertodecodemultipleinstructionsconcurrently-- Wastedbitsininstructions(Whyisthisbad?)-- Harder-to-extendISA(howtoaddnewinstructions?)

• Variablelength:Lengthofinstructionsdifferent(determinedbyopcode andsub-opcode)+Compactencoding(Whyisthisgood?)

Intel432:Huffmanencoding(sortof).6to321bitinstructions.How?-- Morelogictodecodeasingleinstruction-- Hardertodecodemultipleinstructionsconcurrently

• Tradeoffs– Codesize(memoryspace,bandwidth,latency)vs.hardwarecomplexity– ISAextensibilityandexpressiveness– Performance?Smallercodevs.imperfectdecode

45

Uniformvs Non-uniformDecode

• Uniformdecode:Samebitsineachinstructioncorrespondtothesamemeaning– Opcode isalwaysinthesamelocation– immediatevalues,…– Many“RISC” ISAs:Alpha,MIPS,SPARC+Easierdecode,simplerhardware+Enablesparallelism:generatetargetaddressbeforeknowingtheinstruction

isabranch-- Restrictsinstructionformat(fewerinstructions?)orwastesspace

• Non-uniformdecode– E.g.,opcode canbethe1st-7thbyteinx86+Morecompactandpowerfulinstructionformat-- Morecomplexdecodelogic

46

ReducedCodeSizeinRISCs

• Hybridencoding– support16-bitand32-bitinstructionsinRISC,eg.ARMThumb,MIPS16andRISC-V– Narrowinstructionssupportfeweroperations,smalleraddressand

immediatefields,fewerregisters,andtwo-addressformatratherthantheclassicthree-addressformat

– Claimacodesizereductionofupto40%

• CompressioninIBM’sCodePack– Addshardwaretodecompressinstructionsastheyarefetchedfrom

memoryonaninstructioncachemiss– Theinstructioncachecontainsfull32-bitinstructions,but

compressedcodeiskeptinmainmemory,ROMs,andthedisk– Claimcodereduction35%- 40%– PowerPCcreateaHashtableinmemorythatmapbetween

compressedanduncompressedaddress.Codesize35%~40%

• Hitachi’sSuperH:fixed16-bitformat– 16ratherthan32registers– fewerinstructions

47

SummaryofInstructionEncoding

• Threechoices– Variable,fixedandhybrid– Notethedifferencesofhybridandvariable

• Choicesofinstructionencodingisatradeoffbetween– Forperformance:fixedencoding– Forcodesize:variableencoding

• HowhybridencodingisusedinRISCtoreducecodesize– 16bitand32bit

• Ingeneral,wesee:– RISC:fixedorhybrid– CISC:variable

48

8TheRoleofCompilers• Almostallprogrammingisdoneinhigh-levellanguages.

– AnISAisessentiallyacompliertarget.

• Seebackupslidesforthecompilationstagebymostcompiler,e.g.gcc

• Compilergoals:– Allcorrectprogramsexecutecorrectly– Mostcompiledprogramsexecutefast(optimizations)– Fastcompilation– Debuggingsupport

49

TypicalModernCompilerStructure

Figure A.19 Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes.This structure maximizes the probability that a program compiled at various levels of optimization will produce the same outputwhen given the same input. The optimizing passes are designed to be optional and may be skipped when faster compilation is thegoal and lower-quality code is acceptable. A pass is simply one phase in which the compiler reads and transforms the entireprogram. (The term phase is often used inter-changeably with pass.) Because the optimizing passes are separated, multiplelanguages can use the same optimizing and code generation passes. Only a new front end is required for a new language. 50

OptimizationTypes

• Highlevel– doneatornearsourcecodelevel– Ifprocedureiscalledonlyonce,putitin-lineandsaveCALL– moregeneralcase:ifcall-count<somethreshold,putthemin-line

• Local– donewithinstraight-linecode– commonsub-expressionsproducesamevalue– eitherallocatea

registerorreplacewithsinglecopy– constantpropagation– replaceconstantvaluedvariablewiththe

constant– stackheightreduction– re-arrangeexpressiontreetominimize

temporarystorageneeds• Global– acrossabranch

– copypropagation– replaceallinstancesofavariableAthathasbeenassignedX(i.e.,A=X)withX.

– codemotion– removecodefromaloopthatcomputessamevalueeachiterationoftheloopandputitbeforetheloop

– simplifyoreliminatearrayaddressingcalculationsinloops

51

OptimizationTypes

• Machine-dependentoptimizations– basedonmachineknowledge– strengthreduction– replacemultiplybyaconstantwithshifts

andadds• wouldmakesenseiftherewasnohardwaresupportforMUL• atrickierversion:17´ =arithmeticleftshift4andadd

• Pipeliningscheduling– reorderinstructionstoimprovepipelineperformance– dependencyanalysis– branchoffsetoptimization- reordercodetominimizebranch

offsets

52

MajorTypesofOptimizations

53

ComplierOptimizations– ChangeinIC

• L0– unoptimized• L1– localopts,codescheduling,&localreg.allocation• L2– globaloptsandlooptransformations,&globalreg.Allocation• L3– procedureintegration

gcc -O2hello.c -ohello

54

CompilerBasedRegisterOptimization

• Compilerassumessmallnumberofregisters(16-32)– Optimizinguseisuptocompiler– HLLprogramshavenoexplicitreferencestoregisters

• CompilerApproach– Assignsymbolicorvirtualregistertoeachcandidatevariable– Map(unlimited)symbolicregisterstorealregisters– Symbolicregistersthatdonotoverlapcansharerealregisters– Ifyourunoutofrealregisterssomevariables

• Spilling

55

GraphColoring

• Givenagraphofnodesandedges– Assignacolor toeachnode

• Adjacentnodeshavedifferentcolors• Useminimumnumberofcolors

• Registrationallocation– Nodesaresymbolicregisters– Tworegistersthatareliveinthesameprogramfragmentare

joinedbyanedge– Trytocolor thegraphwithn colors,wheren isthenumberof

realregisters– Nodesthatcannotbecolored areplacedinmemory

https://en.wikipedia.org/wiki/Graph_coloring

56

Iron-codeSummary• SectionA.2—Usegeneral-purposeregisterswithaload-storearchitecture.• SectionA.3—Supporttheseaddressingmodes:displacement(withanaddressoffset

sizeof12to16bits),immediate(size8to16bits),andregisterindirect.• SectionA.4—Supportthesedatasizesandtypes:8-,16-,32-,and64-bitintegersand

64-bitIEEE754floating-pointnumbers.– Nowwesee16-bitFPfordeeplearninginGPU

• http://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/

• SectionA.5—Supportthesesimpleinstructions,sincetheywilldominatethenumberofinstructionsexecuted:load,store,add,subtract,moveregister- register,andshift.

• SectionA.6—Compareequal,comparenotequal,compareless,branch(withaPC-relativeaddressatleast8bitslong),jump,call,andreturn.

• SectionA.7—Usefixedinstructionencodingifinterestedinperformance,andusevariableinstructionencodingifinterestedincodesize.

• SectionA.8—Provideatleast16general-purposeregisters,besurealladdressingmodesapplytoalldatatransferinstructions,andaimforaminimalistIS

– Oftenuseseparatefloating-pointregisters.– Thejustificationistoincreasethetotalnumberofregisterswithoutraisingproblemsin

theinstructionformatorinthespeedofthegeneral-purposeregisterfile.Thiscompromise,however,isnotorthogonal.

57

RealWorldISA

58

Thedetailsindesignistotrade-off!

59

Date post:	19-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Lecture 03 Instruction Set Principles · 2019-01-10 · Hence, register architecture classification...

Documents