+ All Categories
Home > Documents > Linux Insides

Linux Insides

Date post: 16-Jul-2016
Category:
Upload: kobarseptyanus
View: 110 times
Download: 2 times
Share this document with a friend
Description:
linux
408
Transcript
Page 1: Linux Insides
Page 2: Linux Insides

1. Introduction2. Booting

i. Frombootloadertokernelii. Firststepsinthekernelsetupcodeiii. Videomodeinitializationandtransitiontoprotectedmodeiv. Transitionto64-bitmodev. Kerneldecompression

3. Initializationi. Firststepsinthekernelii. Earlyinterruptshandleriii. Lastpreparationsbeforethekernelentrypointiv. Kernelentrypointv. Continuearchitecture-specificboot-timeinitializationsvi. Architecture-specificinitializations,again...vii. Endofthearchitecture-specificinitializations,almost...viii. Schedulerinitializationix. RCUinitializationx. Endofinitialization

4. Interruptsi. Introductionii. Starttodiveintointerruptsiii. Interrupthandlersiv. Initializationofnon-earlyinterruptgatesv. Implementationofsomeexceptionhandlersvi. HandlingNon-Maskableinterruptsvii. Diveintoexternalhardwareinterruptsviii. Initializationofexternalhardwareinterruptsstructuresix. Softirq,TaskletsandWorkqueuesx. Lastpart

5. Systemcallsi. Introductiontosystemcallsii. HowtheLinuxkernelhandlesasystemcalliii. vsyscallandvDSOiv. HowtheLinuxkernelrunsaprogram

6. Timersandtimemanagementi. Introductionii. Clocksourceframework

7. Memorymanagementi. Memblockii. Fixmapsandioremap

8. SMP9. Concepts

i. Per-CPUvariablesii. Cpumasks

10. DataStructuresintheLinuxKerneli. Doublylinkedlistii. Radixtree

11. Theoryi. Paging

TableofContents

LinuxInside

2

Page 3: Linux Insides

ii. Elf64iii. CPUIDiv. MSR

12. Initialramdiski. initrd

13. Misci. Howthekerneliscompiledii. Linkersiii. Linuxkerneldevelopmentiv. WriteandSubmityourfirstLinuxkernelPatchv. Datatypesinthekernel

14. Usefullinks15. Contributors

LinuxInside

3

Page 4: Linux Insides

Aseriesofpostsaboutthelinuxkernelanditsinsides.

Thegoalissimple-tosharemymodestknowledgeabouttheinternalsofthelinuxkernelandhelppeoplewhoareinterestedinlinuxkernelinternals,andotherlow-levelsubjectmatter.

Questions/Suggestions:Feelfreeaboutanyquestionsorsuggestionsbypingingmeattwitter@0xAX,addinganissueorjustdropmeanemail.

SupportIfyoulikelinux-insidesyoucansupportmewith:

ChineseSpanish

LicensedBY-NC-SACreativeCommons.

Feelfreetocreateissuesorpull-requestsifyouhaveanyproblems.

PleasereadCONTRIBUTING.mdbeforepushinganychanges.

linux-insides

Support

Onotherlanguages

LICENSE

Contributions

LinuxInside

4Introduction

Page 5: Linux Insides

@0xAX

Author

LinuxInside

5Introduction

Page 6: Linux Insides

Thischapterdescribesthelinuxkernelbootprocess.Youwillseehereacoupleofpostswhichdescribethefullcycleofthekernelloadingprocess:

Fromthebootloadertokernel-describesallstagesfromturningonthecomputertorunningthefirstinstructionofthekernel;Firststepsinthekernelsetupcode-describesfirststepsinthekernelsetupcode.Youwillseeheapinitialization,queryofdifferentparameterslikeEDD,ISTandetc...Videomodeinitializationandtransitiontoprotectedmode-describesvideomodeinitializationinthekernelsetupcodeandtransitiontoprotectedmode.Transitionto64-bitmode-describespreparationfortransitioninto64-bitmodeanddetailsoftransition.KernelDecompression-describespreparationbeforekerneldecompressionanddetailsofdirectdecompression.

Kernelbootprocess

LinuxInside

6Booting

Page 7: Linux Insides

Ifyouhavereadmypreviousblogposts,youcanseethatsometimeagoIstartedtogetinvolvedwithlow-levelprogramming.Iwrotesomepostsaboutx86_64assemblyprogrammingforLinux.Atthesametime,IstartedtodiveintotheLinuxsourcecode.Ihaveagreatinterestinunderstandinghowlow-levelthingswork,howprogramsrunonmycomputer,howtheyarelocatedinmemory,howthekernelmanagesprocessesandmemory,howthenetworkstackworksonlow-levelandmanymanyotherthings.So,IdecidedtowriteyetanotherseriesofpostsabouttheLinuxkernelforx86_64.

NotethatI'mnotaprofessionalkernelhackerandIdon'twritecodeforthekernelatwork.It'sjustahobby.Ijustlikelow-levelstuff,anditisinterestingformetoseehowthesethingswork.Soifyounoticeanythingconfusing,orifyouhaveanyquestions/remarks,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.Iappreciateit.Allpostswillalsobeaccessibleatlinux-insidesandifyoufindsomethingwrongwithmyEnglishorthepostcontent,feelfreetosendapullrequest.

Notethatthisisn'ttheofficialdocumentation,justlearningandsharingknowledge.

Requiredknowledge

UnderstandingCcodeUnderstandingassemblycode(AT&Tsyntax)

Anyway,ifyoujuststarttolearnsometools,Iwilltrytoexplainsomepartsduringthisandthefollowingposts.Ok,simpleintroductionfinishesandnowwecanstarttodiveintothekernelandlow-levelstuff.

Allcodeisactuallyforkernel-3.18.Iftherearechanges,Iwillupdatethepostsaccordingly.

DespitethatthisisaseriesofpostsabouttheLinuxkernel,wewillnotstartfromthekernelcode(atleastnotinthisparagraph).Ok,youpressthemagicpowerbuttononyourlaptopordesktopcomputeranditstartestowork.Afterthemotherboardsendsasignaltothepowersupply,thepowersupplyprovidesthecomputerwiththeproperamountofelectricity.Oncethemotherboardreceivesthepowergoodsignal,ittriestostarttheCPU.TheCPUresetsallleftoverdatainitsregistersandsetsuppredefinedvaluesforeachofthem.

80386andlaterCPUsdefinethefollowingpredefineddatainCPUregistersafterthecomputerresets:

IP0xfff0

CSselector0xf000

CSbase0xffff0000

Theprocessorstartsworkinginrealmode.Let'sbackupalittletotryandunderstandmemorysegmentationinthismode.Realmodeissupportedonallx86-compatibleprocessors,fromthe8086allthewaytothemodernIntel64-bitCPUs.The8086processorhasa20-bitaddressbus,whichmeansthatitcouldworkwith0-2^20bytesaddressspace(1megabyte).Butitonlyhas16-bitregisters,andwith16-bitregistersthemaximumaddressis2^16or0xffff(64kilobytes).Memorysegmentationisusedtomakeuseofalltheaddressspaceavailable.Allmemoryisdividedintosmall,fixed-sizesegmentsof65535bytes,or64KB.Sincewecannotaddressmemoryabove64KBwith16bitregisters,analternatemethodisdevised.Anaddressconsistsoftwoparts:thebeginningaddressofthesegmentandanoffsetfromthisaddress.Togeta

Kernelbootingprocess.Part1.

Fromthebootloadertokernel

TheMagicPowerButton,Whathappensnext?

LinuxInside

7Frombootloadertokernel

Page 8: Linux Insides

physicaladdressinmemory,weneedtomultiplythesegmentpartby16andaddtheoffsetpart:

PhysicalAddress=Segment*16+Offset

ForexampleifCS:IPis0x2000:0x0010,thecorrespondingphysicaladdresswillbe:

>>>hex((0x2000<<4)+0x0010)

'0x20010'

Butifwetakethelargestsegmentpartandoffset:0xffff:0xffff,itwillbe:

>>>hex((0xffff<<4)+0xffff)

'0x10ffef'

whichis65519bytesoverfirstmegabyte.Sinceonlyonemegabyteisaccessibleinrealmode,0x10ffefbecomes0x00ffefwithdisabledA20.

Ok,nowweknowaboutrealmodeandmemoryaddressing.Let'sgetbacktodiscussaboutregistervaluesafterreset:

CSregisterconsistsoftwoparts:thevisiblesegmentselectorandhiddenbaseaddress.WeknowpredefinedCSbaseandIPvalue,sothelogicaladdresswillbe:

0xffff0000:0xfff0

ThestartingaddressisformedbyaddingthebaseaddresstothevalueintheEIPregister:

>>>0xffff0000+0xfff0

'0xfffffff0'

Weget0xfffffff0whichis4GB-16bytes.ThispointiscalledtheResetvector.ThisisthememorylocationatwhichtheCPUexpectstofindthefirstinstructiontoexecuteafterreset.ItcontainsajumpinstructionwhichusuallypointstotheBIOSentrypoint.Forexample,ifwelookinthecorebootsourcecode,wesee:

.section".reset"

.code16

.globlreset_vector

reset_vector:

.byte0xe9

.int_start-(.+2)

...

Herewecanseethejmpinstructionopcode-0xe9anditsdestinationaddress-_start-(.+2),andwecanseethattheresetsectionis16bytesandstartsat0xfffffff0:

SECTIONS{

_ROMTOP=0xfffffff0;

.=_ROMTOP;

.reset.:{

*(.reset)

.=15;

BYTE(0x00);

}

LinuxInside

8Frombootloadertokernel

Page 9: Linux Insides

}

NowtheBIOSstarts:afterinitializingandcheckingthehardware,itneedstofindabootabledevice.AbootorderisstoredintheBIOSconfiguration,controllingwhichdevicesthekernelattemptstobootfrom.Whenattemptingtobootfromaharddrive,theBIOStriestofindabootsector.OnharddrivespartitionedwithanMBRpartitionlayout,thebootsectorisstoredinthefirst446bytesofthefirstsector(whichis512bytes).Thefinaltwobytesofthefirstsectorare0x55and0xaa,whichsignalstheBIOSthatthisdeviceisbootable.Forexample:

;

;Note:thisexampleiswritteninIntelAssemblysyntax

;

[BITS16]

[ORG0x7c00]

boot:

moval,'!'

movah,0x0e

movbh,0x00

movbl,0x07

int0x10

jmp$

times510-($-$$)db0

db0x55

db0xaa

Buildandrunitwith:

nasm-fbinboot.nasm&&qemu-system-x86_64boot

ThiswillinstructQEMUtousethebootbinarywejustbuiltasadiskimage.Sincethebinarygeneratedbytheassemblycodeabovefulfillstherequirementsofthebootsector(theoriginissetto0x7c00,andweendwiththemagicsequence),QEMUwilltreatthebinaryasthemasterbootrecord(MBR)ofadiskimage.

Youwillsee:

LinuxInside

9Frombootloadertokernel

Page 10: Linux Insides

Inthisexamplewecanseethatthecodewillbeexecutedin16bitrealmodeandwillstartat0x7c00inmemory.Afterstartingitcallsthe0x10interruptwhichjustprintsthe!symbol.Itfillstherestofthe510byteswithzerosandfinisheswiththetwomagicbytes0xaaand0x55.

Youcanseeabinarydumpofthiswiththeobjdumputil:

nasm-fbinboot.nasm

objdump-D-bbinary-mi386-Maddr16,data16,intelboot

Areal-worldbootsectorhascodetocontinuethebootprocessandthepartitiontableinsteadofabunchof0'sandanexclamationmark:)Fromthispointonwards,BIOShandsovercontroltothebootloader.

NOTE:AsyoucanreadabovetheCPUisinrealmode.Inrealmode,calculatingthephysicaladdressinmemoryisdoneasfollowing:

PhysicalAddress=Segment*16+Offset

Thesameasmentionedbefore.Wehaveonly16bitgeneralpurposeregisters,themaximumvalueofa16bitregisteris0xffff,soifwetakethelargestvalues,theresultwillbe:

>>>hex((0xffff*16)+0xffff)

'0x10ffef'

Where0x10ffefisequalto1MB+64KB-16b.Buta8086processor,whichisthefirstprocessorwithrealmode,hasa20bitaddresslineand2^20=1048576is1MB.Thismeanstheactualmemoryavailableis1MB.

Generalrealmode'smemorymapis:

0x00000000-0x000003FF-RealModeInterruptVectorTable

0x00000400-0x000004FF-BIOSDataArea

LinuxInside

10Frombootloadertokernel

Page 11: Linux Insides

0x00000500-0x00007BFF-Unused

0x00007C00-0x00007DFF-OurBootloader

0x00007E00-0x0009FFFF-Unused

0x000A0000-0x000BFFFF-VideoRAM(VRAM)Memory

0x000B0000-0x000B7777-MonochromeVideoMemory

0x000B8000-0x000BFFFF-ColorVideoMemory

0x000C0000-0x000C7FFF-VideoROMBIOS

0x000C8000-0x000EFFFF-BIOSShadowArea

0x000F0000-0x000FFFFF-SystemBIOS

InthebeginningofthispostIwrotethatthefirstinstructionexecutedbytheCPUislocatedataddress0xFFFFFFF0,whichismuchlargerthan0xFFFFF(1MB).HowcantheCPUaccessthisinrealmode?Thisisinthecorebootdocumentation:

0xFFFE_0000-0xFFFF_FFFF:128kilobyteROMmappedintoaddressspace

Atthestartofexecution,theBIOSisnotinRAM,butinROM.

ThereareanumberofbootloadersthatcanbootLinux,suchasGRUB2andsyslinux.TheLinuxkernelhasaBootprotocolwhichspecifiestherequirementsforbootloaderstoimplementLinuxsupport.ThisexamplewilldescribeGRUB2.

NowthattheBIOShaschosenabootdeviceandtransferredcontroltothebootsectorcode,executionstartsfromboot.img.Thiscodeisverysimpleduetothelimitedamountofspaceavailable,andcontainsapointerwhichisusedtojumptothelocationofGRUB2'scoreimage.Thecoreimagebeginswithdiskboot.img,whichisusuallystoredimmediatelyafterthefirstsectorintheunusedspacebeforethefirstpartition.Theabovecodeloadstherestofthecoreimageintomemory,whichcontainsGRUB2'skernelanddriversforhandlingfilesystems.Afterloadingtherestofthecoreimage,itexecutesgrub_main.

grub_maininitializestheconsole,getsthebaseaddressformodules,setstherootdevice,loads/parsesthegrubconfigurationfile,loadsmodulesetc.Attheendofexecution,grub_mainmovesgrubtonormalmode.grub_normal_execute(fromgrub-core/normal/main.c)completesthelastpreparationandshowsamenutoselectanoperatingsystem.Whenweselectoneofthegrubmenuentries,grub_menu_execute_entryruns,whichexecutesthegrubbootcommand,bootingtheselectedoperatingsystem.

Aswecanreadinthekernelbootprotocol,thebootloadermustreadandfillsomefieldsofthekernelsetupheader,whichstartsat0x01f1offsetfromthekernelsetupcode.Thekernelheaderarch/x86/boot/header.Sstartsfrom:

.globlhdr

hdr:

setup_sects:.byte0

root_flags:.wordROOT_RDONLY

syssize:.long0

ram_size:.word0

vid_mode:.wordSVGA_MODE

root_dev:.word0

boot_flag:.word0xAA55

Thebootloadermustfillthisandtherestoftheheaders(onlymarkedaswriteintheLinuxbootprotocol,forexamplethis)withvalueswhichiteithergotfromcommandlineorcalculated.Wewillnotseeadescriptionandexplanationofallfieldsofthekernelsetupheader,wewillgetbacktothatwhenthekernelusesthem.Youcanfindadescriptionofallfieldsinthebootprotocol.

Aswecanseeinthekernelbootprotocol,thememorymapwillbethefollowingafterloadingthekernel:

Bootloader

LinuxInside

11Frombootloadertokernel

Page 12: Linux Insides

|Protected-modekernel|

100000+------------------------+

|I/Omemoryhole|

0A0000+------------------------+

|ReservedforBIOS|Leaveasmuchaspossibleunused

~~

|Commandline|(CanalsobebelowtheX+10000mark)

X+10000+------------------------+

|Stack/heap|Forusebythekernelreal-modecode.

X+08000+------------------------+

|Kernelsetup|Thekernelreal-modecode.

|Kernelbootsector|Thekernellegacybootsector.

X+------------------------+

|Bootloader|

Sowhenthebootloadertransferscontroltothekernel,itstartsat:

0x1000+X+sizeof(KernelBootSector)+1

whereXistheaddressofthekernelbootsectorloaded.InmycaseXis0x10000,aswecanseeinamemorydump:

ThebootloaderhasnowloadedtheLinuxkernelintomemory,filledtheheaderfieldsandjumpedtoit.Nowwecanmovedirectlytothekernelsetupcode.

Finallyweareinthekernel.Technicallythekernelhasn'trunyet,weneedtosetupthekernel,memorymanager,processmanageretcfirst.Kernelsetupexecutionstartsfromarch/x86/boot/header.Sat_start.Itisalittlestrangeatfirstsight,asthereareseveralinstructionsbeforeit.

ALongtimeagotheLinuxkernelhaditsownbootloader,butnowifyourunforexample:

qemu-system-x86_64vmlinuz-3.18-generic

Youwillsee:

StartofKernelSetup

LinuxInside

12Frombootloadertokernel

Page 13: Linux Insides

Actuallyheader.SstartsfromMZ(seeimageabove),errormessageprintingandfollowingPEheader:

#ifdefCONFIG_EFI_STUB

#"MZ",MS-DOSheader

.byte0x4d

.byte0x5a

#endif

...

...

...

pe_header:

.ascii"PE"

.word0

ItneedsthistoloadanoperatingsystemwithUEFI.Wewon'tseehowthisworksrightnow,we'llseethisinoneofthenextchapters.

Sotheactualkernelsetupentrypointis:

//header.Sline292

.globl_start

_start:

Thebootloader(grub2andothers)knowsaboutthispoint(0x200offsetfromMZ)andmakesajumpdirectlytothispoint,despitethefactthatheader.Sstartsfrom.bstextsectionwhichprintsanerrormessage:

//

//arch/x86/boot/setup.ld

//

.=0;//currentposition

.bstext:{*(.bstext)}//put.bstextsectiontoposition0

.bsdata:{*(.bsdata)}

Sothekernelsetupentrypointis:

LinuxInside

13Frombootloadertokernel

Page 14: Linux Insides

.globl_start

_start:

.byte0xeb

.bytestart_of_setup-1f

1:

//

//restoftheheader

//

Herewecanseeajmpinstructionopcode-0xebtothestart_of_setup-1fpoint.Nfnotationmeans2freferstothenextlocal2:label.Inourcaseitislabel1whichgoesrightafterjump.Itcontainstherestofthesetupheader.Rightafterthesetupheaderweseethe.entrytextsectionwhichstartsatthestart_of_setuplabel.

Actuallythisisthefirstcodethatruns(asidefromthepreviousjumpinstructionofcourse).Afterthekernelsetupgotthecontrolfromthebootloader,thefirstjmpinstructionislocatedat0x200(first512bytes)offsetfromthestartofthekernelrealmode.ThiswecanreadintheLinuxkernelbootprotocolandalsoseeinthegrub2sourcecode:

state.gs=state.fs=state.es=state.ds=state.ss=segment;

state.cs=segment+0x20;

Itmeansthatsegmentregisterswillhavefollowingvaluesafterkernelsetupstarts:

gs=fs=es=ds=ss=0x1000

cs=0x1020

inmycasewhenthekernelisloadedat0x10000.

Afterthejumptostart_of_setup,itneedstodothefollowing:

BesurethatallvaluesofallsegmentregistersareequalSetupcorrectstackifneededSetupbssJumptoCcodeatmain.c

Let'slookattheimplementation.

Firstofallitensuresthatdsandessegmentregisterspointtothesameaddressanddisablesinterruptswithcliinstruction:

movw%ds,%ax

movw%ax,%es

cli

AsIwroteearlier,grub2loadskernelsetupcodeataddress0x10000andcsat0x1020becauseexecutiondoesn'tstartfromthestartoffile,butfrom:

_start:

.byte0xeb

.bytestart_of_setup-1f

Segmentregistersalign

LinuxInside

14Frombootloadertokernel

Page 15: Linux Insides

jump,whichisat512bytesoffsetfromthe4d5a.Italsoneedstoaligncsfrom0x10200to0x10000asallothersegmentregisters.Afterthatwesetupthestack:

pushw%ds

pushw$6f

lretw

pushdsvaluetostack,andaddressof6labelandexecutelretwinstruction.Whenwecalllretw,itloadsaddressoflabel6intotheinstructionpointerregisterandcswithvalueofds.Afterthiswewillhavedsandcswiththesamevalues.

Actually,almostallofthesetupcodeispreparationfortheClanguageenvironmentinrealmode.Thenextstepischeckingofssregistervalueandmakeacorrectstackifssiswrong:

movw%ss,%dx

cmpw%ax,%dx

movw%sp,%dx

je2f

Thiscanleadto3differentscenarios:

sshasvalidvalue0x10000(asallothersegmentregistersbesidecs)ssisinvalidandCAN_USE_HEAPflagisset(seebelow)ssisinvalidandCAN_USE_HEAPflagisnotset(seebelow)

Let'slookatallthreeofthesescenarios:

1. sshasacorrectaddress(0x10000).Inthiscasewegotolabel2:

2:andw$~3,%dx

jnz3f

movw$0xfffc,%dx

3:movw%ax,%ss

movzwl%dx,%esp

sti

Herewecanseealigningofdx(containsspgivenbybootloader)to4bytesandcheckingwhetheritiszero.Ifitiszero,weput0xfffc(4bytealignedaddressbeforemaximumsegmentsize-64KB)indx.Ifitisnotzerowecontinuetousespgivenbythebootloader(0xf7f4inmycase).Afterthisweputtheaxvaluetosswhichstoresthecorrectsegmentaddressof0x10000andsetsupacorrectsp.Wenowhaveacorrectstack:

StackSetup

LinuxInside

15Frombootloadertokernel

Page 16: Linux Insides

1. Inthesecondscenario,(ss!=ds).Firstofallputthe_end(addressofendofsetupcode)valueindxandchecktheloadflagsheaderfieldwiththetestbinstructiontooseewhetherwecanuseheapornot.loadflagsisabitmaskheaderwhichisdefinedas:

#defineLOADED_HIGH(1<<0)

#defineQUIET_FLAG(1<<5)

#defineKEEP_SEGMENTS(1<<6)

#defineCAN_USE_HEAP(1<<7)

Andaswecanreadinthebootprotocol:

Fieldname:loadflags

Thisfieldisabitmask.

Bit7(write):CAN_USE_HEAP

Setthisbitto1toindicatethatthevalueenteredinthe

heap_end_ptrisvalid.Ifthisfieldisclear,somesetupcode

functionalitywillbedisabled.

IftheCAN_USE_HEAPbitisset,putheap_end_ptrindxwhichpointsto_endandaddSTACK_SIZE(minimalstacksize-512bytes)toit.Afterthisifdxisnotcarry(itwillnotbecarry,dx=_end+512),jumptolabel2asinthepreviouscaseandmakeacorrectstack.

1. WhenCAN_USE_HEAPisnotset,wejustuseaminimalstackfrom_endto_end+STACK_SIZE:

LinuxInside

16Frombootloadertokernel

Page 17: Linux Insides

ThelasttwostepsthatneedtohappenbeforewecanjumptothemainCcode,aresettinguptheBSSareaandcheckingthe"magic"signature.First,signaturechecking:

cmpl$0x5a5aaa55,setup_sig

jnesetup_bad

Thissimplycomparesthesetup_sigwiththemagicnumber0x5a5aaa55.Iftheyarenotequal,afatalerrorisreported.

Ifthemagicnumbermatches,knowingwehaveasetofcorrectsegmentregistersandastack,weonlyneedtosetuptheBSSsectionbeforejumpingintotheCcode.

TheBSSsectionisusedtostorestaticallyallocated,uninitializeddata.Linuxcarefullyensuresthisareaofmemoryisfirstblanked,usingthefollowingcode:

movw$__bss_start,%di

movw$_end+3,%cx

xorl%eax,%eax

subw%di,%cx

shrw$2,%cx

rep;stosl

Firstofallthe__bss_startaddressismovedintodiandthe_end+3address(+3-alignsto4bytes)ismovedintocx.Theeaxregisteriscleared(usinganxorinstruction),andthebsssectionsize(cx-di)iscalculatedandputintocx.Then,cxisdividedbyfour(thesizeofa'word'),andthestoslinstructionisrepeatedlyused,storingthevalueofeax(zero)intotheaddresspointedtobydi,automaticallyincreasingdibyfour(thisoccursuntilcxreacheszero).Theneteffectofthiscodeisthatzerosarewrittenthroughallwordsinmemoryfrom__bss_startto_end:

BSSSetup

LinuxInside

17Frombootloadertokernel

Page 18: Linux Insides

That'sall,wehavethestack,BSSsowecanjumptothemain()Cfunction:

calllmain

Themain()functionislocatedinarch/x86/boot/main.c.Whatthisdoes,youcanreadinthenextpart.

ThisistheendofthefirstpartaboutLinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.InthenextpartwewillseefirstCcodewhichexecutesinLinuxkernelsetup,implementationofmemoryroutinesasmemset,memcpy,earlyprintkimplementationandearlyconsoleinitializationandmanymore.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

Intel80386programmer'sreferencemanual1986MinimalBootLoaderforIntel®Architecture808680386ResetvectorRealmodeLinuxkernelbootprotocolCoreBootdevelopermanualRalfBrown'sInterruptListPowersupplyPowergoodsignal

Jumptomain

Conclusion

Links

LinuxInside

18Frombootloadertokernel

Page 19: Linux Insides

Westartedtodiveintolinuxkernelinternalsinthepreviouspartandsawtheinitialpartofthekernelsetupcode.Westoppedatthefirstcalltothemainfunction(whichisthefirstfunctionwritteninC)fromarch/x86/boot/main.c.

Inthispartwewillcontinuetoresearchthekernelsetupcodeand

seewhatprotectedmodeis,somepreparationforthetransitionintoit,theheapandconsoleinitialization,memorydetection,cpuvalidation,keyboardinitializationandmuchmuchmore.

So,Let'sgoahead.

BeforewecanmovetothenativeIntel64LongMode,thekernelmustswitchtheCPUintoprotectedmode.

Whatisprotectedmode?Protectedmodewasfirstaddedtothex86architecturein1982andwasthemainmodeofIntelprocessorsfromthe80286processoruntilIntel64andlongmodecame.

ThemainreasontomoveawayfromRealmodeisthatthereisverylimitedaccesstotheRAM.Asyoumayremember

fromthepreviouspart,thereisonly220bytesor1Megabyte,sometimesevenonly640KilobytesofRAMavailableintheRealmode.

Protectedmodebroughtmanychanges,butthemainoneisthedifferenceinmemorymanagement.The20-bitaddressbuswasreplacedwitha32-bitaddressbus.Itallowedaccessto4Gigabytesofmemoryvs1Megabyteofrealmode.Alsopagingsupportwasadded,whichyoucanreadaboutinthenextsections.

MemorymanagementinProtectedmodeisdividedintotwo,almostindependentparts:

SegmentationPaging

Herewewillonlyseesegmentation.Pagingwillbediscussedinthenextsections.

Asyoucanreadinthepreviouspart,addressesconsistoftwopartsinrealmode:

BaseaddressofthesegmentOffsetfromthesegmentbase

Andwecangetthephysicaladdressifweknowthesetwopartsby:

PhysicalAddress=Segment*16+Offset

Memorysegmentationwascompletelyredoneinprotectedmode.Thereareno64Kilobytefixed-sizesegments.Instead,thesizeandlocationofeachsegmentisdescribedbyanassociateddatastructurecalledSegmentDescriptor.The

Kernelbootingprocess.Part2.

Firststepsinthekernelsetup

Protectedmode

LinuxInside

19Firststepsinthekernelsetupcode

Page 20: Linux Insides

segmentdescriptorsarestoredinadatastructurecalledGlobalDescriptorTable(GDT).

TheGDTisastructurewhichresidesinmemory.Ithasnofixedplaceinthememoryso,itsaddressisstoredinthespecialGDTRregister.LaterwewillseetheGDTloadingintheLinuxkernelcode.Therewillbeanoperationforloadingitintomemory,somethinglike:

lgdtgdt

wherethelgdtinstructionloadsthebaseaddressandlimit(size)ofglobaldescriptortabletotheGDTRregister.GDTRisa48-bitregisterandconsistsoftwoparts:

size(16-bit)ofglobaldescriptortable;address(32-bit)oftheglobaldescriptortable.

AsmentionedabovetheGDTcontainssegmentdescriptorswhichdescribememorysegments.Eachdescriptoris64-bitsinsize.Thegeneralschemeofadescriptoris:

3124191670

------------------------------------------------------------

|||B||A|||||0|E|W|A||

|BASE31:24|G|/|L|V|LIMIT|P|DPL|S|TYPE|BASE23:16|4

|||D||L|19:16||||1|C|R|A||

------------------------------------------------------------

|||

|BASE15:0|LIMIT15:0|0

|||

------------------------------------------------------------

Don'tworry,Iknowitlooksalittlescaryafterrealmode,butit'seasy.ForexampleLIMIT15:0meansthatbit0-15oftheDescriptorcontainthevalueforthelimit.TherestofitisinLIMIT19:16.So,thesizeofLimitis0-19i.e20-bits.Let'stakeacloserlookatit:

1. Limit[20-bits]isat0-15,16-19bits.Itdefineslength_of_segment-1.ItdependsonG(Granularity)bit.

ifG(bit55)is0andsegmentlimitis0,thesizeofthesegmentis1ByteifGis1andsegmentlimitis0,thesizeofthesegmentis4096BytesifGis0andsegmentlimitis0xfffff,thesizeofthesegmentis1MegabyteifGis1andsegmentlimitis0xfffff,thesizeofthesegmentis4Gigabytes

So,itmeansthatif

ifGis0,Limitisinterpretedintermsof1Byteandthemaximumsizeofthesegmentcanbe1Megabyte.ifGis1,Limitisinterpretedintermsof4096Bytes=4KBytes=1Pageandthemaximumsizeofthesegmentcanbe4Gigabytes.ActuallywhenGis1,thevalueofLimitisshiftedtotheleftby12bits.So,20bits+12bits=

32bitsand232=4Gigabytes.2. Base[32-bits]isat(0-15,32-39and56-63bits).Itdefinesthephysicaladdressofthesegment'sstartinglocation.

3. Type/Attribute(40-47bits)definesthetypeofsegmentandkindsofaccesstoit.

Sflagatbit44specifiesdescriptortype.IfSis0thenthissegmentisasystemsegment,whereasifSis1thenthisisacodeordatasegment(Stacksegmentsaredatasegmentswhichmustberead/writesegments).

TodetermineifthesegmentisacodeordatasegmentwecancheckitsEx(bit43)Attributemarkedas0intheabovediagram.Ifitis0,thenthesegmentisaDatasegmentotherwiseitisacodesegment.

Asegmentcanbeofoneofthefollowingtypes:

LinuxInside

20Firststepsinthekernelsetupcode

Page 21: Linux Insides

|TypeField|DescriptorType|Description

|-----------------------------|-----------------|------------------

|Decimal||

|0EWA||

|00000|Data|Read-Only

|10001|Data|Read-Only,accessed

|20010|Data|Read/Write

|30011|Data|Read/Write,accessed

|40100|Data|Read-Only,expand-down

|50101|Data|Read-Only,expand-down,accessed

|60110|Data|Read/Write,expand-down

|70111|Data|Read/Write,expand-down,accessed

|CRA||

|81000|Code|Execute-Only

|91001|Code|Execute-Only,accessed

|101010|Code|Execute/Read

|111011|Code|Execute/Read,accessed

|121100|Code|Execute-Only,conforming

|141101|Code|Execute-Only,conforming,accessed

|131110|Code|Execute/Read,conforming

|151111|Code|Execute/Read,conforming,accessed

Aswecanseethefirstbit(bit43)is0foradatasegmentand1foracodesegment.Thenextthreebits(40,41,42,43)areeitherEWA(ExpansionWritableAccessible)orCRA(ConformingReadableAccessible).

ifE(bit42)is0,expandupotherwiseexpanddown.Readmorehere.ifW(bit41)(forDataSegments)is1,writeaccessisallowedotherwisenot.Notethatreadaccessisalwaysallowedondatasegments.A(bit40)-Whetherthesegmentisaccessedbyprocessorornot.C(bit43)isconformingbit(forcodeselectors).IfCis1,thesegmentcodecanbeexecutedfromalowerlevelprivilegefore.guserlevel.IfCis0,itcanonlybeexecutedfromthesameprivilegelevel.R(bit41)(forcodesegments).If1readaccesstosegmentisallowedotherwisenot.Writeaccessisneverallowedtocodesegments.

1. DPL[2-bits](DescriptorPrivilegeLevel)isatbits45-46.Itdefinestheprivilegelevelofthesegment.Itcanbe0-3where0isthemostprivileged.

2. Pflag(bit47)-indicatesifthesegmentispresentinmemoryornot.IfPis0,thesegmentwillbepresentedasinvalidandtheprocessorwillrefusetoreadthissegment.

3. AVLflag(bit52)-Availableandreservedbits.ItisignoredinLinux.

4. Lflag(bit53)-indicateswhetheracodesegmentcontainsnative64-bitcode.If1thenthecodesegmentexecutesin64bitmode.

5. D/Bflag(bit54)-Default/Bigflagrepresentstheoperandsizei.e16/32bits.Ifitissetthen32bitotherwise16.

Segmentregistersdon'tcontainthebaseaddressofthesegmentasinrealmode.Insteadtheycontainaspecialstructure-SegmentSelector.EachSegmentDescriptorhasanassociatedSegmentSelector.SegmentSelectorisa16-bitstructure:

-----------------------------

|Index|TI|RPL|

-----------------------------

Where,

IndexshowstheindexnumberofthedescriptorintheGDT.TI(TableIndicator)showswheretosearchforthedescriptor.Ifitis0thensearchintheGlobalDescriptorTable(GDT)otherwiseitwilllookinLocalDescriptorTable(LDT).

LinuxInside

21Firststepsinthekernelsetupcode

Page 22: Linux Insides

AndRPLisRequester'sPrivilegeLevel.

Everysegmentregisterhasavisibleandhiddenpart.

Visible-SegmentSelectorisstoredhereHidden-SegmentDescriptor(base,limit,attributes,flags)

Thefollowingstepsareneededtogetthephysicaladdressintheprotectedmode:

ThesegmentselectormustbeloadedinoneofthesegmentregistersTheCPUtriestofindasegmentdescriptorbyGDTaddress+IndexfromselectorandloadthedescriptorintothehiddenpartofthesegmentregisterBaseaddress(fromsegmentdescriptor)+offsetwillbethelinearaddressofthesegmentwhichisthephysicaladdress(ifpagingisdisabled).

Schematicallyitwilllooklikethis:

Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis:

DisableinterruptsDescribeandloadGDTwithlgdtinstructionSetPE(ProtectionEnable)bitinCR0(ControlRegister0)Jumptoprotectedmodecode

Wewillseethecompletetransitiontoprotectedmodeinthelinuxkernelinthenextpart,butbeforewecanmoveto

LinuxInside

22Firststepsinthekernelsetupcode

Page 23: Linux Insides

protectedmode,weneedtodosomemorepreparations.

Let'slookatarch/x86/boot/main.c.Wecanseesomeroutinestherewhichperformkeyboardinitialization,heapinitialization,etc...Let'stakealook.

Wewillstartfromthemainroutinein"main.c".Firstfunctionwhichiscalledinmainiscopy_boot_params(void).Itcopiesthekernelsetupheaderintothefieldoftheboot_paramsstructurewhichisdefinedinthearch/x86/include/uapi/asm/bootparam.h.

Theboot_paramsstructurecontainsthestructsetup_headerhdrfield.Thisstructurecontainsthesamefieldsasdefinedinlinuxbootprotocolandisfilledbythebootloaderandalsoatkernelcompile/buildtime.copy_boot_paramsdoestwothings:

1. Copieshdrfromheader.Stotheboot_paramsstructureinsetup_headerfield

2. Updatespointertothekernelcommandlineifthekernelwasloadedwiththeoldcommandlineprotocol.

Notethatitcopieshdrwithmemcpyfunctionwhichisdefinedinthecopy.Ssourcefile.Let'shavealookinside:

GLOBAL(memcpy)

pushw%si

pushw%di

movw%ax,%di

movw%dx,%si

pushw%cx

shrw$2,%cx

rep;movsl

popw%cx

andw$3,%cx

rep;movsb

popw%di

popw%si

retl

ENDPROC(memcpy)

Yeah,wejustmovedtoCcodeandnowassemblyagain:)Firstofallwecanseethatmemcpyandotherroutineswhicharedefinedhere,startandendwiththetwomacros:GLOBALandENDPROC.GLOBALisdescribedinarch/x86/include/asm/linkage.hwhichdefinesglobldirectiveandthelabelforit.ENDPROCisdescribedininclude/linux/linkage.hwhichmarksnamesymbolasfunctionnameandendswiththesizeofthenamesymbol.

Implementationofmemcpyiseasy.Atfirst,itpushesvaluesfromsianddiregisterstothestackbecausetheirvalueswillchangeduringthememcpy,soitpushesthemonthestacktopreservetheirvalues.memcpy(andotherfunctionsincopy.S)usefastcallcallingconventions.Soitgetsitsincomingparametersfromtheax,dxandcxregisters.Callingmemcpylookslikethis:

memcpy(&boot_params.hdr,&hdr,sizeofhdr);

So,

axwillcontaintheaddressoftheboot_params.hdrinbytesdxwillcontaintheaddressofhdrinbytescxwillcontainthesizeofhdrinbytes.

memcpyputstheaddressofboot_params.hdrintosiandsavesthesizeonthestack.Afterthisitshiftstotherighton2size(ordivideon4)andcopiesfromsitodiby4bytes.Afterthiswerestorethesizeofhdragain,alignitby4bytes

Copyingbootparametersintothe"zeropage"

LinuxInside

23Firststepsinthekernelsetupcode

Page 24: Linux Insides

andcopytherestofthebytesfromsitodibytebybyte(ifthereismore).Restoresianddivaluesfromthestackintheendandafterthiscopyingisfinished.

Afterthehdriscopiedintoboot_params.hdr,thenextstepisconsoleinitializationbycallingtheconsole_initfunctionwhichisdefinedinarch/x86/boot/early_serial_console.c.

Ittriestofindtheearlyprintkoptioninthecommandlineandifthesearchwassuccessful,itparsestheportaddressandbaudrateoftheserialportandinitializestheserialport.Valueofearlyprintkcommandlineoptioncanbeoneofthese:

*serial,0x3f8,115200

*serial,ttyS0,115200

*ttyS0,115200

Afterserialportinitializationwecanseethefirstoutput:

if(cmdline_find_option_bool("debug"))

puts("earlyconsoleinsetupcode\n");

Thedefinitionofputsisintty.c.Aswecanseeitprintscharacterbycharacterinaloopbycallingtheputcharfunction.Let'slookintotheputcharimplementation:

void__attribute__((section(".inittext")))putchar(intch)

{

if(ch=='\n')

putchar('\r');

bios_putchar(ch);

if(early_serial_base!=0)

serial_putchar(ch);

}

__attribute__((section(".inittext")))meansthatthiscodewillbeinthe.inittextsection.Wecanfinditinthelinkerfilesetup.ld.

Firstofall,putcharchecksforthe\nsymbolandifitisfound,prints\rbefore.AfterthatitoutputsthecharacterontheVGAscreenbycallingtheBIOSwiththe0x10interruptcall:

staticvoid__attribute__((section(".inittext")))bios_putchar(intch)

{

structbiosregsireg;

initregs(&ireg);

ireg.bx=0x0007;

ireg.cx=0x0001;

ireg.ah=0x0e;

ireg.al=ch;

intcall(0x10,&ireg,NULL);

}

Hereinitregstakesthebiosregsstructureandfirstfillsbiosregswithzerosusingthememsetfunctionandthenfillsitwithregistervalues.

Consoleinitialization

LinuxInside

24Firststepsinthekernelsetupcode

Page 25: Linux Insides

memset(reg,0,sizeof*reg);

reg->eflags|=X86_EFLAGS_CF;

reg->ds=ds();

reg->es=ds();

reg->fs=fs();

reg->gs=gs();

Let'slookatthememsetimplementation:

GLOBAL(memset)

pushw%di

movw%ax,%di

movzbl%dl,%eax

imull$0x01010101,%eax

pushw%cx

shrw$2,%cx

rep;stosl

popw%cx

andw$3,%cx

rep;stosb

popw%di

retl

ENDPROC(memset)

Asyoucanreadabove,itusesthefastcallcallingconventionslikethememcpyfunction,whichmeansthatthefunctiongetsparametersfromax,dxandcxregisters.

Generallymemsetislikeamemcpyimplementation.Itsavesthevalueofthediregisteronthestackandputstheaxvalueintodiwhichistheaddressofthebiosregsstructure.Nextisthemovzblinstruction,whichcopiesthedlvaluetothelow2bytesoftheeaxregister.Theremaining2highbytesofeaxwillbefilledwithzeros.

Thenextinstructionmultiplieseaxwith0x01010101.Itneedstobecausememsetwillcopy4bytesatthesametime.Forexample,weneedtofillastructurewith0x7withmemset.eaxwillcontain0x00000007valueinthiscase.Soifwemultiplyeaxwith0x01010101,wewillget0x07070707andnowwecancopythese4bytesintothestructure.memsetusesrep;stoslinstructionsforcopyingeaxintoes:di.

Therestofthememsetfunctiondoesalmostthesameasmemcpy.

Afterthatbiosregsstructureisfilledwithmemset,bios_putcharcallsthe0x10interruptwhichprintsacharacter.Afterwardsitchecksiftheserialportwasinitializedornotandwritesacharactertherewithserial_putcharandinb/outbinstructionsifitwasset.

Afterthestackandbsssectionwerepreparedinheader.S(seepreviouspart),thekernelneedstoinitializetheheapwiththeinit_heapfunction.

Firstofallinit_heapcheckstheCAN_USE_HEAPflagfromtheloadflagsinthekernelsetupheaderandcalculatestheendofthestackifthisflagwasset:

char*stack_end;

if(boot_params.hdr.loadflags&CAN_USE_HEAP){

asm("leal%P1(%%esp),%0"

:"=r"(stack_end):"i"(-STACK_SIZE));

orinotherwordsstack_end=esp-STACK_SIZE.

Heapinitialization

LinuxInside

25Firststepsinthekernelsetupcode

Page 26: Linux Insides

Thenthereistheheap_endcalculation:

heap_end=(char*)((size_t)boot_params.hdr.heap_end_ptr+0x200);

whichmeansheap_end_ptror_end+512(0x200h).Andatthelastischeckedthatwhetherheap_endisgreaterthanstack_end.Ifitisthenstack_endisassignedtoheap_endtomakethemequal.

NowtheheapisinitializedandwecanuseitusingtheGET_HEAPmethod.Wewillseehowitisused,howtouseitandhowtheitisimplementedinthenextposts.

Thenextstepaswecanseeiscpuvalidationbyvalidate_cpufromarch/x86/boot/cpu.c.

Itcallsthecheck_cpufunctionandpassescpulevelandrequiredcpuleveltoitandchecksthatthekernellaunchesontherightcpulevel.

check_cpu(&cpu_level,&req_level,&err_flags);

if(cpu_level<req_level){

...

return-1;

}

check_cpuchecksthecpu'sflags,presenceoflongmodeincaseofx86_64(64-bit)CPU,checkstheprocessor'svendorandmakespreparationforcertainvendorsliketurningoffSSE+SSE2forAMDiftheyaremissing,etc.

Thenextstepismemorydetectionbythedetect_memoryfunction.detect_memorybasicallyprovidesamapofavailableRAMtothecpu.Itusesdifferentprogramminginterfacesformemorydetectionlike0xe820,0xe801and0x88.Wewillseeonlytheimplementationof0xE820here.

Let'slookintothedetect_memory_e820implementationfromthearch/x86/boot/memory.csourcefile.Firstofall,thedetect_memory_e820functioninitializesthebiosregsstructureaswesawaboveandfillsregisterswithspecialvaluesforthe0xe820call:

initregs(&ireg);

ireg.ax=0xe820;

ireg.cx=sizeofbuf;

ireg.edx=SMAP;

ireg.di=(size_t)&buf;

axcontainsthenumberofthefunction(0xe820inourcase)cxregistercontainssizeofthebufferwhichwillcontaindataaboutmemoryedxmustcontaintheSMAPmagicnumberes:dimustcontaintheaddressofthebufferwhichwillcontainmemorydataebxhastobezero.

Nextisaloopwheredataaboutthememorywillbecollected.Itstartsfromthecallofthe0x15BIOSinterrupt,whichwritesonelinefromtheaddressallocationtable.Forgettingthenextlineweneedtocallthisinterruptagain(whichwedointheloop).Beforethenextcallebxmustcontainthevaluereturnedpreviously:

CPUvalidation

Memorydetection

LinuxInside

26Firststepsinthekernelsetupcode

Page 27: Linux Insides

intcall(0x15,&ireg,&oreg);

ireg.ebx=oreg.ebx;

Ultimately,itdoesiterationsinthelooptocollectdatafromtheaddressallocationtableandwritesthisdataintothee820entryarray:

startofmemorysegmentsizeofmemorysegmenttypeofmemorysegment(whichcanbereserved,usableandetc...).

Youcanseetheresultofthisinthedmesgoutput,somethinglike:

[0.000000]e820:BIOS-providedphysicalRAMmap:

[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009fbff]usable

[0.000000]BIOS-e820:[mem0x000000000009fc00-0x000000000009ffff]reserved

[0.000000]BIOS-e820:[mem0x00000000000f0000-0x00000000000fffff]reserved

[0.000000]BIOS-e820:[mem0x0000000000100000-0x000000003ffdffff]usable

[0.000000]BIOS-e820:[mem0x000000003ffe0000-0x000000003fffffff]reserved

[0.000000]BIOS-e820:[mem0x00000000fffc0000-0x00000000ffffffff]reserved

Thenextstepistheinitializationofthekeyboardwiththecallofthekeyboard_init()function.Atfirstkeyboard_initinitializesregistersusingtheinitregsfunctionandcallingthe0x16interruptforgettingthekeyboardstatus.

initregs(&ireg);

ireg.ah=0x02;/*Getkeyboardstatus*/

intcall(0x16,&ireg,&oreg);

boot_params.kbd_status=oreg.al;

Afterthisitcalls0x16againtosetrepeatrateanddelay.

ireg.ax=0x0305;/*Setkeyboardrepeatrate*/

intcall(0x16,&ireg,NULL);

Thenextcoupleofstepsarequeriesfordifferentparameters.Wewillnotdiveintodetailsaboutthesequeries,butwillgetbacktoitinlaterparts.Let'stakeashortlookatthesefunctions:

Thequery_mcaroutinecallsthe0x15BIOSinterrupttogetthemachinemodelnumber,sub-modelnumber,BIOSrevisionlevel,andotherhardware-specificattributes:

intquery_mca(void)

{

structbiosregsireg,oreg;

u16len;

initregs(&ireg);

ireg.ah=0xc0;

intcall(0x15,&ireg,&oreg);

if(oreg.eflags&X86_EFLAGS_CF)

Keyboardinitialization

Querying

LinuxInside

27Firststepsinthekernelsetupcode

Page 28: Linux Insides

return-1;/*NoMCApresent*/

set_fs(oreg.es);

len=rdfs16(oreg.bx);

if(len>sizeof(boot_params.sys_desc_table))

len=sizeof(boot_params.sys_desc_table);

copy_from_fs(&boot_params.sys_desc_table,oreg.bx,len);

return0;

}

Itfillstheahregisterwith0xc0andcallsthe0x15BIOSinterruption.Aftertheinterruptexecutionitchecksthecarryflagandifitissetto1,theBIOSdoesn'tsupport(MCA)[https://en.wikipedia.org/wiki/Micro_Channel_architecture].Ifcarryflagissetto0,ES:BXwillcontainapointertothesysteminformationtable,whichlookslikethis:

OffsetSizeDescription)

00hWORDnumberofbytesfollowing

02hBYTEmodel(see#00515)

03hBYTEsubmodel(see#00515)

04hBYTEBIOSrevision:0forfirstrelease,1for2nd,etc.

05hBYTEfeaturebyte1(see#00510)

06hBYTEfeaturebyte2(see#00511)

07hBYTEfeaturebyte3(see#00512)

08hBYTEfeaturebyte4(see#00513)

09hBYTEfeaturebyte5(see#00514)

---AWARDBIOS---

0AhNBYTEsAWARDcopyrightnotice

---PhoenixBIOS---

0AhBYTE???(00h)

0BhBYTEmajorversion

0ChBYTEminorversion(BCD)

0Dh4BYTEsASCIZstring"PTL"(PhoenixTechnologiesLtd)

---QuadramQuad386---

0Ah17BYTEsASCIIsignaturestring"QuadramQuad386XT"

---Toshiba(SatellitePro435CDSatleast)---

0Ah7BYTEssignature"TOSHIBA"

11hBYTE???(8h)

12hBYTE???(E7h)productID???(guess)

13h3BYTEs"JPN"

Nextwecalltheset_fsroutineandpassthevalueoftheesregistertoit.Implementationofset_fsisprettysimple:

staticinlinevoidset_fs(u16seg)

{

asmvolatile("movw%0,%%fs"::"rm"(seg));

}

Thisfunctioncontainsinlineassemblywhichgetsthevalueofthesegparameterandputsitintothefsregister.Therearemanyfunctionsinboot.hlikeset_fs,forexampleset_gs,fs,gsforreadingavalueinitetc...

Attheendofquery_mcaitjustcopiesthetablewhichpointedtobyes:bxtotheboot_params.sys_desc_table.

ThenextstepisgettingIntelSpeedStepinformationbycallingthequery_istfunction.FirstofallitcheckstheCPUlevelandifitiscorrect,calls0x15forgettinginfoandsavestheresulttoboot_params.

Thefollowingquery_apm_biosfunctiongetsAdvancedPowerManagementinformationfromtheBIOS.query_apm_bioscallsthe0x15BIOSinterruptiontoo,butwithah=0x53tocheckAPMinstallation.Afterthe0x15execution,query_apm_biosfunctionschecksPMsignature(itmustbe0x504d),carryflag(itmustbe0ifAPMsupported)andvalueofthecxregister(ifit's0x02,protectedmodeinterfaceissupported).

Nextitcallsthe0x15again,butwithax=0x5304fordisconnectingtheAPMinterfaceandconnectingthe32-bitprotected

LinuxInside

28Firststepsinthekernelsetupcode

Page 29: Linux Insides

modeinterface.Intheenditfillsboot_params.apm_bios_infowithvaluesobtainedfromtheBIOS.

Notethatquery_apm_bioswillbeexecutedonlyifCONFIG_APMorCONFIG_APM_MODULEwassetinconfigurationfile:

#ifdefined(CONFIG_APM)||defined(CONFIG_APM_MODULE)

query_apm_bios();

#endif

Thelastisthequery_eddfunction,whichqueriesEnhancedDiskDriveinformationfromtheBIOS.Let'slookintothequery_eddimplementation.

Firstofallitreadstheeddoptionfromkernel'scommandlineandifitwassettooffthenquery_eddjustreturns.

IfEDDisenabled,query_eddgoesoverBIOS-supportedharddisksandqueriesEDDinformationinthefollowingloop:

for(devno=0x80;devno<0x80+EDD_MBR_SIG_MAX;devno++){

if(!get_edd_info(devno,&ei)&&boot_params.eddbuf_entries<EDDMAXNR){

memcpy(edp,&ei,sizeofei);

edp++;

boot_params.eddbuf_entries++;

}

...

...

...

where0x80isthefirstharddriveandthevalueofEDD_MBR_SIG_MAXmacrois16.Itcollectsdataintothearrayofedd_infostructures.get_edd_infochecksthatEDDispresentbyinvokingthe0x13interruptwithahas0x41andifEDDispresent,get_edd_infoagaincallsthe0x13interrupt,butwithahas0x48andsicontainingtheaddressofthebufferwhereEDDinformationwillbestored.

ThisistheendofthesecondpartaboutLinuxkernelinternals.Inthenextpartwewillseevideomodesettingandtherestofpreparationsbeforetransitiontoprotectedmodeanddirectlytransitioningintoit.

Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.

ProtectedmodeProtectedmodeLongmodeNiceexplanationofCPUModeswithcodeHowtoUseExpandDownSegmentsonIntel386andLaterCPUsearlyprintkdocumentationKernelParametersSerialconsoleIntelSpeedStepAPMEDDspecification

Conclusion

Links

LinuxInside

29Firststepsinthekernelsetupcode

Page 30: Linux Insides

TLDPdocumentationforLinuxBootProcess(old)PreviousPart

LinuxInside

30Firststepsinthekernelsetupcode

Page 31: Linux Insides

ThisisthethirdpartoftheKernelbootingprocessseries.Inthepreviouspart,westoppedrightbeforethecalloftheset_videoroutinefromthemain.c.Inthispart,wewillsee:

videomodeinitializationinthekernelsetupcode,preparationbeforeswitchingintotheprotectedmode,transitiontoprotectedmode

NOTEIfyoudon'tknowanythingaboutprotectedmode,youcanfindsomeinformationaboutitinthepreviouspart.Alsothereareacoupleoflinkswhichcanhelpyou.

AsIwroteabove,wewillstartfromtheset_videofunctionwhichdefinedinthearch/x86/boot/video.csourcecodefile.Wecanseethatitstartsbyfirstgettingthevideomodefromtheboot_params.hdrstructure:

u16mode=boot_params.hdr.vid_mode;

whichwefilledinthecopy_boot_paramsfunction(youcanreadaboutitinthepreviouspost).vid_modeisanobligatoryfieldwhichisfilledbythebootloader.Youcanfindinformationaboutitinthekernelbootprotocol:

OffsetProtoNameMeaning

/Size

01FA/2ALLvid_modeVideomodecontrol

Aswecanreadfromthelinuxkernelbootprotocol:

vga=<mode>

<mode>hereiseitheraninteger(inCnotation,either

decimal,octal,orhexadecimal)oroneofthestrings

"normal"(meaning0xFFFF),"ext"(meaning0xFFFE)or"ask"

(meaning0xFFFD).Thisvalueshouldbeenteredintothe

vid_modefield,asitisusedbythekernelbeforethecommand

lineisparsed.

Sowecanaddvgaoptiontothegruboranotherbootloaderconfigurationfileanditwillpassthisoptiontothekernelcommandline.Thisoptioncanhavedifferentvaluesaswecanmentionedinthedescription,forexampleitcanbeanintegernumber0xFFFDorask.Ifyoupassasktovga,youwillseeamenulikethis:

Kernelbootingprocess.Part3.

Videomodeinitializationandtransitiontoprotectedmode

LinuxInside

31Videomodeinitializationandtransitiontoprotectedmode

Page 32: Linux Insides

whichwillasktoselectavideomode.Wewilllookatitsimplementation,butbeforedivingintotheimplementationwehavetolookatsomeotherthings.

Earlierwesawdefinitionsofdifferentdatatypeslikeu16etc.inthekernelsetupcode.Let'slookonacoupleofdatatypesprovidedbythekernel:

Type char short int long u8 u16 u32 u64

Size 1 2 4 8 1 2 4 8

Ifyoureadsourcecodeofthekernel,you'llseetheseveryoftenandsoitwillbegoodtorememberthem.

Afterwehavevid_modefromtheboot_params.hdrintheset_videofunctionwecanseecalltoRESET_HEAPfunction.RESET_HEAPisamacrowhichdefinedintheboot.h.Itisdefinedas:

#defineRESET_HEAP()((void*)(HEAP=_end))

Ifyouhavereadthesecondpart,youwillrememberthatweinitializedtheheapwiththeinit_heapfunction.Wehaveacoupleofutilityfunctionsforheapwhicharedefinedinboot.h.Theyare:

#defineRESET_HEAP()

AswesawjustaboveitresetstheheapbysettingtheHEAPvariableequalto_end,where_endisjustexternchar_end[];

Kerneldatatypes

HeapAPI

LinuxInside

32Videomodeinitializationandtransitiontoprotectedmode

Page 33: Linux Insides

NextisGET_HEAPmacro:

#defineGET_HEAP(type,n)\

((type*)__get_heap(sizeof(type),__alignof__(type),(n)))

forheapallocation.Itcallstheinternalfunction__get_heapwith3parameters:

sizeofatypeinbytes,whichneedbeallocated__alignof__(type)showshowvariablesofthistypearealignedntellshowmanyitemstoallocate

Implementationof__get_heapis:

staticinlinechar*__get_heap(size_ts,size_ta,size_tn)

{

char*tmp;

HEAP=(char*)(((size_t)HEAP+(a-1))&~(a-1));

tmp=HEAP;

HEAP+=s*n;

returntmp;

}

andfurtherwewillseeitsusage,somethinglike:

saved.data=GET_HEAP(u16,saved.x*saved.y);

Let'strytounderstandhow__get_heapworks.WecanseeherethatHEAP(whichisequalto_endafterRESET_HEAP())istheaddressofalignedmemoryaccordingtoaparameter.AfteritwesavememoryaddressfromHEAPtothetmpvariable,moveHEAPtotheendofallocatedblockandreturntmpwhichisstartaddressofallocatedmemory.

Andthelastfunctionis:

staticinlineboolheap_free(size_tn)

{

return(int)(heap_end-HEAP)>=(int)n;

}

whichsubtractsvalueoftheHEAPfromtheheap_end(wecalculateditinthepreviouspart)andreturns1ifthereisenoughmemoryforn.

That'sall.NowwehavesimpleAPIforheapandcansetupvideomode.

Nowwecanmovedirectlytovideomodeinitialization.WestoppedattheRESET_HEAP()callintheset_videofunction.Nextisthecalltostore_mode_paramswhichstoresvideomodeparametersintheboot_params.screen_infostructurewhichisdefinedintheinclude/uapi/linux/screen_info.h.

Ifwewilllookatstore_mode_paramsfunction,wecanseethatitstartswiththecalltostore_cursor_positionfunction.Asyoucanunderstandfromthefunctionname,itgetsinformationaboutcursorandstoresit.

Firstofallstore_cursor_positioninitializestwovariableswhichhastype-biosregs,withAH=0x3andcalls0x10BIOS

Setupvideomode

LinuxInside

33Videomodeinitializationandtransitiontoprotectedmode

Page 34: Linux Insides

interruption.Afterinterruptionsuccessfullyexecuted,itreturnsrowandcolumnintheDLandDHregisters.Rowandcolumnwillbestoredintheorig_xandorig_yfieldsfromthetheboot_params.screen_infostructure.

Afterstore_cursor_positionexecuted,store_video_modefunctionwillbecalled.Itjustgetscurrentvideomodeandstoresitintheboot_params.screen_info.orig_video_mode.

Afterthis,itcheckscurrentvideomodeandsetsthevideo_segment.AftertheBIOStransferscontroltothebootsector,thefollowingaddressesareforvideomemory:

0xB000:0x000032KbMonochromeTextVideoMemory

0xB800:0x000032KbColorTextVideoMemory

Sowesetthevideo_segmentvariableto0xB000ifcurrentvideomodeisMDA,HGC,VGAinmonochromemodeor0xB800incolormode.Aftersetupoftheaddressofthevideosegmentfontsizeneedstobestoredintheboot_params.screen_info.orig_video_pointswith:

set_fs(0);

font_size=rdfs16(0x485);

boot_params.screen_info.orig_video_points=font_size;

Firstofallweput0totheFSregisterwithset_fsfunction.Wealreadysawfunctionslikeset_fsinthepreviouspart.Theyarealldefinedintheboot.h.Nextwereadvaluewhichislocatedataddress0x485(thismemorylocationisusedtogetthefontsize)andsavefontsizeintheboot_params.screen_info.orig_video_points.

x=rdfs16(0x44a);

y=(adapter==ADAPTER_CGA)?25:rdfs8(0x484)+1;

Nextwegetamountofcolumnsby0x44aandrowsbyaddress0x484andstorethemintheboot_params.screen_info.orig_video_colsandboot_params.screen_info.orig_video_lines.Afterthis,executionofthestore_mode_paramsisfinished.

Nextwecanseesave_screenfunctionwhichjustsavesscreencontenttotheheap.Thisfunctioncollectsalldatawhichwegotinthepreviousfunctionslikerowsandcolumnsamountetc.andstoresitinthesaved_screenstructure,whichisdefinedas:

staticstructsaved_screen{

intx,y;

intcurx,cury;

u16*data;

}saved;

Itthencheckswhethertheheaphasfreespaceforitwith:

if(!heap_free(saved.x*saved.y*sizeof(u16)+512))

return;

andallocatesspaceintheheapifitisenoughandstoressaved_screeninit.

Thenextcallisprobe_cards(0)fromthearch/x86/boot/video-mode.c.Itgoesoverallvideo_cardsandcollectsnumberofmodesprovidedbythecards.Hereistheinterestingmoment,wecanseetheloop:

LinuxInside

34Videomodeinitializationandtransitiontoprotectedmode

Page 35: Linux Insides

for(card=video_cards;card<video_cards_end;card++){

/*collectingnumberofmodeshere*/

}

butvideo_cardsnotdeclaredanywhere.Answerissimple:Everyvideomodepresentedinthex86kernelsetupcodehasdefinitionlikethis:

static__videocardvideo_vga={

.card_name="VGA",

.probe=vga_probe,

.set_mode=vga_set_mode,

};

where__videocardisamacro:

#define__videocardstructcard_info__attribute__((used,section(".videocards")))

whichmeansthatcard_infostructure:

structcard_info{

constchar*card_name;

int(*set_mode)(structmode_info*mode);

int(*probe)(void);

structmode_info*modes;

intnmodes;

intunsafe;

u16xmode_first;

u16xmode_n;

};

isinthe.videocardssegment.Let'slookinthearch/x86/boot/setup.ldlinkerfile,wecanseethere:

.videocards:{

video_cards=.;

*(.videocards)

video_cards_end=.;

}

Itmeansthatvideo_cardsisjustmemoryaddressandallcard_infostructuresareplacedinthissegment.Itmeansthatallcard_infostructuresareplacedbetweenvideo_cardsandvideo_cards_end,sowecanuseitinalooptogooverallofit.Afterprobe_cardsexecutedwehaveallstructureslikestatic__videocardvideo_vgawithfillednmodes(numberofvideomodes).

Afterprobe_cardsexecutionisfinished,wemovetothemainloopintheset_videofunction.Thereisinfiniteloopwhichtriestosetupvideomodewiththeset_modefunctionorprintsamenuifwepassedvid_mode=asktothekernelcommandlineorvideomodeisundefined.

Theset_modefunctionisdefinedinthevideo-mode.candgetsonlyoneparameter,modewhichisthenumberofvideomode(wegotitorfromthemenuorinthestartofthesetup_video,fromkernelsetupheader).

set_modefunctionchecksthemodeandcallsraw_set_modefunction.Theraw_set_modecallsset_modefunctionforselectedcardi.e.card->set_mode(structmode_info*).Wecangetaccesstothisfunctionfromthecard_infostructure,everyvideomodedefinesthisstructurewithvaluesfilleddependinguponthevideomode(forexampleforvgaitisvideo_vga.set_modefunction,seeaboveexampleofcard_infostructureforvga).video_vga.set_modeisvga_set_mode,whichchecksthevgamodeandcallstherespectivefunction:

LinuxInside

35Videomodeinitializationandtransitiontoprotectedmode

Page 36: Linux Insides

staticintvga_set_mode(structmode_info*mode)

{

vga_set_basic_mode();

force_x=mode->x;

force_y=mode->y;

switch(mode->mode){

caseVIDEO_80x25:

break;

caseVIDEO_8POINT:

vga_set_8font();

break;

caseVIDEO_80x43:

vga_set_80x43();

break;

caseVIDEO_80x28:

vga_set_14font();

break;

caseVIDEO_80x30:

vga_set_80x30();

break;

caseVIDEO_80x34:

vga_set_80x34();

break;

caseVIDEO_80x60:

vga_set_80x60();

break;

}

return0;

}

Everyfunctionwhichsetupsvideomode,justcalls0x10BIOSinterruptwithcertainvalueintheAHregister.

Afterwehavesetvideomode,wepassittotheboot_params.hdr.vid_mode.

Nextvesa_store_edidiscalled.ThisfunctionsimplystorestheEDID(ExtendedDisplayIdentificationData)informationforkerneluse.Afterthisstore_mode_paramsiscalledagain.Lastly,ifdo_restoreisset,screenisrestoredtoanearlierstate.

Afterthiswehavesetvideomodeandnowwecanswitchtotheprotectedmode.

Wecanseethelastfunctioncall-go_to_protected_modeinthemain.c.Asthecommentsays:Dothelastthingsandinvokeprotectedmode,solet'sseetheselastthingsandswitchintotheprotectedmode.

go_to_protected_modedefinedinthearch/x86/boot/pm.c.Itcontainssomefunctionswhichmakelastpreparationsbeforewecanjumpintoprotectedmode,solet'slookonitandtrytounderstandwhattheydoandhowitworks.

Firstisthecalltorealmode_switch_hookfunctioninthego_to_protected_mode.ThisfunctioninvokesrealmodeswitchhookifitispresentanddisablesNMI.Hooksareusedifbootloaderrunsinahostileenvironment.Youcanreadmoreabouthooksinthebootprotocol(seeADVANCEDBOOTLOADERHOOKS).

readlmode_swtichhookpresentspointertothe16-bitrealmodefarsubroutinewhichdisablesnon-maskableinterrupts.Afterrealmode_switchhook(itisn'tpresentforme)ischecked,disablingofNon-MaskableInterrupts(NMI)occurs:

asmvolatile("cli");

outb(0x80,0x70);/*DisableNMI*/

io_delay();

Atfirstthereisinlineassemblyinstructionwithcliinstructionwhichclearstheinterruptflag(IF).Afterthis,external

Lastpreparationbeforetransitionintoprotectedmode

LinuxInside

36Videomodeinitializationandtransitiontoprotectedmode

Page 37: Linux Insides

interruptsaredisabled.NextlinedisablesNMI(non-maskableinterrupt).

InterruptisasignaltotheCPUwhichisemittedbyhardwareorsoftware.Aftergettingsignal,CPUsuspendscurrentinstructionssequence,savesitsstateandtransferscontroltotheinterrupthandler.Afterinterrupthandlerhasfinishedit'swork,ittransferscontroltotheinterruptedinstruction.Non-maskableinterrupts(NMI)areinterruptswhicharealwaysprocessed,independentlyofpermission.Itcannotbeignoredandistypicallyusedtosignalfornon-recoverablehardwareerrors.Wewillnotdiveintodetailsofinterruptsnow,butwilldiscussitinthenextposts.

Let'sgetbacktothecode.Wecanseethatsecondlineiswriting0x80(disabledbit)bytetothe0x70(CMOSAddressregister).Afterthatcalltotheio_delayfunctionoccurs.io_delaycausesasmalldelayandlookslike:

staticinlinevoidio_delay(void)

{

constu16DELAY_PORT=0x80;

asmvolatile("outb%%al,%0"::"dN"(DELAY_PORT));

}

Outputtinganybytetotheport0x80shoulddelayexactly1microsecond.Sowecanwriteanyvalue(valuefromALregisterinourcase)tothe0x80port.Afterthisdelayrealmode_switch_hookfunctionhasfinishedexecutionandwecanmovetothenextfunction.

Thenextfunctionisenable_a20,whichenablesA20line.Thisfunctionisdefinedinthearch/x86/boot/a20.candittriestoenableA20gatewithdifferentmethods.Thefirstisa20_test_shortfunctionwhichchecksisA20alreadyenabledornotwitha20_testfunction:

staticinta20_test(intloops)

{

intok=0;

intsaved,ctr;

set_fs(0x0000);

set_gs(0xffff);

saved=ctr=rdfs32(A20_TEST_ADDR);

while(loops--){

wrfs32(++ctr,A20_TEST_ADDR);

io_delay();/*Serializeandmakedelayconstant*/

ok=rdgs32(A20_TEST_ADDR+0x10)^ctr;

if(ok)

break;

}

wrfs32(saved,A20_TEST_ADDR);

returnok;

}

Firstofallweput0x0000totheFSregisterand0xfffftotheGSregister.NextwereadvaluebyaddressA20_TEST_ADDR(itis0x200)andputthisvalueintosavedvariableandctr.

Nextwewriteupdatedctrvalueintofs:gswithwrfs32function,thendelayfor1ms,andthenreadthevalueintotheGSregisterbyaddressA20_TEST_ADDR+0x10,ifit'snotzerowealreadyhaveenabledA20line.IfA20isdisabled,wetrytoenableitwithadifferentmethodwhichyoucanfindinthea20.c.Forexamplewithcallof0x15BIOSinterruptwithAH=0x2041etc.

Ifenabled_a20functionfinishedwithfail,printanerrormessageandcallfunctiondie.Youcanrememberitfromthefirstsourcecodefilewherewestarted-arch/x86/boot/header.S:

die:

LinuxInside

37Videomodeinitializationandtransitiontoprotectedmode

Page 38: Linux Insides

hlt

jmpdie

.sizedie,.-die

AftertheA20gateissuccessfullyenabled,reset_coprocessorfunctioniscalled:

outb(0,0xf0);

outb(0,0xf1);

ThisfunctionclearstheMathCoprocessorbywriting0to0xf0andthenresetsitbywriting0to0xf1.

Afterthismask_all_interruptsfunctioniscalled:

outb(0xff,0xa1);/*MaskallinterruptsonthesecondaryPIC*/

outb(0xfb,0x21);/*MaskallbutcascadeontheprimaryPIC*/

ThismasksallinterruptsonthesecondaryPIC(ProgrammableInterruptController)andprimaryPICexceptforIRQ2ontheprimaryPIC.

Andafterallofthesepreparations,wecanseeactualtransitionintoprotectedmode.

NowwesetuptheInterruptDescriptortable(IDT).setup_idt:

staticvoidsetup_idt(void)

{

staticconststructgdt_ptrnull_idt={0,0};

asmvolatile("lidtl%0"::"m"(null_idt));

}

whichsetupstheInterruptDescriptorTable(describesinterrupthandlersandetc.).FornowIDTisnotinstalled(wewillseeitlater),butnowwejustloadIDTwithlidtlinstruction.null_idtcontainsaddressandsizeofIDT,butnowtheyarejustzero.null_idtisagdt_ptrstructure,itasdefinedas:

structgdt_ptr{

u16len;

u32ptr;

}__attribute__((packed));

wherewecansee-16-bitlength(len)ofIDTand32-bitpointertoit(MoredetailsaboutIDTandinterruptionswewillseeinthenextposts).__attribute__((packed))meansherethatsizeofgdt_ptrminimumasrequired.Sosizeofthegdt_ptrwillbe6byteshereor48bits.(Nextwewillloadpointertothegdt_ptrtotheGDTRregisterandyoumightrememberfromthepreviouspostthatitis48-bitsinsize).

NextisthesetupofGlobalDescriptorTable(GDT).Wecanseesetup_gdtfunctionwhichsetsupGDT(youcanreadaboutitintheKernelbootingprocess.Part2.).Thereisdefinitionoftheboot_gdtarrayinthisfunction,whichcontainsdefinitionofthethreesegments:

SetupInterruptDescriptorTable

SetupGlobalDescriptorTable

LinuxInside

38Videomodeinitializationandtransitiontoprotectedmode

Page 39: Linux Insides

staticconstu64boot_gdt[]__attribute__((aligned(16)))={

[GDT_ENTRY_BOOT_CS]=GDT_ENTRY(0xc09b,0,0xfffff),

[GDT_ENTRY_BOOT_DS]=GDT_ENTRY(0xc093,0,0xfffff),

[GDT_ENTRY_BOOT_TSS]=GDT_ENTRY(0x0089,4096,103),

};

Forcode,dataandTSS(TaskStateSegment).Wewillnotusetaskstatesegmentfornow,itwasaddedtheretomakeIntelVThappyaswecanseeinthecommentline(ifyou'reinterestingyoucanfindcommitwhichdescribesit-here).Let'slookonboot_gdt.Firstofallnotethatithas__attribute__((aligned(16)))attribute.Itmeansthatthisstructurewillbealignedby16bytes.Let'slookatasimpleexample:

#include<stdio.h>

structaligned{

inta;

}__attribute__((aligned(16)));

structnonaligned{

intb;

};

intmain(void)

{

structaligneda;

structnonalignedna;

printf("Notaligned-%zu\n",sizeof(na));

printf("Aligned-%zu\n",sizeof(a));

return0;

}

Technicallystructurewhichcontainsoneintfield,mustbe4bytes,butherealignedstructurewillbe16bytes:

$gcctest.c-otest&&test

Notaligned-4

Aligned-16

GDT_ENTRY_BOOT_CShasindex-2here,GDT_ENTRY_BOOT_DSisGDT_ENTRY_BOOT_CS+1andetc.Itstartsfrom2,becausefirstisamandatorynulldescriptor(index-0)andthesecondisnotused(index-1).

GDT_ENTRYisamacrowhichtakesflags,baseandlimitandbuildsGDTentry.Forexamplelet'slookonthecodesegmententry.GDT_ENTRYtakesfollowingvalues:

base-0limit-0xfffffflags-0xc09b

Whatdoesitmean?Segment'sbaseaddressis0,limit(sizeofsegment)is-0xffff(1MB).Let'slookonflags.Itis0xc09banditwillbe:

1100000010011011

inbinary.Let'strytounderstandwhateverybitmeans.Wewillgothroughallbitsfromlefttoright:

1-(G)granularitybit1-(D)if016-bitsegment;1=32-bitsegment

LinuxInside

39Videomodeinitializationandtransitiontoprotectedmode

Page 40: Linux Insides

0-(L)executedin64bitmodeif10-(AVL)availableforusebysystemsoftware0000-4bitlength19:16bitsinthedescriptor1-(P)segmentpresenceinmemory00-(DPL)-privilegelevel,0isthehighestprivilege1-(S)codeordatasegment,notasystemsegment101-segmenttypeexecute/read/1-accessedbit

YoucanreadmoreabouteverybitinthepreviouspostorintheIntel®64andIA-32ArchitecturesSoftwareDeveloper'sManuals3A.

AfterthiswegetlengthofGDTwith:

gdt.len=sizeof(boot_gdt)-1;

Wegetsizeofboot_gdtandsubtract1(thelastvalidaddressintheGDT).

NextwegetpointertotheGDTwith:

gdt.ptr=(u32)&boot_gdt+(ds()<<4);

Herewejustgetaddressofboot_gdtandaddittoaddressofdatasegmentleft-shiftedby4bits(rememberwe'reintherealmodenow).

LastlyweexecutelgdtlinstructiontoloadGDTintoGDTRregister:

asmvolatile("lgdtl%0"::"m"(gdt));

Itistheendofgo_to_protected_modefunction.WeloadedIDT,GDT,disableinterruptionsandnowcanswitchCPUintoprotectedmode.Thelaststepwecallprotected_mode_jumpfunctionwithtwoparameters:

protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params+(ds()<<4));

whichisdefinedinthearch/x86/boot/pmjump.S.Ittakestwoparameters:

addressofprotectedmodeentrypointaddressofboot_params

Let'slookinsideprotected_mode_jump.AsIwroteabove,youcanfinditinthearch/x86/boot/pmjump.S.Firstparameterwillbeineaxregisterandsecondisinedx.

Firstofallweputaddressofboot_paramsintheesiregisterandaddressofcodesegmentregistercs(0x1000)inthebx.Afterthisweshiftbxby4bitsandaddaddressoflabel2toit(wewillhavephysicaladdressoflabel2inthebxafterit)andjumptolabel1.Nextweputdatasegmentandtaskstatesegmentinthecsanddiregisterswith:

movw$__BOOT_DS,%cx

Actualtransitionintoprotectedmode

LinuxInside

40Videomodeinitializationandtransitiontoprotectedmode

Page 41: Linux Insides

movw$__BOOT_TSS,%di

AsyoucanreadaboveGDT_ENTRY_BOOT_CShasindex2andeveryGDTentryis8byte,soCSwillbe2*8=16,__BOOT_DSis24etc.

NextwesetPE(ProtectionEnable)bitintheCR0controlregister:

movl%cr0,%edx

orb$X86_CR0_PE,%dl

movl%edx,%cr0

andmakelongjumptotheprotectedmode:

.byte0x66,0xea

2:.longin_pm32

.word__BOOT_CS

where

0x66istheoperand-sizeprefixwhichallowstomix16-bitand32-bitcode,0xea-isthejumpopcode,in_pm32isthesegmentoffset__BOOT_CSisthecodesegment.

Afterthiswearefinallyintheprotectedmode:

.code32

.section".text32","ax"

Let'slookatthefirststepsintheprotectedmode.Firstofallwesetupdatasegmentwith:

movl%ecx,%ds

movl%ecx,%es

movl%ecx,%fs

movl%ecx,%gs

movl%ecx,%ss

Ifyoureadwithattention,youcanrememberthatwesaved$__BOOT_DSinthecxregister.Nowwefillwithitallsegmentregistersbesidescs(csisalready__BOOT_CS).Nextwezerooutallgeneralpurposeregistersbesideseaxwith:

xorl%ecx,%ecx

xorl%edx,%edx

xorl%ebx,%ebx

xorl%ebp,%ebp

xorl%edi,%edi

Andjumptothe32-bitentrypointintheend:

jmpl*%eax

Rememberthateaxcontainsaddressofthe32-bitentry(wepasseditasfirstparameterintoprotected_mode_jump).

LinuxInside

41Videomodeinitializationandtransitiontoprotectedmode

Page 42: Linux Insides

That'sallwe'reintheprotectedmodeandstopatit'sentrypoint.Whathappensnext,wewillseeinthenextpart.

Thisistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseefirststepsintheprotectedmodeandtransitionintothelongmode.

Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.Ifyoufindanymistakes,pleasesendmeaPRwithcorrectionsatlinux-internals.

VGAVESABIOSExtensionsDatastructurealignmentNon-maskableinterruptA20GCCdesignatedinitsGCCtypeattributesPreviouspart

Conclusion

Links

LinuxInside

42Videomodeinitializationandtransitiontoprotectedmode

Page 43: Linux Insides

ItisthefourthpartoftheKernelbootingprocessandwewillseefirststepsintheprotectedmode,likecheckingthatcpusupportsthelongmodeandSSE,pagingandinitializationofthepagetablesandtransitiontothelongmodeinintheendofthispart.

NOTE:willbemuchassemblycodeinthispart,soifyouhavepoorknowledge,readabookaboutit

Inthepreviouspartwestoppedatthejumptothe32-bitentrypointinthearch/x86/boot/pmjump.S:

jmpl*%eax

Recallthateaxregistercontainstheaddressofthe32-bitentrypoint.Wecanreadaboutthispointfromthelinuxkernelx86bootprotocol:

WhenusingbzImage,theprotected-modekernelwasrelocatedto0x100000

Andnowwecanmakesurethatitistrue.Let'slookonregistersvaluein32-bitentrypoint:

eax0x1000001048576

ecx0x00

edx0x00

ebx0x00

esp0x1ff5c0x1ff5c

ebp0x00x0

esi0x1447083056

edi0x00

eip0x1000000x100000

eflags0x46[PFZF]

cs0x1016

ss0x1824

ds0x1824

es0x1824

fs0x1824

gs0x1824

Wecanseeherethatcsregistercontains-0x10(asyoucanrememberfromthepreviouspart,itisthesecondindexintheGlobalDescriptorTable),eipregisteris0x100000andbaseaddressoftheallsegmentsincludecodesegmentiszero.Sowecangetphysicaladdress,itwillbe0:0x100000orjust0x100000,asinbootprotocol.Nowlet'sstartwith32-bitentrypoint.

Wecanfinddefinitionofthe32-bitentrypointinthearch/x86/boot/compressed/head_64.S:

__HEAD

.code32

ENTRY(startup_32)

....

....

Kernelbootingprocess.Part4.

Transitionto64-bitmode

32-bitentrypoint

LinuxInside

43Transitionto64-bitmode

Page 44: Linux Insides

....

ENDPROC(startup_32)

Firstofallwhycompresseddirectory?Actuallybzimageisagzippedvmlinux+header+kernelsetupcode.Wesawthekernelsetupcodeintheallofpreviousparts.So,themaingoalofthehead_64.Sistoprepareforenteringlongmode,enterintoitanddecompressthekernel.Wewillseeallofthesestepsbesideskerneldecompressioninthispart.

Alsoyoucannotethattherearetwofilesinthearch/x86/boot/compresseddirectory:

head_32.Shead_64.S

Wewillseeonlyhead_64.Sbecausewearelearninglinuxkernelforx86_64.head_32.Sevennotcompiledinourcase.Let'slookonthearch/x86/boot/compressed/Makefile,wecanseetherefollowingtarget:

vmlinux-objs-y:=$(obj)/vmlinux.lds$(obj)/head_$(BITS).o$(obj)/misc.o\

$(obj)/string.o$(obj)/cmdline.o\

$(obj)/piggy.o$(obj)/cpuflags.o

Noteon$(obj)/head_$(BITS).o.Itmeansthatcompilationofthehead_{32,64}.odependsonvalueofthe$(BITS).WecanfinditintheotherMakefile-arch/x86/kernel/Makefile:

ifeq($(CONFIG_X86_32),y)

BITS:=32

...

...

else

...

...

BITS:=64

endif

Nowweknowwheretostart,solet'sdoit.

Asiwroteabove,westartinthearch/x86/boot/compressed/head_64.S.Firstofallwecanseebeforestartup_32definition:

__HEAD

.code32

ENTRY(startup_32)

__HEADdefinedintheinclude/linux/init.handlooksas:

#define__HEAD.section".head.text","ax"

Wecanfindthissectioninthearch/x86/boot/compressed/vmlinux.lds.Slinkerscript:

SECTIONS

{

.=0;

.head.text:{

_head=.;

Reloadthesegmentsifneed

LinuxInside

44Transitionto64-bitmode

Page 45: Linux Insides

HEAD_TEXT

_ehead=.;

}

Noteon.=0;..isaspecialvariableoflinker-locationcounter.Assigningavaluetoit,isanoffsetrelativetotheoffsetofthesegment.Asweassignzerotoit,wecanreadfromcomments:

Becarefulpartsofhead_64.Sassumestartup_32isataddress0.

Ok,nowweknowwhereweare,andnowthebesttimetolookinsidethestartup_32function.

Inthestartofthestartup_32wecanseethecldinstructionwhichclearsDFflag.Afterthis,stringoperationslikestosbandotherwillincrementtheindexregistersesioredi.

TheNextwecanseethecheckofKEEP_SEGMENTSflagfromloadflags.Ifyourememberwealreadysawloadflagsinthearch/x86/boot/head.S(therewecheckedflagCAN_USE_HEAP).NowweneedtocheckKEEP_SEGMENTSflag.Wecanfinddescriptionofthisflaginthelinuxbootprotocol:

Bit6(write):KEEP_SEGMENTS

Protocol:2.07+

-If0,reloadthesegmentregistersinthe32bitentrypoint.

-If1,donotreloadthesegmentregistersinthe32bitentrypoint.

Assumethat%cs%ds%ss%esareallsettoflatsegmentswith

abaseof0(ortheequivalentfortheirenvironment).

andifKEEP_SEGMENTSisnotset,weneedtosetds,ssandesregisterstoflatsegmentwithbase0.Thatwedo:

testb$(1<<6),BP_loadflags(%esi)

jnz1f

cli

movl$(__BOOT_DS),%eax

movl%eax,%ds

movl%eax,%es

movl%eax,%ss

rememberthat__BOOT_DSis0x18(indexofdatasegmentintheGlobalDescriptorTable).IfKEEP_SEGMENTSisnotset,wejumptothelabel1forupdatesegmentregisterswith__BOOT_DSifthisflagisset.

Ifyoureadpreviousthepart,youcanrememberthatwealreadyupdatedsegmentregistersinthearch/x86/boot/pmjump.S,sowhyweneedtosetupitagain?Actuallylinuxkernelhasalso32-bitbootprotocol,sostartup_32canbefirstfunctionwhichwillbeexecutedrightafterabootloadertransferscontroltothekernel.

AswecheckedKEEP_SEGMENTSflagandputthecorrectvaluetothesegmentregisters,nextstepiscalculatedifferencebetweenwhereweloadedandcompiledtorun(rememberthatsetup.ld.Scontains.=0atthestartofthesection):

leal(BP_scratch+4)(%esi),%esp

call1f

1:popl%ebp

subl$1b,%ebp

Hereesiregistercontainsaddressoftheboot_paramsstructure.boot_paramscontainsspecialfieldscratchwithoffset0x1e4.Wearegettingaddressofthescratchfield+4bytesandputittotheespregister(wewilluseitasstackforthesecalculations).Afterthiswecanseecallinstructionand1flabelasoperandofit.Whatdoesitmeancall?Itmeansthatit

LinuxInside

45Transitionto64-bitmode

Page 46: Linux Insides

pushesebpvalueinthestack,nextespvalue,nextfunctionargumentsandreturnaddressintheend.Afterthiswepopreturnaddressfromthestackintoebpregister(ebpwillcontainreturnaddress)andsubtractaddressofthepreviouslabel1.

Afterthiswehaveaddresswhereweloadedintheebp-0x100000.

NowwecansetupthestackandverifyCPUthatithassupportofthelongmodeandSSE.

Thenextwecanseeassemblycodewhichsetupsnewstackforkerneldecompression:

movl$boot_stack_end,%eax

addl%ebp,%eax

movl%eax,%esp

boots_stack_endisinthe.bsssection,wecanseedefinitionofitintheendofhead_64.S:

.bss

.balign4

boot_heap:

.fillBOOT_HEAP_SIZE,1,0

boot_stack:

.fillBOOT_STACK_SIZE,1,0

boot_stack_end:

Firstofallweputaddressoftheboot_stack_endintoeaxregisterandaddtoitvalueoftheebp(rememberthatebpnowcontainsaddresswhereweloaded-0x100000).Intheendwejustputeaxvalueintoespandthat'sall,wehavecorrectstackpointer.

ThenextstepisCPUverification.NeedtocheckthatCPUhassupportoflongmodeandSSE:

callverify_cpu

testl%eax,%eax

jnzno_longmode

Itjustcallsverify_cpufunctionfromthearch/x86/kernel/verify_cpu.Swhichcontainsacoupleofcallsofthecpuidinstruction.cpuidisinstructionwhichisusedforgettinginformationaboutprocessor.InourcaseitcheckslongmodeandSSEsupportandreturns0onsuccessor1onfailintheeaxregister.

Ifeaxisnotzero,wejumptotheno_longmodelabelwhichjuststopstheCPUwithhltinstructionwhileanyhardwareinterruptwillnothappen.

no_longmode:

1:

hlt

jmp1b

Wesetstack,checkedCPUandnowcanmoveonthenextstep.

StacksetupandCPUverification

Calculaterelocationaddress

LinuxInside

46Transitionto64-bitmode

Page 47: Linux Insides

Thenextstepiscalculatingrelocationaddressfordecompressionifneed.Wecanseefollowingassemblycode:

#ifdefCONFIG_RELOCATABLE

movl%ebp,%ebx

movlBP_kernel_alignment(%esi),%eax

decl%eax

addl%eax,%ebx

notl%eax

andl%eax,%ebx

cmpl$LOAD_PHYSICAL_ADDR,%ebx

jge1f

#endif

movl$LOAD_PHYSICAL_ADDR,%ebx

1:

addl$z_extract_offset,%ebx

FirstofallnoteonCONFIG_RELOCATABLEmacro.Thisconfigurationoptiondefinedinthearch/x86/Kconfigandaswecanreadfromit'sdescription:

Thisbuildsakernelimagethatretainsrelocationinformation

soitcanbeloadedsomeplacebesidesthedefault1MB.

Note:IfCONFIG_RELOCATABLE=y,thenthekernelrunsfromtheaddress

ithasbeenloadedatandthecompiletimephysicaladdress

(CONFIG_PHYSICAL_START)isusedastheminimumlocation.

Inshortwords,thiscodecalculatesaddresswheretomovekernelfordecompressionputittoebxregisterifthekernelisrelocatableorbzimagewilldecompressitselfaboveLOAD_PHYSICAL_ADDR.

Let'slookonthecode.IfwehaveCONFIG_RELOCATABLE=ninourkernelconfigurationfile,itjustputsLOAD_PHYSICAL_ADDRtotheebxregisterandaddsz_extract_offsettoebx.Asebxiszerofornow,itwillcontainz_extract_offset.Nowlet'strytounderstandthesetwovalues.

LOAD_PHYSICAL_ADDRisthemacrowhichdefinedinthearch/x86/include/asm/boot.handitlookslikethis:

#defineLOAD_PHYSICAL_ADDR((CONFIG_PHYSICAL_START\

+(CONFIG_PHYSICAL_ALIGN-1))\

&~(CONFIG_PHYSICAL_ALIGN-1))

Herewecalculatesalignedaddresswherekernelisloaded(0x100000or1megabyteinourcase).PHYSICAL_ALIGNisanalignmentvaluetowhichkernelshouldbealigned,itrangesfrom0x200000to0x1000000forx86_64.Withthedefaultvalueswewillget2megabytesintheLOAD_PHYSICAL_ADDR:

>>>0x100000+(0x200000-1)&~(0x200000-1)

2097152

Afterthatwegotalignmentunit,weaddsz_extract_offset(whichis0xe5c000inmycase)tothe2megabytes.Intheendwewillget17154048byteoffset.Youcanfindz_extract_offsetinthearch/x86/boot/compressed/piggy.S.Thisfilegeneratedincompiletimebymkpiggyprogram.

Nowlet'strytounderstandthecodeifCONFIG_RELOCATABLEisy.

Firstofallweputebpvaluetotheebx(rememberthatebpcontainsaddresswhereweloaded)andkernel_alignmentfieldfromkernelsetupheadertotheeaxregister.kernel_alignmentisaphysicaladdressofalignmentrequiredforthekernel.Nextwedothesameasinthepreviouscase(whenkernelisnotrelocatable),butwejustusevalueofthekernel_alignmentfieldasalignunitandebx(addresswhereweloaded)asbaseaddressinsteadofCONFIG_PHYSICAL_ALIGN

LinuxInside

47Transitionto64-bitmode

Page 48: Linux Insides

andLOAD_PHYSICAL_ADDR.

Afterthatwecalculatedaddress,wecompareitwithLOAD_PHYSICAL_ADDRandaddz_extract_offsettoitagainorputLOAD_PHYSICAL_ADDRintheebxifcalculatedaddressislessthanweneed.

Afterallofthiscalculationwewillhaveebpwhichcontainsaddresswhereweloadedandebxwithaddresswheretomovekernelfordecompression.

Nowweneedtodothelastpreparationsbeforewecanseetransitiontothe64-bitmode.AtfirstweneedtoupdateGlobalDescriptorTableforthis:

lealgdt(%ebp),%eax

movl%eax,gdt+2(%ebp)

lgdtgdt(%ebp)

Hereweputtheaddressfromebpwithgdtoffsettoeaxregister,nextweputthisaddressintoebpwithoffsetgdt+2andloadGlobalDescriptorTablewiththelgdtinstruction.

Let'slookonGlobalDescriptorTabledefinition:

.data

gdt:

.wordgdt_end-gdt

.longgdt

.word0

.quad0x0000000000000000/*NULLdescriptor*/

.quad0x00af9a000000ffff/*__KERNEL_CS*/

.quad0x00cf92000000ffff/*__KERNEL_DS*/

.quad0x0080890000000000/*TSdescriptor*/

.quad0x0000000000000000/*TScontinued*/

Itdefinedinthesamefileinthe.datasection.Itcontains5descriptors:nulldescriptor,forkernelcodesegment,kerneldatasegmentandtwotaskdescriptors.WealreadyloadedGDTinthepreviouspart,we'redoingalmostthesamehere,butdescriptorswithCS.L=1andCS.D=0forexecutioninthe64bitmode.

AfterwehaveloadedGlobalDescriptorTable,wemustenablePAEmodewithputtingvalueofcr4registerintoeax,setting5bitinitandloaditagaininthecr4:

movl%cr4,%eax

orl$X86_CR4_PAE,%eax

movl%eax,%cr4

Nowwefinishedalmostwithallpreparationsbeforewecanmoveinto64-bitmode.Thelaststepistobuildpagetables,butbeforesomeinformationaboutlongmode.

Longmodeisthenativemodeforx86_64processors.Firstofalllet'slookonsomedifferencebetweenx86_64andx86.

Itprovidessomefeaturesas:

Preparationbeforeenteringlongmode

Longmode

LinuxInside

48Transitionto64-bitmode

Page 49: Linux Insides

New8generalpurposeregistersfromr8tor15+allgeneralpurposeregistersare64-bitnow64-bitinstructionpointer-RIPNewoperatingmode-Longmode64-BitAddressesandOperandsRIPRelativeAddressing(wewillseeexampleifitinthenextparts)

Longmodeisanextensionoflegacyprotectedmode.Itconsistsfromtwosub-modes:

64-bitmodecompatibilitymode

Toswitchinto64-bitmodeweneedtodofollowingthings:

enablePAE(wealreadydidit,seeabove)buildpagetablesandloadtheaddressoftoplevelpagetableintocr3registerenableEFER.LMEenablepaging

WealreadyenabledPAEwithsettingthePAEbitinthecr4register.Nowlet'slookonpaging.

Beforewecanmoveinthe64-bitmode,weneedtobuildpagetables,so,let'slookonbuildingofearly4Gbootpagetables.

NOTE:Iwillnotdescribetheoryofvirtualmemoryhere,ifyouneedtoknowmoreaboutit,seelinksintheend

Linuxkerneluses4-levelpaging,andgenerallywebuild6pagetables:

OnePML4tableOnePDPtableFourPageDirectorytables

Let'slookontheimplementationofit.Firstofallweclearbufferforthepagetablesinthememory.Everytableis4096bytes,soweneed24kilobytesbuffer:

lealpgtable(%ebx),%edi

xorl%eax,%eax

movl$((4096*6)/4),%ecx

repstosl

Weputaddresswhichstoredinebx(rememberthatebxcontainstheaddresswheretorelocatekernelfordecompression)withpgtableoffsettotheediregister.pgtabledefinedintheendofhead_64.Sandlooks:

.section".pgtable","a",@nobits

.balign4096

pgtable:

.fill6*4096,1,0

Itisinthe.pgtablesectionanditsizeis24kilobytes.Afterweputaddresstotheedi,wezeroouteaxregisterandwriteszerostothebufferwithrepstoslinstruction.

Nowwecanbuildtoplevelpagetable-PML4with:

Earlypagetablesinitialization

LinuxInside

49Transitionto64-bitmode

Page 50: Linux Insides

lealpgtable+0(%ebx),%edi

leal0x1007(%edi),%eax

movl%eax,0(%edi)

Herewegetaddresswhichstoredintheebxwithpgtableoffsetandputittotheedi.Nextweputthisaddresswithoffset0x1007totheeaxregister.0x1007is4096bytes(sizeofthePML4)+7(PML4entryflags-PRESENT+RW+USER)andputseaxtotheedi.AfterthismanipulationsediwillcontaintheaddressofthefirstPageDirectoryPointerEntrywithflags-PRESENT+RW+USER.

Inthenextstepwebuild4PageDirectoryentryinthePageDirectoryPointertable,wherefirstentrywillbewith0x7flagsandotherwith0x8:

lealpgtable+0x1000(%ebx),%edi

leal0x1007(%edi),%eax

movl$4,%ecx

1:movl%eax,0x00(%edi)

addl$0x00001000,%eax

addl$8,%edi

decl%ecx

jnz1b

Weputbaseaddressofthepagedirectorypointertabletotheediandaddressofthefirstpagedirectorypointerentrytotheeax.Put4totheecxregister,itwillbecounterinthefollowingloopandwritetheaddressofthefirstpagedirectorypointertableentrytotheediregister.

Afterthisediwillcontainaddressofthefirstpagedirectorypointerentrywithflags0x7.Nextwejustcalculatesaddressoffollowingpagedirectorypointerentrieswithflags0x8andwritestheiraddressestotheedi.

Thenextstepisbuildingof2048pagetableentriesby2megabytes:

lealpgtable+0x2000(%ebx),%edi

movl$0x00000183,%eax

movl$2048,%ecx

1:movl%eax,0(%edi)

addl$0x00200000,%eax

addl$8,%edi

decl%ecx

jnz1b

Herewedoalmostthesamethatinthepreviousexample,justfirstentrywillbewithflags-$0x00000183-PRESENT+WRITE+MBZandallanotherwith0x8.Intheendwewillhave2048pagesby2megabytes.

Ourearlypagetablestructurearedone,itmaps4gigabytesofmemoryandnowwecanputaddressofthehigh-levelpagetable-PML4tothecr3controlregister:

lealpgtable(%ebx),%eax

movl%eax,%cr3

That'sallnowwecanseetransitiontothelongmode.

FirstofallweneedtosetEFER.LMEflagintheMSRto0xC0000080:

Transitiontothelongmode

LinuxInside

50Transitionto64-bitmode

Page 51: Linux Insides

movl$MSR_EFER,%ecx

rdmsr

btsl$_EFER_LME,%eax

wrmsr

HereweputMSR_EFERflag(whichdefinedinthearch/x86/include/uapi/asm/msr-index.h)totheecxregisterandcallrdmsrinstructionwhichreadsMSRregister.Afterrdmsrexecuted,wewillhaveresultdataintheedx:eaxwhichdependsonecxvalue.WecheckEFER_LMEbitwithbtslinstructionandwritedatafromeaxtotheMSRregisterwithwrmsrinstruction.

Innextstepwepushaddressofthekernelsegmentcodetothestack(wedefineditintheGDT)andputaddressofthestartup_64routinetotheeax.

pushl$__KERNEL_CS

lealstartup_64(%ebp),%eax

AfterthiswepushthisaddresstothestackandenablepagingwithsettingPGandPEbitsinthecr0register:

movl$(X86_CR0_PG|X86_CR0_PE),%eax

movl%eax,%cr0

andcall:

lret

Rememberthatwepushedaddressofthestartup_64functiontothestackinthepreviousstep,andafterlretinstruction,CPUextractsaddressofitandjumpsthere.

Afterallofthesestepswe'refinallyinthe64-bitmode:

.code64

.org0x200

ENTRY(startup_64)

....

....

....

That'sall!

Thisistheendofthefourthpartlinuxkernelbootingprocess.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateanissue.

Inthenextpartwewillseekerneldecompressionandmanymore.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

Conclusion

Links

LinuxInside

51Transitionto64-bitmode

Page 53: Linux Insides

ThisisthefifthpartoftheKernelbootingprocessseries.Wesawtransitiontothe64-bitmodeinthepreviouspartandwewillcontinuefromthispointinthispart.Wewillseethelaststepsbeforewejumptothekernelcodeaspreparationforkerneldecompression,relocationanddirectlykerneldecompression.So...let'sstarttodiveinthekernelcodeagain.

Westoppedrightbeforejumpon64-bitentrypoint-startup_64whichlocatedinthearch/x86/boot/compressed/head_64.Ssourcecodefile.Wealreadysawthejumptothestartup_64inthestartup_32:

pushl$__KERNEL_CS

lealstartup_64(%ebp),%eax

...

...

...

pushl%eax

...

...

...

lret

inthepreviouspart,startup_64startstowork.SinceweloadedthenewGlobalDescriptorTableandtherewasCPUtransitioninothermode(64-bitmodeinourcase),wecanseesetupofthedatasegments:

.code64

.org0x200

ENTRY(startup_64)

xorl%eax,%eax

movl%eax,%ds

movl%eax,%es

movl%eax,%ss

movl%eax,%fs

movl%eax,%gs

inthebeginningofthestartup_64.Allsegmentregistersbesidescspointsnowtothedswhichis0x18(ifyoudon'tunderstandwhyitis0x18,readthepreviouspart).

Thenextstepiscomputationofdifferencebetweenwherekernelwascompiledandwhereitwasloaded:

#ifdefCONFIG_RELOCATABLE

leaqstartup_32(%rip),%rbp

movlBP_kernel_alignment(%rsi),%eax

decl%eax

addq%rax,%rbp

notq%rax

andq%rax,%rbp

cmpq$LOAD_PHYSICAL_ADDR,%rbp

jge1f

#endif

movq$LOAD_PHYSICAL_ADDR,%rbp

1:

leaqz_extract_offset(%rbp),%rbx

Kernelbootingprocess.Part5.

Kerneldecompression

Preparationbeforekerneldecompression

LinuxInside

53Kerneldecompression

Page 54: Linux Insides

rbpcontainsdecompressedkernelstartaddressandafterthiscodeexecutedrbxregisterwillcontainaddresswheretorelocatethekernelcodefordecompression.Wealreadysawcodelikethisinthestartup_32(youcanreadaboutitinthepreviouspart-Calculaterelocationaddress),butweneedtodothiscalculationagainbecausebootloadercanuse64-bitbootprotocolandstartup_32justwillnotbeexecutedinthiscase.

Inthenextstepwecanseesetupofthestackandresetofflagsregister:

leaqboot_stack_end(%rbx),%rsp

pushq$0

popfq

Asyoucanseeaboverbxregistercontainsthestartaddressofthedecompressingkernelcodeandwejustputthisaddresswithboot_stack_endoffsettotherspregister.Afterthisstackwillbecorrect.Youcanfinddefinitionoftheboot_stack_endintheendofcompressed/head_64.Sfile:

.bss

.balign4

boot_heap:

.fillBOOT_HEAP_SIZE,1,0

boot_stack:

.fillBOOT_STACK_SIZE,1,0

boot_stack_end:

Itlocatedinthe.bsssectionrightbefore.pgtable.Youcanlookatarch/x86/boot/compressed/vmlinux.lds.Stofindit.

Aswesetthestack,nowwecancopythecompressedkerneltotheaddressthatwegotabove,whenwecalculatedtherelocationaddressofthedecompressedkernel.Let'slookonthiscode:

pushq%rsi

leaq(_bss-8)(%rip),%rsi

leaq(_bss-8)(%rbx),%rdi

movq$_bss,%rcx

shrq$3,%rcx

std

repmovsq

cld

popq%rsi

Firstofallwepushrsitothestack.Weneedsavevalueofrsi,becausethisregisternowstorespointertotheboot_paramsrealmodestructure(youmustrememberthisstructure,wefilleditinthestartofkernelsetup).Intheendofthiscodewe'llrestorepointertotheboot_paramsintorsiagain.

Thenexttwoleaqinstructionscalculateseffectiveaddressoftheripandrbxwith_bss-8offsetandputittothersiandrdi.Whywecalculatethisaddresses?Actuallycompressedkernelimagelocatedbetweenthiscopyingcode(fromstartup_32tothecurrentcode)andthedecompressioncode.Youcanverifythisbylookingonthelinkerscript-arch/x86/boot/compressed/vmlinux.lds.S:

.=0;

.head.text:{

_head=.;

HEAD_TEXT

_ehead=.;

}

.rodata..compressed:{

*(.rodata..compressed)

}

.text:{

LinuxInside

54Kerneldecompression

Page 55: Linux Insides

_text=.;/*Text*/

*(.text)

*(.text.*)

_etext=.;

}

Notethat.head.textsectioncontainsstartup_32.Youcanrememberitfromthepreviouspart:

__HEAD

.code32

ENTRY(startup_32)

...

...

...

.textsectioncontainsdecompressioncode:

assembly

.text

relocated:

...

...

...

/*

*Dothedecompression,andjumptothenewkernel..

*/

...

And.rodata..compressedcontainscompressedkernelimage.

Sorsiwillcontainriprelativeaddressofthe_bss-8andrdiwillcontainrelocationrelativeaddressofthe _bss-

8.Aswestoretheseaddressesinregister,weputtheaddressof_bsstothercxregister.Asyoucanseeinthevmlinux.lds.S,itlocatedintheendofallsectionswiththesetup/kernelcode.Nowwecanstarttocopydatafromrsitordiby8byteswithmovsqinstruction.

Notethatthereisstdinstructionbeforedatacopying,itsetsDFflaganditmeansthatrsiandrdiwillbedecremetedorinotherwords,wewillcrbxopybytesinbackwards.

IntheendweclearDFflagwithcldinstructionandrestoreboot_paramsstructuretothersi.

Afteritweget.textsectionaddressandjumptoit:

leaqrelocated(%rbx),%rax

jmp*%rax

.textsectionsstartswiththerelocatedlabel.Forthestartthereisclearingofthebsssectionwith:

xorl%eax,%eax

leaq_bss(%rip),%rdi

leaq_ebss(%rip),%rcx

subq%rdi,%rcx

shrq$3,%rcx

repstosq

Lastpreparationbeforekerneldecompression

LinuxInside

55Kerneldecompression

Page 56: Linux Insides

Herewejustcleareax,putRIPrelativeaddressofthe_bsstotherdiand_ebsstorcxandfillitwithzeroswithrepstosqinstructions.

Intheendwecanseethecallofthedecompress_kernelroutine:

pushq%rsi

movq$z_run_size,%r9

pushq%r9

movq%rsi,%rdi

leaqboot_heap(%rip),%rsi

leaqinput_data(%rip),%rdx

movl$z_input_len,%ecx

movq%rbp,%r8

movq$z_output_len,%r9

calldecompress_kernel

popq%r9

popq%rsi

Againwesaversiwithpointertoboot_paramsstructureandcalldecompress_kernelfromthearch/x86/boot/compressed/misc.cwithsevenarguments.Allargumentswillbepassedthroughtheregisters.Wefinishedallpreparationandnowcanlookonthekerneldecompression.

Asiwroteabove,decompress_kernelfunctionisinthearch/x86/boot/compressed/misc.csourcecodefile.Thisfunctionstartswiththevideo/consoleinitializationthatwesawinthepreviousparts.Thiscallsneedifbootloadedused32or64-bitprotocols.Afterthiswestorepointerstothestartofthefreememoryandtotheendofit:

free_mem_ptr=heap;

free_mem_end_ptr=heap+BOOT_HEAP_SIZE;

whereheapisthesecondparameterofthedecompress_kernelfunctionwhichwegotwith:

leaqboot_heap(%rip),%rsi

Asyousawaboutboot_heapdefinedas:

boot_heap:

.fillBOOT_HEAP_SIZE,1,0

whereBOOT_HEAP_SIZEis0x400000ifthekernelcompressedwithbzip2or0x8000ifnot.

Inthenextstepwecallchoose_kernel_locationfunctionfromthearch/x86/boot/compressed/aslr.c.Aswecanunderstandfromthefunctionnameitchoosesmemorylocationwheretodecompressthekernelimage.Let'slookonthisfunction.

Atthestartchoose_kernel_locationtriestofindkaslroptioninthecommandlineifCONFIG_HIBERNATIONissetandnokaslroptionifthisconfigurationoptionCONFIG_HIBERNATIONisnotset:

#ifdefCONFIG_HIBERNATION

if(!cmdline_find_option_bool("kaslr")){

debug_putstr("KASLRdisabledbydefault...\n");

gotoout;

Kerneldecompression

LinuxInside

56Kerneldecompression

Page 57: Linux Insides

}

#else

if(cmdline_find_option_bool("nokaslr")){

debug_putstr("KASLRdisabledbycmdline...\n");

gotoout;

}

#endif

Ifthereisnokaslrornokaslrinthecommandlineitjumpstooutlabel:

out:

return(unsignedchar*)choice;

whichjustreturnstheoutputparameterwhichwepassedtothechoose_kernel_locationwithoutanychanges.Let'strytounderstandwhatisitkaslr.Wecanfindinformationaboutitinthedocumentation:

kaslr/nokaslr[X86]

Enable/disablekernelandmodulebaseoffsetASLR

(AddressSpaceLayoutRandomization)ifbuiltinto

thekernel.WhenCONFIG_HIBERNATIONisselected,

kASLRisdisabledbydefault.WhenkASLRisenabled,

hibernationwillbedisabled.

Itmeansthatwecanpasskaslroptiontothekernel'scommandlineandgetrandomaddressforthedecompressedkernel(moreaboutaslryoucanreadhere).

Let'sconsiderthecasewhenkernel'scommandlinecontainskaslroption.

Thereisthecallofthemem_avoid_initfunctionfromthesameaslr.csourcecodefile.Thisfunctiongetstheunsafememoryregions(initrd,kernelcommandlineandetc...).Weneedtoknowaboutthismemoryregionstonotoverlapthemwiththekernelafterdecompression.Forexample:

initrd_start=(u64)real_mode->ext_ramdisk_image<<32;

initrd_start|=real_mode->hdr.ramdisk_image;

initrd_size=(u64)real_mode->ext_ramdisk_size<<32;

initrd_size|=real_mode->hdr.ramdisk_size;

mem_avoid[1].start=initrd_start;

mem_avoid[1].size=initrd_size;

Herewecanseecalculationoftheinitrdstartaddressandsize.ext_ramdisk_imageishigh32-bitsoftheramdisk_imagefieldfrombootheaderandext_ramdisk_sizeishigh32-bitsoftheramdisk_sizefieldfrombootprotocol:

OffsetProtoNameMeaning

/Size

...

...

...

0218/42.00+ramdisk_imageinitrdloadaddress(setbybootloader)

021C/42.00+ramdisk_sizeinitrdsize(setbybootloader)

...

Andext_ramdisk_imageandext_ramdisk_sizeyoucanfindintheDocumentation/x86/zero-page.txt:

OffsetProtoNameMeaning

/Size

...

LinuxInside

57Kerneldecompression

Page 58: Linux Insides

...

...

0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits

0C4/004ALLext_ramdisk_sizeramdisk_sizehigh32bits

...

Sowe'retakingext_ramdisk_imageandext_ramdisk_size,shiftingtheylefton32(nowtheywillcontainlow32-bitsinthehigh32-bitbits)andgettingstartaddressoftheinitrdandsizeofit.Afterthiswestorethesevaluesinthemem_avoidarraywhichdefinedas:

#defineMEM_AVOID_MAX5

staticstructmem_vectormem_avoid[MEM_AVOID_MAX];

wheremem_vectorstructureis:

structmem_vector{

unsignedlongstart;

unsignedlongsize;

};

Thenextstepafterwecollectedallunsafememoryregionsinthemem_avoidarraywillbesearchoftherandomaddresswhichdoesnotoverlapwiththeunsaferegionswiththefind_random_addrfunction.

Firstofallwecanseealignoftheoutputaddressinthefind_random_addrfunction:

minimum=ALIGN(minimum,CONFIG_PHYSICAL_ALIGN);

youcanrememberCONFIG_PHYSICAL_ALIGNconfigurationoptionfromthepreviouspart.Thisoptionprovidesthevaluetowhichkernelshouldbealignedanditis0x200000bydefault.Afterthatwegotalignedoutputaddress,wegothroughthememoryandcollectregionswhicharegoodfordecompressedkernelimage:

for(i=0;i<real_mode->e820_entries;i++){

process_e820_entry(&real_mode->e820_map[i],minimum,size);

}

Youcanrememberthatwecollectede820_entriesinthesecondpartoftheKernelbootingprocesspart2.

Firstofallprocess_e820_entryfunctiondoessomechecksthate820memoryregionisnotnon-RAM,thatthestartaddressofthememoryregionisnotbiggerthanMaximumallowedaslroffsetandthatmemoryregionisnotlessthanvalueofkernelalignment:

structmem_vectorregion,img;

if(entry->type!=E820_RAM)

return;

if(entry->addr>=CONFIG_RANDOMIZE_BASE_MAX_OFFSET)

return;

if(entry->addr+entry->size<minimum)

return;

Afterthis,westoree820memoryregionstartaddressandthesizeinthemem_vectorstructure(wesawdefinitionofthisstructureabove):

LinuxInside

58Kerneldecompression

Page 59: Linux Insides

region.start=entry->addr;

region.size=entry->size;

Aswestorethesevalues,wealigntheregion.startaswediditinthefind_random_addrfunctionandcheckthatwedidn'tgetaddressthatbiggerthanoriginalmemoryregion:

region.start=ALIGN(region.start,CONFIG_PHYSICAL_ALIGN);

if(region.start>entry->addr+entry->size)

return;

NextwegetdifferencebetweentheoriginaladdressandalignedandcheckthatifthelastaddressinthememoryregionisbiggerthanCONFIG_RANDOMIZE_BASE_MAX_OFFSET,wereducethememoryregionsizethatendofkernelimagewillbelessthanmaximumaslroffset:

region.size-=region.start-entry->addr;

if(region.start+region.size>CONFIG_RANDOMIZE_BASE_MAX_OFFSET)

region.size=CONFIG_RANDOMIZE_BASE_MAX_OFFSET-region.start;

Intheendwegothroughtheallunsafememoryregionsandcheckthatthisregiondoesnotoverlapunsafeareswithkernelcommandline,initrdandetc...:

for(img.start=region.start,img.size=image_size;

mem_contains(&region,&img);

img.start+=CONFIG_PHYSICAL_ALIGN){

if(mem_avoid_overlap(&img))

continue;

slots_append(img.start);

}

Ifmemoryregiondoesnotoverlapunsaferegionswecallslots_appendfunctionwiththestartaddressoftheregion.slots_appendfunctionjustcollectsstartaddressesofmemoryregionstotheslotsarray:

slots[slot_max++]=addr;

whichdefinedas:

staticunsignedlongslots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET/

CONFIG_PHYSICAL_ALIGN];

staticunsignedlongslot_max;

Afterprocess_e820_entrywillbeexecuted,wewillhavearrayoftheaddresseswhicharesafeforthedecompressedkernel.Nextwecallslots_fetch_randomfunctionforgettingrandomitemfromthisarray:

if(slot_max==0)

return0;

returnslots[get_random_long()%slot_max];

whereget_random_longfunctionchecksdifferentCPUflagsasX86_FEATURE_RDRANDorX86_FEATURE_TSCandchooses

LinuxInside

59Kerneldecompression

Page 60: Linux Insides

methodforgettingrandomnumber(itcanbeobtainwithRDRANDinstruction,Timestampcounter,programmableintervaltimerandetc...).Afterthatwegotrandomaddressexecutionofthechoose_kernel_locationisfinished.

Nowlet'sbacktothemisc.c.Afterwegotaddressforthekernelimage,thereneedtodosomecheckstobesurethatgottenrandomaddressiscorrectlyalignedandaddressisnotwrong.

Afterallthesecheckswillseethefamiliarmessage:

DecompressingLinux...

andcalldecompressfunctionwhichwilldecompressthekernel.decompressfunctiondependsonwhatdecompressionalgorithmwaschosenduringkernelcompilartion:

#ifdefCONFIG_KERNEL_GZIP

#include"../../../../lib/decompress_inflate.c"

#endif

#ifdefCONFIG_KERNEL_BZIP2

#include"../../../../lib/decompress_bunzip2.c"

#endif

#ifdefCONFIG_KERNEL_LZMA

#include"../../../../lib/decompress_unlzma.c"

#endif

#ifdefCONFIG_KERNEL_XZ

#include"../../../../lib/decompress_unxz.c"

#endif

#ifdefCONFIG_KERNEL_LZO

#include"../../../../lib/decompress_unlzo.c"

#endif

#ifdefCONFIG_KERNEL_LZ4

#include"../../../../lib/decompress_unlz4.c"

#endif

Afterkernelwillbedecompressed,thelastfunctionhandle_relocationswillrelocatethekerneltotheaddressthatwegotfromchoose_kernel_location.Afterthatkernelrelocatedwereturnfromthedecompress_kerneltothehead_64.S.Theaddressofthekernelwillbeintheraxregisterandwejumponit:

jmp*%rax

That'sall.Nowweareinthekernel!

Thisistheendofthefifthandthelastpartaboutlinuxkernelbootingprocess.Wewillnotseepostsaboutkernelbootinganymore(maybeonlyupdatesinthisandpreviousposts),buttherewillbemanypostsaboutotherkernelinternals.

Nextchapterwillbeaboutkernelinitializationandwewillseethefirststepsinthelinuxkernelinitializationcode.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeintwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

Conclusion

LinuxInside

60Kerneldecompression

Page 62: Linux Insides

Youwillfindhereacoupleofpostswhichdescribethefullcycleofkernelinitializationfromitsfirststepsafterthekernelhasdecompressedtothestartofthefirstprocessrunbythekernelitself.

NoteThattherewillnotbedescriptionoftheallkernelinitializationsteps.Herewillbeonlygenerickernelpart,withoutinterruptshandling,ACPI,andmanyotherparts.AllpartswhichI'llmiss,willbedescribedinotherchapters.

Firststepsafterkerneldecompression-describesfirststepsinthekernel.Earlyinterruptandexceptionhandling-describesearlyinterruptsinitializationandearlypagefaulthandler.Lastpreparationsbeforethekernelentrypoint-describesthelastpreparationsbeforethecallofthestart_kernel.Kernelentrypoint-describesfirststepsinthekernelgenericcode.Continueofarchitecture-specificinitializations-describesarchitecture-specificinitialization.Architecture-specificinitializations,again...-describescontinueofthearchitecture-specificinitializationprocess.TheEndofthearchitecture-specificinitializations,almost...-describestheendofthesetup_archrelatedstuff.Schedulerinitialization-describespreparationbeforeschedulerinitializationandinitializationofit.RCUinitialization-describestheinitializationoftheRCU.Endoftheinitialization-thelastpartaboutlinuxkernelinitialization.

Kernelinitializationprocess

LinuxInside

62Initialization

Page 63: Linux Insides

Inthepreviouspost(Kernelbootingprocess.Part5.)-Kerneldecompressionwestoppedatthejumponthedecompressedkernel:

jmp*%rax

andnowweareinthekernel.Therearemanythingstodobeforethekernelwillstartfirstinitprocess.Hopewewillseeallofthepreparationsbeforekernelwillstartinthisbigchapter.Wewillstartfromthekernelentrypoint,whichisinthearch/x86/kernel/head_64.S.Wewillseefirstpreparationslikeearlypagetablesinitialization,switchtoanewdescriptorinkernelspaceandmanymanymore,beforewewillseethestart_kernelfunctionfromtheinit/main.cwillbecalled.

Solet'sstart.

Okay,wegotaddressofthekernelfromthedecompress_kernelfunctionintoraxregisterandjustjumpedthere.Decompressedkernelcodestartsinthearch/x86/kernel/head_64.S:

__HEAD

.code64

.globlstartup_64

startup_64:

...

...

...

Wecanseedefinitionofthestartup_64routineanditdefinedinthe__HEADsection,whichisjust:

#define__HEAD.section".head.text","ax"

Wecanseedefinitionofthissectioninthearch/x86/kernel/vmlinux.lds.Slinkerscript:

.text:AT(ADDR(.text)-LOAD_OFFSET){

_text=.;

...

...

...

}:text=0x9090

Wecanunderstanddefaultvirtualandphysicaladdressesfromthelinkerscript.Notethataddressofthe_textislocationcounterwhichisdefinedas:

.=__START_KERNEL;

forx86_64.Wecanfinddefinitionofthe__START_KERNELmacrointhearch/x86/include/asm/page_types.h:

Kernelinitialization.Part1.

Firststepsinthekernelcode

Firststepsinthekernel

LinuxInside

63Firststepsinthekernel

Page 64: Linux Insides

#define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)

#define__PHYSICAL_STARTALIGN(CONFIG_PHYSICAL_START,CONFIG_PHYSICAL_ALIGN)

Herewecanseethat__START_KERNEListhesumofthe__START_KERNEL_map(whichis0xffffffff80000000,seepostaboutpaging)and__PHYSICAL_START.Where__PHYSICAL_STARTisalignedvalueoftheCONFIG_PHYSICAL_START.SoifyouwillnotusekASLRandwillnotchangeCONFIG_PHYSICAL_STARTintheconfigurationaddresseswillbefollowing:

Physicaladdress-0x1000000;Virtualaddress-0xffffffff81000000.

Nowweknowdefaultphysicalandvirtualaddressesofthestartup_64routine,buttoknowactualaddresseswemusttocalculateitwiththefollowingcode:

leaq_text(%rip),%rbp

subq$_text-__START_KERNEL_map,%rbp

Herewejustputtherip-relativeaddresstotherbpregisterandthensubtract$_text-__START_KERNEL_mapfromit.Weknowthatcompiledaddressofthe_textis0xffffffff81000000and__START_KERNEL_mapcontains0xffffffff81000000,sorbpwillcontainphysicaladdressofthetext-0x1000000afterthiscalculation.Weneedtocalculateitbecausekernelcan'tberunonthedefaultaddress,butnowweknowtheactualphysicaladdress.

Inthenextstepwechecksthatthisaddressisalignedwith:

movq%rbp,%rax

andl$~PMD_PAGE_MASK,%eax

testl%eax,%eax

jnzbad_address

Herewejustputaddresstothe%raxandtestfirstbit.PMD_PAGE_MASKindicatesthemaskforPagemiddledirectory(readpagingaboutit)anddefinedas:

#definePMD_PAGE_MASK(~(PMD_PAGE_SIZE-1))

#definePMD_PAGE_SIZE(_AC(1,UL)<<PMD_SHIFT)

#definePMD_SHIFT21

Aswecaneasilycalculate,PMD_PAGE_SIZEis2megabytes.Hereweusestandardformulaforcheckingalignmentandiftextaddressisnotalignedfor2megabytes,wejumptobad_addresslabel.

Afterthiswecheckaddressthatitisnottoolarge:

leaq_text(%rip),%rax

shrq$MAX_PHYSMEM_BITS,%rax

jnzbad_address

Addressmostnotbegreaterthan46-bits:

#defineMAX_PHYSMEM_BITS46

Okay,wedidsomeearlychecksandnowwecanmoveon.

LinuxInside

64Firststepsinthekernel

Page 65: Linux Insides

Thefirststepbeforewestartedtosetupidentitypaging,needtocorrectfollowingaddresses:

addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)

addq%rbp,level3_kernel_pgt+(510*8)(%rip)

addq%rbp,level3_kernel_pgt+(511*8)(%rip)

addq%rbp,level2_fixmap_pgt+(506*8)(%rip)

Hereweneedtocorrectearly_level4_pgtandotheraddressesofthepagetabledirectories,becauseasIwroteabove,kernelcan'tberunatthedefault0x1000000address.rbpregistercontainsactualaddresssoweaddtotheearly_level4_pgt,level3_kernel_pgtandlevel2_fixmap_pgt.Let'strytounderstandwhattheselabelsmean.Firstofalllet'slookontheirdefinition:

NEXT_PAGE(early_level4_pgt)

.fill511,8,0

.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE

NEXT_PAGE(level3_kernel_pgt)

.fillL3_START_KERNEL,8,0

.quadlevel2_kernel_pgt-__START_KERNEL_map+_KERNPG_TABLE

.quadlevel2_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE

NEXT_PAGE(level2_kernel_pgt)

PMDS(0,__PAGE_KERNEL_LARGE_EXEC,

KERNEL_IMAGE_SIZE/PMD_SIZE)

NEXT_PAGE(level2_fixmap_pgt)

.fill506,8,0

.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE

.fill5,8,0

NEXT_PAGE(level1_fixmap_pgt)

.fill512,8,0

Lookshard,butitisnottrue.

Firstofalllet'slookontheearly_level4_pgt.Itstartswiththe(4096-8)bytesofzeros,itmeansthatwedon'tusefirst511early_level4_pgtentries.Andafterthiswecanseelevel3_kernel_pgtentry.Notethatwesubtract__START_KERNEL_map+_PAGE_TABLEfromit.Asweknow__START_KERNEL_mapisabasevirtualaddressofthekerneltext,soifwesubtract__START_KERNEL_map,wewillgetphysicaladdressofthelevel3_kernel_pgt.Nowlet'slookon_PAGE_TABLE,itisjustpageentryaccessrights:

#define_PAGE_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|\

_PAGE_ACCESSED|_PAGE_DIRTY)

moreaboutit,youcanreadinthepagingpost.

level3_kernel_pgt-storesentrieswhichmapkernelspace.Atthestartofit'sdefinition,wecanseethatitfilledwithzerosL3_START_KERNELtimes.HereL3_START_KERNEListheindexinthepageupperdirectorywhichcontains__START_KERNEL_mapaddressanditequals510.Afteritwecanseedefinitionoftwolevel3_kernel_pgtentries:level2_kernel_pgtandlevel2_fixmap_pgt.Firstissimple,itispagetableentrywhichcontainspointertothepagemiddledirectorywhichmapskernelspaceandithas:

#define_KERNPG_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|\

_PAGE_DIRTY)

Fixbaseaddressesofpagetables

LinuxInside

65Firststepsinthekernel

Page 66: Linux Insides

accessrights.Thesecond-level2_fixmap_pgtisavirtualaddresseswhichcanrefertoanyphysicaladdressesevenunderkernelspace.

Thenextlevel2_kernel_pgtcallsPDMSmacrowhichcreates512megabytesfromthe__START_KERNEL_mapforkerneltext(afterthese512megabyteswillbemodulesmemoryspace).

NowweknowLet'sbacktoourcodewhichisinthebeginningofthesection.Rememberthatrbpcontainsactualphysicaladdressofthe_textsection.Wejustaddthisaddresstothebaseaddressofthepagetables,thatthey'llhavecorrectaddresses:

addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)

addq%rbp,level3_kernel_pgt+(510*8)(%rip)

addq%rbp,level3_kernel_pgt+(511*8)(%rip)

addq%rbp,level2_fixmap_pgt+(506*8)(%rip)

Atthefirstlineweaddrbptotheearly_level4_pgt,atthesecondlineweaddrbptothelevel2_kernel_pgt,atthethirdlineweaddrbptothelevel2_fixmap_pgtandaddrbptothelevel1_fixmap_pgt.

Afterallofthiswewillhave:

early_level4_pgt[511]->level3_kernel_pgt[0]

level3_kernel_pgt[510]->level2_kernel_pgt[0]

level3_kernel_pgt[511]->level2_fixmap_pgt[0]

level2_kernel_pgt[0]->512MBkernelmapping

level2_fixmap_pgt[506]->level1_fixmap_pgt

Aswecorrectedbaseaddressesofthepagetables,wecanstarttobuildit.

Nowwecanseesetuptheidentitymappingearlypagetables.IdentityMappedPagingisavirtualaddresseswhicharemappedtophysicaladdressesthathavethesamevalue,1:1.Let'slookonitindetails.Firstofallwegettherip-relativeaddressofthe_textand_early_level4_pgtandputtheyintordiandrbxregisters:

leaq_text(%rip),%rdi

leaqearly_level4_pgt(%rip),%rbx

Afterthiswestorephysicaladdressofthe_textintheraxandgettheindexofthepageglobaldirectoryentrywhichstores_textaddress,byshifting_textaddressonthePGDIR_SHIFT:

movq%rdi,%rax

shrq$PGDIR_SHIFT,%rax

leaq(4096+_KERNPG_TABLE)(%rbx),%rdx

movq%rdx,0(%rbx,%rax,8)

movq%rdx,8(%rbx,%rax,8)

wherePGDIR_SHIFTis39.PGDIR_SHFTindicatesthemaskforpageglobaldirectorybitsinavirtualaddress.Therearemacroforalltypesofpagedirectories:

#definePGDIR_SHIFT39

#definePUD_SHIFT30

#definePMD_SHIFT21

Identitymappingsetup

LinuxInside

66Firststepsinthekernel

Page 67: Linux Insides

Afterthisweputtheaddressofthefirstlevel3_kernel_pgttotherdxwiththe_KERNPG_TABLEaccessrights(seeabove)andfilltheearly_level4_pgtwiththe2level3_kernel_pgtentries.

Afterthisweadd4096(sizeoftheearly_level4_pgt)totherdx(itnowcontainstheaddressofthefirstentryofthelevel3_kernel_pgt)andputrdi(itnowcontainsphysicaladdressofthe_text)totherax.Andafterthiswewriteaddressesofthetwopageupperdirectoryentriestothelevel3_kernel_pgt:

addq$4096,%rdx

movq%rdi,%rax

shrq$PUD_SHIFT,%rax

andl$(PTRS_PER_PUD-1),%eax

movq%rdx,4096(%rbx,%rax,8)

incl%eax

andl$(PTRS_PER_PUD-1),%eax

movq%rdx,4096(%rbx,%rax,8)

Inthenextstepwewriteaddressesofthepagemiddledirectoryentriestothelevel2_kernel_pgtandthelaststepiscorrectingofthekerneltext+datavirtualaddresses:

leaqlevel2_kernel_pgt(%rip),%rdi

leaq4096(%rdi),%r8

1:testq$1,0(%rdi)

jz2f

addq%rbp,0(%rdi)

2:addq$8,%rdi

cmp%r8,%rdi

jne1b

Hereweputtheaddressofthelevel2_kernel_pgttotherdiandaddressofthepagetableentrytother8register.Nextwecheckthepresentbitinthelevel2_kernel_pgtandifitiszerowe'removingtothenextpagebyadding8bytestordiwhichcontaitnsaddressofthelevel2_kernel_pgt.Afterthiswecompareitwithr8(containsaddressofthepagetableentry)andgobacktolabel1ormoveforward.

Inthenextstepwecorrectphys_basephysicaladdresswithrbp(containsphysicaladdressofthe_text),putphysicaladdressoftheearly_level4_pgtandjumptolabel1:

addq%rbp,phys_base(%rip)

movq$(early_level4_pgt-__START_KERNEL_map),%rax

jmp1f

wherephys_basemathesthefirstentryofthelevel2_kernel_pgtwhichis512MBkernelmapping.

Afterthatwejumpedtothelabel1weenablePAE,PGE(PagingGlobalExtension)andputthephysicaladdressofthephys_base(seeabove)totheraxregisterandfillcr3registerwithit:

1:

movl$(X86_CR4_PAE|X86_CR4_PGE),%ecx

movq%rcx,%cr4

addqphys_base(%rip),%rax

movq%rax,%cr3

Lastpreparations

LinuxInside

67Firststepsinthekernel

Page 68: Linux Insides

InthenextstepwecheckthatCPUsupportNXbitwith:

movl$0x80000001,%eax

cpuid

movl%edx,%edi

Weput0x80000001valuetotheeaxandexecutecpuidinstructionforgettingextendedprocessorinfoandfeaturebits.Theresultwillbeintheedxregisterwhichweputtotheedi.

Nowweput0xc0000080orMSR_EFERtotheecxandcallrdmsrinstructionforthereadingmodelspecificregister.

movl$MSR_EFER,%ecx

rdmsr

Theresultwillbeintheedx:eax.GeneralviewoftheEFERisfollowing:

6332

--------------------------------------------------------------------------------

||

|ReservedMBZ|

||

--------------------------------------------------------------------------------

311615141312111098710

--------------------------------------------------------------------------------

||T||||||||||

|ReservedMBZ|C|FFXSR|LMSLE|SVME|NXE|LMA|MBZ|LME|RAZ|SCE|

||E||||||||||

--------------------------------------------------------------------------------

Wewillnotseeallfieldsindetailshere,butwewilllearnaboutthisandotherMSRsinthespecialpartabout.AswereadEFERtotheedx:eax,wechecks_EFER_SCEorzerobitwhichisSystemCallExtensionswithbtslinstructionandsetittoone.BythesettingSCEbitweenableSYSCALLandSYSRETinstructions.Inthenextstepwecheck20thbitintheedi,rememberthatthisregisterstoresresultofthecpuid(seeabove).If20bitisset(NXbit)wejustwriteEFER_SCEtothemodelspecificregister.

btsl$_EFER_SCE,%eax

btl$20,%edi

jnc1f

btsl$_EFER_NX,%eax

btsq$_PAGE_BIT_NX,early_pmd_flags(%rip)

1:wrmsr

IfNXbitissupportedweenable_EFER_NXandwriteittoo,withthewrmsrinstruction.

InthenextstepweneedtoupdateGlobalDescriptortablewithlgdtinstruction:

lgdtearly_gdt_descr(%rip)

whereGlobalDescriptortabledefinedas:

early_gdt_descr:

.wordGDT_ENTRIES*8-1

early_gdt_descr_base:

.quadINIT_PER_CPU_VAR(gdt_page)

LinuxInside

68Firststepsinthekernel

Page 69: Linux Insides

WeneedtoreloadGlobalDescriptorTablebecausenowkernelworksintheuserspaceaddresses,butsoonkernelwillworkinit'sownspace.Nowlet'slookonearly_gdt_descrdefinition.GlobalDescriptorTablecontains32entries:

#defineGDT_ENTRIES32

forkernelcode,data,threadlocalstoragesegmentsandetc...it'ssimple.Nowlet'slookontheearly_gdt_descr_base.Firstofgdt_pagedefinedas:

structgdt_page{

structdesc_structgdt[GDT_ENTRIES];

}__attribute__((aligned(PAGE_SIZE)));

inthearch/x86/include/asm/desc.h.Itcontainsonefieldgdtwhichisarrayofthedesc_structstructureswhichdefinedas:

structdesc_struct{

union{

struct{

unsignedinta;

unsignedintb;

};

struct{

u16limit0;

u16base0;

unsignedbase1:8,type:4,s:1,dpl:2,p:1;

unsignedlimit:4,avl:1,l:1,d:1,g:1,base2:8;

};

};

}__attribute__((packed));

andpresentsfamiliartousGDTdescriptor.Alsowecannotethatgdt_pagestructurealignedtoPAGE_SIZEwhichis4096bytes.Itmeansthatgdtwilloccupyonepage.Nowlet'strytounderstandwhatisitINIT_PER_CPU_VAR.INIT_PER_CPU_VARisamacrowhichdefinedinthearch/x86/include/asm/percpu.handjustconcatsinit_per_cpu__withthegivenparameter:

#defineINIT_PER_CPU_VAR(var)init_per_cpu__##var

Afterthiswehaveinit_per_cpu__gdt_page.Wecanseeinthelinkerscript:

#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load

INIT_PER_CPU(gdt_page);

Aswegotinit_per_cpu__gdt_pageinINIT_PER_CPU_VARandINIT_PER_CPUmacrofromlinkerscriptwillbeexpandedwewillgetoffsetfromthe__per_cpu_load.Afterthiscalculations,wewillhavecorrectbaseaddressofthenewGDT.

Generallyper-CPUvariablesisa2.6kernelfeature.Youcanunderstandwhatisitfromit'sname.Whenwecreateper-CPUvariable,eachCPUwillhavewillhaveit'sowncopyofthisvariable.Herewecreatinggdt_pageper-CPUvariable.Therearemanyadvantagesforvariablesofthistype,liketherearenolocks,becauseeachCPUworkswithit'sowncopyofvariableandetc...Soeverycoreonmultiprocessorwillhaveit'sownGDTtableandeveryentryinthetablewillrepresentamemorysegmentwhichcanbeaccessedfromthethreadwhichranonthecore.Youcanreadindetailsaboutper-CPUvariablesintheTheory/per-cpupost.

AsweloadednewGlobalDescriptorTable,wereloadsegmentsaswediditeverytime:

xorl%eax,%eax

LinuxInside

69Firststepsinthekernel

Page 70: Linux Insides

movl%eax,%ds

movl%eax,%ss

movl%eax,%es

movl%eax,%fs

movl%eax,%gs

Afterallofthesestepswesetupgsregisterthatitposttotheirqstack(wewillseeinformationaboutitintheupcomingparts):

movl$MSR_GS_BASE,%ecx

movlinitial_gs(%rip),%eax

movlinitial_gs+4(%rip),%edx

wrmsr

whereMSR_GS_BASEis:

#defineMSR_GS_BASE0xc0000101

WeneedtoputMSR_GS_BASEtotheecxregisterandloaddatafromtheeaxandedx(whicharepointtotheinitial_gs)withwrmsrinstruction.Wedon'tusecs,fs,dsandsssegmentregistersforaddressationinthe64-bitmode,butfsandgsregisterscanbeused.fsandgshaveahiddenpart(aswesawitintherealmodeforcs)andthispartcontainsdescriptorwhichmappedtoModelspecificregisters.Sowecanseeabove0xc0000101isags.baseMSRaddress.

Inthenextstepweputtheaddressoftherealmodebootparamstructuretotherdi(rememberrsiholdspointertothisstructurefromthestart)andjumptotheCcodewith:

movqinitial_code(%rip),%rax

pushq$0

pushq$__KERNEL_CS

pushq%rax

lretq

Hereweputtheaddressoftheinitial_codetotheraxandpushfakeaddress,__KERNEL_CSandtheaddressoftheinitial_codetothestack.Afterthiswecanseelretqinstructionwhichmeansthatafteritreturnaddresswillbeextractedfromstack(nowthereisaddressoftheinitial_code)andjumpthere.initial_codedefinedinthesamesourcecodefileandlooks:

__REFDATA

.balign8

GLOBAL(initial_code)

.quadx86_64_start_kernel

...

...

...

Aswecanseeinitial_codecontainsaddressofthex86_64_start_kernel,whichdefinedinthearch/x86/kerne/head64.candlookslikethis:

asmlinkage__visiblevoid__initx86_64_start_kernel(char*real_mode_data){

...

...

...

}

LinuxInside

70Firststepsinthekernel

Page 71: Linux Insides

Ithasoneargumentisareal_mode_data(rememberthatwepassedaddressoftherealmodedatatotherdiregisterpreviously).

ThisisfirstCcodeinthekernel!

Weneedtoseelastpreparationsbeforewecansee"kernelentrypoint"-start_kernelfunctionfromtheinit/main.c.

Firstofallwecanseesomechecksinthex86_64_start_kernelfunction:

BUILD_BUG_ON(MODULES_VADDR<__START_KERNEL_map);

BUILD_BUG_ON(MODULES_VADDR-__START_KERNEL_map<KERNEL_IMAGE_SIZE);

BUILD_BUG_ON(MODULES_LEN+KERNEL_IMAGE_SIZE>2*PUD_SIZE);

BUILD_BUG_ON((__START_KERNEL_map&~PMD_MASK)!=0);

BUILD_BUG_ON((MODULES_VADDR&~PMD_MASK)!=0);

BUILD_BUG_ON(!(MODULES_VADDR>__START_KERNEL));

BUILD_BUG_ON(!(((MODULES_END-1)&PGDIR_MASK)==(__START_KERNEL&PGDIR_MASK)));

BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses)<=MODULES_END);

Therearechecksfordifferentthingslikevirtualaddressesofmodulesspaceisnotfewerthanbaseaddressofthekerneltext-__STAT_KERNEL_map,thatkerneltextwithmodulesisnotlessthanimageofthekernelandetc...BUILD_BUG_ONisamacrowhichlooksas:

#defineBUILD_BUG_ON(condition)((void)sizeof(char[1-2*!!(condition)]))

Let'strytounderstandthistrickworks.Let'stakeforexamplefirstcondition:MODULES_VADDR<__START_KERNEL_map.!!conditionsisthesamethatcondition!=0.SoitmeansifMODULES_VADDR<__START_KERNEL_mapistrue,wewillget1inthe!!(condition)orzeroifnot.After2*!!(condition)wewillgetor2or0.Intheendofcalculationswecangettwodifferentbehaviors:

Wewillhavecompilationerror,becausetrytogetsizeofthechararraywithnegativeindex(ascanbeinourcase,becauseMODULES_VADDRcan'tbelessthan__START_KERNEL_mapwillbeinourcase);Nocompilationerrors.

That'sall.SointerestingCtrickforgettingcompileerrorwhichdependsonsomeconstants.

Inthenextstepwecanseecallofthecr4_init_shadowfunctionwhichstoresshadowcopyofthecr4percpu.Contextswitchescanchangebitsinthecr4soweneedtostorecr4foreachCPU.Andafterthiswecanseecallofthereset_early_page_tablesfunctionwhereweresetsallpageglobaldirectoryentriesandwritenewpointertothePGTincr3:

for(i=0;i<PTRS_PER_PGD-1;i++)

early_level4_pgt[i].pgd=0;

next_early_pgt=0;

write_cr3(__pa_nodebug(early_level4_pgt));

soonwewillbuildnewpagetables.HerewecanseethatwegothroughallPageGlobalDirectoryEntries(PTRS_PER_PGDis512)intheloopandmakeitzero.Afterthiswesetnext_early_pgttozero(wewillseedetailsaboutitinthenextpost)andwritephysicaladdressoftheearly_level4_pgttothecr3.__pa_nodebugisamacrowhichwillbeexpandedto:

Nexttostart_kernel

LinuxInside

71Firststepsinthekernel

Page 72: Linux Insides

((unsignedlong)(x)-__START_KERNEL_map+phys_base)

Afterthisweclear_bssfromthe__bss_stopto__bss_startandthenextstepwillbesetupoftheearlyIDThandlers,butit'sbigthemesowewillseeitinthenextpart.

Thisistheendofthefirstpartaboutlinuxkernelinitialization.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

Inthenextpartwewillseeinitializationoftheearlyinterruptionhandlers,kernelspacememorymappingandalotmore.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

ModelSpecificRegisterPagingPreviouspart-KerneldecompressionNXASLR

Conclusion

Links

LinuxInside

72Firststepsinthekernel

Page 73: Linux Insides

Inthepreviouspartwestoppedbeforesettingofearlyinterrupthandlers.Wecontinueinthispartandwillknowmoreaboutinterruptandexceptionhandling.

Rememberthatwestoppedbeforefollowingloop:

for(i=0;i<NUM_EXCEPTION_VECTORS;i++)

set_intr_gate(i,early_idt_handlers[i]);

fromthearch/x86/kernel/head64.csourcecodefile.Butbeforewestartedtosortoutthiscode,weneedtoknowaboutinterruptsandhandlers.

InterruptisaneventcausedbysoftwareorhardwaretotheCPU.Oninterrupt,CPUstopsthecurrenttaskandtransfercontroltotheinterrupthandler,whichhandlesinterruptionandtransfercontrolbacktothepreviouslystoppedtask.Wecansplitinterruptsonthreetypes:

Softwareinterrupts-whenasoftwaresignalsCPUthatitneedskernelattention.Theseinterruptsaregenerallyusedforsystemcalls;Hardwareinterrupts-whenahardwareeventhappens,forexamplebuttonispressedonakeyboard;Exceptions-interruptsgeneratedbyCPU,whentheCPUdetectserror,forexampledivisionbyzerooraccessingamemorypagewhichisnotinRAM.

Everyinterruptandexceptionisassignedauniquenumberwhichcalled-vectornumber.Vectornumbercanbeanynumberfrom0to255.Thereiscommonpracticetousefirst32vectornumbersforexceptions,andvectornumbersfrom32to255areusedforuser-definedinterrupts.Wecanseeitinthecodeabove-NUM_EXCEPTION_VECTORS,whichdefinedas:

#defineNUM_EXCEPTION_VECTORS32

CPUusesvectornumberasanindexintheInterruptDescriptorTable(wewillseedescriptionofitsoon).CPUcatchinterruptsfromtheAPICorthroughit'spins.Followingtableshows0-31exceptions:

----------------------------------------------------------------------------------------------

|Vector|Mnemonic|Description|Type|ErrorCode|Source|

----------------------------------------------------------------------------------------------

|0|#DE|DivideError|Fault|NO|DIVandIDIV|

|---------------------------------------------------------------------------------------------

|1|#DB|Reserved|F/T|NO||

|---------------------------------------------------------------------------------------------

|2|---|NMI|INT|NO|externalNMI|

|---------------------------------------------------------------------------------------------

|3|#BP|Breakpoint|Trap|NO|INT3|

|---------------------------------------------------------------------------------------------

|4|#OF|Overflow|Trap|NO|INTOinstruction|

|---------------------------------------------------------------------------------------------

|5|#BR|BoundRangeExceeded|Fault|NO|BOUNDinstruction|

|---------------------------------------------------------------------------------------------

Kernelinitialization.Part2.

Earlyinterruptandexceptionhandling

Sometheory

LinuxInside

73Earlyinterruptshandler

Page 74: Linux Insides

|6|#UD|InvalidOpcode|Fault|NO|UD2instruction|

|---------------------------------------------------------------------------------------------

|7|#NM|DeviceNotAvailable|Fault|NO|Floatingpointor[F]WAIT|

|---------------------------------------------------------------------------------------------

|8|#DF|DoubleFault|Abort|YES|AntinstrctionswhichcangenerateNMI|

|---------------------------------------------------------------------------------------------

|9|---|Reserved|Fault|NO||

|---------------------------------------------------------------------------------------------

|10|#TS|InvalidTSS|Fault|YES|TaskswitchorTSSaccess|

|---------------------------------------------------------------------------------------------

|11|#NP|SegmentNotPresent|Fault|NO|Accessingsegmentregister|

|---------------------------------------------------------------------------------------------

|12|#SS|Stack-SegmentFault|Fault|YES|Stackoperations|

|---------------------------------------------------------------------------------------------

|13|#GP|GeneralProtection|Fault|YES|Memoryreference|

|---------------------------------------------------------------------------------------------

|14|#PF|Pagefault|Fault|YES|Memoryreference|

|---------------------------------------------------------------------------------------------

|15|---|Reserved||NO||

|---------------------------------------------------------------------------------------------

|16|#MF|x87FPUfperror|Fault|NO|Floatingpointor[F]Wait|

|---------------------------------------------------------------------------------------------

|17|#AC|AlignmentCheck|Fault|YES|Datareference|

|---------------------------------------------------------------------------------------------

|18|#MC|MachineCheck|Abort|NO||

|---------------------------------------------------------------------------------------------

|19|#XM|SIMDfpexception|Fault|NO|SSE[2,3]instructions|

|---------------------------------------------------------------------------------------------

|20|#VE|Virtualizationexc.|Fault|NO|EPTviolations|

|---------------------------------------------------------------------------------------------

|21-31|---|Reserved|INT|NO|Externalinterrupts|

----------------------------------------------------------------------------------------------

ToreactoninterruptCPUusesspecialstructure-InterruptDescriptorTableorIDT.IDTisanarrayof8-bytedescriptorslikeGlobalDescriptorTable,butIDTentriesarecalledgates.CPUmultipliesvectornumberon8tofindindexoftheIDTentry.Butin64-bitmodeIDTisanarrayof16-bytedescriptorsandCPUmultipliesvectornumberon16tofindindexoftheentryintheIDT.WerememberfromthepreviouspartthatCPUusesspecialGDTRregistertolocateGlobalDescriptorTable,soCPUusesspecialregisterIDTRforInterruptDescriptorTableandlidtinstruuctionforloadingbaseaddressofthetableintothisregister.

64-bitmodeIDTentryhasfollowingstructure:

12796

--------------------------------------------------------------------------------

||

|Reserved|

||

--------------------------------------------------------------------------------

9564

--------------------------------------------------------------------------------

||

|Offset63..32|

||

--------------------------------------------------------------------------------

634847464442393432

--------------------------------------------------------------------------------

|||D|||||||

|Offset31..16|P|P|0|Type|000|0|0|IST|

|||L|||||||

--------------------------------------------------------------------------------

3115160

--------------------------------------------------------------------------------

|||

|SegmentSelector|Offset15..0|

|||

--------------------------------------------------------------------------------

Where:

LinuxInside

74Earlyinterruptshandler

Page 75: Linux Insides

Offset-isoffsettoentrypointofaninterrupthandler;DPL-DescriptorPrivilegeLevel;P-SegmentPresentflag;Segmentselector-acodesegmentselectorinGDTorLDTIST-providesabilitytoswitchtoanewstackforinterruptshandling.

AndthelastTypefielddescribestypeoftheIDTentry.Therearethreedifferentkindsofhandlersforinterrupts:

TaskdescriptorInterruptdescriptorTrapdescriptor

Interruptandtrapdescriptorscontainafarpointertotheentrypointoftheinterrupthandler.OnlyonedifferencebetweenthesetypesishowCPUhandlesIFflag.Ifinterrupthandlerwasaccessedthroughinterruptgate,CPUcleartheIFflagtopreventotherinterruptswhilecurrentinterrupthandlerexecutes.Afterthatcurrentinterrupthandlerexecutes,CPUsetstheIFflagagainwithiretinstruction.

Otherbitsreservedandmustbe0.

Nowlet'slookhowCPUhandlesinterrupts:

CPUsaveflagsregister,CS,andinstructionpointeronthestack.Ifinterruptcausesanerrorcode(like#PFforexample),CPUsavesanerroronthestackafterinstructionpointer;Afterinterrupthandlerexecuted,iretinstructionusedtoreturnfromit.

Nowlet'sbacktocode.

Westoppedatthefollowingpoint:

for(i=0;i<NUM_EXCEPTION_VECTORS;i++)

set_intr_gate(i,early_idt_handlers[i]);

Herewecallset_intr_gateintheloop,whichtakestwoparameters:

Numberofaninterrupt;Addressoftheidthandler.

andinsertsaninterruptgateinthenthIDTentry.Firstofalllet'slookontheearly_idt_handlers.Itisanarraywhichcontainsaddressofthefirst32interrupthandlers:

externconstcharearly_idt_handlers[NUM_EXCEPTION_VECTORS][2+2+5];

We'refillingonlyfirst32IDTentriesbecausealloftheearlysetuprunswithinterruptsdisabled,sothereisnoneedtosetupearlyexceptionhandlersforvectorsgreaterthan32.early_idt_handlerscontainsgenericidthandlersandwecanfinditinthearch/x86/kernel/head_64.S,wewilllookitsoon.

Nowlet'slookonset_intr_gateimplementation:

#defineset_intr_gate(n,addr)\

do{\

FillandloadIDT

LinuxInside

75Earlyinterruptshandler

Page 76: Linux Insides

BUG_ON((unsigned)n>0xFF);\

_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\

__KERNEL_CS);\

_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\

0,0,__KERNEL_CS);\

}while(0)

Firstofallitcheckswiththatpassedinterruptnumberisnotgreaterthan255withBUG_ONmacro.Weneedtodothischeckbecausewecanhaveonly256interrupts.Afterthisitcalls_set_gatewhichwritesaddressofaninterruptgatetotheIDT:

staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,

unsigneddpl,unsignedist,unsignedseg)

{

gate_descs;

pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);

write_idt_entry(idt_table,gate,&s);

write_trace_idt_entry(gate,&s);

}

Atthestartof_set_gatefunctionwecanseecallofthepack_gatefunctionwhichfillsgate_descstructurewiththegivenvalues:

staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,

unsigneddpl,unsignedist,unsignedseg)

{

gate->offset_low=PTR_LOW(func);

gate->segment=__KERNEL_CS;

gate->ist=ist;

gate->p=1;

gate->dpl=dpl;

gate->zero0=0;

gate->zero1=0;

gate->type=type;

gate->offset_middle=PTR_MIDDLE(func);

gate->offset_high=PTR_HIGH(func);

}

Asmentionedabovewefillgatedescriptorinthisfunction.Wefillthreepartsoftheaddressoftheinterrupthandlerwiththeaddresswhichwegotinthemainloop(addressoftheinterrupthandlerentrypoint).Weareusingthreefollowingmacrotosplitaddressonthreeparts:

#definePTR_LOW(x)((unsignedlonglong)(x)&0xFFFF)

#definePTR_MIDDLE(x)(((unsignedlonglong)(x)>>16)&0xFFFF)

#definePTR_HIGH(x)((unsignedlonglong)(x)>>32)

WiththefirstPTR_LOWmacrowegetthefirst2bytesoftheaddress,withthesecondPTR_MIDDLEwegetthesecond2bytesoftheaddressandwiththethirdPTR_HIGHmacrowegetthelast4bytesoftheaddress.Nextwesetupthesegmentselectorforinterrupthandler,itwillbeourkernelcodesegment-__KERNEL_CS.InthenextstepwefillInterruptStackTableandDescriptorPrivilegeLevel(highestprivilegelevel)withzeros.AndwesetGAT_INTERRUPTtypeintheend.

NowwehavefilledIDTentryandwecancallnative_write_idt_entryfunctionwhichjustcopiesfilledIDTentrytotheIDT:

staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate)

{

memcpy(&idt[entry],gate,sizeof(*gate));

}

LinuxInside

76Earlyinterruptshandler

Page 77: Linux Insides

Afterthatmainloopwillfinished,wewillhavefilledidt_tablearrayofgate_descstructuresandwecanloadIDTwith:

load_idt((conststructdesc_ptr*)&idt_descr);

Whereidt_descris:

structdesc_ptridt_descr={NR_VECTORS*16-1,(unsignedlong)idt_table};

andload_idtjustexecuteslidtinstruction:

asmvolatile("lidt%0"::"m"(*dtr));

Youcannotethattherearecallsofthe_trace_*functionsinthe_set_gateandotherfunctions.ThesefunctionsfillsIDTgatesinthesamemannerthat_set_gatebutwithonedifference.Thesefunctionsusetrace_idt_tableInterruptDescriptorTableinsteadofidt_tablefortracepoints(wewillcoverthisthemeintheanotherpart).

Okay,nowwehavefilledandloadedInterruptDescriptorTable,weknowhowtheCPUactsduringinterrupt.Sonowtimetodealwithinterruptshandlers.

Asyoucanreadabove,wefilledIDTwiththeaddressoftheearly_idt_handlers.Wecanfinditinthearch/x86/kernel/head_64.S:

.globlearly_idt_handlers

early_idt_handlers:

i=0

.reptNUM_EXCEPTION_VECTORS

.if(EXCEPTION_ERRCODE_MASK>>i)&1

ASM_NOP2

.else

pushq$0

.endif

pushq$i

jmpearly_idt_handler

i=i+1

.endr

Wecanseehere,interrupthandlersgenerationforthefirst32exceptions.Wecheckhere,ifexceptionhaserrorcodethenwedonothing,ifexceptiondoesnotreturnerrorcode,wepushzerotothestack.Wedoitforthatwouldstackwasuniform.Afterthatwepushexceptionnumberonthestackandjumpontheearly_idt_handlerwhichisgenericinterrupthandlerfornow.Asiwroteabove,CPUpushesflagregister,CSandRIPonthestack.Sobeforeearly_idt_handlerwillbeexecuted,stackwillcontainfollowingdata:

|--------------------|

|%rflags|

|%cs|

|%rip|

|rsp-->errorcode|

|--------------------|

Nowlet'slookontheearly_idt_handlerimplementation.Itlocatesinthesamearch/x86/kernel/head_64.S.Firstofallwe

Earlyinterruptshandlers

LinuxInside

77Earlyinterruptshandler

Page 78: Linux Insides

canseecheckforNMI,wenoneedtohandleit,sojustignoretheyintheearly_idt_handler:

cmpl$2,(%rsp)

jeis_nmi

whereis_nmi:

is_nmi:

addq$16,%rsp

INTERRUPT_RETURN

wedroperrorcodeandvectornumberfromthestackandcallINTERRUPT_RETURNwhichisjustiretq.AswecheckedthevectornumberanditisnotNMI,wecheckearly_recursion_flagtopreventrecursionintheearly_idt_handlerandifit'scorrectwesavegeneralregistersonthestack:

pushq%rax

pushq%rcx

pushq%rdx

pushq%rsi

pushq%rdi

pushq%r8

pushq%r9

pushq%r10

pushq%r11

weneedtodoittopreventwrongvaluesinitwhenwereturnfromtheinterrupthandler.Afterthiswechecksegmentselectorinthestack:

cmpl$__KERNEL_CS,96(%rsp)

jne11f

itmustbeequaltothekernelcodesegmentandifitisnotwejumponlabel11whichprintsPANICmessageandmakesstackdump.

Aftercodesegmentwaschecked,wecheckthevectornumber,andifitis#PF,weputvaluefromthecr2totherdiregisterandcallearly_make_pgtable(wellseeitsoon):

cmpl$14,72(%rsp)

jnz10f

GET_CR2_INTO(%rdi)

callearly_make_pgtable

andl%eax,%eax

jz20f

Ifvectornumberisnot#PF,werestoregeneralpurposeregistersfromthestack:

popq%r11

popq%r10

popq%r9

popq%r8

popq%rdi

popq%rsi

popq%rdx

popq%rcx

popq%rax

LinuxInside

78Earlyinterruptshandler

Page 79: Linux Insides

andexitfromthehandlerwithiret.

Itistheendofthefirstinterrupthandler.Notethatitisveryearlyinterrupthandler,soithandlesonlyPageFaultnow.Wewillseehandlersfortheotherinterrupts,butnowlet'slookonthepagefaulthandler.

Inthepreviousparagraphwesawfirstearlyinterrupthandlerwhichchecksinterruptnumberforpagefaultandcallsearly_make_pgtableforbuildingnewpagetablesifitis.Weneedtohave#PFhandlerinthisstepbecausethereareplanstoaddabilitytoloadkernelabove4Gandmakeaccesstoboot_paramsstructureabovethe4G.

Youcanfindimplementationoftheearly_make_pgtableinthearch/x86/kernel/head64.candtakesoneparameter-addressfromthecr2register,whichcausedPageFault.Let'slookonit:

int__initearly_make_pgtable(unsignedlongaddress)

{

unsignedlongphysaddr=address-__PAGE_OFFSET;

unsignedlongi;

pgdval_tpgd,*pgd_p;

pudval_tpud,*pud_p;

pmdval_tpmd,*pmd_p;

...

...

...

}

Itstartsfromthedefinitionofsomevariableswhichhave*val_ttypes.Allofthesetypesarejust:

typedefunsignedlongpgdval_t;

Alsowewilloperatewiththe*_t(notval)types,forexamplepgd_tandetc...Allofthesetypesdefinedinthearch/x86/include/asm/pgtable_types.handrepresentstructureslikethis:

typedefstruct{pgdval_tpgd;}pgd_t;

Forexample,

externpgd_tearly_level4_pgt[PTRS_PER_PGD];

Hereearly_level4_pgtpresentsearlytop-levelpagetabledirectorywhichconsistsofanarrayofpgd_ttypesandpgdpointstolow-levelpageentries.

Afterwemadethecheckthatwehavenoinvalidaddress,we'regettingtheaddressofthePageGlobalDirectoryentrywhichcontains#PFaddressandputit'svaluetothepgdvariable:

pgd_p=&early_level4_pgt[pgd_index(address)].pgd;

pgd=*pgd_p;

Inthenextstepwecheckpgd,ifitcontainscorrectpageglobaldirectoryentryweputphysicaladdressofthepageglobaldirectoryentryandputittothepud_pwith:

Pagefaulthandling

LinuxInside

79Earlyinterruptshandler

Page 80: Linux Insides

pud_p=(pudval_t*)((pgd&PTE_PFN_MASK)+__START_KERNEL_map-phys_base);

wherePTE_PFN_MASKisamacro:

#definePTE_PFN_MASK((pteval_t)PHYSICAL_PAGE_MASK)

whichexpandsto:

(~(PAGE_SIZE-1))&((1<<46)-1)

or

0b1111111111111111111111111111111111111111111111

whichis46bitstomaskpageframe.

Ifpgddoesnotcontaincorrectaddresswecheckthatnext_early_pgtisnotgreaterthanEARLY_DYNAMIC_PAGE_TABLESwhichis64andpresentafixednumberofbufferstosetupnewpagetablesondemand.Ifnext_early_pgtisgreaterthanEARLY_DYNAMIC_PAGE_TABLESweresetpagetablesandstartagain.Ifnext_early_pgtislessthanEARLY_DYNAMIC_PAGE_TABLES,wecreatenewpageupperdirectorypointerwhichpointstothecurrentdynamicpagetableandwritesit'sphysicaladdresswiththe_KERPG_TABLEaccessrightstothepageglobaldirectory:

if(next_early_pgt>=EARLY_DYNAMIC_PAGE_TABLES){

reset_early_page_tables();

gotoagain;

}

pud_p=(pudval_t*)early_dynamic_pgts[next_early_pgt++];

for(i=0;i<PTRS_PER_PUD;i++)

pud_p[i]=0;

*pgd_p=(pgdval_t)pud_p-__START_KERNEL_map+phys_base+_KERNPG_TABLE;

Afterthiswefixupaddressofthepageupperdirectorywith:

pud_p+=pud_index(address);

pud=*pud_p;

Inthenextstepwedothesameactionsaswedidbefore,butwiththepagemiddledirectory.Intheendwefixaddressofthepagemiddledirectorywhichcontainsmapskerneltext+datavirtualaddresses:

pmd=(physaddr&PMD_MASK)+early_pmd_flags;

pmd_p[pmd_index(address)]=pmd;

Afterpagefaulthandlerfinishedit'sworkandasresultourearly_level4_pgtcontainsentrieswhichpointtothevalidaddresses.

Conclusion

LinuxInside

80Earlyinterruptshandler

Page 81: Linux Insides

Thisistheendofthesecondpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.Inthenextpartwewillseeallstepsbeforekernelentrypoint-start_kernelfunction.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.

GNUassembly.reptAPICNMIPreviouspart

Links

LinuxInside

81Earlyinterruptshandler

Page 82: Linux Insides

ThisisthethirdpartoftheLinuxkernelinitializationprocessseries.Inthepreviouspartwesawearlyinterruptandexceptionhandlingandwillcontinuetodiveintothelinuxkernelinitializationprocessinthecurrentpart.Ournextpointis'kernelentrypoint'-start_kernelfunctionfromtheinit/main.csourcecodefile.Yes,technicallyitisnotkernel'sentrypointbutthestartofthegenerickernelcodewhichdoesnotdependoncertainarchitecture.Butbeforewewillseecallofthestart_kernelfunction,wemustdosomepreparations.Solet'scontinue.

InthepreviouspartwestoppedatsettingInterruptDescriptorTableandloadingitintheIDTRregister.Atthenextstepafterthiswecanseeacallofthecopy_bootdatafunction:

copy_bootdata(__va(real_mode_data));

Thisfunctiontakesoneargument-virtualaddressofthereal_mode_data.Rememberthatwepassedtheaddressoftheboot_paramsstructurefromarch/x86/include/uapi/asm/bootparam.htothex86_64_start_kernelfunctionasfirstargumentinarch/x86/kernel/head_64.S:

/*rsiispointertorealmodestructurewithinterestinginfo.

passittoC*/

movq%rsi,%rdi

Nowlet'slookat__vamacro.Thismacrodefinedininit/main.c:

#define__va(x)((void*)((unsignedlong)(x)+PAGE_OFFSET))

wherePAGE_OFFSETis__PAGE_OFFSETwhichis0xffff880000000000andthebasevirtualaddressofthedirectmappingofallphysicalmemory.Sowe'regettingvirtualaddressoftheboot_paramsstructureandpassittothecopy_bootdatafunction,wherewecopyreal_mod_datatotheboot_paramswhichisdeclaredinthearch/x86/kernel/setup.h

externstructboot_paramsboot_params;

Let'slookatthecopy_boot_dataimplementation:

staticvoid__initcopy_bootdata(char*real_mode_data)

{

char*command_line;

unsignedlongcmd_line_ptr;

memcpy(&boot_params,real_mode_data,sizeofboot_params);

sanitize_boot_params(&boot_params);

cmd_line_ptr=get_cmd_line_ptr();

if(cmd_line_ptr){

command_line=__va(cmd_line_ptr);

memcpy(boot_command_line,command_line,COMMAND_LINE_SIZE);

}

Kernelinitialization.Part3.

Lastpreparationsbeforethekernelentrypoint

boot_paramsagain

LinuxInside

82Lastpreparationsbeforethekernelentrypoint

Page 83: Linux Insides

}

Firstofall,notethatthisfunctionisdeclaredwith__initprefix.Itmeansthatthisfunctionwillbeusedonlyduringtheinitializationandusedmemorywillbefreed.

Wecanseedeclarationoftwovariablesforthekernelcommandlineandcopyingreal_mode_datatotheboot_paramswiththememcpyfunction.Thenextcallofthesanitize_boot_paramsfunctionwhichfillssomefieldsoftheboot_paramsstructurelikeext_ramdisk_imageandetc...ifbootloaderswhichfailtoinitializeunknownfieldsinboot_paramstozero.Afterthiswe'regettingaddressofthecommandlinewiththecalloftheget_cmd_line_ptrfunction:

unsignedlongcmd_line_ptr=boot_params.hdr.cmd_line_ptr;

cmd_line_ptr|=(u64)boot_params.ext_cmd_line_ptr<<32;

returncmd_line_ptr;

whichgetsthe64-bitaddressofthecommandlinefromthekernelbootheaderandreturnsit.Inthelaststepwecheckthatwegotcmd_line_pty,gettingitsvirtualaddressandcopyittotheboot_command_linewhichisjustanarrayofbytes:

externchar__initdataboot_command_line[];

Afterthiswewillhavecopiedkernelcommandlineandboot_paramsstructure.Inthenextstepwecanseecalloftheload_ucode_bspfunctionwhichloadsprocessormicrocode,butwewillnotseeithere.

Aftermicrocodewasloadedwecanseethecheckoftheconsole_loglevelandtheearly_printkfunctionwhichprintsKernelAlivestring.Butyou'llneverseethisoutputbecauseearly_printkisnotinitilizedyet.Itisaminorbuginthekernelandisentthepatch-commitandyouwillseeitinthemainlinesoon.Soyoucanskipthiscode.

Inthenextstepaswehavecopiedboot_paramsstructure,weneedtomovefromtheearlypagetablestothepagetablesforinitializationprocess.Wealreadysetearlypagetablesforswitchover,youcanreadaboutitinthepreviouspartanddroppedallitinthereset_early_page_tablesfunction(youcanreadaboutitinthepreviousparttoo)andkeptonlykernelhighmapping.Afterthiswecall:

clear_page(init_level4_pgt);

functionandpassinit_level4_pgtwhichdefinedalsointhearch/x86/kernel/head_64.Sandlooks:

NEXT_PAGE(init_level4_pgt)

.quadlevel3_ident_pgt-__START_KERNEL_map+_KERNPG_TABLE

.orginit_level4_pgt+L4_PAGE_OFFSET*8,0

.quadlevel3_ident_pgt-__START_KERNEL_map+_KERNPG_TABLE

.orginit_level4_pgt+L4_START_KERNEL*8,0

.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE

whichmapsfirst2gigabytesand512megabytesforthekernelcode,dataandbss.clear_pagefunctiondefinedinthearch/x86/lib/clear_page_64.Sletlookonthisfunction:

ENTRY(clear_page)

CFI_STARTPROC

xorl%eax,%eax

movl$4096/64,%ecx

Moveoninitpages

LinuxInside

83Lastpreparationsbeforethekernelentrypoint

Page 84: Linux Insides

.p2align4

.Lloop:

decl%ecx

#definePUT(x)movq%rax,x*8(%rdi)

movq%rax,(%rdi)

PUT(1)

PUT(2)

PUT(3)

PUT(4)

PUT(5)

PUT(6)

PUT(7)

leaq64(%rdi),%rdi

jnz.Lloop

nop

ret

CFI_ENDPROC

.Lclear_page_end:

ENDPROC(clear_page)

Asyoucanunderstartfromthefunctionnameitclearsorfillswithzerospagetables.FirstofallnotethatthisfunctionstartswiththeCFI_STARTPROCandCFI_ENDPROCwhichareexpandstoGNUassemblydirectives:

#defineCFI_STARTPROC.cfi_startproc

#defineCFI_ENDPROC.cfi_endproc

andusedfordebugging.AfterCFI_STARTPROCmacrowezeroouteaxregisterandput64totheecx(itwillbecounter).Nextwecanseeloopwhichstartswiththe.Llooplabelanditstartsfromtheecxdecrement.Afteritweputzerofromtheraxregistertotherdiwhichcontainsthebaseaddressoftheinit_level4_pgtnowanddothesameprocedureseventimesbuteverytimemoverdioffseton8.Afterthiswewillhavefirst64bytesoftheinit_level4_pgtfilledwithzeros.Inthenextstepweputtheaddressoftheinit_level4_pgtwith64-bytesoffsettotherdiagainandrepeatalloperationswhichecxisnotzero.Intheendwewillhaveinit_level4_pgtfilledwithzeros.

Aswehaveinit_level4_pgtfilledwithzeros,wesetthelastinit_level4_pgtentrytokernelhighmappingwiththe:

init_level4_pgt[511]=early_level4_pgt[511];

Rememberthatwedroppedallearly_level4_pgtentriesinthereset_early_page_tablefunctionandkeptonlykernelhighmappingthere.

Thelaststepinthex86_64_start_kernelfunctionisthecallofthe:

x86_64_start_reservations(real_mode_data);

functionwiththereal_mode_dataasargument.Thex86_64_start_reservationsfunctiondefinedinthesamesourcecodefileasthex86_64_start_kernelfunctionandlooks:

void__initx86_64_start_reservations(char*real_mode_data)

{

if(!boot_params.hdr.version)

copy_bootdata(__va(real_mode_data));

reserve_ebda_region();

start_kernel();

}

LinuxInside

84Lastpreparationsbeforethekernelentrypoint

Page 85: Linux Insides

Youcanseethatitisthelastfunctionbeforeweareinthekernelentrypoint-start_kernelfunction.Let'slookwhatitdoesandhowitworks.

Firstofallwecanseeinthex86_64_start_reservationsfunctioncheckforboot_params.hdr.version:

if(!boot_params.hdr.version)

copy_bootdata(__va(real_mode_data));

andifitisnotwecallagaincopy_bootdatafunctionwiththevirtualaddressofthereal_mode_data(readaboutaboutit'simplementation).

Inthenextstepwecanseethecallofthereserve_ebda_regionfunctionwhichdefinedinthearch/x86/kernel/head.c.ThisfunctionreservesmemoryblockforthEBDAorExtendedBIOSDataArea.TheExtendedBIOSDataArealocatedinthetopofconventionalmemoryandcontainsdataaboutports,diskparametersandetc...

Let'slookonthereserve_ebda_regionfunction.Itstartsfromthecheckingisparavirtualizationenabledornot:

if(paravirt_enabled())

return;

weexitfromthereserve_ebda_regionfunctionifparavirtualizationisenabledbecauseifitenabledtheextendedbiosdataareaisabsent.Inthenextstepweneedtogettheendofthelowmemory:

lowmem=*(unsignedshort*)__va(BIOS_LOWMEM_KILOBYTES);

lowmem<<=10;

We'regettingthevirtualaddressoftheBIOSlowmemoryinkilobytesandconvertittobyteswithshiftingiton10(multiplyon1024inotherwords).AfterthisweneedtogettheaddressoftheextendedBIOSdataarewiththe:

ebda_addr=get_bios_ebda();

whereget_bios_ebdafunctiondefinedinthearch/x86/include/asm/bios_ebda.handlookslike:

staticinlineunsignedintget_bios_ebda(void)

{

unsignedintaddress=*(unsignedshort*)phys_to_virt(0x40E);

address<<=4;

returnaddress;

}

Let'strytounderstandhowitworks.Herewecanseethatweconvertingphysicaladdress0x40Etothevirtual,where0x0040:0x000eisthesegmentwhichcontainsbaseaddressoftheextendedBIOSdataarea.Don'tworrythatweareusingphys_to_virtfunctionforconvertingaphysicaladdresstovirtualaddress.Youcannotethatpreviouslywehaveused__vamacroforthesamepoint,butphys_to_virtisthesame:

staticinlinevoid*phys_to_virt(phys_addr_taddress)

{

return__va(address);

Laststepbeforekernelentrypoint

LinuxInside

85Lastpreparationsbeforethekernelentrypoint

Page 86: Linux Insides

}

onlywithonedifference:wepassargumentwiththephys_addr_twhichdependsonCONFIG_PHYS_ADDR_T_64BIT:

#ifdefCONFIG_PHYS_ADDR_T_64BIT

typedefu64phys_addr_t;

#else

typedefu32phys_addr_t;

#endif

ThisconfigurationoptionisenabledbyCONFIG_PHYS_ADDR_T_64BIT.AfterthatwegotvirtualaddressofthesegmentwhichstoresthebaseaddressoftheextendedBIOSdataarea,weshiftiton4andreturn.Afterthisebda_addrvariablescontainsthebaseaddressoftheextendedBIOSdataarea.

InthenextstepwecheckthataddressoftheextendedBIOSdataareaandlowmemoryisnotlessthanINSANE_CUTOFFmacro

if(ebda_addr<INSANE_CUTOFF)

ebda_addr=LOWMEM_CAP;

if(lowmem<INSANE_CUTOFF)

lowmem=LOWMEM_CAP;

whichis:

#defineINSANE_CUTOFF0x20000U

or128kilobytes.Inthelaststepwegetlowerpartinthelowmemoryandextendedbiosdataareaandcallmemblock_reservefunctionwhichwillreservememoryregionforextendedbiosdatabetweenlowmemoryandonemegabytemark:

lowmem=min(lowmem,ebda_addr);

lowmem=min(lowmem,LOWMEM_CAP);

memblock_reserve(lowmem,0x100000-lowmem);

memblock_reservefunctionisdefinedatmm/block.candtakestwoparameters:

basephysicaladdress;regionsize.

andreservesmemoryregionforthegivenbaseaddressandsize.memblock_reserveisthefirstfunctioninthisbookfromlinuxkernelmemorymanagerframework.Wewilltakeacloserlookonmemorymanagersoon,butnowlet'slookatitsimplementation.

Inthepreviousparagraphwestoppedatthecallofthememblock_reservefunctionandasisadbeforeitisthefirstfunctionfromthememorymanagerframework.Let'strytounderstandhowitworks.memblock_reservefunctionjustcalls:

memblock_reserve_region(base,size,MAX_NUMNODES,0);

Firsttouchofthelinuxkernelmemorymanagerframework

LinuxInside

86Lastpreparationsbeforethekernelentrypoint

Page 87: Linux Insides

functionandpasses4parametersthere:

physicalbaseaddressofthememoryregion;sizeofthememoryregion;maximumnumberofnumanodes;flags.

Atthestartofthememblock_reserve_regionbodywecanseedefinitionofthememblock_typestructure:

structmemblock_type*_rgn=&memblock.reserved;

whichpresentsthetypeofthememoryblockandlooks:

structmemblock_type{

unsignedlongcnt;

unsignedlongmax;

phys_addr_ttotal_size;

structmemblock_region*regions;

};

Asweneedtoreservememoryblockforextendedbiosdataarea,thetypeofthecurrentmemoryregionisreservedwherememblockstructureis:

structmemblock{

boolbottom_up;

phys_addr_tcurrent_limit;

structmemblock_typememory;

structmemblock_typereserved;

#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP

structmemblock_typephysmem;

#endif

};

anddescribesgenericmemoryblock.Youcanseethatweinitialize_rgnbyassigningittotheaddressofthememblock.reserved.memblockistheglobalvariablewhichlooks:

structmemblockmemblock__initdata_memblock={

.memory.regions=memblock_memory_init_regions,

.memory.cnt=1,

.memory.max=INIT_MEMBLOCK_REGIONS,

.reserved.regions=memblock_reserved_init_regions,

.reserved.cnt=1,

.reserved.max=INIT_MEMBLOCK_REGIONS,

#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP

.physmem.regions=memblock_physmem_init_regions,

.physmem.cnt=1,

.physmem.max=INIT_PHYSMEM_REGIONS,

#endif

.bottom_up=false,

.current_limit=MEMBLOCK_ALLOC_ANYWHERE,

};

Wewillnotdiveintodetailofthisvaraible,butwewillseealldetailsaboutitinthepartsaboutmemorymanager.Justnotethatmemblockvariabledefinedwiththe__initdata_memblockwhichis:

#define__initdata_memblock__meminitdata

LinuxInside

87Lastpreparationsbeforethekernelentrypoint

Page 88: Linux Insides

and__meminit_datais:

#define__meminitdata__section(.meminit.data)

Fromthiswecanconcludethatallmemoryblockswillbeinthe.meminit.datasection.Afterwedefined_rgnweprintinformationaboutitwithmemblock_dbgmacros.Youcanenableitbypassingmemblock=debugtothekernelcommandline.

Afterdebugginglineswereprintednextisthecallofthefollowingfunction:

memblock_add_range(_rgn,base,size,nid,flags);

whichaddsnewmemoryblockregionintothe.meminit.datasection.Aswedonotinitlieze_rgnbutitjustcontains&memblock.reserved,wejustfillpassed_rgnwiththebaseaddressoftheextendedBIOSdataarearegion,sizeofthisregionandflags:

if(type->regions[0].size==0){

WARN_ON(type->cnt!=1||type->total_size);

type->regions[0].base=base;

type->regions[0].size=size;

type->regions[0].flags=flags;

memblock_set_region_node(&type->regions[0],nid);

type->total_size=size;

return0;

}

Afterwefilledourregionwecanseethecallofthememblock_set_region_nodefunctionwithtwoparameters:

addressofthefilledmemoryregion;NUMAnodeid.

whereourregionsrepresentedbythememblock_regionstructure:

structmemblock_region{

phys_addr_tbase;

phys_addr_tsize;

unsignedlongflags;

#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAP

intnid;

#endif

};

NUMAnodeiddependsonMAX_NUMNODESmacrowhichisdefinedintheinclude/linux/numa.h:

#defineMAX_NUMNODES(1<<NODES_SHIFT)

whereNODES_SHIFTdependsonCONFIG_NODES_SHIFTconfigurationparameteranddefinedas:

#ifdefCONFIG_NODES_SHIFT

#defineNODES_SHIFTCONFIG_NODES_SHIFT

#else

#defineNODES_SHIFT0

#endif

LinuxInside

88Lastpreparationsbeforethekernelentrypoint

Page 89: Linux Insides

memblick_set_region_nodefunctionjustfillsnidfieldfrommemblock_regionwiththegivenvalue:

staticinlinevoidmemblock_set_region_node(structmemblock_region*r,intnid)

{

r->nid=nid;

}

Afterthiswewillhavefirstreservedmemblockfortheextendedbiosdataareainthe.meminit.datasection.reserve_ebda_regionfunctionfinisheditsworkonthisstepandwecangobacktothearch/x86/kernel/head64.c.

Wefinishedallpreparationsbeforethekernelentrypoint!Thelaststepinthex86_64_start_reservationsfunctionisthecallofthe:

start_kernel()

functionfrominit/main.cfile.

That'sallforthispart.

Itistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseethefirstinitializationstepsinthekernelentrypoint-start_kernelfunction.Itwillbethefirststepbeforewewillseelaunchofthefirstinitprocess.

Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

BIOSdataareaWhatisintheextendedBIOSdataareaonaPC?Previouspart

Conclusion

Links

LinuxInside

89Lastpreparationsbeforethekernelentrypoint

Page 90: Linux Insides

Ifyouhavereadthepreviouspart-Lastpreparationsbeforethekernelentrypoint,youcanrememberthatwefinishedallpre-initializationstuffandstoppedrightbeforethecalltothestart_kernelfunctionfromtheinit/main.c.Thestart_kernelistheentryofthegenericandarchitectureindependentkernelcode,althoughwewillreturntothearch/foldermanytimes.Ifyoulookinsideofthestart_kernelfunction,youwillseethatthisfunctionisverybig.Forthismomentitcontainsabout86callsoffunctions.Yes,it'sverybigandofcoursethispartwillnotcoveralltheprocessesthatoccurinthisfunction.Inthecurrentpartwewillonlystarttodoit.ThispartandallthenextwhichwillbeintheKernelinitializationprocesschapterwillcoverit.

Themainpurposeofthestart_kerneltofinishkernelinitializationprocessandlaunchthefirstinitprocess.Beforethefirstprocesswillbestarted,thestart_kernelmustdomanythingssuchas:toenablelockvalidator,toinitializeprocessorid,toenableearlycgroupssubsystem,tosetupper-cpuareas,toinitializedifferentcachesinvfs,toinitializememorymanager,rcu,vmalloc,scheduler,IRQs,ACPIandmanymanymore.Onlyafterthesestepswewillseethelaunchofthefirstinitprocessinthelastpartofthischapter.Somuchkernelcodeawaitsus,let'sstart.

NOTE:AllpartsfromthisbigchapterLinuxKernelinitializationprocesswillnotcoveranythingaboutdebugging.Therewillbeaseparatechapteraboutkerneldebuggingtips.

AsIwroteabove,thestart_kernelfunctionisdefinedintheinit/main.c.Thisfunctiondefinedwiththe__initattributeandasyoualreadymayknowfromotherparts,allfunctionswhicharedefinedwiththisattributearenecessaryduringkernelinitialization.

#define__init__section(.init.text)__coldnotrace

Aftertheinitializationprocesswillbefinished,thekernelwillreleasethesesectionswithacalltothefree_initmemfunction.Notealsothat__initisdefinedwithtwoattributes:__coldandnotrace.Thepurposeofthefirstcoldattributeistomarkthatthefunctionisrarelyusedandthecompilermustoptimizethisfunctionforsize.Thesecondnotraceisdefinedas:

#definenotrace__attribute__((no_instrument_function))

whereno_instrument_functionsaystothecompilernottogenerateprofilingfunctioncalls.

Inthedefinitionofthestart_kernelfunction,youcanalsoseethe__visibleattributewhichexpandstothe:

#define__visible__attribute__((externally_visible))

whereexternally_visibletellstothecompilerthatsomethingusesthisfunctionorvariable,topreventmarkingthisfunction/variableasunusable.Youcanfindthedefinitionofthisandothermacroattributesininclude/linux/init.h.

Kernelinitialization.Part4.

Kernelentrypoint

Alittleaboutfunctionattributes

Firststepsinthestart_kernel

LinuxInside

90Kernelentrypoint

Page 91: Linux Insides

Atthebeginningofthestart_kernelyoucanseethedefinitionofthesetwovariables:

char*command_line;

char*after_dashes;

Thefirstrepresentsapointertothekernelcommandlineandthesecondwillcontaintheresultoftheparse_argsfunctionwhichparsesaninputstringwithparametersintheformname=value,lookingforspecifickeywordsandinvokingtherighthandlers.Wewillnotgointothedetailsrelatedwiththesetwovariablesatthistime,butwillseeitinthenextparts.Inthenextstepwecanseeacalltothe:

lockdep_init();

function.lockdep_initinitializeslockvalidator.Itsimplementationisprettysimple,itjustinitializestwolist_headhashesandsetsthelockdep_initializedglobalvariableto1.Lockvalidatordetectscircularlockdependenciesandiscalledwhenanyspinlockormutexisacquired.

Thenextfunctionisset_task_stack_end_magicwhichtakesaddressoftheinit_taskandsetsSTACK_END_MAGIC(0x57AC6E9D)ascanaryforit.init_taskrepresentstheinitialtaskstructure:

structtask_structinit_task=INIT_TASK(init_task);

wheretask_structstoresalltheinformationaboutaprocess.Iwillnotexplainthisstructureinthisbookbecauseit'sverybig.Youcanfinditsdefinitionininclude/linux/sched.h.Atthismomenttask_structcontainsmorethan100fields!Althoughyouwillnotseetheexplanationofthetask_structinthisbook,wewilluseitveryoftensinceitisthefundamentalstructurewhichdescribestheprocessintheLinuxkernel.Iwilldescribethemeaningofthefieldsofthisstructureaswemeettheminpractice.

Youcanseethedefinitionoftheinit_taskanditinitializedbytheINIT_TASKmacro.Thismacroisfrominclude/linux/init_task.handitjustfillstheinit_taskwiththevaluesforthefirstprocess.Forexampleitsets:

initprocessstatetozeroorrunnable.ArunnableprocessisonewhichiswaitingonlyforaCPUtorunon;initprocessflags-PF_KTHREADwhichmeans-kernelthread;alistofrunnabletask;processaddressspace;initprocessstacktothe&init_thread_infowhichisinit_thread_union.thread_infoandinitthread_unionhastype-thread_unionwhichcontainsthread_infoandprocessstack:

unionthread_union{

structthread_infothread_info;

unsignedlongstack[THREAD_SIZE/sizeof(long)];

};

Everyprocesshasitsownstackanditis16killobytesor4pageframes.inx86_64.Wecannotethatitisdefinedasarrayofunsignedlong.Thenextfieldofthethread_unionis-thread_infodefinedas:

structthread_info{

structtask_struct*task;

structexec_domain*exec_domain;

__u32flags;

__u32status;

__u32cpu;

intsaved_preempt_count;

LinuxInside

91Kernelentrypoint

Page 92: Linux Insides

mm_segment_taddr_limit;

structrestart_blockrestart_block;

void__user*sysenter_return;

unsignedintsig_on_uaccess_error:1;

unsignedintuaccess_err:1;

};

andoccupies52bytes.Thethread_infostructurecontainsarchitecture-specificinformationonthethread.Weknowthatonx86_64thestackgrowsdownandthread_union.thread_infoisstoredatthebottomofthestackinourcase.Sotheprocessstackis16killobytesandthread_infoisatthebottom.Theremainingthread_sizewillbe16killobytes-62bytes=16332bytes.Notethatthread_uniounrepresentedastheunionandnotstructure,itmeansthatthread_infoandstacksharethememoryspace.

Schematicallyitcanberepresentedasfollows:

+-----------------------+

||

||

|stack|

||

|_______________________|

|||

|||

|||

|__________↓____________|+--------------------+

||||

|thread_info|<----------->|task_struct|

||||

+-----------------------++--------------------+

http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct

SotheINIT_TASKmacrofillsthesetask_struct'sfieldsandmanymanymore.AsIalreadywroteabout,IwillnotdescribeallthefieldsandvaluesintheINIT_TASKmacrobutwewillseethemsoon.

Nowlet'sgobacktotheset_task_stack_end_magicfunction.Thisfunctiondefinedinthekernel/fork.candsetsacanarytotheinitprocessstacktopreventstackoverflow.

voidset_task_stack_end_magic(structtask_struct*tsk)

{

unsignedlong*stackend;

stackend=end_of_stack(tsk);

*stackend=STACK_END_MAGIC;/*foroverflowdetection*/

}

Itsimplementationissimple.set_task_stack_end_magicgetstheendofthestackforthegiventask_structwiththeend_of_stackfunction.TheendofaprocessstackdependsontheCONFIG_STACK_GROWSUPconfigurationoption.Aswelearninx86_64architecture,thestackgrowsdown.Sotheendoftheprocessstackwillbe:

(unsignedlong*)(task_thread_info(p)+1);

wheretask_thread_infojustreturnsthestackwhichwefilledwiththeINIT_TASKmacro:

#definetask_thread_info(task)((structthread_info*)(task)->stack)

LinuxInside

92Kernelentrypoint

Page 93: Linux Insides

Aswegottheendoftheinitprocessstack,wewriteSTACK_END_MAGICthere.Aftercanaryisset,wecancheckitlikethis:

if(*end_of_stack(task)!=STACK_END_MAGIC){

//

//handlestackoverflowhere

//

}

Thenextfunctionaftertheset_task_stack_end_magicissmp_setup_processor_id.Thisfunctionhasanemptybodyforx86_64:

void__init__weaksmp_setup_processor_id(void)

{

}

asitnotimplementedforallarchitectures,butsomesuchass390andarm64.

Thenextfunctioninstart_kernelisdebug_objects_early_init.Implementationofthisfunctionisalmostthesameaslockdep_init,butfillshashesforobjectdebugging.AsIwroteabout,wewillnotseetheexplanationofthisandotherfunctionswhicharefordebuggingpurposesinthischapter.

Afterthedebug_object_early_initfunctionwecanseethecalloftheboot_init_stack_canaryfunctionwhichfillstask_struct->canarywiththecanaryvalueforthe-fstack-protectorgccfeature.ThisfunctiondependsontheCONFIG_CC_STACKPROTECTORconfigurationoptionandifthisoptionisdisabled,boot_init_stack_canarydoesnothing,otherwiseitgeneratesrandomnumbersbasedonrandompoolandtheTSC:

get_random_bytes(&canary,sizeof(canary));

tsc=__native_read_tsc();

canary+=tsc+(tsc<<32UL);

Afterwegotarandomnumber,wefillthestack_canaryfieldoftask_structwithit:

current->stack_canary=canary;

andwritethisvaluetothetopoftheIRQstackwiththe:

this_cpu_write(irq_stack_union.stack_canary,canary);//readbelowaboutthis_cpu_write

Again,wewillnotdiveintodetailshere,wewillcoveritinthepartaboutIRQs.Ascanaryisset,wedisablelocalandearlybootIRQsandregisterthebootstrapCPUintheCPUmaps.WedisablelocalIRQs(interruptsforcurrentCPU)withthelocal_irq_disablemacrowhichexpandstothecallofthearch_local_irq_disablefunctionfrominclude/linux/percpu-defs.h:

staticinlinenotracevoidarch_local_irq_enable(void)

{

native_irq_enable();

}

Wherenative_irq_enableiscliinstructionforx86_64.AsinterruptsaredisabledwecanregisterthecurrentCPUwiththegivenIDintheCPUbitmap.

LinuxInside

93Kernelentrypoint

Page 94: Linux Insides

Thecurrentfunctionfromthestart_kernelisboot_cpu_init.ThisfunctioninitializesvariousCPUmasksforthebootstrapprocessor.Firstofallitgetsthebootstrapprocessoridwithacallto:

intcpu=smp_processor_id();

Fornowitisjustzero.IftheCONFIG_DEBUG_PREEMPTconfigurationoptionisdisabled,smp_processor_idjustexpandstothecallofraw_smp_processor_idwhichexpandstothe:

#defineraw_smp_processor_id()(this_cpu_read(cpu_number))

this_cpu_readasmanyotherfunctionlikethis(this_cpu_write,this_cpu_addandetc...)definedintheinclude/linux/percpu-defs.handpresentsthis_cpuoperation.Theseoperationsprovideawayofoptimizingaccesstotheper-cpuvariableswhichareassociatedwiththecurrentprocessor.Inourcaseitisthis_cpu_read:

__pcpu_size_call_return(this_cpu_read_,pcp)

Rememberthatwehavepassedcpu_numberaspcptothethis_cpu_readfromtheraw_smp_processor_id.Nowlet'slookatthe__pcpu_size_call_returnimplementation:

#define__pcpu_size_call_return(stem,variable)\

({\

typeof(variable)pscr_ret__;\

__verify_pcpu_ptr(&(variable));\

switch(sizeof(variable)){\

case1:pscr_ret__=stem##1(variable);break;\

case2:pscr_ret__=stem##2(variable);break;\

case4:pscr_ret__=stem##4(variable);break;\

case8:pscr_ret__=stem##8(variable);break;\

default:\

__bad_size_call_parameter();break;\

}\

pscr_ret__;\

})

Yes,itlooksalittlestrangebutit'seasy.Firstofallwecanseethedefinitionofthepscr_ret__variablewiththeinttype.Whyint?Ok,variableiscommon_cpuanditwasdeclaredasper-cpuintvariable:

DECLARE_PER_CPU_READ_MOSTLY(int,cpu_number);

Inthenextstepwecall__verify_pcpu_ptrwiththeaddressofcpu_number.__veryf_pcpu_ptrusedtoverifythatthegivenparameterisaper-cpupointer.Afterthatwesetpscr_ret__valuewhichdependsonthesizeofthevariable.Ourcommon_cpuvariableisint,soit4bytesinsize.Itmeansthatwewillgetthis_cpu_read_4(common_cpu)inpscr_ret__.Intheendofthe__pcpu_size_call_returnwejustcallit.this_cpu_read_4isamacro:

#definethis_cpu_read_4(pcp)percpu_from_op("mov",pcp)

whichcallspercpu_from_opandpassmovinstructionandper-cpuvariablethere.percpu_from_opwillexpandtotheinlineassemblycall:

Thefirstprocessoractivation

LinuxInside

94Kernelentrypoint

Page 95: Linux Insides

asm("movl%%gs:%1,%0":"=r"(pfo_ret__):"m"(common_cpu))

Let'strytounderstandhowitworksandwhatitdoes.Thegssegmentregistercontainsthebaseofper-cpuarea.Herewejustcopycommon_cpuwhichisinmemorytothepfo_ret__withthemovlinstruction.Orwithanotherwords:

this_cpu_read(common_cpu)

isthesameas:

movl%gs:$common_cpu,$pfo_ret__

Aswedidn'tsetupper-cpuarea,wehaveonlyone-forthecurrentrunningCPU,wewillgetzeroasaresultofthesmp_processor_id.

Aswegotthecurrentprocessorid,boot_cpu_initsetsthegivenCPUonline,active,presentandpossiblewiththe:

set_cpu_online(cpu,true);

set_cpu_active(cpu,true);

set_cpu_present(cpu,true);

set_cpu_possible(cpu,true);

Allofthesefunctionsusetheconcept-cpumask.cpu_possibleisasetofCPUID'swhichcanbepluggedinatanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentssubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Implementationoftheallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,itcallscpumask_set_cpuorcpumask_clear_cpuotherwise.

Forexamplelet'slookatset_cpu_possible.Aswepassedtrueasthesecondparameter,the:

cpumask_set_cpu(cpu,to_cpumask(cpu_possible_bits));

willbecalled.Firstofalllet'strytounderstandtheto_cpu_maskmacro.Thismacrocastsabitmaptoastructcpumask*.CPUmasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.CPUmaskpresentedbythecpu_maskstructure:

typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;

whichisjustbitmapdeclaredwiththeDECLARE_BITMAPmacro:

#defineDECLARE_BITMAP(name,bits)unsignedlongname[BITS_TO_LONGS(bits)]

Aswecanseefromitsdefinition,theDECLARE_BITMAPmacroexpandstothearrayofunsignedlong.Nowlet'slookathowtheto_cpumaskmacroisimplemented:

#defineto_cpumask(bitmap)\

((structcpumask*)(1?(bitmap)\

LinuxInside

95Kernelentrypoint

Page 96: Linux Insides

:(void*)sizeof(__check_is_bitmap(bitmap))))

Idon'tknowaboutyou,butitlookedreallyweirdformeatthefirsttime.Wecanseeaternaryoperatorherewhichistrueeverytime,butwhythe__check_is_bitmaphere?It'ssimple,let'slookatit:

staticinlineint__check_is_bitmap(constunsignedlong*bitmap)

{

return1;

}

Yeah,itjustreturns1everytime.Actuallyweneedinithereonlyforonepurpose:atcompiletimeitchecksthatthegivenbitmapisabitmap,orinotherwordsitchecksthatthegivenbitmaphasatypeofunsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertingthearrayofunsignedlongtothestructcpumask*.Nowwecancallcpumask_set_cpufunctionwiththecpu-0andstructcpumask*cpu_possible_bits.Thisfunctionmakesonlyonecalloftheset_bitfunctionwhichsetsthegivencpuinthecpumask.Alloftheseset_cpu_*functionsworkonthesameprinciple.

Ifyou'renotsurethatthisset_cpu_*operationsandcpumaskarenotclearforyou,don'tworryaboutit.Youcangetmoreinfobyreadingthespecialpartaboutit-cpumaskordocumentation.

Asweactivatedthebootstrapprocessor,it'stimetogotothenextfunctioninthestart_kernel.Nowitispage_address_init,butthisfunctiondoesnothinginourcase,becauseitexecutesonlywhenallRAMcan'tbemappeddirectly.

Thenextcallispr_notice:

#definepr_notice(fmt,...)\

printk(KERN_NOTICEpr_fmt(fmt),##__VA_ARGS__)

asyoucanseeitjustexpandstotheprintkcall.Atthismomentweusepr_noticetoprinttheLinuxbanner:

pr_notice("%s",linux_banner);

whichisjustthekernelversionwithsomeadditionalparameters:

Linuxversion4.0.0-rc6+(alex@localhost)(gccversion4.9.1(Ubuntu4.9.1-16ubuntu6))#319SMP

Thenextstepisarchitecture-specificinitializations.TheLinuxkerneldoesitwiththecallofthesetup_archfunction.Thisisaverybigfunctionlikestart_kernelandwedonothavetimetoconsiderallofitsimplementationinthispart.Herewe'llonlystarttodoitandcontinueinthenextpart.Asitisarchitecture-specific,weneedtogoagaintothearch/directory.Thesetup_archfunctiondefinedinthearch/x86/kernel/setup.csourcecodefileandtakesonlyoneargument-addressofthekernelcommandline.

Thisfunctionstartsfromthereservingmemoryblockforthekernel_textand_datawhichstartsfromthe_textsymbol

Printlinuxbanner

Architecture-dependentpartsofinitialization

LinuxInside

96Kernelentrypoint

Page 97: Linux Insides

(youcanrememberitfromthearch/x86/kernel/head_64.S)andendsbefore__bss_stop.Weareusingmemblockforthereservingofmemoryblock:

memblock_reserve(__pa_symbol(_text),(unsignedlong)__bss_stop-(unsignedlong)_text);

YoucanreadaboutmemblockintheLinuxkernelmemorymanagementPart1..Asyoucanremembermemblock_reservefunctiontakestwoparameters:

basephysicaladdressofamemoryblock;sizeofamemoryblock.

Wecangetthebasephysicaladdressofthe_textsymbolwiththe__pa_symbolmacro:

#define__pa_symbol(x)\

__phys_addr_symbol(__phys_reloc_hide((unsignedlong)(x)))

Firstofallitcalls__phys_reloc_hidemacroonthegivenparameter.The__phys_reloc_hidemacrodoesnothingforx86_64andjustreturnsthegivenparameter.Implementationofthe__phys_addr_symbolmacroiseasy.Itjustsubtractsthesymboladdressfromthebaseaddressofthekerneltextmappingbasevirtualaddress(youcanrememberthatitis__START_KERNEL_map)andaddsphys_basewhichisthebaseaddressof_text:

#define__phys_addr_symbol(x)\

((unsignedlong)(x)-__START_KERNEL_map+phys_base)

Afterwegotthephysicaladdressofthe_textsymbol,memblock_reservecanreserveamemoryblockfromthe_texttothe__bss_stop-_text.

Inthenextstepafterwereservedplaceforthekerneltextanddataisreservingplacefortheinitrd.Wewillnotseedetailsaboutinitrdinthispost,youjustmayknowthatitistemporaryrootfilesystemstoredinmemoryandusedbythekernelduringitsstartup.Theearly_reserve_initrdfunctiondoesallwork.Firstofallthisfunctiongetsthebaseaddressoftheramdisk,itssizeandtheendaddresswith:

u64ramdisk_image=get_ramdisk_image();

u64ramdisk_size=get_ramdisk_size();

u64ramdisk_end=PAGE_ALIGN(ramdisk_image+ramdisk_size);

Alloftheseparametersaretakenfromboot_params.IfyouhavereadthechapteraboutLinuxKernelBootingProcess,youmustrememberthatwefilledtheboot_paramsstructureduringboottime.Thekernelsetupheadercontainsacoupleoffieldswhichdescribesramdisk,forexample:

Fieldname:ramdisk_image

Type:write(obligatory)

Offset/size:0x218/4

Protocol:2.00+

The32-bitlinearaddressoftheinitialramdiskorramfs.Leaveat

zeroifthereisnoinitialramdisk/ramfs.

Reservememoryforinitrd

LinuxInside

97Kernelentrypoint

Page 98: Linux Insides

Sowecangetalltheinformationthatinterestsusfromboot_params.Forexamplelet'slookatget_ramdisk_image:

staticu64__initget_ramdisk_image(void)

{

u64ramdisk_image=boot_params.hdr.ramdisk_image;

ramdisk_image|=(u64)boot_params.ext_ramdisk_image<<32;

returnramdisk_image;

}

Herewegettheaddressoftheramdiskfromtheboot_paramsandshiftleftiton32.WeneedtodoitbecauseasyoucanreadintheDocumentation/x86/zero-page.txt:

0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits

Soaftershiftingiton32,we'regettinga64-bitaddressinramdisk_imageandwereturnit.get_ramdisk_sizeworksonthesameprincipleasget_ramdisk_image,butitusedext_ramdisk_sizeinsteadofext_ramdisk_image.Afterwegotramdisk'ssize,baseaddressandendaddress,wecheckthatbootloaderprovidedramdiskwiththe:

if(!boot_params.hdr.type_of_loader||

!ramdisk_image||!ramdisk_size)

return;

andreservememoryblockwiththecalculatedaddressesfortheinitialramdiskintheend:

memblock_reserve(ramdisk_image,ramdisk_end-ramdisk_image);

ItistheendofthefourthpartabouttheLinuxkernelinitializationprocess.Westartedtodiveinthekernelgenericcodefromthestart_kernelfunctioninthispartandstoppedonthearchitecture-specificinitializationsinthesetup_arch.Inthenextpartwewillcontinuewitharchitecture-dependentinitializationsteps.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmeaPRtolinux-internals.

GCCfunctionattributesthis_cpuoperationscpumasklockvalidatorcgroupsstackbufferoverflowIRQsinitrdPreviouspart

Conclusion

Links

LinuxInside

98Kernelentrypoint

Page 99: Linux Insides

LinuxInside

99Kernelentrypoint

Page 100: Linux Insides

Inthepreviouspart,westoppedattheinitializationofanarchitecture-specificstufffromthesetup_archfunctionandwillcontinuewithit.Aswereservedmemoryfortheinitrd,nextstepistheolpc_ofw_detectwhichdetectsOneLaptopPerChildsupport.Wewillnotconsiderplatformrelatedstuffinthisbookandwillmissfunctionsrelatedwithit.Solet'sgoahead.Thenextstepistheearly_trap_initfunction.Thisfunctioninitializesdebug(#DB-raisedwhentheTFflagofrflagsisset)andint3(#BP)interruptsgate.Ifyoudon'tknowanythingaboutinterrupts,youcanreadaboutitintheEarlyinterruptandexceptionhandling.Inx86architectureINT,INTOandINT3arespecialinstructionswhichallowatasktoexplicitlycallaninterrupthandler.TheINT3instructioncallsthebreakpoint(#BP)handler.Youcanremember,wealreadysawitinthepartaboutinterrupts:andexceptions:

----------------------------------------------------------------------------------------------

|Vector|Mnemonic|Description|Type|ErrorCode|Source|

----------------------------------------------------------------------------------------------

|3|#BP|Breakpoint|Trap|NO|INT3|

----------------------------------------------------------------------------------------------

Debuginterrupt#DBistheprimarymeansofinvokingdebuggers.early_trap_initdefinedinthearch/x86/kernel/traps.c.Thisfunctionssets#DBand#BPhandlersandreloadsIDT:

void__initearly_trap_init(void)

{

set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);

set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);

load_idt(&idt_descr);

}

Wealreadysawimplementationoftheset_intr_gateinthepreviouspartaboutinterrupts.Herearetwosimilarfunctionsset_intr_gate_istandset_system_intr_gate_ist.Bothofthesetwofunctionstaketwoparameters:

numberoftheinterrupt;baseaddressoftheinterrupt/exceptionhandler;thirdparameteris-InterruptStackTable.ISTisanewmechanisminthex86_64andpartoftheTSS.Everyactivethreadinkernelmodehasownkernelstackwhichis16killobytes.Whileathreadinuserspace,kernelstackisemptyexceptthread_info(readaboutitpreviouspart)atthebottom.Inadditiontoper-threadstacks,thereareacoupleofspecializedstacksassociatedwitheachCPU.Allaboutthesestackyoucanreadinthelinuxkerneldocumentation-Kernelstacks.x86_64providesfeaturewhichallowstoswitchtoanewspecialstackforduringanyeventsasnon-maskableinterruptandetc...Andthenameofthisfeatureis-InterruptStackTable.Therecanbeupto7ISTentriesperCPUandeveryentrypointstothededicatedstack.InourcasethisisDEBUG_STACK.

set_intr_gate_istandset_system_intr_gate_istworkbythesameprincipleasset_intr_gatewithonlyonedifference.Bothofthesefunctionschecksinterruptnumberandcall_set_gateinside:

BUG_ON((unsigned)n>0xFF);

_set_gate(n,GATE_INTERRUPT,addr,0,ist,__KERNEL_CS);

asset_intr_gatedoesthis.Butset_intr_gatecalls_set_gatewithdpl-0,andist-0,butset_intr_gate_istandset_system_intr_gate_istsetsistasDEBUG_STACKandset_system_intr_gate_istsetsdplas0x3whichisthelowest

Kernelinitialization.Part5.

Continueofarchitecture-specificinitializations

LinuxInside

100Continuearchitecture-specificboot-timeinitializations

Page 101: Linux Insides

privilege.Whenaninterruptoccursandthehardwareloadssuchadescriptor,thenhardwareautomaticallysetsthenew

stackpointerbasedontheISTvalue,theninvokestheinterrupthandler.Allofthespecialkernelstackswillbesettedinthecpu_initfunction(wewillseeitlater).

As#DBand#BPgateswrittentotheidt_descr,wereloadIDTtablewithload_idtwhichjustcalsldtrinstruction.Nowlet'slookoninterrupthandlersandwilltrytounderstandhowtheyworks.Ofcourse,Ican'tcoverallinterrupthandlersinthisbookandIdonotseethepointinthis.Itisveryinterestingtodelveinthelinuxkernelsourcecode,sowewillseehowdebughandlerimplementedinthispart,andunderstandhowotherinterrupthandlersareimplementedwillbeyourtask.

Asyoucanreadabove,wepassedaddressofthe#DBhandleras&debugintheset_intr_gate_ist.lxr.free-electorns.comisagreatresourceforsearchingidentificatorsinthelinuxkernelsourcecode,butunfortunatelyyouwillnotfinddebughandlerwithit.Allofyoucanfind,itisdebugdefinitioninthearch/x86/include/asm/traps.h:

asmlinkagevoiddebug(void);

Wecanseeasmlinkageattributewhichtellstousthatdebugisfunctionwrittenwithassembly.Yeah,againandagainassembly:).Implementationofthe#DBhandlerasotherhandlersisinthisarch/x86/kernel/entry_64.Sanddefinedwiththeidtentryassemblymacro:

idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK

idtentryisamacrowhichdefinesaninterrupt/exceptionentrypoint.Asyoucanseeittakesfivearguments:

nameoftheinterruptentrypoint;nameoftheinterrupthandler;hasinterrupterrorcodeornot;paranoid-ifthisparameter=1,switchtospecialstack(readabove);shift_ist-stacktoswitchduringinterrupt.

Nowlet'slookonidtentrymacroimplementation.ThismacrodefinedinthesameassemblyfileanddefinesdebugfunctionwiththeENTRYmacro.Forthestartidtentrymacrochecksthatgivenparametersarecorrectincaseifneedtoswitchtothespecialstack.Inthenextstepitchecksthatgiveinterruptreturnserrorcode.Ifinterruptdoesnotreturnerrorcode(inourcase#DBdoesnotreturnerrorcode),itcallsINTR_FRAMEorXCPT_FRAMEifinterrupthaserrorcode.BothofthesemacrosXCPT_FRAMEandINTR_FRAMEdonothingandneedonlyforthebuildinginitialframestateforinterrupts.TheyusesCFIdirectivesandusedfordebugging.MoreinfoyoucanfindintheCFIdirectives.Ascommentfromthearch/x86/kernel/entry_64.Ssays:CFImacrosareusedtogeneratedwarf2unwindinformationforbetterbacktraces.Theydon'tchangeanycode.sowewillignorethem.

.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1

ENTRY(\sym)

/*Sanitycheck*/

.if\shift_ist!=-1&&\paranoid==0

.error"usingshift_istrequiresparanoid=1"

.endif

.if\has_error_code

XCPT_FRAME

.else

INTR_FRAME

.endif

...

DBhandler

LinuxInside

101Continuearchitecture-specificboot-timeinitializations

Page 102: Linux Insides

...

...

Youcanrememberfromthepreviouspartaboutearlyinterrupts/exceptionshandlingthatafterinterruptoccurs,currentstackwillhavefollowingformat:

+-----------------------+

||

+40|SS|

+32|RSP|

+24|RFLAGS|

+16|CS|

+8|RIP|

0|ErrorCode|<----rsp

||

+-----------------------+

Thenexttwomacrofromtheidtentryimplementationare:

ASM_CLAC

PARAVIRT_ADJUST_EXCEPTION_FRAME

FirstASM_CLACmacrodependsonCONFIG_X86_SMAPconfigurationoptionandneedforsecurityresason,moreaboutityoucanreadhere.ThesecondPARAVIRT_ADJUST_EXCEPTION_FRAMEmacroisforhandlinghandleXen-type-exceptions(thischapteraboutkernelinitializationsandwewillnotconsidervirtualizationstuffhere).

Thenextpieceofcodechecksisinterrupthaserrorcodeornotandpushes$-1whichis0xffffffffffffffffonx86_64onthestackifnot:

.ifeq\has_error_code

pushq_cfi$-1

.endif

Weneedtodoitasdummyerrorcodeforstackconsistencyforallinterrupts.Inthenextstepwesubscractfromthestackpointer$ORIG_RAX-R15:

subq$ORIG_RAX-R15,%rsp

whereORIRG_RAX,R15andothermacrosdefinedinthearch/x86/include/asm/calling.handORIG_RAX-R15is120bytes.Generalpurposeregisterswilloccupythese120bytesbecauseweneedtostoreallregistersonthestackduringinterrupthandling.Afterwesetstackforgeneralpurposeregisters,thenextstepischeckingthatinterruptcamefromuserspacewith:

testl$3,CS(%rsp)

jnz1f

HerewechecksfirstandsecondbitsintheCS.YoucanrememberthatCSregistercontainssegmentselectorwherefirsttwobitsareRPL.Allprivilegelevelsareintegersintherange0–3,wherethelowestnumbercorrespondstothehighestprivilege.Soifinterruptcamefromthekernelmodewecallsave_paranoidorjumponlabel1ifnot.Inthesave_paranoidwestoreallgeneralpurposeregistersonthestackandswitchusergsonkernelgsifneed:

movl$1,%ebx

LinuxInside

102Continuearchitecture-specificboot-timeinitializations

Page 103: Linux Insides

movl$MSR_GS_BASE,%ecx

rdmsr

testl%edx,%edx

js1f

SWAPGS

xorl%ebx,%ebx

1:ret

Inthenextstepsweputpt_regspointertotherdi,saveerrorcodeinthersiifitisandcallinterrupthandlerwhichis-do_debuginourcasefromthearch/x86/kernel/traps.c.do_debuglikeotherhandlerstakestwoparameters:

pt_regs-isastructurewhichpresentssetofCPUregisterswhicharesavedintheprocess'memoryregion;errorcode-errorcodeofinterrupt.

Afterinterrupthandlerfinisheditswork,callsparanoid_exitwhichrestoresstack,switchonuserspaceifinterruptcamefromthereandcallsiret.That'sall.Ofcourseitisnotall:),butwewillseemoredeeplyintheseparatechapteraboutinterrupts.

Thisisgeneralviewoftheidtentrymacrofor#DBinterrupt.Allinterruptsaresimilaronthisimplementationanddefinedwithidtentrytoo.Afterearly_trap_initfinisheditswork,thenextfunctionisearly_cpu_init.Thisfunctiondefinedinthearch/x86/kernel/cpu/common.candcollectsinformationaboutaCPUanditsvendor.

Thenextstepisinitializationofearlyioremap.Ingeneraltherearetwowaystocomminicatewithdevices:

I/OPorts;Devicememory.

Wealreadysawfirstmethod(outb/inbinstructions)inthepartaboutlinuxkernelbootingprocess.ThesecondmethodistomapI/Ophysicaladdressestovirtualaddresses.WhenaphysicaladdressisaccessedbytheCPU,itmayrefertoaportionofphysicalRAMwhichcanbemappedonmemoryoftheI/Odevice.Soioremapusedtomapdevicememoryintokerneladdressspace.

Asiwroteabovenextfunctionistheearly_ioremap_initwhichre-mapsI/Omemorytokerneladdressspacesoitcanaccessit.WeneedtoinitializeearlyioremapforearlyinitializationcodewhichneedstotemporarilymapI/Oormemoryregionsbeforethenormalmappingfunctionslikeioremapareavailable.Implementationofthisfunctionisinthearch/x86/mm/ioremap.c.Atthestartoftheearly_ioremap_initwecanseedefinitionofthepmdpointwithpmd_ttype(whichpresentspagemiddledirectoryentrytypedefstruct{pmdval_tpmd;}pmd_t;wherepmdval_tisunsignedlong)andmakeacheckthatfixmapalignedinacorrectway:

pmd_t*pmd;

BUILD_BUG_ON((fix_to_virt(0)+PAGE_SIZE)&((1<<PMD_SHIFT)-1));

fixmap-isfixedvirtualaddressmappingswhichextendsfromFIXADDR_STARTtoFIXADDR_TOP.Fixedvirtualaddressesareneededforsubsystemsthatneedtoknowthevirtualaddressatcompiletime.Afterthecheckearly_ioremap_initmakesacalloftheearly_ioremap_setupfunctionfromthemm/early_ioremap.c.early_ioremap_setupfillsslot_virtarryoftheunsignedlongwithvirtualaddresseswith512temporaryboot-timefix-mappings:

for(i=0;i<FIX_BTMAPS_SLOTS;i++)

slot_virt[i]=__fix_to_virt(FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*i);

AfterthiswegetpagemiddledirectoryentryfortheFIX_BTMAP_BEGINandputtothepmdvariable,fillswithzerosbm_pte

Earlyioremapinitialization

LinuxInside

103Continuearchitecture-specificboot-timeinitializations

Page 104: Linux Insides

whichisboottimepagetablesandcallpmd_populate_kernelfunctionforsettinggivenpagetableentryinthegivenpagemiddledirectory:

pmd=early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));

memset(bm_pte,0,sizeof(bm_pte));

pmd_populate_kernel(&init_mm,pmd,bm_pte);

That'sallforthis.Ifyoufeelingmissunderstanding,don'tworry.ThereisspecialpartaboutioremapandfixmapsintheLinuxKernelMemoryManagement.Part2chapter.

Afterearlyioremapwasinitialized,youcanseethefollowingcode:

ROOT_DEV=old_decode_dev(boot_params.hdr.root_dev);

Thiscodeobtainsmajorandminornumbersfortherootdevicewhereinitrdwillbemountedlaterinthedo_mount_rootfunction.Majornumberofthedeviceidentifiesadriverassociatedwiththedevice.Minornumberreferredonthedevicecontrolledbydriver.Notethatold_decode_devtakesoneparameterfromtheboot_params_structure.Aswecanreadfromthex86linuxkernelbootprotocol:

Fieldname:root_dev

Type:modify(optional)

Offset/size:0x1fc/2

Protocol:ALL

Thedefaultrootdevicedevicenumber.Theuseofthisfieldis

deprecated,usethe"root="optiononthecommandlineinstead.

Nowlet'stryunderstandwhatisitold_decode_dev.ActuallyitjustcallsMKDEVinsidewhichgeneratesdev_tfromthegivemajorandminornumbers.It'simplementationprettyeasy:

staticinlinedev_told_decode_dev(u16val)

{

returnMKDEV((val>>8)&255,val&255);

}

wheredev_tisakerneldatatypetopresentmajor/minornumberpair.Butwhat'sthestrangeold_prefix?Forhistoricalreasons,therearetwowaysofmanagingthemajorandminornumbersofadevice.Inthefirstwaymajorandminornumbersoccupied2bytes.Youcanseeitinthepreviouscode:8bitformajornumberand8bitforminornumber.Butthereisproblemwiththisway:256majornumbersand256minornumbersarepossible.So16-bitintegerwasreplacedwith32-bitintegerwhere12bitsreservedformajornumberand20bitsforminor.Youcanseethisinthenew_decode_devimplementation:

staticinlinedev_tnew_decode_dev(u32dev)

{

unsignedmajor=(dev&0xfff00)>>8;

unsignedminor=(dev&0xff)|((dev>>12)&0xfff00);

returnMKDEV(major,minor);

}

Aftercalculationwewillget0xfffor12bitsformajorifitis0xffffffffand0xfffffor20bitsforminor.Sointheendof

Obtainingmajorandminornumbersfortherootdevice

LinuxInside

104Continuearchitecture-specificboot-timeinitializations

Page 105: Linux Insides

executionoftheold_decode_devwewillgetmajorandminornumbersfortherootdeviceinROOT_DEV.

Thenextpointisthesetupofthememorymapwiththecallofthesetup_memory_mapfunction.Butbeforethiswesetupdifferentparametersasinformationaboutascreen(currentrowandcolumn,videopageandetc...(youcanreadaboutitintheVideomodeinitializationandtransitiontoprotectedmode)),Extendeddisplayidentificationdata,videomode,bootloader_typeandetc...:

screen_info=boot_params.screen_info;

edid_info=boot_params.edid_info;

saved_video_mode=boot_params.hdr.vid_mode;

bootloader_type=boot_params.hdr.type_of_loader;

if((bootloader_type>>4)==0xe){

bootloader_type&=0xf;

bootloader_type|=(boot_params.hdr.ext_loader_type+0x10)<<4;

}

bootloader_version=bootloader_type&0xf;

bootloader_version|=boot_params.hdr.ext_loader_ver<<4;

Alloftheseparameterswegotduringboottimeandstoredintheboot_paramsstructure.AfterthisweneedtosetuptheendoftheI/Omemory.Asyouknowtheoneofthemainpurposesofthekernelisresourcemanagement.Andoneoftheresourceisamemory.AswealreadyknowtherearetwowaystocommunicatewithdevicesareI/Oportsanddevicememory.Allinformationaboutregisteredresourcesavailablethrough:

/proc/ioports-providesalistofcurrentlyregisteredportregionsusedforinputoroutputcommunicationwithadevice;/proc/iomem-providescurrentmapofthesystem'smemoryforeachphysicaldevice.

Atthemomentweareinterestedin/proc/iomem:

cat/proc/iomem

00000000-00000fff:reserved

00001000-0009d7ff:SystemRAM

0009d800-0009ffff:reserved

000a0000-000bffff:PCIBus0000:00

000c0000-000cffff:VideoROM

000d0000-000d3fff:PCIBus0000:00

000d4000-000d7fff:PCIBus0000:00

000d8000-000dbfff:PCIBus0000:00

000dc000-000dffff:PCIBus0000:00

000e0000-000fffff:reserved

000e0000-000e3fff:PCIBus0000:00

000e4000-000e7fff:PCIBus0000:00

000f0000-000fffff:SystemROM

Asyoucanseerangeofaddressesareshowninhexadecimalnotationwithitsowner.LinuxkernelprovidesAPIformanaginganyresourcesinageneralway.Globalresources(forexamplePICsorI/Oports)canbedividedintosubsets-relatingtoanyhardwarebusslot.Themainstructureresource:

structresource{

resource_size_tstart;

resource_size_tend;

constchar*name;

unsignedlongflags;

structresource*parent,*sibling,*child;

};

presentsabstractionforatree-likesubsetofsystemresources.Thisstructureprovidesrangeofaddressesfromstartto

Memorymapsetup

LinuxInside

105Continuearchitecture-specificboot-timeinitializations

Page 106: Linux Insides

end(resource_size_tisphys_addr_toru64forx86_64)whicharesourcecovers,nameofaresource(youseethesenamesinthe/proc/iomemoutput)andflagsofaresource(Allresourcesflagsdefinedintheinclude/linux/ioport.h).Thelastarethreepointerstotheresourcestructure.Thesepointersenableatree-likestructure:

+-------------++-------------+

||||

|parent|------|sibling|

||||

+-------------++-------------+

|

|

+-------------+

||

|child|

||

+-------------+

Everysubsetofresourceshasrootrangeresources.Foriomemitisiomem_resourcewhichdefinedas:

structresourceiomem_resource={

.name="PCImem",

.start=0,

.end=-1,

.flags=IORESOURCE_MEM,

};

EXPORT_SYMBOL(iomem_resource);

TODOEXPORT_SYMBOL

iomem_resourcedefinesrootaddressesrangeforiomemorywithPCImemnameandIORESOURCE_MEM(0x00000200)asflags.Asiwroteaboutourcurrentpointissetuptheendaddressoftheiomem.Wewilldoitwith:

iomem_resource.end=(1ULL<<boot_cpu_data.x86_phys_bits)-1;

Hereweshift1onboot_cpu_data.x86_phys_bits.boot_cpu_dataiscpuinfo_x86structurewhichwefilledduringexecutionoftheearly_cpu_init.Asyoucanunderstandfromthenameofthex86_phys_bitsfield,itpresentsmaximumbitsamountofthemaximumphysicaladdressinthesystem.Notealsothatiomem_resourcepassedtotheEXPORT_SYMBOLmacro.Thismacroexportsthegivensymbol(iomem_resourceinourcase)fordynamiclinkingorinanotherwordsitmakesasymbolaccessibletodynamicallyloadedmodules.

Aswesettheendaddressoftherootiomemresourceaddressrange,asIwroteaboutthenextstepwillbesetupofthememorymap.Itwillbeproducedwiththecallofthesetup_memory_mapfunction:

void__initsetup_memory_map(void)

{

char*who;

who=x86_init.resources.memory_setup();

memcpy(&e820_saved,&e820,sizeof(structe820map));

printk(KERN_INFO"e820:BIOS-providedphysicalRAMmap:\n");

e820_print_map(who);

}

Firstofallwecalllookherethecallofthex86_init.resources.memory_setup.x86_initisax86_init_opsstructurewhichpresentsplatformspecificsetupfunctionsasresourcesinitializtion,pciinitializationandetc...Initiaizationofthex86_initisinthearch/x86/kernel/x86_init.c.Iwillnotgiveherethefulldescriptionbecauseitisverylong,butonlyonepartwhichinterestsusfornow:

LinuxInside

106Continuearchitecture-specificboot-timeinitializations

Page 107: Linux Insides

structx86_init_opsx86_init__initdata={

.resources={

.probe_roms=probe_roms,

.reserve_resources=reserve_standard_io_resources,

.memory_setup=default_machine_specific_memory_setup,

},

...

...

...

}

Aswecanseeherememry_setupfieldisdefault_machine_specific_memory_setupwherewegetthenumberofthee820entrieswhichwecollectedintheboottime,sanitizetheBIOSe820mapandfille820mapstructurewiththememoryregions.Asallregionscollect,printofallregionswithprintk.Youcanfindthisprintifyouexecutedmesgcommand,youmustseesomethinglikethis:

[0.000000]e820:BIOS-providedphysicalRAMmap:

[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009d7ff]usable

[0.000000]BIOS-e820:[mem0x000000000009d800-0x000000000009ffff]reserved

[0.000000]BIOS-e820:[mem0x00000000000e0000-0x00000000000fffff]reserved

[0.000000]BIOS-e820:[mem0x0000000000100000-0x00000000be825fff]usable

[0.000000]BIOS-e820:[mem0x00000000be826000-0x00000000be82cfff]ACPINVS

[0.000000]BIOS-e820:[mem0x00000000be82d000-0x00000000bf744fff]usable

[0.000000]BIOS-e820:[mem0x00000000bf745000-0x00000000bfff4fff]reserved

[0.000000]BIOS-e820:[mem0x00000000bfff5000-0x00000000dc041fff]usable

[0.000000]BIOS-e820:[mem0x00000000dc042000-0x00000000dc0d2fff]reserved

[0.000000]BIOS-e820:[mem0x00000000dc0d3000-0x00000000dc138fff]usable

[0.000000]BIOS-e820:[mem0x00000000dc139000-0x00000000dc27dfff]ACPINVS

[0.000000]BIOS-e820:[mem0x00000000dc27e000-0x00000000deffefff]reserved

[0.000000]BIOS-e820:[mem0x00000000defff000-0x00000000deffffff]usable

...

...

...

Thenexttwostepsisparsingofthesetup_datawithparse_setup_datafunctionandcopyingBIOSEDDtothesafeplace.setup_dataisafieldfromthekernelbootheaderandaswecanreadfromthex86bootprotocol:

Fieldname:setup_data

Type:write(special)

Offset/size:0x250/8

Protocol:2.09+

The64-bitphysicalpointertoNULLterminatedsinglelinkedlistof

structsetup_data.Thisisusedtodefineamoreextensibleboot

parameterspassingmechanism.

Itusedforstoringsetupinformationfordifferenttypesasdevicetreeblob,EFIsetupdataandetc...InthesecondstepwecopyBIOSEDDinformantionfromtheboot_paramsstructurethatwecollectedinthearch/x86/boot/edd.ctotheeddstructure:

staticinlinevoid__initcopy_edd(void)

{

memcpy(edd.mbr_signature,boot_params.edd_mbr_sig_buffer,

sizeof(edd.mbr_signature));

memcpy(edd.edd_info,boot_params.eddbuf,sizeof(edd.edd_info));

edd.mbr_signature_nr=boot_params.edd_mbr_sig_buf_entries;

edd.edd_info_nr=boot_params.eddbuf_entries;

}

CopyingoftheBIOSEnhancedDiskDeviceinformation

LinuxInside

107Continuearchitecture-specificboot-timeinitializations

Page 108: Linux Insides

Thenextstepisinitializationofthememorydescriptoroftheinitprocess.Asyoualreadycanknoweveryprocesshasownaddressspace.Thisaddressspacepresentedwithspecialdatastructurewhichcalledmemorydescriptor.Directlyinthelinuxkernelsourcecodememorydescriptorpresentedwithmm_structstructure.mm_structcontainsmanydifferentfieldsrelatedwiththeprocessaddressspaceasstart/endaddressofthekernelcode/data,start/endofthebrk,numberofmemoryareas,listofmemoryareasandetc...Thisstructuredefinedintheinclude/linux/mm_types.h.Aseveryprocesshasownmemorydescriptor,task_structstructurecontainsitinthemmandactive_mmfield.Andourfirstinitprocesshasittoo.Youcanrememberthatwesawthepartofinitializationoftheinittask_structwithINIT_TASKmacrointhepreviouspart:

#defineINIT_TASK(tsk)\

{

...

...

...

.mm=NULL,\

.active_mm=&init_mm,\

...

}

mmpointstotheprocessaddressspaceandactive_mmpointstotheactiveaddressspaceifprocesshasnoownaskernelthreads(moreaboutityoucanreadinthedocumentation).Nowwefillmemorydescriptoroftheinitialprocess:

init_mm.start_code=(unsignedlong)_text;

init_mm.end_code=(unsignedlong)_etext;

init_mm.end_data=(unsignedlong)_edata;

init_mm.brk=_brk_end;

withthekernel'stext,dataandbrk.init_mmismemorydescriptoroftheinitialprocessanddefinedas:

structmm_structinit_mm={

.mm_rb=RB_ROOT,

.pgd=swapper_pg_dir,

.mm_users=ATOMIC_INIT(2),

.mm_count=ATOMIC_INIT(1),

.mmap_sem=__RWSEM_INITIALIZER(init_mm.mmap_sem),

.page_table_lock=__SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),

.mmlist=LIST_HEAD_INIT(init_mm.mmlist),

INIT_MM_CONTEXT(init_mm)

};

wheremm_rbisared-blacktreeofthevirtualmemoryareas,pgdisapointertothepageglobaldirectory,mm_usersisaddressspaceusers,mm_countisprimaryusagecounterandmmap_semismemoryareasemaphore.Afterthatwesetupmemorydescriptoroftheinitialiprocess,nextstepisinitializationoftheintelMemoryProtectionExtensionswithmpx_mm_init.Thenextstepafteritisinitializationofthecode/data/bssresourceswith:

code_resource.start=__pa_symbol(_text);

code_resource.end=__pa_symbol(_etext)-1;

data_resource.start=__pa_symbol(_etext);

data_resource.end=__pa_symbol(_edata)-1;

bss_resource.start=__pa_symbol(__bss_start);

bss_resource.end=__pa_symbol(__bss_stop)-1;

Wealreadyknowalittleaboutresourcestructure(readabove).Herewefillscode/data/bssresourceswiththephysicaladdressesofthey.Youcanseeitinthe/proc/iomemoutput:

Memorydescriptorinitialization

LinuxInside

108Continuearchitecture-specificboot-timeinitializations

Page 109: Linux Insides

00100000-be825fff:SystemRAM

01000000-015bb392:Kernelcode

015bb393-01930c3f:Kerneldata

01a11000-01ac3fff:Kernelbss

Allofthesestructuresdefinedinthearch/x86/kernel/setup.candlookliketypicalresourceinitialization:

staticstructresourcecode_resource={

.name="Kernelcode",

.start=0,

.end=0,

.flags=IORESOURCE_BUSY|IORESOURCE_MEM

};

ThelaststepwhichwewillcoverinthispartwillbeNXconfiguration.NX-bitornoexecutebitis63-bitinthepagedirectoryentrywhichcontrolstheabilitytoexecutecodefromallphysicalpagesmappedbythetableentry.Thisbitcanonlybeused/setwhentheno-executepage-protectionmechanismisenabledbythesettingEFER.NXEto1.Inthex86_configure_nxfunctionwecheckthatCPUhassupportofNX-bitanditdoesnotdisabled.Afterthecheckwefill__supported_pte_maskdependonit:

voidx86_configure_nx(void)

{

if(cpu_has_nx&&!disable_nx)

__supported_pte_mask|=_PAGE_NX;

else

__supported_pte_mask&=~_PAGE_NX;

}

Itistheendofthefifthpartaboutlinuxkernelinitializationprocess.Inthispartwecontinuedtodiveinthesetup_archfunctionwhichmakesinitializationofarchitecutre-specificstuff.Itwaslongpart,butwenotfinishedwithit.Asialreadywrote,thesetup_archisbigfunction,andIamreallynotsurethatwewillcoverfullofiteveninthenextpart.ThereweresomenewinterestingconceptsinthispartlikeFix-mappedaddresses,ioremapandetc...Don'tworryiftheyareunclearforyou.Thereisspecialpartabouttheseconcepts-LinuxkernelmemorymanagementPart2..Inthenextpartwewillcontinuewiththeinitializationofthearchitecture-specificstuffandwillseeparsingoftheearlykernelparameteres,earlydumpofthepcidevices,directMediaInterfacescanningandmanymanymore.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

mmvsactive_mme820SupervisormodeaccesspreventionKernelstacksTSSIDTMemorymappedI/OCFIdirectives

Conclusion

Links

LinuxInside

109Continuearchitecture-specificboot-timeinitializations

Page 110: Linux Insides

PDF.dwarf4specificationCallstackPreviouspart

LinuxInside

110Continuearchitecture-specificboot-timeinitializations

Page 111: Linux Insides

Inthepreviouspartwesawarchitecture-specific(x86_64inourcase)initializationstufffromthearch/x86/kernel/setup.candfinishedonx86_configure_nxfunctionwhichsetsthe_PAGE_NXflagdependsonsupportofNXbit.AsIwrotebeforesetup_archfunctionandstart_kernelareverybig,sointhisandinthenextpartwewillcontinuetolearnaboutarchitecture-specificinitializationprocess.Thenextfunctionafterx86_configure_nxisparse_early_param.Thisfunctiondefinedintheinit/main.candasyoucanunderstandfromitsname,thisfunctionparseskernelcommandlineandsetupsdifferentsomeservicesdependsongiveparameters(allkernelcommandlineparametersyoucanfindintheDocumentation/kernel-parameters.txt).Youcanrememberhowwesetupearlyprintkintheearliestpart.Ontheearlystagewelookedforkernelparametersandtheirvaluewiththecmdline_find_optionfunctionand__cmdline_find_option,__cmdline_find_option_boolhelpersfromthearch/x86/boot/cmdline.c.Therewe'reinthegenerickernelpartwhichdoesnotdependonarchitectureandhereweuseanotherapproach.Ifyouarereadinglinuxkernelsourcecode,youalreadycannotecallslikethis:

early_param("gbpages",parse_direct_gbpages_on);

early_parammacrotakestwoparameters:

commandlineparametername;functionwhichwillbecalledifgivenparameterpassed.

anddefinedas:

#defineearly_param(str,fn)\

__setup_param(str,fn,fn,1)

intheinclude/linux/init.h.Asyoucanseeearly_parammacrojustmakescallofthe__setup_parammacro:

#define__setup_param(str,unique_id,fn,early)\

staticconstchar__setup_str_##unique_id[]__initconst\

__aligned(1)=str;\

staticstructobs_kernel_param__setup_##unique_id\

__used__section(.init.setup)\

__attribute__((aligned((sizeof(long)))))\

={__setup_str_##unique_id,fn,early}

Thismacrodefines__setup_str_*_idvariable(where*dependsongivenfunctionname)andassignsittothegivencommandlineparametername.Inthenextlinewecanseedefinitionofthe__setup_*variablewhichtypeisobs_kernel_paramanditsinitialization.obs_kernel_paramstructuredefinedas:

structobs_kernel_param{

constchar*str;

int(*setup_func)(char*);

intearly;

};

andcontainsthreefields:

Kernelinitialization.Part6.

Architecture-specificinitializations,again...

LinuxInside

111Architecture-specificinitializations,again...

Page 112: Linux Insides

nameofthekernelparameter;functionwhichsetupssomethingdependonparameter;fielddeterminiesisparameterearly(1)ornot(0).

Notethat__set_parammacrodefineswith__section(.init.setup)attribute.Itmeansthatall__setup_str_*willbeplacedinthe.init.setupsection,moreover,aswecanseeintheinclude/asm-generic/vmlinux.lds.h,theywillbeplacedbetween__setup_startand__setup_end:

#defineINIT_SETUP(initsetup_align)\

.=ALIGN(initsetup_align);\

VMLINUX_SYMBOL(__setup_start)=.;\

*(.init.setup)\

VMLINUX_SYMBOL(__setup_end)=.;

Nowweknowhowparametersaredefined,let'sbacktotheparse_early_paramimplementation:

void__initparse_early_param(void)

{

staticintdone__initdata;

staticchartmp_cmdline[COMMAND_LINE_SIZE]__initdata;

if(done)

return;

/*Allfallthroughtodo_early_param.*/

strlcpy(tmp_cmdline,boot_command_line,COMMAND_LINE_SIZE);

parse_early_options(tmp_cmdline);

done=1;

}

Theparse_early_paramfunctiondefinestwostaticvariables.Firstdonecheckthatparse_early_paramalreadycalledandthesecondistemporarystorageforkernelcommandline.Afterthiswecopyboot_command_linetothetemporarycommadlinewhichwejustdefinedandcalltheparse_early_optionsfunctionfromthethesamesourcecodemain.cfile.parse_early_optionscallstheparse_argsfunctionfromthekernel/params.cwhereparse_argsparsesgivencommandlineandcallsdo_early_paramfunction.Thisfunctiongoesfromthe__setup_startto__setup_end,andcallsthefunctionfromtheobs_kernel_paramifaparameterisearly.Afterthisallserviceswhicharedependonearlycommandlineparametersweresetupandthenextcallaftertheparse_early_paramisx86_report_nx.AsIwroteinthebeginningofthispart,wealreadysetNX-bitwiththex86_configure_nx.Thenextx86_report_nxfunctionthearch/x86/mm/setup_nx.cjustprintsinformationabouttheNX.Notethatwecallx86_report_nxnotrightafterthex86_configure_nx,butafterthecalloftheparse_early_param.Theanswerissimple:wecallitaftertheparse_early_parambecausethekernelsupportnoexecparameter:

noexec[X86]

OnX86-32availableonlyonPAEconfiguredkernels.

noexec=on:enablenon-executablemappings(default)

noexec=off:disablenon-executablemappings

Wecanseeitinthebootingtime:

Afterthiswecanseecallofthe:

memblock_x86_reserve_range_setup_data();

LinuxInside

112Architecture-specificinitializations,again...

Page 113: Linux Insides

function.Thisfunctiondefinedinthesamearch/x86/kernel/setup.csourcecodefileandremapsmemoryforthesetup_dataandreservedmemoryblockforthesetup_data(moreaboutsetup_datayoucanreadinthepreviouspartandaboutioremapandmemblockyoucanreadintheLinuxkernelmemorymanagement).

Inthenextstepwecanseefollowingconditionalstatement:

if(acpi_mps_check()){

#ifdefCONFIG_X86_LOCAL_APIC

disable_apic=1;

#endif

setup_clear_cpu_cap(X86_FEATURE_APIC);

}

Thefirstacpi_mps_checkfunctionfromthearch/x86/kernel/acpi/boot.cdependsonCONFIG_X86_LOCAL_APICandCNOFIG_x86_MPPARSEconfigurationoptions:

int__initacpi_mps_check(void)

{

#ifdefined(CONFIG_X86_LOCAL_APIC)&&!defined(CONFIG_X86_MPPARSE)

/*mptablecodeisnotbuilt-in*/

if(acpi_disabled||acpi_noirq){

printk(KERN_WARNING"MPSsupportcodeisnotbuilt-in.\n"

"Usingacpi=offoracpi=noirqorpci=noacpi"

"mayhaveproblem\n");

return1;

}

#endif

return0;

}

Itchecksthebuilt-inMPSorMultiProcessorSpecificationtable.IfCONFIG_X86_LOCAL_APICissetandCONFIG_x86_MPPAARSEisnotset,acpi_mps_checkprintswarningmessageiftheoneofthecommandlineoptions:acpi=off,acpi=noirqorpci=noacpipassedtothekernel.Ifacpi_mps_checkreturns1whichmeansthat

wedisablelocalAPICandclearsX86_FEATURE_APICbitintheofthecurrentCPUwiththesetup_clear_cpu_capmacro.(moreaboutCPUmaskyoucanreadintheCPUmasks).

InthenextstepwemakeadumpofthePCIdeviceswiththefollowingcode:

#ifdefCONFIG_PCI

if(pci_early_dump_regs)

early_dump_pci_devices();

#endif

pci_early_dump_regsvariabledefinedinthearch/x86/pci/common.canditsvaluedependsonthekernelcommandlineparameter:pci=earlydump.Wecanfinddefitionofthisparameterinthedrivers/pci/pci.c:

early_param("pci",pci_setup);

pci_setupfunctiongetsthestringafterthepci=andanalyzesit.Thisfunctioncallspcibios_setupwhichdefinedas__weakinthedrivers/pci/pci.candeveryarchitecturedefinesthesamefunctionwhichoverrides__weakanalog.Forexamplex86_64architecture-depenedversionisinthearch/x86/pci/common.c:

EarlyPCIdump

LinuxInside

113Architecture-specificinitializations,again...

Page 114: Linux Insides

char*__initpcibios_setup(char*str){

...

...

...

}elseif(!strcmp(str,"earlydump")){

pci_early_dump_regs=1;

returnNULL;

}

...

...

...

}

So,ifCONFIG_PCIoptionissetandwepassedpci=earlydumpoptiontothekernelcommandline,nextfunctionwhichwillbecalled-early_dump_pci_devicesfromthearch/x86/pci/early.c.Thisfunctionchecksnoearlypciparameterwith:

if(!early_pci_allowed())

return;

andreturnsifitwaspassed.EachPCIdomaincanhostupto256busesandeachbushostsupto32devices.So,wegoesinaloop:

for(bus=0;bus<256;bus++){

for(slot=0;slot<32;slot++){

for(func=0;func<8;func++){

...

...

...

}

}

}

andreadthepciconfigwiththeread_pci_configfunction.

That'sall.Wewillnogodeepinthepcidetails,butwillseemoredetailsinthespecialDrivers/PCIpart.

Aftertheearly_dump_pci_devices,thereareacoupleoffunctionrelatedwithavailablememoryande820whichwecollectedintheFirststepsinthekernelsetuppart:

/*updatethee820_savedtoo*/

e820_reserve_setup_data();

finish_e820_parsing();

...

...

...

e820_add_kernel_range();

trim_bios_range(void);

max_pfn=e820_end_of_ram_pfn();

early_reserve_e820_mpc_new();

Let'slookonit.Asyoucanseethefirstfunctionise820_reserve_setup_data.Thisfunctiondoesalmostthesameasmemblock_x86_reserve_range_setup_datawhichwesawabove,butitalsocallse820_update_rangewhichaddsnewregionstothee820mapwiththegiventypewhichisE820_RESERVED_KERNinourcase.Thenextfunctionisfinish_e820_parsingwhichsanitazese820mapwiththesanitize_e820_mapfunction.Besidesthistwofunctionswecanseeacoupleoffunctionsrelatedtothee820.Youcanseeitinthelistingwhichisabove.e820_add_kernel_rangefunctiontakesthephysicaladdressofthe

Finishwithmemoryparsing

LinuxInside

114Architecture-specificinitializations,again...

Page 115: Linux Insides

kernelstartandend:

u64start=__pa_symbol(_text);

u64size=__pa_symbol(_end)-start;

checksthat.text.dataand.bssmarkedasE820RAMinthee820mapandprintsthewarningmessageifnot.Thenextfunctiontrm_bios_rangeupdatefirst4096bytesine820MapasE820_RESERVEDandsanitizesitagainwiththecallofthesanitize_e820_map.Afterthiswegetthelastpageframenumberwiththecallofthee820_end_of_ram_pfnfunction.Everymemorypagehasanuniquenumber-Pageframenumberande820_end_of_ram_pfnfunctionreturnsthemaximumwiththecallofthee820_end_pfn:

unsignedlong__inite820_end_of_ram_pfn(void)

{

returne820_end_pfn(MAX_ARCH_PFN);

}

wheree820_end_pfntakesmaximumpageframenumberonthecertainarchitecture(MAX_ARCH_PFNis0x400000000forx86_64).Inthee820_end_pfnwegothroughthealle820slotsandcheckthate820entryhasE820_RAMorE820_PRAMtypebecausewecalcluatepageframenumbersonlyforthesetypes,getsthebaseaddressandendaddressofthepageframenumberforthecurrente820entryandmakessomechecksfortheseaddresses:

for(i=0;i<e820.nr_map;i++){

structe820entry*ei=&e820.map[i];

unsignedlongstart_pfn;

unsignedlongend_pfn;

if(ei->type!=E820_RAM&&ei->type!=E820_PRAM)

continue;

start_pfn=ei->addr>>PAGE_SHIFT;

end_pfn=(ei->addr+ei->size)>>PAGE_SHIFT;

if(start_pfn>=limit_pfn)

continue;

if(end_pfn>limit_pfn){

last_pfn=limit_pfn;

break;

}

if(end_pfn>last_pfn)

last_pfn=end_pfn;

}

if(last_pfn>max_arch_pfn)

last_pfn=max_arch_pfn;

printk(KERN_INFO"e820:last_pfn=%#lxmax_arch_pfn=%#lx\n",

last_pfn,max_arch_pfn);

returnlast_pfn;

Afterthiswecheckthatlast_pfnwhichwegotintheloopisnotgreaterthatmaximumpageframenumberforthecertainarchitecture(x86_64inourcase),printinofmrationaboutlastpageframenumberandreturnit.Wecanseethelast_pfninthedmesgoutput:

...

[0.000000]e820:last_pfn=0x41f000max_arch_pfn=0x400000000

...

LinuxInside

115Architecture-specificinitializations,again...

Page 116: Linux Insides

Afterthis,aswehavecalculatedthebiggestpageframenumber,wecalculatemax_low_pfnwhichisthebiggestpageframenumberinthelowmemoryorbellowfirst4gigabytes.Ifinstalledmorethan4gigabytesofRAM,max_low_pfnwillberesultofthee820_end_of_low_ram_pfnfunctionwhichdoesthesamee820_end_of_ram_pfnbutwith4gigabyteslimit,inotherwaymax_low_pfnwillbethesameasmax_pfn:

if(max_pfn>(1UL<<(32-PAGE_SHIFT)))

max_low_pfn=e820_end_of_low_ram_pfn();

else

max_low_pfn=max_pfn;

high_memory=(void*)__va(max_pfn*PAGE_SIZE-1)+1;

Nextwecalculatehigh_memory(definestheupperboundondirectmapmemory)with__vamacrowhichreturnsavirtualaddressbythegivenphysical.

Thenextstepaftermanipulationswithdifferentmemoryregionsande820slotsiscollectinginformationaboutcomputer.WewillgetallinformationwiththeDesktopManagementInterfaceandfollowingfunctions:

dmi_scan_machine();

dmi_memdev_walk();

Firstisdmi_scan_machinedefinedinthedrivers/firmware/dmi_scan.c.ThisfunctiongoesthroughtheSystemManagementBIOSstructuresandextractsinformantion.TherearetwowaysspecifiedtogainaccesstotheSMBIOStable:getthepointertotheSMBIOStablefromtheEFI'sconfigurationtableandscanningthephysycalmemorybetween0xF0000and0x10000addresses.Let'slookonthesecondapproach.dmi_scan_machinefunctionremapsmemorybetween0xf0000and0x10000withthedmi_early_remapwhichjustexpandstotheearly_ioremap:

void__initdmi_scan_machine(void)

{

char__iomem*p,*q;

charbuf[32];

...

...

...

p=dmi_early_remap(0xF0000,0x10000);

if(p==NULL)

gotoerror;

anditeratesoverallDMIheaderaddressandfindsearch_SM_string:

memset(buf,0,16);

for(q=p;q<p+0x10000;q+=16){

memcpy_fromio(buf+16,q,16);

if(!dmi_smbios3_present(buf)||!dmi_present(buf)){

dmi_available=1;

dmi_early_unmap(p,0x10000);

gotoout;

}

memcpy(buf,buf+16,16);

}

_SM_stringmustbebetween000F0000hand0x000FFFFF.Herewecopy16bytestothebufwithmemcpy_fromiowhichisthesamememcpyandexecutedmi_smbios3_presentanddmi_presentonthebuffer.Thesefunctionscheckthatfirst4bytesis_SM_string,getSMBIOSversionandgets_DMI_attributesasDMIstructuretablelength,tableaddressandetc...After

DMIscanning

LinuxInside

116Architecture-specificinitializations,again...

Page 117: Linux Insides

oneofthesefunctionwillfinishtoexecute,youwillseetheresultofitinthedmesgoutput:

[0.000000]SMBIOS2.7present.

[0.000000]DMI:GigabyteTechnologyCo.,Ltd.Z97X-UD5H-BK/Z97X-UD5H-BK,BIOSF606/17/2014

Intheendofthedmi_scan_machine,weunmapthepreviouslyremapedmemory:

dmi_early_unmap(p,0x10000);

Thesecondfunctionis-dmi_memdev_walk.Asyoucanunderstanditgoesovermemorydevices.Let'slookonit:

void__initdmi_memdev_walk(void)

{

if(!dmi_available)

return;

if(dmi_walk_early(count_mem_devices)==0&&dmi_memdev_nr){

dmi_memdev=dmi_alloc(sizeof(*dmi_memdev)*dmi_memdev_nr);

if(dmi_memdev)

dmi_walk_early(save_mem_devices);

}

}

ItchecksthatDMIavailable(wegotitinthepreviousfunction-dmi_scan_machine)andcollectsinformationaboutmemorydeviceswithdmi_walk_earlyanddmi_allocwhichdefinedas:

#ifdefCONFIG_DMI

RESERVE_BRK(dmi_alloc,65536);

#endif

RESERVE_BRKdefinedinthearch/x86/include/asm/setup.handreservesspacewithgivensizeinthebrksection.

init_hypervisor_platform();

x86_init.resources.probe_roms();

insert_resource(&iomem_resource,&code_resource);

insert_resource(&iomem_resource,&data_resource);

insert_resource(&iomem_resource,&bss_resource);

early_gart_iommu_check();

ThenextstepisparsingoftheSMPconfiguration.Wedoitwiththecallofthefind_smp_configfunctionwhichjustcallsfunction:

staticinlinevoidfind_smp_config(void)

{

x86_init.mpparse.find_smp_config();

}

inside.x86_init.mpparse.find_smp_configisadefault_find_smp_configfunctionfromthearch/x86/kernel/mpparse.c.Inthedefault_find_smp_configfunctionwearescanningacoupleofmemoryregionsforSMPconfigandreturniftheyarenot:

SMPconfig

LinuxInside

117Architecture-specificinitializations,again...

Page 118: Linux Insides

if(smp_scan_config(0x0,0x400)||

smp_scan_config(639*0x400,0x400)||

smp_scan_config(0xF0000,0x10000))

return;

Firstofallsmp_scan_configfunctiondefinesacoupleofvariables:

unsignedint*bp=phys_to_virt(base);

structmpf_intel*mpf;

FirstisvirtualaddressofthememoryregionwherewewillscanSMPconfig,secondisthepointertothempf_intelstructure.Let'strytounderstandwhatisitmpf_intel.Allinformationstoresinthemultiprocessorconfigurationdatastructure.mpf_intelpresentsthisstructureandlooks:

structmpf_intel{

charsignature[4];

unsignedintphysptr;

unsignedcharlength;

unsignedcharspecification;

unsignedcharchecksum;

unsignedcharfeature1;

unsignedcharfeature2;

unsignedcharfeature3;

unsignedcharfeature4;

unsignedcharfeature5;

};

Aswecanreadinthedocumentation-oneofthemainfunctionsofthesystemBIOSistoconstructtheMPfloatingpointerstructureandtheMPconfigurationtable.Andoperatingsystemmusthaveaccesstothisinformationaboutthemultiprocessorconfigurationandmpf_intelstoresthephysicaladdress(lookatsecondparameter)ofthemultiprocessorconfigurationtable.So,smp_scan_configgoinginaloopthroughthegivenmemoryrangeandtriestofindMPfloatingpointerstructurethere.ItchecksthatcurrentbytepointstotheSMPsignature,checkschecksum,checksthatmpf->specificationis1(itmustbe1or4byspecification)intheloop:

while(length>0){

if((*bp==SMP_MAGIC_IDENT)&&

(mpf->length==1)&&

!mpf_checksum((unsignedchar*)bp,16)&&

((mpf->specification==1)

||(mpf->specification==4))){

mem=virt_to_phys(mpf);

memblock_reserve(mem,sizeof(*mpf));

if(mpf->physptr)

smp_reserve_memory(mpf);

}

}

reservesgivenmemoryblockifsearchissuccessfulwithmemblock_reserveandreservesphysicaladdressofthemultiprocessorconfigurationtable.Alldocumentationaboutthisyoucanfindinthe-MultiProcessorSpecification.MoredetailsyoucanreadinthespecialpartaboutSMP.

Inthenextstepofthesetup_archwecanseethecalloftheearly_alloc_pgt_buffunctionwhichallocatesthepagetablebufferforearlystage.Thepagetablebufferwillbeplaceinthebrkarea.Let'slookonitsimplementation:

Additionalearlymemoryinitializationroutines

LinuxInside

118Architecture-specificinitializations,again...

Page 119: Linux Insides

void__initearly_alloc_pgt_buf(void)

{

unsignedlongtables=INIT_PGT_BUF_SIZE;

phys_addr_tbase;

base=__pa(extend_brk(tables,PAGE_SIZE));

pgt_buf_start=base>>PAGE_SHIFT;

pgt_buf_end=pgt_buf_start;

pgt_buf_top=pgt_buf_start+(tables>>PAGE_SHIFT);

}

Firstofallitgetthesizeofthepagetablebuffer,itwillbeINIT_PGT_BUF_SIZEwhichis(6*PAGE_SIZE)inthecurrentlinuxkernel4.0.Aswegotthesizeofthepagetablebuffer,wecallextend_brkfunctionwithtwoparameters:sizeandalign.Asyoucanunderstandfromitsname,thisfunctionextendsthebrkarea.AswecanseeinthelinuxkernellinkerscriptbrkinmemoryrightaftertheBSS:

.=ALIGN(PAGE_SIZE);

.brk:AT(ADDR(.brk)-LOAD_OFFSET){

__brk_base=.;

.+=64*1024;/*64kalignmentslopspace*/

*(.brk_reservation)/*areasbrkusershavereserved*/

__brk_limit=.;

}

Orwecanfinditwithreadelfutil:

Afterthatwegotphysicaladdressofthenewbrkwiththe__pamacro,wecalculatethebaseaddressandtheendofthepagetablebuffer.Inthenextstepaswegotpagetablebuffer,wereservememoryblockforthebrkarewiththereserve_brkfunction:

staticvoid__initreserve_brk(void)

{

if(_brk_end>_brk_start)

memblock_reserve(__pa_symbol(_brk_start),

_brk_end-_brk_start);

_brk_start=0;

}

Notethatintheendofthereserve_brk,wesetbrk_starttozero,becauseafterthiswewillnotallocateitanymore.Thenextstepafterreservingmemoryblockforthebrk,weneedtounmapout-of-rangememoryareasinthekernelmappingwiththecleanup_highmapfunction.Remeberthatkernelmappingis__START_KERNEL_mapand_end-_textorlevel2_kernel_pgtmapsthekernel_text,dataandbss.Inthestartoftheclean_high_mapwedefinetheseparameters:

unsignedlongvaddr=__START_KERNEL_map;

unsignedlongend=roundup((unsignedlong)_end,PMD_SIZE)-1;

pmd_t*pmd=level2_kernel_pgt;

pmd_t*last_pmd=pmd+PTRS_PER_PMD;

Now,aswedefinedstartandendofthekernelmapping,wegointheloopthroughtheallkernelpagemiddledirectoryentriesandcleanentrieswhicharenotbetween_textandend:

LinuxInside

119Architecture-specificinitializations,again...

Page 120: Linux Insides

for(;pmd<last_pmd;pmd++,vaddr+=PMD_SIZE){

if(pmd_none(*pmd))

continue;

if(vaddr<(unsignedlong)_text||vaddr>end)

set_pmd(pmd,__pmd(0));

}

Afterthiswesetthelimitforthememblockallocationwiththememblock_set_current_limitfunction(readmoreaboutmemblockyoucanintheLinuxkernelmemorymanagementPart2),itwillbeISA_END_ADDRESSor0x100000andfillthememblockinformationaccordingtoe820withthecallofthememblock_x86_fillfunction.Youcanseetheresultofthisfunctioninthekernelinitializationtime:

MEMBLOCKconfiguration:

memorysize=0x1fff7ec00reservedsize=0x1e30000

memory.cnt=0x3

memory[0x0][0x00000000001000-0x0000000009efff],0x9e000bytesflags:0x0

memory[0x1][0x00000000100000-0x000000bffdffff],0xbfee0000bytesflags:0x0

memory[0x2][0x00000100000000-0x0000023fffffff],0x140000000bytesflags:0x0

reserved.cnt=0x3

reserved[0x0][0x0000000009f000-0x000000000fffff],0x61000bytesflags:0x0

reserved[0x1][0x00000001000000-0x00000001a57fff],0xa58000bytesflags:0x0

reserved[0x2][0x0000007ec89000-0x0000007fffffff],0x1377000bytesflags:0x0

Therestfunctionsafterthememblock_x86_fillare:early_reserve_e820_mpc_newalocatesadditionalslotsinthee820mapforMultiProcessorSpecificationtable,reserve_real_mode-reserveslowmemoryfrom0x0to1megabyteforthetrampolinetotherealmode(forrebootinandetc...),trim_platform_memory_ranges-trimscertainmemoryregionsstartedfrom0x20050000,0x20110000andetc...theseregionsmustbeexcludedbecauseSandyBridgehasproblemswiththeseregions,trim_low_memory_rangereservesthefirst4killobytespageinmemblock,init_mem_mappingfunctionreconstructsdirectmemorymappingandsetupsthedirectmappingofthephysicalmemoryatPAGE_OFFSET,early_trap_pf_initsetups#PFhandler(wewilllookonitinthechapteraboutinterrupts)andsetup_real_modefunctionsetupstrampolinetotherealmodecode.

That'sall.Youcannotethatthispartwillnotcoverallfunctionswhichareinthesetup_arch(likeearly_gart_iommu_check,mtrrinitalizationandetc...).AsIalreadywrotemanytimes,setup_archisbig,andlinuxkernelisbig.That'swhyIcan'tcovereverylineinthelinuxkernel.Idon'tthinkthatwemissedsomethingimportant,...butyoucansaysomethinglike:eachlineofcodeisimportant.Yes,it'strue,butImissedtheyanyway,becauseIthinkthatitisnotrealtocoverfulllinuxkernel.Anywaywewilloftenreturntotheideathatwehavealreadyseen,andifsomethingwillbeunfamiliar,wewillcoverthistheme.

Itistheendofthesixthpartaboutlinuxkernelinitializationprocess.Inthispartwecontinuedtodiveinthesetup_archfunctionagainItwaslongpart,butwenotfinishedwithit.Yes,setup_archisbig,hopethatnextpartwillbelastaboutthisfunction.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

MultiProcessorSpecificationNXbit

Conclusion

Links

LinuxInside

120Architecture-specificinitializations,again...

Page 122: Linux Insides

ThisistheseventhparthoftheLinuxKernelinitializationprocesswhichcoversinternalsofthesetup_archfunctionfromthearch/x86/kernel/setup.c.Asyoucanknowfromthepreviousparts,thesetup_archfunctiondoessomearchitecture-specific(inourcaseitisx86_64)initializationstufflikereservingmemoryforkernelcode/data/bss,earlyscanningoftheDesktopManagementInterface,earlydumpofthePCIdeviceandmanymanymore.Ifyouhavereadthepreviouspart,youcanrememberthatwe'vefinisheditatthesetup_real_modefunction.Inthenextstep,aswesetlimitofthememblocktotheallmappedpages,wecanseethecallofthesetup_log_buffunctionfromthekernel/printk/printk.c.

Thesetup_log_buffunctionsetupskernelcyclicbufferwhichlengthdependsontheCONFIG_LOG_BUF_SHIFTconfigurationoption.AswecanreadfromthedocumentationoftheCONFIG_LOG_BUF_SHIFTitcanbebetween12and21.Intheinternals,bufferdefinedasarrayofchars:

#define__LOG_BUF_LEN(1<<CONFIG_LOG_BUF_SHIFT)

staticchar__log_buf[__LOG_BUF_LEN]__aligned(LOG_ALIGN);

staticchar*log_buf=__log_buf;

Nowlet'slookontheimplementationofthsetup_log_buffunction.Itstartswithcheckthatcurrentbufferisempty(Itmustbeempty,becausewejustsetupit)andanothercheckthatitisearlysetup.Ifsetupofthekernellogbufferisnotearly,wecallthelog_buf_add_cpufunctionwhichincreasesizeofthebufferforeveryCPU:

if(log_buf!=__log_buf)

return;

if(!early&&!new_log_buf_len)

log_buf_add_cpu();

Wewillnotresearchlog_buf_add_cpufunction,becauseasyoucanseeinthesetup_arch,wecallsetup_log_bufas:

setup_log_buf(1);

where1meansthatisisearlysetup.Inthenextstepwechecknew_log_buf_lenvariablewhichisupdatedlengthofthekernellogbufferandallocatenewspaceforthebufferwiththememblock_virt_allocfunctionforit,orjustreturn.

Askernellogbufferisready,thenextfunctionisreserve_initrd.Youcanrememberthatwealreadycalledtheearly_reserve_initrdfunctioninthefourthpartoftheKernelinitialization.Now,aswereconstructeddirectmemorymappingintheinit_mem_mappingfunction,weneedtomoveinitrdtothedownintodirectlymappedmemory.Thereserve_initrdfunctionstartsfromthedefinitionofthebaseaddressandendaddressoftheinitrdandcheckthatinitrdwasprovidedbyabootloader.Allthesameaswesawitintheearly_reserve_initrd.Butinsteadofthereservingplaceinthememblockareawiththecallofthememblock_reservefunction,wegetthemappedsizeofthedirectmemoryareaandcheckthatthesizeoftheinitrdisnotgreaterthatthisareawith:

mapped_size=memblock_mem_size(max_pfn_mapped);

if(ramdisk_size>=(mapped_size>>1))

panic("initrdtoolargetohandle,"

"disablinginitrd(%lldneeded,%lldavailable)\n",

Kernelinitialization.Part7.

TheEndofthearchitecture-specificinitializations,almost...

LinuxInside

122Endofthearchitecture-specificinitializations,almost...

Page 123: Linux Insides

ramdisk_size,mapped_size>>1);

Youcanseeherethatwecallmemblock_mem_sizefunctionandpassthemax_pfn_mappedtoit,wheremax_pfn_mappedcontainsthehighestdirectmappedpageframenumber.Ifyoudonotrememberwhatisitpageframenumber,explanationissimple:First12bitsofthevirtualaddressrepresentoffsetinthephysicalpageorpageframe.Ifwewillshiftrightvirtualaddresson12,we'lldiscardoffsetpartandwillgetPageFrameNumber.Inthememblock_mem_sizewegothroughtheallmemblockmem(notreserved)regionsandcalculatessizeofthemappedpagesamountandreturnittothemapped_sizevariable(seecodeabove).Aswegotamountofthedirectmappedmemory,wecheckthatsizeoftheinitrdisnotgreaterthanmappedpages.IfitisgreaterwejustcallpanicwhichhaltsthesystemandprintspopularKernelpanicmessage.Inthenextstepweprintinformationabouttheinitrdsize.Wecanseetheresultofthisinthedmesgoutput:

[0.000000]RAMDISK:[mem0x36d20000-0x37687fff]

andrelocateinitrdtothedirectmappingareawiththerelocate_initrdfunction.Inthestartoftherelocate_initrdfunctionwetrytofindfreeareawiththememblock_find_in_rangefunction:

relocated_ramdisk=memblock_find_in_range(0,PFN_PHYS(max_pfn_mapped),area_size,PAGE_SIZE);

if(!relocated_ramdisk)

panic("CannotfindplacefornewRAMDISKofsize%lld\n",

ramdisk_size);

Thememblock_find_in_rangefunctiontriestofindfreeareainagivenrange,inourcasefrom0tothemaximummappedphysicaladdressandsizemustequaltothealignedsizeoftheinitrd.Ifwedidn'tfindareawiththegivensize,wecallpanicagain.Ifallisgood,westarttorelocatedRAMdisktothedownofthedirectlymappedmeoryinthenextstep.

Intheendofthereserve_initrdfunction,wefreememblockmemorywhichoccupiedbytheramdiskwiththecallofthe:

memblock_free(ramdisk_image,ramdisk_end-ramdisk_image);

Afterwerelocatedinitrdramdiskimage,thenextfunctionisvsmp_initfromthearch/x86/kernel/vsmp_64.c.ThisfunctioninitializessupportoftheScaleMPvSMP.AsIalreadywroteinthepreviousparts,thischapterwillnotcovernon-relatedx86_64initializationparts(forexampleasthecurrentorACPIandetc...).Sowewillmissimplementationofthisfornowandwillbacktoitinthepartwhichwillcovertechniquesofparallelcomputing.

Thenextfunctionisio_delay_initfromthearch/x86/kernel/io_delay.c.ThisfunctionallowstooverridedefaultdefaultI/Odelay0x80port.WealreadysawI/OdelayintheLastpreparationbeforetransitionintoprotectedmode,nowlet'slookontheio_delay_initimplementation:

void__initio_delay_init(void)

{

if(!io_delay_override)

dmi_check_system(io_delay_0xed_port_dmi_table);

}

Thisfunctioncheckio_delay_overridevariableandoverridesI/Odelayportifio_delay_overrideisset.Wecansetio_delay_overridevariablybypassingio_delayoptiontothekernelcommandline.AswecanreadfromtheDocumentation/kernel-parameters.txt,io_delayoptionis:

io_delay=[X86]I/Odelaymethod

0x80

LinuxInside

123Endofthearchitecture-specificinitializations,almost...

Page 124: Linux Insides

Standardport0x80baseddelay

0xed

Alternateport0xedbaseddelay(neededonsomesystems)

udelay

Simpletwomicrosecondsdelay

none

Nodelay

Wecanseeio_delaycommandlineparametersetupwiththeearly_parammacrointhearch/x86/kernel/io_delay.c

early_param("io_delay",io_delay_param);

Moreaboutearly_paramyoucanreadinthepreviouspart.Sotheio_delay_paramfunctionwhichsetupsio_delay_overridevariablewillbecalledinthedo_early_paramfunction.io_delay_paramfunctiongetstheargumentoftheio_delaykernelcommandlineparameterandsetsio_delay_typedependsonit:

staticint__initio_delay_param(char*s)

{

if(!s)

return-EINVAL;

if(!strcmp(s,"0x80"))

io_delay_type=CONFIG_IO_DELAY_TYPE_0X80;

elseif(!strcmp(s,"0xed"))

io_delay_type=CONFIG_IO_DELAY_TYPE_0XED;

elseif(!strcmp(s,"udelay"))

io_delay_type=CONFIG_IO_DELAY_TYPE_UDELAY;

elseif(!strcmp(s,"none"))

io_delay_type=CONFIG_IO_DELAY_TYPE_NONE;

else

return-EINVAL;

io_delay_override=1;

return0;

}

Thenextfunctionsareacpi_boot_table_init,early_acpi_boot_initandinitmem_initaftertheio_delay_init,butasIwroteabovewewillnotcoverACPIrelatedstuffinthisLinuxKernelinitializationprocesschapter.

InthenextstepweneedtoallocateareafortheDirectmemoryaccesswiththedma_contiguous_reservefunctionwhichdefinedinthedrivers/base/dma-contiguous.c.DMAareaisaspecialmodewhendevicescomminicatewithmemorywithoutCPU.Notethatwepassoneparameter-max_pfn_mapped<<PAGE_SHIFT,tothedma_contiguous_reservefunctionandasyoucanunderstandfromthisexpression,thisislimitofthereservedmemory.Let'slookontheimplementationofthisfunction.Itstartsfromthedefinitionofthefollowingvariables:

phys_addr_tselected_size=0;

phys_addr_tselected_base=0;

phys_addr_tselected_limit=limit;

boolfixed=false;

wherefirstrepresentssizeinbytesofthereservedarea,secondisbaseaddressofthereservedarea,thirdisendaddressofthereservedareaandthelastfixedparametershowswheretoplacereservedarea.Iffixedis1wejustreserveareawiththememblock_reserve,ifitis0weallocatespacewiththekmemleak_alloc.Inthenextstepwechecksize_cmdlinevariableandifitisnotequalto-1wefillallvariableswhichyoucanseeabovewiththevaluesfromthecmakernelcommandlineparameter:

AllocateareaforDMA

LinuxInside

124Endofthearchitecture-specificinitializations,almost...

Page 125: Linux Insides

if(size_cmdline!=-1){

...

...

...

}

Youcanfindinthissourcecodefiledefinitionoftheearlyparameter:

early_param("cma",early_cma);

wherecmais:

cma=nn[MG]@[start[MG][-end[MG]]]

[ARM,X86,KNL]

Setsthesizeofkernelglobalmemoryareafor

contiguousmemoryallocationsandoptionallythe

placementconstraintbythephysicaladdressrangeof

memoryallocations.Avalueof0disablesCMA

altogether.Formoreinformation,see

include/linux/dma-contiguous.h

Ifwewillnotpasscmaoptiontothekernelcommandline,size_cmdlinewillbeequalto-1.Inthiswayweneedtocalculatesizeofthereservedareawhichdependsonthefollowingkernelconfigurationoptions:

CONFIG_CMA_SIZE_SEL_MBYTES-sizeinmegabytes,defaultglobalCMAarea,whichisequaltoCMA_SIZE_MBYTES*SZ_1MorCONFIG_CMA_SIZE_MBYTES*1M;CONFIG_CMA_SIZE_SEL_PERCENTAGE-percentageoftotalmemory;CONFIG_CMA_SIZE_SEL_MIN-uselowervalue;CONFIG_CMA_SIZE_SEL_MAX-usehighervalue.

Aswecalculatedthesizeofthereservedarea,wereserveareawiththecallofthedma_contiguous_reserve_areafunctionwhichfirstofallcalls:

ret=cma_declare_contiguous(base,size,limit,0,0,fixed,res_cma);

function.Thecma_declare_contiguousreservescontiguousareafromthegivenbaseaddressandwithgivensize.AfterwereservedareafortheDMA,nextfunctionisthememblock_find_dma_reserve.Asyoucanunderstandfromitsname,thisfunctioncountsthereservedpagesintheDMAarea.ThispartwillnotcoveralldetailsoftheCMAandDMA,becausetheyarebig.WewillseemuchmoredetailsinthespecialpartintheLinuxKernelMemorymanagementwhichcoverscontiguousmemoryallocatorsandareas.

Thenextstepisthecallofthefunction-x86_init.paging.pagetable_init.Ifyouwilltrytofindthisfunctioninthelinuxkernelsourcecode,intheendofyoursearch,youwillseethefollowingmacro:

#definenative_pagetable_initpaging_init

whichexpandsasyoucanseetothecallofthepaging_initfunctionfromthearch/x86/mm/init_64.c.Thepaging_initfunctioninitializessparsememoryandzonesizes.Firstofallwhat'szonesandwhatisitSparsemem.TheSparsememisaspecialfoundationinthelinuxkernenmemorymanagerwhichusedtosplitmemoryareatothedifferentmemorybanksin

Initializationofthesparsememory

LinuxInside

125Endofthearchitecture-specificinitializations,almost...

Page 126: Linux Insides

theNUMAsystems.Let'slookontheimplementationofthepaginig_initfunction:

void__initpaging_init(void)

{

sparse_memory_present_with_active_regions(MAX_NUMNODES);

sparse_init();

node_clear_state(0,N_MEMORY);

if(N_MEMORY!=N_NORMAL_MEMORY)

node_clear_state(0,N_NORMAL_MEMORY);

zone_sizes_init();

}

Asyoucanseethereiscallofthesparse_memory_present_with_active_regionsfunctionwhichrecordsamemoryareaforeveryNUMAnodetothearrayofthemem_sectionstructurewhichcontainsapointertothestructureofthearrayofstructpage.Thenextsparse_initfunctionallocatesnon-linearmem_sectionandmem_map.Inthenextstepweclearstateofthemovablememorynodesandinitializesizesofzones.EveryNUMAnodeisdevidedintoanumberofpieceswhicharecalled-zones.So,zone_sizes_initfunctionfromthearch/x86/mm/init.cinitializessizeofzones.

Again,thispartandnextpartsdonotcoverthisthemeinfulldetails.TherewillbespecialpartaboutNUMA.

ThenextstepafterSparseMeminitializationissettingofthetrampoline_cr4_featureswhichmustcontaincontentofthecr4Controlregister.FirstofallweneedtocheckthatcurrentCPUhassupportofthecr4registerandifithas,wesaveitscontenttothetrampoline_cr4_featureswhichisstorageforcr4intherealmode:

if(boot_cpu_data.cpuid_level>=0){

mmu_cr4_features=__read_cr4();

if(trampoline_cr4_features)

*trampoline_cr4_features=mmu_cr4_features;

}

Thenextfunctionwhichyoucanseeismap_vsyscalfromthearch/x86/kernel/vsyscall_64.c.ThisfunctionmapsmemoryspaceforvsyscallsanddependsonCONFIG_X86_VSYSCALL_EMULATIONkernelconfigurationoption.Actuallyvsyscallisaspecialsegmentwhichprovidesfastaccesstothecertainsystemcallslikegetcpuandetc...Let'slookonimplementationofthisfunction:

void__initmap_vsyscall(void)

{

externchar__vsyscall_page;

unsignedlongphysaddr_vsyscall=__pa_symbol(&__vsyscall_page);

if(vsyscall_mode!=NONE)

__set_fixmap(VSYSCALL_PAGE,physaddr_vsyscall,

vsyscall_mode==NATIVE

?PAGE_KERNEL_VSYSCALL

:PAGE_KERNEL_VVAR);

BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=

(unsignedlong)VSYSCALL_ADDR);

}

Inthebeginningofthemap_vsyscalwecanseedefinitionoftwovariables.Thefirstisexternvalirable__vsyscall_page.Asvariableextern,itdefinedsomewhereinothersourcecodefile.Actuallywecanseedefinitionofthe__vsyscall_pageinthearch/x86/kernel/vsyscall_emu_64.S.The__vsyscall_pagesymbolpointstothealignedcallsofthevsyscallsasgettimeofdayandetc...:

vsyscallmapping

LinuxInside

126Endofthearchitecture-specificinitializations,almost...

Page 127: Linux Insides

.globl__vsyscall_page

.balignPAGE_SIZE,0xcc

.type__vsyscall_page,@object

__vsyscall_page:

mov$__NR_gettimeofday,%rax

syscall

ret

.balign1024,0xcc

mov$__NR_time,%rax

syscall

ret

...

...

...

Thesecondvariableisphysaddr_vsyscallwhichjuststoresphysicaladdressofthe__vsyscall_pagesymbol.Inthenextstepwecheckthevsyscall_modevariable,andifitisnotequaltoNONEwhichisEMULATEbydefault:

staticenum{EMULATE,NATIVE,NONE}vsyscall_mode=EMULATE;

Andafterthischeckwecanseethecallofthe__set_fixmapfunctionwhichcallsnative_set_fixmapwiththesameparameters:

voidnative_set_fixmap(enumfixed_addressesidx,unsignedlongphys,pgprot_tflags)

{

__native_set_fixmap(idx,pfn_pte(phys>>PAGE_SHIFT,flags));

}

void__native_set_fixmap(enumfixed_addressesidx,pte_tpte)

{

unsignedlongaddress=__fix_to_virt(idx);

if(idx>=__end_of_fixed_addresses){

BUG();

return;

}

set_pte_vaddr(address,pte);

fixmaps_set++;

}

Herewecanseethatnative_set_fixmapmakesvalueofPageTableEntryfromthegivenphysicaladdress(physicaladdressofthe__vsyscall_pagesymbolinourcase)andcallsinternalfunction-__native_set_fixmap.Internalfunctiongetsthevirtualaddressofthegivenfixed_addressesindex(VSYSCALL_PAGEinourcase)andchecksthatgivenindexisnotgreatedthanendofthefix-mappedaddresses.Afterthiswesetpagetableentrywiththecalloftheset_pte_vaddrfunctionandincreasecountofthefix-mappedaddresses.Andintheendofthemap_vsyscallwecheckthatvirtualaddressoftheVSYSCALL_PAGE(whichisfirstindexinthefixed_addresses)isnotgreaterthanVSYSCALL_ADDRwhichis-10UL<<20orffffffffff600000withtheBUILD_BUG_ONmacro:

BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=

(unsignedlong)VSYSCALL_ADDR);

Nowvsyscallareaisinthefix-mappedarea.That'sallaboutmap_vsyscall,ifyoudonotknowanythingaboutfix-mappedaddresses,youcanreadFix-MappedAddressesandioremap.Moreaboutvsyscallswewillseeinthevsyscallsandvdsopart.

GettingtheSMPconfiguration

LinuxInside

127Endofthearchitecture-specificinitializations,almost...

Page 128: Linux Insides

YoucanrememberhowwemadeasearchoftheSMPconfigurationinthepreviouspart.NowweneedtogettheSMPconfigurtaionifwefoundit.Forthiswechecksmp_found_configvariablewhichwesetinthesmp_scan_configfunction(readaboutitthepreviouspart)andcalltheget_smp_configfunction:

if(smp_found_config)

get_smp_config();

Theget_smp_configexpandstothex86_init.mpparse.default_get_smp_configfunctionwhichdefinedinthearch/x86/kernel/mpparse.c.Thisfunctiondefinespointertothemultiprocessorfloatingpointerstructure-mpf_intel(youcanreadaboutitinthepreviouspart)anddoessomechecks:

structmpf_intel*mpf=mpf_found;

if(!mpf)

return;

if(acpi_lapic&&early)

return;

Herewecanseethatmultiprocessorconfigurationwasfoundinthesmp_scan_configfunctionorjustreturnfromthefunctionifnot.Thenextcheckcheckthatitisearly.Andaswedidthischecks,westarttoreadtheSMPconfiguration.Aswefinishedtoreadit,thenextstepis-prefill_possible_mapfunctionwhichmakespreliminaryfillingofthepossibleCPUscpumask(moreaboutityoucanreadintheIntroductiontothecpumasks).

Herewearegettingtotheendofthesetup_archfunction.Therestfunctionofcoursemakeimportantstuff,butdetailsaboutthesestuffwillnotwillnotbeincludedinthispart.Wewilljusttakeashortlookonthesefunctions,becausealthoughtheyareimportantasIwroteabove,buttheycovernon-generickernelfeaturesrelatedwiththeNUMA,SMP,ACPIandAPICsandetc...Firstofall,thenextcalloftheinit_apic_mappingsfunction.AswecanunderstandthisfunctionsetstheaddressofthelocalAPIC.Thenextisx86_io_apic_ops.initandthisfunctioninitializesI/OAPIC.PleasenotethatalldetailsrelatedwithAPIC,wewillseeinthechapteraboutinterruptsandexceptionshandling.InthenextstepwereservestandardI/OresourceslikeDMA,TIMER,FPUandetc...,withthecallofthex86_init.resources.reserve_resourcesfunction.Followingismcheck_initfunctioninitializesMachinecheckExceptionandthelastisregister_refined_jiffieswhichregistersjiffy(Therewillbeseparatechapterabouttimersinthekernel).

Sothat'sall.Finallywehavefinishedwiththebigsetup_archfunctioninthispart.OfcourseasIalreadywrotemanytimes,wedidnotseefulldetailsaboutthisfunction,butdonotworryaboutit.Wewillbebackmorethanoncetothisfunctionfromdifferentchaptersforunderstandinghowdifferentplatform-dependentpartsareinitialized.

That'sall,andnowwecanbacktothestart_kernelfromthesetup_arch.

AsIwroteabove,wehavefinishedwiththesetup_archfunctionandnowwecanbacktothestart_kernelfunctionfromtheinit/main.c.Asyoucanrememberorevenyousawyourself,start_kernelfunctionisverybigtooasthesetup_arch.Sothecoupleofthenextpartwillbededicatedtothelearningofthisfunction.So,let'scontinuewithit.Afterthesetup_archwecanseethecallofthemm_init_cpumaskfunction.Thisfunctionsetsthecpumask)pointertothememorydescriptorcpumask.Wecanlookonitsimplementation:

Therestofthesetup_arch

Backtothemain.c

LinuxInside

128Endofthearchitecture-specificinitializations,almost...

Page 129: Linux Insides

staticinlinevoidmm_init_cpumask(structmm_struct*mm)

{

#ifdefCONFIG_CPUMASK_OFFSTACK

mm->cpu_vm_mask_var=&mm->cpumask_allocation;

#endif

cpumask_clear(mm->cpu_vm_mask_var);

}

Asyoucanseeintheinit/main.c,wepassedmemorydescriptoroftheinitprocesstothemm_init_cpumaskandheredependonCONFIG_CPUMASK_OFFSTACKconfigurationoptionwesetorclearTLBswitchcpumask.

Inthenextstepwecanseethecallofthefollowingfunction:

setup_command_line(command_line);

Thisfunctiontakespointertothekernelcommandlineallocatesacoupleofbufferstostorecommandline.Weneedacoupleofbuffers,becauseonebufferusedforfuturereferenceandaccessingtocommandlineandoneforparameterparsing.Wewillallocatespaceforthefollowingbuffers:

saved_command_line-willcontainbootcommandline;initcall_command_line-willcontainbootcommandline.willbeusedinthedo_initcall_level;static_command_line-willcontaincommandlineforparametersparsing.

Wewillallocatespacewiththememblock_virt_allocfunction.Thisfunctioncallsmemblock_virt_alloc_try_nidwhichallocatesbootmemoryblockwithmemblock_reserveifslabisnotavailableoruseskzalloc_node(moreaboutitwillbeinthelinuxmemorymanagementchapter).Thememblock_virt_allocusesBOOTMEM_LOW_LIMIT(physicalladdressofthe(PAGE_OFFSET+0x1000000)value)andBOOTMEM_ALLOC_ACCESSIBLE(equaltothecurrentvalueofthememblock.current_limit)asminimumaddressofthememoryegionandmaximumaddressofthememoryregion.

Let'slookontheimplementationofthesetup_command_line:

staticvoid__initsetup_command_line(char*command_line)

{

saved_command_line=

memblock_virt_alloc(strlen(boot_command_line)+1,0);

initcall_command_line=

memblock_virt_alloc(strlen(boot_command_line)+1,0);

static_command_line=memblock_virt_alloc(strlen(command_line)+1,0);

strcpy(saved_command_line,boot_command_line);

strcpy(static_command_line,command_line);

}

Herewecanseethatweallocatespaceforthethreebufferswhichwillcontainkernelcommandlineforthedifferentpurposes(readabove).Andasweallocatedspace,westoringboot_comand_lineinthesaved_command_lineandcommand_line(kernelcommandlinefromthesetup_archtothestatic_command_line).

Thenextfunctionafterthesetup_command_lineisthesetup_nr_cpu_ids.Thisfunctionsettingnr_cpu_ids(numberofCPUs)accordingtothelastbitinthecpu_possible_mask(moreaboutityoucanreadinthechapterdescribescpumasksconcept).Let'slookonitsimplementation:

void__initsetup_nr_cpu_ids(void)

{

nr_cpu_ids=find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS)+1;

}

LinuxInside

129Endofthearchitecture-specificinitializations,almost...

Page 130: Linux Insides

Herenr_cpu_idsrepresentsnumberofCPUs,NR_CPUSrepresentsthemaximumnumberofCPUswhichwecansetinconfigurationtime:

Actuallyweneedtocallthisfunction,becauseNR_CPUScanbegreaterthanactualamountoftheCPUsintheyourcomputer.Herewecanseethatwecallfind_last_bitfunctionandpasstwoparameterstoit:

cpu_possible_maskbits;maximimnumberofCPUS.

Inthesetup_archwecanfindthecalloftheprefill_possible_mapfunctionwhichcalculatesandwritestothecpu_possible_maskactualnumberoftheCPUs.Wecallthefind_last_bitfunctionwhichtakestheaddressandmaximumsizetosearchandreturnsbitnumberofthefirstsetbit.Wepassedcpu_possible_maskbitsandmaximumnumberoftheCPUs.Firstofallthefind_last_bitfunctionsplitsgivenunsignedlongaddresstothewords:

words=size/BITS_PER_LONG;

whereBITS_PER_LONGis64onthex86_64.Aswegotamountofwordsinthegivensizeofthesearchdata,weneedtocheckisgivensizedoesnotcontainpartialwordswiththefollowingcheck:

if(size&(BITS_PER_LONG-1)){

tmp=(addr[words]&(~0UL>>(BITS_PER_LONG

-(size&(BITS_PER_LONG-1)))));

if(tmp)

gotofound;

}

ifitcontainspartialword,wemaskthelastwordandcheckit.Ifthelastwordisnotzero,itmeansthatcurrentwordcontainsatleastonesetbit.Wegotothefoundlabel:

LinuxInside

130Endofthearchitecture-specificinitializations,almost...

Page 131: Linux Insides

found:

returnwords*BITS_PER_LONG+__fls(tmp);

Hereyoucansee__flsfunctionwhichreturnslastsetbitinagivenwordwithhelpofthebsrinstruction:

staticinlineunsignedlong__fls(unsignedlongword)

{

asm("bsr%1,%0"

:"=r"(word)

:"rm"(word));

returnword;

}

Thebsrinstructionwhichscansthegivenoperandforfirstbitset.Ifthelastwordisnotpartialwegoingthroughtheallwordsinthegivenaddressandtryingtofindfirstsetbit:

while(words){

tmp=addr[--words];

if(tmp){

found:

returnwords*BITS_PER_LONG+__fls(tmp);

}

}

Hereweputthelastwordtothetmpvariableandcheckthattmpcontainsatleastonesetbit.Ifasetbitfound,wereturnthenumberofthisbit.Ifnoonewordsdonotcontainssetbitwejustreturngivensize:

returnsize;

Afterthisnr_cpu_idswillcontainthecorrectamountoftheavaliableCPUs.

That'sall.

Itistheendoftheseventhpartaboutthelinuxkernelinitializationprocess.Inthispart,finallywehavefinsihedwiththesetup_archfunctionandreturnedtothestart_kernelfunction.Inthenextpartwewillcontinuetolearngenerickernelcodefromthestart_kernelandwillcontinueourwaytothefirstinitprocess.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

DesktopManagementInterfacex86_64initrdKernelpanicDocumentation/kernel-parameters.txtACPI

Conclusion

Links

LinuxInside

131Endofthearchitecture-specificinitializations,almost...

Page 133: Linux Insides

ThisistheeighthpartoftheLinuxkernelinitializationprocessandwestoppedonthesetup_nr_cpu_idsfunctioninthepreviouspart.Themainpointofthecurrentpartisschedulerinitialization.Butbeforewewillstarttolearninitializationprocessofthescheduler,weneedtodosomestuff.Thenextstepintheinit/main.cisthesetup_per_cpu_areasfunction.Thisfunctionsetupsareasforthepercpuvariables,moreaboutityoucanreadinthespecialpartaboutthePer-CPUvariables.Afterpercpuareasupandrunning,thenextstepisthesmp_prepare_boot_cpufunction.ThisfunctiondoessomepreparationsfortheSMP:

staticinlinevoidsmp_prepare_boot_cpu(void)

{

smp_ops.smp_prepare_boot_cpu();

}

wherethesmp_prepare_boot_cpuexpandstothecallofthenative_smp_prepare_boot_cpufunction(moreaboutsmp_opswillbeinthespecialpartsaboutSMP):

void__initnative_smp_prepare_boot_cpu(void)

{

intme=smp_processor_id();

switch_to_new_gdt(me);

cpumask_set_cpu(me,cpu_callout_mask);

per_cpu(cpu_state,me)=CPU_ONLINE;

}

Thenative_smp_prepare_boot_cpufunctiongetsthenumberofthecurrentCPU(whichisBootstrapprocessoranditsidiszero)withthesmp_processor_idfunction.Iwillnotexplainhowthesmp_processor_idworks,becausewealreadsawitintheKernelentrypointpart.AswegotprocessoridnumberwereloadGlobalDescriptorTableforthegivenCPUwiththeswitch_to_new_gdtfunction:

voidswitch_to_new_gdt(intcpu)

{

structdesc_ptrgdt_descr;

gdt_descr.address=(long)get_cpu_gdt_table(cpu);

gdt_descr.size=GDT_SIZE-1;

load_gdt(&gdt_descr);

load_percpu_segment(cpu);

}

Thegdt_descrvariablerepresentspointertotheGDTdescriptorhere(wealreadysawdesc_ptrintheEarlyinterruptandexceptionhandling).WegettheaddressandthesizeoftheGDTdescriptorwhereGDT_SIZEis256or:

#defineGDT_SIZE(GDT_ENTRIES*8)

andtheaddressofthedescriptorwewillgetwiththeget_cpu_gdt_table:

staticinlinestructdesc_struct*get_cpu_gdt_table(unsignedintcpu)

{

Kernelinitialization.Part8.

Schedulerinitialization

LinuxInside

133Schedulerinitialization

Page 134: Linux Insides

returnper_cpu(gdt_page,cpu).gdt;

}

Theget_cpu_gdt_tableusesper_cpumacroforgettinggdt_pagepercpuvariableforthegivenCPUnumber(bootstrapprocessorwithid-0inourcase).Youcanaskthefollowingquestion:so,ifwecanaccessgdt_pagepercpuvariable,whereitwasdefined?Actuallywealreadsawitinthisbook.Ifyouhavereadthefirstpartofthischapter,youcanrememberthatwesawdefinitionofthegdt_pageinthearch/x86/kernel/head_64.S:

early_gdt_descr:

.wordGDT_ENTRIES*8-1

early_gdt_descr_base:

.quadINIT_PER_CPU_VAR(gdt_page)

andifwewilllookonthelinkerfilewecanseethatitlocatesafterthe__per_cpu_loadsymbol:

#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load

INIT_PER_CPU(gdt_page);

andfilledgdt_pageinthearch/x86/kernel/cpu/common.c:

DEFINE_PER_CPU_PAGE_ALIGNED(structgdt_page,gdt_page)={.gdt={

#ifdefCONFIG_X86_64

[GDT_ENTRY_KERNEL32_CS]=GDT_ENTRY_INIT(0xc09b,0,0xfffff),

[GDT_ENTRY_KERNEL_CS]=GDT_ENTRY_INIT(0xa09b,0,0xfffff),

[GDT_ENTRY_KERNEL_DS]=GDT_ENTRY_INIT(0xc093,0,0xfffff),

[GDT_ENTRY_DEFAULT_USER32_CS]=GDT_ENTRY_INIT(0xc0fb,0,0xfffff),

[GDT_ENTRY_DEFAULT_USER_DS]=GDT_ENTRY_INIT(0xc0f3,0,0xfffff),

[GDT_ENTRY_DEFAULT_USER_CS]=GDT_ENTRY_INIT(0xa0fb,0,0xfffff),

...

...

...

moreaboutpercpuvariablesyoucanreadinthePer-CPUvariablespart.AswegotaddressandsizeoftheGDTdescriptorwecasereloadGDTwiththeload_gdtwhichjustexecutelgdtinstructandloadpercpu_segmentwiththefollowingfunction:

voidload_percpu_segment(intcpu){

loadsegment(gs,0);

wrmsrl(MSR_GS_BASE,(unsignedlong)per_cpu(irq_stack_union.gs_base,cpu));

load_stack_canary_segment();

}

Thebaseaddressofthepercpuareamustcontaingsregister(orfsregisterforx86),soweareusingloadsegmentmacroandpassgs.InthenextstepwewritesthebaseaddressiftheIRQstackandsetupstackcanary(thisisonlyforx86_32).AfterweloadnewGDT,wefillcpu_callout_maskbitmapwiththecurrentcpuandsetcpustateasonlinewiththesettingcpu_statepercpuvariableforthecurrentprocessor-CPU_ONLINE:

cpumask_set_cpu(me,cpu_callout_mask);

per_cpu(cpu_state,me)=CPU_ONLINE;

So,whatisitcpu_callout_maskbitmap...Asweinitializedbootstrapprocessor(procesoorwhichisbootedthefirstonx86)theotherprocessorsinamultiprocessorsystemareknownassecondaryprocessors.Linuxkernelusestwofollowingbitmasks:

LinuxInside

134Schedulerinitialization

Page 135: Linux Insides

cpu_callout_mask

cpu_callin_mask

Afterbootstrapprocessorinitialized,itupdatesthecpu_callout_masktoindicatewhichsecondaryprocessorcanbeinitializednext.Allotherorsecondaryprocessorscandosomeinitializationstuffbeforeandcheckthecpu_callout_maskontheboostrapprocessorbit.Onlyafterthebootstrapprocessorfilledthecpu_callout_maskthissecondaryprocessor,itwillcontinuetherestofitsinitialization.Afterthatthecertainprocessorwillfinishitsinitializationprocess,theprocessorsetsbitinthecpu_callin_mask.Oncethebootstrapprocessorfindsthebitinthecpu_callin_maskforthecurrentsecondaryprocessor,thisprocessorrepeatsthesameprocedureforinitializationoftherestofasecondaryprocessors.Inashortwordsitworksasidescribed,butmoredetailswewillseeinthechapteraboutSMP.

That'sall.WedidallSMPbootpreparation.

Inthenextstepwecanseethecallofthebuild_all_zonelistsfunction.Thisfunctionsetsuptheorderofzonesthatallocationsarepreferredfrom.Whatarezonesandwhat'sorderwewillunderstandnow.Forthestartlet'sseehowlinuxkernelconsidersphysicalmemory.Physicalmemorymaybearrangedintobankswhicharecalled-nodes.IfyouhasnohardwarewithsupportforNUMA,youwillseeonlyonenode:

$cat/sys/devices/system/node/node0/numastat

numa_hit72452442

numa_miss0

numa_foreign0

interleave_hit12925

local_node72452442

other_node0

Everynodepresentedbythestructpglistdatainthelinuxkernel.Eachnodedevidedintoanumberofspecialblockswhicharecalled-zones.Everyzonepresentedbythezonestructinthelinuxkernelandhasoneofthetype:

ZONE_DMA-0-16M;ZONE_DMA32-usedfor32bitdevicesthatcanonlydoDMAareasbelow4G;ZONE_NORMAL-allRAMfromthe4GBonthex86_64;ZONE_HIGHMEM-absentonthex86_64;ZONE_MOVABLE-zonewhichcontainsmovablepages.

whicharepresentedbythezone_typeenum.Informationaboutzoneswecangetwiththe:

$cat/proc/zoneinfo

Node0,zoneDMA

pagesfree3975

min3

low3

...

...

Node0,zoneDMA32

pagesfree694163

min875

low1093

...

...

Node0,zoneNormal

pagesfree2529995

min3146

low3932

...

...

Buildzonelists

LinuxInside

135Schedulerinitialization

Page 136: Linux Insides

AsIwroteaboveallnodesaredescribedwiththepglist_dataorpg_data_tstructureinmemory.Thisstructuredefinedintheinclude/linux/mmzone.h.Thebuild_all_zonelistsfunctionfromthemm/page_alloc.cconstructsanorderedzonelist(ofdifferentzonesDMA,DMA32,NORMAL,HIGH_MEMORY,MOVABLE)whichspecifiesthezones/nodestovisitwhenaselectedzoneornodecannotsatisfytheallocationrequest.That'sall.MoreaboutNUMAandmultiprocessorsystemswillbeinthespecialpart.

Beforewewillstarttodiveintolinuxkernelschedulerinitializationprocesswemusttodoacoupleofthings.Thefisrtthingisthepage_alloc_initfunctionfromthemm/page_alloc.c.Thisfunctionlooksprettyeasy:

void__initpage_alloc_init(void)

{

hotcpu_notifier(page_alloc_cpu_notify,0);

}

andinitializeshandlerfortheCPUhotplug.Ofcoursethehotcpu_notifierdependsontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisset,itjustcallscpu_notifiermacrowhichexpandstothecalloftheregister_cpu_notifierwhichaddshotplugcpuhandler(page_alloc_cpu_notifyinourcase).

Afterthiswecanseethekernelcommandlineintheinitializationoutput:

Andacoupleoffunctionsasparse_early_paramandparse_argswhicharehandleslinuxkernelcommandline.Youcanrememberthatwealreadysawthecalloftheparse_early_paramfunctioninthesixthpartofthekernelinitializationchapter,sowhywecallitagain?Answerissimple:wecallthisfunctioninthearchitecture-specificcode(x86_64inourcase),butnotallarchitecturecallsthisfunction.Andweneedinthecallofthesecondfunctionparse_argstoparseandhandlenon-earlycommandlinearguments.

Inthenextstepwecanseethecallofthejump_label_initfromthekernel/jump_label.c.andinitializesjumplabel.

Afterthiswecanseethecallofthesetup_log_buffunctionwhichsetupstheprintklogbuffer.Wealreadysawthisfunctionintheseventhpartofthelinuxkernelinitializationprocesschapter.

Thenextispidhash_initfunction.Asyouknowaneachprocesshasassigneduniquenumberwhichcalled-processidentificationnumberorPID.EachprocessgeneratedwithforkorcloneisautomaticallyassignedanewuniquePIDvaluebythekernel.ThemanagementofPIDscenteredaroundthetwospecialdatastructures:structpidandstructupid.FirststructurerepresentsinformationaboutaPIDinthekernel.Thesecondstructurerepresentstheinformationthatisvisibleinaspecificnamespace.AllPIDinstancesstoredinthespecialhashtable:

staticstructhlist_head*pid_hash;

ThishashtableisusedtofindthepidinstancethatbelongstoanumericPIDvalue.So,pidhash_initinitializesthishash.Inthestartofthepidhash_initfunctionwecanseethecallofthealloc_large_system_hash:

pid_hash=alloc_large_system_hash("PID",sizeof(*pid_hash),0,18,

HASH_EARLY|HASH_SMALL,

&pidhash_shift,NULL,

Therestofthestuffbeforeschedulerinitialization

PIDhashinitialization

LinuxInside

136Schedulerinitialization

Page 137: Linux Insides

0,4096);

Thenumberofelementsofthepid_hashdependsontheRAMconfiguration,butitcanbebetween2^4and2^12.Thepidhash_initcomputesthesizeandallocatestherequiredstorage(whichishlistinourcase-thesameasdoublylinkedlist,butcontainsonepointerinsteadonthestructhlist_head].Thealloc_large_system_hashfunctionallocatesalargesystemhashtablewithmemblock_virt_alloc_nopanicifwepassHASH_EARLYflag(asitinourcase)orwith__vmallocifwedidnopassthisflag.

Theresultwecanseeinthedmesgoutput:

$dmesg|grephash

[0.000000]PIDhashtableentries:4096(order:3,32768bytes)

...

...

...

That'sall.Therestofthestuffbeforeschedulerinitializationisthefollowingfunctions:vfs_caches_init_earlydoesearlyinitializationofthevirtualfilesystem(moreaboutitwillbeinthechapterwhichwilldescribevirtualfilesystem),sort_main_extablesortsthekernel'sbuilt-inexceptiontableentrieswhicharebetween__start___ex_tableand__stop___ex_table,,andtrap_initinitializiestraphandlers(moreaaboutlasttwofunctionwewillknowintheseparatechapteraboutinterrupts).

Thelaststepbeforetheschedulerinitializationisinitializationofthememorymanagerwiththemm_initfunctionfromtheinit/main.c.Aswecansee,themm_initfunctioninitializesdifferentpartofthelinuxkernelmemorymanager:

page_ext_init_flatmem();

mem_init();

kmem_cache_init();

percpu_init_late();

pgtable_init();

vmalloc_init();

Thefirstispage_ext_init_flatmemdependsontheCONFIG_SPARSEMEMkernelconfigurationoptionandinitializesextendeddataperpagehandling.Themem_initreleasesallbootmem,thekmem_cache_initinitializeskernelcache,thepercpu_init_late-replacespercpuchunkswiththoseallocatedbyslub,thepgtable_init-initilizesthevmalloc_init-initializesvmalloc.Please,NOTEthatwewillnotdiveintodetailsaboutallofthesefunctionsandconcepts,butwewillseealloftheyitintheLinuxkernemmemorymanagerchapter.

That'sall.Nowwecanlookonthescheduler.

Andnowwecametothemainpurposeofthispart-initializationofthetaskscheduler.IwanttosayagainasIdiditalreadymanytimes,youwillnotseethefullexplanationoftheschedulerhere,therewillbespecialchapteraboutthis.Ok,nextpointisthesched_initfunctionfromthekernel/sched/core.candaswecanunderstandfromthefunction'sname,itinitializesscheduler.Let'sstarttodiveinthisfunctionandtrytounderstandhowtheschedulerinitialized.Atthestartofthesched_initfunctionwecanseethefollowingcode:

#ifdefCONFIG_FAIR_GROUP_SCHED

alloc_size+=2*nr_cpu_ids*sizeof(void**);

#endif

#ifdefCONFIG_RT_GROUP_SCHED

alloc_size+=2*nr_cpu_ids*sizeof(void**);

#endif

Schedulerinitialization

LinuxInside

137Schedulerinitialization

Page 138: Linux Insides

Firstofallwecanseetwoconfigurationoptionshere:

CONFIG_FAIR_GROUP_SCHED

CONFIG_RT_GROUP_SCHED

Bothofthisoptionsprovidetwodifferentplanningmodels.Aswecanreadfromthedocumentation,thecurrentscheduler-CFSorCompletelyFairSchedulerusedasimpleconcept.Itmodelsprocessschedulingasifthesystemhadanidealmultitaskingprocessorwhereeachprocesswouldreceive1/nprocessortime,wherenisthenumberoftherunnableprocesses.Theschedulerusesthespecialsetofrulesused.Theserulesdeterminewhenandhowtoselectanewprocesstorunandtheyarecalledschedulingpolicy.TheCompletelyFairSchedulersupportsfollowingnormalornon-real-timeschedulingpolicies:SCHED_NORMAL,SCHED_BATCHandSCHED_IDLE.TheSCHED_NORMALisusedforthemostnormalapplications,theamountofcpueachprocessconsumesismostlydeterminedbythenicevalue,theSCHED_BATCHusedforthe100%non-interactivetasksandtheSCHED_IDLErunstasksonlywhentheprocessorhasnottorunanythingbesidesthistask.Thereal-timepoliciesarealsosupportedforthetime-critialapplications:SCHED_FIFOandSCHED_RR.Ifyou'vereadsomethingabouttheLinuxkernelscheduler,youcanknowthatitismodular.Itmeansthatitsupportsdifferentalgorithmstoscheduledifferenttypesofprocesses.Usuallythismodularityiscalledschedulerclasses.Thesemodulesencapsulateschedulingpolicydetailsandarehandledbytheschedulercorewithoutthecorecodeassumingtoomuchaboutthem.

Nowlet'sbacktotheourcodeandlookonthetwoconfigurationoptionsCONFIG_FAIR_GROUP_SCHEDandCONFIG_RT_GROUP_SCHED.Thescheduleroperatesonanindividualtask.Theseoptionsallowstoschedulegrouptasks(moreaboutityoucanreadintheCFSgroupscheduling).Wecanseethatweassignthealloc_sizevariableswhichrepresentsizebasedonamountoftheprocessorstoallocateforthesched_entityandcfs_rqtothe2*nr_cpu_ids*sizeof(void**)expressionwithkzalloc:

ptr=(unsignedlong)kzalloc(alloc_size,GFP_NOWAIT);

#ifdefCONFIG_FAIR_GROUP_SCHED

root_task_group.se=(structsched_entity**)ptr;

ptr+=nr_cpu_ids*sizeof(void**);

root_task_group.cfs_rq=(structcfs_rq**)ptr;

ptr+=nr_cpu_ids*sizeof(void**);

#endif

Thesched_entityisstruturewhichdefinedintheinclude/linux/sched.handusedbytheschedulertokeeptrackofprocessaccounting.Thecfs_rqpresentsrunqueue.So,youcanseethatweallocatedspacewithsizealloc_sizefortherunqueueandschedulerentityoftheroot_task_group.Theroot_task_groupisaninstanceofthetask_groupstructurefromthekernel/sched/sched.hwhichcontainstaskgrouprelatedinformation:

structtask_group{

...

...

structsched_entity**se;

structcfs_rq**cfs_rq;

...

...

}

Theroottaskgroupisthetaskgroupwhichbelongseverytaskinsystem.Asweallocatedspacefortheroottaskgroupschedulerentityandrunqueue,wegooverallpossibleCPUs(cpu_possible_maskbitmap)andallocatezeroedmemoryfromaparticularmemorynodewiththekzalloc_nodefunctionfortheload_balance_maskpercpuvariable:

DECLARE_PER_CPU(cpumask_var_t,load_balance_mask);

LinuxInside

138Schedulerinitialization

Page 139: Linux Insides

Herecpumask_var_tisthecpumask_twithonedifference:cpumask_var_tisallocatedonlynr_cpu_idsbitswhenthecpumask_talwayshasNR_CPUSbits(moreaboutcpumaskyoucanreadintheCPUmaskspart).Asyoucansee:

#ifdefCONFIG_CPUMASK_OFFSTACK

for_each_possible_cpu(i){

per_cpu(load_balance_mask,i)=(cpumask_var_t)kzalloc_node(

cpumask_size(),GFP_KERNEL,cpu_to_node(i));

}

#endif

thiscodedependsontheCONFIG_CPUMASK_OFFSTACKconfigurationoption.Thisconfigurationoptionssaystousedynamicallocationforcpumask,insteadofputtingitonthestack.AllgroupshavetobeabletorelyontheamountofCPUtime.Withthecallofthetwofollowingfunctions:

init_rt_bandwidth(&def_rt_bandwidth,

global_rt_period(),global_rt_runtime());

init_dl_bandwidth(&def_dl_bandwidth,

global_rt_period(),global_rt_runtime());

weinitializebandwidthmanagementfortheSCHED_DEADLINEreal-timetasks.Thesefunctionsinitializesrt_bandwidthanddl_bandwidthstructureswhicharestoreinformationaboutmaximumdeadlinebandwithofthesystem.Forexample,let'slookontheimplementationoftheinit_rt_bandwidthfunction:

voidinit_rt_bandwidth(structrt_bandwidth*rt_b,u64period,u64runtime)

{

rt_b->rt_period=ns_to_ktime(period);

rt_b->rt_runtime=runtime;

raw_spin_lock_init(&rt_b->rt_runtime_lock);

hrtimer_init(&rt_b->rt_period_timer,

CLOCK_MONOTONIC,HRTIMER_MODE_REL);

rt_b->rt_period_timer.function=sched_rt_period_timer;

}

Ittakesthreeparameters:

addressofthert_bandwidthstructurewhichcontainsinformationabouttheallocatedandconsumedquotawithinaperiod;period-periodoverwhichreal-timetaskbandwidthenforcementismeasuredinus;runtime-partoftheperiodthatweallowtaskstoruninus.

Asperiodandruntimewepassresultoftheglobal_rt_periodandglobal_rt_runtimefunctions.Whichare1ssecondandand0.95sbydefault.Thert_bandwidthstructuredefinedinthekernel/sched/sched.handlooks:

structrt_bandwidth{

raw_spinlock_trt_runtime_lock;

ktime_trt_period;

u64rt_runtime;

structhrtimerrt_period_timer;

};

Asyoucansee,itcontainsruntimeandperiodandalsotwofollowingfields:

rt_runtime_lock-spinlockforthert_timeprotection;rt_period_timer-high-resolutionkerneltimerforunthrottledofreal-timetasks.

LinuxInside

139Schedulerinitialization

Page 140: Linux Insides

So,intheinit_rt_bandwidthweinitializert_bandwidthperiodandruntimewiththegivenparameters,initializethespinlockandhigh-resolutiontime.Inthenextstep,dependsontheenabledSMP,wemakeinitializationoftherootdomain:

#ifdefCONFIG_SMP

init_defrootdomain();

#endif

Thereal-timeschedulerrequiresglobalresourcestomakeschedulingdecision.ButunfortenatellyscalabilitybottlenecksappearasthenumberofCPUsincrease.Theconceptofrootdomainswasintroducedforimprovingscalability.ThelinuxkernelprovidesspecialmechanismforassigningasetofCPUsandmemorynodestoasetoftaskanditiscalled-cpuset.Ifacpusetcontainsnon-overlappingwithothercpusetCPUs,itisexclusivecpuset.EachexclusivecpusetdefinesanisolateddomainorrootdomainofCPUspartitionedfromothercpusetsorCPUs.Arootdomainpresentedbythestructroot_domainfromthekernel/sched/sched.hinthelinuxkernelanditsmainpurposeistonarrowthescopeoftheglobalvariablestoper-domainvariablesandallreal-timeschedulingdecisionsaremadeonlywithinthescopeofarootdomain.That'sallaboutit,butwewillseemoredetailsaboutitinthechapteraboutschedulingaboutreal-timescheduler.

Afterrootdomaininitialization,wemakeinitializationofthebandwidthforthereal-timetasksoftheroottaskgroupaswediditabove:

#ifdefCONFIG_RT_GROUP_SCHED

init_rt_bandwidth(&root_task_group.rt_bandwidth,

global_rt_period(),global_rt_runtime());

#endif

Inthenextstep,dependsontheCONFIG_CGROUP_SCHEDkernelconfigurationoptionweinitialzethesiblingsandchildrenlistsoftheroottaskgroup.Aswecanreadfromthedocumentation,theCONFIG_CGROUP_SCHEDis:

Thisoptionallowsyoutocreatearbitrarytaskgroupsusingthe"cgroup"pseudo

filesystemandcontrolthecpubandwidthallocatedtoeachsuchtaskgroup.

Aswefinishedwiththelistsinitialization,wecanseethecalloftheautogroup_initfunction:

#ifdefCONFIG_CGROUP_SCHED

list_add(&root_task_group.list,&task_groups);

INIT_LIST_HEAD(&root_task_group.children);

INIT_LIST_HEAD(&root_task_group.siblings);

autogroup_init(&init_task);

#endif

whichinitializesautomaticprocessgroupscheduling.

Afterthiswearegoingthroughtheallpossiblecpu(youcanrememberthatpossibleCPUsstoreinthecpu_possible_maskbitmapofpossibleCPUsthatcaneverbeavailableinthesystem)andinitializearunqueueforeachpossiblecpu:

for_each_possible_cpu(i){

structrq*rq;

...

...

...

Eachprocessorhasitsownlockingandindividualrunqueue.Allrunnalbletasksarestoredinanactivearrayandindexedaccordingtoitspriority.Whenaprocessconsumesitstimeslice,itismovedtoanexpiredarray.Allofthesearrasare

LinuxInside

140Schedulerinitialization

Page 141: Linux Insides

storedinthespecialstructurewhichnamesisrunqueu.Astherearenogloballockandrunqueu,wearegoingthroughtheallpossibleCPUsandinitializerunqueuefortheeverycpu.Therunqueispresentedbytherqstructureinthelinuxkernelwhichdefinedinthekernel/sched/sched.h.

rq=cpu_rq(i);

raw_spin_lock_init(&rq->lock);

rq->nr_running=0;

rq->calc_load_active=0;

rq->calc_load_update=jiffies+LOAD_FREQ;

init_cfs_rq(&rq->cfs);

init_rt_rq(&rq->rt);

init_dl_rq(&rq->dl);

rq->rt.rt_runtime=def_rt_bandwidth.rt_runtime;

HerewegettherunquefortheeveryCPUwiththecpu_rqmactowhichreturnsrunqueuespercpuvariableandstarttoinitializeitwithrunqueulock,numberofrunningtasks,calc_loadrelativefields(calc_load_activeandcalc_load_update)whichareusedinthereckoningofaCPUloadandinitializationofthecompletelyfair,real-timeanddeadlinerelatedfieldsinarunqueue.Afterthisweinitializecpu_loadarraywithzerosandsetthelastloadupdateticktothejiffiesvariablewhichdeterminesthenumberoftimeticks(cycles),sincethesystemboot:

for(j=0;j<CPU_LOAD_IDX_MAX;j++)

rq->cpu_load[j]=0;

rq->last_load_update_tick=jiffies;

wherecpu_loadkeepshistoryofrunqueueloadsinthepast,fornowCPU_LOAD_IDX_MAXis5.InthenextstepwefillrunqueuefieldswhicharerelatedtotheSMP,butwewillnotcovertheyinthispart.Andintheendoftheloopweinitializehigh-resolutiontimerforthegiverunqueueandsettheiowait(moreaboutitintheseparatepartaboutscheduler)number:

init_rq_hrtick(rq);

atomic_set(&rq->nr_iowait,0);

Nowwecameoutfromthefor_each_possible_cpuloopandthenextweneedtosetloadweightfortheinittaskwiththeset_load_weightfunction.Weightofprocessiscalculatedthroughitsdynamicprioritywhichisstaticpriority+schedulingclassoftheprocess.Afterthisweincreasememoryusagecounterofthememorydescriptoroftheinitprocessandsetschedulerclassforthecurrentprocess:

atomic_inc(&init_mm.mm_count);

current->sched_class=&fair_sched_class;

Andmakecurrentprocess(itwillbethefirstinitprocess)idleandupdatethevalueofthecalc_load_updatewiththe5secondsinterval:

init_idle(current,smp_processor_id());

calc_load_update=jiffies+LOAD_FREQ;

So,theinitprocesswillberun,whentherewillbenoothercandidates(asitisthefirstprocessinthesystem).Intheendwejustsetscheduler_runningvariable:

scheduler_running=1;

LinuxInside

141Schedulerinitialization

Page 142: Linux Insides

That'sall.Linuxkernelschedulerisinitialized.Ofcourse,wemissedmanydifferentdetailsandexplanationshere,becauseweneedtoknowandunderstandhowdifferentconcepts(likeprocessandprocessgroups,runqueue,rcuandetc...)worksinthelinuxkernel,butwetookashortlookontheschedulerinitializationprocess.Allotherdetailswewilllookintheseparatepartwhichwillbefullydedicatedtothescheduler.

Itistheendoftheeighthpartaboutthelinuxkernelinitializationprocess.Inthispart,welookedontheinitializationprocessoftheschedulerandwewillcontinueinthenextparttodiveinthelinuxkernelinitializationprocessandwillseeinitializationoftheRCUandmanymore.

andotherinitializationstuffinthenextpart.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

CPUmaskshigh-resolutionkerneltimerspinlockRunqueueLinuxkernemmemorymanagerslubvirtualfilesystemLinuxkernelhotplugdocumentationIRQGlobalDescriptorTablePer-CPUvariablesSMPRCUCFSSchedulerdocumentationReal-TimegroupschedulingPreviouspart

Conclusion

Links

LinuxInside

142Schedulerinitialization

Page 143: Linux Insides

ThisisninthpartoftheLinuxKernelinitializationprocessandinthepreviouspartwestoppedattheschedulerinitialization.InthispartwewillcontinuetodivetothelinuxkernelinitializationprocessandthemainpurposeofthispartwillbetolearnaboutinitializationoftheRCU.Wecanseethatthenextstepintheinit/main.cafterthesched_initisthecallofthepreempt_disablepreempt_disable.Therearetwomacros:

preempt_disable

preempt_enable

forpreemptiondisablingandenabling.Firstofalllet'strytounderstandwhatisitpreemptinthecontextofanoperatingsystemkernel.Inasimplewords,preemptionisabilityoftheoperatingsystemkerneltopreemptcurrenttasktoruntaskwithhigherpriority.Hereweneedtodisablepreemptionbecausewewillhaveonlyoneinitprocessfortheearlyboottimeandwenoneedtostopitbeforewewillcallcpu_idlefunction.Thepreempt_disablemacrodefinedintheinclude/linux/preempt.handdependsontheCONFIG_PREEMPT_COUNTkernelconfigurationoption.Thismacoimplemetedas:

#definepreempt_disable()\

do{\

preempt_count_inc();\

barrier();\

}while(0)

andifCONFIG_PREEMPT_COUNTisnotsetjust:

#definepreempt_disable()barrier()

Let'slookonit.Firstofallwecanseeonedifferencebetweenthesemacroimplementations.Thepreempt_disablewithCONFIG_PREEMPT_COUNTcontainsthecallofthepreempt_count_inc.Thereisspecialpercpuvariablewhichstoresthenumberofheldlocksandpreempt_disablecalls:

DECLARE_PER_CPU(int,__preempt_count);

Inthefirstimplementationofthepreempt_disableweincrementthis__preempt_count.ThereisAPIforreturningvalueofthe__preempt_count,itisthepreempt_countfunction.Aswecalledpreempt_disable,firstofallweincrementpreemptioncounterwiththepreempt_count_incmacrowhichexpandstothe:

#definepreempt_count_inc()preempt_count_add(1)

#definepreempt_count_add(val)__preempt_count_add(val)

wherepreempt_count_addcallstheraw_cpu_add_4macrowhichadds1tothegivenpercpuvariable(__preempt_count)inourcase(moreaboutprecpuvariablesyoucanreadinthepartaboutPer-CPUvariables).Ok,weincreased__preempt_countandthnextstepwecanseethecallofthebarriermacrointhebothmacros.Thebarriermacroinsertsanoptimizationbarrier.Intheprocessorswithx86_64architectureindependentmemoryaccessoperationscanbeperformedinanyorder.That'swhyweneedintheoportunitytopointcompilerandprocessoroncomplianceoforder.Thismechanismismemorybarrier.Let'sconsidersimpleexample:

Kernelinitialization.Part9.

RCUinitialization

LinuxInside

143RCUinitialization

Page 144: Linux Insides

preempt_disable();

foo();

preempt_enable();

Compilercanrearrangeitas:

preempt_disable();

preempt_enable();

foo();

Inthiscasenon-preemptiblefunctionfoocanbepreempted.Asweputbarriermacrointhepreempt_disableandpreempt_enablemacros,itpreventsthecompilerfromswappingpreempt_count_incwithotherstatements.Moreaboutbarriersyoucanreadhereandhere.

Inthenextstepwecanseefollowingstatement:

if(WARN(!irqs_disabled(),

"Interruptswereenabled*very*early,fixingit\n"))

local_irq_disable();

whichcheckIRQsstate,anddisabling(withcliinstructionforx86_64)iftheyareenabled.

That'sall.Preemptionisdisabledandwecangoahead.

Inthenextstepwecanseethecalloftheidr_init_cachefunctionwhichdefinedinthelib/idr.c.TheidrlibraryusedinavariousplacesinthelinuxkerneltomanageassigningintegerIDstoobjectsandlookingupobjectsbyid.

Let'slookontheimplementationoftheidr_init_cachefunction:

void__initidr_init_cache(void)

{

idr_layer_cache=kmem_cache_create("idr_layer_cache",

sizeof(structidr_layer),0,SLAB_PANIC,NULL);

}

Herewecanseethecallofthekmem_cache_create.Wealreadycalledthekmem_cache_initintheinit/main.c.Thisfunctioncreategeneralizedcachesagainusingthekmem_cache_alloc(moreaboutcacheswewillseeintheLinuxkernelmemorymanagementchapter).Inourcase,asweareusingkmem_cache_titwillbeusedtheslaballocatorandkmem_cache_createcreatesit.Asyoucanseeewepassfiveparameterstothekmem_cache_create:

nameofthecache;sizeoftheobjecttostoreincache;offsetofthefirstobjectinthepage;flags;constructorfortheobjects.

anditwillcreatekmem_cachefortheintegerIDs.IntegerIDsiscommonlyusedpatternforthetomapsetofintegerIDstothesetofpointers.WecanseeusageoftheintegerIDsforexampleinthei2cdriverssubsystem.Forexampledrivers/i2c/i2c-core.cwhichpresentesthecoreofthei2csubsystemdefinesIDforthei2cadapterwiththeDEFINE_IDRmacro:

InitializationoftheintegerIDmanagement

LinuxInside

144RCUinitialization

Page 145: Linux Insides

staticDEFINE_IDR(i2c_adapter_idr);

andthanitusesitforthedeclarationofthei2cadapter:

staticint__i2c_add_numbered_adapter(structi2c_adapter*adap)

{

intid;

...

...

...

id=idr_alloc(&i2c_adapter_idr,adap,adap->nr,adap->nr+1,GFP_KERNEL);

...

...

...

}

andid2_adapter_idrpresentsdynamicallycalculatedbusnumber.

MoreaboutintegerIDmanagementyoucanreadhere.

ThenextstepisRCUinitializationwiththercu_initfunctionandit'simplementationdependsontwokernelconfigurationoptions:

CONFIG_TINY_RCU

CONFIG_TREE_RCU

Inthefirstcasercu_initwillbeinthekernel/rcu/tiny.candinthesecondcaseitwillbedefinedinthekernel/rcu/tree.c.Wewillseetheimplementationofthetreercu,butfirstofallabouttheRCUingeneral.

RCUorread-copyupdateisascalablehigh-performancesynchronizationmechanismimplementedintheLinuxkernel.Ontheearlystagethelinuxkernelprovidedsupportandenvironmentfortheconcurentlyrunningapplications,butallexecutionwasserializedinthekernelusingasinglegloballock.Inourdayslinuxkernelhasnosinglegloballock,butprovidesdifferentmechanismsincludinglock-freedatastructures,percpudatastructuresandother.Oneofthesemechanismsis-theread-copyupdate.TheRCUtechniquedesignedforrarely-modifieddatastructures.TheideaoftheRCUissimple.Forexamplewehaveararely-modifieddatastructure.Ifsomebodywantstochangethisdatastructure,wemakeacopyofthisdatastructureandmakeallchangesinthecopy.Inthesametimeallotherusersofthedatastructureuseoldversionofit.Next,weneedtochoosesafemomentwhenoriginalversionofthedatastructurewillhavenousersandupdateitwiththemodifiedcopy.

OfcoursethisdescriptionoftheRCUisverysimplified.TounderstandsomedetailsaboutRCU,firstofallweneedtolearnsometerminology.DatareadersintheRCUexecutedinthecriticalsection.Everytimewhendatareaderjoinstothecriticalsection,itcallsthercu_read_lock,andrcu_read_unlockonexitfromthecriticalsection.Ifthethreadisnotinthecriticalsection,itwillbeinstatewhichcalled-quiescentstate.Everymomentwheneverythreadwasinthequiescentstatecalled-graceperiod.Ifathreadwantstoremoveelementfromthedatastructure,thisoccursintwosteps.Firststepsisremoval-atomicallyremoveselementfromthedatastructure,butdoesnotreleasethephysicalmemory.Afterthisthread-writerannouncesandwaitswhileitwillbefinsihed.Fromthismoment,theremovedelementisavailabletothethread-readers.Afterthegraceperioudwillbefinished,thesecondstepoftheelementremovalwillbestarted,itjustremoveselementfromthephysicalmemory.

ThereacoupleimplementationsoftheRCU.OldRCUcalledclassic,thenewimplemetationcalledtreeRCU.Asyoualreadycanundrestand,theCONFIG_TREE_RCUkernelconfigurationoptionenablestreeRCU.AnotheristhetinyRCUwhichdependsonCONFIG_TINY_RCUandCONFIG_SMP=n.WewillseemoredetailsabouttheRCUingeneralintheseparate

RCUinitialization

LinuxInside

145RCUinitialization

Page 146: Linux Insides

chapteraboutsynchronizationprimitives,butnowlet'slookonthercu_initimplementationfromthekernel/rcu/tree.c:

void__initrcu_init(void)

{

intcpu;

rcu_bootup_announce();

rcu_init_geometry();

rcu_init_one(&rcu_bh_state,&rcu_bh_data);

rcu_init_one(&rcu_sched_state,&rcu_sched_data);

__rcu_init_preempt();

open_softirq(RCU_SOFTIRQ,rcu_process_callbacks);

/*

*Wedon'tneedprotectionagainstCPU-hotplugherebecause

*thisiscalledearlyinboot,beforeeitherinterrupts

*ortheschedulerareoperational.

*/

cpu_notifier(rcu_cpu_notify,0);

pm_notifier(rcu_pm_notify,0);

for_each_online_cpu(cpu)

rcu_cpu_notify(NULL,CPU_UP_PREPARE,(void*)(long)cpu);

rcu_early_boot_tests();

}

Inthebeginningofthercu_initfunctionwedefinecpuvariableandcallrcu_bootup_announce.Thercu_bootup_announcefunctionisprettysimple:

staticvoid__initrcu_bootup_announce(void)

{

pr_info("HierarchicalRCUimplementation.\n");

rcu_bootup_announce_oddness();

}

ItjustprintsinformationabouttheRCUwiththepr_infofunctionandrcu_bootup_announce_oddnesswhichusespr_infotoo,forprintingdifferentinformationaboutthecurrentRCUconfigurationwhichdependsondifferentkernelconfigurationoptionslikeCONFIG_RCU_TRACE,CONFIG_PROVE_RCU,CONFIG_RCU_FANOUT_EXACTandetc...Inthenextstep,wecanseethecallofthercu_init_geometryfunction.ThisfunctiondefinedinthesamesourcecodefileandcomputesthenodetreegeometrydependsonamountofCPUs.ActuallyRCUprovidesscalabilitywithextremelylowinternaltoRCUlockcontention.WhatifadatastructurewillbereadfromthedifferentCPUs?RCUAPIprovidesthercu_statestructurewihchpresentsRCUglobalstateincludingnodehierarchy.Hierachypresentedbythe:

structrcu_nodenode[NUM_RCU_NODES];

arrayofstructures.Aswecanreadinthecommentwhichisabovedefinitionofthisstructure:

Theroot(firstlevel)ofthehierarchyisin->node[0](referencedby->level[0]),thesecond

levelin->node[1]through->node[m](->node[1]referencedby->level[1]),andthethirdlevel

in->node[m+1]andfollowing(->node[m+1]referencedby->level[2]).Thenumberoflevelsis

determinedbythenumberofCPUsandbyCONFIG_RCU_FANOUT.

Smallsystemswillhavea"hierarchy"consistingofasinglercu_node.

Thercu_nodestructuredefinedinthekernel/rcu/tree.handcontainsinformationaboutcurrentgraceperiod,isgraceperiodcompletedornot,CPUsorgroupsthatneedtoswitchinorderforcurrentgraceperiodtoproceedandetc...Everyrcu_nodecontainsalockforacoupleofCPUs.Thesercu_nodestructuresembeddedintoalineararrayinthercu_statestructureandrepresetedasatreewiththerootinthezeroelementanditcoversallCPUs.AsyoucanseethenumberofthercunodesdeterminedbytheNUM_RCU_NODESwhichdependsonnumberofavailableCPUs:

LinuxInside

146RCUinitialization

Page 147: Linux Insides

#defineNUM_RCU_NODES(RCU_SUM-NR_CPUS)

#defineRCU_SUM(NUM_RCU_LVL_0+NUM_RCU_LVL_1+NUM_RCU_LVL_2+NUM_RCU_LVL_3+NUM_RCU_LVL_4)

wherelevelsvaluesdependontheCONFIG_RCU_FANOUT_LEAFconfigurationoption.Forexampleforthesimplestcase,onercu_nodewillcovertwoCPUonmachinewiththeeightCPUs:

+-----------------------------------------------------------------+

|rcu_state|

|+----------------------+|

||root||

||rcu_node||

|+----------------------+|

||||

|+----v-----++--v-------+|

||||||

||rcu_node||rcu_node||

||||||

|+------------------++----------------+|

||||||

||||||

|+----v-----++-------v--++-v--------++-v--------+|

||||||||||

||rcu_node||rcu_node||rcu_node||rcu_node||

||||||||||

|+----------++----------++----------++----------+|

||||||

||||||

||||||

||||||

+---------|-----------------|-------------|---------------|-------+

||||

+---------v-----------------v-------------v---------------v--------+

|||||

|CPU1|CPU3|CPU5|CPU7|

|||||

|CPU2|CPU4|CPU6|CPU8|

|||||

+------------------------------------------------------------------+

So,inthercu_init_geometryfunctionwejustneedtocalculatethetotalnumberofrcu_nodestructures.Westarttodoitwiththecalculationofthejiffiestilltothefirstandnextfqswhichisforce-quiescent-state(readaboveaboutit):

d=RCU_JIFFIES_TILL_FORCE_QS+nr_cpu_ids/RCU_JIFFIES_FQS_DIV;

if(jiffies_till_first_fqs==ULONG_MAX)

jiffies_till_first_fqs=d;

if(jiffies_till_next_fqs==ULONG_MAX)

jiffies_till_next_fqs=d;

where:

#defineRCU_JIFFIES_TILL_FORCE_QS(1+(HZ>250)+(HZ>500))

#defineRCU_JIFFIES_FQS_DIV256

Aswecalculatedthesejiffies,wecheckthatpreviousdefinedjiffies_till_first_fqsandjiffies_till_next_fqsvariablesareequaltotheULONG_MAX(theirdefaultvalues)andsettheyequaltothecalculatedvalue.Aswedidnottouchthesevariablesbefore,theyareequaltotheULONG_MAX:

staticulongjiffies_till_first_fqs=ULONG_MAX;

staticulongjiffies_till_next_fqs=ULONG_MAX;

LinuxInside

147RCUinitialization

Page 148: Linux Insides

Inthenextstepofthercu_init_geometry,wecheckthatrcu_fanout_leafdidn'tchage(ithasthesamevalueasCONFIG_RCU_FANOUT_LEAFincompile-time)andequaltothevalueoftheCONFIG_RCU_FANOUT_LEAFconfigurationoption,wejustreturn:

if(rcu_fanout_leaf==CONFIG_RCU_FANOUT_LEAF&&

nr_cpu_ids==NR_CPUS)

return;

Afterthisweneedtocomputethenumberofnodesthatcanbehandledanrcu_nodetreewiththegivennumberoflevels:

rcu_capacity[0]=1;

rcu_capacity[1]=rcu_fanout_leaf;

for(i=2;i<=MAX_RCU_LVLS;i++)

rcu_capacity[i]=rcu_capacity[i-1]*CONFIG_RCU_FANOUT;

Andinthelaststepwecalcluatethenumberofrcu_nodesateachlevelofthetreeintheloop.

Aswecalculatedgeometryofthercu_nodetree,weneedtobacktothercu_initfunctionandnextstepweneedtoinitializetworcu_statestructureswiththercu_init_onefunction:

rcu_init_one(&rcu_bh_state,&rcu_bh_data);

rcu_init_one(&rcu_sched_state,&rcu_sched_data);

Thercu_init_onefunctiontakestwoarguments:

GlobalRCUstate;Per-CPUdataforRCU.

Bothvariablesdefinedinthekernel/rcu/tree.hwithitspercpudata:

externstructrcu_statercu_bh_state;

DECLARE_PER_CPU(structrcu_data,rcu_bh_data);

Aboutthisstatesyoucanreadhere.AsIwroteaboveweneedtoinitializercu_statestructuresandrcu_init_onefunctionwillhelpuswithit.Afterthercu_stateinitialization,wecanseethecallofthe__rcu_init_preemptwhichdependsontheCONFIG_PREEMPT_RCUkernelconfigurationoption.Itdoesthesamethatpreviousfunctions-initializationofthercu_preempt_statestructurewiththercu_init_onefunctionwhichhasrcu_statetype.Afterthis,inthercu_init,wecanseethecallofthe:

open_softirq(RCU_SOFTIRQ,rcu_process_callbacks);

function.Thisfunctionregistersahandlerofthependinginterrupt.Pendinginterruptorsoftirqsupposesthatpartofactionscabbedelayedforlaterexecutionwhenthesystemwillbelessloaded.Pendinginterruptsrepresetedbythefollowingstructure:

structsoftirq_action

{

void(*action)(structsoftirq_action*);

};

LinuxInside

148RCUinitialization

Page 149: Linux Insides

whichdefinedintheinclude/linux/interrupt.handcontainsonlyonefield-handlerofaninterrupt.Youcanknowaboutsoftirqsintheyoursystemwiththe:

$cat/proc/softirqs

CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7

HI:20010200

TIMER:1377791081101395731076471074081149729965398665

NET_TX:11270401100

NET_RX:3342211329393076451361292303

BLOCK:525355968779201637442282855

BLOCK_IOPOLL:00000000

TASKLET:6602916113024267080

SCHED:10235075950917057535675323826276927969914

HRTIMER:510302368260219255248246

RCU:8129068062829796901568390693856330463473

Theopen_softirqfunctiontakestwoparameters:

indexoftheinterrupt;interrupthandler.

andaddsinterrupthandlertothearrayofthependinginterrupts:

voidopen_softirq(intnr,void(*action)(structsoftirq_action*))

{

softirq_vec[nr].action=action;

}

Inourcasetheinterrupthandleris-rcu_process_callbackswhichdefinedinthekernel/rcu/tree.canddoestheRCUcoreprocessingforthecurrentCPU.AfterweregisteredsoftirqinterruptfortheRCU,wecanseethefollowingcode:

cpu_notifier(rcu_cpu_notify,0);

pm_notifier(rcu_pm_notify,0);

for_each_online_cpu(cpu)

rcu_cpu_notify(NULL,CPU_UP_PREPARE,(void*)(long)cpu);

HerewecanseeregistrationofthecpunotifierwhichneedsinsysmtemswhichsupportsCPUhotplugandwewillnotdiveintodetailsaboutthistheme.Thelastfunctioninthercu_initisthercu_early_boot_tests:

voidrcu_early_boot_tests(void)

{

pr_info("RunningRCUselftests\n");

if(rcu_self_test)

early_boot_test_call_rcu();

if(rcu_self_test_bh)

early_boot_test_call_rcu_bh();

if(rcu_self_test_sched)

early_boot_test_call_rcu_sched();

}

whichrunsselftestsfortheRCU.

That'sall.WesawinitializationprocessoftheRCUsubsystem.AsIwroteabove,moreabouttheRCUwillbeintheseparatechapteraboutsynchronizationprimitives.

Restoftheinitializationprocess

LinuxInside

149RCUinitialization

Page 150: Linux Insides

Ok,wealreadypassedthemainthemeofthispartwhichisRCUinitialization,butitisnottheendofthelinuxkernelinitializationprocess.Inthelastparagraphofthisthemewewillseeacoupleoffunctionswhichworkintheinitializationtime,butwewillnotdiveintodeepdetailsaroundthisfunctionbydifferentreasons.Somereasonsnottodiveintodetailsarefollowing:

Theyarenotveryimportantforthegenerickernelinitializationprocessandcandependonthedifferentkernelconfiguration;Theyhavethecharacterofdebuggingandnotimportanttoofornow;Wewillseemanyofthisstuffintheseparateparts/chapters.

AfterweinitilizedRCU,thenextstepwhichyoucanseeintheinit/main.cisthe-trace_initfunction.Asyoucanunderstandfromitsname,thisfunctioninitializetracingsubsystem.Moreaboutlinuxkerneltracesystemyoucanread-here.

Afterthetrace_init,wecanseethecalloftheradix_tree_init.Ifyouarefamilarwiththedifferentdatastructures,youcanunderstandfromthenameofthisfunctionthatitinitializeskernelimplementationoftheRadixtree.Thisfunctiondefinedinthelib/radix-tree.candmoreaboutityoucanreadinthepartaboutRadixtree.

Inthenextstepwecanseethefunctionswhicharerelatedtotheinterruptshandlingsubsystem,theyare:

early_irq_init

init_IRQ

softirq_init

Wewillseeexplanationaboutthisfunctionsandtheirimplementationinthespecialpartaboutinterruptsandexceptionshandling.Afterthismanydifferentfunctions(likeinit_timers,hrtimers_init,time_initandetc...)whicharerelatedtodifferenttimingandtimersstuff.Moreaboutthesefunctionwewillseeinthechapterabouttimers.

Thenextcoupleoffunctionsrelatedwiththeperfevents-perf_event-init(willbeseparatechapteraboutperf),initializationoftheprofilingwiththeprofile_init.Afterthisweenableirqwiththecallofthe:

local_irq_enable();

whichexpandstothestiinstructionandmakingpostinitializationoftheSLABwiththecallofthekmem_cache_init_latefunction(AsIwroteabovewewillknowabouttheSLABintheLinuxmemorymanagementchapter).

AfterthepostinitializationoftheSLAB,nextpointisinitializationoftheconsolewiththeconsole_initfunctionfromthedrivers/tty/tty_io.c.

Aftertheconsoleinitialization,wecanseethelockdep_infofunctionwhichprintsinformationabouttheLockdependencyvalidator.Afterthis,wecanseetheinitializationofthedynamicallocationofthedebugobjectswiththedebug_objects_mem_init,kernelmemoryleackdetectorinitializationwiththekmemleak_init,percpupagesetsetupwiththesetup_per_cpu_pageset,setupoftheNUMApolicywiththenuma_policy_init,settingtimefortheschedulerwiththesched_clock_init,pidmapinitializationwiththecallofthepidmap_initfunctionfortheinitialPIDnamespace,cachecreationwiththeanon_vma_initfortheprivatevirtualmemoryareasandearlyinitializationoftheACPIwiththeacpi_early_init.

ThisistheendoftheninthpartofthelinuxkernelinitializationprocessandherewesawinitializationoftheRCU.Inthelastparagraphofthispart(Restoftheinitializationprocess)wewentthorughthemanyfunctionsbutdidnotdiveintodetailsabouttheirimplementations.Donotworryifyoudonotknowanythingaboutthesestufforyouknowanddonotunderstandanythingaboutthis.AsIwrotealreadymanytimes,wewillseedetailsofimplementations,butintheotherpartsorotherchapters.

LinuxInside

150RCUinitialization

Page 151: Linux Insides

Itistheendoftheninthpartaboutthelinuxkernelinitializationprocess.Inthispart,welookedontheinitializationprocessoftheRCUsubsystem.InthenextpartwewillcontinuetodiveintolinuxkernelinitializationprocessandIhopethatwewillfinishwiththestart_kernelfunctionandwillgototherest_initfunctionfromthesameinit/main.csourcecodefileandwillseethatstartofthefirstprocess.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

lock-freedatastructureskmemleakACPIIRQsRCURCUdocumentationintegerIDmanagementDocumentation/memory-barriers.txtRuntimelockingcorrectnessvalidatorPer-CPUvariablesLinuxkernelmemorymanagementslabi2cPreviouspart

Conclusion

Links

LinuxInside

151RCUinitialization

Page 152: Linux Insides

ThisistenthpartofthechapteraboutlinuxkernelinitializationprocessandinthepreviouspartwesawtheinitializationoftheRCUandstoppedonthecalloftheacpi_early_initfunction.ThispartwillbethelastpartoftheKernelinitializationprocesschapter,solet'sfinishwithit.

Afterthecalloftheacpi_early_initfunctionfromtheinit/main.c,wecanseethefollowingcode:

#ifdefCONFIG_X86_ESPFIX64

init_espfix_bsp();

#endif

Herewecanseethecalloftheinit_espfix_bspfunctionwhichdependsontheCONFIG_X86_ESPFIX64kernelconfigurationoption.Aswecanunderstandfromthefunctionname,itdoessomethingwiththestack.Thisfunctiondefinedinthearch/x86/kernel/espfix_64.candpreventsleakingof31:16bitsoftheespregisterduringreturningto16-bitstack.Firstofallweinstallespfixpageupperdirectoryintothekernelpagedirectoryintheinit_espfix_bs:

pgd_p=&init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];

pgd_populate(&init_mm,pgd_p,(pud_t*)espfix_pud_page);

WhereESPFIX_BASE_ADDRis:

#definePGDIR_SHIFT39

#defineESPFIX_PGD_ENTRY_AC(-2,UL)

#defineESPFIX_BASE_ADDR(ESPFIX_PGD_ENTRY<<PGDIR_SHIFT)

AlsowecanfinditintheDocumentation/arch/x86_64/mm:

...unusedhole...

ffffff0000000000-ffffff7fffffffff(=39bits)%espfixupstacks

...unusedhole...

Afterwe'vefilledpageglobaldirectorywiththeespfixpud,thenextstepiscalloftheinit_espfix_randomandinit_espfix_apfunctions.ThefirstfunctionreturnsrandomlocationsfortheespfixpageandthesecondenablestheespfixthecurrentCPU.Aftertheinit_espfix_bspfinishedtowork,wecanseethecallofthethread_info_cache_initfunctionwhichdefinedinthekernel/fork.candallocatescacheforthethread_infoifitssizeislessthanPAGE_SIZE:

#ifTHREAD_SIZE>=PAGE_SIZE

...

...

...

voidthread_info_cache_init(void)

{

thread_info_cache=kmem_cache_create("thread_info",THREAD_SIZE,

THREAD_SIZE,0,NULL);

BUG_ON(thread_info_cache==NULL);

}

...

...

Kernelinitialization.Part10.

Endofthelinuxkernelinitializationprocess

LinuxInside

152Endofinitialization

Page 153: Linux Insides

...

#endif

AswealreadyknowthePAGE_SIZEis(_AC(1,UL)<<PAGE_SHIFT)or4096bytesandTHREAD_SIZEis(PAGE_SIZE<<THREAD_SIZE_ORDER)or16384bytesforthex86_64.Thenextfunctionafterthethread_info_cache_initisthecred_initfromthekernel/cred.c.Thisfunctionjustallocatesspaceforthecredentials(likeuid,gidandetc...):

void__initcred_init(void)

{

cred_jar=kmem_cache_create("cred_jar",sizeof(structcred),

0,SLAB_HWCACHE_ALIGN|SLAB_PANIC,NULL);

}

moreaboutcredentialsyoucanreadintheDocumentation/security/credentials.txt.Nextstepisthefork_initfunctionfromthekernel/fork.c.Thefork_initfunctionallocatesspaceforthetask_struct.Let'slookontheimplementationofthefork_init.FirstofallwecanseedefinitionsoftheARCH_MIN_TASKALIGNmacroandcreationofaslabwheretask_structswillbeallocated:

#ifndefCONFIG_ARCH_TASK_STRUCT_ALLOCATOR

#ifndefARCH_MIN_TASKALIGN

#defineARCH_MIN_TASKALIGNL1_CACHE_BYTES

#endif

task_struct_cachep=

kmem_cache_create("task_struct",sizeof(structtask_struct),

ARCH_MIN_TASKALIGN,SLAB_PANIC|SLAB_NOTRACK,NULL);

#endif

AswecanseethiscodedependsontheCONFIG_ARCH_TASK_STRUCT_ACLLOCATORkernelconfigurationoption.Thisconfigurationoptionshowsthepresenceofthealloc_task_structforthegivenarchitecture.Asx86_64hasnoalloc_task_structfunction,thiscodewillnotworkandevenwillnotbecompiledonthex86_64.

Afterthiswecanseethecallofthearch_task_cache_initfunctioninthefork_init:

voidarch_task_cache_init(void)

{

task_xstate_cachep=

kmem_cache_create("task_xstate",xstate_size,

__alignof__(unionthread_xstate),

SLAB_PANIC|SLAB_NOTRACK,NULL);

setup_xstate_comp();

}

Thearch_task_cache_initdoesinitializationofthearchitecture-specificcaches.Inourcaseitisx86_64,soaswecansee,thearch_task_cache_initallocatesspaceforthetask_xstatewhichrepresentsFPUstateandsetsupoffsetsandsizesofallextendedstatesinxsaveareawiththecallofthesetup_xstate_compfunction.Afterthearch_task_cache_initwecalculatedefaultmaximumnumberofthreadswiththe:

set_max_threads(MAX_THREADS);

wheredefaultmaximumnumberofthreadsis:

Allocatingcacheforinittask

LinuxInside

153Endofinitialization

Page 154: Linux Insides

#defineFUTEX_TID_MASK0x3fffffff

#defineMAX_THREADSFUTEX_TID_MASK

Intheendofthefork_initfunctionweinitalizesignalhandler:

init_task.signal->rlim[RLIMIT_NPROC].rlim_cur=max_threads/2;

init_task.signal->rlim[RLIMIT_NPROC].rlim_max=max_threads/2;

init_task.signal->rlim[RLIMIT_SIGPENDING]=

init_task.signal->rlim[RLIMIT_NPROC];

Asweknowtheinit_taskisaninstanceofthetask_structstructure,soitcontainssignalfieldwhichrepresentssignalhandler.Ithasfollowingtypestructsignal_struct.Onthefirsttwolineswecanseesettingofthecurrentandmaximumlimitoftheresourcelimits.Everyprocesshasanassociatedsetofresourcelimits.Theselimitsspecifyamountofresourceswhichcurrentprocesscanuse.Hererlimisresourcecontrollimitandpresentedbythe:

structrlimit{

__kernel_ulong_trlim_cur;

__kernel_ulong_trlim_max;

};

structurefromtheinclude/uapi/linux/resource.h.InourcasetheresourceistheRLIMIT_NPROCwhichisthemaximumnumberofprocessthatusecanownandRLIMIT_SIGPENDING-themaximumnumberofpendingsignals.Wecanseeitinthe:

cat/proc/self/limits

LimitSoftLimitHardLimitUnits

...

...

...

Maxprocesses6381563815processes

Maxpendingsignals6381563815signals

...

...

...

Thenextfunctionafterthefork_initistheproc_caches_initfromthekernel/fork.c.Thisfunctionallocatescachesforthememorydescriptors(ormm_structstructure).Atthebeginningoftheproc_caches_initwecanseeallocationofthedifferentSLABcacheswiththecallofthekmem_cache_create:

sighand_cachep-manageinformationaboutinstalledsignalhandlers;signal_cachep-manageinformationaboutprocesssignaldescriptor;files_cachep-manageinformationaboutopenedfiles;fs_cachep-managefilesysteminformation.

AfterthisweallocateSLABcacheforthemm_structstructures:

mm_cachep=kmem_cache_create("mm_struct",

sizeof(structmm_struct),ARCH_MIN_MMSTRUCT_ALIGN,

SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK,NULL);

AfterthisweallocateSLABcachefortheimportantvm_area_structwhichusedbythekerneltomanagevirtualmemory

Initializationofthecaches

LinuxInside

154Endofinitialization

Page 155: Linux Insides

space:

vm_area_cachep=KMEM_CACHE(vm_area_struct,SLAB_PANIC);

Note,thatweuseKMEM_CACHEmacrohereinsteadofthekmem_cache_create.Thismacrodefinedintheinclude/linux/slab.handjustexpandstothekmem_cache_createcall:

#defineKMEM_CACHE(__struct,__flags)kmem_cache_create(#__struct,\

sizeof(struct__struct),__alignof__(struct__struct),\

(__flags),NULL)

TheKMEM_CACHEhasonedifferencefromkmem_cache_create.Takealookon__alignof__operator.TheKMEM_CACHEmacroalignsSLABtothesizeofthegivenstructure,butkmem_cache_createusesgivenvaluetoalignspace.Afterthiswecanseethecallofthemmap_initandnsproxy_cache_initfunctions.ThefirstfunctioninitalizesvirtualmemoryareaSLABandthesecondfunctioninitializesSLABfornamespaces.

Thenextfunctionaftertheproc_caches_initisbuffer_init.Thisfunctiondefinedinthefs/buffer.csourcecodefileandallocatecacheforthebuffer_head.Thebuffer_headisaspecialstructurewhichdefinedintheinclude/linux/buffer_head.handusedformanagingbuffers.Inthestartofthebufer_initfunctionweallocatecacheforthestructbuffer_headstructureswiththecallofthekmem_cache_createfunctionaswediditinthepreviousfunctions.Andcalcuatethemaximumsizeofthebuffersinmemorywith:

nrpages=(nr_free_buffer_pages()*10)/100;

max_buffer_heads=nrpages*(PAGE_SIZE/sizeof(structbuffer_head));

whichwillbeequaltothe10%oftheZONE_NORMAL(allRAMfromthe4GBonthex86_64).Thenextfunctionafterthebuffer_initis-vfs_caches_init.ThisfunctionallocatesSLABcachesandhashtablefordifferentVFScaches.Wealreadysawthevfs_caches_init_earlyfunctionintheeighthpartofthelinuxkernelinitializationprocesswhichinitializedcachesfordcache(ordirectory-cache)andinodecache.Thevfs_caches_initfunctionmakespost-earlyinitializationofthedcacheandinodecaches,privatedatacache,hashtablesforthemountpointsandetc...MoredetailsaboutVFSwillbedescribedintheseparatepart.Afterthiswecanseesignals_initfunction.Thisfunctiondefinedinthekernel/signal.candallocatesacacheforthesigqueuestructureswhichrepresentsqueueoftherealtimesignals.Thenextfunctionispage_writeback_init.Thisfunctioninitializestheratioforthedirtypages.Everylow-levelpageentrycontainsthedirtybitwhichindicateswhetherapagehasbeenwrittentowhenset.

Afterallofthispreparationsweneedtocreatetherootfortheprocfilesystem.Wewilldoitwiththecalloftheproc_root_initfunctionfromthefs/proc/root.c.Atthestartoftheproc_root_initfunctionweallocatethecachefortheinodesandregisteranewfilesysteminthesystemwiththe:

err=register_filesystem(&proc_fs_type);

if(err)

return;

AsIwroteabovewewillnotdiveintodetailsaboutVFSanddifferentfilesystemsinthischapter,butwillseeitinthechapterabouttheVFS.Afterwe'veregisteredanewfilesystemintheoursystem,wecalltheproc_self_initfunctionfromtheTOfs/proc/self.candthisfunctionallocatesinodenumberfortheself(/proc/selfdirectoryreferstotheprocessaccessingthe/procfilesystem).Thenextstepaftertheproc_self_initisproc_setup_thread_selfwhichsetupsthe/proc/thread-selfdirectorywhichcontainsinformationaboutcurrentthread.Afterthiswecreate/proc/self/mounts

Creationoftherootfortheprocfs

LinuxInside

155Endofinitialization

Page 156: Linux Insides

symllinkwhichwillcontainsmountpointswiththecallofthe

proc_symlink("mounts",NULL,"self/mounts");

andacoupleofdirectoriesdependsonthedifferentconfigurationoptions:

#ifdefCONFIG_SYSVIPC

proc_mkdir("sysvipc",NULL);

#endif

proc_mkdir("fs",NULL);

proc_mkdir("driver",NULL);

proc_mkdir("fs/nfsd",NULL);

#ifdefined(CONFIG_SUN_OPENPROMFS)||defined(CONFIG_SUN_OPENPROMFS_MODULE)

proc_mkdir("openprom",NULL);

#endif

proc_mkdir("bus",NULL);

...

...

...

if(!proc_mkdir("tty",NULL))

return;

proc_mkdir("tty/ldisc",NULL);

...

...

...

Intheendoftheproc_root_initwecalltheproc_sys_initfunctionwhichcreates/proc/sysdirectoryandinitializestheSysctl.

Itistheendofstart_kernelfunction.Ididnotdescribeallfunctionswhicharecalledinthestart_kernel.Imissedit,becausetheyarenotsoimportantforthegenerickernelinitializationstuffanddependononlydifferentkernelconfigurations.Theyaretaskstats_init_earlywhichexportsper-taskstatistictotheuser-space,delayacct_init-initializesper-taskdelayaccounting,key_initandsecurity_initinitializediferentsecuritystuff,check_bugs-makesfixupofthesomearchitecture-dependentbugs,ftrace_initfunctionexecutesinitializationoftheftrace,cgroup_initmakesinitializationoftherestofthecgroupsubsystemandetc...Manyofthesepartsandsubsystemswillbedescribedintheotherchapters.

That'sall.Finallywepassedthroughthelong-longstart_kernelfunction.Butitisnottheendofthelinuxkernelinitializationprocess.Wehaven'trunthefirstprocessyet.Intheendofthestart_kernelwecanseethelastcallofthe-rest_initfunction.Let'sgoahead.

Therest_initfunctiondefinedinthesamesourcecodefileasstart_kernelfunction,andthisfileisinit/main.c.Inthebeginningoftherest_initwecanseecallofthetwofollowingfunctions:

rcu_scheduler_starting();

smpboot_thread_init();

Thefirstrcu_scheduler_startingmakesRCUscheduleractiveandthesecondsmpboot_thread_initregistersthesmpboot_thread_notifierCPUnotifier(moreaboutityoucanreadintheCPUhotplugdocumentation.Afterthiswecanseethefollowingcalls:

kernel_thread(kernel_init,NULL,CLONE_FS);

pid=kernel_thread(kthreadd,NULL,CLONE_FS|CLONE_FILES);

Firststepsafterthestart_kernel

LinuxInside

156Endofinitialization

Page 157: Linux Insides

Herethekernel_threadfunction(definedinthekernel/fork.c)createsnewkernelthread.Aswecanseethekernel_threadfunctiontakesthreearguments:

Functionwhichwillbeexecutedinanewthread;Parameterforthekernel_initfunction;Flags.

Wewillnotdiveintodetailsaboutkernel_threadimplementation(wewillseeitinthechapterwhichwilldescribescheduler,justneedtosaythatkernel_threadinvokesclone).Nowweonlyneedtoknowthatwecreatenewkernelthreadwithkernel_threadfunction,parentandchildofthethreadwillusesharedinformationaboutafilesystemanditwillstarttoexecutekernel_initfunction.Akernelthreaddiffersfromanuserthreadthatitrunsinakernelmode.Sowiththesetwokernel_threadcallswecreatetwonewkernelthreadswiththePID=1forinitprocessandPID=2forkthread.Wealreadyknowwhatisinitprocess.Let'slookonthekthread.Itisspecialkernelthreadwhichallowstoinitanddifferentpartsofthekerneltocreateanotherkernelthreads.Wecanseeitintheoutputofthepsutil:

$ps-ef|grepkthradd

alex128664767018:26pts/000:00:00grepkthradd

Let'spostponekernel_initandkthreaddfornowandwillgoaheadintherest_init.Inthenextstepafterwehavecreatedtwonewkernelthreadswecanseethefollowingcode:

rcu_read_lock();

kthreadd_task=find_task_by_pid_ns(pid,&init_pid_ns);

rcu_read_unlock();

Thefirstrcu_read_lockfunctionmarksthebeginningofanRCUread-sidecriticalsectionandthercu_read_unlockmarkstheendofanRCUread-sidecriticalsection.Wecallthesefunctionsbecauseweneedtoprotectthefind_task_by_pid_ns.Thefind_task_by_pid_nsreturnspointertothetask_structbythegivenpid.So,herewearegettingthepointertothetask_structforthePID=2(wegotitafterkthreaddcreationwiththekernel_thread).Inthenextstepwecallcompletefunction

complete(&kthreadd_done);

andpassaddressofthekthreadd_done.Thekthreadd_donedefinedas

static__initdataDECLARE_COMPLETION(kthreadd_done);

whereDECLARE_COMPLETIONmacrodefinedas:

#defineDECLARE_COMPLETION(work)\

structcompletionwork=COMPLETION_INITIALIZER(work)

andexpandstothedefinitionofthecompletionstructure.Thisstructuredefinedintheinclude/linux/completion.handpresentscompletionsconcept.Completionsareacodesynchronizationmechanismwhichisproviderace-freesolutionforthethreadsthatmustwaitforsomeprocesstohavereachedapointoraspecificstate.Usingcompletionsconsistsofthreeparts:ThefirstisdefinitionofthecompletestructureandwediditwiththeDECLARE_COMPLETION.Thesecondiscallofthewait_for_completion.Afterthecallofthisfunction,athreadwhichcalleditwillnotcontinuetoexecuteandwillwaitwhileotherthreaddidnotcallcompletefunction.Notethatwecallwait_for_completionwiththekthreadd_doneinthebeginning

LinuxInside

157Endofinitialization

Page 158: Linux Insides

ofthekernel_init_freeable:

wait_for_completion(&kthreadd_done);

Andthelaststepistocallcompletefunctionaswesawitabove.Afterthisthekernel_init_freeablefunctionwillnotbeexecutedwhilekthreaddthreadwillnotbeset.Afterthekthreaddwasset,wecanseethreefollowingfunctionsintherest_init:

init_idle_bootup_task(current);

schedule_preempt_disabled();

cpu_startup_entry(CPUHP_ONLINE);

Thefirstinit_idle_bootup_taskfunctionfromthekernel/sched/core.csetstheSchedulingclassforthecurrentprocess(idleclassinourcase):

voidinit_idle_bootup_task(structtask_struct*idle)

{

idle->sched_class=&idle_sched_class;

}

whereidleclassisalowprioritytasksandtaskscanberunonlywhentheprocessordoesn'thavetorunanythingbesidesthistasks.Thesecondfunctionschedule_preempt_disableddisablespreemptinidletasks.Andthethirdfunctioncpu_startup_entrydefinedinthekernel/sched/idle.candcallscpu_idle_loopfromthekernel/sched/idle.c.Thecpu_idle_loopfunctionworksasprocesswithPID=0andworksinthebackground.Mainpurposeofthecpu_idle_loopisusageoftheidleCPUcycles.Whentherearenooneprocesstorun,thisprocessstartstowork.Wehaveoneprocesswithidleschedulingclass(wejustsetthecurrenttasktotheidlewiththecalloftheinit_idle_bootup_taskfunction),sotheidlethreaddoesnotdousefulworkandchecksthatthereisnotactivetasktoswitch:

staticvoidcpu_idle_loop(void)

{

...

...

...

while(1){

while(!need_resched()){

...

...

...

}

...

}

Moreaboutitwillbeinthechapteraboutscheduler.Soforthismomentthestart_kernelcallstherest_initfunctionwhichspawnsaninit(kernel_initfunction)processandbecomeidleprocessitself.Nowistimetolookonthekernel_init.Executionofthekernel_initfunctionstartsfromthecallofthekernel_init_freeablefunction.Thekernel_init_freeablefunctionfirstofallwaitsforthecompletionofthekthreaddsetup.Ialreadywroteaboutitabove:

wait_for_completion(&kthreadd_done);

Afterthiswesetgfp_allowed_maskto__GFP_BITS_MASKwhichmeansthatalreadysystemisrunning,setallowedcpus/memstoallCPUsandNUMAnodeswiththeset_mems_allowedfunction,allowinitprocesstorunonanyCPUwiththeset_cpus_allowed_ptr,setpidforthecadorCtrl-Alt-Delete,dopreparationforbootingoftheotherCPUswiththecallofthesmp_prepare_cpus,callearlyinitcallswiththedo_pre_smp_initcalls,initializationoftheSMPwiththesmp_initand

LinuxInside

158Endofinitialization

Page 159: Linux Insides

initializationofthelockup_detectorwiththecallofthelockup_detector_initandinitializeschedulerwiththesched_init_smp.

Afterthiswecanseethecallofthefollowingfunctions-do_basic_setup.Beforewewillcallthedo_basic_setupfunction,ourkernelalreadyinitializedforthismoment.Ascommentsays:

Nowwecanfinallystartdoingsomerealwork..

Thedo_basic_setupwillreinitializecpusettotheactiveCPUs,initializationofthekhelper-whichisakernelthreadwhichusedformakingcallsouttouserspacefromwithinthekernel,initializetmpfs,initializedriverssubsystem,enabletheuser-modehelperworkqueueandmakepost-earlycalloftheinitcalls.Wecanseeopeninngofthedev/consoleandduptwicefiledescriptorsfrom0to2afterthedo_basic_setup:

if(sys_open((constchar__user*)"/dev/console",O_RDWR,0)<0)

pr_err("Warning:unabletoopenaninitialconsole.\n");

(void)sys_dup(0);

(void)sys_dup(0);

Weareusingtwosystemcallsheresys_openandsys_dup.Inthenextchapterswewillseeexplanationandimplementationofthedifferentsystemcalls.Afterweopenedinitialconsole,wecheckthatrdinit=optionwaspassedtothekernelcommandlineorsetdefaultpathoftheramdisk:

if(!ramdisk_execute_command)

ramdisk_execute_command="/init";

Checkuser'spermissionsfortheramdiskandcalltheprepare_namespacefunctionfromtheinit/do_mounts.cwhichchecksandmountstheinitrd:

if(sys_access((constchar__user*)ramdisk_execute_command,0)!=0){

ramdisk_execute_command=NULL;

prepare_namespace();

}

Thisistheendofthekernel_init_freeablefunctionandweneedreturntothekernel_init.Thenextstepafterthekernel_init_freeablefinisheditsexecutionistheasync_synchronize_full.Thisfunctionwaitsuntilallasynchronousfunctioncallshavebeendoneandafteritwewillcallthefree_initmemwhichwillreleaseallmemoryoccupiedbytheinitializationstuffwhichlocatedbetween__init_beginand__init_end.Afterthisweprotect.rodatawiththemark_rodata_roandupdatestateofthesystemfromtheSYSTEM_BOOTINGtothe

system_state=SYSTEM_RUNNING;

Andtriestoruntheinitprocess:

if(ramdisk_execute_command){

ret=run_init_process(ramdisk_execute_command);

if(!ret)

return0;

pr_err("Failedtoexecute%s(error%d)\n",

ramdisk_execute_command,ret);

}

LinuxInside

159Endofinitialization

Page 160: Linux Insides

Firstofallitcheckstheramdisk_execute_commandwhichwesetinthekernel_init_freeablefunctionanditwillbeequaltothevalueoftherdinit=kernelcommandlineparametersor/initbydefault.Therun_init_processfunctionfillsthefirstelementoftheargv_initarray:

staticconstchar*argv_init[MAX_INIT_ARGS+2]={"init",NULL,};

whichrepresentsargumentsoftheinitprogramandcalldo_execvefunction:

argv_init[0]=init_filename;

returndo_execve(getname_kernel(init_filename),

(constchar__user*const__user*)argv_init,

(constchar__user*const__user*)envp_init);

Thedo_execvefunctiondefinedintheinclude/linux/sched.handrunsprogramwiththegivenfilenameandarguments.Ifwedidnotpassrdinit=optiontothekernelcommandline,kernelstartstochecktheexecute_commandwhichisequaltovalueoftheinit=kernelcommandlineparameter:

if(execute_command){

ret=run_init_process(execute_command);

if(!ret)

return0;

panic("Requestedinit%sfailed(error%d).",

execute_command,ret);

}

Ifwedidnotpassinit=kernelcommandlineparametertoo,kerneltriestorunoneofthefollowingexecutablefiles:

if(!try_to_run_init_process("/sbin/init")||

!try_to_run_init_process("/etc/init")||

!try_to_run_init_process("/bin/init")||

!try_to_run_init_process("/bin/sh"))

return0;

Inotherwaywefinishwithpanic:

panic("Noworkinginitfound.Trypassinginit=optiontokernel."

"SeeLinuxDocumentation/init.txtforguidance.");

That'sall!Linuxkernelinitializationprocessisfinished!

Itistheendofthetenthpartaboutthelinuxkernelinitializationprocess.Anditisnotonlytenthpart,butthisisthelastpartwhichdescribesinitializationofthelinuxkernel.AsIwroteinthefirstpartofthischapter,wewillgothroughallstepsofthekernelinitializationandwedidit.Westartedatthefirstarchitecture-independentfunction-start_kernelandfinishedwiththelaunchofthefirstinitprocessintheoursystem.Imisseddetailsaboutdifferentsubsystemofthekernel,forexampleIalmostdidnotcoverlinuxkernelschedulerorwedidnotseealmostanythingaboutinterruptsandexceptionshandlingandetc...Fromthenextpartwewillstarttodivetothedifferentkernelsubsystems.Hopeitwillbeinteresting.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.Ifyouwillfindany

Conclusion

LinuxInside

160Endofinitialization

Page 162: Linux Insides

Youwillfindacoupleofpostswhichdescribeinterruptsandexceptionshandlinginthelinuxkernel.

InterruptsandInterruptHandling.Part1.-describesaninterruptshandlingtheory.StarttodiveintointerruptsintheLinuxkernel-thispartstartstodescribeinterruptsandexceptionshandlingrelatedstufffromtheearlystage.Earlyinterrupthandlers-thirdpartdescribesearlyinterrupthandlers.Interrupthandlers-fourthpartdescribesfirstnon-earlyinterrupthandlers.Implementationofexceptionhandlers-descripbesimplementationofsomeexceptionhandlersasdoublefault,dividebyzeroandetc.HandlingNon-Maskableinterrupts-describeshandlingofnon-maskableinterruptsandtherestofinterruptshandlersfromthearchitecture-specificpart.Diveintoexternalhardwareinterrupts-thispartdescribesearlyinitializationofcodewhichisrelatedtohandlingofexternalhardwareinterrupts.Non-earlyinitializationoftheIRQs-thispartdescribesnon-earlyinitializationofcodewhichisrelatedtohandlingofexternalhardwareinterrupts.Softirq,TaskletsandWorkqueues-thispartdescribessoftirqs,taskletsandworkqueuesconcepts.-thisisthelastpartoftheinterruptsandinterrupthandlingchapterandherewewillseearealhardwaredriverandinterruptsrelatedstuff.

InterruptsandInterruptHandling

LinuxInside

162Interrupts

Page 163: Linux Insides

Thisisthefirstpartofthenewchapterofthelinuxinsidesbook.Wehavecomealongwayinthepreviouschapterofthisbook.Westartedfromtheearlieststepsofkernelinitializationandfinishedwiththelaunchofthefirstinitprocess.Yes,wesawseveralinitializationstepswhicharerelatedtothevariouskernelsubsystems.Butwedidnotdigdeepintothedetailsofthesesubsystems.Withthischapter,wewilltrytounderstandhowthevariouskernelsubsystemsworkandhowtheyareimplemented.Asyoucanalreadyunderstandfromthechapter'stitle,thefirstsubsystemwillbeinterrupts.

Wehavealreadyheardofthewordinterruptinseveralpartsofthisbook.Weevensawacoupleofexamplesofinterrupthandlers.Inthecurrentchapterwewillstartfromthetheoryi.e.

Whatareinterrupts?Whatareinterrupthandlers?

WewillthencontinuetodigdeeperintothedetailsofinterruptsandhowtheLinuxkernelhandlesthem.

So...,Firstofallwhatisaninterrupt?AninterruptisaneventwhichisraisedbysoftwareorhardwarewhenitsneedstheCPU'sattention.Forexample,wepressabuttononthekeyboardandwhatdoweexpectnext?Whatshouldtheoperatingsystemandcomputerdoafterthis?TosimplifymattersassumethateachperipheraldevicehasaninterruptlinetotheCPU.AdevicecanuseittosignalaninterrupttotheCPU.HoweverinterruptsarenotsignaleddirectlytotheCPU.IntheoldmachinestherewasaPICwhichisachipresponsibleforsequentiallyprocessingmultipleinterruptrequestsfrommultipledevices.InthenewmachinesthereisanAdvancedProgrammableInterruptControllercommonlyknownas-APIC.AnAPICconsistsoftwoseparatedevices:

LocalAPIC

I/OAPIC

Thefirst-LocalAPICislocatedoneachCPUcore.ThelocalAPICisresponsibleforhandlingtheCPU-specificinterruptconfiguration.ThelocalAPICisusuallyusedtomanageinterruptsfromtheAPIC-timer,thermalsensorandanyothersuchlocallyconnectedI/Odevices.

Thesecond-I/OAPICprovidesmulti-processorinterruptmanagement.ItisusedtodistributeexternalinterruptsamongtheCPUcores.MoreaboutthelocalandI/OAPICswillbecoveredlaterinthischapter.Asyoucanunderstand,interruptscanoccuratanytime.Whenaninterruptoccurs,theoperatingsystemmusthandleitimmediately.Butwhatdoesitmeantohandleaninterrupt?Whenaninterruptoccurs,theoperatingsystemmustensurethefollowingsteps:

Thekernelmustpauseexecutionofthecurrentprocess;(preemptcurrenttask);Thekernelmustsearchforthehandleroftheinterruptandtransfercontrol(executeinterrupthandler);Aftertheinterrupthandlercompletesexecution,theinterruptedprocesscanresumeexecution.

Ofcoursetherearenumerousintricaciesinvolvedinthisprocedureofhandlinginterrupts.Buttheabove3stepsformthebasicskeletonoftheprocedure.

Addressesofeachoftheinterrupthandlersaremaintainedinaspeciallocationreferredtoasthe-InterruptDescriptorTableorIDT.Theprocessorusesauniquenumberforrecognizingthetypeofinterruptionorexception.Thisnumberiscalled-vectornumber.AvectornumberisanindexintheIDT.Thereislimitedamountofthevectornumbersanditcanbefrom0to255.Youcannotethefollowingrange-checkuponthevectornumberwithintheLinuxkernelsource-code:

InterruptsandInterruptHandling.Part1.

Introduction

WhatisanInterrupt?

LinuxInside

163Introduction

Page 164: Linux Insides

BUG_ON((unsigned)n>0xFF);

YoucanfindthischeckwithintheLinuxkernelsourcecoderelatedtointerruptsetup(eg.Theset_intr_gate,voidset_system_intr_gateinarch/x86/include/asm/desc.h).Thefirst32vectornumbersfrom0to31arereservedbytheprocessorandusedfortheprocessingofarchitecture-definedexceptionsandinterrupts.YoucanfindthetablewiththedescriptionofthesevectornumbersinthesecondpartoftheLinuxkernelinitializationprocess-Earlyinterruptandexceptionhandling.Vectornumbersfrom32to255aredesignatedasuser-definedinterruptsandarenotreservedbytheprocessor.TheseinterruptsaregenerallyassignedtoexternalI/Odevicestoenablethosedevicestosendinterruptstotheprocessor.

Nowlet'stalkaboutthetypesofinterrupts.Broadlyspeaking,wecansplitinterruptsinto2majorclasses:

Externalorhardwaregeneratedinterrupts;Software-generatedinterrupts.

Thefirst-externalinterruptsarereceivedthroughtheLocalAPICorpinsontheprocessorwhichareconnectedtotheLocalAPIC.Thesecond-software-generatedinterruptsarecausedbyanexceptionalconditionintheprocessoritself(sometimesusingspecialarchitecture-specificinstructions).Acommonexampleforanexceptionalconditionisdivisionbyzero.Anotherexampleisexitingaprogramwiththesyscallinstruction.

Asmentionedearlier,aninterruptcanoccuratanytimeforareasonwhichthecodeandCPUhavenocontrolover.Ontheotherhand,exceptionsaresynchronouswithprogramexecutionandcanbeclassifiedinto3categories:

Faults

Traps

Aborts

Afaultisanexceptionreportedbeforetheexecutionofa"faulty"instruction(whichcanthenbecorrected).Ifcorrected,itallowstheinterruptedprogramtoberesume.

Nextatrapisanexceptionwhichisreportedimmediatelyfollowingtheexecutionofthetrapinstruction.Trapsalsoallowtheinterruptedprogramtobecontinuedjustasafaultdoes.

Finallyanabortisanexceptionthatdoesnotalwaysreporttheexactinstructionwhichcausedtheexceptionanddoesnotallowtheinterruptedprogramtoberesumed.

Alsowealreadyknowfromthepreviouspartthatinterruptscanbeclassifiedasmaskableandnon-maskable.Maskableinterruptsareinterruptswhichcanbeblockedwiththetwofollowinginstructionsforx86_64-stiandcli.WecanfindthemintheLinuxkernelsourcecode:

staticinlinevoidnative_irq_disable(void)

{

asmvolatile("cli":::"memory");

}

and

staticinlinevoidnative_irq_enable(void)

{

asmvolatile("sti":::"memory");

}

ThesetwoinstructionsmodifytheIFflagbitwithintheinterruptregister.ThestiinstructionsetstheIFflagandthecli

LinuxInside

164Introduction

Page 165: Linux Insides

instructionclearsthisflag.Non-maskableinterruptsarealwaysreported.Usuallyanyfailureinthehardwareismappedtosuchnon-maskableinterrupts.

Ifmultipleexceptionsorinterruptsoccuratthesametime,theprocessorhandlestheminorderoftheirpredefinedpriorities.Wecandeterminetheprioritiesfromthehighesttothelowestinthefollowingtable:

+----------------------------------------------------------------+

|||

|Priority|Description|

|||

+--------------+-------------------------------------------------+

||HardwareResetandMachineChecks|

|1|-RESET|

||-MachineCheck|

+--------------+-------------------------------------------------+

||TraponTaskSwitch|

|2|-TflaginTSSisset|

|||

+--------------+-------------------------------------------------+

||ExternalHardwareInterventions|

||-FLUSH|

|3|-STOPCLK|

||-SMI|

||-INIT|

+--------------+-------------------------------------------------+

||TrapsonthePreviousInstruction|

|4|-Breakpoints|

||-DebugTrapExceptions|

+--------------+-------------------------------------------------+

|5|NonmaskableInterrupts|

+--------------+-------------------------------------------------+

|6|MaskableHardwareInterrupts|

+--------------+-------------------------------------------------+

|7|CodeBreakpointFault|

+--------------+-------------------------------------------------+

|8|FaultsfromFetchingNextInstruction|

||Code-SegmentLimitViolation|

||CodePageFault|

+--------------+-------------------------------------------------+

||FaultsfromDecodingtheNextInstruction|

||Instructionlength>15bytes|

|9|InvalidOpcode|

||CoprocessorNotAvailable|

|||

+--------------+-------------------------------------------------+

|10|FaultsonExecutinganInstruction|

||Overflow|

||Bounderror|

||InvalidTSS|

||SegmentNotPresent|

||Stackfault|

||GeneralProtection|

||DataPageFault|

||AlignmentCheck|

||x87FPUFloating-pointexception|

||SIMDfloating-pointexception|

||Virtualizationexception|

+--------------+-------------------------------------------------+

Nowthatweknowalittleaboutthevarioustypesofinterruptsandexceptions,itistimetomoveontoamorepracticalpart.WestartwiththedescriptionoftheInterruptDescriptorTable.Asmentionedearlier,theIDTstoresentrypointsoftheinterruptsandexceptionshandlers.TheIDTissimilarinstructuretotheGlobalDescriptorTablewhichwesawinthesecondpartoftheKernelbootingprocess.Butofcourseithassomedifferences.Insteadofdescriptors,theIDTentriesarecalledgates.Itcancontainoneofthefollowinggates:

InterruptgatesTaskgatesTrapgates.

LinuxInside

165Introduction

Page 166: Linux Insides

inthex86architecture.Onlylongmodeinterruptgatesandtrapgatescanbereferencedinthex86_64.LiketheGlobalDescriptorTable,theInterruptDescriptortableisanarrayof8-bytegatesonx86andanarrayof16-bytegatesonx86_64.WecanrememberfromthesecondpartoftheKernelbootingprocess,thatGlobalDescriptorTablemustcontainNULLdescriptorasitsfirstelement.UnliketheGlobalDescriptorTable,theInterruptDescriptorTablemaycontainagate;itisnotmandatory.Forexample,youmayrememberthatwehaveloadedtheInterruptDescriptortablewiththeNULLgatesonlyintheearlierpartwhiletransitioningintoprotectedmode:

/*

*SetuptheIDT

*/

staticvoidsetup_idt(void)

{

staticconststructgdt_ptrnull_idt={0,0};

asmvolatile("lidtl%0"::"m"(null_idt));

}

fromthearch/x86/boot/pm.c.TheInterruptDescriptortablecanbelocatedanywhereinthelinearaddressspaceandthebaseaddressofitmustbealignedonan8-byteboundaryonx86or16-byteboundaryonx86_64.ThebaseaddressoftheIDTisstoredinthespecialregister-IDTR.Therearetwoinstructionsonx86-compatibleprocessorstomodifytheIDTRregister:

LIDT

SIDT

ThefirstinstructionLIDTisusedtoloadthebase-addressoftheIDTi.e.thespecifiedoperandintotheIDTR.ThesecondinstructionSIDTisusedtoreadandstorethecontentsoftheIDTRintothespecifiedoperand.TheIDTRregisteris48-bitsonthex86andcontainsthefollowinginformation:

+-----------------------------------+----------------------+

|||

|BaseaddressoftheIDT|LimitoftheIDT|

|||

+-----------------------------------+----------------------+

4716150

Lookingattheimplementationofsetup_idt,wehavepreparedanull_idtandloadedittotheIDTRregisterwiththelidtinstruction.Notethatnull_idthasgdt_ptrtypewhichisdefinedas:

structgdt_ptr{

u16len;

u32ptr;

}__attribute__((packed));

Herewecanseethedefinitionofthestructurewiththetwofieldsof2-bytesand4-byteseach(atotalof48-bits)aswecanseeinthediagram.Nowlet'slookattheIDTentriesstructure.TheIDTentriesstructureisanarrayofthe16-byteentrieswhicharecalledgatesinthex86_64.Theyhavethefollowingstructure:

12796

+-------------------------------------------------------------------------------+

||

|Reserved|

||

+--------------------------------------------------------------------------------

9564

+-------------------------------------------------------------------------------+

||

|Offset63..32|

||

LinuxInside

166Introduction

Page 167: Linux Insides

+-------------------------------------------------------------------------------+

634847464442393432

+-------------------------------------------------------------------------------+

|||D|||||||

|Offset31..16|P|P|0|Type|000|0|0|IST|

|||L|||||||

-------------------------------------------------------------------------------+

3116150

+-------------------------------------------------------------------------------+

|||

|SegmentSelector|Offset15..0|

|||

+-------------------------------------------------------------------------------+

ToformanindexintotheIDT,theprocessorscalestheexceptionorinterruptvectorbysixteen.Theprocessorhandlestheoccurrenceofexceptionsandinterruptsjustlikeithandlescallsofaprocedurewhenitseesthecallinstruction.AprocessorusesanuniquenumberorvectornumberoftheinterruptortheexceptionastheindextofindthenecessaryInterruptDescriptorTableentry.Nowlet'stakeacloserlookatanIDTentry.

Aswecansee,IDTentryonthediagramconsistsofthefollowingfields:

0-15bits-offsetfromthesegmentselectorwhichisusedbytheprocessorasthebaseaddressoftheentrypointoftheinterrupthandler;16-31bits-baseaddressofthesegmentselectwhichcontainstheentrypointoftheinterrupthandler;IST-anewspecialmechanisminthex86_64,willseeitlater;DPL-DescriptorPrivilegeLevel;P-SegmentPresentflag;48-63bits-secondpartofthehandlerbaseaddress;64-95bits-thirdpartofthebaseaddressofthehandler;96-127bits-andthelastbitsarereservedbytheCPU.

AndthelastTypefielddescribesthetypeoftheIDTentry.Therearethreedifferentkindsofhandlersforinterrupts:

InterruptgateTrapgateTaskgate

TheISTorInterruptStackTableisanewmechanisminthex86_64.Itisusedasanalternativetothethelegacystack-switchmechanism.PreviouslyThex86architectureprovidedamechanismtoautomaticallyswitchstackframesinresponsetoaninterrupt.TheISTisamodifiedversionofthex86Stackswitchingmode.ThismechanismunconditionallyswitchesstackswhenitisenabledandcanbeenabledforanyinterruptintheIDTentryrelatedwiththecertaininterrupt(wewillsoonseeit).FromthiswecanunderstandthatISTisnotnecessaryforallinterrupts.Someinterruptscancontinuetousethelegacystackswitchingmode.TheISTmechanismprovidesuptosevenISTpointersintheTaskStateSegmentorTSSwhichisthespecialstructurewhichcontainsinformationaboutaprocess.TheTSSisusedforstackswitchingduringtheexecutionofaninterruptorexceptionhandlerintheLinuxkernel.EachpointerisreferencedbyaninterruptgatefromtheIDT.

TheInterruptDescriptorTablerepresentedbythearrayofthegate_descstructures:

externgate_descidt_table[];

wheregate_descis:

#ifdefCONFIG_X86_64

...

...

...

LinuxInside

167Introduction

Page 168: Linux Insides

typedefstructgate_struct64gate_desc;

...

...

...

#endif

andgate_struct64definedas:

structgate_struct64{

u16offset_low;

u16segment;

unsignedist:3,zero0:5,type:5,dpl:2,p:1;

u16offset_middle;

u32offset_high;

u32zero1;

}__attribute__((packed));

EachactivethreadhasalargestackintheLinuxkernelforthex86_64architecture.ThestacksizeisdefinedasTHREAD_SIZEandisequalto:

#definePAGE_SHIFT12

#definePAGE_SIZE(_AC(1,UL)<<PAGE_SHIFT)

...

...

...

#defineTHREAD_SIZE_ORDER(2+KASAN_STACK_ORDER)

#defineTHREAD_SIZE(PAGE_SIZE<<THREAD_SIZE_ORDER)

ThePAGE_SIZEis4096-bytesandtheTHREAD_SIZE_ORDERdependsontheKASAN_STACK_ORDER.Aswecansee,theKASAN_STACKdependsontheCONFIG_KASANkernelconfigurationparameterandisdefinedas:

#ifdefCONFIG_KASAN

#defineKASAN_STACK_ORDER1

#else

#defineKASAN_STACK_ORDER0

#endif

KASanisaruntimememorydebugger.So...theTHREAD_SIZEwillbe16384bytesifCONFIG_KASANisdisabledor32768ifthiskernelconfigurationoptionisenabled.Thesestackscontainusefuldataaslongasathreadisaliveorinazombiestate.Whilethethreadisinuser-space,thekernelstackisemptyexceptforthethread_infostructure(detailsaboutthisstructureareavailableinthefourthpartoftheLinuxkernelinitializationprocess)atthebottomofthestack.Theactiveorzombiethreadsaren'ttheonlythreadswiththeirownstack.TherealsoexistspecializedstacksthatareassociatedwitheachavailableCPU.ThesestacksareactivewhenthekernelisexecutingonthatCPU.Whentheuser-spaceisexecutingontheCPU,thesestacksdonotcontainanyusefulinformation.EachCPUhasafewspecialper-cpustacksaswell.Thefirstistheinterruptstackusedfortheexternalhardwareinterrupts.Itssizeisdeterminedasfollows:

#defineIRQ_STACK_ORDER(2+KASAN_STACK_ORDER)

#defineIRQ_STACK_SIZE(PAGE_SIZE<<IRQ_STACK_ORDER)

or16384bytes.Theper-cpuinterruptstackrepresentedbytheirq_stack_unionunionintheLinuxkernelforx86_64:

unionirq_stack_union{

charirq_stack[IRQ_STACK_SIZE];

struct{

chargs_base[40];

unsignedlongstack_canary;

LinuxInside

168Introduction

Page 169: Linux Insides

};

};

Thefirstirq_stackfieldisa16kilobytesarray.Alsoyoucanseethatirq_stack_unioncontainsastructurewiththetwofields:

gs_base-Thegsregisteralwayspointstothebottomoftheirqstackunion.Onthex86_64,thegsregisterissharedbyper-cpuareaandstackcanary(moreaboutper-cpuvariablesyoucanreadinthespecialpart).Allper-cpusymbolsarezerobasedandthegspointstothebaseoftheper-cpuarea.Youalreadyknowthatsegmentedmemorymodelisabolishedinthelongmode,butwecansetthebaseaddressforthetwosegmentregisters-fsandgswiththeModelspecificregistersandtheseregisterscanbestillbeusedasaddressregisters.IfyourememberthefirstpartoftheLinuxkernelinitializationprocess,youcanrememberthatwehavesetthegsregister:

movl$MSR_GS_BASE,%ecx

movlinitial_gs(%rip),%eax

movlinitial_gs+4(%rip),%edx

wrmsr

whereinitial_gspointstotheirq_stack_union:

GLOBAL(initial_gs)

.quadINIT_PER_CPU_VAR(irq_stack_union)

stack_canary-Stackcanaryfortheinterruptstackisastackprotectortoverifythatthestackhasn'tbeenoverwritten.Notethatgs_baseisa40bytesarray.GCCrequiresthatstackcanarywillbeonthefixedoffsetfromthebaseofthegsanditsvaluemustbe40forthex86_64and20forthex86.

Theirq_stack_unionisthefirstdatuminthepercpuarea,wecanseeitintheSystem.map:

0000000000000000D__per_cpu_start

0000000000000000Dirq_stack_union

0000000000004000dexception_stacks

0000000000009000Dgdt_page

...

...

...

Wecanseeitsdefinitioninthecode:

DECLARE_PER_CPU_FIRST(unionirq_stack_union,irq_stack_union)__visible;

Now,it'stimetolookattheinitializationoftheirq_stack_union.Besidestheirq_stack_uniondefinition,wecanseethedefinitionofthefollowingper-cpuvariablesinthearch/x86/include/asm/processor.h:

DECLARE_PER_CPU(char*,irq_stack_ptr);

DECLARE_PER_CPU(unsignedint,irq_count);

Thefirstistheirq_stack_ptr.Fromthevariable'sname,itisobviousthatthisisapointertothetopofthestack.Thesecond-irq_countisusedtocheckifaCPUisalreadyonaninterruptstackornot.Initializationoftheirq_stack_ptrislocatedinthesetup_per_cpu_areasfunctioninarch/x86/kernel/setup_percpu.c:

LinuxInside

169Introduction

Page 170: Linux Insides

void__initsetup_per_cpu_areas(void)

{

...

...

#ifdefCONFIG_X86_64

for_each_possible_cpu(cpu){

...

...

...

per_cpu(irq_stack_ptr,cpu)=

per_cpu(irq_stack_union.irq_stack,cpu)+

IRQ_STACK_SIZE-64;

...

...

...

#endif

...

...

}

HerewegooveralltheCPUsone-by-oneandsetupirq_stack_ptr.Thisturnsouttobeequaltothetopoftheinterruptstackminus64.Why64?TODOarch/x86/kernel/cpu/common.csourcecodefileisfollowing:

voidload_percpu_segment(intcpu)

{

...

...

...

loadsegment(gs,0);

wrmsrl(MSR_GS_BASE,(unsignedlong)per_cpu(irq_stack_union.gs_base,cpu));

}

andaswealreadyknowthegsregisterpointstothebottomoftheinterruptstack:

movl$MSR_GS_BASE,%ecx

movlinitial_gs(%rip),%eax

movlinitial_gs+4(%rip),%edx

wrmsr

GLOBAL(initial_gs)

.quadINIT_PER_CPU_VAR(irq_stack_union)

Herewecanseethewrmsrinstructionwhichloadsthedatafromedx:eaxintotheModelspecificregisterpointedbytheecxregister.InourcasethemodelspecificregisterisMSR_GS_BASEwhichcontainsthebaseaddressofthememorysegmentpointedbythegsregister.edx:eaxpointstotheaddressoftheinitial_gswhichisthebaseaddressofourirq_stack_union.

Wealreadyknowthatx86_64hasafeaturecalledInterruptStackTableorISTandthisfeatureprovidestheabilitytoswitchtoanewstackforeventsnon-maskableinterrupt,doublefaultandetc...TherecanbeuptosevenISTentriesper-cpu.Someofthemare:

DOUBLEFAULT_STACK

NMI_STACK

DEBUG_STACK

MCE_STACK

or

#defineDOUBLEFAULT_STACK1

#defineNMI_STACK2

LinuxInside

170Introduction

Page 171: Linux Insides

#defineDEBUG_STACK3

#defineMCE_STACK4

Allinterrupt-gatedescriptorswhichswitchtoanewstackwiththeISTareinitializedwiththeset_intr_gate_istfunction.Forexample:

set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);

...

...

...

set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);

where&nmiand&double_faultareaddressesoftheentriestothegiveninterrupthandlers:

asmlinkagevoidnmi(void);

asmlinkagevoiddouble_fault(void);

definedinthearch/x86/kernel/entry_64.S

idtentrydouble_faultdo_double_faulthas_error_code=1paranoid=2

...

...

...

ENTRY(nmi)

...

...

...

END(nmi)

Whenaninterruptoranexceptionoccurs,thenewssselectorisforcedtoNULLandthessselector’srplfieldissettothenewcpl.Theoldss,rsp,registerflags,cs,riparepushedontothenewstack.In64-bitmode,thesizeofinterruptstack-framepushesisfixedat8-bytes,sowewillgetthefollowingstack:

+---------------+

||

|SS|40

|RSP|32

|RFLAGS|24

|CS|16

|RIP|8

|Errorcode|0

||

+---------------+

IftheISTfieldintheinterruptgateisnot0,wereadtheISTpointerintorsp.Iftheinterruptvectornumberhasanerrorcodeassociatedwithit,wethenpushtheerrorcodeontothestack.Iftheinterruptvectornumberhasnoerrorcode,wegoaheadandpushthedummyerrorcodeontothestack.Weneedtodothistoensurestackconsistency.Nextweloadthesegment-selectorfieldfromthegatedescriptorintotheCSregisterandmustverifythatthetargetcode-segmentisa64-bitmodecodesegmentbythecheckingbit21i.e.theLbitintheGlobalDescriptorTable.Finallyweloadtheoffsetfieldfromthegatedescriptorintoripwhichwillbetheentry-pointoftheinterrupthandler.Afterthistheinterrupthandlerbeginstoexecute.Afteraninterrupthandlerfinishesitsexecution,itmustreturncontroltotheinterruptedprocesswiththeiretinstruction.Theiretinstructionunconditionallypopsthestackpointer(ss:rsp)torestorethestackoftheinterruptedprocessanddoesnotdependonthecplchange.

That'sall.

LinuxInside

171Introduction

Page 172: Linux Insides

ItistheendofthefirstpartaboutinterruptsandinterrupthandlingintheLinuxkernel.Wesawsometheoryandthefirststepsoftheinitializationofstuffrelatedtointerruptsandexceptions.Inthenextpartwewillcontinuetodiveintointerruptsandinterruptshandling-intothemorepracticalaspectsofit.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmeaPRtolinux-internals.

PICAdvancedProgrammableInterruptControllerprotectedmodelongmodekernelstacksTaskStateSegementsegmentedmemorymodelModelspecificregistersStackcanaryPreviouschapter

Conclusion

Links

LinuxInside

172Introduction

Page 173: Linux Insides

WesawsometheoryaboutinterruptsandexceptionhandlinginthepreviouspartandasIalreadywroteinthatpart,wewillstarttodiveintointerruptsandexceptionsintheLinuxkernelsourcecodeinthispart.Asyoualreadycannote,thepreviouspartmostlydescribedtheoreticalaspectsandinthispartwewillstarttodivedirectlyintotheLinuxkernelsourcecode.Wewillstarttodoitaswediditinotherchapters,fromtheveryearlyplaces.WewillnotseetheLinuxkernelsourcecodefromtheearliestcodelinesaswesawitforexampleintheLinuxkernelbootingprocesschapter,butwewillstartfromtheearliestcodewhichisrelatedtotheinterruptsandexceptions.InthispartwewilltrytogothroughtheallinterruptsandexceptionsrelatedstuffwhichwecanfindintheLinuxkernelsourcecode.

Ifyou'vereadthepreviousparts,youcanrememberthattheearliestplaceintheLinuxkernelx86_64architecture-specifixsourcecodewhichisrelatedtotheinterruptislocatedinthearch/x86/boot/pm.csourcecodefileandrepresentsthefirstsetupoftheInterruptDescriptorTable.Itoccursrightbeforethetransitionintotheprotectedmodeinthego_to_protected_modefunctionbythecallofthesetup_idt:

voidgo_to_protected_mode(void)

{

...

setup_idt();

...

}

Thesetup_idtfunctionisdefinedinthesamesourcecodefileasthego_to_protected_modefunctionandjustloadstheaddressoftheNULLinterruptsdescriptortable:

staticvoidsetup_idt(void)

{

staticconststructgdt_ptrnull_idt={0,0};

asmvolatile("lidtl%0"::"m"(null_idt));

}

wheregdt_ptrrepresentsaspecial48-bitGTDRregisterwhichmustcontainthebaseaddressoftheGlobalDescriptorTable:

structgdt_ptr{

u16len;

u32ptr;

}__attribute__((packed));

Ofcourseinourcasethegdt_ptrdoesnotrepresenttheGDTRregister,butIDTRsincewesetInterruptDescriptorTable.Youwillnotfindanidt_ptrstructure,becauseifithadbeenintheLinuxkernelsourcecode,itwouldhavebeenthesameasgdt_ptrbutwithdifferentname.So,asyoucanunderstandthereisnosensetohavetwosimilarstructureswhichdifferonlybyname.Youcannotehere,thatwedonotfilltheInterruptDescriptorTablewithentries,becauseitistooearlytohandleanyinterruptsorexceptionsatthispoint.That'swhywejustfilltheIDTwithNULL.

AfterthesetupoftheInterruptdescriptortable,GlobalDescriptorTableandotherstuffwejumpintoprotectedmodeinthe-arch/x86/boot/pmjump.S.Youcanreadmoreaboutitinthepartwhichdescribesthetransitiontoprotectedmode.

InterruptsandInterruptHandling.Part2.

StarttodiveintointerruptandexceptionshandlingintheLinuxkernel

LinuxInside

173Starttodiveintointerrupts

Page 174: Linux Insides

Wealreadyknowfromtheearliestpartsthatentrytoprotectedmodeislocatedintheboot_params.hdr.code32_startandyoucanseethatwepasstheentryoftheprotectedmodeandboot_paramstotheprotected_mode_jumpintheendofthearch/x86/boot/pm.c:

protected_mode_jump(boot_params.hdr.code32_start,

(u32)&boot_params+(ds()<<4));

Theprotected_mode_jumpisdefinedinthearch/x86/boot/pmjump.Sandgetsthesetwoparametersintheaxanddxregistersusingoneofthe8086callingconventions:

GLOBAL(protected_mode_jump)

...

...

...

.byte0x66,0xea#ljmplopcode

2:.longin_pm32#offset

.word__BOOT_CS#segment

...

...

...

ENDPROC(protected_mode_jump)

wherein_pm32containsajumptothe32-bitentrypoint:

GLOBAL(in_pm32)

...

...

jmpl*%eax//%eaxcontainsaddressofthe`startup_32`

...

...

ENDPROC(in_pm32)

Asyoucanrememberthe32-bitentrypointisinthearch/x86/boot/compressed/head_64.Sassemblyfile,althoughitcontains_64initsname.Wecanseethetwosimilarfilesinthearch/x86/boot/compresseddirectory:

arch/x86/boot/compressed/head_32.S.arch/x86/boot/compressed/head_64.S;

Butthe32-bitmodeentrypointisthesecondfileinourcase.Thefirstfileisnotevencompiledforx86_64.Let'slookatthearch/x86/boot/compressed/Makefile:

vmlinux-objs-y:=$(obj)/vmlinux.lds$(obj)/head_$(BITS).o$(obj)/misc.o\

...

...

Wecanseeherethathead_*dependsonthe$(BITS)variablewhichdependsonthearchitecture.Youcanfinditinthearch/x86/Makefile:

ifeq($(CONFIG_X86_32),y)

...

BITS:=32

else

BITS:=64

...

endif

LinuxInside

174Starttodiveintointerrupts

Page 175: Linux Insides

Nowaswejumpedonthestartup_32fromthearch/x86/boot/compressed/head_64.Swewillnotfindanythingrelatedtotheinterrupthandlinghere.Thestartup_32containscodethatmakespreparationsbeforethetransitionintolongmodeanddirectlyjumpsintoit.Thelongmodeentryislocatedinstartup_64anditmakespreparationsbeforethekerneldecompressionthatoccursinthedecompress_kernelfromthearch/x86/boot/compressed/misc.c.Afterthekernelisdecompressed,wejumponthestartup_64fromthearch/x86/kernel/head_64.S.Inthestartup_64westarttobuildidentity-mappedpages.Afterwehavebuiltidentity-mappedpages,checkedtheNXbit,setuptheExtendedFeatureEnableRegister(seeinlinks),andupdatedtheearlyGlobalDescriptorTablewiththelgdtinstruction,weneedtosetupgsregisterwiththefollowingcode:

movl$MSR_GS_BASE,%ecx

movlinitial_gs(%rip),%eax

movlinitial_gs+4(%rip),%edx

wrmsr

Wealreadysawthiscodeinthepreviouspart.Firstofallpayattentiononthelastwrmsrinstruction.Thisinstructionwritesdatafromtheedx:eaxregisterstothemodelspecificregisterspecifiedbytheecxregister.Wecanseethatecxcontains$MSR_GS_BASEwhichisdeclaredinthearch/x86/include/uapi/asm/msr-index.handlookslike:

#defineMSR_GS_BASE0xc0000101

FromthiswecanunderstandthatMSR_GS_BASEdefinesthenumberofthemodelspecificregister.Sinceregisterscs,ds,es,andssarenotusedinthe64-bitmode,theirfieldsareignored.Butwecanaccessmemoryoverfsandgsregisters.Themodelspecificregisterprovidesabackdoortothehiddenpartsofthesesegmentregistersandallowstouse64-bitbaseaddressforsegmentregisteraddressedbythefsandgs.SotheMSR_GS_BASEisthehiddenpartandthispartismappedontheGS.basefield.Let'slookontheinitial_gs:

GLOBAL(initial_gs)

.quadINIT_PER_CPU_VAR(irq_stack_union)

Wepassirq_stack_unionsymboltotheINIT_PER_CPU_VARmacrowhichjustconcatenatestheinit_per_cpu__prefixwiththegivensymbol.Inourcasewewillgettheinit_per_cpu__irq_stack_unionsymbol.Let'slookatthelinkerscript.Therewecanseefollowingdefinition:

#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load

INIT_PER_CPU(irq_stack_union);

Ittellsusthattheaddressoftheinit_per_cpu__irq_stack_unionwillbeirq_stack_union+__per_cpu_load.Nowweneedtounderstandwhereinit_per_cpu__irq_stack_unionand__per_cpu_loadareandwhattheymean.Thefirstirq_stack_unionisdefinedinthearch/x86/include/asm/processor.hwiththeDECLARE_INIT_PER_CPUmacrowhichexpandstocalltheinit_per_cpu_varmacro:

DECLARE_INIT_PER_CPU(irq_stack_union);

#defineDECLARE_INIT_PER_CPU(var)\

externtypeof(per_cpu_var(var))init_per_cpu_var(var)

#defineinit_per_cpu_var(var)init_per_cpu__##var

Ifweexpandallmacroswewillgetthesameinit_per_cpu__irq_stack_unionaswegotafterexpandingtheINIT_PER_CPUmacro,butyoucannotethatitisnotjustasymbol,butavariable.Let'slookatthetypeof(per_cpu_var(var))expression.Ourvarisirq_stack_unionandtheper_cpu_varmacroisdefinedinthearch/x86/include/asm/percpu.h:

LinuxInside

175Starttodiveintointerrupts

Page 176: Linux Insides

#definePER_CPU_VAR(var)%__percpu_seg:var

where:

#ifdefCONFIG_X86_64

#define__percpu_seggs

endif

So,weareaccessinggs:irq_stack_unionandgetingitstypewhichisirq_union.Ok,wedefinedthefirstvariableandknowitsaddress,nowlet'slookatthesecond__per_cpu_loadsymbol.Thereareacoupleofper-cpuvariableswhicharelocatedafterthissymbol.The__per_cpu_loadisdefinedintheinclude/asm-generic/sections.h:

externchar__per_cpu_load[],__per_cpu_start[],__per_cpu_end[];

andpresentedbaseaddressoftheper-cpuvariablesfromthedataarea.So,weknowtheaddressoftheirq_stack_union,__per_cpu_loadandweknowthatinit_per_cpu__irq_stack_unionmustbeplacedrightafter__per_cpu_load.AndwecanseeitintheSystem.map:

...

...

...

ffffffff819ed000D__init_begin

ffffffff819ed000D__per_cpu_load

ffffffff819ed000Ainit_per_cpu__irq_stack_union

...

...

...

Nowweknowaboutinitial_gs,solet'slookatthecode:

movl$MSR_GS_BASE,%ecx

movlinitial_gs(%rip),%eax

movlinitial_gs+4(%rip),%edx

wrmsr

HerewespecifiedamodelspecificregisterwithMSR_GS_BASE,putthe64-bitaddressoftheinitial_gstotheedx:eaxpairandexecutethewrmsrinstructionforfillingthegsregisterwiththebaseaddressoftheinit_per_cpu__irq_stack_unionwhichwillbeatthebottomoftheinterruptstack.AfterthiswewilljumptotheCcodeonthex86_64_start_kernelfromthearch/x86/kernel/head64.c.Inthex86_64_start_kernelfunctionwedothelastpreparationsbeforewejumpintothegenericandarchitecture-independentkernelcodeandoneofthesepreparationsisfillingtheearlyInterruptDescriptorTablewiththeinterruptshandlersentriesorearly_idt_handlers.Youcanrememberit,ifyouhavereadthepartabouttheEarlyinterruptandexceptionhandlingandcanrememberfollowingcode:

for(i=0;i<NUM_EXCEPTION_VECTORS;i++)

set_intr_gate(i,early_idt_handlers[i]);

load_idt((conststructdesc_ptr*)&idt_descr);

butIwroteEarlyinterruptandexceptionhandlingpartwhenLinuxkernelversionwas-3.18.ForthisdayactualversionoftheLinuxkernelis4.1.0-rc6+andAndyLutomirskisentthepatchandsoonitwillbeinthemainlinekernelthatchangesbehaviourfortheearly_idt_handlers.NOTEWhileIwrotethispartthepatchalreadyturnedintheLinuxkernelsourcecode.Let'slookonit.Nowthesamepartlookslike:

LinuxInside

176Starttodiveintointerrupts

Page 177: Linux Insides

for(i=0;i<NUM_EXCEPTION_VECTORS;i++)

set_intr_gate(i,early_idt_handler_array[i]);

load_idt((conststructdesc_ptr*)&idt_descr);

ASyoucanseeithasonlyonedifferenceinthenameofthearrayoftheinterruptshandlersentrypoints.Nowitisearly_idt_handler_arry:

externconstcharearly_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];

whereNUM_EXCEPTION_VECTORSandEARLY_IDT_HANDLER_SIZEaredefinedas:

#defineNUM_EXCEPTION_VECTORS32

#defineEARLY_IDT_HANDLER_SIZE9

So,theearly_idt_handler_arrayisanarrayoftheinterruptshandlersentrypointsandcontainsoneentrypointoneveryninebytes.Youcanrememberthatpreviousearly_idt_handlerswasdefinedinthearch/x86/kernel/head_64.S.Theearly_idt_handler_arrayisdefinedinthesamesourcecodefiletoo:

ENTRY(early_idt_handler_array)

...

...

...

ENDPROC(early_idt_handler_common)

Itfillsearly_idt_handler_arrywiththe.reptNUM_EXCEPTION_VECTORSandcontainsentryoftheearly_make_pgtableinterrupthandler(moreaboutitsimplementationyoucanreadinthepartaboutEarlyinterruptandexceptionhandling).Fornowwecometotheendofthex86_64architecture-specificcodeandthenextpartisthegenerickernelcode.Ofcourseyoualreadycanknowthatwewillreturntothearchitecture-specificcodeinthesetup_archfunctionandotherplaces,butthisistheendofthex86_64earlycode.

Thenextstopafterthearch/x86/kernel/head_64.Sisthebiggeststart_kernelfunctionfromtheinit/main.c.Ifyou'vereadthepreviouschapterabouttheLinuxkernelinitializationprocess,youmustrememberit.Thisfunctiondoesallinitializationstuffbeforekernelwilllaunchfirstinitprocesswiththepid-1.Thefirstthingthatisrelatedtotheinterruptsandexceptionshandlingisthecalloftheboot_init_stack_canaryfunction.

Thisfunctionsetsthecanaryvaluetoprotectinterruptstackoverflow.Wealreadysawalittlesomedetailsaboutimplementationoftheboot_init_stack_canaryinthepreviouspartandnowlet'stakeacloserlookonit.Youcanfindimplementationofthisfunctioninthearch/x86/include/asm/stackprotector.handitsdependsontheCONFIG_CC_STACKPROTECTORkernelconfigurationoption.Ifthisoptionisnotsetthisfunctionwillnotdoanything:

#ifdefCONFIG_CC_STACKPROTECTOR

...

...

...

#else

staticinlinevoidboot_init_stack_canary(void)

{

}

#endif

Settingstackcanaryfortheinterruptstack

LinuxInside

177Starttodiveintointerrupts

Page 178: Linux Insides

IftheCONFIG_CC_STACKPROTECTORkernelconfigurationoptionisset,theboot_init_stack_canaryfunctionstartsfromthecheckstatirq_stack_unionthatrepresentsper-cpuinterruptstackhasoffsetequaltofortybytesfromthestack_canaryvalue:

#ifdefCONFIG_X86_64

BUILD_BUG_ON(offsetof(unionirq_stack_union,stack_canary)!=40);

#endif

Aswecanreadinthepreviousparttheirq_stack_unionrepresentedbythefollowingunion:

unionirq_stack_union{

charirq_stack[IRQ_STACK_SIZE];

struct{

chargs_base[40];

unsignedlongstack_canary;

};

};

whichdefinedinthearch/x86/include/asm/processor.h.WeknowthatuniounintheCprogramminglanguageisadatastructurewhichstoresonlyonefieldinamemory.Wecanseeherethatstructurehasfirstfield-gs_basewhichis40bytessizeandrepresentsbottomoftheirq_stack.So,afterthisourcheckwiththeBUILD_BUG_ONmacroshouldendsuccessfully.(youcanreadthefirstpartaboutLinuxkernelinitializationprocessifyou'reinterestingabouttheBUILD_BUG_ONmacro).

AfterthiswecalculatenewcanaryvaluebasedontherandomnumberandTimeStampCounter:

get_random_bytes(&canary,sizeof(canary));

tsc=__native_read_tsc();

canary+=tsc+(tsc<<32UL);

andwritecanaryvaluetotheirq_stack_unionwiththethis_cpu_writemacro:

this_cpu_write(irq_stack_union.stack_canary,canary);

moreaboutthis_cpu_*operationyoucanreadintheLinuxkerneldocumentation.

Thenextstepintheinit/main.cwhichisrelatedtotheinterruptsandinterruptshandlingafterwehavesetthecanaryvaluetotheinterruptstack-isthecallofthelocal_irq_disablemacro.

Thismacrodefinedintheinclude/linux/irqflags.hheaderfileandasyoucanunderstand,wecandisableinterruptsfortheCPUwiththecallofthismacro.Let'slookonitsimplementation.FirstofallnotethatitdependsontheCONFIG_TRACE_IRQFLAGS_SUPPORTkernelconfigurationoption:

#ifdefCONFIG_TRACE_IRQFLAGS_SUPPORT

...

#definelocal_irq_disable()\

do{raw_local_irq_disable();trace_hardirqs_off();}while(0)

...

#else

...

Disabling/Enablinglocalinterrupts

LinuxInside

178Starttodiveintointerrupts

Page 179: Linux Insides

#definelocal_irq_disable()do{raw_local_irq_disable();}while(0)

...

#endif

Theyarebothsimilarandasyoucanseehaveonlyonedifference:thelocal_irq_disablemacrocontainscallofthetrace_hardirqs_offwhenCONFIG_TRACE_IRQFLAGS_SUPPORTisenabled.Thereisspecialfeatureinthelockdepsubsystem-irq-flagstracingfortracinghardirqandstoftirqstate.Inourcaselockdepsubsytemcangiveusinterestinginformationabouthard/softirqson/offeventswhichareoccursinthesystem.Thetrace_hardirqs_offfunctiondefinedinthekernel/locking/lockdep.c:

voidtrace_hardirqs_off(void)

{

trace_hardirqs_off_caller(CALLER_ADDR0);

}

EXPORT_SYMBOL(trace_hardirqs_off);

andjustcallstrace_hardirqs_off_callerfunction.Thetrace_hardirqs_off_callerchecksthehardirqs_enabledfiledofthecurrentprocessincrementtheredundant_hardirqs_offifcallofthelocal_irq_disablewasredundantorthehardirqs_off_eventsifitwasnot.Thesetwofieldsandotherlockdepstatisticrelatedfieldsaredefinedinthekernel/locking/lockdep_internals.handlocatedinthelockdep_statsstructure:

structlockdep_stats{

...

...

...

intsoftirqs_off_events;

intredundant_softirqs_off;

...

...

...

}

IfyouwillsetCONFIG_DEBUG_LOCKDEPkernelconfigurationoption,thelockdep_stats_debug_showfunctionwillwritealltracinginformationtothe/proc/lockdep:

staticvoidlockdep_stats_debug_show(structseq_file*m)

{

#ifdefCONFIG_DEBUG_LOCKDEP

unsignedlonglonghi1=debug_atomic_read(hardirqs_on_events),

hi2=debug_atomic_read(hardirqs_off_events),

hr1=debug_atomic_read(redundant_hardirqs_on),

...

...

...

seq_printf(m,"hardirqonevents:%11llu\n",hi1);

seq_printf(m,"hardirqoffevents:%11llu\n",hi2);

seq_printf(m,"redundanthardirqons:%11llu\n",hr1);

#endif

}

andyoucanseeitsresultwiththe:

$sudocat/proc/lockdep

hardirqonevents:12838248974

hardirqoffevents:12838248979

redundanthardirqons:67792

redundanthardirqoffs:3836339146

softirqonevents:38002159

softirqoffevents:38002187

redundantsoftirqons:0

LinuxInside

179Starttodiveintointerrupts

Page 180: Linux Insides

redundantsoftirqoffs:0

Ok,nowweknowalittleabouttracing,butmoreinfowillbeintheseparatepartaboutlockdepandtracing.Youcanseethatthebothlocal_disable_irqmacroshavethesamepart-raw_local_irq_disable.Thismacrodefinedinthearch/x86/include/asm/irqflags.handexpandstothecallofthe:

staticinlinevoidnative_irq_disable(void)

{

asmvolatile("cli":::"memory");

}

AndyoualreadymustrememberthatcliinstructionclearstheIFflagwhichdeterminesabilityofaprocessortohandleandinterruptoranexception.Besidesthelocal_irq_disable,asyoualreadycanknowthereisaninversemacr-local_irq_enable.Thismacrohasthesametracingmechanismandverysimilaronthelocal_irq_enable,butasyoucanunderstandfromitsname,itenablesinterruptswiththestiinstruction:

staticinlinevoidnative_irq_enable(void)

{

asmvolatile("sti":::"memory");

}

Nowweknowhowlocal_irq_disableandlocal_irq_enablework.Itwasthefirstcallofthelocal_irq_disablemacro,butwewillmeetthesemacrosmanytimesintheLinuxkernelsourcecode.Butfornowweareinthestart_kernelfunctionfromtheinit/main.candwejustdisabledlocalinterrupts.Whylocalandwhywedidit?Previouslykernelprovidedamethodtodisableinterruptsonallprocessorsanditwascalledcli.Thisfunctionwasremovedandnowwehavelocal_irq_{enabled,disable}todisableorenableinterruptsonthecurrentprocessor.Afterwe'vedisabledtheinterruptswiththelocal_irq_disablemacro,wesetthe:

early_boot_irqs_disabled=true;

Theearly_boot_irqs_disabledvariabledefinedintheinclude/linux/kernel.h:

externboolearly_boot_irqs_disabled;

andusedinthedifferentplaces.Forexampleitusedinthesmp_call_function_manyfunctionfromthekernel/smp.cforthecheckingpossibledeadlockwheninterruptsaredisabled:

WARN_ON_ONCE(cpu_online(this_cpu)&&irqs_disabled()

&&!oops_in_progress&&!early_boot_irqs_disabled);

Thenextfunctionsafterthelocal_disable_irqareboot_cpu_initandpage_address_init,buttheyarenotrelatedtotheinterruptsandexceptions(moreaboutthisfunctionsyoucanreadinthechapteraboutLinuxkernelinitializationprocess).Thenextisthesetup_archfunction.Asyoucanrememberthisfunctionlocatedinthearch/x86/kernel/setup.csourcecodefileandmakesinitializationofmanydifferentarchitecture-dependentstuff.Thefirstinterruptsrelatedfunctionwhichwecanseeinthesetup_archisthe-early_trap_initfunction.Thisfunctiondefinedinthearch/x86/kernel/traps.candfillsInterruptDescriptorTablewiththecoupleofentries:

Earlytrapinitializationduringkernelinitialization

LinuxInside

180Starttodiveintointerrupts

Page 181: Linux Insides

void__initearly_trap_init(void)

{

set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);

set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);

#ifdefCONFIG_X86_32

set_intr_gate(X86_TRAP_PF,page_fault);

#endif

load_idt(&idt_descr);

}

Herewecanseecallsofthreedifferentfunctions:

set_intr_gate_ist

set_system_intr_gate_ist

set_intr_gate

Allofthesefunctionsdefinedinthearch/x86/include/asm/desc.handdothesimilarthingbutnotthesame.Thefirstset_intr_gate_istfunctioninsertsnewaninterruptgateintheIDT.Let'slookonitsimplementation:

staticinlinevoidset_intr_gate_ist(intn,void*addr,unsignedist)

{

BUG_ON((unsigned)n>0xFF);

_set_gate(n,GATE_INTERRUPT,addr,0,ist,__KERNEL_CS);

}

Firstofallwecanseethecheckthatnwhichisvectornumberoftheinterruptisnotgreaterthan0xffor255.Weneedtocheckitbecausewerememberfromthepreviouspartthatvectornumberofaninterruptmustbebetween0and255.Inthenextstepwecanseethecallofthe_set_gatefunctionthatsetsagiveninterruptgatetotheIDTtable:

staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,

unsigneddpl,unsignedist,unsignedseg)

{

gate_descs;

pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);

write_idt_entry(idt_table,gate,&s);

write_trace_idt_entry(gate,&s);

}

Herewestartfromthepack_gatefunctionwhichtakescleanIDTentryrepresentedbythegate_descstructureandfillsitwiththebaseaddressandlimit,InterruptStackTable,Privilegelevel,typeofaninterruptwhichcanbeoneofthefollowingvalues:

GATE_INTERRUPT

GATE_TRAP

GATE_CALL

GATE_TASK

andsetthepresentbitforthegivenIDTentry:

staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,

unsigneddpl,unsignedist,unsignedseg)

{

gate->offset_low=PTR_LOW(func);

gate->segment=__KERNEL_CS;

gate->ist=ist;

gate->p=1;

gate->dpl=dpl;

gate->zero0=0;

LinuxInside

181Starttodiveintointerrupts

Page 182: Linux Insides

gate->zero1=0;

gate->type=type;

gate->offset_middle=PTR_MIDDLE(func);

gate->offset_high=PTR_HIGH(func);

}

AfterthiswewritejustfilledinterruptgatetotheIDTwiththewrite_idt_entrymacrowhichexpandstothenative_write_idt_entryandjustcopytheinterruptgatetotheidt_tabletablebythegivenindex:

#definewrite_idt_entry(dt,entry,g)native_write_idt_entry(dt,entry,g)

staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate)

{

memcpy(&idt[entry],gate,sizeof(*gate));

}

whereidt_tableisjustarrayofgate_desc:

externgate_descidt_table[];

That'sall.Thesecondset_system_intr_gate_istfunctionhasonlyonedifferencefromtheset_intr_gate_ist:

staticinlinevoidset_system_intr_gate_ist(intn,void*addr,unsignedist)

{

BUG_ON((unsigned)n>0xFF);

_set_gate(n,GATE_INTERRUPT,addr,0x3,ist,__KERNEL_CS);

}

Doyouseeit?Lookonthefourthparameterofthe_set_gate.Itis0x3.Intheset_intr_gateitwas0x0.WeknowthatthisparameterrepresentDPLorprivilegelevel.Wealsoknowthat0isthehighestprivilgeleveland3isthelowest.Nowweknowhowset_system_intr_gate_ist,set_intr_gate_ist,set_intr_gateareworkandwecanreturntotheearly_trap_initfunction.Let'slookonitagain:

set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);

set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);

WesettwoIDTentriesforthe#DBinterruptandint3.Thesefunctionstakesthesamesetofparameters:

vectornumberofaninterrupt;addressofaninterrupthandler;interruptstacktableindex.

That'sall.Moreaboutinterruptsandhandlersyouwillknowinthenextparts.

ItistheendofthesecondpartaboutinterruptsandinterrupthandlingintheLinuxkernel.Wesawthesometheoryinthepreviouspartandstartedtodiveintointerruptsandexceptionshandlinginthecurrentpart.WehavestartedfromtheearliestpartsintheLinuxkernelsourcecodewhicharerelatedtotheinterrupts.Inthenextpartwewillcontinuetodiveintothisinterestingthemeandwillknowmoreaboutinterrupthandlingprocess.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

Conclusion

LinuxInside

182Starttodiveintointerrupts

Page 183: Linux Insides

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

IDTProtectedmodeListofx86callingconventions8086LongmodeNXExtendedFeatureEnableRegisterModel-specificregisterProcessidentifierlockdepirqflagstracingIFStackcanaryUniontypethiscpu*operationsvectornumberInterruptStackTablePrivilegelevelPreviouspart

Links

LinuxInside

183Starttodiveintointerrupts

Page 184: Linux Insides

Thisisthethirdpartofthechapteraboutaninterruptsandanexceptionshandlingandinthepreviouspartwestopedinthesetup_archfunctionfromthearch/x86/kernel/setup.conthesettingofthetwoexceptionshandlersforthetwofollowingexceptions:

#DB-debugexception,transferscontrolfromtheinterruptedprocesstothedebughandler;#BP-breakpointexception,causedbytheint3instruction.

Theseexceptionsallowthex86_64architecturetohaveearlyexceptionprocessingforthepurposeofdebuggingviathekgdb.

Asyoucanrememberwesettheseexceptionshandlersintheearly_trap_initfunction:

void__initearly_trap_init(void)

{

set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);

set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);

load_idt(&idt_descr);

}

fromthearch/x86/kernel/traps.c.Wealreadysawimplementationoftheset_intr_gate_istandset_system_intr_gate_istfunctionsinthepreviouspartandnowwewilllookontheimplementationoftheseearlyexceptionshandlers.

Ok,wesettheinterruptsgatesintheearly_trap_initfunctionforthe#DBand#BPexceptionsandnowtimeistolookontheirhandlers.Butfirstofalllet'slookontheseexceptions.Thefirstexceptions-#DBordebugexceptionoccurswhenadebugeventoccurs,forexampleattempttochangethecontentsofadebugregister.DebugregistersarespecialregisterswhichpresentinprocessorsstartingfromtheIntel80386andasyoucanunderstandfromitsnametheyareusedfordebugging.Theseregistersallowtosetbreakpointsonthecodeandreadorwritedatatotrace,thustrackingtheplaceoferrors.Thedebugregistersareprivilegedresourcesavailableandtheprogramineitherreal-addressorprotectedmodeatCPLis0,that'swhywehaveusedset_intr_gate_istforthe#DB,butnottheset_system_intr_gate_ist.Theverctornumberofthe#DBexceptionsis1(wepassitasX86_TRAP_DB)andhasnoerrorcode:

----------------------------------------------------------------------------------------------

|Vector|Mnemonic|Description|Type|ErrorCode|Source|

----------------------------------------------------------------------------------------------

|1|#DB|Reserved|F/T|NO||

----------------------------------------------------------------------------------------------

Thesecondis#BPorbreakpointexceptionoccurswhenprocessorexecutestheINT3instruction.Wecanadditanywhereinourcode,forexamplelet'slookonthesimpleprogram:

//breakpoint.c

#include<stdio.h>

intmain(){

inti;

while(i<6){

InterruptsandInterruptHandling.Part3.

Interrupthandlers

DebugandBreakpointexceptions

LinuxInside

184Interrupthandlers

Page 185: Linux Insides

printf("iequalto:%d\n",i);

__asm__("int3");

++i;

}

}

Ifwewillcompileandrunthisprogram,wewillseefollowingoutput:

$gccbreakpoint.c-obreakpoint

iequalto:0

Trace/breakpointtrap

Butifwillrunitwithgdb,wewillseeourbreakpointandcancontinueexecutionofourprogram:

$gdbbreakpoint

...

...

...

(gdb)run

Startingprogram:/home/alex/breakpoints

iequalto:0

ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.

0x0000000000400585inmain()

=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1

(gdb)c

Continuing.

iequalto:1

ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.

0x0000000000400585inmain()

=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1

(gdb)c

Continuing.

iequalto:2

ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.

0x0000000000400585inmain()

=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1

...

...

...

Nowweknowalittleaboutthesetwoexceptionsandwecanmoveontoconsiderationoftheirhandlers.

Asyoucannote,theset_intr_gate_istandset_system_intr_gate_istfunctionstakesanaddressesoftheexceptionshandlersinthesecondparameter:

&debug;&int3.

YouwillnotfindthesefunctionsintheCcode.Allthatcanbefoundininthe*.c/*.hfilesonlydefinitionofthisfunctionsinthearch/x86/include/asm/traps.h:

asmlinkagevoiddebug(void);

asmlinkagevoidint3(void);

Butwecanseeasmlinkagedescriptorhere.Theasmlinkageisthespecialspecificatorofthegcc.ActuallyforaC

Preparationbeforeaninterrupthandler

LinuxInside

185Interrupthandlers

Page 186: Linux Insides

functionswhicharewillbecalledfromassembly,weneedinexplicitdeclarationofthefunctioncallingconvention.Inourcase,iffunctionmakedwithasmlinkagedescriptor,thengccwillcompilethefunctiontoretrieveparametersfromstack.So,bothhandlersaredefinedinthearch/x86/kernel/entry_64.Sassemblysourcecodefilewiththeidtentrymacro:

idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK

idtentryint3do_int3has_error_code=0paranoid=1shift_ist=DEBUG_STACK

Actuallydebugandint3arenotinterruptshandlers.Rememberthatbeforewecanexecuteaninterrupt/exceptionhandler,weneedtodosomepreparationsas:

Whenaninterruptorexceptionoccured,theprocessorusesanexceptionorinterruptvectorasanindextoadescriptorintheIDT;Inlegacymodess:espregistersarepushedonthestackonlyifprivilegelevelchanged.In64-bitmodess:rsppushedonthestackeverytime;DuringstackswitchingwithISTthenewssselectorisforcedtonull.Oldssandrsparepushedonthenewstack.Therflags,cs,ripanderrorcodepushedonthestack;Controltransferedtoaninterrupthandler;Afteraninterrupthandlerwillfinishitsworkandfinisheswiththeiretinstruction,oldsswillbepopedfromthestackandloadedtothessregister.ss:rspwillbepoppedfromthestackunconditionallyinthe64-bitmodeandwillbepoppedonlyifthereisaprivilegelevelchangeinlegacymode.iretinstructionwillrestorerip,csandrflags;Interruptedprogramwillcontinueitsexecution.

+--------------------+

+40|ss|

+32|rsp|

+24|rflags|

+16|cs|

+8|rip|

0|errorcode|

+--------------------+

Nowwecanseeonthepreparationsbeforeaprocesswilltransfercontroltoaninterrupt/exceptionhandlerfrompracticalside.AsIalreadywroteabovethefirstthirteenexceptionshandlersdefinedinthearch/x86/kernel/entry_64.Sassemblyfilewiththeidtentrymacro:

.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1

ENTRY(\sym)

...

...

...

END(\sym)

.endm

Thismacrodefinesanexceptionentrypointandaswecanseeittakesfivearguments:

sym-definesglobalsymbolwiththe.globlname.do_sym-aninterrupthandler.has_error_code:req-informationabouterrorcode,The:reqqualifiertellstheassemblerthattheargumentisrequired;paranoid-showsushowweneedtocheckcurrentmode;shift_ist-showsuswhat'sstacktouse;

Aswecanseeourexceptionshandlersarealmostthesame:

LinuxInside

186Interrupthandlers

Page 187: Linux Insides

idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK

idtentryint3do_int3has_error_code=0paranoid=1shift_ist=DEBUG_STACK

Thedifferencesareonlyintheglobalnameandnameofexceptionshandlers.Nowlet'slookhowidtentrymacroimplemented.Itstartsfromthetwochecks:

.if\shift_ist!=-1&&\paranoid==0

.error"usingshift_istrequiresparanoid=1"

.endif

.if\has_error_code

XCPT_FRAME

.else

INTR_FRAME

.endif

FirstcheckmakesthecheckthatanexceptionsusesInterruptstacktableandparanoidisset,inotherwayitemitstheerorrwiththe.errordirective.ThesecondifclausechecksexistenceofanerrorcodeandcallsXCPT_FRAMEorINTR_FRAMEmacrosdependsonit.ThesemacrosjustexpandtothesetofCFIdirectiveswhichareusedbyGNUAStomanagecallframes.TheCFIdirectivesareusedonlytogeneratedwarf2unwindinformationforbetterbacktracesandtheydon'tchangeanycode,sowewillnotgointodetailaboutitandfromthispointIwillskipallcodewhichisrelatedtothesedirectives.Inthenextstepwecheckerrorcodeagainandpushitonthestackifanexceptionhasitwiththe:

.ifeq\has_error_code

pushq_cfi$-1

.endif

Thepushq_cfimacrodefinedinthearch/x86/include/asm/dwarf2.handexpandstothepushqinstructionwhichpushesgivenerrorcode:

.macropushq_cfireg

pushq\reg

CFI_ADJUST_CFA_OFFSET8

.endm

Payattentiononthe$-1.Wealreadyknowthatwhenanexceptionoccrus,theprocessorpushesss,rsp,rflags,csandriponthestack:

#defineRIP16*8

#defineCS17*8

#defineEFLAGS18*8

#defineRSP19*8

#defineSS20*8

Withthepushq\regwedenotethatplacebeforetheRIPwillcontainerrorcodeofanexception:

#defineORIG_RAX15*8

TheORIG_RAXwillcontainerrorcodeofanexception,IRQnumberonahardwareinterruptandsystemcallnumberonsystemcallentry.InthenextstepwecanseethrALLOC_PT_GPREGS_ON_STACKmacrowhichallocatesspaceforthe15generalpurposeregistersonthestack:

LinuxInside

187Interrupthandlers

Page 188: Linux Insides

.macroALLOC_PT_GPREGS_ON_STACKaddskip=0

subq$15*8+\addskip,%rsp

CFI_ADJUST_CFA_OFFSET15*8+\addskip

.endm

AfterthiswecheckparanoidandifitissetwecheckfirstthreeCPLbits.Wecompareitwiththe3anditallowsustoknowdidwecomefromuserspaceornot:

.if\paranoid

.if\paranoid==1

CFI_REMEMBER_STATE

testl$3,CS(%rsp)

jnz1f

.endif

callparanoid_entry

.else

callerror_entry

.endif

Ifwecamefromuserspacewejumponthelabel1whichstartsfromthecallerror_entryinstruction.Theerror_entrysavesallregistersinthept_regsstructurewhichpresetensaninterrupt/exceptionstackframeanddefinedinthearch/x86/include/uapi/asm/ptrace.h.Itsavescommonandextraregistersonthestackwiththe:

SAVE_C_REGS8

SAVE_EXTRA_REGS8

fromrditor15andexecutesswapgsinstruction.ThisinstructionprovidesamethodtofortheLinuxkerneltoobtainapointertothekerneldatastructuresandsavetheuser'sgsbase.Afterthiswewillexitfromtheerror_entrywiththeretinstruction.Aftertheerror_entryfinishedtoexecute,sincewecamefromuserspaceweneedtoswitchonkernelinterruptstack:

movq%rsp,%rdi

callsync_regs

Wejustsaveallregisterstotheerror_entryintheerror_entry,weputaddressofthept_regstotherdiandcallsync_regsfunctionfromthearch/x86/kernel/traps.c:

asmlinkage__visiblenotracestructpt_regs*sync_regs(structpt_regs*eregs)

{

structpt_regs*regs=task_pt_regs(current);

*regs=*eregs;

returnregs;

}

ThisfunctionswitchsofftheISTstackifwecamefromusermode.Afterthisweswitchonthestackwhichwegotfromthesync_regs:

movq%rax,%rsp

movq%rsp,%rdi

andputpointerofthept_regsagainintherdi,andinthelaststepwecallanexceptionhandler:

call\do_sym

LinuxInside

188Interrupthandlers

Page 189: Linux Insides

So,realyexceptionshandlersaredo_debuganddo_int3functions.Wewillseethesefunctioninthispart,butlittlelater.Firstofalllet'slookonthepreparationsbeforeaprocessorwilltransfercontroltoaninterrupthandler.Inanotherwayifparanoidisset,butitisnot1,wecallparanoid_entrywhichmakesalmostthesamethaterror_entry,butitcheckscurrentmodewithmoreslowbutaccurateway:

ENTRY(paranoid_entry)

SAVE_C_REGS8

SAVE_EXTRA_REGS8

...

...

movl$MSR_GS_BASE,%ecx

rdmsr

testl%edx,%edx

js1f/*negative->inkernel*/

SWAPGS

...

...

ret

END(paranoid_entry)

Ifedxwllbenegative,weareinthekernelmode.Aswestoreallregistersonthestack,checkthatweareinthekernelmode,weneedtosetupISTstackifitissetforagivenexception,callanexceptionhandlerandrestoretheexceptionstack:

.if\shift_ist!=-1

subq$EXCEPTION_STKSZ,CPU_TSS_IST(\shift_ist)

.endif

call\do_sym

.if\shift_ist!=-1

addq$EXCEPTION_STKSZ,CPU_TSS_IST(\shift_ist)

.endif

Thelaststepwhenanexceptionhandlerwillfinishit'sworkallregisterswillberestoredfromthestackwiththeRESTORE_C_REGSandRESTORE_EXTRA_REGSmacrosandcontrolwillbereturnedaninterruptedtask.That'sall.Nowweknowaboutpreparationbeforeaninterrupt/exceptionhandlerwillstarttoexecuteandwecangodirectlytotheimplementationofthehandlers.

Bothhandlersdo_debuganddo_int3definedinthearch/x86/kernel/traps.csourcecodefileandhavetwosimilarthings:Allinterrupts/exceptionshandlersmarkedwiththedotraplinkageprefixthatexpandstothe:

#definedotraplinkage__visible

#define__visible__attribute__((externally_visible))

whichtellstocompilerthatsomethingelseusesthisfunction(inourcasethesefunctionsarecalledfromtheassemblyinterruptpreparationcode).Andalsotheytakestwoparameters:

pointertothept_regsstructurewhichcontainsregistersoftheinterruptedtask;errorcode.

Firstofalllet'sconsiderdo_debughandler.Thisfunctionstartsfromthegettingpreviousstatewiththeist_enterfunctionfromthearch/x86/kernel/traps.c.Wecallitbecauseweneedtoknow,didwecometotheinterrupthandlerfromthekernel

Implementationofainterruptsandexceptionshandlers

LinuxInside

189Interrupthandlers

Page 190: Linux Insides

modeorusermode.

prev_state=ist_enter(regs);

Theist_enterfunctionreturnspreviousstatecontextstateandexecutesacouplepreprartionsbeforewecontinuetohandleanexception.Itstartsfromthecheckofthepreviousmodewiththeuser_mode_vmmacro.Ittakespt_regsstructurewhichcontainsasetofregistersoftheinterruptedtaskandreturns1ifwecamefromuserspaceand0ifwecamefromkernelspace.Accordingtothepreviousmodeweexecuteexception_enterifwearefromtheuserspaceorinformRCUifwearefromkrenelspace:

...

if(user_mode_vm(regs)){

prev_state=exception_enter();

}else{

rcu_nmi_enter();

prev_state=IN_KERNEL;

}

...

...

...

returnprev_state;

AfterthisweloadtheDR6debugregisterstothedr6variablewiththecalloftheget_debugregmacrofromthearch/x86/include/asm/debugreg.h:

get_debugreg(dr6,6);

dr6&=~DR6_RESERVED;

TheDR6debugregisterisdebugstatusregistercontainsinformationaboutthereasonforstoppingthe#DBordebugexceptionhandler.Afterweloadeditsvaluetothedr6variablewefilteroutallreservedbits(4:12bits).Inthenextstepwecheckdr6registerandpreviousstatewiththefollowingifconditionexpression:

if(!dr6&&user_mode_vm(regs))

user_icebp=1;

Ifdr6doesnotshowanyreasonswhywecaughtthistrapwesetuser_icebptoonewhichmeansthatuser-codewantstogetSIGTRAPsignal.Inthenextstepwecheckwasitkmemchecktrapandifyeswegotoexit:

if((dr6&DR_STEP)&&kmemcheck_trap(regs))

gotoexit;

Afterwedidallthesechecks,weclearthedr6register,cleartheDEBUGCTLMSR_BTFflagwhichprovidessingle-steponbranchesdebugging,setdr6registerforthecurrentthreadandincreasedebug_stack_usageper-cpu)variablewiththe:

set_debugreg(0,6);

clear_tsk_thread_flag(tsk,TIF_BLOCKSTEP);

tsk->thread.debugreg6=dr6;

debug_stack_usage_inc();

Aswesaveddr6,wecanallowirqs:

LinuxInside

190Interrupthandlers

Page 191: Linux Insides

staticinlinevoidpreempt_conditional_sti(structpt_regs*regs)

{

preempt_count_inc();

if(regs->flags&X86_EFLAGS_IF)

local_irq_enable();

}

moreaboutlocal_irq_enabledandrelatedstuffyoucanreadinthesecondpartaboutinterruptshandlingintheLinuxkernel.Inthenextstepwecheckthepreviousmodewasvirtual8086andhandlethetrap:

if(regs->flags&X86_VM_MASK){

handle_vm86_trap((structkernel_vm86_regs*)regs,error_code,X86_TRAP_DB);

preempt_conditional_cli(regs);

debug_stack_usage_dec();

gotoexit;

}

...

...

...

exit:

ist_exit(regs,prev_state);

Ifwecamenotfromthevirtual8086mode,weneedtocheckdr6registerandpreviousmodeaswediditabove.Herewecheckifstepmodedebuggingisenabledandwearenotfromtheusermode,weenabledstepmodedebugginginthedr6copyinthecurrentthread,setTIF_SINGLE_STEPfalgandre-enableTrapflagfortheusermode:

if((dr6&DR_STEP)&&!user_mode(regs)){

tsk->thread.debugreg6&=~DR_STEP;

set_tsk_thread_flag(tsk,TIF_SINGLESTEP);

regs->flags&=~X86_EFLAGS_TF;

}

ThenwegetSIGTRAPsignalcode:

si_code=get_si_code(tsk->thread.debugreg6);

andsenditforusericebptraps:

if(tsk->thread.debugreg6&(DR_STEP|DR_TRAP_BITS)||user_icebp)

send_sigtrap(tsk,regs,error_code,si_code);

preempt_conditional_cli(regs);

debug_stack_usage_dec();

exit:

ist_exit(regs,prev_state);

Intheendwedisabledirqs,decrementvalueofthedebug_stack_usageandexitfromtheexceptionhandlerwiththeist_exitfunction.

Thesecondexceptionhandlerisdo_int3definedinthesamesourcecodefile-arch/x86/kernel/traps.c.Inthedo_int3wemakesalmostthesamethatinthedo_debughandler.Wegetthepreviousstatewiththeist_enter,incrementanddecrementthedebug_stack_usageper-cpuvariable,enabledanddisablelocalinterrupts.Butofcoursethereisonedifferencebetweenthesetwohandlers.Weneedtolockandthansyncprocessorcoresduringbreakpointpatching.

That'sall.

LinuxInside

191Interrupthandlers

Page 192: Linux Insides

ItistheendofthethirdpartaboutinterruptsandinterrupthandlingintheLinuxkernel.WesawtheinitializationoftheInterruptdescriptortableinthepreviouspartwiththe#DBand#BPgatesandstartedtodiveintopreparationbeforecontrolwillbetransferedtoanexceptionhandlerandimplementationofsomeinterrupthandlersinthispart.Inthenextpartwewillcontinuetodiveintothisthemeandwillgonextbythesetup_archfunctionandwilltrytounderstandinterruptshandlingrelatedstuff.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

DebugregistersIntel80385INT3gccTSSGNUassembly.errordirectivedwarf2CFIdirectivesIRQsystemcallswapgsSIGTRAPPer-CPUvariableskgdbACPIPreviouspart

Conclusion

Links

LinuxInside

192Interrupthandlers

Page 193: Linux Insides

ThisisfourthpartaboutaninterruptsandexceptionshandlingintheLinuxkernelandinthepreviouspartwesawfirstearly#DBand#BPexceptionshandlersfromthearch/x86/kernel/traps.c.Westoppedontherightaftertheearly_trap_initfunctionthatcalledinthesetup_archfunctionwhichdefinedinthearch/x86/kernel/setup.c.InthispartwewillcontinuetodiveintoaninterruptsandexceptionshandlingintheLinuxkernelforx86_64andcontinuetodoitfromfromtheplacewhereweleftoffinthelastpart.Firstthingwhichisrelatedtotheinterruptsandexceptionshandlingisthesetupofthe#PForpagefaulthandlerwiththeearly_trap_pf_initfunction.Let'sstartfromit.

Theearly_trap_pf_initfunctiondefinedinthearch/x86/kernel/traps.c.Itusesset_intr_gatemacrothatfillesInterruptDescriptorTablewiththegivenentry:

void__initearly_trap_pf_init(void)

{

#ifdefCONFIG_X86_64

set_intr_gate(X86_TRAP_PF,page_fault);

#endif

}

Thismacrodefinedinthearch/x86/include/asm/desc.h.Wealreadysawmacroslikethisinthepreviouspart-set_system_intr_gateandset_intr_gate_ist.Thismacrochecksthatgivenvectornumberisnotgreaterthan255(maximumvectornumber)andcalls_set_gatefunctionasset_system_intr_gateandset_intr_gate_istdidit:

#defineset_intr_gate(n,addr)\

do{\

BUG_ON((unsigned)n>0xFF);\

_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\

__KERNEL_CS);\

_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\

0,0,__KERNEL_CS);\

}while(0)

Theset_intr_gatemacrotakestwoparameters:

vectornumberofainterrupt;addressofaninterrupthandler;

Inourcasetheyare:

X86_TRAP_PF-14;page_fault-theinterrupthandlerentrypoint.

TheX86_TRAP_PFistheelementofenumwhichdefinedinthearch/x86/include/asm/traprs.h:

enum{

...

...

...

InterruptsandInterruptHandling.Part4.

Initializationofnon-earlyinterruptgates

Earlypagefaulthandler

LinuxInside

193Initializationofnon-earlyinterruptgates

Page 194: Linux Insides

...

X86_TRAP_PF,/*14,PageFault*/

...

...

...

}

Whentheearly_trap_pf_initwillbecalled,theset_intr_gatewillbeexpandedtothecallofthe_set_gatewhichwillfilltheIDTwiththehandlerforthepagefault.Nowlet'slookontheimplementationofthepage_faulthandler.Thepage_faulthandlerdefinedinthearch/x86/kernel/entry_64.Sassemblysourcecodefileasallexceptionshandlers.Let'slookonit:

trace_idtentrypage_faultdo_page_faulthas_error_code=1

Wesawinthepreviousparthow#DBand#BPhandlersdefined.Theyweredefinedwiththeidtentrymacro,butherewecanseetrace_idtentry.ThismacrodefinedinthesamesourcecodefileanddependsontheCONFIG_TRACINGkernelconfigurationoption:

#ifdefCONFIG_TRACING

.macrotrace_idtentrysymdo_symhas_error_code:req

idtentrytrace(\sym)trace(\do_sym)has_error_code=\has_error_code

idtentry\sym\do_symhas_error_code=\has_error_code

.endm

#else

.macrotrace_idtentrysymdo_symhas_error_code:req

idtentry\sym\do_symhas_error_code=\has_error_code

.endm

#endif

WewillnotdiveintoexceptionsTracingnow.IfCONFIG_TRACINGisnotset,wecanseethattrace_idtentrymacrojustexpandstothenormalidtentry.Wealreadysawimplementationoftheidtentrymacrointhepreviouspart,solet'sstartfromthepage_faultexceptionhandler.

Aswecanseeintheidtentrydefinition,thehandlerofthepage_faultisdo_page_faultfunctionwhichdefinedinthearch/x86/mm/fault.candasallexceptionshandlersittakestwoarguments:

regs-pt_regsstructurethatholdsstateofaninterruptedprocess;error_code-errorcodeofthepagefaultexception.

Let'slookinsidethisfunction.Firstofallwereadcontentofthecr2controlregister:

dotraplinkagevoidnotrace

do_page_fault(structpt_regs*regs,unsignedlongerror_code)

{

unsignedlongaddress=read_cr2();

...

...

...

}

Thisregistercontainsalinearaddresswhichcausedpagefault.Inthenextstepwemakeacalloftheexception_enterfunctionfromtheinclude/linux/context_tracking.h.Theexception_enterandexception_exitarefunctionsfromcontexttrackingsubsytemintheLinuxkernelusedbytheRCUtoremoveitsdependencyonthetimertickwhileaprocessorrunsinuserspace.Almostintheeveryexceptionhandlerwewillseesimilarcode:

enumctx_stateprev_state;

LinuxInside

194Initializationofnon-earlyinterruptgates

Page 195: Linux Insides

prev_state=exception_enter();

...

...//exceptionhandlerhere

...

exception_exit(prev_state);

Theexception_enterfunctionchecksthatcontexttrackingisenabledwiththecontext_tracking_is_enabledandifitisinenabledstate,wegetpreviouscontextwithtethis_cpu_read(moreaboutthis_cpu_*operationsyoucanreadintheDocumentation).Afterthisitcallscontext_tracking_user_exitfunctionwhichinformsthatInformthecontexttrackingthattheprocessorisexitinguserspacemodeandenteringthekernel:

staticinlineenumctx_stateexception_enter(void)

{

enumctx_stateprev_ctx;

if(!context_tracking_is_enabled())

return0;

prev_ctx=this_cpu_read(context_tracking.state);

context_tracking_user_exit();

returnprev_ctx;

}

Thestatecanbeoneofthe:

enumctx_state{

IN_KERNEL=0,

IN_USER,

}state;

Andintheendwereturnpreviouscontext.Betweentheexception_enterandexception_exitwecallactualpagefaulthandler:

__do_page_fault(regs,error_code,address);

The__do_page_faultisdefinedinthesamesourcecodefileasdo_page_fault-arch/x86/mm/fault.c.Inthebinggingofthe__do_page_faultwecheckstateofthekmemcheckchecker.Thekmemcheckdetectswarnsaboutsomeusesofuninitializedmemory.Weneedtocheckitbecausepagefaultcanbecausedbykmemcheck:

if(kmemcheck_active(regs))

kmemcheck_hide(regs);

prefetchw(&mm->mmap_sem);

AfterthiswecanseethecalloftheprefetchwwhichexecutesinstructionwiththesamenamewhichfetchesX86_FEATURE_3DNOWtogetexclusivecacheline.Themainpurposeofprefetchingistohidethelatencyofamemoryaccess.Inthenextstepwecheckthatwegotpagefaultnotinthekernelspacewiththefollowingconditiion:

if(unlikely(fault_in_kernel_space(address))){

...

...

...

}

wherefault_in_kernel_spaceis:

LinuxInside

195Initializationofnon-earlyinterruptgates

Page 196: Linux Insides

staticintfault_in_kernel_space(unsignedlongaddress)

{

returnaddress>=TASK_SIZE_MAX;

}

TheTASK_SIZE_MAXmacroexpandstothe:

#defineTASK_SIZE_MAX((1UL<<47)-PAGE_SIZE)

or0x00007ffffffff000.Payattentiononunlikelymacro.TherearetwomacrosintheLinuxkernel:

#definelikely(x)__builtin_expect(!!(x),1)

#defineunlikely(x)__builtin_expect(!!(x),0)

YoucanoftenfindthesemacrosinthecodeoftheLinuxkernel.Mainpurposeofthesemacrosisoptimization.Sometimesthissituationisthatweneedtochecktheconditionofthecodeandweknowthatitwillrarelybetrueorfalse.Withthesemacroswecantelltothecompileraboutthis.Forexample

staticintproc_root_readdir(structfile*file,structdir_context*ctx)

{

if(ctx->pos<FIRST_PROCESS_ENTRY){

interror=proc_readdir(file,ctx);

if(unlikely(error<=0))

returnerror;

...

...

...

}

Herewecanseeproc_root_readdirfunctionwhichwillbecalledwhentheLinuxVFSneedstoreadtherootdirectorycontents.Ifconditionmarkedwithunlikely,compilercanputfalsecoderightafterbranching.Nowlet'sbacktotheouraddresscheck.Comparisonbetweenthegivenaddressandthe0x00007ffffffff000willgiveustoknow,waspagefaultinthekernelmodeorusermode.Afterthischeckweknowit.Afterthis__do_page_faultroutinewilltrytounderstandtheproblemthatprovokedpagefaultexceptionandthenwillpassaddresstotheappropriteroutine.Itcanbekmemcheckfault,spuriousfault,kprobesfaultandetc.Willnotdiveintoimplementationdetailsofthepagefaultexceptionhandlerinthispart,becauseweneedtoknowmanydifferentconceptswhichareprovidedbytheLinuxkerne,butwillseeitinthechapteraboutthememorymanagementintheLinuxkernel.

Therearemanydifferentfunctioncallsaftertheearly_trap_pf_initinthesetup_archfunctionfromdifferentkernelsubsystems,buttherearenooneinterruptsandexceptionshandlingrelated.So,wehavetogobackwherewecamefrom-start_kernelfunctionfromtheinit/main.c.Thefirstthingsafterthesetup_archisthetrap_initfunctionfromthearch/x86/kernel/traps.c.Thisfunctionmakesinitializationoftheremainingexceptionshandlers(rememberthatwealreadysetup3handlresforthe#DB-debugexception,#BP-breakpointexceptionand#PF-pagefaultexception).Thetrap_initfunctionstartsfromthecheckoftheExtendedIndustryStandardArchitecture:

#ifdefCONFIG_EISA

void__iomem*p=early_ioremap(0x0FFFD9,4);

if(readl(p)=='E'+('I'<<8)+('S'<<16)+('A'<<24))

EISA_bus=1;

early_iounmap(p,4);

Backtostart_kernel

LinuxInside

196Initializationofnon-earlyinterruptgates

Page 197: Linux Insides

#endif

NotethatitdependsontheCONFIG_EISAkernelconfigurationparameterwhichrepresetnsEISAsupport.Hereweuseearly_ioremapfunctiontomapI/Omemoryonthepagetables.Weusereadlfunctiontoreadfirst4bytesfromthemappedregionandiftheyareequaltoEISAstringwesetEISA_bustoone.Intheendwejustunmappreviouslymappedregion.Moreaboutearly_ioremapyoucanreadinthepartwhichdescribesFix-MappedAddressesandioremap.

AfterthiswestarttofilltheInterruptDescriptorTablewiththedifferentinterruptgates.Firstofallweset#DEorDivideErrorand#NMIorNon-maskableInterrupt:

set_intr_gate(X86_TRAP_DE,divide_error);

set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);

Weuseset_intr_gatemacrotosettheinterruptgateforthe#DEexceptionandset_intr_gate_istforthe#NMI.Youcanrememberthatwealreadyusedthesemacroswhenwehavesettheinterruptsgatesforthepagefaulthandler,debughandlerandetc,youcanfindexplanationofitinthepreviouspart.Afterthiswesetupexceptiongatesforthefollowingexceptions:

set_system_intr_gate(X86_TRAP_OF,&overflow);

set_intr_gate(X86_TRAP_BR,bounds);

set_intr_gate(X86_TRAP_UD,invalid_op);

set_intr_gate(X86_TRAP_NM,device_not_available);

Herewecansee:

#OForOverflowexception.ThisexceptionindicatesthatanoverflowtrapoccurredwhenanspecialINTOinstructionwasexecuted;#BRorBOUNDRangeexceededexception.ThisexceptionindeicatesthataBOUND-range-exceedfaultoccuredwhenaBOUNDinstructionwasexecuted;#UDorInvalidOpcodeexception.Occurswhenaprocessorattemptedtoexecuteinvalidorreservedopcode,processorattemptedtoexecuteinstructionwithinvalidoperand(s)andetc;#NMorDeviceNotAvailableexception.Occurswhentheprocessortriestoexecutex87FPUfloatingpointinstructionwhileEMflaginthecontrolregistercr0wasset.

Inthenextstepwesettheinterruptgateforthe#DForDoublefaultexception:

set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);

Thisexceptionoccurswhenprocessordetectedasecondexceptionwhilecallinganexceptionhandlerforapriorexception.Inusualwaywhentheprocessordetectsanotherexceptionwhiletryingtocallanexceptionhandler,thetwoexceptionscanbehandledserially.Iftheprocessorcannothandlethemserially,itsignalsthedouble-faultor#DFexception.

Thefollowingsetoftheinterruptgatesis:

set_intr_gate(X86_TRAP_OLD_MF,&coprocessor_segment_overrun);

set_intr_gate(X86_TRAP_TS,&invalid_TSS);

set_intr_gate(X86_TRAP_NP,&segment_not_present);

set_intr_gate_ist(X86_TRAP_SS,&stack_segment,STACKFAULT_STACK);

set_intr_gate(X86_TRAP_GP,&general_protection);

set_intr_gate(X86_TRAP_SPURIOUS,&spurious_interrupt_bug);

set_intr_gate(X86_TRAP_MF,&coprocessor_error);

set_intr_gate(X86_TRAP_AC,&alignment_check);

LinuxInside

197Initializationofnon-earlyinterruptgates

Page 198: Linux Insides

Herewecanseesetupforthefollowingexceptionhandlers:

#CSOorCoprocessorSegmentOverrun-thisexceptionindicatesthatmathcoprocessorofanoldprocessordetectedapageorsegmentviolation.Modernprocessorsdonotgeneratethisexception#TSorInvalidTSSexception-indicatesthattherewasanerrorrelatedtotheTaskStateSegment.#NPorSegementNotPresentexceptionindicatesthatthepresentflagofasegmentorgatedescriptorisclearduringattempttoloadoneofcs,ds,es,fs,orgsregister.#SSorStackFaultexceptionindicatesoneofthestackrelatedconditionswasdetected,forexampleanot-presentstacksegmentisdetectedwhenattemptingtoloadthessregister.#GPorGeneralProtectionexceptionindicatesthattheprocessordetectedoneofaclassofprotectionviolationscalledgeneral-protectionviolations.Therearemanydifferentconditionsthatcancausegeneral-procetionexception.Forexampleloadingthess,ds,es,fs,orgsregisterwithasegmentselectorforasystemsegment,writingtoacodesegmentoraread-onlydatasegment,referencinganentryintheInterruptDescriptorTable(followinganinterruptorexception)thatisnotaninterrupt,trap,ortaskgateandmanymanymore.SpuriousInterrupt-ahardwareinterruptthatisunwanted.#MForx87FPUFloating-PointErrorexceptioncausedwhenthex87FPUhasdetectedafloatingpointerror.#ACorAlignmentCheckexceptionIndicatesthattheprocessordetectedanunalignedmemoryoperandwhenalignmentcheckingwasenabled.

Afterthatwesetupthisexceptiongates,wecanseesetupoftheMachine-Checkexception:

#ifdefCONFIG_X86_MCE

set_intr_gate_ist(X86_TRAP_MC,&machine_check,MCE_STACK);

#endif

NotethatitdependsontheCONFIG_X86_MCEkernelconfigurationoptionandindicatesthattheprocessordetectedaninternalmachineerrororabuserror,orthatanexternalagentdetectedabuserror.ThenextexceptiongateisfortheSIMDFloating-Pointexception:

set_intr_gate(X86_TRAP_XF,&simd_coprocessor_error);

whichindicatestheprocessorhasdetectedanSSEorSSE2orSSE3SIMDfloating-pointexception.TherearesixclassesofnumericexceptionconditionsthatcanoccurwhileexecutinganSIMDfloating-pointinstruction:

InvalidoperationDivide-by-zeroDenormaloperandNumericoverflowNumericunderflowInexactresult(Precision)

Inthenextstepwefilltheused_vectorsarraywhichdefinedinthearch/x86/include/asm/desc.hheaderfileandrepresentsbitmap:

DECLARE_BITMAP(used_vectors,NR_VECTORS);

ofthefirst32interrupts(moreaboutbitmapsintheLinuxkernelyoucanreadinthepartwhichdescribescpumasksandbitmaps)

for(i=0;i<FIRST_EXTERNAL_VECTOR;i++)

LinuxInside

198Initializationofnon-earlyinterruptgates

Page 199: Linux Insides

set_bit(i,used_vectors)

whereFIRST_EXTERNAL_VECTORis:

#defineFIRST_EXTERNAL_VECTOR0x20

Afterthiswesetuptheinterruptgatefortheia32_syscallandadd0x80totheused_vectorsbitmap:

#ifdefCONFIG_IA32_EMULATION

set_system_intr_gate(IA32_SYSCALL_VECTOR,ia32_syscall);

set_bit(IA32_SYSCALL_VECTOR,used_vectors);

#endif

ThereisCONFIG_IA32_EMULATIONkernelconfigurationoptiononx86_64Linuxkernels.Thisoptionprovidesabilitytoexecute32-bitprocessesincompatibility-mode.Inthenextpartswewillseehowitworks,inthemeantimeweneedonlytoknowthatthereisyetanotherinterruptgateintheIDTwiththevectornumber0x80.InthenextstepwemapsIDTtothefixmaparea:

__set_fixmap(FIX_RO_IDT,__pa_symbol(idt_table),PAGE_KERNEL_RO);

idt_descr.address=fix_to_virt(FIX_RO_IDT);

andwriteitsaddresstotheidt_descr.address(moreaboutfix-mappedaddressesyoucanreadinthesecondpartoftheLinuxkernelmemorymanagementchapter).Afterthiswecanseethecallofthecpu_initfunctionthatdefinedinthearch/x86/kernel/cpu/common.c.Thisfunctionmakesinitializationoftheallper-cpustate.Inthebeginnigofthecpu_initwedothefollowingthings:Firstofallwewaitwhilecurrentcpuisinitializedandthanwecallthecr4_init_shadowfunctionwhichstoresshadowcopyofthecr4controlregisterforthecurrentcpuandloadCPUmicrocodeifneedwiththefollowingfunctioncalls:

wait_for_master_cpu(cpu);

cr4_init_shadow();

load_ucode_ap();

NextwegettheTaskStateSegementforthecurrentcpuandorig_iststructurewhichrepresentsoriginInterruptStackTablevalueswiththe:

t=&per_cpu(cpu_tss,cpu);

oist=&per_cpu(orig_ist,cpu);

AswegotvaluesoftheTaskStateSegementandInterruptStackTableforthecurrentprocessor,weclearfollowingbitsinthecr4controlregister:

cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);

withthiswedisablevm86extension,virtualinterrupts,timestamp(RDTSCcanonlybeexecutedwiththehighestprivilege)anddebugextension.AfterthiswereloadtheGlolbalDescriptoTableandInterruptDescriptortablewiththe:

switch_to_new_gdt(cpu);

loadsegment(fs,0);

load_current_idt();

LinuxInside

199Initializationofnon-earlyinterruptgates

Page 200: Linux Insides

AfterthiswesetuparrayoftheThread-LocalStorageDescriptors,configureNXandloadCPUmicrocode.Nowistimetosetupandloadper-cpuTaskStateSegements.WearegoinginaloopthroughtheallexceptionstackwhichisN_EXCEPTION_STACKSor4andfillitwithInterruptStackTables:

if(!oist->ist[0]){

char*estacks=per_cpu(exception_stacks,cpu);

for(v=0;v<N_EXCEPTION_STACKS;v++){

estacks+=exception_stack_sizes[v];

oist->ist[v]=t->x86_tss.ist[v]=

(unsignedlong)estacks;

if(v==DEBUG_STACK-1)

per_cpu(debug_stack_addr,cpu)=(unsignedlong)estacks;

}

}

AswehavefilledTaskStateSegementswiththeInterruptStackTableswecansetTSSdescriptorforthecurrentprocessorandloaditwiththe:

set_tss_desc(cpu,t);

load_TR_desc();

whereset_tss_descmacrofromthearch/x86/include/asm/desc.hwritesgivendescriptortotheGlobalDescriptorTableofthegivenprocessor:

#defineset_tss_desc(cpu,addr)__set_tss_desc(cpu,GDT_ENTRY_TSS,addr)

staticinlinevoid__set_tss_desc(unsignedcpu,unsignedintentry,void*addr)

{

structdesc_struct*d=get_cpu_gdt_table(cpu);

tss_desctss;

set_tssldt_descriptor(&tss,(unsignedlong)addr,DESC_TSS,

IO_BITMAP_OFFSET+IO_BITMAP_BYTES+

sizeof(unsignedlong)-1);

write_gdt_entry(d,entry,&tss,DESC_TSS);

}

andload_TR_descmacroexpandstotheltrorLoadTaskRegisterinstruction:

#defineload_TR_desc()native_load_tr_desc()

staticinlinevoidnative_load_tr_desc(void)

{

asmvolatile("ltr%w0"::"q"(GDT_ENTRY_TSS*8));

}

Intheendofthetrap_initfunctionwecanseethefollowingcode:

set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);

set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);

...

...

...

#ifdefCONFIG_X86_64

memcpy(&nmi_idt_table,&idt_table,IDT_ENTRIES*16);

set_nmi_gate(X86_TRAP_DB,&debug);

set_nmi_gate(X86_TRAP_BP,&int3);

#endif

LinuxInside

200Initializationofnon-earlyinterruptgates

Page 201: Linux Insides

Herewecopyidt_tabletothenmi_dit_tableandsetupexceptionhandlersforthe#DBorDebugexceptionand#BRorBreakpointexception.Youcanrememberthatwealreadysettheseinterruptgatesinthepreviouspart,sowhydoweneedtosetupitagain?Wesetupitagainbecausewhenweinitializeditbeforeintheearly_trap_initfunction,theTaskStateSegementwasnotreadyyet,butnowitisreadyafterthecallofthecpu_initfunction.

That'sall.Soonwewillconsiderallhandlersoftheseinterrupts/exceptions.

ItistheendofthefourthpartaboutinterruptsandinterrupthandlingintheLinuxkernel.WesawtheinitializationoftheTaskStateSegmentinthispartandinitializationofthedifferentinterrupthandlersasDivideError,PageFaultexcetpionandetc.Youcannotedthatwesawjustinitializationstuf,andwilldiveintodetailsabouthandlersfortheseexceptions.Inthenextpartwewillstarttodoit.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

pagefaultInterruptDescriptorTableTracingcr2RCUthiscpu*operationskmemcheckprefetchw3DNowCPUcachesVFSLinuxkernelmemorymanagementFix-MappedAddressesandioremapExtendedIndustryStandardArchitectureINTisntructionINTOBOUNDopcodecontrolregisterx87FPUMCEexceptionSIMDcpumasksandbitmapsNXTaskStateSegmentPreviouspart

Conclusion

Links

LinuxInside

201Initializationofnon-earlyinterruptgates

Page 202: Linux Insides

ThisisthefifthpartaboutaninterruptsandexceptionshandlingintheLinuxkernelandinthepreviouspartwestoppedonthesettingofinterruptgatestotheInterruptdescriptorTable.Wediditinthetrap_initfunctionfromthearch/x86/kernel/traps.csourcecodefile.Wesawonlysettingoftheseinterruptgatesinthepreviouspartandinthecurrentpartwewillseeimplementationoftheexceptionhandlersforthesegates.Thepreparationbeforeanexceptionhandlerwillbeexecutedisinthearch/x86/entry/entry_64.Sassemblyfileandoccursintheidtentrymacrothatdefinesexceptionsentrypoints:

idtentrydivide_errordo_divide_errorhas_error_code=0

idtentryoverflowdo_overflowhas_error_code=0

idtentryinvalid_opdo_invalid_ophas_error_code=0

idtentryboundsdo_boundshas_error_code=0

idtentrydevice_not_availabledo_device_not_availablehas_error_code=0

idtentrycoprocessor_segment_overrundo_coprocessor_segment_overrunhas_error_code=0

idtentryinvalid_TSSdo_invalid_TSShas_error_code=1

idtentrysegment_not_presentdo_segment_not_presenthas_error_code=1

idtentryspurious_interrupt_bugdo_spurious_interrupt_bughas_error_code=0

idtentrycoprocessor_errordo_coprocessor_errorhas_error_code=0

idtentryalignment_checkdo_alignment_checkhas_error_code=1

idtentrysimd_coprocessor_errordo_simd_coprocessor_errorhas_error_code=0

Theidtentrymacrodoesfollowingpreparationbeforeanactualexceptionhandler(do_divide_errorforthedivide_error,do_overflowfortheoverflowandetc.)willgetcontrol.Inanotherwordstheidtentrymacroallocatesplacefortheregisters(pt_regsstructure)onthestack,pushesdummyerrorcodeforthestackconsistencyifaninterrupt/exceptionhasnoerrorcode,checksthesegmentselectorinthecssegmentregisterandswitchesdependsonthepreviousstate(userspaceorkernelspace).Afterallofthesepreparationsitmakesacallofanactualinterrupt/exceptionhandler:

.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1

ENTRY(\sym)

...

...

...

call\do_sym

...

...

...

END(\sym)

.endm

Afteranexceptionhandlerwillfinishitswork,theidtentrymacrorestoresstackandgeneralpurposeregistersofaninterruptedtaskandexecutesiretinstruction:

ENTRY(paranoid_exit)

...

...

...

RESTORE_EXTRA_REGS

RESTORE_C_REGS

REMOVE_PT_GPREGS_FROM_STACK8

INTERRUPT_RETURN

END(paranoid_exit)

whereINTERRUPT_RETURNis:

InterruptsandInterruptHandling.Part5.

Implementationofexceptionhandlers

LinuxInside

202Implementationofsomeexceptionhandlers

Page 203: Linux Insides

#defineINTERRUPT_RETURNjmpnative_iret

...

ENTRY(native_iret)

.globalnative_irq_return_iret

native_irq_return_iret:

iretq

Moreabouttheidtentrymacroyoucanreadinthethirtpartofthehttp://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.htmlchapter.Ok,nowwesawthepreparationbeforeanexceptionhandlerwillbeexecutedandnowtimetolookonthehandlers.Firstofalllet'slookonthefollowinghandlers:

divide_erroroverflowinvalid_opcoprocessor_segment_overruninvalid_TSSsegment_not_presentstack_segmentalignment_check

Allthesehandlersdefinedinthearch/x86/kernel/traps.csourcecodefilewiththeDO_ERRORmacro:

DO_ERROR(X86_TRAP_DE,SIGFPE,"divideerror",divide_error)

DO_ERROR(X86_TRAP_OF,SIGSEGV,"overflow",overflow)

DO_ERROR(X86_TRAP_UD,SIGILL,"invalidopcode",invalid_op)

DO_ERROR(X86_TRAP_OLD_MF,SIGFPE,"coprocessorsegmentoverrun",coprocessor_segment_overrun)

DO_ERROR(X86_TRAP_TS,SIGSEGV,"invalidTSS",invalid_TSS)

DO_ERROR(X86_TRAP_NP,SIGBUS,"segmentnotpresent",segment_not_present)

DO_ERROR(X86_TRAP_SS,SIGBUS,"stacksegment",stack_segment)

DO_ERROR(X86_TRAP_AC,SIGBUS,"alignmentcheck",alignment_check)

AswecanseetheDO_ERRORmacrotakes4parameters:

Vectornumberofaninterrupt;Signalnumberwhichwillbesenttotheinterruptedprocess;Stringwhichdescribesanexception;Exceptionhandlerentrypoint.

Thismacrodefinedinthesamesoucecodefileandexpandstothefunctionwiththedo_handlername:

#defineDO_ERROR(trapnr,signr,str,name)\

dotraplinkagevoiddo_##name(structpt_regs*regs,longerror_code)\

{\

do_error_trap(regs,error_code,str,trapnr,signr);\

}

Noteonthe##tokens.Thisisspecialfeature-GCCmacroConcatenationwhichconcatenatestwogivenstrings.Forexample,firstDO_ERRORinourexamplewillexpandstothe:

dotraplinkagevoiddo_divide_error(structpt_regs*regs,longerror_code)\

{

...

}

WecanseethatallfunctionswhicharegeneratedbytheDO_ERRORmacrojustmakeacallofthedo_error_trapfunction

LinuxInside

203Implementationofsomeexceptionhandlers

Page 204: Linux Insides

fromthearch/x86/kernel/traps.c.Let'slookonimplementationofthedo_error_trapfunction.

Thedo_error_trapfunctionstartsandendsfromthetwofollowingfunctions:

enumctx_stateprev_state=exception_enter();

...

...

...

exception_exit(prev_state);

fromtheinclude/linux/context_tracking.h.ThecontexttrackingintheLinuxkernelsubsystemwhichprovidekernelboundariesprobestokeeptrackofthetransitionsbetweenlevelcontextswithtwobasicinitialcontexts:userorkernel.Theexception_enterfunctionchecksthatcontexttrackingisenabled.Afterthisifitisenabled,theexception_enterreadspreviouscontextandcomparesitwiththeCONTEXT_KERNEL.Ifthepreviouscontextisuser,wecallcontext_tracking_exitfunctionfromthekernel/context_tracking.cwhichinformthecontexttrackingsubsystemthataprocessorisexitingusermodeandenteringthekernelmode:

if(!context_tracking_is_enabled())

return0;

prev_ctx=this_cpu_read(context_tracking.state);

if(prev_ctx!=CONTEXT_KERNEL)

context_tracking_exit(prev_ctx);

returnprev_ctx;

Ifpreviouscontextisnonuser,wejustreturnit.Thepre_ctxhasenumctx_statetypewhichdefinedintheinclude/linux/context_tracking_state.handlooksas:

enumctx_state{

CONTEXT_KERNEL=0,

CONTEXT_USER,

CONTEXT_GUEST,

}state;

Thesecondfunctionisexception_exitdefinedinthesameinclude/linux/context_tracking.hfileandchecksthatcontexttrackingisenabledandcallthecontert_tracking_enterfunctionifthepreviouscontextwasuser:

staticinlinevoidexception_exit(enumctx_stateprev_ctx)

{

if(context_tracking_is_enabled()){

if(prev_ctx!=CONTEXT_KERNEL)

context_tracking_enter(prev_ctx);

}

}

Thecontext_tracking_enterfunctioninformsthecontexttrackingsubsystemthataprocessorisgoingtoentertotheusermodefromthekernelmode.Wecanseethefollowingcodebetweentheexception_enterandexception_exit:

if(notify_die(DIE_TRAP,str,regs,error_code,trapnr,signr)!=

NOTIFY_STOP){

conditional_sti(regs);

do_trap(trapnr,signr,str,regs,error_code,

Traphandlers

LinuxInside

204Implementationofsomeexceptionhandlers

Page 205: Linux Insides

fill_trap_info(regs,signr,trapnr,&info));

}

Firstofallitcallsthenotify_diefunctionwhichdefinedinthekernel/notifier.c.Togetnotifiedforkernelpanic,kerneloops,Non-MaskableInterruptorothereventsthecallerneedstoinsertitselfinthenotify_diechainandthenotify_diefunctiondoesit.TheLinuxkernelhasspecialmechanismthatallowskerneltoaskwhensomethinghappensandthismechanismcallednotifiersornotifierchains.ThismechanismusedforexamplefortheUSBhotplugevents(lookonthedrivers/usb/core/notify.c),forthememoryhotplug(lookontheinclude/linux/memory.h,thehotplug_memory_notifiermacroandetc...),systemrebootsandetc.Anotifierchainisthusasimple,singly-linkedlist.WhenaLinuxkernelsubsystemwantstobenotifiedofspecificevents,itfillsoutaspecialnotifier_blockstructureandpassesittothenotifier_chain_registerfunction.Aneventcanbesentwiththecallofthenotifier_call_chainfunction.Firstofallthenotify_diefunctionfillsdie_argsstructurewiththetrapnumber,trapstring,registersandothervalues:

structdie_argsargs={

.regs=regs,

.str=str,

.err=err,

.trapnr=trap,

.signr=sig,

}

andreturnstheresultoftheatomic_notifier_call_chainfunctionwiththedie_chain:

staticATOMIC_NOTIFIER_HEAD(die_chain);

returnatomic_notifier_call_chain(&die_chain,val,&args);

whichjustexpandstotheatomit_notifier_headstructurethatcontainslockandnotifier_block:

structatomic_notifier_head{

spinlock_tlock;

structnotifier_block__rcu*head;

};

Theatomic_notifier_call_chainfunctioncallseachfunctioninanotifierchaininturnandreturnsthevalueofthelastnotifierfunctioncalled.Ifthenotify_dieinthedo_error_trapdoesnotreturnNOTIFY_STOPweexecuteconditional_stifunctionfromthearch/x86/kernel/traps.cthatchecksthevalueoftheinterruptflagandenablesinterruptdependsonit:

staticinlinevoidconditional_sti(structpt_regs*regs)

{

if(regs->flags&X86_EFLAGS_IF)

local_irq_enable();

}

moreaboutlocal_irq_enablemacroyoucanreadinthesecondpartofthischapter.Thenextandlastcallinthedo_error_trapisthedo_trapfunction.Firstofallthedo_trapfunctiondefinedthetskvariablewhichhastrak_structtypeandrepresentsthecurrentinterruptedprocess.Afterthedefinitionofthetsk,wecanseethecallofthedo_trap_no_signalfunction:

structtask_struct*tsk=current;

if(!do_trap_no_signal(tsk,trapnr,str,regs,error_code))

return;

LinuxInside

205Implementationofsomeexceptionhandlers

Page 206: Linux Insides

Thedo_trap_no_signalfunctionmakestwochecks:

DidwecomefromtheVirtual8086mode;Didwecomefromthekernelspace.

if(v8086_mode(regs)){

...

}

if(!user_mode(regs)){

...

}

return-1;

WewillnotconsiderfirstcasebecausethelongmodedoesnotsupporttheVirtual8086mode.Inthesecondcaseweinvokefixup_exceptionfunctionwhichwilltrytorecoverafaultanddieifwecan't:

if(!fixup_exception(regs)){

tsk->thread.error_code=error_code;

tsk->thread.trap_nr=trapnr;

die(str,regs,error_code);

}

Thediefunctiondefinedinthearch/x86/kernel/dumpstack.csourcecodefile,printsusefulinformationaboutstack,registers,kernelmodulesandcausedkerneloops.Ifwecamefromtheuserspacethedo_trap_no_signalfunctionwillreturn-1andtheexecutionofthedo_trapfunctionwillcontinue.Ifwepassedthroughthedo_trap_no_signalfunctionanddidnotexitfromthedo_trapafterthis,itmeansthatpreviouscontextwas-user.MostexceptionscausedbytheprocessorareinterpretedbyLinuxaserrorconditions,forexampledivisionbyzero,invalidopcodeandetc.WhenanexceptionoccurstheLinuxkernelsendsasignaltotheinterruptedprocessthatcausedtheexceptiontonotifyitofanincorrectcondition.So,inthedo_trapfunctionweneedtosendasignalwiththegivennumber(SIGFPEforthedivideerror,SIGILLfortheoverflowexceptionandetc...).Firstofallwesaveerrorcodeandvectornumberinthecurrentinterruptsprocesswiththefillingthread.error_codeandthread_trap_nr:

tsk->thread.error_code=error_code;

tsk->thread.trap_nr=trapnr;

Afterthiswemakeacheckdoweneedtoprintinformationaboutunhandledsignalsfortheinterruptedprocess.Wecheckthatshow_unhandled_signalsvariableisset,thatunhandled_signalfunctionfromthekernel/signal.cwillreturnunhandledsignal(s)andprintkratelimit:

#ifdefCONFIG_X86_64

if(show_unhandled_signals&&unhandled_signal(tsk,signr)&&

printk_ratelimit()){

pr_info("%s[%d]trap%sip:%lxsp:%lxerror:%lx",

tsk->comm,tsk->pid,str,

regs->ip,regs->sp,error_code);

print_vma_addr("in",regs->ip);

pr_cont("\n");

}

#endif

Andsendagivensignaltointerruptedprocess:

force_sig_info(signr,info?:SEND_SIG_PRIV,tsk);

LinuxInside

206Implementationofsomeexceptionhandlers

Page 207: Linux Insides

Thisistheendofthedo_trap.WejustsawgenericimplementationforeightdifferentexceptionswhicharedefinedwiththeDO_ERRORmacro.Nowlet'slookonanotherexceptionhandlers.

Thenextexceptionis#DForDoublefault.Thisexceptionoccurrswhentheprocessordetectedasecondexceptionwhilecallinganexceptionhandlerforapriorexception.Wesetthetrapgateforthisexceptioninthepreviouspart:

set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);

NotethatthisexceptionrunsontheDOUBLEFAULT_STACKInterruptStackTablewhichhasindex-1:

#defineDOUBLEFAULT_STACK1

Thedouble_faultishandlerforthisexceptionanddefinedinthearch/x86/kernel/traps.c.Thedouble_faulthandlerstartsfromthedefinitionoftwovariables:stringthatdescribesexcetpionandinterruptedprocess,asotherexceptionhandlers:

staticconstcharstr[]="doublefault";

structtask_struct*tsk=current;

Thehandlerofthedoublefaultexceptionsplittedontwoparts.Thefirstpartisthecheckwhichchecksthatafaultisanon-ISTfaultontheespfix64stack.Actuallytheiretinstructionrestoresonlythebottom16bitswhenreturningtoa16bitsegment.Theespfixfeaturesolvesthisproblem.Soifthenon-ISTfaultontheespfix64stackwemodifythestacktomakeitlooklikeGeneralProtectionFault:

structpt_regs*normal_regs=task_pt_regs(current);

memmove(&normal_regs->ip,(void*)regs->sp,5*8);

ormal_regs->orig_ax=0;

regs->ip=(unsignedlong)general_protection;

regs->sp=(unsignedlong)&normal_regs->orig_ax;

return;

Inthesecondcasewedoalmostthesamethatwedidinthepreviousexcetpionhandlers.Thefirstisthecalloftheist_enterfunctionthatdiscardspreviouscontext,userinourcase:

ist_enter(regs);

AndafterthiswefilltheinterruptedprocesswiththevectornumberoftheDoublefaultexcetpionanderrorcodeaswediditintheprevioushandlers:

tsk->thread.error_code=error_code;

tsk->thread.trap_nr=X86_TRAP_DF;

Nextweprintusefulinformationaboutthedoublefault(PIDnumber,registerscontent):

#ifdefCONFIG_DOUBLEFAULT

df_debug(regs,error_code);

#endif

Doublefault

LinuxInside

207Implementationofsomeexceptionhandlers

Page 208: Linux Insides

Anddie:

for(;;)

die(str,regs,error_code);

That'sall.

Thenextexceptionisthe#NMorDevicenotavailable.TheDevicenotavailableexceptioncanoccurdependingonthesethings:

Theprocessorexecutedanx87FPUfloating-pointinstructionwhiletheEMflagincontrolregistercr0wasset;TheprocessorexecutedawaitorfwaitinstructionwhiletheMPandTSflagsofregistercr0wereset;Theprocessorexecutedanx87FPU,MMXorSSEinstructionwhiletheTSfalgincontrolregistercr0wassetandtheEMflagisclear.

ThehandleroftheDevicenotavailableexceptionisthedo_device_not_availablefunctionanditdefinedinthearch/x86/kernel/traps.csourcecodefiletoo.Itstartsandendsfromthegettingofthepreviouscontext,asothertrapswhichwesawinthebeginningofthispart:

enumctx_stateprev_state;

prev_state=exception_enter();

...

...

...

exception_exit(prev_state);

InthenextstepwecheckthatFPUisnoteager:

BUG_ON(use_eager_fpu());

WhenweswitchintoataskorinterruptwemayavoidloadingtheFPUstate.Ifataskwilluseit,wecatchDevicenotAvailableexceptionexception.IfweloadingtheFPUstateduringtaskswitching,theFPUiseager.Inthenextstepwecheckcr0controlregisterontheEMflagwhichcanshowusisx87floatingpointunitpresent(flagclear)ornot(flagset):

#ifdefCONFIG_MATH_EMULATION

if(read_cr0()&X86_CR0_EM){

structmath_emu_infoinfo={};

conditional_sti(regs);

info.regs=regs;

math_emulate(&info);

exception_exit(prev_state);

return;

}

#endif

Ifthex87floatingpointunitnotpresented,weenableinterruptswiththeconditional_sti,fillthemath_emu_info(definedinthearch/x86/include/asm/math_emu.h)structurewiththeregistersofaninterrupttaskandcallmath_emulatefunctionfromthearch/x86/math-emu/fpu_entry.c.Asyoucanunderstandfromfunction'sname,itemulatesX87FPUunit(moreaboutthe

Devicenotavailableexceptionhandler

LinuxInside

208Implementationofsomeexceptionhandlers

Page 209: Linux Insides

x87wewillknowinthespecialchapter).Inotherway,ifX86_CR0_EMflagisclearwhichmeansthatx87FPUunitispresented,wecallthefpu__restorefunctionfromthearch/x86/kernel/fpu/core.cwhichcopiestheFPUregistersfromthefpustatetothelivehardwareregisters.AfterthisFPUinstructionscanbeused:

fpu__restore(&current->thread.fpu);

Thenextexceptionisthe#GPorGeneralprotectionfault.Thisexceptionoccurswhentheprocessordetectedoneofaclassofprotectionviolationscalledgeneral-protectionviolations.Itcanbe:

Exceedingthesegmentlimitwhenaccessingthecs,ds,es,fsorgssegments;Loadingthess,ds,es,fsorgsregisterwithasegmentselectorforasystemsegment.;Violatinganyoftheprivilegerules;andother...

Theexceptionhandlerforthisexceptionisthedo_general_protectionfromthearch/x86/kernel/traps.c.Thedo_general_protectionfunctionstartsandendsasotherexceptionhandlersfromthegettingofthepreviouscontext:

prev_state=exception_enter();

...

exception_exit(prev_state);

AfterthisweenableinterruptsiftheyweredisabledandcheckthatwecamefromtheVirtual8086mode:

conditional_sti(regs);

if(v8086_mode(regs)){

local_irq_enable();

handle_vm86_fault((structkernel_vm86_regs*)regs,error_code);

gotoexit;

}

Aslongmodedoesnotsupportthismode,wewillnotconsiderexceptionhandlingforthiscase.Inthenextstepcheckthatpreviousmodewaskernelmodeandtrytofixthetrap.Ifwecan'tfixthecurrentgeneralprotectionfaultexceptionwefilltheinterruptedprocesswiththevectornumberanderrorcodeoftheexceptionandaddittothenotify_diechain:

if(!user_mode(regs)){

if(fixup_exception(regs))

gotoexit;

tsk->thread.error_code=error_code;

tsk->thread.trap_nr=X86_TRAP_GP;

if(notify_die(DIE_GPF,"generalprotectionfault",regs,error_code,

X86_TRAP_GP,SIGSEGV)!=NOTIFY_STOP)

die("generalprotectionfault",regs,error_code);

gotoexit;

}

Ifwecanfixexceptionwegototheexitlabelwhichexitsfromexceptionstate:

exit:

exception_exit(prev_state);

Generalprotectionfaultexceptionhandler

LinuxInside

209Implementationofsomeexceptionhandlers

Page 210: Linux Insides

IfwecamefromusermodewesendSIGSEGVsignaltotheinterruptedprocessfromusermodeaswediditinthedo_trapfunction:

if(show_unhandled_signals&&unhandled_signal(tsk,SIGSEGV)&&

printk_ratelimit()){

pr_info("%s[%d]generalprotectionip:%lxsp:%lxerror:%lx",

tsk->comm,task_pid_nr(tsk),

regs->ip,regs->sp,error_code);

print_vma_addr("in",regs->ip);

pr_cont("\n");

}

force_sig_info(SIGSEGV,SEND_SIG_PRIV,tsk);

That'sall.

ItistheendofthefifthpartoftheInterruptsandInterruptHandlingchapterandwesawimplementationofsomeinterrupthandlersinthispart.InthenextpartwewillcontinuetodiveintointerruptandexceptionhandlersandwillseehandlerfortheNon-MaskableInterrupts,handlingofthemathcoprocessorandSIMDcoprocessorexceptionsandmanymanymore.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

InterruptdescriptorTableiretinstructionGCCmacroConcatenationkernelpanickerneloopsNon-MaskableInterrupthotpluginterruptflaglongmodesignalprintkcoprocessorSIMDInterruptStackTablePIDx87FPUcontrolregisterMMXPreviouspart

Conclusion

Links

LinuxInside

210Implementationofsomeexceptionhandlers

Page 211: Linux Insides

ItissixthpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwesawimplementationofsomeexceptionhandlersfortheGeneralProtectionFaultexception,divideexception,invalidopcodeexceptionsandetc.AsIwroteinthepreviouspartwewillseeimplementationsoftherestexceptionsinthispart.Wewillseeimplementationofthefollowinghandlers:

Non-Maskableinterrupt;BOUNDRangeExceededException;Coprocessorexception;SIMDcoprocessorexception.

inthispart.So,let'sstart.

ANon-Maskableinterruptisahardwareinterruptthatcannotbeignorebystandardmaskingtechniques.Inageneralway,anon-maskableinterruptcanbegeneratedineitheroftwoways:

Externalhardwareassertsthenon-maskableinterruptpinontheCPU.TheprocessorreceivesamessageonthesystembusortheAPICserialbuswithadeliverymodeNMI.

WhentheprocessorreceivesaNMIfromoneofthesesources,theprocessorhandlesitimmediatelybycallingtheNMIhandlerpointedtobyinterruptvectorwhichhasnumber2(seetableinthefirstpart).WealreadyfilledtheInterruptDescriptorTablewiththevectornumber,addressofthenmiinterrupthandlerandNMI_STACKInterruptStackTableentry:

set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);

inthetrap_initfunctionwhichdefinedinthearch/x86/kernel/traps.csourcecodefile.Inthepreviouspartswesawthatentrypointsoftheallinterrupthandlersaredefinedwiththe:

.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1

ENTRY(\sym)

...

...

...

END(\sym)

.endm

macrofromthearch/x86/entry/entry_64.Sassemblysourcecodefile.ButthehandleroftheNon-Maskableinterruptsisnotdefinedwiththismacro.Ithasownentrypoint:

ENTRY(nmi)

...

...

...

END(nmi)

InterruptsandInterruptHandling.Part6.

Non-maskableinterrupthandler

Non-Maskableinterrupthandling

LinuxInside

211HandlingNon-Maskableinterrupts

Page 212: Linux Insides

inthesamearch/x86/entry/entry_64.Sassemblyfile.LetsdiveintoitandwilltrytounderstandhowNon-Maskableinterrupthandlerworks.Thenmihandlersstartsfromthecallofthe:

PARAVIRT_ADJUST_EXCEPTION_FRAME

macrobutwewillnotdiveintodetailsaboutitinthispart,becausethismacrorelatedtotheParavirtualizationstuffwhichwewillseeinanotherchapter.Afterthissavethecontentoftherdxregisteronthestack:

pushq%rdx

Andallocatedcheckthatcswasnotthekernelsegmentwhenannon-maskableinterruptoccurs:

cmpl$__KERNEL_CS,16(%rsp)

jnefirst_nmi

The__KERNEL_CSmacrodefinedinthearch/x86/include/asm/segment.handrepresentedseconddescriptorintheGlobalDescriptorTable:

#defineGDT_ENTRY_KERNEL_CS2

#define__KERNEL_CS(GDT_ENTRY_KERNEL_CS*8)

moreaboutGDTyoucanreadinthesecondpartoftheLinuxkernelbootingprocesschapter.Ifcsisnotkernelsegment,itmeansthatitisnotnestedNMIandwejumponthefirst_nmilabel.Let'sconsiderthiscase.Firstofallweputaddressofthecurrentstackpointertotherdxandpushes1tothestackinthefirst_nmilabel:

first_nmi:

movq(%rsp),%rdx

pushq$1

Whydowepush1onthestack?Asthecommentsays:WeallowbreakpointsinNMIs.Onthex86_64,likeotherarchitectures,theCPUwillnotexecuteanotherNMIuntilthefirstNMIiscomplete.ANMIinterruptfinishedwiththeiretinstructionlikeotherinterruptsandexceptionsdoit.IftheNMIhandlertriggerseitherapagefaultorbreakpointoranotherexceptionwhichareuseiretinstructiontoo.IfthishappenswhileinNMIcontext,theCPUwillleaveNMIcontextandanewNMImaycomein.Theiretusedtoreturnfromthoseexceptionswillre-enableNMIsandwewillgetnestednon-maskableinterrupts.TheproblemtheNMIhandlerwillnotreturntothestatethatitwas,whentheexceptiontriggered,butinsteaditwillreturntoastatethatwillallownewNMIstopreempttherunningNMIhandler.IfanotherNMIcomesinbeforethefirstNMIhandleriscomplete,thenewNMIwillwritealloverthepreemptedNMIsstack.WecanhavenestedNMIswherethenextNMIisusingthetopofthestackofthepreviousNMI.Itmeansthatwecannotexecuteitbecauseanestednon-maskableinterruptwillcorruptstackofapreviousnon-maskableinterrupt.That'swhywehaveallocatedspaceonthestackfortemporaryvariable.WewillcheckthisvariablethatitwassetwhenapreviousNMIisexecutingandclearifitisnotnestedNMI.Wepush1heretothepreviouslyallocatedspaceonthestacktodenotethatanon-maskableinterruptexecutedcurrently.RememberthatwhenandNMIoranotherexceptionoccurswehavethefollowingstackframe:

+------------------------+

|SS|

|RSP|

|RFLAGS|

|CS|

|RIP|

+------------------------+

LinuxInside

212HandlingNon-Maskableinterrupts

Page 213: Linux Insides

andalsoanerrorcodeifanexceptionhasit.So,afterallofthesemanipulationsourstackframewilllooklikethis:

+------------------------+

|SS|

|RSP|

|RFLAGS|

|CS|

|RIP|

|RDX|

|1|

+------------------------+

Inthenextstepweallocateyetanother40bytesonthestack:

subq$(5*8),%rsp

andpushesthecopyoftheoriginalstackframeaftertheallocatedspace:

.rept5

pushq11*8(%rsp)

.endr

withthe.reptassemblydirective.Weneedinthecopyoftheoriginalstackframe.Generallyweneedintwocopiesoftheinterruptstack.Firstiscopiedinterruptsstack:savedstackframeandcopiedstackframe.Nowwepushesoriginalstackframetothesavedstackframewhichlocatesafterthejustallocated40bytes(copiedstackframe).ThisstackframeisusedtofixupthecopiedstackframethatanestedNMImaychange.Thesecond-copiedstackframemodifiedbyanynestedNMIstoletthefirstNMIknowthatwetriggeredasecondNMIandweshouldrepeatthefirstNMIhandler.Ok,wehavemadefirstcopyoftheoriginalstackframe,nowtimetomakesecondcopy:

addq$(10*8),%rsp

.rept5

pushq-6*8(%rsp)

.endr

subq$(5*8),%rsp

Afterallofthesemanipulationsourstackframewillbelikethis:

+-------------------------+

|originalSS|

|originalReturnRSP|

|originalRFLAGS|

|originalCS|

|originalRIP|

+-------------------------+

|tempstorageforrdx|

+-------------------------+

|NMIexecutingvariable|

+-------------------------+

|copiedSS|

|copiedReturnRSP|

|copiedRFLAGS|

|copiedCS|

|copiedRIP|

+-------------------------+

|SavedSS|

|SavedReturnRSP|

|SavedRFLAGS|

|SavedCS|

|SavedRIP|

LinuxInside

213HandlingNon-Maskableinterrupts

Page 214: Linux Insides

+-------------------------+

Afterthiswepushdummyerrorcodeonthestackaswediditalreadyinthepreviousexceptionhandlersandallocatespaceforthegeneralpurposeregistersonthestack:

pushq$-1

ALLOC_PT_GPREGS_ON_STACK

WealreadysawimplementationoftheALLOC_PT_GREGS_ON_STACKmacrointhethirdpartoftheinterruptschapter.Thismacrodefinedinthearch/x86/entry/calling.handyetanotherallocates120bytesonstackforthegeneralpurposeregisters,fromtherditother15:

.macroALLOC_PT_GPREGS_ON_STACKaddskip=0

addq$-(15*8+\addskip),%rsp

.endm

Afterspaceallocationforthegeneralregisterswecanseecalloftheparanoid_entry:

callparanoid_entry

Wecanrememberfromthepreviouspartsthislabel.Itpushesgeneralpurposeregistersonthestack,readsMSR_GS_BASEModelSpecificregisterandchecksitsvalue.IfthevalueoftheMSR_GS_BASEisnegative,wecamefromthekernelmodeandjustreturnfromtheparanoid_entry,inotherwayitmeansthatwecamefromtheusermodeandneedtoexecuteswapgsinstructionwhichwillchangeusergswiththekernelgs:

ENTRY(paranoid_entry)

cld

SAVE_C_REGS8

SAVE_EXTRA_REGS8

movl$1,%ebx

movl$MSR_GS_BASE,%ecx

rdmsr

testl%edx,%edx

js1f

SWAPGS

xorl%ebx,%ebx

1:ret

END(paranoid_entry)

Notethataftertheswapgsinstructionwezeroedtheebxregister.Nexttimewewillcheckcontentofthisregisterandifweexecutedswapgsthanebxmustcontain0and1inotherway.Inthenextstepwestorevalueofthecr2controlregistertother12register,becausetheNMIhandlercancausepagefaultandcorruptthevalueofthiscontrolregister:

movq%cr2,%r12

NowtimetocallactualNMIhandler.Wepushtheaddressofthept_regstotherdi,errorcodetothersiandcallthedo_nmihandler:

movq%rsp,%rdi

movq$-1,%rsi

calldo_nmi

LinuxInside

214HandlingNon-Maskableinterrupts

Page 215: Linux Insides

Wewillbacktothedo_nmilittlelaterinthispart,butnowlet'slookwhatoccursafterthedo_nmiwillfinishitsexecution.Afterthedo_nmihandlerwillbefinishedwecheckthecr2register,becausewecangotpagefaultduringdo_nmiperformedandifwegotitwerestoreoriginalcr2,inotherwaywejumponthelabel1.Afterthiswetestcontentoftheebxregister(rememberitmustcontain0ifwehaveusedswapgsinstructionand1ifwedidn'tuseit)andexecuteSWAPGS_UNSAFE_STACKifitcontains1orjumptothenmi_restorelabel.TheSWAPGS_UNSAFE_STACKmacrojustexpandstotheswapgsinstruction.Inthenmi_restorelabelwerestoregeneralpurposeregisters,clearallocatedspaceonthestackforthisregistersclearourtemporaryvariableandexitfromtheinterrupthandlerwiththeINTERRUPT_RETURNmacro:

movq%cr2,%rcx

cmpq%rcx,%r12

je1f

movq%r12,%cr2

1:

testl%ebx,%ebx

jnznmi_restore

nmi_swapgs:

SWAPGS_UNSAFE_STACK

nmi_restore:

RESTORE_EXTRA_REGS

RESTORE_C_REGS

/*Poptheextrairetframeatonce*/

REMOVE_PT_GPREGS_FROM_STACK6*8

/*CleartheNMIexecutingstackvariable*/

movq$0,5*8(%rsp)

INTERRUPT_RETURN

whereINTERRUPT_RETURNisdefinedinthearch/x86/include/irqflags.handjustexpandstotheiretinstruction.That'sall.

Nowlet'sconsidercasewhenanotherNMIinterruptoccurredwhenpreviousNMIinterruptdidn'tfinishitsexecution.Youcanrememberfromthebeginningofthispartthatwe'vemadeacheckthatwecamefromuserspaceandjumponthefirst_nmiinthiscase:

cmpl$__KERNEL_CS,16(%rsp)

jnefirst_nmi

NotethatinthiscaseitisfirstNMIeverytime,becauseifthefirstNMIcatchedpagefault,breakpointoranotherexceptionitwillbeexecutedinthekernelmode.Ifwedidn'tcomefromuserspace,firstofallwetestourtemporaryvariable:

cmpl$1,-8(%rsp)

jenested_nmi

andifitissetto1wejumptothenested_nmilabel.Ifitisnot1,wetesttheISTstack.InthecaseofnestedNMIswecheckthatweareabovetherepeat_nmi.Inthiscaseweignoreit,inotherwaywecheckthatweabovethanend_repeat_nmiandjumponthenested_nmi_outlabel.

Nowlet'slookonthedo_nmiexceptionhandler.Thisfunctiondefinedinthearch/x86/kernel/nmi.csourcecodefileandtakestwoparameters:

addressofthept_regs;errorcode.

asallexceptionhandlers.Thedo_nmistartsfromthecallofthenmi_nesting_preprocessfunctionandendswiththecallofthenmi_nesting_postprocess.Thenmi_nesting_preprocessfunctionchecksthatwelikelydonotworkwiththedebugstackandifweonthedebugstacksettheupdate_debug_stackper-cpuvariableto1andcallthedebug_stack_set_zerofunctionfromthearch/x86/kernel/cpu/common.c.Thisfunctionincreasesthedebug_stack_use_ctrper-cpuvariableandloadsnewInterruptDescriptorTable:

LinuxInside

215HandlingNon-Maskableinterrupts

Page 216: Linux Insides

staticinlinevoidnmi_nesting_preprocess(structpt_regs*regs)

{

if(unlikely(is_debug_stack(regs->sp))){

debug_stack_set_zero();

this_cpu_write(update_debug_stack,1);

}

}

Thenmi_nesting_postprocessfunctioncheckstheupdate_debug_stackper-cpuvariablewhichwesetinthenmi_nesting_preprocessandresetsdebugstackorinanotherwordsitloadsoriginInterruptDescriptorTable.Afterthecallofthenmi_nesting_preprocessfunction,wecanseethecallofthenmi_enterinthedo_nmi.Thenmi_enterincreaseslockdep_recursionfieldoftheinterruptedprocess,updatepreemptcounterandinformstheRCUsubsystemaboutNMI.Thereisalsonmi_exitfunctionthatdoesthesamestuffasnmi_enter,butvice-versa.Afterthenmi_enterweincrease__nmi_countintheirq_statstructureandcallthedefault_do_nmifunction.Firstofallinthedefault_do_nmiwechecktheaddressofthepreviousnmiandupdateaddressofthelastnmitotheactual:

if(regs->ip==__this_cpu_read(last_nmi_rip))

b2b=true;

else

__this_cpu_write(swallow_nmi,false);

__this_cpu_write(last_nmi_rip,regs->ip);

AfterthisfirstofallweneedtohandleCPU-specificNMIs:

handled=nmi_handle(NMI_LOCAL,regs,b2b);

__this_cpu_add(nmi_stats.normal,handled);

Andthannon-specificNMIsdependsonitsreason:

reason=x86_platform.get_nmi_reason();

if(reason&NMI_REASON_MASK){

if(reason&NMI_REASON_SERR)

pci_serr_error(reason,regs);

elseif(reason&NMI_REASON_IOCHK)

io_check_error(reason,regs);

__this_cpu_add(nmi_stats.external,1);

return;

}

That'sall.

ThenextexceptionistheBOUNDrangeexceededexception.TheBOUNDinstructiondeterminesifthefirstoperand(arrayindex)iswithintheboundsofanarrayspecifiedthesecondoperand(boundsoperand).Iftheindexisnotwithinbounds,aBOUNDrangeexceededexceptionor#BRisoccurred.Thehandlerofthe#BRexceptionisthedo_boundsfunctionthatdefinedinthearch/x86/kernel/traps.c.Thedo_boundshandlerstartswiththecalloftheexception_enterfunctionandendswiththecalloftheexception_exit:

prev_state=exception_enter();

if(notify_die(DIE_TRAP,"bounds",regs,error_code,

X86_TRAP_BR,SIGSEGV)==NOTIFY_STOP)

RangeExceededException

LinuxInside

216HandlingNon-Maskableinterrupts

Page 217: Linux Insides

gotoexit;

...

...

...

exception_exit(prev_state);

return;

Afterwehavegotthestateofthepreviouscontext,weaddtheexceptiontothenotify_diechainandifitwillreturnNOTIFY_STOPwereturnfromtheexception.Moreaboutnotifychainsandthecontexttrackingfunctionsyoucanreadinthepreviouspart.Inthenextstepweenableinterruptsiftheyweredisabledwiththecontidional_stifunctionthatchecksIFflagandcallthelocal_irq_enabledependsonitsvalue:

conditional_sti(regs);

if(!user_mode(regs))

die("bounds",regs,error_code);

andcheckthatifwedidn'tcamefromusermodewesendSIGSEGVsignalwiththediefunction.AfterthiswecheckisMPXenabledornot,andifthisfeatureisdisabledwejumpontheexit_traplabel:

if(!cpu_feature_enabled(X86_FEATURE_MPX)){

gotoexit_trap;

}

whereweexecute`do_trap`function(moreaboutityoucanfindinthepreviouspart):

```C

exit_trap:

do_trap(X86_TRAP_BR,SIGSEGV,"bounds",regs,error_code,NULL);

exception_exit(prev_state);

IfMPXfeatureisenabledwechecktheBNDSTATUSwiththeget_xsave_field_ptrfunctionandifitiszero,itmeansthattheMPXwasnotresponsibleforthisexception:

bndcsr=get_xsave_field_ptr(XSTATE_BNDCSR);

if(!bndcsr)

gotoexit_trap;

Afterallofthis,thereisstillonlyonewaywhenMPXisresponsibleforthisexception.WewillnotdiveintothedetailsaboutIntelMemoryProtectionExtensionsinthispart,butwillseeitinanotherchapter.

Thenexttwoexceptionsarex87FPUFloating-PointErrorexceptionor#MFandSIMDFloating-PointExceptionor#XF.Thefirstexceptionoccurswhenthex87FPUhasdetectedfloatingpointerror.Forexampledividebyzero,numericoverflowandetc.ThesecondexceptionoccurswhentheprocessorhasdetectedSSE/SSE2/SSE3SIMDfloating-pointexception.Itcanbethesameasforthex87FPU.Thehandlersfortheseexceptionsaredo_coprocessor_erroranddo_simd_coprocessor_erroraredefinedinthearch/x86/kernel/traps.candverysimilaroneachother.Theybothmakeacallofthemath_errorfunctionfromthesamesourcecodefilebutpassdifferentvectornumber.Thedo_coprocessor_errorpassesX86_TRAP_MFvectornumbertothemath_error:

dotraplinkagevoiddo_coprocessor_error(structpt_regs*regs,longerror_code)

{

enumctx_stateprev_state;

prev_state=exception_enter();

CoprocessorexceptionandSIMDexception

LinuxInside

217HandlingNon-Maskableinterrupts

Page 218: Linux Insides

math_error(regs,error_code,X86_TRAP_MF);

exception_exit(prev_state);

}

anddo_simd_coprocessor_errorpassesX86_TRAP_XFtothemath_errorfunction:

dotraplinkagevoid

do_simd_coprocessor_error(structpt_regs*regs,longerror_code)

{

enumctx_stateprev_state;

prev_state=exception_enter();

math_error(regs,error_code,X86_TRAP_XF);

exception_exit(prev_state);

}

Firstofallthemath_errorfunctiondefinescurrentinterruptedtask,addressofitsfpu,stringwhichdescribesanexception,addittothenotify_diechainandreturnfromtheexceptionhandlerifitwillreturnNOTIFY_STOP:

structtask_struct*task=current;

structfpu*fpu=&task->thread.fpu;

siginfo_tinfo;

char*str=(trapnr==X86_TRAP_MF)?"fpuexception":

"simdexception";

if(notify_die(DIE_TRAP,str,regs,error_code,trapnr,SIGFPE)==NOTIFY_STOP)

return;

Afterthiswecheckthatwearefromthekernelmodeandifyeswewilltrytofixanexcetpionwiththefixup_exceptionfunction.Ifwecannotwefillthetaskwiththeexception'serrorcodeandvectornumberanddie:

if(!user_mode(regs)){

if(!fixup_exception(regs)){

task->thread.error_code=error_code;

task->thread.trap_nr=trapnr;

die(str,regs,error_code);

}

return;

}

Ifwecamefromtheusermode,wesavethefpustate,fillthetaskstructurewiththevectornumberofanexceptionandsiginfo_twiththenumberofsignal,errno,theaddresswhereexceptionoccurredandsignalcode:

fpu__save(fpu);

task->thread.trap_nr=trapnr;

task->thread.error_code=error_code;

info.si_signo=SIGFPE;

info.si_errno=0;

info.si_addr=(void__user*)uprobe_get_trap_addr(regs);

info.si_code=fpu__exception_code(fpu,trapnr);

Afterthiswecheckthesignalcodeandifitisnon-zerowereturn:

if(!info.si_code)

return;

LinuxInside

218HandlingNon-Maskableinterrupts

Page 219: Linux Insides

OrsendtheSIGFPEsignalintheend:

force_sig_info(SIGFPE,&info,task);

That'sall.

ItistheendofthesixthpartoftheInterruptsandInterruptHandlingchapterandwesawimplementationofsomeexceptionhandlersinthispart,likenon-maskableinterrupt,SIMDandx87FPUfloatingpointexception.Finallywehavefinsihedwiththetrap_initfunctioninthispartandwillgoaheadinthenextpart.Thenextourpointistheexternalinterruptsandtheearly_irq_initfunctionfromtheinit/main.c.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

GeneralProtectionFaultopcodeNon-MaskableBOUNDinstructionCPUsocketInterruptDescriptorTableInterruptStackTableParavirtualization.reptSIMDCoprocessorx86_64iretpagefaultbreakpointGlobalDescriptorTablestackframeModelSpecificregiserpercpuRCUMPXx87FPUPreviouspart

Conclusion

Links

LinuxInside

219HandlingNon-Maskableinterrupts

Page 220: Linux Insides

ThisistheseventhpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwehavefinishedwiththeexceptionswhicharegeneratedbytheprocessor.Inthispartwewillcontinuetodivetotheinterrupthandlingandwillstartwiththeexternalhandwareinterrupthandling.Asyoucanremember,inthepreviouspartwehavefinsihedwiththetrap_initfunctionfromthearch/x86/kernel/trap.candthenextstepisthecalloftheearly_irq_initfunctionfromtheinit/main.c.

InterruptsaresignalthataresentacrossIRQorInterruptRequestLinebyahardwareorsoftware.Externalhardwareinterruptsallowdeviceslikekeyboard,mouseandetc,toindicatethatitneedsattentionoftheprocessor.OncetheprocessorreceivestheInterruptRequest,itwilltemporarystopexecutionoftherunningprogramandinvokespecialroutinewhichdependsonaninterrupt.Wealreadyknowthatthisroutineiscalledinterrupthandler(orhowwewillcallitISRorInterruptServiceRoutinefromthispart).TheISRorInterruptHandlerRoutinecanbefoundinInterruptVectortablethatislocatedatfixedaddressinthememory.Aftertheinterruptishandledprocessorresumestheinterruptedprocess.Attheboot/initializationtime,theLinuxkernelidentifiesalldevicesinthemachine,andappropriateinterrupthandlersareloadedintotheinterrupttable.Aswesawinthepreviousparts,mostexceptionsarehandledsimplybythesendingaUnixsignaltotheinterruptedprocess.That'swhykerneliscanhandleanexceptionquickly.Unfortunatellywecannotusethisapproachfortheexternalhandwareinterrupts,becauseoftentheyarriveafter(andsometimeslongafter)theprocesstowhichtheyarerelatedhasbeensuspended.SoitwouldmakenosensetosendaUnixsignaltothecurrentprocess.Externalinterrupthandlingdependsonthetypeofaninterrupt:

I/Ointerrupts;Timerinterrupts;Interprocessorinterrupts.

Iwilltrytodescribealltypesofinterruptsinthisbook.

Generally,ahandlerofanI/Ointerruptmustbeflexibleenoughtoserviceseveraldevicesatthesametime.ForexmapleinthePCIbusarchitectureseveraldevicesmaysharethesameIRQline.InthesimplestwaytheLinuxkernelmustdofollowingthingwhenanI/Ointerruptoccured:

SavethevalueofanIRQandtheregister'scontentsonthekernelstack;SendanacknowledgmenttothehardwarecontrollerwhichisservicingtheIRQline;Executetheinterruptserviceroutine(nextwewillcallitISR)whichisassociatedwiththedevice;Restoreregistersandreturnfromaninterrupt;

Ok,weknowalittletheoryandnowlet'sstartwiththeearly_irq_initfunction.Theimplementationoftheearly_irq_initfunctionisinthekernel/irq/irqdesc.c.Thisfunctionmakeearlyinitialziationoftheirq_descstructure.Theirq_descstructureisthefoundationofinterruptmanagementcodeintheLinuxkernel.Anarrayofthisstructure,whichhasthesamename-irq_desc,keepstrackofeveryinterruptrequestsourceintheLinuxkernel.Thisstructuredefinedintheinclude/linux/irqdesc.handasyoucannoteitdependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Thiskernelconfigurationoptionenablessupportforsparseirqs.Theirq_descstructurecontainsmanydifferentfiels:

irq_common_data-perirqandchipdatapasseddowntochipfunctions;status_use_accessors-containsstatusoftheinterruptsourcewhichiscanbecombinationofofthevaluesfromtheenumfromtheinclude/linux/irq.handdifferentmacroswhicharedefinedinthesamesourcecodefile;kstat_irqs-irqstatsper-cpu;handle_irq-highlevelirq-eventshandler;action-identifiestheinterruptserviceroutinestobeinvokedwhentheIRQoccurs;

InterruptsandInterruptHandling.Part7.

Introductiontoexternalinterrupts

LinuxInside

220Diveintoexternalhardwareinterrupts

Page 221: Linux Insides

irq_count-counterofinterruptoccurrencesontheIRQline;depth-0iftheIRQlineisenabledandapositivevalueifithasbeendisabledatleastonce;last_unhandled-agingtimerforunhandledcount;irqs_unhandled-countoftheunhandledinterrupts;lock-aspinlockusedtoserializetheaccessestotheIRQdescriptor;pending_mask-pendingrebalancedinterrupts;owner-anownerofinterruptdescriptor.Interruptdescriptorscanbeallocatedfrommodules.Thisfieldisneedtoprovedrefcountonthemodulewhichprovidestheinterrupts;andetc.

Ofcourseitisnotallfieldsoftheirq_descstructure,becauseitistoolongtodescribeeachfieldofthisstructure,butwewillseeitallsoon.Nowlet'sstarttodiveintotheimplementationoftheearly_irq_initfunction.

Now,let'slookontheimplementationoftheearly_irq_initfunction.Notethatimplementationoftheearly_irq_initfunctiondependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Nowweconsiderimplementationoftheearly_irq_initfunctionwhentheCONFIG_SPARSE_IRQkernelconfigurationoptionisnotset.Thisfunctionstartsfromthedeclarationofthefollowingvariables:irqdescriptorscounter,loopcounter,memorynodeandtheirq_descdescriptor:

int__initearly_irq_init(void)

{

intcount,i,node=first_online_node;

structirq_desc*desc;

...

...

...

}

ThenodeisanonlineNUMAnodewhichdependsontheMAX_NUMNODESvaluewhichdependsontheCONFIG_NODES_SHIFTkernelconfigurationparameter:

#defineMAX_NUMNODES(1<<NODES_SHIFT)

...

...

...

#ifdefCONFIG_NODES_SHIFT

#defineNODES_SHIFTCONFIG_NODES_SHIFT

#else

#defineNODES_SHIFT0

#endif

AsIalreadywrote,implementationofthefirst_online_nodemacrodependsontheMAX_NUMNODESvalue:

#ifMAX_NUMNODES>1

#definefirst_online_nodefirst_node(node_states[N_ONLINE])

#else

#definefirst_online_node0

Thenode_statesistheenumwhichdefinedintheinclude/linux/nodemask.handrepresentthesetofthestatesofanode.Inourcasewearesearchinganonlinenodeanditwillbe0ifMAX_NUMNODESisoneorzero.IftheMAX_NUMNODESisgreaterthanone,thenode_states[N_ONLINE]willreturn1andthefirst_nodemacrowillbeexpandstothecallofthe__first_nodefunctionwhichwillreturnminimalorthefirstonlinenode:

Earlyexternalinterruptsinitialization

LinuxInside

221Diveintoexternalhardwareinterrupts

Page 222: Linux Insides

#definefirst_node(src)__first_node(&(src))

staticinlineint__first_node(constnodemask_t*srcp)

{

returnmin_t(int,MAX_NUMNODES,find_first_bit(srcp->bits,MAX_NUMNODES));

}

MoreaboutthiswillbeintheanotherchapterabouttheNUMA.Thenextstepafterthedeclarationoftheselocalvariablesisthecallofthe:

init_irq_default_affinity();

function.Theinit_irq_default_affinityfunctiondefinedinthesamesourcecodefileanddependsontheCONFIG_SMPkernelconfigurationoptionallocatesagivencpumaskstructure(inourcaseitistheirq_default_affinity):

#ifdefined(CONFIG_SMP)

cpumask_var_tirq_default_affinity;

staticvoid__initinit_irq_default_affinity(void)

{

alloc_cpumask_var(&irq_default_affinity,GFP_NOWAIT);

cpumask_setall(irq_default_affinity);

}

#else

staticvoid__initinit_irq_default_affinity(void)

{

}

#endif

Weknowthatwhenahardware,suchasdiskcontrollerorkeyboard,needsattentionfromtheprocessor,itthrowsaninterrupt.Theinterrupttellstotheprocessorthatsomethinghashappenedandthattheprocessorshouldinterruptcurrentprocessandhandleanincomingevent.Inordertopreventmutlipledevicesfromsendingthesameinterrupts,theIRQsystemwasestablishedwhereeachdeviceinacomputersystemisassigneditsownspecialIRQsothatitsinterruptsareunique.LinuxkernelcanassigncertainIRQstospecificprocessors.ThisisknownasSMPIRQaffinity,anditallowsyoucontrolhowyoursystemwillrespondtovarioushardwareevents(that'swhyithascertainimplementationonlyiftheCONFIG_SMPkernelconfigurationoptionisset).Afterweallocatedirq_default_affinitycpumask,wecanseeprintkoutput:

printk(KERN_INFO"NR_IRQS:%d\n",NR_IRQS);

whichprintsNR_IRQS:

~$dmesg|grepNR_IRQS

[0.000000]NR_IRQS:4352

TheNR_IRQSisthemaximumnumberoftheirqdescriptorsorinanotherwordsmaximumnumberofinterrupts.ItsvaluedependsonthestateoftheCOFNIG_X86_IO_APICkernelconfigurationoption.IftheCONFIG_X86_IO_APICisnotsetandtheLinuxkernelusesanoldPICchip,theNR_IRQSis:

#defineNR_IRQS_LEGACY16

#ifdefCONFIG_X86_IO_APIC

...

...

LinuxInside

222Diveintoexternalhardwareinterrupts

Page 223: Linux Insides

...

#else

#defineNR_IRQSNR_IRQS_LEGACY

#endif

Inotherway,whentheCONFIG_X86_IO_APICkernelconfigurationoptionisset,theNR_IRQSdependsontheamountoftheprocessorsandamountoftheinterruptvectors:

#defineCPU_VECTOR_LIMIT(64*NR_CPUS)

#defineNR_VECTORS256

#defineIO_APIC_VECTOR_LIMIT(32*MAX_IO_APICS)

#defineMAX_IO_APICS128

#defineNR_IRQS\

(CPU_VECTOR_LIMIT>IO_APIC_VECTOR_LIMIT?\

(NR_VECTORS+CPU_VECTOR_LIMIT):\

(NR_VECTORS+IO_APIC_VECTOR_LIMIT))

...

...

...

Werememberfromthepreviousparts,thattheamountofprocessorswecansetduringLinuxkernelconfigurationprocesswiththeCONFIG_NR_CPUSconfigurationoption:

Inthefirstcase(CPU_VECTOR_LIMIT>IO_APIC_VECTOR_LIMIT),theNR_IRQSwillbe4352,inthesecondcase(CPU_VECTOR_LIMIT<IO_APIC_VECTOR_LIMIT),theNR_IRQSwillbe768.InmycasetheNR_CPUSis8asyoucanseeinthemyconfiguration,theCPU_VECTOR_LIMITis512andtheIO_APIC_VECTOR_LIMITis4096.SoNR_IRQSformyconfigurationis4352:

~$dmesg|grepNR_IRQS

[0.000000]NR_IRQS:4352

InthenextstepweassignarrayoftheIRQdescriptorstotheirq_descvariablewhichwedefinedinthestartoftheearly_irq_initfunctionandcacluatecountoftheirq_descarraywiththeARRAY_SIZEmacro:

LinuxInside

223Diveintoexternalhardwareinterrupts

Page 224: Linux Insides

desc=irq_desc;

count=ARRAY_SIZE(irq_desc);

Theirq_descarraydefinedinthesamesourcecodefileandlookslike:

structirq_descirq_desc[NR_IRQS]__cacheline_aligned_in_smp={

[0...NR_IRQS-1]={

.handle_irq=handle_bad_irq,

.depth=1,

.lock=__RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),

}

};

Theirq_descisarrayoftheirqdescriptors.Ithasthreealreadyinitializedfields:

handle_irq-asIalreadywroteabove,thisfieldisthehighlevelirq-eventhandler.Inourcaseitinitializedwiththehandle_bad_irqfunctionthatdefinedinthekernel/irq/handle.csourcecodefileandhandlesspuriousandunhandledirqs;depth-0iftheIRQlineisenabledandapositivevalueifithasbeendisabledatleastonce;lock-AspinlockusedtoserializetheaccessestotheIRQdescriptor.

Aswecalculatedcountoftheinterruptsandinitializedourirq_descarray,westarttofilldescriptorsintheloop:

for(i=0;i<count;i++){

desc[i].kstat_irqs=alloc_percpu(unsignedint);

alloc_masks(&desc[i],GFP_KERNEL,node);

raw_spin_lock_init(&desc[i].lock);

lockdep_set_class(&desc[i].lock,&irq_desc_lock_class);

desc_set_defaults(i,&desc[i],node,NULL);

}

Wearegoingthroughtheallinterruptdescriptorsanddothefollowingthings:

Firstofallweallocatepercpuvariablefortheirqkernelstatisticwiththealloc_percpumacro.Thismacroallocatesoneinstanceofanobjectofthegiventypeforeveryprocessoronthesystem.Youcanaccesskernelstatisticfromtheuserspacevia/proc/stat:

~$cat/proc/stat

cpu20790768539045427850143940394000

cpu0258811166846791311351018000

cpu1247911658946799942285024000

cpu22632147154678924664071000

cpu326648869316788914140244000

...

...

...

Wherethesixthcolumnistheservicinginterrupts.Afterthisweallocatecpumaskforthegivenirqdescriptoraffinityandinitializethespinlockforthegiveninterruptdescriptor.Afterthisbeforethecriticalsection,thelockwillbeaqcuiredwithacalloftheraw_spin_lockandunlockedwiththecalloftheraw_spin_unlock.Inthenextstepwecallthelockdep_set_classmacrowhichsettheLockvalidatorirq_desc_lock_classclassforthelockofthegiveninterruptdescriptor.Moreaboutlockdep,spinlockandothersynchronizationprimitiveswillbedescribedintheseparatechapter.

Intheendoftheloopwecallthedesc_set_defaultsfunctionfromthekernel/irq/irqdesc.c.Thisfunctiontakesfourparameters:

numberofairq;

LinuxInside

224Diveintoexternalhardwareinterrupts

Page 225: Linux Insides

interruptdescriptor;onlineNUMAnode;ownerofinterruptdescriptor.Interruptdescriptorscanbeallocatedfrommodules.Thisfieldisneedtoprovedrefcountonthemodulewhichprovidestheinterrupts;

andfillstherestoftheirq_descfields.Thedesc_set_defaultsfunctionfillsinterruptnumber,irqchip,platform-specificper-chipprivatedataforthechipmethods,per-IRQdatafortheirq_chipmethodsandMSIdescriptorfortheperirqandirqchipdata:

desc->irq_data.irq=irq;

desc->irq_data.chip=&no_irq_chip;

desc->irq_data.chip_data=NULL;

desc->irq_data.handler_data=NULL;

desc->irq_data.msi_desc=NULL;

...

...

...

Theirq_data.chipstructureprovidesgeneralAPIliketheirq_set_chip,irq_set_irq_typeandetc,fortheirqcontrollerdrivers.Youcanfinditinthekernel/irq/chip.csourcecodefile.

Afterthiswesetthestatusoftheaccessorforthegivendescriptorandsetdisabledstateoftheinterrupts:

...

...

...

irq_settings_clr_and_set(desc,~0,_IRQ_DEFAULT_INIT_FLAGS);

irqd_set(&desc->irq_data,IRQD_IRQ_DISABLED);

...

...

...

Inthenextstepwesetthehighlevelinterrupthandlerstothehandle_bad_irqwhichhandlesspuriousandunhandledirqs(asthehardwarestuffisnotinitializedyet,wesetthishandler),setirq_desc.descto1whichmeansthatanIRQisdisabled,resetcountoftheunhandledinterruptsandinterruptsingeneral:

...

...

...

desc->handle_irq=handle_bad_irq;

desc->depth=1;

desc->irq_count=0;

desc->irqs_unhandled=0;

desc->name=NULL;

desc->owner=owner;

...

...

...

Afterthiswegothroughtheallpossibleprocessorwiththefor_each_possible_cpuhelperandsetthekstat_irqstozeroforthegiveninterruptdescriptor:

for_each_possible_cpu(cpu)

*per_cpu_ptr(desc->kstat_irqs,cpu)=0;

andcallthedesc_smp_initfunctionfromthekernel/irq/irqdesc.cthatinitializesNUMAnodeofthegiveninterruptdescriptor,setsdefaultSMPaffinityandclearsthepending_maskofthegiveninterruptdescriptordependsonthevalueofthe

LinuxInside

225Diveintoexternalhardwareinterrupts

Page 226: Linux Insides

CONFIG_GENERIC_PENDING_IRQkernelconfigurationoption:

staticvoiddesc_smp_init(structirq_desc*desc,intnode)

{

desc->irq_data.node=node;

cpumask_copy(desc->irq_data.affinity,irq_default_affinity);

#ifdefCONFIG_GENERIC_PENDING_IRQ

cpumask_clear(desc->pending_mask);

#endif

}

Intheendoftheearly_irq_initfunctionwereturnthereturnvalueofthearch_early_irq_initfunction:

returnarch_early_irq_init();

Thisfunctiondefinedinthekernel/apic/vector.candcontainsonlyonecallofthearch_early_ioapic_initfunctionfromthekernel/apic/io_apic.c.Aswecanunderstandfromthearch_early_ioapic_initfunction'sname,thisfunctionmakesearlyinitializationoftheI/OAPIC.Firstofallitmakeacheckofthenumberofthelegacyinterruptswitthecallofthenr_legacy_irqsfunction.IfwehavenolagacyinterruptswiththeIntel8259programmableinterruptcontrollerwesetio_apic_irqstothe0xffffffffffffffff:

if(!nr_legacy_irqs())

io_apic_irqs=~0UL;

AfterthiswearegoingthroughtheallI/OAPICsandallocatespacefortheregisterswiththecallofthealloc_ioapic_saved_registers:

for_each_ioapic(i)

alloc_ioapic_saved_registers(i);

Andintheendofthearch_early_ioapic_initfunctionwearegoingthroughthealllegacyirqs(fromIRQ0toIRQ15)intheloopandallocatespacefortheirq_cfgwhichrepresentsconfigurationofanirqonthegivenNUMAnode:

for(i=0;i<nr_legacy_irqs();i++){

cfg=alloc_irq_and_cfg_at(i,node);

cfg->vector=IRQ0_VECTOR+i;

cpumask_setall(cfg->domain);

}

That'sall.

Wealreadysawinthebeginningofthispartthatimplementationoftheearly_irq_initfunctiondependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Previouslywesawimplementationoftheearly_irq_initfunctionwhentheCONFIG_SPARSE_IRQconfigurationoptionisnotset,notlet'slookontheitsimplementationwhenthisoptionisset.Implementationofthisfunctionverysimilar,butlittlediffer.Wecanseethesamedefinitionofvariablesandcalloftheinit_irq_default_affinityinthebeginningoftheearly_irq_initfunction:

#ifdefCONFIG_SPARSE_IRQ

int__initearly_irq_init(void)

SparseIRQs

LinuxInside

226Diveintoexternalhardwareinterrupts

Page 227: Linux Insides

{

inti,initcnt,node=first_online_node;

structirq_desc*desc;

init_irq_default_affinity();

...

...

...

}

#else

...

...

...

Butafterthiswecanseethefollowingcall:

initcnt=arch_probe_nr_irqs();

Thearch_probe_nr_irqsfunctiondefinedinthearch/x86/kernel/apic/vector.candcalculatescountofthepre-allocatedirqsandupdatenr_irqswithitsnumber.Butstop.Whytherearepre-allocatedirqs?Thereisalternativeformofinterruptscalled-MessageSignaledInterruptsavailableinthePCI.Insteadofassigningafixednumberoftheinterruptrequest,thedeviceisallowedtorecordamessageataparticularaddressofRAM,infact,thedisplayontheLocalAPIC.MSIpermitsadevicetoallocate1,2,4,8,16or32interruptsandMSI-Xpermitsadevicetoallocateupto2048interrupts.Nowweknowthatirqscanbepre-allocated.MoreaboutMSIwillbeinanextpart,butnowlet'slookonthearch_probe_nr_irqsfunction.Wecanseethecheckwhichassignamountoftheinterruptvectorsfortheeachprocessorinthesystemtothenr_irqsifitisgreaterandcalculatethenrwhichrepresentsnumberofMSIinterrupts:

intnr_irqs=NR_IRQS;

if(nr_irqs>(NR_VECTORS*nr_cpu_ids))

nr_irqs=NR_VECTORS*nr_cpu_ids;

nr=(gsi_top+nr_legacy_irqs())+8*nr_cpu_ids;

Takealookonthegsi_topvariable.EachAPICisidentifiedwithitsownIDandwiththeoffsetwhereitsIRQstarts.ItiscalledGSIbaseorGlobalSystemInterruptbase.Sothegsi_toprepresntersit.WegettheGlobalSystemInterruptbasefromtheMultiProcessorConfigurationTabletable(youcanrememberthatwehaveparsedthistableinthesixthpartoftheLinuxKernelinitializationprocesschapter).

Afterthisweupdatethenrdependsonthevalueofthegsi_top:

#ifdefined(CONFIG_PCI_MSI)||defined(CONFIG_HT_IRQ)

if(gsi_top<=NR_IRQS_LEGACY)

nr+=8*nr_cpu_ids;

else

nr+=gsi_top*16;

#endif

Updatethenr_irqsifitlessthannrandreturnthenumberofthelegacyirqs:

if(nr<nr_irqs)

nr_irqs=nr;

returnnr_legacy_irqs();

}

Thenextafterthearch_probe_nr_irqsisprintinginformationaboutnumberofIRQs:

LinuxInside

227Diveintoexternalhardwareinterrupts

Page 228: Linux Insides

printk(KERN_INFO"NR_IRQS:%dnr_irqs:%d%d\n",NR_IRQS,nr_irqs,initcnt);

Wecanfinditinthedmesgoutput:

$dmesg|grepNR_IRQS

[0.000000]NR_IRQS:4352nr_irqs:48816

Afterthiswedosomechecksthatnr_irqsandinitcntvaluesisnotgreaterthanmaximumallowablenumberofirqs:

if(WARN_ON(nr_irqs>IRQ_BITMAP_BITS))

nr_irqs=IRQ_BITMAP_BITS;

if(WARN_ON(initcnt>IRQ_BITMAP_BITS))

initcnt=IRQ_BITMAP_BITS;

whereIRQ_BITMAP_BITSisequaltotheNR_IRQSiftheCONFIG_SPARSE_IRQisnotsetandNR_IRQS+8196inotherway.Inthenextstepwearegoingoverallinterruptdescriptwhichneedtobeallocatedintheloopandallocatespaceforthedescriptorandinserttotheirq_desc_treeradixtree:

for(i=0;i<initcnt;i++){

desc=alloc_desc(i,node,NULL);

set_bit(i,allocated_irqs);

irq_insert_desc(i,desc);

}

Intheendoftheearly_irq_initfunctionwereturnthevalueofthecallofthearch_early_irq_initfunctionaswediditalreadyinthepreviousvariantwhentheCONFIG_SPARSE_IRQoptionwasnotset:

returnarch_early_irq_init();

That'sall.

ItistheendoftheseventhpartoftheInterruptsandInterruptHandlingchapterandwestartedtodiveintoexternalhardwareinterruptsinthispart.Wesawearlyinitializationoftheirq_descstructurewhichrepresentsdescriptionofanexternalinterruptandcontainsinformationaboutitlikelistofirqactions,informationaboutinterrupthandler,interrupts'sowner,countoftheunhandledinterruptandetc.Inthenextpartwewillcontinuetoresearchexternalinterrupts.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

IRQnumaEnumtype

Conclusion

Links

LinuxInside

228Diveintoexternalhardwareinterrupts

Page 230: Linux Insides

ThisistheeighthpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwestartedtodiveintotheexternalhardwareinterrupts.Welookedontheimplementationoftheearly_irq_initfunctionfromthekernel/irq/irqdesc.csourcecodefileandsawtheinitializationoftheirq_descstructureinthisfunction.Remindthatirq_descstructure(definedintheinclude/linux/irqdesc.histhefoundationofinterruptmanagementcodeintheLinuxkernelandrepresentsaninterruptdescriptor.Inthispartwewillcontinuetodiveintotheinitializationstuffwhichisrelatedtotheexternalhardwareinterrupts.

Rightafterthecalloftheearly_irq_initfunctionintheinit/main.cwecanseethecalloftheinit_IRQfunction.Thisfunctionisarchitecture-specficanddefinedinthearch/x86/kernel/irqinit.c.Theinit_IRQfunctionmakesinitializationofthevector_irqpercpuvariablethatdefinedinthesamearch/x86/kernel/irqinit.csourcecodefile:

...

DEFINE_PER_CPU(vector_irq_t,vector_irq)={

[0...NR_VECTORS-1]=-1,

};

...

andrepresentspercpuarrayoftheinterruptvectornumbers.Thevector_irq_tdefinedinthearch/x86/include/asm/hw_irq.handexpandstothe:

typedefintvector_irq_t[NR_VECTORS];

whereNR_VECTORSiscountofthevectornumberandasyoucanrememberfromthefirstpartofthischapteritis256forthex86_64:

#defineNR_VECTORS256

So,inthestartoftheinit_IRQfunctionwefillthevecto_irqpercpuarraywiththevectornumberofthelegacyinterrupts:

void__initinit_IRQ(void)

{

inti;

for(i=0;i<nr_legacy_irqs();i++)

per_cpu(vector_irq,0)[IRQ0_VECTOR+i]=i;

...

...

...

}

Thisvector_irqwillbeusedduringthefirststepsofanexternalhardwareinterrupthandlinginthedo_IRQfunctionfromthearch/x86/kernel/irq.c:

__visibleunsignedint__irq_entrydo_IRQ(structpt_regs*regs)

{

...

...

InterruptsandInterruptHandling.Part8.

Non-earlyinitializationoftheIRQs

LinuxInside

230Initializationofexternalhardwareinterruptsstructures

Page 231: Linux Insides

...

irq=__this_cpu_read(vector_irq[vector]);

if(!handle_irq(irq,regs)){

...

...

...

}

exiting_irq();

...

...

return1;

}

Whyislegacyhere?ActuallallinterruptshandledbythemodernIO-APICcontroller.Buttheseinterrupts(from0x30to0x3f)bylegacyinterrupt-controllerslikeProgrammableInterruptController.IftheseinterruptsarehandledbytheI/OAPICthenthisvectorspacewillbefreedandre-used.Let'slookonthiscodecloser.Firstofallthenr_legacy_irqsdefinedinthearch/x86/include/asm/i8259.handjustreturnsthenr_legacy_irqsfieldfromthelegacy_picstrucutre:

staticinlineintnr_legacy_irqs(void)

{

returnlegacy_pic->nr_legacy_irqs;

}

Thisstructuredefinedinthesameheaderfileandrepresentsnon-modernprogrammableinterruptscontroller:

structlegacy_pic{

intnr_legacy_irqs;

structirq_chip*chip;

void(*mask)(unsignedintirq);

void(*unmask)(unsignedintirq);

void(*mask_all)(void);

void(*restore_mask)(void);

void(*init)(intauto_eoi);

int(*irq_pending)(unsignedintirq);

void(*make_irq)(unsignedintirq);

};

ActualldefaultmaximumnumberofthelegacyinterruptsreprestentedbytheNR_IRQ_LEGACYmacrofromthearch/x86/include/asm/irq_vectors.h:

#defineNR_IRQS_LEGACY16

Intheloopweareaccessingthevecto_irqper-cpuarraywiththeper_cpumacrobytheIRQ0_VECTOR+iindexandwritethelegacyvectornumberthere.TheIRQ0_VECTORmacrodefinedinthearch/x86/include/asm/irq_vectors.hheaderfileandexpandstothe0x30:

#defineFIRST_EXTERNAL_VECTOR0x20

#defineIRQ0_VECTOR((FIRST_EXTERNAL_VECTOR+16)&~15)

Whyis0x30here?Youcanrememberfromthefirstpartofthischapterthatfirst32vectornumbersfrom0to31arereservedbytheprocessorandusedfortheprocessingofarchitecture-definedexceptionsandinterrupts.Vectornumbersfrom0x30to0x3farereservedfortheISA.So,itmeansthatwefillthevector_irqfromtheIRQ0_VECTORwhichisequaltothe32totheIRQ0_VECTOR+16(beforethe0x30).

LinuxInside

231Initializationofexternalhardwareinterruptsstructures

Page 232: Linux Insides

Intheendoftheinit_IRQfunctiowecanseethecallofthefollowingfunction:

x86_init.irqs.intr_init();

fromthearch/x86/kernel/x86_init.csourcecodefile.IfyouhavereadchapterabouttheLinuxkernelinitializationprocess,youcanrememberthex86_initstructure.Thisstructurecontainsacoupleoffileswhicharepointstothefunctionrelatedtotheplatformsetup(x86_64inourcase),forexampleresources-relatedwiththememoryresources,mpparse-relatedwiththeparsingoftheMultiProcessorConfigurationTabletableandetc.).Aswecanseethex86_initalsocontainstheirqsfieldwhichcontainsthreefollowingfields:

structx86_init_opsx86_init__initdata

{

...

...

...

.irqs={

.pre_vector_init=init_ISA_irqs,

.intr_init=native_init_IRQ,

.trap_init=x86_init_noop,

},

...

...

...

}

Now,weareinterestinginthenative_init_IRQ.Aswecannote,thenameofthenative_init_IRQfunctioncontainsthenative_prefixwhichmeansthatthisfunctionisarchitecture-specific.Itdefinedinthearch/x86/kernel/irqinit.candexecutesgeneralinitializationoftheLocalAPICandinitializationoftheISAirqs.Let'slookontheimplementationofthenative_init_IRQfunctionandwilltrytounderstandwhatoccursthere.Thenative_init_IRQfunctionstartsfromtheexecutionofthefollowingfunction:

x86_init.irqs.pre_vector_init();

Aswecanseeabove,thepre_vector_initpointstotheinit_ISA_irqsfunctionthatdefinedinthesamesourcecodefileandaswecanunderstandfromthefunction'sname,itmakesinitializationoftheISArelatedinterrupts.Theinit_ISA_irqsfunctionstartsfromthedefinitionofthechipvariablewhichhasairq_chiptype:

void__initinit_ISA_irqs(void)

{

structirq_chip*chip=legacy_pic->chip;

...

...

...

Theirq_chipstructuredefinedintheinclude/linux/irq.hheaderfileandrepresentshardwareinterruptchipdescriptor.Itcontains:

name-nameofadevice.Usedinthe/proc/interrupts:

$cat/proc/interrupts

CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7

0:160000000IO-APIC2-edgetimer

1:20000000IO-APIC1-edgei8042

8:10000000IO-APIC8-edgertc0

LinuxInside

232Initializationofexternalhardwareinterruptsstructures

Page 233: Linux Insides

lookonthelastcolumnt;

(*irq_mask)(structirq_data*data)-maskaninterruptsource;(*irq_ack)(structirq_data*data)-startofanewinterrupt;(*irq_startup)(structirq_data*data)-startuptheinterrupt;(*irq_shutdown)(structirq_data*data)-shutdowntheinterruptandetc.

fields.Notethattheirq_datastructurerepresentssetoftheperirqchipdatapasseddowntochipfunctions.Itcontainsmask-precomputedbitmaskforaccessingthechipregisters,irq-interruptnumber,hwirq-hardwareinterruptnumber,localtotheinterruptdomainchiplowlevelinterrupthardwareaccessandetc.

AfterthisdependsontheCONFIG_X86_64andCONFIG_X86_LOCAL_APICkernelconfigurationoptioncalltheinit_bsp_APICfunctionfromthearch/x86/kernel/apic/apic.c:

#ifdefined(CONFIG_X86_64)||defined(CONFIG_X86_LOCAL_APIC)

init_bsp_APIC();

#endif

ThisfunctionmakesinitializationoftheAPICofbootstrapprocessor(orprocessorwhichstartsfirst).ItstartsfromthecheckthatwefoundSMPconfig(readmoreaboutitinthesixthpartoftheLinuxkernelinitializationprocesschapter)andtheprocessorhasAPIC:

if(smp_found_config||!cpu_has_apic)

return;

Inotherwaywereturnfromthisfunction.Inthenextstepwecalltheclear_local_APICfunctionfromthesamesourcecodefilethatshutdownsthelocalAPIC(moreaboutitwillbeinthechapterabouttheAdvancedProgrammableInterruptController)andenableAPICofthefirstprocessorbythesettingunsignedintvaluetotheAPIC_SPIV_APIC_ENABLED:

value=apic_read(APIC_SPIV);

value&=~APIC_VECTOR_MASK;

value|=APIC_SPIV_APIC_ENABLED;

andwritingitwiththehelpoftheapic_writefunction:

apic_write(APIC_SPIV,value);

AfterwehaveenabledAPICforthebootstrapprocessor,wereturntotheinit_ISA_irqsfunctionandinthenextstepweinitalizelegacyProgrammableInterruptControllerandsetthelegacychipandhandlerfortheeachlegacyirq:

legacy_pic->init(0);

for(i=0;i<nr_legacy_irqs();i++)

irq_set_chip_and_handler(i,chip,handle_level_irq);

Wherecanwefindinitfunction?Thelegacy_picdefinedinthearch/x86/kernel/i8259.canditis:

structlegacy_pic*legacy_pic=&default_legacy_pic;

LinuxInside

233Initializationofexternalhardwareinterruptsstructures

Page 234: Linux Insides

Wherethedefault_legacy_picis:

structlegacy_picdefault_legacy_pic={

...

...

...

.init=init_8259A,

...

...

...

}

Theinit_8259AfunctiondefinedinthesamesourcecodefileandexecutesinitializationoftheIntel8259 Programmable

InterruptController(moreaboutitwillbeintheseparatechapterabotProgrammableInterruptControllersandAPIC).

Nowwecanreturntothenative_init_IRQfunction,aftertheinit_ISA_irqsfunctionfinisheditswork.Thenextstepisthecalloftheapic_intr_initfunctionthatallocatesspecialinterruptgateswhichareusedbytheSMParchitecturefortheInter-processorinterrupt.Thealloc_intr_gatemacrofromthearch/x86/include/asm/desc.husedfortheinterruptdescriptorallocationallocation:

#definealloc_intr_gate(n,addr)\

do{\

alloc_system_vector(n);\

set_intr_gate(n,addr);\

}while(0)

Aswecansee,firstofallitexpandstothecallofthealloc_system_vectorfunctionthatchecksthegivenvectornumberintheuser_vectorsbitmap(readpreviouspartaboutit)andifitisnotsetintheuser_vectorsbitmapwesetit.Afterthiswetestthatthefirst_system_vectorisgreaterthangiveninterruptvectornumberandifitisgreaterweassignit:

if(!test_bit(vector,used_vectors)){

set_bit(vector,used_vectors);

if(first_system_vector>vector)

first_system_vector=vector;

}else{

BUG();

}

Wealreadysawtheset_bitmacro,nowlet'slookonthetest_bitandthefirst_system_vector.Thefirsttest_bitmacrodefinedinthearch/x86/include/asm/bitops.handlookslikethis:

#definetest_bit(nr,addr)\

(__builtin_constant_p((nr))\

?constant_test_bit((nr),(addr))\

:variable_test_bit((nr),(addr)))

Wecanseetheternaryoperatorheremakeatestwiththegccbuilt-infunction__builtin_constant_pteststhatgivenvectornumber(nr)isknownatcompiletime.Ifyou'refeelingmisunderstandingofthe__builtin_constant_p,wecanmakesimpletest:

#include<stdio.h>

#definePREDEFINED_VAL1

intmain(){

inti=5;

printf("__builtin_constant_p(i)is%d\n",__builtin_constant_p(i));

LinuxInside

234Initializationofexternalhardwareinterruptsstructures

Page 235: Linux Insides

printf("__builtin_constant_p(PREDEFINED_VAL)is%d\n",__builtin_constant_p(PREDEFINED_VAL));

printf("__builtin_constant_p(100)is%d\n",__builtin_constant_p(100));

return0;

}

andlookontheresult:

$gcctest.c-otest

$./test

__builtin_constant_p(i)is0

__builtin_constant_p(PREDEFINED_VAL)is1

__builtin_constant_p(100)is1

NowIthinkitmustbeclearforyou.Let'sgetbacktothetest_bitmacro.Ifthe__builtin_constant_pwillreturnnon-zero,wecallconstant_test_bitfunction:

staticinlineintconstant_test_bit(intnr,constvoid*addr)

{

constu32*p=(constu32*)addr;

return((1UL<<(nr&31))&(p[nr>>5]))!=0;

}

andthevariable_test_bitinotherway:

staticinlineintvariable_test_bit(intnr,constvoid*addr)

{

u8v;

constu32*p=(constu32*)addr;

asm("btl%2,%1;setc%0":"=qm"(v):"m"(*p),"Ir"(nr));

returnv;

}

What'sthedifferencebetweentwothesefunctionsandwhydoweneedintwodifferentfunctionsforthesamepurpose?Asyoualreadycanguessmainpurposeisoptimization.Ifwewillwritesimpleexamplewiththesefunctions:

#defineCONST25

intmain(){

intnr=24;

variable_test_bit(nr,(int*)0x10000000);

constant_test_bit(CONST,(int*)0x10000000)

return0;

}

andwilllookontheassemblyoutputofourexamplewewillseefollowigassemblycode:

pushq%rbp

movq%rsp,%rbp

movl$268435456,%esi

movl$25,%edi

callconstant_test_bit

fortheconstant_test_bit,and:

LinuxInside

235Initializationofexternalhardwareinterruptsstructures

Page 236: Linux Insides

pushq%rbp

movq%rsp,%rbp

subq$16,%rsp

movl$24,-4(%rbp)

movl-4(%rbp),%eax

movl$268435456,%esi

movl%eax,%edi

callvariable_test_bit

forthevariable_test_bit.Thesetwocodelistingsstartswiththesamepart,firstofallwesavebaseofthecurrentstackframeinthe%rbpregister.Butafterthiscodeforbothexamplesisdifferent.Inthefirstexampleweput$268435456(herethe$268435456isoursecondparameter-0x10000000)totheesiand$25(ourfirstparameter)totheediregisterandcallconstant_test_bit.WeputfunctuinparameterstotheesiandediregistersbecauseaswearelearningLinuxkernelforthex86_64architectureweuseSystemVAMD64ABIcallingconvention.Allisprettysimple.Whenweareusingpredifinedconstant,thecompilercanjustsubstituteitsvalue.Nowlet'slookonthesecondpart.Asyoucanseehere,thecompilercannotsubstitutevaluefromthenrvariable.Inthiscasecompilermustcalcuateitsoffsetontheprogramm'sstackframe.Wesubstract16fromtherspregistertoallocatestackforthelocalvariablesdataandputthe$24(valueofthenrvariable)totherbpwithoffset-4.Ourstackframewillbelikethis:

<-stackgrows

%[rbp]

|

+----------++---------++---------++--------+

|||||return|||

|nr|-||-||-|argc|

|||||address|||

+----------++---------++---------++--------+

|

%[rsp]

Afterthisweputthisvaluetotheeax,soeaxregisternowcontainsvalueofthenr.Intheendwedothesamethatinthefirstexample,weputthe$268435456(thefirstparameterofthevariable_test_bitfunction)andthevalueoftheeax(valueofnr)totheediregister(thesecondparameterofthevariable_test_bitfunction).

Thenextstepaftertheapic_intr_initfunctionwillfinishitsworkisthesettinginterrupgatesfromtheFIRST_EXTERNAL_VECTORor0x20tothe0x256:

i=FIRST_EXTERNAL_VECTOR;

#ifndefCONFIG_X86_LOCAL_APIC

#definefirst_system_vectorNR_VECTORS

#endif

for_each_clear_bit_from(i,used_vectors,first_system_vector){

set_intr_gate(i,irq_entries_start+8*(i-FIRST_EXTERNAL_VECTOR));

}

Butasweareusingthefor_each_clear_bit_fromhelper,wesetonlynon-initializedinterruptgates.Afterthisweusethesamefor_each_clear_bit_fromhelpertofillthenon-filledinterruptgatesintheinterrupttablewiththespurious_interrupt:

#ifdefCONFIG_X86_LOCAL_APIC

for_each_clear_bit_from(i,used_vectors,NR_VECTORS)

set_intr_gate(i,spurious_interrupt);

#endif

Wherethespurious_interruptfunctionrepresentinterrupthandlerfrothespuriousinterrupt.Heretheused_vectorsisthe

LinuxInside

236Initializationofexternalhardwareinterruptsstructures

Page 237: Linux Insides

unsignedlongthatcontainsalreadyinitializedinterruptgates.Wealreadyfilledfirst32interruptvectorsinthetrap_initfunctionfromthearch/x86/kernel/setup.csourcecodefile:

for(i=0;i<FIRST_EXTERNAL_VECTOR;i++)

set_bit(i,used_vectors);

Youcanrememberhowwediditinthesixthpartofthischapter.

Intheendofthenative_init_IRQfunctionwecanseethefollowingcheck:

if(!acpi_ioapic&&!of_ioapic&&nr_legacy_irqs())

setup_irq(2,&irq2);

Firstofalllet'sdealwiththecondition.Theacpi_ioapicvariablerepresentsexistenceofI/OAPIC.Itdefinedinthearch/x86/kernel/acpi/boot.c.Thisvariablesetintheacpi_set_irq_model_ioapicfunctionthatcalledduringtheprocessingMultipleAPICDescriptionTable.Thisoccursduringinitializationofthearchitecture-specificstuffinthearch/x86/kernel/setup.c(moreaboutitwewillknowintheotherchapteraboutAPIC).Notethatthevalueoftheacpi_ioapicvariabledependsontheCONFIG_ACPIandCONFIG_X86_LOCAL_APICLinuxkernelconfigurationoptions.Iftheseoptionsdidnotset,thisvariablewillbejustzero:

#defineacpi_ioapic0

Thesecondcondition-!of_ioapic&&nr_legacy_irqs()checksthatwedonotuseOpenFirmwareI/OAPICandlegacyinterruptcontroller.Wealreadyknowaboutthenr_legacy_irqs.Thesecondisof_ioapicvariabledefinedinthearch/x86/kernel/devicetree.candinitializedinthedtb_ioapic_setupfunctionthatbuildinformationaboutAPICsinthedevicetree.Notethatof_ioapicvariabledependsontheCONFIG_OFLinuxkernelconfigurationopiotn.Ifthisoptionisnotset,thevalueoftheof_ioapicwillbezerotoo:

#ifdefCONFIG_OF

externintof_ioapic;

...

...

...

#else

#defineof_ioapic0

...

...

...

#endif

Iftheconditionwillreturnnon-zerovaulewecallthe:

setup_irq(2,&irq2);

function.Firstofallabouttheirq2.Theirq2istheirqactionstructurethatdefinedinthearch/x86/kernel/irqinit.csourcecodefileandrepresentsIRQ2linethatisusedtoquerydevicesconnectedcascade:

staticstructirqactionirq2={

.handler=no_action,

.name="cascade",

.flags=IRQF_NO_THREAD,

};

LinuxInside

237Initializationofexternalhardwareinterruptsstructures

Page 238: Linux Insides

Sometimeagointerruptcontrollerconsistedoftwochipsandonewasconnectedtosecond.ThesecondchipthatwasconnectedtothefirstchipviathisIRQ2line.Thischipservicedlinesfrom8to15andafterafterthislinesofthefirstchip.So,forexampleIntel8259Ahasfollowinglines:

IRQ0-systemtime;IRQ1-keyboard;IRQ2-usedfordeviceswhicharecascadeconnected;IRQ8-RTC;IRQ9-reserved;IRQ10-reserved;IRQ11-reserved;IRQ12-ps/2mouse;IRQ13-coprocessor;IRQ14-harddrivecontroller;IRQ1-reserved;IRQ3-COM2andCOM4;IRQ4-COM1andCOM3;IRQ5-LPT2;IRQ6-drivecontroller;IRQ7-LPT1.

Thesetup_irqfunctiondefinedinthekernel/irq/manage.candtakestwoparameters:

vectornumberofaninterrupt;irqactionstructurerelatedwithaninterrupt.

Thisfunctioninitializesinterruptdescriptorfromthegivenvectornumberatthebeginning:

structirq_desc*desc=irq_to_desc(irq);

Andcallthe__setup_irqfunctionthatsetupsgiveninterrupt:

chip_bus_lock(desc);

retval=__setup_irq(irq,desc,act);

chip_bus_sync_unlock(desc);

returnretval;

Notethattheinterruptdescriptorislockedduring__setup_irqfunctionwillwork.The__setup_irqfunctionmakesmanydifferentthings:Itcreatesahandlerthreadwhenathreadfunctionissuppliedandtheinterruptdoesnotnestintoanotherinterruptthread,setstheflagsofthechip,fillstheirqactionstructureandmanymanymore.

Alloftheaboveitcreates/prov/vector_numberdirectoryandfillsit,butifyouareusingmoderncomputerallvalueswillbezerothere:

$cat/proc/irq/2/node

0

$cat/proc/irq/2/affinity_hint

00

cat/proc/irq/2/spurious

count0

unhandled0

last_unhandled0ms

LinuxInside

238Initializationofexternalhardwareinterruptsstructures

Page 239: Linux Insides

becauseprobablyAPIChandlesinterruptsontheourmachine.

That'sall.

ItistheendoftheeighthpartoftheInterruptsandInterruptHandlingchapterandwecontinuedtodiveintoexternalhardwareinterruptsinthispart.InthepreviouspartwestartedtodoitandsawearlyinitializationoftheIRQs.Inthispartwealreadysawnon-earlyinterruptsinitializationintheinit_IRQfunction.Wesawinitializationofthevector_irqper-cpuarraywhichisstorevectornumbersoftheinterruptsandwillbeusedduringinterrupthandlingandinitializationofotherstuffwhichisrelatedtotheexternalhardwareinterrupts.

Inthenextpartwewillcontinuetolearninterruptshandlingrelatedstuffandwillseeinitializationofthesoftirqs.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

IRQpercpux86_64Intel8259ProgrammableInterruptControllerISAMultiProcessorConfigurationTableLocalAPICI/OAPICSMPInter-processorinterruptternaryoperatorgcccallingconventionPDF.SystemVApplicationBinaryInterfaceAMD64CallstackOpenFirmwaredevicetreeRTCPreviouspart

Conclusion

Links

LinuxInside

239Initializationofexternalhardwareinterruptsstructures

Page 240: Linux Insides

Itistheninthpartofthelinux-insidesbookandinthepreviousPreviouspartwesawimplementationoftheinit_IRQfromthatdefinedinthearch/x86/kernel/irqinit.csourcecodefile.So,wewillcontinuetodiveintotheinitializationstuffwhichisrelatedtotheexternalhardwareinterruptsinthispart.

Aftertheinit_IRQfunctionwecanseethecallofthesoftirq_initfunctionintheinit/main.c.Thisfunctiondefinedinthekernel/softirq.csourcecodefileandaswecanunderstandfromitsname,thisfunctionmakesinitializationofthesoftirqorinotherwordsinitializationofthedeferredinterrupts.Whatisitdeferreedintrrupt?WealreadysawalittlebitaboutitintheninthpartofthechapterthatdescribesinitializationprocessoftheLinuxkernel.TherearethreetypesofdefferedinterruptsintheLinuxkernel:

softirqs;tasklets;workqueues;

Andwewillseedescriptionofallofthesetypesinthispart.AsIsaid,wesawonlyalittlebitaboutthistheme,so,nowistimetodivedeepintodetailsaboutthistheme.

Interruptsmayhavedifferentimportantcharacteristicsandtherearetwoamongthem:

Handlerofaninterruptmustexecutequickly;Sometimeaninterrupthandlermustdoalargeamountofwork.

Asyoucanunderstand,itisalmostimpossibletomakesothatbothcharacteristicswerevalid.Becauseofthese,previouslythehandlingofinterruptswassplittedintotwoparts:

Tophalf;Bottomhalf;

OncetheLinuxkernelwasoneofthewaystheorganizationpostprocessing,andwhichwascalled:thebottomhalfoftheprocessor,butnowitisalreadynotactual.Nowthistermhasremainedasacommonnounreferringtoallthedifferentwaysoforganizingdefferedprocessingofaninterrupt.WiththeadventofparallelismsintheLinuxkernel,allnewschemesofimplementationofthebottomhalfhandlersarebuiltontheperformanceoftheprocessorspecifickernelthreadthatcalledksoftirqd(willbediscussedbelow).Thesoftirqmechanismrepresentshandlingofinterruptsthatarealmostasimportantasthehandlingofthehardwareinterrupts.Thedeferredprocessingofaninterruptsuggeststhatsomeoftheactionsforaninterruptmaybepostponedtoalaterexecutionwhenthesystemwillbelessloaded.Asyoucansuggests,aninterrupthandlercandolargeamountofworkthatisimpermissibleasitexecutesinthecontextwhereinterruptsaredisabled.That'swhyprocessingofaninterruptcanbesplittedontwodifferentparts.Inthefirstpart,themainhandlerofaninterruptdoesonlyminimalandthemostimportantjob.Afterthisitschedulesthesecondpartandfinishesitswork.Whenthesystemislessbusyandcontextoftheprocessorallowstohandleinterrupts,thesecondpartstartsitsworkandfinishestoprocessremaingpartofadeferredinterrupt.Thatismainexplanationofthedeferredinterrupthandling.

AsIalreadywroteabove,handlingofdeferredinterrupts(orsoftirqinotherwords)andaccordinglytaskletsisperformedbyasetofthespecialkernelthreads(onethreadperprocessor).Eachprocessorhasitsownthreadthatis

InterruptsandInterruptHandling.Part9.

Introductiontodeferredinterrupts(Softirq,TaskletsandWorkqueues)

Deferredinterrupts

LinuxInside

240Softirq,TaskletsandWorkqueues

Page 241: Linux Insides

calledksoftirqd/nwherethenisthenumberoftheprocessor.Wecanseeitintheoutputofthesystemd-cglsutil:

$systemd-cgls-k|grepksoft

├─3[ksoftirqd/0]

├─13[ksoftirqd/1]

├─18[ksoftirqd/2]

├─23[ksoftirqd/3]

├─28[ksoftirqd/4]

├─33[ksoftirqd/5]

├─38[ksoftirqd/6]

├─43[ksoftirqd/7]

Thespawn_ksoftirqdfunctionstartsthisthesethreads.Aswecanseethisfunctioncalledasearlyinitcall:

early_initcall(spawn_ksoftirqd);

Deferredinterruptsaredeterminedstaticallyatcompile-timeoftheLinuxkernelandtheopen_softirqfunctiontakescareofsoftirqinitialization.Theopen_softirqfunctiondefinedinthekernel/softirq.c:

voidopen_softirq(intnr,void(*action)(structsoftirq_action*))

{

softirq_vec[nr].action=action;

}

andaswecanseethisfunctionusestwoparameters:

theindexofthesoftirq_vecarray;apointertothesoftirqfunctiontobeexecuted;

Firstofalllet'slookonthesoftirq_vecarray:

staticstructsoftirq_actionsoftirq_vec[NR_SOFTIRQS]__cacheline_aligned_in_smp;

itdefinedinthesamesourcecodefile.Aswecansee,thesoftirq_vecarraymaycontainNR_SOFTIRQSor10typesofsoftirqsthathastypesoftirq_action.Firstofallaboutitselements.InthecurrentversionoftheLinuxkerneltherearetensoftirqvectorsdefined;twofortaskletprocessing,twofornetworking,twofortheblocklayer,twofortimers,andoneeachfortheschedulerandread-copy-updateprocessing.Allofthesekindsarerepresentedbythefollowingenum:

enum

{

HI_SOFTIRQ=0,

TIMER_SOFTIRQ,

NET_TX_SOFTIRQ,

NET_RX_SOFTIRQ,

BLOCK_SOFTIRQ,

BLOCK_IOPOLL_SOFTIRQ,

TASKLET_SOFTIRQ,

SCHED_SOFTIRQ,

HRTIMER_SOFTIRQ,

RCU_SOFTIRQ,

NR_SOFTIRQS

};

Allnamesofthesekindsofsoftirqsarerepresentedbythefollowingarray:

LinuxInside

241Softirq,TaskletsandWorkqueues

Page 242: Linux Insides

constchar*constsoftirq_to_name[NR_SOFTIRQS]={

"HI","TIMER","NET_TX","NET_RX","BLOCK","BLOCK_IOPOLL",

"TASKLET","SCHED","HRTIMER","RCU"

};

Orwecanseeitintheoutputofthe/proc/softirqs:

~$cat/proc/softirqs

CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7

HI:50000000

TIMER:332519310498289555272913282535279467282895270979

NET_TX:23200021100

NET_RX:270221225338281311262430265

BLOCK:13428232401012788

BLOCK_IOPOLL:00000000

TASKLET:1968352300000

SCHED:161852146745129539126064127998128014120243117391

HRTIMER:00000000

RCU:337707289397251874239796254377254898267497256624

Aswecanseethesoftirq_vecarrayhassoftirq_actiontypes.Thisisthemaindatastructurerelatedtothesoftirqmechanism,soallsoftirqsrepresentedbythesoftirq_actionstructure.Thesoftirq_actionstructureconsistsasinglefieldonly:anactionpointertothesoftirqfunction:

structsoftirq_action

{

void(*action)(structsoftirq_action*);

};

So,afterthiswecanunderstandthattheopen_softirqfunctionfillsthesoftirq_vecarraywiththegivensoftirq_action.Theregistereddeferredinterrupt(withthecalloftheopen_softirqfunction)forittobequeuedforexecution,itshouldbeactivatedbythecalloftheraise_softirqfunction.Thisfunctiontakesonlyoneparameter--asoftirqindexnr.Let'slookonitsimplementation:

voidraise_softirq(unsignedintnr)

{

unsignedlongflags;

local_irq_save(flags);

raise_softirq_irqoff(nr);

local_irq_restore(flags);

}

Herewecanseethecalloftheraise_softirq_irqofffunctionbetweenthelocal_irq_saveandthelocal_irq_restoremacros.Thelocal_irq_savedefinedintheinclude/linux/irqflags.hheaderfileandsavesthestateoftheIFflagoftheeflagsregisteranddisablesinterruptsonthelocalprocessor.Thelocal_irq_restoremacrodefinedinthesameheaderfileanddoestheoppositething:restorestheinterruptflagandenablesinterrupts.Wedisableinterruptsherebecauseasoftirqinterruptrunsintheinterruptcontextandthatonesoftirq(andnoothers)willberun.

Theraise_softirq_irqofffunctionmarksthesoftirqasdefferedbysettingthebitcorrespondingtothegivenindexnrinthesoftirqbitmask(__softirq_pending)ofthelocalprocessor.Itdoesitwiththehelpofthe:

__raise_softirq_irqoff(nr);

macro.Afterthis,itcheckstheresultofthein_interruptthatreturnsirq_countvalue.Wealreadysawtheirq_countin

LinuxInside

242Softirq,TaskletsandWorkqueues

Page 243: Linux Insides

thefirstpartofthischapteranditisusedtocheckifaCPUisalreadyonaninterruptstackornot.Wejustexitfromtheraise_softirq_irqoff,restoreIFflangandenableinterruptsonthelocalprocessor,ifweareintheinterruptcontext,otherwisewecallthewakeup_softirqd:

if(!in_interrupt())

wakeup_softirqd();

Wherethewakeup_softirqdfunctionactivatestheksoftirqdkernelthreadofthelocalprocessor:

staticvoidwakeup_softirqd(void)

{

structtask_struct*tsk=__this_cpu_read(ksoftirqd);

if(tsk&&tsk->state!=TASK_RUNNING)

wake_up_process(tsk);

}

Eachksoftirqdkernelthreadrunstherun_ksoftirqdfunctionthatchecksexistenceofdeferredinterruptsandcallsthe__do_softirqfunctiondependsonresult.Thisfunctionreadsthe__softirq_pendingsoftirqbitmaskofthelocalprocessorandexecutesthedeferrablefunctionscorrespondingtoeverybitset.Duringexecutionofadeferredfunction,newpendingsoftirqsmightoccur.Themainproblemherethatexecutionoftheuserspacecodecanbedelayedforalongtimewhilethe__do_softirqfunctionwillhandledeferredinterrupts.Forthispurpose,ithasthelimitofthetimewhenitmustbefinsihed:

unsignedlongend=jiffies+MAX_SOFTIRQ_TIME;

...

...

...

restart:

while((softirq_bit=ffs(pending))){

...

h->action(h);

...

}

...

...

...

pending=local_softirq_pending();

if(pending){

if(time_before(jiffies,end)&&!need_resched()&&

--max_restart)

gotorestart;

}

...

Checksoftheexistenceofthedeferredinterruptsperformedperiodicallyandtherearesomepointswherethischeckoccurs.Themainpointwherethissituationoccursisthecallofthedo_IRQfunctionthatdefinedinthearch/x86/kernel/irq.candprovidesmainpossibilitiesforactualinterruptprocessingintheLinuxkernel.Whenthisfunctionwillfinishtohandleaninterrupt,itcallstheexiting_irqfunctionfromthearch/x86/include/asm/apic.hthatexpandstothecalloftheirq_exitfunction.Theirq_exitchecksdeferredinterrupts,currentcontextandcallstheinvoke_softirqfunction:

if(!in_interrupt()&&local_softirq_pending())

invoke_softirq();

thatexecutesthe__do_softirqtoo.Sowhatdowehaveinsummary.Eachsoftirqgoesthroughthefollowingstages:Registrationofasoftirqwiththeopen_softirqfunction.Activationofasoftirqbymarkingitasdeferredwiththeraise_softirqfunction.Afterthis,allmarkedsoftirqswillberunnedinthenexttimetheLinuxkernelschedulesaround

LinuxInside

243Softirq,TaskletsandWorkqueues

Page 244: Linux Insides

ofexecutionsofdeferrablefunctions.Andexecutionofthedeferredfunctionsthathavethesametype.

AsIalreadywrote,thesoftirqsarestaticallyallocatedanditisaproblemforakernelmodulethatcanbeloaded.Thesecondconceptthatbuiltontopofsoftirq--thetaskletssolvesthisproblem.

IfyoureadthesourcecodeoftheLinuxkernelthatisrelatedtothesoftirq,younoticethatitisusedveryrarely.Thepreferablewaytoimplementdeferrablefunctionsaretasklets.AsIalreadywroteabovethetaskletsarebuiltontopofthesoftirqconceptandgenerallyontopoftwosoftirqs:

TASKLET_SOFTIRQ;HI_SOFTIRQ.

Inshortwords,taskletsaresoftirqsthatcanbeallocatedandinitializedatruntimeandunlikesoftirqs,taskletsthathavethesametypecannotberunonmultipleprocessorsatatime.Ok,nowweknowalittlebitaboutthesoftirqs,ofcourseprevioustextdoesnotcoverallaspectsaboutthis,butnowwecandirectlylookonthecodeandtoknowmoreaboutthesoftirqsstepbysteponpracticeandtoknowabouttasklets.Let'sreturnbacktotheimplementationofthesoftirq_initfunctionthatwetalkedaboutinthebeginningofthispart.Thisfunctionisdefinedinthekernel/softirq.csourcecodefile,let'slookonitsimplementation:

void__initsoftirq_init(void)

{

intcpu;

for_each_possible_cpu(cpu){

per_cpu(tasklet_vec,cpu).tail=

&per_cpu(tasklet_vec,cpu).head;

per_cpu(tasklet_hi_vec,cpu).tail=

&per_cpu(tasklet_hi_vec,cpu).head;

}

open_softirq(TASKLET_SOFTIRQ,tasklet_action);

open_softirq(HI_SOFTIRQ,tasklet_hi_action);

}

Wecanseedefinitionoftheintegercpuvariableatthebeginningofthesoftirq_initfunction.Nextwewilluseitasparameterforthefor_each_possible_cpumacrothatgoesthroughtheallpossibleprocessorsinthesystem.Ifthepossibleprocessoristhenewterminologyforyou,youcanreadmoreaboutittheCPUmaskschapter.Inshortwords,possiblecpusisthesetofprocessorsthatcanbepluggedinanytimeduringthelifeofthatsystemboot.Allpossibleprocessorsstoredinthecpu_possible_bitsbitmap,youcanfinditsdefinitioninthekernel/cpu.c:

staticDECLARE_BITMAP(cpu_possible_bits,CONFIG_NR_CPUS)__read_mostly;

...

...

...

conststructcpumask*constcpu_possible_mask=to_cpumask(cpu_possible_bits);

Ok,wedefinedtheintegercpuvariableandgothroughtheallpossibleprocessorswiththefor_each_possible_cpumacroandmakesinitializationofthetwofollowingper-cpuvariables:

tasklet_vec;tasklet_hi_vec;

Thesetwoper-cpuvariablesdefinedinthesamesourcecodefileasthesoftirq_initfunctionandrepresenttwotasklet_headstructures:

Tasklets

LinuxInside

244Softirq,TaskletsandWorkqueues

Page 245: Linux Insides

staticDEFINE_PER_CPU(structtasklet_head,tasklet_vec);

staticDEFINE_PER_CPU(structtasklet_head,tasklet_hi_vec);

Wheretasklet_headstructurerepresentsalistofTaskletsandcontainstwofields,headandtail:

structtasklet_head{

structtasklet_struct*head;

structtasklet_struct**tail;

};

Thetasklet_structstructureisdefinedintheinclude/linux/interrupt.handrepresentstheTasklet.Previouslywedidnotseethiswordinthisbook.Let'strytounderstandwhatthetaskletis.Actually,thetaskletisoneofmechanismstohandledeferredinterrupt.Let'slookontheimplementationofthetasklet_structstructure:

structtasklet_struct

{

structtasklet_struct*next;

unsignedlongstate;

atomic_tcount;

void(*func)(unsignedlong);

unsignedlongdata;

};

Aswecanseethisstructurecontainsfivefields,theyare:

Nexttaskletintheschedulingqueue;Stateofthetasklet;Representcurrentstateofthetasklet,activeornot;Maincallbackofthetasklet;Parameterofthecallback.

Inourcase,wesetonlyforinitializeonlytwoarraysoftaskletsinthesoftirq_initfunction:thetasklet_vecandthetasklet_hi_vec.Taskletsandhigh-prioritytaskletsarestoredinthetasklet_vecandtasklet_hi_vecarrays,respectively.So,wehaveinitializedthesearraysandnowwecanseetwocallsoftheopen_softirqfunctionthatisdefinedinthekernel/softirq.csourcecodefile:

open_softirq(TASKLET_SOFTIRQ,tasklet_action);

open_softirq(HI_SOFTIRQ,tasklet_hi_action);

attheendofthesoftirq_initfunction.Themainpurposeoftheopen_softirqfunctionistheinitalizationofsoftirq.Let'slookontheimplementationoftheopen_softirqfunction.

,inourcasetheyare:tasklet_actionandthetasklet_hi_actionorthesoftirqfunctionassociatedwiththeHI_SOFTIRQsoftirqisnamedtasklet_hi_actionandsoftirqfunctionassociatedwiththeTASKLET_SOFTIRQisnamedtasklet_action.TheLinuxkernelprovidesAPIforthemanipulatingoftasklets.Firstofallitisthetasklet_initfunctionthattakestasklet_struct,functionandparameterforitandinitializesthegiventasklet_structwiththegivendata:

voidtasklet_init(structtasklet_struct*t,

void(*func)(unsignedlong),unsignedlongdata)

{

t->next=NULL;

t->state=0;

atomic_set(&t->count,0);

t->func=func;

t->data=data;

LinuxInside

245Softirq,TaskletsandWorkqueues

Page 246: Linux Insides

}

Thereareadditionalmethodstoinitializeataskletstaticallywiththetwofollowingmacros:

DECLARE_TASKLET(name,func,data);

DECLARE_TASKLET_DISABLED(name,func,data);

TheLinuxkernelprovidesthreefollowingfunctionstomarkataskletasreadytorun:

voidtasklet_schedule(structtasklet_struct*t);

voidtasklet_hi_schedule(structtasklet_struct*t);

voidtasklet_hi_schedule_first(structtasklet_struct*t);

Thefirstfunctionschedulesataskletwiththenormalpriority,thesecondwiththehighpriorityandthethirdoutofturn.Implementationoftheallofthesethreefunctionsissimilar,sowewillconsideronlythefirst--tasklet_schedule.Let'slookonitsimplementation:

staticinlinevoidtasklet_schedule(structtasklet_struct*t)

{

if(!test_and_set_bit(TASKLET_STATE_SCHED,&t->state))

__tasklet_schedule(t);

}

void__tasklet_schedule(structtasklet_struct*t)

{

unsignedlongflags;

local_irq_save(flags);

t->next=NULL;

*__this_cpu_read(tasklet_vec.tail)=t;

__this_cpu_write(tasklet_vec.tail,&(t->next));

raise_softirq_irqoff(TASKLET_SOFTIRQ);

local_irq_restore(flags);

}

AswecanseeitchecksandsetsthestateofthegiventasklettotheTASKLET_STATE_SCHEDandexecutesthe__tasklet_schedulewiththegiventasklet.The__tasklet_schedulelooksverysimilartotheraise_softirqfunctionthatwesawabove.Itsavestheinterruptflaganddisablesinterruptsatthebeginning.Afterthis,itupdatestasklet_vecwiththenewtaskletandcallstheraise_softirq_irqofffunctionthatwesawabove.WhentheLinuxkernelschedulerwilldecidetorundeferredfunctions,thetasklet_actionfunctionwillbecalledfordeferredfunctionswhichareassociatedwiththeTASKLET_SOFTIRQandtasklet_hi_actionfordeferredfunctionswhichareassociatedwiththeHI_SOFTIRQ.Thesefunctionsareverysimilarandthereisonlyonedifferencebetweenthem--tasklet_actionusestasklet_vecandtasklet_hi_actionusestasklet_hi_vec.

Let'slookontheimplementationofthetasklet_actionfunction:

staticvoidtasklet_action(structsoftirq_action*a)

{

local_irq_disable();

list=__this_cpu_read(tasklet_vec.head);

__this_cpu_write(tasklet_vec.head,NULL);

__this_cpu_write(tasklet_vec.tail,this_cpu_ptr(&tasklet_vec.head));

local_irq_enable();

while(list){

if(tasklet_trylock(t)){

t->func(t->data);

tasklet_unlock(t);

}

LinuxInside

246Softirq,TaskletsandWorkqueues

Page 247: Linux Insides

...

...

...

}

}

Inthebeginningofthetasketl_actionfunction,wedisableinterruptsforthelocalprocessorwiththehelpofthelocal_irq_disablemacro(youcanreadaboutthismacrointhesecondpartofthischapter).Inthenextstep,wetakeaheadofthelistthatcontainstaskletswithnormalpriorityandsetthisper-cpulisttoNULLbecausealltaskletsmustbeexecutedinageneralyway.Afterthisweenableinterruptsforthelocalprocessorandgothroughthelistoftakletsintheloop.Ineveryiterationoftheloopwecallthetasklet_trylockfunctionforthegiventaskletthatupdatesstateofthegiventaskletonTASKLET_STATE_RUN:

staticinlineinttasklet_trylock(structtasklet_struct*t)

{

return!test_and_set_bit(TASKLET_STATE_RUN,&(t)->state);

}

Ifthisoperationwassuccessfulweexecutetasklet'saction(itwassetinthetasklet_init)andcallthetasklet_unlockfunctionthatclearstasklet'sTASKLET_STATE_RUNstate.

Ingeneral,that'sallabouttaskletsconcept.Ofcoursethisdoesnotcoverfulltasklets,butIthinkthatitisagoodpointfromwhereyoucancontinuetolearnthisconcept.

ThetaskletsarewidelyusedconceptintheLinuxkernel,butasIwroteinthebeginningofthispartthereisthirdmechanismfordeferredfunctions--workqueue.Inthenextparagraphwewillseewhatitis.

Theworkqueueisanotherconceptforhandlingdeferredfunctions.Itissimilartotaskletswithsomedifferences.Workqueuefunctionsruninthecontextofakernelprocess,buttaskletfunctionsruninthesoftwareinterruptcontext.Thismeansthatworkqueuefunctionsmustnotbeatomicastaskletfunctions.Taskletsalwaysrunontheprocessorfromwhichtheywereoriginallysubmitted.Workqueuesworkinthesameway,butonlybydefault.Theworkqueueconceptrepresentedbythe:

structworker_pool{

spinlock_tlock;

intcpu;

intnode;

intid;

unsignedintflags;

structlist_headworklist;

intnr_workers;

...

...

...

structurethatisdefinedinthekernel/workqueue.csourcecodefileintheLinuxkernel.Iwillnotwritethesourcecodeofthisstructurehere,becauseithasquitealotoffields,butwewillconsidersomeofthosefields.

Initsmostbasicform,theworkqueuesubsystemisaninterfaceforcreatingkernelthreadstohandleworkthatisqueuedfromelsewhere.Allofthesekernelthreadsarecalled--workerthreads.Theworkqueuearemaintainedbythework_structthatdefinedintheinclude/linux/workqueue.h.Let'slookonthisstructure:

Workqueues

LinuxInside

247Softirq,TaskletsandWorkqueues

Page 248: Linux Insides

structwork_struct{

atomic_long_tdata;

structlist_headentry;

work_func_tfunc;

#ifdefCONFIG_LOCKDEP

structlockdep_maplockdep_map;

#endif

};

Herearetwothingsthatweareinterested:func--thefunctionthatwillbescheduledbytheworkqueueandthedata-parameterofthisfunction.TheLinuxkernelprovidesspecialper-cputhreadsthatarecalledkworker:

systemd-cgls-k|grepkworker

├─5[kworker/0:0H]

├─15[kworker/1:0H]

├─20[kworker/2:0H]

├─25[kworker/3:0H]

├─30[kworker/4:0H]

...

...

...

Thisprocesscanbeusedtoschedulethedeferredfunctionsoftheworkqueues(asksoftirqdforsoftirqs).Besidesthiswecancreatenewseparateworkerthreadforaworkqueue.TheLinuxkernelprovidesfollowingmacrosforthecreationofworkqueue:

#defineDECLARE_WORK(n,f)\

structwork_structn=__WORK_INITIALIZER(n,f)

forstaticcreation.Ittakestwoparameters:nameoftheworkqueueandtheworkqueuefunction.Forcreationofworkqueueinruntime,wecanusethe:

#defineINIT_WORK(_work,_func)\

__INIT_WORK((_work),(_func),0)

#define__INIT_WORK(_work,_func,_onstack)\

do{\

__init_work((_work),_onstack);\

(_work)->data=(atomic_long_t)WORK_DATA_INIT();\

INIT_LIST_HEAD(&(_work)->entry);\

(_work)->func=(_func);\

}while(0)

macrothattakeswork_structstructurethathastobecreatedandthefunctiontobescheduledinthisworkqueue.Afteraworkwascreatedwiththeoneofthesemacros,weneedtoputittotheworkqueue.Wecandoitwiththehelpofthequeue_workorthequeue_delayed_workfunctions:

staticinlineboolqueue_work(structworkqueue_struct*wq,

structwork_struct*work)

{

returnqueue_work_on(WORK_CPU_UNBOUND,wq,work);

}

Thequeue_workfunctionjustcallsthequeue_work_onfunctionthatqueueworkonspecificprocessor.NotethatinourcasewepasstheWORK_STRUCT_PENDING_BITtothequeue_work_onfunction.Itisapartoftheenumthatisdefinedintheinclude/linux/workqueue.handrepresentsworkqueuewhicharenotboundtoanyspecificprocessor.Thequeue_work_onfunctiontestsandsettheWORK_STRUCT_PENDING_BITbitofthegivenworkandexecutesthe__queue_workfunctionwiththe

LinuxInside

248Softirq,TaskletsandWorkqueues

Page 249: Linux Insides

workqueueforthegivenprocessorandgivenwork:

__queue_work(cpu,wq,work);

The__queue_workfunctiongetstheworkpool.Yes,theworkpoolnotworkqueue.Actually,allworksarenotplacedintheworkqueue,buttotheworkpoolthatisrepresentedbytheworker_poolstructureintheLinuxkernel.Asyoucanseeabove,theworkqueue_structstructurehasthepwqsfieldwhichislistofworker_pools.Whenwecreateaworkqueue,itstandsoutforeachprocessorthepool_workqueue.Eachpool_workqueueassociatedwithworker_pool,whichisallocatedonthesameprocessorandcorrespondstothetypeofpriorityqueue.Throughthemworkqueueinteractswithworker_pool.Sointhe__queue_workfunctionwesetthecputothecurrentprocessorwiththeraw_smp_processor_id(youcanfindinformationaboutthismarcointhefouthpartoftheLinuxkernelinitializationprocesschapter),gettingthepool_workqueueforthegivenworkqueue_structandinsertthegivenworktothegivenworkqueue:

staticvoid__queue_work(intcpu,structworkqueue_struct*wq,

structwork_struct*work)

{

...

...

...

if(req_cpu==WORK_CPU_UNBOUND)

cpu=raw_smp_processor_id();

if(!(wq->flags&WQ_UNBOUND))

pwq=per_cpu_ptr(wq->cpu_pwqs,cpu);

else

pwq=unbound_pwq_by_node(wq,cpu_to_node(cpu));

...

...

...

insert_work(pwq,work,worklist,work_flags);

Aswecancreateworksandworkqueue,weneedtoknowwhentheyareexecuted.AsIalreadywrote,allworksareexecutedbythekernelthread.Whenthiskernelthreadisscheduled,itstartstoexecuteworksfromthegivenworkqueue.Eachworkerthreadexecutesaloopinsidetheworker_threadfunction.Thisthreadmakesmanydifferentthingsandpartofthesethingsaresimilartowhatwesawbeforeinthispart.Asitstartsexecuting,itremovesallwork_structorworksfromitsworkqueue.

That'sall.

ItistheendoftheninthpartoftheInterruptsandInterruptHandlingchapterandwecontinuedtodiveintoexternalhardwareinterruptsinthispart.InthepreviouspartwesawinitializationoftheIRQsandmainirq_descstructure.Inthispartwesawthreeconcepts:thesoftirq,taskletandworkqueuethatareusedforthedeferredfunctions.

ThenextpartwillbelastpartoftheInterruptsandInterruptHandlingchapterandwewilllookontherealhardwaredriverandwilltrytolearnhowitworkswiththeinterruptssubsystem.

Ifyouhaveanyquestionsorsuggestions,writemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyoufindanymistakespleasesendmePRtolinux-internals.

Conclusion

Links

LinuxInside

249Softirq,TaskletsandWorkqueues

Page 251: Linux Insides

ThisisthetenthpartofthechapteraboutinterruptsandinterrupthandlingintheLinuxkernelandinthepreviouspartwesawalittleaboutdeferredinterruptsandrelatedconceptslikesoftirq,taskletandworkqeue.Inthispartwewillcontinuetodiveintothisthemeandnowit'stimetolookatrealhardwaredriver.

Let'sconsiderserialdriveroftheStrongARM**SA-110/21285EvaluationBoardboardforexampleandwilllookhowthisdriverrequestsanIRQline,whathappenswhenaninterruptistriggeredandetc.Thesourcecodeofthisdriverisplacedinthedrivers/tty/serial/21285.csourcecodefile.Ok,wehavesourcecode,let'sstart.

Wewillstarttoconsiderthisdriverasweusuallydiditwithallnewconceptsthatwesawinthisbook.Wewillstarttoconsideritfromtheintialization.Asyoualreadymayknow,theLinuxkernelprovidestwomacrosforinitializationandfinalizationofadriverorakernelmodule:

module_init;module_exit.

Andwecanfindusageofthesemacrosinourdriversourcecode:

module_init(serial21285_init);

module_exit(serial21285_exit);

ThemostpartofdevicedriverscanbecompiledasaloadablekernelmoduleorinanotherwaytheycanbestaticallylinkedintotheLinuxkernel.Inthefirstcaseinitializationofadevicedriverwillbeproducedviathemodule_initandmodule_Exitmacrosthataredefinedintheinclude/linux/init.h:

#definemodule_init(initfn)\

staticinlineinitcall_t__inittest(void)\

{returninitfn;}\

intinit_module(void)__attribute__((alias(#initfn)));

#definemodule_exit(exitfn)\

staticinlineexitcall_t__exittest(void)\

{returnexitfn;}\

voidcleanup_module(void)__attribute__((alias(#exitfn)));

andwillbecalledbytheinitcallfunctions:

early_initcall

pure_initcall

core_initcall

postcore_initcall

arch_initcall

subsys_initcall

fs_initcall

rootfs_initcall

device_initcall

InterruptsandInterruptHandling.Part10.

Lastpart

Initializationofakernelmodule

LinuxInside

251Lastpart

Page 252: Linux Insides

late_initcall

thatarecalledinthedo_initcallsfromtheinit/main.c.Otherwise,ifadevicedriverisstaticallylinkedintotheLinuxkernel,implementationofthesemacroswillbefollowing:

#definemodule_init(x)__initcall(x);

#definemodule_exit(x)__exitcall(x);

Inthiswayimplementationofmoduleloadingplacedinthekernel/module.csourcecodefileandinitializationoccursinthedo_init_modulefunction.Wewillnotdiveintodetailsaboutloadablemodulesinthischapter,butwillseeitinthespecialchapterthatwilldescribeLinuxkernelmodules.Ok,themodule_initmacrotakesoneparameter-theserial21285_initinourcase.Aswecanunderstandfromfunction'sname,thisfunctiondoesstuffrelatedtothedriverinitialization.Let'slookatit:

staticint__initserial21285_init(void)

{

intret;

printk(KERN_INFO"Serial:21285driver\n");

serial21285_setup_ports();

ret=uart_register_driver(&serial21285_reg);

if(ret==0)

uart_add_one_port(&serial21285_reg,&serial21285_port);

returnret;

}

Aswecansee,firstofallitprintsinformationaboutthedrivertothekernelbufferandthecalloftheserial21285_setup_portsfunction.Thisfunctionsetupsthebaseuartclockoftheserial21285_portdevice:

unsignedintmem_fclk_21285=50000000;

staticvoidserial21285_setup_ports(void)

{

serial21285_port.uartclk=mem_fclk_21285/4;

}

Heretheserial21285isthestructurethatdescribesuartdriver:

staticstructuart_driverserial21285_reg={

.owner=THIS_MODULE,

.driver_name="ttyFB",

.dev_name="ttyFB",

.major=SERIAL_21285_MAJOR,

.minor=SERIAL_21285_MINOR,

.nr=1,

.cons=SERIAL_21285_CONSOLE,

};

Ifthedriverregisteredsuccessfullyweattachthedriver-definedportserial21285_portstructurewiththeuart_add_one_portfunctionfromthedrivers/tty/serial/serial_core.csourcecodefileandreturnfromtheserial21285_initfunction:

if(ret==0)

uart_add_one_port(&serial21285_reg,&serial21285_port);

returnret;

LinuxInside

252Lastpart

Page 253: Linux Insides

That'sall.Ourdriverisinitialized.Whenanuartportwillbeopenedwiththecalloftheuart_openfunctionfromthedrivers/tty/serial/serial_core.c,itwillcalltheuart_startupfunctiontostartuptheserialport.Thisfunctionwillcallthestartupfunctionthatispartoftheuart_opsstructure.Eachuartdriverhasthedefinitionofthisstructure,inourcaseitis:

staticstructuart_opsserial21285_ops={

...

.startup=serial21285_startup,

...

}

serial21285structure.Aswecanseethe.strartupfieldreferencesontheserial21285_startupfunction.Implementationofthisfunctionisveryinterestingforus,becauseitisrelatedtotheinterruptsandinterrupthandling.

Let'slookattheimplementationoftheserial21285function:

staticintserial21285_startup(structuart_port*port)

{

intret;

tx_enabled(port)=1;

rx_enabled(port)=1;

ret=request_irq(IRQ_CONRX,serial21285_rx_chars,0,

serial21285_name,port);

if(ret==0){

ret=request_irq(IRQ_CONTX,serial21285_tx_chars,0,

serial21285_name,port);

if(ret)

free_irq(IRQ_CONRX,port);

}

returnret;

}

FirstofallaboutTXandRX.Aserialbusofadeviceconsistsofjusttwowires:oneforsendingdataandanotherforreceiving.Assuch,serialdevicesshouldhavetwoserialpins:thereceiver-RX,andthetransmitter-TX.Withthecalloffirsttwomacros:tx_enabledandrx_enabled,weenablethesewires.Thefollowingpartofthesefunctionisthegreatestinterestforus.Noteonrequest_irqfunctions.Thisfunctionregistersaninterrupthandlerandenablesagiveninterruptline.Let'slookattheimplementationofthisfunctionandgetintothedetails.Thisfunctiondefinedintheinclude/linux/interrupt.hheaderfileandlooksas:

staticinlineint__must_check

request_irq(unsignedintirq,irq_handler_thandler,unsignedlongflags,

constchar*name,void*dev)

{

returnrequest_threaded_irq(irq,handler,NULL,flags,name,dev);

}

Aswecansee,therequest_irqfunctiontakesfiveparameters:

irq-theinterruptnumberthatbeingrequested;handler-thepointertotheinterrupthandler;flags-thebitmaskoptions;

Requestingirqline

LinuxInside

253Lastpart

Page 254: Linux Insides

name-thenameoftheownerofaninterrupt;dev-thepointerusedforsharedinterruptlines;

Nowlet'slookatthecallsoftherequest_irqfunctionsinourexample.AswecanseethefirstparameterisIRQ_CONRX.Weknowthatitisnumberoftheinterrupt,butwhatisitCONRX?Thismacrodefinedinthearch/arm/mach-footbridge/include/mach/irqs.hheaderfile.Wecanfindthefulllistofinterruptsthatthe21285boardcangenerate.Notethatinthesecondcalloftherequest_irqfunctionwepasstheIRQ_CONTXinterruptnumber.BoththeseinterruptswillhandleRXandTXeventinourdriver.Implementationofthesemacrosiseasy:

#defineIRQ_CONRX_DC21285_IRQ(0)

#defineIRQ_CONTX_DC21285_IRQ(1)

...

...

...

#define_DC21285_IRQ(x)(16+(x))

TheISAIRQsonthisboardarefrom0to15,so,ourinterruptswillhavefirsttwonumbers:16and17.Secondparametersfortwocallsoftherequest_irqfunctionsareserial21285_rx_charsandserial21285_tx_chars.ThesefunctionswillbecalledwhenanRXorTXinterruptoccured.Wewillnotdiveinthispartintodetailsofthesefunctions,becausethischaptercoverstheinterruptsandinterruptshandlingbutnotdeviceanddrivers.Thenextparameter-flagsandaswecansee,itiszeroinbothcallsoftherequest_irqfunction.AllacceptableflagsaredefinedasIRQF_*macrosintheinclude/linux/interrupt.h.Someofit:

IRQF_SHARED-allowssharingtheirqamongseveraldevices;IRQF_PERCPU-aninterruptispercpu;IRQF_NO_THREAD-aninterruptcannotbethreaded;IRQF_NOBALANCING-excludesthisinterruptfromirqbalancing;IRQF_IRQPOLL-aninterruptisusedforpolling;andetc.

Inourcasewepass0,soitwillbeIRQF_TRIGGER_NONE.Thisflagmeansthatitdoesnotimplyanykindofedgeorleveltriggeredinterruptbehaviour.Tothefourthparameter(name),wepasstheserial21285_namethatdefinedas:

staticconstcharserial21285_name[]="FootbridgeUART";

andwillbedisplayedintheoutputofthe/proc/interrupts.Andinthelastparameterwepassthepointertotheourmainuart_portstructure.Nowweknowalittleaboutrequest_irqfunctionanditsparameters,let'slookatitsimplemenetation.Aswecanseeabove,therequest_irqfunctionjustmakesacalloftherequest_threaded_irqfunctioninside.Therequest_threaded_irqfunctiondefinedinthekernel/irq/manage.csourcecodefileandallocatesagiveninterruptline.Ifwewilllookatthisfunction,itstartsfromthedefinitionoftheirqactionandtheirq_desc:

intrequest_threaded_irq(unsignedintirq,irq_handler_thandler,

irq_handler_tthread_fn,unsignedlongirqflags,

constchar*devname,void*dev_id)

{

structirqaction*action;

structirq_desc*desc;

intretval;

...

...

...

}

Weareladysawtheirqactionandtheirq_descstructuresinthischapter.Thefirststructurerepresentsperinterruptactiondescriptorandcontainspointerstotheinterrupthandler,nameofthedevice,interruptnumber,etc.Thesecond

LinuxInside

254Lastpart

Page 255: Linux Insides

structurerepresentsadescriptorofaninterruptandcontainspointertotheirqaction,interruptflags,etc.Notethattherequest_threaded_irqfunctioncalledbytherequest_irqwiththeadditioanalparameter:irq_handler_tthread_fn.IfthisparameterisnotNULL,theirqthreadwillbecreatedandthegivenirqhandlerwillbeexecutedinthisthread.Inthenextstepweneedtomakefollowingchecks:

if(((irqflags&IRQF_SHARED)&&!dev_id)||

(!(irqflags&IRQF_SHARED)&&(irqflags&IRQF_COND_SUSPEND))||

((irqflags&IRQF_NO_SUSPEND)&&(irqflags&IRQF_COND_SUSPEND)))

return-EINVAL;

Firstofallwecheckthatrealdev_idispassedforthesharedinterruptandtheIRQF_COND_SUSPENDonlymakessenseforsharedinterrupts.Othrewiseweexitfromthisfunctionwiththe-EINVALerror.Afterthisweconvertthegivenirqnumbertotheirqdescriptorwitthehelpoftheirq_to_descfunctionthatdefinedinthekernel/irq/irqdesc.csourcecodefileandexitfromthisfunctionwiththe-EINVALerrorifitwasnotsuccessful:

desc=irq_to_desc(irq);

if(!desc)

return-EINVAL;

Theirq_to_descfunctionchecksthatgivenirqnumberislessthanmaximumnumberofIRQsandreturnstheirqdescriptorwheretheirqnumberisoffsetfromtheirq_descarray:

structirq_desc*irq_to_desc(unsignedintirq)

{

return(irq<NR_IRQS)?irq_desc+irq:NULL;

}

Aswehaveconvertedirqnumbertotheirqdescriptorwemakethecheckthestatusofthedescriptorthataninterruptcanberequested:

if(!irq_settings_can_request(desc)||WARN_ON(irq_settings_is_per_cpu_devid(desc)))

return-EINVAL;

andexitwiththe-EINVALinothreway.Afterthiswecheckthegiveninterrupthandler.Ifitwasnotpassedtotherequest_irqfunction,wecheckthethread_fn.IfbothhandlersareNULL,wereturnwiththe-EINVAL.Ifaninterrupthandlerwasnotpassedtotherequest_irqfunction,butthethread_fnisnotnull,wesethandlertotheirq_default_primary_handler:

if(!handler){

if(!thread_fn)

return-EINVAL;

handler=irq_default_primary_handler;

}

Inthenextstepweallocatememoryforourirqactionwiththekzallocfunctionandreturnfromthefunctionifthisoperationwasnotsuccessful:

action=kzalloc(sizeof(structirqaction),GFP_KERNEL);

if(!action)

return-ENOMEM;

LinuxInside

255Lastpart

Page 256: Linux Insides

MoreaboutkzallocwillbeintheseparatechapteraboutmemorymanagementintheLinuxkernel.Asweallocatedspacefortheirqaction,westarttoinitializethisstructurewiththevaluesofinterrupthandler,interruptflags,devicename,etc:

action->handler=handler;

action->thread_fn=thread_fn;

action->flags=irqflags;

action->name=devname;

action->dev_id=dev_id;

Intheendoftherequest_threaded_irqfunctionwecallthe__setup_irqfunctionfromthekernel/irq/manage.candregistersagivenirqaction.Releasememoryfortheirqactionandreturn:

chip_bus_lock(desc);

retval=__setup_irq(irq,desc,action);

chip_bus_sync_unlock(desc);

if(retval)

kfree(action);

returnretval;

Notethatthecallofthe__setup_irqfunctionisplacedbetweenthechip_bus_lockandthechip_bus_sync_unlockfunctions.Thesefunctionslocl/unlockaccesstoslowbus(likei2c)chips.Nowlet'slookattheimplementationofthe__setup_irqfunction.Inthebeginningofthe__setup_irqfunctionwecanseeacoupleofdifferentchecks.FirstofallwecheckthatthegiveninterruptdescriptorisnotNULL,irqchipisnotNULLandthatgiveninterruptdescriptormoduleownerisnotNULL.Afterthiswecheckisinterruptnestintoanotherinterruptthreadornot,andifitisnestedwereplacetheirq_default_primary_handlerwiththeirq_nested_primary_handler.

Inthenextstepwecreateanirqhandlerthreadwiththekthread_createfunction,ifthegiveninterruptisnotnestedandthethread_fnisnotNULL:

if(new->thread_fn&&!nested){

structtask_struct*t;

t=kthread_create(irq_thread,new,"irq/%d-%s",irq,new->name);

...

}

Andfilltherestofthegiveninterruptdescriptorfieldsintheend.So,our16and17interruptrequestlinesareregisteredandtheandfunctionswillbeinvokedwhenaninterruptcontrollerwillgeteventreleatedtotheseinterrupts.Nowlet'slookatwhathappenswhenaninterruptoccurs.

Inthepreviousparagraphwesawtherequestingoftheirqlineforthegiveninterruptdescriptorandregistrationoftheirqactionstructureforthegiveninterrupt.Wealreadyknowthatwhenaninterrupteventoccurs,aninterruptcontrollernotifiestheprocessoraboutthiseventandprocessortriestofindappropriateinterruptgateforthisinterrupt.Ifyouhavereadtheeighthpartofthischapter,youmayrememberthenative_init_IRQfunction.ThisfunctionmakesinitializationofthelocalAPIC.Thefollowingpartofthisfunctionisthemostinterestingpartforusrightnow:

for_each_clear_bit_from(i,used_vectors,first_system_vector){

set_intr_gate(i,irq_entries_start+

8*(i-FIRST_EXTERNAL_VECTOR));

}

Preparetohandleaninterrupt

LinuxInside

256Lastpart

Page 257: Linux Insides

Hereweiterateoveralltheclearedbitoftheused_vectorsbitmapstartingatfirst_system_vectorthatis:

intfirst_system_vector=FIRST_SYSTEM_VECTOR;//0xef

andsetinterruptgateswiththeivectornumberandtheirq_entries_start+8*(i-FIRST_EXTERNAL_VECTOR)startaddress.Onlyonethingsisunclearhere-theirq_entries_start.Thissymboldefinedinthearch/x86/entry/entry_64.Sassemblyfileandprovidesirqentries.Let'slookatit:

.align8

ENTRY(irq_entries_start)

vector=FIRST_EXTERNAL_VECTOR

.rept(FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR)

pushq$(~vector+0x80)

vector=vector+1

jmpcommon_interrupt

.align8

.endr

END(irq_entries_start)

HerewecanseetheGNUassembler.reptinstructionwhichrepeatsthethesequenceoflinesthatarebefore.endr-FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTORtimes.Aswealreadyknow,theFIRST_SYSTEM_VECTORis0xef,andtheFIRST_EXTERNAL_VECTORisequalto0x20.So,itwillwork:

>>>0xef-0x20

207

times.Inthebodyofthe.reptinstructionwepushentrystubsonthestack(notethatweusenegativenumbersfortheinterruptvectornumbers,becausepositivenumbersalreadyreservedtoidentifysystemcalls),incrementthevectorvariableandjumponthecommon_interruptlabel.Inthecommon_interruptweadjustvectornumberonthestackandexecuteinterruptnumberwiththedo_IRQparameter:

common_interrupt:

addq$-0x80,(%rsp)

interruptdo_IRQ

Themacrointerruptdefinedinthesamesourcecodefileandsavesgeneralpurposeregistersonthestack,changetheuserspacegsonthekernelwiththeSWAPGSassemblerinstructionifneed,incrementper-cpu-irq_countvariablethatshowsthatweareininterruptandcallthedo_IRQfunction.Thisfunctiondefinedinthearch/x86/kernel/irq.csourcecodefileandhandlesourdeviceinterrupt.Let'slookatthisfunction.Thedo_IRQfunctiontakesoneparameter-pt_regsstructurethatstoresvaluesoftheuserspaceregisters:

__visibleunsignedint__irq_entrydo_IRQ(structpt_regs*regs)

{

structpt_regs*old_regs=set_irq_regs(regs);

unsignedvector=~regs->orig_ax;

unsignedirq;

irq_enter();

exit_idle();

...

...

...

}

Atthebeginningofthisfunctionwecanseecalloftheset_irq_regsfunctionthatreturnssavedper-cpuirqregisterpointer

LinuxInside

257Lastpart

Page 258: Linux Insides

andthecallsoftheirq_enterandexit_idlefunctions.Thefirstfunctionirq_enterenterstoaninterruptcontextwiththeupdating__preempt_countvariableandthesectionfunction-exit_idlechecksthatcurrentprocessisidlewithpid-0andnotifytheidle_notifierwiththeIDLE_END.

Inthenextstepwereadtheirqforthecurrentcpuandcallthehandle_irqfunction:

irq=__this_cpu_read(vector_irq[vector]);

if(!handle_irq(irq,regs)){

...

...

...

}

...

...

...

Thehandle_irqfunctiondefinedinthearch/x86/kernel/irq_64.csourcecodefile,checksthegiveninterruptdescriptorandcallthegeneric_handle_irq_desc:

desc=irq_to_desc(irq);

if(unlikely(!desc))

returnfalse;

generic_handle_irq_desc(irq,desc);

Wherethegeneric_handle_irq_desccallstheinterrupthandler:

staticinlinevoidgeneric_handle_irq_desc(unsignedintirq,structirq_desc*desc)

{

desc->handle_irq(irq,desc);

}

Butstop...Whatisithandle_irqandwhydowecallourinterrupthandlerfromtheinterruptdescriptorwhenweknowthatirqactionpointstotheactualinterrupthandler?Actuallytheirq_desc->handle_irqisahigh-levelAPIforthecallinginterrupthandlerroutine.ItsetupsduringinitializationofthedevicetreeandAPICinitialization.Thekernelselectscorrectfunctionandcallchainoftheirq->action(s)there.Inthisway,theserial21285_tx_charsortheserial21285_rx_charsfunctionwillbeexecutedafteraninterruptwilloccur.

Intheendofthedo_IRQfunctionwecalltheirq_exitfunctionthatwillexitfromtheinterruptcontext,theset_irq_regswiththeolduserspaceregistersandreturn:

irq_exit();

set_irq_regs(old_regs);

return1;

WealreadyknowthatwhenanIRQfinishesitswork,deferredinterruptswillbeexecutediftheyexist.

Ok,theinterrupthandlerfinisheditsexecutionandnowwemustreturnfromtheinterrupt.Whentheworkofthedo_IRQfunctionwillbefinsihed,wewillreturnbacktotheassemblercodeinthearch/x86/entry/entry_64.Stotheret_from_intrlabel.FirstofallwedisableinterruptswiththeDISABLE_INTERRUPTSmacrothatexpandstothecliinstructionanddecrementvalueoftheirq_countper-cpuvariable.Remember,thisvariablehadvalue-1,whenwewereininterruptcontext:

Exitfrominterrupt

LinuxInside

258Lastpart

Page 259: Linux Insides

DISABLE_INTERRUPTS(CLBR_NONE)

TRACE_IRQS_OFF

declPER_CPU_VAR(irq_count)

Inthelaststepwecheckthepreviouscontext(userorkernel),restoreitinacorrectwayandexitfromaninterruptwiththe:

INTERRUPT_RETURN

wheretheINTERRUPT_RETURNmacrois:

#defineINTERRUPT_RETURNjmpnative_iret

and

ENTRY(native_iret)

.globalnative_irq_return_iret

native_irq_return_iret:

iretq

That'sall.

ItistheendofthetenthpartoftheInterruptsandInterruptHandlingchapterandasyouhavereadinthebeginningofthispart-itisthelastpartofthischapter.Thischapterstartedfromtheexplanationofthetheoryofinterruptsandwehavelearnedwhatisitinterruptandkindsofinterrupts,thenwesawexceptionsandhandlingofthiskindofinterrupts,deferredinterruptsandfinallywelookedonthehardwareinterruptsandthanldingoftheirinthispart.Ofcourse,thispartandeventhischapterdoesnotcoverfullaspectsofinterruptsandinterrupthandlingintheLinuxkernel.Itisnotrealistictodothis.Atleastforme.Itwasthebigpart,Idon'tknowhowaboutyou,butitwasreallybigforme.ThisthemeismuchbiggerthanthischapterandIamnotsurethatsomewherethereisabookthatcoversit.Wehavemissedmanypartandaspectsofinterruptsandinterrupthandling,butIthinkitwillbegoodpointtodiveinthekernelcoderelatedtotheinterruptsandinterruptshandling.

Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.

PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.

SerialdriverdocumentationStrongARM**SA-110/21285EvaluationBoardIRQmoduleinitcalluartISAmemorymanagement

Conclusion

Links

LinuxInside

259Lastpart

Page 261: Linux Insides

Thischapterdescribesthesystemcallconceptinthelinuxkernel.

Introductiontosystemcallconcept-thispartisintroductiontothesystemcallconceptintheLinuxkernel.HowtheLinuxkernelhandlesasystemcall-thispartdescribeshowtheLinuxkernelhandlesasystemcallfromanuserspaceapplication.vsyscallandvDSO-thirdpartdescribesvsyscallandvDSOconcepts.HowtheLinuxkernelrunsaprogram-thispartdescribesstartupprocessofaprogram.

Systemcalls

LinuxInside

261Systemcalls

Page 262: Linux Insides

Thispostopensupanewchapterinlinux-insidesbookandasyoumayunderstandfromthetitle,thischapterwillbedevotedtotheSystemcallconceptintheLinuxkernel.Thechoiceoftopicforthischapterisnotaccidental.Inthepreviouschapterwesawinterruptsandinterrupthandling.Theconceptofsystemcallsisverysimilartothatofinterrupts.Thisisbecausethemostcommonwaytoimplementsystemcallsisassoftwareinterrupts.Wewillseemanydifferentaspectsthatarerelatedtothesystemcallconcept.Forexample,wewilllearnwhat'shappeningwhenasystemcalloccursfromuserspace,wewillseeanimplementationofacouplesystemcallhandlersintheLinuxkernel,VDSOandvsyscallconceptsandmanymanymore.

BeforewestarttodiveintotheimplementationofthesystemcallsrelatedstuffintheLinuxkernelsourcecode,itisgoodtoknowsometheoryaboutsystemcalls.Let'sdoitinthefollowingparagraph.

Asystemcallisjustauserspacerequestofakernelservice.Yes,theoperatingsystemkernelprovidesmanyservices.Whenyourprogramwantstowritetoorreadfromafile,starttolistenforconnectionsonasocket,deleteorcreatedirectory,oreventofinishitswork,aprogramusesasystemcall.Inanotherwords,asystemcallisjustaCfunctionthatisplacedinthekernelspaceandanuserprogramcanaskkerneltodosomethingviathisfunction.

TheLinuxkernelprovidesasetofthesefunctionsandeacharchitectureprovidesitsownset.Forexample:thex86_64provides322systemcallsandthex86provides358differentsystemcalls.Ok,asystemcallisjustafunction.Let'slookonasimpleHelloworldexamplethat'swrittenintheassemblyprogramminglanguage:

.data

msg:

.ascii"Hello,world!\n"

len=.-msg

.text

.global_start

_start:

movq$1,%rax

movq$1,%rdi

movq$msg,%rsi

movq$len,%rdx

syscall

movq$60,%rax

xorq%rdi,%rdi

syscall

Wecancompiletheabovewiththefollowingcommands:

$gcc-ctest.S

$ld-otesttest.o

andrunitasfollows:

SystemcallsintheLinuxkernel.Part1.

Introduction

Systemcall.Whatisit?

LinuxInside

262Introductiontosystemcalls

Page 263: Linux Insides

./test

Hello,world!

Ok,whatdoweseehere?ThissimplecoderepresentsHelloworldassemblyprogramfortheLinuxx86_64architecture.Wecanseetwosectionshere:

.data

.text

Thefirstsection-.datastoresinitializeddataofourprogram(Helloworldstringanditslengthinourcase).Thesecondsection-.textcontainsthecodeofourprogram.Wecansplitthecodeofourprogramintotwoparts:firstpartwillbebeforethefirstsyscallinstructionandthesecondpartwillbebetweenfirstandsecondsyscallinstructions.Firstofallwhatdoesthesyscallinstructiondoinourcodeandgenerally?Aswecanreadinthe64-ia-32-architectures-software-developer-vol-2b-manual:

SYSCALLinvokesanOSsystem-callhandleratprivilegelevel0.Itdoessoby

loadingRIPfromtheIA32_LSTARMSR(aftersavingtheaddressoftheinstruction

followingSYSCALLintoRCX).(TheWRMSRinstructionensuresthatthe

IA32_LSTARMSRalwayscontainacanonicaladdress.)

...

...

...

SYSCALLloadstheCSandSSselectorswithvaluesderivedfrombits47:32ofthe

IA32_STARMSR.However,theCSandSSdescriptorcachesarenotloadedfromthe

descriptors(inGDTorLDT)referencedbythoseselectors.

Instead,thedescriptorcachesareloadedwithfixedvalues.Itistherespon-

sibilityofOSsoftwaretoensurethatthedescriptors(inGDTorLDT)referenced

bythoseselectorvaluescorrespondtothefixedvaluesloadedintothedescriptor

caches;theSYSCALLinstructiondoesnotensurethiscorrespondence.

andweareinitializingsyscallsbythewritingoftheentry_SYSCALL_64thatdefinedinthearch/x86/entry/entry_64.SassemblerfileandrepresentsSYSCALLinstructionentrytotheIA32_STARModelspecificregister:

wrmsrl(MSR_LSTAR,entry_SYSCALL_64);

inthearch/x86/kernel/cpu/common.csourcecodefile.

So,thesyscallinstructioninvokesahandlerofagivensystemcall.Buthowdoesitknowwhichhandlertocall?Actuallyitgetsthisinformationfromthegeneralpurposeregisters.Asyoucanseeinthesystemcalltable,eachsystemcallhasanuniquenumber.Inourexample,firstsystemcallis-writethatwritesdatatothegivenfile.Let'slookinthesystemcalltableandtrytofindwritesystemcall.Aswecansee,thewritesystemcallhasnumber-1.Wepassthenumberofthissystemcallthroughtheraxregisterinourexample.Thenextgeneralpurposeregisters:%rdi,%rsiand%rdxtakeparametersofthewritesyscall.Inourcase,theyarefiledescriptor(1isstdoutinourcase),secondparameteristhepointertoourstring,andthethirdissizeofdata.Yes,youheardright.Parametersforasystemcall.AsIalreadywroteabove,asystemcallisajustCfunctioninthekernelspace.Inourcasefirstsystemcalliswrite.Thissystemcalldefinedinthefs/read_write.csourcecodefileandlookslike:

SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,

size_t,count)

{

...

...

...

}

LinuxInside

263Introductiontosystemcalls

Page 264: Linux Insides

Orinotherwords:

ssize_twrite(intfd,constvoid*buf,size_tnbytes);

Don'tworryabouttheSYSCALL_DEFINE3macrofornow,we'llcomebacktoit.

Thesecondpartofourexampleisthesame,butwecallothersystemcall.Inthiscasewecallexitsystemcall.Thissystemcallgetsonlyoneparameter:

Returnvalue

andhandlesthewayourprogramexits.Wecanpasstheprogramnameofourprogramtothestraceutilandwewillseeoursystemcalls:

$stracetest

execve("./test",["./test"],[/*62vars*/])=0

write(1,"Hello,world!\n",14Hello,world!

)=14

_exit(0)=?

+++exitedwith0+++

Inthefirstlineofthestraceoutput,wecanseeexecvesystemcallthatexecutesourprogram,andthesecondandthirdaresystemcallsthatwehaveusedinourprogram:writeandexit.Notethatwepasstheparameterthroughthegeneralpurposeregistersinourexample.Theorderoftheregistersisnotaccidental.Theorderoftheregistersisdefinedbythefollowingagreement-x86-64callingconventions.Thisandotheragreementforthex86_64architectureexplainedinthespecialdocument-SystemVApplicationBinaryInterface.PDF.Inageneralway,argument(s)ofafunctionareplacedeitherinregistersorpushedonthestack.Therightorderis:

rdi;rsi;rdx;rcx;r8;r9.

forthefirstsixparametersofafunction.Ifafunctionhasmorethansixarguments,otherparameterswillbeplacedonthestack.

Wedonotusesystemcallsinourcodedirectly,butourprogramusesitwhenwewanttoprintsomething,checkaccesstoafileorjustwriteorreadsomethingtoit.

Forexample:

#include<stdio.h>

intmain(intargc,char**argv)

{

FILE*fp;

charbuff[255];

fp=fopen("test.txt","r");

fgets(buff,255,fp);

printf("%s\n",buff);

fclose(fp);

return0;

}

LinuxInside

264Introductiontosystemcalls

Page 265: Linux Insides

Therearenofopen,fgets,printfandfclosesystemcallsintheLinuxkernel,butopen,readwriteandcloseinstead.Ithinkyouknowthatthesefourfunctionsfopen,fgets,printfandfclosearejustfunctionsthatdefinedintheCstandardlibrary.Actuallythesefunctionsarewrappersforthesystemcalls.Wedonotcallsystemcallsdirectlyinourcode,butusingwrapperfunctionsfromthestandardlibrary.Themainreasonofthisissimple:asystemcallmustbeperformedquickly,veryquickly.Asasystemcallmustbequick,itmustbesmall.Thestandardlibrarytakesresponsibilitytoperformsystemcallswiththecorrectsetparametersandmakesdifferentchecksbeforeitwillcallthegivensystemcall.Let'scompileourprogramwiththefollowingcommand:

$gcctest.c-otest

andlookonitwiththeltraceutil:

$ltrace./test

__libc_start_main(["./test"]<unfinished...>

fopen("test.txt","r")=0x602010

fgets("HelloWorld!\n",255,0x602010)=0x7ffd2745e700

puts("HelloWorld!\n"HelloWorld!

)=14

fclose(0x602010)=0

+++exited(status0)+++

Theltraceutildisplaysasetofuserspacecallsofaprogram.Thefopenfunctionopensthegiventextfile,thefgetsreadsfilecontenttothebufbuffer,theputsfunctionprintsittothestdoutandthefclosefunctionclosesfilebythegivenfiledescriptor.AndasIalreadywrote,allofthesefunctionscallanappropriatesystemcall.Forexampleputscallsthewritesystemcallinside,wecanseeitifwewilladd-Soptiontotheltraceprogram:

write@SYS(1,"HelloWorld!\n\n",14)=14

Yes,systemcallsareubiquitous.Eachprogramneedstoopen/write/readfile,networkconnection,allocatememoryandmanyotherthingsthatcanbeprovidedonlybythekernel.Theprocfilesystemcontainsspecialfilesinaformat:/proc/pid/systemcallthatexposesthesystemcallnumberandargumentregistersforthesystemcallcurrentlybeingexecutedbytheprocess.Forexample,pid1,thatissystemdforme:

$sudocat/proc/1/comm

systemd

$sudocat/proc/1/syscall

2320x40x7ffdf82e11b00x1f0xffffffff0x1000x7ffdf82e11bf0x7ffdf82e11a00x7f9114681193

thesystemcallwithnumber-232whichisepoll_waitsystemcallthatwaitsforanI/Oeventonanepollfiledescriptor.OrforexampleemacseditorwhereI'mwritingthispart:

$psax|grepemacs

2093?Sl2:40emacs

$sudocat/proc/2093/comm

emacs

$sudocat/proc/2093/syscall

2700xf0x7fff068a5a900x7fff068a5b100x00x7fff068a59c00x7fff068a59d00x7fff068a59b00x7f777dd8813c

LinuxInside

265Introductiontosystemcalls

Page 266: Linux Insides

thesystemcallwiththenumber270whichissys_pselect6systemcallthatallowsemacstomonitormultiplefiledescriptors.

Nowweknowalittleaboutsystemcall,whatisitandwhyweneedinit.Solet'slookatthewritesystemcallthatourprogramused.

Let'slookattheimplementationofthissystemcalldirectlyinthesourcecodeoftheLinuxkernel.Aswealreadyknow,thewritesystemcallisdefinedinthefs/read_write.csourcecodefileandlookslikethis:

SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,

size_t,count)

{

structfdf=fdget_pos(fd);

ssize_tret=-EBADF;

if(f.file){

loff_tpos=file_pos_read(f.file);

ret=vfs_write(f.file,buf,count,&pos);

if(ret>=0)

file_pos_write(f.file,pos);

fdput_pos(f);

}

returnret;

}

Firstofall,theSYSCALL_DEFINE3macroisdefinedintheinclude/linux/syscalls.hheaderfileandexpandstothedefinitionofthesys_name(...)function.Let'slookatthismacro:

#defineSYSCALL_DEFINE3(name,...)SYSCALL_DEFINEx(3,_##name,__VA_ARGS__)

#defineSYSCALL_DEFINEx(x,sname,...)\

SYSCALL_METADATA(sname,x,__VA_ARGS__)\

__SYSCALL_DEFINEx(x,sname,__VA_ARGS__)

AswecanseetheSYSCALL_DEFINE3macrotakesnameparameterwhichwillrepresentnameofasystemcallandvariadicnumberofparameters.ThismacrojustexpandstotheSYSCALL_DEFINExmacrothattakesthenumberoftheparametersthegivensystemcall,the_##namestubforthefuturenameofthesystemcall(moreabouttokensconcatenationwiththe##youcanreadinthedocumentationofgcc).NextwecanseetheSYSCALL_DEFINExmacro.Thismacroexpandstothetwofollowingmacros:

SYSCALL_METADATA;__SYSCALL_DEFINEx.

ImplementationofthefirstmacroSYSCALL_METADATAdependsontheCONFIG_FTRACE_SYSCALLSkernelconfigurationoption.Aswecanunderstandfromthenameofthisoption,itallowstoenabletracertocatchthesyscallentryandexitevents.Ifthiskernelconfigrationoptionisenabled,theSYSCALL_METADATAmacroexecutesinitializationofthesyscall_metadatastructurethatdefinedintheinclude/trace/syscall.hheaderfileandcontainsdifferentusefulfieldsasnameofasystemcall,numberofasystemcallinthesystemcalltable,numberofparametersofasystemcall,listofparametertypesandetc:

#defineSYSCALL_METADATA(sname,nb,...)\

...\

...\

...\

structsyscall_metadata__used\

__syscall_meta_##sname={\

Implementationofwritesystemcall

LinuxInside

266Introductiontosystemcalls

Page 267: Linux Insides

.name="sys"#sname,\

.syscall_nr=-1,\

.nb_args=nb,\

.types=nb?types_##sname:NULL,\

.args=nb?args_##sname:NULL,\

.enter_event=&event_enter_##sname,\

.exit_event=&event_exit_##sname,\

.enter_fields=LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields),\

};\

staticstructsyscall_metadata__used\

__attribute__((section("__syscalls_metadata")))\

*__p_syscall_meta_##sname=&__syscall_meta_##sname;

IftheCONFIG_FTRACE_SYSCALLSkerneloptiondoesnotenabledduringkernelconfiguration,inthiswaytheSYSCALL_METADATAmacroexpandstoemptystring:

#defineSYSCALL_METADATA(sname,nb,...)

Thesecondmacro__SYSCALL_DEFINExexpandstothedefinitionofthefivefollowingfunctions:

#define__SYSCALL_DEFINEx(x,name,...)\

asmlinkagelongsys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\

__attribute__((alias(__stringify(SyS##name))));\

\

staticinlinelongSYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));\

\

asmlinkagelongSyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));\

\

asmlinkagelongSyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\

{\

longret=SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));\

__MAP(x,__SC_TEST,__VA_ARGS__);\

__PROTECT(x,ret,__MAP(x,__SC_ARGS,__VA_ARGS__));\

returnret;\

}\

\

staticinlinelongSYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

Thefirstsys##nameisdefinitionofthesyscallhandlerfunctionwiththegivenname-sys_system_call_name.The__SC_DECLmacrotakesthe__VA_ARGS__andcombinescallinputparametersystemtypeandtheparametername,becausethemacrodefinitionisunabletodeterminetheparametertypes.Andthe__MAPmacroapplies__SC_DECLmacrotothe__VA_ARGS__arguments.Theotherfunctionsthataregeneratedbythe__SYSCALL_DEFINExmacroareneedtoprotectfromtheCVE-2009-0029andwewillnotdiveintodetailsaboutthishere.Ok,asresultoftheSYSCALL_DEFINE3macro,wewillhave:

asmlinkagelongsys_write(unsignedintfd,constchar__user*buf,size_tcount);

Nowweknowalittleaboutthesystemcall'sdefinitionandwecangobacktotheimplementationofthewritesystemcall.Let'slookontheimplementationofthissystemcallagain:

SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,

size_t,count)

{

structfdf=fdget_pos(fd);

ssize_tret=-EBADF;

if(f.file){

loff_tpos=file_pos_read(f.file);

ret=vfs_write(f.file,buf,count,&pos);

if(ret>=0)

file_pos_write(f.file,pos);

LinuxInside

267Introductiontosystemcalls

Page 268: Linux Insides

fdput_pos(f);

}

returnret;

}

Aswealreadyknowandcanseefromthecode,ittakesthreearguments:

fd-filedescriptor;buf-buffertowrite;count-lengthofbuffertowrite.

andwritesdatafromabufferdeclaredbytheusertoagivendeviceorafile.Notethatthesecondparameterbuf,definedwiththe__userattribute.ThemainpurposeofthisattributeisforcheckingtheLinuxkernelcodewiththesparseutil.Itisdefinedintheinclude/linux/compiler.hheaderfileanddependsonthe__CHECKER__definitionintheLinuxkernel.That'sallaboutusefulmeta-informationrelatedtooursys_writesystemcall,let'strytounderstandhowthissystemcallisimplemented.AswecanseeitstartsfromthedefinitionofthefstructurethathasfdstructuretypethatrepresentfiledescriptorintheLinuxkernelandweputtheresultofthecallofthefdget_posfunction.Thefdget_posfunctiondefinedinthesamesourcecodefileandjustexpandsthecallofthe__to_fdfunction:

staticinlinestructfdfdget_pos(intfd)

{

return__to_fd(__fdget_pos(fd));

}

Themainpurposeofthefdget_posistoconvertthegivenfiledescriptorwhichisjustanumbertothefdstructure.Throughthelongchainoffunctioncalls,thefdget_posfunctiongetsthefiledescriptortableofthecurrentprocess,current->files,andtriestofindacorrespondingfiledescriptornumberthere.Aswegotthefdstructureforthegivenfiledescriptornumber,wecheckitandreturnifitdoesnotexist.Wegetthecurrentpositioninthefilewiththecallofthefile_pos_readfunctionthatjustreturnsf_posfieldoftheourfile:

staticinlineloff_tfile_pos_read(structfile*file)

{

returnfile->f_pos;

}

andcallthevfs_writefunction.Thevfs_writefunctiondefinedthefs/read_write.csourcecodefileanddoestheworkforus-writesgivenbuffertothegivenfilestartingfromthegivenposition.Wewillnotdiveintodetailsaboutthevfs_writefunction,becausethisfunctionisweaklyrelatedtothesystemcallconceptbutmostlyaboutVirtualfilesystemconceptwhichwewillseeinanotherchapter.Afterthevfs_writehasfinisheditswork,wechecktheresultandifitwasfinishedsuccessfullywechangethepositioninthefilewiththefile_pos_writefunction:

if(ret>=0)

file_pos_write(f.file,pos);

thatjustupdatesf_poswiththegivenpositioninthegivenfile:

staticinlinevoidfile_pos_write(structfile*file,loff_tpos)

{

file->f_pos=pos;

}

Attheendoftheourwritesystemcallhandler,wecanseethecallofthefollowingfunction:

LinuxInside

268Introductiontosystemcalls

Page 269: Linux Insides

fdput_pos(f);

unlocksthef_pos_lockmutexthatprotectsfilepositionduringconcurrentwritesfromthreadsthatsharefiledescriptor.

That'sall.

WehaveseenthepartialimplementationofonesystemcallprovidedbytheLinuxkernel.Ofcoursewehavemissedsomepartsintheimplementationofthewritesystemcall,becauseasImentionedabove,wewillseeonlysystemcallsrelatedstuffinthischapterandwillnotseeotherstuffrelatedtoothersubsystems,suchasVirtualfilesystem.

ThisconcludesthefirstpartcoveringsystemcallconceptsintheLinuxkernel.Wehavecoveredthetheoryofsystemcallssofarandinthenextpartwewillcontinuetodiveintothistopic,touchingLinuxkernelcoderelatedtosystemcalls.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

systemcallvdsovsyscallgeneralpurposeregisterssocketCprogramminglanguagex86x86_64x86-64callingconventionsSystemVApplicationBinaryInterface.PDFGCCIntelmanual.PDFsystemcalltableGCCmacrodocumentationfiledescriptorstdoutstracestandardlibrarywrapperfunctionsltracesparseprocfilesystemVirtualfilesystemsystemdepollPreviouschapter

Conclusion

Links

LinuxInside

269Introductiontosystemcalls

Page 270: Linux Insides

ThepreviouspartwasthefirstpartofthechapterthatdescribesthesystemcallconceptsintheLinuxkernel.InthepreviouspartwelearnedwhatasystemcallisintheLinuxkernel,andinoperatingsystemsingeneral.Thiswasintroducedfromauser-spaceperspective,andpartofthewritesystemcallimplementationwasdiscussed.Inthispartwecontinueourlookatsystemcalls,startingwithsometheorybeforemovingontotheLinuxkernelcode.

Anuserapplicationdoesnotmakethesystemcalldirectlyfromourapplications.WedidnotwritetheHelloworld!programlike:

intmain(intargc,char**argv)

{

...

...

...

sys_write(fd1,buf,strlen(buf));

...

...

}

WecanusesomethingsimilarwiththehelpofCstandardlibraryanditwilllooksomethinglikethis:

#include<unistd.h>

intmain(intargc,char**argv)

{

...

...

...

write(fd1,buf,strlen(buf));

...

...

}

Butanyway,writeisnotadirectsystemcallandnotakernelfunction.Anapplicationmustfillgeneralpurposeregisterswiththecorrectvaluesinthecorrectorderandusethesyscallinstructiontomaketheactualsystemcall.InthispartwewilllookatwhatoccursintheLinuxkernelwhenthesyscallinstructionismetbytheprocessor.

Fromthepreviouspartweknowthatsystemcallconceptisverysimilartoaninterrupt.Furthermore,systemcallsareimplementedassoftwareinterrupts.So,whentheprocessorhandlesasyscallinstructionfromauserapplication,thisinstructioncausesanexceptionwhichtransferscontroltoanexceptionhandler.Asweknow,allexceptionhandlers(orinotherwordskernelCfunctionsthatwillreactonanexception)areplacedinthekernelcode.ButhowdoestheLinuxkernelsearchfortheaddressofthenecessarysystemcallhandlerfortherelatedsystemcall?TheLinuxkernelcontainsaspecialtablecalledthesystemcalltable.Thesystemcalltableisrepresentedbythesys_call_tablearrayintheLinuxkernelwhichisdefinedinthearch/x86/entry/syscall_64.csourcecodefile.Let'slookatitsimplementation:

asmlinkageconstsys_call_ptr_tsys_call_table[__NR_syscall_max+1]={

[0...__NR_syscall_max]=&sys_ni_syscall,

#include<asm/syscalls_64.h>

};

SystemcallsintheLinuxkernel.Part2.

HowdoestheLinuxkernelhandleasystemcall

Initializationofthesystemcallstable

LinuxInside

270HowtheLinuxkernelhandlesasystemcall

Page 271: Linux Insides

Aswecansee,thesys_call_tableisanarrayof__NR_syscall_max+1sizewherethe__NR_syscall_maxmacrorepresentsthemaximumnumberofsystemcallsforthegivenarchitecture.Thisbookisaboutthex86_64architecture,soforourcasethe__NR_syscall_maxis322andthisisthecorrectnumberatthetimeofwriting(currentLinuxkernelversionis4.2.0-rc8+).WecanseethismacrointheheaderfilegeneratedbyKbuildduringkernelcompilation-include/generated/asm-offsets.h`:

#define__NR_syscall_max322

Therewillbethesamenumberofsystemcallsinthearch/x86/entry/syscalls/syscall_64.tblforthex86_64.Therearetwoimportanttopicshere;thetypeofthesys_call_tablearray,andtheinitializationofelementsinthisarray.Firstofall,thetype.Thesys_call_ptr_trepresentsapointertoasystemcalltable.Itisdefinedastypedefforafunctionpointerthatreturnsnothingandanddoesnottakearguments:

typedefvoid(*sys_call_ptr_t)(void);

Thesecondthingistheinitializationofthesys_call_tablearray.Aswecanseeinthecodeabove,allelementsofourarraythatcontainpointerstothesystemcallhandlerspointtothesys_ni_syscall.Thesys_ni_syscallfunctionrepresentsnot-implementedsystemcalls.Tostartwith,allelementsofthesys_call_tablearraypointtothenot-implementedsystemcall.Thisisthecorrectinitialbehaviour,becauseweonlyinitializestorageofthepointerstothesystemcallhandlers,itispopulatedlateron.Implementationofthesys_ni_syscallisprettyeasy,itjustreturns-errnoor-ENOSYSinourcase:

asmlinkagelongsys_ni_syscall(void)

{

return-ENOSYS;

}

The-ENOSYSerrortellsusthat:

ENOSYSFunctionnotimplemented(POSIX.1)

Alsoanoteon...intheinitializationofthesys_call_table.WecandoitwithaGCCcompilerextensioncalled-DesignatedInitializers.Thisextensionallowsustoinitializeelementsinnon-fixedorder.Asyoucansee,weincludetheasm/syscalls_64.hheaderattheendofthearray.Thisheaderfileisgeneratedbythespecialscriptatarch/x86/entry/syscalls/syscalltbl.shandgeneratesourheaderfilefromthesyscalltable.Theasm/syscalls_64.hcontainsdefinitionsofthefollowingmacros:

__SYSCALL_COMMON(0,sys_read,sys_read)

__SYSCALL_COMMON(1,sys_write,sys_write)

__SYSCALL_COMMON(2,sys_open,sys_open)

__SYSCALL_COMMON(3,sys_close,sys_close)

__SYSCALL_COMMON(5,sys_newfstat,sys_newfstat)

...

...

...

The__SYSCALL_COMMONmacroisdefinedinthesamesourcecodefileandexpandstothe__SYSCALL_64macrowhichexpandstothefunctiondefinition:

#define__SYSCALL_COMMON(nr,sym,compat)__SYSCALL_64(nr,sym,compat)

LinuxInside

271HowtheLinuxkernelhandlesasystemcall

Page 272: Linux Insides

#define__SYSCALL_64(nr,sym,compat)[nr]=sym,

So,afterthis,oursys_call_tabletakesthefollowingform:

asmlinkageconstsys_call_ptr_tsys_call_table[__NR_syscall_max+1]={

[0...__NR_syscall_max]=&sys_ni_syscall,

[0]=sys_read,

[1]=sys_write,

[2]=sys_open,

...

...

...

};

Afterthisallelementsthatpointtothenon-implementedsystemcallswillcontaintheaddressofthesys_ni_syscallfunctionthatjustreturns-ENOSYSaswesawabove,andotherelementswillpointtothesys_syscall_namefunctions.

Atthispoint,wehavefilledthesystemcalltableandtheLinuxkernelknowswhereeachsystemcallhandleris.ButtheLinuxkerneldoesnotcallasys_syscall_namefunctionimmediatelyafteritisinstructedtohandleasystemcallfromauserspaceapplication.Rememberthechapteraboutinterruptsandinterrupthandling.WhentheLinuxkernelgetsthecontroltohandleaninterrupt,ithadtodosomepreparationslikesaveuserspaceregisters,switchtoanewstackandmanymoretasksbeforeitwillcallaninterrupthandler.Thereisthesamesituationwiththesystemcallhandling.Thepreparationforhandlingasystemcallisthefirstthing,butbeforetheLinuxkernelwillstartthesepreparations,theentrypointofasystemcallmustbeinitailizedandonlytheLinuxkernelknowshowtoperformthispreparation.InthenextparagraphwewillseetheprocessoftheinitializationofthesystemcallentryintheLinuxkernel.

Whenasystemcalloccursinthesystem,wherearethefirstbytesofcodethatstartstohandleit?AswecanreadintheIntelmanual-64-ia-32-architectures-software-developer-vol-2b-manual:

SYSCALLinvokesanOSsystem-callhandleratprivilegelevel0.

ItdoessobyloadingRIPfromtheIA32_LSTARMSR

itmeansthatweneedtoputthesystemcallentryintotheIA32_LSTARmodelspecificregister.ThisoperationtakesplaceduringtheLinuxkernelinitializationprocess.IfyouhavereadthefourthpartofthechapterthatdescribesinterruptsandinterrupthandlingintheLinuxkernel,youknowthattheLinuxkernelcallsthetrap_initfunctionduringtheinitializationprocess.Thisfunctionisdefinedinthearch/x86/kernel/setup.csourcecodefileandexecutestheinitializationofthenon-earlyexceptionhandlerslikedivideerror,coprocessorerroretc.Besidestheinitializationofthenon-earlyexceptionshandlers,thisfunctioncallsthecpu_initfunctionfromthearch/x86/kernel/cpu/common.csourcecodefilewhichbesidesinitializationofper-cpustate,callsthesyscall_initfunctionfromthesamesourcecodefile.

Thisfunctionperformstheinitializationofthesystemcallentrypoint.Let'slookontheimplementationofthisfunction.Itdoesnottakeparametersandfirstofallitfillstwomodelspecificregisters:

wrmsrl(MSR_STAR,((u64)__USER32_CS)<<48|((u64)__KERNEL_CS)<<32);

wrmsrl(MSR_LSTAR,entry_SYSCALL_64);

Thefirstmodelspecificregister-MSR_STARcontains63:48bitsoftheusercodesegment.ThesebitswillbeloadedtotheCSandSSsegmentregistersforthesysretinstructionwhichprovidesfunctionalitytoreturnfromasystemcalltousercodewiththerelatedprivilege.AlsotheMSR_STARcontains47:32bitsfromthekernelcodethatwillbeusedasthebaseselectorforCSandSSsegmentregisterswhenuserspaceapplicationsexecuteasystemcall.Inthesecondlineofcode

Initializationofthesystemcallentry

LinuxInside

272HowtheLinuxkernelhandlesasystemcall

Page 273: Linux Insides

wefilltheMSR_LSTARregisterwiththeentry_SYSCALL_64symbolthatrepresentssystemcallentry.Theentry_SYSCALL_64isdefinedinthearch/x86/entry/entry_64.Sassemblyfileandcontainscoderelatedtothepreparationpeformedbeforeasystemcallhandlerwillbeexecuted(Ialreadywroteaboutthesepreparations,readabove).Wewillnotconsiderthe

entry_SYSCALL_64now,butwillreturntoitlaterinthischapter.

Afterwehavesettheentrypointforsystemcalls,weneedtosetthefollowingmodelspecificregisters:

MSR_CSTAR-targetripforthecompabilitymodecallers;MSR_IA32_SYSENTER_CS-targetcsforthesysenterinstruction;MSR_IA32_SYSENTER_ESP-targetespforthesysenterinstruction;MSR_IA32_SYSENTER_EIP-targeteipforthesysenterinstruction.

ThevaluesofthesemodelspecificregisterdependontheCONFIG_IA32_EMULATIONkernelconfigurationoption.Ifthiskernelconfigurationoptionisenabled,itallowslegacy32-bitprogramstorunundera64-bitkernel.Inthefirstcase,iftheCONFIG_IA32_EMULATIONkernelconfigurationoptionisenabled,wefillthesemodelspecificregisterswiththeentrypointforthesystemcallsthecompabilitymode:

wrmsrl(MSR_CSTAR,entry_SYSCALL_compat);

andwiththekernelcodesegment,putzerotothestackpointerandwritetheaddressoftheentry_SYSENTER_compatsymboltotheinstructionpointer:

wrmsrl_safe(MSR_IA32_SYSENTER_CS,(u64)__KERNEL_CS);

wrmsrl_safe(MSR_IA32_SYSENTER_ESP,0ULL);

wrmsrl_safe(MSR_IA32_SYSENTER_EIP,(u64)entry_SYSENTER_compat);

Inanotherway,iftheCONFIG_IA32_EMULATIONkernelconfigurationoptionisdisabled,wewriteignore_sysretsymboltotheMSR_CSTAR:

wrmsrl(MSR_CSTAR,ignore_sysret);

thatisdefinedinthearch/x86/entry/entry_64.Sassemblyfileandjustreturns-ENOSYSerrorcode:

ENTRY(ignore_sysret)

mov$-ENOSYS,%eax

sysret

END(ignore_sysret)

NowweneedtofillMSR_IA32_SYSENTER_CS,MSR_IA32_SYSENTER_ESP,MSR_IA32_SYSENTER_EIPmodelspecificregistersaswedidinthepreviouscodewhentheCONFIG_IA32_EMULATIONkernelconfigurationoptionwasenabled.Inthiscase(whentheCONFIG_IA32_EMULATIONconfigurationoptionisnotset)wefilltheMSR_IA32_SYSENTER_ESPandtheMSR_IA32_SYSENTER_EIPwithzeroandputtheinvalidsegmentoftheGlobalDescriptorTabletotheMSR_IA32_SYSENTER_CSmodelspecificregister:

wrmsrl_safe(MSR_IA32_SYSENTER_CS,(u64)GDT_ENTRY_INVALID_SEG);

wrmsrl_safe(MSR_IA32_SYSENTER_ESP,0ULL);

wrmsrl_safe(MSR_IA32_SYSENTER_EIP,0ULL);

YoucanreadmoreabouttheGlobalDescriptorTableinthesecondpartofthechapterthatdescribesthebootingprocessoftheLinuxkernel.

Attheendofthesyscall_initfunction,wejustmaskflagsintheflagsregisterbywritingthesetofflagstothe

LinuxInside

273HowtheLinuxkernelhandlesasystemcall

Page 274: Linux Insides

MSR_SYSCALL_MASKmodelspecificregister:

wrmsrl(MSR_SYSCALL_MASK,

X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|

X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);

Theseflagswillbeclearedduringsyscallinitialization.That'sall,itistheendofthesyscall_initfunctionanditmeansthatsystemcallentryisreadytowork.Nowwecanseewhatwilloccurwhenanuserapplicationexecutesthesyscallinstruction.

AsIalreadywrote,beforeasystemcalloraninterrupthandlerwillbecalledbytheLinuxkernelweneedtodosomepreparations.Theidtentrymacroperformsthepreparationsrequiredbeforeanexceptionhandlerwillbeexecuted,theinterruptmacroperformsthepreparationsrequiresbeforeaninterrupthandlerwillbecalledandtheentry_SYSCALL_64willdothepreparationsrequiredbeforeasystemcallhandlerwillbeexecuted.

Theentry_SYSCALL_64isdefinedinthearch/x86/entry/entry_64.Sassemblyfileandstartsfromthefollowingmacro:

SWAPGS_UNSAFE_STACK

Thismacroisdefinedinthearch/x86/include/asm/irqflags.hheaderfileandexpandstotheswapgsinstruction:

#defineSWAPGS_UNSAFE_STACKswapgs

whichexchangesthecurrentGSbaseregistervaluewiththevaluecontainedintheMSR_KERNEL_GS_BASEmodelspecificregister.Inotherwordswemoveditontothekernelstack.Afterthiswepointtheoldstackpointertothersp_scratchper-cpuvariableandsetupthestackpointertopointtothetopofstackforthecurrentprocessor:

movq%rsp,PER_CPU_VAR(rsp_scratch)

movqPER_CPU_VAR(cpu_current_top_of_stack),%rsp

Inthenextstepwepushthestacksegmentandtheoldstackpointertothestack:

pushq$__USER_DS

pushqPER_CPU_VAR(rsp_scratch)

Afterthisweenableinterrupts,becauseinterruptsareoffonentryandsavethegeneralpurposeregisters(besidesbp,bxandfromr12tor15),flags,-ENOSYSforthenon-implementedsystemcallandcodesegmentregisteronthestack:

ENABLE_INTERRUPTS(CLBR_NONE)

pushq%r11

pushq$__USER_CS

pushq%rcx

pushq%rax

pushq%rdi

pushq%rsi

pushq%rdx

pushq%rcx

pushq$-ENOSYS

Preparationbeforesystemcallhandlerwillbecalled

LinuxInside

274HowtheLinuxkernelhandlesasystemcall

Page 275: Linux Insides

pushq%r8

pushq%r9

pushq%r10

pushq%r11

sub$(6*8),%rsp

Whenasystemcalloccursfromtheuser'sapplication,generalpurposeregistershavethefollowingstate:

rax-containssystemcallnumber;rcx-containsreturnaddresstotheuserspace;r11-containsregisterflags;rdi-containsfirstargumentofasystemcallhandler;rsi-containssecondargumentofasystemcallhandler;rdx-containsthirdargumentofasystemcallhandler;r10-containsfourthargumentofasystemcallhandler;r8-containsfifthargumentofasystemcallhandler;r9-containssixthargumentofasystemcallhandler;

Othergeneralpurposeregisters(asrbp,rbxandfromr12tor15)arecallee-preservedinCABI).Sowepushregisterflagsonthetopofthestack,thenusercodesegment,returnaddresstotheuserspace,systemcallnumber,firstthreearguments,dumperrorcodeforthenon-implementedsystemcallandotherargumentsonthestack.

Inthenextstepwecheckthe_TIF_WORK_SYSCALL_ENTRYinthecurrentthread_info:

testl$_TIF_WORK_SYSCALL_ENTRY,ASM_THREAD_INFO(TI_flags,%rsp,SIZEOF_PTREGS)

jnztracesys

The_TIF_WORK_SYSCALL_ENTRYmacroisdefinedinthearch/x86/include/asm/thread_info.hheaderfileandprovidessetofthethreadinformationflagsthatarerelatedtothesystemcallstracing:

#define_TIF_WORK_SYSCALL_ENTRY\

(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_EMU|_TIF_SYSCALL_AUDIT|\

_TIF_SECCOMP|_TIF_SINGLESTEP|_TIF_SYSCALL_TRACEPOINT|\

_TIF_NOHZ)

Wewillnotconsiderdebugging/tracingrelatedstuffinthischapter,butwillseeitintheseparatechapterthatwillbedevotedtothedebuggingandtracingtechniquesintheLinuxkernel.Afterthetracesyslabel,thenextlabelistheentry_SYSCALL_64_fastpath.Intheentry_SYSCALL_64_fastpathwecheckthe__SYSCALL_MASKthatisdefinedinthearch/x86/include/asm/unistd.hheaderfileand

#ifdefCONFIG_X86_X32_ABI

#define__SYSCALL_MASK(~(__X32_SYSCALL_BIT))

#else

#define__SYSCALL_MASK(~0)

#endif

wherethe__X32_SYSCALL_BITis

#define__X32_SYSCALL_BIT0x40000000

Aswecanseethe__SYSCALL_MASKdependsontheCONFIG_X86_X32_ABIkernelconfigurationoptionandrepresentsthemaskforthe32-bitABIinthe64-bitkernel.

LinuxInside

275HowtheLinuxkernelhandlesasystemcall

Page 276: Linux Insides

Sowecheckthevalueofthe__SYSCALL_MASKandiftheCONFIG_X86_X32_ABIisdisabledwecomparethevalueoftheraxregistertothemaximumsyscallnumber(__NR_syscall_max),alternativelyiftheCNOFIG_X86_X32_ABIisenabledwemasktheeaxregisterwiththe__X32_SYSCALL_BITanddothesamecomparison:

#if__SYSCALL_MASK==~0

cmpq$__NR_syscall_max,%rax

#else

andl$__SYSCALL_MASK,%eax

cmpl$__NR_syscall_max,%eax

#endif

AfterthiswechecktheresultofthelastcomparisonwiththejainstructionthatexecutesifCFandZFflagsarezero:

ja1f

andifwehavethecorrectsystemcallforthis,wemovethefourthargumentfromther10tothercxtokeepx86_64CABIcompliantandexecutethecallinstructionwiththeaddressofasystemcallhandler:

movq%r10,%rcx

call*sys_call_table(,%rax,8)

Note,thesys_call_tableisanarraythatwesawaboveinthispart.Aswealreadyknowtheraxgeneralpurposeregistercontainsthenumberofasystemcallandeachelementofthesys_call_tableis8-bytes.Soweareusing*sys_call_table(,%rax,8)thisnotationtofindthecorrectoffsetinthesys_call_tablearrayforthegivensystemcallhandler.

That'sall.Wedidalltherequiredpreparationsandthesystemcallhandlerwascalledforthegiveninterrupthandler,forexamplesys_read,sys_writeorothersystemcallhandlerthatisdefinedwiththeSYSCALL_DEFINE[N]macrointheLinuxkernelcode.

Afterasystemcallhandlerfinishesitswork,wewillreturnbacktothearch/x86/entry/entry_64.S,rightafterwherewehavecalledthesystemcallhandler:

call*sys_call_table(,%rax,8)

Thenextstepafterwe'vereturnedfromasystemcallhandleristoputthereturnvalueofasystemhandlerontothestack.Weknowthatasystemcallreturnstheresulttotheuserprograminthegeneralpurposeraxregister,sowearemovingitsvalueontothestackafterthesystemcallhandlerhasfinisheditswork:

movq%rax,RAX(%rsp)

ontheRAXplace.

AfterthiswecanseethecalloftheLOCKDEP_SYS_EXITmacrofromthearch/x86/include/asm/irqflags.h:

LOCKDEP_SYS_EXIT

Exitfromasystemcall

LinuxInside

276HowtheLinuxkernelhandlesasystemcall

Page 277: Linux Insides

TheimplementationofthismacrodependsontheCONFIG_DEBUG_LOCK_ALLOCkernelconfigurationoptionthatallowsustodebuglocksonexitfromasystemcall.Andagain,wewillnotconsideritinthischapter,butwillreturntoitinaseparateone.Intheendoftheentry_SYSCALL_64functionwerestoreallgeneralpurposeregistersbesidesrxcandr11,becausethercxregistermustcontainthereturnaddresstotheapplicationthatcalledsystemcallandther11registercontainstheoldflagsregister.Afterallgeneralpurposeregistersarerestored,wefillrcxwiththereturnaddress,r11registerwiththeflagsandrspwiththeoldstackpointer:

RESTORE_C_REGS_EXCEPT_RCX_R11

movqRIP(%rsp),%rcx

movqEFLAGS(%rsp),%r11

movqRSP(%rsp),%rsp

USERGS_SYSRET64

IntheendwejustcalltheUSERGS_SYSRET64macrothatexpandstothecalloftheswapgsinstructionwhichexchangesagaintheuserGSandkernelGSandthesysretqinstructionwhichexecutesonexitfromasystemcallhandler:

#defineUSERGS_SYSRET64\

swapgs;\

sysretq;

Nowweknowwhatoccurswhenanuserapplicationcallsasystemcall.Thefullpathofthisprocessisasfollows:

Userapplicationcontainscodethatfillsgeneralpurposerregisterwiththevalues(systemcallnumberandargumentsofthissystemcall);Processorswitchesfromtheusermodetokernelmodeandstartsexecutionofthesystemcallentry-entry_SYSCALL_64;entry_SYSCALL_64switchestothekernelstackandsavessomegeneralpurposeregisters,oldstackandcodesegment,flagsandetc...onthestack;entry_SYSCALL_64checksthesystemcallnumberintheraxregister,searchesasystemcallhandlerinthesys_call_tableandcallsit,ifthenumberofasystemcalliscorrect;Ifasystemcallisnotcorrect,jumponexitfromsystemcall;Afterasystemcallhandlerwillfinishitswork,restoregeneralpurposerregisters,oldstack,flagsandreturnaddressandexitfromtheentry_SYSCALL_64withthesysretqinstruction.

That'sall.

ThisistheendofthesecondpartaboutthesystemcallsconceptintheLinuxkernel.Inthepreviouspartwesawtheoryaboutthisconceptfromtheuserapplicationview.InthispartwecontinuedtodiveintothestuffwhichisrelatedtothesystemcallconceptandsawwhattheLinuxkerneldoeswhenasystemcalloccurs.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

systemcall

Conclusion

Links

LinuxInside

277HowtheLinuxkernelhandlesasystemcall

Page 279: Linux Insides

ThisisthethirdpartofthechapterthatdescribessystemcallsintheLinuxkernelandwesawpreparationsafterasystemcallcausedbyanuserspaceapplicationandprocessofhandlingofasystemcallinthepreviouspart.Inthispartwewilllookattwoconceptsthatareveryclosetothesystemcallconcept,theyarecalledvsyscallandvdso.

Wealreadyknowwhatisasystemcall.ThisisspecialroutineintheLinuxkernelwhichuserspaceapplicationaskstodoprivilegedtasks,liketoreadortowritetoafile,toopenasocketandetc.Asyoumayknow,invokingasystemcallisanexpensiveoperationintheLinuxkernel,becausetheprocessormustinterruptthecurrentlyexecutingtaskandswitchcontexttokernelmode,subsequentlyjumpingagainintouserspaceafterthesystemcallhandlerfinishesitswork.Thesetwomechanisms-vsyscallandvdsoaredesignedtospeedupthisprocessforcertainsystemcallsandinthispartwewilltrytounderstandhowthesemechanismswork.

ThevsyscallorvirtualsystemcallisthefirstandoldestmechinismintheLinuxkernelthatisdesignedtoaccelerateexecutionofcertainsystemcalls.Theprincipleofworkofthevsyscallconceptissimple.TheLinuxkernelmapsintouserspaceapagethatcontainssomevariablesandtheimplementationofsomesystemcalls.WecanfindinformationaboutthismemoryspaceintheLinuxkerneldocumentationforthex86_64:

ffffffffff600000-ffffffffffdfffff(=8MB)vsyscalls

or:

~$sudocat/proc/1/maps|grepvsyscall

ffffffffff600000-ffffffffff601000r-xp0000000000:000[vsyscall]

Afterthis,thesesystemcallswillbeexecutedinuserspaceandthismeansthattherewillnotbecontextswitching.Mappingofthevsyscallpageoccursinthemap_vsyscallfunctionthatisdefinedinthearch/x86/entry/vsyscall/vsyscall_64.csourcecodefile.ThisfunctioniscalledduringtheLinuxkernelintializationinthesetup_archfunctionthatisdefinedinthearch/x86/kernel/setup.csourcecodefile(wesawthisfunctioninthefifthpartoftheLinuxkernelinitializationprocesschapter).

Notethatimplementationofthemap_vsyscallfunctiondependsontheCONFIG_X86_VSYSCALL_EMULATIONkernelconfigurationoption:

#ifdefCONFIG_X86_VSYSCALL_EMULATION

externvoidmap_vsyscall(void);

#else

staticinlinevoidmap_vsyscall(void){}

#endif

Aswecanreadinthehelptext,theCONFIG_X86_VSYSCALL_EMULATIONconfigurationoption:Enablevsyscallemulation.Whyemulatevsyscall?Actually,thevsyscallisalegacyABIduetosecurityreasons.Virtualsystemcallshavefixedaddresses,meaningthatvsyscallpageisstillatthesamelocationeverytimeandthelocationofthispageisdeterminedinthemap_vsyscallfunction.Let'slookontheimplementationofthisfunction:

SystemcallsintheLinuxkernel.Part3.

vsyscallsandvDSO

Introductiontovsyscalls

LinuxInside

279vsyscallandvDSO

Page 280: Linux Insides

void__initmap_vsyscall(void)

{

externchar__vsyscall_page;

unsignedlongphysaddr_vsyscall=__pa_symbol(&__vsyscall_page);

...

...

...

}

Aswecansee,atthebeginningofthemap_vsyscallfunctionwegetthephysicaladdressofthevsyscallpagewiththe__pa_symbolmacro(wealreadysawimplementationifthismacrointhefourthpathoftheLinuxkernelinitializationprocess).The__vsyscall_pagesymboldefinedinthearch/x86/entry/vsyscall/vsyscall_emu_64.Sassemblysourcecodefileandhavethefollowingvirtualaddress:

ffffffff81881000D__vsyscall_page

inthe.data..page_aligned,awsectionandcontainscallofthethreefolowingsystemcalls:

gettimeofday;time;getcpu.

Or:

__vsyscall_page:

mov$__NR_gettimeofday,%rax

syscall

ret

.balign1024,0xcc

mov$__NR_time,%rax

syscall

ret

.balign1024,0xcc

mov$__NR_getcpu,%rax

syscall

ret

Let'sgobacktotheimplementationofthemap_vsyscallfunctionandreturntotheimplementationofthe__vsyscall_page,later.Afterwereceivingthephysicaladdressofthe__vsyscall_page,wecheckthevalueofthevsyscall_modevariableandsetthefix-mappedaddressforthevsyscallpagewiththe__set_fixmapmacro:

if(vsyscall_mode!=NONE)

__set_fixmap(VSYSCALL_PAGE,physaddr_vsyscall,

vsyscall_mode==NATIVE

?PAGE_KERNEL_VSYSCALL

:PAGE_KERNEL_VVAR);

The__set_fixmaptakesthreearguments:Thefirstisindexofthefixed_addressesenum.InourcaseVSYSCALL_PAGEisthefirstelementofthefixed_addressesenumforthex86_64architecture:

enumfixed_addresses{

...

...

...

#ifdefCONFIG_X86_VSYSCALL_EMULATION

VSYSCALL_PAGE=(FIXADDR_TOP-VSYSCALL_ADDR)>>PAGE_SHIFT,

LinuxInside

280vsyscallandvDSO

Page 281: Linux Insides

#endif

...

...

...

Itequaltothe511.Thesecondargumentisthephysicaladdressofthethepagethathastobemappedandthethirdargumentistheflagsofthepage.NotethattheflagsoftheVSYSCALL_PAGEdependonthevsyscall_modevariable.ItwillbePAGE_KERNEL_VSYSCALLifthevsyscall_modevariableisNATIVEandthePAGE_KERNEL_VVARotherwise.Bothmacros(thePAGE_KERNEL_VSYSCALLandthePAGE_KERNEL_VVAR)willbeexpandedtothefollowingflags:

#define__PAGE_KERNEL_VSYSCALL(__PAGE_KERNEL_RX|_PAGE_USER)

#define__PAGE_KERNEL_VVAR(__PAGE_KERNEL_RO|_PAGE_USER)

thatrepresentaccessrightstothevsyscallpage.Bothflagshavethesame_PAGE_USERflagsthatmeansthatthepagecanbeaccessedbyauser-modeprocessrunningatlowerprivilegelevels.Thesecondflagdependsonthevalueofthevsyscall_modevariable.Thefirstflag(__PAGE_KERNEL_VSYSCALL)willbesetinthecasewherevsyscall_modeisNATIVE.Thismeansvirtualsystemcallswillbenativesyscallinstructions.InotherwaythevsyscallwillhavePAGE_KERNEL_VVARifthevsyscall_modevariablewillbeemulate.Inthiscasevirtualsystemcallswillbeturnedintotrapsandareemulatedreasonably.Thevsyscall_modevariablegetsitsvalueinthevsyscall_setupfunction:

staticint__initvsyscall_setup(char*str)

{

if(str){

if(!strcmp("emulate",str))

vsyscall_mode=EMULATE;

elseif(!strcmp("native",str))

vsyscall_mode=NATIVE;

elseif(!strcmp("none",str))

vsyscall_mode=NONE;

else

return-EINVAL;

return0;

}

return-EINVAL;

}

Thatwillbecalledduringearlykernelparametersparsing:

early_param("vsyscall",vsyscall_setup);

Moreaboutearly_parammacroyoucanreadinthesixthpartofthechapterthatdescribesprocessoftheinitializationoftheLinuxkernel.

Intheendofthevsyscall_mapfunctionwejustcheckthatvirtualaddressofthevsyscallpageisequaltothevalueoftheVSYSCALL_ADDRwiththeBUILD_BUG_ONmacro:

BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=

(unsignedlong)VSYSCALL_ADDR);

That'sall.vsyscallpageissetup.Theresultofthealltheaboveisthefollowing:Ifwepassvsyscall=nativeparametertothekernelcommandline,virtualsystemcallswillbehandledasnativesyscallinstructionsinthearch/x86/entry/vsyscall/vsyscall_emu_64.S.Theglibcknowsaddressesofthevirtualsystemcallhandlers.Notethatvirtualsystemcallhandlersarealignedby1024(or0x400)bytes:

LinuxInside

281vsyscallandvDSO

Page 282: Linux Insides

__vsyscall_page:

mov$__NR_gettimeofday,%rax

syscall

ret

.balign1024,0xcc

mov$__NR_time,%rax

syscall

ret

.balign1024,0xcc

mov$__NR_getcpu,%rax

syscall

ret

Andthestartaddressofthevsyscallpageistheffffffffff600000everytime.So,theglibcknowstheaddressesoftheallvirutalsystemcallhandlers.Youcanfinddefinitionoftheseaddressesintheglibcsourcecode:

#defineVSYSCALL_ADDR_vgettimeofday0xffffffffff600000

#defineVSYSCALL_ADDR_vtime0xffffffffff600400

#defineVSYSCALL_ADDR_vgetcpu0xffffffffff600800

Allvirtualsystemcallrequestswillfallintothe__vsyscall_page+VSYSCALL_ADDR_vsyscall_nameoffset,putthenumberofavirtualsystemcalltotheraxgeneralpurposeregisterandthenativeforthex86_64syscallinstructionwillbeexecuted.

Inthesecondcase,ifwepassvsyscall=emulateparametertothekernelcommandline,anattempttoperformvirtualsystemcallhandlerwillcauseapagefaultexception.Ofcourse,remember,thevsyscallpagehas__PAGE_KERNEL_VVARaccessrightsthatforbidexecution.Thedo_page_faultfunctionisthe#PForpagefaulthandler.Ittriestounderstandthereasonofthelastpagefault.Andoneofthereasoncanbesituationwhenvirtualsystemcallcalledandvsyscallmodeisemulate.Inthiscasevsyscallwillbehandledbytheemulate_vsyscallfunctionthatdefinedinthearch/x86/entry/vsyscall/vsyscall_64.csourcecodefile.

Theemulate_vsyscallfunctiongetsthenumberofavirtualsystemcall,checksit,printserrorandsendssegementationfaultsingle:

...

...

...

vsyscall_nr=addr_to_vsyscall_nr(address);

if(vsyscall_nr<0){

warn_bad_vsyscall(KERN_WARNING,regs,"misalignedvsyscall...);

gotosigsegv;

}

...

...

...

sigsegv:

force_sig(SIGSEGV,current);

reutrntrue;

Asitcheckednumberofavirtualsystemcall,itdoessomeyetanothercheckslikeaccess_okviolationsandexecutesystemcallfunctiondependsonthenumberofavirtualsystemcall:

switch(vsyscall_nr){

case0:

ret=sys_gettimeofday(

(structtimeval__user*)regs->di,

(structtimezone__user*)regs->si);

break;

...

...

LinuxInside

282vsyscallandvDSO

Page 283: Linux Insides

...

}

Intheendweputtheresultofthesys_gettimeofdayoranothervirtualsystemcallhandlertotheaxgeneralpurposeregister,aswediditwiththenormalsystemcallsandrestoretheinstructionpointerregisterandadd8bytestothestackpointerregister.Thisoperationemulatesretinstruction.

regs->ax=ret;

do_ret:

regs->ip=caller;

regs->sp+=8;

returntrue;

That'sall.Nowlet'slookonthemodernconcept-vDSO.

AsIalreadywroteabove,vsyscallisanobsoleteconceptandreplacedbythevDSOorvirtualdynamicsharedobject.ThemaindifferencebetweenthevsyscallandvDSOmechanismsisthatvDSOmapsmemorypagesintoeachprocessinasharedobjectform,butvsyscallisstaticinmemoryandhasthesameaddresseverytime.Forthex86_64architectureitiscalled-linux-vdso.so.1.Alluserspaceapplicationslinkedwiththissharedlibraryviatheglibc.Forexample:

~$ldd/bin/uname

linux-vdso.so.1(0x00007ffe014b7000)

libc.so.6=>/lib64/libc.so.6(0x00007fbfee2fe000)

/lib64/ld-linux-x86-64.so.2(0x00005559aab7c000)

Or:

~$sudocat/proc/1/maps|grepvdso

7fff39f73000-7fff39f75000r-xp0000000000:000[vdso]

Herewecanseethatunameutilwaslinkedwiththethreelibraries:

linux-vdso.so.1;libc.so.6;ld-linux-x86-64.so.2.

ThefirstprovidesvDSOfunctionality,thesecondisCstandardlibraryandthethirdistheprograminterpreter(moreaboutthisyoucanreadinthepartthatdescribeslinkers).So,thevDSOsolveslimitationsofthevsyscall.ImplementationofthevDSOissimilartovsyscall.

InitializationofthevDSOoccursintheinit_vdsofunctionthatdefinedinthearch/x86/entry/vdso/vma.csourcecodefile.ThisfunctionstartsfromtheinitializationofthevDSOimagesfor32-bitsand64-bitsdependsontheCONFIG_X86_X32_ABIkernelconfigurationoption:

staticint__initinit_vdso(void)

{

init_vdso_image(&vdso_image_64);

#ifdefCONFIG_X86_X32_ABI

init_vdso_image(&vdso_image_x32);

#endif

IntroductiontovDSO

LinuxInside

283vsyscallandvDSO

Page 284: Linux Insides

Bothfunctioninitializethevdso_imagestructure.Thisstructureisdefinedinthetwogeneratedsourcecodefiles:thearch/x86/entry/vdso/vdso-image-64.candthearch/x86/entry/vdso/vdso-image-64.c.Thesesourcecodefilesgeneratedbythevdso2cprogramfromthedifferentsourcecodefiles,representdifferentapproachestocallasystemcalllikeint0x80,sysenterandetc.Thefullsetoftheimagesdependsonthekernelconfiguration.

Forexampleforthex86_64Linuxkernelitwillcontainvdso_image_64:

#ifdefCONFIG_X86_64

externconststructvdso_imagevdso_image_64;

#endif

Butforthex86-vdso_image_32:

#ifdefCONFIG_X86_X32

externconststructvdso_imagevdso_image_x32;

#endif

Ifourkernelisconfiguredforthex86architectureorforthex86_64andcompabilitymode,wewillhaveabilitytocallasystemcallwiththeint0x80interrupt,ifcompabilitymodeisenabled,wewillbeabletocallasystemcallwiththenativesyscallinstructionorsysenterinstructioninotherway:

#ifdefinedCONFIG_X86_32||definedCONFIG_COMPAT

externconststructvdso_imagevdso_image_32_int80;

#ifdefCONFIG_COMPAT

externconststructvdso_imagevdso_image_32_syscall;

#endif

externconststructvdso_imagevdso_image_32_sysenter;

#endif

Aswecanunderstandfromthenameofthevdso_imagestructure,itrepresentsimageofthevDSOforthecertainmodeofthesystemcallentry.ThisstructurecontainsinformationaboutsizeinbytesofthevDSOareathatalwaysamultipleofPAGE_SIZE(4096bytes),pointertothetextmapping,startandendaddressofthealternatives(setofinstructionswithbetteralternativesforthecertaintypeoftheprocessor)andetc.Forexamplevdso_image_64lookslikethis:

conststructvdso_imagevdso_image_64={

.data=raw_data,

.size=8192,

.text_mapping={

.name="[vdso]",

.pages=pages,

},

.alt=3145,

.alt_len=26,

.sym_vvar_start=-8192,

.sym_vvar_page=-8192,

.sym_hpet_page=-4096,

};

Wheretheraw_datacontainsrawbinarycodeofthe64-bitvDSOsystemcallswhichare2pagesize:

staticstructpage*pages[2];

or8Kilobytes.

LinuxInside

284vsyscallandvDSO

Page 285: Linux Insides

Theinit_vdso_imagefunctionisdefinedinthesamesourcecodefileandjustinitializesthevdso_image.text_mapping.pages.Firstofallthisfunctioncalculatesthenumberofpagesandinitializeseachvdso_image.text_mapping.pages[number_of_page]withthevirt_to_pagemacrothatconvertsgivenaddresstothepagestructure:

void__initinit_vdso_image(conststructvdso_image*image)

{

inti;

intnpages=(image->size)/PAGE_SIZE;

for(i=0;i<npages;i++)

image->text_mapping.pages[i]=

virt_to_page(image->data+i*PAGE_SIZE);

...

...

...

}

Theinit_vdsofunctionpassedtothesubsys_initcallmacroaddsthegivenfunctiontotheinitcallslist.Allfunctionsfromthislistwillbecalledinthedo_initcallsfunctionfromtheinit/main.csourcecodefile:

subsys_initcall(init_vdso);

Ok,wejustsawinitializationofthevDSOandinitializationofpagestructuresthatarerelatedtothememorypagesthatcontainvDSOsystemcalls.Buttowheredotheirpagesmap?Actuallytheyaremappedbythekernel,whenitloadsbinarytothememory.TheLinuxkernelcallsthearch_setup_additional_pagesfunctionfromthearch/x86/entry/vdso/vma.csourcecodefilethatchecksthatvDSOenabledforthex86_64andcallsthemap_vdsofunction:

intarch_setup_additional_pages(structlinux_binprm*bprm,intuses_interp)

{

if(!vdso64_enabled)

return0;

returnmap_vdso(&vdso_image_64,true);

}

Themap_vdsofunctionisdefinedinthesamesourcecodefileandmapspagesforthevDSOandforthesharedvDSOvariables.That'sall.ThemaindifferencesbetweenthevsyscallandthevDSOconceptsisthatvsyscalhasastaticaddressofffffffffff600000andimplements3systemcalls,whereasthevDSOloadsdynamicallyandimplementsfoursystemcalls:

__vdso_clock_gettime;__vdso_getcpu;__vdso_gettimeofday;__vdso_time.

That'sall.

ThisistheendofthethirdpartaboutthesystemcallsconceptintheLinuxkernel.InthepreviouspartwediscussedtheimplementationofthepreparationfromtheLinuxkernelside,beforeasystemcallwillbehandledandimplementationoftheexitprocessfromasystemcallhandler.Inthispartwecontinuedtodiveintothestuffwhichisrelatedtothesystemcallconceptandlearnedtwonewconceptsthatareverysimilartothesystemcall-thevsyscallandthevDSO.

Afterallofthesethreeparts,weknowalmostallthingsthatarerelatedtosystemcalls,weknowwhatsystemcallisand

Conclusion

LinuxInside

285vsyscallandvDSO

Page 286: Linux Insides

whyuserapplicationsneedthem.Wealsoknowwhatoccurswhenauserapplicationcallsasystemcallandhowthekernelhandlessystemcalls.

Thenextpartwillbethelastpartinthischapterandwewillseewhatoccurswhenauserrunstheprogram.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

x86_64memorymapx86_64contextswitchingABIvirtualaddressSegmentationenumfix-mappedaddressesglibcBUILD_BUG_ONProcessorregisterPagefaultsegementationfaultinstructionpointerstackpointerunameLinkersPreviouspart

Links

LinuxInside

286vsyscallandvDSO

Page 287: Linux Insides

ThisisthefourthpartofthechapterthatdescribessystemcallsintheLinuxkernelandasIwroteintheconclusionoftheprevious-thispartwillbelastinthischapter.Inthepreviouspartwestoppedatthetwonewconcepts:

vsyscall;vDSO;

thatarerelatedandverysimilaronsystemcallconcept.

Thispartwillbelastpartinthischapterandasyoucanunderstandfromthepart'stitle-wewillseewhatdoesoccurintheLinuxkernelwhenwerunourprograms.So,let'sstart.

Therearemanydifferentwaystolaunchanapplicationfromanuserperspective.Forexamplewecanrunaprogramfromtheshellordouble-clickontheapplicationicon.Itdoesnotmatter.TheLinuxkernelhandlesapplicationlaunchregardlesshowwedolaunchthisapplication.

Inthispartwewillconsiderthewaywhenwejustlaunchanapplicationfromtheshell.Asyouknow,thestandardwaytolaunchanapplicationfromshellisthefollowing:Wejustlaunchaterminalemulatorapplicationandjustwritethenameoftheprogramandpassornotargumentstoourprogram,forexample:

Let'sconsiderwhatdoesoccurwhenwelaunchanapplicationfromtheshell,whatdoesshelldowhenwewriteprogramname,whatdoesLinuxkerneldoetc.Butbeforewewillstarttoconsidertheseinterestingthings,IwanttowarnthatthisbookisabouttheLinuxkernel.That'swhywewillseeLinuxkernelinternalsrelatedstuffmostlyinthispart.Wewillnotconsiderindetailswhatdoesshelldo,wewillnotconsidercomplexcases,forexamplesubshellsetc.

Mydefaultshellis-bash,soIwillconsiderhowdobashshelllaunchesaprogram.Solet'sstart.ThebashshellaswellasanyprogramthatwrittenwithCprogramminglanguagestartsfromthemainfunction.Ifyouwilllookonthesourcecodeofthebashshell,youwillfindthemainfunctionintheshell.csourcecodefile.Thisfunctionmakesmanydifferentthingsbeforethemainthreadloopofthebashstartedtowork.Forexamplethisfunction:

checksandtriestoopen/dev/tty;checkthatshellrunningindebugmode;parsescommandlinearguments;readsshellenvironment;loads.bashrc,.profileandotherconfigurationfiles;andmanymanymore.

Afteralloftheseoperationswecanseethecallofthereader_loopfunction.Thisfunctiondefinedintheeval.csourcecode

SystemcallsintheLinuxkernel.Part4.

HowdoestheLinuxkernelrunaprogram

howdowelaunchourprograms?

LinuxInside

287HowtheLinuxkernelrunsaprogram

Page 288: Linux Insides

fileandrepresentsmainthreadlooporinotherwordsitreadsandexecutescommands.Asthereader_loopfunctionmadeallchecksandreadthegivenprogramnameandarguments,itcallstheexecute_commandfunctionfromtheexecute_cmd.csourcecodefile.Theexecute_commandfunctionthroughthechainofthefunctionscalls:

execute_command

-->execute_command_internal

---->execute_simple_command

------>execute_disk_command

-------->shell_execve

makesdifferentcheckslikedoweneedtostartsubshell,wasitbuiltinbashfunctionornotetc.AsIalreadywroteabove,wewillnotconsideralldetailsaboutthingsthatarenotrelatedtotheLinuxkernel.Intheendofthisprocess,theshell_execvefunctioncallstheexecvesystemcall:

execve(command,args,env);

Theexecvesystemcallhasthefollowingsignature:

intexecve(constchar*filename,char*constargv[],char*constenvp[]);

andexecutesaprogrambythegivenfilename,withthegivenargumentsandenvironmentvariables.Thissystemcallisthefirstinourcaseandonly,forexample:

$stracels

execve("/bin/ls",["ls"],[/*62vars*/])=0

$straceecho

execve("/bin/echo",["echo"],[/*62vars*/])=0

$straceuname

execve("/bin/uname",["uname"],[/*62vars*/])=0

So,anuserapplication(bashinourcase)callsthesystemcallandaswealreadyknowthenextstepisLinuxkernel.

Wesawpreparationbeforeasystemcallcalledbyanuserapplicationandafterasystemcallhandlerfinisheditsworkinthesecondpartofthischapter.Westoppedatthecalloftheexecvesystemcallinthepreviousparagraph.Thissystemcalldefinedinthefs/exec.csourcecodefileandaswealreadyknowittakesthreearguments:

SYSCALL_DEFINE3(execve,

constchar__user*,filename,

constchar__user*const__user*,argv,

constchar__user*const__user*,envp)

{

returndo_execve(getname(filename),argv,envp);

}

Implementationoftheexecveisprettysimplehere,aswecanseeitjustreturnstheresultofthedo_execvefunction.Thedo_execvefunctiondefinedinthesamesourcecodefileanddothefollowingthings:

Initializetwopointersonauserspacedatawiththegivenargumentsandenvironmentvariables;

execvesystemcall

LinuxInside

288HowtheLinuxkernelrunsaprogram

Page 289: Linux Insides

returntheresultofthedo_execveat_common.

Wecanseeitsimplementation:

structuser_arg_ptrargv={.ptr.native=__argv};

structuser_arg_ptrenvp={.ptr.native=__envp};

returndo_execveat_common(AT_FDCWD,filename,argv,envp,0);

Thedo_execveat_commonfunctiondoesmainwork-itexecutesanewprogram.Thisfunctiontakessimilarsetofarguments,butasyoucanseeittakesfiveargumentsinsteadofthree.Thefirstargumentisthefiledescriptorthatrepresentdirectorywithourapplication,inourcasetheAT_FDCWDmeansthatthegivenpathnameisinterpretedrelativetothecurrentworkingdirectoryofthecallingprocess.Thefifthargumentisflags.Inourcasewepassed0tothedo_execveat_common.Wewillcheckinanextstep,sowillseeitlatter.

Firstofallthedo_execveat_commonfunctionchecksthefilenamepointerandreturnsifitisNULL.Afterthiswecheckflagsofthecurrentprocessthatlimitofrunningprocessesisnotexceed:

if(IS_ERR(filename))

returnPTR_ERR(filename);

if((current->flags&PF_NPROC_EXCEEDED)&&

atomic_read(&current_user()->processes)>rlimit(RLIMIT_NPROC)){

retval=-EAGAIN;

gotoout_ret;

}

current->flags&=~PF_NPROC_EXCEEDED;

IfthesetwochecksweresuccessfulweunsetPF_NPROC_EXCEEDEDflagintheflagsofthecurrentprocesstopreventfailoftheexecve.Youcanseethatinthenextstepwecalltheunshare_filesfunctionthatdefinedinthekernel/fork.candunsharesthefilesofthecurrenttaskandchecktheresultofthisfunction:

retval=unshare_files(&displaced);

if(retval)

gotoout_ret;

Weneedtocallthisfunctiontoeliminatepotentialleakoftheexecve'dbinary'sfiledescriptor.Inthenextstepwestartpreparationofthebprmthatrepresentedbythestructlinux_binprmstructure(definedintheinclude/linux/binfmts.hheaderfile).Thelinux_binprmstructureisusedtoholdtheargumentsthatareusedwhenloadingbinaries.Forexampleitcontainsvmafieldwhichhasvm_area_structtypeandrepresentssinglememoryareaoveracontiguousintervalinagivenaddressspacewhereourapplicationwillbeloaded,mmfieldwhichismemorydescriptorofthebinary,pointertothetopofmemoryandmanyotherdifferentfields.

Firstofallweallocatememoryforthisstructurewiththekzallocfunctionandchecktheresultoftheallocation:

bprm=kzalloc(sizeof(*bprm),GFP_KERNEL);

if(!bprm)

gotoout_files;

Afterthiswestarttopreparethebinprmcredentialswiththecalloftheprepare_bprm_credsfunction:

retval=prepare_bprm_creds(bprm);

if(retval)

gotoout_free;

LinuxInside

289HowtheLinuxkernelrunsaprogram

Page 290: Linux Insides

check_unsafe_exec(bprm);

current->in_execve=1;

Initializationofthebinprmcredentialsinotherwordsisinitializationofthecredstructurethatstoredinsideofthelinux_binprmstructure.Thecredstructurecontainsthesecuritycontextofataskforexamplerealuidofthetask,realguidofthetask,uidandguidforthevirtualfilesystemoperationsetc.Inthenextstepasweexecutedpreparationofthebprmcredentialswecheckthatnowwecansafelyexecuteaprogramwiththecallofthecheck_unsafe_execfunctionandsetthecurrentprocesstothein_execvestate.

Afteralloftheseoperationswecallthedo_open_execatfunctionthatcheckstheflagsthatwepassedtothedo_execveat_commonfunction(rememberthatwehave0intheflags)andsearchesandopensexecutablefileondisk,checksthatourwewillloadabinaryfilefromnoexecmountpoints(weneedtoavoidexecuteabinaryfromfilesystemsthatdonotcontainexecutablebinarieslikeprocorsysfs),intializesfilestructureandreturnspointeronthisstructure.Nextwecanseethecallthesched_execafterthis:

file=do_open_execat(fd,filename,flags);

retval=PTR_ERR(file);

if(IS_ERR(file))

gotoout_unmark;

sched_exec();

Thesched_execfunctionisusedtodeterminetheleastloadedprocessorthatcanexecutethenewprogramandtomigratethecurrentprocesstoit.

Afterthisweneedtocheckfiledescriptorofthegiveexecutablebinary.Wetrytocheckdoesthenameoftheourbinaryfilestartsfromthe/symbolordoesthepathofthegivenexecutablebinaryisinterpretedrelativetothecurrentworkingdirectoryofthecallingprocessorinotherwordsfiledescriptorisAT_FDCWD(readaboveaboutthis).

Ifoneofthesechecksissuccessfullwesetthebinaryparameterfilename:

bprm->file=file;

if(fd==AT_FDCWD||filename->name[0]=='/'){

bprm->filename=filename->name;

}

Otherwiseifthefilenameisemptywesetthebinaryparameterfilenametothe/dev/fd/%dor/dev/fd/%d/%sdependsonthefilenameofthegivenexecutablebinarywhichmeansthatwewillexecutethefiletowhichthefiledescriptorrefers:

}else{

if(filename->name[0]=='\0')

pathbuf=kasprintf(GFP_TEMPORARY,"/dev/fd/%d",fd);

else

pathbuf=kasprintf(GFP_TEMPORARY,"/dev/fd/%d/%s",

fd,filename->name);

if(!pathbuf){

retval=-ENOMEM;

gotoout_unmark;

}

bprm->filename=pathbuf;

}

bprm->interp=bprm->filename;

Notethatwesetnotonlythebprm->filenamebutalsobprm->interpthatwillcontainnameoftheprograminterpreter.For

LinuxInside

290HowtheLinuxkernelrunsaprogram

Page 291: Linux Insides

nowwejustwritethesamenamethere,butlateritwillbeupdatedwiththerealnameoftheprograminterpreterdependsonbinaryformatofaprogram.Youcanreadabovethatwealreadypreparedcredforthelinux_binprm.Thenextstepisinitalizationofotherfieldsofthelinux_binprm.Firstofallwecallthebprm_mm_initfunctionandpassthebprmtoit:

retval=bprm_mm_init(bprm);

if(retval)

gotoout_unmark;

Thebprm_mm_initdefinedinthesamesourcecodefileandaswecanunderstandfromthefunction'sname,itmakesinitializationofthememorydescriptororinotherwordsthebprm_mm_initfunctioninitializesmm_structstructure.Thisstructuredefinedintheinclude/linux/mm_types.hheaderfileandrepresentsaddressspaceofaprocess.Wewillnotconsiderimplementationofthebprm_mm_initfunctionbecausewedonotknowmanyimportantstuffrelatedtotheLinuxkernelmemorymanager,butwejustneedtoknowthatthisfunctioninitializesmm_structandpopulateitwithatemporarystackvm_area_struct.

Afterthiswecalculatethecountofthecommandlineargumentswhicharewerepassedtotheourexecutablebinary,thecountoftheenvironmentvariablesandsetittothebprm->argcandbprm->envcrespectively:

bprm->argc=count(argv,MAX_ARG_STRINGS);

if((retval=bprm->argc)<0)

gotoout;

bprm->envc=count(envp,MAX_ARG_STRINGS);

if((retval=bprm->envc)<0)

gotoout;

Asyoucanseewedothisoperationswiththehelpofthecountfunctionthatdefinedinthesamesourcecodefileandcalculatesthecountofstringsintheargvarray.TheMAX_ARG_STRINGSmacrodefinedintheinclude/uapi/linux/binfmts.hheaderfileandaswecanunderstandfromthemacro'sname,itrepresentsmaximumnumberofstringsthatwerepassedtotheexecvesystemcall.ThevalueoftheMAX_ARG_STRINGS:

#defineMAX_ARG_STRINGS0x7FFFFFFF

Afterwecalculatedthenumberofthecommandlineargumentsandenvironmentvariables,wecalltheprepare_binprmfunction.Wealreadycallthefunctionwiththesimilarnamebeforethismoment.Thisfunctioniscalledprepare_binprm_credandwerememberthatthisfunctioninitializescredstructureinthelinux_bprm.Nowtheprepare_binprmfunction:

retval=prepare_binprm(bprm);

if(retval<0)

gotoout;

fillsthelinux_binprmstructurewiththeuidfrominodeandread128bytesfromthebinaryexecutablefile.Wereadonlyfirst128fromtheexecutablefilebecauseweneedtocheckatypeofourexecutable.Wewillreadtherestoftheexecutablefileinthelaterstep.Afterthepreparationofthelinux_bprmstructurewecopythefilenameoftheexecutablebinaryfile,commandlineargumentsandenviromentvariablestothelinux_bprmwiththecallofthecopy_strings_kernelfunction:

retval=copy_strings_kernel(1,&bprm->filename,bprm);

if(retval<0)

gotoout;

retval=copy_strings(bprm->envc,envp,bprm);

if(retval<0)

gotoout;

LinuxInside

291HowtheLinuxkernelrunsaprogram

Page 292: Linux Insides

retval=copy_strings(bprm->argc,argv,bprm);

if(retval<0)

gotoout;

Andsetthepointertothetopofnewprogram'sstackthatwesetinthebprm_mm_initfunction:

bprm->exec=bprm->p;

Thetopofthestackwillcontaintheprogramfilenameandwestorethisfilenemetotheexecfieldofthelinux_bprmstructure.

Nowwehavefilledlinux_bprmstructure,wecalltheexec_binprmfunction:

retval=exec_binprm(bprm);

if(retval<0)

gotoout;

Firstofallwestorethepidandpidthatseenfromthenamespaceofthecurrenttaskintheexec_binprm:

old_pid=current->pid;

rcu_read_lock();

old_vpid=task_pid_nr_ns(current,task_active_pid_ns(current->parent));

rcu_read_unlock();

andcallthe:

search_binary_handler(bprm);

function.Thisfunctiongoesthroughthelistofhandlersthatcontainsdifferentbinaryformats.CurrentlytheLinuxkernelsupportsfollowingbinaryformats:

binfmt_script-supportforinterpretedscriptsthatarestartsfromthe#!line;binfmt_misc-supportdifferntbinaryformats,accordingtoruntimeconfigurationoftheLinuxkernel;binfmt_elf-supportelfformat;binfmt_aout-supporta.outformat;binfmt_flat-supportforflatformat;binfmt_elf_fdpic-SupportforelfFDPICbinaries;binfmt_em86-supportforIntelelfbinariesrunningonAlphamachines.

So,thesearch-binary_handlertriestocalltheload_binaryfunctionandpasslinux_binprmtoit.Ifthebinaryhandlersupportsthegivenexecutablefileformat,itstartstopreparetheexecutablebinaryforexecution:

intsearch_binary_handler(structlinux_binprm*bprm)

{

...

...

...

list_for_each_entry(fmt,&formats,lh){

retval=fmt->load_binary(bprm);

if(retval<0&&!bprm->mm){

force_sigsegv(SIGSEGV,current);

returnretval;

}

LinuxInside

292HowtheLinuxkernelrunsaprogram

Page 293: Linux Insides

}

returnretval;

Wheretheload_binaryforexamplefortheelfchecksthemagicnumber(eachelfbinaryfilecontainsmagicnumberintheheader)inthelinux_bprmbuffer(rememberthatwereadfirst128bytesfromtheexecutablebinaryfile):andexitifitisnotelfbinary:

staticintload_elf_binary(structlinux_binprm*bprm)

{

...

...

...

loc->elf_ex=*((structelfhdr*)bprm->buf);

if(memcmp(elf_ex.e_ident,ELFMAG,SELFMAG)!=0)

gotoout;

Ifthegivenexecutablefileisinelfformat,theload_elf_binarycontinuestoexecute.Theload_elf_binarydoesmanydifferentthingstoprepareonexecutionexecutablefile.Forexampleitchecksthearchitectureandtypeoftheexecutablefile:

if(loc->elf_ex.e_type!=ET_EXEC&&loc->elf_ex.e_type!=ET_DYN)

gotoout;

if(!elf_check_arch(&loc->elf_ex))

gotoout;

andexitifthereiswrongarchitectureandexecutablefilenonexecutablenonshared.Triestoloadtheprogramheadertable:

elf_phdata=load_elf_phdrs(&loc->elf_ex,bprm->file);

if(!elf_phdata)

gotoout;

thatdescribessegments.Readtheprograminterpreterandlibrariesthatlinkedwiththeourexecutablebinaryfilefromdiskandloadittomemory.Theprograminterpreterspecifiedinthe.interpsectionoftheexecutablefileandasyoucanreadinthepartthatdescribesLinkersitis-/lib64/ld-linux-x86-64.so.2forthex86_64.Itsetupsthestackandmapelfbinaryintothecorrectlocationinmemory.Itmapsthebssandthebrksectionsanddoesmanymanyotherdifferentthingstoprepareexecutablefiletoexecute.

Intheendoftheexecutionoftheload_elf_binarywecallthestart_threadfunctionandpassthreeargumentstoit:

start_thread(regs,elf_entry,bprm->p);

retval=0;

out:

kfree(loc);

out_ret:

returnretval;

Theseargumentsare:

Setofregistersforthenewtask;Addressoftheentrypointofthenewtask;Addressofthetopofthestackforthenewtask.

LinuxInside

293HowtheLinuxkernelrunsaprogram

Page 294: Linux Insides

Aswecanunderstandfromthefunction'sname,itstartsnewthread,butitisnotso.Thestart_threadfunctionjustpreparesnewtask'sregisterstobereadytorun.Let'slookontheimplementationofthisfunction:

void

start_thread(structpt_regs*regs,unsignedlongnew_ip,unsignedlongnew_sp)

{

start_thread_common(regs,new_ip,new_sp,

__USER_CS,__USER_DS,0);

}

Aswecanseethestart_threadfunctionjustmakesacallofthestart_thread_commonfunctionthatwilldoallforus:

staticvoid

start_thread_common(structpt_regs*regs,unsignedlongnew_ip,

unsignedlongnew_sp,

unsignedint_cs,unsignedint_ss,unsignedint_ds)

{

loadsegment(fs,0);

loadsegment(es,_ds);

loadsegment(ds,_ds);

load_gs_index(0);

regs->ip=new_ip;

regs->sp=new_sp;

regs->cs=_cs;

regs->ss=_ss;

regs->flags=X86_EFLAGS_IF;

force_iret();

}

Thestart_thread_commonfunctionfillsfssegmentregisterwithzeroandesanddswiththevalueofthedatasegmentregister.Afterthiswesetnewvaluestotheinstructionpointer,cssegmentsetc.Intheendofthestart_thread_commonfunctionwecanseetheforce_iretmacrothatforceasystemcallreturnviairetinstruction.Ok,wepreparednewthreadtoruninuserspaceandnowwecanreturnfromtheexec_binprmandnowweareinthedo_execveat_commonagain.Aftertheexec_binprmwillfinishitsexecutionwereleasememoryforstructuresthatwasallocatedbeforeandreturn.

Afterwereturnedfromtheexecvesystemcallhandler,executionofourprogramwillbestarted.Wecandoit,becauseallcontextrelatedinformationalreadyconfiguredforthispurpose.Aswesawtheexecvesystemcalldoesnotreturncontroltoaprocess,butcode,dataandothersegmentsofthecallerprocessarejustoverwrittenoftheprogramsegments.Theexitfromourapplicationwillbeimplementedthroughtheexitsystemcall.

That'sall.Fromthispointourprogrammwillbeexecuted.

ThisistheendofthefourthandlastpartoftheaboutthesystemcallsconceptintheLinuxkernel.Wesawalmostallrelatedstufftothesystemcallconceptinthesefourparts.Westartedfromtheunderstandingofthesystemcallconcept,wehavelearnedwhatisitandwhydousersapplicationsneedinthisconcept.NextwesawhowdoestheLinuxhandleasystemcallfromanuserapplication.Wemettwosimilarconceptstothesystemcallconcept,theyarevsyscallandvDSOandfinallywesawhowdoesLinuxkernelrunanuserprogram.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

Conclusion

Links

LinuxInside

294HowtheLinuxkernelrunsaprogram

Page 296: Linux Insides

Thischapterdescribestimersandtimemanagementrelatedconceptsinthelinuxkernel.

Introduction-thispartisintroductiontothetimersintheLinuxkernel.Introductiontotheclocksourceframework-thispartdescribesclocksourceframeworkintheLinuxkernel.

Timersandtimemanagement

LinuxInside

296Timersandtimemanagement

Page 297: Linux Insides

Thisisyetanotherpostthatopensnewchapterinthelinux-insidesbook.Thepreviouspartwasalistpartofthechapterthatdescribessystemcallconceptandnowtimeistostartnewchapter.Asyoucanunderstandfromthepost'stitle,thischapterwillbedevotedtothetimersandtimemanagementintheLinuxkernel.Thechoiceoftopicforthecurrentchapterisnotaccidental.TimersandgenerallytimemanagementareveryimportantandwidelyusedintheLinuxkernel.TheLinuxkernelusestimersforvarioustasks,differenttimeoutsforexampleinTCPimplementation,thekernelmustknowcurrenttime,schedulingasynchronousfunctions,nexteventinterruptschedulingandmanymanymore.

So,wewillstarttolearnimplementationofthedifferenttimemanagementrelatedstuffinthispart.WewillseedifferenttypesoftimersandhowdodifferentLinuxkernelsubsystemsusethem.AsalwayswewillstartfromtheearliestpartoftheLinuxkernelandwillgothroughinitializationprocessoftheLinuxkernel.WealreadydiditinthespecialchapterwhichdescribesinitializationprocessoftheLinuxkernel,butasyoumayrememberwemissiedsomethingsthere.Andoneofthemistheinitializationoftimers.

Let'sstart.

AftertheLinuxkernelwasdecompressed(moreaboutthisyoucanreadintheKerneldecompressionpart)thearchitecturenon-specificcodestartstoworkintheinit/main.csourcecodefile.Afterinitializationofthelockvalidator,initializationofcgroupsandsettingcanaryvaluewecanseethecallofthesetup_archfunction.

Asyoumayrememberthisfunctiondefinedinthearch/x86/kernel/setup.csourcecodefileandprepares/initializesarchitecture-specificstuff(forexampleitreservesplaceforbsssection,reservesplaceforinitrd,parseskernelcommandlineandmanymanyotherthings).Besidesthis,wecanfindsometimemanagementrelatedfunctionsthere.

Thefirstis:

x86_init.timers.wallclock_init();

Wealreadysawx86_initstructureinthechapterthatdescribesinitializationoftheLinuxkernel.ThisstructurecontainspointerstothedefaultsetupfunctionsforthedifferentplatformslikeIntelMID,IntelCE4100andetc.Thex86_initstructuredefinedinthearch/x86/kernel/x86_init.candasyoucanseeitdeterminesstandardPChardwarebydefault.

Aswecansee,thex86_initstructurehasx86_init_opstypethatprovidesasetoffunctionsforplatformspecificsetuplikereserviingstandardresources,platformspecificmemorysetup,initializationofinterrupthandlersandetc.Thisstructurelookslike:

structx86_init_ops{

structx86_init_resourcesresources;

structx86_init_mpparsempparse;

structx86_init_irqsirqs;

structx86_init_oemoem;

structx86_init_pagingpaging;

structx86_init_timerstimers;

structx86_init_iommuiommu;

structx86_init_pcipci;

};

TimersintheLinuxkernel.Part1.

Introduction

Initializationofnon-standardPChardwareclock

LinuxInside

297Introduction

Page 298: Linux Insides

Wecannotetimersfieldthathasx86_init_timerstypeandaswecanunderstandbyitsname-thisfieldisrelatedtotimemanagementandtimers.Thex86_init_timerscontainsfourfieldswhichareallfunctionsthatreturnspointeronvoid:

setup_percpu_clockev-setupthepercpuclockeventdeviceforthebootcpu;tsc_pre_init-platformfunctioncalledbeforeTSCinit;timer_init-initializetheplatformtimer;wallclock_init-initializethewallclockdevice.

So,aswealreadyknow,inourcasethewallclock_initexecutesinitializationofthewallclockdevice.Ifwewilllookonthex86_initstructure,wewillseethatwallclock_initpointstothex86_init_noop:

structx86_init_opsx86_init__initdata={

...

...

...

.timers={

.wallclock_init=x86_init_noop,

},

...

...

...

}

Wherethex86_init_noopisjustafunctionthatdoesnothing:

void__cpuinitx86_init_noop(void){}

forthestandardPChardware.Actually,thewallclock_initfunctionisusedintheIntelMIDplatform.Initializationofthex86_init.timers.wallclock_initlocatedinthearch/x86/platform/intel-mid/intel-mid.csourcecodefileinthex86_intel_mid_early_setupfunction:

void__initx86_intel_mid_early_setup(void)

{

...

...

...

x86_init.timers.wallclock_init=intel_mid_rtc_init;

...

...

...

}

Implementationoftheintel_mid_rtc_initfunctionisinthearch/x86/platform/intel-mid/intel_mid_vrtc.csourcecodefileandlooksprettyeasy.Firstofall,thisfunctionparsesSimpleFirmwareInterfaceM-Real-Time-Clocktableforthegettingsuchdevicestothesfi_mrtc_arrayarrayandinitializationoftheset_timeandget_timefunctions:

void__initintel_mid_rtc_init(void)

{

unsignedlongvrtc_paddr;

sfi_table_parse(SFI_SIG_MRTC,NULL,NULL,sfi_parse_mrtc);

vrtc_paddr=sfi_mrtc_array[0].phys_addr;

if(!sfi_mrtc_num||!vrtc_paddr)

return;

vrtc_virt_base=(void__iomem*)set_fixmap_offset_nocache(FIX_LNW_VRTC,

vrtc_paddr);

x86_platform.get_wallclock=vrtc_get_time;

LinuxInside

298Introduction

Page 299: Linux Insides

x86_platform.set_wallclock=vrtc_set_mmss;

}

That'sall,afterthisadevicebasedonIntelMIDwillbeabletogetgettimefromhardwareclock.AsIalreadywrote,thestandardPCx86_64architecturedoesnotsupportfunctionandjustdonothingduringcallofthisfunction.WejustsawinitializationoftherealtimeclockfortheIntelMIDarchitectureandnowtimestoreturntothegeneralx86_64architectureandwilllookonthetimemanagementrelatedstuffthere.

Ifwewillreturntothesetup_archfunctionwhichislocatedasyourememberinthearch/x86/kernel/setup.csourcecodefile,wewillseethenextcallofthetimemanagementrelatedfunction:

register_refined_jiffies(CLOCK_TICK_RATE);

Beforewewilllookontheimplementationofthisfunction,wemustknowaboutjiffy.Aswecanreadonwikipedia:

Jiffyisaninformaltermforanyunspecifiedshortperiodoftime

ThisdefinitionisverysimilartothejiffyintheLinuxkernel.Thereisglobalvariablewiththejiffieswhichholdsthenumberofticksthathaveoccurredsincethesystembooted.TheLinuxkernelsetsthisvariabletozero:

externunsignedlongvolatile__jiffy_datajiffies;

duringinitializationprocess.Thisglobalvariablewillbeincrementedeachtimeduringtimerinterrupt.Besidesthis,nearthejiffiesvariablewecanseedefinitionofthesimilarvariable

externu64jiffies_64;

ActuallyonlyoneofthesevariablesisinuseintheLinuxkernel.Anditdependsontheprocessortype.Forthex86_64itwillbeu64useandforthex86isunsignedlong.Wewillseethisifwewilllookonthearch/x86/kernel/vmlinux.lds.Slinkerscript:

#ifdefCONFIG_X86_32

...

jiffies=jiffies_64;

...

#else

...

jiffies_64=jiffies;

...

#endif

Inthecaseofx86_32thejiffieswillbelower32bitsofthejiffies_64variable.Schematically,wecanimagineitasfollows

jiffies_64

+-----------------------------------------------------+

|||

|||

Acquaintedwithjiffies

LinuxInside

299Introduction

Page 300: Linux Insides

||jiffieson`x86_32`|

|||

|||

+-----------------------------------------------------+

63310

Nowweknowalittletheoryaboutjiffiesandwecanreturntotheourfunction.Thereisnoarchitecture-specificimplementationforourfunction-theregister_refined_jiffies.Thisfunctionlocatedinthegenerickernelcode-kernel/time/jiffies.csourcecodefile.Mainpointoftheregister_refined_jiffiesisregistrationofthejiffyclocksource.Beforewewilllookontheimplementationoftheregister_refined_jiffiesfunction,wemustknowwhatisitclocksource.Aswecanreadinthecomments:

The`clocksource`ishardwareabstractionforafree-runningcounter.

I'mnotsureaboutyou,butthatdescriptiondidn'tgiveagoodunderstandingabouttheclocksourceconcept.Let'strytounderstandwhatisit,butwewillnotgodeeperbecausethistopicwillbedescribedinaseparatepartinmuchmoredetail.Themainpointoftheclocksourceistimekeepingabstractionorinverysimplewords-itprovidesatimevaluetothekernel.Wealreadyknowaboutjiffiesinterfacethatrepresentsnumberofticksthathaveoccurredsincethesystembooted.ItrepresentedbytheglobalvariableintheLinuxkernelandincrementedeachtimerinterrupt.TheLinuxkernelcanusejiffiesfortimemeasurement.Sowhydoweneedinseparatecontextliketheclocksource?Actuallydifferenthardwaredevicesprovidedifferentclocksourcesthatarewidelyintheircapabilities.Theavailabilityofmoreprecisetechniquesfortimeintervalsmeasurementishardware-dependent.

Forexamplex86hason-chipa64-bitcounterthatiscalledTimeStampCounteranditsfrequencycanbeequaltoprocessorfrequency.OrforexampleHighPrecisionEventTimerthatconsistsofa64-bitcounterofatleast10MHzfrequency.Twodifferenttimersandtheyarebothforx86.Ifwewilladdtimersfromotherarchitectures,thisonlymakesthisproblemmorecomplex.TheLinuxkernelprovidesclocksourceconcepttosolvetheproblem.

TheclocksourceconceptrepresentedbytheclocksourcestructureintheLinuxkernel.Thisstructuredefinedintheinclude/linux/clocksource.hheaderfileandcontainsacoupleoffieldsthatdescribeatimecounter.Forexampleitcontains-namefieldwhichisthenameofacounter,flagsfieldthatdescribesdifferentpropertiesofacounter,pointerstothesuspendandresumefunctions,andmanymore.

Let'slookontheclocksourcestructureforjiffiesthatdefinedinthekernel/time/jiffies.csourcecodefile:

staticstructclocksourceclocksource_jiffies={

.name="jiffies",

.rating=1,

.read=jiffies_read,

.mask=0xffffffff,

.mult=NSEC_PER_JIFFY<<JIFFIES_SHIFT,

.shift=JIFFIES_SHIFT,

.max_cycles=10,

};

Wecanseedefinitionofthedefaultnamehere-jiffies,thenextisratingfieldallowsthebestregisteredclocksourcetobechosenbytheclocksourcemanagementcodeavailableforthespecifiedhardware.Theratingmayhavefollowingvalue:

1-99-Onlyavailableforbootupandtestingpurposes;100-199-Functionalforrealuse,butnotdesired.200-299-Acorrectandusableclocksource.300-399-Areasonablyfastandaccurateclocksource.400-499-Theidealclocksource.Amust-usewhereavailable;

LinuxInside

300Introduction

Page 301: Linux Insides

Forexampleratingofthetimestampcounteris300,butratingofthehighprecisioneventtimeris250.Thenextfieldisread-ispointertothefunctionthatallowstoreadclocksource'scyclevalueorinotherwordsitjustreturnsjiffiesvariablewithcycle_ttype:

staticcycle_tjiffies_read(structclocksource*cs)

{

return(cycle_t)jiffies;

}

thatisjust64-bitunsignedtype:

typedefu64cycle_t;

Thenextfieldisthemaskvalueensuresthatsubtractionbetweencountersvaluesfromnon64bitcountersdonotneedspecialoverflowlogic.Inourcasethemaskis0xffffffffanditis32bits.Thismeansthatjiffywrapsaroundtozeroafter42seconds:

>>>0xffffffff

4294967295

#42nanoseconds

>>>42*pow(10,-9)

4.2000000000000006e-08

#43nanoseconds

>>>43*pow(10,-9)

4.3e-08

Thenexttwofieldsmultandshiftareusedtoconverttheclocksource'speriodtonanosecondspercycle.Whenthekernelcallstheclocksource.readfunction,thisfunctionreturnsvalueinmachinetimeunitsrepresentedwithcycle_tdatatypethatwesawjustnow.Toconvertthisreturnvaluetothenanosecondsweneedinthesetwofields:multandshift.Theclocksourceprovidesclocksource_cyc2nsfunctionthatwilldoitforuswiththefollowingexpression:

((u64)cycles*mult)>>shift;

Aswecanseethemultfieldisequal:

NSEC_PER_JIFFY<<JIFFIES_SHIFT

#defineNSEC_PER_JIFFY((NSEC_PER_SEC+HZ/2)/HZ)

#defineNSEC_PER_SEC1000000000L

bydefault,andtheshiftis

#ifHZ<34

#defineJIFFIES_SHIFT6

#elifHZ<67

#defineJIFFIES_SHIFT7

#else

#defineJIFFIES_SHIFT8

#endif

ThejiffiesclocksourceusestheNSEC_PER_JIFFYmultiplierconversiontospecifythenanosecondovercycleratio.NotethatvaluesoftheJIFFIES_SHIFTandNSEC_PER_JIFFYdependonHZvalue.TheHZrepresentsthefrequencyofthesystemtimer.Thismacrodefinedintheinclude/asm-generic/param.handdependsontheCONFIG_HZkernelconfigurationoption.

LinuxInside

301Introduction

Page 302: Linux Insides

ThevalueofHZdiffersforeachsupportedarchitecture,butforx86it'sdefinedlike:

#defineHZCONFIG_HZ

WhereCONFIG_HZcanbeoneofthefollowingvalues:

Thismeansthatinourcasethetimerinterruptfrequencyis250HZoroccurs250timespersecondoronetimerinterrupteach4ms.

Thelastfieldthatwecanseeinthedefinitionoftheclocksource_jiffiesstructureisthe-max_cyclesthatholdsthemaximumcyclevaluethatcansafelybemultipliedwithoutpotentiallycausinganoverflow.

Ok,wejustsawdefinitionofthe`clocksource_jiffies`structure,alsoweknowalittleabout`jiffies`and`clocksource`,nowistimetogetbacktotheimplementationoftheourfunction.Inthebeginningofthispartwehavestoppedonthecallofthe:

register_refined_jiffies(CLOCK_TICK_RATE);

functionfromthearch/x86/kernel/setup.csourcecodefile.

AsIalreadywrote,themainpurposeoftheregister_refined_jiffiesfunctionistoregisterrefined_jiffiesclocksource.Wealreadysawtheclocksource_jiffiesstructurerepresentsstandardjiffiesclocksource.Now,ifyoulookinthekernel/time/jiffies.csourcecodefile,youwillfindyetanotherclocksourcedefinition:

structclocksourcerefined_jiffies;

Thereisonedifferentbetweenrefined_jiffiesandclocksource_jiffies:Thestandardjiffiesbasedclocksourceisthelowestcommondenominatorclocksourcewhichshouldfunctiononallsystems.Aswealreadyknow,thejiffiesglobalvariablewillbeincrementedduringeachtimerinterrupt.Thismeansthatstandardjiffiesbasedclocksourcehasthe

LinuxInside

302Introduction

Page 303: Linux Insides

sameresolutionasthetimerinterruptfrequency.Fromthiswecanunderstandthatstandardjiffiesbasedclocksourcemaysufferfrominaccuracies.Therefined_jiffiesusesCLOCK_TICK_RATEasthebaseofjiffiesshift.

Let'slookontheimplementationofthisfunction.Firstofallwecanseethattherefined_jiffiesclocksourcebasedontheclocksource_jiffiesstructure:

intregister_refined_jiffies(longcycles_per_second)

{

u64nsec_per_tick,shift_hz;

longcycles_per_tick;

refined_jiffies=clocksource_jiffies;

refined_jiffies.name="refined-jiffies";

refined_jiffies.rating++;

...

...

...

Herewecanseethatweupdatethenameoftherefined_jiffiestorefined-jiffiesandincrementtheratingofthisstructure.Asyouremember,theclocksource_jiffieshasrating-1,soourrefined_jiffiesclocksourcewillhaverating-2.Thismeansthattherefined_jiffieswillbebestselectionforclocksourcemanagementcode.

Inthenextstepweneedtocalculatenumberofcyclesperonetick:

cycles_per_tick=(cycles_per_second+HZ/2)/HZ;

NotethatwehaveusedNSEC_PER_SECmacroasthebaseofthestandardjiffiesmultiplier.Hereweareusingthecycles_per_secondwhichisthefirstparameteroftheregister_refined_jiffiesfunction.We'vepassedtheCLOCK_TICK_RATEmacrototheregister_refined_jiffiesfunction.Thismacrodefiniedinthearch/x86/include/asm/timex.hheaderfileandexpandstothe:

#defineCLOCK_TICK_RATEPIT_TICK_RATE

wherethePIT_TICK_RATEmacroexpandstothefrequencyoftheIntel8253:

#definePIT_TICK_RATE1193182ul

Afterthiswecalculateshift_hzfortheregister_refined_jiffiesthatwillstorehz<<8orinotherwordsfrequencyofthesystemtimer.Weshiftleftthecycles_per_secondorfrequencyoftheprogrammableintervaltimeron8inordertogetextraaccuracy:

shift_hz=(u64)cycles_per_second<<8;

shift_hz+=cycles_per_tick/2;

do_div(shift_hz,cycles_per_tick);

InthenextstepwecalculatethenumberofsecondsperonetickbyshiftinglefttheNSEC_PER_SECon8tooaswediditwiththeshift_hzanddothesamecalculationasbefore:

nsec_per_tick=(u64)NSEC_PER_SEC<<8;

nsec_per_tick+=(u32)shift_hz/2;

do_div(nsec_per_tick,(u32)shift_hz);

LinuxInside

303Introduction

Page 304: Linux Insides

refined_jiffies.mult=((u32)nsec_per_tick)<<JIFFIES_SHIFT;

Intheendoftheregister_refined_jiffiesfunctionweregisternewclocksourcewiththe__clocksource_registerfunctionthatdefinedintheinclude/linux/clocksource.hheaderfileandreturn:

__clocksource_register(&refined_jiffies);

return0;

TheclocksourcemanagementcodeprovidestheAPIforclocksourceregistrationandselection.Aswecansee,clocksourcesareregisteredbycallingthe__clocksource_registerfunctionduringkernelinitializationorfromakernelmodule.Duringregistration,theclocksourcemanagementcodewillchoosethebestclocksourceavailableinthesystemusingtheclocksource.ratingfieldwhichwealreadysawwhenweinitializedclocksourcestructureforjiffes.

Wejustsawinitializationoftwojiffiesbasedclocksourcesinthepreviousparagraph:

standardjiffiesbasedclocksource;refinedjiffiesbasedclocksource;

Don'tworryifyoudon'tunderstandthecalculationshere.Theylookfrighteningatfirst.Soon,stepbystepwewilllearnthesethings.So,wejustsawinitializationofjffiesbasedclocksourcesandalsoweknowthattheLinuxkernelhastheglobalvariablejiffiesthatholdsthenumberofticksthathaveoccuredsincethekernelstartedtowork.Now,let'slookhowtouseit.Tousejiffieswejustcanusejiffiesglobalvariablebyitsnameorwiththecalloftheget_jiffies_64function.Thisfunctiondefinedinthekernel/time/jiffies.csourcecodefileandjustreturnsfullfull64-bitvalueofthejiffies:

u64get_jiffies_64(void)

{

unsignedlongseq;

u64ret;

do{

seq=read_seqbegin(&jiffies_lock);

ret=jiffies_64;

}while(read_seqretry(&jiffies_lock,seq));

returnret;

}

EXPORT_SYMBOL(get_jiffies_64);

Notethattheget_jiffies_64functiondoesnotimplementedasjiffies_readforexample:

staticcycle_tjiffies_read(structclocksource*cs)

{

return(cycle_t)jiffies;

}

Wecanseethatimplementationoftheget_jiffies_64ismorecomplex.Thereadingofthejiffies_64variableisimplementedusingseqlocks.Actuallythisisdoneformachinesthatcannotatomicallyreadthefull64-bitvalues.

Ifwecanaccessthejiffiesorthejiffies_64variablewecanconvertittohumantimeunits.Togetonesecondwecanusefollowingexpression:

Usingthejiffies

LinuxInside

304Introduction

Page 305: Linux Insides

jiffies/HZ

So,ifweknowthis,wecangetanytimeunits.Forexample:

/*Thirtysecondsfromnow*/

jiffies+30*HZ

/*Twominutesfromnow*/

jiffies+120*HZ

/*Tenmillisecondsfromnow*/

jiffies+HZ/1000

That'sall.

ThisconcludesthefirstpartcoveringtimeandtimemanagementrelatedconceptsintheLinuxkernel.Wemetfirsttwoconceptsanditsinitializationinthispart:jiffiesandclocksource.InthenextpartwewillcontinuetodiveintothisinterestingthemeandasIalreadywroteinthispartwewillacquaintedandtrytounderstandinternalsoftheseandothertimemanagementconceptsintheLinuxkernel.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

systemcallTCPlockvalidatorcgroupsbssinitrdIntelMIDTSCvoidSimpleFirmwareInterfacex86_64realtimeclockJiffyhighprecisioneventtimernanosecondsIntel8253seqlockscloksourcedocumentationPreviouschapter

Conclusion

Links

LinuxInside

305Introduction

Page 306: Linux Insides

ThepreviouspartwasthefirstpartinthecurrentchapterthatdescribestimersandtimemanagementrelatedstuffintheLinuxkernel.Wegotacquaintedwithtwoconceptsinthepreviouspart:

jiffies

clocksource

Thefirstistheglobalvariablethatdefinedintheinclude/linux/jiffies.hheaderfileandrepresentscounterthatincrementedduringeachtimerinterrupt.Soifwecanaccessthisglobalvariableandweknowtimerinterruptratewecanconvertjiffiestothehumantimeunits.Aswealreadyknowthetimerinterruptraterepresentedbythecompile-timeconstantthatiscalledHZintheLinuxkernel.ThevalueoftheHZisequaltothevalueoftheCONFIG_HZkernelconfigurationoptionandifwewilllookinthearch/x86/configs/x86_64_defconfigkernelconfigurationfile,wewillseethat:

CONFIG_HZ_1000=y

kernelconfigurationoptionisset.ThismeansthatvalueoftheCONFIG_HZwillbe1000bydefaultforthex86_64architecture.So,ifwedividevaluesofjiffiesonthevalueoftheHZ:

jiffies/HZ

wewillgetamountofsecondsthatelapsedsincethebeginningofthemomentwhentheLinuxkernelstartedtoworkorinotherwordswewillgetsystemuptime.SincetheHZrepresentsamountofthetimerinterruptsinasecond,wecansetavalueforsometimeinthefuture.Forexample:

/*oneminutefromnow*/

unsignedlonglater=jiffies+60*HZ;

/*fiveminutesfromnow*/

unsignedlonglater=jiffies+5*60*HZ;

ThisisaverycommonpracticeintheLinuxkernel.Forexample,ifyouwilllookinthearch/x86/kernel/smpboot.csourcecodefile,youwillfindthedo_boot_cpufunction.Thisfunctionbootsallprocessorsbesidesbootstrapprocessor.Youcanfindapieceofcodethatwaitsfortensecondsforaresponsefromapplicationprocessor:

if(!boot_error){

timeout=jiffies+10*HZ;

while(time_before(jiffies,timeout)){

...

...

...

udelay(100);

}

...

...

...

}

Weassignjiffies+10*HZvaluetothetimeoutvariablehere.AsIthinkyoualreadyunderstood,thiswillmeanten

TimersintheLinuxkernel.Part2.

Introductiontotheclocksourceframework

LinuxInside

306Clocksourceframework

Page 307: Linux Insides

secondstimeout.Afterthisweareenteringtotheloopthatweusetime_beforemacrotocomparecurrentjiffiesvalueandourtimeout.

Orforexampleifwewilllookinthesound/isa/sscape.csourcecodefilewhichrepresentsdriverfortheEnsoniqSoundscapeElitesoundcard,wewillseetheobp_startup_ackfunctionthatwaitsgiventimeoutforOn-BoardProcessortoreturnitsstart-upacknowledgementsequence:

staticintobp_startup_ack(structsoundscape*s,unsignedtimeout)

{

unsignedlongend_time=jiffies+msecs_to_jiffies(timeout);

do{

...

...

...

x=host_read_unsafe(s->io_base);

...

...

...

if(x==0xfe||x==0xff)

return1;

msleep(10);

}while(time_before(jiffies,end_time));

return0;

}

So,youcanfindthatjiffiesvariableisverywidelyusedintheLinuxkernelcode.AsIalreadywrote,wemetyetanothernewtimemanagementrelatedconceptinthepreviouspart-clocksource.InthepreviouspartwejustsawalittledescrptionofthisconceptandsawAPIforaclocksourceregistration.Let'stakeacloserlookatthisconceptinthispart

TheclocksourceconceptrepresentsgenericAPIforclocksourcesmanagementintheLinuxkernel.Whydoweneedseparateframeworkforthis?Let'sgobacktothebeginning.ThetimeconceptisfundamentalconceptintheLinuxkernelandotheroperatingsystemkernels.Andthetimekeepingisanoneoneofthenecessitiestousethisconcept.ForexampleLinuxkernelmustknowandupdatethetimeelapsedsincesystemstartup,itmustdeterminehowlongthecurrentprocesshasbeenrunningforeveryprocessorandmanymanymore.WheretheLinuxkernelcangetinformationabouttime?FirstofallitisRealTimeClockorRTCthatrepresentsbytheanonvolatiledevice.Youcanfindasetofarchitecture-independendrealtimeclockdriversintheLinuxkernelinthedrivers/rtcdirectory.Besidesthis,eacharchitecturecanprovideadriverforthearchitecture-dependendrealtimeclock,forexample-CMOS/RTC-arch/x86/kernel/rtc.cforthex86architecture.Thesecondissystemtimer-timerthatexcitesinterruptswithaperiodicrate.Forexample,forIBMPCcompatiblesitwas-programmableintervaltimer.

WealreadyknowthatfortimekeepingpurposeswecanusejiffiesintheLinuxkernel.ThejiffiescanbeconsideredasreadonlyglobalvariablewhichisupdatedwithHZfrequency.WeknowthattheHZisacompile-timekernelparameterwhosereasonablerangeisfrom100to1000Hz.So,itisguaranteedtohaveaninterfacefortimemeasurementwith1-10millisecondsresolution.Besidesstandardjiffies,wesawtherefined_jiffiesclocksourceinthepreviouspartthatisbasedonthei8253/i8254programmableintervaltimertickratewhichisalmost1193182hertz.Sowecangetsomethingabout1microsecondresolutionwiththerefined_jiffies.Inthistime,nanosecondsarethefavoritechoiceforthetimevalueunitsofthegivenclocksource.

Theavailabilityofmoreprecisetechniquesfortimeintervalsmeasurementishardware-dependend.Wejustknewalittleaboutx86dependendtimershardware.Buteacharchitectureprovidesowntimershardware.Earliereacharchitecturehadownimplementationforthispurpose.SolutionofthisproblemisanabstractionlayerandassociatedAPIinacommoncodeframeworkformanagingvariousclocksourcesandindependentofthetimerinterrupt.Thiscommncodeframeworkbecame-clocksourceframework.

Introductiontoclocksource

LinuxInside

307Clocksourceframework

Page 308: Linux Insides

Generictimeofdayandclocksourcemanagementframeworkmovedlotoftimekeepingcodeintoarchitectureindependentportionofcode,witharchitectureportionreducedtodefiningandmanaginglowlevelhardwarepiecesofclocksources.Alargeamountoffundstomeasurethetimeintervalondifferentarchitectureswithdifferenthardwareisabigcomplexity.Implementationoftheeachclockreleatedserviceisstronglyassociatedwithanindividualhardwaredeviceandasyoucanunderstand,itresultsinsimilarimplementationsfordifferentarchitectures.

Withinthisframework,eachclocksourceisrequiredtomaintainarepresentationoftimeasamonotonicallyincreasingvalue.AswecanseeintheLinuxkernelcode,nanosecondsarethefavoritechoiceforthetimevalueunitsofaclocksourceinthistime.Oneofthemainpointoftheclocksourceframeworkistoallowanusertoselectclocksourceamongarangeofavailablehardwaredevicessupportingclockfunctionswhenconfiguringthesystemandselecting,accessingandscalingdifferentclocksources.

Thefundamentaloftheclocksourceframeworkistheclocksourcestructurethatdefinedintheinclude/linux/clocksource.hheaderfile.Wealreadysawsomefieldsthatareprovidedbytheclocksourcestrucutreinthepreviouspart.Let'slookonthefulldefinitionofthisstructureandtrytodescribeallofitsfields:

structclocksource{

cycle_t(*read)(structclocksource*cs);

cycle_tmask;

u32mult;

u32shift;

u64max_idle_ns;

u32maxadj;

#ifdefCONFIG_ARCH_CLOCKSOURCE_DATA

structarch_clocksource_dataarchdata;

#endif

u64max_cycles;

constchar*name;

structlist_headlist;

intrating;

int(*enable)(structclocksource*cs);

void(*disable)(structclocksource*cs);

unsignedlongflags;

void(*suspend)(structclocksource*cs);

void(*resume)(structclocksource*cs);

#ifdefCONFIG_CLOCKSOURCE_WATCHDOG

structlist_headwd_list;

cycle_tcs_last;

cycle_twd_last;

#endif

structmodule*owner;

}____cacheline_aligned;

Wealreadysawthefirstfieldoftheclocksourcestructureinthepreviouspart-itispointertothereadfunctionthatreturnsbestconterselectedbytheclocksourceframework.Forexampleweusejiffies_readfunctiontoreadjiffiesvalue:

staticstructclocksourceclocksource_jiffies={

...

.read=jiffies_read,

...

}

wherejiffies_readjustreturns:

staticcycle_tjiffies_read(structclocksource*cs)

{

return(cycle_t)jiffies;

}

Theclocksourcestructure

LinuxInside

308Clocksourceframework

Page 309: Linux Insides

Ortheread_tscfunction:

staticstructclocksourceclocksource_tsc={

...

.read=read_tsc,

...

};

forthetimestampcounterreading.

Thenextfieldismaskthatallowstoensurethatsubtractionbetweencountersvaluesfromnon64bitcountersdonotneedspecialoverflowlogic.Afterthemaskfield,wecanseetwofields:multandshift.Thesearethefieldsthatarebaseofmathematicalfunctionsthatareprovideabilitytoconverttimevaluesspecifictoeachclocksource.Inotherwordsthesetwofieldshelpustoconvertanabstractmachinetimeunitsofacountertonanoseconds.

Afterthesetwofieldswecanseethe64bitsmax_idle_nsfieldrepresentsmaxidletimepermittedbytheclocksourceinnanoseconds.WeneedinthisfieldfortheLinuxkernelwithenabledCONFIG_NO_HZkernelconfigurationoption.ThiskernelconfigurationoptionenablestheLinuxkerneltorunwithoutaregulartimertick(wewillseefullexplanationofthisinotherpart).Theproblemthatdynamictickallowsthekerneltosleepforperiodslongerthanasingletick,moreoversleeptimecouldbeunlimited.Themax_idle_nsfieldrepresentsthissleepinglimit.

Thenextfieldafterthemax_idle_nsisthemaxadjfieldwhichisthemaximumadjustmentvaluetomult.Themainformulabywhichweconvertcyclestothenanoseconds:

((u64)cycles*mult)>>shift;

isnot100%accurate.InsteadthenumberistakenascloseaspossibletoananosecondandmaxadjhelpstocorrectthisandallowsclocksourceAPItoavoidmultvaluesthatmightoverflowwhenadjusted.Thenextfourfieldsarepointerstothefunction:

enable-optionalfunctiontoenableclocksource;disable-optionalfunctiontodisableclocksource;suspend-suspendfunctionfortheclocksource;resume-resumefunctionfortheclocksource;

Thenextfieldisthemax_cyclesandaswecanunderstandfromitsname,thisfieldrepresentsmaximumcyclevaluebeforepotentialoverflow.Andthelastfieldisownerrepresentsreferencetoakernelmodulethatisownerofaclocksource.Thisisall.Wejustwentthroughallthestandardfieldsoftheclocksourcestructure.Butyoucannotedthatwemissedsomefieldsoftheclocksourcestructure.Wecandivideallofmissedfieldontwotypes:Fieldsofthefirsttypearealreadyknownforus.Forexample,theyarenamefieldthatrepresentsnameofaclocksource,theratingfieldthathelpstotheLinuxkerneltoselectthebestclocksourceandetc.Thesecondtype,fieldswhicharedependentfromthedifferentLinuxkernelconfigurationoptions.Let'slookonthesefields.

Thefirstfieldisthearchdata.Thisfieldhasarch_clocksource_datatypeanddependsontheCONFIG_ARCH_CLOCKSOURCE_DATAkernelconfigurationoption.Thisfieldisactualonlyforthex86andIA64architecturesforthismoment.Andagain,aswecanunderstandfromthefield'sname,itrepresentsarchitecture-specificdataforaclocksource.Forexample,itrepresentsvDSOclockmode:

structarch_clocksource_data{

intvclock_mode;

};

LinuxInside

309Clocksourceframework

Page 310: Linux Insides

forthex86architectures.WherethevDSOclockmodecanbeoneofthe:

#defineVCLOCK_NONE0

#defineVCLOCK_TSC1

#defineVCLOCK_HPET2

#defineVCLOCK_PVCLOCK3

Thelastthreefieldsarewd_list,cs_lastandthewd_lastdependsontheCONFIG_CLOCKSOURCE_WATCHDOGkernelconfigurationoption.Firstofalllet'strytounderstandwhatisitwhatchdog.Inasimplewords,watchdogisatimerthatisusedfordetectionofthecomputermalfunctionsandrecoveringfromit.Allofthesethreefieldscontainwatchdogrelateddatathatisusedbytheclocksourceframework.IfwewillgreptheLinuxkernelsourcecode,wewillseethatonlyarch/x86/KConfigkernelconfigurationfilecontainstheCONFIG_CLOCKSOURCE_WATCHDOGkernelconfigurationoption.So,whydox86andx86_64needinwatchdog?Youalreadymayknowthatallx86processorshasspecial64-bitregister-timestampcounter.Thisregistercontainsnumberofcyclessincethereset.Sometimesthetimestampcounterneedstobeverifiedagainstanotherclocksource.Wewillnotseeinitializationofthewatchdogtimerinthispart,beforethiswemustlearnmoreabouttimers.

That'sall.Fromthismomentweknowallfieldsoftheclocksourcestructure.Thisknowledgewillhelpustolearninternalsoftheclocksourceframework.

Wesawonlyonefunctionfromtheclocksourceframeworkinthepreviouspart.Thisfunctionwas-__clocksource_register.Thisfunctiondefinedintheinclude/linux/clocksource.hheaderfileandaswecanunderstandfromthefunction'sname,mainpointofthisfunctionistoregisternewclocksource.Ifwewilllookontheimplementationofthe__clocksource_registerfunction,wewillseethatitjustmakescallofthe__clocksource_register_scalefunctionandreturnsitsresult:

staticinlineint__clocksource_register(structclocksource*cs)

{

return__clocksource_register_scale(cs,1,0);

}

Beforewewillseeimplementationofthe__clocksource_register_scalefunction,wecanseethatclocksourceprovidesadditionalAPIforanewclocksourceregistration:

staticinlineintclocksource_register_hz(structclocksource*cs,u32hz)

{

return__clocksource_register_scale(cs,1,hz);

}

staticinlineintclocksource_register_khz(structclocksource*cs,u32khz)

{

return__clocksource_register_scale(cs,1000,khz);

}

Andallofthesefunctionsdothesame.Theyreturnvalueofthe__clocksource_register_scalefunctionbutwithdiffferentsetofparameters.The__clocksource_register_scalefunctiondefinedinthekernel/time/clocksource.csourcecodefile.Tounderstanddifferencebetweenthesefunctions,let'slookontheparametersoftheclocksource_register_khzfunction.Aswecansee,thisfunctiontakesthreeparameters:

cs-clocksourcetobeinstalled;scale-scalefactorofaclocksource.Inotherwords,ifwewillmultiplyvalueofthisparameteronfrequency,wewillgethzofaclocksource;

Newclocksourceregistration

LinuxInside

310Clocksourceframework

Page 311: Linux Insides

freq-clocksourcefrequencydividedbyscale.

Nowlet'slookontheimplementationofthe__clocksource_register_scalefunction:

int__clocksource_register_scale(structclocksource*cs,u32scale,u32freq)

{

__clocksource_update_freq_scale(cs,scale,freq);

mutex_lock(&clocksource_mutex);

clocksource_enqueue(cs);

clocksource_enqueue_watchdog(cs);

clocksource_select();

mutex_unlock(&clocksource_mutex);

return0;

}

Firstofallwecanseethatthe__clocksource_register_scalefunctionstartsfromthecallofthe__clocksource_update_freq_scalefunctionthatdefinedinthesamesourcecodefileandupdatesgivenclocksourcewiththenewfrequency.Let'slookontheimplementationofthisfunction.Inthefirststepweneedtocheckgivenfrequencyandifitwasnotpassedaszero,weneedtocalculatemultandshiftparametersforthegivenclocksource.Whydoweneedtocheckvalueofthefrequency?Actuallyitcanbezero.ifyouattentivelylookedontheimplementationofthe__clocksource_registerfunction,youmayhavenoticedthatwepassedfrequencyas0.Wewilldoitonlyforsomeclocksourcesthathaveselfdefinedmultandshiftparameters.Lookinthepreviouspartandyouwillseethatwesawcalculationofthemultandshiftforjiffies.The__clocksource_update_freq_scalefunctionwilldoitforusforotherclocksources.

Sointhestartofthe__clocksource_update_freq_scalefunctionwecheckthevalueofthefrequencyparameterandifisnotzeroweneedtocalculatemultandshiftforthegivenclocksource.Let'slookonthemultandshiftcalculation:

void__clocksource_update_freq_scale(structclocksource*cs,u32scale,u32freq)

{

u64sec;

if(freq){

sec=cs->mask;

do_div(sec,freq);

do_div(sec,scale);

if(!sec)

sec=1;

elseif(sec>600&&cs->mask>UINT_MAX)

sec=600;

clocks_calc_mult_shift(&cs->mult,&cs->shift,freq,

NSEC_PER_SEC/scale,sec*scale);

}

...

...

...

}

Herewecanseecalculationofthemaximumnumberofsecondswhichwecanrunbeforeaclocksourcecounterwilloverflow.Firstofallwefillthesecvariablewiththevalueofaclocksourcemask.Rememberthataclocksource'smaskrepresentsmaximumamountofbitsthatarevalidforthegivenclocksource.Afterthis,wecanseetwodivisionoperations.Atfirstwedivideoursecvariableonaclocksourcefrequencyandthanonscalefactor.Thefreqparametershowsushowmanytimerinterruptswillbeoccuredinonesecond.So,wedividemaskvaluethatrepresentsmaximumnumberofacounter(forexamplejiffy)onthefrequencyofatimerandwillgetthemaximumnumberofsecondsforthecertainclocksource.Theseconddivisionoperationwillgiveusmaximumnumberofsecondsforthecertainclocksourcedependsonitsscalefactorwhichcanbe1hertzor1kilohertz(10^Hz).

Afterwehavegotmaximumnumberofseconds,wecheckthisvalueandsetitto1or600dependsontheresultatthe

LinuxInside

311Clocksourceframework

Page 312: Linux Insides

nextstep.Thesevaluesismaximumsleepingtimeforaclocksourceinseconds.Inthenextstepwecanseecalloftheclocks_calc_mult_shift.Mainpointofthisfunctioniscalculationofthemultandshiftvaluesforagivenclocksource.Intheendofthe__clocksource_update_freq_scalefunctionwecheckthatjustcalculatedmultvalueofagivenclocksourcewillnotcauseoverflowafteradjustment,updatethemax_idle_nsandmax_cyclesvaluesofagivenclocksourcewiththemaximumnanosecondsthatcanbeconvertedtoaclocksourcecounterandprintresulttothekernelbuffer:

pr_info("%s:mask:0x%llxmax_cycles:0x%llx,max_idle_ns:%lldns\n",

cs->name,cs->mask,cs->max_cycles,cs->max_idle_ns);

thatwecanseeinthedmesgoutput:

$dmesg|grep"clocksource:"

[0.000000]clocksource:refined-jiffies:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:1910969940391419ns

[0.000000]clocksource:hpet:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:133484882848ns

[0.094084]clocksource:jiffies:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:1911260446275000ns

[0.205302]clocksource:acpi_pm:mask:0xffffffmax_cycles:0xffffff,max_idle_ns:2085701024ns

[1.452979]clocksource:tsc:mask:0xffffffffffffffffmax_cycles:0x7350b459580,max_idle_ns:881591204237ns

Afterthe__clocksource_update_freq_scalefunctionwillfinishitswork,wecanreturnbacktothe__clocksource_register_scalefunctionthatwillregisternewclocksource.Wecanseethecallofthefollowingthreefunctions:

mutex_lock(&clocksource_mutex);

clocksource_enqueue(cs);

clocksource_enqueue_watchdog(cs);

clocksource_select();

mutex_unlock(&clocksource_mutex);

Notethatbeforethefirstwillbecalled,welocktheclocksource_mutexmutex.Thepointoftheclocksource_mutexmutexistoprotectcurr_clocksourcevariablewhichrepresentscurrentlyselectedclocksourceandclocksource_listvariablewhichrepresentslistthatcontainsregisteredclocksources.Now,let'slookonthesethreefunctions.

Thefirstclocksource_enqueuefunctionandothertwodefinedinthesamesourcecodefile.Wegothroughallalreadyregisteredclocksourcesorinotherwordswegothroughallelementsoftheclocksource_listandtriestofindbestplaceforagivenclocksource:

staticvoidclocksource_enqueue(structclocksource*cs)

{

structlist_head*entry=&clocksource_list;

structclocksource*tmp;

list_for_each_entry(tmp,&clocksource_list,list)

if(tmp->rating>=cs->rating)

entry=&tmp->list;

list_add(&cs->list,entry);

}

Intheendwejustinsertnewclocksourcetotheclocksource_list.Thesecondfunction-clocksource_enqueue_watchdogdoesalmostthesamethatpreviousfunction,butitinsertsnewclocksourcetothewd_listdependsonflangsofaclocksourceandstartsnewwatchdogtimer.AsIalreadywrote,wewillnotconsiderwatchdogrelatedstuffinthispartbutwilldoitinnextparts.

Thelastfunctionistheclocksource_select.Aswecanunderstandfromthefunction'sname,mainpointofthisfunction-selectthebestclocksourcefromregisteredclocksources.Thisfunctionconsistsonlyfromthecallofthefunctionhelper:

LinuxInside

312Clocksourceframework

Page 313: Linux Insides

staticvoidclocksource_select(void)

{

return__clocksource_select(false);

}

Notethatthe__clocksource_selectfunctiontakesoneparameter(falseinourcase).Thisboolparametershowshowtotraveresetheclocksource_list.Inourcasewepassfalsethatismeanthatwewillgothroughallentriesoftheclocksource_list.Wealreadyknowthatclocksourcewiththebestratingwillthefirstintheclocksource_listafterthecalloftheclocksource_enqueuefunction,sowecaneasilygetitfromthislist.Afterwefoundaclocksourcewiththebestrating,weswitchtoit:

if(curr_clocksource!=best&&!timekeeping_notify(best)){

pr_info("Switchedtoclocksource%s\n",best->name);

curr_clocksource=best;

}

Theresultofthisoperationwecanseeinthedmesgoutput:

$dmesg|grepSwitched

[0.199688]clocksource:Switchedtoclocksourcehpet

[2.452966]clocksource:Switchedtoclocksourcetsc

Notethatwecanseetwoclocksourcesinthedmesgoutput(hpetandtscinourcase).Yes,actuallytherecanbemanydifferentclocksourcesonaparticularhardware.SotheLinuxkernelknowsaboutallregisteredclocksourcesandswitchestoaclocksourcewithabetterratingeachtimeafterregistrationofanewclocksource.

Ifwewilllookonthebottomofthekernel/time/clocksource.csourcecodefile,wewillseethatithassysfsinterface.Maininitializationoccursintheinit_clocksource_sysfsfunctionwhichwillbecalledduringdeviceinitcalls.Let'slookontheimplementationoftheinit_clocksource_sysfsfunction:

staticstructbus_typeclocksource_subsys={

.name="clocksource",

.dev_name="clocksource",

};

staticint__initinit_clocksource_sysfs(void)

{

interror=subsys_system_register(&clocksource_subsys,NULL);

if(!error)

error=device_register(&device_clocksource);

if(!error)

error=device_create_file(

&device_clocksource,

&dev_attr_current_clocksource);

if(!error)

error=device_create_file(&device_clocksource,

&dev_attr_unbind_clocksource);

if(!error)

error=device_create_file(

&device_clocksource,

&dev_attr_available_clocksource);

returnerror;

}

device_initcall(init_clocksource_sysfs);

Firstofallwecanseethatitregistersaclocksourcesubsystemwiththecallofthesubsys_system_registerfunction.Inotherwords,afterthecallofthisfunction,wewillhavefollowingdirectory:

LinuxInside

313Clocksourceframework

Page 314: Linux Insides

$pwd

/sys/devices/system/clocksource

Afterthisstep,wecanseeregistrationofthedevice_clocksourcedevicewhichisrepresentedbythefollowingstructure:

staticstructdevicedevice_clocksource={

.id=0,

.bus=&clocksource_subsys,

};

andcreationofthreefiles:

dev_attr_current_clocksource;dev_attr_unbind_clocksource;dev_attr_available_clocksource.

Thesefileswillprovideinformationaboutcurrentclocksourceinthesystem,availableclocksourcesinthesystemandinterfacewhichallowstounbindtheclocksource.

Aftertheinit_clocksource_sysfsfunctionwillbeexecuted,wewillbeablefindsomeinformationaboutavaliableclocksourcesinthe:

$cat/sys/devices/system/clocksource/clocksource0/available_clocksource

tschpetacpi_pm

Orforexampleinformantionaboutcurrentclocksourceinthesystem:

$cat/sys/devices/system/clocksource/clocksource0/current_clocksource

tsc

Inthepreviouspart,wesawAPIfortheregistrationofthejiffiesclocksource,butdidn'tdiveintodetailsabouttheclocksourceframework.Inthispartwediditandsawimplementationofthenewclocksourceregistrationandselectionofaclocksourcewiththebestratingvalueinthesystem.Ofcourse,thisisnotallAPIthatclocksourceframeworkprovides.Thereacoupleadditionalfunctionslikeclocksource_unregisterforremovinggivenclocksourcefromtheclocksource_listandetc.ButIwillnotdescribethisfunctionsinthispart,becausetheyarenotimportantforusrightnow.Anywayifyouareinterestinginit,youcanfinditinthekernel/time/clocksource.c.

That'sall.

ThisistheendofthesecondpartofthechapterthatdescribestimersandtimermanagementrelatedstuffintheLinuxkernel.Inthepreviouspartgotacquaintedwiththefollowingtwoconcepts:jiffiesandclocksource.Inthispartwesawsomeexamplesofthejiffiesusageandknewmoredetailsabouttheclocksourceconcept.

Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.

Conclusion

Links

LinuxInside

314Clocksourceframework

Page 316: Linux Insides

Thischapterdescribesmemorymanagementinthelinuxkernel.Youwillseehereacoupleofpostswhichdescribedifferentpartsofthelinuxmemorymanagementframework:

Memblock-describesearlymemblockallocator.Fix-MappedAddressesandioremap-describesfix-mappedaddressesandearlyioremap.

Linuxkernelmemorymanagement

LinuxInside

316Memorymanagement

Page 317: Linux Insides

Memorymanagementisoneofthemostcomplex(andIthinkthatitisthemostcomplex)partsoftheoperatingsystemkernel.Inthelastpreparationsbeforethekernelentrypointpartwestoppedrightbeforecallofthestart_kernelfunction.Thisfunctioninitializesallthekernelfeatures(includingarchitecture-dependentfeatures)beforethekernelrunsthefirstinitprocess.Youmayrememberaswebuiltearlypagetables,identitypagetablesandfixmappagetablesintheboottime.Nocomplicatedmemorymanagementisworkingyet.Whenthestart_kernelfunctioniscalledwewillseethetransitiontomorecomplexdatastructuresandtechniquesformemorymanagement.Foragoodunderstandingoftheinitializationprocessinthelinuxkernelweneedtohaveaclearunderstandingofthesetechniques.ThischapterwillprovideanoverviewofthedifferentpartsofthelinuxkernelmemorymanagementframeworkanditsAPI,startingfromthememblock.

Memblockisoneofthemethodsofmanagingmemoryregionsduringtheearlybootstrapperiodwhiletheusualkernelmemoryallocatorsarenotupandrunningyet.PreviouslyitwascalledLogicalMemoryBlock,butwiththepatchbyYinghaiLu,itwasrenamedtothememblock.AsLinuxkernelforx86_64architectureusesthismethod.WealreadymetmemblockintheLastpreparationsbeforethekernelentrypointpart.Andnowtimetogetacquaintedwithitcloser.Wewillseehowitisimplemented.

Wewillstarttolearnmemblockfromthedatastructures.Definitionsofthealldatastructurescanbefoundintheinclude/linux/memblock.hheaderfile.

Thefirststructurehasthesamenameasthispartanditis:

structmemblock{

boolbottom_up;

phys_addr_tcurrent_limit;

structmemblock_typememory;-->arrayofmemblock_region

structmemblock_typereserved;-->arrayofmemblock_region

#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP

structmemblock_typephysmem;

#endif

};

Thisstructurecontainsfivefields.Firstisbottom_upwhichallowsallocatingmemoryinbottom-upmodewhenitistrue.Nextfieldiscurrent_limit.Thisfielddescribesthelimitsizeofthememoryblock.Thenextthreefieldsdescribethetypeofthememoryblock.Itcanbe:reserved,memoryandphysicalmemoryiftheCONFIG_HAVE_MEMBLOCK_PHYS_MAPconfigurationoptionisenabled.Nowweseeyetanotherdatastructure-memblock_type.Let'slookatitsdefinition:

structmemblock_type{

unsignedlongcnt;

unsignedlongmax;

phys_addr_ttotal_size;

structmemblock_region*regions;

};

Thisstructureprovidesinformationaboutmemorytype.Itcontainsfieldswhichdescribethenumberofmemoryregionswhichareinsidethecurrentmemoryblock,thesizeofallmemoryregions,thesizeoftheallocatedarrayofthememoryregionsandpointertothearrayofthememblock_regionstructures.memblock_regionisastructurewhichdescribesa

LinuxkernelmemorymanagementPart1.

Introduction

Memblock

LinuxInside

317Memblock

Page 318: Linux Insides

memoryregion.Itsdefinitionis:

structmemblock_region{

phys_addr_tbase;

phys_addr_tsize;

unsignedlongflags;

#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAP

intnid;

#endif

};

memblock_regionprovidesbaseaddressandsizeofthememoryregion,flagswhichcanbe:

#defineMEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)

#defineMEMBLOCK_ALLOC_ACCESSIBLE0

#defineMEMBLOCK_HOTPLUG0x1

Alsomemblock_regionprovidesintegerfield-numanodeselector,iftheCONFIG_HAVE_MEMBLOCK_NODE_MAPconfigurationoptionisenabled.

Schematicallywecanimagineitas:

+---------------------------++---------------------------+

|memblock|||

|_______________________|||

||memory|||Arrayofthe|

||memblock_type|-|-->|membock_region|

||_______________________||||

||+---------------------------+

|_______________________|+---------------------------+

||reserved||||

||memblock_type|-|-->|Arrayofthe|

||_______________________|||memblock_region|

||||

+---------------------------++---------------------------+

Thesethreestructures:memblock,memblock_typeandmemblock_regionaremainintheMemblock.NowweknowaboutitandcanlookatMemblockinitializationprocess.

AsallAPIofthememblockdescribedintheinclude/linux/memblock.hheaderfile,allimplementationofthesefunctionisinthemm/memblock.csourcecodefile.Let'slookatthetopofthesourcecodefileandwewillseetheinitializationofthememblockstructure:

structmemblockmemblock__initdata_memblock={

.memory.regions=memblock_memory_init_regions,

.memory.cnt=1,

.memory.max=INIT_MEMBLOCK_REGIONS,

.reserved.regions=memblock_reserved_init_regions,

.reserved.cnt=1,

.reserved.max=INIT_MEMBLOCK_REGIONS,

#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP

.physmem.regions=memblock_physmem_init_regions,

.physmem.cnt=1,

.physmem.max=INIT_PHYSMEM_REGIONS,

#endif

.bottom_up=false,

Memblockinitialization

LinuxInside

318Memblock

Page 319: Linux Insides

.current_limit=MEMBLOCK_ALLOC_ANYWHERE,

};

Herewecanseeinitializationofthememblockstructurewhichhasthesamenameasstructure-memblock.Firstofallnoteon__initdata_memblock.Defenitionofthismacrolookslike:

#ifdefCONFIG_ARCH_DISCARD_MEMBLOCK

#define__init_memblock__meminit

#define__initdata_memblock__meminitdata

#else

#define__init_memblock

#define__initdata_memblock

#endif

YoucannotethatitdependsonCONFIG_ARCH_DISCARD_MEMBLOCK.Ifthisconfigurationoptionisenabled,memblockcodewillbeputtothe.initsectionanditwillbereleasedafterthekernelisbootedup.

Nextwecanseeinitializationofthememblock_typememory,memblock_typereservedandmemblock_typephysmemfieldsofthememblockstructure.Hereweareinterestedonlyinthememblock_type.regionsinitializationprocess.Notethateverymemblock_typefieldinitializedbythearraysofthememblock_region:

staticstructmemblock_regionmemblock_memory_init_regions[INIT_MEMBLOCK_REGIONS]__initdata_memblock;

staticstructmemblock_regionmemblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS]__initdata_memblock;

#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP

staticstructmemblock_regionmemblock_physmem_init_regions[INIT_PHYSMEM_REGIONS]__initdata_memblock;

#endif

Everyarraycontains128memoryregions.WecanseeitintheINIT_MEMBLOCK_REGIONSmacrodefinition:

#defineINIT_MEMBLOCK_REGIONS128

Notethatallarraysarealsodefinedwiththe__initdata_memblockmacrowhichwealreadysawinthememblockstrucutreinitialization(readaboveifyou'veforgot).

Thelasttwofieldsdescribethatbottom_upallocationisdisabledandthelimitofthecurrentMemblockis:

#defineMEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)

whichis0xffffffffffffffff.

OnthisstepinitializationofthememblockstructurefinishedandwecanlookontheMemblockAPI.

OkwehavefinishedwithinitilizationofthememblockstructureandnowwecanlookontheMemblockAPIanditsimplementation.AsIsaidabove,allimplementationofthememblockpresentedinthemm/memblock.c.Tounderstandhowmemblockworksandisimplemented,let'slookatitsusagefirstofall.Thereareacoupleofplacesinthelinuxkernelwherememblockisused.Forexamplelet'stakememblock_x86_fillfunctionfromthearch/x86/kernel/e820.c.Thisfunctiongoesthroughthememorymapprovidedbythee820andaddsmemoryregionsreservedbythekerneltothememblockwiththememblock_addfunction.Aswemetmemblock_addfunctionfirst,let'sstartfromit.

Thisfunctiontakesphysicalbaseaddressandsizeofthememoryregionandaddsittothememblock.memblock_add

MemblockAPI

LinuxInside

319Memblock

Page 320: Linux Insides

functiondoesnotdoanythingspecialinitsbody,butjustcalls:

memblock_add_range(&memblock.memory,base,size,MAX_NUMNODES,0);

function.Wepassmemoryblocktype-memory,physicalbaseaddressandsizeofthememoryregion,maximumnumberofnodeswhicharezeroifCONFIG_NODES_SHIFTisnotsetintheconfigurationfileorCONFIG_NODES_SHIFTifitisset,andflags.Thememblock_add_rangefunctionaddsnewmemoryregiontothememoryblock.Itstartsbycheckingthesizeofthegivenregionandifitiszeroitjustreturns.Afterthis,memblock_add_rangechecksforexistenceofthememoryregionsinthememblockstructurewiththegivenmemblock_type.Iftherearenomemoryregions,wejustfillnewmemory_regionwiththegivenvaluesandreturn(wealreadysawtheimplementationofthisintheFirsttouchofthelinuxkernelmemorymanagerframework).Ifmemblock_typeisnotempty,westarttoaddnewmemoryregiontothememblockwiththegivenmemblock_type.

Firstofallwegettheendofthememoryregionwiththe:

phys_addr_tend=base+memblock_cap_size(base,&size);

memblock_cap_sizeadjustssizethatbase+sizewillnotoverflow.Itsimplementationisprettyeasy:

staticinlinephys_addr_tmemblock_cap_size(phys_addr_tbase,phys_addr_t*size)

{

return*size=min(*size,(phys_addr_t)ULLONG_MAX-base);

}

memblock_cap_sizereturnsnewsizewhichisthesmallestvaluebetweenthegivensizeandULLONG_MAX-base.

Afterthatwehavetheendaddressofthenewmemoryregion,memblock_add_rangechecksoverlapandmergeconditionswithalreadyaddedmemoryregions.Insertionofthenewmemoryregiontothememblcokconsistsoftwosteps:

Addingofnon-overlappingpartsofthenewmemoryareaasseparateregions;Mergingofallneighbouringregions.

Wearegoingthroughallthealreadystoredmemoryregionsandcheckingforoverlapwiththenewregion:

for(i=0;i<type->cnt;i++){

structmemblock_region*rgn=&type->regions[i];

phys_addr_trbase=rgn->base;

phys_addr_trend=rbase+rgn->size;

if(rbase>=end)

break;

if(rend<=base)

continue;

...

...

...

}

Ifthenewmemoryregiondoesnotoverlapregionswhicharealreadystoredinthememblock,insertthisregionintothememblockwithandthisisfirststep,wecheckthatnewregioncanfitintothememoryblockandcallmemblock_double_arrayinotherway:

while(type->cnt+nr_new>type->max)

if(memblock_double_array(type,obase,size)<0)

LinuxInside

320Memblock

Page 321: Linux Insides

return-ENOMEM;

insert=true;

gotorepeat;

memblock_double_arraydoublesthesizeofthegivenregionsarray.Thenwesetinserttotrueandgototherepeatlabel.Inthesecondstep,startingfromtherepeatlabelwegothroughthesameloopandinsertthecurrentmemoryregionintothememoryblockwiththememblock_insert_regionfunction:

if(base<end){

nr_new++;

if(insert)

memblock_insert_region(type,i,base,end-base,

nid,flags);

}

Aswesetinserttotrueinthefirststep,nowmemblock_insert_regionwillbecalled.memblock_insert_regionhasalmostthesameimplementationthatwesawwhenweinsertnewregiontotheemptymemblock_type(seeabove).Thisfunctiongetsthelastmemoryregion:

structmemblock_region*rgn=&type->regions[idx];

andcopiesmemoryareawithmemmove:

memmove(rgn+1,rgn,(type->cnt-idx)*sizeof(*rgn));

Afterthisfillsmemblock_regionfieldsofthenewmemoryregionbase,sizeandetc...andincreasesizeofthememblock_type.Intheendoftheexecution,memblock_add_rangecallsmemblock_merge_regionswhichmergesneighboringcompatibleregionsinthesecondstep.

Inthesecondcasethenewmemoryregioncanoverlapalreadystoredregions.Forexamplewealreadyhaveregion1inthememblock:

00x1000

+-----------------------+

||

||

|region1|

||

||

+-----------------------+

Andnowwewanttoaddregion2tothememblockwiththefollowingbaseaddressandsize:

0x1000x2000

+-----------------------+

||

||

|region2|

||

||

+-----------------------+

Inthiscasesetthebaseaddressofthenewmemoryregionastheendaddressoftheoverlappedregionwith:

LinuxInside

321Memblock

Page 322: Linux Insides

base=min(rend,end);

Soitwillbe0x1000inourcase.Andinsertitaswediditalreadyinthesecondstepwith:

if(base<end){

nr_new++;

if(insert)

memblock_insert_region(type,i,base,end-base,nid,flags);

}

Inthiscaseweinsertoverlappingportion(weinsertonlythehigherportion,becausethelowerportionisalreadyintheoverlappedmemoryregion),thentheremainingportionandmergetheseportionswithmemblock_merge_regions.AsIsaidabovememblock_merge_regionsfunctionmergesneighboringcompatibleregions.Itgoesthroughtheallmemoryregionsfromthegivenmemblock_type,takestwoneighboringmemoryregions-type->regions[i]andtype->regions[i+1]andchecksthattheseregionshavethesameflags,belongtothesamenodeandthatendaddressofthefirstregionsisnotequaltothebaseaddressofthesecondregion:

while(i<type->cnt-1){

structmemblock_region*this=&type->regions[i];

structmemblock_region*next=&type->regions[i+1];

if(this->base+this->size!=next->base||

memblock_get_region_node(this)!=

memblock_get_region_node(next)||

this->flags!=next->flags){

BUG_ON(this->base+this->size>next->base);

i++;

continue;

}

Ifnoneoftheseconditionsarenottrue,weupdatethesizeofthefirstregionwiththesizeofthenextregion:

this->size+=next->size;

Asweupdatethesizeofthefirstmemoryregionwiththesizeofthenextmemoryregion,wecopyevery(intheloop)memoryregionwhichisafterthecurrent(this)memoryregiontotheoneindexagowiththememmovefunction:

memmove(next,next+1,(type->cnt-(i+2))*sizeof(*next));

Anddecreasethecountofthememoryregionswhicharebelongstothememblock_type:

type->cnt--;

Afterthiswewillgettwomemoryregionsmergedintoone:

00x2000

+------------------------------------------------+

||

||

|region1|

||

||

+------------------------------------------------+

LinuxInside

322Memblock

Page 323: Linux Insides

That'sall.Thisisthewholeprincipleoftheworkofthememblock_add_rangefunction.

Thereisalsomemblock_reservefunctionwhichdoesthesameasmemblock_add,butonlywithonedifference.Itstoresmemblock_type.reservedinthememblockinsteadofmemblock_type.memory.

OfcoursethisisnotthefullAPI.MemblockprovidesanAPIfornotonlyaddingmemoryandreservedmemoryregions,butalso:

memblock_remove-removesmemoryregionfrommemblock;memblock_find_in_range-findsfreeareaingivenrange;memblock_free-releasesmemoryregioninmemblock;for_each_mem_range-iteratesthroughmemblockareas.

andmanymore....

MemblockalsoprovidesanAPIforgettinginformationaboutallocatedmemoryregionsinthememblcok.Itissplitintwoparts:

get_allocated_memblock_memory_regions_info-gettinginfoaboutmemoryregions;get_allocated_memblock_reserved_regions_info-gettinginfoaboutreservedregions.

Implementationofthesefunctionsiseasy.Let'slookatget_allocated_memblock_reserved_regions_infoforexample:

phys_addr_t__init_memblockget_allocated_memblock_reserved_regions_info(

phys_addr_t*addr)

{

if(memblock.reserved.regions==memblock_reserved_init_regions)

return0;

*addr=__pa(memblock.reserved.regions);

returnPAGE_ALIGN(sizeof(structmemblock_region)*

memblock.reserved.max);

}

Firstofallthisfunctionchecksthatmemblockcontainsreservedmemoryregions.Ifmemblockdoesnotcontainreservedmemoryregionswejustreturnzero.Otherwisewewritethephysicaladdressofthereservedmemoryregionsarraytothegivenaddressandreturnalignedsizeoftheallocatedarray.NotethatthereisPAGE_ALIGNmacrousedforalign.Actuallyitdependsonsizeofpage:

#definePAGE_ALIGN(addr)ALIGN(addr,PAGE_SIZE)

Implementationoftheget_allocated_memblock_memory_regions_infofunctionisthesame.Ithasonlyonedifference,memblock_type.memoryusedinsteadofmemblock_type.reserved.

Therearemanycallstomemblock_dbginthememblockimplementation.Ifyoupassthememblock=debugoptiontothekernelcommandline,thisfunctionwillbecalled.Actuallymemblock_dbgisjustamacrowhichexpandstoprintk:

#definememblock_dbg(fmt,...)\

Gettinginfoaboutmemoryregions

Memblockdebugging

LinuxInside

323Memblock

Page 324: Linux Insides

if(memblock_debug)printk(KERN_INFOpr_fmt(fmt),##__VA_ARGS__)

Forexampleyoucanseeacallofthismacrointhememblock_reservefunction:

memblock_dbg("memblock_reserve:[%#016llx-%#016llx]flags%#02lx%pF\n",

(unsignedlonglong)base,

(unsignedlonglong)base+size-1,

flags,(void*)_RET_IP_);

Andyouwillseesomethinglikethis:

Memblockhasalsosupportindebugfs.IfyourunkernelnotinX86architectureyoucanaccess:

/sys/kernel/debug/memblock/memory/sys/kernel/debug/memblock/reserved/sys/kernel/debug/memblock/physmem

forgettingdumpofthememblockcontents.

Thisistheendofthefirstpartaboutlinuxkernelmemorymanagement.Ifyouhavequestionsorsuggestions,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.

e820numadebugfsFirsttouchofthelinuxkernelmemorymanagerframework

Conclusion

Links

LinuxInside

324Memblock

Page 325: Linux Insides

Fix-Mappedaddressesareasetofspecialcompile-timeaddresseswhosecorrespondingphysicaladdressdonothavetobealinearaddressminus__START_KERNEL_map.Eachfix-mappedaddressmapsonepageframeandthekernelusesthemaspointersthatneverchangetheiraddress.Thatisthemainpointoftheseaddresses.Asthecommentsays:tohaveaconstantaddressatcompiletime,buttosetthephysicaladdressonlyinthebootprocess.Youcanrememberthatintheearliestpart,wealreadysetthelevel2_fixmap_pgt:

NEXT_PAGE(level2_fixmap_pgt)

.fill506,8,0

.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE

.fill5,8,0

NEXT_PAGE(level1_fixmap_pgt)

.fill512,8,0

Asyoucanseelevel2_fixmap_pgtisrightafterthelevel2_kernel_pgtwhichiskernelcode+data+bss.Everyfix-mappedaddressisrepresentedbyanintegerindexwhichisdefinedinthefixed_addressesenumfromthearch/x86/include/asm/fixmap.h.ForexampleitcontainsentriesforVSYSCALL_PAGE-ifemulationoflegacyvsyscallpageisenabled,FIX_APIC_BASEforlocalapicandetc...Inavirtualmemoryfix-mappedareaisplacedinthemodulesarea:

+-----------+-----------------+---------------+------------------+

|||||

|kerneltext|kernel||vsyscalls|

|mapping|text|Modules|fix-mapped|

|fromphys0|data||addresses|

|||||

+-----------+-----------------+---------------+------------------+

__START_KERNEL_map__START_KERNELMODULES_VADDR0xffffffffffffffff

Basevirtualaddressandsizeofthefix-mappedareaarepresentedbythetwofollowingmacro:

#defineFIXADDR_SIZE(__end_of_permanent_fixed_addresses<<PAGE_SHIFT)

#defineFIXADDR_START(FIXADDR_TOP-FIXADDR_SIZE)

Here__end_of_permanent_fixed_addressesisanelementofthefixed_addressesenumandasIwroteabove:Everyfix-mappedaddressisrepresentedbyanintegerindexwhichisdefinedinthefixed_addresses.PAGE_SHIFTdeterminessizeofapage.Forexamplesizeoftheonepagewecangetwiththe1<<PAGE_SHIFT.Inourcaseweneedtogetthesizeofthefix-mappedarea,butnotonlyofonepage,that'swhyweareusing__end_of_permanent_fixed_addressesforgettingthesizeofthefix-mappedarea.Inmycaseit'salittlemorethan536killobytes.Inyourcaseitmightbeadifferentnumber,becausethesizedependsonamountofthefix-mappedaddresseswhicharedependsonyourkernel'sconfiguration.

ThesecondFIXADDR_STARTmacrojustextractsfromthelastaddressofthefix-mappedareaitssizeforgettingbasevirtualaddressofthefix-mappedarea.FIXADDR_TOPisroundedupaddressfromthebaseaddressofthevsyscallspace:

#defineFIXADDR_TOP(round_up(VSYSCALL_ADDR+PAGE_SIZE,1<<PMD_SHIFT)-PAGE_SIZE)

Thefixed_addressesenumsareusedasanindextogetthevirtualaddressusingthefix_to_virtfunction.

LinuxkernelmemorymanagementPart2.

Fix-MappedAddressesandioremap

LinuxInside

325Fixmapsandioremap

Page 326: Linux Insides

Implementationofthisfunctioniseasy:

static__always_inlineunsignedlongfix_to_virt(constunsignedintidx)

{

BUILD_BUG_ON(idx>=__end_of_fixed_addresses);

return__fix_to_virt(idx);

}

firstofallitchecksthattheindexgivenforthefixed_addressesenumisnotgreaterorequalthan__end_of_fixed_addresseswiththeBUILD_BUG_ONmacroandthenreturnstheresultofthe__fix_to_virtmacro:

#define__fix_to_virt(x)(FIXADDR_TOP-((x)<<PAGE_SHIFT))

Hereweshiftleftthegivenfix-mappedaddressindexonthePAGE_SHIFTwhichdeterminessizeofapageasIwroteaboveandsubtractitfromtheFIXADDR_TOPwhichisthehighestaddressofthefix-mappedarea.Thereisaninversefunctionforgettingfix-mappedaddressfromavirtualaddress:

staticinlineunsignedlongvirt_to_fix(constunsignedlongvaddr)

{

BUG_ON(vaddr>=FIXADDR_TOP||vaddr<FIXADDR_START);

return__virt_to_fix(vaddr);

}

virt_to_fixtakesvirtualaddress,checksthatthisaddressisbetweenFIXADDR_STARTandFIXADDR_TOPandcalls__virt_to_fixmacrowhichimplementedas:

#define__virt_to_fix(x)((FIXADDR_TOP-((x)&PAGE_MASK))>>PAGE_SHIFT)

APFNissimplyanindexwithinphysicalmemorythatiscountedinpage-sizedunits.PFNforaphysicaladdresscouldbetriviallydefinedas(page_phys_addr>>PAGE_SHIFT);

__virt_to_fixclearsthefirst12bitsinthegivenaddress,subtractsitfromthelastaddresstheoffix-mappedarea(FIXADDR_TOP)andshiftsrightresultonPAGE_SHIFTwhichis12.Letmeexplainhowitworks.AsIalreadywrotewewillclearthefirst12bitsinthegivenaddresswithx&PAGE_MASK.AswesubtractthisfromtheFIXADDR_TOP,wewillgetthelast12bitsoftheFIXADDR_TOPwhicharepresent.Weknowthatthefirst12bitsofthevirtualaddressrepresenttheoffsetinthepageframe.WiththeshitingitonPAGE_SHIFTwewillgetPageframenumberwhichisjustallbitsinavirtualaddressbesidesthefirst12offsetbits.Fix-mappedaddressesareusedindifferentplacesinthelinuxkernel.IDTdescriptorstoredthere,IntelTrustedExecutionTechnologyUUIDstoredinthefix-mappedareastartedfromFIX_TBOOT_BASEindex,Xenbootmapandmanymore...Wealreadysawalittleaboutfix-mappedaddressesinthefifthpartaboutlinuxkernelinitialization.Weusedfix-mappedareaintheearlyioremapinitialization.Let'slookonitandtrytounderstandwhatisitioremap,howitisimplementedinthekernelandhowitisreleatedtothefix-mappedaddresses.

Linuxkernelprovidesmanydifferentprimitivestomanagememory.ForthismomentwewilltouchI/Omemory.Everydeviceiscontrolledbyreading/writingfrom/toitsregisters.Forexampleadrivercanturnoff/onadevicebywritingtoitsregistersorgetthestateofadevicebyreadingfromitsregisters.Besidesregisters,manydeviceshavebufferswhereadrivercanwritesomethingorreadfromthere.Asweknowforthismomenttherearetwowaystoaccessdevice'sregistersanddatabuffers:

throughtheI/Oports;

ioremap

LinuxInside

326Fixmapsandioremap

Page 327: Linux Insides

mappingoftheallregisterstothememoryaddressspace;

Inthefirstcaseeverycontrolregisterofadevicehasanumberofinputandoutputport.Anddriverofadevicecanreadfromaportandwritetoitwithtwoinandoutinstructionswhichwealreadysaw.Ifyouwanttoknowaboutcurrentlyregisteredportregions,youcanknowtheybyaccessingof/proc/ioports:

$cat/proc/ioports

0000-0cf7:PCIBus0000:00

0000-001f:dma1

0020-0021:pic1

0040-0043:timer0

0050-0053:timer1

0060-0060:keyboard

0064-0064:keyboard

0070-0077:rtc0

0080-008f:dmapagereg

00a0-00a1:pic2

00c0-00df:dma2

00f0-00ff:fpu

00f0-00f0:PNP0C04:00

03c0-03df:vesafb

03f8-03ff:serial

04d0-04d1:pnp00:06

0800-087f:pnp00:01

0a00-0a0f:pnp00:04

0a20-0a2f:pnp00:04

0a30-0a3f:pnp00:04

0cf8-0cff:PCIconf1

0d00-ffff:PCIBus0000:00

...

...

...

/proc/ioporstprovidesinformationaboutwhatdriverusedaddressofaI/Oportsregion.Allofthesememoryregions,forexample0000-0cf7,wereclaimedwiththerequest_regionfunctionfromtheinclude/linux/ioport.h.Actuallyrequest_regionisamacrowhichdefiedas:

#definerequest_region(start,n,name)__request_region(&ioport_resource,(start),(n),(name),0)

Aswecanseeittakesthreeparameters:

start-beginofregion;n-lengthofregion;name-nameofrequester.

request_regionallocatesI/Oportregion.Veryoftencheck_regionfunctioncalledbeforetherequest_regiontocheckthatthegivenaddressrangeisavailableandrelease_regiontoreleasememoryregion.request_regionreturnspointertotheresourcestructure.resourcestructurepresentsabstractionforatree-likesubsetofsystemresources.Wealreadysawresourcestructureinthefirthpartaboutkernelinitializationprocessanditlooksas:

structresource{

resource_size_tstart;

resource_size_tend;

constchar*name;

unsignedlongflags;

structresource*parent,*sibling,*child;

};

andcontainsstartandendaddressesoftheresource,nameandetc...Everyresourcestructurecontainspointerstotheparent,sliblingandchildresources.Asithasparentandchilds,itmeansthateverysubsetofresuorceshasroot

LinuxInside

327Fixmapsandioremap

Page 328: Linux Insides

resourcestructure.Forexample,forI/Oportsitisioport_resourcestructure:

structresourceioport_resource={

.name="PCIIO",

.start=0,

.end=IO_SPACE_LIMIT,

.flags=IORESOURCE_IO,

};

EXPORT_SYMBOL(ioport_resource);

Orforiomem,itisiomem_resourcestructure:

structresourceiomem_resource={

.name="PCImem",

.start=0,

.end=-1,

.flags=IORESOURCE_MEM,

};

AsIwroteaboutrequest_regionsisusedforregisteringofI/Oportregionandthismacrousedinmanyplacesinthekernel.Forexamplelet'slookatdrivers/char/rtc.c.ThissourcecodefileprovidesRealTimeClockinterfaceinthelinuxkernel.Aseverykernelmodule,rtcmodulecontainsmodule_initdefinition:

module_init(rtc_init);

wherertc_initisrtcinitializationfunction.Thisfunctiondefinedinthesamertc.csourcecodefile.Inthertc_initfunctionwecanseeacouplecallsofthertc_request_regionfunctions,whichwraprequest_regionforexample:

r=rtc_request_region(RTC_IO_EXTENT);

wherertc_request_regioncalls:

r=request_region(RTC_PORT(0),size,"rtc");

HereRTC_IO_EXTENTisasizeofmemoryregionanditis0x8,"rtc"isanameofregionandRTC_PORTis:

#defineRTC_PORT(x)(0x70+(x))

Sowiththerequest_region(RTC_PORT(0),size,"rtc")weregistermemoryregion,startedat0x70andwithsize0x8.Let'slookonthe/proc/ioports:

~$sudocat/proc/ioports|greprtc

0070-0077:rtc0

So,wegotit!Ok,itwasports.ThesecondwayisuseofI/Omemory.AsIwroteabovethiswayismappingofcontrolregistersandmemoryofadevicetothememoryaddressspace.I/OmemoryisasetofcontiguousaddresseswhichareprovidedbyadevicetoCPUthroughabus.Allmemory-mappedI/Oaddressesarenotusedbythekerneldirectly.ThereisaspecialioremapfunctionwhichallowsustocovertthephysicaladdressonabustothekernelvirtualaddressorinanotherwordsioremapmapsI/Ophysicalmemoryregiontoaccessitfromthekernel.Theioremapfunctiontakestwo

LinuxInside

328Fixmapsandioremap

Page 329: Linux Insides

parameters:

startofthememoryregion;sizeofthememoryregion;

I/OmemorymappingAPIprovidesfunctionforthechecking,requestingandreleaseofamemoryregionasthisdoesI/OportsAPI.Therearethreefunctionsforit:

request_mem_region

release_mem_region

check_mem_region

~$sudocat/proc/iomem

...

...

...

be826000-be82cfff:ACPINon-volatileStorage

be82d000-bf744fff:SystemRAM

bf745000-bfff4fff:reserved

bfff5000-dc041fff:SystemRAM

dc042000-dc0d2fff:reserved

dc0d3000-dc138fff:SystemRAM

dc139000-dc27dfff:ACPINon-volatileStorage

dc27e000-deffefff:reserved

defff000-deffffff:SystemRAM

df000000-dfffffff:RAMbuffer

e0000000-feafffff:PCIBus0000:00

e0000000-efffffff:PCIBus0000:01

e0000000-efffffff:0000:01:00.0

f7c00000-f7cfffff:PCIBus0000:06

f7c00000-f7c0ffff:0000:06:00.0

f7c10000-f7c101ff:0000:06:00.0

f7c10000-f7c101ff:ahci

f7d00000-f7dfffff:PCIBus0000:03

f7d00000-f7d3ffff:0000:03:00.0

f7d00000-f7d3ffff:alx

...

...

...

Partoftheseaddressesisfromthecallofthee820_reserve_resourcesfunction.Wecanfindcallofthisfunctioninthearch/x86/kernel/setup.candthefunctionitselfdefinedinthearch/x86/kernel/e820.c.e820_reserve_resourcesgoesthroughthee820mapandinsertsmemoryregionstotherootiomemresourcestructure.Alle820memoryregionswhicharewillbeinsertedtotheiomemresourcewillhavefollowingtypes:

staticinlineconstchar*e820_type_to_string(inte820_type)

{

switch(e820_type){

caseE820_RESERVED_KERN:

caseE820_RAM:return"SystemRAM";

caseE820_ACPI:return"ACPITables";

caseE820_NVS:return"ACPINon-volatileStorage";

caseE820_UNUSABLE:return"Unusablememory";

default:return"reserved";

}

}

andwecanseeitinthe/proc/iomem(readabove).

Nowlet'strytounderstandhowioremapworks.Wealreadyknowalittleaboutioremap,wesawitinthefifthpartaboutlinuxkernelinitialization.Ifyouhavereadthispart,youcanrememberthecalloftheearly_ioremap_initfunctionfromthearch/x86/mm/ioremap.c.Initializationoftheioremapissplitinntwoparts:thereistheearlypartwhichwecanusebeforethenormalioremapisavailableandthenormalioremapwhichisavailableaftervmallocinitializationandcallofthe

LinuxInside

329Fixmapsandioremap

Page 330: Linux Insides

paging_init.Wedonotknowanythingaboutvmallocfornow,solet'sconsiderearlyinitializationoftheioremap.Firstofallearly_ioremap_initchecksthatfixmapisalignedonpagemiddledirectoryboundary:

BUILD_BUG_ON((fix_to_virt(0)+PAGE_SIZE)&((1<<PMD_SHIFT)-1));

moreaboutBUILD_BUG_ONyoucanreadinthefirstpartaboutLinuxKernelinitialization.SoBUILD_BUG_ONmacroraisescompilationerrorifthegivenexpressionistrue.Inthenextstepafterthischeck,wecanseecalloftheearly_ioremap_setupfunctionfromthemm/early_ioremap.c.Thisfunctionpresentsgenericinitializationoftheioremap.early_ioremap_setupfunctionfillstheslot_virtarraywiththevirtualaddressesoftheearlyfixmaps.Allearlyfixmapsareafter__end_of_permanent_fixed_addressesinmemory.TheyarestatsfromtheFIX_BITMAP_BEGIN(top)andendswithFIX_BITMAP_END(down).Actuallythereare512temporaryboot-timemappings,usedbyearlyioremap:

#defineNR_FIX_BTMAPS64

#defineFIX_BTMAPS_SLOTS8

#defineTOTAL_FIX_BTMAPS(NR_FIX_BTMAPS*FIX_BTMAPS_SLOTS)

andearly_ioremap_setup:

void__initearly_ioremap_setup(void)

{

inti;

for(i=0;i<FIX_BTMAPS_SLOTS;i++)

if(WARN_ON(prev_map[i]))

break;

for(i=0;i<FIX_BTMAPS_SLOTS;i++)

slot_virt[i]=__fix_to_virt(FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*i);

}

theslot_virtandotherarraysaredefinedinthesamesourcecodefile:

staticvoid__iomem*prev_map[FIX_BTMAPS_SLOTS]__initdata;

staticunsignedlongprev_size[FIX_BTMAPS_SLOTS]__initdata;

staticunsignedlongslot_virt[FIX_BTMAPS_SLOTS]__initdata;

slot_virtcontainsvirtualaddressesofthefix-mappedareas,prev_maparraycontainsaddressesoftheearlyioremapareas.NotethatIwroteabove:Actuallythereare512temporaryboot-timemappings,usedbyearlyioremapandyoucanseethatallarraysdefinedwiththe__initdataattributewhichmeansthatthismemorywillbereleasedafterkernelinitializationprocess.Afterearly_ioremap_setupfinishedtowork,we'regettingpagemiddledirectorywhereearlyioremapbeginningwiththeearly_ioremap_pmdfunctionwhichjustgetsthebaseaddressofthepageglobaldirectoryandcalculatesthepagemiddledirectoryforthegivenaddress:

staticinlinepmd_t*__initearly_ioremap_pmd(unsignedlongaddr)

{

pgd_t*base=__va(read_cr3());

pgd_t*pgd=&base[pgd_index(addr)];

pud_t*pud=pud_offset(pgd,addr);

pmd_t*pmd=pmd_offset(pud,addr);

returnpmd;

}

Afterthiswefillsbm_pte(earlyioremappagetableentries)withzerosandcallthepmd_populate_kernelfunction:

LinuxInside

330Fixmapsandioremap

Page 331: Linux Insides

pmd=early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));

memset(bm_pte,0,sizeof(bm_pte));

pmd_populate_kernel(&init_mm,pmd,bm_pte);

pmd_populate_kerneltakesthreeparameters:

init_mm-memorydescriptoroftheinitprocess(youcanreadaboutitinthepreviouspart);pmd-pagemiddledirectoryofthebeginningoftheioremapfixmaps;bm_pte-earlyioremappagetableentriesarraywhichdefinedas:

staticpte_tbm_pte[PAGE_SIZE/sizeof(pte_t)]__page_aligned_bss;

Thepmd_popularte_kernelfunctiondefinedinthearch/x86/include/asm/pgalloc.handpopulatesgivenpagemiddledirectory(pmd)withthegivenpagetableentries(bm_pte):

staticinlinevoidpmd_populate_kernel(structmm_struct*mm,

pmd_t*pmd,pte_t*pte)

{

paravirt_alloc_pte(mm,__pa(pte)>>PAGE_SHIFT);

set_pmd(pmd,__pmd(__pa(pte)|_PAGE_TABLE));

}

whereset_pmdis:

#defineset_pmd(pmdp,pmd)native_set_pmd(pmdp,pmd)

andnative_set_pmdis:

staticinlinevoidnative_set_pmd(pmd_t*pmdp,pmd_tpmd)

{

*pmdp=pmd;

}

That'sall.Earlyioremapisreadytouse.Thereareacoupleofchecksintheearly_ioremap_initfunction,buttheyarenotsoimportant,anywayinitializationoftheioremapisfinished.

Asearlyioremapissetup,wecanuseit.Itprovidestwofunctions:

early_ioremapearly_iounmap

formapping/unmappingofIOphysicaladdresstovirtualaddress.BothfunctionsdependsonCONFIG_MMUconfigurationoption.Memorymanagementunitisaspecialblockofmemorymanagement.Mainpurposeofthisblockistranslationphysicaladdressestothevirtual.Techinicallymemorymanagementunitknowsabouthigh-levelpagetableaddress(pgd)fromthecr3controlregister.IfCONFIG_MMUoptionsissetton,early_ioremapjustreturnsthegivenphysicaladdressandearly_iounmapdoesnotnothing.Inotherway,ifCONFIG_MMUoptionissettoy,early_ioremapcalls__early_ioremapwhichtakesthreeparameters:

phys_addr-basephysicalladdressoftheI/Omemoryregiontomaponvirtualaddresses;

Useofearlyioremap

LinuxInside

331Fixmapsandioremap

Page 332: Linux Insides

size-sizeoftheI/Omemroyregion;prot-pagetableentrybits.

Firstofallinthe__early_ioremap,wegoesthroughtheallearlyioremapfixmapslotsandcheckfirstfreeareintheprev_maparrayandrememberit'snumberintheslotvariableandsetupsizeaswefoundit:

slot=-1;

for(i=0;i<FIX_BTMAPS_SLOTS;i++){

if(!prev_map[i]){

slot=i;

break;

}

}

...

...

...

prev_size[slot]=size;

last_addr=phys_addr+size-1;

Inthenextsptewecanseethefollowingcode:

offset=phys_addr&~PAGE_MASK;

phys_addr&=PAGE_MASK;

size=PAGE_ALIGN(last_addr+1)-phys_addr;

HereweareusingPAGE_MASKforclearingallbitsinthephys_addrbesidesfirst12bits.PAGE_MASKmacrodefinedas:

#definePAGE_MASK(~(PAGE_SIZE-1))

Weknowthatsizeofapageis4096bytesor1000000000000inbinary.PAGE_SIZE-1willbe111111111111,butwith~,wewillget000000000000,butasweuse~PAGE_MASKwewillget111111111111again.Onthesecondlinewedothesamebutclearfirst12bitsandgettingpage-alignedsizeoftheareaonthethirdline.Wegettingalignedareaandnowweneedtogetthenumberofpageswhichareoccupiedbythenewioremapareandcalculatethefix-mappedindexfromfixed_addressesinthenextsteps:

nrpages=size>>PAGE_SHIFT;

idx=FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*slot;

Nowwecanfillfix-mappedareawiththegivenphysicaladdresses.Everyiterationintheloop,wecall__early_set_fixmapfunctionfromthearch/x86/mm/ioremap.c,increasegivenphysicaladdressonpagesizewhichis4096bytesandupdateaddressesindexandnumberofpages:

while(nrpages>0){

__early_set_fixmap(idx,phys_addr,prot);

phys_addr+=PAGE_SIZE;

--idx;

--nrpages;

}

The__early_set_fixmapfunctiongetsthepagetableentry(storedinthebm_pte,seeabove)forthegivenphysicaladdresswith:

pte=early_ioremap_pte(addr);

LinuxInside

332Fixmapsandioremap

Page 333: Linux Insides

Inthenextstepoftheearly_ioremap_ptewecheckthegivenpageflagswiththepgprot_valmacroandcallsset_pteorpte_cleardependsonit:

if(pgprot_val(flags))

set_pte(pte,pfn_pte(phys>>PAGE_SHIFT,flags));

else

pte_clear(&init_mm,addr,pte);

Asyoucanseeabove,wepassedFIXMAP_PAGE_IOasflagstothe__early_ioremap.FIXMPA_PAGE_IOexpandstothe:

(__PAGE_KERNEL_EXEC|_PAGE_NX)

flags,sowecallset_ptefunctionforsettingpagetableentrywhichworksinthesamemannerasset_pmdbutforPTEs(readaboveaboutit).AswesetallPTEsintheloop,wecanseethecallofthe__flush_tlb_onefunction:

__flush_tlb_one(addr);

Thisfunctiondefinedinthearch/x86/include/asm/tlbflush.handcalls__flush_tlb_singleor__flush_tlbdependsonvalueofthecpu_has_invlpg:

staticinlinevoid__flush_tlb_one(unsignedlongaddr)

{

if(cpu_has_invlpg)

__flush_tlb_single(addr);

else

__flush_tlb();

}

__flush_tlb_onefunctioninvalidatesgivenaddressintheTLB.Asyoujustsawweupdatedpagingstructure,butTLBnotinformedofchanges,that'swhyweneedtodoitmanually.Therearetwowayshowtodoit.Firstisupdatecr3controlregisterand__flush_tlbfunctiondoesthis:

native_write_cr3(native_read_cr3());

ThesecondmethodistouseinvlpginstructioninvalidatesTLBentry.Let'slookon__flush_tlb_oneimplementation.Asyoucanseefirstofallitcheckscpu_has_invlpgwhichdefinedas:

#ifdefined(CONFIG_X86_INVLPG)||defined(CONFIG_X86_64)

#definecpu_has_invlpg1

#else

#definecpu_has_invlpg(boot_cpu_data.x86>3)

#endif

IfaCPUsupportinvlpginstruction,wecallthe__flush_tlb_singlemacrowhichexpandstothecallofthe__native_flush_tlb_single:

staticinlinevoid__native_flush_tlb_single(unsignedlongaddr)

{

asmvolatile("invlpg(%0)"::"r"(addr):"memory");

}

LinuxInside

333Fixmapsandioremap

Page 334: Linux Insides

orcall__flush_tlbwhichjustupdatescr3registeraswesawitabove.Afterthisstepexecutionofthe__early_set_fixmapfunctionisfinsihedandwecanbacktothe__early_ioremapimplementation.Aswesetfixmapareaforthegivenaddress,needtosavethebasevirtualaddressoftheI/ORe-mappedareaintheprev_mapwiththeslotindex:

prev_map[slot]=(void__iomem*)(offset+slot_virt[slot]);

andreturnit.

Thesecondfunctionis-early_iounmap-unmapsanI/Omemoryregion.Thisfunctiontakestwoparameters:baseaddressandsizeofaI/Oregionandgenerallylooksverysimilaronearly_ioremap.Italsogoesthroughfixmapslotsandlooksforslotwiththegivenaddress.Afterthisitgetstheindexofthefixmapslotandcalls__late_clear_fixmapor__early_set_fixmapdependsonafter_paging_initvalue.Itcalls__early_set_fixmapwithondifferencethenitdoesearly_ioremap:itpasseszeroasphysicalladdress.AndintheenditsetsaddressoftheI/OmemoryregiontoNULL:

prev_map[slot]=NULL;

That'sallaboutfixmapsandioremap.Ofcoursethispartdoesnotcoverfullfeaturesoftheioremap,itwasonlyearlyioremap,butthereisalsonormalioremap.Butweneedtoknowmorethingsthannowbeforeit.

So,thisistheend!

Thisistheendofthesecondpartaboutlinuxkernelmemorymanagement.Ifyouhavequestionsorsuggestions,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.

apicvsyscallIntelTrustedExecutionTechnologyXenRealTimeClocke820MemorymanagementunitTLBPagingLinuxkernelmemorymanagementPart1.

Conclusion

Links

LinuxInside

334Fixmapsandioremap

Page 335: Linux Insides

ThischapterdescribesvariousconceptswhichareusedintheLinuxkernel.

Per-CPUvariablesCPUmasks

Linuxkernelconcepts

LinuxInside

335Concepts

Page 336: Linux Insides

Per-CPUvariablesareoneofthekernelfeatures.Youcanunderstandwhatthisfeaturemeansbyreadingitsname.Wecancreateavariableandeachprocessorcorewillhaveitsowncopyofthisvariable.Inthispart,wetakeacloserlookatthisfeatureandtrytounderstandhowitisimplementedandhowitworks.

ThekernelprovidesanAPIforcreatingper-cpuvariables-theDEFINE_PER_CPUmacro:

#defineDEFINE_PER_CPU(type,name)\

DEFINE_PER_CPU_SECTION(type,name,"")

Thismacrodefinedintheinclude/linux/percpu-defs.hasmanyothermacrosforworkwithper-cpuvariables.Nowwewillseehowthisfeatureisimplemented.

TakealookattheDECLARE_PER_CPUdefinition.Weseethatittakes2parameters:typeandname,sowecanuseittocreateper-cpuvariables,forexamplelikethis:

DEFINE_PER_CPU(int,per_cpu_n)

Wepassthetypeandthenameofourvariable.DEFINE_PER_CPUcallstheDEFINE_PER_CPU_SECTIONmacroandpassesthesametwoparamatersandemptystringtoit.Let'slookatthedefinitionoftheDEFINE_PER_CPU_SECTION:

#defineDEFINE_PER_CPU_SECTION(type,name,sec)\

__PCPU_ATTRS(sec)PER_CPU_DEF_ATTRIBUTES\

__typeof__(type)name

#define__PCPU_ATTRS(sec)\

__percpu__attribute__((section(PER_CPU_BASE_SECTIONsec)))\

PER_CPU_ATTRIBUTES

wheresectionis:

#definePER_CPU_BASE_SECTION".data..percpu"

Afterallmacrosareexpandedwewillgetaglobalper-cpuvariable:

__attribute__((section(".data..percpu")))intper_cpu_n

Itmeansthatwewillhaveaper_cpu_nvariableinthe.data..percpusection.Wecanfindthissectioninthevmlinux:

.data..percpu00013a5800000000000000000000000001a5c00000e000002**12

CONTENTS,ALLOC,LOAD,DATA

Ok,nowweknowthatwhenweusetheDEFINE_PER_CPUmacro,aper-cpuvariableinthe.data..percpusectionwillbecreated.Whenthekernelinitializesitcallsthesetup_per_cpu_areasfunctionwhichloadsthe.data..percpusectionmultipletimes,onesectionperCPU.

Per-CPUvariables

LinuxInside

336Per-CPUvariables

Page 337: Linux Insides

Let'slookattheper-CPUareasinitializationprocess.Itstartsintheinit/main.cfromthecallofthesetup_per_cpu_areasfunctionwhichisdefinedinthearch/x86/kernel/setup_percpu.c.

pr_info("NR_CPUS:%dnr_cpumask_bits:%dnr_cpu_ids:%dnr_node_ids:%d\n",

NR_CPUS,nr_cpumask_bits,nr_cpu_ids,nr_node_ids);

Thesetup_per_cpu_areasstartsfromtheoutputinformationaboutthemaximumnumberofCPUssetduringkernelconfigurationwiththeCONFIG_NR_CPUSconfigurationoption,actualnumberofCPUs,nr_cpumask_bitsisthesamethatNR_CPUSbitforthenewcpumaskoperatorsandnumberofNUMAnodes.

Wecanseethisoutputinthedmesg:

$dmesg|greppercpu

[0.000000]setup_percpu:NR_CPUS:8nr_cpumask_bits:8nr_cpu_ids:8nr_node_ids:1

Inthenextstepwecheckthepercpufirstchunkallocator.Allpercpuareasareallocatedinchunks.Thefirstchunkisusedforthestaticpercpuvariables.TheLinuxkernelhaspercpu_alloccommandlineparameterswhichprovidesthetypeofthefirstchunkallocator.Wecanreadaboutitinthekerneldocumentation:

percpu_alloc=Selectwhichpercpufirstchunkallocatortouse.

Currentlysupportedvaluesare"embed"and"page".

Archsmaysupportsubsetornoneoftheselections.

Seecommentsinmm/percpu.cfordetailsoneach

allocator.Thisparameterisprimarilyfordebugging

andperformancecomparison.

Themm/percpu.ccontainsthehandlerofthiscommandlineoption:

early_param("percpu_alloc",percpu_alloc_setup);

Wherethepercpu_alloc_setupfunctionsetsthepcpu_chosen_fcvariabledependsonthepercpu_allocparametervalue.Bydefaultthefirstchunkallocatorisauto:

enumpcpu_fcpcpu_chosen_fc__initdata=PCPU_FC_AUTO;

Ifthepercpu_allocparameterisnotgiventothekernelcommandline,theembedallocatorwillbeusedwhichembedsthefirstpercpuchunkintobootmemwiththememblock.ThelastallocatoristhefirstchunkpageallocatorwhichmapsthefirstchunkwithPAGE_SIZEpages.

AsIwroteaboutfirstofall,wemakeacheckofthefirstchunkallocatortypeinthesetup_per_cpu_areas.Firstofallwecheckthatfirstchunkallocatorisnotpage:

if(pcpu_chosen_fc!=PCPU_FC_PAGE){

...

...

...

}

IfitisnotPCPU_FC_PAGE,wewillusetheembedallocatorandallocatespaceforthefirstchunkwiththepcpu_embed_first_chunkfunction:

LinuxInside

337Per-CPUvariables

Page 338: Linux Insides

rc=pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,

dyn_size,atom_size,

pcpu_cpu_distance,

pcpu_fc_alloc,pcpu_fc_free);

AsIwroteabove,thepcpu_embed_first_chunkfunctionembedsthefirstpercpuchunkintobootmem.Asyoucanseewepassacoupleofparameterstothepcup_embed_first_chunk,theyare

PERCPU_FIRST_CHUNK_RESERVE-thesizeofthereservedspaceforthestaticpercpuvariables;dyn_size-minimumfreesizefordynamicallocationinbytes;atom_size-allallocationsarewholemultiplesofthisandalignedtothisparameter;pcpu_cpu_distance-callbacktodeterminedistancebetweencpus;pcpu_fc_alloc-functiontoallocatepercpupage;pcpu_fc_free-functiontoreleasepercpupage.

Alloftheseparameterswecalculatebeforethecallofthepcpu_embed_first_chunk:

constsize_tdyn_size=PERCPU_MODULE_RESERVE+PERCPU_DYNAMIC_RESERVE-PERCPU_FIRST_CHUNK_RESERVE;

size_tatom_size;

#ifdefCONFIG_X86_64

atom_size=PMD_SIZE;

#else

atom_size=PAGE_SIZE;

#endif

IfthefirstchunkallocatorisPCPU_FC_PAGE,wewillusethepcpu_page_first_chunkinsteadofthepcpu_embed_first_chunk.Afterthatpercpuareasup,wesetuppercpuoffsetanditssegmentforeveryCPUwiththesetup_percpu_segmentfunction(onlyforx86systems)andmovesomeearlydatafromthearraystothepercpuvariables(x86_cpu_to_apicid,irq_stack_ptrandetc...).Afterthekernelfinishestheinitializationprocess,wewillhaveloadedN.data..percpusections,whereNisthenumberofCPUs,andthesectionusedbythebootstrapprocessorwillcontainanuninitializedvariablecreatedwiththeDEFINE_PER_CPUmacro.

ThekernelprovidesanAPIforper-cpuvariablesmanipulating:

get_cpu_var(var)put_cpu_var(var)

Let'slookattheget_cpu_varimplementation:

#defineget_cpu_var(var)\

(*({\

preempt_disable();\

this_cpu_ptr(&var);\

}))

TheLinuxkernelispreemptibleandaccessingaper-cpuvariablerequiresustoknowwhichprocessorthekernelrunningon.So,currentcodemustnotbepreemptedandmovedtotheanotherCPUwhileaccessingaper-cpuvariable.That'swhyfirstofallwecanseeacallofthepreempt_disablefunction.Afterthiswecanseeacallofthethis_cpu_ptrmacro,whichlookslike:

#definethis_cpu_ptr(ptr)raw_cpu_ptr(ptr)

and

LinuxInside

338Per-CPUvariables

Page 339: Linux Insides

#defineraw_cpu_ptr(ptr)per_cpu_ptr(ptr,0)

whereper_cpu_ptrreturnsapointertotheper-cpuvariableforthegivencpu(secondparameter).Afterwe'vecreatedaper-cpuvariableandmademodificationstoit,wemustcalltheput_cpu_varmacrowhichenablespreemptionwithacallofpreempt_enablefunction.Sothetypicalusageofaper-cpuvariableisasfollows:

get_cpu_var(var);

...

//Dosomethingwiththe'var'

...

put_cpu_var(var);

Let'slookattheper_cpu_ptrmacro:

#defineper_cpu_ptr(ptr,cpu)\

({\

__verify_pcpu_ptr(ptr);\

SHIFT_PERCPU_PTR((ptr),per_cpu_offset((cpu)));\

})

AsIwroteabove,thismacroreturnsaper-cpuvariableforthegivencpu.Firstofallitcalls__verify_pcpu_ptr:

#define__verify_pcpu_ptr(ptr)

do{

constvoid__percpu*__vpp_verify=(typeof((ptr)+0))NULL;

(void)__vpp_verify;

}while(0)

whichmakesthegivenptrtypeofconstvoid__percpu*,

AfterthiswecanseethecalloftheSHIFT_PERCPU_PTRmacrowithtwoparameters.Atfirstparameterwepassourptrandsecondwepassthecpunumbertotheper_cpu_offsetmacro:

#defineper_cpu_offset(x)(__per_cpu_offset[x])

whichexpandstogettingthexelementfromthe__per_cpu_offsetarray:

externunsignedlong__per_cpu_offset[NR_CPUS];

whereNR_CPUSisthenumberofCPUs.The__per_cpu_offsetarrayisfilledwiththedistancesbetweencpu-variablecopies.Forexampleallper-cpudataisXbytesinsize,soifweaccess__per_cpu_offset[Y],X*Ywillbeaccessed.Let'slookattheSHIFT_PERCPU_PTRimplementation:

#defineSHIFT_PERCPU_PTR(__p,__offset)\

RELOC_HIDE((typeof(*(__p))__kernel__force*)(__p),(__offset))

RELOC_HIDEjustreturnsoffset(typeof(ptr))(__ptr+(off))anditwillreturnapointertothevariable.

That'sall!OfcourseitisnotthefullAPI,butageneraloverview.Itcanbehardtostartwith,buttounderstandper-cpuvariablesyoumainlyneedtounderstandtheinclude/linux/percpu-defs.hmagic.

LinuxInside

339Per-CPUvariables

Page 340: Linux Insides

Let'sagainlookatthealgorithmofgettingapointertoaper-cpuvariable:

Thekernelcreatesmultiple.data..percpusections(oneper-cpu)duringinitializationprocess;AllvariablescreatedwiththeDEFINE_PER_CPUmacrowillberelocatedtothefirstsectionorforCPU0;__per_cpu_offsetarrayfilledwiththedistance(BOOT_PERCPU_OFFSET)between.data..percpusections;Whentheper_cpu_ptriscalled,forexampleforgettingapointeronacertainper-cpuvariableforthethirdCPU,the__per_cpu_offsetarraywillbeaccessed,whereeveryindexpointstotherequiredCPU.

That'sall.

LinuxInside

340Per-CPUvariables

Page 341: Linux Insides

CpumasksisaspecialwayprovidedbytheLinuxkerneltostoreinformationaboutCPUsinthesystem.TherelevantsourcecodeandheaderfileswhicharecontainsAPIforCpumasksmanipulating:

include/linux/cpumask.hlib/cpumask.ckernel/cpu.c

Ascommentsaysfromtheinclude/linux/cpumask.h:CpumasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.Wealreadysawabitaboutcpumaskintheboot_cpu_initfunctionfromtheKernelentrypointpart.Thisfunctionmakesfirstbootcpuonline,activeandetc...:

set_cpu_online(cpu,true);

set_cpu_active(cpu,true);

set_cpu_present(cpu,true);

set_cpu_possible(cpu,true);

set_cpu_possibleisasetofcpuID'swhichcanbepluggedinanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentsasubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Theimplementationsofallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,itcallscpumask_set_cpuotherwiseitcallscpumask_clear_cpu.

Therearetwowaysforacpumaskcreation.Firstistousecpumask_t.Itisdefinedas:

typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;

Itwrapsthecpumaskstructurewhichcontainsonebitmakbitsfield.TheDECLARE_BITMAPmacrogetstwoparameters:

bitmapname;numberofbits.

andcreatesanarrayofunsignedlongwiththegivenname.Itsimplementationisprettyeasy:

#defineDECLARE_BITMAP(name,bits)\

unsignedlongname[BITS_TO_LONGS(bits)]

whereBITS_TO_LONGS:

#defineBITS_TO_LONGS(nr)DIV_ROUND_UP(nr,BITS_PER_BYTE*sizeof(long))

#defineDIV_ROUND_UP(n,d)(((n)+(d)-1)/(d))

Aswearefocussingonthex86_64architecture,unsignedlongis8-bytessizeandourarraywillcontainonlyoneelement:

(((8)+(8)-1)/(8))=1

CPUmasks

Introduction

LinuxInside

341Cpumasks

Page 342: Linux Insides

NR_CPUSmacrorepresentsthenumberofCPUsinthesystemanddependsontheCONFIG_NR_CPUSmacrowhichisdefinedininclude/linux/threads.handlookslikethis:

#ifndefCONFIG_NR_CPUS

#defineCONFIG_NR_CPUS1

#endif

#defineNR_CPUSCONFIG_NR_CPUS

ThesecondwaytodefinecpumaskistousetheDECLARE_BITMAPmacrodirectlyandtheto_cpumaskmacrowhichconvertsthegivenbitmaptostructcpumask*:

#defineto_cpumask(bitmap)\

((structcpumask*)(1?(bitmap)\

:(void*)sizeof(__check_is_bitmap(bitmap))))

Wecanseetheternaryoperatoroperatorherewhichistrueeverytime.__check_is_bitmapinlinefunctionisdefinedas:

staticinlineint__check_is_bitmap(constunsignedlong*bitmap)

{

return1;

}

Andreturns1everytime.Weneedithereforonlyonepurpose:atcompiletimeitchecksthatagivenbitmapisabitmap,orinotherwordsitchecksthatagivenbitmaphastype-unsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertinganarrayofunsignedlongtothestructcpumask*.

Aswecandefinecpumaskwithoneofthemethod,LinuxkernelprovidesAPIformanipulatingacpumask.Let'sconsideroneofthefunctionwhichpresentedabove.Forexampleset_cpu_online.Thisfunctiontakestwoparameters:

NumberofCPU;CPUstatus;

Implementationofthisfunctionlooksas:

voidset_cpu_online(unsignedintcpu,boolonline)

{

if(online){

cpumask_set_cpu(cpu,to_cpumask(cpu_online_bits));

cpumask_set_cpu(cpu,to_cpumask(cpu_active_bits));

}else{

cpumask_clear_cpu(cpu,to_cpumask(cpu_online_bits));

}

}

Firstofallitchecksthesecondstateparameterandcallscpumask_set_cpuorcpumask_clear_cpudependsonit.Herewecanseecastingtothestructcpumask*ofthesecondparameterinthecpumask_set_cpu.Inourcaseitiscpu_online_bitswhichisabitmapanddefinedas:

staticDECLARE_BITMAP(cpu_online_bits,CONFIG_NR_CPUS)__read_mostly;

cpumaskAPI

LinuxInside

342Cpumasks

Page 343: Linux Insides

Thecpumask_set_cpufunctionmakesonlyonecalltotheset_bitfunction:

staticinlinevoidcpumask_set_cpu(unsignedintcpu,structcpumask*dstp)

{

set_bit(cpumask_check(cpu),cpumask_bits(dstp));

}

Theset_bitfunctiontakestwoparameterstoo,andsetsagivenbit(firstparameter)inthememory(secondparameterorcpu_online_bitsbitmap).Wecanseeherethatbeforeset_bitwillbecalled,itstwoparameterswillbepassedtothe

cpumask_check;cpumask_bits.

Let'sconsiderthesetwomacros.Firstifcpumask_checkdoesnothinginourcaseandjustreturnsgivenparameter.Thesecondcpumask_bitsjustreturnsthebitsfieldfromthegivenstructcpumask*structure:

#definecpumask_bits(maskp)((maskp)->bits)

Nowlet'slookontheset_bitimplementation:

static__always_inlinevoid

set_bit(longnr,volatileunsignedlong*addr)

{

if(IS_IMMEDIATE(nr)){

asmvolatile(LOCK_PREFIX"orb%1,%0"

:CONST_MASK_ADDR(nr,addr)

:"iq"((u8)CONST_MASK(nr))

:"memory");

}else{

asmvolatile(LOCK_PREFIX"bts%1,%0"

:BITOP_ADDR(addr):"Ir"(nr):"memory");

}

}

Thisfunctionlooksscary,butitisnotsohardasitseems.FirstofallitpassesnrornumberofthebittotheIS_IMMEDIATEmacrowhichjustcallstheGCCinternal__builtin_constant_pfunction:

#defineIS_IMMEDIATE(nr)(__builtin_constant_p(nr))

__builtin_constant_pchecksthatgivenparameterisknownconstantatcompile-time.Asourcpuisnotcompile-timeconstant,theelseclausewillbeexecuted:

asmvolatile(LOCK_PREFIX"bts%1,%0":BITOP_ADDR(addr):"Ir"(nr):"memory");

Let'strytounderstandhowitworksstepbystep:

LOCK_PREFIXisax86lockinstruction.Thisinstructiontellsthecputooccupythesystembuswhiletheinstruction(s)willbeexecuted.ThisallowstheCPUtosynchronizememoryaccess,preventingsimultaneousaccessofmultipleprocessors(ordevices-theDMAcontrollerforexample)toonememorycell.

BITOP_ADDRcaststhegivenparametertothe(*(volatilelong*)andadds+mconstraints.+meansthatthisoperandisbothreadandwrittenbytheinstruction.mshowsthatthisisamemoryoperand.BITOP_ADDRisdefinedas:

LinuxInside

343Cpumasks

Page 344: Linux Insides

#defineBITOP_ADDR(x)"+m"(*(volatilelong*)(x))

Nextisthememoryclobber.Ittellsthecompilerthattheassemblycodeperformsmemoryreadsorwritestoitemsotherthanthoselistedintheinputandoutputoperands(forexample,accessingthememorypointedtobyoneoftheinputparameters).

Ir-immediateregisteroperand.

ThebtsinstructionsetsagivenbitinabitstringandstoresthevalueofagivenbitintheCFflag.Sowepassedthecpunumberwhichiszeroinourcaseandafterset_bitisexecuted,itsetsthezerobitinthecpu_online_bitscpumask.Itmeansthatthefirstcpuisonlineatthismoment.

Besidestheset_cpu_*API,cpumaskofcourseprovidesanotherAPIforcpumasksmanipulation.Let'sconsideritinshort.

cpumaskprovidesasetofmacrosforgettingthenumbersofCPUsinvariousstates.Forexample:

#definenum_online_cpus()cpumask_weight(cpu_online_mask)

ThismacroreturnstheamountofonlineCPUs.Itcallsthecpumask_weightfunctionwiththecpu_online_maskbitmap(readaboutit).Thecpumask_weightfunctionmakesonecallofthebitmap_weightfunctionwithtwoparameters:

cpumaskbitmap;nr_cpumask_bits-whichisNR_CPUSinourcase.

staticinlineunsignedintcpumask_weight(conststructcpumask*srcp)

{

returnbitmap_weight(cpumask_bits(srcp),nr_cpumask_bits);

}

andcalculatesthenumberofbitsinthegivenbitmap.Besidesthenum_online_cpus,cpumaskprovidesmacrosfortheallCPUstates:

num_possible_cpus;num_active_cpus;cpu_online;cpu_possible.

andmanymore.

BesidesthattheLinuxkernelprovidesthefollowingAPIforthemanipulationofcpumask:

for_each_cpu-iteratesovereverycpuinamask;for_each_cpu_not-iteratesovereverycpuinacomplementedmask;cpumask_clear_cpu-clearsacpuinacpumask;cpumask_test_cpu-testsacpuinamask;cpumask_setall-setallcpusinamask;cpumask_size-returnssizetoallocatefora'structcpumask'inbytes;

andmanymanymore...

AdditionalcpumaskAPI

LinuxInside

344Cpumasks

Page 345: Linux Insides

cpumaskdocumentation

Links

LinuxInside

345Cpumasks

Page 346: Linux Insides

Linuxkernelprovidesdifferentimplementationsofdatastructureslikedoublylinkedlist,B+tree,priorityheapandmanymanymore.

Thispartconsidersthefollowingdatastructuresandalgorithms:

DoublylinkedlistRadixtree

DataStructuresintheLinuxKernel

LinuxInside

346DataStructuresintheLinuxKernel

Page 347: Linux Insides

Linuxkernelprovidesitsownimplementationofdoublylinkedlist,whichyoucanfindintheinclude/linux/list.h.WewillstartDataStructuresintheLinuxkernelfromthedoublylinkedlistdatastructure.Why?Becauseitisverypopularinthekernel,justtrytosearch

Firstofall,let'slookonthemainstructureintheinclude/linux/types.h:

structlist_head{

structlist_head*next,*prev;

};

Youcannotethatitisdifferentfrommanyimplementationsofdoublylinkedlistwhichyouhaveseen.Forexample,thisdoublylinkedliststructurefromthegliblibrarylookslike:

structGList{

gpointerdata;

GList*next;

GList*prev;

};

Usuallyalinkedliststructurecontainsapointertotheitem.TheimplementationoflinkedlistinLinuxkerneldoesnot.Sothemainquestionis-wheredoestheliststorethedata?.Theactualimplementationoflinkedlistinthekernelis-Intrusivelist.Anintrusivelinkedlistdoesnotcontaindatainitsnodes-Anodejustcontainspointerstothenextandpreviousnodeandlistnodespartofthedatathatareaddedtothelist.Thismakesthedatastructuregeneric,soitdoesnotcareaboutentrydatatypeanymore.

Forexample:

structnmi_desc{

spinlock_tlock;

structlist_headhead;

};

Let'slookatsomeexamplestounderstandhowlist_headisusedinthekernel.AsIalreadywroteabout,therearemany,reallymanydifferentplaceswherelistsareusedinthekernel.Let'slookforanexampleinmiscellaneouscharacterdrivers.MisccharacterdriversAPIfromthedrivers/char/misc.cisusedforwritingsmalldriversforhandlingsimplehardwareorvirtualdevices.Thosedriverssharesamemajornumber:

#defineMISC_MAJOR10

buthavetheirownminornumber.Forexampleyoucanseeitwith:

ls-l/dev|grep10

crw-------1rootroot10,235Mar2112:01autofs

drwxr-xr-x10rootroot200Mar2112:01cpu

crw-------1rootroot10,62Mar2112:01cpu_dma_latency

crw-------1rootroot10,203Mar2112:01cuse

DataStructuresintheLinuxKernel

Doublylinkedlist

LinuxInside

347Doublylinkedlist

Page 348: Linux Insides

drwxr-xr-x2rootroot100Mar2112:01dri

crw-rw-rw-1rootroot10,229Mar2112:01fuse

crw-------1rootroot10,228Mar2112:01hpet

crw-------1rootroot10,183Mar2112:01hwrng

crw-rw----+1rootkvm10,232Mar2112:01kvm

crw-rw----1rootdisk10,237Mar2112:01loop-control

crw-------1rootroot10,227Mar2112:01mcelog

crw-------1rootroot10,59Mar2112:01memory_bandwidth

crw-------1rootroot10,61Mar2112:01network_latency

crw-------1rootroot10,60Mar2112:01network_throughput

crw-r-----1rootkmem10,144Mar2112:01nvram

brw-rw----1rootdisk1,10Mar2112:01ram10

crw--w----1roottty4,10Mar2112:01tty10

crw-rw----1rootdialout4,74Mar2112:01ttyS10

crw-------1rootroot10,63Mar2112:01vga_arbiter

crw-------1rootroot10,137Mar2112:01vhci

Nowlet'shaveacloselookathowlistsareusedinthemiscdevicedrivers.Firstofall,let'slookonmiscdevicestructure:

structmiscdevice

{

intminor;

constchar*name;

conststructfile_operations*fops;

structlist_headlist;

structdevice*parent;

structdevice*this_device;

constchar*nodename;

mode_tmode;

};

Wecanseethefourthfieldinthemiscdevicestructure-listwhichisalistofregistereddevices.Inthebeginningofthesourcecodefilewecanseethedefinitionofmisc_list:

staticLIST_HEAD(misc_list);

whichexpandstothedefinitionofvariableswithlist_headtype:

#defineLIST_HEAD(name)\

structlist_headname=LIST_HEAD_INIT(name)

andinitializesitwiththeLIST_HEAD_INITmacro,whichsetspreviousandnextentrieswiththeaddressofvariable-name:

#defineLIST_HEAD_INIT(name){&(name),&(name)}

Nowlet'slookonthemisc_registerfunctionwhichregistersamiscellaneousdevice.Atthestartitinitializesmiscdevice->listwiththeINIT_LIST_HEADfunction:

INIT_LIST_HEAD(&misc->list);

whichdoesthesameastheLIST_HEAD_INITmacro:

staticinlinevoidINIT_LIST_HEAD(structlist_head*list)

{

list->next=list;

list->prev=list;

LinuxInside

348Doublylinkedlist

Page 349: Linux Insides

}

Inthenextstepafteradeviceiscreatedbythedevice_createfunction,weaddittothemiscellaneousdeviceslistwith:

list_add(&misc->list,&misc_list);

Kernellist.hprovidesthisAPIfortheadditionofanewentrytothelist.Let'slookatitsimplementation:

staticinlinevoidlist_add(structlist_head*new,structlist_head*head)

{

__list_add(new,head,head->next);

}

Itjustcallsinternalfunction__list_addwiththe3givenparameters:

new-newentry.head-listheadafterwhichthenewitemwillbeinserted.head->next-nextitemafterlisthead.

Implementationofthe__list_addisprettysimple:

staticinlinevoid__list_add(structlist_head*new,

structlist_head*prev,

structlist_head*next)

{

next->prev=new;

new->next=next;

new->prev=prev;

prev->next=new;

}

Hereweaddanewitembetweenprevandnext.SomisclistwhichwedefinedatthestartwiththeLIST_HEAD_INITmacrowillcontainpreviousandnextpointerstothemiscdevice->list.

Thereisstillonequestion:howtogetlist'sentry.Thereisaspecialmacro:

#definelist_entry(ptr,type,member)\

container_of(ptr,type,member)

whichgetsthreeparameters:

ptr-thestructurelist_headpointer;type-structuretype;member-thenameofthelist_headwithinthestructure;

Forexample:

conststructmiscdevice*p=list_entry(v,structmiscdevice,list)

Afterthiswecanaccesstoanymiscdevicefieldwithp->minororp->nameandetc...Let'slookonthelist_entryimplementation:

LinuxInside

349Doublylinkedlist

Page 350: Linux Insides

#definelist_entry(ptr,type,member)\

container_of(ptr,type,member)

Aswecanseeitjustcallscontainer_ofmacrowiththesamearguments.Atfirstsight,thecontainer_oflooksstrange:

#definecontainer_of(ptr,type,member)({\

consttypeof(((type*)0)->member)*__mptr=(ptr);\

(type*)((char*)__mptr-offsetof(type,member));})

Firstofallyoucannotethatitconsistsoftwoexpressionsincurlybrackets.Thecompilerwillevaluatethewholeblockinthecurlybracesandusethevalueofthelastexpression.

Forexample:

#include<stdio.h>

intmain(){

inti=0;

printf("i=%d\n",({++i;++i;}));

return0;

}

willprint2.

Thenextpointistypeof,it'ssimple.Asyoucanunderstandfromitsname,itjustreturnsthetypeofthegivenvariable.WhenIfirstsawtheimplementationofthecontainer_ofmacro,thestrangestthingIfoundwasthezerointhe((type*)0)expression.Actuallythispointermagiccalculatestheoffsetofthegivenfieldfromtheaddressofthestructure,butaswehave0here,itwillbejustazerooffsetalongwiththefieldwidth.Let'slookatasimpleexample:

#include<stdio.h>

structs{

intfield1;

charfield2;

charfield3;

};

intmain(){

printf("%p\n",&((structs*)0)->field3);

return0;

}

willprint0x5.

Thenextoffsetofmacrocalculatesoffsetfromthebeginningofthestructuretothegivenstructure'sfield.Itsimplementationisverysimilartothepreviouscode:

#defineoffsetof(TYPE,MEMBER)((size_t)&((TYPE*)0)->MEMBER)

Let'ssummarizeallaboutcontainer_ofmacro.Thecontainer_ofmacroreturnstheaddressofthestructurebythegivenaddressofthestructure'sfieldwithlist_headtype,thenameofthestructurefieldwithlist_headtypeandtypeofthecontainerstructure.Atthefirstlinethismacrodeclaresthe__mptrpointerwhichpointstothefieldofthestructurethatptrpointstoandassignsptrtoit.Nowptrand__mptrpointtothesameaddress.Technicallywedon'tneedthislinebutit'susefulfortypechecking.Thefirstlineensuresthatthegivenstructure(typeparameter)hasamembercalledmember.Inthesecondlineitcalculatesoffsetofthefieldfromthestructurewiththeoffsetofmacroandsubtractsitfromthestructure

LinuxInside

350Doublylinkedlist

Page 351: Linux Insides

address.That'sall.

Ofcourselist_addandlist_entryisnottheonlyfunctionswhich<linux/list.h>provides.ImplementationofthedoublylinkedlistprovidesthefollowingAPI:

list_addlist_add_taillist_dellist_replacelist_movelist_is_lastlist_emptylist_cut_positionlist_splicelist_for_eachlist_for_each_entry

andmanymore.

LinuxInside

351Doublylinkedlist

Page 352: Linux Insides

Asyoualreadyknowlinuxkernelprovidesmanydifferentlibrariesandfunctionswhichimplementdifferentdatastructuresandalgorithms.Inthispartwewillconsideroneofthesedatastructures-Radixtree.TherearetwofileswhicharerelatedtoradixtreeimplementationandAPIinthelinuxkernel:

include/linux/radix-tree.hlib/radix-tree.c

Letstalkaboutwhataradixtreeis.Radixtreeisacompressedtriewhereatrieisadatastructurewhichimplementsaninterfaceofanassociativearrayandallowstostorevaluesaskey-value.Thekeysareusuallystrings,butanydatatypecanbeused.Atrieisdifferentfromann-treebecauseofitsnodes.Nodesofatriedonotstorekeys;instead,anodeofatriestoressinglecharacterlabels.Thekeywhichisrelatedtoagivennodeisderivedbytraversingfromtherootofthetreetothisnode.Forexample:

               +-----------+               |           |               |    ""    |||

        +------+-----------+------+        |                         |        |                         |   +----v------+            +-----v-----+   |           |            |           |   |    g      |            |     c     |||||

   +-----------+            +-----------+        |                         |        |                         |   +----v------+            +-----v-----+   |           |            |           |   |    o      |            |     a     |||||

   +-----------+            +-----------+                                  |                                  |                            +-----v-----+                            |           |                            |     t     |||

                            +-----------+

Sointhisexample,wecanseethetriewithkeys,goandcat.Thecompressedtrieorradixtreediffersfromtrieinthatallintermediatesnodeswhichhaveonlyonechildareremoved.

Radixtreeinlinuxkernelisthedatastructurewhichmapsvaluestointegerkeys.Itisrepresentedbythefollowingstructuresfromthefileinclude/linux/radix-tree.h:

structradix_tree_root{

unsignedintheight;

gfp_tgfp_mask;

structradix_tree_node__rcu*rnode;

};

Thisstructurepresentstherootofaradixtreeandcontainsthreefields:

DataStructuresintheLinuxKernel

Radixtree

LinuxInside

352Radixtree

Page 353: Linux Insides

height-heightofthetree;gfp_mask-tellshowmemoryallocationswillbeperformed;rnode-pointertothechildnode.

Thefirstfieldwewilldiscussisgfp_mask:

Low-levelkernelmemoryallocationfunctionstakeasetofflagsas-gfp_mask,whichdescribeshowthatallocationistobeperformed.TheseGFP_flagswhichcontroltheallocationprocesscanhavefollowingvalues:(GF_NOIOflag)meanssleepandwaitformemory,(__GFP_HIGHMEMflag)meanshighmemorycanbeused,(GFP_ATOMICflag)meanstheallocationprocesshashigh-priorityandcan'tsleepetc.

GFP_NOIO-cansleepandwaitformemory;__GFP_HIGHMEM-highmemorycanbeused;GFP_ATOMIC-allocationprocessishigh-priorityandcan'tsleep;

etc.

Thenextfieldisrnode:

structradix_tree_node{

unsignedintpath;

unsignedintcount;

union{

struct{

structradix_tree_node*parent;

void*private_data;

};

structrcu_headrcu_head;

};

/*Fortreeuser*/

structlist_headprivate_list;

void__rcu*slots[RADIX_TREE_MAP_SIZE];

unsignedlongtags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];

};

Thisstructurecontainsinformationabouttheoffsetinaparentandheightfromthebottom,countofthechildnodesandfieldsforaccessingandfreeinganode.Thisfieldsaredescribedbelow:

path-offsetinparent&heightfromthebottom;count-countofthechildnodes;parent-pointertotheparentnode;private_data-usedbytheuserofatree;rcu_head-usedforfreeinganode;private_list-usedbytheuserofatree;

Thetwolastfieldsoftheradix_tree_node-tagsandslotsareimportantandinteresting.Everynodecancontainsasetofslotswhicharestorepointerstothedata.EmptyslotsinthelinuxkernelradixtreeimplementationstoreNULL.Radixtreesinthelinuxkernelalsosupportstagswhichareassociatedwiththetagsfieldsintheradix_tree_nodestructure.Tagsallowindividualbitstobesetonrecordswhicharestoredintheradixtree.

Nowthatweknowaboutradixtreestructure,itistimetolookonitsAPI.

Westartfromthedatastructureinitialization.Therearetwowaystoinitializeanewradixtree.ThefirstistouseRADIX_TREEmacro:

LinuxkernelradixtreeAPI

LinuxInside

353Radixtree

Page 354: Linux Insides

RADIX_TREE(name,gfp_mask);

`

Asyoucanseewepassthenameparameter,sowiththeRADIX_TREEmacrowecandefineandinitializeradixtreewiththegivenname.ImplementationoftheRADIX_TREEiseasy:

#defineRADIX_TREE(name,mask)\

structradix_tree_rootname=RADIX_TREE_INIT(mask)

#defineRADIX_TREE_INIT(mask){\

.height=0,\

.gfp_mask=(mask),\

.rnode=NULL,\

}

AtthebeginningoftheRADIX_TREEmacrowedefineinstanceoftheradix_tree_rootstructurewiththegivennameandcallRADIX_TREE_INITmacrowiththegivenmask.TheRADIX_TREE_INITmacrojustinitializesradix_tree_rootstructurewiththedefaultvaluesandthegivenmask.

Thesecondwayistodefineradix_tree_rootstructurebyhandandpassitwithmasktotheINIT_RADIX_TREEmacro:

structradix_tree_rootmy_radix_tree;

INIT_RADIX_TREE(my_tree,gfp_mask_for_my_radix_tree);

where:

#defineINIT_RADIX_TREE(root,mask)\

do{\

(root)->height=0;\

(root)->gfp_mask=(mask);\

(root)->rnode=NULL;\

}while(0)

makesthesameinitialziationwithdefaultvaluesasitdoesRADIX_TREE_INITmacro.

Thenextaretwofunctionsforinsertinganddeletingrecordsto/fromaradixtree:

radix_tree_insert;radix_tree_delete;

Thefirstradix_tree_insertfunctiontakesthreeparameters:

rootofaradixtree;indexkey;datatoinsert;

Theradix_tree_deletefunctiontakesthesamesetofparametersastheradix_tree_insert,butwithoutdata.

Thesearchinaradixtreeimplementedintwoways:

radix_tree_lookup;radix_tree_gang_lookup;radix_tree_lookup_slot.

Thefirstradix_tree_lookupfunctiontakestwoparameters:

LinuxInside

354Radixtree

Page 355: Linux Insides

rootofaradixtree;indexkey;

Thisfunctiontriestofindthegivenkeyinthetreeandreturntherecordassociatedwiththiskey.Thesecondradix_tree_gang_lookupfunctionhavethefollowingsignature

unsignedintradix_tree_gang_lookup(structradix_tree_root*root,

void**results,

unsignedlongfirst_index,

unsignedintmax_items);

andreturnsnumberofrecords,sortedbythekeys,startingfromthefirstindex.Numberofthereturnedrecordswillnotbegreaterthanmax_itemsvalue.

Andthelastradix_tree_lookup_slotfunctionwillreturntheslotwhichwillcontainthedata.

RadixtreeTrie

Links

LinuxInside

355Radixtree

Page 356: Linux Insides

Thischapterdescribesvarioustheoreticalconceptsandconceptswhicharenotdirectlyrelatedtopracticebutusefultoknow.

PagingElf64format

Theory

LinuxInside

356Theory

Page 357: Linux Insides

InthefifthpartoftheseriesLinuxkernelbootingprocesswelearnedaboutwhatthekerneldoesinitsearlieststage.Inthenextstepthekernelwillinitializedifferentthingslikeinitrdmounting,lockdepinitialization,andmanymanyothersthings,beforewecanseehowthekernelrunsthefirstinitprocess.

Yeah,therewillbemanydifferentthings,butmanymanyandonceagainmanyworkwithmemory.

Inmyview,memorymanagementisoneofthemostcomplexpartofthelinuxkernelandinsystemprogrammingingeneral.Thisiswhybeforeweproceedwiththekernelinitializationstuff,weneedtogetacquaintedwithpaging.

Pagingisamechanismthattranslatesalinearmemoryaddresstoaphysicaladdress.Ifyouhavereadthepreviouspartsofthisbook,youmayrememberthatwesawsegmentationinrealmodewhenphysicaladdressesarecalculatedbyshiftingasegmentregisterbyfourandaddinganoffset.Wealsosawsegmentationinprotectedmode,whereweusedthedescriptortablesandbaseaddressesfromdescriptorswithoffsetstocalculatethephysicaladdresses.Nowthatwearein64-bitmode,willseepaging.

AstheIntelmanualsays:

Pagingprovidesamechanismforimplementingaconventionaldemand-paged,virtual-memorysystemwheresectionsofaprogram’sexecutionenvironmentaremappedintophysicalmemoryasneeded.

So...InthispostIwilltrytoexplainthetheorybehindpaging.Ofcourseitwillbecloselyrelatedtothex86_64versionofthelinuxkernelfor,butwewillnotgointotoomuchdetails(atleastinthispost).

Therearethreepagingmodes:

32-bitpaging;PAEpaging;IA-32epaging.

Wewillonlyexplainthelastmodehere.ToenabletheIA-32epagingpagingmodeweneedtodofollowingthings:

settheCR0.PGbit;settheCR4.PAEbit;settheIA32_EFER.LMEbit.

Wealreadysawwherethosethisbitsweresetinarch/x86/boot/compressed/head_64.S:

movl$(X86_CR0_PG|X86_CR0_PE),%eax

movl%eax,%cr0

and

movl$MSR_EFER,%ecx

rdmsr

Paging

Introduction

Enablingpaging

LinuxInside

357Paging

Page 358: Linux Insides

btsl$_EFER_LME,%eax

wrmsr

Pagingdividesthelinearaddressspaceintofixed-sizepages.Pagescanbemappedintothephysicaladdressspaceorevenexternalstorage.Thisfixedsizeis4096bytesforthex86_64linuxkernel.Toperformthelinearaddresstranslationtoaphysicaladdressspecialstructuresareused.Everystructureis4096bytessizeandcontains512entries(thisonlyforPAEandIA32_EFER.LMEmodes).Pagingstructuresarehierarchicalandthelinuxkerneluses4levelofpaginginthex86_64architecture.TheCPUusesapartofthelinearaddresstoidentifytheentryinanotherpagingstructurewhichisatthelowerlevelorphysicalmemoryregion(pageframe)orphysicaladdressinthisregion(pageoffset).Theaddressofthetoplevelpagingstructurelocatedinthecr3register.Wealreadysawthisinarch/x86/boot/compressed/head_64.S:

lealpgtable(%ebx),%eax

movl%eax,%cr3

Webuiltthepagetablestructuresandputtheaddressofthetop-levelstructureinthecr3register.Herecr3isusedtostoretheaddressofthetop-levelstructure,thePML4orPageGlobalDirectoryasitiscalledinthelinuxkernel.cr3is64-bitregisterandhasthefollowingstructure:

63525132

--------------------------------------------------------------------------------

|||

|ReservedMBZ|Addressofthetoplevelstructure|

|||

--------------------------------------------------------------------------------

31121154320

--------------------------------------------------------------------------------

|||P|P||

|Addressofthetoplevelstructure|Reserved|C|W|Reserved|

|||D|T||

--------------------------------------------------------------------------------

Thesefieldshavethefollowingmeanings:

Bits2:0-ignored;Bits51:12-storestheaddressofthetoplevelpagingstructure;Bit3and4-PWTorPage-LevelWritethroughandPCDorPage-levelcachedisableindicate.ThesebitscontrolthewaythepageorPageTableishandledbythehardwarecache;Reserved-reservedmustbe0;Bits63:52-reservedmustbe0.

Thelinearaddresstranslationaddressisfollowing:

AgivenlinearaddressarrivestotheMMUinsteadofmemorybus.64-bitlinearaddresssplitsonsomeparts.Onlylow48bitsaresignificant,itmeansthat2^48or256TBytesoflinear-addressspacemaybeaccessedatanygiventime.cr3registerstorestheaddressofthe4top-levelpagingstructure.47:39bitsofthegivenlinearaddressstoresanindexintothepagingstructurelevel-4,38:30bitsstoresindexintothepagingstructurelevel-3,29:21bitsstoresanindexintothepagingstructurelevel-2,20:12bitsstoresanindexintothepagingstructurelevel-1and11:0bitsprovidethebyteoffsetintothephysicalpage.

schematically,wecanimagineitlikethis:

Pagingstructures

LinuxInside

358Paging

Page 359: Linux Insides

Everyaccesstoalinearaddressiseitherasupervisor-modeaccessorauser-modeaccess.ThisaccessisdeterminedbytheCPL(currentprivilegelevel).IfCPL<3itisasupervisormodeaccesslevelotherwise,otherwiseitisausermodeaccesslevel.Forexample,thetoplevelpagetableentrycontainsaccessbitsandhasthefollowingstructure:

6362525132

--------------------------------------------------------------------------------

|N|||

||Available|Addressofthepagingstructureonlowerlevel|

|X|||

--------------------------------------------------------------------------------

3112119876543210

--------------------------------------------------------------------------------

|||M|I||P|P|U|W||

|Addressofthepagingstructureonlowerlevel|AVL|B|G|A|C|W|||P|

|||Z|N||D|T|S|R||

--------------------------------------------------------------------------------

Where:

63bit-N/Xbit(NoExecuteBit)-presentsabilitytoexecutethecodefromphysicalpagesmappedbythetableentry;62:52bits-ignoredbyCPU,usedbysystemsoftware;51:12bits-storesphysicaladdressofthelowerlevelpagingstructure;12:9bits-ignoredbyCPU;MBZ-mustbezerobits;Ignoredbits;A-accessedbitindicateswasphysicalpageorpagestructureaccessed;PWTandPCDusedforcache;U/S-user/supervisorbitcontrolsuseraccesstotheallphysicalpagesmappedbythistableentry;R/W-read/writebitcontrolsread/writeaccesstotheallphysicalpagesmappedbythistableentry;P-presentbit.Currentbitindicateswaspagetableorphysicalpageloadedintoprimarymemoryornot.

Ok,weknowaboutthepagingstructuresandtheirentries.Nowlet'sseesomedetailsabout4-levelpaginginthelinuxkernel.

LinuxInside

359Paging

Page 360: Linux Insides

Aswe'veseen,thelinuxkernelinx86_64uses4-levelpagetables.Theirnamesare:

PageGlobalDirectoryPageUpperDirectoryPageMiddleDirectoryPageTableEntry

Afteryou'vecompiledandinstalledthelinuxkernel,youcanseetheSystem.mapfilewhichstoresthevirtualaddressesofthefunctionsthatareusedbythekernel.Forexample:

$grep"start_kernel"System.map

ffffffff81efe497Tx86_64_start_kernel

ffffffff81efeaa2Tstart_kernel

Wecansee0xffffffff81efe497here.IdoubtyoureallyhavethatmuchRAMinstalled.Butanyway,start_kernelandx86_64_start_kernelwillbeexecuted.Theaddressspaceinx86_64is2^64size,butit'stoolarge,that'swhyasmalleraddressspaceisused,only48-bitswide.Sowehaveasituationwherethephysicaladdressspaceislimitedto48bits,butaddressingstillperformedwith64bitpointers.Howisthisproblemsolved?Lookatthisdiagram:

0xffffffffffffffff+-----------+

||

||Kernelspace

||

0xffff800000000000+-----------+

||

||

|hole|

||

||

0x00007fffffffffff+-----------+

||

||Userspace

||

0x0000000000000000  +-----------+

Thissolutionissignextension.Herewecanseethatthelower48bitsofavirtualaddresscanbeusedforaddressing.Bits63:48canbeeitheronlyzeroesoronlyones.Notethatthevirtualaddressspaceissplitin2parts:

KernelspaceUserspace

Userspaceoccupiesthelowerpartofthevirtualaddressspace,from0x000000000000000to0x00007fffffffffffandkernelspaceoccupiesthehighestpartfrom0xffff8000000000to0xffffffffffffffff.Notethatbits63:48is0foruserspaceand1forkernelspace.Alladdresseswhichareinkernelspaceandinuserspaceorinotherwordswhichhigher63:48bitsarezeroesoronesarecalledcanonicaladdresses.Thereisanon-canonicalareabetweenthesememoryregions.Togetherthesetwomemoryregions(kernelspaceanduserspace)areexactly2^48bitswide.Wecanfindthevirtualmemorymapwith4levelpagetablesintheDocumentation/x86/x86_64/mm.txt:

0000000000000000-00007fffffffffff(=47bits)userspace,differentpermm

holecausedby[48:63]signextension

ffff800000000000-ffff87ffffffffff(=43bits)guardhole,reservedforhypervisor

ffff880000000000-ffffc7ffffffffff(=64TB)directmappingofallphys.memory

ffffc80000000000-ffffc8ffffffffff(=40bits)hole

ffffc90000000000-ffffe8ffffffffff(=45bits)vmalloc/ioremapspace

ffffe90000000000-ffffe9ffffffffff(=40bits)hole

ffffea0000000000-ffffeaffffffffff(=40bits)virtualmemorymap(1TB)

Pagingstructuresinthelinuxkernel

LinuxInside

360Paging

Page 361: Linux Insides

...unusedhole...

ffffec0000000000-fffffc0000000000(=44bits)kasanshadowmemory(16TB)

...unusedhole...

ffffff0000000000-ffffff7fffffffff(=39bits)%espfixupstacks

...unusedhole...

ffffffff80000000-ffffffffa0000000(=512MB)kerneltextmapping,fromphys0

ffffffffa0000000-ffffffffff5fffff(=1525MB)modulemappingspace

ffffffffff600000-ffffffffffdfffff(=8MB)vsyscalls

ffffffffffe00000-ffffffffffffffff(=2MB)unusedhole

Wecanseeherethememorymapforuserspace,kernelspaceandthenon-canonicalareain-betweenthem.Theuserspacememorymapissimple.Let'stakeacloserlookatthekernelspace.Wecanseethatitstartsfromtheguardholewhichisreservedforthehypervisor.Wecanfindthedefinitionofthisguardholeinarch/x86/include/asm/page_64_types.h:

#define__PAGE_OFFSET_AC(0xffff880000000000,UL)

Previouslythisguardholeand__PAGE_OFFSETwasfrom0xffff800000000000to0xffff80fffffffffftopreventaccesstonon-canonicalarea,butwaslaterextendedby3bitsforthehypervisor.

Nextisthelowestusableaddressinkernelspace-ffff880000000000.Thisvirtualmemoryregionisfordirectmappingoftheallphysicalmemory.Afterthememoryspacewhichmapsallphysicaladdresses,theguardhole.Itneedstobebetweenthedirectmappingofallthephysicalmemoryandthevmallocarea.Afterthevirtualmemorymapforthefirstterabyteandtheunusedholeafterit,wecanseethekasanshadowmemory.Itwasaddedbycommitandprovidesthekerneladdresssanitizer.Afterthenextunusedholewecanseetheespfixupstacks(wewilltalkaboutitinotherpartsofthisbook)andthestartofthekerneltextmappingfromthephysicaladdress-0.Wecanfindthedefinitionofthisaddressinthesamefileasthe__PAGE_OFFSET:

#define__START_KERNEL_map_AC(0xffffffff80000000,UL)

Usuallykernel's.textstartherewiththeCONFIG_PHYSICAL_STARToffset.WesawitinthepostaboutELF64:

readelf-svmlinux|grepffffffff81000000

1:ffffffff810000000SECTIONLOCALDEFAULT1

65099:ffffffff810000000NOTYPEGLOBALDEFAULT1_text

90766:ffffffff810000000NOTYPEGLOBALDEFAULT1startup_64

HereicheckedvmlinuxwiththeCONFIG_PHYSICAL_STARTis0x1000000.Sowehavethestartpointofthekernel.text-0xffffffff80000000andoffset-0x1000000,theresultedvirtualaddresswillbe0xffffffff80000000+1000000=0xffffffff81000000.

Afterthekernel.textregionthereisthevirtualmemoryregionforkernelmodules,vsyscallsandanunusedholeof2megabytes.

We'veseenhowthekernel'svirtualmemorymapislaidoutandhowavirtualaddressistranslatedintoaphysicalone.Let'stakeforexamplefollowingaddress:

0xffffffff81000000

Inbinaryitwillbe:

1111111111111111111111111111111110000001000000000000000000000000

63:4847:3938:3029:2120:1211:0

LinuxInside

361Paging

Page 362: Linux Insides

Thisvirtualaddressissplitinpartsasdescribedabove:

63:48-bitsnotused;47:39-bitsofthegivenlinearaddressstoresanindexintothepagingstructurelevel-4;38:30-bitsstoresindexintothepagingstructurelevel-3;29:21-bitsstoresanindexintothepagingstructurelevel-2;20:12-bitsstoresanindexintothepagingstructurelevel-1;11:0-bitsprovidethebyteoffsetintothephysicalpage.

Thatisall.Nowyouknowalittleabouttheoryofpagingandwecangoaheadinthekernelsourcecodeandseethefirstinitializationsteps.

It'stheendofthisshortpartaboutpagingtheory.Ofcoursethispostdoesn'tcovereverydetailofpaging,butsoonwe'llseeinpracticehowthelinuxkernelbuildspagingstructuresandworkswiththem.

PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.Ifyou'vefoundanymistakespleasesendmePRtolinux-internals.

PagingonWikipediaIntel64andIA-32architecturessoftwaredeveloper'smanualvolume3AMMUELF64Documentation/x86/x86_64/mm.txtLastpart-Kernelbootingprocess

Conclusion

Links

LinuxInside

362Paging

Page 363: Linux Insides

ELF(ExecutableandLinkableFormat)isastandardfileformatforexecutablefiles,objectcode,sharedlibrariesandcoredumps.LinuxandmanyUNIX-likeoperatingsystemsusethisformat.Let'slookatthestructureoftheELF-64ObjectFileFormatandsomedefinitionsinthelinuxkernelsourcecodewhichrelatedwithit.

AnELFobjectfileconsistsofthefollowingparts:

ELFheader-describesthemaincharacteristicsoftheobjectfile:type,CPUarchitecture,thevirtualaddressoftheentrypoint,thesizeandoffsetoftheremainingparts,etc...;Programheadertable-liststheavailablesegmentsandtheirattributes.Programheadertableneedloadersforplacingsectionsofthefileasvirtualmemorysegments;Sectionheadertable-containsthedescriptionofthesections.

Nowlet'shaveacloserlookonthesecomponents.

ELFheader

TheELFheaderislocatedatthebeginningoftheobjectfile.Itsmainpurposeistolocateallotherpartsoftheobjectfile.TheFileheadercontainsthefollowingfields:

ELFidentification-arrayofbyteswhichhelpsidentifythefileasanELFobjectfileandalsoprovidesinformationaboutgeneralobjectfilecharacteristic;Objectfiletype-identifiestheobjectfiletype.ThisfieldcandescribethatELFfileisarelocatableobjectfile,anexecutablefile,etc...;Targetarchitecture;Versionoftheobjectfileformat;Virtualaddressoftheprogramentrypoint;Fileoffsetoftheprogramheadertable;Fileoffsetofthesectionheadertable;SizeofanELFheader;Sizeofaprogramheadertableentry;andotherfields...

Youcanfindtheelf64_hdrstructurewhichpresentsELF64headerinthelinuxkernelsourcecode:

typedefstructelf64_hdr{

unsignedchare_ident[EI_NIDENT];

Elf64_Halfe_type;

Elf64_Halfe_machine;

Elf64_Worde_version;

Elf64_Addre_entry;

Elf64_Offe_phoff;

Elf64_Offe_shoff;

Elf64_Worde_flags;

Elf64_Halfe_ehsize;

Elf64_Halfe_phentsize;

Elf64_Halfe_phnum;

Elf64_Halfe_shentsize;

Elf64_Halfe_shnum;

Elf64_Halfe_shstrndx;

}Elf64_Ehdr;

Thisstructuredefinedintheelf.h

Sections

ExecutableandLinkableFormat

LinuxInside

363Elf64

Page 364: Linux Insides

AlldatastoresinasectionsinanElfobjectfile.Sectionsidentifiedbyindexinthesectionheadertable.Sectionheadercontainsfollowingfields:

Sectionname;Sectiontype;Sectionattributes;Virtualaddressinmemory;Offsetinfile;Sizeofsection;Linktoothersection;Miscellaneousinformation;Addressalignmentboundary;Sizeofentries,ifsectionhastable;

Andpresentedwiththefollowingelf64_shdrstructureinthelinuxkernel:

typedefstructelf64_shdr{

Elf64_Wordsh_name;

Elf64_Wordsh_type;

Elf64_Xwordsh_flags;

Elf64_Addrsh_addr;

Elf64_Offsh_offset;

Elf64_Xwordsh_size;

Elf64_Wordsh_link;

Elf64_Wordsh_info;

Elf64_Xwordsh_addralign;

Elf64_Xwordsh_entsize;

}Elf64_Shdr;

elf.h

Programheadertable

Allsectionsaregroupedintosegmentsinanexecutableorsharedobjectfile.Programheaderisanarrayofstructureswhichdescribeeverysegment.Itlookslike:

typedefstructelf64_phdr{

Elf64_Wordp_type;

Elf64_Wordp_flags;

Elf64_Offp_offset;

Elf64_Addrp_vaddr;

Elf64_Addrp_paddr;

Elf64_Xwordp_filesz;

Elf64_Xwordp_memsz;

Elf64_Xwordp_align;

}Elf64_Phdr;

inthelinuxkernelsourcecode.

elf64_phdrdefinedinthesameelf.h.

TheELFobjectfilealsocontainsotherfields/structureswhichyoucanfindintheDocumentation.Nowlet'salookatthevmlinuxELFobject.

vmlinuxisalsoarelocatableELFobjectfile.Wecantakealookatitwiththereadelfutil.Firstofalllet'slookatthe

vmlinux

LinuxInside

364Elf64

Page 365: Linux Insides

header:

$readelf-hvmlinux

ELFHeader:

Magic:7f454c46020101000000000000000000

Class:ELF64

Data:2'scomplement,littleendian

Version:1(current)

OS/ABI:UNIX-SystemV

ABIVersion:0

Type:EXEC(Executablefile)

Machine:AdvancedMicroDevicesX86-64

Version:0x1

Entrypointaddress:0x1000000

Startofprogramheaders:64(bytesintofile)

Startofsectionheaders:381608416(bytesintofile)

Flags:0x0

Sizeofthisheader:64(bytes)

Sizeofprogramheaders:56(bytes)

Numberofprogramheaders:5

Sizeofsectionheaders:64(bytes)

Numberofsectionheaders:73

Sectionheaderstringtableindex:70

Herewecanseethatvmlinuxisa64-bitexecutablefile.

WecanreadfromtheDocumentation/x86/x86_64/mm.txt:

ffffffff80000000-ffffffffa0000000(=512MB)kerneltextmapping,fromphys0

WecanthenlookthisaddressupinthevmlinuxELFobjectwith:

$readelf-svmlinux|grepffffffff81000000

1:ffffffff810000000SECTIONLOCALDEFAULT1

65099:ffffffff810000000NOTYPEGLOBALDEFAULT1_text

90766:ffffffff810000000NOTYPEGLOBALDEFAULT1startup_64

Notethattheaddressofthestartup_64routineisnotffffffff80000000,butffffffff81000000andnowI'llexplainwhy.

Wecanseefollowingdefinitioninthearch/x86/kernel/vmlinux.lds.S:

.=__START_KERNEL;

...

...

..

/*Textandread-onlydata*/

.text:AT(ADDR(.text)-LOAD_OFFSET){

_text=.;

...

...

...

}

Where__START_KERNELis:

#define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)

__START_KERNEL_mapisthevaluefromthedocumentation-ffffffff80000000and__PHYSICAL_STARTis0x1000000.That's

LinuxInside

365Elf64

Page 366: Linux Insides

whyaddressofthestartup_64isffffffff81000000.

Andatlastwecangetprogramheadersfromvmlinuxwiththefollowingcommand:

readelf-lvmlinux

ElffiletypeisEXEC(Executablefile)

Entrypoint0x1000000

Thereare5programheaders,startingatoffset64

ProgramHeaders:

TypeOffsetVirtAddrPhysAddr

FileSizMemSizFlagsAlign

LOAD0x00000000002000000xffffffff810000000x0000000001000000

0x0000000000cfd0000x0000000000cfd000RE200000

LOAD0x00000000010000000xffffffff81e000000x0000000001e00000

0x00000000001000000x0000000000100000RW200000

LOAD0x00000000012000000x00000000000000000x0000000001f00000

0x0000000000014d980x0000000000014d98RW200000

LOAD0x00000000013150000xffffffff81f150000x0000000001f15000

0x000000000011d0000x0000000000279000RWE200000

NOTE0x0000000000b172840xffffffff819172840x0000000001917284

0x00000000000000240x00000000000000244

SectiontoSegmentmapping:

SegmentSections...

00.text.notes__ex_table.rodata__bug_table.pci_fixup.builtin_fw

.tracedata__ksymtab__ksymtab_gpl__kcrctab__kcrctab_gpl

__ksymtab_strings__param__modver

01.data.vvar

02.data..percpu

03.init.text.init.data.x86_cpu_dev.init.altinstructions

.altinstr_replacement.iommu_table.apicdrivers.exit.text

.smp_locks.data_nosave.bss.brk

Herewecanseefivesegmentswithsectionslist.Youcanfindallofthesesectionsinthegeneratedlinkerscriptat-arch/x86/kernel/vmlinux.lds.

That'sall.Ofcourseit'snotafulldescriptionofELF(ExecutableandLinkableFormat),butifyouwanttoknowmore,youcanfindthedocumentation-here

LinuxInside

366Elf64

Page 367: Linux Insides

ThischaptercontainspartswhicharenotdirectlyrelatedtotheLinuxkernelsourcecodeandimplementationofdifferentsubsystems.

Misc

LinuxInside

367Misc

Page 368: Linux Insides

Iwon'ttellyouhowtobuildandinstallacustomLinuxkernelonyourmachine.Ifyouneedhelpwiththis,youcanfindmanyresourcesthatwillhelpyoudoit.Instead,wewilllearnwhatoccurswhenyouexecutemakeintherootdirectoryoftheLinuxkernelsourcecode.

WhenIstartedtostudythesourcecodeoftheLinuxkernel,themakefilewasthefirstfilethatIopened.Anditwasscary:).Themakefilecontained1591linesofcodewhenIwrotethispartandthekernelwasthe4.2.0-rc3release.

ThismakefileisthetopmakefileintheLinuxkernelsourcecodeandthekernelbuildingstartshere.Yes,itisbig,butmoreover,ifyou'vereadthesourcecodeoftheLinuxkernelyoumayhavenotedthatalldirectoriescontainingsourcecodehasitsownmakefile.Ofcourseitisnotpossibletodescribehoweachsourcefileiscompiledandlinked,sowewillonlystudythestandardcompilationcase.Youwillnotfindherebuildingofthekernel'sdocumentation,cleaningofthekernelsourcecode,tagsgeneration,cross-compilationrelatedstuff,etc...WewillstartfromthemakeexecutionwiththestandardkernelconfigurationfileandwillfinishwiththebuildingofthebzImage.

Itwouldbebetterifyou'realreadyfamiliarwiththemakeutil,butIwilltrytodescribeeverypieceofcodeinthispartanyway.

Solet'sstart.

Therearemanythingstopreparebeforethekernelcompilationcanbestarted.Themainpointhereistofindandconfigurethetypeofcompilation,toparsecommandlineargumentsthatarepassedtomake,etc...Solet'sdiveintothetopMakefileofLinuxkernel.

ThetopMakefileofLinuxkernelisresponsibleforbuildingtwomajorproducts:vmlinux(theresidentkernelimage)andthemodules(anymodulefiles).TheMakefileoftheLinuxkernelstartswiththedefinitionoffollowingvariables:

VERSION=4

PATCHLEVEL=2

SUBLEVEL=0

EXTRAVERSION=-rc3

NAME=HurrdurrI'masheep

ThesevariablesdeterminethecurrentversionofLinuxkernelandareusedindifferentplaces,forexampleintheformingoftheKERNELVERSIONvariableinthesameMakefile:

KERNELVERSION=$(VERSION)$(if$(PATCHLEVEL),.$(PATCHLEVEL)$(if$(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)

Afterthiswecanseeacoupleofifeqconditionsthatchecksomeoftheparameterspassedtomake.TheLinuxkernelmakefilesprovidesaspecialmakehelptargetthatprintsallavailabletargetsandsomeofthecommandlineargumentsthatcanbepassedtomake.Forexample:makeV=1=>verbosebuild.ThefirstifeqcheckswhethertheV=noptionispassedtomake:

ifeq("$(originV)","commandline")

ProcessoftheLinuxkernelbuilding

Introduction

Preparationbeforethekernelcompilation

LinuxInside

368Howthekerneliscompiled

Page 369: Linux Insides

KBUILD_VERBOSE=$(V)

endif

ifndefKBUILD_VERBOSE

KBUILD_VERBOSE=0

endif

ifeq($(KBUILD_VERBOSE),1)

quiet=

Q=

else

quiet=quiet_

Q=@

endif

exportquietQKBUILD_VERBOSE

Ifthisoptionispassedtomake,wesettheKBUILD_VERBOSEvariabletothevalueofVoption.OtherwisewesettheKBUILD_VERBOSEvariabletozero.AfterthiswecheckthevalueofKBUILD_VERBOSEvariableandsetvaluesofthequietandQvariablesdependingonthevalueofKBUILD_VERBOSEvariable.The@symbolssuppresstheoutputofcommand.Andifitispresentbeforeacommandtheoutputwillbesomethinglikethis:CCscripts/mod/empty.oinsteadofCompiling....scripts/mod/empty.o.Intheendwejustexportallofthesevariables.ThenextifeqstatementchecksthatO=/diroptionwaspassedtothemake.Thisoptionallowstolocatealloutputfilesinthegivendir:

ifeq($(KBUILD_SRC),)

ifeq("$(originO)","commandline")

KBUILD_OUTPUT:=$(O)

endif

ifneq($(KBUILD_OUTPUT),)

saved-output:=$(KBUILD_OUTPUT)

KBUILD_OUTPUT:=$(shellmkdir-p$(KBUILD_OUTPUT)&&cd$(KBUILD_OUTPUT)\

&&/bin/pwd)

$(if$(KBUILD_OUTPUT),,\

$(errorfailedtocreateoutputdirectory"$(saved-output)"))

sub-make:FORCE

$(Q)$(MAKE)-C$(KBUILD_OUTPUT)KBUILD_SRC=$(CURDIR)\

-f$(CURDIR)/Makefile$(filter-out_allsub-make,$(MAKECMDGOALS))

skip-makefile:=1

endif#ifneq($(KBUILD_OUTPUT),)

endif#ifeq($(KBUILD_SRC),)

WechecktheKBUILD_SRCthatrepresentsthetopdirectoryofthekernelsourcecodeandwhetheritisempty(itisemptywhenthemakefileisexecutedforthefirsttime).WethensettheKBUILD_OUTPUTvariabletothevaluepassedwiththeOoption(ifthisoptionwaspassed).InthenextstepwecheckthisKBUILD_OUTPUTvariableandifitisset,wedofollowingthings:

StorethevalueofKBUILD_OUTPUTinthetemporarysaved-outputvariable;Trytocreatethegivenoutputdirectory;Checkthatdirectorycreated,inotherwayprinterrormessage;Ifthecustomoutputdirectorywascreatedsuccessfully,executemakeagainwiththenewdirectory(seethe-Coption).

ThenextifeqstatementscheckthattheCorMoptionspassedtomake:

ifeq("$(originC)","commandline")

KBUILD_CHECKSRC=$(C)

endif

ifndefKBUILD_CHECKSRC

KBUILD_CHECKSRC=0

endif

LinuxInside

369Howthekerneliscompiled

Page 370: Linux Insides

ifeq("$(originM)","commandline")

KBUILD_EXTMOD:=$(M)

endif

TheCoptiontellsthemakefilethatweneedtocheckallcsourcecodewithatoolprovidedbythe$CHECKenvironmentvariable,bydefaultitissparse.ThesecondMoptionprovidesbuildfortheexternalmodules(willnotseethiscaseinthispart).WealsocheckwhethertheKBUILD_SRCvariableisset,andifitisn't,wesetthesrctreevariableto.:

ifeq($(KBUILD_SRC),)

srctree:=.

endif

objtree:=.

src:=$(srctree)

obj:=$(objtree)

exportsrctreeobjtreeVPATH

ThattellsMakefilethatthekernelsourcetreewillbeinthecurrentdirectorywheremakewasexecuted.Wethensetobjtreeandothervariablestothisdirectoryandexportthem.ThenextstepistogetvaluefortheSUBARCHvariablethatrepresentswhattheunderlyingarchitectureis:

SUBARCH:=$(shelluname-m|sed-es/i.86/x86/-es/x86_64/x86/\

-es/sun4u/sparc64/\

-es/arm.*/arm/-es/sa110/arm/\

-es/s390x/s390/-es/parisc64/parisc/\

-es/ppc.*/powerpc/-es/mips.*/mips/\

-es/sh[234].*/sh/-es/aarch64.*/arm64/)

Asyoucansee,itexecutestheunameutilthatprintsinformationaboutmachine,operatingsystemandarchitecture.Asitgetstheoutputofuname,itparsestheouputandassignstheresulttotheSUBARCHvariable.NowthatwehaveSUBARCH,wesettheSRCARCHvariablethatprovidesthedirectoryofthecertainarchitectureandhfr-archthatprovidesthedirectoryfortheheaderfiles:

ifeq($(ARCH),i386)

SRCARCH:=x86

endif

ifeq($(ARCH),x86_64)

SRCARCH:=x86

endif

hdr-arch:=$(SRCARCH)

NoteARCHisanaliasforSUBARCH.InthenextstepwesettheKCONFIG_CONFIGvariablethatrepresentspathtothekernelconfigurationfileandifitwasnotsetbefore,itissetto.configbydefault:

KCONFIG_CONFIG?=.config

exportKCONFIG_CONFIG

andtheshellthatwillbeusedduringkernelcompilation:

CONFIG_SHELL:=$(shellif[-x"$$BASH"];thenecho$$BASH;\

elseif[-x/bin/bash];thenecho/bin/bash;\

elseechosh;fi;fi)

LinuxInside

370Howthekerneliscompiled

Page 371: Linux Insides

ThenextsetofvariablesarerelatedtothecompilersusedduringLinuxkernelcompilation.Wesetthehostcompilersforthecandc++andtheflagstobeusedwiththem:

HOSTCC=gcc

HOSTCXX=g++

HOSTCFLAGS=-Wall-Wmissing-prototypes-Wstrict-prototypes-O2-fomit-frame-pointer-std=gnu89

HOSTCXXFLAGS=-O2

NextwegettotheCCvariablethatrepresentscompilertoo,sowhydoweneedtheHOST*variables?CCisthetargetcompilerthatwillbeusedduringkernelcompilation,butHOSTCCwillbeusedduringcompilationofthesetofthehostprograms(wewillseeitsoon).AfterthiswecanseethedefinitionofKBUILD_MODULESandKBUILD_BUILTINvariablesthatareusedtodeterminewhattocompile(modules,kernel,orboth):

KBUILD_MODULES:=

KBUILD_BUILTIN:=1

ifeq($(MAKECMDGOALS),modules)

KBUILD_BUILTIN:=$(if$(CONFIG_MODVERSIONS),1)

endif

HerewecanseedefinitionofthesevariablesandthevalueofKBUILD_BUILTINvariablewilldependontheCONFIG_MODVERSIONSkernelconfigurationparameterifwepassonlymodulestomake.Thenextstepistoincludethekbuildfile.

includescripts/Kbuild.include

TheKbuildorKernelBuildSystemisthespecialinfrastructuretomanagethebuildofthekernelanditsmodules.Thekbuildfileshasthesamesyntaxthatmakefilesdo.Thescripts/Kbuild.includefileprovidessomegenericdefinitionsforthekbuildsystem.Asweincludedthiskbuildfileswecanseedefinitionofthevariablesthatarerelatedtothedifferenttoolsthatwillbeusedduringkernelandmodulescompilation(likelinker,compilers,utilsfromthebinutils,etc...):

AS=$(CROSS_COMPILE)as

LD=$(CROSS_COMPILE)ld

CC=$(CROSS_COMPILE)gcc

CPP=$(CC)-E

AR=$(CROSS_COMPILE)ar

NM=$(CROSS_COMPILE)nm

STRIP=$(CROSS_COMPILE)strip

OBJCOPY=$(CROSS_COMPILE)objcopy

OBJDUMP=$(CROSS_COMPILE)objdump

AWK=awk

...

...

...

Wethendefinetwoothervariables:USERINCLUDEandLINUXINCLUDE.Theycontainthepathsofthedirectorieswithheaderscz(publicforusersinthefirstcaseandforkernelinthesecondcase):

USERINCLUDE:=\

-I$(srctree)/arch/$(hdr-arch)/include/uapi\

-Iarch/$(hdr-arch)/include/generated/uapi\

-I$(srctree)/include/uapi\

-Iinclude/generated/uapi\

-include$(srctree)/include/linux/kconfig.h

LINUXINCLUDE:=\

-I$(srctree)/arch/$(hdr-arch)/include\

...

LinuxInside

371Howthekerneliscompiled

Page 372: Linux Insides

AndthestandardflagsfortheCcompiler:

KBUILD_CFLAGS:=-Wall-Wundef-Wstrict-prototypes-Wno-trigraphs\

-fno-strict-aliasing-fno-common\

-Werror-implicit-function-declaration\

-Wno-format-security\

-std=gnu89

Itisthenotlastcompilerflags,theycanbeupdatedbytheothermakefiles(forexamplekbuildsfromarch/).Afterallofthese,allvariableswillbeexportedtobeavailableintheothermakefiles.ThefollowingtwotheRCS_FIND_IGNOREandtheRCS_TAR_IGNOREvariableswillcontainfilesthatwillbeignoredintheversioncontrolsystem:

exportRCS_FIND_IGNORE:=\(-nameSCCS-o-nameBitKeeper-o-name.svn-o\

-nameCVS-o-name.pc-o-name.hg-o-name.git\)\

-prune-o

exportRCS_TAR_IGNORE:=--excludeSCCS--excludeBitKeeper--exclude.svn\

--excludeCVS--exclude.pc--exclude.hg--exclude.git

That'sall.Wehavefinishedwiththeallpreparations,nextpointisthebuildingofvmlinux.

Wehavenowfinishedallthepreparations,andnextstepinthemainmakefileisrelatedtothekernelbuild.Beforethismoment,nothinghasbeenprintedtotheterminalbymake.Butnowthefirststepsofthecompilationarestarted.Weneedtogotoline598oftheLinuxkerneltopmakefileandwewillfindthevmlinuxtargetthere:

all:vmlinux

includearch/$(SRCARCH)/Makefile

Don'tworrythatwehavemissedmanylinesinMakefilethatarebetweenexportRCS_FIND_IGNORE.....andall:vmlinux......Thispartofthemakefileisresponsibleforthemake*.configtargetsandasIwroteinthebeginningofthispartwewillseeonlybuildingofthekernelinageneralway.

Theall:targetisthedefaultwhennotargetisgivenonthecommandline.Youcanseeherethatweincludearchitecturespecificmakefilethere(inourcaseitwillbearch/x86/Makefile).Fromthismomentwewillcontinuefromthismakefile.Aswecanseealltargetdependsonthevmlinuxtargetthatdefinedalittlelowerinthetopmakefile:

vmlinux:scripts/link-vmlinux.sh$(vmlinux-deps)FORCE

ThevmlinuxistheLinuxkernelinastaticallylinkedexecutablefileformat.Thescripts/link-vmlinux.shscriptlinksandcombinesdifferentcompiledsubsystemsintovmlinux.Thesecondtargetisthevmlinux-depsthatdefinedas:

vmlinux-deps:=$(KBUILD_LDS)$(KBUILD_VMLINUX_INIT)$(KBUILD_VMLINUX_MAIN)

andconsistsfromthesetofthebuilt-in.ofromeachtopdirectoryoftheLinuxkernel.Later,whenwewillgothroughalldirectoriesintheLinuxkernel,theKbuildwillcompileallthe$(obj-y)files.Itthencalls$(LD)-rtomergethesefilesintoonebuilt-in.ofile.Forthismomentwehavenovmlinux-deps,sothevmlinuxtargetwillnotbeexecutednow.Formevmlinux-depscontainsfollowingfiles:

Directlytothekernelbuild

LinuxInside

372Howthekerneliscompiled

Page 373: Linux Insides

arch/x86/kernel/vmlinux.ldsarch/x86/kernel/head_64.o

arch/x86/kernel/head64.oarch/x86/kernel/head.o

init/built-in.ousr/built-in.o

arch/x86/built-in.okernel/built-in.o

mm/built-in.ofs/built-in.o

ipc/built-in.osecurity/built-in.o

crypto/built-in.oblock/built-in.o

lib/lib.aarch/x86/lib/lib.a

lib/built-in.oarch/x86/lib/built-in.o

drivers/built-in.osound/built-in.o

firmware/built-in.oarch/x86/pci/built-in.o

arch/x86/power/built-in.oarch/x86/video/built-in.o

net/built-in.o

Thenexttargetthatcanbeexecutedisfollowing:

$(sort$(vmlinux-deps)):$(vmlinux-dirs);

$(vmlinux-dirs):preparescripts

$(Q)$(MAKE)$(build)=$@

Aswecanseevmlinux-dirsdependsontwotargets:prepareandscripts.prepareisdefinedinthetopMakefileoftheLinuxkernelandexecutesthreestagesofpreparations:

prepare:prepare0

prepare0:archprepareFORCE

$(Q)$(MAKE)$(build)=.

archprepare:archheadersarchscriptsprepare1scripts_basic

prepare1:prepare2$(version_h)include/generated/utsrelease.h\

include/config/auto.conf

$(cmd_crmodverdir)

prepare2:prepare3outputmakefileasm-generic

Thefirstprepare0expandstothearchpreparethatexpandstothearchheadersandarchscriptsthatdefinedinthex86_64specificMakefile.Let'slookonit.Thex86_64specificmakefilestartsfromthedefinitionofthevariablesthatarerelatedtothearchitecture-specificconfigs(defconfig,etc...).Afterthisitdefinesflagsforthecompilingofthe16-bitcode,calculatingoftheBITSvariablethatcanbe32fori386or64forthex86_64flagsfortheassemblysourcecode,flagsforthelinkerandmanymanymore(alldefinitionsyoucanfindinthearch/x86/Makefile).Thefirsttargetisarchheadersinthemakefilegeneratessyscalltable:

archheaders:

$(Q)$(MAKE)$(build)=arch/x86/entry/syscallsall

Andthesecondtargetisarchscriptsinthismakefileis:

archscripts:scripts_basic

$(Q)$(MAKE)$(build)=arch/x86/toolsrelocs

Wecanseethatitdependsonthescripts_basictargetfromthetopMakefile.Atthefirstwecanseethescripts_basictargetthatexecutesmakeforthescripts/basicmakefile:

scripts_basic:

$(Q)$(MAKE)$(build)=scripts/basic

LinuxInside

373Howthekerneliscompiled

Page 374: Linux Insides

Thescripts/basic/Makefilecontainstargetsforcompilationofthetwohostprograms:fixdepandbin2:

hostprogs-y:=fixdep

hostprogs-$(CONFIG_BUILD_BIN2C)+=bin2c

always:=$(hostprogs-y)

$(addprefix$(obj)/,$(filter-outfixdep,$(always))):$(obj)/fixdep

Firstprogramisfixdep-optimizeslistofdependenciesgeneratedbygccthattellsmakewhentoremakeasourcecodefile.Thesecondprogramisbin2c,whichdependsonthevalueoftheCONFIG_BUILD_BIN2CkernelconfigurationoptionandisaverylittleCprogramthatallowstoconvertabinaryonstdintoaCincludeonstdout.Youcannotehereastrangenotation:hostprogs-y,etc...Thisnotationisusedintheallkbuildfilesandyoucanreadmoreaboutitinthedocumentation.Inourcasehostprogs-ytellskbuildthatthereisonehostprogramnamedfixdepthatwillbebuiltfromfixdep.cthatislocatedinthesamedirectorywheretheMakefileis.Thefirstoutputafterweexecutemakeinourterminalwillberesultofthiskbuildfile:

$make

HOSTCCscripts/basic/fixdep

Asscript_basictargetwasexecuted,thearchscriptstargetwillexecutemakeforthearch/x86/toolsmakefilewiththerelocstarget:

$(Q)$(MAKE)$(build)=arch/x86/toolsrelocs

Therelocs_32.candtherelocs_64.cwillbecompiledthatwillcontainrelocationinformationandwewillseeitinthemakeoutput:

HOSTCCarch/x86/tools/relocs_32.o

HOSTCCarch/x86/tools/relocs_64.o

HOSTCCarch/x86/tools/relocs_common.o

HOSTLDarch/x86/tools/relocs

Thereischeckingoftheversion.haftercompilingoftherelocs.c:

$(version_h):$(srctree)/MakefileFORCE

$(callfilechk,version.h)

$(Q)rm-f$(old_version_h)

Wecanseeitintheoutput:

CHKinclude/config/kernel.release

andthebuildingofthegenericassemblyheaderswiththeasm-generictargetfromthearch/x86/include/generated/asmthatgeneratedinthetopMakefileoftheLinuxkernel.Aftertheasm-generictargetthearchpreparewillbedone,sotheprepare0targetwillbeexecuted.AsIwroteabove:

prepare0:archprepareFORCE

$(Q)$(MAKE)$(build)=.

LinuxInside

374Howthekerneliscompiled

Page 375: Linux Insides

Noteonthebuild.Itdefinedinthescripts/Kbuild.includeandlookslikethis:

build:=-f$(srctree)/scripts/Makefile.buildobj

Orinourcaseitiscurrentsourcedirectory-.:

$(Q)$(MAKE)-f$(srctree)/scripts/Makefile.buildobj=.

Thescripts/Makefile.buildtriestofindtheKbuildfilebythegivendirectoryviatheobjparameter,includethisKbuildfiles:

include$(kbuild-file)

andbuildtargetsfromit.Inourcase.containstheKbuildfilethatgeneratesthekernel/bounds.sandthearch/x86/kernel/asm-offsets.s.Afterthisthepreparetargetfinishedtowork.Thevmlinux-dirsalsodependsonthesecondtarget-scriptsthatcompilesfollowingprograms:file2alias,mk_elfconfig,modpost,etc.....Afterscripts/host-programscompilationourvmlinux-dirstargetcanbeexecuted.Firstofalllet'strytounderstandwhatdoesvmlinux-dirscontain.Formycaseitcontainspathsofthefollowingkerneldirectories:

initusrarch/x86kernelmmfsipcsecuritycryptoblock

driverssoundfirmwarearch/x86/pciarch/x86/power

arch/x86/videonetlibarch/x86/lib

Wecanfinddefinitionofthevmlinux-dirsinthetopMakefileoftheLinuxkernel:

vmlinux-dirs:=$(patsubst%/,%,$(filter%/,$(init-y)$(init-m)\

$(core-y)$(core-m)$(drivers-y)$(drivers-m)\

$(net-y)$(net-m)$(libs-y)$(libs-m)))

init-y:=init/

drivers-y:=drivers/sound/firmware/

net-y:=net/

libs-y:=lib/

...

...

...

Hereweremovethe/symbolfromtheeachdirectorywiththehelpofthepatsubstandfilterfunctionsandputittothevmlinux-dirs.Sowehavelistofdirectoriesinthevmlinux-dirsandthefollowingcode:

$(vmlinux-dirs):preparescripts

$(Q)$(MAKE)$(build)=$@

The$@representsvmlinux-dirsherethatmeansthatitwillgorecursivelyoveralldirectoriesfromthevmlinux-dirsanditsinternaldirectories(depensonconfiguration)andwillexecutemakeinthere.Wecanseeitintheoutput:

CCinit/main.o

CHKinclude/generated/compile.h

CCinit/version.o

CCinit/do_mounts.o

...

CCarch/x86/crypto/glue_helper.o

LinuxInside

375Howthekerneliscompiled

Page 376: Linux Insides

ASarch/x86/crypto/aes-x86_64-asm_64.o

CCarch/x86/crypto/aes_glue.o

...

ASarch/x86/entry/entry_64.o

ASarch/x86/entry/thunk_64.o

CCarch/x86/entry/syscall_64.o

Sourcecodeineachdirectorywillbecompiledandlinkedtothebuilt-in.o:

$find.-namebuilt-in.o

./arch/x86/crypto/built-in.o

./arch/x86/crypto/sha-mb/built-in.o

./arch/x86/net/built-in.o

./init/built-in.o

./usr/built-in.o

...

...

Ok,allbuint-in.o(s)built,nowwecanbacktothevmlinuxtarget.Asyouremember,thevmlinuxtargetisinthetopMakefileoftheLinuxkernel.Beforethelinkingofthevmlinuxitbuildssamples,Documentation,etc...butIwillnotdescribeithereasIwroteinthebeginningofthispart.

vmlinux:scripts/link-vmlinux.sh$(vmlinux-deps)FORCE

...

...

+$(callif_changed,link-vmlinux)

Asyoucanseemainpurposeofitisacallofthescripts/link-vmlinux.shscriptislinkingoftheallbuilt-in.o(s)totheonestaticallylinkedexecutableandcreationoftheSystem.map.Intheendwewillseefollowingoutput:

LINKvmlinux

LDvmlinux.o

MODPOSTvmlinux.o

GEN.version

CHKinclude/generated/compile.h

UPDinclude/generated/compile.h

CCinit/version.o

LDinit/built-in.o

KSYM.tmp_kallsyms1.o

KSYM.tmp_kallsyms2.o

LDvmlinux

SORTEXvmlinux

SYSMAPSystem.map

andvmlinuxandSystem.mapintherootoftheLinuxkernelsourcetree:

$lsvmlinuxSystem.map

System.mapvmlinux

That'sall,vmlinuxisready.ThenextstepiscreationofthebzImage.

ThebzImagefileisthecompressedLinuxkernelimage.WecangetitbyexecutingmakebzImageaftervmlinuxisbuilt.That,orwecanjustexecutemakewithoutanyargumentandwewillgetbzImageanywaybecauseitisdefaultimage:

BuildingbzImage

LinuxInside

376Howthekerneliscompiled

Page 377: Linux Insides

all:bzImage

inthearch/x86/kernel/Makefile.Let'slookonthistarget,itwillhelpustounderstandhowthisimagebuilds.AsIalreadysaidthebzImagetargetdefinedinthearch/x86/kernel/Makefileandlookslikethis:

bzImage:vmlinux

$(Q)$(MAKE)$(build)=$(boot)$(KBUILD_IMAGE)

$(Q)mkdir-p$(objtree)/arch/$(UTS_MACHINE)/boot

$(Q)ln-fsn../../x86/boot/bzImage$(objtree)/arch/$(UTS_MACHINE)/boot/$@

Wecanseehere,thatfirstofallcalledmakeforthebootdirectory,inourcaseitis:

boot:=arch/x86/boot

Themaingoalnowistobuildthesourcecodeinthearch/x86/bootandarch/x86/boot/compresseddirectories,buildsetup.binandvmlinux.bin,andbuildthebzImagefromthemintheend.Firsttargetinthearch/x86/boot/Makefileisthe$(obj)/setup.elf:

$(obj)/setup.elf:$(src)/setup.ld$(SETUP_OBJS)FORCE

$(callif_changed,ld)

Wealreadyhavethesetup.ldlinkerscriptinthearch/x86/bootdirectoryandtheSETUP_OBJSvariablethatexpandstotheallsourcefilesfromthebootdirectory.Wecanseefirstoutput:

ASarch/x86/boot/bioscall.o

CCarch/x86/boot/cmdline.o

ASarch/x86/boot/copy.o

HOSTCCarch/x86/boot/mkcpustr

CPUSTRarch/x86/boot/cpustr.h

CCarch/x86/boot/cpu.o

CCarch/x86/boot/cpuflags.o

CCarch/x86/boot/cpucheck.o

CCarch/x86/boot/early_serial_console.o

CCarch/x86/boot/edd.o

Thenextsourcefileisarch/x86/boot/header.S,butwecan'tbuilditnowbecausethistargetdependsonthefollowingtwoheaderfiles:

$(obj)/header.o:$(obj)/voffset.h$(obj)/zoffset.h

Thefirstisvoffset.hgeneratedbythesedscriptthatgetstwoaddressesfromthevmlinuxwiththenmutil:

#defineVO__end0xffffffff82ab0000

#defineVO__text0xffffffff81000000

Theyarethestartandtheendofthekernel.Thesecondiszoffset.hdepensonthevmlinuxtargetfromthearch/x86/boot/compressed/Makefile:

$(obj)/zoffset.h:$(obj)/compressed/vmlinuxFORCE

$(callif_changed,zoffset)

LinuxInside

377Howthekerneliscompiled

Page 378: Linux Insides

The$(obj)/compressed/vmlinuxtargetdependsonthevmlinux-objs-ythatcompilessourcecodefilesfromthearch/x86/boot/compresseddirectoryandgeneratesvmlinux.bin,vmlinux.bin.bz2,andcompilesprogramm-mkpiggy.Wecanseethisintheoutput:

LDSarch/x86/boot/compressed/vmlinux.lds

ASarch/x86/boot/compressed/head_64.o

CCarch/x86/boot/compressed/misc.o

CCarch/x86/boot/compressed/string.o

CCarch/x86/boot/compressed/cmdline.o

OBJCOPYarch/x86/boot/compressed/vmlinux.bin

BZIP2arch/x86/boot/compressed/vmlinux.bin.bz2

HOSTCCarch/x86/boot/compressed/mkpiggy

Wherevmlinux.binisthevmlinuxfilewithdebuginginformationandcommentsstrippedandthevmlinux.bin.bz2compressedvmlinux.bin.all+u32sizeofvmlinux.bin.all.Thevmlinux.bin.allisvmlinux.bin+vmlinux.relocs,wherevmlinux.relocsisthevmlinuxthatwashandledbytherelocsprogram(seeabove).Aswegotthesefiles,thepiggy.Sassemblyfileswillbegeneratedwiththemkpiggyprogramandcompiled:

MKPIGGYarch/x86/boot/compressed/piggy.S

ASarch/x86/boot/compressed/piggy.o

Thisassemblyfileswillcontainthecomputedoffsetfromthecompressedkernel.Afterthiswecanseethatzoffsetgenerated:

ZOFFSETarch/x86/boot/zoffset.h

Asthezoffset.handthevoffset.haregenerated,compilationofthesourcecodefilesfromthearch/x86/bootcanbecontinued:

ASarch/x86/boot/header.o

CCarch/x86/boot/main.o

CCarch/x86/boot/mca.o

CCarch/x86/boot/memory.o

CCarch/x86/boot/pm.o

ASarch/x86/boot/pmjump.o

CCarch/x86/boot/printf.o

CCarch/x86/boot/regs.o

CCarch/x86/boot/string.o

CCarch/x86/boot/tty.o

CCarch/x86/boot/video.o

CCarch/x86/boot/video-mode.o

CCarch/x86/boot/video-vga.o

CCarch/x86/boot/video-vesa.o

CCarch/x86/boot/video-bios.o

Asallsourcecodefileswillbecompiled,theywillbelinkedtothesetup.elf:

LDarch/x86/boot/setup.elf

or:

ld-melf_x86_64-Tarch/x86/boot/setup.ldarch/x86/boot/a20.oarch/x86/boot/bioscall.oarch/x86/boot/cmdline.oarch/x86/boot/copy.oarch/x86/boot/cpu.oarch/x86/boot/cpuflags.oarch/x86/boot/cpucheck.oarch/x86/boot/early_serial_console.oarch/x86/boot/edd.oarch/x86/boot/header.oarch/x86/boot/main.oarch/x86/boot/mca.oarch/x86/boot/memory.oarch/x86/boot/pm.oarch/x86/boot/pmjump.oarch/x86/boot/printf.oarch/x86/boot/regs.oarch/x86/boot/string.oarch/x86/boot/tty.oarch/x86/boot/video.oarch/x86/boot/video-mode.oarch/x86/boot/version.oarch/x86/boot/video-vga.oarch/x86/boot/video-vesa.oarch/x86/boot/video-bios.o-oarch/x86/boot/setup.elf

LinuxInside

378Howthekerneliscompiled

Page 379: Linux Insides

Thelasttwothingsisthecreationofthesetup.binthatwillcontaincompiledcodefromthearch/x86/boot/*directory:

objcopy-Obinaryarch/x86/boot/setup.elfarch/x86/boot/setup.bin

andthecreationofthevmlinux.binfromthevmlinux:

objcopy-Obinary-R.note-R.comment-Sarch/x86/boot/compressed/vmlinuxarch/x86/boot/vmlinux.bin

Intheendwecompilehostprogram:arch/x86/boot/tools/build.cthatwillcreateourbzImagefromthesetup.binandthevmlinux.bin:

arch/x86/boot/tools/buildarch/x86/boot/setup.binarch/x86/boot/vmlinux.binarch/x86/boot/zoffset.harch/x86/boot/bzImage

ActuallythebzImageistheconcatenatedsetup.binandthevmlinux.bin.IntheendwewillseetheoutputwhichisfamiliartoallwhooncebuilttheLinuxkernelfromsource:

Setupis16268bytes(paddedto16384bytes).

Systemis4704kB

CRC94a88f9a

Kernel:arch/x86/boot/bzImageisready(#5)

That'sall.

ItistheendofthispartandherewesawallstepsfromtheexecutionofthemakecommandtothegenerationofthebzImage.Iknow,theLinuxkernelmakefilesandprocessoftheLinuxkernelbuildingmayseemconfusingatfirstglance,butitisnotsohard.HopethispartwillhelpyouunderstandtheprocessofbuildingtheLinuxkernel.

GNUmakeutilLinuxkerneltopMakefilecross-compilationCtagssparsebzImageunameshellKbuildbinutilsgccDocumentationSystem.mapRelocation

Conclusion

Links

LinuxInside

379Howthekerneliscompiled

Page 380: Linux Insides

Duringthewritingofthelinux-insidesbookIhavereceivedmanyemailswithquestionsrelatedtothelinkerscriptandlinker-relatedsubjects.SoI'vedecidedtowritethistocoversomeaspectsofthelinkerandthelinkingofobjectfiles.

IfweopentheLinkerpageonWikipedia,wewillseefollowingdefinition:

Incomputerscience,alinkerorlinkeditorisacomputerprogramthattakesoneormoreobjectfilesgeneratedbyacompilerandcombinesthemintoasingleexecutablefile,libraryfile,oranotherobjectfile.

Ifyou'vewrittenatleastoneprogramonCinyourlife,youwillhaveseenfileswiththe*.oextension.Thesefilesareobjectfiles.Objectfilesareblocksofmachinecodeanddatawithplaceholderaddressesthatreferencedataandfunctionsinotherobjectfilesorlibraries,aswellasalistofitsownfunctionsanddata.Themainpurposeofthelinkeriscollect/handlethecodeanddataofeachobjectfile,turningitintothefinalexecutablefileorlibrary.Inthispostwewilltrytogothroughallaspectsofthisprocess.Let'sstart.

Let'screateasimpleprojectwiththefollowingstructure:

*-linkers

*--main.c

*--lib.c

*--lib.h

Ourmain.csourcecodefilecontains:

#include<stdio.h>

#include"lib.h"

intmain(intargc,char**argv){

printf("factorialof5is:%d\n",factorial(5));

return0;

}

Thelib.cfilecontains:

intfactorial(intbase){

intres=1,i=1;

if(base==0){

return1;

}

while(i<=base){

res*=i;

i++;

}

returnres;

}

Andthelib.hfilecontains:

Introduction

Linkingprocess

LinuxInside

380Linkers

Page 381: Linux Insides

#ifndefLIB_H

#defineLIB_H

intfactorial(intbase);

#endif

Nowlet'scompileonlythemain.csourcecodefilewith:

$gcc-cmain.c

Ifwelookinsidetheoutputtedobjectfilewiththenmutil,wewillseethefollowingoutput:

$nm-Amain.o

main.o:Ufactorial

main.o:0000000000000000Tmain

main.o:Uprintf

Thenmutilallowsustoseethelistofsymbolsfromthegivenobjectfile.Itconsistsofthreecolumns:thefirstisthenameofthegivenobjectfileandtheaddressofanyresolvedsymbols.Thesecondcolumncontainsacharacterthatrepresentsthestatusofthegivensymbol.InthiscasetheUmeansundefinedandtheTdenotesthatthesymbolsareplacedinthe.textsectionoftheobject.Thenmutilityshowsusherethatwehavethreesymbolsinthemain.csourcecodefile:

factorial-thefactorialfunctiondefinedinthelib.csourcecodefile.Itismarkedasundefinedherebecausewecompiledonlythemain.csourcecodefile,anditdoesnotknowanythingaboutcodefromthelib.cfilefornow;main-themainfunction;printf-thefunctionfromtheglibclibrary.main.cdoesnotknowanythingaboutitfornoweither.

Whatcanweunderstandfromtheoutputofnmsofar?Themain.oobjectfilecontainsthelocalsymbolmainataddress0000000000000000(itwillbefilledwithcorrectaddressafterisislinked),andtwounresolvedsymbols.Wecanseeallofthisinformationinthedisassemblyoutputofthemain.oobjectfile:

$objdump-Smain.o

main.o:fileformatelf64-x86-64

Disassemblyofsection.text:

0000000000000000<main>:

0:55push%rbp

1:4889e5mov%rsp,%rbp

4:4883ec10sub$0x10,%rsp

8:897dfcmov%edi,-0x4(%rbp)

b:488975f0mov%rsi,-0x10(%rbp)

f:bf05000000mov$0x5,%edi

14:e800000000callq19<main+0x19>

19:89c6mov%eax,%esi

1b:bf00000000mov$0x0,%edi

20:b800000000mov$0x0,%eax

25:e800000000callq2a<main+0x2a>

2a:b800000000mov$0x0,%eax

2f:c9leaveq

30:c3retq

Hereweareinterestedonlyinthetwocallqoperations.Thetwocallqoperationscontainlinkerstubs,orthefunctionnameandoffsetfromittothenextinstruction.Thesestubswillbeupdatedtotherealaddressesofthefunctions.Wecanseethesefunctions'nameswithinthefollowingobjdumpoutput:

$objdump-S-rmain.o

LinuxInside

381Linkers

Page 382: Linux Insides

...

14:e800000000callq19<main+0x19>

15:R_X86_64_PC32factorial-0x4

19:89c6mov%eax,%esi

...

25:e800000000callq2a<main+0x2a>

26:R_X86_64_PC32printf-0x4

2a:b800000000mov$0x0,%eax

...

The-ror--relocflagsoftheobjdumputilprinttherelocationentriesofthefile.Nowlet'slookinmoredetailattherelocationprocess.

Relocationistheprocessofconnectingsymbolicreferenceswithsymbolicdefinitions.Let'slookattheprevioussnippetfromtheobjdumpoutput:

14:e800000000callq19<main+0x19>

15:R_X86_64_PC32factorial-0x4

19:89c6mov%eax,%esi

Notethee800000000onthefirstline.Thee8istheopcodeofthecall,andtheremainderofthelineisarelativeoffset.Sothee800000000containsaone-byteoperationcodefollowedbyafour-byteaddress.Notethatthe00000000is4-bytes.Whyonly4-bytesifanaddresscanbe8-bytesinax86_64(64-bit)machine?Actuallywecompiledthemain.csourcecodefilewiththe-mcmodel=small!Fromthegccmanpage:

-mcmodel=small

Generatecodeforthesmallcodemodel:theprogramanditssymbolsmustbelinkedinthelower2GBoftheaddressspace.Pointersare64bits.Programscanbestaticallyordynamicallylinked.Thisisthedefaultcodemodel.

Ofcoursewedidn'tpassthisoptiontothegccwhenwecompiledthemain.c,butitisthedefault.Weknowthatourprogramwillbelinkedinthelower2GBoftheaddressspacefromthegccmanualextractabove.Fourbytesisthereforeenoughforthis.Sowehaveopcodeofthecallinstructionandanunknownaddress.Whenwecompilemain.cwithallitsdependenciestoanexecutablefile,andthenlookatthefactorialcallwesee:

$gccmain.clib.c-ofactorial|objdump-Sfactorial|grepfactorial

factorial:fileformatelf64-x86-64

...

...

0000000000400506<main>:

40051a:e818000000callq400537<factorial>

...

...

0000000000400537<factorial>:

400550:7507jne400559<factorial+0x22>

400557:eb1bjmp400574<factorial+0x3d>

400559:eb0ejmp400569<factorial+0x32>

40056f:7eeajle40055b<factorial+0x24>

...

...

Aswecanseeinthepreviousoutput,theaddressofthemainfunctionis0x0000000000400506.Whyitdoesnotstartfrom0x0?YoumayalreadyknowthatstandardCprogramsarelinkedwiththeglibcCstandardlibrary(assumingthe-nostdlibwasnotpassedtothegcc).Thecompiledcodeforaprogramincludesconstructorfunctionstoinitializedatain

Relocation

LinuxInside

382Linkers

Page 383: Linux Insides

theprogramwhentheprogramisstarted.Thesefunctionsneedtobecalledbeforetheprogramisstarted,orinanotherwordsbeforethemainfunctioniscalled.Tomaketheinitializationandterminationfunctionswork,thecompilermustoutputsomethingintheassemblercodetocausethosefunctionstobecalledattheappropriatetime.Executionofthisprogram

willstartfromthecodeplacedinthespecial.initsection.Wecanseethisinthebeginningoftheobjdumpoutput:

objdump-Sfactorial|less

factorial:fileformatelf64-x86-64

Disassemblyofsection.init:

00000000004003a8<_init>:

4003a8:4883ec08sub$0x8,%rsp

4003ac:488b05a5052000mov0x2005a5(%rip),%rax#600958<_DYNAMIC+0x1d0>

Notthatitstartsatthe0x00000000004003a8addressrelativetotheglibccode.WecancheckitalsointheELFoutputbyrunningreadelf:

$readelf-dfactorial|grep\(INIT\)

0x000000000000000c(INIT)0x4003a8

So,theaddressofthemainfunctionis0000000000400506andisoffsetfromthe.initsection.Aswecanseefromtheoutput,theaddressofthefactorialfunctionis0x0000000000400537andbinarycodeforthecallofthefactorialfunctionnowise818000000.Wealreadyknowthate8isopcodeforthecallinstruction,thenext18000000(notethataddressrepresentedaslittleendianforx86_64,soitis00000018)istheoffsetfromthecallqtothefactorialfunction:

>>>hex(0x40051a+0x18+0x5)==hex(0x400537)

True

Soweadd0x18and0x5totheaddressofthecallinstruction.Theoffsetismeasuredfromtheaddressofthefollowinginstruction.Ourcallinstructionis5-byteslong(e818000000)andthe0x18istheoffsetofthecallafterthefactorialfunction.Acompilergenerallycreateseachobjectfilewiththeprogramaddressesstartingatzero.Butifaprogramiscreatedfrommultipleobjectfiles,thesewilloverlap.

Whatwehaveseeninthissectionistherelocationprocess.Thisprocessassignsloadaddressestothevariouspartsoftheprogram,adjustingthecodeanddataintheprogramtoreflecttheassignedaddresses.

Ok,nowthatweknowalittleaboutlinkersandrelocationitistimetolearnmoreaboutlinkersbylinkingourobjectfiles.

Asyoucanunderstandfromthetitle,IwilluseGNUlinkerorjustldinthispost.Ofcoursewecanusegcctolinkourfactorialproject:

$gccmain.clib.o-ofactorial

andafteritwewillgetexecutablefile-factorialasaresult:

./factorial

factorialof5is:120

GNUlinker

LinuxInside

383Linkers

Page 384: Linux Insides

Butgccdoesnotlinkobjectfiles.Insteaditusescollect2whichisjustwrapperfortheGNUldlinker:

~$/usr/lib/gcc/x86_64-linux-gnu/4.9/collect2--version

collect2version4.9.3

/usr/bin/ld--version

GNUld(GNUBinutilsforDebian)2.25

...

...

...

Ok,wecanusegccanditwillproduceexecutablefileofourprogramforus.Butlet'slookhowtouseGNUldlinkerforthesamepurpose.Firstofalllet'strytolinktheseobjectfileswiththefollowingexample:

ldmain.olib.o-ofactorial

Trytodoitandyouwillgetfollowingerror:

$ldmain.olib.o-ofactorial

ld:warning:cannotfindentrysymbol_start;defaultingto00000000004000b0

main.o:Infunction`main':

main.c:(.text+0x26):undefinedreferenceto`printf'

Herewecanseetwoproblems:

Linkercan'tfind_startsymbol;Linkerdoesnotknowanythingaboutprintffunction.

Firstofalllet'strytounderstandwhatisthis_startentrysymbolthatappearstoberequiredforourprogramtorun?WhenIstartedtolearnprogrammingIlearnedthatthemainfunctionistheentrypointoftheprogram.Ithinkyoulearnedthistoo:)Butitactuallyisn'ttheentrypoint,it's_startinstead.The_startsymbolisdefinedinthecrt1.oobjectfile.Wecanfinditwiththefollowingcommand:

$objdump-S/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o:fileformatelf64-x86-64

Disassemblyofsection.text:

0000000000000000<_start>:

0:31edxor%ebp,%ebp

2:4989d1mov%rdx,%r9

...

...

...

Wepassthisobjectfiletotheldcommandasitsfirstargument(seeabove).Nowlet'strytolinkitandwilllookonresult:

ld/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\

main.olib.o-ofactorial

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o:Infunction`_start':

/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:115:undefinedreferenceto`__libc_csu_fini'

/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:116:undefinedreferenceto`__libc_csu_init'

/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:122:undefinedreferenceto`__libc_start_main'

main.o:Infunction`main':

main.c:(.text+0x26):undefinedreferenceto`printf'

LinuxInside

384Linkers

Page 385: Linux Insides

Unfortunatelywewillseeevenmoreerrors.Wecanseehereolderroraboutundefinedprintfandyetanotherthreeundefinedreferences:

__libc_csu_fini

__libc_csu_init

__libc_start_main

The_startsymbolisdefinedinthesysdeps/x86_64/start.Sassemblyfileintheglibcsourcecode.Wecanfindfollowingassemblycodelinesthere:

mov$__libc_csu_fini,%R8_LP

mov$__libc_csu_init,%RCX_LP

...

call__libc_start_main

Herewepassaddressoftheentrypointtothe.initand.finisectionthatcontaincodethatstartstoexecutewhentheprogramisranandthecodethatexecuteswhenprogramterminates.Andintheendweseethecallofthemainfunctionfromourprogram.Thesethreesymbolsaredefinedinthecsu/elf-init.csourcecodefile.Thefollowingtwoobjectfiles:

crtn.o;crti.o.

definethefunctionprologs/epilogsforthe.initand.finisections(withthe_initand_finisymbolsrespectively).

Thecrtn.oobjectfilecontainsthese.initand.finisections:

$objdump-S/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o

0000000000000000<.init>:

0:4883c408add$0x8,%rsp

4:c3retq

Disassemblyofsection.fini:

0000000000000000<.fini>:

0:4883c408add$0x8,%rsp

4:c3retq

Andthecrti.oobjectfilecontainsthe_initand_finisymbols.Let'strytolinkagainwiththesetwoobjectfiles:

$ld\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o\

-ofactorial

Andanywaywewillgetthesameerrors.Nowweneedtopass-lcoptiontotheld.Thisoptionwillsearchforthestandardlibraryinthepathspresentinthe$LD_LIBRARY_PATHenvironmentvariable.Let'strytolinkagainwitthe-lcoption:

$ld\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o-lc\

-ofactorial

Finallywegetanexecutablefile,butifwetrytorunit,wewillgetstrangeresults:

LinuxInside

385Linkers

Page 386: Linux Insides

$./factorial

bash:./factorial:Nosuchfileordirectory

What'stheproblemhere?Let'slookontheexecutablefilewiththereadelfutil:

$readelf-lfactorial

ElffiletypeisEXEC(Executablefile)

Entrypoint0x4003c0

Thereare7programheaders,startingatoffset64

ProgramHeaders:

TypeOffsetVirtAddrPhysAddr

FileSizMemSizFlagsAlign

PHDR0x00000000000000400x00000000004000400x0000000000400040

0x00000000000001880x0000000000000188RE8

INTERP0x00000000000001c80x00000000004001c80x00000000004001c8

0x000000000000001c0x000000000000001cR1

[Requestingprograminterpreter:/lib64/ld-linux-x86-64.so.2]

LOAD0x00000000000000000x00000000004000000x0000000000400000

0x00000000000006100x0000000000000610RE200000

LOAD0x00000000000006100x00000000006006100x0000000000600610

0x00000000000001cc0x00000000000001ccRW200000

DYNAMIC0x00000000000006100x00000000006006100x0000000000600610

0x00000000000001900x0000000000000190RW8

NOTE0x00000000000001e40x00000000004001e40x00000000004001e4

0x00000000000000200x0000000000000020R4

GNU_STACK0x00000000000000000x00000000000000000x0000000000000000

0x00000000000000000x0000000000000000RW10

SectiontoSegmentmapping:

SegmentSections...

00

01.interp

02.interp.note.ABI-tag.hash.dynsym.dynstr.gnu.version.gnu.version_r.rela.dyn.rela.plt.init.plt.text.fini.rodata.eh_frame

03.dynamic.got.got.plt.data

04.dynamic

05.note.ABI-tag

06

Noteonthestrangeline:

INTERP0x00000000000001c80x00000000004001c80x00000000004001c8

0x000000000000001c0x000000000000001cR1

[Requestingprograminterpreter:/lib64/ld-linux-x86-64.so.2]

The.interpsectionintheelffileholdsthepathnameofaprograminterpreterorinanotherwordsthe.interpsectionsimplycontainsanasciistringthatisthenameofthedynamiclinker.ThedynamiclinkeristhepartofLinuxthatloadsandlinkssharedlibrariesneededbyanexecutablewhenitisexecuted,bycopyingthecontentoflibrariesfromdisktoRAM.Aswecanseeintheoutputofthereadelfcommanditisplacedinthe/lib64/ld-linux-x86-64.so.2fileforthex86_64architecture.Nowlet'saddthe-dynamic-linkeroptionwiththepathofld-linux-x86-64.so.2totheldcallandwillseethefollowingresults:

$gcc-cmain.clib.c

$ld\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\

/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o\

-dynamic-linker/lib64/ld-linux-x86-64.so.2\

-lc-ofactorial

LinuxInside

386Linkers

Page 387: Linux Insides

Nowwecanrunitasnormalexecutablefile:

$./factorial

factorialof5is:120

Itworks!Withthefirstlinewecompilethemain.candthelib.csourcecodefilestoobjectfiles.Wewillgetthemain.oandthelib.oafterexecutionofthegcc:

$filelib.omain.o

lib.o:ELF64-bitLSBrelocatable,x86-64,version1(SYSV),notstripped

main.o:ELF64-bitLSBrelocatable,x86-64,version1(SYSV),notstripped

andafterthiswelinkobjectfilesofourprogramwiththeneededsystemobjectfilesandlibraries.WejustsawasimpleexampleofhowtocompileandlinkaCprogramwiththegcccompilerandGNUldlinker.InthisexamplewehaveusedacouplecommandlineoptionsoftheGNUlinker,butitsupportsmuchmorecommandlineoptionsthan-o,-dynamic-linker,etc...MoreoverGNUldhasitsownlanguagethatallowstocontrolthelinkingprocess.Inthenexttwoparagraphswewilllookintoit.

AsIalreadywroteandasyoucanseeinthemanualoftheGNUlinker,ithasbigsetofthecommandlineoptions.We'veseenacoupleofoptionsinthispost:-o<output>-thattellsldtoproduceanoutputfilecalledoutputastheresultoflinking,-l<name>thataddsthearchiveorobjectfilespecifiedbythename,-dynamic-linkerthatspecifiesthenameofthedynamiclinker.Ofcourseldsupportsmuchmorecommandlineoptions,let'slookatsomeofthem.

Thefirstusefulcommandlineoptionis@file.Inthiscasethefilespecifiesfilenamewherecommandlineoptionswillberead.Forexamplewecancreatefilewiththenamelinker.ld,putthereourcommandlineargumentsfromthepreviousexampleandexecuteitwith:

[email protected]

Thenextcommandlineoptionis-bor--format.ThiscommandlineoptionspecifiesformatoftheinputobjectfilesELF,DJGPP/COFFandetc.Thereisacommandlineoptionforthesamepurposebutfortheoutputfile:--oformat=output-format.

Thenextcommandlineoptionis--defsym.Fullformatofthiscommandlineoptionisthe--defsym=symbol=expression.Itallowstocreateglobalsymbolintheoutputfilecontainingtheabsoluteaddressgivenbyexpression.Wecanfindfollowingcasewherethiscommandlineoptioncanbeuseful:intheLinuxkernelsourcecodeandmorepreciselyintheMakefilethatisrelatedtothekerneldecompressionfortheARMarchitecture-arch/arm/boot/compressed/Makefile,wecanfindfollowingdefinition:

LDFLAGS_vmlinux=--defsym_kernel_bss_size=$(KBSS_SZ)

Aswealreadyknow,itdefinesthe_kernel_bss_sizesymbolwiththesizeofthe.bsssectionintheoutputfile.Thissymbolwillbeusedinthefirstassemblyfilethatwillbeexecutedduringkerneldecompressing:

ldrr5,=_kernel_bss_size

UsefulcommandlineoptionsoftheGNUlinker

LinuxInside

387Linkers

Page 388: Linux Insides

Thenextcommandlineoptionsisthe-sharedthatallowsustocreatesharedlibrary.The-Mor-map<filename>commandlineoptionprintsthelinkingmapwiththeinformationaboutsymbols.Inourcase:

[email protected]

...

...

...

.text0x00000000004003c00x112

*(.text.unlikely.text.*_unlikely.text.unlikely.*)

*(.text.exit.text.exit.*)

*(.text.startup.text.startup.*)

*(.text.hot.text.hot.*)

*(.text.stub.text.*.gnu.linkonce.t.*)

.text0x00000000004003c00x2a/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o

...

...

...

.text0x00000000004003ea0x31main.o

0x00000000004003eamain

.text0x000000000040041b0x3flib.o

0x000000000040041bfactorial

OfcoursetheGNUlinkersupportstandardcommandlineoptions:--helpand--versionthatprintcommonhelpoftheusageoftheldanditsversion.That'sallaboutcommandlineoptionsoftheGNUlinker.Ofcourseitisnotthefullsetofcommandlineoptionssupportedbytheldutil.Youcanfindthecompletedocumentationoftheldutilinthemanual.

AsIwrotepreviously,ldhassupportforitsownlanguage.ItacceptsLinkerCommandLanguagefileswritteninasupersetofAT&T'sLinkEditorCommandLanguagesyntax,toprovideexplicitandtotalcontroloverthelinkingprocess.Let'slookonitsdetails.

Withthelinkerlanguagewecancontrol:

inputfiles;outputfiles;fileformatsaddressesofsections;etc...

Commandswritteninthelinkercontrollanguageareusuallyplacedinafilecalledlinkerscript.Wecanpassittoldwiththe-Tcommandlineoption.ThemaincommandinalinkerscriptistheSECTIONScommand.Eachlinkerscriptmustcontainthiscommandanditdeterminesthemapoftheoutputfile.Thespecialvariable.containscurrentpositionoftheoutput.Let'swriteasimpleassemblyprogramandwewilllookathowwecanusealinkerscripttocontrollinkingofthisprogram.Wewilltakeahelloworldprogramforthisexample:

section.data

msgdb"hello,world!",`\n`

section.text

global_start

_start:

movrax,1

movrdi,1

movrsi,msg

movrdx,14

syscall

movrax,60

movrdi,0

syscall

ControlLanguagelinker

LinuxInside

388Linkers

Page 389: Linux Insides

Wecancompileandlinkitwiththefollowingcommands:

$nasm-felf64-ohello.ohello.asm

$ld-ohellohello.o

Ourprogramconsistsfromtwosections:.textcontainscodeoftheprogramand.datacontainsinitializedvariables.Let'swritesimplelinkerscriptandtrytolinkourhello.asmassemblyfilewithit.Ourscriptis:

/*

*Linkerscriptforthefactorial

*/

OUTPUT(hello)

OUTPUT_FORMAT("elf64-x86-64")

INPUT(hello.o)

SECTIONS

{

.=0x200000;

.text:{

*(.text)

}

.=0x400000;

.data:{

*(.data)

}

}

OnthefirstthreelinesyoucanseeacommentwritteninCstyle.AfterittheOUTPUTandtheOUTPUT_FORMATcommandsspecifythenameofourexecutablefileanditsformat.Thenextcommand,INPUT,specifiestheinputfiletotheldlinker.Then,wecanseethemainSECTIONScommand,which,asIalreadywrote,mustbepresentineverylinkerscript.TheSECTIONScommandrepresentsthesetandorderofthesectionswhichwillbeintheoutputfile.AtthebeginningoftheSECTIONScommandwecanseefollowingline.=0x200000.Ialreadywroteabovethat.commandpointstothecurrentpositionoftheoutput.Thislinesaysthatthecodeshouldbeloadedataddress0x200000andtheline.=0x400000saysthatdatasectionshouldbeloadedataddress0x400000.Thesecondlineafterthe.=0x200000defines.textasanoutputsection.Wecansee*(.text)expressioninsideit.The*symboliswildcardthatmatchesanyfilename.Inotherwords,the*(.text)expressionsaysall.textinputsectionsinallinputfiles.Wecanrewriteitashello.o(.text)forourexample.Afterthefollowinglocationcounter.=0x400000,wecanseedefinitionofthedatasection.

Wecancompileandlinkitwiththe:

$nasm-felf64-ohello.ohello.S&&ld-Tlinker.script&&./hello

hello,world!

Ifwewilllookinsideitwiththeobjdumputil,wecanseethat.textsectionstartsfromtheaddress0x200000andthe.datasectionsstartsfromtheaddress0x400000:

$objdump-Dhello

Disassemblyofsection.text:

0000000000200000<_start>:

200000:b801000000mov$0x1,%eax

...

Disassemblyofsection.data:

0000000000400000<msg>:

400000:68656c6c6fpushq$0x6f6c6c65

...

LinuxInside

389Linkers

Page 390: Linux Insides

Apartfromthecommandswehavealreadyseen,thereareafewothers.ThefirstistheASSERT(exp,message)thatensuresthatgivenexpressionisnotzero.Ifitiszero,thenexitthelinkerwithanerrorcodeandprintthegivenerrormessage.Ifyou'vereadaboutLinuxkernelbootingprocessinthelinux-insidesbook,youmayknowthatthesetupheaderoftheLinuxkernelhasoffset0x1f1.InthelinkerscriptoftheLinuxkernelwecanfindacheckforthis:

.=ASSERT(hdr==0x1f1,"Thesetupheaderhasthewrongoffset!");

TheINCLUDEfilenamecommandallowstoincludeexternallinkerscriptsymbolsinthecurrentone.Inalinkerscriptwecanassignavaluetoasymbol.ldsupportsacoupleofassignmentoperators:

symbol=expression;symbol+=expression;symbol-=expression;symbol*=expression;symbol/=expression;symbol<<=expression;symbol>>=expression;symbol&=expression;symbol|=expression;

AsyoucannotealloperatorsareCassignmentoperators.Forexamplewecanuseitinourlinkerscriptas:

START_ADDRESS=0x200000;

DATA_OFFSET=0x200000;

SECTIONS

{

.=START_ADDRESS;

.text:{

*(.text)

}

.=START_ADDRESS+DATA_OFFSET;

.data:{

*(.data)

}

}

AsyoualreadymaynotedthesyntaxforexpressionsinthelinkerscriptlanguageisidenticaltothatofCexpressions.Besidesthisthecontrollanguageofthelinkingsupportsfollowingbuiltinfunctions:

ABSOLUTE-returnsabsolutevalueofthegivenexpression;ADDR-takesthesectionandreturnsitsaddress;ALIGN-returnsthevalueofthelocationcounter(.operator)thatalignedbytheboundaryofthenextexpressionafterthegivenexpression;DEFINED-returns1ifthegivensymbolplacedintheglobalsymboltableand0inotherway;MAXandMIN-returnmaximumandminimumofthetwogivenexpressions;NEXT-returnsthenextunallocatedaddressthatisamultipleofthegiveexpression;SIZEOF-returnsthesizeinbytesofthegivennamedsection.

That'sall.

Conclusion

LinuxInside

390Linkers

Page 391: Linux Insides

Thisistheendofthepostaboutlinkers.Welearnedmanythingsaboutlinkersinthispost,suchaswhatisalinkerandwhyitisneeded,howtouseit,etc..

Ifyouhaveanyquestionsorsuggestions,writemeanemailorpingmeontwitter.

PleasenotethatEnglishisnotmyfirstlanguage,andIamreallysorryforanyinconvenience.IfyoufindanymistakespleaseletmeknowviaemailorsendaPR.

BookaboutLinuxkernelinternalslinkerobjectfilesglibcopcodeELFGNUlinkerMypostsaboutassemblyprogrammingforx86_64readelf

Links

LinuxInside

391Linkers

Page 392: Linux Insides

Asyoualreadymayknow,I'vestartedaseriesofblogpostsaboutassemblerprogrammingforx86_64architectureinthelastyear.Ihaveneverwrittenalineoflow-levelcodebeforethismoment,exceptforacoupleoftoyHelloWorldexamplesintheuniversity.ItwasalreadyalongtimeagoandasIalreadysaidIdidn'twritelow-levelcodeatall.SometimeagoIwasinterestedinsuchthingsorinotherwordsIunderstoodthatIcanwriteprograms,butactuallyIdidn'tunderstandhowmyprogramisarranged.

AfterwritingsomeassemblercodeIbegantounderstandhowmyprogramlooksaftercompilation,approximately.Butanyway,Ididn'tunderstandmanyotherthings.Forexample:whatoccurswhenthesyscallinstructionisexecutedinmyassembler,whatoccurswhentheprintffunctionstartstoworkorhowcanmyprogramtalkwithothercomputersvianetwork.Assemblerprogramminglanguagedidn'tgivemeanswerstomyquestionsandIdecidedtogodeeperinmyresearch.IstartedtolearnfromthesourcecodeoftheLinuxkernelandtriedtounderstandthethingsthatI'minterestedin.ThesourcecodeoftheLinuxkerneldidn'tgivemetheanswerstoallofmyquestions,butnowmyknowledgeabouttheLinuxkernelandtheprocessesarounditismuchbetter.

I'mwritingthispartnineandahalfmonthsafterI'vestartedtolearnfromthesourcecodeoftheLinuxkernelandpublishedthefirstpartofthisbook.Nowitcontainsfortypartsanditisnottheend.IdecidedtowritethisseriesabouttheLinuxkernelmostlyformyself.AsyouknowtheLinuxkernelisveryhugepieceofcodeanditiseasytoforgetwhatdoesthisorthatpartoftheLinuxkernelmeanandhowdoesitimplementsomething.Butsoonthelinux-insidesrepobecamepopularandafterninemonthsithas9096stars:

ItseemsthatpeopleareinterestedintheinternalsoftheLinuxkernel.Besidesthis,inallthattimethatI'mwritinglinux-inside,Ihavereceivedmanyquestionsfromdifferentpeoplelike:howtostartwiththeLinuxkernel,whatdoIneedtostartcontributetotheLinuxkernelandandotherslikethese.GenerallypeopleareinterestedcontributetoopensourceprojectfordifferentreasonsandtheLinuxkernelisnotexception:

So,seemsthatpeopleareinterestedaboutLinuxkerneldevelopmentprocess.IthoughtitwillbestrangeifthebookabouttheLinuxkernelwillnotcontainapartthatwilldescribehowtotakeapartintheLinuxkerneldevelopmentandthat'swhyIdecidedtowriteit.YouwillnotfindinformationaboutwhyyoushouldbeinterestedincontributingtotheLinuxkernelinthispart.IseemanybenefitstolearnsourcecodeoftheLinuxkernel.Idon'tknowhowaboutyou,that'swhyIhavenoansweronthisquestion.ButifyouareinterestedhowtostartwithLinuxkerneldevelopment,thispartisforyou.

Let'sstart.

Linuxkerneldevelopment

Introduction

HowtostartwithLinuxkernel

LinuxInside

392Linuxkerneldevelopment

Page 393: Linux Insides

Firstofalllet'slookhowtoget,buildandruntheLinuxkernel.ActuallyyoucanrunyourcustombuildoftheLinuxkernelintwoways:

RuntheLinuxkernelonavirtualmachine;RuntheLinuxkernelonrealhardware.

I'llprovidedescriptionsforbothmethods.BeforewewillstarttodosomethingwiththeLinuxkernel,weneedtogetit.Thereareacoupleofwayshowtodoit.Alldependsonyourpurpose.IfyoujustwanttoupdatethecurrentversionoftheLinuxkernelonyourcomputer,youcanusetheinstructionsspecificforyourLinuxdistro.

InthefirstcaseyoujustneedtodownloadnewversionoftheLinuxkernelwiththepackagemanager.Forexample,toupgradetheversionoftheLinuxkernelto4.1forUbuntu(VividVervet),youwilljustneedtoexecutethefollowingcommands:

$sudoadd-apt-repositoryppa:kernel-ppa/ppa

$sudoapt-getupdate

Afterthisexecutethiscommand:

$apt-cacheshowpkglinux-headers

andchoosetheversionoftheLinuxkernelinwhichyouareinterested.Intheendexecutethenextcommandandreplace${version}withtheversionthatyouchoseintheoutputofthepreviouscommand:

$sudoapt-getinstalllinux-headers-${version}linux-headers-${version}-genericlinux-image-${version}-generic--fix-missing

andrebootyoursystem.Aftertherebootyouwillseethenewkernelinthegrubmenu.

IntheotherwayifyouareinterestedintheLinuxkerneldevelopment,youwillneedtogetthesourcecodeoftheLinuxkernel.Youcanfinditonthekernel.orgwebsiteanddownloadanarchivewiththeLinuxkernelsourcecode.ActuallytheLinuxkerneldevelopmentprocessisfullybuiltaroundgitversioncontrolsystem.Soyoucangetitwithgitfromthekernel.org:

$gitclonegit://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Idon'tknowhowaboutyou,butIprefergithub.ThereisamirroroftheLinuxkernelmainlinerepository,soyoucancloneitwith:

[email protected]:torvalds/linux.git

ActuallyI'musingmyforkfordevelopmentandwhenIwanttopullupdatesfromthemainrepositoryIjustexecutethefollowingcommand:

$gitcheckoutmaster

$gitpullupstreammaster

LinuxInside

393Linuxkerneldevelopment

Page 394: Linux Insides

Notethattheremotenameofthemainrepositoryisupstream.Toaddanewremotewiththemainlinuxrepositoryyoucanexecute:

[email protected]:torvalds/linux.git

Afterthisyouwillhavetworemotes:

~/dev/linux(master)$gitremote-v

[email protected]:0xAX/linux.git(fetch)

[email protected]:0xAX/linux.git(push)

upstreamhttps://github.com/torvalds/linux.git(fetch)

upstreamhttps://github.com/torvalds/linux.git(push)

Oneisofyoufork(origin)andthesecondisforthemainrepository(upstream).

NowthatwehavealocalcopyoftheLinuxkernelsourcecode,weneedtoconfigureandbuildit.TheLinuxkernelcanbeconfiguredindifferentways.Thesimplestwayistojustcopytheconfigurationfileofthealreadyinstalledkernelthatislocatedinthe/bootdirectory:

$sudocp/boot/config-$(uname-r)~/dev/linux/.config

IfyourcurrentLinuxkernelwasbuiltwiththesupportforaccesstothe/proc/config.gzfile,youcancopyyouractualkernelconfigurationfilewiththiscommand:

$cat/proc/config.gz|gunzip>~/dev/linux/.config

Ifyouarenotsatisfiedwiththestandardkernelconfigurationthatisprovidedbythemaintainersofyourdistro,youcanconfiguretheLinuxkernelmanually.Thereareacoupleofwaystodoit.TheLinuxkernelrootMakefileprovidesasetoftargetsthatallowsyoutoconfigureit.Forexamplemenuconfigprovidesamenu-driveninterfaceforthekernelconfiguration:

LinuxInside

394Linuxkerneldevelopment

Page 395: Linux Insides

Thedefconfigargumentgeneratesthedefaultkernelconfigurationfileforthecurrentarchitecture,forexamplex86_64defconfig.YoucanpasstheARCHcommandlineargumenttomaketobuilddefconfigforthegivenarchitecture:

$makeARCH=arm64defconfig

Theallnoconfig,allyesconfigandallmodconfigargumentsallowyoutogenerateanewconfigurationfilewherealloptionswillbedisabled,enabledandenabledasmodulesrespectively.ThenconfigcommandlineargumentsthatprovidesncursesbasedprogramwithmenutoconfigureLinuxkernel:

LinuxInside

395Linuxkerneldevelopment

Page 396: Linux Insides

AndevenrandconfigtogeneraterandomLinuxkernelconfigurationfile.IwillnotwritehowtoconfiguretheLinuxkernel,whichoptionstoenableandwhatnot,becauseitmakesnosensetodosofortworeasons:FirstofallIdonotknowyourhardwareandsecond,ifyouknowyourhardware,theonlyremainingtaskistofindouthowtouseprogramsforkernelconfiguration,andallofthemareprettysimpletouse.

Ok,forthismomentwegotthesourcecodeoftheLinuxkernelandconfiguredit.ThenextstepisthecompilationoftheLinuxkernel.ThesimplestwaytocompileLinuxkernelisjustexecute:

$make

scripts/kconfig/conf--silentoldconfigKconfig

#

#configurationwrittento.config

#

CHKinclude/config/kernel.release

UPDinclude/config/kernel.release

CHKinclude/generated/uapi/linux/version.h

CHKinclude/generated/utsrelease.h

...

...

...

OBJCOPYarch/x86/boot/vmlinux.bin

ASarch/x86/boot/header.o

LDarch/x86/boot/setup.elf

OBJCOPYarch/x86/boot/setup.bin

BUILDarch/x86/boot/bzImage

Setupis15740bytes(paddedto15872bytes).

Systemis4342kB

CRC82703414

Kernel:arch/x86/boot/bzImageisready(#73)

command.Toincreasethespeedofkernelcompilationyoucanpass-jNcommandlineargumenttothemakeutil,whereNspecifiesthenumberofcommandstorunsimultaneously:

$make-j8

LinuxInside

396Linuxkerneldevelopment

Page 397: Linux Insides

IfyouwanttobuildLinuxkernelforanarchitecturethatdiffersfromyourcurrent,thesimplestwaytodoitpasstwoarguments:

ARCHcommandlineargumentandthenameofthetargetarchitecture;CROSS_COMPILERcommandlineargumentandthecross-compilertoolprefix;

ForexampleifwewanttocompiletheLinuxkernelforthearm64withdefaultkernelcnofigurationfile,weneedtoexecutefollowingcommand:

$make-j4ARCH=arm64CROSS_COMPILER=aarch64-linux-gnu-defconfig

$make-j4ARCH=arm64CROSS_COMPILER=aarch64-linux-gnu-

Asresultofcompilationwecanseethecompressedkernel-arch/x86/boot/bzImage.Nowwehavecompiledkernelandwecaneitherinstallitonourcomputerorjustrunitinanemulator.

AsIalreadywrotewewillconsidertwowayshowtolaunchnewkernel:InthefirstcasewecaninstallandrunthenewversionoftheLinuxkernelontherealhardwareandthesecondislaunchtheLinuxkernelonavirtualmachine.InthepreviousparagraphwesawhowtobuildtheLinuxkernelfromsourcecodeandasaresultwehavegotcompressedimage:

...

...

...

Kernel:arch/x86/boot/bzImageisready(#73)

AfterwehavegotthebzImageweneedtoinstallheaders,modulesofthenewLinuxkernelwiththe:

$sudomakeheaders_install

$sudomakemodules_install

anddirectlythekernelitself:

$sudomakeinstall

FromthismomentwehaveinstallednewversionoftheLinuxkernelandnowwemusttellthebootloaderaboutit.Ofcoursewecanadditmanuallybytheeditingofthe/boot/grub2/grub.cfgconfigurationfile,butIprefertouseascriptforthispurpose.I'musingtwodiffernetLinuxdistros:FedoraandUbuntu.Therearetwodifferentwaystoupdatethegrubconfigurationfile.I'musingfollowingscriptforthispurpose:

#!/bin/bash

source"term-colors"

DISTRIBUTIVE=$(cat/etc/*-release|grepNAME|head-1|sed-n-e's/NAME\=//p')

echo-e"Distributive:${Green}${DISTRIBUTIVE}${Color_Off}"

if[["$DISTRIBUTIVE"=="Fedora"]];

then

su-c'grub2-mkconfig-o/boot/grub2/grub.cfg'

else

sudoupdate-grub

InstallingLinuxkernel

LinuxInside

397Linuxkerneldevelopment

Page 398: Linux Insides

fi

echo"${Green}Done.${Color_Off}"

ThisisthelaststepofthenewLinuxkernelinstallationandafterthisyoucanrebootyourcomputerandselectnewversionofthekernelduringboot.

ThesecondcaseistolaunchnewLinuxkernelinthevirtualmachine.Ipreferqemu.Firstofallweneedtobuildinitialramdisk-initrdforthis.TheinitrdisatemporaryrootfilesystemthatisusedbytheLinuxkernelduringinitializationprocesswhileotherfilesystemsarenotmounted.Wecanbuildinitrdwiththefollowingcommands:

Firstofallweneedtodownloadbusyboxandrunmenuconfigforitsconfiguration:

$mkdirinitrd

$cdinitrd

$curlhttp://busybox.net/downloads/busybox-1.23.2.tar.bz2|tarxjf-

$cdbusybox-1.23.2/

$makemenuconfig

$make-j4

Thebusyboxisanexecutablefile-/bin/busyboxthatcontainsasetofstandardtoolslikecoreutilsandetc.Inthebusysboxmenuweneedtoenable:BuildBusyBoxasastaticbinary(nosharedlibs)option:

Wecanfindthismenuinthe:

BusyboxSettings

-->BuildOptions

Afterthisweexitfromthebusysboxconfigurationmenuandexecutefollowingcommandsforbuildingandinstallationofit:

LinuxInside

398Linuxkerneldevelopment

Page 399: Linux Insides

$make-j4

$sudomakeinstall

Ok,thebusyboxisinstalledfromthismomentandwecanstarttobuildourinitrd.Dodothis,wegotothepreviousinitrddirectoryand:

$cd..

$mkdir-pinitramfs

$cdinitramfs

$mkdir-pv{bin,sbin,etc,proc,sys,usr/{bin,sbin}}

$cp-av../busybox-1.23.2/_install/*.

copybusyboxfieldstothebin,sbinandotherdirectories.Nowweneedtocreateexecutableinitfilethatwillbeexecutedasafirstprocessinthesystem.Myinitfilejustmountsprocfsandsysfsfilesystemsandexecutedshell:

#!/bin/sh

mount-tprocnone/proc

mount-tsysfsnone/sys

exec/bin/sh

Nowwecancreateanarchivethatwillbeourinitrd:

$find.-print0|cpio--null-ov--format=newc|gzip-9>~/dev/initrd_x86_64.gz

Wecannowrunourkernelinthevirtualmachine.AsIalreadywroteIpreferqemuforthis.Wecanrunourkernelwiththefollowingcommand:

$qemu-system-x86_64-snapshot-m8GB-serialstdio-kernel~/dev/linux/arch/x86_64/boot/bzImage-initrd~/dev/initrd_x86_64.gz-append"root=/dev/sda1ignore_loglevel"

LinuxInside

399Linuxkerneldevelopment

Page 400: Linux Insides

FromnowwecanruntheLinuxkernelinthevirtualmachineandthismeansthatwecanbegintochangeandtestthekernel.

Considerusingivandaviov/minimaltoautomatetheprocessofgeneratinginitrd.

Themainpointofthisparagraphisanswerontwoquestions:WhattodoandwhatnottodobeforeyouwillsendyourfirstpatchtotheLinuxkernel.Please,donotconfusethistodowithtodo.IhavenoanswerwhatyoucanfixintheLinuxkernel.IjustwanttotellyoumyworkflowduringexperimentingwiththeLinuxkernelsourcecode.

FirstofallI'mtryingtopulllastupdatesfromtheLinus'srepowiththefollowingcommands:

$gitcheckoutmaster

$gitpullupstreammaster

AfterthismylocalrepositorywiththeLinuxkernelsourcecodeissyncedwiththemainlinerepository.Nowwecanmakesomechangesinthesourcecode.AsIalreadywrote,IhavenoadviceforyouwhereyoucanstartandwhatTODOintheLinuxkernel.Butthebestplacefornewbiesisstagingtree.Inotherwordsthesetofdriversfromthedrivers/staging.ThemaintainerofthestagingtreeisGregKroah-Hartmanandthestagingtreeisthatplacewhereyourtrivialpatchcanbeaccepted.Let'slookonasimpleexamplethatdescribeshowtogeneratepatch,checkitandsendtotheLinuxkernelmaillisting.

IfwewilllookonthedriverfortheDigiInternationalEPCAPCIbaseddevices,wewillseedgap_sindexfunction:

staticchar*dgap_sindex(char*string,char*group)

{

char*ptr;

if(!string||!group)

GettingstartedwiththeLinuxKernelDevelopment

LinuxInside

400Linuxkerneldevelopment

Page 401: Linux Insides

returnNULL;

for(;*string;string++){

for(ptr=group;*ptr;ptr++){

if(*ptr==*string)

returnstring;

}

}

returnNULL;

}

onthe295line.Thisfunctionlooksforamatchofanycharacterinthegroup,andreturnsthatposition.DuringresearchofsourcecodeoftheLinuxkernel,Ihavenotedthatlib/string.csourcecodefilecontainsimplementationofthestrpbrkfunctionthatdoesthesamethatdgap_sinidex.Itisnotagoodideatouseacustomimplementationofafunctionthatalreadyexists.Sowecanremovethedgap_sindexfunctionfromthedrivers/staging/dgap/dgap.csourcecodefileandusethestrpbrkinstead.

Firstofalllet'screatenewgitbranchbasedonthecurrentmasterthatsyncedwiththeLinuxkernelmainlinerepo:

$gitcheckout-b"dgap-remove-dgap_sindex"

Andnowwecanreplacethedgap_sindexwiththestrpbrk.AfterwedidallchangesweneedtorecompiletheLinuxkernelorjustdgapdirectory.Donotforgettoenablethisdriverinthekernelconfiguration.Youcanfinditinthe:

DeviceDrivers

-->Stagingdrivers

---->DigiEPCAPCIproducts

Nowistimetomakecommit.I'musingfollowingcombinationforthis:

LinuxInside

401Linuxkerneldevelopment

Page 402: Linux Insides

$gitadd.

$gitcommit-s-v

Afterthelastcommandaneditorwillbeopennedthatwillbechosenfrom$GIT_EDITORor$EDITORenvironmentvariable.The-scommandlineargumentwilladdSigned-off-bylinebythecommitterattheendofthecommitlogmessage.Youcanfindthislineintheendofeachcommitmessage,forexample-00cc1633.Themainpointofthislineisthetrackingofwhodidachange.The-voptionshowunifieddiffbetweentheHEADcommitandwhatwouldbecommittedatthebottomofthecommitmessage.Itisnotnecessary,butveryusefulsometimes.Acoupleofwordsaboutcommitmessage.Actuallyacommitmessageconsistsfromtwoparts:

Thefirstpartisonthefirstlineandcontainsshortdescriptionofchanges.Itstartsfromthe[PATCH]prefixfollowedbyasubsystem,driverorarchitecturenameandafter:symbolshortdescription.Inourcaseitwillbesomethinglikethis:

[PATCH]staging/dgap:Usestrpbrk()insteadofdgap_sindex()

Aftershortdescriptionusuallywehaveanemptylineandfulldescriptionofthecommit.Inourcaseitwillbe:

The<linux/string.h>providesstrpbrk()functionthatdoesthesamethatthe

dgap_sindex().Let'susealreadydefinedfunctioninsteadofwritingcustom.

AndtheSign-off-bylineintheendofthecommitmessage.Notethateachlineofacommitmessagemustnobelongerthan80symbolsandcommitmessagemustdescribeyourchangesindetails.Donotjustwriteacommitmessagelike:Customfunctionremoved,youneedtodescribewhatyoudidandwhy.Thepatchreviewersmustknowwhattheyreview.Besidesthiscommitmessagesinthisviewareveryhelpful.Eachtimewhenwecan'tunderstandsomething,wecanusegitblametoreaddescriptionofchanges.

Afterwehavecommittedchangestimetogeneratepatch.Wecandoitwiththeformat-patchcommand:

$gitformat-patchmaster

0001-staging-dgap-Use-strpbrk-instead-of-dgap_sindex.patch

We'vepassednameofthebranch(masterinthiscase)totheformat-patchcommandthatwillgenerateapatchwiththelastchangesthatareinthedgap-remove-dgap_sindexbranchandnotareinthemasterbranch.Asyoucannote,theformat-patchcommandgeneratesfilethatcontainslastchangesandhasnamethatisbasedonthecommitshortdescription.Ifyouwanttogenerateapatchwiththecustomname,youcanuse--stdoutoption:

$gitformat-patchmaster--stdout>dgap-patch-1.patch

ThelaststepafterwehavegeneratedourpatchisjusttosendittotheLinuxkernelmaillisting.Ofcourseyoucanuseanyemailclient,buttheGitprovidesspecialcommandforthis:gitsend-email.Beforeyouwillsendyourpatch,youneedtoknowwheretosendit.Yes,youcansenditjusttotheLinuxkernelmaillistingaddresswhichislinux-kernel@vger.kernel.org,butthereisahighprobabilitythatthepatchwillbeignored,becauseasyoumayalreadyknowthereisthelargeflowofmessagesontheLinuxkernelmaillisting.Thebetterwaywillbesendtoamaintainerofsubsystemwhereyouhavemadechanges.Wecanfindmaintainerandotherrelatedguyswhohastouchedthecodewiththeget_maintainer.plscript.Allofyouneedisjustpassfileordirectorywhereyouwroteacode.GototherootdirectorywithsourcecodeoftheLinuxkernelandexecuteit:

$./scripts/get_maintainer.pl-fdrivers/staging/dgap/dgap.c

LidzaLouina<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)

LinuxInside

402Linuxkerneldevelopment

Page 403: Linux Insides

MarkHounschell<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)

DaeseokYoun<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)

GregKroah-Hartman<[email protected]>(supporter:STAGINGSUBSYSTEM)

[email protected](openlist:DIGIEPCAPCIPRODUCTS)

[email protected](openlist:STAGINGSUBSYSTEM)

[email protected](openlist)

Youwillseethesetofthenamesandrelatedemails.Nowwecansendourpatchwith:

$gitsend-email--to"LidzaLouina<[email protected]>"\

--cc"MarkHounschell<[email protected]>"\

--cc"DaeseokYoun<[email protected]>"\

--cc"GregKroah-Hartman<[email protected]>"\

--cc"[email protected]"\

--cc"[email protected]"\

--cc"[email protected]"

That'sall.ThepatchissentandnowonlyhavetowaitfeedbackfromtheLinuxkerneldevelopers.Afteryouwillsentapatchandamaintaineracceptedit,youwillfinditinthemaintainer'srepository(forexamplepatchthatyousawinthispart)andaftersometimeamaintainerwillsendpullrequesttoLinusandyouwillseeyourpatchinthemainlinerepository.

That'sall.

IntheendofthispartIwanttogiveyousomeadvicethatwilldescribewhattodoandwhatnottododuringdevelopmentoftheLinuxkernel:

Think,Think,Think.Andthinkagainbeforeyoudecidedtosendapatch.

EachtimewhenyouhavechangedsomethingintheLinuxkernelsourcecode-compileit.Afteranychanges.Againandagain.Nobodylikeschangesthatdon'tevencompile.

TheLinuxkernelhasacodingstyleguideandyouneedtocomplywithit.Thereisgreatscriptwhichcanhelptocheckyourchanges.Thisscriptis-scripts/checkpatch.pl.Justpasssourcecodefilewithchangestoitandyouwillsee:

$./scripts/checkpatch.pl-fdrivers/staging/dgap/dgap.c

WARNING:Blockcommentsuse*onsubsequentlines

#94:FILE:drivers/staging/dgap/dgap.c:94:

+/*

+SUPPORTEDPRODUCTS

CHECK:spacespreferredaroundthat'|'(ctx:VxV)

#143:FILE:drivers/staging/dgap/dgap.c:143:

+{PPCM,PCI_DEV_XEM_NAME,64,(T_PCXM|T_PCLITE|T_PCIBUS)},

Alsoyoucanseeproblematicplaceswiththehelpofthegitdiff:

Someadvice

LinuxInside

403Linuxkerneldevelopment

Page 404: Linux Insides

Linusdoesn'tacceptgithubpullrequests

Ifyourchangeconsistsfromsomedifferentandunrelatedchanges,youneedtosplitthechangesviaseparatecommits.Thegitformat-patchcommandwillgeneratepatchesforeachcommitandthesubjectofeachpatchwillcontainavNprefixwheretheNisthenumberofthepatch.Ifyouareplanningtosendaseriesofpatchesitwillbehelpfultopassthe--cover-letteroptiontothegitformat-patchcommand.Thiswillgenerateanadditionalfilethatwillcontainthecoverletterthatyoucanusetodescribewhatyourpatchsetchanges.Itisalsoagoodideatousethe--in-reply-tooptioninthegitsend-emailcommand.Thisoptionallowsyoutosendyourpatchseriesinreplytoyourcovermessage.Thestructureoftheyourpatchwilllooklikethisforamaintainer:

|-->coverletter

|---->patch_1

|---->patch_2

Youneedtopassmessage-idasanargumentofthe--in-reply-tooptionthatyoucanfindintheoutputofthegitsend-email:

It'simportantthatyouremailbeintheplaintextformat.Generally,send-emailandformat-patchareveryusefulduringdevelopment,solookatthedocumentationforthecommandsandyou'llfindsomeusefuloptionssuchas:gitsend-emailandgitformat-patch.

Donotbesurprisedifyoudonotgetanimmediateanswerafteryousendyourpatch.Maintainerscanbeverybusy.

ThescriptsdirectorycontainsmanydifferentusefulscriptsthatarerelatedtoLinuxkerneldevelopment.Wealreadysawtwoscriptsfromthisdirectory:thecheckpatch.plandtheget_maintainer.plscripts.Outsideofthosescripts,youcanfindthestackusagescriptthatwillprintusageofthestack,extract-vmlinuxforextractinganuncompressedkernelimage,andmanyothers.OutsideofthescriptsdirectoryyoucanfindsomeveryusefulscriptsbyLorenzoStoakesforkerneldevelopment.

SubscribetotheLinuxkernelmailinglist.Therearealargenumberofletterseverydayonlkml,butitisveryusefultoreadthemandunderstandthingssuchasthecurrentstateoftheLinuxkernel.OtherthanlkmltherearesetmailinglistingswhicharerelatedtothedifferentLinuxkernelsubsystems.

IfyourpatchisnotacceptedthefirsttimeandyoureceivefeedbackfromLinuxkerneldevelopers,makeyourchangesandresendthepatchwiththe[PATCHvN]prefix(whereNisthenumberofpatchversion).Forexample:

[PATCHv2]staging/dgap:Usestrpbrk()insteadofdgap_sindex()

LinuxInside

404Linuxkerneldevelopment

Page 405: Linux Insides

Alsoitmustcontainchangelogthatwilldescribeallchangeschangesfrompreviouspatchversions.Ofcourse,thisisnotanexhaustivelistofrequirementsforLinuxkerneldevelopment,butsomeofthemostimportantitemswereaddressed.

HappyHacking!

IhopethiswillhelpothersjointheLinuxkernelcommunity!Ifyouhaveanyquestionsorsuggestions,writemeatemailorpingmeontwitter.

PleasenotethatEnglishisnotmyfirstlanguage,andIamreallysorryforanyinconvenience.IfyoufindanymistakespleaseletmeknowviaemailorsendaPR.

blogpostsaboutassemblyprogrammingforx86_64Assemblerdistropackagemanagergrubkernel.orgversioncontrolsystemarm64bzImageqemuinitrdbusyboxcoreutilsprocfssysfsLinuxkernelmaillistingarchiveLinuxkernelcodingstyleguideHowtoGetYourChangeIntotheLinuxKernelLinuxKernelNewbiesplaintext

Conclusion

Links

LinuxInside

405Linuxkerneldevelopment

Page 407: Linux Insides

AkashShendeJakubKramarzckroossecksunMaciekMakowskiThomasMarcelisChrisCostesnathansozRubanDeventhiranfuzhliandarsAlexandruPanaBogdanRădulescuzilcodelittgulyasmalx741HaddaynDanielCampoverdeCarriónGuillaumeGomezLeandroMoreiraJonatanPålssonGeorgeHorrellCiroSantilliKevinSoulesFabioPozziKevinSwintonLeandroMoreiraLYF610400210CamCopeMiquelSabatéSolàMichaelAquilinaGabrielSulliceMichaelDrüingAlexanderPolakovAntonDavydovArpanKapoorBrandonFosdickAshleighNewman-JonesTerrellRussellMarioEwoudKohlvanWijngaardenJochenMaesBrother-LalBrianMcKennaJoshTriplettJamesFlowersAlexanderHardingDzmitryPlashchynski

Thankyoutoallcontributors:

LinuxInside

407Contributors


Recommended